securityeval

Repository for "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques" published in MSR4P&S'22.

https://github.com/s2e-lab/securityeval

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

code-generation cwe dataset evaluation security
Last synced: 4 months ago · JSON representation ·

Repository

Repository for "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques" published in MSR4P&S'22.

Basic Info
  • Host: GitHub
  • Owner: s2e-lab
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 378 KB
Statistics
  • Stars: 67
  • Watchers: 1
  • Forks: 13
  • Open Issues: 0
  • Releases: 2
Topics
code-generation cwe dataset evaluation security
Created over 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme Citation

README.md

Hugging Face

SecurityEval

Update

We updated the dataset with a new version. It addresses the following issues: 1. Typos in the prompt. 2. Remove the prompt that deliberately asks to generate vulnerable code.

We have now 121 prompts for 69 CWEs in this version. We did not change the old result and evaluation for models. The new dataset is available in the dataset.jsonl file.

You can find the old dataset and evaluation result for MSR4P&S workshop in the v1.0.

Introduction

This repository contains source code for the paper titled SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. The project is accepted for The first edition of the International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S '22). The paper describes the dataset for evaluating machine learning-based code generation output and the application of the dataset to the code generation tools.

Project Structure

  • dataset.jsonl: dataset file in jsonl format. Every line contains a JSON object with the following fields:
    • ID: unique identifier of the sample.
    • Prompt: Prompt for the code generation model.
    • Insecure_code: code of the vulnerability example that may be generated from the prompt.
  • DatasetCreator.py: script to create the dataset from the folders: Testcases_Prompt and TestcasesInsecureCode.
  • Testcases_Prompt: folder containing the prompt files.
  • TestcasesInsecureCode: folder containing the insecure code files.
  • Testcases_Copilot: folder containing the code generated by GitHub Copilot.
  • Testcases_InCoder: folder containing the code generated by InCoder.
  • Databases: folder containing the databases for the CodeQL analysis.
    • job_{copilot,incoder}.sh: scripts to run the CodeQL analysis.
  • Result: folder containing the results of the evaluation.
    • DataTable.{csv,xlsx}: table of the CWE list with their source
    • testcasescopilot: folder containing result by running CodeQL on *TestcasesCopilot*
    • testcasescopilot.json: result by running Bandit on *TestcasesCopilot*
    • testcasescopilot.csv: result for manual analysis on *TestcasesCopilot*
    • testcasesincoder: folder containing result by running CodeQL on *TestcasesInCoder*
    • testcasesincoder.json: result by running Bandit on *TestcasesInCoder*
    • testcasesincoder.csv: result for manual analysis on *TestcasesInCoder*
    • testcases.json: contains the list of files and folders in Testcases_Prompt
    • CSVConvertor.py: script to convert the CSV files to from json file(i.e. testcases.json)

Loading the dataset of prompts from HuggingFace

The dataset is now published on HuggingFace. You can load it as follows:

from datasets import load_dataset dataset = load_dataset("s2e-lab/SecurityEval")

Usage of the Analyzer

Dependencies: - Python: 3.9.4 - CodeQL command-line toolchain: 2.10.0 - Bandit: 1.7.4

Bandit

virtualenv bandit-env python3 -m venv bandit-env source bandit-env/bin/activate pip install bandit bandit -r Testcases_Copilot -f json -o Result/testcases_copilot.json bandit -r Testcases_InCoder -f json -o Result/testcases_incoder.json

CodeQL

Install CodeQL from here: https://codeql.github.com/docs/codeql-cli/getting-started-with-the-codeql-cli/ ``` cd TestcasesCopilot codeql database create --language=python 'ROOTPATH/SecurityEval/Databases/TestcasesCopilotDB' # Use your path to the database cd ../Databases sh job_copilot.sh

cd .. cd TestcasesInCoder codeql database create --language=python 'ROOTPATH/SecurityEval/Databases/TestcasesIncoderDB' # Use your path to the database cd ../Databases sh job_incoder.sh ```

Abstract

Automated source code generation is currently a popular machine learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can produce vulnerable code, which the developers can mistakenly use. For this reason, evaluating the security of a code generation model is a must. In this paper, we describe SecurityEval, an evaluation dataset to fulfill this purpose. It contains 130 samples for 75 vulnerability types, which are mapped to the Common Weakness Enumeration (CWE). We also demonstrate using our dataset to evaluate one open-source (i.e., InCoder) and one closed-source code generation model (i.e., GitHub Copilot).

Citation

@inproceedings{siddiq2022seceval, author={Siddiq, Mohammed Latif and Santos, Joanna C. S. }, booktitle={Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S22)}, title={SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques}, year={2022}, doi={10.1145/3549035.3561184} }

Owner

  • Name: Security & Software Engineering Research Lab at University of Notre Dame
  • Login: s2e-lab
  • Kind: organization
  • Location: United States of America

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Siddiq"
  given-names: "Mohammed Latif"
  orcid: "https://orcid.org/0000-0002-7984-3611"
- family-names: "Santos"
  given-names: "Joanna C. S."
  orcid: "https://orcid.org/0000-0001-8743-2516"
title: "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques"
version: 1.0.0
date-released: 2022-03-02
doi: 10.1145/3549035.3561184
url: "https://github.com/s2e-lab/SecurityEval"
preferred-citation:
  type: proceedings
  authors:
  - family-names: "Siddiq"
    given-names: "Mohammed Latif"
    orcid: "https://orcid.org/0000-0002-7984-3611"
  - family-names: "Santos"
    given-names: "Joanna C. S."
    orcid: "https://orcid.org/0000-0001-8743-2516"
  doi: "10.1145/3549035.3561184"
  journal: "Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Se- curity (MSR4P&S ’22)"
  month: 11
  title: "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques"
  year: 2022

GitHub Events

Total
  • Watch event: 16
Last Year
  • Watch event: 16