java-vulnerability-patch-retriever

Sistema de recuperación semántica para sugerir commits de parche de vulnerabilidades en Java. Combina BM25 y Sentence Transformers para encontrar soluciones relevantes, evaluadas con métricas estándar. Basado en el dataset curado de TUHH-SoftSec.

https://github.com/lucascandia/java-vulnerability-patch-retriever

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Sistema de recuperación semántica para sugerir commits de parche de vulnerabilidades en Java. Combina BM25 y Sentence Transformers para encontrar soluciones relevantes, evaluadas con métricas estándar. Basado en el dataset curado de TUHH-SoftSec.

Basic Info
  • Host: GitHub
  • Owner: lucascandia
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 50.8 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 12 months ago · Last pushed 12 months ago
Metadata Files
Readme Citation

README.md

A Manually Curated Dataset of Vulnerability Introducing Commits In Java

Research in identifying vulnerabilities and the commits that introduce them is ongoing. However, many current methods rely heavily on automation, which can lead to a high rate of false positives and require significant error-checking. To address this issue, we developed a tool-assisted pipeline to manually review and examine vulnerabilities and their corresponding commits. Additionally, we collected relevant metadata such as modified lines of code, and the mapping of CVE and CWE categories. This data set can be used to validate automated methods like machine learning approaches.

Table of Contents

DOI

Dataset Description

The complete dataset can be found here.

It is structured in an JSON file with the following fields:

JSON Fields

| Fieldname | Brief | | --- | --- | |cwe| Common Weakness Enumeration ID | |introducing| Commit hash that introduces the vulnerability | |introstats| Number of lines added/deleted in the introducing commit | |introlines| Lines marked as vulnerable in the introducing commit | |fixingstats| Number of lines added/deleted in the fixing commits | |fixinglines| Lines marked as fixing the vulnerability in the fixing commit | |days_between| Days between the identified introducing and fixing commits |

Example

```json { "cve": "CVE-2019-11274", "cwe": "CWE-79", "repository": "https://github.com/cloudfoundry/uaa", "fixing": [ "a34f55fc97a81966faf21e3ae404ec24f1f31cf7" ], "introducing": "bb8ff8f4e8969b46fdacffcd27781d223c8c7244", "introstats": { "bb8ff8f4e8969b46fdacffcd27781d223c8c7244": { "add": 320, "del": 7 } }, "fixingstats": { "a34f55fc97a81966faf21e3ae404ec24f1f31cf7": { "add": 68, "del": 17 } }, "daysbetween": 1836, "fixinglines": { "server/src/main/java/org/cloudfoundry/identity/uaa/scim/endpoints/ScimGroupEndpoints.java": "168" }, "introducing_lines": { "scim/src/main/java/org/cloudfoundry/identity/uaa/scim/endpoints/ScimGroupEndpoints.java": "190" } },

```

Review Pipeline Instructions

Prerequisites

| Software | Used Version | | --- | --- | | Python3 |3.10.8 | | pip3 | 22.3.1 | | git | 2.29.0 | | Webbrowser of choice | Safari 16.1|

Setup

In order to install all required python packages please run the following command inside the review_pipeline directory: - python3 -m pip install -R requirements.txt

Usage

The pipeline can be executed by the following command inside the review_pipeline directory: - python3 manual_analysis_pipeline.py <path_to_input_dataset>

Input Dataset

The input dataset is expected to be a JSON file with the following fields:

| Fieldname | Brief | | --- | --- | |cveid| CVE id of the vulnerability| |repository| URL to the repository | |fixingcommits| List of fixing commit SHA-1 hashes |

Citation (citation.cff)

cff-version: 1.2.0
type: dataset
message: "If you use this dataset, please cite it as below."
authors:
- family-names: "Hinrichs"
  given-names: "Torge"
  orcid: "https://orcid.org/0000-0001-7489-3540"
- family-names: "Scandariato"
  given-names: "Riccardo"
  orcid: "https://orcid.org/0000-0003-3591-7671"
title: "A Manually Curated Dataset of Vulnerability Introducing Commits In Java"
version: 1.0.0
doi: 10.5281/zenodo.7565542
date-released: 25.01.2023
url: "https://github.com/tuhh-softsec/A-Manually-Curated-Dataset-of-Vulnerability-Introducing-Commits-in-Java"

GitHub Events

Total
  • Push event: 4
  • Create event: 3
Last Year
  • Push event: 4
  • Create event: 3

Dependencies

review_pipeline/requirements.txt pypi
  • GitPython *
  • PyDriller *
  • browser *