cve-llm_dataset

This is a dataset intended to train a LLM model for a completely CVE focused input and output.

https://github.com/morpheuslord/cve-llm_dataset

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Keywords

ai-dataset ai-finetune ai-training llama2 openai openai-chatgpt textgeneration

Last synced: 11 months ago · JSON representation ·

Repository

This is a dataset intended to train a LLM model for a completely CVE focused input and output.

Basic Info

Host: GitHub
Owner: morpheuslord
License: mit
Language: Python
Default Branch: main
Homepage: https://huggingface.co/datasets/morpheuslord/cve-llm-training
Size: 178 MB

Statistics

Stars: 63
Watchers: 4
Forks: 13
Open Issues: 0
Releases: 0

Topics

ai-dataset ai-finetune ai-training llama2 openai openai-chatgpt textgeneration

Created almost 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

CVE-llm_dataset

This dataset is intended to train an LLM model for an utterly CVE-focused input and output.

Data extraction:

For the data extraction I first downloaded the CVE database from NVD lists and then loaded them using the cve_dataset_2.py and cve_dataset.py both have produce different datasets one is for llama and the other is for openai GPT.

``` The programs traverse trough these folders and extracts the data in the files and arrainges them into usable formats for the fine-tune process.

llama2 Model dataset:

The llama2 fine-tune dataset follows this format: { "instruction": "Explain CVE-1999-0001", "input": "Explain the vulnerability: CVE-1999-0001", "output": "ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.\nAffected Products: n/a\nReferences: [{'tags': ['x_refsource_CONFIRM'], 'url': 'http://www.openbsd.org/errata23.html#tcpfix'}, {'name': '5707', 'tags': ['vdb-entry', 'x_refsource_OSVDB'], 'url': 'http://www.osvdb.org/5707'}]\nCVE State: PUBLISHED" } The instruction is what we instruct the AI to do with the data provided for example we can command the AI To take in user input analyze it and then based on what he asks returns an answer This is also where we can add a role or a personal to the AI.

The input is the user Inputs the main query or data that must be processed by the AI. This is a crucial peace of information that the AI will process in order to provide an output.

The output is the format that we define and tell the AI to generate anwers in that format or provide that answer to the question asked.

OpenAI fine-tune dataset:

The OpenAI fine-tune format is way different from the Llama dataset this requires us to define roles and messages for the output and using this we can provide more details and increase the answer accuracy.

{ "messages": [ { "role": "system", "content": "CVE Vulnerability Information" }, { "role": "user", "content": "Explain the vulnerability: CVE-1999-0001" }, { "role": "assistant", "content": "ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.\nAffected Products: n/a\nReferences: [{'tags': ['x_refsource_CONFIRM'], 'url': 'http://www.openbsd.org/errata23.html#tcpfix'}, {'name': '5707', 'tags': ['vdb-entry', 'x_refsource_OSVDB'], 'url': 'http://www.osvdb.org/5707'}]\nCVE State: PUBLISHED" } ] } In this dataset we define the AI and user role's and also the AI content and output for the users content. The core working is similar to llama or any text generation models datasets.

Trained model on this dataset.

Someone actually trained a model heres the LINK, but the accuracy was not great so I modified the dataset to be more robust so that it can actually be useful to others.

OpenAI price calculation:

The price-openai.py file is calculates the datasets total tokens and does the necessary calculations to decide the operall price to train a custom gpt model from openai. The same goes for tokencount.py it mainly counts the total amount of tokens present in the dataset.

Cite this

@misc {chiranjeevi_g_2024, author = { {Chiranjeevi G} }, title = { cve-llm-training (Revision b224515) }, year = 2024, url = { https://huggingface.co/datasets/morpheuslord/cve-llm-training }, doi = { 10.57967/hf/3627 }, publisher = { Hugging Face } }

Owner

Name: morpheuslord
Login: morpheuslord
Kind: user
Location: India

Website: https://chiranjeevi-profile.vercel.app/
Twitter: Morpheuslord2
Repositories: 86
Profile: https://github.com/morpheuslord

#cybersecurity #redteam #Python #websitehacking #applicationtesting #securityresearcher

Citation (CITATION.cff)

cff-version: 1.2.0
message: "POC CVE LLM Traing dataset"
authors:
- family-names: "Chiranjeevi"
  given-names: "G"
- family-names: "Abhishek Reddy"
  given-names: "R"
- family-names: "Shyam"
  given-names: "R"
title: "CVE LLM Dataset"
date-released: 2023-10-15
url: "https://github.com/morpheuslord/CVE-llm_dataset"

GitHub Events

Total

Watch event: 15
Push event: 8
Pull request event: 13
Fork event: 3
Create event: 5

Last Year

Watch event: 15
Push event: 8
Pull request event: 13
Fork event: 3
Create event: 5

Committers

Last synced: 11 months ago

All Time

Total Commits: 27
Total Committers: 3
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.111

Past Year

Commits: 8
Committers: 2
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.125

Top Committers

Name	Email	Commits
morpheuslord	7****d	24
R ABHISHEK REDDY	9****e	2
Vishrutha S	1****5	1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 1
Total pull requests: 24
Average time to close issues: about 5 hours
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 24
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 8
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

cve-llm_dataset

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

CVE-llm_dataset

Data extraction:

llama2 Model dataset:

OpenAI fine-tune dataset:

Trained model on this dataset.

OpenAI price calculation:

Cite this

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels