vulntrain

A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.

https://github.com/vulnerability-lookup/vulntrain

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

dataset llm nlp text-generation vulnerability vulnerability-lookup
Last synced: 6 months ago · JSON representation ·

Repository

A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.

Basic Info
Statistics
  • Stars: 13
  • Watchers: 4
  • Forks: 2
  • Open Issues: 3
  • Releases: 13
Topics
dataset llm nlp text-generation vulnerability vulnerability-lookup
Created about 1 year ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation Authors

README.md

VulnTrain

Latest release License PyPi version

VulnTrain offers a suite of commands to generate diverse AI datasets and train models using comprehensive vulnerability data from Vulnerability-Lookup. It harnesses over one million JSON records from all supported advisory sources to build high-quality, domain-specific models.

Additionally, data from the vulnerability-lookup:meta container, including enrichment sources such as vulnrichment and Fraunhofer FKIE, is incorporated to enhance model quality.

Check out the datasets and models on Hugging Face:

Model on HF

For more information about the use of AI in Vulnerability-Lookup, please refer to the user manual.

Usage

Install VulnTrain:

bash $ pipx install VulnTrain

Three types of commands are available:

  • Dataset generation: Create and prepare datasets.
  • Model training: Train models using the prepared datasets.
    • Train a model to classify vulnerabilities by severity. Model on HF
    • Train a model for text generation to assist in writing vulnerability descriptions Model on HF
  • Model validation: Assess the performance of trained models (validations, benchmarks, etc.).

Check out the documentation for more information.

How to cite

Bonhomme, C., & Dulaunoy, A. (2025). VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification (Version 1.4.0) [Computer software]. https://doi.org/10.48550/arXiv.2507.03607

bibtex @misc{bonhomme2025vlai, title={VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification}, author={Cédric Bonhomme and Alexandre Dulaunoy}, year={2025}, eprint={2507.03607}, archivePrefix={arXiv}, primaryClass={cs.CR} }

License

VulnTrain is licensed under GNU General Public License version 3

~~~ Copyright (c) 2025 Computer Incident Response Center Luxembourg (CIRCL) Copyright (C) 2025 Cédric Bonhomme - https://github.com/cedricbonhomme Copyright (C) 2025 Léa Ulusan - https://github.com/3LS3-1F ~~~

Owner

  • Name: Vulnerability-Lookup
  • Login: vulnerability-lookup
  • Kind: organization
  • Email: info@circl.lu

Vulnerability-Lookup facilitates quick correlation of vulnerabilities from various sources, independent of vulnerability IDs.

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use VulnTrain or one of our models, please cite the following work."
title: "VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification"
version: 1.4.0
doi: 10.48550/arXiv.2507.03607
url: https://www.vulnerability-lookup.org
repository-code: https://github.com/vulnerability-lookup/VulnTrain
date-released: 2025-07-04
abstract: >
  This paper presents VLAI, a transformer-based model that predicts software vulnerability severity levels 
  directly from text descriptions. Built on RoBERTa, VLAI is fine-tuned on over 600,000 real-world vulnerabilities 
  and achieves over 82% accuracy in predicting severity categories, enabling faster and more consistent triage 
  ahead of manual CVSS scoring. The model and dataset are open-source and integrated into the Vulnerability-Lookup service.
authors:
  - family-names: Bonhomme
    given-names: Cédric
    orcid: https://orcid.org/0009-0003-7679-0109
  - family-names: Dulaunoy
    given-names: Alexandre
    orcid: https://orcid.org/0000-0002-5437-4652

GitHub Events

Total
  • Create event: 20
  • Issues event: 4
  • Release event: 13
  • Watch event: 12
  • Delete event: 2
  • Member event: 1
  • Issue comment event: 4
  • Push event: 125
  • Pull request event: 4
  • Fork event: 2
Last Year
  • Create event: 20
  • Issues event: 4
  • Release event: 13
  • Watch event: 12
  • Delete event: 2
  • Member event: 1
  • Issue comment event: 4
  • Push event: 125
  • Pull request event: 4
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 72
  • Total Committers: 2
  • Avg Commits per committer: 36.0
  • Development Distribution Score (DDS): 0.014
Past Year
  • Commits: 72
  • Committers: 2
  • Avg Commits per committer: 36.0
  • Development Distribution Score (DDS): 0.014
Top Committers
Name Email Commits
Cédric Bonhomme c****c@c****g 71
Else-If-05 l****n@e****r 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 7
  • Average time to close issues: N/A
  • Average time to close pull requests: about 3 hours
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.8
  • Average comments per pull request: 0.43
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 7
  • Average time to close issues: N/A
  • Average time to close pull requests: about 3 hours
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 0.8
  • Average comments per pull request: 0.43
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • cedricbonhomme (4)
  • 3LS3-1F (1)
Pull Request Authors
  • 3LS3-1F (3)
  • Else-If-05 (2)
  • cedricbonhomme (2)
Top Labels
Issue Labels
enhancement (3) dataset (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 60 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 14
  • Total maintainers: 1
pypi.org: vulntrain

Generate datasets amd models based on vulnerabilities data from Vulnerability-Lookup.

  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 60 Last month
Rankings
Dependent packages count: 9.6%
Average: 31.8%
Dependent repos count: 54.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/release.yml actions
  • actions/checkout v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
poetry.lock pypi
  • aiohappyeyeballs 2.4.6
  • aiohttp 3.11.12
  • aiosignal 1.3.2
  • async-timeout 5.0.1
  • attrs 25.1.0
  • certifi 2025.1.31
  • charset-normalizer 3.4.1
  • click 8.1.8
  • colorama 0.4.6
  • datasets 3.3.1
  • dill 0.3.8
  • filelock 3.17.0
  • frozenlist 1.5.0
  • fsspec 2024.12.0
  • huggingface-hub 0.29.0
  • idna 3.10
  • joblib 1.4.2
  • multidict 6.1.0
  • multiprocess 0.70.16
  • nltk 3.9.1
  • numpy 2.2.3
  • packaging 24.2
  • pandas 2.2.3
  • propcache 0.2.1
  • pyarrow 19.0.1
  • python-dateutil 2.9.0.post0
  • pytz 2025.1
  • pyyaml 6.0.2
  • regex 2024.11.6
  • requests 2.32.3
  • six 1.17.0
  • tqdm 4.67.1
  • typing-extensions 4.12.2
  • tzdata 2025.1
  • urllib3 2.3.0
  • valkey 6.1.0
  • xxhash 3.5.0
  • yarl 1.18.3
pyproject.toml pypi
  • datasets ^3.3.1
  • nltk ^3.9.1
  • pandas ^2.2.3
  • valkey ^6.1.0