malvada

MALVADA: Malware Execution Traces Dataset generation.

https://github.com/reverseame/malvada

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: sciencedirect.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

dataset dataset-generation datasets datasets-preparation malware malware-analysis malware-dataset malware-execution malware-research malware-samples
Last synced: 6 months ago · JSON representation ·

Repository

MALVADA: Malware Execution Traces Dataset generation.

Basic Info
  • Host: GitHub
  • Owner: reverseame
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 37.4 MB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Topics
dataset dataset-generation datasets datasets-preparation malware malware-analysis malware-dataset malware-execution malware-research malware-samples
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License Code of conduct Citation Authors

README.md

MALVADA - A Windows Malware Execution Traces Dataset generation framework

MALVADA is a software framework that parses one or more CAPE .json reports coming from Windows programs and processes them in different phases to provide various statistics about their contents.

The main objective of MALVADA is to help generate datasets. Specifically, reporting datasets generated with CAPE (although it can be extended to other sandboxing engines format).

Installation

Install the requirements specified in requirements.txt. $ pip3 install -r requirements.txt

The last requirement specified in the requirements.txt file is AVClass (from malicialab). In case you face any problem during installation, you can try to install it independently with:

$ pip3 install avclass-malicialab

Usage

To use this framework you just need to run the main script malvada.py (/src/malvada.py) and pass it the path to a directory that contains the set of .json reports you want to process: $ python3 malvada.py directory NOTE: The phases MALVADA comprises can be invoked individually, calling their respective scripts.

The tool will process all the reports in the directory and move them in their corresponding folders, if appropriate. You can test the tool using the report samples provided in test_reports.

The help message is printed with the -h flag: ``` $ python3 malvada.py -h usage: malvada.py [-h] [-w WORKERS] [-s] [-vt VTPOSITIVESTHRESHOLD] [-a ANONIMIZETERMS] jsondir

Generates the MALset dataset from CAPE reports. WARNING: This script will modify the reports in the directory provided.

positional arguments: json_dir The directory containing one or more json reports.

options: -h, --help show this help message and exit -w WORKERS, --workers WORKERS Number of workers to use (default: 10). -s, --silent Silent mode (default: False). -vt VTPOSITIVESTHRESHOLD, --vt-positives-threshold VTPOSITIVESTHRESHOLD Threshold for VirusTotal positives (default: 10). -a ANONIMIZETERMS, --anonimize-terms ANONIMIZETERMS Replace the terms in the file provided with [REDACTED], one by line (default: 'termstoanonymize.txt'). ```

Contextual overview of MALVADA

MALVADA_Contextual_Overview

Phases of MALVADA

MALVADA processes the reports in the following phases: 1. Detect incorrect reports. That is, those that are poorly formatted for some reason (samples do not run, they crash, etc...). 2. Remove duplicate reports (based on the SHA512 of the submitted sample). 3. Sanitize and anonymize reports. That is, remove sensitive information and the terms specified (by default) in terms_to_anonymize.txt. 4. Add AVClass result to the report. That is, parse the results from all VT vendors, transform them into valid input for AVClass and invoke AVClass itself. The AVClass consesus result is added in the key avclass_detection. 5. Generate statistics.

Internal architecture of MALVADA

MALVADA_Internal_Architecture

Example

Output after executing MALVADA with the test_reports:

$ python3 src/malvada.py test_reports -w 100 (100 workers, default is 10)

MALVADA execution example

How to cite

If you are using this software, please cite it as follows: Raducu, R., Villagrasa-Labrador, A., Rodríguez, R. J., & Álvarez, P. (2025). MALVADA: A framework for generating datasets of malware execution traces. SoftwareX, 30. latex @article{RADUCU2025_MALVADA, title = {MALVADA: A framework for generating datasets of malware execution traces}, journal = {SoftwareX}, volume = {30}, year = {2025}, issn = {2352-7110}, doi = {https://doi.org/10.1016/j.softx.2025.102082}, url = {https://www.sciencedirect.com/science/article/pii/S2352711025000494}, author = {Razvan Raducu and Alain Villagrasa-Labrador and Ricardo J. Rodríguez and Pedro Álvarez}, keywords = {Dataset generation, Malware behavior, Execution traces, Malware classification}, abstract = {Malware attacks have been growing steadily in recent years, making more sophisticated detection methods necessary. These approaches typically rely on analyzing the behavior of malicious applications, for example by examining execution traces that capture their runtime behavior. However, many existing execution trace datasets are simplified, often resulting in the omission of relevant contextual information, which is essential to capture the full scope of a malware sample’s behavior. This paper introduces MALVADA, a flexible framework designed to generate extensive datasets of execution traces from Windows malware. These traces provide detailed insights into program behaviors and help malware analysts to classify a malware sample. MALVADA facilitates the creation of large datasets with minimal user effort, as demonstrated by the WinMET dataset, which includes execution traces from approximately 10,000 Windows malware samples.} }

More info in the "Cite this repository" GitHub contextual menu.

Authors

Razvan Raducu
Alain Villagrasa Labrador
Ricardo J. Rodríguez
Pedro Álvarez

Funding support

Part of this research was supported by the Spanish National Cybersecurity Institute (INCIBE) under Proyectos Estratégicos de Ciberseguridad -- CIBERSEGURIDAD EINA UNIZAR and by the Recovery, Transformation and Resilience Plan funds, financed by the European Union (Next Generation).

INCIBE_logos

Owner

  • Name: RME-DisCo Research Group
  • Login: reverseame
  • Kind: organization
  • Location: Zaragoza, Spain

Official repository of RME, a part of the DisCo research group from University of Zaragoza focused on software and systems security

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work, please cite the following article."
preferred-citation:
  type: article-journal
  authors:
    - family-names: "Raducu"
      given-names: "Razvan"
      affiliation: "Department of Computer Science and Systems Engineering, University of Zaragoza"
      orcid: "0000-0002-8938-755X"
    - family-names: "Villagrasa-Labrador"
      given-names: "Alain"
      affiliation: "Department of Computer Science and Systems Engineering, University of Zaragoza"
      orcid: "0009-0005-8644-7376"
    - family-names: "Rodríguez"
      given-names: "Ricardo J."
      affiliation: "Department of Computer Science and Systems Engineering, University of Zaragoza"
      orcid: "0000-0001-7982-0359"
    - family-names: "Álvarez"
      given-names: "Pedro"
      affiliation: "Department of Computer Science and Systems Engineering, University of Zaragoza"
      orcid: "0000-0002-6584-7259"
  doi: "10.1016/j.softx.2025.102082"
  journal-title: "SoftwareX"
  volume: "30"
  issue: "30"
  pages: "TBD"
  title: "MALVADA: A framework for generating datasets of malware execution traces"
  year: "2025"
  url: "https://www.sciencedirect.com/science/article/pii/S2352711025000494"
keywords:
  - Dataset generation
  - Malware behavior
  - Execution traces
  - Malware classification
abstract: "Malware attacks have been growing steadily in recent years, making more sophisticated detection methods necessary. These approaches typically rely on analyzing the behavior of malicious applications, for example by examining execution traces that capture their runtime behavior. However, many existing execution trace datasets are simplified, often resulting in the omission of relevant contextual information, which is essential to capture the full scope of a malware sample’s behavior. This paper introduces MALVADA, a flexible framework designed to generate extensive datasets of execution traces from Windows malware. These traces provide detailed insights into program behaviors and help malware analysts to classify a malware sample. MALVADA facilitates the creation of large datasets with minimal user effort, as demonstrated by the WinMET dataset, which includes execution traces from approximately 10,000 Windows malware samples."

GitHub Events

Total
  • Watch event: 2
  • Push event: 20
  • Fork event: 1
Last Year
  • Watch event: 2
  • Push event: 20
  • Fork event: 1