https://github.com/ffri/ffridataset-scripts

Make datasets like FFRI Dataset

https://github.com/ffri/ffridataset-scripts

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary
Last synced: 8 months ago · JSON representation

Repository

Make datasets like FFRI Dataset

Basic Info
  • Host: GitHub
  • Owner: FFRI
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Size: 35.4 MB
Statistics
  • Stars: 10
  • Watchers: 1
  • Forks: 3
  • Open Issues: 3
  • Releases: 7
Created over 6 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

FFRI Dataset scripts

This script allows you to create datasets in the same format as the FFRI dataset.

Requirements

We recommend using Docker to create datasets. For more information, refer to the Using Docker section.

Alternatively, you can run this script natively by installing the following dependencies on tested platforms. For detailed instructions, see the Run this script natively section.

Using Docker

Make A CSV File

This script requires a CSV file that contains file information such as labels, dates, and file paths. For example:

path,label,date ./data/cleanware/test0.exe,0,2018/01/01 ./data/malware/test1.exe,1,2018/01/02

Please note that the file paths in the CSV file should be specified as relative paths from the container's working directory.

Make Datasets

You can create datasets using the following commands:

``` docker build --target production --tag ffridataset-scripts . docker run -v /testbin:/work/testbin ffridataset-scripts test_main.py

Note: The data directory should contain a CSV file and the executable files you want to process.

docker run -v /data:/work/data -v /outdir:/work/outdir ffridataset-scripts main.py --csv ./data/target.csv --out ./outdir --log ./dataset.log --ver <versionstring> ```

Please ensure the following:

  • The host directory containing the CSV file and executable files is mounted to the container’s /work/data.
  • The host directory where you want to save the JSON files is mounted to the container’s /work/out_dir.
  • Replace <version_string> with vYYYY (e.g., use v2024 for the FFRI Dataset 2024).

To process non-PE files, include the --not-pe-only flag: docker run -v <path/to/here>/data:/work/data -v <path/to/here>/out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver <version_string> --not-pe-only

Run This Script Natively

Prepare To Use

Attention We recommend running the following commands in the working directory (the ffridataset-scripts directory). ``` export LC_ALL=C.UTF-8 export LANG=C.UTF-8

sudo apt update sudo apt install -y --no-install-recommends wget git gcc g++ make autoconf libfuzzy-dev unar cmake mlocate libssl-dev libglib2.0-0 curl libboost-regex-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev build-essential libpcre2-dev libdouble-conversion-dev sudo apt install -y --no-install-recommends libqt5core5a libqt5svg5 libqt5gui5 libqt5widgets5 libqt5opengl5 libqt5dbus5 libqt5scripttools5 libqt5script5 libqt5network5 libqt5sql5 sudo apt install -y --no-install-recommends libffi-dev libncurses5-dev zlib1g zlib1g-dev libreadline-dev libbz2-dev libsqlite3-dev liblzma-dev sudo apt install -y --no-install-recommends software-properties-common gpg-agent gpg clang wget https://github.com/horsicq/DIE-engine/releases/download/3.09/die3.09Ubuntu22.04amd64.deb sudo apt --fix-broken install ./die3.09Ubuntu22.04amd64.deb rm die3.09Ubuntu22.04amd64.deb

wget mark0.net/download/tridlinux64.zip unar tridlinux64.zip cp tridlinux64/trid ./ chmod u+x trid cp triddefs_dir/triddefs-dataset2024.trd triddefs.trd

cd workspace

git clone https://github.com/JPCERTCC/impfuzzy.git cd impfuzzy git checkout b30548d005c9d980b3e3630648b39830597293fc cd ../

git clone https://github.com/JusticeRage/Manalyze.git cd Manalyze git checkout b6800ffcf2f7f4e82fe1f94d0eb2736e75e175ec cmake . make cd ../

git clone https://github.com/lief-project/LIEF.git cd LIEF git checkout 573c885de5a2bb217d4d0255b54f9b53d9a4d7c9 git apply ../../patches/lief.patch cd ../

git clone https://github.com/trendmicro/tlsh.git cd tlsh git checkout 96536e3f5b9b322b44ce88d36126121685e45a77 ./make.sh cd ../

git clone https://github.com/erocarrera/pefile.git cd pefile git checkout ceab92e003b3436d2e52b74e9c903e812a4aeae1 cd ../../

wget https://github.com/ninja-build/ninja/releases/download/v1.12.1/ninja-linux.zip unar ninja-linux.zip sudo mv ninja /usr/bin/

poetry install --no-root ```

If something goes wrong, refer to the Dockerfile.

Run Tests

Attention Do not store a file named test.exe in the working directory. The test script copies testbin/test.exe into the directory and then removes it. poetry run python test_main.py

Make Datasets

Before running this script, you need to make a CSV file described in the Make A CSV File section and specify its file path as an argument. Unlike when using Docker, file paths can be specified as full paths.

Attention Do not store malware and cleanware in the working directory. This script will copy malware and cleanware into the directory and then removes them.

poetry run python main.py --csv <path/to/csv> --out <path/to/output_dataset_dir> --log <path/to/log_file> --ver <version_string>

Notes About Hashes

  • TLSH may sometimes be an empty string. This occurs because a file must possess a sufficient level of complexity to generate a valid TLSH. For more details, visit https://github.com/trendmicro/tlsh/blob/master/README.md.
  • The peHashes (crits, endgame, and totalhash) can be null due to bugs in their implementation.

Notes About TrID Definition File

  • The TrID definition files located in triddefs_dir are redistributed with the permission from the TrID author, Marco Pontello.
  • The latest definition file can be obtained from the TrID website.

Tested

  • Ubuntu 22.04.2 LTS
  • Ubuntu 22.04 on WSL2 on Windows 10

Development

Profiling Measurement

First, create two folders: mkdir out_dir mkdir measurement

Next, build a Docker image by specifying the measurement target: docker build --target measurement --tag ffridataset-scripts .

Then, run the following command to generate executables and a csv file: docker run -v <path/to/here>\testbin:/work/testbin -v <path/to/here>\measurement\:/work/measurement ffridataset-scripts poetry run python create_measurement_env.py

Now you're ready to do profiling. To generate a cProfile result file, run: docker run -v <path/to/here>\measurement:/work/data -v <path/to/here>\out_dir:/work/out_dir ffridataset-scripts poetry run python -m cProfile -o ./out_dir/profiling.stats main.py --csv ./data/test.csv --out ./out_dir --log ./test.log --ver v2023

Then, execute the following command: docker run -v <path/to/here>\out_dir\:/work/out_dir/ --rm -p 8080:8080 ffridataset-scripts poetry run snakeviz /work/out_dir/profiling.stats -s -p 8080 -H 0.0.0.0

Now, you can view the profiling results through your browser.

Author

Yuki Mogi. © FFRI, Inc. 2019-2024

Koh M. Nakagawa. © FFRI, Inc. 2019-2024

Owner

  • Name: FFRI Security, Inc.
  • Login: FFRI
  • Kind: organization
  • Location: Tokyo, Japan

Next Generation Security

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 6 months
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 5.0
  • Average comments per pull request: 2.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kohnakagawa (1)
  • wangclin (1)
Pull Request Authors
  • dependabot[bot] (4)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
dependencies (4)