https://github.com/czs108/microsoft-malware-classification

🔍 "2015 Microsoft Malware Classification Challenge" - Using machine learning to classify malware into different families based on Windows PE structures, disassembly scripts and machine code.

https://github.com/czs108/microsoft-malware-classification

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.1%) to scientific vocabulary

Keywords

cybersecurity kaggle-competition machine-learning malware malware-analysis pe python reverse-engineering
Last synced: 5 months ago · JSON representation

Repository

🔍 "2015 Microsoft Malware Classification Challenge" - Using machine learning to classify malware into different families based on Windows PE structures, disassembly scripts and machine code.

Basic Info
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Archived
Topics
cybersecurity kaggle-competition machine-learning malware malware-analysis pe python reverse-engineering
Created over 4 years ago · Last pushed almost 3 years ago

https://github.com/czs108/Microsoft-Malware-Classification/blob/main/

# Microsoft Malware Classification

[![Python](badges/Python.svg)](https://www.python.org)
![LaTeX](badges/LaTeX.svg)
[![Jupyter](badges/Made-with-Jupyter.svg)](https://jupyter.org)
[![Windows](badges/Microsoft-Windows.svg)](https://www.microsoft.com/en-ie/windows)
[![Kaggle](badges/Kaggle.svg)](https://www.kaggle.com)
![License](badges/License-MIT.svg)
[![arXiv](badges/Published-in-arXiv.svg)](https://arxiv.org/abs/2201.07649)
[![TechRxiv](badges/Published-in-TechRxiv.svg)](https://www.techrxiv.org/articles/preprint/Malware_Classification_Using_Static_Disassembly_and_Machine_Learning/17259806)
[![CEUR-WS](badges/Published-in-CEUR-WS.svg)](http://ceur-ws.org/Vol-3105/paper8.pdf)

## Introduction

![Cover](Cover.jpg)

In recent years, the malware industry has become a well organized market involving large amounts of money. Well funded, multi-player syndicates invest heavily in technologies and capabilities built to evade traditional protection, requiring anti-malware vendors to develop counter mechanisms for finding and deactivating them. One of the major challenges that anti-malware faces today is the vast amounts of data and files which need to be evaluated for potential malicious intent.

The goal of this project is to train a malware classifier using machine learning that can separate malicious samples into different families with high accuracy and efficiency, such as *Virus*, *Worm*, and *Trojan*.

## Dataset

The dataset is from the [*2015 Microsoft Malware Classification Challenge*](https://www.kaggle.com/c/malware-classification). It contains 10868 malware samples representing a mix of nine families.

|       Name       | # Samples |        Type        |
| :--------------: | :-------: | :----------------: |
|     `Ramnit`     |   1541    |        Worm        |
|    `Lollipop`    |   2478    |       Adware       |
|  `Kelihos_ver3`  |   2942    |      Backdoor      |
|     `Vundo`      |    475    |       Trojan       |
|     `Simda`      |    42     |      Backdoor      |
|     `Tracur`     |    751    | Trojan Downloader  |
|  `Kelihos_ver1`  |    398    |      Backdoor      |
| `Obfuscator.ACY` |   1228    | Obfuscated malware |
|     `Gatak`      |   1013    |      Backdoor      |

Each sample has two files of different forms: machine code and disassembly script generated by *IDA Pro*.

```
10001100 C4 01 74 AC D9 EE D9 C0 DD EA DF E0 F6 C4 44 7A
10001110 0A DD D9 DD 17 DD 1E 8B E5 5D C3 DD D8 E8 74 26
10001120 00 00 DC 0D 78 65 00 10 DC 34 24 E8 60 26 00 00
10001130 DD 44 24 08 D8 C9 DD 1F DC 4C 24 10 DD 1E 8B E5
10001140 5D C3 CC CC CC CC CC CC CC CC CC CC CC CC CC CC
```

```
.text:10001106 D9 C0        fld     st
.text:10001108 DD EA        fucomp  st(2)
.text:1000110A DF E0        fnstsw  ax
.text:1000110C F6 C4 44     test    ah, 44h
.text:1000110F 7A 0A        jp      short loc_1000111B
.text:10001111 DD D9        fstp    st(1)
.text:10001113 DD 17        fst     qword ptr [edi]
```

The `data` folder only contains several sample files.

## Precomputed Features

|         Item          |                         Description                          |
| :-------------------: | :----------------------------------------------------------: |
|       File Size       | The sizes of disassembly and machine code files, and their ratios. |
|      API 4-gram       |                        API sequences.                        |
|     Opcode 4-gram     |                      Opcode sequences.                       |
|    Import Library     |           Import libraries in the PE Import Table.           |
|    PE Section Size    | The virtual sizes and raw sizes of PE sections, and their ratios. |
| PE Section Permission | The total sizes of readable data, writable data and executable code. |
|  Content Complexity   | The original sizes, compressed sizes and compression ratios of disassembly and machine code files. |

## Models

These models have been tested in the project:

- *Support Vector*
- *K-Nearest Neighbors*
- *Random Forest*

## IDA Pro Classification Plug-in

Unlike other machine learning applications like hand written digit classification, where the shape of numbers is not updated over time, the similarity between previous and future malware will degrade over time due to function updates and polymorphic techniques. Polymorphic techniques can automatically and frequently change identifiable characteristics like encryption types and code distribution to make malware unrecognizable to anti-virus detection.

To solve this, we designed an automatic malware classification workflow to apply and enhance our classifier in practice with *IDA Pro*'s *Python* development kit.

![workflow](paper-tex/figures/workflow.png)

### Getting Started

Copy all the files in the `ida-plugin` folder except `tpot_exported_ida_pipeline.py` to *IDA Pro*'s `plugins` folder.

- `generator.py`

  A generator for generating disassembly and machine code files for an executable sample, relying on *IDA Pro*'s disassembler. These two output files are in the similar format as the files in the dataset.

- `generation_cmd.py`

  A wrapper for `generator.py`. It can be run under *IDA Pro*'s autonomous mode.

- `Generate-Samples.ps1`

  It uses `generation_cmd.py` and processes a directory containing executable samples.

- `tpot_exported_ida_pipeline.py`

  The best machine learning pipeline produced by `TPOT`. It is just a reference.

- `tpot_exported_ida_clf.joblib`

  A fitted machine learning model using the pipeline in `tpot_exported_ida_pipeline.py`.

- `classifier.py`

  An *IDA Pro* classification plug-in. When an analyst open a sample with *IDA Pro*, it can calculate the probability that the sample belongs to each malware family.

  ```console
  0.53 -> Ramnit
  0.24 -> Lollipop
  0.17 -> Obfuscator.ACY
  0.05 -> Gatak
  0.01 -> Simda
  0.01 -> Vundo
  0.00 -> Tracur
  0.00 -> Kelihos_ver1
  0.00 -> Kelihos_ver3
  ```

### Manual Data Generation

Before the automatic classification, the plug-in can produce disassembly and machine code files by itself. But you can also manually produce them using `generation_cmd.py` and `Generate-Samples.ps1`.

`generation_cmd.py` can be run from the command line with *IDA Pro*'s parameters `-A` and `-S`, which launch *IDA Pro* in autonomous mode and make it run a script.

```bash
ida -A "-S" 
```

Note that there is no space between `-S` and the path.

Or use `Generate-Samples.ps1` to process a directory containing executable samples.

```powershell
Generate-Samples.ps1 -InputDirectory  -OutputDirectory 
```

## Dependences

- [*NumPy*](https://numpy.org)
- [*pandas*](https://pandas.pydata.org)
- [*SciPy*](https://www.scipy.org)
- [*scikit-learn*](https://scikit-learn.org/stable)
- [*auto-sklearn*](https://automl.github.io/auto-sklearn/master)
- [*TPOT*](http://epistasislab.github.io/tpot)
- [*Matplotlib*](https://matplotlib.org)
- [*dtreeviz*](https://github.com/parrt/dtreeviz)
- [*seaborn*](https://seaborn.pydata.org)
- [*joblib*](https://joblib.readthedocs.io/en/latest)
- [*tqdm*](https://tqdm.github.io)

## License

Distributed under the *MIT License*. See `LICENSE` for more information.

## Citing

- [*arXiv*](https://arxiv.org/abs/2201.07649)

  ```tex
  @misc{chen-2021:static-disasm-malware,
      title         = {Malware Classification Using Static Disassembly and Machine Learning},
      author        = {Zhenshuo Chen and Eoin Brophy and Tomas Ward},
      year          = 2021,
      eprint        = {2201.07649},
      archiveprefix = {arXiv},
      primaryclass  = {cs.CR}
  }
  ```

- [*CEUR-WS*](http://ceur-ws.org/Vol-3105/paper8.pdf)

  ```tex
  @inproceedings{chen-2022:static-disasm-malware,
      title    = {Malware Classification Using Static Disassembly and Machine Learning},
      author   = {Zhenshuo Chen and Eoin Brophy and Tomas Ward},
      pages    = {48--59},
      url      = {http://ceur-ws.org/Vol-3105/paper8.pdf},
      crossref = {AICS2021}
  }

  @proceedings{AICS2021,
      title     = {The 29th Irish Conference on Artificial Intelligence and Cognitive Science 2021},
      year      = 2021,
      booktitle = {The 29th Irish Conference on Artificial Intelligence and Cognitive Science 2021},
      address   = {Aachen},
      series    = {CEUR Workshop Proceedings},
      number    = 3105,
      issn      = {1613-0073},
      url       = {http://ceur-ws.org/Vol-3105},
      editor    = {Arjun Pakrashi and Ellen Rushe and Mehran Bazargani and Mac Namee, Brian},
      venue     = {Dublin, Republic of Ireland},
      eventdate = {2021-12-09}
  }
  ```

Owner

  • Name: Chenzs108
  • Login: czs108
  • Kind: user
  • Location: Dublin, Ireland
  • Company: Susquehanna International Group

Software Development | Artificial Intelligence | Reverse Engineering. For more projects, see @Zhuagenborn.

GitHub Events

Total
  • Watch event: 12
  • Fork event: 2
Last Year
  • Watch event: 12
  • Fork event: 2