https://github.com/alfa-group/adv-malware-viz
"On Visual Hallmarks of Robustness to Adversarial Malware" by Alex Huang, Abdullah Al-Dujaili, Erik Hemberg, Una-May O'Reilly
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.5%) to scientific vocabulary
Keywords
adversarial-example
deep-learning
malware
visualization
Last synced: 5 months ago
·
JSON representation
Repository
"On Visual Hallmarks of Robustness to Adversarial Malware" by Alex Huang, Abdullah Al-Dujaili, Erik Hemberg, Una-May O'Reilly
Basic Info
Statistics
- Stars: 6
- Watchers: 7
- Forks: 5
- Open Issues: 0
- Releases: 0
Topics
adversarial-example
deep-learning
malware
visualization
Created almost 8 years ago
· Last pushed over 7 years ago
https://github.com/ALFA-group/adv-malware-viz/blob/master/
Code repository for the paper [On Visual Hallmarks of Robustness to Adversarial Malware](https://arxiv.org/pdf/1805.03553.pdf) - A series of related blog posts can be found [here](http://ash-aldujaili.github.io/blog/2018/08/29/evasive-malware/). # Installation if you have `conda` installed, you can just `cd` to the main directory and execute the following with `osx_environment.yml` or `linux_environment.yml` on OSx or Linux, respectively. ``` conda install nb_conda conda config --add channels conda-forge conda env create --file ymls/(osx|linux)_environment.yml ``` This will create an environment called `nn_mal`. To activate this environment, execute: ``` source activate nn_mal ``` PS1: If you're going to use losswise, you may run into an issue of one print line whose argument is not enclosed by brackets, just put the brackets if this error shows up and you're good to go. PS2: If youre running the code on Mac OS with Cuda, then according to Pytorch.org macOS Binaries dont support CUDA, install from source if CUDA is needed # Jupyter Notebook Code Walkthrough - Synthetic Data jupyter_tutorial.ipynb provides a walkthrough of the code and each of the figures using a synthetic dataset where malicious vectors have bits set with probability 0.2 and benign vectors have bits set with probabiltiy 0.8. Make sure your jupyter notebook kernel is set to the nn_mal conda env. In order to have nn_mal show up in the notebook under Kernel->Change kernel, run this command after activating the env: ``` python -m ipykernel install --user --name nn_mal --display-name "nn_mal" ``` # Full Process Walkthrough - Sample Dataset ## 1. Assembling Portable Executable (PE) Dataset The first step is to gather a dataset of benign and malicious PE files. Each sample is then turned into its corresponding feature vector after examining the entire dataset to create a mapping from imported function to index. We do not include the actual samples in this repo but we provide the generated feature vectors in **sample_dataset_saved_feature_vectors** and describe the process in (2). ## 2. Generating Feature Vector Files In order to save time during training, we generate the feature vector for each file and save it as a pickle file rather than recreating the feature vector each time we load a PE file. To do this, we modify the *malicious_filepath* and *benign_filepath* parameters in **parameters.ini** to match the locations of our malicious and benign files respectively. Change the location of the saved vectors by modifying the parameter, *saved_vectors_directory*. To generate the vectors we run. ``` python generate_vectors.py ``` ## 3A. Train Model **NOTE: This is the step to start at when running this code for the first time**. The file **framework.py** performs the actual model training and **parameters.ini** provides the specifications. This design pattern is used throughout the various packages. For the sample dataset, we set the *use_saved_feature_vectors* to True in order to use the generated feature vectors from step 2. To train a model, simply run: ``` python framework.py parameters.ini ``` Sections 3B, 3C, and 3D provide an overview of the parameters available. ## 3B. Parameter.ini Dataset Parameters Explanation - malicious_filepath - directory containing malicious PE files or saved feature vectors - benign_filepath - directory containing benign PE files or saved feature vectors - helper_filepath - directory containing index mappings and file lists - malicious_files_list - a list of malicious files to use, None uses all in the directory - benign_files_list - a list of benign files to use, None uses all in the directory - load_mapping_from_pickle - indicates whether or not to load a precreated function-to-index mapping file - pickle_mapping_file - path to a function-to-index mapping pickle file - generate_feature_vector_files - set to True only when running generate_vectors.py - use_saved_feature_vectors - whether to use saved vectors or regenerate each time a PE is loaded ## 3c. Parameter.ini General Parameters Explanation - is_synthetic_dataset - Generate feature vectors by randomly setting bits with some probability - is_cuda - True if GPU enabled, False otherwise - use_seed - Whether to seed (for reproducibility) - is_losswise - Losswise integration - losswise_api_key - API key for Losswise integration - training_method - the inner maximizer method used to create examples for training (natural, dfgsm_k, rfgsm_k, bga_k, and bca_k) - evasion_method - the inner maximizer method to use when generating adversarial examples in validation or test phase - experiment_suffix - name of experiment - train_model_from_scratch - if True, training process will take place - load_model_weights - if True, no training, pre-trained model loaded instead - model_weights_path - path to saved PyTorch model - num_workers - number of workers to use for PyTorch Dataloaders - model_output_directory - directory to save models in ## 3d. Parameter.ini Hyperparam Parameters Explanation - ff_h1, ff_h2, ff_h3 - sizes of the three hidden layers - ff_learning_rate - learning rate - ff_num_epochs - number of epochs to train and test on - evasion_iterations - number of iterations to perform iterative inner maximizer methods ## 4. Generating All Training Model Combinations **run_experiments.py** is a script that runs **framework.py** with each training and test inner maxmizer combination. ``` python run_experiments.py ``` At this point, there should be 5 saved models in *trained_models/*, each with a different inner maximizer method used for training. ## 5. Collecting Accuracy and Evasion Results Run the following script to generate tex files with results in *result_files/* ``` python utils/collect_results.py [insert_experiment_name_here] ``` ## 6. Generating Adversarial Vectors We can use the naturally trained model in combination with each of our evasion methods to generate a set of adversarial vectors produced by each method. Make sure the *"experiment_name"* and *"saved_model_directory"* parameters are set properly in **generate_adversarial_parameters.ini** as well as *"output_directory_for_adv_vecs"*, the output location for the adversarial vectors. To generate, go to the *generate_adversarial/* directory and run: ``` python generate_adversarial.py ``` ## 7. Generating Histograms and Loss Progressions (Figures 3 and 4) To generate loss progressions and histograms, simply run the following in the **loss_graphs/** directory, taking care to ensure that *experiment_name* is set properly in **figure_generation_parameters.ini** ``` python run_loss_landscape_experiments.py [insert_experiment_name_here] python run_histogram_experiments.py [insert_experiment_name_here] ``` The figures will be output to the directories **loss_progressions/** and **histograms/**. ## 8. Generating 3D Loss Landscapes (Figure 5A and 5C) There are two options for generating loss landscapes: calculating the loss using only vectors generated using the same inner maximizer used to train the model (Figure 5 Column A) and calculating the loss using all types of adversarial vectors (Figure 5 Column C). This is controlled by the parameter *use_all_attack_variants" in **loss_visual_params.ini**. The *plot_size* and *increment* parameters in **loss_visual_params.ini** cause the alpha and beta values for filter-wise normalization to lie in the grid of *-plot_size* to *+plot_size* incrementing by *increment*. To generate loss landscapes for each model type: ``` python run_loss_visualization_experiments.py [insert_experiment_name_here] ``` ## 9. Training Self-Organizing Maps and Plotting Decision Map (Figure 5B and 5D) There are two steps to generating the decision map plots: training the self-organizing map (SOM) and using it to plot the decision map. Similar to the loss landscape methods, we can either train a SOM using all the adversarial vectors or a single type of adversarial vector. The latter is used for models trained with the same inner maximizer method. The number of vectors of each type, number of training epochs, and dimensionality of the SOM is set in the [hyperparam] section of **som_parameters.ini**. To train a SOM after setting parameters: ``` python train_som.py som_parameters.ini ``` The SOM is saved as as pickle file in *som_pickles/*. To plot a decision map, set the variables *som_pickle_dir* and *som_pickle_file* in **som_parameters.ini** accordingly given previous training. If *plot_all_attack_variants*, all types of adversarial vectors will be shown on the decision map (Figure 5 Column D). If it is set to false, only one type will be plotted (Figure 5 Column B). In this case, 5 SOM's, each trained with a single type of adversarial vector, must be provided at the place of the TODO in **som_filenames**. To generate decision maps: ``` python som_decision_map.py som_parameters.ini ```
Owner
- Name: Anyscale Learning For All (ALFA)
- Login: ALFA-group
- Kind: organization
- Email: alfa-apply@csail.mit.edu
- Location: Cambridge, MA, USA
- Website: https://alfagroup.csail.mit.edu/
- Repositories: 19
- Profile: https://github.com/ALFA-group
Scalable machine learning technology, Adversarial AI, Evolutionary algorithms, and data science frameworks.