Skip to content

vibalcam/ml-malware-detection

Repository files navigation

Machine Learning-Based Cyberdefenses Competition

The data_features_combined folder has a small dataset with extracted features. To recreate the full dataset check the Dataset Section.

The data_test folder has executables for testing the model. ATTENTION: this folder contains real malware executables which can be harmful.

Quick start

Instead of building solution from code, download the competition docker image from here.

An additional docker image with a better overall model is provided here

Before you proceed, you must install Docker Engine for your operating system.

Load the docker image

docker load -i ml.rar

Run the docker container:

docker run -itp 8080:8080 --memory=1g ml

Test the solution on malicious and benign samples of your choosing via:

python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benign

Build the sample solution

Before you proceed, you must install Docker Engine for your operating system.

A sample solution that you may modify is included in the defender folder.

Install Python requirements needed to test the solution:

pip install -r requirements.txt

OPTIONAL: To apply obfuscation to the code, copy the defender folder somewhere else since it is applied in place and run

pyminify defender/ --in-place --remove-literal-statements

Compile python code to run faster and slightly obfuscate code run

python out.py

Some trained models can be found in defender/saved_models.

Add the *.pkl file to use as model into docker/models/, we will later set the model to use during docker run.

From the root folder that contains the Dockerfile, build the solution:

docker build -t ml .

Run the docker container:

docker run -itp 8080:8080 --memory=1g ml

The flag -p 8080:8080 maps the container's port 8080 to the host's port 8080.

The flag --memory=1g limits the container with 1GB of RAM.

The flag --env MODEL_FILE="models/ml_classifier.pkl" can be added to specify which model to run

Test the solution on malicious and benign samples of your choosing via:

python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benign

You can also use the system folder C:\Windows\System32\ as benign samples.

Sample collections may be in a folder, or in an archive of type zip, tar, tar.bz2, tar.gz or tgz.

It is not required to unzip and strongly recommended that you do not unzip the archive to test malicious samples.

Train and test model

Once you have a trained model, it can be tested by running

python test.py -m model_path.pkl

To train a Random Forest model check defender/models/ml.py

python -m defender.models.ml && python test.py -m defender/ml.pkl

To train a Deep Learning model check defender/models/malware_gpt.py

./scripts/run.sh && python test.py -m defender/ml.pkl

Requirements

Minimum scores

  • FPR 1%
  • TPR 95%

Constraints

  • 1GB of RAM
  • Response time 5 seconds per sample

A valid submission for the defense track consists of the following

  1. a Docker image
  2. listens on port 8080
  3. accepts POST / with header Content-Type: application/octet-stream and the contents of a PE file in the body
  4. returns {"result": 0} for benign files and {"result": 1} for malicious files
  5. for files up to 10**21 bytes (10 MiB), must respond in less than 5 seconds (a timeout results in a benign verdict)

Generate Dataset

The datasets used are listed in data.txt.

To apply the feature extractor on a folder of PE files and save them for training models use

python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]
python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]

Different parameters allow creating a dataset from

  • --large_dataset from Practical Security Analytics dataset
  • --dike from the DikeDataset
  • --windows from the own Windows files
  • --programs from the Program Files and Drivers
  • --benign to specify any number of folders considered benign
  • --malware to specify any number of folders considered malware

PE Files Datasets

Datasets Used

Combined Datasets

Malware Datasets

Benign dataset

Other Datasets (Not Used)