A generalizable approach for network flow image representation for deep learning

Short paper for CSNet 23.

We propose a generalizable approach for network flow image representation to detect patterns without performing any network flow cut-offs. Further, we introduce a novel method to preprocess network traffic to enhance our resulting models. In this step, we remove network protocol header information hindering the models' generalizability. We use a data set containing malware and benign classes and train different deep learning architectures VGG-19, ResNet-50, and ResNeXt-50. ResNet-50 reaches up to $99.48%$ for multiclass classification accuracy with a macro F1 score of $98.88%$ and a Kappa score of $99.39%$ on our preprocessed data set. In the binary scenario, ResNet-50 and VGG-19 achieve $100%$ accuracy.

All trained models and the preprocessed dataset are publicly available at heiBOX.

Getting started

The folder ML_code contains all the code we used to train our machine learning models on our custom datasets. In following we describe how to use the different python scripts

> dataset_train_test_split.py
Will perform a stratified random split on a given dataset into train/test/validation set (70/15/15 split).
Usage: python dataset_train_test_split.py [DATASET_PATH]

DATASET_PATH:
  The folder containing the dataset ordered into subfolders of classes.

> train_model.py
Will train a model on a given dataset from scratch or continue training a given model.
Usage: python train_model.py [OPTIONS]

Options:
  -d, --dataset_type STRING   Can be "multiclass" or "binary"  [required]

  -m, --model_type STRING     Can be "fc" (VGG-FC), "notop" (VGG-NoTop), "resnet" (ResNet), or "next" (ResNeXt)  [required]

  -p, --preprocessing_type STRING  Can be "preprocessed": tells the script to
                              load the dataset from the PREPROCESSED_PATH
                              (given in dataset.py), or "payload": tells the
                              script to load the dataset from the PAYLOAD_PATH
                              (given in dataset.py). (PATHS need to be modified
                              in dataset.py!)  [required]

  -s --s saved_model [NONE|MODEL_PATH] Path to a model to continue training. If
                              NONE training is performed from scratch
                              [default: NONE]

  -t --training_optimizer [NONE|OPTIMIZER_PATH] Path to a optimizer to continue
                              training. If NONE a new optimier is initialized
                              [default: NONE]

  -l --learning_rate  [NONE|FLOAT] Defines the learning rate to be used. If
                              None the determined optimal hyperparameter for
                              the given model will be used  [default: NONE]

  -e --starting_epoch  [INTEGER] Defines the epoch to start training at
                              [default: 1]

  -n --num_epochs  [INTEGER]  The amount of epochs to train the model
                              [default: 35]
                              
  -o  [BOOLEAN]               If true oversampling is used.  [default: False]

  -a  [BOOLEAN]               If true an adaptive learning rate is used (If the
                              validation accuracy reaches 93% the lr will be
                              divided by 10.  [default: False]

> hyperparameter_optimization.py
Will perform hyperparameter_opimization (currently only for multiclass classification)
Usage: hyperparameter_optimization.py [OPTIONS]

Options:
  -m, --model_type STRING     Can be "fc" (VGG-FC), "notop" (VGG-NoTop), "resnet" (ResNet), or "next" (ResNeXt)  [required]

  -p, --preprocessing_type STRING  Can be "preprocessed": tells the script to
                              load the dataset from the PREPROCESSED_PATH
                              (given in dataset.py), or "payload": tells the
                              script to load the dataset from the PAYLOAD_PATH
                              (given in dataset.py). (PATHS need to be modified
                              in dataset.py!)  [required]

  -s --s save_dir_best_result [NONE|PATH] Path to save the best resulting model.
                              [default: NONE]

  -o  [BOOLEAN]               If true oversampling is used.  [default: False]