Skip to content

erv4gen/DS-Kaggle-MicrosoftMalwarePrediction

Repository files navigation

Microsoft Malware Prediction

Wednesday, May 22, 2019

8:03 PM

Abstract

This repository is a research journal for a Kaggle competition. At the time I was working on the problem, the competition was already finished.

The goal is to practice in solving a data science problem with many variables (82 in this case) and figure out which one is the most important.

In this research, XGboost classifier was used to build a data model and predict whether or not a given computer will be infected by a virus. Then, using a "plot_importance" method extract the essential conditions for a PC to be hit by malware.

Data Description:

Dimensionality: 82 variables

Train dataset size: ~1M rows

Test dataset size:

Method

ML method was used in this exercise is Gradient Boosted Decision Tree Classifier (xgboost). With a parameter estimated using a grid search method.

Data Preparing

I started the research with understanding the data I will be working with:

alt text

Train dataset contains 82 variables; it's hard to explore each of them visually or compare side by side. A lot of "pair plots" will not be representative and challenging to compare with each other.

I can roughly separate all features to "categorical" and "numerical" variables for more representation.

Here's a top unbalance "categorical" variable:

alt text

And top "numerical" variables (also filtered by parentage of missing values): alt text Census_DeviceFamily, ProductName OsVer, Platform, Census_FlightRing, Census_OSArchitecture, and Processor are very unbalanced (more 90% are in the biggest bucket).

In the categorical variables, MachineIdentifier seems to be useless in this analysis and can be dropped.

All unbalanced and missing variables will be filtered out:

alt text

To fit the XGBoost classifier, all variables will be transformed into categories.

Results


A baseline accuracy for the training dataset is (Using dummy classifier) AUC = 0.66.

Using a grid search cross validations next parameters were found:

Best params: { 'max_depth': 10 , 'min_child_weight': 5 , 'subsample': 0.8 , 'colsample_bytree': 0.9 , 'eta': 0.3 , 'objective': 'binary:logistic' , 'eval_metrics': 'auc'}

AUC learning curve graph:

alt text

Using explored parameters in the final model next "importance plot" was found:

alt text

According to the final model, the most important feature is AvSigVersion, which is Windows Defender state information. It makes sense since antivirus quality usually is a key factor which helps prevent the infection.

Conclusion

During this exercise, I tested a different approach to work with XGboost: using sparse and dense matrixes, exclude unbalanced and low presented variables and feature selection with a xgboost kit.

Even though I didn't active the original competition goal (evaluate the model on the testing dataset), I found the results of my model meaningful and interpreted.