Welcome to the Malicious Executable Detection project! This repository explores the world of machine learning and clustering analysis to detect malicious executable files. ππ€
In an era where cyber warfare is on the rise, detecting malicious code has become crucial. This project aims to develop a machine learning approach to identify malicious executable files. π»π¦
The dataset contains features extracted from both malicious and non-malicious Windows executable files. It includes a total of 373 samples, with 301 being malicious and 72 non-malicious files. The dataset is imbalanced, with 531 features represented as F1, F2, and so on, and a label column indicating whether the file is malicious or non-malicious. ππ§
- Imputation: Rows and columns with missing data exceeding 70% are removed. π§Ή
- Feature Selection: Relevant features are chosen for analysis. π―
- Data Standardization: Standardization is applied to make the data suitable for clustering. π
K-Means clustering is applied to group similar instances together. The Silhouette method is used to determine the optimal number of clusters. π§©
Silhouette analysis helps evaluate the quality of clustering. A higher silhouette score indicates better clustering. ππ
Cluster stability is assessed by comparing clusters with and without random sampling of data. π
The model is used to predict clusters for new executable files. π
- Implementing cluster analysis in Python
- Pre-processing data for analysis
- Hierarchical clustering and dendrogram visualization
- Implementing K-Means clustering
- Determining the optimal number of clusters
- Cluster stability evaluation
- Predicting clusters for new samples
Feel free to explore the notebooks and the code to dive deeper into the analysis!
You can also view this project on Kaggle. π
Want to run the notebooks in Google Colab? Click here to open them directly! π‘
Join our community and stay updated on our latest projects:
Happy coding! π©βπ»π¨βπ»