Employee Attrition Analysis - An Imbalanced Classification Problem

GOAL: To prepare a model for the HR department to predict the Attrition. And, give insights from the data about the important factors associated with the attrition so that HR can take measures to control the attrition of employees.

Attrition is a metric that represents the percentage of employees who leave an organisation(either voluntarily or involuntarily) for reasons including resignation, termination, death or retirement. It is a key indicator used by HR department to monitor and improve their organisation's workforce planning and overall management.

Why Does Attrition Rate Matter?

Can have negative impact on your organization’s performance
Struggle to recruit new people
Hiring cost to replace employees
Impact those employees working with them that can result in adding more work to them. This can increase stress and impact organization’s performance

Imbalanced Classification

In this Classification problem, we try to predict the class label by studying the input data or predictor where the target(Attrition) feature is a categorical variable in nature. It is observed that the numbers of observations in class label 1(minority class) is significantly lower than other class label 0(majority class). And this type of dataset is called an imbalanced class dataset! Here, it is vital to identify the minority classes correctly.

Here we discuss, some approaches used to solve this imbalanced dataset problem. They are:
1. Resampling the training dataset - Using SMOTE(Oversampling) and RandomUnderSampling
  - After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both classes have a similar number of records present in the dataset, we can assume that the classifier will give equal importance to both classes.
  - SMOTE: Synthetic Minority Oversampling Technique or SMOTE. This technique is used to oversample the minority class. In this case, SMOTE looks into minority class instances and uses k nearest neighbor to select a random nearest neighbor, and new instances are synthesized from the existing data.
  - RandomUnderSampling: Here, random rows from the majority class are deleted to match with the minority class.
2. Performance Metric
  - i) F1 score: harmonic mean of precision and recall
  - ii) Precision/Specificity: how many selected instances are relevant
  - iii) Recall/Sensitivity: how many relevant instances are selected
  - For Visualization,
  - iv) AUC: relation between true-positive rate and false positive rate
  - v) Confusion Matrix

Steps for Python implementation

1) EDA

 i) Import neccessary libraries
 
ii) Load the data. Check the datatypes. 

iii) Quality Analysis - Find Missing Value, Duplicates

iv) Correlation check, Heap Mat

 v) Visualization - Bivariate, Multivariate Analysis on Numerical and Categorical features

vi) Attrition Rate - Attrition Rate vs Non-Attrition Rate in various features

vii) KEY INSIGHTS

viii) Capping and Flooring of outliers

ix)  Zero Variance check

x)  Numerical and Categorical Feature Selection 

xi) Split the data into 2 dataframes 
     - Independent features dataframe - X
     - Target feature(Attrition) dataframe - Y

2) Model Building

1. Split the dataset into Train and Test dataset - x_train, y_train, x_test, y_test
1. 3 Algorithms are used - Decision Tree, Random Forest, GBM
1. 'Model_1' notebook - 3 algorithms are used without Resampling the data
With Resampling the training set,
1. 'Moddel_2_SMOTE' notebook - the SMOTE techinque is used to resample the the training data for the 3 algorithms
1. 'Model_3_Undersampling' notebook - the RandomUnderSampling is used to resample the the training data for the 3 algorithms
1. Performance Metrics used - F1 Score, Precision, Recall, AUC, Confusion Matrix

On comparison of all the techniques, we find that best model is 'GBM' with RandomUnderSampling(for training data) using performance metrics such Recall, f1score, Precision, Confusion matrix . We also show, the top 10 important features in the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Analysis_SQL		Analysis_SQL
Data		Data
__pycache__		__pycache__
EDA.ipynb		EDA.ipynb
Model_1.ipynb		Model_1.ipynb
Model_2_SMOTE.ipynb		Model_2_SMOTE.ipynb
Model_3_Undersampling.ipynb		Model_3_Undersampling.ipynb
README.md		README.md
func_for_model.py		func_for_model.py
main.py		main.py
top_10_features.png		top_10_features.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Employee Attrition Analysis - An Imbalanced Classification Problem

GOAL: To prepare a model for the HR department to predict the Attrition. And, give insights from the data about the important factors associated with the attrition so that HR can take measures to control the attrition of employees.

Why Does Attrition Rate Matter?

Imbalanced Classification

Here we discuss, some approaches used to solve this imbalanced dataset problem. They are:

Steps for Python implementation

1) EDA

2) Model Building

On comparison of all the techniques, we find that best model is 'GBM' with RandomUnderSampling(for training data) using performance metrics such Recall, f1score, Precision, Confusion matrix . We also show, the top 10 important features in the dataset.

About

Releases

Packages

Languages

priya-explorer/Imbalanced_Classification

Folders and files

Latest commit

History

Repository files navigation

Employee Attrition Analysis - An Imbalanced Classification Problem

GOAL: To prepare a model for the HR department to predict the Attrition. And, give insights from the data about the important factors associated with the attrition so that HR can take measures to control the attrition of employees.

Why Does Attrition Rate Matter?

Imbalanced Classification

Here we discuss, some approaches used to solve this imbalanced dataset problem. They are:

Steps for Python implementation

1) EDA

2) Model Building

On comparison of all the techniques, we find that best model is 'GBM' with RandomUnderSampling(for training data) using performance metrics such Recall, f1score, Precision, Confusion matrix . We also show, the top 10 important features in the dataset.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages