Skip to content

The aim of the project is to apply different global, local and performance interpretability methods as well as model fairness evaluations to a dataset with protected attributes. The dataset regards traffic violations in Montgomery, Maryland, USA. This is a fork of a group project of my Data Science for Business Master's Degree at HEC Paris.

License

Notifications You must be signed in to change notification settings

DataThomas/project-fairness-interpretability

 
 

Repository files navigation

Traffic Violations

Python Version Code style: black Imports: isort Linting: ruff Pre-commit

Authors: Mykyta Alekseiev, Elizaveta Barysheva, Joao Melo, Thomas Schneider, Harshit Shangari and Maria Stoelben

Description

The goal of this project is to predict a binary variable using white and black box models. Subsequently, the performance and fairness of the models with respect to certain protected features will be analysed. The protected attributes that will be focused on here are gender and race. Moreover, the models' predictions will be analysed with methods for interpretability.

Data

For this project a dataset of traffic violations in Maryland, USA was selected. You can download the data here. The .arff should be placed in a data/ folder in the root of your repository.

The processed data contains 65'203 instances with 15 columns, where 5 columns are categorical and the rest binary or numeric. The target column is Citation, which is equal to 1 when a citation was given by an officer and 0 if only a warning was declared.

Setup

Create a virtual environment and install the requirements:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
pre-commit install

Data Preproessing

Check out the jupyter notebooks to understand the data the preprocessing decisions.

To run the data preprocessing and get a data.csv output for the following parts, run:

python -m spacy download en_core_web_sm
python src/data_preprocessing/data_preprocessor.py

Modeling

The parameters can be changed in the config/config_modeling.py. The data is seperated into 60% training and 20% validation and testing each by default.

Run the training with mlflow tracking with the following command:

python src/modeling/main.py

Results

The model selection was performed on the validation data. Below the results are displayed for white and black box models.

Model Train AUC Val AUC Test AUC Test Accuracy Test F1 Score
XGB 0.898 0.866 0.860 0.778 0.748
Random Froest 0.873 0.849 0.843 0.764 0.728
Decision Tree 0.825 0.818 0.818 0.742 0.703
GAM 0.805 0.814 0.805 0.730 0.705
Logistic Regression 0.645 0.652 0.641 0.600 0.559
ANN 0.641 0.649 0.637 0.537 0.097

Explainability and fairness

If you are interested in our conclusions regarding how our model works and if it is fair to different protected attributes, please check within the notebooks folder the explanation and fairness subfolders, respectively.

About

The aim of the project is to apply different global, local and performance interpretability methods as well as model fairness evaluations to a dataset with protected attributes. The dataset regards traffic violations in Montgomery, Maryland, USA. This is a fork of a group project of my Data Science for Business Master's Degree at HEC Paris.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Jupyter Notebook 99.1%
  • Python 0.9%