Skip to content

The objective of this analysis is to predict income inequality among US citizens using various classification models. The study employs logistic regression, random forest, decision tree, gradient boosting, and k-nearest classification models to classify individuals into income categories based on provided demographic and socioeconomic variables

Notifications You must be signed in to change notification settings

Aytaj9/Predictive-Analytics-on-USA-Census-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Predictive-Analytics-on-USA-Census-dataset

Objective: The objective of this analysis is to predict income inequality among US citizens using various classification models. The study employs logistic regression, random forest, decision tree, gradient boosting, and k-nearest classification models to classify individuals into income categories based on provided demographic and socioeconomic variables.

Summary: This analysis focuses on predicting income levels using a dataset obtained from the census. The study begins with data cleaning, where missing values and abnormal entries such as '?' marks are addressed. After data preparation, exploratory data analysis is conducted to understand the distributions of variables. Following this, different classification models are implemented, starting with logistic regression. The analysis compares the performance of logistic regression, random forest, decision tree, and gradient boosting models in terms of accuracy and feature importance.

Logistic regression revealed that variables such as age, education level, marital status, occupation, and gender significantly influence income levels. However, its accuracy was found to be moderate at 71%. Random forest classification yielded higher accuracy (87%) and identified similar important variables affecting income. Decision tree analysis provided acceptable accuracy (79%) and highlighted the importance of family status in predicting income levels. Gradient boosting achieved an accuracy rate of 85%, with variables like age, education, occupation, and family status being significant predictors.

Finally, the analysis implemented a k-nearest classification model, achieving an accuracy level of 83%. The study concludes that decision tree and gradient boosting models consistently identify family status as a key predictor of income levels. Overall, the project successfully predicts income inequality using various classification techniques in Python.

Introduction

This project is worked on the census dataset which I will allocate link in reference section. The purpose of this project using a nearest neighbors’ model with provided data to create classification model that allows us to find which variables contribute affluency to reach equal payment among US citizens. We are going to define feature importance of variables, how these variables take role in this case, determination of K value and finally accuracy level of our model. Before starting classification model, firstly, we should make data cleaning on our data, make sure that we have clean data, prepare this dataset for classification.

Data Cleaning From the part of summary of data set, our data set contains 48841 observations with 15 variables: ‘39’, ‘77516’, ‘13’, ‘2174’, ‘0’ and ‘40’ are quantitative variables while ‘State-gov’, ‘Bachelors’, ‘Never-married’, ‘Adm-clerical’, ‘Not-in-family’, ‘White’, ‘Male’, ‘United-States’ and ‘<=50K’ are categorical variables. Statistics summary of numerical variables showed us count, min, max, mean, standard deviation and quartiles of boxplot. When we observe counts of numerical values, we identified that numerical variables of data set don’t have missing values. Furthermore, we didn’t define any null values in our dataset. However, it is not guarantee for clean data. We have check frequency of our variables to identify any special characters or values in dataset. When we observe frequency of whole variables, we identified ‘?’ mark in ‘State-gov’, ‘Adm-clerical’, and ‘United-States’ column. I determined that ‘?’ mark takes 5.7% (2799/48841) of ‘State-gov’, 5.8% (2809/48841) of ‘Adm-clerical’ and 1.8% (857/48841) of ‘United-States’ variable; thus, I decided remove question mark value from these columns. Additionally, after observing distribution of all variables, I determined that ‘2174’ and ‘0’ columns contain more than 50% zero values which I thought have replaced missing values. From my perspective, having more than half of column with unnecessary values will not indicate any essential information for our model; therefore, I removed these two columns from our dataset. After this cleaning, we have 45221 values with 13 variables. From the exploratory analysis part, I checked each variables distribution by visualization of each variable on Histogram; thus, I defined that ‘Private’ value takes high portion of ‘State-gov’ variable while rest of values takes below than 5000 counts of variable. Comparing with ‘State-gov’ and ‘Bachelors’ variables have normal distribution among values while ‘HS-grad’, ‘Some-college’, ‘Bachelors’ values are three high observed values in this column. For variable- ‘Never-married’, I noticed that ‘Married-civ-spouse’ takes almost half of whole column while ‘Married-AF-spouse’ is just 32 values out of 45221. For ‘Adm-clerical’ variable, we noticed 14 unique values in which ‘Craft-repair’, ‘Prof-specialty’ and ‘Exec-managerial’ took top three portions of variable while ‘Armed-Forces’ and ‘Priv-house-serv’ took lowest portion. ‘Husband’ values have more than 17500 counts among ‘Not-in-family’ column values when ‘White’ have got more than 85% percentage of ‘White’ column which shows race among population. When comparing male and female portion in ‘Male’ column which shows gender distribution, we observed that male takes more portion in comparison of female. Observing ‘United-States’ variable which shows 41 unique value which is hard to observe on histogram and United states 91% portion of this column and rest of them took Mexico, Philippines, Germany, Puerto-Rico, and other countries. Finally, when we look through distribution of income level which have indicated in ‘<=50K’ column, we noticed ‘<=50K’ income is higher than ‘>50K’ income level in population.

Analysis Before starting the analysis and modeling part, we began data preprocessing part to build different model. Firstly, we encoded categorical variables by transforming to numerical values to fit model. Then we split our data frame to x and y frames in which y variables is target variable with ‘<=50K’ when x contains independent variables. After determining high correlation between variables, we checked Variance Inflation factors between all independent variables and target variable- ‘<=50K’. We identified that ‘13’, ‘White’, ‘40’ and ‘United-States’ variable have high variance inflation to target variable. When having VIF higher than 10, it indicates multicollinearity and it will lead problem in model; therefore, we need to drop these high VIF variables from column. Using standard scaler function, we standardized our frame by removing high VIF variables. Using Smote function, we managed imbalanced data based on target data, now our target variable contains 68026 values with 8 variables. After that, we randomly divided our data frame into training and testing size with 80/20 size. Before k-nearest model, we created Logistic regression, Random Forest classification, Decision Tree classification and Gradient Boosting and checked their accuracy level with feature importance. From the Logistic regression part, we identified ‘39’, ‘Bachelors’, ‘Adm-clerical’ and ‘Male’ have high positive impact in comparison to other predictors on target variable- ‘<=50k’. ‘Never-married’ and ‘Not-in-family’ variables were assigned as negative effectors for income variable. We can observe feature importance coefficients of all variables in bar plot. In a conclusion, our Logistic regression showed 71% accuracy level which is not acceptable good. From the Random Forest, we observed ‘39’, ‘77516’, ‘Bachelors’, ‘Never-married’, ‘Adm-clerical’, and ‘Not-in-family’ have high positive effects on income variable with 87% accuracy rate which is higher than Logistic regression model accuracy level having different predictors. From the Decision Tree analysis, we noticed that ‘Not-in-family’, ‘Bachelors’, and ‘39’ are feature importance variables and in conclusion, we achieved 79% accuracy level which is acceptable. Then we checked feature importance in Gradient Boosting model with 85% accuracy rate and identified ‘39’, ‘Bachelors’, ‘Adm-clerical’, and ‘Not-in-family’ have positive effects on target variable. In a summary, we observed that Decision Tree and Gradient Boosting model showed ‘Not-in-family’ variable as main predictors for target variable. Finally, we reached K-Nearest classification model with 8 neighbors. We calculated accuracy of the model is about 83% which is good accuracy level for our model. We classified K values over training and testing dataset and observed accuracy of both. We noticed from generated plot that training dataset accuracy reached 100% within number of neighbor equals 1 and suddenly went down having 3 neighbors and showed fluctuations. Instead of training data set, testing data set showed maximum accuracy level with 85.5% having 1 variable and it reached at least accuracy- 82% with just 2 neighbors and continued almost same path and achieved its maximum level with 83% accuracy level.

Conclusion As a conclusion of our Income Inequality project, we have successfully achieved our assignment goals. By starting description of provided data set, we made exploratory analysis to observe distributions of values for each variable. After checking dataset, we generated some data cleaning techniques to have high quality of data set. We built Logistic, Random Forest, Decision Tree, Gradient Boosting and K-Nearest classification model to define most effective variable to our target variables and identified accuracy level for our models.

References

Notebook on viewer. Jupyter Notebook Viewer. (n.d.). Retrieved May 31, 2022, from https://nbviewer.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb Brownlee, J. (2020, October 26). Imbalanced classification with the Adult Income Dataset. Machine Learning Mastery. Retrieved May 31, 2022, from https://machinelearningmastery.com/imbalanced-classification-with-the-adult-income-dataset/
Dataset: https://northeastern.instructure.com/courses/105751/files/14397132/download?download_frd=1

About

The objective of this analysis is to predict income inequality among US citizens using various classification models. The study employs logistic regression, random forest, decision tree, gradient boosting, and k-nearest classification models to classify individuals into income categories based on provided demographic and socioeconomic variables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages