Predict whether income exceeds 50k per year based on census data
We are using Census dataset here with almost 50000 instances and 14 features. It contains categorical and numeric type of values for differnt features. It also has missing values. To see full description of dataset please use Dataset Link . We are going to do all our analysis in python like observing the data, cleaning the data, transforming the data and finally training the data.Hope you already have some python basics and you have ready any python environment like Jupyter. You need to install all the below libraries in your python environment. You can do this in Python IDLE using simple pip command.
First we need to store the data file in your local machine somewhere and from that path we load the CSV file into Jupyter Notebook. For reading the file, we use PANDAS library. We can also mention the delimiter type and header detais if need.
import pandas as pd
adult = pd.read_csv('adult.csv')
adult.columns()
adult.values()
adult.shape
adult.head()
adult.tail()
We will see results like below.
From the above analysis, we see fields 'education' and 'education_num' are two fields which are interrelated and we can use only field for analysis. Because this might lead to multi-colleniarity problem. And field 'finalwgt' is another informative field, we can remove this too.
So this is our next step, we are gonna drop those columns. This can be done in many ways, we can explicitly mention all the fields we gonna keep or drop those 2 fields.
mydata = adult.iloc[:,[0,1,4,5,6,7,8,9,10,11,12,13,14]]
mydata
#adult.drop(['fnlwgt', 'education'])
mydata.isnull().any()
mydata.isnull().sum()
mydata[pd.isnull(mydata['Age'])] # to check only in Age column.
We find some values like 'No Gender' in Age column, so we replace these values with default 'na' values.
mydata["Age"].fillna("No Gender", inplace = True)
for each_column in mydata.columns:
if ' ?' in mydata[each_column].values:
print (each_column)
This gives 3 columns as 'workclass', 'occupation' and 'native-country'.
We also want to see howmany such '?' rows are existing in each column. This is the step by step processing of missing values. This is the most important step in any Pre-processing.
mydata.loc[mydata['workclass']==' ?'].shape #1836
mydata.loc[mydata['occupation']==' ?'].shape #1843
mydata.loc[mydata['native-country']==' ?'].shape #583
Now what do we do with these rows, these are not numerical values to replace with mean or median. These are important categorical values. If we see the data size, from total around 50K rows if we remove like 1800 rows , that would be a best approach as we won't mislead data with inappropriate data. So let's remove them.
final_data=mydata[mydata.workclass!=' ?']
final_data=final_data[final_data.occupation!=' ?']
In Python we have 'LabelEncoder' library to do that for us. So we transform each column one by one.
final_data['workclass']=LabelEncoder().fit_transform(final_data['workclass'])
final_data['marital_status']=LabelEncoder().fit_transform(final_data['marital_status'])
final_data['occupation']=LabelEncoder().fit_transform(final_data['occupation'])
final_data['relationship']=LabelEncoder().fit_transform(final_data['relationship'])
final_data['race']=LabelEncoder().fit_transform(final_data['race'])
final_data['sex']=LabelEncoder().fit_transform(final_data['sex'])
final_data['native-country']=LabelEncoder().fit_transform(final_data['native-country'])
x= final_data.iloc[:,:-1]
y=final_data.iloc[:,-1]
The next step is splitting. We use 80 to 90 % of the data for training the model and we test it on the remaining 10% of the data which is called test data. This split can be randomly applied by the Python function 'train_test_split'
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(x_train,y_train)
y_pred = logreg.predict(x_test)
y_pred is the predicted outcome on the test data we kept aside. We have to compare this y_pred with y_test values and check how much data is a match. This is how we estimate the accuracy of the model. We have certain measure like 'accuracy_score' , 'confusion_matrix' and 'classification_report' to evaluate the model. We can use 'metrics' library in python to find these values.
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test,y_pred)
print(cnf_matrix)
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
Out: [[4311 274]
[ 867 692]]
Accuracy: 0.8142903645833334
We got good accuracy here. This can also be seen graphically, using 'auc roc curve'. Based on this curve we can see how good the accuracy of the model is. To plot the graphs we use 'matplotlib' library.
import matplotlib.pyplot as plt
y_pred_proba = logreg.predict_proba(x_test)[::,1]
fpr,tpr,_=metrics.roc_curve(y_test,y_pred_proba)
auc=metrics.roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="test data ,auc="+str(auc))
plt.legend(loc=1)
plt.show
We will see the ouput as below.
from sklearn.naive_bayes import GaussianNB, MultinomialNB
#create a Gaussian classifier
gnb = GaussianNB()
gnb.fit(x_train, y_train)
y_pred = gnb.predict(x_test)
print('Accuracy thru Gaussian Naive Bayes ', metrics.accuracy_score(y_test, y_pred))
Out: ccuracy thru Gaussian Naive Bayes 0.7859700520833334
Now the roc_curve:
import matplotlib.pyplot as plt
y_pred_proba = gnb.predict_proba(x_test)[::,1]
fpr,tpr,_=metrics.roc_curve(y_test,y_pred_proba)
auc=metrics.roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="test data ,auc="+str(auc))
plt.legend(loc=1)
plt.show
We will get the curve as below.
Thank you...