Skip to content

This machine learning project predicts whether an individual will have a stroke depending on certain health characteristics.

Notifications You must be signed in to change notification settings

blythekelly/ML_strokes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Problem

The problem I am trying to solve in this experiment is finding a machine learning algorithm that accurately predicts whether an individual will have a stroke depending on certain health characteristics. These health characteristics include age, gender, marriage status, smoking habits, and the presence of hypertension or heart disease, which were provided in the dataset I am using from Kaggle. I will be using these columns as the predictor variables, and my target variable is stroke, which is either 0 or 1. This aligns with a classification problem, so I will be using perceptron, multilayer perceptron, and stochastic gradient descent models. I will be comparing these models using accuracy and confusion matrices to reduce false negatives or false positives.

Data Preparation

To prepare my data, I checked for any null values in the columns, and I found that one column, BMI, contained nulls. In order to determine whether to drop the values or replace them, I checked the number of missing values and found there were 201. I replaced these with the mean BMI value because dropping 201 entries was a significant chunk of the dataset. After this, I confirmed that the dataset no longer had any null values. In addition, I used indicator variables to replace categorical data, such as work type and smoking status. I replaced other categorical data, including gender, marriage status, and residence type, with 0, 1, or 2s. After this cleaning was complete, I affirmed that the data was prepared for use by checking the data types of each column, and they were all either float or int values.

Analysis

Through this experiment, I discovered the untuned multilayer perceptron was the best model for this stroke dataset, as it provided a 93.5% accuracy and the least amount (51) of false negatives. I believe this was the case because the data may have been non-linearly separable, and the multilayer perceptron performs well on this type of data. My insight from this would be to recommend tuning different parameters and keeping the default value for the activation function. Changes to the activation function produced a model that provided more false negatives. My other two models, the perceptron and stochastic gradient descent, performed well with accuracy rates of 94% and 92% respectively. However, both models provided more false negative predictions, which should be more heavily weighted when evaluating the success of a diagnosis-related prediction model.

About

This machine learning project predicts whether an individual will have a stroke depending on certain health characteristics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages