Skip to content

mishrakushal/news-article-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

News Article Classification Model

A machine learning model that classifies news articles into various categories. Uses TF-IDF to vectorise input text and find important words. We remap the words in the articles into features. Features are the superset of words that have importance assigned to them based on the frequency of occurrence in the document and across various documents.

TF-IDF

  • Transforms the text into a usable vector. Term frequency is the # of occurrences of a specific term in a document; it indicates how important a particular word is in a document.

  • Document frequency is the # of documents containing a specific term. It indicates how common the word is.

For each article, we perform a chi-squared analysis to find the relevancy of words to a particular category. In doing so, we find the "key" words that, if they occur, determine the class of the entire article.

unigrams array stores single word features (in increasing order of chi-squared statistical values)

bigrams array stores two-word features (in increasing order of chi-squared statistical values)

We only sample a subset of our data (in our case, 30%) because t-SNE is computationally expensive.

Cross-validation is performed across 5 different models:

  1. Logistic Regression
  2. Naive Bayes
  3. Random Forest Regression
  4. Decision Tree Classifier
  5. K-Nearest Neighbours

We selected the model with the best mean cross-validation score, which, in this case, was Logistic Regression.

Classifiers Tested

  • Logistic Regression (used for final model)
  • Naive Bayes
  • Random Forest Regression
  • Decision Tree Classifier
  • K-Nearest Neighbours

Libraries Used


Resources