Skip to content

ChiWang03/Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Capstone project at UBC in colloboration with the Centre for Population Health Data at Statistics Canada

Our goal was to create a website/dashboard that is based on a given food and nutrition thematic data model. This data model contains five main categories: Food Supply, Distribution, Consumption, Utilization, Health Outcomes. The first step of the entire process is to categorize data products (data sets, publications, surveys) into the appropriate categories by building a text classification model. This repository contains the majority of code for data collection, text classification and plotly dashboards used in the website.

The Website Link

Inside the dataset folder contains the Training Data, New Data (To be Classified), as well as Classified New Data

  • Note: All data is saved in a csv file for easy python implementation

Training Data:

  • Combined_Links.csv contains the 417 record data set that we have manually categorized. This data set would act as our training data for all the text classifiers used in the notebooks. The columns contain: Title, Description, Link, Category, Subcategory, Source

New data (To be classified):

  • New Data.csv Data set that has been scrapped and not labeled, we be categorized using the ensemble classifiers.

Classified New Data:

  • contains the categorized data set using the New data mentioned above.combined.csv uses the method of merging the all the predicted class of the base classifiers and dropped its duplicates. This method is not recommened and just used for or own research purposes.

  • The Bestcombined.csv is the best ensemble classifier classified data set. Contains 1376 records in total, 1033 in supply, 152 in consumption, 88 in Health Outcome, 44 in Distribution, and 59 in Utilization. The data set was examined for accuracy, we've realized that the 1033 supply category had some minor errors. It is categorizing data products that essentially is supply such as GDP, imports and exports, but not relevant to Food and Nutrition. This may be improved in the future.

The code folder contains most of our code for our classifiers and web scraper

  • Classifier
    • contains all the code for the classifier
  • Webscraper
    • contains the webscraper notebook using BeautifulSoup
  • Visualizations
    • contains the ploty dash code used to build summary statistic visualizations

Note: Please insert mapbox access token for geolocation plots, this could simply be setting up a mapbox account.