Skip to content

Hashtag multilabel classification with machine learning and deep learning models

Notifications You must be signed in to change notification settings

LucyLi2021/Hashtag-recommendation-for-twitter-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hashtag-recommendation-for-twitter-data

Recommendations simulator for qualitative testing given mock personas fitting model requirements.

Important Packages Requirement

Python 3.8 PySpark 3.3.1 PyTorch 1.13.0 tweepy 4.12.1 huggingface-hub 0.11.0

Installing

Start up a fresh virtual environment in the same version as models you want to test, for example: conda create -n twitter_hashtag38 python=3.8 conda activate twitter_hashtag38

Then run:
pip install -r requirements.txt
To set up Spark correctly, you may need to set environment variables:
PYTHONPATH="PATH_TO_SPARK_PYTHON"
SPARK_HOME="PATH_TO_SPARK"
PYSPARK_PYTHON="PATH_TO_ENV_PYTHON"
PYSPARK_DRIVER_PYTHON=""PATH_TO_ENV_PYTHON"

Updating Data

Data: run data_utils.py

  1. Data Collection:
    • You MUST have your own TWitter API BEARER_TOKEN and save it to src/main/data/tweepy_token/BEARER_TOKEN.json
  2. Data Preparation:
    • Simply run data_utils.py to get cleaned data with 200 hashtags, cleaned data with 50 hashtags, and cleaned data with 50 hashtags for non-DL models

Updating Models

Models must be added in src/main folder, for now we have lstm.py, resnet.py, bert.py, fasttext.py, tfidf_logistic.py.

Model Training/Evaluation/Prediction

Simply run main.py