This repository contains a pipeline for topic modeling using BERTopic, a Python library for topic modeling with BERT embeddings.
Note: This is my first attempt at creating a class-based pipeline. If you have suggestions or best practices to share, please let me know! I'd greatly appreciate any feedback 😊
- Preprocessing: Clean and sample Twitter data from a dataframe.
- Tokenization: Tokenize the documents and build a vocabulary.
- Embedding Generation: Encode the documents using SentenceTransformer. You can also load previously generated embeddings to speed up experiments.
- Dimensionality Reduction: Reduce the dimensionality of the embeddings using UMAP. Previously reduced embeddings can be loaded for faster processing.
- Clustering: Cluster the reduced embeddings using HDBSCAN. You can also load previously generated clusters to speed up experiments.
- Visualization: Visualize the clusters and generate word clouds for each topic.
The pipeline saves everything (including cluster plot and topics overview) in a unique folder every time an instance is created.
- Dimensionality: A dummy class to allow skipping clustering during model fitting.
- Plots: A class to visualize topic modeling related information.
- Preprocessor: A pipeline to clean and sample Twitter data from a dataframe.
- TopicModelPipeline: The main class that integrates all the functionalities and provides an end-to-end pipeline for topic modeling.
- Import:
from src.topic_model import TopicModelPipeline
1.5 Optional: Overwrite class for cleaning Tweets
from src.preprocessor import Preprocessor
def new_clean_text_columns(self):
"""
Method to overwrite clean_text_columns method from Preprocessor.
"""
self.df[self.text_column] = "Example"
Preprocessor.clean_text_columns = new_clean_text_columns
- Initialization:
pipeline = TopicModelPipeline(
project_name="Your_Project_Name",
output_path="path/to/save/output",
documents_path="path/to/documents",
file_type="parquet",
model="all-MiniLM-L6-v2",
time_column="timestamp_column_name",
text_column="text_column_name",
sample=True,
sample_frequency="daily",
tresh_absolut=200,
clean_text=True
)
- Visualize Clusters:
pipeline.plot_raw_clusters()
pipeline.plot_top_clusters()
- Visualize Word Clouds:
pipeline.plot_wordclouds()
Everything is tracked with mlflow.
For GPU acceleration switch out HDBSCAN and UMAP import with
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
if RAPIDS cuML installation is possible.
matplotlib seaborn wordcloud adjustText pandas numpy nltk sklearn sentence_transformers bertopic torch umap hdbscan transformers huggingface_hub joblib tqdm swifter