This project performs sentiment analysis on Twitter data in a batch processing manner using the ntscraper
library, Hadoop clusters (HDFS), and PySpark.
Twitter is a valuable source of data for sentiment analysis. This project focuses on analyzing tweets in batches rather than real-time, making it suitable for processing large volumes of historical Twitter data. The sentiment analysis is performed using PySpark, a Python library for parallel computing with Spark, leveraging Hadoop clusters for distributed computing.
- ntscraper Library: Utilizes the
ntscraper
library to scrape tweets based on specified criteria such as keywords, hashtags, or user handles. - Hadoop Clusters (HDFS): Stores the scraped Twitter data in Hadoop Distributed File System (HDFS), enabling distributed storage and processing.
- PySpark: Processes the Twitter data stored in HDFS using PySpark.
- Python 3.x
- ntscraper Library
- Hadoop Cluster (for HDFS storage and distributed processing)
- PySpark
- ntscraper Library Installation: Install the
ntscraper
library usingpip
:
pip install ntscraper
-
Hadoop Cluster Configuration: Set up a Hadoop cluster and configure HDFS. Ensure that the necessary permissions are granted for accessing and writing data to HDFS.
-
PySpark Installation: PySpark comes pre-installed with Apache Spark. You can install Apache Spark locally or use a cloud-based Spark service.
-
Scrape Tweets: Use the
ntscraper
library to scrape tweets based on your criteria (keywords, hashtags, user handles). -
Store Data in HDFS: Transfer the scraped Twitter data to HDFS for distributed storage.
-
Perform Sentiment Analysis: Execute the PySpark script
sentiment_analysis.py
to analyze the sentiment of the Twitter data stored in HDFS.
python sentiment_analysis.py (use ipynb)
- Ahmed Abdullah
- Ibtehaj Ali
- Maira
This project is licensed under the MIT License - see the LICENSE file for details.