Skip to content

The objective of this project is to conduct text analysis on the Amazon Fine Food Reviews Dataset to create a topic model. By employing topic modeling techniques, the project aims to identify themes across reviews and uncover hidden topics within the dataset.

Notifications You must be signed in to change notification settings

Aytaj9/Amazon-Fine-Food-Reviews-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Amazon-Fine-Food-Reviews-

The objective of this project is to conduct text analysis on the Amazon Fine Food Reviews Dataset to create a topic model. By employing topic modeling techniques, the project aims to identify themes across reviews and uncover hidden topics within the dataset. Specifically, the focus is on discerning groups of "Good Reviews" on Amazon, categorized by scores 3, 4, and 5, while contrasting them with "Bad Reviews" marked with scores 1 and 2. The primary business question addressed by this analysis is to determine which product groups have garnered predominantly positive feedback.

Summary: This project revolves around the exploration and analysis of the Amazon Fine Food Reviews Dataset. The initial phase involves data cleaning, where the dataset of 568,454 observations with 10 variables undergoes scrutiny for missing values and redundancy. After filtering, the dataset is reduced to 2,734 observations, specifically focusing on reviews with scores 3, 4, and 5. Stop words are eliminated from the text column to enhance clarity and relevance for subsequent analysis.

The analysis begins with graphical exploration, including histograms depicting the distribution of characters and words within reviews. Word clouds are generated to visualize frequently occurring terms in both the review text and summary attributes, revealing common themes such as "coffee," "taste," and "good." These visualizations confirm the selection of "Good Reviews" for further analysis.

Next, topic modeling is employed to identify coherent groups within the dataset. Various iterations of topic models with different numbers of topics and passes are evaluated, with the most insightful model comprising four topics. These topics are interpreted as representing distinct categories such as "snacks," "beverages," "cookies," and "pet items," aligning with the observed trends in review content.

Further refinement of the model with an increased number of iterations reveals similar thematic clusters, solidifying the conclusions drawn from the initial topic model. The interpretation of these results leads to recommendations for Amazon, emphasizing the importance of maintaining quality in popular categories like "tea," "coffee," "cookie," and "chocolate," while also suggesting a focus on improving lesser-reviewed food products to enhance customer satisfaction.

In conclusion, the topic modeling analysis provides valuable insights into customer preferences and product categories that drive positive reviews on Amazon, guiding strategic decisions for product development and marketing efforts.

Introduction

This project is worked on the Amazon Fine Food Reviews Dataset. The purpose of this project using text analysis technique to create a topic model that allows us to find a theme across reviews and find hidden topics. Topic modeling is trying to classify huge amount of unlabeled text to find connected words with familiar meanings and differ between uses of words with multiple meanings. In this project, I specially target ‘Good Reviews’ on Amazon in which score equal 3, 4 and 5. Rest of score (1 and 2) are labeled as ‘Bad Reviews’. Project’s business question is which group(s) have gained ‘Good Reviews’.

Data Cleaning From the part of summary of data set, I identified that Reviews data contains 568454 observations with 10 variables, they are Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text. These reviews comprise more than 10 years period from October 1999 until October 2012, and they are commonly about product and user information, ratings and a plain text reviews and all other Amazon categories as well as. There are included more than 568 thousand reviews, 256 thousand users, 74 thousand products and 260 users with 50 reviews. Dataset have equal quantitative and qualitative variables. When I checked missing values of data set, I defined 16 null values from Profile Name, and 27 null values from Summary variable. These missing values doesn’t take 1% of data set, thus there is no need to clean these missing values as they don’t have powerful impact on output. It is clearly seen that data contains tremendous observations and it is so tough to manage for my home computer. Thereby, I just used 15000 reviews with 3, 4 and 5 Scored values. Now my dataset contains 2734 observations with same 10 variable. Before starting topic modeling, I should remove some junk words which aren’t useful in model, and they won’t allow to get idea about groups. I selected some nouns as stop words such as 'like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people, 'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said', 'br', 'www','http', 'com' to drop from Text column to make clearer.

Analysis Before staring modeling data, I wanted to get idea about Reviews from, I created some graphs such as histograms and word clouds. My histogram shows us the number of characters existed in each sentence of Text attribute. The result of this histogram, I learned that length of ‘Text’ which contains Reviews is ranged from 0 until 8000, and about 12000 characters’ length in each sentence of Reviews are approximately between 0 and 1000; however there are some variable with above 1000 and even between 2000 and 4000 lengths. From the second histogram I acquired that the number of words appearing in each Amazon Reviews are about 180 with 11500 Reviews and maximum number of words is observed like about 600. Having a lot number of words in Reviews might be helpful in some case, but also cause difficult to define groups of Reviews in topic modeling. The third histogram provided average word length in each sentence of Text attribute (Review), and as result of this histogram, I am informed that average length in each sentence of Review is ranged between 3 and 10, and approximately 6000 word’s average length equal to 4 and 5. This result indicates we will get values in length of 4 and 6 in model. Lastly, before observing group(s) which gained Good Reviews, I build word cloud with ‘Text’ variable to observe mostly viewed words in Reviews by help of generated Word cloud plot. From the observation of this Word cloud, I saw ‘coffee’, ‘taste’, ‘good’, ‘great’, flavor’, ‘br’, ‘product’, ‘dog’, ‘food’ and other clearly viewed words. Additionally, I created Word cloud for Summary attribute to observe mostly given words. From the observation, I viewed ‘Great’, ‘Good’, ‘Best’, love’, ‘coffee’, ‘product’, ‘taste’, ‘Delicious’, and ‘yummy’ words in Summary attribute. Thereby, it makes sense that I actually chose Good scored reviews. After preparing data to build model, I specified two special parameters which show the number of topics and the number of passes, and I can adjust my model by modifying these parameters. My first table contains two topics and 10 numbers of iterations. Second and third table contains three and four topics, respectively, at same numbers of iterations. When evaluated table 1 , table 2 and table 3, I saw that I have 9 topic models with three tables. Now I have to define which group makes more sense? Out of the 9 examined topic tables, nouns only, table 3 with 4 topics mostly sense. I observed that four distinct groups here: 1st topic is about ‘snack’, 2nd topic is about ‘beverages’, 3rd topic is about ‘cookie’, and 4th topic looks like ‘pet items’. As observed before in histograms and word cloud, now we can be sure that topic words’ length (snack, cookie) are between 4 and 6, mostly observed values are coffee, food, dog, product which I saw these words in my preferred model. Furthermore, I checked third model with 3 topics and increasing number of iterations from 10 to 80 to get clearest distinction within model. From these model, I gained 1st topic is about ‘cookie and snacks ’, 2nd topic related to ‘drinkings’, and 3rd topic is about ‘pet items’.

Interpretation and Recommendations

From these models’ results, it is obviously seen that most of ‘Good reviews’ come from ‘Cookie and snacks’, ‘beverages’ and ‘pet items’ food categories. It indicates that these mentioned categories gained 3, 4 or 5 score from customers (users). Now Amazon is informed that Cookie or snacks, beverages and pet items food products are loved by its customers and customers generally provide good feedback for these categories. Now Amazon doesn’t need to be in panic for improving these categories in its sales. In other word, Amazon should focus on other categories that they haven’t been included in Good Reviewed products. From the topic model, I can recommend for company to keep ‘tea’ and ‘coffee’ as beverage, ‘cookie’ and ‘chocolate’ as cookie, ‘cracker’ and ‘chips’ as snacks in good quality which take the most important place by satisfied customers. Furthermore, which food products I didn’t observe in models, Amazon should pay more attention these food products to gain Good reviews from users.

About

The objective of this project is to conduct text analysis on the Amazon Fine Food Reviews Dataset to create a topic model. By employing topic modeling techniques, the project aims to identify themes across reviews and uncover hidden topics within the dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages