A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Overview

Sentiment Analysis on Yelp's Dataset

Author: Roberto Sanchez, Talent Path: D1 Group

Docker Deployment:

Deployment of this application can be found here hosted on AWS

Running it locally:

docker pull rsanchez2892/sentiment_analysis_app

Overview

The scope of this capstone is centered around the data processing, exploratory data analysis, and training of a model to predict sentiment on user reviews.

End goal of the model

Business Goals

Create a model to be able to be used in generating sentiment on reviews or comments found in external / internal websites to give insights on how people feel about certain topics.

This could give the company insights not easily available on sites where ratings are required or for internal use to determine sentiment on blogs or comments.

Business Applications

By utilizing this model, the business can use it for the following purposes:

External:

  • Monitoring Brand and Reputation online
  • Product Research

Internal:

  • Customer Support
  • Customer Feedback
  • Employee Satisfaction

Currently method to achieving this is by using outside resources which come at a cost and increases risk for leaking sensitive data to the public. This product will bypass these outside resources and give the company the ability to do it in house.

Model Deployment

Link: Review Analyzer

After running multiple models and comparing accuracy, I found that the LinearSVC model is a viable candidate to be used in production for analyzing reviews of services or food.

Classification Report / Confusion Matrix:

Classification Report

Technology Stack

I have been using these technologies for this project:

  • Jupyter Notebook - Version 6.3.0
    • Used for most of the data processing, EDA, and model training.
  • Python - Version 3.8.8
    • The main language this project will be done in.
  • Scikit-learn - Version 0.24
    • Utilizing metrics reports and certain models.
  • Postgres - Version 13
    • Main database application used to store this data.
  • Flask - Version 1.1.2
    • Main backend technology to host a usable version of this project to the public.
  • GitHub
    • Versioning control and online documentation
  • Heroku
    • Online cloud platform to host this application for public use

Data Processing

This capstone uses the Yelp dataset found on Kaggle which comprises of multiple files:

  • Business Data
  • Check-in Data
  • Review Data
  • Tips Data
  • User Data

Stage 1 - Read in From JSON files into Postgres

Overview

  • Read in JSON files
  • General observations on the features found in each file
  • Modifying feature names to meet Postgres naming convention
  • Normalized the data to prepare for import to Postgres
  • Saved copies of each table as CSV file for backup incase Database goes down
  • Exported data into Postgres

As stated above, Kaggle provided several JSON files with a large amount of data that needed to be stored in a location for easy access and provide a quick way to query data on the fly. As the files were read in Jupyter notebook a general observation was made to the feature names and amount each file contained to see what data I was dealing with along with the types associated with them. The business data contained a strange number of attributes that had to be broken up into separate data frames to be normalized for Postgres.

Stage 2 - Pre-Processing Data

Overview

  • Read in data from Postgres
  • IDing Null Values
  • Removing Sparse features
  • Saved data frame as a pickle to be used in model training

This stage I performed elementary data analysis where I analyze any null values, see the distribution of my ratings and review lengths.

Stage 3 - Cleaning Up Data

Overview

  • Replace contractions with expanded versions
  • Lemmatized text
  • Removed special characters, dates, emails, and URLs
  • Removed stop words
  • Remove non-english text
  • Normalized text

Exploratory Data Analysis

Analyzing Null Values in Dataset

Below is a visualization of the data provided by Kaggle showing which features have "NaN " values. Its is clear that the review ratings (review_stars) and reviews (text) are fully populated. Some of the business attributes are sparse but have enough values to be useful for other things. Note several other features were dropped in the Data Processing since they did not provide any insights for the scope of this project.

Heatmap of several million rows of data.

Looking Closer at the Ratings (review_stars)

This is a sample of 2 million rows from the original 8 million in the dataset. This distribution of ratings has a left skew on it where most of the reviews are 4 to 5 stars.

A bar graph showing the distribution of ratings between 1 to 5. there is a significant amount of 5 stars compared to 1-3 combined.

I simplified the ratings to better categorize the sentiment of the review by grouping 1 and 2 star reviews as 'negative', 3 star review as 'neutral', and 4 and 5 star reviews as 'positive'.

Simplified Barchar showing just the negative, neutral, and positive ratings

Looking Closer at the Reviews (text)

To analyze the text, I've calculated the length of each review in the sample and plotted a distribution graph showing them the number of characters of each review. The statistics were that the median review was approx. 606 characters with a range of 0 through 5000 characters.

Showing a distribution chart of the length of the reviews. Clearly the distribution skews right with a median around 400 characters.

A closer inspection on the range 0 - 2000 we can see that most of the reviews are around this general area.

A zoomed in version of the same distribution chart now focusing on 0 - 2000 characters

In order to produce a viable word cloud, I've had to process all of the text in the sample to remove special characters and stop words from NLTK to produce a viable string to be used in word cloud. Below is a visualization of all of the key words found in the positive reviews.

Created a word cloud from the positive words after cleaning

As expected, words like "perfect", "great", "good", "great place", and "highly recommend" came out on top.

A word cloud showing all the words from the negative reviews

On the negative word cloud, words like "bad", "customer service", "never", "horrible", and "awful" are appearing on the word cloud.

Model Training

Model Selection

model selection flow chart

These four models were chosen to be trained with this data. Each of these models had a pipeline created with TfidfVectorizer.

Model Training

  • Run a StratifiedKFold with a 5 fold split and analyze the average scores and classification reports
    • Get an average accuracy of the model for comparison
  • Create a single model to generate a confusion matrix
  • Test out model on a handful of examples

Below is the average metrics after running 5 fold cross validation on LinearSVC

average metrics for linearSVC model

Testing Model

After the model was trained, I fed it some reviews I found online to test out whether or not the model can properly detect the right sentiment. The following reviews are ordered as "Negative", "Neutral", and "Positive":

new_test_data = [
    "This was the worst place I've ever eaten at. The staff was rude and did not take my order until after i pulled out my wallet.",
    "The food was alright, nothing special about this place. I would recommend going elsewhere.",
    "I had a pleasent time with kimberly at the granny shack. The food was amazing and very family friendly.",
]
res = model.prediction(new_test_data)

Below is the results of the prediction, notice that the neutral review has been labeled as negative. This makes sense since the model has a poor recall for neutral reviews as shown in the classification report.

Results from the prediction

End Notes

There are some improvements to be made such as the follow:

  • Balancing the data
    • This can be seen in the confusion matrix for the candidate models and other models created that the predictions come out more positive than negative or neutral.
    • While having poor scores in the neutral category, the most important features are found in the negative and positive predictions for business applications.
  • Hyper-parametrization improvement
    • Logistic Regression and Multinomial NB models produced models within a reasonable time frame while returning reasonable scores. Random Forrest Classifier and SVM took a significant amount of time to produce just one iteration. In order to produce results from this model StratifiedKFold was not used in these two models. Changing SVM to LinearSVC improved performance dramatically and replaced the SVM model and outperformed Logistic Regression which was the original candidate model.
Owner
Roberto Sanchez
Full Stack Web Developer / Data Science Startup
Roberto Sanchez
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
1 Jun 28, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 07, 2022
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 01, 2022
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
Smart discord chatbot integrated with Dialogflow

academic-NLP-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022