Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Overview

Product Reviews Summarizer

Version 1.0.0

A quick guide on installation of important libraries and running the code.

The project has three .ipynb files - Data Scraper.ipynb, cosine-similarity-wo-tf-idf.ipynb, and cosine-similarity-w-tf-idf.ipynb.


Data Scraper

For the Data Scraper python script, we need to import the following three libraries - requests, BeautifulSoup, and pandas. The installation process can be viewed by clicking on the respective library names.

Splash

In this project, instead of using the default web browser to scrape data, we have created a splash container using docker. Splash is a light-weight javascript rendering service with an HTTP API. For easy installation, you can watch this amazing video by John Watson Rooney on YouTube.

https://www.youtube.com/watch?v=8q2K41QC2nQ&t=361s

Note: You need to make sure that you give the Splash Localhost URL to the requests.get().

Running the code

After you have installed and configured everything, you can run the code by providing the URL of your choice. Suppose, you are taking a product from Amazon, make sure to go to All Reviews page and go to page #2. Copy this URL upto the last '=' and paste it as an f-string in the code. Add a '{x}' after the '='. The code is ready to run. It will scrape the product name, review title, star rating, and the review body from each page, until the last page is encountered, and save it in .xlsx format.

Note: Specify the required output name and destination.


cosine-similarity-wo-tf-idf

For the cosine similarity model, first we need to download the pretrained GloVe Word Embeddings. Run the Load GloVe Word Embeddings section in the script once. It is only required if the kernel is restarted.

For this script, we need to import the following libraries - numpy, pandas, nltk, nltk.tokenize, nltk.corpus, re, sklearn.metrics.pairwise, networkx, transformers, and time. Also run the nltk.download('punkt') and nltk.download('stopwords') lines to download them.

Next step is to load the data as a dataframe. Make sure to give the correct address. Pre-processing of the reviews is done for efficient results. The pre-processing steps include converting to string datatype, converting alphabetical characters to lowercase, removing stopwords, replacing non-alphabetical characters with blank character and tokenizing the sentences.

The pre-processed data is then grouped based on star ratings and sent to the cosine similarity and pagerank algorithm. The top 10 ranked sentences after the applying the pagerank algorithm are sent to huggingface transformers to create an extractive summary (min_lenght = 75, max_length = 300). The summary, along with the product name, star rating, no of reviews, % of total reviews, and the top 5 frequent words along with the count are saved in .xlsx format.

Note: Specify the required output name and destination.


cosine-similarity-w-tf-idf

For this model, along with the above libraries, we need to import the following additional libraries - spacy, and heapq. The cosine similarity algorithm has a time complexity of O(n^2). In order to have a fast execution, in this method, we are using tf-idf measure to score the frequent words, and hence the corresponding sentences. Only the top 1000 sentences are then sent to the cosine similarity algorithm. Usage of the tf-idf measure, ensures that each product, irrespective of the number of sentences in the reviews, gives an output within 120 seconds. This method makes sure no important feature is lost, giving similar results as the previous method but in considerately less time.


Contributors

© Parv Bhatt © Namratha Sri Mateti © Dominic Thomas


Owner
Parv Bhatt
Masters in Data Analytics Student at Penn State University
Parv Bhatt
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 03, 2023
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Eliyar Eziz 2.3k Dec 29, 2022
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 05, 2023
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to ach

Keon Lee 237 Jan 02, 2023
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022
⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

BERT-of-Theseus Code for paper "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing". BERT-of-Theseus is a new compressed BERT by progre

Kevin Canwen Xu 284 Nov 25, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022