Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Overview

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

This project is a good starting point for those who have little or no experience with Apache Spark Streaming. We use Twitter data since Twitter provides an API for developers that is easy to access. We present an end-to-end architecture on how to stream data from Twitter, clean it, and apply a simple sentiment analysis model to detect the polarity and subjectivity of each tweet.

Input data: Live tweets with a keyword
Main model: Data preprocessing and apply sentiment analysis on the tweets
Output: A parquet file with all the tweets and their sentiment analysis scores (polarity and subjectivity)

We use Python version 3.7.6 and Spark version 2.4.7. We should be cautious on the versions that we use because different versions of Spark require a different version of Python.

Main Libraries

tweepy: interact with the Twitter Streaming API and create a live data streaming pipeline with Twitter
pyspark: preprocess the twitter data (Python's Spark library)
textblob: apply sentiment analysis on the twitter text data

Instructions

First, run the Part 1: twitter_connection.py and let it continue running.
Then, run the Part 2: sentiment_analysis.py from a different IDE.

Part 1: Send tweets from the Twitter API 

In this part, we use our developer credentials to authenticate and connect to the Twitter API. We also create a TCP socket between Twitter's API and Spark, which waits for the call of the Spark Structured Streaming and then sends the Twitter data. Here, we use Python's Tweepy library for connecting and getting the tweets from the Twitter API. 

Part 2: Tweet preprocessing and sentiment analysis

In this part, we receive the data from the TCP socket and preprocess it with the pyspark library, which is Python's API for Spark. Then, we apply sentiment analysis using textblob, which is Python's library for processing textual data. After sentiment analysis, we save the tweet and the sentiment analysis scores in a parquet file, which is a data storage format.

Owner
Himanshu Kumar singh
Always Passionate to work on Artificial Intelligence technologies. Getting better at AI with each passing day. #TowardsAI
Himanshu Kumar singh
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
LynxKite: a complete graph data science platform for very large graphs and other datasets.

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

124 Dec 14, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 09, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Binomial Option Pricing Calculator Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required) Background A derivative is a fi

sammuhrai 1 Nov 29, 2021
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 01, 2023