Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Last update: Dec 06, 2021

Related tags

Data Analysis kafka-to-spark-streaming

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.

Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py file

Create environment using conda with Python 3.8:
- conda create -n python38 python=3.8
- conda activate python38
- Check requirements inside requirements.txt and install then using conda:
  - conda install -c conda-forge tweepy==4.4.0
  - conda install -c conda-forge kafka-python==2.0.2
Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use brew install kafka
Start zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties, port: 2181
On another terminal window start broker: kafka-server-start /usr/local/etc/kafka/server.properties, port: 9092 - In terminal window list topics you have: kafka-topics --list --bootstrap-server localhost:9092
Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine: kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Now list again, the topics you have: kafka-topics --list --bootstrap-server localhost:9092
Let's see what we have inside the "tweeter" topic kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning, absolutely noting), but when we start streaming, data will be generated
Now run python kafka_producer.py to start stream Twitter and push message to topic.
And now check that the data is inside topic with kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
Congrats! You have done it!

So what's next?

You can use generated data with Kafka Stream and Spark Streaming, and practice more!

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Related tags

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Owner

Rustam Zokirov

Py-price-monitoring - A Python price monitor

A Python package for the mathematical modeling of infectious diseases via compartmental models

Catalogue data - A Python Scripts to prepare catalogue data

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Binance Kline Data With Python

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

Wafer Fault Detection - Wafer circleci with python

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

Fancy data functions that will make your life as a data scientist easier.

An easy-to-use feature store

Stochastic Gradient Trees implementation in Python

vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Modular analysis tools for neurophysiology data

Clean and reusable data-sciency notebooks.

ETL pipeline on movie data using Python and postgreSQL