Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.

Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py file

  • Create environment using conda with Python 3.8:
    • conda create -n python38 python=3.8
    • conda activate python38
    • Check requirements inside requirements.txt and install then using conda:
      • conda install -c conda-forge tweepy==4.4.0
      • conda install -c conda-forge kafka-python==2.0.2
  • Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use brew install kafka
  • Start zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties, port: 2181
  • On another terminal window start broker: kafka-server-start /usr/local/etc/kafka/server.properties, port: 9092 - In terminal window list topics you have: kafka-topics --list --bootstrap-server localhost:9092
  • Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine: kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
  • Now list again, the topics you have: kafka-topics --list --bootstrap-server localhost:9092
  • Let's see what we have inside the "tweeter" topic kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning, absolutely noting), but when we start streaming, data will be generated
  • Now run python kafka_producer.py to start stream Twitter and push message to topic.
  • And now check that the data is inside topic with kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
  • Congrats! You have done it!

So what's next?

You can use generated data with Kafka Stream and Spark Streaming, and practice more!

Owner
Rustam Zokirov
15x Engineer • Data Engineer
Rustam Zokirov
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 07, 2022
TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data

tedana: TE Dependent ANAlysis TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI)

136 Dec 22, 2022
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

Burn Research 4 Oct 13, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8k Dec 29, 2022
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022