💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Overview

Chatistics

Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds from the chat logs.

Changelog

10 Jan 2020: UPDATED ALL THE THINGS! Thanks to mar-muel and manueth, pretty much everything has been updated and improved, and WhatsApp is now supported!

21 Oct 2018: Updated Facebook Messenger and Google Hangouts parsers to make them work with the new exported file formats.

9 Feb 2018: Telegram support added thanks to bmwant.

24 Oct 2016: Initial release supporting Facebook Messenger and Google Hangouts.

Support Matrix

Platform Direct Chat Group Chat
Facebook Messenger ✔ ✘
Google Hangouts ✔ ✘
Telegram ✔ ✘
WhatsApp ✔ ✔

Exported data

Data exported for each message regardless of the platform:

Column Content
timestamp UNIX timestamp (in seconds)
conversationId A conversation ID, unique by platform
conversationWithName Name of the other people in a direct conversation, or name of the group conversation
senderName Name of the sender
outgoing Boolean value whether the message is outgoing/coming from owner
text Text of the message
language Language of the conversation as inferred by langdetect
platform Platform (see support matrix above)

Exporting your chat logs

1. Download your chat logs

Google Hangouts

Warning: Google Hangouts archives can take a long time to be ready for download - up to one hour in our experience.

  1. Go to Google Takeout: https://takeout.google.com/settings/takeout
  2. Request an archive containing your Hangouts chat logs
  3. Download the archive, then extract the file called Hangouts.json
  4. Move it to ./raw_data/hangouts/

Facebook Messenger

Warning: Facebook archives can take a very long time to be ready for download - up to 12 hours! They can weight several gigabytes. Start with an archive containing just a few months of data if you want to quickly get started, this shouldn't take more than a few minutes to complete.

  1. Go to the page "Your Facebook Information": https://www.facebook.com/settings?tab=your_facebook_information
  2. Click on "Download Your Information"
  3. Select the date range you want. The format must be JSON. Media won't be used, so you can set the quality to "Low" to speed things up.
  4. Click on "Deselect All", then scroll down to select "Messages" only
  5. Click on "Create File" at the top of the list. It will take Facebook a while to generate your archive.
  6. Once the archive is ready, download and extract it, then move the content of the messages folder into ./raw_data/messenger/

WhatsApp

Unfortunately, WhatsApp only lets you export your conversations from your phone and one by one.

  1. On your phone, open the chat conversation you want to export
  2. On Android, tap on â‹® > More > Export chat. On iOS, tap on the interlocutor's name > Export chat
  3. Choose "Without Media"
  4. Send chat to yourself eg via Email
  5. Unpack the archive and add the individual .txt files to the folder ./raw_data/whatsapp/

Telegram

The Telegram API works differently: you will first need to setup Chatistics, then query your chat logs programmatically. This process is documented below. Exporting Telegram chat logs is very fast.

2. Setup Chatistics

First, install the required Python packages using conda:

conda env create -f environment.yml
conda activate chatistics

You can now parse the messages by using the command python parse.py .

By default the parsers will try to infer your own name (i.e. your username) from the data. If this fails you can provide your own name to the parser by providing the --own-name argument. The name should match your name exactly as used on that chat platform.

# Google Hangouts
python parse.py hangouts

# Facebook Messenger
python parse.py messenger

# WhatsApp
python parse.py whatsapp

Telegram

  1. Create your Telegram application to access chat logs (instructions). You will need api_id and api_hash which we will now set as environment variables.
  2. Run cp secrets.sh.example secrets.sh and fill in the values for the environment variables TELEGRAM_API_ID, TELEGRAMP_API_HASH and TELEGRAM_PHONE (your phone number including country code).
  3. Run source secrets.sh
  4. Execute the parser script using python parse.py telegram

The pickle files will now be ready for analysis in the data folder!

For more options use the -h argument on the parsers (e.g. python parse.py telegram --help).

3. All done! Play with your data

Chatistics can print the chat logs as raw text. It can also create histograms, showing how many messages each interlocutor sent, or generate word clouds based on word density and a base image.

Export

You can view the data in stdout (default) or export it to csv, json, or as a Dataframe pickle.

python export.py

You can use the same filter options as described above in combination with an output format option:

  -f {stdout,json,csv,pkl}, --format {stdout,json,csv,pkl}
                        Output format (default: stdout)

Histograms

Plot all messages with:

python visualize.py breakdown

Among other options you can filter messages as needed (also see python visualize.py breakdown --help):

  --platforms {telegram,whatsapp,messenger,hangouts}
                        Use data only from certain platforms (default: ['telegram', 'whatsapp', 'messenger', 'hangouts'])
  --filter-conversation
                        Limit by conversations with this person/group (default: [])
  --filter-sender
                        Limit to messages sent by this person/group (default: [])
  --remove-conversation
                        Remove messages by these senders/groups (default: [])
  --remove-sender
                        Remove all messages by this sender (default: [])
  --contains-keyword
                        Filter by messages which contain certain keywords (default: [])
  --outgoing-only       
                        Limit by outgoing messages (default: False)
  --incoming-only       
                        Limit by incoming messages (default: False)

Eg to see all the messages sent between you and Jane Doe:

python visualize.py breakdown --filter-conversation "Jane Doe"

To see the messages sent to you by the top 10 people with whom you talk the most:

python visualize.py breakdown -n 10 --incoming-only

You can also plot the conversation densities using the --as-density flag.

Word Cloud

You will need a mask file to render the word cloud. The white bits of the image will be left empty, the rest will be filled with words using the color of the image. See the WordCloud library documentation for more information.

python visualize.py cloud -m raw_outlines/users.jpg

You can filter which messages to use using the same flags as with histograms.

Development

Install dev environment using

conda env create -f environment_dev.yml

Run tests from project root using

python -m pytest

Improvement ideas

  • Parsers for more chat platforms: Discord? Signal? Pidgin? ...
  • Handle group chats on more platforms.
  • See open issues for more ideas.

Pull requests are welcome!

Social medias

Projects using Chatistics

Meet your Artificial Self: Generate text that sounds like you workshop

Credits

Owner
Florian
🤖 Machine Learning
Florian
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Jan 02, 2023
SparseLasso: Sparse Solutions for the Lasso

SparseLasso: Sparse Solutions for the Lasso Introduction SparseLasso provides a Scikit-Learn based estimation of the Lasso with cross-validation tunin

Gabriel Okasa 1 Nov 08, 2021
This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Overview Welcome to the Step-X repository. This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP. Be

Keanu Pang 0 Jan 20, 2022
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022
A simplified prototype for an as-built tracking database with API

Asbuilt_Trax A simplified prototype for an as-built tracking database with API The purpose of this project is to: Model a database that tracks constru

Ryan Pemberton 1 Jan 31, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 05, 2022
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Universal data analysis tools for atmospheric sciences

U_analysis Universal data analysis tools for atmospheric sciences Script written in python 3. This file defines multiple functions that can be used fo

Luis Ackermann 1 Oct 10, 2021
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.3k Dec 30, 2022
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022
ASTR 302: Python for Astronomy (Winter '22)

ASTR 302, Winter 2022, University of Washington: Python for Astronomy Mario Jurić Location When: 2:30-3:50, Monday & Wednesday, Winter quarter 2022 Wh

UW ASTR 302: Python for Astronomy 4 Jan 12, 2022
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 03, 2023
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022