pipeline for migrating lichess data into postgresql

Overview

How Long Does It Take Ordinary People To "Get Good" At Chess?

TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, most beginners will improve their rating by 100 lichess elo points in 3-6 months. Most "experienced" chess players in the 1400-1800 rating range will take 3-4 years to improve their rating by 100 lichess elo points. There's no strong evidence that playing more games makes you improve quicker.

Table Of Contents

  1. Backstory
  2. ETL Process
  3. Data Analysis
  4. How To Do What I Did

Backstory

I've been a casual chess player for a few years now. Like most people who get into chess, one question that's been at the back of my mind is, like the title suggests, how long is it going to take me to actually get good at this game?

Fortunately for us, lichess.org, the largest open source online chess website, publishes all chess games played on their site freely available for the public to download (including elo rating data). This gives me just what I need to take a crack at this question.

But before we start getting into the data mining, we would first need to define what is "good at chess". Naturally, that's going to very from person to person. If your definition of "good" is never losing, that's never going to happen (unless you're picky about who you choose to play or you're a computer). If your definition of "good" is better than most everyday people you would find off the street, then you would probably get there after spending 30 minutes learning how the pieces move.

For what it's worth, the data says the 90th percentile rating is at 1927. On lichess.org/stat/rating/distribution/blitz, I believe the distribution is calculated from currently active players, raising the 90th percentile to 2075. For reference, GM Hikaru Nakamura (one of the world's top blitz players) was able to jump from total beginner (600 USCF ~1100 lichess) to the 90th percentile (1800 USCF, ~2000 lichess) in just 2 years (Jan 1995- Jan 1997). So best case scenario if you're a chess prodigy, you can get into the 90th percentile ballpark in 2 years of serious study.

Given the ambiguous nature of this question, I decided to focus on improvement rate rather than trying to answer something like how long it takes to go from total beginner to grandmaster.

There are a few reasons for this:

  1. People who become grandmasters probably don't represent your typical everyday chess player. I'm more interested in how long it's going to take a "normal" person like me to improve.
  2. While tournament games and rating histories of grandmasters (and other tournament players) are publically available, over the board chess is likely different than online chess, even if the time controls are the same. I'm more interested in getting better at online chess, which is where I usually play chess.
  3. I think the rate of improvement is what people (myself included) are really interested in. I felt like the core of my question is actually "how long is it going to take until I see noticeable improvement in my chess skills?" It's also a helpful way to benchmark if your struggling at chess improvement is "normal" or if there's something wrong with your training plan.
  4. It's unlikely I'm going to find many examples of people who went from beginner to grandmaster in the data. It takes at least 8-10 years to jump that gap in addition to extreme talent and near full-time study (back to point 1). Lichess data only goes back to 2013, 8 years of data as of this writing. Because it takes many years to develop strong players, there's a low chance such players have documented their entire progression on lichess. They would have had to start in the early days of lichess, back when it wasn't very popular.
  5. Even if I could get data spanning the many years required to build strong players, the amount of data would be massive even after stripping down the data to the bare minimum required, given the exponential growth of chess games being played on lichess. There are ways I could store and analyze the massive amount of data, but that's going to cost me an arm and a leg. Phrasing the question this way is going to allow me to work with the limit data I'm able to get my hands on.

So that brings me to the purpose of this project: to figure out the typical improvement rates for typical online chess players on lichess.org.

ETL Process

Here's an outline of what I did to do this analysis:

  1. Extract the data from database.lichess.org
  2. Transform the data to the format I need it in
  3. Load the processed data in a relational database
  4. Analyze the data to answer my questions

Lichess has all their games available for download by month (e.g. January 2013). The problem is, the data in its raw format can't be queried easily, nor can it directly answer the question I'm interested in.

Here's what the raw data looks like:

Alt test

I believe it's an export from MongoDB, a NoSQL document database, which does not have a fixed schema. The problem with schemaless databases like MongoDB is they aren't designed for complex analytical queries (but great for application flexibility and scaling for big data).

I decided to migrate the data into postgreSQL, a structured, relational database. That's going to allow me to run complex SQL queries to get the elo rating per player over time, which is really what I need to answer my question. It also gives me the flexibility to play around with the data for other questions that might pop up in the future.

To help manage the ETL process, I built a datapipeline using Airflow to schedule the downloading, processing, and deletion of each data file on lichess. The nice thing about Airflow is it comes with a web UI that helps you visually see where you're at in your data pipeline. It also provides logging and task status to help make troubleshooting/debugging a bit easier.

Here's a couple of screenshots of what the pipeline looks like:

Alt test Alt test

After a lot of troubleshooting and a few weeks of waiting for the data to download on my dinky 10 year old laptop, I got about 100 Gb of data loaded into postgres, spanning 5.5 years, 450 million games, and 2.3 million players.

Once that was done, I did the analysis with a bit of SQL and pandas then made the plots with seaborn/matplotlib.

Data Analysis

How Long Does It Take To Improve At Chess?

Now to the fun stuff.

What does the data say about chess improvement rate?

After extracting the data for elo per player over time (including games as white and black), filtering for one time control, calculating the monthly average, aligning everyone's starting dates, assigning the ratings into rating bins, and averaging the ratings by the rating bins, I get the plot below:

Alt test

I analyzed the data from the perspective of a player's monthly average which should be a better estimate of a player's playing strength than looking at the game-by-game elo fluctuation. I'm not particularly interested in cases of players who managed to jump 100 points in one afternoon blitz binge session. I believe those instances can be attributed to random chance rather than those players suddenly having a "eureka" moment that boosted their playing strength by 100 elo points overnight.

From the graph, it looks like improvement rate depends a lot on what your current elo is. As one might expect, lower elo ratings have the greatest opportunity to improve quickly, while higher elo ratings will take much longer to see improvement. Most players in the 800-1000 rating range (about 6% of players) will see their elo jump up 100 points in just a few months of activity. Most players in the 1600-2000 range (27% of players) will take 4 years or more to move up just 100 elo points.

I'm not sure what the weird bump and dip is that happens around the 3 year mark. That may be an artifact of the data only containing 5.5 years of data, with datapoints heavily clustered around lower month counts (see player churn).

4 years just for 100 elo points? Seems a bit longer than I expected. But it is plausible.

There are players in the data with long histories of activity who have not improved their rating despite playing many games over the span of many years. See the player who's played the most games of all time on lichess:

Alt test

But what if the mean ratings are being dragged down by the mass of "casual" players who aren't interested in improving and just play for fun? (Not that there's anything wrong with that). Is there a way to look at just the players who are serious about improving? I tried filtering the data to players who have gained at least 100 elo points since joining lichess:

Alt test

There's a strange jump in rating in the first month for players in this category that is less prevalent in the entire dataset. It's possible this could be due to players starting out on lichess as underrated and quickly catching up in the first month. I'm going to ignore the first month when calculating the improvement rate.

From the chart, players in the 800-1000 range on average improve their elo by 100 points in just 1-2 months. Players in the 1600-2000 range improve their elo by 100 points in about 3-4 years, which isn't much different from the average of all players. One explanation for this is that most players at this rating range are already considered "serious" players and most of them "have what it takes" to improve at chess. Setting a cutoff in the data for 100 elo points of improvement is not as strong a filter for players in this range.

So it looks like for most people, improvement happens over long periods of time of consistent study/activity on the scale of months for beginners and years for experienced players.

Does Playing More Games Make You A Better Chess Player?

But while I have the data here, let's take a moment to answer the question everyone's asking: does playing more games make you a better chess player?

Intuitively, the answer seems like it should be yes. Seems like experience should be a huge factor in someone's chess strength. It also seems like the go-to answer many of the top players recommend for improving at chess.

This is what the data says:

Elo Rating vs. Number of Games Played Alt test Elo Rating vs. Number of Games Played Per Month Alt test Net Elo Gain vs. Number of Games Played Alt test Net Elo Gain Per Month vs. Number of Games Played Per Month Alt test

It seems like no matter how the data is sliced up, there does not seem to be a clear 1:1 linear correlation between improvement rate and number of games played. There's maybe a slight upward trend with elo gain vs games per month, but it seems to be a weak trend.

However, there does seem to be a sweet spot in the elo gain rate where a large portion of the players who have gained the most elo per month seem to cluster around 100-300 games per month, which comes out to a handful of games per day. That may be evidence that playing at least a few games here and there on a consistent basis will give the best chances at improving, but that could also be due to the fact that there are just more data points for players playing at that rate.

In case you're curious about how people gained 100+ points per month, I checked the data and the majority of them are either cheaters, smurf accounts, or bots. For reference, the most improved player is (as of oct 2021) an 11 year old world chess champion from the Ukraine. His rate of improvement averaged out to 33 points per month over a period of 6 years.

Regardless, the data does not give any indications that bingeing chess games like it's a full time job is going to make you a stronger chess player any faster than playing a few games per day on your lunch breaks.

So what does make you a better chess player? I can only assume that it takes additional study beyond mindlessly playing chess games (like many other skills in life). Perhaps by practicing tactics, studying chess strategy, or studying your games. That seems to be what all the top players do. I would love to take a look at player's tactics data or studying habits in relation to their improvement rate, but that data is not readily available/easily obtainable, especially at the scale of this dataset. As a project manager would say, that's probably out of scope for this project.

Misc. Questions Explored Out Of Curiosity

I had a lot of other ideas I wanted to explore while I had this data stored in my postgres database. That's the nice thing about relational databases. You have a lot of flexibility in how you can slice up and analyze the data.

These questions didn't fit with the narrative I kicked this project off with, so I'm just going to dump them here for you to browse through in case you're interested.

How many games are played per time control?

Alt test

Seems like blitz is the most popular and almost no one plays correspondence. That surprised me a bit, I would have thought bullet would be the most popular just because more bullet games can be played than blitz games in a fixed amount of time.

When do most people play chess?

Alt test Alt test Alt test

Looks like 6pm UTC is when lichess players are the most active and 4am when they are least active. I guess most lichess players are from Europe because that lines up with when most people in European time zones would be getting off work and sleeping.

There's not much difference between the days of the week, but maybe Sunday has a slight edge over the other days.

Who has played the most games in the dataset?

The most active player in this dataset played 250,516 games (mostly bullet games), about 0.05% of all games in this database. There are only 10 players who have played over 100,000 games in this dataset (ordered from most to least):

  1. german11
  2. ribarisah
  3. decidement
  4. ASONINYA
  5. bernes
  6. Grga1
  7. leko29
  8. rezo2015
  9. jiacomo
  10. PILOTVE

What are the highest/lowest ratings in the dataset?

  • Lowest: 630 by JAckermaan
  • Highest (bot): 3233 by LeelaChessOfficial
  • Highest (non-bot): 3196 by penguingim1 and Kuuhaku_1
  • Highest (non-bot blitz): 3031 by penguingim1, Ultimate_SHSL_Gamer, BehaardTheBonobo, galbijjim, Kuuhaku_1, BeepBeepImAJeep, keithster

How many games are played per elo band?

Alt test

How many games does the median, average, maximum user play?

Alt test

Here's the number of games players have played in this dataset:

  • Median games played: 27
  • Average games played: 482
  • Maximum games played: 250,516

Which elo bands analyze their games the most?

Alt test

Most analyzed games are clustered at the top levels, probably by viewers or possibly by someone who decided to have a bunch of master games analyzed in bulk. 1800s seem to analyze the least. Perhaps at the higher levels, there is less need to rely on the computer and more reliance on player skill. Intuitively, that seems to make sense, at least from what I've seen strong chess players on youtube do. But then again, the difference is really small (only 1-2%) and probably is not significant.

What is the typical change between players' starting and ending elo ratings?

  • Most negative change (banned): -1045.96. This account (laurent17320) was closed for violating terms of service (probably for intentionally losing games). They went from 1950 down to 895 elo.
  • Most negative change (not banned): -1012.33. This account (Gyarados) went from 2145 elo down to 1133. I think this account used to be played by a strong player around the year 2016, then was handed off to a weaker player in 2017 who proceeded to drop the rating down to where it is now.
  • Median change: 0. Most players don't play more than ~30 or so games before quitting lichess permanently, much less so staying active longer than 1 month. Hence why the most common rating change is 0.
  • Mean change: 22. As expected, players should tend to get stronger over time, but the average is probably brought down by the majority of casual, non-serious chess players on lichess.
  • Most positive change: 1404. This is held by the account PeshkaCh, as of this writing (Oct 2021) an 11 year old world chess champion from ukraine. Browsing some of the other top accounts, some of them are bots, many of them have been closed.
  • Highest positive change/month: 897. I have explored accounts with the greatest elo change/month, however most of them are either bots with a very short activity timespan (ok_zabu) or accounts that have been closed/banned probably for cheating (as expected). I don't think these accounts are particularly interesting. For reference, the account with the most positive change had an average increase of 33.45 elo/month.

Alt test Alt test

How do most games finish?

  • Closed for fair play violation: 25,251 (0.01%)
  • Game abandoned: 770,328 (0.17%)
  • Lost due to time: 149,958,742 (33%)
  • Finished normally: 62,791,173 (67%)

How many games were played where black and/or white had a provisional rating (rating gain/loss of over 30 points)?

26,778,555, about 6% of all games played

Who played the most games for each time control?

  • german11: 211,781 bullet games
  • jlomb: 58,966 blitz games
  • tutubelezza: 22,674 rapid games #note: Lichess categorized rapid as either blitz/classical initially, so data isn't entirely accurate here
  • Snowden: 45,814 classical games
  • lapos: 6,870 correspondence games

How To Do What I Did

If you're interested in exploring some of this data yourself, I tried to make it easy to get everything setup on your machine. It will probably help if you're familiar with docker, python, SQL, and airflow.

First, you're going to need docker and docker compose installed on your computer. You can visit https://docs.docker.com/get-docker/ for instructions on how to do that. Once you have that installed, you can follow the steps below:

Clone this repository to your computer:

git clone 
   
     .

   

Give global write access to the src folder (for docker-compose)

chmod +777 src

Run docker-compose up at the home directory of the repository where the docker-compose.yml file is. This will setup several docker containers for the airflow webserver, airflow worker, airflow scheduler, postgres databases (one for chess game data, one for airflow), redis database (for airflow), and a bash cli with python installed to run any scripts (mainly for plotting/running SQL queries)

docker-compose up

Once the containers are up and running, you can get to the airflow webserver UI by going to localhost:8080 in your web browser. You can login with "username" and "password" (the default specified in the docker-compose.yml file).

You can click the switch to activate the DAG to start loading data into postgres. That's pretty much all you need to do to start loading data into your postgres database. You may want to modify the "airflow_dag_local.py" file if you want to download data from particular months. NOTE: There is a DAG that uses Kafka that I was playing with, but does not work without setting up kafka (complicated):

Alt test

You can use the cli container if you want to play with any of the python code (i.e. modify the code that transforms the data from lichess to postgres)

docker ps
docker container exec -it 
   
     bash

   

Alternatively, you could just install the required python packages by installing from requirements.txt and run the scripts without docker:

pip install -r requirements.txt

You can run SQL queries directly on the postgres database containing all the chess games you've downloaded (i.e. to peek at what the data looks like going into postgres):

docker ps
docker container exec -it 
   
     psql lichess_games username 

   
Owner
Joseph Wong
Boeing engineer who writes code and builds airplanes.
Joseph Wong
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data

tedana: TE Dependent ANAlysis TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI)

136 Dec 22, 2022
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Souvik Pratiher 31 Dec 16, 2022
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
sportsdataverse python package

sportsdataverse-py See CHANGELOG.md for details. The goal of sportsdataverse-py is to provide the community with a python package for working with spo

Saiem Gilani 37 Dec 27, 2022
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
PyClustering is a Python, C++ data mining library.

pyclustering is a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each

Andrei Novikov 1k Jan 05, 2023
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
Analysis scripts for QG equations

qg-edgeofchaos Analysis scripts for QG equations FIle/Folder Structure eigensolvers.py - Spectral and finite-difference solvers for Rossby wave eigenf

Norman Cao 2 Sep 27, 2022