Exploring the Top ML and DL GitHub Repositories

Overview

Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I collected data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

I've written a corresponding article about this project, which you can find on Towards Data Science. The article was selected as an "Editors Pick", and was also selected to be in their "Hands on Tutorials" section of their publication.

At a high level, my analysis is as follows:

  1. I collected data on the top machine learning and deep learning repositories and their respective owners from GitHub.
  2. I cleaned and prepared the data.
  3. I visualized what I thought were interesting patterns, trends, and findings within the data, and discuss each visualization in detail within the TDS article above.

Tools used

Python NumPy pandas tqdm PyGitHub GeoPy Altair tqdm wordcloud docopt black

Replicating the Analysis

I've designed the analysis in this repository so that anyone is able to recreate the data collection, cleaning, and visualization steps in a fully automated manner. To do this, open up a terminal and follow the steps below:

Step 1: Clone this repository to your computer

# clone the repo
git clone https://github.com/nicovandenhooff/top-repo-analysis.git

# change working directory to the repos root directory
cd top-repo-analysis

Step 2: Create and activate the required virtual environment

# create the environment
conda env create -f environment.yaml

# activate the environment
conda activate top-repo-analysis

Step 3: Obtain a GitHub personal access token ("PAT") and add it to the credentials file

Please see how to obtain a PAT here.

Once you have it perform the following:

# open the credentials file
open src/credentials.json

This will open the credentials json file which contains the following:

" }">
{
"github_token": "
   
    "
   
}

Change to your PAT.

Step 4: Run the following command to delete the current data and visualizations in the repository

make clean

Step 5: Run the following command to recreate the analysis

make all

Please note that if you are recreating the analysis:

  • The last step will take several hours to run (approximately 6-8 hours) as the data collection process from GitHub has to sleep to respect the GitHub API rate limit. The total number of API requests for the data collection will approximately be between 20,000 to 30,000.
  • When the data cleaning script data_cleaning.py runs, there make be some errors may be printed to the screen by GeoPy if the Noinatim geolocation service is unable to find a valid location for a GitHub user. This will not cause the script to terminate, and is just ugly in the terminal. Unfortunately you cannot suppress these error messages, so just ignore them if they occur.
  • Getting the location data with GeoPy in the data cleaning script also takes about 30 minutes as the Nominatim geolocation service limits 1 API request per second.
  • I ran this analysis on December 30, 2021 and as such collected the data from GitHub on this date. If you run this analysis in the future, the data you collect will inherently be slightly different if the machine learning and deep learning repositories with the highest number of stars has changed since the date when I ran the analysis. This will slightly change how the resulting visualizations look.

Using the Scraper to Collect New Data

You can also use the scraping script in isolation to collect new data from GitHub if you desire.

If you'd like to do this, all you'll need to do is open up a terminal, follow steps 1 to 3 above, and then perform the following:

Step a) Run the scraping script with your desired options as follows

python src/github_scraper.py --queries=<queries> --path=<path>
  • Replace with your desired queries. Note that if you desire multiple search queries, enclose them in "" separate them by a single comma with NO SPACE after the comma. For example "Machine Learning,Deep Learning"
  • Replace with the output path that you want the scraped data to be saved at.

Please see the documentation in the header of the scraping script for additional options that are available.

Step b) Run the data cleaning script to clean your newly scraped data

python src/data_cleaning.py --input_path=<path> --output_path=<output_path>
  • Replace with the path that you saved the scraped data at.
  • Replace with the output path that you want the cleaned data to be saved at.
  • As metioned in the last section, some errors may be printed to the terminal by GeoPy during the data cleaning process, but feel free to ignore these as they do not affect the execution of the script.

Dependencies

Please see the environment file for a full list of dependencies.

License

The source code for the site is licensed under the MIT license.

You might also like...
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

A collection of robust and fast processing tools for parsing and analyzing web archive data.

ChatNoir Resiliparse A collection of robust and fast processing tools for parsing and analyzing web archive data. Resiliparse is part of the ChatNoir

Python beta calculator that retrieves stock and market data and provides linear regressions.

Stock and Index Beta Calculator Python script that calculates the beta (β) of a stock against the chosen index. The script retrieves the data and resa

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Releases(v1.0.0)
Owner
Nico Van den Hooff
UBC Master of Data Science 2022
Nico Van den Hooff
Data analysis and visualisation projects from a range of individual projects and applications

Python-Data-Analysis-and-Visualisation-Projects Data analysis and visualisation projects from a range of individual projects and applications. Python

Tom Ritman-Meer 1 Jan 25, 2022
PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 05, 2023
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

CRISPRanalysis InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family. In this work, we present a workflow

2 Jan 31, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Modular analysis tools for neurophysiology data

Neuroanalysis Modular and interactive tools for analysis of neurophysiology data, with emphasis on patch-clamp electrophysiology. Functions for runnin

Allen Institute 5 Dec 22, 2021
SparseLasso: Sparse Solutions for the Lasso

SparseLasso: Sparse Solutions for the Lasso Introduction SparseLasso provides a Scikit-Learn based estimation of the Lasso with cross-validation tunin

Gabriel Okasa 1 Nov 08, 2021
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
LynxKite: a complete graph data science platform for very large graphs and other datasets.

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

124 Dec 14, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias Introduction | Updates | Usage | Results&Pretrained Models | Statement | Intr

104 Nov 27, 2022
An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

qgrid Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your D

Quantopian, Inc. 2.9k Jan 08, 2023
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021