Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

Pandas and Spark DataFrame comparison for humans

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Very useful and necessary functions that simplify working with data

Using approximate bayesian posteriors in deep nets for active learning

A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Useful tool for inserting DataFrames into the Excel sheet.

CSV database for chihuahua (HUAHUA) blockchain transactions

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Shot notebooks resuming the main functions of GeoPandas

BAyesian Model-Building Interface (Bambi) in Python.

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Basis Set Format Converter

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

Python for Data Analysis, 2nd Edition

Ejercicios Panda usando Pandas

Very basic but functional Kakuro solver written in Python.