Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

Learn machine learning the fun way, with Oracle and RedBull Racing

Implementation in Python of the reliability measures such as Omega.

An extension to pandas dataframes describe function.

Office365 (Microsoft365) audit log analysis tool

A neural-based binary analysis tool

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

Python reader for Linked Data in HDF5 files

peptides.py is a pure-Python package to compute common descriptors for protein sequences

Nobel Data Analysis

An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

A simple and efficient tool to parallelize Pandas operations on all available CPUs

Monitor the stability of a pandas or spark dataframe ⚙︎

nrgpy is the Python package for processing NRG Data Files

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Data cleaning tools for Business analysis

PyClustering is a Python, C++ data mining library.

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Data pipelines built with polars

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.