Installation:

pip install lm_dataloader

Design Philosophy

A library to unify lm dataloading at large scale
Simple interface, any tokenizer can be integrated
Minimal changes needed from small -> large scale (many multiple GPU nodes)
follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:
- Easily combine multiple datasets
- Easily split a dataset into train / val / test splits
- Easily build a weighted dataset out of a list of existing ones along with weights.
- unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
- index files that are built on the fly are hidden files, leaving less mess in the directory.
- More straightforward interface, better documentation.
- Inspectable with a command line tool
- Can load from urls
- Can load from S3 buckets
- Can load from GCS buckets
- Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

Dataloader tools for language modelling

Related tags

Overview

Installation:

Design Philosophy

Example usage

Command line utilities

Owner

Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Pytorch implementation of the paper "COAD: Contrastive Pre-training with Adversarial Fine-tuning for Zero-shot Expert Linking."

A knowledge base construction engine for richly formatted data

OpenMMLab 3D Human Parametric Model Toolbox and Benchmark

Flexible time series feature extraction & processing

Pytorch implementation of PTNet for high-resolution and longitudinal infant MRI synthesis

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Telegram chatbot created with deep learning model (LSTM) and telebot library.

HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset (ICCV 2021)

Official repo for QHack—the quantum machine learning hackathon

Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

Heterogeneous Deep Graph Infomax

Python Assignments for the Deep Learning lectures by Andrew NG on coursera with complete submission for grading capability.

Official code repository for the work: "The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement"

[CVPR 2022] Deep Equilibrium Optical Flow Estimation

CoaT: Co-Scale Conv-Attentional Image Transformers

PyTorch implementation of Spiking Neural Networks trained on surrogate gradient & BPTT using snntorch.

Make your own game in a font!

A list of multi-task learning papers and projects.