Large dataset storage format for Pytorch

Last update: Oct 22, 2022

Overview

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Support python 3

pip install h5record

Why?

Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:
- large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
- database approach : depend on what kind of database engine used, usually multi-process read is not supported
- the above method scale non linear in terms of data - storage size
TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )
However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

Support multi-process read
Relatively simple to use and low technical debt
Support compression/de-compression on the fly
Quick load to memory if required

Simple usage

pip install h5record

Sentence Similarity

from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type	File size	Read speed row/second
no compression	2.0G	2084.55 it/s
lzf	1.7G	1496.14 it/s
gzip	1.1G	843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

157 Dec 11, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

Comments

Example about Image dataset

Thanks for your work. Do you have an end to end example about image dataset which includes creating h5records file similar to tfrecord files and then using it in dataloader mechanism just like tf dataset api loader mechanism?
documentation question

opened by meet-minimalist 1

Releases(1.0.4)

1.0.4(Jun 8, 2021)

Minor bug fix
Source code(tar.gz)
Source code(zip)
1.0.3(Jun 6, 2021)
Support for image sequence, float16 sequence, float sequence and float16 datatype

Fix bugs

Source code(tar.gz)
Source code(zip)
1.0.1(Jun 5, 2021)

Source code(tar.gz)
Source code(zip)

Large dataset storage format for Pytorch

Related tags

Overview

H5Record

Why?

Simple usage

Note

Comparison between different compression algorithm

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

N-Omniglot is a large neuromorphic few-shot learning dataset

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

This is the dataset and code release of the OpenRooms Dataset.

Comments

Example about Image dataset

Releases(1.0.4)

1.0.4(Jun 8, 2021)

1.0.3(Jun 6, 2021)

1.0.1(Jun 5, 2021)

Owner

theblackcat102

Rethinking Nearest Neighbors for Visual Classification

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

A tutorial on training a DarkNet YOLOv4 model for the CrowdHuman dataset

A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

OpenMMLab 3D Human Parametric Model Toolbox and Benchmark

A python script to lookup Passport Index Dataset

A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021.

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

Official PyTorch implementation of "The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation" (ICCV 21).

Tensorflow implementation of ID-Unet: Iterative Soft and Hard Deformation for View Synthesis.

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

Spectral normalization (SN) is a widely-used technique for improving the stability and sample quality of Generative Adversarial Networks (GANs)

CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution

Open source code for the paper of Neural Sparse Voxel Fields.

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder