An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Last update: Dec 20, 2022

Related tags

Overview

Simple Tar Dataset

An unopinionated replacement for PyTorch's Dataset and ImageFolder classes, for datasets stored as uncompressed Tar archives.

Just Tar it: No particular structure is enforced in the Tar archive. This means that you can just archive your files with no modification, and handle any data/meta-data with your dataset code.

Why? Storing a dataset as millions of small files makes access inefficient, and can create other difficulties in large-scale scenarios (e.g. running out of inodes, inneficient operations in distributed filesystems which are optimised for fewer large files). A Tar file is a simple and uncompressed archive format for which numerous utilities exist, and it allows fast random access into a single archive file.

Example

The default TarDataset simply loads all PNG, JPG and JPEG images from a Tar file, and allows you to iterate them.

Images are returned as Tensor. Here some RGB values are printed.

from tardataset import TarDataset

dataset = TarDataset('example-data/colors.tar')

for (idx, image) in enumerate(dataset):
  print(f"Image #{idx}, color: {image[:,0,0]}")

Usage

For image classification datasets, where images are usually stored in one folder per class (e.g. ImageNet), TarImageFolder is a drop-in replacement for torchvision.dataset.ImageFolder.

For more complex scenarios -- say, you store some data in one or more JSON files, or you have folders with video frames in specific formats -- you can subclass TarDataset, and read the data in any format you like.

Jupyter notebook tutorial

There is a more comprehensive set of examples as a Jupyter notebook in example.ipynb.

Full "ImageNet in a Tar file" example

A large-scale data loading example is given in imagenet-example.py. Only the section of code responsible for data loading was modified from the official PyTorch ImageNet example.

First, ensure that the data is in the expected format for the original example to work, in a folder named ILSVRC12. Then, create a Tar archive from it (tar cf ILSVRC12.tar ILSVRC12 on Linux or a utility like 7-Zip on Windows). Finally, run our modified imagenet-example.py, passing it the path to the Tar archive instead.

Author

João Henriques, Visual Geometry Group (VGG), University of Oxford

An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Related tags

Overview

Simple Tar Dataset

Example

Usage

Jupyter notebook tutorial

Full "ImageNet in a Tar file" example

Author

Owner

Joao Henriques

Text Generation by Learning from Demonstrations

Graph neural network message passing reframed as a Transformer with local attention

Do Neural Networks for Segmentation Understand Insideness?

Repository of 3D Object Detection with Pointformer (CVPR2021)

Deep Networks with Recurrent Layer Aggregation

1st Solution For ICDAR 2021 Competition on Mathematical Formula Detection

Scalable, event-driven, deep-learning-friendly backtesting library

A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population

Repo for Photon-Starved Scene Inference using Single Photon Cameras, ICCV 2021

Pre-trained models for a Cascaded-FCN in caffe and tensorflow that segments

Face Recognition & AI Based Smart Attendance Monitoring System.

An unreferenced image captioning metric (ACL-21)

The project was to detect traffic signs, based on the Megengine framework.

TRACER: Extreme Attention Guided Salient Object Tracing Network implementation in PyTorch

Meli Data Challenge 2021 - First Place Solution

The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

Mining-the-Social-Web-3rd-Edition - The official online compendium for Mining the Social Web, 3rd Edition (O'Reilly, 2018)

Super Resolution for images using deep learning.

Image Captioning using CNN and Transformers

Python SDK for building, training, and deploying ML models