An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Last update: Dec 20, 2022

Related tags

Overview

Simple Tar Dataset

An unopinionated replacement for PyTorch's Dataset and ImageFolder classes, for datasets stored as uncompressed Tar archives.

Just Tar it: No particular structure is enforced in the Tar archive. This means that you can just archive your files with no modification, and handle any data/meta-data with your dataset code.

Why? Storing a dataset as millions of small files makes access inefficient, and can create other difficulties in large-scale scenarios (e.g. running out of inodes, inneficient operations in distributed filesystems which are optimised for fewer large files). A Tar file is a simple and uncompressed archive format for which numerous utilities exist, and it allows fast random access into a single archive file.

Example

The default TarDataset simply loads all PNG, JPG and JPEG images from a Tar file, and allows you to iterate them.

Images are returned as Tensor. Here some RGB values are printed.

from tardataset import TarDataset

dataset = TarDataset('example-data/colors.tar')

for (idx, image) in enumerate(dataset):
  print(f"Image #{idx}, color: {image[:,0,0]}")

Usage

For image classification datasets, where images are usually stored in one folder per class (e.g. ImageNet), TarImageFolder is a drop-in replacement for torchvision.dataset.ImageFolder.

For more complex scenarios -- say, you store some data in one or more JSON files, or you have folders with video frames in specific formats -- you can subclass TarDataset, and read the data in any format you like.

Jupyter notebook tutorial

There is a more comprehensive set of examples as a Jupyter notebook in example.ipynb.

Full "ImageNet in a Tar file" example

A large-scale data loading example is given in imagenet-example.py. Only the section of code responsible for data loading was modified from the official PyTorch ImageNet example.

First, ensure that the data is in the expected format for the original example to work, in a folder named ILSVRC12. Then, create a Tar archive from it (tar cf ILSVRC12.tar ILSVRC12 on Linux or a utility like 7-Zip on Windows). Finally, run our modified imagenet-example.py, passing it the path to the Tar archive instead.

Author

João Henriques, Visual Geometry Group (VGG), University of Oxford

An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Related tags

Overview

Simple Tar Dataset

Example

Usage

Jupyter notebook tutorial

Full "ImageNet in a Tar file" example

Author

Owner

Joao Henriques

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Code release for "Making a Bird AI Expert Work for You and Me".

This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

RL-GAN: Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation

This program will stylize your photos with fast neural style transfer.

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

Exploit ILP to learn symmetry breaking constraints of ASP programs.

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data

Rethinking Transformer-based Set Prediction for Object Detection

Ontologysim: a Owlready2 library for applied production simulation

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Architecture Patterns with Python (TDD, DDD, EDM)

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

This is the official repository for our paper: ''Pruning Self-attentions into Convolutional Layers in Single Path''.

SEJE Pytorch implementation

salabim - discrete event simulation in Python

Repository for code and dataset for our EMNLP 2021 paper - “So You Think You’re Funny?”: Rating the Humour Quotient in Standup Comedy.