Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Overview

img2dataset

pypi Open In Colab Try it on gitpod

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Also supports saving captions for url+caption datasets.

Install

pip install img2dataset

Usage

First get some image url list. For example:

echo 'https://placekitten.com/200/305' >> myimglist.txt
echo 'https://placekitten.com/200/304' >> myimglist.txt
echo 'https://placekitten.com/200/303' >> myimglist.txt

Then, run the tool:

img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256

The tool will then automatically download the urls, resize them, and store them with that format:

  • output_folder
    • 0
      • 0.jpg
      • 1.jpg
      • 2.jpg

or as this format if choosing webdataset:

  • output_folder
    • 0.tar containing:
      • 0.jpg
      • 1.jpg
      • 2.jpg

with each number being the position in the list. The subfolders avoids having too many files in a single folder.

If captions are provided, they will be saved as 0.txt, 1.txt, ...

This can then easily be fed into machine learning training or any other use case.

If save_metadata option is turned on (that's the default), then .json files named 0.json, 1.json,... are saved with these keys:

  • url
  • caption
  • key
  • shard_id
  • status : whether the download succeeded
  • error_message
  • width
  • height
  • original_width
  • original_height
  • exif

Also a .parquet file will be saved with the same name as the subfolder/tar files containing these same metadata. It can be used to analyze the results efficiently.

Integration with Weights & Biases

Performance metrics are monitored through Weights & Biases.

W&B metrics

In addition, most frequent errors are logged for easier debugging.

W&B table

Other features are available:

  • logging of environment configuration (OS, python version, CPU count, Hostname, etc)
  • monitoring of hardware resources (GPU/CPU, RAM, Disk, Networking, etc)
  • custom graphs and reports
  • comparison of runs (convenient when optimizing parameters such as number of threads/cpus)

When running the script for the first time, you can decide to either associate your metrics to your account or log them anonymously.

You can also log in (or create an account) before by running wandb login.

API

This module exposes a single function download which takes the same arguments as the command line tool:

  • url_list A file with the list of url of images to download. It can be a folder of such files. (required)
  • image_size The size to resize image to (default 256)
  • output_folder The path to the output folder. If existing subfolder are present, the tool will continue to the next number. (default "images")
  • processes_count The number of processes used for downloading the pictures. This is important to be high for performance. (default 1)
  • thread_count The number of threads used for downloading the pictures. This is important to be high for performance. (default 256)
  • resize_mode The way to resize pictures, can be no, border or keep_ratio (default border)
    • no doesn't resize at all
    • border will make the image image_size x image_size and add a border
    • keep_ratio will keep the ratio and make the smallest side of the picture image_size
    • center_crop will keep the ratio and center crop the largest side so the picture is squared
  • resize_only_if_bigger resize pictures only if bigger that the image_size (default False)
  • output_format decides how to save pictures (default files)
    • files saves as a set of subfolder containing pictures
    • webdataset saves as tars containing pictures
  • input_format decides how to load the urls (default txt)
    • txt loads the urls as a text file of url, one per line
    • csv loads the urls and optional caption as a csv
    • tsv loads the urls and optional caption as a tsv
    • parquet loads the urls and optional caption as a parquet
  • url_col the name of the url column for parquet and csv (default url)
  • caption_col the name of the caption column for parquet and csv (default None)
  • number_sample_per_shard the number of sample that will be downloaded in one shard (default 10000)
  • save_metadata if true, saves one parquet file per folder/tar and json files with metadata (default True)
  • save_additional_columns list of additional columns to take from the csv/parquet files and save in metadata files (default None)
  • timeout maximum time (in seconds) to wait when trying to download an image (default 10)
  • wandb_project name of W&B project used (default img2dataset)

How to tweak the options

The default values should be good enough for small sized dataset. For larger ones, these tips may help you get the best performance:

  • set the processes_count as the number of cores your machine has
  • increase thread_count as long as your bandwidth and cpu are below the limits
  • I advise to set output_format to webdataset if your dataset has more than 1M elements, it will be easier to manipulate few tars rather than million of files
  • keeping metadata to True can be useful to check what items were already saved and avoid redownloading them

Road map

This tool works very well in the current state for up to 100M elements. Future goals include:

  • a benchmark for 1B pictures which may require
    • further optimization on the resizing part
    • better multi node support
    • integrated support for incremental support (only download new elements)

Architecture notes

This tool is designed to download pictures as fast as possible. This put a stress on various kind of resources. Some numbers assuming 1350 image/s:

  • Bandwidth: downloading a thousand average image per second requires about 130MB/s
  • CPU: resizing one image may take several milliseconds, several thousand per second can use up to 16 cores
  • DNS querying: million of urls mean million of domains, default OS setting usually are not enough. Setting up a local bind9 resolver may be required
  • Disk: if using resizing, up to 30MB/s write speed is necessary. If not using resizing, up to 130MB/s. Writing in few tar files make it possible to use rotational drives instead of a SSD.

With these information in mind, the design choice was done in this way:

  • the list of urls is split in N shards. N is usually chosen as the number of cores
  • N processes are started (using multiprocessing process pool)
    • each process starts M threads. M should be maximized in order to use as much network as possible while keeping cpu usage below 100%.
    • each of this thread download 1 image and returns it
    • the parent thread handle resizing (which means there is at most N resizing running at once, using up the cores but not more)
    • the parent thread saves to a tar file that is different from other process

This design make it possible to use the CPU resource efficiently by doing only 1 resize per core, reduce disk overhead by opening 1 file per core, while using the bandwidth resource as much as possible by using M thread per process.

Setting up a bind9 resolver

In order to keep the success rate high, it is necessary to use an efficient DNS resolver. I tried several options: systemd-resolved, dnsmaskq and bind9 and reached the conclusion that bind9 reaches the best performance for this use case. Here is how to set this up on ubuntu:

sudo apt install bind9
sudo vim /etc/bind/named.conf.options

Add this in options:
        recursive-clients 10000;
        resolver-query-timeout 30000;
        max-clients-per-query 10000;
        max-cache-size 2000m;

sudo systemctl restart bind9

sudo vim /etc/resolv.conf

Put this content:
nameserver 127.0.0.1

This will make it possible to keep an high success rate while doing thousands of dns queries. You may also want to setup bind9 logging in order to check that few dns errors happen.

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

python -m pytest -v tests -s

Benchmarks

10000 image benchmark

cd tests
bash benchmark.sh

18M image benchmark

Download crawling at home first part, then:

cd tests
bash large_bench.sh

It takes 3.7h to download 18M pictures

1350 images/s is the currently observed performance. 4.8M images per hour, 116M images per 24h.

36M image benchmark

downloading 2 parquet files of 18M items (result 936GB) took 7h24 average of 1345 image/s

190M benchmark

downloading 190M images from the crawling at home dataset took 41h (result 5TB) average of 1280 image/s

Comments
  • Downloader is not producing full set of expected outputs

    Downloader is not producing full set of expected outputs

    Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.

    Any tips on debugging further?

    TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in *_stats.json files indicating ~2M files were actually downloaded

    For example - I recently tried to download this dataset in a distributed manner on EMR:

    https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

    I applied some light NSFW filtering on it to produce a new parquet

    # rest of the script is redacted, but there is some code before this to normalize the NSFW row to make filtering more convenient
    sampled_df = df[df["NSFW"] == "unlikely"]
    sampled_df.reset_index(inplace=True)
    

    Verified its row count is ~12M samples:

    import glob
    import json
    from pyarrow.parquet import ParquetDataset
    
    files = glob.glob("*.parquet")
    
    d = {}
    
    for file in files:
        d[file] = 0
        dataset = ParquetDataset(file)
        for piece in dataset.pieces:
            d[file] += piece.get_metadata().num_rows
    
    print(json.dumps(d, indent=2, sort_keys=True))
    
    {
      "part00000.parquet": 12026281
    }
    

    Ran the download, and scanned over the output s3 bucket:

    aws s3 cp\
    	s3://path/to/s3/download/ . \
    	--exclude "*" \
    	--include "*.json" \
    	--recursive
    

    Ran this script to get the total count of images downloaded:

    import json
    import glob
    
    files = glob.glob("/path/to/json/files/*.json")
    
    count = {}
    successes = {}
    
    for file in files:
        with open(file) as f:
            j = json.load(f)
            count[file] = j["count"]
            successes[file] = j["successes"]
    
    rate = 100 * sum(successes.values()) / sum(count.values())
    print(f"Success rate: {rate}. From {sum(successes.values())} / {sum(count.values())}")
    

    which gave me the following output:

    Success rate: 56.15816066896948. From 1508566 / 2686281
    

    The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I'll use a knot resolver later)

    unknown url type: '21nicrmo2'                                                      1.0
    <urlopen error [errno 22] invalid argument>                                        1.0
    encoding with 'idna' codec failed (unicodeerror: label empty or too long)          1.0
    http/1.1 401.2 unauthorized\r\n                                                    4.0
    <urlopen error no host given>                                                      5.0
    <urlopen error unknown url type: "https>                                          11.0
    incomplete read                                                                   14.0
    <urlopen error [errno 101] network is unreachable>                                38.0
    <urlopen error [errno 104] connection reset by peer>                              75.0
    [errno 104] connection reset by peer                                              92.0
    opencv                                                                           354.0
    <urlopen error [errno 113] no route to host>                                     448.0
    remote end closed connection without response                                    472.0
    <urlopen error [errno 111] connection refused>                                  1144.0
    encoding issue                                                                  2341.0
    timed out                                                                       2850.0
    <urlopen error timed out>                                                       4394.0
    the read operation timed out                                                    4617.0
    image decoding error                                                            5563.0
    ssl                                                                             6174.0
    http error                                                                     62670.0
    <urlopen error [errno -2] name or service not known>                         1086446.0
    success                                                                      1508566.0
    

    I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from

    > ls
    00000_stats.json  00051_stats.json  01017_stats.json  01066_stats.json  01112_stats.json  01157_stats.json
    00001_stats.json  00052_stats.json  01018_stats.json  01067_stats.json  01113_stats.json  01159_stats.json
    ...
    > ls -l | wc -l 
    270
    
    opened by PranshuBansalDev 33
  • Increasing mem and no output files

    Increasing mem and no output files

    Currently using your tool to download laion dataset, thank you for your contribution. The program grows in memory until it uses all of my 32G of RAM and 64G of SWAP. No tar files are ever output. Am I doing something wrong?

    Using the following command (slightly modified from official command provided by laion) img2dataset --url_list laion400m-meta --input_format "parquet" \ --url_col "URL" --caption_col "TEXT" --output_format webdataset \ --output_folder webdataset --processes_count 1 --thread_count 12 --image_size 384 \ --save_additional_columns '["NSFW","similarity","LICENSE"]'

    opened by pbatk 23
  • feat: support tfrecord

    feat: support tfrecord

    Add support for tfrecords.

    The webdataset format is not very convenient on TPU's due to bad support of pytorch dataloaders in multiprocessing at the moment so tfrecords allow better usage of CPU's.

    opened by borisdayma 22
  • Download stall at the end

    Download stall at the end

    I'm trying to download the CC3M dataset on an AWS Sagemaker Notebook instance. I first do pip install img2dataset. Then I fired up a terminal and do

    img2dataset --url_list cc3m.tsv --input_format "tsv"\
             --url_col "url" --caption_col "caption" --output_format webdataset\
               --output_folder cc3m --processes_count 16 --thread_count 64 --resize_mode no\
                 --enable_wandb False
    

    Code runs and downloads but stalls towards the end. I tried terminating by restarting the instance (restart), as a result, some .tar files are having read error "Unexpected end of file" while using the tar files for training. I also tried to terminate it using Ctrl-C on a second run, which result in the same read error when using the tar files for training. The difference between two termination methods is the later seemed to do some cleanup which removed "_tmp" folder within the download folder.

    opened by xiankgx 13
  • Respect noai and noimageai directives when downloading image files

    Respect noai and noimageai directives when downloading image files

    Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai).

    This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts.

    Refs:

    • https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag
    • https://www.deviantart.com/team/journal/A-New-Directive-for-Opting-Out-of-AI-Datasets-934500371
    opened by raincoastchris 12
  • How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

    How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

    For Vision and Language pretraining cc3m, mscoco, SBUcaptions and VG are very relevant datasets. I haven't been able to download SBU captions and VG. Here are my questions.

    1. How to download SBU captions and VG's metadata?
    2. How to download these datasets on webdataset format?

    Could you also please provide me with a tutorial or just some hints to download it in webdataset format using img2dataset? Thank you in advance.

    opened by sanyalsunny111 8
  • clip-retrieval-getting-started.ipynb giving errors (Urgent)

    clip-retrieval-getting-started.ipynb giving errors (Urgent)

    image Hello there I am new to the world of deep learning.I am trying to run clip-retrieval-getting-started.ipynb but getting the error attached as snip....Please help its urgent

    opened by minakshimathpal 8
  • Decrease memory usage

    Decrease memory usage

    Currently the memory usage is about 1.5GB per core. That's way too much, it must be possible to decrease it. Figure out what's using all that ram (is it because the resize queue is full ? should there be some backpressure on the downloader ,etc) and solve it

    opened by rom1504 8
  • Interest in supporting video datasets?

    Interest in supporting video datasets?

    Hi. Thanks for the amazing repository. It really makes the workflow very easy. I was wondering if you are considering to add video datasets as well. Some are based on urls, while some others are derived from youtube or segments from youtube.

    opened by TheShadow29 7
  • Add checksum of image

    Add checksum of image

    I think it could be useful to add a checksum in the parquet files since we're downloading the images anyway and it's fast to compute. It would help us do a real deduplication, not only on urls but on actual image content.

    opened by borisdayma 7
  • add list of int, float feature in TFRecordSampleWriter

    add list of int, float feature in TFRecordSampleWriter

    We use list of int, list of float attribute in coyo-labeled-300m dataset. (It will be released soon) To create a dataset using img2dataset in tfrecord, we need to add above features.

    opened by justHungryMan 6
  • Figure out how to timeout

    Figure out how to timeout

    I implemented some new metrics and found that many urls timeout after 20s, which clearly slow down everything

    here is some examples: Downloaded (12, 'http://www.herteldenbirname.com/wp-content/uploads/2014/05/Italia-Independent-Flocked-Aviator-Sunglasses-150x150.jpg') in 10.019284009933472 Downloaded (124, 'http://image.rakuten.co.jp/sneak/cabinet/shoes-03/cr-ucrocs5-a.jpg?_ex=128x128') in 10.01184344291687 Downloaded (146, 'http://www.slicingupeyeballs.com/wp-content/uploads/2009/05/stoneroses452.jpg') in 10.006474256515503 Downloaded (122, 'https://media.mwcradio.com/mimesis/2013-03/01/2013-03-01T153415Z_1_CBRE920179600_RTROPTP_3_TECH-US-GERMANY-EREADER_JPG_475x310_q85.jpg') in 10.241626739501953 Downloaded (282, 'https://8d1aee3bcc.site.internapcdn.net/00/images/media/5/5cfb2eba8f1f6244c6f7e261b9320a90-1.jpg') in 10.431355476379395 Downloaded (298, 'https://my-furniture.com.au/media/catalog/product/cache/1/small_image/295x295/9df78eab33525d08d6e5fb8d27136e95/a/u/au0019-stool-01.jpg') in 10.005694150924683 Downloaded (300, 'http://images.tastespotting.com/thumbnails/889506.jpg') in 10.007027387619019 Downloaded (330, 'https://www.infoworld.pk/wp-content/uploads/2016/02/Cool-HD-Valentines-Day-Wallpapers-480x300.jpeg') in 10.004335880279541 Downloaded (361, 'http://pendantscarf.com/image/cache/data/necklace/JW0013-(2)-150x150.jpg') in 10.00539231300354 Downloaded (408, 'https://www.solidrop.net/photo-6/animorphia-coloring-books-for-adults-children-drawing-book-secret-garden-style-relieve-stress-graffiti-painting-book.jpg') in 10.004313945770264

    Let's try to implement request timeout

    I tried #153 , eventlet and #260 and none of them can timeout properly

    A good value for timeout is 2s

    opened by rom1504 16
  • Add asyncio implementation of downloader

    Add asyncio implementation of downloader

    #252 #256

    The impl of asyncio downloader. It can also run properly on windows (without 3rd place dns resolver) with avg 500~600Mbps (under 1gbps network).

    use command arg --downloader to choose type of downloader("normal", "async"):

    img2dataset --downloader async
    

    mscoco download test

    opened by KohakuBlueleaf 3
  • opencv-python => opencv-python-headless

    opencv-python => opencv-python-headless

    This PR replaces opencv-python with opencv-python-headless to remove the dependency on GUI-related libraries (see: https://github.com/opencv/opencv-python/issues/370#issuecomment-671202529). I tested this working on the python:3.9 Docker image.

    opened by shionhonda 2
Releases(1.40.0)
Owner
Romain Beaumont
Interested in machine learning (computer vision, natural language processing, deep learning), node.js (network, bots, web), and programming in general
Romain Beaumont
Unique image & metadata generation using weighted layer collections.

nft-generator-py nft-generator-py is a python based NFT generator which programatically generates unique images using weighted layer files. The progra

Jonathan Becker 243 Dec 31, 2022
Python library that finds the size / type of an image given its URI by fetching as little as needed

FastImage This is an implementation of the excellent Ruby library FastImage - but for Python. FastImage finds the size or type of an image given its u

Brian Muller 28 Mar 01, 2022
Archive of the image generator stuff from my API

alex_api_archive Archive of the image generator stuff from my API FAQ Q: Why? A: Because I am removing these components from the API Q: How do I run i

AlexFlipnote 26 Nov 17, 2022
Computer art based on joining transparent images

Computer Art There is no must in art because art is free. Introduction The following tutorial exaplains how to generate computer art based on a series

Computer Art 12 Jul 30, 2022
Easy to use Python module to extract Exif metadata from digital image files.

Easy to use Python module to extract Exif metadata from digital image files.

ianaré sévi 719 Jan 05, 2023
Python implementation of image filters (such as brightness, contrast, saturation, etc.)

PyPhotoshop Python implementation of image filters Use Python to adjust brightness and contrast, add blur, and detect edges! Follow along tutorial: ht

Kylie 87 Dec 15, 2022
A warping based image translation model focusing on upper body synthesis.

Pose2Img Upper body image synthesis from skeleton(Keypoints). Sub module in the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis

zhiyh 15 Nov 10, 2022
HyperBlend is a new type of hyperspectral image simulator based on Blender.

HyperBlend version 0.1.0 This is the HyperBlend leaf spectra simulator developed in Spectral Laboratory of University of Jyväskylä. You can use and mo

SILMAE 2 Jun 20, 2022
display: a browser-based graphics server

display: a browser-based graphics server Installation Quick Start Usage Development A very lightweight display server for Torch. Best used as a remote

Szymon Jakubczak 205 Oct 17, 2022
Find target hash collisions for Apple's NeuralHash perceptual hash function.💣

neural-hash-collider Find target hash collisions for Apple's NeuralHash perceptual hash function. For example, starting from a picture of this cat, we

Anish Athalye 630 Jan 01, 2023
A 3D structural engineering finite element library for Python.

An easy to use elastic 3D structural engineering finite element analysis library for Python.

Craig 220 Dec 27, 2022
The friendly PIL fork (Python Imaging Library)

Pillow Python Imaging Library (Fork) Pillow is the friendly PIL fork by Alex Clark and Contributors. PIL is the Python Imaging Library by Fredrik Lund

Pillow 10.4k Dec 31, 2022
API to help generating QR-code for ZATCA's e-invoice known as Fatoora with any programming language

You can try it @ api-fatoora api-fatoora API to help generating QR-code for ZATCA's e-invoice known as Fatoora with any programming language Disclaime

نافع الهلالي 12 Oct 05, 2022
Fill holes in binary 2D & 3D images fast.

Fill holes in binary 2D & 3D images fast.

11 Dec 09, 2022
An ascii art generator that's actually good. Does edge detection and selects the most appropriate characters.

Ascii Artist An ascii art generator that's actually good. Does edge detection and selects the most appropriate characters. Installing Installing with

18 Jan 03, 2023
Anaglyph 3D Converter - A python script that adds a 3D anaglyph style effect to an image using the Pillow image processing package.

Anaglyph 3D Converter - A python script that adds a 3D anaglyph style effect to an image using the Pillow image processing package.

Kizdude 2 Jan 22, 2022
Image generation API.

Image Generator API This is an api im working on Currently its just a test project Im trying to make custom readme images with your discord account pr

Siddhesh Zantye 2 Feb 19, 2022
GTK and Python based, simple multiple image editor tool

System Monitoring Center GTK3 and Python3 based, simple multiple image editor tool. Note: Development of this application is not completed yet. The ap

Hakan Dündar 1 Feb 02, 2022
Nutrify - take a photo of food and learn about it

Nutrify - take a photo of food and learn about it Work in progress. To make this a thing, we're going to need lots of food images... Start uploading y

Daniel Bourke 93 Dec 30, 2022
A Blender add-on to create interesting meshes using symmetry

Procedural Symmetries This Blender add-on automates the process of iteratively applying a set of reflection planes to a base mesh. The result will con

1 Dec 29, 2021