Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Overview

img2dataset

pypi Open In Colab Try it on gitpod

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Also supports saving captions for url+caption datasets.

Install

pip install img2dataset

Usage

First get some image url list. For example:

echo 'https://placekitten.com/200/305' >> myimglist.txt
echo 'https://placekitten.com/200/304' >> myimglist.txt
echo 'https://placekitten.com/200/303' >> myimglist.txt

Then, run the tool:

img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256

The tool will then automatically download the urls, resize them, and store them with that format:

  • output_folder
    • 0
      • 0.jpg
      • 1.jpg
      • 2.jpg

or as this format if choosing webdataset:

  • output_folder
    • 0.tar containing:
      • 0.jpg
      • 1.jpg
      • 2.jpg

with each number being the position in the list. The subfolders avoids having too many files in a single folder.

If captions are provided, they will be saved as 0.txt, 1.txt, ...

This can then easily be fed into machine learning training or any other use case.

If save_metadata option is turned on (that's the default), then .json files named 0.json, 1.json,... are saved with these keys:

  • url
  • caption
  • key
  • shard_id
  • status : whether the download succeeded
  • error_message
  • width
  • height
  • original_width
  • original_height
  • exif

Also a .parquet file will be saved with the same name as the subfolder/tar files containing these same metadata. It can be used to analyze the results efficiently.

Integration with Weights & Biases

Performance metrics are monitored through Weights & Biases.

W&B metrics

In addition, most frequent errors are logged for easier debugging.

W&B table

Other features are available:

  • logging of environment configuration (OS, python version, CPU count, Hostname, etc)
  • monitoring of hardware resources (GPU/CPU, RAM, Disk, Networking, etc)
  • custom graphs and reports
  • comparison of runs (convenient when optimizing parameters such as number of threads/cpus)

When running the script for the first time, you can decide to either associate your metrics to your account or log them anonymously.

You can also log in (or create an account) before by running wandb login.

API

This module exposes a single function download which takes the same arguments as the command line tool:

  • url_list A file with the list of url of images to download. It can be a folder of such files. (required)
  • image_size The size to resize image to (default 256)
  • output_folder The path to the output folder. If existing subfolder are present, the tool will continue to the next number. (default "images")
  • processes_count The number of processes used for downloading the pictures. This is important to be high for performance. (default 1)
  • thread_count The number of threads used for downloading the pictures. This is important to be high for performance. (default 256)
  • resize_mode The way to resize pictures, can be no, border or keep_ratio (default border)
    • no doesn't resize at all
    • border will make the image image_size x image_size and add a border
    • keep_ratio will keep the ratio and make the smallest side of the picture image_size
    • center_crop will keep the ratio and center crop the largest side so the picture is squared
  • resize_only_if_bigger resize pictures only if bigger that the image_size (default False)
  • output_format decides how to save pictures (default files)
    • files saves as a set of subfolder containing pictures
    • webdataset saves as tars containing pictures
  • input_format decides how to load the urls (default txt)
    • txt loads the urls as a text file of url, one per line
    • csv loads the urls and optional caption as a csv
    • tsv loads the urls and optional caption as a tsv
    • parquet loads the urls and optional caption as a parquet
  • url_col the name of the url column for parquet and csv (default url)
  • caption_col the name of the caption column for parquet and csv (default None)
  • number_sample_per_shard the number of sample that will be downloaded in one shard (default 10000)
  • save_metadata if true, saves one parquet file per folder/tar and json files with metadata (default True)
  • save_additional_columns list of additional columns to take from the csv/parquet files and save in metadata files (default None)
  • timeout maximum time (in seconds) to wait when trying to download an image (default 10)
  • wandb_project name of W&B project used (default img2dataset)

How to tweak the options

The default values should be good enough for small sized dataset. For larger ones, these tips may help you get the best performance:

  • set the processes_count as the number of cores your machine has
  • increase thread_count as long as your bandwidth and cpu are below the limits
  • I advise to set output_format to webdataset if your dataset has more than 1M elements, it will be easier to manipulate few tars rather than million of files
  • keeping metadata to True can be useful to check what items were already saved and avoid redownloading them

Road map

This tool works very well in the current state for up to 100M elements. Future goals include:

  • a benchmark for 1B pictures which may require
    • further optimization on the resizing part
    • better multi node support
    • integrated support for incremental support (only download new elements)

Architecture notes

This tool is designed to download pictures as fast as possible. This put a stress on various kind of resources. Some numbers assuming 1350 image/s:

  • Bandwidth: downloading a thousand average image per second requires about 130MB/s
  • CPU: resizing one image may take several milliseconds, several thousand per second can use up to 16 cores
  • DNS querying: million of urls mean million of domains, default OS setting usually are not enough. Setting up a local bind9 resolver may be required
  • Disk: if using resizing, up to 30MB/s write speed is necessary. If not using resizing, up to 130MB/s. Writing in few tar files make it possible to use rotational drives instead of a SSD.

With these information in mind, the design choice was done in this way:

  • the list of urls is split in N shards. N is usually chosen as the number of cores
  • N processes are started (using multiprocessing process pool)
    • each process starts M threads. M should be maximized in order to use as much network as possible while keeping cpu usage below 100%.
    • each of this thread download 1 image and returns it
    • the parent thread handle resizing (which means there is at most N resizing running at once, using up the cores but not more)
    • the parent thread saves to a tar file that is different from other process

This design make it possible to use the CPU resource efficiently by doing only 1 resize per core, reduce disk overhead by opening 1 file per core, while using the bandwidth resource as much as possible by using M thread per process.

Setting up a bind9 resolver

In order to keep the success rate high, it is necessary to use an efficient DNS resolver. I tried several options: systemd-resolved, dnsmaskq and bind9 and reached the conclusion that bind9 reaches the best performance for this use case. Here is how to set this up on ubuntu:

sudo apt install bind9
sudo vim /etc/bind/named.conf.options

Add this in options:
        recursive-clients 10000;
        resolver-query-timeout 30000;
        max-clients-per-query 10000;
        max-cache-size 2000m;

sudo systemctl restart bind9

sudo vim /etc/resolv.conf

Put this content:
nameserver 127.0.0.1

This will make it possible to keep an high success rate while doing thousands of dns queries. You may also want to setup bind9 logging in order to check that few dns errors happen.

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

python -m pytest -v tests -s

Benchmarks

10000 image benchmark

cd tests
bash benchmark.sh

18M image benchmark

Download crawling at home first part, then:

cd tests
bash large_bench.sh

It takes 3.7h to download 18M pictures

1350 images/s is the currently observed performance. 4.8M images per hour, 116M images per 24h.

36M image benchmark

downloading 2 parquet files of 18M items (result 936GB) took 7h24 average of 1345 image/s

190M benchmark

downloading 190M images from the crawling at home dataset took 41h (result 5TB) average of 1280 image/s

Comments
  • Downloader is not producing full set of expected outputs

    Downloader is not producing full set of expected outputs

    Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.

    Any tips on debugging further?

    TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in *_stats.json files indicating ~2M files were actually downloaded

    For example - I recently tried to download this dataset in a distributed manner on EMR:

    https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

    I applied some light NSFW filtering on it to produce a new parquet

    # rest of the script is redacted, but there is some code before this to normalize the NSFW row to make filtering more convenient
    sampled_df = df[df["NSFW"] == "unlikely"]
    sampled_df.reset_index(inplace=True)
    

    Verified its row count is ~12M samples:

    import glob
    import json
    from pyarrow.parquet import ParquetDataset
    
    files = glob.glob("*.parquet")
    
    d = {}
    
    for file in files:
        d[file] = 0
        dataset = ParquetDataset(file)
        for piece in dataset.pieces:
            d[file] += piece.get_metadata().num_rows
    
    print(json.dumps(d, indent=2, sort_keys=True))
    
    {
      "part00000.parquet": 12026281
    }
    

    Ran the download, and scanned over the output s3 bucket:

    aws s3 cp\
    	s3://path/to/s3/download/ . \
    	--exclude "*" \
    	--include "*.json" \
    	--recursive
    

    Ran this script to get the total count of images downloaded:

    import json
    import glob
    
    files = glob.glob("/path/to/json/files/*.json")
    
    count = {}
    successes = {}
    
    for file in files:
        with open(file) as f:
            j = json.load(f)
            count[file] = j["count"]
            successes[file] = j["successes"]
    
    rate = 100 * sum(successes.values()) / sum(count.values())
    print(f"Success rate: {rate}. From {sum(successes.values())} / {sum(count.values())}")
    

    which gave me the following output:

    Success rate: 56.15816066896948. From 1508566 / 2686281
    

    The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I'll use a knot resolver later)

    unknown url type: '21nicrmo2'                                                      1.0
    <urlopen error [errno 22] invalid argument>                                        1.0
    encoding with 'idna' codec failed (unicodeerror: label empty or too long)          1.0
    http/1.1 401.2 unauthorized\r\n                                                    4.0
    <urlopen error no host given>                                                      5.0
    <urlopen error unknown url type: "https>                                          11.0
    incomplete read                                                                   14.0
    <urlopen error [errno 101] network is unreachable>                                38.0
    <urlopen error [errno 104] connection reset by peer>                              75.0
    [errno 104] connection reset by peer                                              92.0
    opencv                                                                           354.0
    <urlopen error [errno 113] no route to host>                                     448.0
    remote end closed connection without response                                    472.0
    <urlopen error [errno 111] connection refused>                                  1144.0
    encoding issue                                                                  2341.0
    timed out                                                                       2850.0
    <urlopen error timed out>                                                       4394.0
    the read operation timed out                                                    4617.0
    image decoding error                                                            5563.0
    ssl                                                                             6174.0
    http error                                                                     62670.0
    <urlopen error [errno -2] name or service not known>                         1086446.0
    success                                                                      1508566.0
    

    I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from

    > ls
    00000_stats.json  00051_stats.json  01017_stats.json  01066_stats.json  01112_stats.json  01157_stats.json
    00001_stats.json  00052_stats.json  01018_stats.json  01067_stats.json  01113_stats.json  01159_stats.json
    ...
    > ls -l | wc -l 
    270
    
    opened by PranshuBansalDev 33
  • Increasing mem and no output files

    Increasing mem and no output files

    Currently using your tool to download laion dataset, thank you for your contribution. The program grows in memory until it uses all of my 32G of RAM and 64G of SWAP. No tar files are ever output. Am I doing something wrong?

    Using the following command (slightly modified from official command provided by laion) img2dataset --url_list laion400m-meta --input_format "parquet" \ --url_col "URL" --caption_col "TEXT" --output_format webdataset \ --output_folder webdataset --processes_count 1 --thread_count 12 --image_size 384 \ --save_additional_columns '["NSFW","similarity","LICENSE"]'

    opened by pbatk 23
  • feat: support tfrecord

    feat: support tfrecord

    Add support for tfrecords.

    The webdataset format is not very convenient on TPU's due to bad support of pytorch dataloaders in multiprocessing at the moment so tfrecords allow better usage of CPU's.

    opened by borisdayma 22
  • Download stall at the end

    Download stall at the end

    I'm trying to download the CC3M dataset on an AWS Sagemaker Notebook instance. I first do pip install img2dataset. Then I fired up a terminal and do

    img2dataset --url_list cc3m.tsv --input_format "tsv"\
             --url_col "url" --caption_col "caption" --output_format webdataset\
               --output_folder cc3m --processes_count 16 --thread_count 64 --resize_mode no\
                 --enable_wandb False
    

    Code runs and downloads but stalls towards the end. I tried terminating by restarting the instance (restart), as a result, some .tar files are having read error "Unexpected end of file" while using the tar files for training. I also tried to terminate it using Ctrl-C on a second run, which result in the same read error when using the tar files for training. The difference between two termination methods is the later seemed to do some cleanup which removed "_tmp" folder within the download folder.

    opened by xiankgx 13
  • Respect noai and noimageai directives when downloading image files

    Respect noai and noimageai directives when downloading image files

    Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai).

    This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts.

    Refs:

    • https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag
    • https://www.deviantart.com/team/journal/A-New-Directive-for-Opting-Out-of-AI-Datasets-934500371
    opened by raincoastchris 12
  • How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

    How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

    For Vision and Language pretraining cc3m, mscoco, SBUcaptions and VG are very relevant datasets. I haven't been able to download SBU captions and VG. Here are my questions.

    1. How to download SBU captions and VG's metadata?
    2. How to download these datasets on webdataset format?

    Could you also please provide me with a tutorial or just some hints to download it in webdataset format using img2dataset? Thank you in advance.

    opened by sanyalsunny111 8
  • clip-retrieval-getting-started.ipynb giving errors (Urgent)

    clip-retrieval-getting-started.ipynb giving errors (Urgent)

    image Hello there I am new to the world of deep learning.I am trying to run clip-retrieval-getting-started.ipynb but getting the error attached as snip....Please help its urgent

    opened by minakshimathpal 8
  • Decrease memory usage

    Decrease memory usage

    Currently the memory usage is about 1.5GB per core. That's way too much, it must be possible to decrease it. Figure out what's using all that ram (is it because the resize queue is full ? should there be some backpressure on the downloader ,etc) and solve it

    opened by rom1504 8
  • Interest in supporting video datasets?

    Interest in supporting video datasets?

    Hi. Thanks for the amazing repository. It really makes the workflow very easy. I was wondering if you are considering to add video datasets as well. Some are based on urls, while some others are derived from youtube or segments from youtube.

    opened by TheShadow29 7
  • Add checksum of image

    Add checksum of image

    I think it could be useful to add a checksum in the parquet files since we're downloading the images anyway and it's fast to compute. It would help us do a real deduplication, not only on urls but on actual image content.

    opened by borisdayma 7
  • add list of int, float feature in TFRecordSampleWriter

    add list of int, float feature in TFRecordSampleWriter

    We use list of int, list of float attribute in coyo-labeled-300m dataset. (It will be released soon) To create a dataset using img2dataset in tfrecord, we need to add above features.

    opened by justHungryMan 6
  • Figure out how to timeout

    Figure out how to timeout

    I implemented some new metrics and found that many urls timeout after 20s, which clearly slow down everything

    here is some examples: Downloaded (12, 'http://www.herteldenbirname.com/wp-content/uploads/2014/05/Italia-Independent-Flocked-Aviator-Sunglasses-150x150.jpg') in 10.019284009933472 Downloaded (124, 'http://image.rakuten.co.jp/sneak/cabinet/shoes-03/cr-ucrocs5-a.jpg?_ex=128x128') in 10.01184344291687 Downloaded (146, 'http://www.slicingupeyeballs.com/wp-content/uploads/2009/05/stoneroses452.jpg') in 10.006474256515503 Downloaded (122, 'https://media.mwcradio.com/mimesis/2013-03/01/2013-03-01T153415Z_1_CBRE920179600_RTROPTP_3_TECH-US-GERMANY-EREADER_JPG_475x310_q85.jpg') in 10.241626739501953 Downloaded (282, 'https://8d1aee3bcc.site.internapcdn.net/00/images/media/5/5cfb2eba8f1f6244c6f7e261b9320a90-1.jpg') in 10.431355476379395 Downloaded (298, 'https://my-furniture.com.au/media/catalog/product/cache/1/small_image/295x295/9df78eab33525d08d6e5fb8d27136e95/a/u/au0019-stool-01.jpg') in 10.005694150924683 Downloaded (300, 'http://images.tastespotting.com/thumbnails/889506.jpg') in 10.007027387619019 Downloaded (330, 'https://www.infoworld.pk/wp-content/uploads/2016/02/Cool-HD-Valentines-Day-Wallpapers-480x300.jpeg') in 10.004335880279541 Downloaded (361, 'http://pendantscarf.com/image/cache/data/necklace/JW0013-(2)-150x150.jpg') in 10.00539231300354 Downloaded (408, 'https://www.solidrop.net/photo-6/animorphia-coloring-books-for-adults-children-drawing-book-secret-garden-style-relieve-stress-graffiti-painting-book.jpg') in 10.004313945770264

    Let's try to implement request timeout

    I tried #153 , eventlet and #260 and none of them can timeout properly

    A good value for timeout is 2s

    opened by rom1504 16
  • Add asyncio implementation of downloader

    Add asyncio implementation of downloader

    #252 #256

    The impl of asyncio downloader. It can also run properly on windows (without 3rd place dns resolver) with avg 500~600Mbps (under 1gbps network).

    use command arg --downloader to choose type of downloader("normal", "async"):

    img2dataset --downloader async
    

    mscoco download test

    opened by KohakuBlueleaf 3
  • opencv-python => opencv-python-headless

    opencv-python => opencv-python-headless

    This PR replaces opencv-python with opencv-python-headless to remove the dependency on GUI-related libraries (see: https://github.com/opencv/opencv-python/issues/370#issuecomment-671202529). I tested this working on the python:3.9 Docker image.

    opened by shionhonda 2
Releases(1.40.0)
Owner
Romain Beaumont
Interested in machine learning (computer vision, natural language processing, deep learning), node.js (network, bots, web), and programming in general
Romain Beaumont
An automated Comic Book downloader (cbr/cbz) for use with SABnzbd, NZBGet and torrents

Mylar Note that feature development has stopped as we have moved to Mylar3. EOL for this project is the end of 2020 and will no longer be supported. T

979 Dec 13, 2022
An example which streams RGB-D images over spout.

Spout RGB-D Example An example which streams RGB-D images over spout with visiongraph. Due to the spout dependency this currently only works on Window

Florian Bruggisser 4 Nov 14, 2022
Django helper application to easily and non-destructively crop arbitrarily large images in admin and frontend.

django-image-cropping django-image-cropping is an app for cropping uploaded images via Django's admin backend using Jcrop. Screenshot: django-image-cr

Jonas und der Wolf GmbH 546 Jan 03, 2023
A Blender add-on to create interesting meshes using symmetry

Procedural Symmetries This Blender add-on automates the process of iteratively applying a set of reflection planes to a base mesh. The result will con

1 Dec 29, 2021
Python Image Morpher (PIM) is a program that can take two images and blend them to whatever extent or precision that you like

Python Image Morpher (PIM) is a program that can take two images and blend them to whatever extent or precision that you like! It is designed to emulate some of Python's OpenCV image processing from

David Dowd 108 Dec 19, 2022
A quick and dirty QT Statusbar implementation for grabbing GIFs from Tenor, since there is no offical or unofficial one I found. This was intended for use under Linux, however it was also functional enough on MacOS.

Statusbar-TenorGIF App for Linux A quick and dirty QT Statusbar implementation for grabbing GIFs from Tenor, since there is no offical one and I didnt

Luigi DaVinci 1 Nov 01, 2021
PSD (Photoshop, Krita, Gimp...) -> Godot.

limage v0.2.2 Features Getting Started Tags Settings Todo Customizer Changes Solutions WARNING: Requires Python to be installed PSD (Photoshop, Krita,

21 Nov 10, 2022
🎨 Generate and change color-schemes on the fly.

Generate and change color-schemes on the fly. Pywal is a tool that generates a color palette from the dominant colors in an image. It then applies the

dylan 6.9k Jan 03, 2023
QR Generator using GUI with Tinker

BinCat Token System Very simple python script with GUI that generates QR codes. It don't include a QR "decription" tool. It only generate-it and thats

Hipotesi 1 Nov 06, 2021
Program to export all new icons from the latest Fortnite patch

Assets Exporter This program allows you to generate all new icons of a patch in png! Requierements Python =3.8 (installed on your computer) If you wa

ᴅᴊʟᴏʀ3xᴢᴏ 6 Jun 24, 2022
A proof-of-concept implementation of a parallel-decodable PNG format

mtpng A parallelized PNG encoder in Rust by Brion Vibber [email protected] Backgrou

Brion Vibber 193 Dec 16, 2022
Maze generator with most popular shapes - hexagon, triangle, square

Maze-Generator Maze generator with most popular shapes - hexagon, triangle, square (sqaure not implemented yet): Theory: Planar Graph https://en.wikip

Kacper Plesiak 2 Dec 28, 2021
Depix is a tool for recovering passwords from pixelized screenshots.

This implementation works on pixelized images that were created with a linear box filter. In this article I cover background information on pixelization and similar research.

23.1k Jan 04, 2023
Samila is a generative art generator written in Python

Samila is a generative art generator written in Python, Samila let's you create arts based on many thousand points. The position of every single point is calculated by a formula, which has random par

Sepand Haghighi 947 Dec 30, 2022
An agnostic Canvas API for the browser-less and insane

Apollo An agnostic Canvas API for the browser-less and mildly insane. Project Apollo is a Pythonic re-imagining of HTML Canvas element implementati

1 Jan 13, 2022
A simple image-level annotation tool supporting multi-channel images for napari.

napari-labelimg4classification A simple image-level annotation tool supporting multi-channel images for napari. This napari plugin was generated with

4 May 16, 2022
Create QR Code for link using Python

Quick Response QR is short and named for a quick read from a cell phone. Used to view information from transitory media and put it on your cell phone.

Coding Taggers 1 Jan 09, 2022
A simple python script to reveal the contents of a proof of vaccination QR code.

vaxidecoder A simple python script to reveal the contents of a proof of vaccination QR code. It takes a QR code image as input, and returns JSon data.

Hafidh 2 Feb 28, 2022
QR fixer part is standalone but for image to FQR conversion

f-qr-fixer QR fixer part is standalone but for image to FQR conversion it requires Pillow (can be installed with easy_install), qrtools (on ubuntu the

2 Nov 22, 2022
Black-white image converter - Black-white photo colorization

Black-white image converter - Black-white photo colorization

1 Jan 02, 2022