Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Overview

Foodi-ML dataset

This is the GitHub repository for the Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset. This dataset contains over 1.5M unique images and over 9.5M store names, product names, descriptions and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from over 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870k samples of languages of countries from Eastern Europe and West Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visio-linguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

License

The FooDI-ML dataset is offered under the BY-NC-SA license.

1. Download the dataset

The FooDI-ML dataset is hosted in a S3 bucket in AWS. Therefore AWS CLI is needed to download it. Our dataset is composed of:

  • One DataFrame (glovo-foodi-ml-dataset) stored as a csv file containing all text information + image paths in S3. The size of this CSV file is 540 MB.
  • Set of images listed in the DataFrame. The disk space required to store all images is 316.1 GB.

1.1. Download AWS CLI

If you do not have AWS CLI already installed, please download the latest version of AWS CLI for your operating system.

1.2. Download FooDI-ML

  1. Run the following command to download the DataFrame in ENTER_DESTINATION_PATH directory. We provide an example as if we were going to download the dataset in the directory /mnt/data/foodi-ml/.

    aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv ENTER_DESTINATION_PATH --no-sign-request

    Example: aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv /mnt/data/foodi-ml/ --no-sign-request

  2. Run the following command to download the images in ENTER_DESTINATION_PATH/dataset directory (please note the appending of /dataset). This command will download the images in ENTER_DESTINATION_PATHdirectory.

    aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset ENTER_DESTINATION_PATH/dataset --no-sign-request --quiet

    Example: aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset /mnt/data/foodi-ml/dataset --no-sign-request --quiet

  3. Run the script rename_images.py. This script modifies the DataFrame column to include the paths of the images in the location you specified with ENTER_DESTINATION_PATH/dataset.

    pip install pandas
    python scripts/rename_images.py --output-dir ENTER_DESTINATION_PATH
    

Getting started

Our dataset is managed by the DataFrame glovo-foodi-ml-dataset.csv. This dataset contains the following columns:

  • country_code: This column comprehends 37 unique country codes as explained in our paper. These codes are:

    'ES', 'PL', 'CI', 'PT', 'MA', 'IT', 'AR', 'BG', 'KZ', 'BR', 'ME', 'TR', 'PE', 'SI', 'GE', 'EG', 'RS', 'RO', 'HR', 'UA', 'DO', 'KG', 'CR', 'UY', 'EC', 'HN', 'GH', 'KE', 'GT', 'CL', 'FR', 'BA', 'PA', 'UG', 'MD', 'NG', 'PR'

  • city_code: Name of the city where the store is located.

  • store_name: Name of the store selling that product. If store_name is equal to AS_XYZ, it represents an auxiliary store. This means that while the samples contained are for the most part valid, the store name can't be used in learning tasks

  • product_name: Name of the product. All products have product_name, so this column does not contain any NaN value.

  • collection_section: Name of the section of the product, used for organizing the store menu. Common values are "drinks", "our pizzas", "desserts". All products have collection_section associated to it, so this column does not have any NaN value in it.

  • product_description: A detailed description of the product, describing ingredients and components of it. Not all products of our data have description, so this column contains NaN values that must be removed by the researchers as a preprocessing step.

  • subset: Categorical variable indicating if the sample belongs to the Training, Validation or Test set. The respective values in the DataFrame are ["train", "val", "test"].

  • HIER: Boolean variable indicating if the store name can be used to retrieve product information (indicating if the store_name is not an auxiliary store (with code AS_XYZ)).

  • s3_path: Path of the image of the product in the disk location you chose.

Dataset Statistics

A notebook analyzing several dataset statistics is provided in notebooks/FooDI-ML Dataset Stats Analytics.ipynb.

Benchmark

To run the benchmark included in the original paper one must follow the procedure listed in the following link.

The hyperparameters of the model are included here link

Citation

This paper is under review. In the meanwhile you can cite it in arxiv: https://arxiv.org/abs/2110.02035

Tensorflow-Project-Template - A best practice for tensorflow project template architecture.

Tensorflow Project Template A simple and well designed structure is essential for any Deep Learning project, so after a lot of practice and contributi

Mahmoud G. Salem 3.6k Dec 22, 2022
Automated Hyperparameter Optimization Competition

QQ浏览器2021AI算法大赛 - 自动超参数优化竞赛 ACM CIKM 2021 AnalyticCup 在信息流推荐业务场景中普遍存在模型或策略效果依赖于“超参数”的问题,而“超参数"的设定往往依赖人工经验调参,不仅效率低下维护成本高,而且难以实现更优效果。因此,本次赛题以超参数优化为主题,从真

20 Dec 09, 2021
MGFN: Multi-Graph Fusion Networks for Urban Region Embedding was accepted by IJCAI-2022.

Multi-Graph Fusion Networks for Urban Region Embedding (IJCAI-22) This is the implementation of Multi-Graph Fusion Networks for Urban Region Embedding

202 Nov 18, 2022
Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

Autoformer (NeurIPS 2021) Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting Time series forecasting is a c

THUML @ Tsinghua University 847 Jan 08, 2023
Text to Image Generation with Semantic-Spatial Aware GAN

text2image This repository includes the implementation for Text to Image Generation with Semantic-Spatial Aware GAN This repo is not completely. Netwo

CVDDL 124 Dec 30, 2022
A PyTorch implementation of Radio Transformer Networks from the paper "An Introduction to Deep Learning for the Physical Layer".

An Introduction to Deep Learning for the Physical Layer An usable PyTorch implementation of the noisy autoencoder infrastructure in the paper "An Intr

Gram.AI 120 Nov 21, 2022
UNet model with VGG11 encoder pre-trained on Kaggle Carvana dataset

TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation By Vladimir Iglovikov and Alexey Shvets Introduction TernausNet is

Vladimir Iglovikov 1k Dec 28, 2022
Non-stationary GP package written from scratch in PyTorch

NSGP-Torch Examples gpytorch model with skgpytorch # Import packages import torch from regdata import NonStat2D from gpytorch.kernels import RBFKernel

Zeel B Patel 1 Mar 06, 2022
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

Welcome to the cuQuantum repository! This public repository contains two sets of files related to the NVIDIA cuQuantum SDK: samples: All C/C++ sample

NVIDIA Corporation 147 Dec 27, 2022
Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton 1.5k Dec 29, 2022
Adapter-BERT: Parameter-Efficient Transfer Learning for NLP.

Adapter-BERT: Parameter-Efficient Transfer Learning for NLP.

Google Research 340 Jan 03, 2023
Human motion synthesis using Unity3D

Human motion synthesis using Unity3D Prerequisite: Software: amc2bvh.exe, Unity 2017, Blender. Unity: RockVR (Video Capture), scenes, character models

Hao Xu 9 Jun 01, 2022
Provide baselines and evaluation metrics of the task: traffic flow prediction

Note: This repo is adpoted from https://github.com/UNIMIBInside/Smart-Mobility-Prediction. Due to technical reasons, I did not fork their code. Introd

Zhangzhi Peng 11 Nov 02, 2022
An official source code for "Augmentation-Free Self-Supervised Learning on Graphs"

Augmentation-Free Self-Supervised Learning on Graphs An official source code for Augmentation-Free Self-Supervised Learning on Graphs paper, accepted

Namkyeong Lee 59 Dec 01, 2022
The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color

The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color Overview Code and dataset for The World of an Octopus: H

1 Nov 13, 2021
A curated list of programmatic weak supervision papers and resources

A curated list of programmatic weak supervision papers and resources

Jieyu Zhang 118 Jan 02, 2023
GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. (CVPR 2021)

GDR-Net This repo provides the PyTorch implementation of the work: Gu Wang, Fabian Manhardt, Federico Tombari, Xiangyang Ji. GDR-Net: Geometry-Guided

169 Jan 07, 2023
Apply a perspective transformation to a raster image inside Inkscape (no need to use an external software such as GIMP or Krita).

Raster Perspective Apply a perspective transformation to bitmap image using the selected path as envelope, without the need to use an external softwar

s.ouchene 19 Dec 22, 2022
A PyTorch implementation of "DGC-Net: Dense Geometric Correspondence Network"

DGC-Net: Dense Geometric Correspondence Network This is a PyTorch implementation of our work "DGC-Net: Dense Geometric Correspondence Network" TL;DR A

191 Dec 16, 2022
A highly efficient, fast, powerful and light-weight anime downloader and streamer for your favorite anime.

AnimDL - Download & Stream Your Favorite Anime AnimDL is an incredibly powerful tool for downloading and streaming anime. Core features Abuses the dev

KR 759 Jan 08, 2023