SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

Last update: Nov 24, 2022

Overview

The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service IQON, which was actually in operation for a decade. In addition, the SHIFT15M dataset has several types of dataset shifts, allowing us to evaluate the robustness of the model to different types of shifts (e.g., covariate shift and target shift).

We provide the Datasheet for SHIFT15M. This datasheet is based on the Datasheets for Datasets [1] template.

System	Python 3.6	Python 3.7	Python 3.8
Linux CPU
Linux GPU
Windows CPU / GPU	Status Currently Unavailable	Status Currently Unavailable	Status Currently Unavailable
Mac OS CPU

SHIFT15M is a large-scale dataset based on approximately 15 million items accumulated by the fashion search service IQON.

Installation

From PyPi

$ pip install shift15m

From source

$ git clone https://github.com/st-tech/zozo-shift15m.git
$ cd zozo-shift15m
$ poetry build
$ pip install dist/shift15m-xxxx-py3-none-any.whl

Download SHIFT15M dataset

Use Dataset class

You can download SHIFT15M dataset as follows:

from shift15.datasets import NumLikesRegression

dataset = NumLikesRegression(root="./data", download=True)

Download directly by using download scripts

Please download the dataset as follows:

$ bash scripts/download_all.sh

To avoid downloading the test dataset for set matching (80GB), which is not required in training, you can use the following script.

$ bash scripts/download_all_wo_set_testdata.sh

Tasks

The following tasks are now available:

Tasks	Task type	Shift type	# of input dim	# of output dim
NumLikesRegression	regression	target shift	(N, 25)	(N, 1)
SumPricesRegression	regression	covariate shift, target shift	(N, 1)	(N, 1)
ItemPriceRegression	regression	target shift	(N, 4096)	(N, 1)
ItemCategoryClassification	classification	target shift	(N, 4096)	(N, 7)
Set2SetMatching	set-to-set matching	covariate shift	(N, 4096)x(M, 4096)	(1)

Benchmarks

As templates for numerical experiments on the SHIFT15M dataset, we have published experimental results for each task with several models.

Original Dataset Structure

The original dataset is maintained in json format, and a row consists of the following:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd"
}

Contributing

To learn more about making a contribution to SHIFT15M, please see the following materials:

License

The dataset itself is provided under a CC BY-NC 4.0 license. On the other hand, the software in this repository is provided under the MIT license.

Dataset metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name SHIFT15M Dataset

alternateName SHIFT15M

alternateName shift15m-dataset

url https://github.com/st-tech/zozo-shift15m

sameAs https://github.com/st-tech/zozo-shift15m

description SHIFT15M is a multi-objective, multi-domain dataset which includes multiple dataset shifts.

provider

property	value
name	`ZOZO Research`
sameAs	`https://ja.wikipedia.org/wiki/ZOZO`

license

property	value
name	`CC BY-NC 4.0`
url	`https://github.com/st-tech/zozo-shift15m/blob/main/LICENSE.CC`

Citation

@misc{Kimura_SHIFT15M_Multiobjective_LargeScale_2021,
author = {Kimura, Masanari and Nakamura, Takuma and Saito, Yuki},
month = {8},
title = {SHIFT15M: Multiobjective Large-Scale Fashion Dataset with Distributional Shifts},
year = {2021}
}

Errata

No errata are currently available.

References

[1] Gebru, Timnit, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018).

Comments

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

The following question should be answered:

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).
documentation datasheet

opened by nocotan 3
Extracting Image Features
@nocotan I'm planning to prepare image features as we discussed. To be extracted:

CNN features (2048 dimensional features from the pre-trained Inception-V3 model on ILSVRC2012)

By the way, I was trying to find a properly hand-crafted image feature extractor that involves colors but cannot find available codes. For instance, combining Local Binary Pattern (LBP) and Local Color Contrast (LCC) showed superior performance in a texture classification task described in the following paper compared with other color-based hand-crafted features, but LCC is not in OSS. https://www.researchgate.net/publication/315858786_Hand-Crafted_vs_Learned_Descriptors_for_Color_Texture_Classification

So, here I'm planning not to include a hand-crafted one for the image-based task.
opened by wildsnowman 2
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

The following question should be answered:

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
documentation datasheet

opened by nocotan 2
add LICENSE
Before publishing, we need to determine the license of the repository, e.g.,

MIT

Apache

BSD

GPL

After researching which license is appropriate, please add the LICENSE to the repository.
documentation
opened by nocotan 2
Got an TypeError exception when try to run item category prediction task
Thank you for your great work and dataset opening at first.

Description When I tried to run the item_category_prediction task following the usageitem_category_prediction I got an exception like this:

Environment:

Python 3.8.8

It will be so helpful if you can give any gracious advice, thank you.
bug
opened by you0xy 1
Information: the dataset size

the number of outfits: 2,555,147 the number of images (multiple-counting): 15,218,721 the number of unique images: 2,335,598

Note: maybe shift28M is not the correct name.

opened by wildsnowman 1
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

The following question should be answered:

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
documentation datasheet

opened by nocotan 1
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

The following question should be answered:

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
documentation datasheet

opened by nocotan 1
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

The following question should be answered:

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
documentation datasheet

opened by nocotan 1
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

The following question should be answered:

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
documentation datasheet

opened by nocotan 1
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

The following question should be answered:

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
documentation datasheet

opened by nocotan 1
Bump setuptools from 65.4.1 to 65.5.1
Bumps setuptools from 65.4.1 to 65.5.1.

Changelog

Sourced from setuptools's changelog.

v65.5.1

Misc ^^^^

#3638: Drop a test dependency on the mock package, always use :external+python:py:mod:unittest.mock -- by :user:hroncok

#3659: Fixed REDoS vector in package_index.

v65.5.0

Changes ^^^^^^^

#3624: Fixed editable install for multi-module/no-package src-layout projects.

#3626: Minor refactorings to support distutils using stdlib logging module.

Documentation changes ^^^^^^^^^^^^^^^^^^^^^

#3419: Updated the example version numbers to be compliant with PEP-440 on the "Specifying Your Project’s Version" page of the user guide.

Misc ^^^^

#3569: Improved information about conflicting entries in the current working directory and editable install (in documentation and as an informational warning).

#3576: Updated version of validate_pyproject.

Commits

a462cb5 Bump version: 65.5.0 → 65.5.1

de35d8b Merge pull request #3656 from bmorris3/typos

58e23de Update changelog. Ref #3659.

43a9c9b Limit the amount of whitespace to search/backtrack. Fixes #3659.

5791343 Add test capturing failed expectation. Ref #3659.

1f97905 ⚫ Fade to black.

6254567 Remove workaround for emacs.

729b180 ⚫ Fade to black.

c068081 Typo corrections

f777a40 Suppress deprecation warning in --rsyncdir. Workaround for #3655.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bug: the number of val/test data is not consistent with other cases when the same years are selected for train_year and test_year.

Describe the bug In set matching, the numbers of data used are restricted as 30816, 3851, and 3851 for train, val, and test data, respectively; however, when the same years are selected for train_year and test_year, it will be inconsistent.

This bug may cause inappropriate experiments in changing train_year and test_year.
bug

opened by wildsnowman 0
disjoint set matching

Parent Task

set matching

Model List

Note

It might be required to conduct set matching experiments under the disjoint setting. Here, we perform testing using the items that are not included while training; we call it disjointed.

References

https://arxiv.org/abs/1804.09979
benchmark

opened by wildsnowman 0
Implementation of the set data loader with tags

Is your feature request related to a problem? Please describe. We added the tags information for our dataset. Then, it is good to implement the additional data loader with tags information.

Describe the solution you'd like This can be accomplished by adding arguments to an existing data loader.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

opened by nocotan 0

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

add tags info as follows:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd",
  "tags": "tag_a, tag_b, tag_c, ..."
}

add superset matching benchmark
fix a label creation bug on set matching with multiple splits

Source code(tar.gz)
Source code(zip)

v.0.1.2(Nov 24, 2021)

Source code(tar.gz)
Source code(zip)
v.0.1.1(Sep 6, 2021)

Source code(tar.gz)
Source code(zip)

Owner

ZOZO, Inc.

GitHub Repository

Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer

Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer Paper on arXiv Public PyTorch implementation of two-stage peer-reg

38 Oct 14, 2022

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

4.9k Dec 31, 2022

[ICCV 2021] Learning A Single Network for Scale-Arbitrary Super-Resolution

ArbSR Pytorch implementation of "Learning A Single Network for Scale-Arbitrary Super-Resolution", ICCV 2021 [Project] [arXiv] Highlights A plug-in mod

229 Dec 30, 2022

This is the official source code of "BiCAT: Bi-Chronological Augmentation of Transformer for Sequential Recommendation".

BiCAT This is our TensorFlow implementation for the paper: "BiCAT: Sequential Recommendation with Bidirectional Chronological Augmentation of Transfor

15 Dec 06, 2022

Mask-invariant Face Recognition through Template-level Knowledge Distillation

Mask-invariant Face Recognition through Template-level Knowledge Distillation This is the official repository of "Mask-invariant Face Recognition thro

35 Dec 06, 2022

CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search

CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search This repository is the official implementation of CAPITAL: Optimal Subgrou

0 Oct 19, 2021

Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations

Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations This is the repository for the paper Consumer Fairness in Recomm

7 Nov 30, 2022

PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situat

11 Aug 23, 2021

R-package accompanying the paper "Dynamic Factor Model for Functional Time Series: Identification, Estimation, and Prediction"

dffm The goal of dffm is to provide functionality to apply the methods developed in the paper “Dynamic Factor Model for Functional Time Series: Identi

3 Dec 09, 2022

This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

Yogi-Optimizer_Keras This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras) The NeurIPS-Paper can be found here: http://papers.nips.c

14 Sep 13, 2022

This is an unofficial PyTorch implementation of Meta Pseudo Labels

This is an unofficial PyTorch implementation of Meta Pseudo Labels. The official Tensorflow implementation is here.

320 Jan 08, 2023

Attention for PyTorch with Linear Memory Footprint

Attention for PyTorch with Linear Memory Footprint Unofficially implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention (+

11 Jan 09, 2022

“英特尔创新大师杯”深度学习挑战赛赛道3：CCKS2021中文NLP地址相关性任务

ccks2021-track3 CCKS2021中文NLP地址相关性任务-赛道三-冠军方案团队：我的加菲鱼- wodejiafeiyu 初赛第二/复赛第一/决赛第一前言 19年开始，陆陆续续参加了一些比赛，拿到过一些top，比较懒一直都没分享过，这次比较幸运又拿了top1，打算分享下分类的任务

131 Dec 31, 2022

OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

211 Jan 02, 2023

This code provides a PyTorch implementation for OTTER (Optimal Transport distillation for Efficient zero-shot Recognition), as described in the paper.

Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation This repository contains PyTorch evaluation code, trainin

45 Dec 20, 2022

Algorithmic trading using machine learning.

Algorithmic Trading This machine learning algorithm was built using Python 3 and scikit-learn with a Decision Tree Classifier. The program gathers sto

101 Nov 10, 2022

PyTorch Implementation of ECCV 2020 Spotlight TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

TuiGAN-PyTorch Official PyTorch Implementation of "TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images" (ECCV 2020 Spotligh

181 Dec 09, 2022

the official implementation of the paper "Isometric Multi-Shape Matching" (CVPR 2021)

Isometric Multi-Shape Matching (IsoMuSh) Paper-CVF | Paper-arXiv | Video | Code Citation If you find our work useful in your research, please consider

9 Jul 17, 2022

generate-2D-quadrilateral-mesh-with-neural-networks-and-tree-search

generate-2D-quadrilateral-mesh-with-neural-networks-and-tree-search This repository contains single-threaded TreeMesh code. I'm Hua Tong, a senior stu

18 Sep 21, 2022

This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

MLOps with Vertex AI This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities. The ex

238 Dec 21, 2022

SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

Related tags

Overview

Installation

From PyPi

From source

Download SHIFT15M dataset

Use Dataset class

Download directly by using download scripts

Tasks

Benchmarks

Original Dataset Structure

Contributing

License

Dataset metadata

Citation

Errata

References

Comments

v65.5.1

v65.5.0

Parent Task

Model List

Note

References

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

v.0.1.2(Nov 24, 2021)

v.0.1.1(Sep 6, 2021)

Owner

ZOZO, Inc.

Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer

A scikit-learn compatible neural network library that wraps PyTorch

[ICCV 2021] Learning A Single Network for Scale-Arbitrary Super-Resolution

This is the official source code of "BiCAT: Bi-Chronological Augmentation of Transformer for Sequential Recommendation".

Mask-invariant Face Recognition through Template-level Knowledge Distillation

CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search

Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations

PPO is a very popular Reinforcement Learning algorithm at present.

R-package accompanying the paper "Dynamic Factor Model for Functional Time Series: Identification, Estimation, and Prediction"

This is an implementation of Googles Yogi-Optimizer in Keras (tf.keras)

This is an unofficial PyTorch implementation of Meta Pseudo Labels

Attention for PyTorch with Linear Memory Footprint

“英特尔创新大师杯”深度学习挑战赛 赛道3：CCKS2021中文NLP地址相关性任务

OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

This code provides a PyTorch implementation for OTTER (Optimal Transport distillation for Efficient zero-shot Recognition), as described in the paper.

Algorithmic trading using machine learning.

PyTorch Implementation of ECCV 2020 Spotlight TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

the official implementation of the paper "Isometric Multi-Shape Matching" (CVPR 2021)

generate-2D-quadrilateral-mesh-with-neural-networks-and-tree-search

This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

“英特尔创新大师杯”深度学习挑战赛赛道3：CCKS2021中文NLP地址相关性任务