SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

Last update: Nov 24, 2022

Overview

The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service IQON, which was actually in operation for a decade. In addition, the SHIFT15M dataset has several types of dataset shifts, allowing us to evaluate the robustness of the model to different types of shifts (e.g., covariate shift and target shift).

We provide the Datasheet for SHIFT15M. This datasheet is based on the Datasheets for Datasets [1] template.

System	Python 3.6	Python 3.7	Python 3.8
Linux CPU
Linux GPU
Windows CPU / GPU	Status Currently Unavailable	Status Currently Unavailable	Status Currently Unavailable
Mac OS CPU

SHIFT15M is a large-scale dataset based on approximately 15 million items accumulated by the fashion search service IQON.

Installation

From PyPi

$ pip install shift15m

From source

$ git clone https://github.com/st-tech/zozo-shift15m.git
$ cd zozo-shift15m
$ poetry build
$ pip install dist/shift15m-xxxx-py3-none-any.whl

Download SHIFT15M dataset

Use Dataset class

You can download SHIFT15M dataset as follows:

from shift15.datasets import NumLikesRegression

dataset = NumLikesRegression(root="./data", download=True)

Download directly by using download scripts

Please download the dataset as follows:

$ bash scripts/download_all.sh

To avoid downloading the test dataset for set matching (80GB), which is not required in training, you can use the following script.

$ bash scripts/download_all_wo_set_testdata.sh

Tasks

The following tasks are now available:

Tasks	Task type	Shift type	# of input dim	# of output dim
NumLikesRegression	regression	target shift	(N, 25)	(N, 1)
SumPricesRegression	regression	covariate shift, target shift	(N, 1)	(N, 1)
ItemPriceRegression	regression	target shift	(N, 4096)	(N, 1)
ItemCategoryClassification	classification	target shift	(N, 4096)	(N, 7)
Set2SetMatching	set-to-set matching	covariate shift	(N, 4096)x(M, 4096)	(1)

Benchmarks

As templates for numerical experiments on the SHIFT15M dataset, we have published experimental results for each task with several models.

Original Dataset Structure

The original dataset is maintained in json format, and a row consists of the following:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd"
}

Contributing

To learn more about making a contribution to SHIFT15M, please see the following materials:

License

The dataset itself is provided under a CC BY-NC 4.0 license. On the other hand, the software in this repository is provided under the MIT license.

Dataset metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name SHIFT15M Dataset

alternateName SHIFT15M

alternateName shift15m-dataset

url https://github.com/st-tech/zozo-shift15m

sameAs https://github.com/st-tech/zozo-shift15m

description SHIFT15M is a multi-objective, multi-domain dataset which includes multiple dataset shifts.

provider

property	value
name	`ZOZO Research`
sameAs	`https://ja.wikipedia.org/wiki/ZOZO`

license

property	value
name	`CC BY-NC 4.0`
url	`https://github.com/st-tech/zozo-shift15m/blob/main/LICENSE.CC`

Citation

@misc{Kimura_SHIFT15M_Multiobjective_LargeScale_2021,
author = {Kimura, Masanari and Nakamura, Takuma and Saito, Yuki},
month = {8},
title = {SHIFT15M: Multiobjective Large-Scale Fashion Dataset with Distributional Shifts},
year = {2021}
}

Errata

No errata are currently available.

References

[1] Gebru, Timnit, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018).

Comments

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

The following question should be answered:

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).
documentation datasheet

opened by nocotan 3
Extracting Image Features
@nocotan I'm planning to prepare image features as we discussed. To be extracted:

CNN features (2048 dimensional features from the pre-trained Inception-V3 model on ILSVRC2012)

By the way, I was trying to find a properly hand-crafted image feature extractor that involves colors but cannot find available codes. For instance, combining Local Binary Pattern (LBP) and Local Color Contrast (LCC) showed superior performance in a texture classification task described in the following paper compared with other color-based hand-crafted features, but LCC is not in OSS. https://www.researchgate.net/publication/315858786_Hand-Crafted_vs_Learned_Descriptors_for_Color_Texture_Classification

So, here I'm planning not to include a hand-crafted one for the image-based task.
opened by wildsnowman 2
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

The following question should be answered:

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
documentation datasheet

opened by nocotan 2
add LICENSE
Before publishing, we need to determine the license of the repository, e.g.,

MIT

Apache

BSD

GPL

After researching which license is appropriate, please add the LICENSE to the repository.
documentation
opened by nocotan 2
Got an TypeError exception when try to run item category prediction task
Thank you for your great work and dataset opening at first.

Description When I tried to run the item_category_prediction task following the usageitem_category_prediction I got an exception like this:

Environment:

Python 3.8.8

It will be so helpful if you can give any gracious advice, thank you.
bug
opened by you0xy 1
Information: the dataset size

the number of outfits: 2,555,147 the number of images (multiple-counting): 15,218,721 the number of unique images: 2,335,598

Note: maybe shift28M is not the correct name.

opened by wildsnowman 1
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

The following question should be answered:

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
documentation datasheet

opened by nocotan 1
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

The following question should be answered:

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
documentation datasheet

opened by nocotan 1
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

The following question should be answered:

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
documentation datasheet

opened by nocotan 1
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

The following question should be answered:

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
documentation datasheet

opened by nocotan 1
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

The following question should be answered:

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
documentation datasheet

opened by nocotan 1
Bump setuptools from 65.4.1 to 65.5.1
Bumps setuptools from 65.4.1 to 65.5.1.

Changelog

Sourced from setuptools's changelog.

v65.5.1

Misc ^^^^

#3638: Drop a test dependency on the mock package, always use :external+python:py:mod:unittest.mock -- by :user:hroncok

#3659: Fixed REDoS vector in package_index.

v65.5.0

Changes ^^^^^^^

#3624: Fixed editable install for multi-module/no-package src-layout projects.

#3626: Minor refactorings to support distutils using stdlib logging module.

Documentation changes ^^^^^^^^^^^^^^^^^^^^^

#3419: Updated the example version numbers to be compliant with PEP-440 on the "Specifying Your Project’s Version" page of the user guide.

Misc ^^^^

#3569: Improved information about conflicting entries in the current working directory and editable install (in documentation and as an informational warning).

#3576: Updated version of validate_pyproject.

Commits

a462cb5 Bump version: 65.5.0 → 65.5.1

de35d8b Merge pull request #3656 from bmorris3/typos

58e23de Update changelog. Ref #3659.

43a9c9b Limit the amount of whitespace to search/backtrack. Fixes #3659.

5791343 Add test capturing failed expectation. Ref #3659.

1f97905 ⚫ Fade to black.

6254567 Remove workaround for emacs.

729b180 ⚫ Fade to black.

c068081 Typo corrections

f777a40 Suppress deprecation warning in --rsyncdir. Workaround for #3655.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bug: the number of val/test data is not consistent with other cases when the same years are selected for train_year and test_year.

Describe the bug In set matching, the numbers of data used are restricted as 30816, 3851, and 3851 for train, val, and test data, respectively; however, when the same years are selected for train_year and test_year, it will be inconsistent.

This bug may cause inappropriate experiments in changing train_year and test_year.
bug

opened by wildsnowman 0
disjoint set matching

Parent Task

set matching

Model List

Note

It might be required to conduct set matching experiments under the disjoint setting. Here, we perform testing using the items that are not included while training; we call it disjointed.

References

https://arxiv.org/abs/1804.09979
benchmark

opened by wildsnowman 0
Implementation of the set data loader with tags

Is your feature request related to a problem? Please describe. We added the tags information for our dataset. Then, it is good to implement the additional data loader with tags information.

Describe the solution you'd like This can be accomplished by adding arguments to an existing data loader.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

opened by nocotan 0

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

add tags info as follows:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd",
  "tags": "tag_a, tag_b, tag_c, ..."
}

add superset matching benchmark
fix a label creation bug on set matching with multiple splits

Source code(tar.gz)
Source code(zip)

v.0.1.2(Nov 24, 2021)

Source code(tar.gz)
Source code(zip)
v.0.1.1(Sep 6, 2021)

Source code(tar.gz)
Source code(zip)

Owner

ZOZO, Inc.

GitHub Repository

SymPy-powered, Wolfram|Alpha-like answer engine totally in your browser, without backend computation

SymPy Beta SymPy Beta is a fork of SymPy Gamma. The purpose of this project is to run a SymPy-powered, Wolfram|Alpha-like answer engine totally in you

25 Dec 21, 2022

CPU inference engine that delivers unprecedented performance for sparse models

The DeepSparse Engine is a CPU runtime that delivers unprecedented performance by taking advantage of natural sparsity within neural networks to reduce compute required as well as accelerate memory b

1.2k Jan 09, 2023

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 07, 2022

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system Getting started To start working on this assignment, you should

2 Aug 06, 2022

A Python Package For System Identification Using NARMAX Models

SysIdentPy is a Python module for System Identification using NARMAX models built on top of numpy and is distributed under the 3-Clause BSD license. N

175 Dec 25, 2022

ROS support for Velodyne 3D LIDARs

Overview Velodyne1 is a collection of ROS2 packages supporting Velodyne high definition 3D LIDARs3. Warning: The master branch normally contains code

543 Dec 30, 2022

A heterogeneous entity-augmented academic language model based on Open Academic Graph (OAG)

Library | Paper | Slack We released two versions of OAG-BERT in CogDL package. OAG-BERT is a heterogeneous entity-augmented academic language model wh

58 Dec 17, 2022

Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Geometry-Aware Gradient Algorithms for Neural Architecture Search This repository contains the code required to run the experiments for the DARTS sear

18 May 27, 2022

Pytorch implementation of PCT: Point Cloud Transformer

PCT: Point Cloud Transformer This is a Pytorch implementation of PCT: Point Cloud Transformer.

265 Dec 22, 2022

A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

imutils A series of convenience functions to make basic image processing functions such as translation, rotation, resizing, skeletonization, and displ

4.3k Jan 08, 2023

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology Sharon Zhou, Eric Zelikman

34 Nov 16, 2022

SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

Related tags

Overview

Installation

From PyPi

From source

Download SHIFT15M dataset

Use Dataset class

Download directly by using download scripts

Tasks

Benchmarks

Original Dataset Structure

Contributing

License

Dataset metadata

Citation

Errata

References

Comments

v65.5.1

v65.5.0

Parent Task

Model List

Note

References

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

v.0.1.2(Nov 24, 2021)

v.0.1.1(Sep 6, 2021)

Owner

ZOZO, Inc.

SymPy-powered, Wolfram|Alpha-like answer engine totally in your browser, without backend computation

CPU inference engine that delivers unprecedented performance for sparse models

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

A Python Package For System Identification Using NARMAX Models

ROS support for Velodyne 3D LIDARs

A heterogeneous entity-augmented academic language model based on Open Academic Graph (OAG)

Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Pytorch implementation of PCT: Point Cloud Transformer

A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

Source code for "Roto-translated Local Coordinate Framesfor Interacting Dynamical Systems"

The Body Part Regression (BPR) model translates the anatomy in a radiologic volume into a machine-interpretable form.

This repository contain code on Novelty-Driven Binary Particle Swarm Optimisation for Truss Optimisation Problems.

Aquarius - Enabling Fast, Scalable, Data-Driven Virtual Network Functions

Python inverse kinematics for your robot model based on Pinocchio.

Deep Markov Factor Analysis (NeurIPS2021)

RealTime Emotion Recognizer for Machine Learning Study Jam's demo

Pytorch domain adaptation package

A project that uses optical flow and machine learning to detect aimhacking in video clips.

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology