Download and preprocess popular sequential recommendation datasets

Overview

Build Status codebeat badge

Sequential Recommendation Datasets

This repository collects some commonly used sequential recommendation datasets in recent research papers and provides a tool for downloading, preprocessing and batch-loading those datasets. The preprocessing method can be customized based on the task, for example: short-term recommendation (including session-based recommendation) and long-short term recommendation. Loading has faster version which intergrates the DataLoader of PyTorch.

Datasets

Install this tool

Stable version

pip install -U srdatasets —-user

Latest version

pip install git+https://github.com/guocheng2018/sequential-recommendation-datasets.git --user

Download datasets

Run the command below to download datasets. As some datasets are not directly accessible, you'll be warned to download them manually and place them somewhere it tells you.

srdatasets download --dataset=[dataset_name]

To get a view of downloaded and processed status of all datasets, run

srdatasets info

Process datasets

The generic processing command is

srdatasets process --dataset=[dataset_name] [--options]

Splitting options

Two dataset splitting methods are provided: user-based and time-based. User-based means that splitting is executed on every user behavior sequence given the ratio of validation set and test set, while time-based means that splitting is based on the date of user behaviors. After splitting some dataset, two processed datasets are generated, one for development, which uses the validation set as the test set, the other for test, which contains the full training set.

--split-by     User or time (default: user)
--test-split   Proportion of test set to full dataset (default: 0.2)
--dev-split    Proportion of validation set to full training set (default: 0.1)

NOTE: time-based splitting need you to manually input days at console by tipping you total days of that dataset, since you may not know.

Task related options

For short term recommnedation task, you use previous input-len items to predict next target-len items. To make user interests more focused, user behavior sequences can also be cut into sessions if session-interval is given. If the number of previous items is smaller than input-len, 0 is padded to the left.

For long and short term recommendation task, you use pre-sessions previous sessions and current session to predict target-len items. The target items are picked randomly or lastly from current session. So the length of current session is max-session-len - target-len while the length of any previous session is max-session-len. If any previous session or current session is shorter than the preset length, 0 is padded to the left.

--task              Short or long-short (default: short)
--input-len         Number of previous items (default: 5)
--target-len        Number of target items (default: 1)
--pre-sessions      Number of previous sessions (default: 10)
--pick-targets      Randomly or lastly pick items from current session (default: random)
--session-interval  Session splitting interval (minutes)  (default: 0)
--min-session-len   Sessions less than this in length will be dropped  (default: 2)
--max-session-len   Sessions greater than this in length will be cut  (default: 20)

Common options

--min-freq-item        Items less than this in frequency will be dropped (default: 5)
--min-freq-user        Users less than this in frequency will be dropped (default: 5)
--no-augment           Do not use data augmentation (default: False)
--remove-duplicates    Remove duplicated items in user sequence or user session (if splitted) (default: False)

Dataset related options

--rating-threshold  Interactions with rating less than this will be dropped (Amazon, Movielens, Yelp) (default: 4)
--item-type         Recommend artists or songs (Lastfm) (default: song)

Version

By using different options, a dataset will have many processed versions. You can run the command below to get configurations and statistics of all processed versions of some dataset. The config id shown in output is a required argument of DataLoader.

srdatasets info --dataset=[dataset_name]

DataLoader

DataLoader is a built-in class that makes loading processed datasets easy. Practically, once initialized a dataloder by passing the dataset name, processed version (config id), batch_size and a flag to load training data or test data, you can then loop it to get batch data. Considering that some models use rank-based learning, negative sampling is intergrated into DataLoader. The negatives are sampled from all items except items in current data according to popularity. By default it (negatives_per_target) is turned off. Also, the time of user behaviors is sometimes an important feature, you can include it into batch data by setting include_timestmap to True.

Arguments

  • dataset_name: dataset name (case insensitive)
  • config_id: configuration id
  • batch_size: batch size (default: 1)
  • train: load training dataset (default: True)
  • development: load the dataset aiming for development (default: False)
  • negatives_per_target: number of negative samples per target (default: 0)
  • include_timestamp: add timestamps to batch data (default: False)
  • drop_last: drop last incomplete batch (default: False)

Attributes

  • num_users: total users in training dataset
  • num_items: total items in training dataset (not including the padding item 0)

Initialization example

from srdatasets.dataloader import DataLoader

trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True)

For pytorch users, there is a wrapper implementation of torch.utils.data.DataLoader, you can then set keyword arguments like num_workers and pin_memory to speed up loading data

from srdatasets.dataloader_pytorch import DataLoader

trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True, num_workers=8, pin_memory=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True, num_workers=8, pin_memory=True)

Iteration template

For short term recommendation task

for epoch in range(10):
    # Train
    for users, input_items, target_items, input_item_timestamps, target_item_timestamps, negative_samples in trainloader:
        # Shape
        #   users:                  (batch_size,)
        #   input_items:            (batch_size, input_len)
        #   target_items:           (batch_size, target_len)
        #   input_item_timestamps:  (batch_size, input_len)
        #   target_item_timestamps: (batch_size, target_len)
        #   negative_samples:       (batch_size, target_len, negatives_per_target)
        #
        # DataType
        #   numpy.ndarray or torch.LongTensor
        pass

    # Test
    for users, input_items, target_items, input_item_timestamps, target_item_timestamps in testloader:
        pass

For long and short term recommendation task

for epoch in range(10):
    # Train
    for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps, negative_samples in trainloader:
        # Shape
        #   users:                          (batch_size,)
        #   pre_sessions_items:             (batch_size, pre_sessions * max_session_len)
        #   cur_session_items:              (batch_size, max_session_len - target_len)
        #   target_items:                   (batch_size, target_len)
        #   pre_sessions_item_timestamps:   (batch_size, pre_sessions * max_session_len)
        #   cur_session_item_timestamps:    (batch_size, max_session_len - target_len)
        #   target_item_timestamps:         (batch_size, target_len)
        #   negative_samples:               (batch_size, target_len, negatives_per_target)
        #
        # DataType
        #   numpy.ndarray or torch.LongTensor
        pass

    # Test
    for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps in testloader:
        pass

Disclaimers

This repo does not host or distribute any of the datasets, it is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

Code accompanying "Adaptive Methods for Aggregated Domain Generalization"

Adaptive Methods for Aggregated Domain Generalization (AdaClust) Official Pytorch Implementation of Adaptive Methods for Aggregated Domain Generalizat

Xavier Thomas 15 Sep 20, 2022
Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

[AAAI 2021]DropLoss for Long-Tail Instance Segmentation [AAAI 2021] DropLoss for Long-Tail Instance Segmentation Ting-I Hsieh*, Esther Robb*, Hwann-Tz

Tim 37 Dec 02, 2022
Source code for "Understanding Knowledge Integration in Language Models with Graph Convolutions"

Graph Convolution Simulator (GCS) Source code for "Understanding Knowledge Integration in Language Models with Graph Convolutions" Requirements: PyTor

yifan 10 Oct 18, 2022
Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers

Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers This is the repo used for human motion prediction with non-autoregress

Idiap Research Institute 26 Dec 14, 2022
利用Tensorflow实现基于CNN的中文短文本分类

Text Classification with CNN 使用卷积神经网络进行中文文本分类 CNN做句子分类的论文可以参看: Convolutional Neural Networks for Sentence Classification 还可以去读dennybritz大牛的博客:Implemen

Jeremiah 4 Nov 08, 2022
Invariant Causal Prediction for Block MDPs

MISA Abstract Generalization across environments is critical to the successful application of reinforcement learning algorithms to real-world challeng

Meta Research 41 Sep 17, 2022
CoINN: Correlated-informed neural networks: a new machine learning framework to predict pressure drop in micro-channels

CoINN: Correlated-informed neural networks: a new machine learning framework to predict pressure drop in micro-channels Accurate pressure drop estimat

Alejandro Montanez 0 Jan 21, 2022
Learning to Communicate with Deep Multi-Agent Reinforcement Learning in PyTorch

Learning to Communicate with Deep Multi-Agent Reinforcement Learning This is a PyTorch implementation of the original Lua code release. Overview This

Minqi 297 Dec 12, 2022
Code for our CVPR 2021 Paper "Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes".

Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes (CVPR 2021) Project page | Paper | Colab | Colab for Drawing App Rethinking Style

CompVis Heidelberg 153 Jan 04, 2023
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
🧠 A PyTorch implementation of 'Deep CORAL: Correlation Alignment for Deep Domain Adaptation.', ECCV 2016

Deep CORAL A PyTorch implementation of 'Deep CORAL: Correlation Alignment for Deep Domain Adaptation. B Sun, K Saenko, ECCV 2016' Deep CORAL can learn

Andy Hsu 200 Dec 25, 2022
使用深度学习框架提取视频硬字幕;docker容器免安装深度学习库,使用本地api接口使得界面和后端识别分离;

extract-video-subtittle 使用深度学习框架提取视频硬字幕; 本地识别无需联网; CPU识别速度可观; 容器提供API接口; 运行环境 本项目运行环境非常好搭建,我做好了docker容器免安装各种深度学习包; 提供windows界面操作; 容器为CPU版本; 视频演示 https

歌者 16 Aug 06, 2022
"NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search".

NAS-Bench-301 This repository containts code for the paper: "NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search". The

AutoML-Freiburg-Hannover 57 Nov 30, 2022
Pytorch Implementation of LNSNet for Superpixel Segmentation

LNSNet Overview Official implementation of Learning the Superpixel in a Non-iterative and Lifelong Manner (CVPR'21) Learning Strategy The proposed LNS

42 Oct 11, 2022
We utilize deep reinforcement learning to obtain favorable trajectories for visual-inertial system calibration.

Unified Data Collection for Visual-Inertial Calibration via Deep Reinforcement Learning Update: The lastest code will be updated in this branch. Pleas

ETHZ ASL 27 Dec 29, 2022
Reference implementation for Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Diffusion Probabilistic Models This repository provides a reference implementation of the method described in the paper: Deep Unsupervised Learning us

Jascha Sohl-Dickstein 238 Jan 02, 2023
CLEAR algorithm for multi-view data association

CLEAR: Consistent Lifting, Embedding, and Alignment Rectification Algorithm The Matlab, Python, and C++ implementation of the CLEAR algorithm, as desc

MIT Aerospace Controls Laboratory 30 Jan 02, 2023
Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

Trevor Stephens 1.3k Jan 03, 2023
Advantage Actor Critic (A2C): jax + flax implementation

Advantage Actor Critic (A2C): jax + flax implementation Current version supports only environments with continious action spaces and was tested on muj

Andrey 3 Jan 23, 2022
Multi-View Radar Semantic Segmentation

Multi-View Radar Semantic Segmentation Paper Multi-View Radar Semantic Segmentation, ICCV 2021. Arthur Ouaknine, Alasdair Newson, Patrick Pérez, Flore

valeo.ai 37 Oct 25, 2022