Download and preprocess popular sequential recommendation datasets

Overview

Build Status codebeat badge

Sequential Recommendation Datasets

This repository collects some commonly used sequential recommendation datasets in recent research papers and provides a tool for downloading, preprocessing and batch-loading those datasets. The preprocessing method can be customized based on the task, for example: short-term recommendation (including session-based recommendation) and long-short term recommendation. Loading has faster version which intergrates the DataLoader of PyTorch.

Datasets

Install this tool

Stable version

pip install -U srdatasets —-user

Latest version

pip install git+https://github.com/guocheng2018/sequential-recommendation-datasets.git --user

Download datasets

Run the command below to download datasets. As some datasets are not directly accessible, you'll be warned to download them manually and place them somewhere it tells you.

srdatasets download --dataset=[dataset_name]

To get a view of downloaded and processed status of all datasets, run

srdatasets info

Process datasets

The generic processing command is

srdatasets process --dataset=[dataset_name] [--options]

Splitting options

Two dataset splitting methods are provided: user-based and time-based. User-based means that splitting is executed on every user behavior sequence given the ratio of validation set and test set, while time-based means that splitting is based on the date of user behaviors. After splitting some dataset, two processed datasets are generated, one for development, which uses the validation set as the test set, the other for test, which contains the full training set.

--split-by     User or time (default: user)
--test-split   Proportion of test set to full dataset (default: 0.2)
--dev-split    Proportion of validation set to full training set (default: 0.1)

NOTE: time-based splitting need you to manually input days at console by tipping you total days of that dataset, since you may not know.

Task related options

For short term recommnedation task, you use previous input-len items to predict next target-len items. To make user interests more focused, user behavior sequences can also be cut into sessions if session-interval is given. If the number of previous items is smaller than input-len, 0 is padded to the left.

For long and short term recommendation task, you use pre-sessions previous sessions and current session to predict target-len items. The target items are picked randomly or lastly from current session. So the length of current session is max-session-len - target-len while the length of any previous session is max-session-len. If any previous session or current session is shorter than the preset length, 0 is padded to the left.

--task              Short or long-short (default: short)
--input-len         Number of previous items (default: 5)
--target-len        Number of target items (default: 1)
--pre-sessions      Number of previous sessions (default: 10)
--pick-targets      Randomly or lastly pick items from current session (default: random)
--session-interval  Session splitting interval (minutes)  (default: 0)
--min-session-len   Sessions less than this in length will be dropped  (default: 2)
--max-session-len   Sessions greater than this in length will be cut  (default: 20)

Common options

--min-freq-item        Items less than this in frequency will be dropped (default: 5)
--min-freq-user        Users less than this in frequency will be dropped (default: 5)
--no-augment           Do not use data augmentation (default: False)
--remove-duplicates    Remove duplicated items in user sequence or user session (if splitted) (default: False)

Dataset related options

--rating-threshold  Interactions with rating less than this will be dropped (Amazon, Movielens, Yelp) (default: 4)
--item-type         Recommend artists or songs (Lastfm) (default: song)

Version

By using different options, a dataset will have many processed versions. You can run the command below to get configurations and statistics of all processed versions of some dataset. The config id shown in output is a required argument of DataLoader.

srdatasets info --dataset=[dataset_name]

DataLoader

DataLoader is a built-in class that makes loading processed datasets easy. Practically, once initialized a dataloder by passing the dataset name, processed version (config id), batch_size and a flag to load training data or test data, you can then loop it to get batch data. Considering that some models use rank-based learning, negative sampling is intergrated into DataLoader. The negatives are sampled from all items except items in current data according to popularity. By default it (negatives_per_target) is turned off. Also, the time of user behaviors is sometimes an important feature, you can include it into batch data by setting include_timestmap to True.

Arguments

  • dataset_name: dataset name (case insensitive)
  • config_id: configuration id
  • batch_size: batch size (default: 1)
  • train: load training dataset (default: True)
  • development: load the dataset aiming for development (default: False)
  • negatives_per_target: number of negative samples per target (default: 0)
  • include_timestamp: add timestamps to batch data (default: False)
  • drop_last: drop last incomplete batch (default: False)

Attributes

  • num_users: total users in training dataset
  • num_items: total items in training dataset (not including the padding item 0)

Initialization example

from srdatasets.dataloader import DataLoader

trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True)

For pytorch users, there is a wrapper implementation of torch.utils.data.DataLoader, you can then set keyword arguments like num_workers and pin_memory to speed up loading data

from srdatasets.dataloader_pytorch import DataLoader

trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True, num_workers=8, pin_memory=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True, num_workers=8, pin_memory=True)

Iteration template

For short term recommendation task

for epoch in range(10):
    # Train
    for users, input_items, target_items, input_item_timestamps, target_item_timestamps, negative_samples in trainloader:
        # Shape
        #   users:                  (batch_size,)
        #   input_items:            (batch_size, input_len)
        #   target_items:           (batch_size, target_len)
        #   input_item_timestamps:  (batch_size, input_len)
        #   target_item_timestamps: (batch_size, target_len)
        #   negative_samples:       (batch_size, target_len, negatives_per_target)
        #
        # DataType
        #   numpy.ndarray or torch.LongTensor
        pass

    # Test
    for users, input_items, target_items, input_item_timestamps, target_item_timestamps in testloader:
        pass

For long and short term recommendation task

for epoch in range(10):
    # Train
    for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps, negative_samples in trainloader:
        # Shape
        #   users:                          (batch_size,)
        #   pre_sessions_items:             (batch_size, pre_sessions * max_session_len)
        #   cur_session_items:              (batch_size, max_session_len - target_len)
        #   target_items:                   (batch_size, target_len)
        #   pre_sessions_item_timestamps:   (batch_size, pre_sessions * max_session_len)
        #   cur_session_item_timestamps:    (batch_size, max_session_len - target_len)
        #   target_item_timestamps:         (batch_size, target_len)
        #   negative_samples:               (batch_size, target_len, negatives_per_target)
        #
        # DataType
        #   numpy.ndarray or torch.LongTensor
        pass

    # Test
    for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps in testloader:
        pass

Disclaimers

This repo does not host or distribute any of the datasets, it is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

Pocsploit is a lightweight, flexible and novel open source poc verification framework

Pocsploit is a lightweight, flexible and novel open source poc verification framework

cckuailong 208 Dec 24, 2022
O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning (CoRL 2021)

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning Object-object Interaction Affordance Learning. For a given object-object int

Kaichun Mo 26 Nov 04, 2022
Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles This project is for the paper: Detecting Errors and Estimating

Jiefeng Chen 13 Nov 21, 2022
Replication Package for "An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets"

Replication Package for "An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Data

2 Oct 06, 2022
Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

tflite2tensorflow Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite. 1. Supported Layers No. TFLite Layer TF

Katsuya Hyodo 214 Dec 29, 2022
Production First and Production Ready End-to-End Speech Recognition Toolkit

WeNet 中文版 Discussions | Docs | Papers | Runtime (x86) | Runtime (android) | Pretrained Models We share neural Net together. The main motivation of WeN

2.7k Jan 04, 2023
GANfolk: Using AI to create portraits of fictional people to sell as NFTs

GANfolk are AI-generated renderings of fictional people. Each image in the collection was created by a pair of Generative Adversarial Networks (GANs) with names and backstories also created with AI.

Robert A. Gonsalves 32 Dec 02, 2022
Tightness-aware Evaluation Protocol for Scene Text Detection

TIoU-metric Release on 27/03/2019. This repository is built on the ICDAR 2015 evaluation code. If you propose a better metric and require further eval

Yuliang Liu 206 Nov 18, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

ASAPP Research 2.1k Jan 01, 2023
The missing CMake project initializer

cmake-init - The missing CMake project initializer Opinionated CMake project initializer to generate CMake projects that are FetchContent ready, separ

1k Jan 01, 2023
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
Neural network graphs and training metrics for PyTorch, Tensorflow, and Keras.

HiddenLayer A lightweight library for neural network graphs and training metrics for PyTorch, Tensorflow, and Keras. HiddenLayer is simple, easy to ex

Waleed 1.7k Dec 31, 2022
The code of paper "Block Modeling-Guided Graph Convolutional Neural Networks".

Block Modeling-Guided Graph Convolutional Neural Networks This repository contains the demo code of the paper: Block Modeling-Guided Graph Convolution

22 Dec 08, 2022
QuakeLabeler is a Python package to create and manage your seismic training data, processes, and visualization in a single place — so you can focus on building the next big thing.

QuakeLabeler Quake Labeler was born from the need for seismologists and developers who are not AI specialists to easily, quickly, and independently bu

Hao Mai 15 Nov 04, 2022
Predicting Semantic Map Representations from Images with Pyramid Occupancy Networks

This is the code associated with the paper Predicting Semantic Map Representations from Images with Pyramid Occupancy Networks, published at CVPR 2020.

Thomas Roddick 219 Dec 20, 2022
When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings This is the repository for t

RegLab 39 Jan 07, 2023
Implement face detection, and age and gender classification, and emotion classification.

YOLO Keras Face Detection Implement Face detection, and Age and Gender Classification, and Emotion Classification. (image from wider face dataset) Ove

Chloe 10 Nov 14, 2022
[ICCV 2021] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction [arXiv] [Project Page] @inproceedings{ huang2021fapn, title={{FaPN}: Feature-alig

Shihua Huang 23 Jul 22, 2022
This repository is the official implementation of Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models Link to paper Abstract We study prediction of future out

Rickard Karlsson 2 Aug 19, 2022
GANSketchingJittor - Implementation of Sketch Your Own GAN in Jittor

GANSketching in Jittor Implementation of (Sketch Your Own GAN) in Jittor(计图). Or

Bernard Tan 10 Jul 02, 2022