Time Series Cross-Validation -- an extension for scikit-learn

Overview

Downloads Build Status codecov DOI

TSCV: Time Series Cross-Validation

This repository is a scikit-learn extension for time series cross-validation. It introduces gaps between the training set and the test set, which mitigates the temporal dependence of time series and prevents information leakage.

Installation

pip install tscv

or

conda install -c conda-forge tscv

Usage

This extension defines 3 cross-validator classes and 1 function:

  • GapLeavePOut
  • GapKFold
  • GapRollForward
  • gap_train_test_split

The three classes can all be passed, as the cv argument, to scikit-learn functions such as cross-validate, cross_val_score, and cross_val_predict, just like the native cross-validator classes.

The one function is an alternative to the train_test_split function in scikit-learn.

Examples

The following example uses GapKFold instead of KFold as the cross-validator.

import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import cross_val_score
from tscv import GapKFold

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)

# use GapKFold as the cross-validator
cv = GapKFold(n_splits=5, gap_before=5, gap_after=5)
scores = cross_val_score(clf, iris.data, iris.target, cv=cv)

The following example uses gap_train_test_split to split the data set into the training set and the test set.

import numpy as np
from tscv import gap_train_test_split

X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = gap_train_test_split(X, y, test_size=2, gap_size=2)

Contributing

  • Report bugs in the issue tracker
  • Express your use cases in the issue tracker

Documentations

Acknowledgments

  • I would like to thank Jeffrey Racine and Christoph Bergmeir for the helpful discussion.

License

BSD-3-Clause

Citation

Wenjie Zheng. (2021). Time Series Cross-Validation (TSCV): an extension for scikit-learn. Zenodo. http://doi.org/10.5281/zenodo.4707309

@software{zheng_2021_4707309,
  title={{Time Series Cross-Validation (TSCV): an extension for scikit-learn}},
  author={Zheng, Wenjie},
  month={april},
  year={2021},
  publisher={Zenodo},
  doi={10.5281/zenodo.4707309},
  url={http://doi.org/10.5281/zenodo.4707309}
}
Comments
  • Make it work with cross_val_predict

    Make it work with cross_val_predict

    Is it possible to somehow make the CV work with cross_val_predict function. Fore example, if I try:

    cv = GapWalkForward(n_splits=3, gap_size=1, test_size=2)
    cross_val_predict(estimator=SGDClassifier(), X=X_sample, y=y_bin_sample, cv=cv, n_jobs=6)
    

    it returns an error

    ValueError: cross_val_predict only works for partitions

    but I would like to have predictions so I can make consfusion matrx and other statistics.

    Is it possible to make it work with your cross-validators?

    opened by MislavSag 8
  • Documentation

    Documentation

    Documentation and examples do not address the splitting of data set into training and test sets.

    If using one of the cross validators, does the data set need to be sorted in time order? Is there way to designate a datetime column so the class understands on what basis to sequentially split data?

    opened by mksamelson 3
  • split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

    split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

    Just flagging a minor issue:

    We found this after poetry update-ing our dependencies, inadvertently bumping scikit-learn to 0.24.0. This broke code we have that uses tscv

    relevant scikit-learn source-code from version 0.23.0 https://github.com/scikit-learn/scikit-learn/blob/0.23.0/sklearn/utils/init.py#L274-L275

    The method has been made private in scikit-learn 0.24.0: https://github.com/scikit-learn/scikit-learn/blob/0.24.0/sklearn/utils/init.py#L271

    I did not investigate further, we pinned scikit-learn to 0.23.0 and that's OK for now, but some refactoring may be in order to move off the private method.

    opened by rob-sokolowski 3
  • Error when Importing TSCV Gapwalkforward

    Error when Importing TSCV Gapwalkforward

    Using TSCV Gapwalkforward successfully with Python 3.7.

    Suddenly getting following error:

    ImportError Traceback (most recent call last) in 41 #Modeling 42 ---> 43 from tscv import GapWalkForward 44 from sklearn.utils import shuffle 45 from sklearn.model_selection import KFold

    ~\Anaconda3\envs\py37\lib\site-packages\tscv_init_.py in ----> 1 from .split import GapCrossValidator 2 from .split import GapLeavePOut 3 from .split import GapKFold 4 from .split import GapWalkForward 5 from .split import gap_train_test_split

    ~\Anaconda3\envs\py37\lib\site-packages\tscv\split.py in 7 8 import numpy as np ----> 9 from sklearn.utils import indexable, safe_indexing 10 from sklearn.utils.validation import _num_samples 11 from sklearn.base import _pprint

    ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'

    Any insight? I get this when simply importing Gapwalkforward.

    opened by mksamelson 2
  • GapWalkForward Issue with Scikit-learn 0.24.1

    GapWalkForward Issue with Scikit-learn 0.24.1

    When I upgrade to Scikit-learn 0.24.1 I get an issue:

    cannot import name 'safe_indexing' from 'sklearn.utils'

    This appears to be a change within scikit-learn as indicated here:

    https://stackoverflow.com/questions/65602076/yellowbrick-importerror-cannot-import-name-safe-indexing-from-sklearn-utils

    No issue using scikit-learn 0.23.2

    opened by mksamelson 2
  • Release 0.0.4 for GridSearch compat

    Release 0.0.4 for GridSearch compat

    Would it be possible to issue a new release on PyPI to include the latest changes from this commit which aligns the get_n_splits method signature with the abstract method signature required by GridSearchCV?

    opened by wderose 2
  • Warning once is not enough

    Warning once is not enough

    https://github.com/WenjieZ/TSCV/blob/f8b832fab1dca0e2d2d46029308c2d06eef8b858/tscv/split.py#L253

    This warning should appear for every occurrence. Use standard output instead.

    opened by WenjieZ 1
  • Retrained version of GapWalkForward: GapRollForward

    Retrained version of GapWalkForward: GapRollForward

    The current implementation is based on legacy K-Fold cross-validation requiring an explicit value for the n_splits parameter. It puts the burden of calculating desired value of n_splits on the user.

    A better implementation should allow the user to initiate a GapWalkForward class without specifying the value for n_splits. Instead, it can deduct the right value through the other inputs.

    It is theoretically desirable to keep both channels of kickstarting a GapWalkForward class. In practice, however, it is hard to maintain both within a single class. Therefore, I decide to ~~deprecate the n_splits channel~~ implement a new class dubbed GapRollForward in v0.1.0 -- the version after the next.

    opened by WenjieZ 1
  • Changed GapWalkForward.get_n_splits to match abstract method signatur…

    Changed GapWalkForward.get_n_splits to match abstract method signatur…

    …e. Now works with GridSearchCV. Otherwise using GapWalkForward as the cross validation class passed to GridSearchCV will fail with "TypeError: get_n_splits() takes 1 positional argument but 4 were given."

    opened by lawsonmcw 1
  • Import error with latest sklearn version

    Import error with latest sklearn version

    Hi guys, this issue occured after the upgrade to 1.1.3

    ImportError: cannot import name '_pprint' from 'sklearn.base'

    /.venv/lib/python3.10/site-packages/tscv/_split.py:19 in      │
    │ <module>                                                                                         │
    │                                                                                                  │
    │    16 import numpy as np                                                                         │
    │    17 from sklearn.utils import indexable                                                        │
    │    18 from sklearn.utils.validation import _num_samples, check_consistent_length                 │
    │ ❱  19 from sklearn.base import _pprint                                                           │
    │    20 from sklearn.utils import _safe_indexing                                                   │
    │    21                                                                                            │
    │    22                                       
    

    Could you please fix it ?

    Kind regards, Jim

    opened by teneon 1
  • Consistently use the test sets as reference for `gap_before` and `gap_after`

    Consistently use the test sets as reference for `gap_before` and `gap_after`

    There are two ways of defining a derived cross-validator. One is to redefine _iter_test_indices or _iter_test_masks (test viewpoint), and the other is to redefine _iter_train_masks or _iter_train_indices (train viewpoint).

    Currently, these two methods assign different semantic meanings to the parameters gap_before and gap_after. The test viewpoint uses the test sets as the reference:

    train    gap_before    test    gap_after    train
    

    The train viewpoint uses the training sets as the reference:

    test    gap_before    train    gap_after    test
    

    This diverged behavior is ~~not intended~~ inappropriate. The package should insist on the test viewpoint, and hence this PR. It will be enforced in v0.2.

    I don't think this issue has touched any users, for the derived classes in this package use _iter_test_indices exclusively (test viewpoint). No users have reported this issue either. If you suspect that you have been affected by it, please reply to this PR.

    opened by WenjieZ 1
  • time boost in folds generation

    time boost in folds generation

    With contiguous test sets:

    cv_orig = GapKFold(n_splits=5, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_orig.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [3 4 5 6 7 8 9] TEST: [0 1]
    ... TRAIN: [0 5 6 7 8 9] TEST: [2 3]
    ... TRAIN: [0 1 2 7 8 9] TEST: [4 5]
    ... TRAIN: [0 1 2 3 4 9] TEST: [6 7]
    ... TRAIN: [0 1 2 3 4 5 6] TEST: [8 9]
    
    cv_opt = GapKFold(n_splits=5, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_opt.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [3 4 5 6 7 8 9] TEST: [0 1]
    ... TRAIN: [0 5 6 7 8 9] TEST: [2 3]
    ... TRAIN: [0 1 2 7 8 9] TEST: [4 5]
    ... TRAIN: [0 1 2 3 4 9] TEST: [6 7]
    ... TRAIN: [0 1 2 3 4 5 6] TEST: [8 9]
    
    %%timeit
    folds = list(cv_orig.split(np.arange(10000)))
    
    
    ... 1.21 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    folds = list(cv_opt.split(np.arange(10000)))
    
    
    ... 4.74 ms ± 44.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    With uncontiguous test sets:

    cv_orig = _XXX_(_xxx_, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_orig.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [5 6 7 8 9] TEST: [0 1 2 3]
    ... TRAIN: [7 8 9] TEST: [0 1 4 5]
    ... TRAIN: [3 4 9] TEST: [0 1 6 7]
    ... TRAIN: [3 4 5 6] TEST: [0 1 8 9]
    ... TRAIN: [0 7 8 9] TEST: [2 3 4 5]
    ... TRAIN: [0 9] TEST: [2 3 6 7]
    ... TRAIN: [0 5 6] TEST: [2 3 8 9]
    ... TRAIN: [0 1 2 9] TEST: [4 5 6 7]
    ... TRAIN: [0 1 2] TEST: [4 5 8 9]
    ... TRAIN: [0 1 2 3 4] TEST: [6 7 8 9]
    
    cv_opt = _XXX_(_xxx_, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_opt.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [5 6 7 8 9] TEST: [0 1 2 3]
    ... TRAIN: [7 8 9] TEST: [0 1 4 5]
    ... TRAIN: [3 4 9] TEST: [0 1 6 7]
    ... TRAIN: [3 4 5 6] TEST: [0 1 8 9]
    ... TRAIN: [0 7 8 9] TEST: [2 3 4 5]
    ... TRAIN: [0 9] TEST: [2 3 6 7]
    ... TRAIN: [0 5 6] TEST: [2 3 8 9]
    ... TRAIN: [0 1 2 9] TEST: [4 5 6 7]
    ... TRAIN: [0 1 2] TEST: [4 5 8 9]
    ... TRAIN: [0 1 2 3 4] TEST: [6 7 8 9]
    
    %%timeit
    folds = list(cv_orig.split(np.arange(10000)))
    
    ... 1.23 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    folds = list(cv_opt.split(np.arange(10000)))
    
    ... 4.78 ms ± 49.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    opened by aldder 3
  • add CombinatorialGapKFold

    add CombinatorialGapKFold

    From "Advances in Financial Machine Learning" book by Marcos López de Prado the implemented version of Combinatorial Cross Validation with Purging and Embargoing

    image

    explaining video: https://www.youtube.com/watch?v=hDQssGntmFA

    opened by aldder 3
  • Implement Rep-Holdout

    Implement Rep-Holdout

    Thank you for this repository and the implemented CV-methods; especially GapRollForward. I was looking for exactly this package.

    I was wondering if you are interested in implementing another CV-Method for time series, called Rep-Holdout. It is used in this evaluation paper (https://arxiv.org/abs/1905.11744) and has good performance compared to all other CV-methods - some of which you have implemented here.

    As I understand it, it is somewhat like sklearn.model_selection.TimeSeriesSplit but with a randomized selection of all possible folds. Here is the description from the paper as an image:

    Unbenannt


    The authors provided code in R but it is written very differently than how it needs to look in Python. I adapted your functions to implement it in python but I am not the best coder and it really only serves my purpose of tuning a specific model. Seeing as the performance of Rep-Holdout is good and -to me at least - it makes sense for time series cross validation, maybe you are interested in adding this function to your package?

    opened by georgeblck 8
  • Intution on setting number of gaps

    Intution on setting number of gaps

    If for example, I have data without gaps, when and why would I still create a break between my train and validation? I have seen the argument for setting gaps when the period that needs to be predicted may be N days after the train. Are there other reasons? And if so, what is the intuition on knowing how many gaps to include before/after the training set?

    opened by tyokota 0
Releases(v0.1.2)
Owner
Wenjie Zheng
Statistical Learning Solution Expert
Wenjie Zheng
Computing Shapley values using VAEAC

Shapley values and the VAEAC method In this GitHub repository, we present the implementation of the VAEAC approach from our paper "Using Shapley Value

3 Nov 23, 2022
Talk covering the features of skorch

Skorch Talk Skorch - A Union of Scikit-learn and PyTorch Presentation The slides can be downloaded at: download link. Google Colab Part One - MNIST Pa

Thomas J. Fan 3 Oct 20, 2020
[ECCV 2020] Gradient-Induced Co-Saliency Detection

Gradient-Induced Co-Saliency Detection Zhao Zhang*, Wenda Jin*, Jun Xu, Ming-Ming Cheng ⭐ Project Home » The official repo of the ECCV 2020 paper Grad

Zhao Zhang 35 Nov 25, 2022
PyTorch common framework to accelerate network implementation, training and validation

pytorch-framework PyTorch common framework to accelerate network implementation, training and validation. This framework is inspired by works from MML

Dongliang Cao 3 Dec 19, 2022
Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO.

Streamlit Demo: The Udacity Self-driving Car Image Browser This project demonstrates the Udacity self-driving-car dataset and YOLO object detection in

Streamlit 992 Jan 04, 2023
To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts

JaxTon 💯 JAX exercises Mission 🚀 To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beg

Rohan Rao 512 Jan 01, 2023
Retina blood vessel segmentation with a convolutional neural network

Retina blood vessel segmentation with a convolution neural network (U-net) This repository contains the implementation of a convolutional neural netwo

Orobix 1.2k Jan 06, 2023
Reinfore learning tool box, contains trpo, a3c algorithm for continous action space

RL_toolbox all the algorithm is running on pycharm IDE, or the package loss error may exist. implemented algorithm: trpo a3c a3c:for continous action

yupei.wu 44 Oct 10, 2022
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021) Jiaxi Jiang, Kai Zhang, Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland 🔥

Jiaxi Jiang 282 Jan 02, 2023
K-Nearest Neighbor in Pytorch

Pytorch KNN CUDA 2019/11/02 This repository will no longer be maintained as pytorch supports sort() and kthvalue on tensors. git clone https://github.

Chris Choy 65 Dec 01, 2022
SEJE Pytorch implementation

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021
keyframes-CNN-RNN(action recognition)

keyframes-CNN-RNN(action recognition) Environment: python=3.7 pytorch=1.2 Datasets: Following the format of UCF101 action recognition. Run steps: Mo

4 Feb 09, 2022
Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Patient Knowledge Distillation for BERT Model Compression Knowledge distillation for BERT model Installation Run command below to install the environm

Siqi 180 Dec 19, 2022
Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

QuickDraw - AirGesture Introduction Here is my python source code for QuickDraw - an online game developed by google, combined with AirGesture - a sim

Viet Nguyen 89 Dec 18, 2022
RGB-D Local Implicit Function for Depth Completion of Transparent Objects

RGB-D Local Implicit Function for Depth Completion of Transparent Objects [Project Page] [Paper] Overview This repository maintains the official imple

NVIDIA Research Projects 43 Dec 12, 2022
GAN example for Keras. Cuz MNIST is too small and there should be something more realistic.

Keras-GAN-Animeface-Character GAN example for Keras. Cuz MNIST is too small and there should an example on something more realistic. Some results Trai

160 Sep 20, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022
Python implementation of "Elliptic Fourier Features of a Closed Contour"

PyEFD An Python/NumPy implementation of a method for approximating a contour with a Fourier series, as described in [1]. Installation pip install pyef

Henrik Blidh 71 Dec 09, 2022
PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

Mitch Hill 32 Nov 25, 2022