A library to generate synthetic time series data by easy-to-use factors and generator

Overview

timeseries-generator

This repository consists of a python packages that generates synthetic time series dataset in a generic way (under /timeseries_generator) and demo notebooks on how to generate synthetic timeseries data (under /examples). The goal here is to have non-sensitive data available to demo solutions and test the effectiveness of those solutions and/or algorithms. In order to test your algorithm, you want to have time series available containing different kinds of trends. The python package should help create different kinds of time series while still being maintainable.

timeseries_generator package

For this package, it is assumed that a time series is composed of a base value multiplied by many factors.

ts = base_value * factor1 * factor2 * ... * factorN + Noiser

Diagram

These factors can be anything, random noise, linear trends, to seasonality. The factors can affect different features. For example, some features in your time series may have a seasonal component, while others do not.

Different factors are represented in different classes, which inherit from the BaseFactor class. Factor classes are input for the Generator class, which creates a dataframe containing the features, base value, all the different factors working on the base value and and the final factor and value.

Core concept

  • Generator: a python class to generate the time series. A generator contains a list of factors and noiser. By overlaying the factors and noiser, generator can produce a customized time series
  • Factor: a python class to generate the trend, seasonality, holiday factors, etc. Factors take effect by multiplying on the base value of the generator.
  • Noised: a python class to generate time series noise data. Noiser take effect by summing on top of "factorized" time series. This formula describes the concepts we talk above

Built-in Factors

  • LinearTrend: give a linear trend based on the input slope and intercept
  • CountryYearlyTrend: give a yearly-based market cap factor based on the GDP per - capita.
  • EUEcoTrendComponents: give a monthly changed factor based on EU industry product public data
  • HolidayTrendComponents: simulate the holiday sale peak. It adapts the holiday days - differently in different country
  • BlackFridaySaleComponents: simulate the BlackFriday sale event
  • WeekendTrendComponents: more sales at weekends than on weekdays
  • FeatureRandFactorComponents: set up different sale amount for different stores and different product
  • ProductSeasonTrendComponents: simulate season-sensitive product sales. In this example code, we have 3 different types of product:
    • winter jacket: inverse-proportional to the temperature, more sales in winter
    • basketball top: proportional to the temperature, more sales in summer
    • Yoga Mat: temperature insensitive

Installation

pip install timeseries-generator

Usage

from timeseries_generator import LinearTrend, Generator, WhiteNoise, RandomFeatureFactor
import pandas as pd

# setting up a linear tren
lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")
g = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-20-2020"))
g.generate()
g.plot()

# update by adding some white noise to the generator
wn = WhiteNoise(stdev_factor=0.05)
g.update_factor(wn)
g.generate()
g.plot()

Example Notebooks

We currently have 2 example notebooks available:

  1. generate_stationary_process: Good for introducing the basics of the timeseries_generator. Shows how to apply simple linear trends and how to introduce features and labels, as well as random noise.
  2. use_external_factors: Goes more into detail and shows how to use the external_factors submodule. Shows how to create seasonal trends.

Web based prototyping UI

We also use Streamlit to build a web-based UI to demonstrate how to use this package to generate synthesis time series data in an interactive web UI.

streamlit run examples/streamlit/app.py

Web UI

License

This package is released under the Apache License, Version 2.0

You might also like...
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Visualize classified time series data with interactive Sankey plots in Google Earth Engine
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

A collection of Scikit-Learn compatible time series transformers and tools.
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Automatic extraction of relevant features from time series:
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

Probabilistic time series modeling in Python
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Comments
  • Time series data augmentation

    Time series data augmentation

    There is a code example that gives to increase the amount of series data by adding slightly modified copies of already existing time series data or newly created synthetic series data from existing data?

    opened by YAYAYru 0
  • KeyError: 'country'

    KeyError: 'country'

    From the following code,

    from timeseries_generator import HolidayFactor, LinearTrend, Generator
    
    lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")
    
    g: Generator = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-01-2021"))
    
    holiday_factor = HolidayFactor(
        country_feature_name="country",
    )
    g.add_factor(holiday_factor)
    g.generate()
    

    I get the error. I am not sure this is expected behavior.

    File /usr/local/Caskroom/miniconda/base/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py:10083, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    ...
    -> 1849     raise KeyError(key)
       1851 # Check for duplicates
       1852 if values.ndim > 1:
    
    KeyError: 'country'
    
    opened by twobitunicorn 0
  • [Feature request] Customizable feature combinations

    [Feature request] Customizable feature combinations

    Hi team, Thanks for the useful library! I wonder if you'd be open to this idea:

    I would like to be able to:

    • Set up categorizing features (let's say, for illustration, CATEGORY=[footwear, t-shirts, socks], SIZE=[S, M, L, US-Mens-8, US-Womens-6) and define Factors on them
    • Generate time-series with more restricted feature combinations than the outer product (again for illustration, "t-shirt sizes for t-shirts, shoe sizes for footwear")

    Today, it seems like Generator.generate() hard-codes the assumption that time-series should be generated for the product of all provided feature values.

    It'd be helpful if, instead, we could have the option of customizing this join to limit down generated combinations?

    Some options I can think of:

    1. Leave the library as-is: Users generate full outer product and limit down what they want in post-processing
      • This seems possible already, but very RAM-intensive if your desired combinations are sparse?
    2. Accept an optional dataframe of factor combinations as parameter to the generate() method
      • Gives full flexibility over which combinations are kept / ignored, without assuming any particular rigid hierarchies between features
      • ...But might need to do a bit of validation to protect against user errors? May not be super easy to use without some documented examples / functions to generate the dataframe
    3. Some more complex API for feature configuration that accommodates specifying valid/invalid feature combinations
      • Might be nicer for usability, but difficult to make general: E.g. a straightforward hierarchy could be represented as a nested dict, but in practice many applications have multiple intersecting views of product category information e.g. brand, type, target segment, etc.
    opened by athewsey 1
  • Generate hourly data

    Generate hourly data

    First of all, thank you for making this repository public! I enjoy its ease of use and the built-in factors.

    Problem description

    I'm currently trying to generate revenue data for a bar/restaurant on an hourly basis. As far as I can see, the timeseries-generator only supports generating one data point per day, not per hour.

    I tried to generate hourly data like g = Generator(factors={lt}, features=None, date_range=pd.date_range(start='15/9/2021', end='30/9/2021', freq='h')) which didn't work.

    Potential solution

    Add the possibility to generate hourly data too. If this is a promising idea in your opinion, I'm willing to contribute to the implementation.

    Thank you in advance!

    opened by nileger 1
Releases(v0.1.0)
  • v0.1.0(Jul 20, 2021)

    • first release of time series generators, including:
      • base factor
      • linear trend factor
      • sinusoidal factor
      • white noise factor
      • random factor
      • holiday factor
      • weekday factor
      • country GDP factor
      • EU industry index factor
    • Examples
      • notebooks which includes some simple examples
      • streamlit dashboard
    Source code(tar.gz)
    Source code(zip)
Owner
Nike Inc.
Nike Inc.
Coursera Machine Learning - Python code

Coursera Machine Learning This repository contains python implementations of certain exercises from the course by Andrew Ng. For a number of assignmen

Jordi Warmenhoven 859 Dec 10, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the

Google Research 76 Nov 25, 2022
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Empyrial is a Python-based open-source quantitative investment library dedicated to financial institutions and retail investors

By Investors, For Investors. Want to read this in Chinese? Click here Empyrial is a Python-based open-source quantitative investment library dedicated

Santosh 640 Dec 31, 2022
Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

Tangram 1.4k Jan 05, 2023
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical alg

RICOS Co. Ltd. 179 Dec 21, 2022
A game theoretic approach to explain the output of any machine learning model.

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allo

Scott Lundberg 18.2k Jan 02, 2023
Pydantic based mock data generation

This library offers powerful mock data generation capabilities for pydantic based models. It can also be used with other libraries that use pydantic as a foundation, for example SQLModel, Beanie and

Na'aman Hirschfeld 396 Dec 28, 2022
AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker

Data Science on AWS - O'Reilly Book Get the book on Amazon.com Book Outline Quick Start Workshop (4-hours) In this quick start hands-on workshop, you

Data Science on AWS 2.8k Jan 03, 2023
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 09, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 02, 2023
Drug prediction

I have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Dr

Khazar 1 Jan 28, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Projeto: Machine Learning: Linguagens de Programacao 2004-2001 Projeto de Data Science e Machine Learning de análise de linguagens de programação de 2

Victor Hugo Negrisoli 0 Jun 29, 2021
Kalman filter library

The kalman filter framework described here is an incredibly powerful tool for any optimization problem, but particularly for visual odometry, sensor fusion localization or SLAM.

comma.ai 276 Jan 01, 2023
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
The Fuzzy Labs guide to the universe of open source MLOps

Open Source MLOps This is the Fuzzy Labs guide to the universe of free and open source MLOps tools. Contents What is MLOps, anyway? Data version contr

Fuzzy Labs 352 Dec 29, 2022
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah) Xue 0 Jan 31, 2022