Automatic extraction of relevant features from time series:

Last update: Jan 03, 2023

Overview

tsfresh

This repository contains the TSFRESH python package. The abbreviation stands for

"Time Series Feature extraction based on scalable hypothesis tests".

The package contains many feature extraction methods and a robust feature selection algorithm.

Spend less time on feature engineering

Data Scientists often spend most of their time either cleaning data or building features. While we cannot change the first thing, the second can be automated. TSFRESH frees your time spent on building features by extracting them automatically. Hence, you have more time to study the newest deep learning paper, read hacker news or build better models.

Automatic extraction of 100s of features

TSFRESH automatically extracts 100s of features from time series. Those features describe basic characteristics of the time series such as the number of peaks, the average or maximal value or more complex features such as the time reversal symmetry statistic.

The set of features can then be used to construct statistical or machine learning models on the time series to be used for example in regression or classification tasks.

Forget irrelevant features

Time series often contain noise, redundancies or irrelevant information. As a result most of the extracted features will not be useful for the machine learning task at hand.

To avoid extracting irrelevant features, the TSFRESH package has a built-in filtering procedure. This filtering procedure evaluates the explaining power and importance of each characteristic for the regression or classification tasks at hand.

It is based on the well developed theory of hypothesis testing and uses a multiple test procedure. As a result the filtering process mathematically controls the percentage of irrelevant extracted features.

The TSFRESH package is described in the following open access paper

Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). Neurocomputing 307 (2018) 72-77, doi:10.1016/j.neucom.2018.03.067.

The FRESH algorithm is described in the following whitepaper

Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2017). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-print 1610.07717, https://arxiv.org/abs/1610.07717.

Advantages of tsfresh

TSFRESH has several selling points, for example

it is field tested
it is unit tested
the filtering process is statistically/mathematically correct
it has a comprehensive documentation
it is compatible with sklearn, pandas and numpy
it allows anyone to easily add their favorite features
it both runs on your local machine or even on a cluster

Next steps

If you are interested in the technical workings, go to see our comprehensive Read-The-Docs documentation at http://tsfresh.readthedocs.io.

The algorithm, especially the filtering part are also described in the paper mentioned above.

If you have some questions or feedback you can find the developers in the gitter chatroom.

We appreciate any contributions, if you are interested in helping us to make TSFRESH the biggest archive of feature extraction methods in python, just head over to our How-To-Contribute instructions.

If you want to try out tsfresh quickly or if you want to integrate it into your workflow, we also have a docker image available:

docker pull nbraun/tsfresh

Acknowledgements

The research and development of TSFRESH was funded in part by the German Federal Ministry of Education and Research under grant number 01IS14004 (project iPRODICT).

Comments

Shapelet extraction

One interesting feature with an explanatory ability is shapelet extraction.

Would maybe be interesting to implement within this package? A far from optimal code example by me can be found here
enhancement

opened by GillesVandewiele 74
extract_features is failing with: "OverflowError: value too large to convert to int"
I am running extract_features on a very large matrix, having ~350 million rows and 6 features (as part of a complex data science pipeline). I am using a machine with 64 cores and 2TB memory, and utilizing all 64 cores. I am getting this error: "Overflow Error: value too large to convert to int". Some comments: i) When I split the matrix vertically into, say 3, chunks (each chunk having 2 features only), and run them sequentially, everything works fine. So it does not seem like I am having issues with "problematic" values in the matrix. ii) It does not seem a memory related issue either (as alluded to in https://github.com/blue-yonder/tsfresh/issues/368) because I was babysitting the mentioned run that failed, and was checking memory usage regularly (using "free -g"). It never got above 400GB. iii) I also tried running with LocalDaskDistributor and got the same error. iv) All 6 features in the matrix are floats. v) pai_tsfresh below is a fork of tsfresh.

Your operating system No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.3 LTS Release: 16.04 Codename: xenial

The version of tsfresh that you are using latest

A minimal code snippet which reproduces the problem/bug Here's the call in my code to extract_features: extracted_features_df = extract_features(rolled_design_matrix, column_id='account_date_index', column_sort='date', default_fc_parameters=fc_parameters, n_jobs=64) where fc_parameters is: {'abs_energy': None, 'autocorrelation': [{'lag': 1}], 'binned_entropy': [{'max_bins': 10}], 'c3': [{'lag': 1}], 'cid_ce': [{'normalize': True}], 'fft_aggregated': [{'aggtype': 'centroid'}, {'aggtype': 'variance'}, {'aggtype': 'skew'}, {'aggtype': 'kurtosis'}], 'fft_coefficient': [{'attr': 'real', 'coeff': 0}], 'sample_entropy': None, 'spkt_welch_density': [{'coeff': 2}], 'time_reversal_asymmetry_statistic': [{'lag': 1}]}

Any reported errors or traceback Here's the traceback: Traceback (most recent call last): File "/home/yuval/pai/projects/ds-feature-engineering-service/feature_engineering_service/src/fe/stateless/time_series_features_enricher/time_series_features_enricher.py", line 175, in do_enrich distributor=local_dask_distributor) File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 152, in extract_features distributor=distributor) File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in _do_extraction data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]] File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]] File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator splitter = self._get_splitter(data, axis=axis) File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter comp_ids, _, ngroups = self.group_info File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.get File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info comp_ids, obs_group_ids = self._get_compressed_labels() File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels all_labels = [ping.labels for ping in self.groupings] File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in all_labels = [ping.labels for ping in self.groupings] File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels self._make_labels() File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels self.grouper, sort=self.sort) File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize table = hash_klass(size_hint or len(values)) File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.init OverflowError: value too large to convert to int

bug
opened by yuval-nardi 46
Improve performance of the impute function.
Improve performance of the functions:

utilities.dataframe_function.get_range_values_per_column(df)

utilities.dataframe_function.impute(df_impute) More specifically: apply the impute function directly on numpy array to improve computation time.

Now the impute function runs in 109ms (60 samples, 14256 features (or columns) ).

Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present. In our case we call the get_range_values_per_column just before so these checks are not necessary. So I just reimplemented the function impute_dataframe_rage directly in the impute function. This is less modular. Maybe you could pack this code in a new impute_dataframe_range function.

Solve the issue #123 .
opened by F-A 31
Avoid leaking indices from training data sets as feature, classification accuracy depends on order of input time series in data frame
I attempt to use tsfresh for a simple binary classification using a k-nearest-neighbor-classifier and k-fold-validation. However, the classification accuracy depends on the order of the input time series, which should not be relevant at all.

The underlying problem are the features selected by select_features: value__index_mass_quantile__q_0.8 value__index_mass_quantile__q_0.7 value__index_mass_quantile__q_0.2 value__index_mass_quantile__q_0.3 and so on. All of them are directly proportional to the id in the training data set.

Now the k-nearest-neighbor classifier just has to decide whether these index "features" are above a certain threshold to make a correct classification.

I need to disable the consideration of the index for feature extraction. Using the index of the samples in my training data as input for feature extraction reduces my model to absurdity. All features should be only based on the time stamps and the associated values, but not on the order of the samples in my input data.

How can I disable this incorrect behavior?

extracted_features = extract_features(time_series, column_id="id", column_sort="time", column_value="value") impute(extracted_features) features_filtered = select_features(extracted_features, y) # use features_filtered and y as input for k-fold validation

The time_series data frame is constructed in the same way as the robots example:

id time value 0 0 1 760 1 0 11 761 2 0 466 761 3 0 473 765 4 0 481 763 5 0 488 761 6 0 516 763 7 0 532 763 8 0 542 756 9 0 610 756 10 0 618 757 11 0 885 757 12 0 1189 757 13 0 1206 758 14 0 1263 758 15 0 1275 760 16 0 1295 768 17 1 1 760 18 1 11 761 19 1 466 761 20 1 473 765 21 1 481 763 22 1 488 761 23 1 516 763 .. .. ... ... 538 31 885 757 539 31 1189 757 540 31 1206 758 541 31 1263 758 542 31 1275 760 543 31 5000 768 544 32 1 760 545 32 11 761 546 32 466 761 547 32 473 765 548 32 481 763 549 32 488 761 550 32 516 763 551 32 532 763 552 32 542 756 553 32 610 756 554 32 618 757 555 32 885 757 556 32 1189 757 557 32 1206 758 558 32 1263 758 559 32 1275 760 560 32 5000 768

The same goes for the target labels y:

0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 dtype: int64
opened by fkirc 28
Notebook rolling

It is not ready to be merged.

However, I would like to get your feedback on this. What do you think about that make_forecasting_frame method?

The notebooks/timeseries_forecasting_basic_example.ipynb notebook can be used to play a little bit with the method.

opened by MaxBenChrist 26
Parallelization performance v0.7.1 and commit ...1c99e8

Hi,

Two topics that I wanted to discuss:

Runtime increase in latest commit

We compared the extraction time of version 0.7.1 to code version 1c99e8. We varied the number of ids ,where every chunk/time-series had length = 91 and the dataframe had 4 kinds of ts columns. As you may see in the figure and table below, we didn't see any significant improvement when using 4 processes in parallel and CHUNCKSIZE = None (fastest settings). How does this fit your benchmark results ?

Results table:

#ids | 0.7.1 w parallel [sec] | 1c99e8 w parallel [sec] -- | -- | -- 940 | 8.95 ±1.2 | 14.6 ±2.47 4500 | 46.98 ±14.21 | 61.96 ±1.66 7612 | 74.37 ±9.18 | 69.07 ±6.3 30523 | 295.98 ±49.15 | 317.16 ±16.52

BTW, I actually saw a nice improvement with no-parallelization (N_PROCESSES = 0). See table below:

#ids | 0.7.1 w/o parallel [sec] | 1c99e8 w/o parallel [sec] -- | -- | -- 940 | 12.42 ±0.51 | 10.34±0.69 4500 | 58.04 ±1.04 | 45.79±0.09

mean and std taken from 5 trials

50% CPU usage with no-parallelization

Recently I've noticed that while using the package for feature extraction, the CPU usage with no-parallelization settings stays mostly near 50±2%. This means that functions that are located deeper in _do_extraction_on_chunk are opening many processes/threads that are not configurable by the tsfresh API. Accordingly, what is the meaning of working with more than 2-3 processes, which use 100% CPU? I've also noticed, that when working with N_PROCESSES > 4 run time tends to raise drastically

Disclaimer:

I've modified the parallelization backend in feature_extraction/extraction.py from multiprocess toconcurrent.futures (ProcessPoolExecuter) in order to allow an hierarchical parallelization scheme. This scheme allows to run in parallel 2 modules - one module for DB extraction, preprocessing and feature extraction (tsfresh) and second child module for the internal tsfresh parallelization . The change can be seen in the image bellow

![image](https://user-image

s.githubusercontent.com/13464827/28779134-c5264a92-760a-11e7-91b0-009e9aa8123b.png)

Benchmark settings OS: win7, Resources: Xeon 48-cores, 32 RAM GB Python interpreter: python 3.6 tsfresh package version: 0.7.1 and 1c99e8 feature extraction settings: mean,std,var,median,max,min, sum_values, length ,augmented_dickey_fuller, ar_5_params

Thanks

opened by NoamGit 23
Formatting the data for 'tsfresh'
First off, thank you for this amazing library, which showed me another way of observing data.

I read the tutorials, but I think I clearly don't get it right somewhere... Below is the DataFrame 'df' I was trying to feed into 'tsfresh': multiple tickers with n 'features' so that the machine can learn the 'label' for the particular ticker on the particular date.

X = extract_features(df[:, :-1], column_id='ticker', column_sort='date') y = df['label']

But then, [ValueError: Index of X must be a subset of y's index] occurred in that the number of unique 'column_id' equals to 3 whereas the number of 'label' equals to 15. I know this is what the tutorial explains; each robot will be predicted 'only once' with all the corresponding time-series data of features.

My intention was to predict each 'label' per 'time step' for each 'ticker' like the figure above. Could you please help me out with this?
opened by Nuri8 23
STUMPY, Matrix Profiles, and Motif Discovery

Hello, tsfresh devs/users! First off, I wanted to say thank you for this wonderful and thoughtfully created package. I am a big fan of the work that y'all are doing here!

I had noticed an older issue (and PR) between @Ezekiel-Kruglick, @GillesVandewiele, @MaxBenChrist, and @nils-braun regarding the earlier work from Eamonn Keogh's group on motif discovery (and shapelets too). If I understand correctly, this discussion happened right before Keogh published his wonderful papers on matrix profiles during the fall of 2017. I was wondering if it is of any interest to the group to re-explore the idea of motif extraction in light of these papers. The STUMPY Python package is focused on providing a fast and user friendly interface for computing the matrix profile and, more importantly, faithfully reproduces Keogh's work. It is Python 3 only and has support for parallel CPU computation via Numba, distributed computations via Dask, multi-GPU support, and maintains 100% code coverage. Depending on the data size, it may fit well with some of the tsfresh use cases.

Full disclosure, I am the creator of STUMPY so let me know if you see an opportunity to collaborate here!
new feature calculator

opened by seanlaw 21
extract_features MemoryError with > 7x5k timeseries

I'm attempting to to feature extraction on a time series that's 5 minutes long at 30 samples per second, with 7 features. However, I noticed that once I got past ~5k samples, I got a MemoryError.

Running on Windows 10 64-bit/16gb RAM with python 3.5.2 32-bit, and the master branch of tsfresh. Sample Timeseries, Code & Traceback: https://gist.github.com/ProgBot/0463a68efcbabdb0e6c204c4b8bbf52a

Is this a limitation of 32 bit Python? Or is tsfresh sadly incapable of handling this amount of data? Thanks
question

opened by ProgBot 21
Implement parallelization of feature calculation per kind

If there are several kinds of time series, their features are calculated in parallel using a process pool. Standard behavior will be one process per cpu. This setting can be overwritten in the FeatureExtractionSettings object provided to extract_features.

opened by jneuff 21
Use tqdm for Jupyter Notebooks

When using tsfresh in Jupyter Notebook, output from tqdm results bar is not overwriting itself but creating a line for every change in percentage. Can we somehow build a switch for using tqdm_notebook?

opened by anderl80 19
Added functionality to test multiple versions of python using tox
What functionality changed?

This PR adds functionality to be able to test multiple versions of python using tox.

This PR will run the test suite for python versions 3.7, 3.8, 3.9, 3.10, and 3.11 for whichever platform the user runs the tests on. The specific microversion of python depends on what is installed on the user's machine.

tox will skip over running the test suite for a given version if it cannot find a particular python environment.

How to run?

To test on multiple python versions, edit envlist in setup.cfg to include the python versions you want to test on, and then run:

tox -r

in the top-level directory for tsfresh. Currently the versions being tested span between python 3.7-3.11

Why?

This PR will make it much easier to test tsfresh against multiple versions of python, for a given platform.

What code changed?

setup.cfg was changed to include some extra configuration information.

tox was added to test-requirements.txt

Tips for reviewing/running

By default, tox will look in binary directories for any relevant python interpreters. For example, if

envlist = (py37, py38)

exists in setup.cfg, then tox will look inside binary directories for python executables named similar to python3.7.X and python 3.8.X. If it cannot find any executables for a given python version, then it will skip over testing that version.

Using pyenv in tandem with tox

A recommended way to handle multiple versions of python is with pyenv. pyenv will allow you to install multiple versions of python in a well-organised fashion.

If you choose to use pyenv to manage the different python versions installed on your machine, then the executables of each python version will be in ~/.pyenv/shims/ which will not be found by tox by default. A recommended solution to this is to make a .python-version file, with the versions of python that you want tox to look for.

For example, if the output of running

pyenv versions

shows that you have installed python 3.7.16 and python 3.8.16, then you should put 3.7.16 and 3.8.16 as separate lines in the .python-version file. Running tox -r will then run the tests for python3.7.16 and python3.8.16, and tox will know where to find the relevant python interpreters.

pyenv important note

The python version that the tox command is initially invoked from matters!

Running pyenv which python will show the version of python from which the tox command will be invoked from. If tox is initially invoked from a version of python that is not supported by the package (i.e. package is invoked from python 3.6 and python_requires is >=3.7), then tox will fail for all environments, including python versions that would otherwise work if tox was originally invoked from a version of python supported with the package.

Note that we can still test tsfresh on unsupported versions of python (such as 3.6), provided that tox is initially invoked from a version of python that is in tsfresh python_requires (such as 3.8).

Log files

Log files are stored in the .tox directory which is created once tox is run.
opened by Scott-Simmons 2
Why make_forecasting_frame does not have min_timeshift argument?
The problem:

Hi. I noticed that when using make_forecasting_frame, it does not have min_timeshift argument so the first few rows have less predictor rows.

Environment:

Python version: 3.8.16

Operating System: Ubuntu

tsfresh version: 0.19.0

Install method (conda, pip, source): pip-Colab

bug
opened by arashag 0
unmaintained dependency `matrixprofile` makes `tsfresh` uninstallable on python 3.10
The problem:

tsfresh cannot be installed on python 3.10 because it has matrixprofile as a dependency, yet matrixprofile cannot be installed on python 3.10 and is no longer maintained.

Anything else we need to know?:

matrixprofile is superseded by stumpy.

Environment:

Python version: 3.10

bug
opened by mmp3 6
IndexError: cannot do a non-empty take from an empty axes.

This error occurs when using EfficientFCParameters or ComprehensiveFCParameters (not MinimalFCParameters). Error does not occur when pandas version is 1.3.5. However this is an old version and is not compatible with other python packages. Can you please resolve the issue making it work with later panda versions such as 1.4.3?
bug

opened by hn2 0

BrokenPipeError

The problem: I'm trying to run extract_features on my data but keep getting BrokenPipeError. I tried it on two different computers (both with the same environment) with the same error. The dataset is quite large, merged DataFrame shape: (880169, 522), so it is expected to run for 20 hours. It runs for a few hours and then crashes,

Settings:

features = extract_features(
    merged_time_series,
    column_id="id",
    default_fc_parameters=ComprehensiveFCParameters(),
    n_jobs=15,
    impute_function=impute,
)

Error (repeated many times):

Process ForkPoolWorker-1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
    put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
    put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
    put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
    put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
    put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
    self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
    self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
    self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
    self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                                                                                                                                                                                                                         
    self._send_bytes(m[offset:offset + size])                                                                                                                                                                                                                                                                                                                                               
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
    self._writer.send_bytes(obj)                                                               
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                           
    self._send_bytes(m[offset:offset + size])                                                  
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                           
    self._send_bytes(m[offset:offset + size])                                                                                                                                                                                                                                                                                                                                               
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                           
    self._send_bytes(m[offset:offset + size])                            
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
    self._send(header)                                                       
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
    self._send(header)                                                  
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                                                                                                                                                                                                                         
    self._send_bytes(m[offset:offset + size])                                                                                                                                                                                                                                                                                                                                               
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                                                                                                                                                                                                                        
    self._send(header)                                                                                                                                                                        
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
    self._send(header)                                                                                                                                                                        
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                 
    n = write(self._handle, buf)                                                                                                                                                              
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                                                                                                                                                                                                              
    n = write(self._handle, buf)                                                                                                                                                              
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
    self._send(header)                                                                                                                                                                        
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                
    n = write(self._handle, buf)                                                                                                                                                                                                                                                                                                                                                            
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                                                                                                                                                                                                              
    n = write(self._handle, buf)                                                                                                                                                                                                                                                                                                                                                            
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                
    n = write(self._handle, buf)                                                                                                                                                              
BrokenPipeError: [Errno 32] Broken pipe

Anything else we need to know?: I also tried running it with a smaller chunksize and fewer jobs, but with no change.

features = extract_features(
    merged_time_series,
    column_id="id",
    default_fc_parameters=ComprehensiveFCParameters(),
    n_jobs=8,
    impute_function=impute,
    chunksize=1,
)

Environment:

Python version: 3.10
Operating System: Ubuntu 22.04
tsfresh version: 0.19.0
Install method (conda, pip, source): pip

bug

opened by johan-sightic 0

Releases(v0.20.0)

v0.20.0(Dec 30, 2022)
Breaking Change

The matrixprofile package becomes an optional dependency

Bugfixes/Typos/Documentation:

Fix feature extraction of Friedrich coefficients for pandas>1.3.5

Fix file paths after example notebooks were moved

Source code(tar.gz)
Source code(zip)
v0.19.0(Dec 21, 2021)
Breaking Change

Drop Python 3.6 support due to dependency on statsmodels 0.13

Added Features

Improve documentation (#831, #834, #851, #853, #870)

Add absolute_maximum and mean_n_absolute_max features (#833)

Make settings pickable (#845, #847, #910)

Disable multiprocessing for n_jobs=1 (#852)

Add black, isort, and pre-commit (#876)

Bugfixes/Typos/Documentation:

Fix conversion of time-series into sequence for lempel_ziv_complexity (#806)

Fix range count config (#827)

Reword documentation (#893)

Fix statsmodels deprecation issues (#898, #912)

Fix typo in requirements (#903)

Updated references

Source code(tar.gz)
Source code(zip)
v0.18.0(Mar 6, 2021)
Added Features

Allow arbitrary rolling sizes (#766)

Allow for multiclass significance tests (#762)

Add multiclass option to RelevantFeatureAugmenter (#782)

Addition of matrix_profile feature (#793)

Added new query similarity counter feature (#798)

Add root mean square feature (#813)

Bugfixes/Typos/Documentation:

Do not send coverage of notebook tests to codecov (#759)

Fix typos in notebook (#757, #780)

Fix output format of make_forecasting_frame (#758)

Fix badges and remove benchmark test

Fix BY notebook plot (#760)

Ts forecast example improvement (#763)

Also surpress warnings in dask (#769)

Update relevant_feature_augmenter.py (#779)

Fix column names in quick_start.rst (#778)

Improve relevance table function documentation (#781)

Fixed #789 Typo in "how to add custom feature" (#790)

Convert to the correct type on warnings (#799)

Fix minor typos in the docs (#802)

Add unwanted filetypes to gitignore (#819)

Fix build and test failures (#815)

Fix imputing docu (#800)

Bump the scikit-learn version (#822)

Source code(tar.gz)
Source code(zip)
v0.17.0(Sep 9, 2020)
We changed the default branch from "master" to "main".

Breaking Change

Changed constructed id in roll_time_series from string to tuple (#700)

Same for add_sub_time_series_index (#720)

Added Features

Implemented the Lempel-Ziv-Complexity and the Fourier Entropy (#688)

Prevent #524 by adding an assert for common identifiers (#690)

Added permutation entropy (#691)

Added a logo :-) (#694)

Implemented the benford distribution feature (#689)

Reworked the notebooks (#701, #704)

Speed up the result pivoting (#705)

Add a test for the dask bindings (#719)

Refactor input data iteration to need less memory (#707)

Added benchmark tests (#710)

Make dask a possible input format (#736)

Bugfixes:

Fixed a bug in the selection, that caused all regression tasks with un-ordered index to be wrong (#715)

Fixed readthedocs (#695, #696)

Fix spark and dask after #705 and for non-id named id columns (#712)

Fix in the forecasting notebook (#729)

Let tsfresh choose the value column if possible (#722)

Move from coveralls github action to codecov (#734)

Improve speed of data processing (#735)

Fix for newer, more strict pandas versions (#737)

Fix documentation for feature calculators (#743)

Source code(tar.gz)
Source code(zip)
v0.16.0(May 12, 2020)
Breaking Change

Fix the sorting of the parameters in the feature names (#656) The feature names consist of a sorted list of all parameters now. That used to be true for all non-combiner features, and is now also true for combiner features. If you relied on the actual feature name, this is a breaking change.

Change the id after the rolling (#668) Now, the old id of your data is still kept. Additionally, we improved the way dataframes without a time column are rolled and how the new sub-time series are named. Also, the documentation was improved a lot.

Added Features

Added variation coefficient (#654)

Added the datetimeindex explanation from the notebook to the docs (#661)

Optimize RelevantFeatureAugmenter to avoid re-extraction (#669)

Added a function add_sub_time_series_index (#666)

Added Dockerfile

Speed optimizations and speed testing script (#681)

Bugfixes

Increase the extracted ar coefficients to the full parameter range. (#662)

Documentation fixes (#663, #664, #665)

Rewrote the sample_entropy feature calculator (#681) It is now faster and (hopefully) more correct. But your results will change!

Source code(tar.gz)
Source code(zip)
v0.15.1(May 12, 2020)

Changelog and documentation fixes
Source code(tar.gz)
Source code(zip)
v0.15.0(Mar 26, 2020)
Added Features

Add count_above and count_below feature (#632)

Add convenience bindings for dask dataframes and pyspark dataframes (#651)

Bugfixes

Fix documentation build and feature table in sphinx (#637, #631, #627)

Add scripts to API documentation

Skip dask test for older python versions (#649)

Add missing distributor keyword (#648)

Fix tuple input for cwt (#645)

Source code(tar.gz)
Source code(zip)
v0.14.0(Feb 4, 2020)
Breaking Change

Replace Benjamini-Hochberg implementation with statsmodels implementation (#570)

Refactoring and Documentation

travis.yml (#605)

gitignore (#608)

Fix docstring of c3 (#590)

Feature/pep8 (#607)

Added Features

Improve test coverage (#609)

Add "autolag" parameter to augmented_dickey_fuller() (#612)

Bugfixes

Feature/pep8 (#607)

Fix filtering on warnings with multiprocessing on Windows (#610)

Remove outdated logging config (#621)

Replace Benjamini-Hochberg implementation with statsmodels implementation (#570)

Fix the kernel and the naming of a notebook (#626)

Source code(tar.gz)
Source code(zip)
v0.13.0(Nov 24, 2019)
Drop python 2.7 support (#568)

Fixed bugs

Fix cache in friedrich_coefficients and agg_linear_trend (#593)

Added a check for wrong column names and a test for this check (#586)

Make sure to not install the tests folder (#599)

Make sure there is at least a single column which we can use for data (#589)

Avoid division by zero in energy_ratio_by_chunks (#588)

Ensure that get_moment() uses float computations (#584)

Preserve index when column_value and column_kind not provided (#576)

Add @set_property("input", "pd.Series") when needed (#582)

Fix off-by-one error in longest strike features (fixes #577) (#578)

Add set_property import (#572)

Fix typo (#571)

Fix indexing of melted normalized input (#563)

Fix travis (#569)

Remove warnings (#583)

Update to newest python version (#594)

Optimizations

Early return from change_quantiles if ql >= qh (#591)

Optimize mean_second_derivative_central (#587)

Improve performance with Numpy's sum function (#567)

Optimize mean_change (fixes issue #542) and correct documentation (#574)

Source code(tar.gz)
Source code(zip)
v0.12.0(Nov 24, 2019)
fixed bugs

wrong calculation of friedrich coefficients

feature selection selected too many features

an ignored max_timeshift parameter in roll_time_series

add deprecation warning for python 2

added support for index based features

new feature calculator

linear_trend_timewise

enable the RelevantFeatureAugmenter to be used in cross validated pipelines

increased scipy dependency to 1.2.0

Source code(tar.gz)
Source code(zip)
v0.11.1(Nov 24, 2019)
general performance improvements

removed hard pinning of dependencies

fixed bugs

the stock price forecasting notebook

the multi classification notebook

Source code(tar.gz)
Source code(zip)
v0.11.0(Nov 24, 2019)
new feature calculators:

fft_aggregated

cid_ce

renamed mean_second_derivate_central to mean_second_derivative_central

add warning if no relevant features were found in feature selection

add columns_to_ignore parameter to from_columns method

add distribution module, contains support for distributed feature extraction on Dask

Source code(tar.gz)
Source code(zip)

Automatic extraction of relevant features from time series:

Related tags

Overview

tsfresh

Spend less time on feature engineering

Automatic extraction of 100s of features

Forget irrelevant features

Advantages of tsfresh

Next steps

Acknowledgements

Comments

Runtime increase in latest commit

50% CPU usage with no-parallelization

Disclaimer:

What functionality changed?

How to run?

Why?

What code changed?

Tips for reviewing/running

Using pyenv in tandem with tox

pyenv important note

Log files

Releases(v0.20.0)

v0.20.0(Dec 30, 2022)

v0.19.0(Dec 21, 2021)

v0.18.0(Mar 6, 2021)

v0.17.0(Sep 9, 2020)

v0.16.0(May 12, 2020)

v0.15.1(May 12, 2020)

v0.15.0(Mar 26, 2020)

v0.14.0(Feb 4, 2020)

v0.13.0(Nov 24, 2019)

v0.12.0(Nov 24, 2019)

v0.11.1(Nov 24, 2019)

v0.11.0(Nov 24, 2019)

Owner

Blue Yonder GmbH

a feature engineering wrapper for sklearn

scikit-learn addon to operate on set/"group"-based features

Automatic extraction of relevant features from time series:

A fast xgboost feature selection algorithm

open-source feature selection repository in python

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

An open source python library for automated feature engineering

Python implementations of the Boruta all-relevant feature selection method.

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Using `pyenv` in tandem with `tox`

`pyenv` important note