An open source python library for automated feature engineering

Overview

Featuretools

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to Know about Machine Learning

Tests Coverage Status PyPI version Anaconda-Server Badge StackOverflow Downloads

Featuretools is a python library for automated feature engineering. See the documentation for more information.

Installation

Install with pip

python -m pip install featuretools

or from the Conda-forge channel on conda:

conda install -c conda-forge featuretools

Add-ons

You can install add-ons individually or all at once by running

python -m pip install featuretools[complete]

Update checker - Receive automatic notifications of new Featuretools releases

python -m pip install featuretools[update_checker]

TSFresh Primitives - Use 60+ primitives from tsfresh within Featuretools

python -m pip install featuretools[tsfresh]

Example

Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()

Featuretools can automatically create a single table of features for any "target entity"

>> feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers")
>> feature_matrix.head(5)
            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                  ...
1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 69 columns]

We now have a feature vector for each customer that can be used for machine learning. See the documentation on Deep Feature Synthesis for more examples.

Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to define your own custom primitives.

Demos

Predict Next Purchase

Repository | Notebook

In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.

For more examples of how to use Featuretools, check out our demos page.

Testing & Development

The Featuretools community welcomes pull requests. Instructions for testing and development are available here.

Support

The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:

  1. For usage questions, use Stack Overflow with the featuretools tag.
  2. For bugs, issues, or feature requests start a Github issue.
  3. For discussion regarding development on the core library, use Slack.
  4. For everything else, the core developers can be reached by email at [email protected].

Citing Featuretools

If you use Featuretools, please consider citing the following paper:

James Max Kanter, Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. IEEE DSAA 2015.

BibTeX entry:

@inproceedings{kanter2015deep,
  author    = {James Max Kanter and Kalyan Veeramachaneni},
  title     = {Deep feature synthesis: Towards automating data science endeavors},
  booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},
  pages     = {1--10},
  year      = {2015},
  organization={IEEE}
}

Built at Alteryx Innovation Labs

Alteryx Innovation Labs
Comments
  • Spark Example for Featuretools

    Spark Example for Featuretools

    Bug/Feature Request Description

    In notebooks such as here: https://github.com/Featuretools/predict-next-purchase/blob/master/Tutorial.ipynb and documentation: https://docs.featuretools.com/usage_tips/scaling.html

    It mentions the ability to scale to Spark. Could an example be provided like it was for dask here: https://github.com/Featuretools/predict-next-purchase?


    Issues created here on Github are for bugs or feature requests. For usage questions and questions about errors, please ask on Stack Overflow with the featuretools tag. Check the documentation for further guidance on where to ask your question.

    opened by charliec443 26
  • Refactor LatLong and Datetime Primitives into Separate Files

    Refactor LatLong and Datetime Primitives into Separate Files

    Pull Request Description

    • Fixes #1855

    Changes: I decided to split all classes containing Lat/Long functions into their own file as well as classes containing date/time into their own file. In each file I also organized classes in alphabetical order. I don't believe there are any conflicts with the new files as I was able to run the tests.

    Comments: Whenever someone is able to review my changes I would also appreciate some input/advice on the testing. I am running them as described on Ubuntu. They run to the end but I do have some failed tests, not sure if this is due to my changes or if it is just part of the process.

    As an aside I apologize for all of the unnecessary commits. I'm still getting the hang of it and understand now I may have gone overboard. Also, I accidentally deleted my original branch which is why I am submitting a second pull request.

    opened by jacobboney 21
  • “IndexError: Too many levels” when running Featuretools dfs after upgrade

    “IndexError: Too many levels” when running Featuretools dfs after upgrade

    Featuretools' dfs() method fails to run on my entity set after upgrading from v0.1.21 to v0.2.x and v0.3.0.

    The error is raised when the Pandas backend tries to calculate the aggregate features _calculate_agg_features(). In particular:

    --> 442 to_merge.reset_index(1, drop=True, inplace=True) ... IndexError: Too many levels: Index has only 1 level, not 2

    This is working fine in v0.1.x and the entity set hasn't changed after the upgrade. The entity set is composed of 7 entities and 6 relationships. Each entity (dataframe) is added via entity_from_dataframe.

    opened by jrkinley-zz 20
  • Memory crashing when using featuretools/dask

    Memory crashing when using featuretools/dask

    I'm not sure what I'm doing wrong, but basically I'm taking a fairly large dataframe(11GB) and converting it to dask before running featuretools on it. During DFS my system is running out of memory, which is strange to me because I thought it should be writing to disk?

    from dask.distributed import Client, progress
    
    client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')
    client
    
    import featuretools as ft
    import dask.dataframe as dd
    dt = {}
    dt.update(dict.fromkeys(catgoricalValues, ft.variable_types.Categorical))
    dt.update(dict.fromkeys(NumericColumns, ft.variable_types.Numeric))
    dask_df = dd.from_pandas(Main[NumericColumns + catgoricalValues], npartitions=50000)
    dask_df  # this works
    
    # Make an entityset and add the entity
    es = ft.EntitySet(id = 'Test')
    es = es.entity_from_dataframe(entity_id="dask_entity", dataframe=dask_df, make_index = True, index="index", variable_types=dt)
    
    # primatives to use
    default_agg_primitives =  ["sum", "std", "max", "min", "mean", "count", "percent_true", "num_unique"]
    default_trans_primitives =  ["add_numeric", 'multiply_numeric']
    
    feature_matrix, feature_defs = ft.dfs(entityset = es, target_entity = 'dask_entity',
                                           trans_primitives = default_trans_primitives,
                                           agg_primitives=default_agg_primitives, 
                                            max_depth = 2, features_only=False, verbose = True)
    

    My session crashes at this point from using all the memory. I followed various tutorials but I'm not sure what I'm doing wrong? My goal is after DFS is done, I would save the results to a file that I can then pass on to TF/Keras.

    opened by gautambak 16
  • How is `DIFF` calculated?

    How is `DIFF` calculated?

    I read docs but can't understand how does DIFF calculate its value.

    This part of my example:

    Screen Shot 2019-11-12 at 22 00 07

    I generated this dataframe using dfs(..., time_window=None)

    (time in index is meaning cutoff_time)

    What I can't understand is, DIFF(MAX(sales.amount)) will be calculated by applying DIFF on MAX(sales.amount) but since MAX(sales.amount) is an aggregated value, which would be only one single value(=max value before cutoff time), how does DIFF calculate its value? I think that DIFF requires at least 2 values to calculate?...

    If I missed something, please let me know how is first value of DIFF(MAX(sales.amount)), 25714.287, calculated..

    Thanks

    opened by rightx2 16
  • Calculating direct features use default value if parent missing

    Calculating direct features use default value if parent missing

    Pull Request Description

    (replace this text with your description)


    After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request.

    opened by seriallazer 15
  • Support/approach for sliding window/multiple snapshots in time

    Support/approach for sliding window/multiple snapshots in time

    Hi there! (first of all huge thx for dfs, vision & tools, superb work)

    My question, the predict_next_purchase sample uses a single cut_off time right? But doesnt that remove a lot of data that could help with the purchase prediction? and we're only using a single day for reference right?

    only this data/users -> "Using users who had acivity during training_window days before the cutoff_time, we look to see if they purchase the product in the prediction_window."

    I would like to use all data in a single final ml table for the models. Is there support to have the cut off being a sliding window (ex: for each customer) of features from last x days, predicting purchase (yes/no) up x days in the future. So each customer would appear multiple times, depending on choosen sliding window.

    Think it's a tipical pattern in predicting future events (predictive maintenance, churn, healthcare). Usually applies to any kind of event prediction. (ex: for every user, machine, predict probability of event E for the next x days for a specific point in time, obv the training the dataset has proper timestamps so that we can "recalculate" feature values for user/machine up to at any point in time)

    The dataset becomes non IID obv, some cautions apply.

    Makes sense? What's the approach to use DFS with these scenarios? thx!

    opened by rquintino 15
  • LatLong type

    LatLong type

    The issue in testing comes from mock_ds.py where the mock retail entityset is made with es.entity_from_csv(entity, (line 292). This makes the latlong type in that entityset a string rather than a tuple. The options as I understand them are:

    1. Modify Latitude and Longitude to check if the latlong is a string
    2. Modify entity_from_csv to convert certain strings to tuples
    3. Change the test to do the pandas _from_csv, modify the dataframe and then make entity_from_dataframe.
    4. Leave Latitude and Longitude with no real tests for now.

    My gut is to go with 3. Do you have a preference @kmax12?

    opened by Seth-Rothschild 15
  • Bug with parallel feature matrix calculation within sklearn cross-validation

    Bug with parallel feature matrix calculation within sklearn cross-validation

    Bug with parallel feature matrix calculation within sklearn cross-validation


    Bug Description

    Hello, guys! Thank you for the quick release of featuretools 1.1.0 !

    During my research I have faced the following bug: I have an estimator which is actually an imblearn Pipeline. The estimator consists of several steps including my custom transformer which calculates feature matrix with featuretools. And I want to check the quality of the model with sklearn cross_validate function . If I set n_jobs > 1 both in featuretools.calculate_feature_matrix and in sklearn.cross_validate, then I get an unexpected error ValueError: cannot find context for 'loky'. When either one of n_jobs is set to 1, then everything works fine.

    I googled for some time and I understood that such error might happen when parallelization is used without if __name__ == '__main__' - but it's the best information I've got - nothing more valuable. So for me it looks like there is some conflict in parallelization usage in sklearn and featuretools. And as far both of the libraries are essential as well as parallelization working with big data, i really hope you will be able to find a way to fix it :)

    P.S this problem was actual before 1.0.0 release - previously I used 0.24.0 and still faced it

    Output of featuretools.show_info()

    Featuretools version: 1.1.0

    SYSTEM INFO

    python: 3.7.5.final.0 python-bits: 64 OS: Darwin OS-release: 19.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: ru_RU.UTF-8 LOCALE: ru_RU.UTF-8

    INSTALLED VERSIONS

    numpy: 1.21.1 pandas: 1.3.2 tqdm: 4.62.2 PyYAML: 5.4.1 cloudpickle: 1.6.0 dask: 2021.10.0 distributed: 2021.10.0 psutil: 5.8.0 pip: 19.2.3 setuptools: 41.2.0

    opened by VGODIE 14
  • Add include_cutoff_time arg to control whether data at cutoff times a…

    Add include_cutoff_time arg to control whether data at cutoff times a…

    Add include_cutoff_time arg to control whether data at cutoff times are included in feature calculations and prevent traininig_window overlapping

    Pull Request Description

    There was a data overlapping problem when calculating the feature matrix: The data at cutoff time might be used both in calculating features and in calculating target values(#918 ). This could cause data cheating and affect the result as well. There was a trial to solve the issue (#930 ), but It still didn't solve the cheating problem. So, we decided to parameterize it to control whether data at cutoff times are included in feature calculations or not(#942 ) and this PR solves it.

    opened by rightx2 14
  • Fixed #297 update tests to check error strings

    Fixed #297 update tests to check error strings

    • On windows platform, there is an open issue currently in pandas where it raises an error when reading a file with accents in the file path (i.e. régions.csv). So, I resolved it with the following:
    # featuretools\tests\testing_utils\mock_ds.py:334
    df = pd.read_csv(open(filenames[entity], 'r', encoding='utf8'), encoding='utf-8')
    
    • This snippet np.dtype((np.integer, np.floating)).type was causing this issue. So, I resolved it by changing it to the following:
    np.issubdtype(time, np.integer) or np.issubdtype(time, np.floating)
    
    • Not sure how to get the error text for test_not_enough_memory
    opened by jeff-hernandez 14
  • Consider adding new scalar comparison primitives for Datetime and Ordinal column types

    Consider adding new scalar comparison primitives for Datetime and Ordinal column types

    In PR #2434 Datetime and Ordinal inputs were removed from the valid input types for four comparison primitives. This was done due to errors that could be encountered during feature value calculation. We should consider adding new primitives to replace the lost functionality, if these types of comparisons are needed.

    The Ordinal comparison primitives might be too specific and may not be necessary, but the datetime comparison primitives could be useful.

    Some primitives to add could include (naming could be improved): GreaterThanDate GreaterThanOrEqualToDate LessThanDate LessThanOrEqualToDate

    opened by thehomebrewnerd 0
  • Add global test for NaturalLanguage primitives

    Add global test for NaturalLanguage primitives

    • As per discussion in #2413, this adds a test that checks strings that have caused errors in the past, ~~as well as randomized test inputs~~ to make sure none of our NaturalLanguage primitives hang or fail on generated input.
    • Adds get_natural_language_primitives and refactors get_transform_primitives and get_aggregation_primitives utility functions
    opened by sbadithe 3
  • Fix warnings encountered during unit tests

    Fix warnings encountered during unit tests

    Currently when running the full test suite over 17,000 warnings are generated:

    2417 passed, 23 skipped, 87 xfailed, 17879 warnings
    

    Some of these warning may be expected, but some should be addressed. This may require mutiple PRs, but we should investigate these warnings and implement fixes where possible. Of immediate concern are the deprecation warnings that might cause tests to break when new versions of dependencies are released:

    featuretools/tests/primitive_tests/test_num_consecutive.py::TestNumConsecutiveLessMean::test_inf
      /dev/featuretools/featuretools/tests/primitive_tests/test_num_consecutive.py:259: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
        x = x.append(pd.Series([np.inf]))
    

    This Dask warning also appears to be quite common and perhaps of concern.

    featuretools/tests/primitive_tests/test_transform_features.py::test_comparisons_with_ordinal_invalid_inputs[dask_es]
      /dev/featuretools/env/lib/python3.9/site-packages/dask/dataframe/core.py:4134: UserWarning:
      You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
      To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
        Before: .apply(func)
        After:  .apply(func, meta=('priority_level <= ordinal_invalid', 'object'))
    
        warnings.warn(meta_warning(meta))
    
    opened by thehomebrewnerd 0
  • Remove numpy restriction in featuretools[spark] install when possible

    Remove numpy restriction in featuretools[spark] install when possible

    In PR #2414, numpy was restricted to <1.24.0 when installing the featuretools[spark] requirements due to an incompatibility between pyspark and numpy==1.24.0 (likely due to pysparks' use of numpy aliases in which were removed in 1.24.0). Once pyspark is updated to work with numpy 1.24.0 we should remove the upper bound on numpy in the Featuretools requirements as well.

    opened by thehomebrewnerd 0
  • Refactor computation of primitive lists in `DeepFeatureSynthesis` `__init__`

    Refactor computation of primitive lists in `DeepFeatureSynthesis` `__init__`

    When building the following lists, there is a lot of code duplication:

    • self.groupby_trans_primitives
    • self.agg_primitives
    • self.where_primitives
    • self.trans_primitives

    Furthermore, refactoring this logic outside of the __init__ would help make the code more expressive and testable.

    enhancement refactor tech debt 
    opened by sbadithe 0
Releases(v1.20.0)
  • v1.20.0(Jan 5, 2023)

    Jan 5, 2023

    • Enhancements
      • Add TimeSinceLastFalse, TimeSinceLastMax, TimeSinceLastMin, and TimeSinceLastTrue primitives (#2418)
      • Add MaxConsecutiveFalse, MaxConsecutiveNegatives, MaxConsecutivePositives, MaxConsecutiveTrue, MaxConsecutiveZeros, NumConsecutiveGreaterMean, NumConsecutiveLessMean (#2420)
    • Fixes
      • Fix typo in _handle_binary_comparison function name and update set_feature_names docstring (#2388)
      • Only allow Datetime time index as input to RateOfChange primitive (#2408)
      • Prevent catastrophic backtracking in regex for NumberOfWordsInQuotes (#2413)
      • Fix to eliminate fragmentation PerformanceWarning in feature_set_calculator.py (#2424)
      • Fix serialization of NumberOfCommonWords feature with custom word_set (#2432)
      • Improve edge case handling in NaturalLanguage primitives by standardizing delimiter regex (#2423)
      • Remove support for Datetime and Ordinal inputs in several primitives to prevent creation of Features that cannot be calculated (#2434)
    • Changes
      • Refactor _all_direct_and_same_path by deleting call to _features_have_same_path (#2400)
      • Refactor _build_transform_features by iterating over input_features once (#2400)
      • Iterate only once over ignore_columns in DeepFeatureSynthesis init (#2397)
      • Resolve empty Pandas series warnings (#2403)
      • Initialize Woodwork with init_with_partial_schama instead of init in EntitySet.add_last_time_indexes (#2409)
      • Updates for compatibility with numpy 1.24.0 (#2414)
      • The delimiter_regex parameter for TotalWordLength has been renamed to do_not_count (#2423)
    • Documentation Changes
      • Remove unused sections from 1.19.0 notes (#2396)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @thehomebrewnerd

    Breaking Changes

    • The delimiter_regex parameter for TotalWordLength has been renamed to do_not_count. Old saved features that had a non-default value for the parameter will no longer load.
    • Support for Datetime and Ordinal inputs has been removed from the LessThanScalar, GreaterThanScalar, LessThanEqualToScalar and GreaterThanEqualToScalar primitives.
    Source code(tar.gz)
    Source code(zip)
  • v1.19.0(Dec 9, 2022)

    v1.19.0 Dec 9, 2022

    • Enhancements
      • Add OneDigitPostalCode and TwoDigitPostalCode primitives (#2365)
      • Add ExpandingCount, ExpandingMin, ExpandingMean, ExpandingMax, ExpandingSTD and ExpandingTrend primitives (#2343)
    • Fixes
      • Fix DeepFeatureSynthesis to consider the base_of_exclude family of attributes when creating transform features(#2380)
      • Fix bug with negative version numbers in test_version (#2389)
      • Fix bug in MultiplyNumericBoolean primitive that can cause an error with certain input dtype combinations (#2393)
    • Testing Changes
      • Fix version comparison in test_holiday_out_of_range (#2382)

    Thanks to the following people for contributing to this release: @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.18.0(Nov 15, 2022)

    v1.18.0 Nov 15, 2022

    • Enhancements
      • Add RollingOutlierCount primitive (#2129)
      • Add RateOfChange primitive (#2359)
    • Fixes
      • Sets uses_full_dataframe for Rolling* and Exponential* primitives (#2354)
      • Updates for compatibility with upcoming Woodwork release 0.21.0 (#2363)
      • Updates demo dataset location to use new links (#2366)
      • Fix test_holiday_out_of_range after holidays release 0.17 (#2373)
    • Changes
      • Remove click and CLI functions (list-primitives, info) (#2353, #2358)
    • Documentation Changes
      • Build docs in parallel with Sphinx (#2351)
      • Use non-editable install to allow local docs build (#2367)
      • Remove primitives.featurelabs.com website from documentation (#2369)
    • Testing Changes
      • Replace use of pytest's tmpdir fixture with tmp_path (#2344)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.17.0(Oct 31, 2022)

    v1.17.0 Oct 31, 2022

    • Enhancements

      • Add featuretools-sklearn-transformer as an extra installation option (#2335)
      • Add CountAboveMean, CountBelowMean, CountGreaterThan, CountInsideNthSTD, CountInsideRange, CountLessThan, CountOutsideNthSTD, CountOutsideRange (#2336)
    • Changes

      • Restructure primitives directory to use individual primitives files (#2331)
      • Restrict 2022.10.1 for dask and distributed (#2347)
    • Documentation Changes

      • Add Featuretools-SQL to Install page on documentation (#2337)
      • Fixes broken link in Featuretools documentation (#2339)

      Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.16.0(Oct 24, 2022)

    • Enhancements
      • Add ExponentialWeighted primitives and DateToTimeZone primitive (#2318)
      • Add 14 natural language primitives from nlp_primitives library (#2328)
    • Documentation Changes
      • Fix typos in aggregation_primitive_base.py and features_deserializer.py (#2317) (#2324)
      • Update SQL integration documentation to reflect Snowflake compatibility (#2313)
    • Testing Changes
      • Add Windows install test #2330

    Thanks to the following people for contributing to this release: @gsheni, @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.15.0(Oct 6, 2022)

    v1.15.0 Oct 6, 2022

    • Enhancements
      • Add series_library attribute to EntitySet dictionary (#2257)
      • Leverage Library Enum inheriting from str (#2275)
    • Changes
      • Change default gap for Rolling* primitives from 0 to 1 to prevent accidental leakage (#2282)
      • Updates for pandas 1.5.0 compatibility (#2290, #2291, #2308)
      • Exclude documentation files from release workflow (#2295)
      • Bump requirements for optional pyspark dependency (#2299)
      • Bump scipy and woodwork[spark] dependencies (#2306)
    • Documentation Changes
      • Add documentation describing how to use featuretools_sql with featuretools (#2262)
      • Remove featuretools_sql as a docs requirement (#2302)
      • Fix typo in DiffDatetime doctest (#2314)
      • Fix typo in EntitySet documentation (#2315)
    • Testing Changes
      • Remove graphviz version restrictions in Windows CI tests (#2285)
      • Run CI tests with pytest -n auto (#2298, #2310)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @thehomebrewnerd

    Breaking Changes

    • The EntitySet schema has been updated to include a series_library attribute
    • The default behavior of the Rolling* primitives has changed in this release. If this primitive was used without defining the gap value, the feature values returned with this release will be different than feature values from prior releases.
    Source code(tar.gz)
    Source code(zip)
  • v1.15.0.dev0(Oct 5, 2022)

  • v1.14.0(Sep 1, 2022)

    v1.14.0 Sep 1, 2022

    • Enhancements
      • Replace NumericLag with Lag primitive (#2252)
      • Refactor build_features to speed up long running DFS calls by 50% (#2224)
    • Fixes
      • Fix compatibility issues with holidays 0.15 (#2254)
    • Changes
      • Update release notes to make clear conda release portion (#2249)
      • Use pyproject.toml only (move away from setup.cfg) (#2260, #2263, #2265)
      • Add entry point instructions for pyproject.toml project (#2272)
    • Documentation Changes
      • Fix to remove warning from Using Spark EntitySets Guide (#2258)
    • Testing Changes
      • Add tests/profiling/dfs_profile.py (#2224)
      • Add workflow to test featuretools without test dependencies (#2274)

    Thanks to the following people for contributing to this release: @cp2boston, @gsheni, @ozzieD, @stefaniesmith, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.13.0(Aug 18, 2022)

    v1.13.0 Aug 18, 2022

    • Fixes
      • Allow boolean columns to be included in remove_highly_correlated_features (#2231)
    • Changes
      • Refactor schema version checking to use packaging method (#2230)
      • Extract duplicated logic for Rolling primitives into a general utility function (#2218)
      • Set pandas version to >=1.4.0 (#2246)
      • Remove workaround in roll_series_with_gap caused by pandas version < 1.4.0 (#2246)
    • Documentation Changes
      • Add line breaks between sections of IsFederalHoliday primitive docstring (#2235)
    • Testing Changes
      • Update create feedstock PR forked repo to use (#2223, #2237)
      • Update development requirements and use latest for documentation (#2225)

    Thanks to the following people for contributing to this release: @gsheni, @ozzieD, @sbadithe, @tamargrey

    Source code(tar.gz)
    Source code(zip)
  • v1.12.1(Aug 4, 2022)

    v1.12.1 Aug 4, 2022

    • Fixes
      • Update Trend and RollingTrend primitives to work with IntegerNullable inputs (#2204)
      • camel_and_title_to_snake handles snake case strings with numbers (#2220)
      • Change _get_description to split on blank lines to avoid truncating primitive descriptions (#2219)
    • Documentation Changes
      • Add instructions to add new users to featuretools feedstock (#2215)
    • Testing Changes
      • Add create feedstock PR workflow (#2181)
      • Add performance tests for python 3.9 and 3.10 (#2198, #2208)
      • Add test to ensure primitive docstrings use standardized verbs (#2200)
      • Configure codecov to avoid premature PR comments (#2209)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.12.0(Jul 19, 2022)

    v1.12.0 Jul 19, 2022

    warning: This release of Featuretools will not support Python 3.7

    • Enhancements
      • Add IsWorkingHours and IsLunchTime transform primitives (#2130)
      • Add periods parameter to Diff and add DiffDatetime primitive (#2155)
      • Add RollingTrend primitive (#2170)
    • Fixes
      • Resolves Woodwork integration test failure and removes Python version check for codecov (#2182)
    • Changes
      • Drop Python 3.7 support (#2169, #2186)
      • Add pre-commit hooks for linting (#2177)
    • Documentation Changes
      • Augment single table entry in DFS to include information about passing in a dictionary for dataframes argument (#2160)
    • Testing Changes
      • Standardize imports across test files to simplify accessing featuretools functions (#2166)

    Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @ozzieD, @rwedge, @sbadithe

    Source code(tar.gz)
    Source code(zip)
  • v1.11.1(Jul 5, 2022)

    v1.11.1 Jul 5, 2022

    • Fixes
      • Remove 24th hour from PartOfDay primitive and add 0th hour (#2167)

    Thanks to the following people for contributing to this release: @tamargrey

    Source code(tar.gz)
    Source code(zip)
  • v1.11.0(Jun 30, 2022)

    v1.11.0 Jun 30, 2022

    • Enhancements
      • Add datetime and string types as valid arguments to dfs cutoff_time (#2147 )
      • Add PartOfDay transform primitive (#2128)
      • Add IsYearEnd, IsYearStart transform primitives (#2124)
      • Add Feature.set_feature_names method to directly set output column names for multi-output features (#2142)
      • Include np.nan testing for DayOfYear and DaysInMonth primitives (#2146)
      • Allow dfs kwargs to be passed into get_valid_primitives (#2157)
    • Fixes
    • Changes
      • Improve serialization and deserialization to reduce storage of duplicate primitive information (#2136, #2127, #2144)
      • Sort core requirements and test requirements in setup cfg (#2152)
    • Documentation Changes
    • Testing Changes
      • Fix pandas warning and reduce dask .apply warnings (#2145)
      • Pin graphviz version used in windows tests (#2159)

    Thanks to the following people for contributing to this release: @gsheni, @ozzieD, @rwedge, @sbadithe, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.10.0(Jun 23, 2022)

    v1.10.0 June 23, 2022

    • Enhancements
      • Add DayOfYear, DaysInMonth, Quarter, IsLeapYear, IsQuarterEnd, IsQuarterStart transform primitives (#2110, #2117)
      • Add IsMonthEnd, IsMonthStart transform primitives (#2121)
      • Move Quarter test cases (#2123)
      • Add summarize_primitives function for getting metrics about available primitives (#2099)
    • Changes
      • Changes for compatibility with numpy 1.23.0 (#2135, #2137)
    • Documentation Changes
      • Update contributing.md to add pandoc (#2103, #2104)
      • Update NLP primitives section of API reference (#2109)
      • Fixing release notes formatting (#2139)
    • Testing Changes
      • Latest dependency checker installs spark dependencies (#2112)
      • Fix test failures with pyspark v3.3.0 (#2114, #2120)

    Thanks to the following people for contributing to this release: @gsheni, @ozzieD, @rwedge, @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.9.2(Jun 10, 2022)

    v1.9.2 June 10, 2022

    • Fixes
      • Add feature origin information to all multi-output feature columns (#2102)
    • Documentation Changes
      • Update contributing.md to add pandoc (#2103)

    Thanks to the following people for contributing to this release: @gsheni, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.9.1(May 27, 2022)

    v1.9.1 May 27, 2022

    • Enhancements
      • Update DateToHoliday and DistanceToHoliday primitives to work with timezone-aware inputs (#2056)
    • Changes
      • Delete setup.py, MANIFEST.in and move configuration to pyproject.toml (#2046)
    • Documentation Changes
      • Update slack invite link to new (#2044)
      • Add slack and stackoverflow icon to footer (#2087)
      • Update dead links in docs and docstrings (#2092)
    • Testing Changes
      • Skip test for normalize_dataframe due to different error coming from Woodwork in 0.16.3 (#2052)
      • Fix Woodwork install in test with Woodwork main branch (#2055)
      • Use codecov action v3 (#2039)
      • Add workflow to kickoff EvalML unit tests with Featuretools main (#2072)
      • Rename yml to yaml for GitHub Actions workflows (#2073, #2077)
      • Update Dask test fixtures to prevent flaky behavior (#2079)
      • Update Makefile with better pkg command (#2081)
      • Add scheduled workflow that checks for broken links in documentation (#2084)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.9.0(Apr 27, 2022)

    v1.9.0 Apr 27, 2022

    • Enhancements
      • Improve UnusedPrimitiveWarning with additional information (#2003)
      • Update DFS primitive matching to use all inputs defined in primitive input_types (#2019)
      • Add MultiplyNumericBoolean primitive (#2035)
    • Fixes
      • Fix issue with Ordinal inputs to binary comparison primitives (#2024, #2025)
    • Changes
      • Updated autonormalize version requirement (#2002)
      • Remove extra NaN checking in LatLong primitives (#1924)
      • Normalize LatLong NaN values during EntitySet creation (#1924)
      • Pass primitive dictionaries into check_primitive to avoid repetitive calls (#2016)
      • Remove Boolean and BooleanNullable from MultiplyNumeric primitive inputs (#2022)
      • Update serialization for compatibility with Woodwork version 0.16.1 (#2030)
    • Documentation Changes
      • Update README text to Alteryx (#2010, #2015)
    • Testing Changes
      • Update unit tests with Woodwork main branch workflow name (#2033)

    Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(Mar 31, 2022)

    • Changes
      • Removed make_trans_primitive and make_agg_primitive utility functions (#1970)
    • Documentation Changes
      • Update project urls in setup cfg to include Twitter and Slack (#1981)
      • Update nbconvert to version 6.4.5 to fix docs build issue (#1984)
      • Update ReadMe to have centered badges and add docs badge (#1993)
      • Add M1 installation instructions to docs and contributing (#1997)
    • Testing Changes
      • Updated scheduled workflows to only run on Alteryx owned repos (#1973)
      • Updated minimum dependency checker to use new version with write file support (#1975, #1976)
      • Add black linting package and remove autopep8 (#1978)
      • Update tests for compatibility with Woodwork version 0.15.0 (#1984)

    Thanks to the following people for contributing to this release: @gsheni, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Mar 16, 2022)

    v1.7.0 Mar 16, 2022

    • Enhancements
      • Add support for Python 3.10 (#1940)
      • Added the SquareRoot, NaturalLogarithm, Sine, Cosine and Tangent primitives (#1948)
    • Fixes
      • Updated the conda install commands to specify the channel (#1917)
    • Changes
      • Update error message when DFS returns an empty list of features (#1919)
      • Remove list_variable_types and related directories (#1929)
      • Transition to use pyproject.toml and setup.cfg (moving away from setup.py) (#1941, #1950, #1952, #1954, #1957, #1964 )
      • Replace Koalas with pandas API on Spark (#1949)
    • Documentation Changes
      • Add time series guide (#1896)
      • Update minimum nlp_primitives requirement for docs (#1925)
      • Add GitHub URL for PyPi (#1928)
      • Add backport release support (#1932)
      • Update instructions in release.md (#1963)
    • Testing Changes
      • Update test cases to cover main.py file (#1927)
      • Upgrade moto requirement (#1929, #1938)
      • Add Python 3.9 linting, install complete, and docs build CI tests (#1934)
      • Add CI workflow to test with latest woodwork main branch (#1936)
      • Add lower bound for wheel for minimum dependency checker and limit lint CI tests to Python 3.10 (#1945)
      • Fix non-deterministic test in test_es.py (#1961)

    Thanks to the following people for contributing to this release: @andriyor, @gsheni, @jeff-hernandez, @kushal-gopal, @mingdavidqi, @rwedge, @tamargrey, @thehomebrewnerd, @tvdboom

    Source code(tar.gz)
    Source code(zip)
  • v1.7.0.dev2(Mar 16, 2022)

  • v1.7.0.dev1(Mar 15, 2022)

  • v1.7.0.dev0(Mar 15, 2022)

  • v1.6.0(Feb 17, 2022)

    v1.6.0 Feb 17, 2022

    • Enhancements
      • Add IsFederalHoliday transform primitive (#1912)
    • Fixes
      • Fix to catch new NotImplementedError raised by holidays library for unknown country (#1907)
    • Changes
      • Remove outdated pandas workaround code (#1906)
    • Documentation Changes
      • Add in-line tabs and copy-paste functionality to docs (#1905)
    • Testing Changes
      • Fix URL deserialization file (#1909)

    Thanks to the following people for contributing to this release: @jeff-hernandez, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Feb 14, 2022)

    v1.5.0 Feb 14, 2022

    warning: Featuretools may not support Python 3.7 in next non-bugfix release.

    • Enhancements
      • Add ability to use offset alias strings as inputs to rolling primitives (#1809)
      • Update to add support for pandas version 1.4.0 (#1881, #1895)
    • Fixes
      • Fix featuretools_primitives entry point (#1891)
    • Changes
      • Allow only snake camel and title case for primitives (#1854)
      • Add autonormalize as an add-on library (#1840)
      • Add DateToHoliday Transform Primitive (#1848)
      • Add DistanceToHoliday Transform Primitive (#1853)
      • Temporarily restrict pandas and koalas max versions (#1863)
      • Add __setitem__ method to overload add_dataframe method on EntitySet (#1862)
      • Add support for woodwork 0.12.0 (#1872, #1897)
      • Split Datetime and LatLong primitives into separate files (#1861)
      • Null values will not be included in index of normalized dataframe (#1897)
    • Documentation Changes
      • Bump ipython version (#1857)
      • Update README.md with Alteryx link (#1886)
    • Testing Changes
      • Add check for package conflicts with install workflow (#1843)
      • Change auto approve workflow to use assignee (#1843)
      • Update auto approve workflow to delete branch and change on trigger (#1852)
      • Upgrade tests to use compose version 0.8.0 (#1856)
      • Updated deep feature synthesis and feature serialization tests to use new primitive files (#1861)

    Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @jacobboney, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd, @tuethan1999

    Source code(tar.gz)
    Source code(zip)
  • v1.4.1(Jan 28, 2022)

    v1.4.1 Jan 28, 2022

    • Changes
      • Set upper bound for compatible Woodwork version (#1872)
      • Restrict pandas and koalas max versions (#1863)
    • Testing Changes
      • Upgrade tests to use compose version 0.8.0 (#1856)

    Thanks to the following people for contributing to this release: @dvreed77, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Jan 11, 2022)

    • Enhancements
      • Add LatLong transform primitives - GeoMidpoint, IsInGeoBox, CityblockDistance (#1814)
      • Add issue templates for bugs, feature requests and documentation improvements (#1834)
    • Fixes
      • Fix bug where Woodwork initialization could fail on feature matrix if cutoff times caused null values to be introduced (#1810)
    • Changes
      • Skip code coverage for specific dask usage lines (#1829)
      • Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (#1833)
      • Remove pyyaml as a requirement (#1833)
    • Documentation Changes
      • Remove testing on conda forge in release.md (#1811)
    • Testing Changes
      • Enable auto-merge for minimum and latest dependency merge requests (#1818, #1821, #1822)
      • Change auto approve workfow to use PR number and run every 30 minutes (#1827)
      • Add auto approve workflow to run when unit tests complete (#1837)
      • Test deserializing from S3 with mocked S3 fixtures only (#1825)
      • Remove fastparquet as a test requirement (#1833)

    Thanks to the following people for contributing to this release: @davesque, @gsheni, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Dec 2, 2021)

    • Enhancements
      • Add NumericLag transform primitive #1797
    • Changes
      • Update pip to 21.3.1 for test requirements #1789
    • Documentation Changes
      • Add Docker install instructions and documentation on the install page. #1785
      • Update install page on documentation with correct python version #1784
      • Fix formatting in Improving Computational Performance guide #1786

    Thanks to the following people for contributing to this release: @gsheni, @HenryRocha, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0.dev0(Dec 2, 2021)

  • v1.2.0(Nov 15, 2021)

    • Enhancements
      • Add Rolling Transform primitives with integer parameters (#1770)
    • Fixes
      • Handle new graphviz FORMATS import (#1770)
    • Changes
      • Add new version of featuretools_tsfresh_primitives as an add-on library (#1772)
      • Add load_weather as demo dataset for time series (#1777)

    Thanks to the following people for contributing to this release: @gsheni, @tamargrey

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0.dev0(Nov 15, 2021)

Owner
alteryx
Alteryx Development
alteryx
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 05, 2023
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 05, 2022
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Master status: Development status: Package information: MDR A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (M

Epistasis Lab at UPenn 122 Jul 06, 2022
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
open-source feature selection repository in python

scikit-feature Feature selection repository scikit-feature in Python. scikit-feature is an open-source feature selection repository in Python develope

Jundong Li 1.3k Jan 05, 2023
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

1.2k Jan 04, 2023
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 03, 2023
scikit-learn addon to operate on set/"group"-based features

skl-groups skl-groups is a package to perform machine learning on sets (or "groups") of features in Python. It extends the scikit-learn library with s

Danica J. Sutherland 41 Apr 06, 2022