UpliftML: A Python Package for Scalable Uplift Modeling

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

upliftml

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Uplift modeling is a family of techniques for estimating the Conditional Average Treatment Effect (CATE) from experimental or observational data using machine learning. In particular, we are interested in estimating the causal effect of a treatment T on the outcome Y of an individual characterized by features X. In experimental data with binary treatments and binary outcomes, this is equivalent to estimating Pr(Y=1 | T=1, X=x) - Pr(Y=1 | T=0, X=x).

In many practical use cases the goal is to select which users to target in order to maximize the overall uplift without exceeding a specified budget or ROI constraint. In those cases, estimating uplift alone is not sufficient to make optimal decisions and we need to take into account the costs and monetary benefit incurred by the treatment.

Uplift modeling is an emerging tool for various personalization applications. Example use cases include marketing campaigns personalization and optimization, personalized pricing in e-commerce, and clinical treatment personalization.

The UpliftML library includes PySpark/H2O implementations for the following:

  • 6 metalearner approaches for uplift modeling: T-learner[1], S-learner[1], X-learner[1], R-learner[2], class variable transformation[3], transformed outcome approach[4].
  • The Retrospective Estimation[5] technique for uplift modeling under ROI constraints.
  • Uplift and iROI-based evaluation and plotting functions with bootstrapped confidence intervals. Currently implemented: ATE, ROI, iROI, CATE per category/quantile, CATE lift, Qini/AUUC curves[6], Qini/AUUC score[6], cumulative iROI curves.

For detailed information about the package, read the UpliftML documentation.

Installation

Install the latest release from PyPI:

$ pip install upliftml

Quick Start

from upliftml.models.pyspark import TLearnerEstimator
from upliftml.evaluation import estimate_and_plot_qini
from upliftml.datasets import simulate_randomized_trial
from pyspark.ml.classification import LogisticRegression


# Read/generate the dataset and convert it to Spark if needed
df_pd = simulate_randomized_trial(n=2000, p=6, sigma=1.0, binary_outcome=True)
df_spark = spark.createDataFrame(df_pd)

# Split the data into train, validation, and test sets
df_train, df_val, df_test = df_spark.randomSplit([0.5, 0.25, 0.25])

# Preprocess the datasets (for implementation of get_features_vector, see the full example notebook)
num_features = [col for col in df_spark.columns if col.startswith('feature')]
cat_features = []
df_train_assembled = get_features_vector(df_train, num_features, cat_features)
df_val_assembled = get_features_vector(df_val, num_features, cat_features)
df_test_assembled = get_features_vector(df_test, num_features, cat_features)

# Build a two-model estimator
model = TLearnerEstimator(base_model_class=LogisticRegression,
                          base_model_params={'maxIter': 15},
                          predictors_colname='features',
                          target_colname='outcome',
                          treatment_colname='treatment',
                          treatment_value=1,
                          control_value=0)
model.fit(df_train_assembled, df_val_assembled)

# Apply the model to test data
df_test_eval = model.predict(df_test_assembled)

# Evaluate performance on the test set
qini_values, ax = estimate_and_plot_qini(df_test_eval)

For complete examples with more estimators and evaluation functions, see the demo notebooks in the examples folder.

Contributing

If interested in contributing to the package, get started by reading our contributor guidelines.

License

The project is licensed under Apache 2.0 License

Citation

If you use UpliftML, please cite it as follows:

Irene Teinemaa, Javier Albert, Nam Pham. UpliftML: A Python Package for Scalable Uplift Modeling. https://github.com/bookingcom/upliftml, 2021. Version 0.0.1.

@misc{upliftml,
  author={Irene Teinemaa, Javier Albert, Nam Pham},
  title={{UpliftML}: {A Python Package for Scalable Uplift Modeling}},
  howpublished={https://github.com/bookingcom/upliftml},
  note={Version 0.0.1},
  year={2021}
}

Resources

Documentation:

Tutorials and blog posts:

Related packages:

  • CausalML: a Python package for uplift modeling and causal inference with machine learning
  • EconML: a Python package for estimating heterogeneous treatment effects from observational data via machine learning

References

  1. Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 2019.
  2. Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
  3. Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.
  4. Susan Athey and Guido W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5), 2015.
  5. Dmitri Goldenberg, Javier Albert, Lucas Bernardi, Pablo Estevez Castillo. Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints. In Fourteenth ACM Conference on Recommender Systems (pp. 486-491), 2020.
  6. Nicholas J Radcliffe and Patrick D Surry. Real-world uplift modelling with significance based uplift trees. White Paper tr-2011-1, Stochastic Solutions, 2011.
Owner
Booking.com
Open source projects and forks of projects we use internally (for better upstream collaboration)
Booking.com
dirty_cat is a Python module for machine-learning on dirty categorical variables.

dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.

637 Dec 29, 2022
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

Paweł Rościszewski 131 Dec 12, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc). Structured a custom ensemble model and a neural network. Found a outperformed

Chris Yuan 1 Feb 06, 2022
Climin is a Python package for optimization, heavily biased to machine learning scenarios

climin climin is a Python package for optimization, heavily biased to machine learning scenarios distributed under the BSD 3-clause license. It works

Biomimetic Robotics and Machine Learning at Technische Universität München 177 Sep 02, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

FFT-accelerated Interpolation-based t-SNE (FIt-SNE) Introduction t-Stochastic Neighborhood Embedding (t-SNE) is a highly successful method for dimensi

Kluger Lab 547 Dec 21, 2022
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
SPCL 48 Dec 12, 2022
Lightweight Machine Learning Experiment Logging 📖

Simple logging of statistics, model checkpoints, plots and other objects for your Machine Learning Experiments (MLE). Furthermore, the MLELogger comes with smooth multi-seed result aggregation and co

Robert Lange 65 Dec 08, 2022
Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Ibotta 282 Dec 09, 2022
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 03, 2021
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

Solar-radiation-ISB-MLOps - Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan.

Abid Ali Awan 1 Dec 31, 2021
Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

EconML/CausalML KDD 2021 Tutorial 124 Dec 28, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

27 Aug 19, 2022
Predict profitability of trades based on indicator buy / sell signals

Predict profitability of trades based on indicator buy / sell signals Trade profitability analysis for trades based on various indicators signals: MAC

Tomasz Porzycki 1 Dec 15, 2021
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022