UpliftML: A Python Package for Scalable Uplift Modeling

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

upliftml

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Uplift modeling is a family of techniques for estimating the Conditional Average Treatment Effect (CATE) from experimental or observational data using machine learning. In particular, we are interested in estimating the causal effect of a treatment T on the outcome Y of an individual characterized by features X. In experimental data with binary treatments and binary outcomes, this is equivalent to estimating Pr(Y=1 | T=1, X=x) - Pr(Y=1 | T=0, X=x).

In many practical use cases the goal is to select which users to target in order to maximize the overall uplift without exceeding a specified budget or ROI constraint. In those cases, estimating uplift alone is not sufficient to make optimal decisions and we need to take into account the costs and monetary benefit incurred by the treatment.

Uplift modeling is an emerging tool for various personalization applications. Example use cases include marketing campaigns personalization and optimization, personalized pricing in e-commerce, and clinical treatment personalization.

The UpliftML library includes PySpark/H2O implementations for the following:

  • 6 metalearner approaches for uplift modeling: T-learner[1], S-learner[1], X-learner[1], R-learner[2], class variable transformation[3], transformed outcome approach[4].
  • The Retrospective Estimation[5] technique for uplift modeling under ROI constraints.
  • Uplift and iROI-based evaluation and plotting functions with bootstrapped confidence intervals. Currently implemented: ATE, ROI, iROI, CATE per category/quantile, CATE lift, Qini/AUUC curves[6], Qini/AUUC score[6], cumulative iROI curves.

For detailed information about the package, read the UpliftML documentation.

Installation

Install the latest release from PyPI:

$ pip install upliftml

Quick Start

from upliftml.models.pyspark import TLearnerEstimator
from upliftml.evaluation import estimate_and_plot_qini
from upliftml.datasets import simulate_randomized_trial
from pyspark.ml.classification import LogisticRegression


# Read/generate the dataset and convert it to Spark if needed
df_pd = simulate_randomized_trial(n=2000, p=6, sigma=1.0, binary_outcome=True)
df_spark = spark.createDataFrame(df_pd)

# Split the data into train, validation, and test sets
df_train, df_val, df_test = df_spark.randomSplit([0.5, 0.25, 0.25])

# Preprocess the datasets (for implementation of get_features_vector, see the full example notebook)
num_features = [col for col in df_spark.columns if col.startswith('feature')]
cat_features = []
df_train_assembled = get_features_vector(df_train, num_features, cat_features)
df_val_assembled = get_features_vector(df_val, num_features, cat_features)
df_test_assembled = get_features_vector(df_test, num_features, cat_features)

# Build a two-model estimator
model = TLearnerEstimator(base_model_class=LogisticRegression,
                          base_model_params={'maxIter': 15},
                          predictors_colname='features',
                          target_colname='outcome',
                          treatment_colname='treatment',
                          treatment_value=1,
                          control_value=0)
model.fit(df_train_assembled, df_val_assembled)

# Apply the model to test data
df_test_eval = model.predict(df_test_assembled)

# Evaluate performance on the test set
qini_values, ax = estimate_and_plot_qini(df_test_eval)

For complete examples with more estimators and evaluation functions, see the demo notebooks in the examples folder.

Contributing

If interested in contributing to the package, get started by reading our contributor guidelines.

License

The project is licensed under Apache 2.0 License

Citation

If you use UpliftML, please cite it as follows:

Irene Teinemaa, Javier Albert, Nam Pham. UpliftML: A Python Package for Scalable Uplift Modeling. https://github.com/bookingcom/upliftml, 2021. Version 0.0.1.

@misc{upliftml,
  author={Irene Teinemaa, Javier Albert, Nam Pham},
  title={{UpliftML}: {A Python Package for Scalable Uplift Modeling}},
  howpublished={https://github.com/bookingcom/upliftml},
  note={Version 0.0.1},
  year={2021}
}

Resources

Documentation:

Tutorials and blog posts:

Related packages:

  • CausalML: a Python package for uplift modeling and causal inference with machine learning
  • EconML: a Python package for estimating heterogeneous treatment effects from observational data via machine learning

References

  1. Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 2019.
  2. Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
  3. Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.
  4. Susan Athey and Guido W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5), 2015.
  5. Dmitri Goldenberg, Javier Albert, Lucas Bernardi, Pablo Estevez Castillo. Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints. In Fourteenth ACM Conference on Recommender Systems (pp. 486-491), 2020.
  6. Nicholas J Radcliffe and Patrick D Surry. Real-world uplift modelling with significance based uplift trees. White Paper tr-2011-1, Stochastic Solutions, 2011.
Owner
Booking.com
Open source projects and forks of projects we use internally (for better upstream collaboration)
Booking.com
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

16 Sep 23, 2022
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 6k Jan 06, 2023
EbookMLCB - ebook Machine Learning cơ bản

Mã nguồn cuốn ebook "Machine Learning cơ bản", Vũ Hữu Tiệp. ebook Machine Learning cơ bản pdf-black_white, pdf-color. Mọi hình thức sao chép, in ấn đề

943 Jan 02, 2023
Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps - A Production-focused Open-Source Machine Learning Framework Open MLOps is a set of open-source tools carefully chosen to ease user experi

Data Revenue 590 Dec 28, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Jan 06, 2023
MLOps pipeline project using Amazon SageMaker Pipelines

This project shows steps to build an end to end MLOps architecture that covers data prep, model training, realtime and batch inference, build model registry, track lineage of artifacts and model drif

AWS Samples 3 Sep 16, 2022
AP1 Transcription Factor Binding Site Prediction

A machine learning project that predicted binding sites of AP1 transcription factor, using ChIP-Seq data and local DNA shape information.

1 Jan 21, 2022
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
The unified machine learning framework, enabling framework-agnostic functions, layers and libraries.

The unified machine learning framework, enabling framework-agnostic functions, layers and libraries. Contents Overview In a Nutshell Where Next? Overv

Ivy 8.2k Dec 31, 2022
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical alg

RICOS Co. Ltd. 179 Dec 21, 2022
Turns your machine learning code into microservices with web API, interactive GUI, and more.

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Machine Learning Tooling 2.8k Jan 02, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
List of Data Science Cheatsheets to rule the world

Data Science Cheatsheets List of Data Science Cheatsheets to rule the world. Table of Contents Business Science Business Science Problem Framework Dat

Favio André Vázquez 11.7k Dec 30, 2022
Data from "Datamodels: Predicting Predictions with Training Data"

Data from "Datamodels: Predicting Predictions with Training Data" Here we provid

Madry Lab 51 Dec 09, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

4 Nov 11, 2021
Neural Machine Translation (NMT) tutorial with OpenNMT-py

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

Yasmin Moslem 29 Jan 09, 2023
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
Gaussian Process Optimization using GPy

End of maintenance for GPyOpt Dear GPyOpt community! We would like to acknowledge the obvious. The core team of GPyOpt has moved on, and over the past

Sheffield Machine Learning Software 847 Dec 19, 2022