Improving XGBoost survival analysis with embeddings and debiased estimators

Last update: Dec 30, 2022

Overview

`xgbse`: XGBoost Survival Embeddings

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown." - Leo Breiman, Statistical Modeling: The Two Cultures

Survival Analysis is a powerful statistical technique with a wide range of applications such as predictive maintenance, customer churn, credit risk, asset liquidity risk, and others.

However, it has not yet seen widespread adoption in industry, with most implementations embracing one of two cultures:

models with sound statistical properties, but lacking in expressivess and computational efficiency
highly efficient and expressive models, but lacking in statistical rigor

xgbse aims to unite the two cultures in a single package, adding a layer of statistical rigor to the highly expressive and computationally effcient xgboost survival analysis implementation.

The package offers:

calibrated and unbiased survival curves with confidence intervals (instead of point predictions)
great predictive power, competitive to vanilla xgboost
efficient, easy to use implementation
explainability through prototypes

This is a research project by Loft Data Science Team, however we invite the community to contribute. Please help by trying it out, reporting bugs, and letting us know what you think!

Installation

pip install xgbse

Usage

Basic usage

The package follows scikit-learn API, with a minor adaptation to work with time and event data (y as a numpy structured array of times and events). .predict() returns a dataframe where each column is a time window and values represent the probability of survival before or exactly at the time window.

# importing dataset from pycox package
from pycox.datasets import metabric

# importing model and utils from xgbse
from xgbse import XGBSEKaplanNeighbors
from xgbse.converters import convert_to_structured

# getting data
df = metabric.read_df()

# splitting to X, y format
X = df.drop(['duration', 'event'], axis=1)
y = convert_to_structured(df['duration'], df['event'])

# fitting xgbse model
xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
xgbse_model.fit(X, y)

# predicting
event_probs = xgbse_model.predict(X)
event_probs.head()

index	15	30	45	60	75	90	105	120	135	150	165	180	195	210	225	240	255	270	285
0	0.98	0.87	0.81	0.74	0.71	0.66	0.53	0.47	0.42	0.4	0.3	0.25	0.21	0.16	0.12	0.098	0.085	0.062	0.054
1	0.99	0.89	0.79	0.7	0.64	0.58	0.53	0.46	0.42	0.39	0.33	0.31	0.3	0.24	0.21	0.18	0.16	0.11	0.095
2	0.94	0.78	0.63	0.57	0.54	0.49	0.44	0.37	0.34	0.32	0.26	0.23	0.21	0.16	0.13	0.11	0.098	0.072	0.062
3	0.99	0.95	0.93	0.88	0.84	0.81	0.73	0.67	0.57	0.52	0.45	0.37	0.33	0.28	0.23	0.19	0.16	0.12	0.1
4	0.98	0.92	0.82	0.77	0.72	0.68	0.63	0.6	0.57	0.55	0.51	0.48	0.45	0.42	0.38	0.33	0.3	0.22	0.2

You can also get interval predictions (probability of failing exactly at each time window) using return_interval_probs:

# point predictions
interval_probs = xgbse_model.predict(X_valid, return_interval_probs=True)
interval_probs.head()

index	15	30	45	60	75	90	105	120	135	150	165	180	195	210	225	240	255	270	285
0	0.024	0.1	0.058	0.07	0.034	0.05	0.12	0.061	0.049	0.026	0.096	0.049	0.039	0.056	0.04	0.02	0.013	0.023	0.0078
1	0.014	0.097	0.098	0.093	0.052	0.065	0.054	0.068	0.034	0.038	0.053	0.019	0.018	0.052	0.038	0.027	0.018	0.05	0.015
2	0.06	0.16	0.15	0.054	0.033	0.053	0.046	0.073	0.032	0.014	0.06	0.03	0.017	0.055	0.031	0.016	0.014	0.027	0.0097
3	0.011	0.034	0.021	0.053	0.038	0.038	0.08	0.052	0.1	0.049	0.075	0.079	0.037	0.052	0.053	0.041	0.026	0.04	0.017
4	0.016	0.067	0.099	0.046	0.05	0.042	0.051	0.028	0.03	0.018	0.048	0.022	0.029	0.038	0.035	0.047	0.031	0.08	0.027

Survival curves and confidence intervals

XBGSEKaplanTree and XBGSEKaplanNeighbors support estimation of survival curves and confidence intervals via the Exponential Greenwood formula out-of-the-box via the return_ci argument:

# fitting xgbse model
xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS)

# predicting
mean, upper_ci, lower_ci = xgbse_model.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

XGBSEDebiasedBCE does not support estimation of confidence intervals out-of-the-box, but we provide the XGBSEBootstrapEstimator to get non-parametric confidence intervals. As the stacked logistic regressions are trained with more samples (in comparison to neighbor-sets in XGBSEKaplanNeighbors), confidence intervals are more concentrated:

# base model as BCE
base_model = XGBSEDebiasedBCE(PARAMS_XGB_AFT, PARAMS_LR)

# bootstrap meta estimator
bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=20)

# fitting the meta estimator
bootstrap_estimator.fit(
    X_train,
    y_train,
    validation_data=(X_valid, y_valid),
    early_stopping_rounds=10,
    time_bins=TIME_BINS,
)

# predicting
mean, upper_ci, lower_ci = bootstrap_estimator.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

The bootstrap abstraction can be used for XBGSEKaplanTree and XBGSEKaplanNeighbors as well, however, the confidence interval will be estimated via bootstrap only (not Exponential Greenwood formula):

# base model
base_model = XGBSEKaplanTree(PARAMS_TREE)

# bootstrap meta estimator
bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=100)

# fitting the meta estimator
bootstrap_estimator.fit(
    X_train,
    y_train,
    time_bins=TIME_BINS,
)

# predicting
mean, upper_ci, lower_ci = bootstrap_estimator.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

With a sufficiently large n_estimators, interval width shouldn't be much different, with the added benefit of model stability and improved accuracy. Addittionaly, XGBSEBootstrapEstimator allows building confidence intervals for interval probabilities (which is not supported for Exponential Greenwood):

# predicting
mean, upper_ci, lower_ci = bootstrap_estimator.predict(
    X_valid,
    return_ci=True,
    return_interval_probs=True
)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

The parameter ci_width controls the width of the confidence interval. For XGBSEKaplanTree it should be passed at .fit(), as KM curves are pre-calculated for each leaf at fit time to avoid storing training data.

# fitting xgbse model
xgbse_model = XGBSEKaplanTree(PARAMS_TREE)
xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS, ci_width=0.99)

# predicting
mean, upper_ci, lower_ci = xgbse_model.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

For other models (XGBSEKaplanNeighbors and XGBSEBootstrapEstimator) it should be passed at .predict().

# base model
model = XGBSEKaplanNeighbors(PARAMS_XGB_AFT, N_NEIGHBORS)

# fitting the meta estimator
model.fit(
    X_train, y_train,
    validation_data = (X_valid, y_valid),
    early_stopping_rounds=10,
    time_bins=TIME_BINS
)

# predicting
mean, upper_ci, lower_ci = model.predict(X_valid, return_ci=True, ci_width=0.99)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

Extrapolation

We provide an extrapolation interface, to deal with cases where the survival curve does not end at 0 due to censoring. Currently the implementation only supports extrapolation assuming constant risk.

from xgbse.extrapolation import extrapolate_constant_risk

# predicting
survival = bootstrap_estimator.predict(X_valid)

# extrapolating
survival_ext = extrapolate_constant_risk(survival, 450, 11)

Early stopping

A simple interface to xgboost early stopping is provided.

# splitting between train, and validation
(X_train, X_valid,
 y_train, y_valid) = \
train_test_split(X, y, test_size=0.2, random_state=42)

# fitting with early stopping
xgb_model = XGBSEDebiasedBCE()
xgb_model.fit(
    X_train,
    y_train,
    validation_data=(X_valid, y_valid),
    early_stopping_rounds=10,
    verbose_eval=50
)

[0]	validation-aft-nloglik:16.86713
Will train until validation-aft-nloglik hasn't improved in 10 rounds.
[50]	validation-aft-nloglik:3.64540
[100]	validation-aft-nloglik:3.53679
[150]	validation-aft-nloglik:3.53207
Stopping. Best iteration:
[174]	validation-aft-nloglik:3.53004

Explainability through prototypes

xgbse also provides explainability through prototypes, searching the embedding for neighbors. The idea is to explain model predictions with real samples, providing solid ground to justify them (see [8]). The method .get_neighbors() searches for the n_neighbors nearest neighbors in index_data for each sample in query_data:

neighbors = xgb_model.get_neighbors(
    query_data=X_valid,
    index_data=X_train,
    n_neighbors=10
)
neighbors.head(5)

index	neighbor_1	neighbor_2	neighbor_3	neighbor_4	neighbor_5	neighbor_6	neighbor_7	neighbor_8	neighbor_9	neighbor_10
1225	1151	1513	1200	146	215	452	1284	1127	1895	257
111	1897	1090	1743	1224	892	1695	1624	1546	1418	4
554	9	627	1257	1460	1031	1575	1557	440	1236	858
526	726	1042	177	1640	242	1529	234	1800	399	1431
1313	205	1738	599	954	1694	1715	1651	828	541	992

This way, we can inspect neighbors of a given sample to try to explain predictions. For instance, we can choose a reference and check that its neighbors actually are very similar as a sanity check:

i = 0

reference = X_valid.iloc[i]
reference.name = 'reference'
train_neighs = X_train.loc[neighbors.iloc[i]]

pd.concat([reference.to_frame().T, train_neighs])

index	x0	x1	x2	x3	x4	x5	x7	x8
reference	5.7	5.7	11	5.6	1	1	1	86
1151	5.8	5.9	11	5.5	1	1	1	82
1513	5.5	5.5	11	5.6	1	1	1	79
1200	5.7	6	11	5.6	1	1	1	76
146	5.9	5.9	11	5.5	0	1	1	75
215	5.8	5.5	11	5.4	1	1	1	78
452	5.7	5.7	12	5.5	0	0	1	76
1284	5.6	6.2	11	5.6	1	0	1	79
1127	5.5	5.1	11	5.5	1	1	1	86
1895	5.5	5.4	10	5.5	1	1	1	85
257	5.7	6	9.6	5.6	1	1	1	76

We also can compare the Kaplan-Meier curve estimated from the neighbors to the actual model prediction, checking that it is inside the confidence interval:

from xgbse.non_parametric import calculate_kaplan_vectorized

mean, high, low = calculate_kaplan_vectorized(
    np.array([y['c2'][neighbors.iloc[i]]]),
    np.array([y['c1'][neighbors.iloc[i]]]),
    TIME_BINS
)

model_surv = xgb_model.predict(X_valid)

plt.figure(figsize=(12,4), dpi=120)
plt.plot(model_surv.columns, model_surv.iloc[i])
plt.plot(mean.columns, mean.iloc[0])
plt.fill_between(mean.columns, low.iloc[0], high.iloc[0], alpha=0.1, color='red')

Specifically, for XBGSEKaplanNeighbors prototype predictions and model predictions should match exactly if n_neighbors is the same and query_data is equal to the training data.

Metrics

We made our own metrics submodule to make the lib self-contained. xgbse.metrics implements C-index, Brier Score and D-Calibration from [9], including adaptations to deal with censoring:

# training model
xgbse_model = XGBSEKaplanNeighbors(PARAMS_XGB_AFT, n_neighbors=30)

xgbse_model.fit(
    X_train, y_train,
    validation_data = (X_valid, y_valid),
    early_stopping_rounds=10,
    time_bins=TIME_BINS
)

# predicting
preds = xgbse_model.predict(X_valid)

# importing metrics
from xgbse.metrics import (
    concordance_index,
    approx_brier_score,
    dist_calibration_score
)

# running metrics
print(f'C-index: {concordance_index(y_valid, preds)}')
print(f'Avg. Brier Score: {approx_brier_score(y_valid, preds)}')
print(f"""D-Calibration: {dist_calibration_score(y_valid, preds) > 0.05}""")

C-index: 0.6495863029409356
Avg. Brier Score: 0.1704190044350422
D-Calibration: True

As metrics follow the score_func(y, y_pred, **kwargs) pattern, we can use the sklearn model selection module easily:

from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

xgbse_model = XGBSEKaplanTree(PARAMS_TREE)
results = cross_val_score(xgbse_model, X, y, scoring=make_scorer(approx_brier_score))
results

array([0.17432953, 0.15907712, 0.13783666, 0.16770409, 0.16792016])

If you want to dive deeper, please check our docs and examples!

References

[1] Practical Lessons from Predicting Clicks on Ads at Facebook: paper that shows how stacking boosting models with logistic regression improves performance and calibration

[2] Feature transformations with ensembles of trees: scikit-learn post showing tree ensembles as feature transformers

[3] Calibration of probabilities for tree-based models: blog post showing a practical example of tree ensemble probability calibration with a logistic regression

[4] Supervised dimensionality reduction and clustering at scale with RFs with UMAP: blog post showing how forests of decision trees act as noise filters, reducing intrinsic dimension of the dataset.

[5] Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors: inspiration for the BCE method (multi-task logistic regression)

[6] The Brier Score under Administrative Censoring: Problems and Solutions: reference to BCE (binary cross-entropy survival method).

[7] The Greenwood and Exponential Greenwood Confidence Intervals in Survival Analysis: reference we used for the Exponential Greenwood formula from KM confidence intervals

[8] Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable: paper showing a very similar method for extracting prototypes

[9] Effective Ways to Build and Evaluate Individual Survival Distributions: paper showing how to validate survival analysis models with different metrics

Citing `xgbse`

To cite this repository:

@software{xgbse2020github,
  author = {Davi Vieira and Gabriel Gimenez and Guilherme Marmerola and Vitor Estima},
  title = {XGBoost Survival Embeddings: improving statistical properties of XGBoost survival analysis implementation},
  url = {http://github.com/loft-br/xgboost-survival-embeddings},
  version = {0.2.0},
  year = {2020},
}

Comments

xgboost 1.4.0+: ValueError: If using all scalar values, you must pass an index

Using xgboost 1.4.0 or 1.4.1, we are now getting an error: ValueError: If using all scalar values, you must pass an index

No error with 1.3.3

All releases after 1.3.3, we're receiving a ValueError upon XGBSEBootstrapEstimator.fit() call. Tested in Python 3.7.2 and 3.8.6

Trace:

C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgboost\core.py:101: UserWarning: ntree_limit is deprecated, use iteration_range or model slicing instead.
  warnings.warn(
Traceback (most recent call last):
  File "F:/git/gqc/pipe_breaks/script_runner.py", line 111, in <module>
	main()
  File "F:/git/gqc/pipe_breaks/script_runner.py", line 100, in main
	script_module.main()
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 271, in main
	do_extrapolation(X=X, X_valid=X_valid, X_train=X_train, y_train=y_train, main_ids=id_column)
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 122, in do_extrapolation
	bootstrap_estimator, mean, upper_ci, lower_ci = fit_predict_bootstrap_est(
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 208, in fit_predict_bootstrap_est
	bootstrap_estimator.fit(
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgbse\_meta.py", line 57, in fit
	trained_model = self.base_estimator.fit(X_sample, y_sample, **kwargs)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgbse\_kaplan_neighbors.py", line 407, in fit
	pd.DataFrame({"leaf": leaves})
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\frame.py", line 467, in __init__
	mgr = init_dict(data, index, columns, dtype=dtype)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 283, in init_dict
	return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 78, in arrays_to_mgr
	index = extract_index(arrays)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 387, in extract_index
	raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index

Throwing code block:

def fit_predict_bootstrap_est(base_model, n_estimators, X_train, y_train, X_valid):
    """Instantiate, fit, and predict a bootstrap_estimator."""
    bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=n_estimators)
    bootstrap_estimator.fit(
        X_train,
        y_train,
        time_bins=TIME_BINS,
    )

I'm unable to share specific data of the train structures, but their types and shapes follow: X_train = DataFrame: (2916, 11) y_train = ndarray: (4916,) TIME_BINS = np.arange(5, 540, 5)

Requirements: astor==0.8.1 autograd==1.3 autograd-gamma==0.5.0 backcall==0.2.0 colorama==0.4.4 cycler==0.10.0 decorator==5.0.7 ecos==2.0.7.post1 formulaic==0.2.3 future==0.18.2 interface-meta==1.2.3 ipykernel==5.5.3 ipython==7.22.0 ipython-genutils==0.2.0 jedi==0.18.0 joblib==1.0.1 jupyter-client==6.1.12 jupyter-core==4.7.1 kiwisolver==1.3.1 lifelines==0.25.11 matplotlib==3.3.0 numexpr==2.7.3 numpy==1.20.2 osqp==0.6.2.post0 pandas==1.1.0 parso==0.8.2 pickleshare==0.7.5 Pillow==8.2.0 prompt-toolkit==3.0.18 Pygments==2.8.1 pyparsing==2.4.7 python-dateutil==2.8.1 pytz==2021.1 pywin32==300 pyzmq==22.0.3 qdldl==0.1.5.post0 scikit-learn==0.24.1 scikit-survival==0.15.0.post0 scipy==1.6.2 six==1.15.0 threadpoolctl==2.1.0 toml==0.10.2 tornado==6.1 traitlets==5.0.5 wcwidth==0.2.5 wrapt==1.12.1 xgboost==1.3.3 # xgboost==1.4.1 xgbse==0.2.1

bug

opened by jacobgqc 7

XGBEmbedKaplanNeighbors appears to now be XGBSEKaplanNeighbors

Hello - thanks for this amazing library. It's truly a life-saver after looking around for a modern survival analysis library which could handle both large data, as well provide the full survival curve for every row. You've built something special here.

In exploring, I discovered that the XGBEmbedKaplanNeighbors import in the documentation should now be a XGBSEKaplanNeighbors import, if I'm reading things correctly. I don't know if any other class names have changed, but wanted to flag to help get more people using this library.

(Unrelatedly, I was unable to install this package in Colab due to a claimed dependency on Python 3.6.9. However, it worked fine on my local machine on 3.8).

Thanks again for the great work. I look forward to exploring the functionality more.
documentation

opened by kmedved 5
Competing risks/events

If I have competing events and I treat the occurrence of a competing events as censored observations (informative/dependent censoring) will this model still give me an unbiased estimate of survival? In my use case the competing event will be one that precludes the event of interest, so the two events are completely mutually exclusive, I'm interested in event A, but if event B happens, then event A can never happen.

Generally I gather that competing events can work within a Cox framework, but it will give a biased estimate of incidence using a Kaplan Meier approach and XGBSE partially relies on Kaplan Meier.

opened by onacrame 4
XGBSEDebiasedBCE n_jobs parameter

I'm curious to understand how the XGBSEDebiasedBCE (and perhaps other modules') n_jobs parameter works. I see it listed here: https://loft-br.github.io/xgboost-survival-embeddings/bce.html as a separate parameter from the xgboost parameters themselves (which has it's own n_jobs parameter).

Does this parameter impact the logistic regression instead of xgboost? In other words, what is being multiprocessed here?

Perhaps relatedly, I've noticed that the xgboost part of XGBSEDebiasedBCE will typically fit quite quickly on my data (~20 seconds if using GPU and stopping after about 150 rounds), but will then pause for 30-40 seconds before the fitting is complete. Is this due to the logistic regression that XGBSEDebiasedBCE is performing after fitting with xgboost and is there any good way to speed this up?
documentation

opened by kmedved 4

XGBoost Nrounds Parameter

Thanks again for the wonderful package. I'm wondering what the best way to set the nrounds parameter in the underlying Xgboost model is? I have tried nrounds, n_estimators, and num_boost_round. and gotten the following warning each time:

[08:18:45] WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516: 
Parameters: { num_boost_round } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.

What's the correct interface for this?

documentation

opened by kmedved 4

D-Calibration Maximum Deviation

Status

READY

Background context

Update the D calibration maximum deviation to take into account the number of bin used.

Description of the changes proposed in the pull request

Change 0.1 by 1/n_bins

opened by Jeanselme 3
pip install xgbse

Code sample

pip install xgbse

Problem description

ERROR: Could not find a version that satisfies the requirement xgbse (from versions: none) ERROR: No matching distribution found for xgbse
bug

opened by alextsui86 3
Multi-states and competing risk modelling

Hello guys, first of all congratulations for your work :)

Currently I'm facing a multi-state modeling with multiple absorbing states, in the search of some techniques I found this module. As far as I could go into the docs I didn't find any mention to multi-states and competing risks modelling, does my understanding is correct?

If so, do you guys intend to extend the use cases of this module in that direction?

Thanks!
enhancement

opened by matdeof 2
Using best round when applying early stopping
Hi and thanks for a really interesting library!

I have two small questions.

First of all I was thinking about the reasoning behind your choice of XGBoost instead of LightGBM? I am just curious if XGB has any inherent advantages for survival analysis compared to LGB.

Then on to my main question related to early stopping. As far as I can see, when using early stopping, you are not using the best iteration for predictions, but rather the last iteration. Or, in code; instead of:

y_pred = bst.predict(dtrain)

it should be:

y_pred = bst.predict(dtrain, ntree_limit=bst.best_ntree_limit)

Please correct me if I am mistaken since I have not used your library extensively yet :)

Cheers,
bug
opened by ChristianMichelsen 2
Underlying xgboost model for calculating feature importances

Xgboost models can output permutation or gain feature importances. Additionally, through packages such as Shap, it's possible to get Shapley values for Xgboost models, which can be highly informative about what features are contributing to survival.

It would be helpful to allow the user to access the underlying Xgboost regressor object in order to calculate/view these feature importance. I don't know if this is already possible in some way, but if not, exposing this to the user would be helpful.
enhancement

opened by kmedved 2
TypeError: predict() got an unexpected keyword argument 'iteration_range'
Code sample

# Your code here # importing dataset from pycox package from pycox.datasets import metabric # importing model and utils from xgbse from xgbse import XGBSEKaplanNeighbors from xgbse.converters import convert_to_structured # getting data df = metabric.read_df() # splitting to X, y format X = df.drop(['duration', 'event'], axis=1) y = convert_to_structured(df['duration'], df['event']) # fitting xgbse model xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50) xgbse_model.fit(X, y)

Problem description

I tried to run the above example code and i encountered the following error

TypeError Traceback (most recent call last) in 15 # fitting xgbse model 16 xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50) ---> 17 xgbse_model.fit(X, y) 18 19 # # predicting

~/opt/anaconda3/lib/python3.8/site-packages/xgbse/_kaplan_neighbors.py in fit(self, X, y, num_boost_round, validation_data, early_stopping_rounds, verbose_eval, persist_train, index_id, time_bins) 176 177 # creating nearest neighbor index --> 178 leaves = self.bst.predict( 179 dtrain, pred_leaf=True, iteration_range=(0, self.bst.best_iteration + 1) 180 )

TypeError: predict() got an unexpected keyword argument 'iteration_range'
opened by emilyybo 1

Category features support

Only dataframe object is supported for the "fit" method

xgbse_model = XGBSEKaplanTree(PARAMS_TREE)
# X = xgb.DMatrix(X, enable_categorical=True)
xgbse_model.fit(X, y)

but enable_categorical is not set "True" in the source code of xgbse it gives error : "ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True"

def build_xgb_cox_dmatrix(X, T, E):
    """Builds a XGB DMatrix using specified Data Frame of features (X)
        arrays of times (T) and censors/events (E).

    Args:
        X ([pd.DataFrame, np.array]): Data Frame to be converted to XGBDMatrix format.
        T ([np.array, pd.Series]): Array of times.
        E ([np.array, pd.Series]): Array of censors(False) / events(True).

    Returns:
        (DMatrix): A XGB DMatrix is returned including features and target.
    """

    target = np.where(E, T, -T)

    return xgb.DMatrix(X, label=target)

The last line here does not set "enable_categorical = True". Category features are not supported? Or I just need to change the code here myself.

enhancement

opened by UTimeStrange 0

feature selection

hi, I have read the documentaion about xgbse. However I did't find the guidelines for how to use xgbse to do feature selection. Could you privde the related documention or method ?
documentation

opened by yangwei1993 0

XGBSE error with sklearn pipeline and GridSearchCV

Code sample

# Your code here

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from xgbse import XGBSEKaplanNeighbors
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import itertools
import time

[Radiomics_Clinical_GMMC_2022_09_22-CT_der_PET-mix.csv](https://github.com/loft-br/xgboost-survival-embeddings/files/9636594/Radiomics_Clinical_GMMC_2022_09_22-CT_der_PET-mix.csv)

csv_path = "C:/Radiomics_Clinical_GMMC_2022_09_22-CT_der_PET-mix.csv"

df = pd.read_csv(csv_path)
df = df.set_index('cohort', drop=True)
df.index.rename('index', inplace=True)
df.index = df.index.astype(int)

X = df.drop(columns=['PFS_binaer_Progress', 'Ereignis_korrigiert_Update_2021_03', 'DFS_M_ED_Update_2021_03',
                     'Pseudonym', 'train_test_mix', 'SUVmax', 'SEX_SPSS', 
                     'DIAGECOG_komplett_ueber_1', 'DIAALTER', 'PTNM_T_SPSS_korr_grob_7th',
                     'PTNM_N_SPSS_korr', 'STADIUM_GROB_SPSS_7thEdition',
                     'R_Status', 'PTNM_T_SPSS_korr_7th', 'STADIUM_SPSS_7thEdition',
                     'Histo_Subtyp', 'NEOADJ_CHEMO', 'NEOADJ_BESTR', 'ADJ_CHEMO', 'ADJ_BESTR',
                     'ANY_CHEMO', 'ANY_BESTR', 'ASP_high_19_5', 'ASP', 'ASP_high_33_3'])

y_df = pd.DataFrame()
y_df['event'] = df['Ereignis_korrigiert_Update_2021_03'].astype(bool)
y_df['time'] = df['DFS_M_ED_Update_2021_03']

# split dataframe into training+test and validation cohort

X_train_test = X.iloc[X.index.isin([1])]
X_valid = X.iloc[X.index.isin([2])]

y_train_test_df = y_df.iloc[y_df.index.isin([1])]
y_valid_df = y_df.iloc[y_df.index.isin([2])]

s = y_df.dtypes
y_train_test = np.array([tuple(x) for x in y_train_test_df.values], dtype=list(zip(s.index, s)))
y_valid = np.array([tuple(x) for x in y_valid_df.values], dtype=list(zip(s.index, s)))


def score_survival_model(model, X, y):
    prediction = model.predict(X)
    result = concordance_index_censored(y['event'], y['time'], prediction)
    return result[0]


feature_select_dict = {
                        "MIC" : SelectKBest(mutual_info_classif, k=30),                        
                        }


p_grid_dict = {"xgbse" : {"estimator__objective": ["survival:aft"],
                             "estimator__eval_metric": ["aft-nloglik"],
                             "estimator__aft_loss_distribution": ["normal", "logistic"],
                             "estimator__aft_loss_distribution_scale": [0.5, 1.0, 1.5],
                             "estimator__tree_method": ["hist"],
                             "estimator__learning_rate": np.logspace(-2, 0, num=6),
                             "estimator__max_depth": [0, 1, 5, 10, 15, 25, 50],
                             "estimator__booster": ["dart"],
                             "estimator__subsample": [0.1, 0.2, 0.4, 0.6, 0.8, 1.0],
                             "estimator__min_child_weight": [1, 2, 5, 10],
                             "estimator__colsample_bynode": [0.5, 1.0, 2.0]}
                        }


models_dict = {
                "xgbse" : XGBSEKaplanNeighbors(xgb_params=p_grid_dict['xgbse'])
                }


inner_cv = sklearn.model_selection.KFold(n_splits=10, shuffle=True, random_state=1)
outer_cv = sklearn.model_selection.KFold(n_splits=10, shuffle=True, random_state=1)



def model_scores(feature_select_dict, models_dict, p_grid_dict, X_train, y_train, X_valid, y_valid):
    
    # define the scaler and prepare empty dict or dataframe to assemble results and best parameters
    
    scaler = RobustScaler(with_centering = False)
    models_df_dict = dict()
    params_df = pd.DataFrame()    
    
    for outerKey in feature_select_dict:
         
        models = pd.DataFrame()  
        feature_select = feature_select_dict[outerKey]

        for innerKey in models_dict:
  
            # instantiate model
        
            model = models_dict[innerKey]
            p_grid = p_grid_dict[innerKey]
        
        
            # inner loop of nested cross-validation: perform hyperparameter tuning in the training set of the outer loop
        
            t1 = time.time()
            
            pipeline = Pipeline([('scaling', scaler), ('feature_selection', feature_select), ('estimator', model)])
            
            clf_model = sklearn.model_selection.GridSearchCV(estimator=pipeline, 
                                                             scoring=score_survival_model,
                                                             param_grid=p_grid, 
                                                             cv=inner_cv, refit=True) 
            
                        
            # outer loop: train the model with training data and score the perfomance on test sets
            nested_test_score = sklearn.model_selection.cross_val_score(clf_model, scoring=score_survival_model,
                                                                        X=X_train, y=y_train, cv=outer_cv)
        
        
            # calculate AUC from test and validation set
        
            test_mean = nested_test_score.mean()
            test_std = nested_test_score.std()
            
            clf_model_fit = clf_model.fit(X_train, y_train)
            
            clf_model_best_parameters = str(clf_model.best_params_)
        
            valid_score = clf_model.score(X_valid, y_valid)
            
            test_plus_valid = test_mean + valid_score
            
            model_time = (time.time()-t1)
                
                
            # at the end of this nested CV: add results for this model to the models dataframe 
        
            models[innerKey] = [test_mean, test_std, model_time, valid_score, test_plus_valid]
            
            df_list = [outerKey, innerKey, clf_model_best_parameters]
            params = pd.DataFrame(df_list)
            params_df = pd.concat([params_df, params], axis = 1)

        # add model results for different feature_select_dict keys to the dict of model results
        # add best parmaeters to the dataframe
       

        models_df_dict[outerKey] = models 
        
        
    # subsequent to all model calculations: add multiindex, flip (transpose) dataframe and sort by highest "test+valid"
    # finalize the "best parameters" dataframe
    
    multiindex = {(outerKey, innerKey): values for outerKey, innerDict in models_df_dict.items() for innerKey, values in innerDict.items()}

    models_df_dict_multiindex = pd.DataFrame(multiindex)
    models_df_dict_multiindex.index = ['nested_test_mean', 'nested_test_SD', 'time', 'valid', 'test_plus_valid']
    
    models_transpose = models_df_dict_multiindex.transpose()
    models_transpose.index.set_names(['pre', 'model'], inplace=True)
    models_transpose = models_transpose.sort_values(by = ['nested_test_mean'], ascending = False)

    params_df = params_df.T
    params_df.columns = ['feature_select', 'model', 'parameters']
    params_df = params_df.sort_values(by = ['model', 'feature_select'], ascending = [True, True])
    
    return models_transpose, params_df


results, params = model_scores(feature_select_dict, models_dict, p_grid_dict, X_train_test, y_train_test, X_valid, y_valid)
results_ready = results.reset_index(level=['pre', 'model'])
print(results_ready)

Problem description

Use of GridSearchCV is not possible because XGBSE requires hyperparameters to be unique and to be passed during model initiation. Furthermore, parameter vales in the parameter dict need to be without [], while GridSearchCV expects values in []. XGBSE therefore seems to be incompatible with GridSearchCV. Furthermore, XGBSE seems to be incompatible with sklearn's pipeline. If the sklearn pipeline is fitted, the estimator XGBSE receives the X dataframe as a np.array in the last step of the pipeline, which misses an index. This gives a corresponding error because XGBSE fitting seems to require X.index.

Expected behavior

__ It would be expected that XGBSE can be used with GridSearchCV and pipeline.

Possible solutions

__ It would be required that hyperparameters could be defined and that fitting would allow X without an index (np.array).

bug

opened by jules-germany 2

Feat/add pre trained model xgbse 1st step
Add the possibility of using a pre-trained XGB estimator for the 1st step model in both XGBSEDebiasedBCE and XGBSEStackedWeibull

Add tests:

function to assert a valid xgb estimator is used as pre-trained model

model with and without the pre-trained xgb is fitted properly for both modules

Add docs:

example on how to use a pre-trained xgb model for both modules
opened by davivieirab 0
Ensembling of predictions

Dear xgbse-team,

what would be the correct way for ensembling of predictions? Let's say that I have 5 StackedWeibull models and would like to ensemble their predictions on a test dataset. Should I average the interval predictions?

Thank you very much.
question

opened by andwurl 1

Releases(v0.2.3)

v0.2.3(Nov 23, 2021)

Source code(tar.gz)
Source code(zip)
v0.2.2(Jun 10, 2021)

Refactor extrapolation method
Source code(tar.gz)
Source code(zip)
v0.2.1b(Mar 23, 2021)

bug fix on xgboost predictions when using early stopping
Source code(tar.gz)
Source code(zip)
v0.2.1(Mar 23, 2021)

bug fix on xgboost predictions when using early stopping
Source code(tar.gz)
Source code(zip)
v0.2.0(Feb 17, 2021)

Source code(tar.gz)
Source code(zip)
v0.1.2(Feb 9, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Loft

GitHub Repository https://loft-br.github.io/xgboost-survival-embeddings/

SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

Learning to Classify Images without Labels This repo contains the Pytorch implementation of our paper: SCAN: Learning to Classify Images without Label

1.1k Dec 30, 2022

The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Prior-Enhanced network with Meta-Prototypes (PEMP) This is the PyTorch implementation of PEMP. Overview of PEMP Meta-Prototypes & Adaptive Prototypes

8 Oct 14, 2021

基于PaddleOCR搭建的OCR server... 离线部署用

开头说明 DangoOCR 是基于大家的 CPU处理器来运行的，CPU处理器的好坏会直接影响其速度，但不会影响识别的精度，目前此版本识别速度可能在 0.5-3秒之间，具体取决于大家机器的配置，可以的话尽量不要在运行时开其他太多东西。需要配合团子翻译器 Ver3.6 及其以上的版本才可以使用！

131 Dec 25, 2022

Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF)

Graph Convolutional Gated Recurrent Neural Network (GCGRNN) Improved from Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF

21 Dec 18, 2022

Official repository accompanying a CVPR 2022 paper EMOCA: Emotion Driven Monocular Face Capture And Animation. EMOCA takes a single image of a face as input and produces a 3D reconstruction. EMOCA sets the new standard on reconstructing highly emotional images in-the-wild

EMOCA: Emotion Driven Monocular Face Capture and Animation Radek Daněček · Michael J. Black · Timo Bolkart CVPR 2022 This repository is the official i

339 Dec 30, 2022

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors This package provides a simulator for vision-based

255 Dec 27, 2022

A new video text spotting framework with Transformer

TransVTSpotter: End-to-end Video Text Spotter with Transformer Introduction A Multilingual, Open World Video Text Dataset and End-to-end Video Text Sp

67 Jan 03, 2023

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1 Liang Pan1 Zhongang Cai1,2,3 Ziwei Liu1* 1S-Lab, Nanyang Technologic

96 Jan 03, 2023

UNet model with VGG11 encoder pre-trained on Kaggle Carvana dataset

TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation By Vladimir Iglovikov and Alexey Shvets Introduction TernausNet is

1k Dec 28, 2022

Deep learning toolbox based on PyTorch for hyperspectral data classification.

304 Dec 28, 2022

Space Invaders For Python

Space-Invaders Just download or clone the git repository. To run the Space Invader game you need to have pyhton installed in you system. If you dont h

5 Jul 27, 2022

Code for ACL 2019 Paper: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction"

To run a generation experiment (either conceptnet or atomic), follow these instructions: First Steps First clone, the repo: git clone https://github.c

575 Jan 01, 2023

Hl classification bc - A Network-Based High-Level Data Classification Algorithm Using Betweenness Centrality

A Network-Based High-Level Data Classification Algorithm Using Betweenness Centr

3 Dec 01, 2022

[NeurIPS 2020] Official repository for the project "Listening to Sound of Silence for Speech Denoising"

Listening to Sounds of Silence for Speech Denoising Introduction This is the repository of the "Listening to Sounds of Silence for Speech Denoising" p

40 Dec 20, 2022

Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Population-Based Bandits (PB2) Code for the Population-Based Bandits (PB2) Algorithm, from the paper Provably Efficient Online Hyperparameter Optimiza

22 Nov 16, 2022

Contrastive Loss Gradient Attack (CLGA)

Contrastive Loss Gradient Attack (CLGA) Official implementation of Unsupervised Graph Poisoning Attack via Contrastive Loss Back-propagation, WWW22 Bu

12 Dec 23, 2022

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

AugMix Introduction We propose AugMix, a data processing technique that mixes augmented images and enforces consistent embeddings of the augmented ima

876 Dec 17, 2022

Official code for paper "ISNet: Costless and Implicit Image Segmentation for Deep Classifiers, with Application in COVID-19 Detection"

Official code for paper "ISNet: Costless and Implicit Image Segmentation for Deep Classifiers, with Application in COVID-19 Detection". LRPDenseNet.py

2 Sep 21, 2022

Code for the paper: Adversarial Machine Learning: Bayesian Perspectives

Code for the paper: Adversarial Machine Learning: Bayesian Perspectives This repository contains code for reproducing the experiments in the ** Advers

2 Nov 11, 2022

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery This is the code to the paper: Gradient-Based Learn

3 Feb 15, 2022

Improving XGBoost survival analysis with embeddings and debiased estimators

Related tags

Overview

xgbse: XGBoost Survival Embeddings

Installation

Usage

Basic usage

Survival curves and confidence intervals

Extrapolation

Early stopping

Explainability through prototypes

Metrics

References

Citing xgbse

Comments

Status

Background context

Description of the changes proposed in the pull request

Code sample

Problem description

Code sample

Problem description

Code sample

Problem description

Expected behavior

Possible solutions

Releases(v0.2.3)

v0.2.3(Nov 23, 2021)

v0.2.2(Jun 10, 2021)

v0.2.1b(Mar 23, 2021)

v0.2.1(Mar 23, 2021)

v0.2.0(Feb 17, 2021)

v0.1.2(Feb 9, 2021)

Owner

Loft

SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

基于PaddleOCR搭建的OCR server... 离线部署用

Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF)

Official repository accompanying a CVPR 2022 paper EMOCA: Emotion Driven Monocular Face Capture And Animation. EMOCA takes a single image of a face as input and produces a 3D reconstruction. EMOCA sets the new standard on reconstructing highly emotional images in-the-wild

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors

A new video text spotting framework with Transformer

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

UNet model with VGG11 encoder pre-trained on Kaggle Carvana dataset

Deep learning toolbox based on PyTorch for hyperspectral data classification.

Space Invaders For Python

Code for ACL 2019 Paper: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction"

Hl classification bc - A Network-Based High-Level Data Classification Algorithm Using Betweenness Centrality

[NeurIPS 2020] Official repository for the project "Listening to Sound of Silence for Speech Denoising"

Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Contrastive Loss Gradient Attack (CLGA)

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

Official code for paper "ISNet: Costless and Implicit Image Segmentation for Deep Classifiers, with Application in COVID-19 Detection"

Code for the paper: Adversarial Machine Learning: Bayesian Perspectives

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

`xgbse`: XGBoost Survival Embeddings

Citing `xgbse`