Predictive Modeling on Electronic Health Records(EHR) using Pytorch

Overview

Predictive Modeling on Electronic Health Records(EHR) using Pytorch


Overview

Although there are plenty of repos on vision and NLP models, there are very limited repos on EHR using deep learning that we can find. Here we open source our repo, implementing data preprocessing, data loading, and a zoo of common RNN models. The main goal is to lower the bar of entering this field for researchers. We are not claiming any state-of-the-art performance, though our models are quite competitive (a paper describing our work will be available soon).

Based on existing works (e.g., Dr. AI and RETAIN), we represent electronic health records (EHRs) using the pickled list of list of list, which contain histories of patients' diagnoses, medications, and other various events. We integrated all relevant information of a patient's history, allowing easy subsetting.

Currently, this repo includes the following predictive models: Vanilla RNN, GRU, LSTM, Bidirectional RNN, Bidirectional GRU, Bidirectional LSTM, Dilated RNN, Dilated GRU, Dilated LSTM, QRNN,and T-LSTM to analyze and predict clinical performaces. Additionally we have tutorials comparing perfomance to plain LR, Random Forest.

Pipeline

pipeline

Primary Results

Results Summary

Note this result is over two prediction tasks: Heart Failure (HF) risk and Readmission. We showed simple gated RNNs (GRUs or LSTMs) consistently beat traditional MLs (logistic regression (LR) and Random Forest (RF)). All methods were tuned by Bayesian Optimization. All these are described in this paper.

Folder Organization

  • ehr_pytorch: main folder with modularized components:
    • EHREmb.py: EHR embeddings
    • EHRDataloader.py: a separate module to allow for creating batch preprocessed data with multiple functionalities including sorting on visit length and shuffle batches before feeding.
    • Models.py: multiple different models
    • Utils.py
    • main.py: main execution file
    • tplstm.py: tplstm package file
  • Data
    • toy.train: pickle file of toy data with the same structure (multi-level lists) of our processed Cerner data, can be directly utilized for our models for demonstration purpose;
  • Preprocessing
    • data_preprocessing_v1.py: preprocess the data from dataset to build the required multi-level input structure (clear description of how to run this file is in its document header)
  • Tutorials
    • RNN_tutorials_toy.ipynb: jupyter notebooks with examples on how to run our models with visuals and/or utilize our dataloader as a standalone;
    • HF prediction for Diabetic Patients.ipynb
    • Early Readmission v2.ipynb
  • trained_models examples:
    • hf.trainEHRmodel.log: examples of the output of the model
    • hf.trainEHRmodel.pth: actual trained model
    • hf.trainEHRmodel.st: state dictionary

Data Structure

  • We followed the data structure used in the RETAIN. Encounters may include pharmacy, clinical and microbiology laboratory, admission, and billing information from affiliated patient care locations. All admissions, medication orders and dispensing, laboratory orders, and specimens are date and time stamped, providing a temporal relationship between treatment patterns and clinical information.These clinical data are mapped to the most common standards, for example, diagnoses and procedures are mapped to the International Classification of Diseases (ICD) codes, medimultications information include the national drug codes (NDCs), and laboratory tests are linked to their LOINIC codes.

  • Our processed pickle data: multi-level lists. From most outmost to gradually inside (assume we have loaded them as X)

    • Outmost level: patients level, e.g. X[0] is the records for patient indexed 0
    • 2nd level: patient information indicated in X[0][0], X[0][1], X[0][2] are patient id, disease status (1: yes, 0: no disease), and records
    • 3rd level: a list of length of total visits. Each element will be an element of two lists (as indicated in 4)
    • 4th level: for each row in the 3rd-level list.
      • 1st element, e.g. X[0][2][0][0] is list of visit_time (since last time)
      • 2nd element, e.g. X[0][2][0][1] is a list of codes corresponding to a single visit
    • 5th level: either a visit_time, or a single code
  • An illustration of the data structure is shown below:

data structure

In the implementation, the medical codes are tokenized with a unified dictionary for all patients. data example

  • Notes: as long as you have multi-level list you can use our EHRdataloader to generate batch data and feed them to your model

Paper Reference

The paper upon which this repo was built.

Versions This is Version 0.2, more details in the release notes

Dependencies

  • Pytorch 0.4.0 (All models except T-LSTM are compatible with pytorch version 1.4.0) , Issues appear with pytorch 1.5 solved in 1.6 version
  • Torchqrnn
  • Pynvrtc
  • sklearn
  • Matplotlib (for visualizations)
  • tqdm
  • Python: 3.6+

Usage

  • For preprocessing python data_preprocessing.py The above case and control files each is just a three columns table like pt_id | medical_code | visit/event_date

  • To run our models, directly use (you don't need to separately run dataloader, everything can be specified in args here):

python3 main.py -root_dir<'your folder that contains data file(s)'> -files<['filename(train)' 'filename(valid)' 'filename(test)']> -which_model<'RNN'> -optimizer<'adam'> ....(feed as many args as you please)
  • Example:
python3.7 main.py -root_dir /.../Data/ -files sample.train sample.valid sample.test -input_size 15800 -batch_size 100 -which_model LR -lr 0.01 -eps 1e-06 -L2 1e-04
  • To singly use our dataloader for generating data batches, use:
data = EHRdataFromPickles(root_dir = '../data/', 
                          file = ['toy.train'])
loader =  EHRdataLoader(data, batch_size = 128)

#Note: If you want to split data, you must specify the ratios in EHRdataFromPickles() otherwise, call separate loaders for your seperate data files If you want to shuffle batches before using them, add this line

loader = iter_batch2(loader = loader, len(loader))

otherwise, directly call

for i, batch in enumerate(loader): 
    #feed the batch to do things

Check out this notebook with a step by step guide of how to utilize our package.

Warning

  • This repo is for research purpose. Using it at your own risk.
  • This repo is under GPL-v3 license.

Acknowledgements Hat-tip to:

Comments
  • kaplan meier

    kaplan meier

    I attended your session during ACM-BCB conference. Great presentation! I have one question regarding survival analysis. What is the purpose of the "kaplan meier plot" used in survival analysis in ModelTraining file. Is it some kind of baseline to your actual models or is it shoing that survival probability predicted by best model is same as kaplan meier ?

    opened by mehak25 2
  • Getting embedding error when running main.py with toy.train

    Getting embedding error when running main.py with toy.train

    Hi @ZhiGroup and @lrasmy,

    I am very impressed by this work.

    I am getting the attached error when trying to retrieve the embeddings in the EmbedPatients_MB(self,mb_t, mtd) method when using the toy.train file. I just wanted to test the repo's code with this sample data. Should I not use this file and just follow the ACM-BCB-Tutorial instead to generate the processed data?

    Thank you so much for providing this code and these tutorials, it is very help.

    Best Regards,

    Aaron Reich

    pytorch ehr error

    opened by agr505 1
  • Cell_type option

    Cell_type option

    Currently user can input any cell_type (e.g. celltype of "QRNN" for EHR_RNN model), leading to some mismatch in handling packPadMode.
    => Restrict cell_type option to "RNN", "GRU", "LSTM". => Make cell_type of "QRNN" and "TLSTM" a default for qrnn, tlstm model.

    opened by 2miatran 1
  • Mia test

    Mia test

    MODIFIED PARTS: Main.py

    • Modify codes to take data with split options (split is True => split to train, test, valid, split is False => keep the file and sort)
    • Add model prefix (the hospital name) and suffix (optional: user input) to output file
    • Batch_size is used in EHRdataloader => need to give batch_size parameter to dataloader instead of ut.epochs_run()
    • Results are different due to embedded => No modification. Laila's suggestion: change codes in EHRmb.py
    • Eps (currently not required for current optimizer Adagrad but might need later for other optimzers)
    • n_layer default to 1
    • args = parser.parse_args([])

    Utils:

    • Remove batch_size in all functions
    • Add prefix, suffix to the epochs_run function

    Note: mia_test_1 is first created for testing purpose, please ignore this file.

    opened by 2miatran 1
  • Random results with each run even with setting Random seed

    Random results with each run even with setting Random seed

    Testing GPU performance:

    GPU 0 Run 1: Epoch 1 Train_auc : 0.8716401835745263 , Valid_auc : 0.8244826612068169 ,& Test_auc : 0.8398872287083271 Avg Loss: 0.2813216602802277 Train Time (0m 38s) Eval Time (0m 53s)

    Epoch 2 Train_auc : 0.8938440516209567 , Valid_auc : 0.8162852367127903 ,& Test_auc : 0.836586122995983 Avg Loss: 0.26535209695498146 Train Time (0m 38s) Eval Time (0m 53s)

    Epoch 3 Train_auc : 0.9090785000429356 , Valid_auc : 0.8268489421541162 ,& Test_auc : 0.8355234191881434 Avg Loss: 0.25156350443760556 Train Time (0m 38s) Eval Time (0m 53s) (edited)

    lrasmy [3:27 PM]

    GPU0 Run 2: Epoch 1 Train_auc : 0.870730593956147 , Valid_auc : 0.8267809126014227 ,& Test_auc : 0.8407658238915342 Avg Loss: 0.28322121808926265 Train Time (0m 39s) Eval Time (0m 53s)

    Epoch 2 Train_auc : 0.8918280081196787 , Valid_auc : 0.814092171574357 ,& Test_auc : 0.8360580004715573 Avg Loss: 0.26621529906988145 Train Time (0m 39s) Eval Time (0m 53s)

    Epoch 3 Train_auc : 0.9128840712381358 , Valid_auc : 0.8237124792427901 ,& Test_auc : 0.839372227662688 Avg Loss: 0.2513388389100631 Train Time (0m 39s) Eval Time (0m 54s)

    lrasmy [3:43 PM]

    GPU0 Run 3: Epoch 1 Train_auc : 0.8719306438569514 , Valid_auc : 0.8290540285789691 ,& Test_auc : 0.8416333372040562 Avg Loss: 0.28306034040947753 Train Time (0m 40s) Eval Time (0m 55s)

    Epoch 2 Train_auc : 0.8962238893571299 , Valid_auc : 0.812984847168468 ,& Test_auc : 0.8358539036875299 Avg Loss: 0.26579822269578773 Train Time (0m 39s) Eval Time (0m 54s)

    Epoch 3 Train_auc : 0.9131959085864382 , Valid_auc : 0.824907504397332 ,& Test_auc : 0.8411787765451596 Avg Loss: 0.24994653667012848 Train Time (0m 40s) Eval Time (0m 54s)

    opened by lrasmy 1
Releases(v0.2-Feb20)
  • v0.2-Feb20(Feb 21, 2020)

    This release is offering a faster and more memory efficient code than the previously released version

    Key Changes:

    • Moving paddings and mini-batches related tensors creation to the EHR_dataloader
    • Creating the mini-batches list once before running the epochs
    • Adding RETAIN to the models list
    Source code(tar.gz)
    Source code(zip)
Long Expressive Memory (LEM)

Long Expressive Memory for Sequence Modeling This repository contains the implementation to reproduce the numerical experiments of the paper Long Expr

Konstantin Rusch 47 Dec 17, 2022
PyTorch Code for "Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning"

Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning [Project Page] [Paper] Wenlong Huang1, Igor Mordatch2, Pieter Abbeel1,

Wenlong Huang 40 Nov 22, 2022
PyTorch implementation for the ICLR 2020 paper "Understanding the Limitations of Variational Mutual Information Estimators"

Smoothed Mutual Information ``Lower Bound'' Estimator PyTorch implementation for the ICLR 2020 paper Understanding the Limitations of Variational Mutu

50 Nov 09, 2022
[UNMAINTAINED] Automated machine learning for analytics & production

auto_ml Automated machine learning for production and analytics Installation pip install auto_ml Getting started from auto_ml import Predictor from au

Preston Parry 1.6k Jan 02, 2023
Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

Oliver Hahn 1 Jan 26, 2022
Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

Distributed Deep Learning in Open Collaborations This repository contains the code for the NeurIPS 2021 paper "Distributed Deep Learning in Open Colla

Yandex Research 96 Sep 15, 2022
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

CurriculumNet Introduction This repo contains related code and models from the ECCV 2018 CurriculumNet paper. CurriculumNet is a new training strategy

156 Jul 04, 2022
Final report with code for KAIST Course KSE 801.

Orthogonal collocation is a method for the numerical solution of partial differential equations

Chuanbo HUA 4 Apr 06, 2022
Replication of Pix2Seq with Pretrained Model

Pretrained-Pix2Seq We provide the pre-trained model of Pix2Seq. This version contains new data augmentation. The model is trained for 300 epochs and c

peng gao 51 Nov 22, 2022
RE3: State Entropy Maximization with Random Encoders for Efficient Exploration

State Entropy Maximization with Random Encoders for Efficient Exploration (RE3) (ICML 2021) Code for State Entropy Maximization with Random Encoders f

Younggyo Seo 47 Nov 29, 2022
Small-bets - Ergodic Experiment With Python

Ergodic Experiment Based on this video. Run this experiment with this command: p

Michael Brant 3 Jan 11, 2022
Bayesian Inference Tools in Python

BayesPy Bayesian Inference Tools in Python Our goal is, given the discrete outcomes of events, estimate the distribution of categories. Using gradient

Max Sklar 99 Dec 14, 2022
Class-Attentive Diffusion Network for Semi-Supervised Classification [AAAI'21] (official implementation)

Class-Attentive Diffusion Network for Semi-Supervised Classification Official Implementation of AAAI 2021 paper Class-Attentive Diffusion Network for

Jongin Lim 7 Sep 20, 2022
EEGEyeNet is benchmark to evaluate ET prediction based on EEG measurements with an increasing level of difficulty

Introduction EEGEyeNet EEGEyeNet is a benchmark to evaluate ET prediction based on EEG measurements with an increasing level of difficulty. Overview T

Ard Kastrati 23 Dec 22, 2022
ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction

ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction. NeurIPS 2021.

Gengshan Yang 59 Nov 25, 2022
The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

PISE The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp. Requirement conda create -n pise pyt

jinszhang 110 Nov 21, 2022
A curated list of neural network pruning resources.

A curated list of neural network pruning and related resources. Inspired by awesome-deep-vision, awesome-adversarial-machine-learning, awesome-deep-learning-papers and Awesome-NAS.

Yang He 1.7k Jan 09, 2023
Hands-On Machine Learning for Algorithmic Trading, published by Packt

Hands-On Machine Learning for Algorithmic Trading Hands-On Machine Learning for Algorithmic Trading, published by Packt This is the code repository fo

Packt 981 Dec 29, 2022
Label Studio is a multi-type data labeling and annotation tool with standardized output format

Website • Docs • Twitter • Join Slack Community What is Label Studio? Label Studio is an open source data labeling tool. It lets you label data types

Heartex 11.7k Jan 09, 2023
SemiNAS: Semi-Supervised Neural Architecture Search

SemiNAS: Semi-Supervised Neural Architecture Search This repository contains the code used for Semi-Supervised Neural Architecture Search, by Renqian

Renqian Luo 21 Aug 31, 2022