Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Overview

cim-misspelling

Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.

image

This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.

Requirements

How to Run

Clone the repo

$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git

Data preparing

  1. Download the MIMIC-III dataset from PhysioNet, especially NOTEEVENTS.csv and put under data/mimic3

  2. Download LRWD and prevariants of the SPECIALIST Lexicon from the LSG website (2018AB version) and put under data/umls.

  3. Download the English dictionary english.txt from here (commit 7cb484d) and put under data/english_words.

  4. Run scripts/build_vocab_corpus.ipynb to build the dictionary and split the MIMIC-III notes into files.

  5. Run the Jupyter notebook for the dataset that you want to download/pre-process:

    • MIMIC-III misspelling dataset, or ClinSpell (Fivez et al., 2017): scripts/preprocess_clinspell.ipynb
    • CSpell dataset (Lu et al., 2019): scripts/preprocess_cspell.ipynb
    • Synthetic misspelling dataset from the MIMIC-III: scripts/synthetic_dataset.ipynb
  6. Download the BlueBERT model from here under bert/ncbi_bert_{base|large}.

    • For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
    • For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"

Pre-training the char-based LM on MIMIC-III

Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM. The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data. You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256. There are several options you may want to configure:

  • num_gpus: number of GPUs
  • batch_size: batch size
  • training_step: total number of steps to train
  • init_ckpt/init_step: the checkpoint file/steps to resume pretraining
  • num_beams: beam search width for evaluation
  • mimic_csv_dir: directory of the MIMIC-III csv splits
  • bert_dir: directory of the BlueBERT files

You can also download the pre-trained LMs and put under model/:

Misspelling Correction with CIM

Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load

Cite this work

@InProceedings{juyong2022context,
  title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
  author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  pages = {234--247},
  year = {2022},
  volume = {174},
  series = {Proceedings of Machine Learning Research},
  month = {07--08 Apr},
  publisher = {PMLR}
}
Owner
Juyong Kim
Juyong Kim
Lab course materials for IEMBA 8/9 course "Coding and Artificial Intelligence"

IEMBA 8/9 - Coding and Artificial Intelligence Dear IEMBA 8/9 students, welcome to our IEMBA 8/9 elective course Coding and Artificial Intelligence, t

Artificial Intelligence & Machine Learning (AI:ML Lab) @ HSG 1 Jan 11, 2022
Codebase for Amodal Segmentation through Out-of-Task andOut-of-Distribution Generalization with a Bayesian Model

Codebase for Amodal Segmentation through Out-of-Task andOut-of-Distribution Generalization with a Bayesian Model

Yihong Sun 12 Nov 15, 2022
Class activation maps for your PyTorch models (CAM, Grad-CAM, Grad-CAM++, Smooth Grad-CAM++, Score-CAM, SS-CAM, IS-CAM, XGrad-CAM, Layer-CAM)

TorchCAM: class activation explorer Simple way to leverage the class-specific activation of convolutional layers in PyTorch. Quick Tour Setting your C

F-G Fernandez 1.2k Dec 29, 2022
To SMOTE, or not to SMOTE?

To SMOTE, or not to SMOTE? This package includes the code required to repeat the experiments in the paper and to analyze the results. To SMOTE, or not

Amazon Web Services 1 Jan 03, 2022
N-gram models- Unsmoothed, Laplace, Deleted Interpolation

N-gram models- Unsmoothed, Laplace, Deleted Interpolation

Ravika Nagpal 1 Jan 04, 2022
Differentiable Abundance Matching With Python

shamnet Differentiable Stellar Population Synthesis Installation You can install shamnet with pip. Installation dependencies are numpy, jax, corrfunc,

5 Dec 17, 2021
Roger Labbe 13k Dec 29, 2022
Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Geometry-Aware Gradient Algorithms for Neural Architecture Search This repository contains the code required to run the experiments for the DARTS sear

18 May 27, 2022
LeetCode Solutions https://t.me/tenvlad

leetcode LeetCode Solutions groupped by common patterns YouTube: https://www.youtube.com/c/vladten Telegram: https://t.me/nilinterface Problems source

Vlad Ten 158 Dec 29, 2022
a baseline to practice

ccks2021_track3_baseline a baseline to practice 路径可能会有问题,自己改改 torch==1.7.1 pyhton==3.7.1 transformers==4.7.0 cuda==11.0 this is a baseline, you can fi

45 Nov 23, 2022
An end-to-end machine learning library to directly optimize AUC loss

LibAUC An end-to-end machine learning library for AUC optimization. Why LibAUC? Deep AUC Maximization (DAM) is a paradigm for learning a deep neural n

Andrew 75 Dec 12, 2022
Learning from graph data using Keras

Steps to run = Download the cora dataset from this link : https://linqs.soe.ucsc.edu/data unzip the files in the folder input/cora cd code python eda

Mansar Youness 64 Nov 16, 2022
Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

2 Aug 05, 2022
A flexible and extensible framework for gait recognition.

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Shiqi Yu 335 Dec 22, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022
Code to accompany the paper "Finding Bipartite Components in Hypergraphs", which is published in NeurIPS'21.

Finding Bipartite Components in Hypergraphs This repository contains code to accompany the paper "Finding Bipartite Components in Hypergraphs", publis

Peter Macgregor 5 May 06, 2022
Official code for "On the Frequency Bias of Generative Models", NeurIPS 2021

Frequency Bias of Generative Models Generator Testbed Discriminator Testbed This repository contains official code for the paper On the Frequency Bias

35 Nov 01, 2022
Annealed Flow Transport Monte Carlo

Annealed Flow Transport Monte Carlo Open source implementation accompanying ICML 2021 paper by Michael Arbel*, Alexander G. D. G. Matthews* and Arnaud

DeepMind 30 Nov 21, 2022
Sionna: An Open-Source Library for Next-Generation Physical Layer Research

Sionna: An Open-Source Library for Next-Generation Physical Layer Research Sionna™ is an open-source Python library for link-level simulations of digi

NVIDIA Research Projects 313 Dec 22, 2022
SSL_SLAM2: Lightweight 3-D Localization and Mapping for Solid-State LiDAR (mapping and localization separated) ICRA 2021

SSL_SLAM2 Lightweight 3-D Localization and Mapping for Solid-State LiDAR (Intel Realsense L515 as an example) This repo is an extension work of SSL_SL

Wang Han 王晗 1.3k Jan 08, 2023