Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Last update: Dec 19, 2022

Related tags

Overview

cim-misspelling

Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.

This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.

Requirements

Python 3.8 and packages in requirements.txt
The MIMIC-III dataset (v1.4): PhysioNet link
BlueBERT: GitHub link
The SPECIALIST Lexicon of UMLS: LSG website
English dictionary (DWYL): GitHub link

How to Run

Clone the repo

$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git

Data preparing

Download the MIMIC-III dataset from PhysioNet, especially NOTEEVENTS.csv and put under data/mimic3
Download LRWD and prevariants of the SPECIALIST Lexicon from the LSG website (2018AB version) and put under data/umls.
Download the English dictionary english.txt from here (commit 7cb484d) and put under data/english_words.
Run scripts/build_vocab_corpus.ipynb to build the dictionary and split the MIMIC-III notes into files.
Run the Jupyter notebook for the dataset that you want to download/pre-process:
- MIMIC-III misspelling dataset, or ClinSpell (Fivez et al., 2017): scripts/preprocess_clinspell.ipynb
- CSpell dataset (Lu et al., 2019): scripts/preprocess_cspell.ipynb
- Synthetic misspelling dataset from the MIMIC-III: scripts/synthetic_dataset.ipynb
Download the BlueBERT model from here under bert/ncbi_bert_{base|large}.
- For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
- For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"

Pre-training the char-based LM on MIMIC-III

Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM. The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data. You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256. There are several options you may want to configure:

num_gpus: number of GPUs
batch_size: batch size
training_step: total number of steps to train
init_ckpt/init_step: the checkpoint file/steps to resume pretraining
num_beams: beam search width for evaluation
mimic_csv_dir: directory of the MIMIC-III csv splits
bert_dir: directory of the BlueBERT files

You can also download the pre-trained LMs and put under model/:

Misspelling Correction with CIM

Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load

Cite this work

@InProceedings{juyong2022context,
  title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
  author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  pages = {234--247},
  year = {2022},
  volume = {174},
  series = {Proceedings of Machine Learning Research},
  month = {07--08 Apr},
  publisher = {PMLR}
}

Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Related tags

Overview

cim-misspelling

Requirements

How to Run

Clone the repo

Data preparing

Pre-training the char-based LM on MIMIC-III

Misspelling Correction with CIM

Cite this work

Owner

Juyong Kim

learned_optimization: Training and evaluating learned optimizers in JAX

Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

PyTorch common framework to accelerate network implementation, training and validation

Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

DeepLab2: A TensorFlow Library for Deep Labeling

Code basis for the paper "Camera Condition Monitoring and Readjustment by means of Noise and Blur" (2021)

PyTorch code of my WACV 2022 paper Improving Model Generalization by Agreement of Learned Representations from Data Augmentation

Doods2 - API for detecting objects in images and video streams using Tensorflow

Shared Attention for Multi-label Zero-shot Learning

(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021)

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

COVINS -- A Framework for Collaborative Visual-Inertial SLAM and Multi-Agent 3D Mapping

Official PyTorch implementation of "Adversarial Reciprocal Points Learning for Open Set Recognition"

Anomaly Detection Based on Hierarchical Clustering of Mobile Robot Data

Epidemiology analysis package

Scalable Graph Neural Networks for Heterogeneous Graphs

Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Analysing poker data from home games with friends

MG-GCN: Scalable Multi-GPU GCN Training Framework