Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Last update: Dec 19, 2022

Related tags

Overview

cim-misspelling

Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.

This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.

Requirements

Python 3.8 and packages in requirements.txt
The MIMIC-III dataset (v1.4): PhysioNet link
BlueBERT: GitHub link
The SPECIALIST Lexicon of UMLS: LSG website
English dictionary (DWYL): GitHub link

How to Run

Clone the repo

$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git

Data preparing

Download the MIMIC-III dataset from PhysioNet, especially NOTEEVENTS.csv and put under data/mimic3
Download LRWD and prevariants of the SPECIALIST Lexicon from the LSG website (2018AB version) and put under data/umls.
Download the English dictionary english.txt from here (commit 7cb484d) and put under data/english_words.
Run scripts/build_vocab_corpus.ipynb to build the dictionary and split the MIMIC-III notes into files.
Run the Jupyter notebook for the dataset that you want to download/pre-process:
- MIMIC-III misspelling dataset, or ClinSpell (Fivez et al., 2017): scripts/preprocess_clinspell.ipynb
- CSpell dataset (Lu et al., 2019): scripts/preprocess_cspell.ipynb
- Synthetic misspelling dataset from the MIMIC-III: scripts/synthetic_dataset.ipynb
Download the BlueBERT model from here under bert/ncbi_bert_{base|large}.
- For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
- For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"

Pre-training the char-based LM on MIMIC-III

Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM. The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data. You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256. There are several options you may want to configure:

num_gpus: number of GPUs
batch_size: batch size
training_step: total number of steps to train
init_ckpt/init_step: the checkpoint file/steps to resume pretraining
num_beams: beam search width for evaluation
mimic_csv_dir: directory of the MIMIC-III csv splits
bert_dir: directory of the BlueBERT files

You can also download the pre-trained LMs and put under model/:

Misspelling Correction with CIM

Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load

Cite this work

@InProceedings{juyong2022context,
  title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
  author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  pages = {234--247},
  year = {2022},
  volume = {174},
  series = {Proceedings of Machine Learning Research},
  month = {07--08 Apr},
  publisher = {PMLR}
}

Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Related tags

Overview

cim-misspelling

Requirements

How to Run

Clone the repo

Data preparing

Pre-training the char-based LM on MIMIC-III

Misspelling Correction with CIM

Cite this work

Owner

Juyong Kim

📚 A collection of all the Deep Learning Metrics that I came across which are not accuracy/loss.

Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).

A computer vision pipeline to identify the "icons" in Christian paintings

PyTorch Implementation of Small Lesion Segmentation in Brain MRIs with Subpixel Embedding (ORAL, MICCAIW 2021)

A PyTorch implementation of unsupervised SimCSE

Code for Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

SatelliteSfM - A library for solving the satellite structure from motion problem

RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

A collection of easy-to-use, ready-to-use, interesting deep neural network models

Count the MACs / FLOPs of your PyTorch model.

Code to reproduce the results in "Visually Grounded Reasoning across Languages and Cultures", EMNLP 2021.

Pgn2tex - Scripts to convert pgn files to latex document. Useful to build books or pdf from pgn studies

Pytorch port of Google Research's LEAF Audio paper

[AAAI 2021] EMLight: Lighting Estimation via Spherical Distribution Approximation and [ICCV 2021] Sparse Needlets for Lighting Estimation with Spherical Transport Loss

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Weighing Counts: Sequential Crowd Counting by Reinforcement Learning

A Pytorch Implementation of Domain adaptation of object detector using scissor-like networks