Natural Language Processing for Adverse Drug Reaction (ADR) Detection

This repo contains code from a project to identify ADRs in discharge summaries at Austin Health. The model uses the HuggingFace Transformers library, beginning with the pretrained DeBERTa model. Further MLM pre-training is performed on a large corpus of unannotated discharge summaries. Finally, fine-tuning is peformed on a corpus of annotated discharge summaries (annotated using Prodigy). The model performs NER, but final performance is measured at the document level using the maximum token-level score.

We used Weights and Biases for experiment tracking.

The pretrain script takes a folder containing discharge summaries stored in CSV folders, tokenizes and continues MLM training on deberta-base.

Fine-tuning can then be performed with the finetune script using CLI commands. This script assumes the data is either a JSONL file of annotated text exported from Prodigy (--datafile example.jsonl), or a saved HuggingFace Datasets. If you run this script once on a JSONL file of annotations, you can choose to save the Dataset into a folder (--save_data_dir "save_to_here") and use this for subsequent training runs (--datafile "save_to_here").

Example usage:

python .\finetune.py --folds 5 --epochs 15 --lr 5e-5 --wandb_on --hub_off --project 'CLI Tests' --run_name cross-validation --datafile 'data'

Note: you might find that your exported annotations (JSONL file) is not encoded using UTF-8, which will prevent this code from working. There are various methods to change the encoding and these can all be found with a quick Google search. On a windows machine, for example, modify the following in powershell:

Get-Content .\name_of_file.jsonl -Encoding Unicode | Set-Content -Encoding UTF8 .\name_of_new_file.jsonl

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Related tags

Overview

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Owner

Medicines Optimisation Service - Austin Health

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

A look-ahead multi-entity Transformer for modeling coordinated agents.

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

IMDB film review sentiment classification based on BERT's supervised learning model.

An open collection of annotated voices in Japanese language

Basic yet complete Machine Learning pipeline for NLP tasks

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

Reproduction process of BERT on SST2 dataset

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Code for text augmentation method leveraging large-scale language models

Open source annotation tool for machine learning practitioners.

SpikeX - SpaCy Pipes for Knowledge Extraction

Mycroft Core, the Mycroft Artificial Intelligence platform.

Baseline code for Korean open domain question answering(ODQA)

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Using BERT-based models for toxic span detection

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Unsupervised text tokenizer focused on computational efficiency