RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations.

Related tags

Deep LearningRCT-ART
Overview

Randomised controlled trial abstract result tabulator

RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations. The system is currently constrained to result sentences with specific measures of an outcome for a specific intervention and does not extract comparative relationship (e.g. a relative decrease between the study intervention and placebo).

This repository contains custom pipes and models developed, trained and run using the spaCy library. These are defined and initiated through configs and custom scripts.

In addition, we include all stages of our datasets from their raw format, gold-standard annotations, pre-processed spacy docs and output tables of the system, as well as the evaluation results of the system for its different NLP tasks across each pre-trained model.

Running the system from Python

After cloning this repository and pip installing its dependencies from requirements.txt, the system can be run in two steps:

1. Download and extract the trained models

In the primary study of RCT-ART, we explored a number of BERT-based models in the development of the system. Here, we make available the BioBERT-based named entity recognition (NER) and relation extraction (RE) models:

Download models from here.

The train_models folder of the compression file should be extracted into the root of the cloned directory for the system scripts to be able to access the models.

2a. Demo the system NLP tasks

Once the model folder has been extracted, a streamlit demo of the system NER, RE and tabulation tasks can be run locally on your browser with the following command:

streamlit run scripts/demo.py

2b. Process multiple RCT result sentences

Alternatively, multiple result sentences can be processed by the system using tabulate.py in the scripts directory. Input sentences should be in the Doc format, with the sentences from the study available within datasets/preprocessed.

Training new models for the system

The NER and RE models employed by RCT-ART were both trained using spaCy config files, where we defined their architectures and training hyper-parameters. These are included in the config directory, with a config for each model type and the different BERT-based language representation models we explored in the development of the system. The simplest way to initiate spaCy model training is with the library's inbuilt commands (https://spacy.io/usage/training), passing in the paths of the config file, training set and development set. Below are the commands we used to train the models made available with this repository:

spaCy cmd for training BioBERT-based NER model on all-domains dataset

python -m spacy train configs/ner_biobert.cfg --output ../trained_models/biobert/ner/all_domains --paths.train ../datasets/preprocessed/all_domains/results_only/train.spacy --paths.dev ../datasets/preprocessed/all_domains/results_only/dev.spacy -c ../scripts/custom_functions.py --gpu-id 0

spaCy cmd for training BioBERT-based RE model on all-domains dataset

python -m spacy train configs/rel_biobert.cfg --output ../trained_models/biobert/rel/all_domains  --paths.train ../datasets/preprocessed/all_domains/results_only/train.spacy --paths.dev ../datasets/preprocessed/all_domains/results_only/dev.spacy -c ../scripts/custom_functions.py --gpu-id 0

Repository contents breakdown

The following is a brief description of the assets available in this repository.

configs

Includes the spaCy config files for training NER and RE models of the RCT-ART system. These files define the model architectures, including the BERT-base language representations. Three of BERT language representations were experimented with for each model in the main study of this sytem: BioBERT, SciBERT and RoBERTa.

datasets

Includes all stages of the data used to train and test the RCT-ART models from raw to split gold-standard files in spaCy doc format.

Before filtering and result sentence extraction, abstracts were sourced from the EBM-NLP corpus and the annotated corpus from the Trenta et al. study, which explored automated information extraction from RCTs, and was a key reference for our study.

evaluation_results

Output txt files from the evaluate.py script, giving precision, recall and F1 scores for each of the system tasks across the various dataset cuts.

output_tables

Output csv files from the tabulate.py script, includes the predicted tables output by our system for each test result sentence.

scripts

Below is a contents list of the repository scripts with brief descriptions. Full descriptions can be found at the head of each script.

custom_functions.py -- helper functions for supporting key modules of system.

data_collection.py -- classes and functions for filtering the EBM-NLP corpus and result sentence preprocessing.

demo.py -- a browser-based demo of the RCT-ART system developed with spaCy and Streamlit (see above).

entity_ruler.py -- a script for rules-based entity recognition. Unused in final system, but made available for further development.

evaluate.py -- a set of function for evaluating the system across the NLP tasks: NER, RE, joint NER + RE and tabulation.

preprocessing.py -- a set of function for further data preprocessing after data collection and splitting data into train, test and dev sets.

rel_model.py -- defines the relation extraction model.

rel_pipe.py -- integrates the relation extraction model as a spaCy pipeline component.

tabulate.py -- run the full system by loading the NER and RE models and running their outputs through a tabulation function. Can be used on batches of RCT sentences to output batches of CSV files.

train_multiple_models.py -- iterates through spaCy train commands with different input parameters allowing batches of models to be trained.

Common issues

The transformer models of this system need a GPU with suitable video RAM -- in the primary study, they were trained and run on a GeForce RTX 3080 10GB.

There can be issues with the transformer library dependencies -- CUDA and pytorch. If an issue occurs, ensure CUDA 11.1 is installed on your system, and try reinstalling PyTorch with the following command:

pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

References

  1. The relation extraction component was adapted from the following spaCy project tutorial.

  2. The EBM-NLP corpus is accessible from here and its publication can be found here.

  3. The glaucoma corpus can be found in the Trenta et al. study.

Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at [email protected]

TableParser Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at DS3 Lab 11 Dec 13, 2022

Codebase for BMVC 2021 paper "Text Based Person Search with Limited Data"

Text Based Person Search with Limited Data This is the codebase for our BMVC 2021 paper. Please bear with me refactoring this codebase after CVPR dead

Xiao Han 33 Nov 24, 2022
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

Raajhesh Kannaa Chidambaram 3 Aug 13, 2022
PyTorch implementations of the beta divergence loss.

Beta Divergence Loss - PyTorch Implementation This repository contains code for a PyTorch implementation of the beta divergence loss. Dependencies Thi

Billy Carson 7 Nov 09, 2022
Cossim - Sharpened Cosine Distance implementation in PyTorch

Sharpened Cosine Distance PyTorch implementation of the Sharpened Cosine Distanc

Istvan Fehervari 10 Mar 22, 2022
Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Improving Transferability of Representations via Augmentation-Aware Self-Supervision Accepted to NeurIPS 2021 TL;DR: Learning augmentation-aware infor

hankook 38 Sep 16, 2022
Self-Supervised Image Denoising via Iterative Data Refinement

Self-Supervised Image Denoising via Iterative Data Refinement Yi Zhang1, Dasong Li1, Ka Lung Law2, Xiaogang Wang1, Hongwei Qin2, Hongsheng Li1 1CUHK-S

Zhang Yi 72 Jan 01, 2023
FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data. Flexible EM-Inspired Discriminant Analysis is a robust supervised classification algorithm that performs well i

0 Sep 06, 2022
Scripts and a shader to get you started on setting up an exported Koikatsu character in Blender.

KK Blender Shader Pack A plugin and a shader to get you started with setting up an exported Koikatsu character in Blender. The plugin is a Blender add

166 Jan 01, 2023
PyTorch framework, for reproducing experiments from the paper Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. Code, based on the PyTorch framework, for reprodu

Asaf 3 Dec 27, 2022
A multi-entity Transformer for multi-agent spatiotemporal modeling.

baller2vec This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotempor

Michael A. Alcorn 56 Nov 15, 2022
HINet: Half Instance Normalization Network for Image Restoration

HINet: Half Instance Normalization Network for Image Restoration Liangyu Chen, Xin Lu, Jie Zhang, Xiaojie Chu, Chengpeng Chen Paper: https://arxiv.org

303 Dec 31, 2022
Python code to fuse multiple RGB-D images into a TSDF voxel volume.

Volumetric TSDF Fusion of RGB-D Images in Python This is a lightweight python script that fuses multiple registered color and depth images into a proj

Andy Zeng 845 Jan 03, 2023
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Matthias Fey 139 Dec 25, 2022
Supporting code for the Neograd algorithm

Neograd This repo supports the paper Neograd: Gradient Descent with a Near-Ideal Learning Rate, which introduces the algorithm "Neograd". The paper an

Michael Zimmer 12 May 01, 2022
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
Tensor-based approaches for fMRI classification

tensor-fmri Using tensor-based approaches to classify fMRI data from StarPLUS. Citation If you use any code in this repository, please cite the follow

4 Sep 07, 2022
某学校选课系统GIF验证码数据集 + Baseline模型 + 上下游相关工具

elective-dataset-2021spring 某学校2021春季选课系统GIF验证码数据集(29338张) + 准确率98.4%的Baseline模型 + 上下游相关工具。 数据集采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可。 Baseline模型和上下游相关工具采用

xmcp 27 Sep 17, 2021
code and models for "Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation"

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation This repository contains code and models for the method described in: Golnaz

55 Jun 18, 2022