Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

Related tags

Deep Learningclin_x
Overview

CLIN-X

(CLIN-X-ES) & (CLIN-X-EN)

This repository holds the companion code for the system reported in the paper:

"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain" by Lukas Lange, Heike Adel, Jannik Strötgen and Dietrich Klakow.

The paper wcan be found here. The code allows the users to reproduce and extend the results reported in the paper. Please cite the above paper when reporting, reproducing or extending the results.

@inproceedings{lange-etal-2021-clin-x,
      author    = {Lukas Lange and
                   Heike Adel and
                   Jannik Str{\"{o}}tgen and
                   Dietrich Klakow},
      title     = {"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain},
      year={2021},
      url={https://arxiv.org/abs/2112.08754}
}

In case of questions, please contact the authors as listed on the paper.

Purpose of the project

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

The CLIN-X language models

As part of this work, two XLM-R were adapted to the clinical domain The models can be found here:

  • CLIN-X ES: Spanish clinical XLM-R (link)
  • CLIN-X EN: English clinical XLM-R (link)

The CLIN-X models are open-sourced under the CC-BY 4.0 license. See the LICENSE_models file for details.

Prepare the conda environment

The code requires some python libraries to work:

conda create -n clin-x python==3.8.5
pip install flair==0.8 transformers==4.6.1 torch==1.8.1 scikit-learn==0.23.1 scipy==1.6.3 numpy==1.20.3 nltk tqdm seaborn matplotlib

Masked-Language-Modeling training

The models were trained using the huggingface MLM script that can be found here. The script was called as follows:

python -m torch.distributed.launch --nproc_per_node 8 run_mlm.py  \
--model_name_or_path xlm-roberta-large  \
--train_file data/spanisch_clinical_train.txt  \
--validation_file data/spanisch_clinical_valid.txt  \
--do_train   --do_eval  \
--output_dir models/xlm-roberta-large-spanisch-clinical-domain/  \
--fp16  \
--per_device_train_batch_size 4 --per_device_eval_batch_size 4  \
--save_strategy steps --save_steps 10000

Using the CLIN-X model with our propose model architecture (as reported in Table 7)

The following will describe our different scripts to reproduce the results. See each of the script files for detailed information on the input arguments.

Tokenize and split the data

python tokenize_files.py --input_path path/to/input/files/ --output_path /path/to/bio_files/
python create_data_splits.py --train_files /path/to/bio_files/ --method random --output_dir /path/to/split_files/

Train the model (using random data splits)

The following command trains on model on four splits (1,2,3,4) and uses the remaining split (5) for validation. For different split combinations adjust the list of --training_files and the --dev_file arguments accordingly.

python train_our_model_architecture.py   \
--data_path /path/to/split_files/  \
--train_files random_split_1.txt,random_split_2.txt,random_split_3.txt,random_split_4.txt  \
--dev_file random_split_5.txt  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

Get ensemble predictions

For all models, get the predictions on the test set as following:

python get_test_predictions.py --name models/model_name --conll_path /path/to/bio_files/ --out_path predictions/model_name/

Then, combine different models into one ensemble. Arguments: Output path + List of model predictions

python create_ensemble_data.py predictions/ensemble1 predictions/model_name/ predictions/model_name_2/ ...

Using the CLIN-X model (as reported in Table 3)

While we recommand the usage of our model architecture, the CLIN-X models can be used in many other architectures. In the paper, we compare to the standard transformer sequnece labeling models as proposed by Devlin et al. For this, we provide the train_standard_model_architecture.py script

python train_standard_model_architecture.py  \
--data_path /path/to/bio_files/  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

License

The CLIN-X code is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.

For a list of other open source components included in CLIN-X, see the file 3rd-party-licenses.txt.

Owner
Bosch Research
Bosch Research
PyTorch implementation of federated learning framework based on the acceleration of global momentum

Federated Learning with Acceleration of Global Momentum PyTorch implementation of federated learning framework based on the acceleration of global mom

0 Dec 23, 2021
GPU-accelerated Image Processing library using OpenCL

pyclesperanto pyclesperanto is a python package for clEsperanto - a multi-language framework for GPU-accelerated image processing. clEsperanto uses Op

17 Dec 25, 2022
Evaluation and Benchmarking of Speech Super-resolution Methods

Speech Super-resolution Evaluation and Benchmarking What this repo do: A toolbox for the evaluation of speech super-resolution algorithms. Unify the e

Haohe Liu (刘濠赫) 84 Dec 20, 2022
A PyTorch implementation of "Predict then Propagate: Graph Neural Networks meet Personalized PageRank" (ICLR 2019).

APPNP ⠀ A PyTorch implementation of Predict then Propagate: Graph Neural Networks meet Personalized PageRank (ICLR 2019). Abstract Neural message pass

Benedek Rozemberczki 329 Dec 30, 2022
A CNN implementation using only numpy. Supports multidimensional images, stride, etc.

A CNN implementation using only numpy. Supports multidimensional images, stride, etc. Speed up due to heavy use of slicing and mathematical simplification..

2 Nov 30, 2021
Augmented Traffic Control: A tool to simulate network conditions

Augmented Traffic Control Full documentation for the project is available at http://facebook.github.io/augmented-traffic-control/. Overview Augmented

Meta Archive 4.3k Jan 08, 2023
Official implementation of Few-Shot and Continual Learning with Attentive Independent Mechanisms

Few-Shot and Continual Learning with Attentive Independent Mechanisms This repository is the official implementation of Few-Shot and Continual Learnin

Chikan_Huang 25 Dec 08, 2022
Versatile Generative Language Model

Versatile Generative Language Model This is the implementation of the paper: Exploring Versatile Generative Language Model Via Parameter-Efficient Tra

Zhaojiang Lin 17 Dec 02, 2022
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices

Intro Real-time object detection and classification. Paper: version 1, version 2. Read more about YOLO (in darknet) and download weight files here. In

Trieu 6.1k Jan 04, 2023
2 Jul 19, 2022
Official repo for our 3DV 2021 paper "Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements".

Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements Yu Rong, Jingbo Wang, Ziwei Liu, Chen Change Loy Paper. Pr

Yu Rong 41 Dec 13, 2022
This repository provides some of the code implemented and the data used for the work proposed in "A Cluster-Based Trip Prediction Graph Neural Network Model for Bike Sharing Systems".

cluster-link-prediction This repository provides some of the code implemented and the data used for the work proposed in "A Cluster-Based Trip Predict

Bárbara 0 Dec 28, 2022
Recovering Brain Structure Network Using Functional Connectivity

Recovering-Brain-Structure-Network-Using-Functional-Connectivity Framework: Papers: This repository provides a PyTorch implementation of the models ad

5 Nov 30, 2022
An Inverse Kinematics library aiming performance and modularity

IKPy Demo Live demos of what IKPy can do (click on the image below to see the video): Also, a presentation of IKPy: Presentation. Features With IKPy,

Pierre Manceron 481 Jan 02, 2023
This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

Code-and-Dataset-for-CapSal This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detec

lu zhang 48 Aug 19, 2022
Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

Mark Gatere 4 Mar 30, 2022
A Python library for Deep Graph Networks

PyDGN Wiki Description This is a Python library to easily experiment with Deep Graph Networks (DGNs). It provides automatic management of data splitti

Federico Errica 194 Dec 22, 2022
A simplified framework and utilities for PyTorch

Here is Poutyne. Poutyne is a simplified framework for PyTorch and handles much of the boilerplating code needed to train neural networks. Use Poutyne

GRAAL/GRAIL 534 Dec 17, 2022
For AILAB: Cross Lingual Retrieval on Yelp Search Engine

Cross-lingual Information Retrieval Model for Document Search Train Phase CUDA_VISIBLE_DEVICES="0,1,2,3" \ python -m torch.distributed.launch --nproc_

Chilia Waterhouse 104 Nov 12, 2022