Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

Overview

BI-RADS BERT

Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

This implementation could be used on other radiology in house corpus as well. Labelling your own data should take the same form as reports and dataframes in './mockdata'.

Conda Environment setup

This project was developed using conda environments. To build the conda environment use the line of code below from the command line

conda create --name NLPenv --file requirements.txt --channel default --channel conda-forge --channel huggingface --channel pytorch

Dataset Organization

Two datasets are needed to build BERT embeddings and fine tuned Field Extractors. 1. dataframe of SQL data, 2. labeled data for field extraction.

Dataframe of SQL data: example file './mock_data/sql_dataframe.csv'. This file was efficiently made by producing a spreadsheet of all entries in the sql table and saving them as a csv file. It will require that each line of the report be split and coordinated with a SequenceNumber column to combine all the reports. Then continue to the 'How to Run BERT Pretraining' Section.

Labeled data for Field Extraction: example of files in './mock_data/labaled_data'. Exach txt file is a save dict object with fields:

example = {
    'original_report': original text report unprocessed from the exam_dataframe.csv, 
    'sectionized': dict example of the report in sections, ex. {'Title': '...', 'Hx': '...', ...}
    'PID': patient identification number,
    'date': date of the exam,
    'field_name1': name of a field you wish to classify, vlaue is the label, 
    'field_name2': more labeled fields are an option, 
    ...
}

How to Run BERT Pretraining

Step 1: SQLtoDataFrame.py

This script can be ran to convert SQL data from a hospital records system to a dataframe for all exams. Hospital records keep each individual report line as a separate SQL entry, so by using 'SequenceNumber' we can assemble them in order.

python ./examples/SQLtoDataFrame.py 
--input_sql ./mock_data/sql_dataframe.csv 
--save_name /folder/to/save/exam_dataframe/save_file.csv

This will output an 'exam_dataframe.csv' file that can be used in the next step.

Step 2: TextPreProcessingBERTModel.py

This script is ran to convert the exam_dataframe.csv file into a pre_training text file for training and validation, with a vocabulary size. An example of the output can be found in './mock_data/pre_training_data'.

python ./examples/TextPreProcessingBERTModel.py 
--dfolder /folder/that/contains/exam_dataframe 
--ft_folder ./mock_data/labeled_data

Step 3: MLM_Training_transformers.py

This script will now run the BERT pre training with masked language modeling. The Output directory (--output_dir) used is required to be empty; eitherwise the parser parameter --overwrite_output_dir is required to overwrite the files in the output directory.

python ./examples/MLM_Training_transformers.py 
--train_data_file ./mock_data/pre_training_data/VocabOf39_PreTraining_training.txt 
--output_dir /folder/to/save/bert/model
--do_eval 
--eval_data_file ./mock_data/pre_training_data/PreTraining_validation.txt 

How to Run BERT Fine Tuning

--pre_trained_model parsed arugment that can be used for all the follwing scripts to load a pre trained embedding. The default is bert-base-uncased. To get BioClinical BERT use --pre_trained_model emilyalsentzer/Bio_ClinicalBERT.

Step 4: BERTFineTuningSectionTokenization.py

This script will run fine tuning to train a section tokenizer with the option of using auxiliary data.

python ./examples/BERTFineTuningSectionTokenization.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/section_tokenizer

Optional parser arguements:

--aux_data If used then the Section Tokenizer will be trained with the auxilliary data.

--k_fold If used then the experiment is run with a 5 fold cross validation.

Step 5: BERTFineTuningFieldExtractionWoutSectionization.py

This script will run fine tuning training of field extraction without section tokenization.

python ./examples/BERTFineTuningFieldExtractionWoutSectionization.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/field_extractor_WoutST
--field_name Modality

field_name is a required parsed arguement.

Optional parser arguements:

--k_fold If used then the experiment is run with a 5 fold cross validation.

Step 6: BERTFineTuningFieldExtraction.py

This script will run fine tuning training of field extraction with section tokenization.

python ./examples/BERTFineTuningFieldExtraction.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/field_extractor
--field_name Modality
--report_section Title

field_name and report_section is a required parsed arguement.

Optional parser arguements:

--k_fold If used then the experiment is run with a 5 fold cross validation.

Additional Codes

post_ExperimentSummary.py

This code can be used to run statistical analysis of test results that are produced from BERTFineTuning codes.

To determine the best final model, we performed statistical significance testing with a 95% confidence. We used the Mann-Whitney U test to compare the medians of different section tokenizers as the distribution of accuracy and G.F1 performance is skewed to the left (medians closer to 100%). For the field extraction classifiers, we used the McNemar test to compare the agreement between two classifiers. The McNemar test was chosen because it has been robustly proven to have an acceptable probability of Type I errors (not detecting a difference between two classifiers when there is a difference). After evaluating both configurations of field extraction explored in this paper, we performed another McNemar test to assist in choosing the best technique. All statistical tests were performed with p-value adjustments for multiple comparisons testing with Bonferonni correction.

Note: input folder must contain 2 or more .xlsx files of experiemtnal results to perform a statistical test.

python ./examples/post_ExperimentSummary.py --folder /folder/where/xlsx/files/are/located --stat_test MannWhitney

--stat_test options: 'MannWhitney' and 'McNemar'.

'MannWhitney': MannWhitney U-Test. This test was used for the Section Tokenizer experimental results comparing the results from different models. https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

'McNemar' : McNemar's test. This test was used for the Field Extraction experimental results comparing the results from different models. https://en.wikipedia.org/wiki/McNemar%27s_test

Contact

Please post a Github issue if you have any questions.

High-Fidelity Pluralistic Image Completion with Transformers (ICCV 2021)

Image Completion Transformer (ICT) Project Page | Paper (ArXiv) | Pre-trained Models | Supplemental Material This repository is the official pytorch i

Ziyu Wan 243 Jan 03, 2023
[IROS2021] NYU-VPR: Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymization Influences

NYU-VPR This repository provides the experiment code for the paper Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymiza

Automation and Intelligence for Civil Engineering (AI4CE) Lab @ NYU 22 Sep 28, 2022
OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

English | 简体中文 Documentation: https://mmtracking.readthedocs.io/ Introduction MMTracking is an open source video perception toolbox based on PyTorch.

OpenMMLab 2.7k Jan 08, 2023
Demonstrates iterative FGSM on Apple's NeuralHash model.

apple-neuralhash-attack Demonstrates iterative FGSM on Apple's NeuralHash model. TL;DR: It is possible to apply noise to CSAM images and make them loo

Lim Swee Kiat 11 Jun 23, 2022
Self-Supervised Collision Handling via Generative 3D Garment Models for Virtual Try-On

Self-Supervised Collision Handling via Generative 3D Garment Models for Virtual Try-On [Project website] [Dataset] [Video] Abstract We propose a new g

71 Dec 24, 2022
Convert scikit-learn models to PyTorch modules

sk2torch sk2torch converts scikit-learn models into PyTorch modules that can be tuned with backpropagation and even compiled as TorchScript. Problems

Alex Nichol 101 Dec 16, 2022
Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out) created with Python.

Hand Gesture Volume Controller Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out). Code Firstly I have created a

Tejas Prajapati 16 Sep 11, 2021
Node for thenewboston digital currency network.

Project setup For project setup see INSTALL.rst Community Join the community to stay updated on the most recent developments, project roadmaps, and ra

thenewboston 27 Jul 08, 2022
A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis

A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis Project Page | Paper A Shading-Guided Generative Implicit Model

Xingang Pan 115 Dec 18, 2022
Boostcamp CV Serving For Python

Boostcamp-CV-Serving Prerequisites MySQL GCP Cloud Storage GCP key file Sentry Streamlit Cloud Secrets: .streamlit/secrets.toml #DO NOT SHARE THIS I

Jungwon Seo 19 Feb 22, 2022
BookMyShowPC - Movie Ticket Reservation App made with Tkinter

Book My Show PC What is this? Movie Ticket Reservation App made with Tkinter. Tk

The Nithin Balaji 3 Dec 09, 2022
Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021) Linsen Song, Wayne Wu, Chaoyou Fu, Chen Qian, Chen Change Loy, and Ran He [Paper], [Video

71 Dec 21, 2022
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go This repository provides our implementation of the CVPR 2021 paper NeuroMorp

Meta Research 35 Dec 08, 2022
Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

ReDet: A Rotation-equivariant Detector for Aerial Object Detection ReDet: A Rotation-equivariant Detector for Aerial Object Detection (CVPR2021), Jiam

csuhan 334 Dec 23, 2022
PyTorch implementations of the paper: "DR.VIC: Decomposition and Reasoning for Video Individual Counting, CVPR, 2022"

DRNet for Video Indvidual Counting (CVPR 2022) Introduction This is the official PyTorch implementation of paper: DR.VIC: Decomposition and Reasoning

tao han 35 Nov 22, 2022
U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection

The code for our newly accepted paper in Pattern Recognition 2020: "U^2-Net: Going Deeper with Nested U-Structure for Salient Object Detection."

Xuebin Qin 6.5k Jan 09, 2023
Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

KML: A Machine Learning Framework for Operating Systems & Storage Systems Storage systems and their OS components are designed to accommodate a wide v

File systems and Storage Lab (FSL) 186 Nov 24, 2022
Final project for Intro to CS class.

Financial Analysis Web App https://share.streamlit.io/mayurk1/fin-web-app-final-project/webApp.py 1. Project Description This project is a technical a

Mayur Khanna 1 Dec 10, 2021
Deep learning operations reinvented (for pytorch, tensorflow, jax and others)

This video in better quality. einops Flexible and powerful tensor operations for readable and reliable code. Supports numpy, pytorch, tensorflow, and

Alex Rogozhnikov 6.2k Jan 01, 2023
Creating Artificial Life with Reinforcement Learning

Although Evolutionary Algorithms have shown to result in interesting behavior, they focus on learning across generations whereas behavior could also be learned during ones lifetime.

Maarten Grootendorst 49 Dec 21, 2022