SciFive: a text-text transformer model for biomedical literature

Last update: Dec 24, 2022

Overview

SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

SciFive Pubmed+PMC Base: gs://scifive/models/pubmed_pmc/base
SciFive Pubmed+PMC Large: gs://scifive/models/pubmed_pmc/large
SciFive Pubmed Base: gs://scifive/models/pubmed/base
SciFive Pubmed Large: gs://scifive/models/pubmed/large
SciFive PMC Base: gs://scifive/models/pmc/base
SciFive PMC Large: gs://scifive/models/pmc/large

gsutil URI for Pretrain data:

Pubmed: gs://scifive/pretrain/pubmed
PMC: gs://scifive/pretrain/pmc

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

HuggingFace

SciFive Pubmed+PMC: Base | Large
SciFive Pubmed: Base | Large
SciFive PMC: Base | Large

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊 Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

SciFive: a text-text transformer model for biomedical literature

Related tags

Overview

SciFive

Google Cloud Storage

gsutil URI for 6 SciFive models:

gsutil URI for Pretrain data:

Example

HuggingFace

Datasets

📊 Expected Results

Citations

Owner

Long Phan

AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning

A curated list of resources for Image and Video Deblurring

The official repository for "Score Transformer: Generating Musical Scores from Note-level Representation" (MMAsia '21)

Fine-grained Post-training for Improving Retrieval-based Dialogue Systems - NAACL 2021

TransferNet: Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network

A set of tools to pre-calibrate and calibrate (multi-focus) plenoptic cameras (e.g., a Raytrix R12) based on the libpleno.

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

PyTorch implementation of popular datasets and models in remote sensing

This is the source code for: Context-aware Entity Typing in Knowledge Graphs.

FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

A simple editor for captions in .SRT file extension

Planning from Pixels in Environments with Combinatorially Hard Search Spaces -- NeurIPS 2021

3D Avatar Lip Syncronization from speech (JALI based face-rigging)

EmoTag helps you train emotion detection model for Chinese audios

Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 130+ Indicators

ATAC: Adversarially Trained Actor Critic

Source code for paper "ATP: AMRize Than Parse! Enhancing AMR Parsing with PseudoAMRs" @NAACL-2022

The codes of paper 'Active-LATHE: An Active Learning Algorithm for Boosting the Error exponent for Learning Homogeneous Ising Trees'

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

The pytorch implementation of the paper "text-guided neural image inpainting" at MM'2020