[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Last update: Dec 07, 2022

Overview

SapBERT: Self-alignment pretraining for BERT

This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations [arxiv]; and our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [PDF].

Huggingface Models

[SapBERT]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-XLMR]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-mean-token]

Same as the standard SapBERT but trained with mean-pooling instead of [CLS] representations.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Train SapBERT

Prepare training data as insrtructed in data/generate_pretraining_data.ipynb.

Run:

cd umls_pretraining
./pretrain.sh 0,1

where 0,1 specifies the GPU devices.

Evaluate SapBERT

Please view evaluation/README.md for details.

Citations

@article{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	journal={arXiv preprint arXiv:2010.11784},
	year={2020}
}

Acknowledgement

Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.

License

SapBERT is MIT licensed. See the LICENSE file for details.

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Related tags

Overview

SapBERT: Self-alignment pretraining for BERT

Huggingface Models

[SapBERT]

[SapBERT-XLMR]

[SapBERT-mean-token]

Environment

Train SapBERT

Evaluate SapBERT

Citations

Acknowledgement

License

Owner

Cambridge Language Technology Lab

Course materials for Fall 2021 "CIS6930 Topics in Computing for Data Science" at New College of Florida

Code, Models and Datasets for OpenViDial Dataset

利用yolov5和TensorRT从0到1实现目标检测的模型训练到模型部署全过程

Resources for our AAAI 2022 paper: "LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification".

Pcos-prediction - Predicts the likelihood of Polycystic Ovary Syndrome based on patient attributes and symptoms

A simple code to perform canny edge contrast detection on images.

Vector Neurons: A General Framework for SO(3)-Equivariant Networks

Official repository of the AAAI'2022 paper "Contrast and Generation Make BART a Good Dialogue Emotion Recognizer"

[CVPR 2022 Oral] Crafting Better Contrastive Views for Siamese Representation Learning

Distributed Asynchronous Hyperparameter Optimization in Python

High performance distributed framework for training deep learning recommendation models based on PyTorch.

Selective Wavelet Attention Learning for Single Image Deraining

Element selection for functional materials discovery by integrated machine learning of atomic contributions to properties

A Free and Open Source Python Library for Multiobjective Optimization

LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice,

Some bravo or inspiring research works on the topic of curriculum learning.

This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

An Industrial Grade Federated Learning Framework

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021)

Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)