BERTMap: A BERT-Based Ontology Alignment System

Overview

BERTMap: A BERT-based Ontology Alignment System

Important Notices

About

BERTMap is a BERT-based ontology alignment system, which utilizes the textual knowledge of ontologies to fine-tune BERT and make prediction. It also incorporates sub-word inverted indices for candidate selection, and (graph-based) extension and (logic-based) repair modules for mapping refinement.

Essential dependencies

The following packages are necessary but not sufficient for running BERTMap:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch  # pytorch
pip install cython  # the optimized parser of owlready2 relies on Cython
pip install owlready2  # for managing ontologies
pip install tensorboard  # tensorboard logging (optional)
pip install transformers  # huggingface library
pip install datasets  # huggingface datasets

Running BERTMap

IMPORTANT NOTICE: BERTMap relies on class labels for training, but different ontologies have different annotation properties to define the aliases (synonyms), so preprocessing is required for adding all the synonyms to rdf:label before running BERTMap. The preprocessed ontologies involved in our paper together with their reference mappings are available in data.zip.

Clone the repository and run:

# fine-tuning and evaluate bertmap prediction 
python run_bertmap.py -c config.json -m bertmap

# mapping extension (-e specify which mapping set {src, tgt, combined} to be extended)
python extend_bertmap.py -c config.json -e src

# evaluate extended bertmap 
python eval_bertmap.py -c config.json -e src

# repair and evluate final outputs (-t specify best validation threshold)
python repair_bertmap.py -c config.json -e src -t 0.999

# baseline models (edit similarity and pretrained bert embeddings)
python run_bertmap.py -c config.json -m nes
python run_bertmap.py -c config.json -m bertembeds

The script skips data construction once built for the first time to ensure that all of the models share the same set of pre-processed data.

The fine-tuning model is implemented with huggingface Trainer, which by default uses multiple GPUs, for restricting to GPUs of specified indices, please run (for example):

# only device (1) and (2) are visible to the script
CUDA_VISIBLE_DEVICES=1,2 python run_bertmap.py -c config.json -m bertmap 

Configurations

Here gives the explanations of the variables used in config.json for customized BERTMap running.

  • data:
    • task_dir: directory for saving all the output files.
    • src_onto: source ontology name.
    • tgt_onto: target ontology name.
    • task_suffix: any suffix of the task if needed, e.g. the LargeBio track has 'small' and 'whole'.
    • src_onto_file: source ontology file in .owl format.
    • tgt_onto_fil: target ontology file in .owl format.
    • properties: list of textual properties used for constructing semantic data , default is class labels: ["label"].
    • cut: threshold length for the keys of sub-word inverted index, preserve the keys only if their lengths > cut, default is 0.
  • corpora:
    • sample_rate: number of (soft) negative samples for each positive sample generated in corpora (not the ultimate fine-tuning data).
    • src2tgt_mappings_file: reference mapping file for evaluation and semi-supervised learning setting in .tsv format with columns: "Entity1", "Entity2" and "Value".
    • ignored_mappings_file: file in .tsv format but stores mappings that should be ignored by the evaluator.
    • train_map_ratio: proportion of training mappings to used in semi-supervised setting, default is 0.2.
    • val_map_ratio: proportion of validation mappings to used in semi-supervised setting, default is 0.1.
    • test_map_ratio: proportion of test mappings to used in semi-supervised setting, default is 0.7.
    • io_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • io_hard_neg_rate: number of hard negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • co_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the cross-ontology level.
    • depth_threshold: classes of depths larger than this threshold will not considered in hard negative generation, default is null.
    • depth_strategy: strategy to compute the depths of the classes if any threshold is set, default is max, choices are max and min.
  • bert
    • pretrained_path: real or huggingface library path for pretrained BERT, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
    • tokenizer_path: real or huggingface library path for BERT tokenizer, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
  • fine-tune
    • include_ids: include identity synonyms in the positive samples or not.
    • learning: choice of learning setting ss (semi-supervised) or us (unsupervised).
    • warm_up_ratio: portion of warm up steps.
    • max_length: maximum length for tokenizer (highly important for large task!).
    • num_epochs: number of training epochs, default is 3.0.
    • batch_size: batch size for fine-tuning BERT.
    • early_stop: whether or not to apply early stopping (patience has been set to 10), default is false.
    • resume_checkpoint: path to previous checkpoint if any, default is null.
  • map
    • candidate_limits: list of candidate limits used for mapping computation, suggested values are [25, 50, 100, 150, 200].
    • batch_size: batch size used for mapping computation.
    • nbest: number of top results to be considered.
    • string_match: whether or not to use string match before others.
    • strategy: strategy for classifier scoring method, default is mean.
  • eval:
    • automatic: whether or not automatically evaluate the mappings.

Should you need any further customizaions especially on the evaluation part, please set eval: automatic to false and use your own evaluation script.

Acknolwedgements

The repair module is credited to Ernesto Jiménez Ruiz et al., and the code can be found here.

Owner
KRR
Knowledge Representation and Reasoning Group - University of Oxford
KRR
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

DV Lab 137 Dec 14, 2022
SegNet-Basic with Keras

SegNet-Basic: What is Segnet? Deep Convolutional Encoder-Decoder Architecture for Semantic Pixel-wise Image Segmentation Segnet = (Encoder + Decoder)

Yad Konrad 81 Jun 30, 2022
TensorFlow implementation of "Variational Inference with Normalizing Flows"

[TensorFlow 2] Variational Inference with Normalizing Flows TensorFlow implementation of "Variational Inference with Normalizing Flows" [1] Concept Co

YeongHyeon Park 7 Jun 08, 2022
ReLoss - Official implementation for paper "Relational Surrogate Loss Learning" ICLR 2022

Relational Surrogate Loss Learning (ReLoss) Official implementation for paper "R

Tao Huang 31 Nov 22, 2022
MOpt-AFL provided by the paper "MOPT: Optimized Mutation Scheduling for Fuzzers"

MOpt-AFL 1. Description MOpt-AFL is a AFL-based fuzzer that utilizes a customized Particle Swarm Optimization (PSO) algorithm to find the optimal sele

172 Dec 18, 2022
BasicRL: easy and fundamental codes for deep reinforcement learning。It is an improvement on rainbow-is-all-you-need and OpenAI Spinning Up.

BasicRL: easy and fundamental codes for deep reinforcement learning BasicRL is an improvement on rainbow-is-all-you-need and OpenAI Spinning Up. It is

RayYoh 12 Apr 28, 2022
Paddle pit - Rethinking Spatial Dimensions of Vision Transformers

基于Paddle实现PiT ——Rethinking Spatial Dimensions of Vision Transformers,arxiv 官方原版代

Hongtao Wen 4 Jan 15, 2022
Designing a Practical Degradation Model for Deep Blind Image Super-Resolution (ICCV, 2021) (PyTorch) - We released the training code!

Designing a Practical Degradation Model for Deep Blind Image Super-Resolution Kai Zhang, Jingyun Liang, Luc Van Gool, Radu Timofte Computer Vision Lab

Kai Zhang 804 Jan 08, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This is a Python package available on PyPI for NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pyto

Artit 'Art' Wangperawong 5 Sep 29, 2021
K Closest Points and Maximum Clique Pruning for Efficient and Effective 3D Laser Scan Matching (To appear in RA-L 2022)

KCP The official implementation of KCP: k Closest Points and Maximum Clique Pruning for Efficient and Effective 3D Laser Scan Matching, accepted for p

Yu-Kai Lin 109 Dec 14, 2022
Conversational text Analysis using various NLP techniques

PyConverse Let me try first Installation pip install pyconverse Usage Please try this notebook that demos the core functionalities: basic usage noteb

Rita Anjana 158 Dec 25, 2022
Hierarchical Motion Encoder-Decoder Network for Trajectory Forecasting (HMNet)

Hierarchical Motion Encoder-Decoder Network for Trajectory Forecasting (HMNet) Our paper: https://arxiv.org/abs/2111.13324 We will release the complet

15 Oct 17, 2022
Decentralized Reinforcment Learning: Global Decision-Making via Local Economic Transactions (ICML 2020)

Decentralized Reinforcement Learning This is the code complementing the paper Decentralized Reinforcment Learning: Global Decision-Making via Local Ec

40 Oct 30, 2022
Tree-based Search Graph for Approximate Nearest Neighbor Search

TBSG: Tree-based Search Graph for Approximate Nearest Neighbor Search. TBSG is a graph-based algorithm for ANNS based on Cover Tree, which is also an

Fanxbin 2 Dec 27, 2022
ProMP: Proximal Meta-Policy Search

ProMP: Proximal Meta-Policy Search Implementations corresponding to ProMP (Rothfuss et al., 2018). Overall this repository consists of two branches: m

Jonas Rothfuss 212 Dec 20, 2022
The MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset This is the repository for Measuring Mathematical Problem Solving With the MATH Dataset b

Dan Hendrycks 267 Dec 26, 2022
Air Pollution Prediction System using Linear Regression and ANN

AirPollution Pollution Weather Prediction System: Smart Outdoor Pollution Monitoring and Prediction for Healthy Breathing and Living Publication Link:

Dr Sharnil Pandya, Associate Professor, Symbiosis International University 19 Feb 07, 2022
the code of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021)

RMA-Net This repo is the implementation of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021). Paper

Wanquan Feng 205 Nov 09, 2022
Code for technical report "An Improved Baseline for Sentence-level Relation Extraction".

RE_improved_baseline Code for technical report "An Improved Baseline for Sentence-level Relation Extraction". Requirements torch = 1.8.1 transformers

Wenxuan Zhou 74 Nov 29, 2022
PyTorch implementation of neural style randomization for data augmentation

README Augment training images for deep neural networks by randomizing their visual style, as described in our paper: https://arxiv.org/abs/1809.05375

84 Nov 23, 2022