Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Last update: Jan 04, 2023

Overview

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

arXiv

This is the code base for weakly supervised NER.

We provide a three stage framework:

Stage I: Domain continual pre-training;
Stage II: Noise-aware weakly supervised pre-training;
Stage III: Fine-tuning.

In this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.

See details in arXiv.

Weakly Supervised NER

Performance Benchmark

BioMedical NER

Method (F1)	BC5CDR-chem	BC5CDR-disease	NCBI-disease
BERT	89.99	79.92	85.87
bioBERT	92.85	84.70	89.13
PubMedBERT	93.33	85.62	87.82
Ours	94.17	90.69	92.28

See more in bio_script/README.md

Dependency

pytorch==1.6.0
transformers==3.3.1
allennlp==1.1.0
flashtool==0.0.10
ray==0.8.7

Install requirements

pip install -r requirements.txt

(If the allennlp and transformers are incompatible, install allennlp first and then update transformers. Since we only use some small functions of allennlp, it should works fine. )

File Structure:

├── bert-ner          #  Python Code for Training NER models
│   └── ...
└── bio_script        #  Shell Scripts for Training BioMedical NER models
    └── ...

Usage

See examples in bio_script

Hyperparameter Explaination

Here we explain hyperparameters used the scripts in ./bio_script.

Training Scripts:

Scripts

roberta_mlm_pretrain.sh
weak_weighted_selftrain.sh
finetune.sh

Hyperparameter

GPUID: Choose the GPU for training. It can also be specified by xxx.sh 0,1,2,3.
MASTER_PORT: automatically constructed (avoid conflicts) for distributed training.
DISTRIBUTE_GPU: use distributed training or not
PROJECT_ROOT: automatically detected, the root path of the project folder.
DATA_DIR: Directory of the training data, where it contains train.txt test.txt dev.txt labels.txt weak_train.txt (weak data) aug_train.txt (optional).
USE_DA: if augment training data by augmentation, i.e., combine train.txt + aug_train.txt in DATA_DIR for training.
BERT_MODEL: the model backbone, e.g., roberta-large. See transformers for details.
BERT_CKP: see BERT_MODEL_PATH.
BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
LOSSFUNC: nll the normal loss function, corrected_nll noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).
MAX_WEIGHT: The maximum weight of a sample in the loss.
MAX_LENGTH: max sentence length.
BATCH_SIZE: batch size per GPU.
NUM_EPOCHS: number of training epoches.
LR: learning rate.
WARMUP: learning rate warmup steps.
SAVE_STEPS: the frequency of saving models.
EVAL_STEPS: the frequency of testing on validation.
SEED: radnom seed.
OUTPUT_DIR: the directory for saving model and code. Some parameters will be automatically appended to the path.
- roberta_mlm_pretrain.sh: It's better to manually check where you want to save the model.]
- finetune.sh: It will be save in ${BERT_MODEL_PATH}/finetune_xxxx.
- weak_weighted_selftrain.sh: It will be save in ${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx (see FBA_RULE below)

There are some addition parameters need to be set for weakly supervised learning (weak_weighted_selftrain.sh).

WEAK_RULE: what kind of weakly supervised data to use. See Weakly Supervised Data Refinement Script for details.

Profiling Script

Scripts

profile.sh

Profiling scripts also use the same entry as the training script: bert-ner/run_ner.py but only do evaluation.

Hyperparameter Basically the same as training script.

PROFILE_FILE: can be train,dev,test or a specific path to a txt data. E.g., using Weak by

PROFILE_FILE=weak_train_100.txt PROFILE_FILE=$DATA_DIR/$PROFILE_FILE
OUTPUT_DIR: It will be saved in OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile

Weakly Supervised Data Refinement Script

Scripts

profile2refinedweakdata.sh

Hyperparameter

BERT_CKP: see BERT_MODEL_PATH.
BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
WEI_RULE: rule for generating weight for each weak sample.
- uni: all are 1
- avgaccu: confidence estimate for new labels generated by all_overwrite
- avgaccu_weak_non_O_promote: confidence estimate for new labels generated by non_O_overwrite
PRED_RULE: rule for generating new weak labels.
- non_O_overwrite: non-entity ('O') is overwrited by prediction
- all_overwrite: all use prediction, i.e., self-training
- no: use original weak labels
- non_O_overwrite_all_overwrite_over_accu_xx: non_O_overwrite + if confidence is higher than xx all tokens use prediction as new labels

The generated data will be saved in ${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE} WEAK_RULE specified in weak_weighted_selftrain.sh is essential the name of folder weak_${PRED_RULE}-WEI_${WEI_RULE}.

More Rounds of Training, Try Different Combination

To do training with weakly supervised data from any model checkpoint directory:

i) Set BERT_CKP appropriately;
ii) Create profile data, e.g., run ./bio_script/profile.sh for dev set and weak set
iii) Generate data with weak labels from profile data, e.g., run ./bio_script/profile2refinedweakdata.sh. You can use different rules to generate weights for each sample (WEI_RULE) and different rules to refine weak labels (PRED_RULE). See more details in ./ber-ner/profile2refinedweakdata.py
iv) Do training with ./bio_script/weak_weighted_selftrain.sh.

To do fine-tuning with human labeled data from any model checkpoint directory:

i) Set BERT_CKP appropriately;
ii) Run ./bio_script/finetune.sh.

Reference

@inproceedings{Jiang2021NamedER,
  title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},
  author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},
  booktitle={ACL/IJCNLP},
  year={2021}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Related tags

Overview

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Performance Benchmark

Dependency

File Structure:

Usage

Hyperparameter Explaination

Training Scripts:

Profiling Script

Weakly Supervised Data Refinement Script

More Rounds of Training, Try Different Combination

Reference

Security

License

Owner

Amazon

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Dynamica causal Bayesian optimisation

Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation (CVPR 2020)

AI drive app that can help user become beautiful.

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Source Code for ICSE 2022 Paper - ``Can We Achieve Fairness Using Semi-Supervised Learning?''

Official implementation of MSR-GCN (ICCV 2021 paper)

Create images and texts with the First Order Generative Adversarial Networks

It is the assignment for COMP 576 in Rice University

This repository lets you interact with Lean through a REPL.

ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

Good Classification Measures and How to Find Them

A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis

Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

Setup and customize deep learning environment in seconds.

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Official re-implementation of the Calibrated Adversarial Refinement model described in the paper Calibrated Adversarial Refinement for Stochastic Semantic Segmentation

Source code for CVPR2022 paper "Abandoning the Bayer-Filter to See in the Dark"

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Related tags

Overview

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Performance Benchmark

Dependency

File Structure:

Usage

Hyperparameter Explaination

Training Scripts:

Profiling Script

Weakly Supervised Data Refinement Script

More Rounds of Training, Try Different Combination

Reference

Security

License

Owner

Amazon

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Dynamica causal Bayesian optimisation

Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation (CVPR 2020)

AI drive app that can help user become beautiful.

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Source Code for ICSE 2022 Paper - ``Can We Achieve Fairness Using Semi-Supervised Learning?''

Official implementation of MSR-GCN (ICCV 2021 paper)

Create images and texts with the First Order Generative Adversarial Networks

It is the assignment for COMP 576 in Rice University

This repository lets you interact with Lean through a REPL.

ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

Good Classification Measures and How to Find Them

A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis

Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

Setup and customize deep learning environment in seconds.

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Official re-implementation of the Calibrated Adversarial Refinement model described in the paper Calibrated Adversarial Refinement for Stochastic Semantic Segmentation

Source code for CVPR2022 paper "Abandoning the Bayer-Filter to See in the Dark"

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,