[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

Overview

RoSTER

The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, published in EMNLP 2021.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 or above is strongly recommended; using older python versions might lead to package incompatibility issues.

Reproducing the Results

The three datasets used in the paper can be found under the data directory. We provide three bash scripts run_conll.sh, run_onto.sh and run_wikigold.sh for running the model on the three datasets.

Note: Our model does not use any ground truth training/valid/test set labels but only distant labels; we provide the ground truth label files only for completeness and evaluation.

The training bash scripts assume you use one GPU for training (a GPU with around 20GB memory would be sufficient). If your GPUs have smaller memory sizes, try increasing gradient_accumulation_steps or using more GPUs (by setting the CUDA_VISIBLE_DEVICES environment variable). However, the train_batch_size should be always kept as 32.

Command Line Arguments

The meanings of the command line arguments will be displayed upon typing

python src/train.py -h

The following arguments are important and need to be set carefully:

  • train_batch_size: The effective training batch size after gradient accumulation. Usually 32 is good for different datasets.
  • gradient_accumulation_steps: Increase this value if your GPU cannot hold the training batch size (while keeping train_batch_size unchanged).
  • eval_batch_size: This argument only affects the speed of the algorithm; use as large evaluation batch size as your GPUs can hold.
  • max_seq_length: This argument controls the maximum length of sequence fed into the model (longer sequences will be truncated). Ideally, max_seq_length should be set to the length of the longest document (max_seq_length cannot be larger than 512 under RoBERTa architecture), but using larger max_seq_length also consumes more GPU memory, resulting in smaller batch size and longer training time. Therefore, you can trade model accuracy for faster training by reducing max_seq_length.
  • noise_train_epochs, ensemble_train_epochs, self_train_epochs: They control how many epochs to train the model for noise-robust training, ensemble model trianing and self-training, respectively. Their default values will be a good starting point for most datasets, but you may increase them if your dataset is small (e.g., Wikigold dataset) and decrease them if your dataset is large (e.g., OntoNotes dataset).
  • q, tau: Hyperparameters used for noise-robust training. Their default values will be a good starting point for most datasets, but you may use higher values if your dataset is more noisy and use lower values if your dataset is cleaner.
  • noise_train_update_interval, self_train_update_interval: They control how often to update training label weights in noise-robust training and compute soft labels in soft-training, respectively. Their default values will be a good starting point for most datasets, but you may use smaller values (more frequent updates) if your dataset is small (e.g., Wikigold dataset).

Other arguments can be kept as their default values.

Running on New Datasets

To execute the code on a new dataset, you need to

  1. Create a directory named your_dataset under data.
  2. Prepare a training corpus train_text.txt (one sequence per line; words separated by whitespace) and the corresponding distant label train_label_dist.txt (one sequence per line; labels separated by whitespace) under your_dataset for training the NER model.
  3. Prepare an entity type file types.txt under your_dataset (each line contains one entity type; no need to include O class; no need to prepend I-/B- to type names). The entity type names need to be consistant with those in train_label_dist.txt.
  4. (Optional) You can choose to provide a test corpus test_text.txt (one sequence per line) with ground truth labels test_label_true.txt (one sequence per line; labels separated by whitespace). If the test corpus is provided and the command line argument do_eval is turned on, the code will display evaluation results on the test set during training, which is useful for tuning hyperparameters and monitoring the training progress.
  5. Run the code with appropriate command line arguments (I recommend creating a new bash script by referring to the three example scripts).
  6. The final trained classification model will be saved as final_model.pt under the output directory specified by the command line argument output_dir.

You can always refer to the example datasets when preparing your own datasets.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2021distantly,
  title={Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Wang, Xuan and Zhang, Yu and Ji, Heng and Han, Jiawei},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  year={2021},
}
Owner
Yu Meng
Ph.D. student, Text Mining
Yu Meng
Double pendulum simulator using a symplectic Euler's method and Hamiltonian mechanics

Symplectic Double Pendulum Simulator Double pendulum simulator using a symplectic Euler's method. The program calculates the momentum and position of

Scott Marino 1 Jan 12, 2022
DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

Alexa 51 Dec 17, 2022
A set of tools for converting a darknet dataset to COCO format working with YOLOX

darknet格式数据→COCO darknet训练数据目录结构(详情参见dataset/darknet): darknet ├── class.names ├── gen_config.data ├── gen_train.txt ├── gen_valid.txt └── images

RapidAI-NG 148 Jan 03, 2023
BTC-Generator - BTC Generator With Python

Что такое BTC-Generator? Это генератор чеков всеми любимого @BTC_BANKER_BOT Для

DoomGod 3 Aug 24, 2022
ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプル

ByteTrack-ONNX-Sample ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプルです。 ONNXに変換したモデルも同梱しています。 変換自体を試したい方はByteT

KazuhitoTakahashi 16 Oct 26, 2022
An unofficial styleguide and best practices summary for PyTorch

A PyTorch Tools, best practices & Styleguide This is not an official style guide for PyTorch. This document summarizes best practices from more than a

IgorSusmelj 1.5k Jan 05, 2023
Vehicle direction identification consists of three module detection , tracking and direction recognization.

Vehicle-direction-identification Vehicle direction identification consists of three module detection , tracking and direction recognization. Algorithm

5 Nov 15, 2022
Misc YOLOL scripts for use in the Starbase space sandbox videogame

starbase-misc Misc YOLOL scripts for use in the Starbase space sandbox videogame. Each directory contains standalone YOLOL scripts. They don't really

4 Oct 17, 2021
[ECCV'20] Convolutional Occupancy Networks

Convolutional Occupancy Networks Paper | Supplementary | Video | Teaser Video | Project Page | Blog Post This repository contains the implementation o

622 Dec 30, 2022
Additional code for Stable-baselines3 to load and upload models from the Hub.

Hugging Face x Stable-baselines3 A library to load and upload Stable-baselines3 models from the Hub. Installation With pip Examples [Todo: add colab t

Hugging Face 34 Dec 10, 2022
A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

OMNI A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes. Why? When I finished my Kubernetes cluster using a few Raspber

Matias Godoy 148 Dec 29, 2022
Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021)

Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021) Zeyu Wang, Sherry Qiu, Nicole Feng, Holly Rushmeier, Leonard McMill

Zach Zeyu Wang 23 Dec 09, 2022
Official respository for "Modeling Defocus-Disparity in Dual-Pixel Sensors", ICCP 2020

Official respository for "Modeling Defocus-Disparity in Dual-Pixel Sensors", ICCP 2020 BibTeX @INPROCEEDINGS{punnappurath2020modeling, author={Abhi

Abhijith Punnappurath 22 Oct 01, 2022
Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Causal Influence Detection for Improving Efficiency in Reinforcement Learning This repository contains the code release for the paper "Causal Influenc

Autonomous Learning Group 21 Nov 29, 2022
A PyTorch-Based Framework for Deep Learning in Computer Vision

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{you2019torchcv, author = {Ansheng You and Xiangtai Li and Zhen Zhu a

Donny You 2.2k Jan 09, 2023
A PyTorch implementation of "Signed Graph Convolutional Network" (ICDM 2018).

SGCN ⠀ A PyTorch implementation of Signed Graph Convolutional Network (ICDM 2018). Abstract Due to the fact much of today's data can be represented as

Benedek Rozemberczki 251 Nov 30, 2022
An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

Zou 33 Jan 03, 2023
Code for Domain Adaptive Video Segmentation via Temporal Consistency Regularization in ICCV 2021

Domain Adaptive Video Segmentation via Temporal Consistency Regularization Updates 08/2021: check out our domain adaptation for sematic segmentation p

36 Dec 12, 2022
An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities.

Playground for CLIP-like models Demo Colab Link GradCAM Visualization Naive Zero-shot Detection Smarter Zero-shot Detection Captcha Solver Changelog 2

Kevin Zakka 101 Dec 30, 2022
Scripts and a shader to get you started on setting up an exported Koikatsu character in Blender.

KK Blender Shader Pack A plugin and a shader to get you started with setting up an exported Koikatsu character in Blender. The plugin is a Blender add

166 Jan 01, 2023