“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

[email protected]">

Last update: Sep 10, 2022

Related tags

Overview

Data Augmentation for Cross-Domain Named Entity Recognition

Authors: Shuguang Chen, Gustavo Aguilar, Leonardo Neves and Thamar Solorio

This repository contains the implementations of the system described in the paper "Data Augmentation for Cross-Domain Named Entity Recognition" at EMNLP 2021 conference.

The main contribution of this paper is a novel neural architecture that can learn the textual patterns and effectively transform the text from a high-resource to a low-resource domain. Please refer to the paper for details.

Installation

We have updated the code to work with Python 3.9, Pytorch 1.9, and CUDA 11.1. If you use conda, you can set up the environment as follows:

conda create -n style_NER python==3.9
conda activate style_NER
conda install pytorch==1.9 cudatoolkit=11.1 -c pytorch

Also, install the dependencies specified in the requirements.txt:

pip install -r requirements.txt

Data

Please download the data with the following links: OntoNotes-5.0-NER-BIO and Temporal Twitter Corpus. We provide two toy datasets under the data/linearized_domain dictory for cross-domain mapping experiments and data/ner directory for NER experiments. After downloading the data with the links above, you may need to preprocess it so that it can have the same format as toy datasets and put them under the corresponding directory.

Data pre-processing

For data pre-processing, we provide some functions under the src/commons/preproc_domain.py and src/commons/preproc_ner.py directory. You can use them to convert the data to the json format for cross-domain mapping experiments.

Data post-processing

After generating the data, you may want to use the code under the src/commons/postproc_domain.py directory to convert the data from json to CoNLL format for named entity recognition experiments.

Running

There are two main stages to run this project.

Cross-domain mapping with cross-domain autoencoder
Named entity recognition with sequencel labeling model

1. Cross-domain Mapping

Training

You can train a model from pre-defined config files in this repo with the following command:

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_domain/main.py --config configs/exp_domain/cdar1.0-nw-sm.json

The code saves a model checkpoint after every epoch if the model improves (either lower loss or higher metric). You will notice that a directory is created using the experiment id (e.g. style_NER/checkpoints/cdar1.0-nw-sm/). You can resume training by running the same command.

Two phases training: our training algorithm includes two phases: 1) in the first phase, we train the model with only denoising reconstruction and domain classification, and 2) in the second phase, we train the model together with denoising reconstruction, detransforming reconstruction, and the domain classification. To do this, you can simply set lambda_cross as 0 for the first phase and 1 for the second phase in the config file.

    ...
    "lambda_coef":{
        "lambda_auto": 1.0,
        "lambda_adv": 10.0,
        "lambda_cross": 1.0
    }
    ...

Evaluate

To evaluate the model, use --mode eval (default: train):

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_domain/main.py --config configs/exp_domain/cdar1.0-nw-sm.json --mode eval

Generation

To evaluate the model, use --mode generate (default: train):

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_domain/main.py --config configs/exp_domain/cdar1.0-nw-sm.json --mode generate

2. Named Entity Recognition

We fine-tune a sequence labeling model (BERT + Linear) to evaluate our cross-domain mapping method. After generating the data, you can add the path of the generated data into the configuration file and run the code with the following command:

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_ner/main.py --config configs/exp_ner/ner1.0-nw-sm.json

Citation

(Comming soon...)

Contact

Feel free to get in touch via email to [email protected].

“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

Related tags

Overview

Data Augmentation for Cross-Domain Named Entity Recognition

Installation

Data

Data pre-processing

Data post-processing

Running

1. Cross-domain Mapping

Training

Evaluate

Generation

2. Named Entity Recognition

Citation

Contact

Owner

[email protected]

Semi-supervised learning for object detection

Code for EMNLP 2021 paper: "Learning Implicit Sentiment in Aspect-based Sentiment Analysis with Supervised Contrastive Pre-Training"

Pytorch implementation of One-Shot Affordance Detection

Codebase for "Revisiting spatio-temporal layouts for compositional action recognition" (Oral at BMVC 2021).

A python library for face detection and features extraction based on mediapipe library

This repository provides a PyTorch implementation and model weights for HCSC (Hierarchical Contrastive Selective Coding)

Pure python PEMDAS expression solver without using built-in eval function

PyTorch inference for "Progressive Growing of GANs" with CelebA snapshot

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

The InterScript dataset contains interactive user feedback on scripts generated by a T5-XXL model.

Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Progressive Image Deraining Networks: A Better and Simpler Baseline

PerfFuzz: Automatically Generate Pathological Inputs for C/C++ programs

This is the dataset for testing the robustness of various VO/VIO methods

An open source bike computer based on Raspberry Pi Zero (W, WH) with GPS and ANT+. Including offline map and navigation.

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

DANet for Tabular data classification/ regression.

My coursework for Machine Learning (2021 Spring) at National Taiwan University (NTU)

Code for `BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery`, Neurips 2021

alfred-py: A deep learning utility library for human

“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

Related tags

Overview

Data Augmentation for Cross-Domain Named Entity Recognition

Installation

Data

Data pre-processing

Data post-processing

Running

1. Cross-domain Mapping

Training

Evaluate

Generation

2. Named Entity Recognition

Citation

Contact

Owner

[email protected]

Semi-supervised learning for object detection

Code for EMNLP 2021 paper: "Learning Implicit Sentiment in Aspect-based Sentiment Analysis with Supervised Contrastive Pre-Training"

Pytorch implementation of One-Shot Affordance Detection

Codebase for "Revisiting spatio-temporal layouts for compositional action recognition" (Oral at BMVC 2021).

A python library for face detection and features extraction based on mediapipe library

This repository provides a PyTorch implementation and model weights for HCSC (Hierarchical Contrastive Selective Coding)

Pure python PEMDAS expression solver without using built-in eval function

PyTorch inference for "Progressive Growing of GANs" with CelebA snapshot

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

The InterScript dataset contains interactive user feedback on scripts generated by a T5-XXL model.

Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Progressive Image Deraining Networks: A Better and Simpler Baseline

PerfFuzz: Automatically Generate Pathological Inputs for C/C++ programs

This is the dataset for testing the robustness of various VO/VIO methods

An open source bike computer based on Raspberry Pi Zero (W, WH) with GPS and ANT+. Including offline map and navigation.

​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

DANet for Tabular data classification/ regression.

My coursework for Machine Learning (2021 Spring) at National Taiwan University (NTU)

Code for `BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery`, Neurips 2021

alfred-py: A deep learning utility library for **human**

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

alfred-py: A deep learning utility library for human