“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

Overview

Data Augmentation for Cross-Domain Named Entity Recognition

Authors: Shuguang Chen, Gustavo Aguilar, Leonardo Neves and Thamar Solorio

License: MIT

This repository contains the implementations of the system described in the paper "Data Augmentation for Cross-Domain Named Entity Recognition" at EMNLP 2021 conference.

The main contribution of this paper is a novel neural architecture that can learn the textual patterns and effectively transform the text from a high-resource to a low-resource domain. Please refer to the paper for details.

Installation

We have updated the code to work with Python 3.9, Pytorch 1.9, and CUDA 11.1. If you use conda, you can set up the environment as follows:

conda create -n style_NER python==3.9
conda activate style_NER
conda install pytorch==1.9 cudatoolkit=11.1 -c pytorch

Also, install the dependencies specified in the requirements.txt:

pip install -r requirements.txt

Data

Please download the data with the following links: OntoNotes-5.0-NER-BIO and Temporal Twitter Corpus. We provide two toy datasets under the data/linearized_domain dictory for cross-domain mapping experiments and data/ner directory for NER experiments. After downloading the data with the links above, you may need to preprocess it so that it can have the same format as toy datasets and put them under the corresponding directory.

Data pre-processing

For data pre-processing, we provide some functions under the src/commons/preproc_domain.py and src/commons/preproc_ner.py directory. You can use them to convert the data to the json format for cross-domain mapping experiments.

Data post-processing

After generating the data, you may want to use the code under the src/commons/postproc_domain.py directory to convert the data from json to CoNLL format for named entity recognition experiments.

Running

There are two main stages to run this project.

  1. Cross-domain mapping with cross-domain autoencoder
  2. Named entity recognition with sequencel labeling model

1. Cross-domain Mapping

Training

You can train a model from pre-defined config files in this repo with the following command:

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_domain/main.py --config configs/exp_domain/cdar1.0-nw-sm.json

The code saves a model checkpoint after every epoch if the model improves (either lower loss or higher metric). You will notice that a directory is created using the experiment id (e.g. style_NER/checkpoints/cdar1.0-nw-sm/). You can resume training by running the same command.

Two phases training: our training algorithm includes two phases: 1) in the first phase, we train the model with only denoising reconstruction and domain classification, and 2) in the second phase, we train the model together with denoising reconstruction, detransforming reconstruction, and the domain classification. To do this, you can simply set lambda_cross as 0 for the first phase and 1 for the second phase in the config file.

    ...
    "lambda_coef":{
        "lambda_auto": 1.0,
        "lambda_adv": 10.0,
        "lambda_cross": 1.0
    }
    ...
Evaluate

To evaluate the model, use --mode eval (default: train):

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_domain/main.py --config configs/exp_domain/cdar1.0-nw-sm.json --mode eval
Generation

To evaluate the model, use --mode generate (default: train):

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_domain/main.py --config configs/exp_domain/cdar1.0-nw-sm.json --mode generate

2. Named Entity Recognition

We fine-tune a sequence labeling model (BERT + Linear) to evaluate our cross-domain mapping method. After generating the data, you can add the path of the generated data into the configuration file and run the code with the following command:

CUDA_VISIBLE_DEVICES=[gpu_id] python src/exp_ner/main.py --config configs/exp_ner/ner1.0-nw-sm.json

Citation

(Comming soon...)

Contact

Feel free to get in touch via email to [email protected].

Owner
<a href=[email protected]">
A series of Python scripts to access measurements from Fluke 28X meters. Fluke IR Remote Interface required.

Fluke289_data_access A series of Python scripts to access measurements from Fluke 28X meters. Fluke IR Remote Interface required. Created from informa

3 Dec 08, 2022
Measures input lag without dedicated hardware, performing motion detection on recorded or live video

What is InputLagTimer? This tool can measure input lag by analyzing a video where both the game controller and the game screen can be seen on a webcam

Bruno Gonzalez 4 Aug 18, 2022
Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

zshicode 1 Nov 18, 2021
training script for space time memory network

Trainig Script for Space Time Memory Network This codebase implemented training code for Space Time Memory Network with some cyclic features. Requirem

Yuxi Li 100 Dec 20, 2022
Official implementation of "OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association" in PyTorch.

openpifpaf Continuously tested on Linux, MacOS and Windows: New 2021 paper: OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Te

VITA lab at EPFL 50 Dec 29, 2022
image scene graph generation benchmark

Scene Graph Benchmark in PyTorch 1.7 This project is based on maskrcnn-benchmark Highlights Upgrad to pytorch 1.7 Multi-GPU training and inference Bat

Microsoft 303 Dec 27, 2022
✂️ EyeLipCropper is a Python tool to crop eyes and mouth ROIs of the given video.

EyeLipCropper EyeLipCropper is a Python tool to crop eyes and mouth ROIs of the given video. The whole process consists of three parts: frame extracti

Zi-Han Liu 9 Oct 25, 2022
Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

NLOS-OT Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted) Description In this reposit

Ruixu Geng(耿瑞旭) 16 Dec 16, 2022
Net2net - Network-to-Network Translation with Conditional Invertible Neural Networks

Net2Net Code accompanying the NeurIPS 2020 oral paper Network-to-Network Translation with Conditional Invertible Neural Networks Robin Rombach*, Patri

CompVis Heidelberg 206 Dec 20, 2022
Anatomy of Matplotlib -- tutorial developed for the SciPy conference

Introduction This tutorial is a complete re-imagining of how one should teach users the matplotlib library. Hopefully, this tutorial may serve as insp

Matplotlib Developers 1.1k Dec 29, 2022
CC-GENERATOR - A python script for generating CC

CC-GENERATOR A python script for generating CC NOTE: This tool is for Educationa

Lêkzï 6 Oct 14, 2022
Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

SA-AutoAug Scale-aware Automatic Augmentation for Object Detection Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, Jiaya Jia [Paper] [Bi

DV Lab 182 Dec 29, 2022
Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(2021) paper

Semantic Diversity Learning for Zero-Shot Multi-label Classification Paper Official PyTorch Implementation Avi Ben-Cohen, Nadav Zamir, Emanuel Ben Bar

28 Aug 29, 2022
WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking

WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking [Paper Link] Abstract In this work, we contribute a new million-scale Un

25 Jan 01, 2023
This is the repository of shape matching algorithm Iterative Rotations and Assignments (IRA)

Description This is the repository of shape matching algorithm Iterative Rotations and Assignments (IRA), described in the publication [1]. Directory

MAMMASMIAS Consortium 6 Nov 14, 2022
Springer Link Download Module for Python

♞ pupalink A simple Python module to search and download books from SpringerLink. 🧪 This project is still in an early stage of development. Expect br

Pupa Corp. 18 Nov 21, 2022
Python Library for Signal/Image Data Analysis with Transport Methods

PyTransKit Python Transport Based Signal Processing Toolkit Website and documentation: https://pytranskit.readthedocs.io/ Installation The library cou

24 Dec 23, 2022
Contenido del curso Bases de datos del DCC PUC versión 2021-2

IIC2413 - Bases de Datos Tabla de contenidos Equipo Profesores Ayudantes Contenidos Calendario Evaluaciones Resumen de notas Foro Política de integrid

54 Nov 23, 2022
3D-aware GANs based on NeRF (arXiv).

CIPS-3D This repository will contain the code of the paper, CIPS-3D: A 3D-Aware Generator of GANs Based on Conditionally-Independent Pixel Synthesis.

Peterou 563 Dec 31, 2022
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

67 Dec 21, 2022