A Unified Generative Framework for Various NER Subtasks.

Related tags

Deep LearningBARTNER
Overview

This is the code for ACL-ICJNLP2021 paper A Unified Generative Framework for Various NER Subtasks.

Install the package in the requirements.txt, then use the following commands to install two other packages

pip install git+https://github.com/fastnlp/[email protected]
pip install git+https://github.com/fastnlp/fitlog

You need to put your data in the parallel folder of this repo

    - BARTNER/
        - train.py
        ...
    - data/
        - conll2003
            - train.txt
            - text.txt
            - dev.txt
        - en-ontonotes
            - ...
        - Share_2013
        - Share_2014
        - CADEC
        - en_ace04
        - en_ace05
        - genia

For the conll2003 and en-ontonotes you data in each split should like (The first column is words, the second column is tags. We assume the tag is the BIO-tagging)

LONDON B-LOC
1996-08-30 O

West B-MISC
Indian I-MISC
all-rounder O
Phil B-PER

For nested dataset en_ace04, en_ace05 and genia, the data should like (each line is a jsonline, contains ners and sentences keys.)

{"ners": [[[16, 16, "DNA"], [4, 8, "DNA"], [24, 26, "DNA"], [19, 20, "DNA"]], [[31, 31, "DNA"], [2, 2, "DNA"], [4, 4, "DNA"], [30, 31, "DNA"]], [[23, 24, "RNA"], [14, 15, "cell_type"], [1, 2, "RNA"]], [[2, 2, "DNA"]], [], [[0, 0, "DNA"], [9, 9, "cell_type"]]], "sentences": [["There", "is", "a", "single", "methionine", "codon-initiated", "open", "reading", "frame", "of", "1,458", "nt", "in", "frame", "with", "a", "homeobox", "and", "a", "CAX", "repeat", ",", "and", "the", "open", "reading", "frame", "is", "predicted", "to", "encode", "a", "protein", "of", "51,659", "daltons."], ["When", "the", "homeodomain", "from", "HB24", "was", "compared", "to", "known", "mammalian", "and", "Drosophila", "homeodomains", "it", "was", "found", "to", "be", "only", "moderately", "conserved,", "but", "when", "it", "was", "compared", "to", "a", "highly", "diverged", "Drosophila", "homeodomain", ",", "H2.0,", "it", "was", "found", "to", "be", "80%", "identical."], ["The", "HB24", "mRNA", "was", "absent", "or", "present", "at", "low", "levels", "in", "normal", "B", "and", "T", "lymphocytes", ";", "however,", "with", "the", "appropriate", "activation", "signal", "HB24", "mRNA", "was", "induced", "within", "several", "hours", "even", "in", "the", "presence", "of", "cycloheximide", "."], ["Characterization", "of", "HB24", "expression", "in", "lymphoid", "and", "select", "developing", "tissues", "was", "performed", "by", "in", "situ", "hybridization", "."], ["Positive", "hybridization", "was", "found", "in", "thymus", ",", "tonsil", ",", "bone", "marrow", ",", "developing", "vessels", ",", "and", "in", "fetal", "brain", "."], ["HB24", "is", "likely", "to", "have", "an", "important", "role", "in", "lymphocytes", "as", "well", "as", "in", "certain", "developing", "tissues", "."]]}
{"ners": [[[16, 16, "DNA"], [4, 8, "DNA"], [24, 26, "DNA"], [19, 20, "DNA"]], [[31, 31, "DNA"], [2, 2, "DNA"], [4, 4, "DNA"], [30, 31, "DNA"]], [[23, 24, "RNA"], [14, 15, "cell_type"], [1, 2, "RNA"]], [[2, 2, "DNA"]], [], [[0, 0, "DNA"], [9, 9, "cell_type"]]], "sentences": [["There", "is", "a", "single", "methionine", "codon-initiated", "open", "reading", "frame", "of", "1,458", "nt", "in", "frame", "with", "a", "homeobox", "and", "a", "CAX", "repeat", ",", "and", "the", "open", "reading", "frame", "is", "predicted", "to", "encode", "a", "protein", "of", "51,659", "daltons."], ["When", "the", "homeodomain", "from", "HB24", "was", "compared", "to", "known", "mammalian", "and", "Drosophila", "homeodomains", "it", "was", "found", "to", "be", "only", "moderately", "conserved,", "but", "when", "it", "was", "compared", "to", "a", "highly", "diverged", "Drosophila", "homeodomain", ",", "H2.0,", "it", "was", "found", "to", "be", "80%", "identical."], ["The", "HB24", "mRNA", "was", "absent", "or", "present", "at", "low", "levels", "in", "normal", "B", "and", "T", "lymphocytes", ";", "however,", "with", "the", "appropriate", "activation", "signal", "HB24", "mRNA", "was", "induced", "within", "several", "hours", "even", "in", "the", "presence", "of", "cycloheximide", "."], ["Characterization", "of", "HB24", "expression", "in", "lymphoid", "and", "select", "developing", "tissues", "was", "performed", "by", "in", "situ", "hybridization", "."], ["Positive", "hybridization", "was", "found", "in", "thymus", ",", "tonsil", ",", "bone", "marrow", ",", "developing", "vessels", ",", "and", "in", "fetal", "brain", "."], ["HB24", "is", "likely", "to", "have", "an", "important", "role", "in", "lymphocytes", "as", "well", "as", "in", "certain", "developing", "tissues", "."]]}
...

For discontinuous dataset Share_2013, Share_2014 and CADEC, the data should like ( each sample has two lines, if the second line is empty means there is not entity. )

Abdominal cramps , flatulence , gas , bloating .
0,1 ADR|3,3 ADR|7,7 ADR|5,5 ADR

Cramps would start within 15 minutes of taking pill , even during meals .
0,0 ADR

...

We use code from https://github.com/daixiangau/acl2020-transition-discontinuous-ner to pre-process the data.

You can run the code by directly using

python train.py

The following output should be achieved

Save cache to caches/data_facebook/bart-large_conll2003_word.pt.                                                                                                        
max_len_a:0.6, max_len:10
In total 3 datasets:
        test has 3453 instances.
        train has 14041 instances.
        dev has 3250 instances.

The number of tokens in tokenizer  50265
50269 50274
input fields after batch(if batch size is 2):
        tgt_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 8]) 
        src_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 11]) 
        first: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 11]) 
        src_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
        tgt_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
target fields after batch(if batch size is 2):
        entities: (1)type:numpy.ndarray (2)dtype:object, (3)shape:(2,) 
        tgt_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 8]) 
        target_span: (1)type:numpy.ndarray (2)dtype:object, (3)shape:(2,) 
        tgt_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 

training epochs started 2021-06-02-11-49-26-964889
Epoch 1/30:   0%|                                                         | 15/32430 [00:06<3:12:37,  2.80it/s, loss:6.96158

Some important python files are listed below

- BartNER
  - data
     - pipe.py # load and process data
  - model
     - bart.py # the model file
  - train.py  # the training file

The different Loaders in the data/pipe.py is meant to load data, and the data.BartNERPipe class is to process data, the loader should load data into a DataBundle object, you can mock the provided Loader to write your own loader, as long as your dataset has the following four fields, the BartNERPipe should be able to process it

- raw_words  # List[str]
    # ['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06']
- entities  # List[List[str]]
    # [['AL-AIN'], ['United', 'Arab', 'Emirates']]
- entity_tags  # List[str], the same length as entities
    # ['loc', 'loc']
- entity_spans # List[List[int]], the inner list must have an even number of ints, means the start(inclusive,开区间) and end(exclusive,开区间) of an entity segment
    # [[0, 1], [2, 5]] or for discontinous NER [[0, 1, 5, 7], [2, 3, 5, 7],...]

In order to help you reproduce the results, we have hardcoded the hyper-parameters for each dataset in the code, you can change them based on your need. We conduct all experiments in NVIDIA-3090(24G memory). Some known difficulties about the reproduction of this code: (1) Some datasets (nested and discontinous) will drop to 0 or near 0 F1 during training, please drop these results; (2) randomness will cause large performance variance for some datasets, please try to run multiple times.

We deeply understand how frustrating it can be if the results are hard to reproduce, we tried our best to make sure the results were at least reproducible in our equipment (Usually take average from at least five runs).

Owner
I am currently a PhD candidate in Fudan University.
CTRL-C: Camera calibration TRansformer with Line-Classification

CTRL-C: Camera calibration TRansformer with Line-Classification This repository contains the official code and pretrained models for CTRL-C (Camera ca

57 Nov 14, 2022
GND-Nets (Graph Neural Diffusion Networks) in TensorFlow.

GNDC For submission to IEEE TKDE. Overview Here we provide the implementation of GND-Nets (Graph Neural Diffusion Networks) in TensorFlow. The reposit

Wei Ye 3 Aug 08, 2022
GLM (General Language Model)

GLM GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language underst

THUDM 421 Jan 04, 2023
The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting".

IGMTF The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting". Requirements The framework

Wentao Xu 24 Dec 05, 2022
本项目是一个带有前端界面的垃圾分类项目,加载了训练好的模型参数,模型为efficientnetb4,暂时为40分类问题。

说明 本项目是一个带有前端界面的垃圾分类项目,加载了训练好的模型参数,模型为efficientnetb4,暂时为40分类问题。 python依赖 tf2.3 、cv2、numpy、pyqt5 pyqt5安装 pip install PyQt5 pip install PyQt5-tools 使用 程

4 May 04, 2022
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 327 Dec 27, 2022
Code and model benchmarks for "SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology"

NeurIPS 2020 SEVIR Code for paper: SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology Requirement

USAF - MIT Artificial Intelligence Accelerator 46 Dec 15, 2022
Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Decoupled Spatial-Temporal Graph Neural Networks Code for our paper: Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting.

S22 43 Jan 04, 2023
Official Pytorch Implementation of GraphiT

GraphiT: Encoding Graph Structure in Transformers This repository implements GraphiT, described in the following paper: Grégoire Mialon*, Dexiong Chen

Inria Thoth 80 Nov 27, 2022
Fairness Metrics: All you need to know

Fairness Metrics: All you need to know Testing machine learning software for ethical bias has become a pressing current concern. Recent research has p

Anonymous2020 1 Jan 17, 2022
Analysis of Smiles through reservoir sampling & RDkit

Analysis of Smiles through reservoir sampling and machine learning (under development). This is a simple project that includes two Jupyter files for t

Aurimas A. Nausėdas 6 Aug 30, 2022
190 Jan 03, 2023
PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

Subin An 8 Nov 21, 2022
CausaLM: Causal Model Explanation Through Counterfactual Language Models

CausaLM: Causal Model Explanation Through Counterfactual Language Models Authors: Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart Abstract: Understan

Amir Feder 39 Jul 10, 2022
Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at [email protected]

TableParser Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at DS3 Lab 11 Dec 13, 2022

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 36 Oct 31, 2022
Implementation of parameterized soft-exponential activation function.

Soft-Exponential-Activation-Function: Implementation of parameterized soft-exponential activation function. In this implementation, the parameters are

Shuvrajeet Das 1 Feb 23, 2022
Script for getting information in discord

User-info.py Script for getting information in https://discord.com/ Instalação: apt-get update -y apt-get upgrade -y apt-get install git pkg install

Moleey 1 Dec 18, 2021
Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch)

Contrastive Unpaired Translation (CUT) video (1m) | video (10m) | website | paper We provide our PyTorch implementation of unpaired image-to-image tra

1.7k Dec 27, 2022
Library for time-series-forecasting-as-a-service.

TIMEX TIMEX (referred in code as timexseries) is a framework for time-series-forecasting-as-a-service. Its main goal is to provide a simple and generi

Alessandro Falcetta 8 Jan 06, 2023