Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Last update: Dec 29, 2022

Related tags

Text Data & NLP NegSampling-NER

Overview

Negative Sampling for NER

Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes using negative sampling for solving this important issue. This repo. contains the implementation of our approach.

Note that this is not an officially supported Tencent product.

Preparation

Two steps. Firstly, reformulate the NER data and move it into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following case:

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons', '4-38', ')', ',', 'Leicestershire', '296', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (5, 6, 'PER'), (10, 10, 'ORG')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (4, 4, 'ORG')]",
 }
]

Secondly, pretrained LM (i.e., BERT) and eval. script. Create a dir. named "resource" and arrange them as

resource
- bert-base-cased
  - model.pt
  - vocab.txt
- conlleval.pl

Note that the files in BERT.tar.gz need to be renamed as above.

Training and Test

CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -cd save -rd resource

Citation

@inproceedings{li2021empirical,
    title={Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition},
    author={Yangming Li and lemao liu and Shuming Shi},
    booktitle={International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=5jRVa89sZk}
}

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Related tags

Overview

Negative Sampling for NER

Preparation

Training and Test

Citation

Owner

Yangming Li

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Train 🤗-transformers model with Poutyne.

FastFormers - highly efficient transformer models for NLU

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

DeLighT: Very Deep and Light-Weight Transformers

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Various Algorithms for Short Text Mining

A python package for deep multilingual punctuation prediction.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Lattice methods in TensorFlow

Package for controllable summarization

Kestrel Threat Hunting Language

I can help you convert your images to pdf file.

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.