Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Overview

logo

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER.

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

Please consider citing our work if you use data and/or code from this repository.

In a nutshell, WikiNEuRal consists in a novel technique which builds upon a multilingual lexical knowledge base (i.e., BabelNet) and transformer-based architectures (i.e., BERT) to produce high-quality annotations for multilingual NER. It shows consistent improvements of up to 6 span-based F1-score points against state-of-the-art alternative data production methods on common benchmarks for NER. Moreover, in our paper we also present a new approach for creating interpretable word embeddings together with a Domain Adaptation algorithm, which enable WikiNEuRal to create domain-specific training corpora.

Data

Dataset Version Sentences Tokens PER ORG LOC MISC OTHER
WikiNEuRal EN 116k 2.73M 51k 31k 67k 45k 2.40M
WikiNEuRal ES 95k 2.33M 43k 17k 68k 25k 2.04M
WikiNEuRal NL 107k 1.91M 46k 22k 61k 24k 1.64M
WikiNEuRal DE 124k 2.19M 60k 32k 59k 25k 1.87M
WikiNEuRal RU 123k 2.39M 40k 26k 89k 25k 2.13M
WikiNEuRal IT 111k 2.99M 67k 22k 97k 26k 2.62M
WikiNEuRal FR 127k 3.24M 76k 25k 101k 29k 2.83M
WikiNEuRal PL 141k 2.29M 59k 34k 118k 22k 1.91M
WikiNEuRal PT 106k 2.53M 44k 17k 112k 25k 2.20M
WikiNEuRal EN DA (CoNLL) 29k 759k 12k 23k 6k 3k 0.54M
WikiNEuRal NL DA (CoNLL) 34k 598k 17k 8k 18k 6k 0.51M
WikiNEuRal DE DA (CoNLL) 41k 706k 17k 12k 23k 3k 0.61M
WikiNEuRal EN DA (OntoNotes) 48k 1.18M 20k 13k 38k 12k 1.02M

Further datasets, such as the combination of WikiNEuRal with gold-standard training data (i.e., CoNLL) or the gold-standard datasets themselves, can be obtained by simply concatenating the two train.conllu files together (e.g., data/conll/en/train.conllu and data/wikineural/en/train.conllu give CoNLL+WikiNEuRal).

How to use

  1. To train 10 models on CoNLL English, run:

    python run.py -m +train.seed_idx=0,1,2,3,4,5,6,7,8,9 data.datamodule.source=conll data.datamodule.language=en
    

    note: for the EN, ES, NL and DE versions of WikiNEuRal, you can use the CoNLL splits as validation and testing material (e.g., copy the data/conll/en/val.conllu into data/wikineural/en/). Similarly, for RU and PL you can use the BSNLP splits. For the other languages instead, you can use the scripts/create_splits.py script to split a given train.conllu file into train, dev and test sets.

  2. To produce results for the 10 trained models, run:

    bash test.sh
    

    test.sh also contains more complex bash for loops that can produce results on multiple datasets / models at once.

License

WikiNEuRal is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found here.

We underline that the source from which the raw sentences have been extracted is Wikipedia (wikipedia.org) and the NER annotations have been produced by Babelscape.

Acknowledgments

We gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon2020 research and innovation programme (http://mousse-project.org/).

This work was also supported by the PerLIR project (Personal Linguistic resources in Information Retrieval) funded by the MIUR Progetti di ricerca di Rilevante Interesse Nazionale programme (PRIN2017).

The code in this repository is built on top of .

Owner
Babelscape
Babelscape is a deep tech company founded in 2016 focused on multilingual Natural Language Processing.
Babelscape
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

Timo Schick 1.4k Dec 30, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

Aberystwyth Systems Biology 7 Nov 09, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023