KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Last update: Jun 21, 2022

Related tags

Overview

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition.

It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

For the construction of the dataset, we decide to use texts available for free, under a license that permits both research and commercial use.

In particular we release four chapters with texts taken from: (i) Wikinews (WN) as a source of news texts belonging to the last decades; (ii) some Italian fiction books (FIC) whose authors died more than 70 years ago; (iii) writings and speeches from Italian politicians Aldo Moro (AM) and (iv) Alcide De Gasperi (ADG).

Wikinews

Wikinews is a multi-language free project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License.

In building KIND, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.

Literature

Regarding fiction literature, we annotate 86 book chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.

In particular, we choose: Il giorno delle Mésules (Ettore Castiglioni, 12,853 tokens), L'amante di Cesare (Augusto De Angelis, 13,464 tokens), Canne al vento (Grazia Deledda, 13,945 tokens), 1861-1911 - Cinquant’anni di vita nazionale ricordati ai fanciulli (Guido Fabiani, 10,801 tokens), Lettere dal carcere (Antonio Gramsci, 10,655), Anarchismo e democrazia (Errico Malatesta, 11,557 tokens), L'amore negato (Maria Messina, 31,115 tokens), La luna e i falò (Cesare Pavese, 10,705 tokens), La coscienza di Zeno (Italo Svevo, 56,364 tokens), Le cose piu grandi di lui (Luciano Zuccoli, 20,989 tokens).

In selecting works without copyright, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.

Aldo Moro's Works

Writings belonging to Aldo Moro have recently been collected by the University of Bologna and published on a platform called Edizione Nazionale delle Opere di Aldo Moro.

The project is still ongoing and, by now, it contains 806 documents for a total of about one million tokens.

In the first release of KIND, we include 392,604 tokens from the Aldo Moro's works dataset, with silver annotations (see the reference below).

Alcide De Gasperi's Writings

Finally, we annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history.

The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.

License

The NER annotations in (i), (ii), and (iii) are released under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Annotation from Alcide De Gasperi's writings are released under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Related tags

Overview

KIND (Kessler Italian Named-entities Dataset)

Wikinews

Literature

Aldo Moro's Works

Alcide De Gasperi's Writings

License

Owner

Digital Humanities

Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"

Code of Adverse Weather Image Translation with Asymmetric and Uncertainty aware GAN

Official Pytorch implementation of ICLR 2018 paper Deep Learning for Physical Processes: Integrating Prior Scientific Knowledge.

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon.

A hand tracking demo made with mediapipe where you can control lights with pinching your fingers and moving your hand up/down.

Fuse radar and camera for detection

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Deep Learning for Time Series Forecasting.

A simple software for capturing human body movements using the Kinect camera.

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

Official Pytorch implementation of Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

X-VLM: Multi-Grained Vision Language Pre-Training

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

Python package for visualizing the loss landscape of parameterized quantum algorithms.

Numerical Methods with Python, Numpy and Matplotlib

Implementation of Kronecker Attention in Pytorch

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

Yolo algorithm for detection + centroid tracker to track vehicles

A Survey on Deep Learning Technique for Video Segmentation

KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Related tags

Overview

KIND (Kessler Italian Named-entities Dataset)

Wikinews

Literature

Aldo Moro's Works

Alcide De Gasperi's Writings

License

Owner

Digital Humanities

Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"

Code of Adverse Weather Image Translation with Asymmetric and Uncertainty aware GAN

Official Pytorch implementation of ICLR 2018 paper Deep Learning for Physical Processes: Integrating Prior Scientific Knowledge.

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.

A hand tracking demo made with mediapipe where you can control lights with pinching your fingers and moving your hand up/down.

Fuse radar and camera for detection

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Deep Learning for Time Series Forecasting.

A simple software for capturing human body movements using the Kinect camera.

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

Official Pytorch implementation of Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

X-VLM: Multi-Grained Vision Language Pre-Training

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

Python package for visualizing the loss landscape of parameterized quantum algorithms.

Numerical Methods with Python, Numpy and Matplotlib

Implementation of Kronecker Attention in Pytorch

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

Yolo algorithm for detection + centroid tracker to track vehicles

A Survey on Deep Learning Technique for Video Segmentation

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon.