Klexikon: A German Dataset for Joint Summarization and Simplification

Overview

Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller and Michael Gertz
Heidelberg University

Under submission at LREC 2022
A preprint version of the paper can be found on arXiv!
For easy access, we have also made the dataset available on Huggingface Datasets!


Data Availability

To use data in your experiments, we suggest the existing training/validation/test split, available in ./data/splits/. This split has been generated with a stratified sampling strategy (based on document lengths) and a 80/10/10 split, which ensure that the samples are somewhat evenly distributed.

Alternatively, please refer to our Huggingface Datasets version for easy access of the preprocessed data.

Installation

This repository contains the code to crawl the Klexikon data set presented in our paper, as well as all associated baselines and splits. You can work on the existing code base by simply cloning this repository.

Install all required dependencies with the following command:

python3 -m pip install -r requirements.txt

The experiments were run on Python 3.8.4, but should run fine with any version >3.7. To run files, relative imports are required, which forces you to run them as modules, e.g.,

python3 -m klexikon.analysis.compare_offline_stats

instead of

python3 klexikon/analysis/compare_offline_stats.py

Furthermore, this requires the working directory to be the root folder as well, to ensure correct referencing of relative data paths. I.e., if you cloned this repository into /home/dennis/projects/klexikon, make sure to run scripts directly from this path.

Extended Explanation

Manually Replaced Articles in articles.json

Aside from all the manual matches, which can be produced by create_matching_url_list.py, there are some articles which simply link to an incorrect article in Wikipedia.
We approximate this by the number of paragraphs in the Wikipedia article, which is generally much longer than the Klexikon article, and therefore should have at least 15 paragraphs. Note that most of the pages are disambiguations, which unfortunately don't necessarily correspond neatly to a singular Wikipedia page. We remove the article if it is not possible to find a singular Wikipedia article that covers more than 66% of the paragraphs in the Klexikon article. Some examples for manual changes were:

  • "Aal" to "Aale"
  • "Abendmahl" to "Abendmahl Jesu"
  • "Achse" to "Längsachse"
  • "Ader" to "Blutgefäß"
  • "Albino" to "Albinismus"
  • "Alkohol" to "Ethanol"
  • "Android" to "Android (Betriebssystem)"
  • "Anschrift" to "Postanschrift"
  • "Apfel" to "Kulturapfel"
  • "App" to "Mobile App"
  • "Appenzell" to "Appenzellerland"
  • "Arabien" to "Arabische Halbinsel"
  • "Atlas" to "Atlas (Kartografie)"
  • "Atmosphäre" to "Erdatmospähre"

Merging sentences that end in a semicolon (;)

This applies to any position in the document. The reason is rectifying some unwanted splits by spaCy.

Merge of short lines in lead 3 baseline

Also checking for lines that have less than 10 characters in the first three sentences. This helps with fixing the lead-3 baseline, and most issues arise from some incorrect splits to begin with.

Removal of coordinates

Sometimes, coordinate information is leading in the data, which seems to be embedded in some Wikipedia articles. We remove any coordinate with a simple regex.

Sentences that do not end in a period

Manual correction of sentences (in the lead 3) that do not end in periods. This has been automatically fixed by merging content similarly to the semicolon case. Specifically, we only merge if the subsequent line is not just an empty line.

Using your own data

Currently, the systems expect input data to be processed in a line-by-line fashion, where every line represents a sentence, and each file represents an input document. Note that we currently do not support multi-document summarization.

Criteria for discarding articles

Articles where Wikipedia has less than 15 paragraphs. Otherwise, manually discarding when there are no matching articles in Wikipedia (see above). Examples of the latter case are for example "Kiwi" or "Washington"

Reasons for not using lists

As described in the paper, we discard any element that is not a

tag in the HTLM code. This helps getting rid of actual unwanted information (images, image captions, meta-descriptors, etc.), but also removes list items. After reviewing some examples, we have decided to discard list elements altogether. This means that some articles (especially disambiguation pages) are also easier to detect.

Final number of valid article pairs: 2898

This means we had to discard around 250 articles from the original list at the time of crawling (April 2021). In the meantime, there have been new articles added to Klexikon, which leaves room for future improvements.

Execution Order of Scripts

TK: I'll include a better reference to the particular scripts in the near future, as well as a script that actually executes everything relevant in order.

  • Generate JSON file with article URLs
  • Crawl texts
  • Fix lead sentences
  • Remove unused articles (optional)
  • Generate stratified split

License Information

Both Wikipedia and Klexikon make their textual contents available under the CC BY-SA license. Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. Data will be re-distributed under the CC BY-SA license.

Contributions

Contributions are very welcome. Please either open an issue or pull request if you have any suggestion on how this data can be improved. Open TODOs:

  • So far, the data does not have more than a few simplistic baselines, and lacks an actually trained system on top of the data.
  • The dataset is "out-of-date", since it does not include any of the more recently articles (~100 since the inception of my version). Potentially, we can increase the availability to almost 3000 articles.
  • Adding a top-level script that adds correct execution order of different scripts to generate baselines/results/etc.
  • Adding a proper data managing script for the Huggingface Datasets version of this dataset.

How to Cite?

If you use our dataset, or code from this repository, please cite

@article{aumiller-gertz-2022-klexikon,  
  title   = {{Klexikon: A German Dataset for Joint Summarization and Simplification}},  
  author  = {Aumiller, Dennis and Gertz, Michael},  
  year    = {2022},  
  journal = {arXiv preprint arXiv:2201.07198},  
  url     = {https://arxiv.org/abs/2201.07198},  
}
Owner
Dennis Aumiller
PhD student in Information Retrieval & NLP at Heidelberg University. Python is awesome, and so is Huggingface
Dennis Aumiller
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

efficient-task-transfer This repository contains code for the experiments in our paper "What to Pre-Train on? Efficient Intermediate Task Selection".

AdapterHub 26 Dec 24, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ 💻 Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
Natural Language Processing Best Practices & Examples

NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive bus

Microsoft 6.1k Dec 31, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022
Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

Sipeed 267 Dec 25, 2022