Klexikon: A German Dataset for Joint Summarization and Simplification

Overview

Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller and Michael Gertz
Heidelberg University

Under submission at LREC 2022
A preprint version of the paper can be found on arXiv!
For easy access, we have also made the dataset available on Huggingface Datasets!


Data Availability

To use data in your experiments, we suggest the existing training/validation/test split, available in ./data/splits/. This split has been generated with a stratified sampling strategy (based on document lengths) and a 80/10/10 split, which ensure that the samples are somewhat evenly distributed.

Alternatively, please refer to our Huggingface Datasets version for easy access of the preprocessed data.

Installation

This repository contains the code to crawl the Klexikon data set presented in our paper, as well as all associated baselines and splits. You can work on the existing code base by simply cloning this repository.

Install all required dependencies with the following command:

python3 -m pip install -r requirements.txt

The experiments were run on Python 3.8.4, but should run fine with any version >3.7. To run files, relative imports are required, which forces you to run them as modules, e.g.,

python3 -m klexikon.analysis.compare_offline_stats

instead of

python3 klexikon/analysis/compare_offline_stats.py

Furthermore, this requires the working directory to be the root folder as well, to ensure correct referencing of relative data paths. I.e., if you cloned this repository into /home/dennis/projects/klexikon, make sure to run scripts directly from this path.

Extended Explanation

Manually Replaced Articles in articles.json

Aside from all the manual matches, which can be produced by create_matching_url_list.py, there are some articles which simply link to an incorrect article in Wikipedia.
We approximate this by the number of paragraphs in the Wikipedia article, which is generally much longer than the Klexikon article, and therefore should have at least 15 paragraphs. Note that most of the pages are disambiguations, which unfortunately don't necessarily correspond neatly to a singular Wikipedia page. We remove the article if it is not possible to find a singular Wikipedia article that covers more than 66% of the paragraphs in the Klexikon article. Some examples for manual changes were:

  • "Aal" to "Aale"
  • "Abendmahl" to "Abendmahl Jesu"
  • "Achse" to "Längsachse"
  • "Ader" to "Blutgefäß"
  • "Albino" to "Albinismus"
  • "Alkohol" to "Ethanol"
  • "Android" to "Android (Betriebssystem)"
  • "Anschrift" to "Postanschrift"
  • "Apfel" to "Kulturapfel"
  • "App" to "Mobile App"
  • "Appenzell" to "Appenzellerland"
  • "Arabien" to "Arabische Halbinsel"
  • "Atlas" to "Atlas (Kartografie)"
  • "Atmosphäre" to "Erdatmospähre"

Merging sentences that end in a semicolon (;)

This applies to any position in the document. The reason is rectifying some unwanted splits by spaCy.

Merge of short lines in lead 3 baseline

Also checking for lines that have less than 10 characters in the first three sentences. This helps with fixing the lead-3 baseline, and most issues arise from some incorrect splits to begin with.

Removal of coordinates

Sometimes, coordinate information is leading in the data, which seems to be embedded in some Wikipedia articles. We remove any coordinate with a simple regex.

Sentences that do not end in a period

Manual correction of sentences (in the lead 3) that do not end in periods. This has been automatically fixed by merging content similarly to the semicolon case. Specifically, we only merge if the subsequent line is not just an empty line.

Using your own data

Currently, the systems expect input data to be processed in a line-by-line fashion, where every line represents a sentence, and each file represents an input document. Note that we currently do not support multi-document summarization.

Criteria for discarding articles

Articles where Wikipedia has less than 15 paragraphs. Otherwise, manually discarding when there are no matching articles in Wikipedia (see above). Examples of the latter case are for example "Kiwi" or "Washington"

Reasons for not using lists

As described in the paper, we discard any element that is not a

tag in the HTLM code. This helps getting rid of actual unwanted information (images, image captions, meta-descriptors, etc.), but also removes list items. After reviewing some examples, we have decided to discard list elements altogether. This means that some articles (especially disambiguation pages) are also easier to detect.

Final number of valid article pairs: 2898

This means we had to discard around 250 articles from the original list at the time of crawling (April 2021). In the meantime, there have been new articles added to Klexikon, which leaves room for future improvements.

Execution Order of Scripts

TK: I'll include a better reference to the particular scripts in the near future, as well as a script that actually executes everything relevant in order.

  • Generate JSON file with article URLs
  • Crawl texts
  • Fix lead sentences
  • Remove unused articles (optional)
  • Generate stratified split

License Information

Both Wikipedia and Klexikon make their textual contents available under the CC BY-SA license. Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. Data will be re-distributed under the CC BY-SA license.

Contributions

Contributions are very welcome. Please either open an issue or pull request if you have any suggestion on how this data can be improved. Open TODOs:

  • So far, the data does not have more than a few simplistic baselines, and lacks an actually trained system on top of the data.
  • The dataset is "out-of-date", since it does not include any of the more recently articles (~100 since the inception of my version). Potentially, we can increase the availability to almost 3000 articles.
  • Adding a top-level script that adds correct execution order of different scripts to generate baselines/results/etc.
  • Adding a proper data managing script for the Huggingface Datasets version of this dataset.

How to Cite?

If you use our dataset, or code from this repository, please cite

@article{aumiller-gertz-2022-klexikon,  
  title   = {{Klexikon: A German Dataset for Joint Summarization and Simplification}},  
  author  = {Aumiller, Dennis and Gertz, Michael},  
  year    = {2022},  
  journal = {arXiv preprint arXiv:2201.07198},  
  url     = {https://arxiv.org/abs/2201.07198},  
}
Owner
Dennis Aumiller
PhD student in Information Retrieval & NLP at Heidelberg University. Python is awesome, and so is Huggingface
Dennis Aumiller
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
FireFlyer Record file format, writer and reader for DL training samples.

FFRecord The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux

77 Jan 04, 2023
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
A natural language processing model for sequential sentence classification in medical abstracts.

NLP PubMed Medical Research Paper Abstract (Randomized Controlled Trial) A natural language processing model for sequential sentence classification in

Hemanth Chandran 1 Jan 17, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 05, 2022