The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

Last update: Jul 15, 2022

Related tags

Overview

NewSHead

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

A news story is defined as a list of articles about the same event with a coherent topic. The released dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.

The dataset is collected from news stories published between May 2018 and May 2019, where a proprietary clustering algorithm iteratively loads articles published in a time window and groups them based on content similarity¹. Up to five representative articles are picked from the cluster for generating the story headline². Curators from a crowd-sourcing platform are requested to provide a headline of up to 35 characters to describe the major information covered by the story.

Example Headlines:

International Space Station flyover
Drilling for oil in Pakistan
Review of 'Mr. Local'
MLB: Pirates vs Padres
Braves re-sign Jerry Blevins

Download Link

Tools to Process

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{headline2020,
  title = {{Generating Representative Headlines for News Stories}},
  author = {Gu, Xiaotao and Mao, Yuning and Han, Jiawei and Liu, Jialu and Yu, Hongkun and Wu, You and Yu, Cong
and Finnie, Daniel and Zhai, Jiaqi and Zukoski, Nicholas},
  booktitle = {Proc. of the the Web Conf. 2020},
  year = {2020}
}

Analysis

We did broad topic analysis for the 932,571 articles in our dataset. A histogram is attached as below.

Among all the 369,940 stories, each headline is required to be between 10 and 35 characters.

Such lengths of curated story headlines are much shorter than traditional summaries, and even shorter than article titles in our dataset depicted below

References

[1] Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu, Cong Yu, Daniel Finnie, Jiaqi Zhai and Nicholas Zukoski "Generating Representative Headlines for News Stories": https://arxiv.org/abs/2001.09386. World Wide Web Conf. (WWW’2020).

Footnote

1 Clustering algorithm could contain noise. It is possible if some articles in a story are not relevant to the rest.

2 These articles presented don't necessarily map to articles we would show on Google products such as Search and News App.

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

310 Feb 1, 2021

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

137 Feb 1, 2021

Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

72 Dec 7, 2022

The guide to tackle with the Text Summarization

1.2k Dec 30, 2022

FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

83 Jan 9, 2023

code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab

21 Nov 24, 2022

code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

6 Jun 4, 2021

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

39 Dec 14, 2022

Releases(v1.0-config)

v1.0-config(Jun 9, 2020)

Config file for crawling articles using news-please.
Source code(tar.gz)
Source code(zip)
news_please.zip(28.38 MB)
v1.0(Mar 27, 2020)

First version includes 369,940 news stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively.
Source code(tar.gz)
Source code(zip)
dataset.zip(66.60 MB)

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

Related tags

Overview

NewSHead

Citation

Analysis

References

Footnote

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Package for controllable summarization

The guide to tackle with the Text Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization

code for modular summarization work published in ACL2021 by Krishna et al

code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Codes for processing meeting summarization datasets AMI and ICSI.

Releases(v1.0-config)

v1.0-config(Jun 9, 2020)

v1.0(Mar 27, 2020)

Owner

Google Research Datasets

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Shirt Bot is a discord bot which uses GPT-3 to generate text

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Fake Shakespearean Text Generator

Code for Text Prior Guided Scene Text Image Super-Resolution

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

EdiTTS: Score-based Editing for Controllable Text-to-Speech

A curated list of efficient attention modules

KoBERT - Korean BERT pre-trained cased (KoBERT)

XLNet: Generalized Autoregressive Pretraining for Language Understanding

PyTorch original implementation of Cross-lingual Language Model Pretraining.

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

What are the best Systems? New Perspectives on NLP Benchmarking

Python module (C extension and plain python) implementing Aho-Corasick algorithm

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

A framework for cleaning Chinese dialog data

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.