The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Last update: Dec 18, 2022

Related tags

Overview

PRIMER

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization.

PRIMER is a pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on the zero-shot, few-shot and full-supervised settings, PRIMER outperforms current state-of-the-art models on most of these settings with large margins.

Set up

Create new virtual environment by

conda create --name primer python=3.7
conda activate primer
conda install cudatoolkit=10.0

Install Longformer by

pip install git+https://github.com/allenai/longformer.git

Install requirements to run the summarization scripts and data generation scripts by

pip install -r requirements.txt

Usage of PRIMER

Download the pre-trained PRIMER model here to ./PRIMER_model
Load the tokenizer and model by

from transformers import AutoTokenizer
from longformer import LongformerEncoderDecoderForConditionalGeneration
from longformer import LongformerEncoderDecoderConfig

tokenizer = AutoTokenizer.from_pretrained('./PRIMER_model/')
config = LongformerEncoderDecoderConfig.from_pretrained('./PRIMER_model/')
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(
            './PRIMER_model/', config=config)

Make sure the documents separated with <doc-sep> in the input.

Summarization Scripts

You can use script/primer_main.py for pre-train/train/test PRIMER, and script/compared_model_main.py for train/test BART/PEGASUS/LED.

Pre-training Data Generation

Newshead: we crawled the newshead dataset using the original code, and cleaned up the crawled data, the final newshead dataset can be found here.

You can use utils/pretrain_preprocess.py to generate pre-training data.

Generate data with scores and entities with --mode compute_all_scores
Generate pre-training data with --mode pretraining_data_with_score:
- Pegasus: --strategy greedy --metric pegasus_score
- Entity_Pyramid: --strategy greedy_entity_pyramid --metric pyramid_rouge

Datasets

For Multi-News and Multi-XScience, it will automatically download from Huggingface.
WCEP-10: the preprocessed version can be found here
Wikisum: we only use a small subset for few-shot training(10/100) and testing(3200). The subset we used can be found here. Note we have significantly more examples than we used in train.pt and valid.pt, as we sample 10/100 examples multiple times in the few-shot setting, and we need to make sure it has a large pool to sample from.
DUC2003/2004: You need to apply for access based on the instruction
arXiv: you can find the data we used in this repo

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Related tags

Overview

PRIMER

Set up

Usage of PRIMER

Summarization Scripts

Pre-training Data Generation

Datasets

Owner

AI2

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen.

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images

Fast Neural Style for Image Style Transform by Pytorch

Auto-Lama combines object detection and image inpainting to automate object removals

Fast, general, and tested differentiable structured prediction in PyTorch

Trajectory Variational Autoencder baseline for Multi-Agent Behavior challenge 2022

(ICCV 2021 Oral) Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation.

The code from the paper Character Transformations for Non-Autoregressive GEC Tagging

Telegram chatbot created with deep learning model (LSTM) and telebot library.

Tensorflow 2 Object Detection API kurulumu, GPU desteği, custom model hazırlama

Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on PaddlePaddle

a reimplementation of LiteFlowNet in PyTorch that matches the official Caffe version

Code for the submitted paper Surrogate-based cross-correlation for particle image velocimetry

PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

Physics-informed Neural Operator for Learning Partial Differential Equation

A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

Convenient tool for speeding up the intern/officer review process.