Codebase for the Summary Loop paper at ACL2020

Overview

Summary Loop

This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples.

Training Procedure

We provide pre-trained models for each component needed in the Summary Loop Release:

  • keyword_extractor.joblib: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure,
  • bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,
  • fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,
  • gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

In the release, we also provide:

  • pretrain_coverage.py script to train a coverage model from scratch,
  • train_generator.py to train a fluency model from scratch (we recommend Fluency model on domain of summaries, such as news, legal, etc.)

Once all the pretraining models are ready, training a summarizer can be done using the train_summary_loop.py:

python train_summary_loop.py --experiment wikinews_test --dataset_file data/wikinews.db

Scorer Models

The Coverage and Fluency model and Guardrails scores can be used separately for analysis, evaluation, etc. They are respectively in model_coverage.py and model_guardrails.py, each model is implemented as a class with a score(document, summary) function. The Fluency model is a Language model, which is also the generator (in model_generator.py). Examples of how to run each model are included in the class files, at the bottom of the files.

Bringing in your own data

Want to test out the Summary Loop on a different language/type of text? A Jupyter Notebook can help you bring your own data into the SQLite format we use in the pre-training scripts. Otherwise you can modify the scripts' data loading (DataLoader) and collate function (collate_fn).

Cite the work

If you make use of the code, models, or algorithm, please cite our paper:

@inproceedings{laban2020summary,
  title={The Summary Loop: Learning to Write Abstractive Summaries Without Examples},
  author={Laban, Philippe and Hsi, Andrew and Canny, John and Hearst, Marti A},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  volume={1},
  year={2020}
}

Contributing

If you'd like to contribute, or have questions or suggestions, you can contact us at [email protected]. All contributions welcome! For example, if you have a type of text data on which you want to apply the Summary Loop.

Comments
  • Error Loading Model  RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:

    Error Loading Model RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:

    Traceback (most recent call last):
      File "train_summary_loop.py", line 59, in <module>
        summarizer = GeneTransformer(max_output_length=args.max_output_length, device=args.device, tokenizer_type='gpt2', starter_model=summarizer_model_start)
      File "/home/tait-dev-0/summary_loop/summary_loop/model_generator.py", line 30, in __init__
        self.reload(starter_model)
      File "/home/tait-dev-0/summary_loop/summary_loop/model_generator.py", line 39, in reload
        print(self.model.load_state_dict(torch.load(from_file)))
      File "/home/tait-dev-0/anaconda2/envs/summary_loop/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
        self.__class__.__name__, "\n\t".join(error_msgs)))
    RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:
    	Missing key(s) in state_dict: "transformer.h.0.attn.masked_bias", "transformer.h.1.attn.masked_bias", "transformer.h.2.attn.masked_bias", "transformer.h.3.attn.masked_bias", "transformer.h.4.attn.masked_bias", "transformer.h.5.attn.masked_bias", "transformer.h.6.attn.masked_bias", "transformer.h.7.attn.masked_bias", "transformer.h.8.attn.masked_bias", "transformer.h.9.attn.masked_bias", "transformer.h.10.attn.masked_bias", "transformer.h.11.attn.masked_bias". 
    
    
    opened by raviolli 6
  • Missing models for training

    Missing models for training

    Dear author, I tried to load the fluency_news_model_file models but failed. It seems that the "news_gpt2_bs32.bin" is not provided in the release.

    I tried to replace it with "fluency_news_bs32.bin", but it does not seem to match the GeneTransformer. I.e. when I tried to load the fluency model using modelf=GeneTransformer(max_output_length=args.max_output_length, device=args.device, starter_model=fluency_news_model_file) it shows "IncompatibleKeys(missing_keys=['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias'], unexpected_keys=[]) "

    Is this fine?

    In addition, when I tried to load the key word coverage model, the keys do not match either I.e. When running modelc = KeywordCoverage(args.device, keyword_model_file=coverage_keyword_model_file, model_file=coverage_model_file)} It shows IncompatibleKeys(missing_keys=['bert.embeddings.position_ids', 'cls.predictions.decoder.bias'], unexpected_keys=[])

    Wondering how I could deal with this situation

    opened by pengshancai 2
  • IndexError when decode with beam_size > 1

    IndexError when decode with beam_size > 1

    Followed the instruction from here and changed the beam_size to more than 1. IndexError occur:

    ~/summary_loop/model_generator.py in decode(self, bodies, max_output_length, max_batch_size, beam_size, return_scores, sample, progress)
        232             with torch.no_grad():
        233                 if beam_size > 1:
    --> 234                     batch_outputs = self.decode_beam_batch(batch_bodies, beam_size=beam_size, max_output_length=max_output_length, sample=sample)
        235                 else:
        236                     batch_outputs = self.decode_batch(batch_bodies, max_output_length=max_output_length, sample=sample, return_scores=return_scores)
    
    ~/summary_loop/model_generator.py in decode_beam_batch(self, bodies, beam_size, max_output_length, sample)
        200             if build_up is not None:
        201                 build_up = build_up[tracks, :]
    --> 202             past = [p[:, tracks, :] for p in past]
        203 
        204             # Update the latest scores, and the current_build
    
    ~/summary_loop/model_generator.py in <listcomp>(.0)
        200             if build_up is not None:
        201                 build_up = build_up[tracks, :]
    --> 202             past = [p[:, tracks, :] for p in past]
        203 
        204             # Update the latest scores, and the current_build
    
    IndexError: tensors used as indices must be long, byte or bool tensors
    
    opened by s103321048 2
  • cannot reshape tensor of 0 elements into shape [-1, 0]

    cannot reshape tensor of 0 elements into shape [-1, 0]

    I followed the instruction training model with the provided example wikinews.db: python train_summary_loop.py --experiment wikinews_test --dataset_file data/wikinews.db

    It did start training but later stop due to Runtimeerror:

    Traceback (most recent call last):
      File "train_summary_loop.py", line 138, in <module>
        sampled_summaries, sampled_logprobs, sampled_tokens, input_past, sampled_end_idxs = summarizer.decode_batch(bodies, max_output_length=args.max_output_length, return_logprobs=True, sample=True)
      File "/home/robin/TrySomethingNew/summary_loop/model_generator.py", line 100, in decode_batch
        _, input_past = self.model(input_ids=inputs, past_key_values=None)
      File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 731, in forward
        return_dict=return_dict,
      File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 533, in forward
        input_ids = input_ids.view(-1, input_shape[-1])
    RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified
    dimension size -1 can be any value and is ambiguous
    
    opened by s103321048 2
  • Code for summary generation from the given model is not provided

    Code for summary generation from the given model is not provided

    You mentioned "Releasing the 11,490 summaries generated by the Summary Loop model (summary_loop_length46.bin) on the CNN/DM test set." and provided the json file "cnndm_test_summary_loop.json". Is there any code to get the json file(summaries) from the given model(.bin). If you have such code, then please share.

    opened by tarunyadav 1
  • Resuming training

    Resuming training

    Is resuming training simply starting from the checkpoint instead of the gpt3 copier bin?

    For example:

    #summarizer_model_start = os.path.join(models_folder, "gpt2_copier23.bin")
    summarizer_model_start = os.path.join(models_folder, "summarizer_wikinews_test_0_ckpt.bin")
    
    opened by RevanthRameshkumar 1
  • Encoding error in bin file

    Encoding error in bin file

    (dlenv) D:\summary loop\summary_loop-0.1>python summary_loop_length10.bin --experiment wikinews_test --dataset_file data/wikinews.db File "summary_loop_length10.bin", line 1 SyntaxError: Non-UTF-8 code starting with '\x80' in file summary_loop_length10.bin on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

    Anyone else get this issue? Currently debugging

    opened by RevanthRameshkumar 1
  • a sample of data in hdf5 format

    a sample of data in hdf5 format

    Hi,

    I'm trying to train the models from scratch since I'd like to use them on a different language. It seems that one needs a dataset in hdf5 format instead of SQL to do that. Can you please release a sample of data in hdf5 format?

    Thanks

    opened by azagsam 1
  • Missing Model to run example

    Missing Model to run example

    I'm trying to run the example:

    python train_summary_loop.py --experiment wikinews_test --dataset_file ../data/wikinews.db --root_folder ../ --device cuda

    but it seems I'm missing the ../models/fluency_news_bs32.bin

    it doesn't seem to be in the list of downloadable models. Mistake??

    opened by raviolli 1
  • Error running training_summary example

    Error running training_summary example

    python train_summary_loop.py --experiment wikinews_test --dataset_file ../data/wikinews.db

    Traceback (most recent call last):
      File "train_summary_loop.py", line 56, in <module>
        bert_tokenizer = utils_tokenizer.BERTCacheTokenizer()
      File "/home/tait-dev-0/summary_loop/summary_loop/utils_tokenizer.py", line 88, in __init__
        self.tokenizer.max_len = 10000
    AttributeError: can't set attribute
    

    transformers 3.0.2 py_0 conda-forge

    I created a separate conda environment. Is this a transformer version issue?

    opened by raviolli 1
  • updated torch.load params

    updated torch.load params

    Updated occurrences of torch.load to include map_location parameter. When attempting to train with --device set to cpu, torch.load may attempt to load a file with GPU tensors, which will lead to loading to GPU by default (see: https://pytorch.org/docs/stable/generated/torch.load.html). If --device is set to cpu, this will error on a cpu-only machine. Otherwise, it will go against desired functionality. This pull request resolves this issue.

    opened by bsh98 0
Releases(0.3)
  • 0.3(Jun 11, 2021)

    Releasing the 11,490 summaries generated by the Summary Loop model (summary_loop_length46.bin) on the CNN/DM test set. Each summary is released attached with the CNN/DM id. The following code snippet can be used to evaluate ROUGE scores:

    from datasets import load_dataset, load_metric
    import json
    with open("/home/phillab/data/cnndm_test_summary_loop.json", "r") as f:
        summary_loop_gens = json.load(f)
    rouge = load_metric("rouge")
    dataset_test = load_dataset("cnn_dailymail", "3.0.0")["test"]
    id2summary_loop = {d["id"]: d["summary_loop_gen"] for d in summary_loop_gens}
    candidates, references = [], []
    for d in dataset_test:
        references.append(d["highlights"])
        candidates.append(id2summary_loop[d["id"]])
    print(len(references), len(candidates))
    print(rouge.compute(predictions=candidates, references=references))
    

    Notes: (1) this relies on HuggingFace's datasets repository (https://github.com/huggingface/datasets) to load the CNN/DM dataset, and the ROUGE metric. (2) The ROUGE metric implementation used in the above example is not the original, PERL-based implementation of ROUGE used for official numbers in the paper. This serves for demonstration purposes to show how to use the file.

    Source code(tar.gz)
    Source code(zip)
    cnndm_test_summary_loop.json(3.40 MB)
  • 0.2(Sep 8, 2020)

    We release an upgraded set of initial models for the training script that are compatible with transformers==3.1.0 to make it easier to get started. The original release (0.1) used version 2.8.0 of transformers, and there were some breaking changed introduced since, which leads to some model loading failing. The requirements.txt in the latest release has been updated with compatible library versions to simplify installation.

    Initial Models

    These sets of models work using Python 3.6.10, Transformers 3.1.0 and Sklearn 0.22.1:

    • keyword_extractor.joblib: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure,
    • bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,
    • fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,
    • gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

    Final Models

    Unfortunately, the three final models (trained summarizers) released in v0.1 do not work anymore in the latest transformers library, and only work in versions 2.8.0 and before. Once we retrain these models, we will reupload them. If this is of interest, feel free to add an issue or contact us.

    Source code(tar.gz)
    Source code(zip)
    bert_coverage.bin(420.06 MB)
    fluency_news_bs32.bin(486.73 MB)
    gpt2_copier23.bin(633.97 MB)
    keyword_extractor.joblib(667.33 KB)
  • v0.1(Jun 25, 2020)

    We release models and data needed to run the Summary Loop and use the models we trained.

    Initial models

    Here are the models needed to run the train_summary_loop.py:

    • keyword_extractor.joblib: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure,
    • bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,
    • fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,
    • gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

    Sample dataset

    We release a sample dataset of Wikinews news articles to get researchers started using the Summary Loop: wikinews.db. We cannot release the full dataset we used for copyright reasons. We note that we do not expect this to be enough to train to best performance, and recommend finding larger datasets (such as Newsroom or CNN/DM) for full-fledged training.

    Final models

    We release 3 Summarizer models obtained through the Summary Loop procedure for 3 target lengths: summary_loop_length_12.bin, summary_loop_length_27.bin, summary_loop_length_61.bin

    Source code(tar.gz)
    Source code(zip)
    bert_coverage.bin(420.06 MB)
    fluency_news_bs32.bin(522.73 MB)
    gpt2_copier23.bin(633.97 MB)
    keyword_extractor.joblib(667.33 KB)
    summary_loop_length10.bin(522.73 MB)
    summary_loop_length24.bin(522.73 MB)
    summary_loop_length46.bin(522.73 MB)
    wikinews.db(91.20 MB)
Owner
Canny Lab @ The University of California, Berkeley
Canny Lab @ The University of California, Berkeley
MT-GAN-PyTorch - PyTorch Implementation of Learning to Transfer: Unsupervised Domain Translation via Meta-Learning

MT-GAN-PyTorch PyTorch Implementation of AAAI-2020 Paper "Learning to Transfer: Unsupervised Domain Translation via Meta-Learning" Dependency: Python

29 Oct 19, 2022
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger.

Init Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger. 本项目基于 https://github.com/jaywalnut310/vits https://github.com/S

AmorTX 107 Dec 23, 2022
MPLP: Metapath-Based Label Propagation for Heterogenous Graphs

MPLP: Metapath-Based Label Propagation for Heterogenous Graphs Results on MAG240M Here, we demonstrate the following performance on the MAG240M datase

Qiuying Peng 10 Jun 28, 2022
Contrastive Learning of Structured World Models

Contrastive Learning of Structured World Models This repository contains the official PyTorch implementation of: Contrastive Learning of Structured Wo

Thomas Kipf 371 Jan 06, 2023
Official page of Struct-MDC (RA-L'22 with IROS'22 option); Depth completion from Visual-SLAM using point & line features

Struct-MDC (click the above buttons for redirection!) Official page of "Struct-MDC: Mesh-Refined Unsupervised Depth Completion Leveraging Structural R

Urban Robotics Lab. @ KAIST 37 Dec 22, 2022
Code for "The Box Size Confidence Bias Harms Your Object Detector"

The Box Size Confidence Bias Harms Your Object Detector - Code Disclaimer: This repository is for research purposes only. It is designed to maintain r

Johannes G. 24 Dec 07, 2022
Understanding the Generalization Benefit of Model Invariance from a Data Perspective

Understanding the Generalization Benefit of Model Invariance from a Data Perspective This is the code for our NeurIPS2021 paper "Understanding the Gen

1 Jan 15, 2022
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

selfcontact This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] It includes the main function

Lea Müller 68 Dec 06, 2022
LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021 We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrie

Hieu Duong 7 Jan 12, 2022
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Dec 30, 2022
PyTorch implementation of InstaGAN: Instance-aware Image-to-Image Translation

InstaGAN: Instance-aware Image-to-Image Translation Warning: This repo contains a model which has potential ethical concerns. Remark that the task of

Sangwoo Mo 827 Dec 29, 2022
PyTorch implementation of the YOLO (You Only Look Once) v2

PyTorch implementation of the YOLO (You Only Look Once) v2 The YOLOv2 is one of the most popular one-stage object detector. This project adopts PyTorc

申瑞珉 (Ruimin Shen) 433 Nov 24, 2022
PyTorch implementation of Graph Convolutional Networks in Feature Space for Image Deblurring and Super-resolution, IJCNN 2021.

GCResNet PyTorch implementation of Graph Convolutional Networks in Feature Space for Image Deblurring and Super-resolution, IJCNN 2021. The code will

11 May 19, 2022
A simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

this is a simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

crispengari 5 Dec 09, 2021
Use unsupervised and supervised learning to predict stocks

AIAlpha: Multilayer neural network architecture for stock return prediction This project is meant to be an advanced implementation of stacked neural n

Vivek Palaniappan 1.5k Jan 06, 2023
Occlusion robust 3D face reconstruction model in CFR-GAN (WACV 2022)

Occlusion Robust 3D face Reconstruction Yeong-Joon Ju, Gun-Hee Lee, Jung-Ho Hong, and Seong-Whan Lee Code for Occlusion Robust 3D Face Reconstruction

Yeongjoon 31 Dec 19, 2022
Tiny Kinetics-400 for test

Kinetics-400迷你数据集 English | 简体中文 该数据集旨在解决的问题:参照Kinetics-400数据格式,训练基于自己数据的视频理解模型。 数据集介绍 Kinetics-400是视频领域benchmark常用数据集,详细介绍可以参考其官方网站Kinetics。整个数据集包含40

38 Jan 06, 2023
UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss This repository contains the TensorFlow implementation of the paper UnF

Simon Meister 270 Nov 06, 2022
Generative code template for PixelBeasts 10k NFT project.

generator-template Generative code template for combining transparent png attributes into 10,000 unique images. Used for the PixelBeasts 10k NFT proje

Yohei Nakajima 9 Aug 24, 2022