SummVis is an interactive visualization tool for text summarization.

Overview

SummVis

SummVis is an interactive visualization tool for analyzing abstractive summarization model outputs and datasets.

Figure

Installation

IMPORTANT: Please use python>=3.8 since some dependencies require that for installation.

git clone https://github.com/robustness-gym/summvis.git
cd summvis
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Quickstart

Follow the steps below to start using SummVis immediately.

1. Download and extract data

Download our pre-cached dataset that contains predictions for state-of-the-art models such as PEGASUS and BART on 1000 examples taken from the CNN / Daily Mail validation set.

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize data

Next, we'll need to add the original examples from the CNN / Daily Mail dataset to deanonymize the data (this information is omitted for copyright reasons). The preprocessing.py script can be used for this with the --deanonymize flag.

Deanonymize 10 examples (try_it mode):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/try:cnn_dailymail_1000.validation \
--try_it

This will take between 10 seconds and several minutes depending on whether you've previously loaded CNN/DailyMail from the Datasets library.

3. Run SummVis

Finally, we're ready to run the Streamlit app. Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

General instructions for running with pre-loaded datasets

1. Download one of the pre-loaded datasets:

CNN / Daily Mail (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip
CNN / Daily Mail (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail.validation.anonymized.zip
XSum (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip
XSum (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum.validation.anonymized.zip

We recommend that you choose the smallest dataset that fits your need in order to minimize download / preprocessing time.

Example: Download and unzip CNN / Daily Mail

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize n examples:

Set the --n_samples argument and name the --processed_dataset_path output file accordingly.

Example: Deanonymize 100 examples from CNN / Daily Mail:

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/100:cnn_dailymail_1000.validation \
--n_samples 100

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail_1000.validation \
--n_samples 1000

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail.validation

Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/xsum_1000.validation.anonymized \
--dataset xsum \
--split validation \
--processed_dataset_path data/full:xsum_1000.validation \
--n_samples 1000

3. Run SummVis

Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

Alternately, if you need to point SummVis to a folder where your data is stored.

streamlit run summvis.py -- --path your/path/to/data

Note that the additional -- is not a mistake, and is required to pass command-line arguments in streamlit.

Get your data into SummVis: end-to-end preprocessing

You can also perform preprocessing end-to-end to load any summarization dataset or model predictions into SummVis. Instructions for this are provided below.

Prior to running the following, an additional install step is required:

python -m spacy download en_core_web_lg

1. Standardize and save dataset to disk.

Loads in a dataset from HF, or any dataset that you have and stores it in a standardized format with columns for document and summary:reference.

Example: Save CNN / Daily Mail validation split to disk as a jsonl file.

python preprocessing.py \
--standardize \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

Example: Load custom my_dataset.jsonl, standardize, and save.

python preprocessing.py \
--standardize \
--dataset_jsonl path/to/my_dataset.jsonl \
--doc_column name_of_document_column \
--reference_column name_of_reference_summary_column \
--save_jsonl_path preprocessing/my_dataset.jsonl

2. Add predictions to the saved dataset.

Takes a saved dataset that has already been standardized and adds predictions to it from prediction jsonl files. Cached predictions for several models available here: https://storage.googleapis.com/sfr-summvis-data-research/predictions.zip

You may also generate your own predictions using this this script.

Example: Add 6 prediction files for PEGASUS and BART to the dataset.

python preprocessing.py \
--join_predictions \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--prediction_jsonls \
predictions/bart-cnndm.cnndm.validation.results.anonymized \
predictions/bart-xsum.cnndm.validation.results.anonymized \
predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
predictions/pegasus-multinews.cnndm.validation.results.anonymized \
predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
predictions/pegasus-xsum.cnndm.validation.results.anonymized \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

3. Run the preprocessing workflow and save the dataset.

Takes a saved dataset that has been standardized, and predictions already added. Applies all the preprocessing steps to it (running spaCy, lexical and semantic aligners), and stores the processed dataset back to disk.

Example: Autorun with default settings on a few examples to try it.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail.validation \
--try_it

Example: Autorun with default settings on all examples.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail

Citation

When referencing this repository, please cite this paper:

@misc{vig2021summvis,
      title={SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization}, 
      author={Jesse Vig and Wojciech Kryscinski and Karan Goel and Nazneen Fatema Rajani},
      year={2021},
      eprint={2104.07605},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2104.07605}
}

Acknowledgements

We thank Michael Correll for his valuable feedback.

Owner
Robustness Gym
Building tools for evaluating and repairing ML models.
Robustness Gym
BrowZen correlates your emotional states with the web sites you visit to give you actionable insights about how you spend your time browsing the web.

BrowZen BrowZen correlates your emotional states with the web sites you visit to give you actionable insights about how you spend your time browsing t

Nick Bild 36 Sep 28, 2022
A simple agent-based model used to teach the basics of OOP in my lectures

Pydemic A simple agent-based model of a pandemic. This is used to teach basic principles of object-oriented programming to master students. It is not

Fabien Maussion 2 Jun 08, 2022
Chem: collection of mostly python code for molecular visualization, QM/MM, FEP, etc

chem: collection of mostly python code for molecular visualization, QM/MM, FEP,

5 Sep 02, 2022
100 Days of Code The Complete Python Pro Bootcamp for 2022

100-Day-With-Python 100 Days of Code - The Complete Python Pro Bootcamp for 2022. In this course, I spend with python language over 100 days, and I up

Rajdip Das 8 Jun 22, 2022
Scientific measurement library for instruments, experiments, and live-plotting

PyMeasure scientific package PyMeasure makes scientific measurements easy to set up and run. The package contains a repository of instrument classes a

PyMeasure 445 Jan 04, 2023
又一个云探针

ServerStatus-Murasame 感谢ServerStatus-Hotaru,又一个云探针诞生了(大雾 本项目在ServerStatus-Hotaru的基础上使用fastapi重构了服务端,部分修改了客户端与前端 项目还在非常原始的阶段,可能存在严重的问题 演示站:https://stat

6 Oct 19, 2021
A simple Monte Carlo simulation using Python and matplotlib library

Monte Carlo python simulation Install linux dependencies sudo apt update sudo apt install build-essential \ software-properties-commo

Samuel Terra 2 Dec 13, 2021
Lime: Explaining the predictions of any machine learning classifier

lime This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support explaining individual predict

Marco Tulio Correia Ribeiro 10.3k Dec 29, 2022
A way of looking at COVID-19 data that I haven't seen before.

Visualizing Omicron: COVID-19 Deaths vs. Cases Click here for other countries. Data is from Our World in Data/Johns Hopkins University. About this pro

1 Jan 10, 2022
Statistics and Visualization of acceptance rate, main keyword of CVPR 2021 accepted papers for the main Computer Vision conference (CVPR)

Statistics and Visualization of acceptance rate, main keyword of CVPR 2021 accepted papers for the main Computer Vision conference (CVPR)

Hoseong Lee 78 Aug 23, 2022
With Holoviews, your data visualizes itself.

HoloViews Stop plotting your data - annotate your data and let it visualize itself. HoloViews is an open-source Python library designed to make data a

HoloViz 2.3k Jan 04, 2023
a python function to plot a geopandas dataframe

Pretty GeoDataFrame A minimum python function (~60 lines) to draw pretty geodataframe. Based on matplotlib, shapely, descartes. Installation just use

haoming 27 Dec 05, 2022
Python scripts to manage Chia plots and drive space, providing full reports. Also monitors the number of chia coins you have.

Chia Plot, Drive Manager & Coin Monitor (V0.5 - April 20th, 2021) Multi Server Chia Plot and Drive Management Solution Be sure to ⭐ my repo so you can

338 Nov 25, 2022
Farhad Davaripour, Ph.D. 1 Jan 05, 2022
Quickly and accurately render even the largest data.

Turn even the largest data into images, accurately Build Status Coverage Latest dev release Latest release Docs Support What is it? Datashader is a da

HoloViz 2.9k Dec 28, 2022
A simple interpreted language for creating basic mathematical graphs.

graphr Introduction graphr is a small language written to create basic mathematical graphs. It is an interpreted language written in python and essent

2 Dec 26, 2021
Write python locally, execute SQL in your data warehouse

RasgoQL Write python locally, execute SQL in your data warehouse ≪ Read the Docs · Join Our Slack » RasgoQL is a Python package that enables you to ea

Rasgo 265 Nov 21, 2022
Declarative statistical visualization library for Python

Altair http://altair-viz.github.io Altair is a declarative statistical visualization library for Python. With Altair, you can spend more time understa

Altair 8k Jan 05, 2023
Learning Convolutional Neural Networks with Interactive Visualization.

CNN Explainer An interactive visualization system designed to help non-experts learn about Convolutional Neural Networks (CNNs) For more information,

Polo Club of Data Science 6.3k Jan 01, 2023
Certificate generating and sending system written in Python.

Certificate Generator & Sender How to use git clone https://github.com/saadhaxxan/Certificate-Generator-Sender.git cd Certificate-Generator-Sender Add

Saad Hassan 11 Dec 01, 2022