GraphSum

This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other baseline TTG is simply based on BertSumExt.

Environment Setup

This code is tested on python 3.6.9, transformer 3.0.2 and pytorch 1.7.0. You would also need numpy and scipy packages.

Data

Download and unzip the data from this link. Put the unzipped folder named as ./data parallel with ./src. You should see four subfolders under ./data/json, corresponding to four data splits as described in the paper.

Under each subfolder, the json file contains all document full texts, abstracts as well as the summarized graphs obtained from the abstract, organized by the document keys. Each full text consists of a list of sections. Each summarized graph contains a list of entity and relation mentions. Except for the test split, other three data splits have their summarized graphs obtained by running DyGIE++ on the abstract. The test set have manually annotated summarized graphs from SciERC dataset. The format of the graph follows the output of DyGIE++, where each entity mention in a section is represented by (start token id, end token id, entity type) and each relation mention is represented by (start token id of entity 1, end token id of entity 1, start token id of entity 2, end token id of entity 2, relation type). The graph also contains a list of coreferential entity mentions.

You should also see two subfolders under the processed folder of each data split: merged_entities and aligned_entities. merged_entities contains the full and summarized graphs for each document, where the graph vertices are cluster of entity mentions. Entity clusters in each summarized graph are coreferential entity mentions predicted by DyGIE++ or annotated (in test set). Entity clusters in each full graph contains entity mentions that are coreferences or share the same non-generic string names (as described in our paper). Under merged_entities, we provide entity clusters and relations between entity clusters, as well as corresponding entity and relation mentions in the full paper or abstract. Each relation is represented by "[entity cluster id 1]_[entity cluster id 2]_[relation type]". The original full graphs with all entity and relation mentions are obtained by running DyGIE++ on the document full text. You don't need them to run the code, but you can find them here. For some entity names, you may see a trailing string "<GENERIC_ID> [number]". It means these entity names are classified by DyGIE++ as "generic" and the trailing string is used to differentiate the same entity name strings in different clusters in such cases.

aligned_entities contains the pre-calculated alignment between entity clusters (see Section 5.1 in the paper) in the summarized and full graphs for each document. In each entity alignment file, under each entity cluster of the summarized graph, there is a list of entity clusters from the full graph if the list is not empty. They are used to facilitate data preprocessing of G2G and evaluation.

Training and Evaluation

The model is based on GAT. Go to ./src and run bash run.sh. You can also find the pretrained model here. Put it under ./src/output and run the inference and evaluation parts in ./src/run.sh.

Extracting Summary Knowledge Graphs from Long Documents

Related tags

Overview

GraphSum

Environment Setup

Data

Training and Evaluation

Owner

Zeqiu (Ellen) Wu

justCTF [*] 2020 challenges sources

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

Repositório da disciplina no semestre 2021-2

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

List of GSoC organisations with number of times they have been selected.

Nateve compiler developed with python.

2021搜狐校园文本匹配算法大赛baseline

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

ConvBERT: Improving BERT with Span-based Dynamic Convolution

GPT-3 command line interaction

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Fast, DB Backed pretrained word embeddings for natural language processing.

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Labelling platform for text using distant supervision

novel deep learning research works with PaddlePaddle

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

Shared code for training sentence embeddings with Flax / JAX