Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Overview

korean extractive summarization

2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Leaderboard

1

Notice

  • Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(extractive summarization을 위해 bert위에 추가적으로 inter-sentence 레이어를 얹은구조)의 bert를 klue/roberta-large모델로 대체하여 구성하였음.
  • uoneway님 KoBertSum 레포지토리를 기반으로 만들어짐.
  • 수정된 부분 - pytorch 1.1 ->pytorch 1.7.1버전 지원하도록 수정.
  • 수정된 부분 - transformers 4.0 버전 지원하도록 수정, klue/roberta-large 포팅
  • 수정된 부분 - 불필요한 부분 삭제 or 수정

Process

  1. Environment Setting
pip install -r requirements.txt
python src/others/install_mecab.py # mecab설치
  1. Preprocess( ./ext/data/raw/train.jsonl, ./ext/data/raw/test.jsonl이 있어야 함)
python main.py -task make_data -n_cpus 5
  1. Train
python main.py -task train -target_summary_sent abs -visible_gpus 0
  1. Validation(path에 있는 모델파일 전부 validation하는 코드임.)
python main.py -task valid -model_path 1209_1236
  1. Test and submission 파일 생성
python main.py -task test -test_from 1209_1236/model_step_500.pt -visible_gpus 0
cd ext/results/
python get_submission.py -filename result_1209_1236_step_500.candidate.jsonl

포함되지 않은 부분

  • 대회에선, ensemble 이용해서 rouge-L 53.15 -> 53.5 으로 끌어올렸는데, 간단하니까 필요하신 분들은 구현해서 사용하시면 성능향상에 도움이 될 듯.

  • 추가로 데이터셋 폼(jsonl각 line)은 이렇게 구성됨(세줄요약 데이터셋)

{"category": "none", "id": 0, "article_original": ["","","","",""], "extractive": [2, 3, 4], "abstractive": "", "extractive_sents": ["", "", ""]}

Reference

Owner
NLP, ML research
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

UlionTse 907 Dec 27, 2022
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 03, 2023
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
This repository contains helper functions which can help you generate additional data points depending on your NLP task.

NLP Albumentations For Data Augmentation This repository contains helper functions which can help you generate additional data points depending on you

Aflah 6 May 22, 2022
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
Search-Engine - 📖 AI based search engine

Search Engine AI based search engine that was trained on 25000 samples, feel free to train on up to 1.2M sample from kaggle dataset, link below StackS

Vladislav Kruglikov 2 Nov 29, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

Simran Farrukh 0 Mar 28, 2022
Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

Open Business Software Solutions 29 Dec 14, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
A Telegram bot to add notes to Flomo.

flomo bot 使用 Telegram 机器人发送笔记到你的 Flomo. 你需要有一台可访问 Telegram 的服务器。 Steps @BotFather 新建机器人,获取 token Flomo 官网获取 API,链接 https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022