Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Overview

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

📥 Download Datasets
📥 Download Trained Models

INTRODUCTION

TH2ZH (Thai-to-Simplified Chinese) and TH2EN (Thai-to-English) are cross-lingual summarization (CLS) datasets. The source articles of these datasets are from TR-TPBS dataset, a monolingual Thai text summarization dataset. To create CLS dataset out of TR-TPBS, we used a neural machine translation service to translate articles into target languages. For some reasons, we were strongly recommended not to mention the name of the service that we used 🥺 . We will refer to the service we used as ‘main translation service’.

Cross-lingual summarization (cross-sum) is a task to summarize a given document written in one language to another language short summary.

crosslingual summarization

Traditional cross-sum approaches are based on two techniques namely early translation technique and late translation technique. Early translation can be explained easily as translate-then-summarize method. Late translation, in reverse, is summarize-then-translate method.

However, classical cross-sum methods tend to carry errors from monolingual summarization process or translation process to final cross-language output summary. Several end-to-end approaches have been proposed to tackle problems of traditional ones. Couple of end-to-end models are available to download as well.

DATASET CONSTRUCTION

💡 Important Note In contrast to Zhu, et al, in our experiment, we found that filtering out articles using RTT technique worsen the overall performance of the end-to-end models significantly. Therefore, full datasets are highly recommended.

We used TR-TPBS as source documents for creating cross-lingual summarization dataset. In the same way as Zhu, et al., we constructed Th2En and Th2Zh by translating the summary references into target languages using translation service and filtered out those poorly-translated summaries using round-trip translation technique (RTT). The overview of cross-lingual summarization dataset construction is presented in belowe figure. Please refer to the corresponding paper for more details on RTT.

crosslingual summarization In our experiment, we set 𝑇1 and 𝑇2 equal to 0.45 and 0.2 respectively, backtranslation technique filtered out 27.98% from Th2En and 56.79% documents from Th2Zh.

python3 src/tools/cls_dataset_construction.py \
--dataset th2en \
--input_csv path/to/full_dataset.csv \
--output_csv path/to/save/filtered_csv \
--r1 0.45 \
--r2 0.2
  • --dataset can be {th2en, th2zh}.
  • --r1 and --r2 are where you can set ROUGE score thresholds (r1 and r2 represent ROUGE-1 and ROUGE-2 respectively) for filtering (assumingly) poor translated articles.

Dataset Statistic

Click the file name to download.

File Number of Articles Size
th2en_full.csv 310,926 2.96 GB
th2zh_full.csv 310,926 2.81 GB
testset.csv 3,000 44 MB
validation.csv 3,000 43 MB

Data Fields

Please refer to th2enzh_data_exploration.ipynb for more details.

Column Description
th_body Original Thai body text
th_sum Original Thai summary
th_title Original Thai Article headline
{en/zh}_body Translated body text
{en/zh}_sum Translated summary
{en/zh}_title Translated article's headline
{en/zh}2th Back translation of{en/zh}_body
{en/zh}_gg_sum Translated summary (by Google Translation)
url URL to original article’s webpage
  • {th/en/zh}_title are only available in test set.
  • {en/zh}_gg_sum are also only available in test set. We (at the time this experiment took place) assumed that Google translation was better than the main translation service we were using. We intended to use these Google translated summaries as some kind of alternative summary references, but in the end, they never been used. We decided to make them available in the test set anyway, just in case the others find them useful.
  • {en/zh}_body were not presented during training end-to-end models. They were used only in early translation methods.

AVAILABLE TRAINED MODELS

Model Corresponding Paper Thai -> English Thai -> Simplified Chinese
Full Filtered Full Filtered
TNCLS Zhu et al., 2019 - Available - -
CLS+MS Zhu et al., 2019 Available - - -
CLS+MT Zhu et al., 2019 Available - Available -
XLS – RL-ROUGE Dou et al., 2020 Available - Available -

To evaluate these trained models, please refer to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb.

If you wish to evaluate the models with our test sets, you can use below script to create test files for XLS and NCLS models.

python3 src/tools/create_cls_test_manifest.py \
--test_csv_path path/to/testset.csv \
--output_dir path/to/save/testset_files \
--use_google_sum {true/false} \
--max_tokens 500 \
--create_ms_ref {true/false}
  • output_dir is path to directory that you want to save test set files
  • use_google_sum can be {true/false}. If true, it will select summary reference from columns {en/zh}_gg_sum. Default is false.
  • max_tokens number of maximum words in input articles. Default is 500 words. Too short or too long articles can significantly worsen performance of the models.
  • create_ms_ref whether to create Thai summary reference file to evaluate MS task in NCLS:CLS+MS model.

This script will produce three files namely test.CLS.source.thai.txt and test.CLS.target.{en/zh}.txt. test.CLS.source.thai.txt is used as a test file for cls task. test.CLS.target.{en/zh}.txt are the crosslingual summary reference for English and Chinese, they are used to evaluate ROUGE and BertScore. Each line is corresponding to the body articles in test.CLS.source.thai.txt.

🥳 We also evaluated MT tasks in XLS and NCLS:CLS+MT models. Please refers to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb for BLUE score results . For test sets that we used to evaluate MT task, please refer to data/README.md.

EXPERIMENT RESULTS

🔆 It has to be noted that all of end-to-end models reported in this section were trained on filtered datasets NOT full datasets. And for all end-to-end models, only `th_body` and `{en/zh}_sum` were present during training. We trained end-to-end models for 1,000,000 steps and selected model checkpoints that yielded the highest overall ROUGE scores to report the experiment.

In this experiment, we used two automatic evaluation matrices namely ROUGE and BertScore to assess the performance of CLS models. We evaluated ROUGE on Chinese text at word-level, NOT character level.

We only reported BertScore on abstractive summarization models. To evaluate the results with BertScore we used weights from ‘roberta-large’ and ‘bert-base-chinese’ pretrained models for Th2En and Th2Zh respectively.

Model Thai to English Thai to Chinese
ROUGE BertScore ROUGE BertScore
R1 R2 RL F1 R1 R2 RL F1
Traditional Approaches
Translated Headline 23.44 6.99 21.49 - 21.55 4.66 18.58 -
ETrans → LEAD2 51.96 42.15 50.01 - 44.18 18.83 43.84 -
ETrans → BertSumExt 51.85 38.09 49.50 - 34.58 14.98 34.84 -
ETrans → BertSumExtAbs 52.63 32.19 48.14 88.18 35.63 16.02 35.36 70.42
BertSumExt → LTrans 42.33 27.33 34.85 - 28.11 18.85 27.46 -
End-to-End Training Approaches
TNCLS 26.48 6.65 21.66 85.03 27.09 6.69 21.99 63.72
CLS+MS 32.28 15.21 34.68 87.22 34.34 12.23 28.80 67.39
CLS+MT 42.85 19.47 39.48 88.06 42.48 19.10 37.73 71.01
XLS – RL-ROUGE 42.82 19.62 39.53 88.03 43.20 19.19 38.52 72.19

LICENSE

Thai crosslingual summarization datasets including TH2EN, TH2ZH, test and validation set are licensed under MIT License.

ACKNOWLEDGEMENT

  • These cross-lingual datasets and the experiments are parts of Nakhun Chumpolsathien ’s master’s thesis at school of computer science, Beijing Institute of Technology. Therefore, as well, a great appreciation goes to his supervisor, Assoc. Prof. Gao Yang.
  • Shout out to Tanachat Arayachutinan for the initial data processing and for introducing me 麻辣烫, 黄焖鸡.
  • We would like to thank Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications for providing computing resources to conduct the experiment.
  • In this experiment, we used PyThaiNLP v. 2.2.4 to tokenize (on both word & sentence levels) Thai texts. For Chinese and English segmentation, we used Stanza.
Owner
Nakhun Chumpolsathien
I thought it was fun.
Nakhun Chumpolsathien
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python binding for Morfologik Morfologik is Polish morphological analyzer. For more information see http://github.com/morfologik/morfologik-stemming/

Damian Mirecki 18 Dec 29, 2021
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Dec 16, 2022
Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014). R

Yoon Kim 2k Jan 02, 2023
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Francesco Pham 94 Dec 25, 2022
A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Alexa 62 Dec 20, 2022
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

nwg-wrapper This program is a part of the nwg-shell project. This program is a GTK3-based wrapper to display a script output, or a text file content o

Piotr Miller 94 Dec 27, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Natural Language Processing

NLP Natural Language Processing apps Multilingual_NLP.py start #This script is demonstartion of Mul

Ritesh Sharma 1 Oct 31, 2021