Automated question generation and question answering from Turkish texts using text-to-text transformers

Overview
citation

If you use this software in your work, please cite as:

@article{akyon2021automated,
  title={Automated question generation and question answering from Turkish texts using text-to-text transformers},
  author={Akyon, Fatih Cagatay and Cavusoglu, Devrim and Cengiz, Cemil and Altinuc, Sinan Onur and Temizel, Alptekin},
  journal={arXiv preprint arXiv:2111.06476},
  year={2021}
}
install
git clone https://github.com/obss/turkish-question-generation.git
cd turkish-question-generation
pip install -r requirements.txt
train
  • start a training using args:
python run.py --model_name_or_path google/mt5-small  --output_dir runs/exp1 --do_train --do_eval --tokenizer_name_or_path mt5_qg_tokenizer --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --learning_rate 1e-4 --seed 42 --save_total_limit 1
python run.py config.json
python run.py config.yaml
evaluate
  • arrange related params in config:
do_train: false
do_eval: true
eval_dataset_list: ["tquad2-valid", "xquad.tr"]
prepare_data: true
mt5_task_list: ["qa", "qg", "ans_ext"]
mt5_qg_format: "both"
no_cuda: false
  • start an evaluation:
python run.py config.yaml
neptune
  • install neptune:
pip install neptune-client
  • download config file and arrange neptune params:
run_name: 'exp1'
neptune_project: 'name/project'
neptune_api_token: 'YOUR_API_TOKEN'
  • start a training:
python train.py config.yaml
wandb
  • install wandb:
pip install wandb
  • download config file and arrange wandb params:
run_name: 'exp1'
wandb_project: 'turque'
  • start a training:
python train.py config.yaml
finetuned checkpoints
Name Model data
train
params
(M)
model size
(GB)
turque-s1 mt5-small tquad2-train+tquad2-valid+xquad.tr 60M 1.2GB
mt5-small-3task-both-tquad2 mt5-small tquad2-train 60M 1.2GB
mt5-small-3task-prepend-tquad2 mt5-small tquad2-train 60M 1.2GB
mt5-base-3task-both-tquad2 mt5-base tquad2-train 220M 2.3GB
format
  • answer extraction:

input:

Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">
"
      
        Osman Bey 1258 yılında Söğüt’te doğdu. 
       
         Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

       
      

target:


    
      1258 
     
       Söğüt’te 
      

      
     
    
  • question answering:

input:

"question: Osman Bey nerede doğmuştur? context: Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Söğüt’te"
  • question generation (prepend):

input:

"answer: Söğüt’te context: Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Osman Bey nerede doğmuştur?"
  • question generation (highlight):

input:

Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">
"generate question: Osman Bey 1258 yılında 
     
       Söğüt’te 
      
        doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

      
     

target:

"Osman Bey nerede doğmuştur?"
  • question generation (both):

input:

Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">
"answer: Söğüt’te context: Osman Bey 1258 yılında 
     
       Söğüt’te 
      
        doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

      
     

target:

"Osman Bey nerede doğmuştur?"
paper results
BERTurk-base and mT5-base QA evaluation results for TQuADv2 fine-tuning.

mT5-base QG evaluation results for single-task (ST) and multi-task (MT) for TQuADv2 fine-tuning.

TQuADv1 and TQuADv2 fine-tuning QG evaluation results for multi-task mT5 variants. MT-Both means, mT5 model is fine-tuned with ’Both’ input format and in a multi-task setting.

paper configs

You can find the config files used in the paper under configs/paper.

contributing

Before opening a PR:

  • Install required development packages:
pip install "black==21.7b0" "flake8==3.9.2" "isort==5.9.2"
  • Reformat with black and isort:
black . --config pyproject.toml
isort .
You might also like...
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Baseline code for Korean open domain question answering(ODQA)
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

BERT-based Financial Question Answering System
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Owner
Open Business Software Solutions
Open Source for Open Business
Open Business Software Solutions
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

ThreatFox2Misp Creating a Feed of MISP Events from ThreatFox (by abuse.ch) What will it do? This will fetch IOCs from ThreatFox by Abuse.ch, convert t

17 Nov 22, 2022
ChatBotProyect - This is an unfinished project about a simple chatbot.

chatBotProyect This is an unfinished project about a simple chatbot. (union_todo.ipynb) Reminders for the project: Find why one of the vectorizers fai

Tomás 0 Jul 24, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 07, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

Trần Cao Minh Bách 32 Jul 29, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 06, 2023
تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

Vaheed NÆINI (9E) 36 Nov 24, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022