中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Last update: Oct 22, 2021

Overview

Transformer QG on DRCD

The inputs of the model refers to

we integrate C and A into a new C' in the following form.
C' = [c1, c2, ..., [HL], a1, ..., a|A|, [HL], ..., c|C|]

Proposed by Ying-Hong Chan & Yao-Chung Fan. (2019). A Re-current BERT-based Model for Question Generation.

我們還有另外一個英文QG: Transformer-QG-on-SQuAD

Features

完整的流程；從微調到模型評分
支援許多先進的語言模型
內建Flask，可快速作為API server

DRCD dataset

台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 DRCD資料集從2,108篇維基條目中整理出10,014篇段落，並從段落中標註出30,000多個問題。

Available models

BART (base on uer/bart-base-chinese-cluecorpussmall)

Use in Transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("p208p2002/bart-drcd-qg-hl")

model = AutoModelForSeq2SeqLM.from_pretrained("p208p2002/bart-drcd-qg-hl")

Expriments

Model	Bleu 1	Bleu 2	Bleu 3	Bleu 4	METEOR	ROUGE-L
BART-HLSQG	34.25	27.70	22.43	18.13	23.58	36.88

Environment requirements

The hole development is based on Ubuntu system

If you don't have pytorch 1.6+ please install or update first

https://pytorch.org/get-started/locally/

Install packages pip install -r requirements.txt
Setup scorer python setup_scorer.py
Download dataset python init_dataset.py

Training

Seq2Seq LM

usage: train_seq2seq_lm.py [-h]
                           [--base_model {bert-base-chinese,uer/bart-base-chinese-cluecorpussmall,p208p2002/bart-drcd-qg-hl}]
                           [-d {drcd}] [--batch_size BATCH_SIZE]
                           [--epoch EPOCH] [--lr LR] [--dev DEV] [--server]
                           [--run_test] [-fc FROM_CHECKPOINT]

optional arguments:
  -h, --help            show this help message and exit
  --base_model {bert-base-chinese,uer/bart-base-chinese-cluecorpussmall,p208p2002/bart-drcd-qg-hl}
  -d {drcd}, --dataset {drcd}
  --batch_size BATCH_SIZE
  --epoch EPOCH
  --lr LR
  --dev DEV
  --server
  --run_test
  -fc FROM_CHECKPOINT, --from_checkpoint FROM_CHECKPOINT

Run as API server

From pre-trained (recommend)

python train_seq2seq_lm.py --server --base_model p208p2002/bart-drcd-qg-hl

From your own checkpoint

python train_xxx_lm.py --server --base_model YOUR_BASE_MODEL --from_checkpoint FROM_CHECKPOINT

Request example

curl --location --request POST 'http://127.0.0.1:5000/' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'context=[HL]伊隆·里夫·馬斯克[HL]是一名企業家和商業大亨'

{"predict": "哪一個人是一名企業家和商業大亨?"}

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Related tags

Overview

Transformer QG on DRCD

Features

DRCD dataset

Available models

Use in Transformers

Expriments

Environment requirements

Training

Seq2Seq LM

Run as API server

From pre-trained (recommend)

From your own checkpoint

Request example

Owner

Philip

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Pretrain CPM - 大规模预训练语言模型的预训练代码

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

scikit-learn wrappers for Python fastText.

Athena is an open-source implementation of end-to-end speech processing engine.

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

CPC-big and k-means clustering for zero-resource speech processing

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

Generate a cool README/About me page for your Github Profile

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Mednlp - Medical natural language parsing and utility library

Tools and data for measuring the popularity & growth of various programming languages.

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

A minimal code for fairseq vq-wav2vec model inference.