NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

	Code has been run on Google Colab, thanks Google for providing computational resources

Natural Language Processing（自然语言处理）
- Text Classification（文本分类）
  - IMDB（ENG）
  - CLUE Emotion Analysis Dataset (CHN)
- Text Matching（文本匹配）
  - SNLI（ENG）
  - 微众银行智能客服（CHN）
  - 蚂蚁金融语义相似度 (CHN)
- Intent Detection and Slot Filling（意图检测与槽位填充）
  - ATIS（ENG）
- Retrieval Dialog（检索式对话）
  - ElasticSearch
    - Sparse Retrieval
    - Dense Retrieval
- Generative Dialog（生成式对话）
  - Large-scale Chinese Conversation Dataset (CHN)
- Multi-turn Dialogue Rewriting（多轮对话改写）
  - 20k 腾讯 AI 研发数据（CHN）
- Semantic Parsing（语义解析）
  - Facebook's Hierarchical Task Oriented Dialog（ENG）
- Multi-hop Question Answering（多跳问题回答）
  - bAbI（ENG）
- Text Visualization（文本可视化）
  - Topic Modelling
  - Explain Prediction
Knowledge Graph（知识图谱）
- Knowledge Graph Completion（知识图谱补全）
- Knowledge Base Question Answering（知识图谱问答）
Recommender System（推荐系统）
- Movielens 1M（English Data）

Text Classification

└── finch/tensorflow2/text_classification/imdb
	│
	├── data
	│   └── glove.840B.300d.txt          # pretrained embedding, download and put here
	│   └── make_data.ipynb              # step 1. make data and vocab: train.txt, test.txt, word.txt
	│   └── train.txt  		     # incomplete sample, format <label, text> separated by \t 
	│   └── test.txt   		     # incomplete sample, format <label, text> separated by \t
	│   └── train_bt_part1.txt  	     # (back-translated) incomplete sample, format <label, text> separated by \t
	│
	├── vocab
	│   └── word.txt                     # incomplete sample, list of words in vocabulary
	│	
	└── main
		└── sliced_rnn.ipynb         # step 2: train and evaluate model
		└── ...

Task: IMDB（English Data）

  Training Data: 25000, Testing Data: 25000, Labels: 2

<Notebook>: Make Data & Vocabulary
- <Text File>: Data Example
- <Text File>: Data Example (Back-Translated)
```
 Back-Translation increases training data from 25000 to 50000
 
 which is done by "english -> french -> english" translation
```
- <Text File>: Vocabulary Example

Model: TF-IDF + Logistic Regression (Sklearn)

Logistic Regression	Binary TF	NGram Range	Knowledge Dist	Testing Accuracy
<Notebook>	False	(1, 1)	False	88.3%
<Notebook>	True	(1, 1)	False	88.8%
<Notebook>	True	(1, 2)	False	89.6%
<Notebook>	True	(1, 2)	True	90.7%

-> PySpark Equivalent

Model: FastText, CNN and RNN

Code	Model	Testing Accuracy
<Notebook>	FastText (Unigram)	87.3%
<Notebook>	FastText (Bigram)	89.8%
<Notebook>	FastText (AutoTune)	90.1%
<Notebook>	TextCNN	91.8%
<Notebook>	Sliced RNN	92.6%

Model: Large-scale Transformer

TensorFlow 2 + transformers

Code	Model	Batch Size	Max Length	Testing Accuracy
<Notebook>	BERT	32	128	92.6%
<Notebook>	BERT	16	200	93.3%
<Notebook>	BERT	12	256	93.8%
<Notebook>	BERT	8	300	94%
<Notebook>	RoBERTa	8	300	94.7%

└── finch/tensorflow2/text_classification/clue
	│
	├── data
	│   └── make_data.ipynb              # step 1. make data and vocab
	│   └── train.txt  		     # download from clue benchmark
	│   └── test.txt   		     # download from clue benchmark
	│
	├── vocab
	│   └── label.txt                    # list of emotion labels
	│	
	└── main
		└── bert_finetune.ipynb      # step 2: train and evaluate model
		└── ...

Task: CLUE Emotion Analysis Dataset（Chinese Data）

  Training Data: 31728, Testing Data: 3967, Labels: 7

Model: TF-IDF + Linear Model

Logistic Regression	Binary TF	NGram Range	Split By	Testing Accuracy
<Notebook>	False	(1, 1)	Char	57.4%
<Notebook>	True	(1, 1)	Word	57.7%
<Notebook>	False	(1, 2)	Word	57.8%
<Notebook>	False	(1, 1)	Word	58.3%
<Notebook>	True	(1, 2)	Char	59.1%
<Notebook>	False	(1, 2)	Char	59.4%

Model: Deep Model

Code Model Env Testing Accuracy

<Notebook> BERT TF2 61.7%

<Notebook> BERT + TAPT (<Notebook>) TF2 62.3%

Code	Model	Env	Testing Accuracy
<Notebook>	BERT	TF2	61.7%
<Notebook>	BERT + TAPT (<Notebook>)	TF2	62.3%

Text Matching

└── finch/tensorflow2/text_matching/snli
	│
	├── data
	│   └── glove.840B.300d.txt       # pretrained embedding, download and put here
	│   └── download_data.ipynb       # step 1. run this to download snli dataset
	│   └── make_data.ipynb           # step 2. run this to generate train.txt, test.txt, word.txt 
	│   └── train.txt  		  # incomplete sample, format <label, text1, text2> separated by \t 
	│   └── test.txt   		  # incomplete sample, format <label, text1, text2> separated by \t
	│
	├── vocab
	│   └── word.txt                  # incomplete sample, list of words in vocabulary
	│	
	└── main              
		└── dam.ipynb      	  # step 3. train and evaluate model
		└── esim.ipynb      	  # step 3. train and evaluate model
		└── ......

Task: SNLI（English Data）

  Training Data: 550152, Testing Data: 10000, Labels: 3

Code	Reference	Env	Testing Accuracy
<Notebook>	DAM	TF2	85.3%
<Notebook>	Match Pyramid	TF2	87.1%
<Notebook>	ESIM	TF2	87.4%
<Notebook>	RE2	TF2	87.7%
<Notebook>	RE3	TF2	88.3%
<Notebook>	BERT	TF2	90.4%
<Notebook>	RoBERTa	TF2	91.1%

└── finch/tensorflow2/text_matching/chinese
	│
	├── data
	│   └── make_data.ipynb           # step 1. run this to generate char.txt and char.npy
	│   └── train.csv  		  # incomplete sample, format <text1, text2, label> separated by comma 
	│   └── test.csv   		  # incomplete sample, format <text1, text2, label> separated by comma
	│
	├── vocab
	│   └── cc.zh.300.vec             # pretrained embedding, download and put here
	│   └── char.txt                  # incomplete sample, list of chinese characters
	│   └── char.npy                  # saved pretrained embedding matrix for this task
	│	
	└── main              
		└── pyramid.ipynb      	  # step 2. train and evaluate model
		└── esim.ipynb      	  # step 2. train and evaluate model
		└── ......

Task: 微众银行智能客服（Chinese Data）

  Training Data: 100000, Testing Data: 10000, Labels: 2, Balanced

<Notebook>: Make Data & Vocabulary
- <Text File>: Data Example (数据示例)
- <Text File>: Vocabulary
Model (can be compared to this benchmark since the dataset is the same)

Code	Reference	Env	Split by	Testing Accuracy
<Notebook>	RE2	TF2	Word	82.5%
<Notebook>	ESIM	TF2	Char	82.5%
<Notebook>	Match Pyramid	TF2	Char	82.7%
<Notebook>	RE2	TF2	Char	83.8%
<Notebook>	BERT	TF2	Char	83.8%
<Notebook>	BERT-wwm	TF1 + bert4keras	Char	84.75%

└── finch/tensorflow2/text_matching/ant
	│
	├── data
	│   └── make_data.ipynb           # step 1. run this to generate char.txt and char.npy
	│   └── train.json           	  # incomplete sample, format <text1, text2, label> separated by comma 
	│   └── dev.json   		  # incomplete sample, format <text1, text2, label> separated by comma
	│
	├── vocab
	│   └── cc.zh.300.vec             # pretrained embedding, download and put here
	│   └── char.txt                  # incomplete sample, list of chinese characters
	│   └── char.npy                  # saved pretrained embedding matrix for this task
	│	
	└── main              
		└── pyramid.ipynb      	  # step 2. train and evaluate model
		└── bert.ipynb      	  # step 2. train and evaluate model
		└── ......

Task: 蚂蚁金融语义相似度（Chinese Data）

  Training Data: 34334, Testing Data: 4316, Labels: 2, Imbalanced

Code	Reference	Env	Split by	Testing Accuracy
<Notebook>	RE2	TF2	Char	66.5%
<Notebook>	Match Pyramid	TF2	Char	69.0%
<Notebook>	Match Pyramid + Joint Training	TF2	Char	70.3%
<Notebook>	BERT	TF2	Char	73.8%
<Notebook>	BERT + TAPT (<Notebook>)	TF2	Char	74.3%

Joint training

set data_1 = 微众银行智能客服 (size: 100000)
set data_2 = 蚂蚁金融语义相似度 (size: 34334)
joint training (size: 100000 + 34334 = 134334)

BERT	train by data_1	train by data_2	joint train	joint train + TAPT
Code	<Notebook>	<Notebook>	<Notebook>	<Notebook>
data_1 accuracy	83.8%	-	84.4%	85.0%
data_2 accuracy	-	73.8%	74.0%	74.9%

Intent Detection and Slot Filling

└── finch/tensorflow2/spoken_language_understanding/atis
	│
	├── data
	│   └── glove.840B.300d.txt           # pretrained embedding, download and put here
	│   └── make_data.ipynb               # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── atis.train.w-intent.iob       # incomplete sample, format <text, slot, intent>
	│   └── atis.test.w-intent.iob        # incomplete sample, format <text, slot, intent>
	│
	├── vocab
	│   └── word.txt                      # list of words in vocabulary
	│   └── intent.txt                    # list of intents in vocabulary
	│   └── slot.txt                      # list of slots in vocabulary
	│	
	└── main              
		└── bigru_clr.ipynb               # step 2. train and evaluate model
		└── ...

Task: ATIS（English Data）

  Training Data: 4978, Testing Data: 893

Code	Model	Helper	Env	Intent Accuracy	Slot Micro-F1
<Notebook>	CRF	-	crfsuite	-	92.6%
<Notebook>	Bi-GRU	-	TF2	97.4%	95.4%
<Notebook>	Bi-GRU	+ CRF	TF2	97.2%	95.8%
<Notebook>	Transformer	-	TF2	96.5%	95.5%
<Notebook>	Transformer	+ Time Weighting	TF2	97.2%	95.6%
<Notebook>	Transformer	+ Time Mixing	TF2	97.5%	95.8%
<Notebook>	Bi-GRU	+ ELMO	TF1	97.5%	96.1%
<Notebook>	Bi-GRU	+ ELMO + CRF	TF1	97.3%	96.3%

Retrieval Dialog

Task: Build a chatbot answering fundamental questions

Code	Engine	Encoder	Vector Type	Unit Test Accuracy
<Notebook>	Elastic Search	Default (TF-IDF)	Sparse	80%
<Notebook>	Elastic Search	Default (TF-IDF) + Segmentation	Sparse	90%
<Notebook>	Elastic Search	Bert	Dense	80%
<Notebook>	Elastic Search	Universal Sentence Encoder	Dense	100%

Semantic Parsing

└── finch/tensorflow2/semantic_parsing/tree_slu
	│
	├── data
	│   └── glove.840B.300d.txt     	# pretrained embedding, download and put here
	│   └── make_data.ipynb           	# step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── train.tsv   		  	# incomplete sample, format <text, tokenized_text, tree>
	│   └── test.tsv    		  	# incomplete sample, format <text, tokenized_text, tree>
	│
	├── vocab
	│   └── source.txt                	# list of words in vocabulary for source (of seq2seq)
	│   └── target.txt                	# list of words in vocabulary for target (of seq2seq)
	│	
	└── main
		└── lstm_seq2seq_tf_addons.ipynb           # step 2. train and evaluate model
		└── ......

Task: Semantic Parsing for Task Oriented Dialog（English Data）

  Training Data: 31279, Testing Data: 9042

Code	Reference	Env	Testing Exact Match
<Notebook>	GRU Seq2Seq	TF2	74.1%
<Notebook>	LSTM Seq2Seq	TF2	74.1%
<Notebook>	GRU Pointer-Generator	TF2	80.4%
<Notebook>	GRU Pointer-Generator + Char Embedding	TF2	80.7%

The Exact Match result is higher than original paper

Knowledge Graph Completion

└── finch/tensorflow2/knowledge_graph_completion/wn18
	│
	├── data
	│   └── download_data.ipynb       	# step 1. run this to download wn18 dataset
	│   └── make_data.ipynb           	# step 2. run this to generate vocabulary: entity.txt, relation.txt
	│   └── wn18  		          	# wn18 folder (will be auto created by download_data.ipynb)
	│   	└── train.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│   	└── valid.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t 
	│   	└── test.txt   		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│
	├── vocab
	│   └── entity.txt                  	# incomplete sample, list of entities in vocabulary
	│   └── relation.txt                	# incomplete sample, list of relations in vocabulary
	│	
	└── main              
		└── distmult_1-N.ipynb    	# step 3. train and evaluate model
		└── ...

Task: WN18

  Training Data: 141442, Testing Data: 5000

<Notebook>: Download Data
- <Text File>: Data Example
<Notebook>: Make Vocabulary
- <Text File>: Vocabulary Example
We use the idea of multi-label classification to accelerate evaluation

Code	Reference	Env	MRR	[email protected]	[email protected]	[email protected]
<Notebook>	DistMult	TF2	0.797	0.938	0.902	0.688
<Notebook>	TuckER	TF2	0.885	0.939	0.909	0.853
<Notebook>	ComplEx	TF2	0.938	0.958	0.948	0.925

Knowledge Base Question Answering

Rule-based System（基于规则的系统）

For example, we want to answer the following questions with car knowledge:
```
 	What is BMW?
     	I want to know about the BMW
     	Please introduce the BMW to me
     	How is the BMW?
     	How is the BMW compared to the Benz?
```
- <Notebook> Regular Expression
- <Notebook> Regular Expression + POS Feature

Multi-hop Question Answering

└── finch/tensorflow1/question_answering/babi
	│
	├── data
	│   └── make_data.ipynb           		# step 1. run this to generate vocabulary: word.txt 
	│   └── qa5_three-arg-relations_train.txt       # one complete example of babi dataset
	│   └── qa5_three-arg-relations_test.txt	# one complete example of babi dataset
	│
	├── vocab
	│   └── word.txt                  		# complete list of words in vocabulary
	│	
	└── main              
		└── dmn_train.ipynb
		└── dmn_serve.ipynb
		└── attn_gru_cell.py

Task: bAbI（English Data）
- <Text File>: Data Example
- <Notebook>: Make Vocabulary
- Model: Dynamic Memory Network
  - TensorFlow 1
    - <Notebook> DMN -> 99.4% Testing Accuracy
    - Inference

Text Visualization

Topic Modelling
- Model: TF-IDF + LDA
- Data: IMDB Movie Reviews
- <Notebook> Code
- <Notebook> Visualization
Explain Prediction
- Model: SHAP
- Data: IMDB Movie Reviews
- <Notebook>

Recommender System

└── finch/tensorflow1/recommender/movielens
	│
	├── data
	│   └── make_data.ipynb           		# run this to generate vocabulary
	│
	├── vocab
	│   └── user_job.txt
	│   └── user_id.txt
	│   └── user_gender.txt
	│   └── user_age.txt
	│   └── movie_types.txt
	│   └── movie_title.txt
	│   └── movie_id.txt
	│	
	└── main              
		└── dnn_softmax.ipynb
		└── ......

Task: Movielens 1M（English Data）

  Training Data: 900228, Testing Data: 99981, Users: 6000, Movies: 4000, Rating: 1-5

<Notebook>: Make Vocabulary
- <Text File>: Data Example

Model: Fusion

Code	Scoring	LR Decay	Env	Testing MAE
<Notebook>	Sigmoid (Continuous)	Exponential	TF1	0.663
<Notebook>	Sigmoid (Continuous)	Cyclical	TF1	0.661
<Notebook>	Softmax (Discrete)	Exponential	TF1	0.633
<Notebook>	Softmax (Discrete)	Cyclical	TF1	0.628

The MAE results seem better than the all the results here and all the results here

Multi-turn Dialogue Rewriting

└── finch/tensorflow1/multi_turn_rewrite/chinese/
	│
	├── data
	│   └── make_data.ipynb         # run this to generate vocab, split train & test data, make pretrained embedding
	│   └── corpus.txt		# original data downloaded from external
	│   └── train_pos.txt		# processed positive training data after {make_data.ipynb}
	│   └── train_neg.txt		# processed negative training data after {make_data.ipynb}
	│   └── test_pos.txt		# processed positive testing data after {make_data.ipynb}
	│   └── test_neg.txt		# processed negative testing data after {make_data.ipynb}
	│
	├── vocab
	│   └── cc.zh.300.vec		# fastText pretrained embedding downloaded from external
	│   └── char.npy		# chinese characters and their embedding values (300 dim)	
	│   └── char.txt		# list of chinese characters used in this project 
	│	
	└── main              
		└── baseline_lstm_train.ipynb
		└── baseline_lstm_predict.ipynb
		└── ...

Task: 20k 腾讯 AI 研发数据（Chinese Data）

 data split as: training data (positive): 18986, testing data (positive): 1008

 Training data = 2 * 18986 because of 1:1 Negative Sampling

<Text File>: Full Data
<Notebook>: Make Data & Vocabulary & Pretrained Embedding
```
  There are six incorrect data and we have deleted them
```
- <Text File>: Positive Data Example
- <Text File>: Negative Data Example

Model (results can be compared to here with the same dataset)

Code	Model	Env	Exact Match	BLEU-1	BLEU-2	BLEU-4
<Notebook>	LSTM Seq2Seq + Dynamic Memory	TF1	56.2%	94.6	89.1	78.5
<Notebook>	GRU Seq2Seq + Dynamic Memory	TF1	56.2%	95.0	89.5	78.9
<Notebook>	GRU Pointer	TF1	59.2%	93.2	87.7	77.2
<Notebook>	GRU Pointer + Multi-Attention	TF1	60.2%	94.2	88.7	78.3

Deployment: first export the model

Inference Code Environment

<Notebook> Python

<Notebook> Java

Inference Code	Environment
<Notebook>	Python
<Notebook>	Java

Generative Dialog

└── finch/tensorflow1/free_chat/chinese_lccc
	│
	├── data
	│   └── LCCC-base.json           	# raw data downloaded from external
	│   └── LCCC-base_test.json         # raw data downloaded from external
	│   └── make_data.ipynb           	# step 1. run this to generate vocab {char.txt} and data {train.txt & test.txt}
	│   └── train.txt           		# processed text file generated by {make_data.ipynb}
	│   └── test.txt           			# processed text file generated by {make_data.ipynb}
	│
	├── vocab
	│   └── char.txt                	# list of chars in vocabulary for chinese
	│   └── cc.zh.300.vec			# fastText pretrained embedding downloaded from external
	│   └── char.npy			# chinese characters and their embedding values (300 dim)	
	│	
	└── main
		└── lstm_seq2seq_train.ipynb    # step 2. train and evaluate model
		└── lstm_seq2seq_infer.ipynb    # step 4. model inference
		└── ...

Task: Large-scale Chinese Conversation Dataset

  Training Data: 5000000 (sampled due to small memory), Testing Data: 19008

Data
- <Text File>: Data Example
- <Notebook>: Make Data & Vocabulary
  - <Text File>: Vocabulary Example

Model

Code	Model	Env	Test Case	Perplexity
<Notebook>	Transformer Encoder + LSTM Generator	TF1	<Notebook>	42.465
<Notebook>	LSTM Encoder + LSTM Generator	TF1	<Notebook>	41.250
<Notebook>	LSTM Encoder + LSTM Pointer-Generator	TF1	<Notebook>	36.525

If you want to deploy model in Java production

 └── FreeChatInference
 	│
 	├── data
 	│   └── transformer_export/
 	│   └── char.txt
 	│   └── libtensorflow-1.14.0.jar
 	│   └── tensorflow_jni.dll
 	│
 	└── src              
 		└── ModelInference.java

<Notebook> Java Inference

If you don't know the input and output node names in Java, you can display the node names:

 !saved_model_cli show --dir ../model/xxx/1587959473/ --tag_set serve --signature_def serving_default

Large Pre-trained GPT

Test Case	Model	Env	Author
<Notebook>	GPT LCCC base	TF1 + bert4keras	Tsinghua University
<Notebook>	GPT LCCC large	TF1 + bert4keras	Tsinghua University
<Notebook>	NEZHA LM	TF1 + bert4keras	Jianlin Su

Diverse Response Comparsion

问题 / 模型回复	Our RNN	GPT base	GPT large	Nezha
Q: 你好	A: 你好	A: 你好	A: 你好，你好，我是大好人。	A: 您好
Q: 早上好	A: 早上好	A: 早上好，好久不见	A: 早上好！	A: 早啊早啊
Q: 晚上好	A: 晚上好	A: 你好	A: 好久不见	A: 晚上好！
Q: 再见	A: 再见	A: 不见	A: 你是不是不爱我了	A: 我要走了
Q: 好久不见	A: 好久不见	A: 是呀，好久不见。	A: 是啊，好久不见。	A: 是啊，最近怎么样啊
Q: 想死你了	A: 我也是	A: 我也想你，	A: 我不信	A: 哈哈哈哈哈
Q: 谢谢你	A: 不客气	A: 不用谢	A: 不用	A: 不谢不谢，
Q: 爱你	A: 爱你	A: 你是爱我	A: 我也爱你，	A: 我也爱你，你最美
Q: 你叫什么名字	A: 你猜	A: 我也想知道	A: 叫你	A: 你是不是想说我的名字太俗了
Q: 你几岁了	A: 问女孩子年龄可不是绅士所为哦 ~	A: 你猜	A: 你猜	A: 我不知道
Q: 现在几点了	A: 11点	A: 你说呢？	A: 十二点	A: 我也不知道啊
Q: 今天天气怎么样	A: 下雨了	A: 今天很冷，但是我还是觉得很冷	A: 很好	A: 不错
Q: 我们现在在哪里	A: 在这里	A: 我在这里等你呢	A: 我不知道你在说什么	A: 在我们宿舍楼下。。。
Q: 你能给我讲个笑话吗	A: 不能	A: 讲个笑话	A: 讲给谁听	A: 不能
Q: 你是男孩还是女孩呀	A: 我是萌妹纸	A: 你猜	A: 你猜呀	A: 我是萌妹纸
Q: 你会几种语言呀	A: 你猜	A: 不会	A: 你猜呀~	A: 只会说一句
Q: 你能陪我玩吗	A: 可以啊	A: 可以	A: 我可以啊	A: 你猜
Q: 说话可以大声一点吗	A: 可以	A: 可以的	A: 不要	A: 我不
Q: 天气真好	A: 是的	A: 是啊！	A: 天气好是因为有你	A: 是啊。
Q: 天气太糟糕了	A: 是啊	A: 是啊	A: 是啊	A: 是的
Q: 下雨了	A: 是的	A: 你们那边也下雨了	A: 是呀	A: 我也是
Q: 雨好大	A: 今天下雨了	A: 是啊	A: 是的，好大雨	A: 对呀，我们还穿短袖
Q: 我讨厌艳阳天	A: 我也讨厌艳阳天	A: 我也是。	A: 我喜欢阴天	A: 我喜欢
Q: 好晒啊	A: 是啊	A: 是啊	A: 是啊！！	A: 是啊，不是很热的
Q: 今天好冷	A: 多穿点衣服	A: 你们都不来	A: 你那儿也冷了	A: 是啊。
Q: 今天好热	A: 今天热死了	A: 热死我了	A: 对啊	A: 对啊，我都热疯
Q: 风好大	A: 是啊	A: 风太大了。。。	A: 你不是在上课嘛	A: 嗯嗯嗯，风好大，冷
Q: 雾太大了看不清路	A: 雾太大了	A: 哈哈，你这是在逗我笑	A: 我也看不清	A: 你是不是傻啊
Q: 打雷了好可怕	A: 可怕	A: 我们这边也是，我们这边下雨了，我们这边也没有下雨。	A: 是啊	A: 我们这边下雨了
Q: 下雪了诶	A: 下雪了？	A: 是滴！	A: 嗯嗯嗯！！好大的雪！！	A: 下了一天了
Q: 好烦啊	A: 我也好烦	A: 我也是	A: 我好烦	A: 我也是
Q: 好开心	A: 开心就好	A: 你是不是想我了	A: 是呀	A: 哈哈哈，我也是！
Q: 太激动了	A: 激动什么	A: 我也激动	A: 哈哈，还是不要了	A: 哈哈
Q: 我好难过	A: 怎么了？	A: 我也是，我们都是	A: 不要难过，我陪你	A: 你怎么难过
Q: 我想哭	A: 抱抱	A: 抱紧你	A: 我不想哭！你不要想太多！	A: 哭吧哭吧，我陪你哭
Q: 太好笑了	A: 笑死我了	A: 你也笑了？	A: 我也觉得	A: 哈哈
Q: 我好伤心	A: 怎么了	A: 你伤心什么	A: 我也是	A: 你是不是想笑死我！！下次再给你拍
Q: 心好痛	A: 抱抱	A: 你心痛什么	A: 心碎一地了。	A: 怎么了
Q: 好累啊	A: 加油	A: 你不会还是一个人在寝室吧	A: 好好休息	A: 累了就睡觉
Q: 我好疲惫	A: 我也是	A: 我也差不多。。	A: 我好困	A: 你不是一直都很疲惫么
Q: 我爱你	A: 我也爱你	A: 我也是	A: 我们都爱你	A: 爱我就来见面
Q: 我讨厌你	A: 我也讨厌你	A: 你不是一直很喜欢我	A: 不是我的	A: 我也讨厌你
Q: 你真是太棒啦	A: 谢谢	A: 你最棒	A: 你也棒棒哒	A: 谢谢！
Q: 你好厉害啊	A: 哈哈	A: 是不是	A: 你也不差呀	A: 你也可以的
Q: 吓死我了	A: 哈哈	A: 我都不敢出门了	A: 哈哈哈哈哈！！！	A: 哈哈哈哈哈笑死我了
Q: 我想回家	A: 我也想回家	A: 我也想！！	A: 想家的时候回来，想家的时候离开。	A: 回来吧，家真的好啊！
Q: 我想爸妈了	A: 我也想爸妈	A: 哈哈	A: 我也想	A: 想我吗
Q: 不知道小孩在家有没有听话	A: 我也不知道	A: 没有	A: 听话的话肯定是会听话的。	A: 我也是听不懂啊
Q: 想回家撸猫	A: 我也想回家	A: 你也想啊？	A: 我们这也有一个	A: 回呀回呀

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Related tags

Overview

Contents

Text Classification

Text Matching

Intent Detection and Slot Filling

Retrieval Dialog

Semantic Parsing

Knowledge Graph Completion

Knowledge Base Question Answering

Multi-hop Question Answering

Text Visualization

Recommender System

Multi-turn Dialogue Rewriting

Generative Dialog

Owner

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

End-2-end speech synthesis with recurrent neural networks

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Text vectorization tool to outperform TFIDF for classification tasks

Amazon Multilingual Counterfactual Dataset (AMCD)

A list of NLP(Natural Language Processing) tutorials

Precision Medicine Knowledge Graph (PrimeKG)

2021 2학기 데이터크롤링 기말프로젝트

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Ray-based parallel data preprocessing for NLP and ML.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Data preprocessing rosetta parser for python

小布助手对话短文本语义匹配的一个baseline

Pytorch version of BERT-whitening

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

A curated list of FOSS tools to improve the Hacker News experience

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

NLP, Machine learning