Jingju baseline - A baseline model of our project of Beijing opera script generation

Overview

Jingju Baseline

It is a baseline of our project about Beijing opera script generation. Our baseline model is based on gpt2-chinese-ancient which is pretrained with 1.5GB literary Chinese.Please refer to our paper for details.

Directory Annotation

jingju_baseline/
	|-- finetuning.py 	#the finetuning script
	|-- jingju_test.py 	#test script
	|-- preprocess.py 	#data preprocess script
	|-- config/ 		#model configuration files 
	|-- corpora/ 		#corpora files
	|-- models/ 		# vocab file, model checkpoints and some necessary files
	|-- scripts/ 		# several functional scripts
	|-- test/ 		#test files
	|-- uer/ 		#files from UER-py

Environment Preparation

Our baseline model is fineturned with a pretraining framework UER-py. Refer to the part for environment requirements.

Finetuning

  1. Data preprocess
python3 preprocess.py --corpus_path corpora/jingju_train.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_train.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
python3 preprocess.py --corpus_path corpora/jingju_dev.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_dev.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
  1. Finetuning
export CUDA_VISIBLE_DEVICES=0
nohup python3 -u finetuning.py --dataset_path corpora/jingju_train.pt\
		 --devset_path corpora/jingju_dev.pt\
		 --vocab_path models/vocab.txt \
		 --config_path config/jingju_config.json \
		 --output_model_path models/finetuned_model.bin\
		 --pretrained_model_path models/uer-ancient-chinese.bin\
		 --world_size 1 --gpu_ranks 0  \
		 --total_steps 100000 --save_checkpoint_steps 50000\
		 --report_steps 1000 --learning_rate 5e-5\
		 --batch_size 5 --accumulation_steps 4 \
		 --embedding word_pos  --fp16 --fp16_opt_level O1 \
		 --remove_embedding_layernorm --encoder transformer \
		 --mask causal --layernorm_positioning pre \
		 --target lm --tie_weights > fineturning.log 2>&1 &

Refer to here for function of every argument.

Specificly, you may change environment variable CUDA_VISIBLE_DEVICES and --world_size paired with --gpu_ranks option for multi-GPU training. To enable --fp16 coordinated with --fp16_opt_level needs apex.

Test

You can finetuning by yourself with instructions above, or download the checkpoint(extracting code: q0yn) to directory ./models Then run as follows:

python3 preprocess.py --corpus_path corpora/jingju_test.txt\
		  --vocab_path models/vocab.txt \
		  --tokenizer bert \
		  --dataset_path corpora/jingju_test.pt \
		  --processes_num 32 --seq_length 1024 --target lm
nohup python3 -u jingju_test.py --load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--beginning_path test/jingju_beginning.txt  \
		--reference_path test/jingju_reference.txt \
		--prediction_path test/jingju_candidates.txt \
		--test_path test/jingju_beginning.txt \
		--testset_path datasets/jingju_test.pt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm \
		--encoder transformer --mask causal \
		--layernorm_positioning pre --target lm \
		--tie_weights > test_candidate_generation.log 2>&1 &

The automatic mertics(i.e., F1, Perplexity, BLEU and Distinct) will be displayed on stdout.

Generation

nohup python3 -u scripts/generate_lm.py \
		--load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--test_path test/beginning.txt \
		--prediction_path test/generation.txt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm --encoder transformer \
		--mask causal --layernorm_positioning pre \
		--target lm --tie_weights > generation_log.log 2>&1 &

Given the beginning, the model will generates script corresponding with it. The generate_lm.py script only generates sequence no longer than 1024. If you want longer script, replace scripts/generate_lm.py with scripts/long_generate_lm.py and revise --seq_length to the length you desire. Note that the generation procedure employs auto-regressive fashion, so generating long sequence is a time-consuming process.

Citation

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}
Owner
midon
master from School of Informatics,Xiamen University
midon
Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)'

SCL Introduction Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)' We evaluated our approach using two baseline

34 Oct 08, 2022
A reimplementation of DCGAN in PyTorch

DCGAN in PyTorch A reimplementation of DCGAN in PyTorch. Although there is an abundant source of code and examples found online (as well as an officia

Diego Porres 6 Jan 08, 2022
Using Machine Learning to Create High-Res Fine Art

BIG.art: Using Machine Learning to Create High-Res Fine Art How to use GLIDE and BSRGAN to create ultra-high-resolution paintings with fine details By

Robert A. Gonsalves 13 Nov 27, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
It is a system used to detect bone fractures. using techniques deep learning and image processing

MohammedHussiengadalla-Intelligent-Classification-System-for-Bone-Fractures It is a system used to detect bone fractures. using techniques deep learni

Mohammed Hussien 7 Nov 11, 2022
RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving (AAAI2021). RTS3D is efficiency and accuracy s

71 Nov 29, 2022
A fast model to compute optical flow between two input images.

DCVNet: Dilated Cost Volumes for Fast Optical Flow This repository contains our implementation of the paper: @InProceedings{jiang2021dcvnet, title={

Huaizu Jiang 8 Sep 27, 2021
This folder contains the implementation of the multi-relational attribute propagation algorithm.

MrAP This folder contains the implementation of the multi-relational attribute propagation algorithm. It requires the package pytorch-scatter. Please

6 Dec 06, 2022
Angle data is a simple data type.

angledat Angle data is a simple data type. Installing + using Put angledat.py in the main dir of your project. Import it and use. Comments Comments st

1 Jan 05, 2022
MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

187 Dec 26, 2022
A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

LP-Option-Hedging Description A Python program to analyze leveraged liquidity farming/mining and find the optimal option combination for hedging imper

Aureliano 18 Dec 19, 2022
Stochastic gradient descent with model building

Stochastic Model Building (SMB) This repository includes a new fast and robust stochastic optimization algorithm for training deep learning models. Th

S. Ilker Birbil 22 Jan 19, 2022
A style-based Quantum Generative Adversarial Network

Style-qGAN A style based Quantum Generative Adversarial Network (style-qGAN) model for Monte Carlo event generation. Tutorial We have prepared a noteb

9 Nov 24, 2022
Graph Convolutional Networks in PyTorch

Graph Convolutional Networks in PyTorch PyTorch implementation of Graph Convolutional Networks (GCNs) for semi-supervised classification [1]. For a hi

Thomas Kipf 4.5k Dec 31, 2022
JFB: Jacobian-Free Backpropagation for Implicit Models

JFB: Jacobian-Free Backpropagation for Implicit Models

Typal Research 28 Dec 11, 2022
face2comics by Sxela (Alex Spirin) - face2comics datasets

This is a paired face to comics dataset, which can be used to train pix2pix or similar networks.

Alex 164 Nov 13, 2022
An LSTM based GAN for Human motion synthesis

GAN-motion-Prediction An LSTM based GAN for motion synthesis has a few issues reading H3.6M data from A.Jain et al , will fix soon. Prediction of the

Amogh Adishesha 9 Jun 17, 2022
SMIS - Semantically Multi-modal Image Synthesis(CVPR 2020)

Semantically Multi-modal Image Synthesis Project page / Paper / Demo Semantically Multi-modal Image Synthesis(CVPR2020). Zhen Zhu, Zhiliang Xu, Anshen

316 Dec 01, 2022
The Official PyTorch Implementation of "LSGM: Score-based Generative Modeling in Latent Space" (NeurIPS 2021)

The Official PyTorch Implementation of "LSGM: Score-based Generative Modeling in Latent Space" (NeurIPS 2021) Arash Vahdat*   ·   Karsten Kreis*   ·  

NVIDIA Research Projects 238 Jan 02, 2023
The official implementation of CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing

CSGStumpNet The official implementation of CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing Paper | Project page

Daxuan 39 Dec 26, 2022