Jingju baseline - A baseline model of our project of Beijing opera script generation

Overview

Jingju Baseline

It is a baseline of our project about Beijing opera script generation. Our baseline model is based on gpt2-chinese-ancient which is pretrained with 1.5GB literary Chinese.Please refer to our paper for details.

Directory Annotation

jingju_baseline/
	|-- finetuning.py 	#the finetuning script
	|-- jingju_test.py 	#test script
	|-- preprocess.py 	#data preprocess script
	|-- config/ 		#model configuration files 
	|-- corpora/ 		#corpora files
	|-- models/ 		# vocab file, model checkpoints and some necessary files
	|-- scripts/ 		# several functional scripts
	|-- test/ 		#test files
	|-- uer/ 		#files from UER-py

Environment Preparation

Our baseline model is fineturned with a pretraining framework UER-py. Refer to the part for environment requirements.

Finetuning

  1. Data preprocess
python3 preprocess.py --corpus_path corpora/jingju_train.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_train.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
python3 preprocess.py --corpus_path corpora/jingju_dev.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_dev.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
  1. Finetuning
export CUDA_VISIBLE_DEVICES=0
nohup python3 -u finetuning.py --dataset_path corpora/jingju_train.pt\
		 --devset_path corpora/jingju_dev.pt\
		 --vocab_path models/vocab.txt \
		 --config_path config/jingju_config.json \
		 --output_model_path models/finetuned_model.bin\
		 --pretrained_model_path models/uer-ancient-chinese.bin\
		 --world_size 1 --gpu_ranks 0  \
		 --total_steps 100000 --save_checkpoint_steps 50000\
		 --report_steps 1000 --learning_rate 5e-5\
		 --batch_size 5 --accumulation_steps 4 \
		 --embedding word_pos  --fp16 --fp16_opt_level O1 \
		 --remove_embedding_layernorm --encoder transformer \
		 --mask causal --layernorm_positioning pre \
		 --target lm --tie_weights > fineturning.log 2>&1 &

Refer to here for function of every argument.

Specificly, you may change environment variable CUDA_VISIBLE_DEVICES and --world_size paired with --gpu_ranks option for multi-GPU training. To enable --fp16 coordinated with --fp16_opt_level needs apex.

Test

You can finetuning by yourself with instructions above, or download the checkpoint(extracting code: q0yn) to directory ./models Then run as follows:

python3 preprocess.py --corpus_path corpora/jingju_test.txt\
		  --vocab_path models/vocab.txt \
		  --tokenizer bert \
		  --dataset_path corpora/jingju_test.pt \
		  --processes_num 32 --seq_length 1024 --target lm
nohup python3 -u jingju_test.py --load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--beginning_path test/jingju_beginning.txt  \
		--reference_path test/jingju_reference.txt \
		--prediction_path test/jingju_candidates.txt \
		--test_path test/jingju_beginning.txt \
		--testset_path datasets/jingju_test.pt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm \
		--encoder transformer --mask causal \
		--layernorm_positioning pre --target lm \
		--tie_weights > test_candidate_generation.log 2>&1 &

The automatic mertics(i.e., F1, Perplexity, BLEU and Distinct) will be displayed on stdout.

Generation

nohup python3 -u scripts/generate_lm.py \
		--load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--test_path test/beginning.txt \
		--prediction_path test/generation.txt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm --encoder transformer \
		--mask causal --layernorm_positioning pre \
		--target lm --tie_weights > generation_log.log 2>&1 &

Given the beginning, the model will generates script corresponding with it. The generate_lm.py script only generates sequence no longer than 1024. If you want longer script, replace scripts/generate_lm.py with scripts/long_generate_lm.py and revise --seq_length to the length you desire. Note that the generation procedure employs auto-regressive fashion, so generating long sequence is a time-consuming process.

Citation

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}
Owner
midon
master from School of Informatics,Xiamen University
midon
MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system Getting started To start working on this assignment, you should

2 Aug 06, 2022
FaceAPI: AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using TensorFlow/JS

FaceAPI AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using

Vladimir Mandic 395 Dec 29, 2022
A different spin on dataclasses.

dataklasses Dataklasses is a library that allows you to quickly define data classes using Python type hints. Here's an example of how you use it: from

David Beazley 752 Nov 18, 2022
A 3D sparse LBM solver implemented using Taichi

taichi_LBM3D Background Taichi_LBM3D is a 3D lattice Boltzmann solver with Multi-Relaxation-Time collision scheme and sparse storage structure impleme

Jianhui Yang 121 Jan 06, 2023
GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

MTV-TSA: Adaptable GAN Encoders for Image Reconstruction via Multi-type Latent Vectors with Two-scale Attentions. This is the official code release fo

owl 37 Dec 24, 2022
A Python library for working with arbitrary-dimension hypercomplex numbers following the Cayley-Dickson construction of algebras.

Hypercomplex A Python library for working with quaternions, octonions, sedenions, and beyond following the Cayley-Dickson construction of hypercomplex

7 Nov 04, 2022
face2comics by Sxela (Alex Spirin) - face2comics datasets

This is a paired face to comics dataset, which can be used to train pix2pix or similar networks.

Alex 164 Nov 13, 2022
This is the official code of L2G, Unrolling and Recurrent Unrolling in Learning to Learn Graph Topologies.

Learning to Learn Graph Topologies This is the official code of L2G, Unrolling and Recurrent Unrolling in Learning to Learn Graph Topologies. Requirem

Stacy X PU 16 Dec 09, 2022
This repository allows the user to automatically scale a 3D model/mesh/point cloud on Agisoft Metashape

Metashape-Utils This repository allows the user to automatically scale a 3D model/mesh/point cloud on Agisoft Metashape, given a set of 2D coordinates

INSCRIBE 4 Nov 07, 2022
Official PyTorch Implementation of "Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs". NeurIPS 2020.

Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs This repository is the implementation of SELAR. Dasol Hwang* , Jinyoung Pa

MLV Lab (Machine Learning and Vision Lab at Korea University) 48 Nov 09, 2022
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

46 Nov 09, 2022
Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Phil Wang 383 Jan 02, 2023
StyleGAN2 Webtoon / Anime Style Toonify

StyleGAN2 Webtoon / Anime Style Toonify Korea Webtoon or Japanese Anime Character Stylegan2 base high Quality 1024x1024 / 512x512 Generate and Transfe

121 Dec 21, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel 🤘 This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 02, 2022
SCALoss: Side and Corner Aligned Loss for Bounding Box Regression (AAAI2022).

SCALoss PyTorch implementation of the paper "SCALoss: Side and Corner Aligned Loss for Bounding Box Regression" (AAAI 2022). Introduction IoU-based lo

TuZheng 20 Sep 07, 2022
Code for AutoNL on ImageNet (CVPR2020)

Neural Architecture Search for Lightweight Non-Local Networks This repository contains the code for CVPR 2020 paper Neural Architecture Search for Lig

Yingwei Li 104 Aug 31, 2022
GMFlow: Learning Optical Flow via Global Matching

GMFlow GMFlow: Learning Optical Flow via Global Matching Authors: Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Dacheng Tao We streamline the

Haofei Xu 298 Jan 04, 2023
Fast RFC3339 compliant Python date-time library

udatetime: Fast RFC3339 compliant date-time library Handling date-times is a painful act because of the sheer endless amount of formats used by people

Simon Pirschel 235 Oct 25, 2022
This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Off-Belief Learning Introduction This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021. Environment Setup

Facebook Research 32 Jan 05, 2023
Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

Group-CAM By Zhang, Qinglong and Rao, Lu and Yang, Yubin [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the o

zhql 98 Nov 16, 2022