Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Last update: Jan 07, 2023

Related tags

Overview

japanese-gpt2

This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

Train a Japanese GPT-2 from scratch on your own machine

Download training corpus Japanese CC-100 and extract the ja.txt file.
Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.
Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files

Train a medium-sized GPT-2 on 4 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True

Interact with the trained model

Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account; Create a model repo; Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your training script

Check available arguments by running:

python -m task.pretrain.train --help

License

The MIT license

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Related tags

Overview

japanese-gpt2

Train a Japanese GPT-2 from scratch on your own machine

Interact with the trained model

Prepare files for uploading to Huggingface

Customize your training script

License

Owner

rinna Co.,Ltd.

A simple word search made in python

Creating an LSTM model to generate music

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Understand Text Summarization and create your own summarizer in python

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Ray-based parallel data preprocessing for NLP and ML.

LewusBot - Twitch ChatBot built in python with twitchio library

Open Source Neural Machine Translation in PyTorch

List of GSoC organisations with number of times they have been selected.

RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

Conditional probing: measuring usable information beyond a baseline

本插件是pcrjjc插件的重置版，可以独立于后端api运行

Open-World Entity Segmentation

Faster, modernized fork of the language identification tool langid.py

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

A high-level Python library for Quantum Natural Language Processing

Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).