Pretrained Japanese BERT models

Overview

Pretrained Japanese BERT models

This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face.

For information on the previous versions of our pretrained models, see the v1.0 tag of this repository.

Model Architecture

The architecture of our models are the same as the original BERT models proposed by Google.

  • BERT-base models consist of 12 layers, 768 dimensions of hidden states, and 12 attention heads.
  • BERT-large models consist of 24 layers, 1024 dimensions of hidden states, and 16 attention heads.

Training Data

The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020.

The generated corpus files are 4.0GB in total, consisting of approximately 30M sentences. We used the MeCab morphological parser with mecab-ipadic-NEologd dictionary to split texts into sentences.

$WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt">
$ WORK_DIR="$HOME/work/bert-japanese"

$ python make_corpus_wiki.py \
--input_file jawiki-20200831-cirrussearch-content.json.gz \
--output_file $WORK_DIR/corpus/jawiki-20200831/corpus.txt \
--min_text_length 10 \
--max_text_length 200 \
--mecab_option "-r $HOME/local/etc/mecabrc -d $HOME/local/lib/mecab/dic/mecab-ipadic-neologd-v0.0.7"

# Split corpus files for parallel preprocessing of the files
$ python merge_split_corpora.py \
--input_files $WORK_DIR/corpus/jawiki-20200831/corpus.txt \
--output_dir $WORK_DIR/corpus/jawiki-20200831 \
--num_files 8

# Sample some lines for training tokenizers
$ cat $WORK_DIR/corpus/jawiki-20200831/corpus.txt|grep -v '^$'|shuf|head -n 1000000 \
> $WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt

Tokenization

For each of BERT-base and BERT-large, we provide two models with different tokenization methods.

  • For wordpiece models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768.
  • For character models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144.

We used fugashi and unidic-lite packages for the tokenization.

$WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt">
$ WORK_DIR="$HOME/work/bert-japanese"

# WordPiece (unidic_lite)
$ TOKENIZERS_PARALLELISM=false python train_tokenizer.py \
--input_files $WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt \
--output_dir $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite \
--tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--vocab_size 32768 \
--limit_alphabet 6129 \
--num_unused_tokens 10

# Character
$ head -n 6144 $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite/vocab.txt \
> $WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt

Training

The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

For training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 5 days and 14 days for BERT-base and BERT-large models, respectively.

Creation of the pretraining data

$ WORK_DIR="$HOME/work/bert-japanese"

# WordPiece (unidic_lite)
$ mkdir -p $WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data
# It takes 3h and 420GB RAM, producing 43M instances
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 8 python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/jawiki-20200831/corpus_{}.txt \
--output_file $WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite/vocab.txt \
--tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--max_seq_length 512 \
--max_predictions_per_seq 80 \
--dupe_factor 10

# Character
$ mkdir $WORK_DIR/bert/jawiki-20200831/character/pretraining_data
# It takes 4h10m and 615GB RAM, producing 55M instances
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 8 python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/jawiki-20200831/corpus_{}.txt \
--output_file $WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt \
--tokenizer_type character \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--max_seq_length 512 \
--max_predictions_per_seq 80 \
--dupe_factor 10

Training of the models

Note: all the necessary files need to be stored in a Google Cloud Storage (GCS) bucket.

# BERT-base, WordPiece (unidic_lite)
$ ctpu up -name tpu01 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-base" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-base/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu01

# BERT-base, Character
$ ctpu up -name tpu02 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/character/bert-base" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/character/bert-base/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu02

# BERT-large, WordPiece (unidic_lite)
$ ctpu up -name tpu03 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-large" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-large/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=5e-5 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu03

# BERT-large, Character
$ ctpu up -name tpu04 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/character/bert-large" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/character/bert-large/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=5e-5 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu04

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0.

The codes in this repository are distributed under the Apache License 2.0.

Related Work

Acknowledgments

The models are trained with Cloud TPUs provided by TensorFlow Research Cloud program.

Owner
Inui Laboratory
Inui Laboratory, Tohoku University
Inui Laboratory
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022
A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

blurr A library that integrates huggingface transformers with version 2 of the fastai framework Install You can now pip install blurr via pip install

ohmeow 253 Dec 31, 2022
SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

jon_chun 14 Dec 19, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
ChessCoach is a neural network-based chess engine capable of natural-language commentary.

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Chris Butner 380 Dec 03, 2022
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
Translate U is capable of translating the text present in an image from one language to the other.

Translate U is capable of translating the text present in an image from one language to the other. The app uses OCR and Google translate to identify and translate across 80+ languages.

Neelanjan Manna 1 Dec 22, 2021
a test times augmentation toolkit based on paddle2.0.

Patta Image Test Time Augmentation with Paddle2.0! Input | # input batch of images / / /|\ \ \ # apply

AgentMaker 110 Dec 03, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022