Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

FastFormers - highly efficient transformer models for NLU

Data manipulation and transformation for audio signal processing, powered by PyTorch

Tool which allow you to detect and translate text.

Scikit-learn style model finetuning for NLP

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Maix Speech AI lib, including ASR, chat, TTS etc.

Built for cleaning purposes in military institutions

The entmax mapping and its loss, a family of sparse softmax alternatives.

Collection of useful (to me) python scripts for interacting with napari

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Pytorch-Named-Entity-Recognition-with-BERT

IEEEXtreme15.0 Questions And Answers

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

DAGAN - Dual Attention GANs for Semantic Image Synthesis

Two-stage text summarization with BERT and BART

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Conversational text Analysis using various NLP techniques