Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

BERT Attention Analysis

Python3 to Crystal Translation using Python AST Walker

Big Bird: Transformers for Longer Sequences

leaking paid token generator that was a shit lmao for 100$ haha

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

小布助手对话短文本语义匹配的一个baseline

Training open neural machine translation models

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

COVID-19 Chatbot with Rasa 2.0: open source conversational AI

Repository for Project Insight: NLP as a Service

Super easy library for BERT based NLP models

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Predict the spans of toxic posts that were responsible for the toxic label of the posts

Bnagla hand written document digiiztion

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Build Text Rerankers with Deep Language Models

Text Classification Using LSTM