EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Overview

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

This is the official implementation for "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling" (EMNLP 2021).

Requirements

  • torch
  • transformers
  • datasets
  • scikit-learn
  • tensorflow
  • spacy

How to pre-train

1. Clone this repository

git clone https://github.com/gucci-j/light-transformer-emnlp2021.git

2. Install required packages

cd ./light-transformer-emnlp2021
pip install -r requirements.txt

requirements.txt is located just under light-transformer-emnlp2021.

We also need spaCy's en_core_web_sm for preprocessing. If you have not installed this model, please run python -m spacy download en_core_web_sm.

3. Preprocess datasets

cd ./src/utils
python preprocess_roberta.py --path=/path/to/save/data/

You need to specify the following argument:

  • path: (str) Where to save the processed data?

4. Pre-training

You need to secify configs as command line arguments. Sample configs for pre-training MLM are shown as below. python pretrainer.py --help will display helper messages.

cd ../
python pretrainer.py \
--data_dir=/path/to/dataset/ \
--do_train \
--learning_rate=1e-4 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=12774 \
--save_steps=12774 \
--seed=42 \
--per_device_train_batch_size=16 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm=True \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 
  • pretrain_model should be selected from:
    • RobertaForMaskedLM (MLM)
    • RobertaForShuffledWordClassification (Shuffle)
    • RobertaForRandomWordClassification (Random)
    • RobertaForShuffleRandomThreeWayClassification (Shuffle+Random)
    • RobertaForFourWayTokenTypeClassification (Token Type)
    • RobertaForFirstCharPrediction (First Char)

Check the pre-training process

You can monitor the progress of pre-training via the Tensorboard. Simply run the following:

tensorboard --logdir=/path/to/log/dir/

Distributed training

pretrainer.py is compatible with distributed training. Sample configs for pre-training MLM are as follows.

python -m torch/distributed/launch.py \
--nproc_per_node=8 \
pretrainer.py \
--data_dir=/path/to/dataset/ \
--model_path=None \
--do_train \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=24000 \
--save_steps=1000 \
--seed=42 \
--per_device_train_batch_size=8 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 

For more details about launch.py, please refer to https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py.

Mixed precision training

Installation

  • For PyTorch version >= 1.6, there is a native functionality to enable mixed precision training.
  • For older versions, NVIDIA apex must be installed.
    • You might encounter some errors when installing apex due to permission problems. To fix these, specify export TMPDIR='/path/to/your/favourite/dir/' and change permissions of all files under apex/.git/ to 777.
    • You also need to specify an optimisation method from https://nvidia.github.io/apex/amp.html.

Usage
To use mixed precision during pre-training, just specify --fp16 as an input argument. For older PyTorch versions, also specify --fp16_opt_level from O0, O1, O2, and O3.

How to fine-tune

GLUE

  1. Download GLUE data

    git clone https://github.com/huggingface/transformers
    python transformers/utils/download_glue_data.py
    
  2. Create a json config file
    You need to create a .json file for configuration or use command line arguments.

    {
        "model_name_or_path": "/path/to/pretrained/weights/",
        "tokenizer_name": "roberta-base",
        "task_name": "MNLI",
        "do_train": true,
        "do_eval": true,
        "data_dir": "/path/to/MNLI/dataset/",
        "max_seq_length": 128,
        "learning_rate": 2e-5,
        "num_train_epochs": 3, 
        "per_device_train_batch_size": 32,
        "per_device_eval_batch_size": 128,
        "logging_steps": 500,
        "logging_first_step": true,
        "save_steps": 1000,
        "save_total_limit": 2,
        "evaluate_during_training": true,
        "output_dir": "/path/to/save/models/",
        "overwrite_output_dir": true,
        "logging_dir": "/path/to/save/log/files/",
        "disable_tqdm": true
    }

    For task_name and data_dir, please choose one from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI.

  3. Fine-tune

    python run_glue.py /path/to/json/
    

    Instead of specifying a JSON path, you can directly specify configs as input arguments.
    You can also monitor training via Tensorboard.
    --help option will display a helper message.

SQuAD

  1. Download SQuAD data

    cd ./utils
    python download_squad_data.py --save_dir=/path/to/squad/
    
  2. Fine-tune

    cd ..
    export SQUAD_DIR=/path/to/squad/
    python run_squad.py \
    --model_type roberta \
    --model_name_or_path=/path/to/pretrained/weights/ \
    --tokenizer_name roberta-base \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir=$SQUAD_DIR \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --per_gpu_train_batch_size 16 \
    --per_gpu_eval_batch_size 32 \
    --learning_rate 3e-5 \
    --weight_decay=0.01 \
    --warmup_steps=3327 \
    --num_train_epochs 10.0 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --logging_steps=278 \
    --save_steps=50000 \
    --patience=5 \
    --objective_type=maximize \
    --metric_name=f1 \
    --overwrite_output_dir \
    --evaluate_during_training \
    --output_dir=/path/to/save/weights/ \
    --logging_dir=/path/to/save/logs/ \
    --seed=42 
    

    Similar to pre-training, you can monitor the fine-tuning status via Tensorboard.
    --help option will display a helper message.

Citation

@inproceedings{yamaguchi-etal-2021-frustratingly,
    title = "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling",
    author = "Yamaguchi, Atsuki  and
      Chrysostomou, George  and
      Margatina, Katerina  and
      Aletras, Nikolaos",
    booktitle = "Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

License

MIT License

Owner
Atsuki Yamaguchi
NLP researcher
Atsuki Yamaguchi
ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet)

ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet) (

Wei-Ting Chen 49 Dec 27, 2022
Code for the paper "Generative design of breakwaters usign deep convolutional neural network as a surrogate model"

Generative design of breakwaters usign deep convolutional neural network as a surrogate model This repository contains the code for the paper "Generat

2 Apr 10, 2022
Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Code To run: python runner.py new --save SAVE_NAME --data PATH_TO_DATA_DIR --dataset DATASET --model model_name [options] --n 1000 - train - t

Geoff Pleiss 5 Dec 12, 2022
CCCL: Contrastive Cascade Graph Learning.

CCGL: Contrastive Cascade Graph Learning This repo provides a reference implementation of Contrastive Cascade Graph Learning (CCGL) framework as descr

Xovee Xu 19 Dec 05, 2022
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022
Code for models used in Bashiri et al., "A Flow-based latent state generative model of neural population responses to natural images".

A Flow-based latent state generative model of neural population responses to natural images Code for "A Flow-based latent state generative model of ne

Sinz Lab 5 Aug 26, 2022
RL and distillation in CARLA using a factorized world model

World on Rails Learning to drive from a world on rails Dian Chen, Vladlen Koltun, Philipp Krähenbühl, arXiv techical report (arXiv 2105.00636) This re

Dian Chen 131 Dec 16, 2022
MIMIC Code Repository: Code shared by the research community for the MIMIC-III database

MIMIC Code Repository The MIMIC Code Repository is intended to be a central hub for sharing, refining, and reusing code used for analysis of the MIMIC

MIT Laboratory for Computational Physiology 1.8k Dec 26, 2022
[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Exploring Temporal Coherence for More General Video Face Forgery Detection(FTCN) Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen Accepted b

57 Dec 28, 2022
NaturalProofs: Mathematical Theorem Proving in Natural Language

NaturalProofs: Mathematical Theorem Proving in Natural Language NaturalProofs: Mathematical Theorem Proving in Natural Language Sean Welleck, Jiacheng

Sean Welleck 83 Jan 05, 2023
Discriminative Condition-Aware PLDA

DCA-PLDA This repository implements the Discriminative Condition-Aware Backend described in the paper: L. Ferrer, M. McLaren, and N. Brümmer, "A Speak

Luciana Ferrer 31 Aug 05, 2022
Demos of essentia classifiers hosted on replicate.ai

essentia-replicate-demos Demos of Essentia models hosted on replicate.ai's MTG site. The models Check our site for a complete list of the models avail

Music Technology Group - Universitat Pompeu Fabra 12 Nov 14, 2022
Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

Denoised-Smoothing-TF Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow. Denoised Smoothing is

Sayak Paul 19 Dec 11, 2022
Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

Qing-Long Zhang 199 Jan 08, 2023
CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer

CycleTransGAN-EVC CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer Demo emotion CycleTransGAN CycleTransGAN Cycle

24 Dec 15, 2022
4K videos with annotated masks in our ICCV2021 paper 'Internal Video Inpainting by Implicit Long-range Propagation'.

Annotated 4K Videos paper | project website | code | demo video 4K videos with annotated object masks in our ICCV2021 paper: Internal Video Inpainting

Tengfei Wang 21 Nov 05, 2022
code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

PreSumm This code is for EMNLP 2019 paper Text Summarization with Pretrained Encoders Updates Jan 22 2020: Now you can Summarize Raw Text Input!. Swit

Yang Liu 1.2k Dec 28, 2022
Deep Probabilistic Programming Course @ DIKU

Deep Probabilistic Programming Course @ DIKU

52 May 14, 2022
Off-policy continuous control in PyTorch, with RDPG, RTD3 & RSAC

arXiv technical report soon available. we are updating the readme to be as comprehensive as possible Please ask any questions in Issues, thanks. Intro

Zhihan 31 Dec 30, 2022
A neuroanatomy-based augmented reality experience powered by computer vision. Features 3D visuals of the Atlas Brain Map slices.

Brain Augmented Reality (AR) A neuroanatomy-based augmented reality experience powered by computer vision that features 3D visuals of the Atlas Brain

Yasmeen Brain 10 Oct 06, 2022