MPNet: Masked and Permuted Pre-training for Language Understanding

Related tags

Text Data & NLPMPNet
Overview

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

  • A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
  • Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to pip install transformers to use bert tokenizer. We provide a script encode.py and a dictionary file dict.txt to tokenize your corpus. You can modify encode.py if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do \
    python MPNet/encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model:

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
    --arch mpnet_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'

Notes: You can replace arch with mpnet_rel_base and add command --mask-whole-words --bpe bert to use relative position embedding and whole word mask.

Notes: You can specify --input-mode as mlm or plm to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 01, 2023
Summarization module based on KoBART

KoBART-summarization Install KoBART pip install git+https://github.com/SKT-AI/KoBART#egg=kobart Requirements pytorch==1.7.0 transformers==4.0.0 pytor

seujung hwan, Jung 148 Dec 28, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

NLP*CL Laboratory 2 Oct 26, 2021
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Dec 28, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 - treatments and vaccinations.

Project: Text Analysis - This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 -

1 Mar 14, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023