超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Last update: Dec 18, 2022

Overview

bert4pytorch

2021年8月27更新：

感谢大家的star，最近有小伙伴反映了一些小的bug，我也注意到了，奈何这个月工作上实在太忙，更新不及时，大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本，然后会新添加一些关键注释。再增加对抗训练的内容，更新一个完整的finetune案例。

背景

目前最流行的pytorch版本的bert框架，莫过于huggingface团队的Transformers项目，但是随着项目的越来越大，显得很重，对于初学者、有一定nlp基础的人来说，想看懂里面的代码逻辑，深入了解bert，有很大的难度。

另外，如果想修改Transformers的底层代码也是想当困难的，导致很难对模型进行魔改。

本项目把整个bert架构，浓缩在几个文件当中（主要修改自Transfomers开源项目），删除大量无关紧要的代码，新增了一些功能，比如：ema、warmup schedule，并且在核心部分，添加了大量中文注释，力求解答读者在使用过程中产生的一些疑惑。

此项目核心只有三个文件，modeling、tokenization、optimization。并且都在几百行内完成。结合大量的中文注释，分分钟透彻理解bert。

功能

现在已经实现

加载bert、RoBERTa-wwm-ext的预训练权重进行fintune
实现了带warmup的优化器
实现了模型权重的指数滑动平均（ema）

未来将实现

albert、GPT、XLnet等网络架构
实现对抗训练、conditional Layer Norm等功能（想法来自于苏神(苏剑林)的bert4keras开源项目，事实上，bert4pytorch就是受到了它的启发）
添加大量的例子和中文注释，减轻学习难度

安装

pip install bert4pytorch==0.1.2

使用

加载预训练模型

from bert4pytorch.modeling import BertModel, BertConfig
from bert4pytorch.tokenization import BertTokenizer
from bert4pytorch.optimization import AdamW, get_linear_schedule_with_warmup
import torch

model_path = "/model/pytorch_bert_pretrain_model"
config = BertConfig(model_path + "/config.json")

tokenizer = BertTokenizer(model_path + "/vocab.txt")
model = BertModel.from_pretrained(model_path, config)

input_ids, token_type_ids = tokenizer.encode("今天很开心")

input_ids = torch.tensor([input_ids])
token_type_ids = torch.tensor([token_type_ids])

model.eval()

outputs = model(input_ids, token_type_ids, output_all_encoded_layers=True)

## orther code

带warmup的优化器实现

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer
                if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer
                if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5, correct_bias=False)

num_training_steps=train_batches * num_epoches
num_warmup_steps=num_training_steps * warmup_proportion
schedule = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps)

其他

最初整理这个项目，只是为了自己方便。这一段时间，经常逛苏剑林大佬的博客，里面的内容写得相当精辟，更加感叹的是，苏神经常能闭门造车出一些还不错的trick，只能说，大佬牛逼。

所以本项目命名也雷同bert4keras，以感谢苏大佬无私的分享。

后来，慢慢萌生把学习中的小小成果开源出来，后期会渐渐补充例子，前期会借用苏神的bert4keras里面的例子，实现pytorch版本。如果有问题，欢迎讨论；如果本项目对您有用，请不吝star！

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Related tags

Overview

bert4pytorch

2021年8月27更新：

背景

功能

现在已经实现

未来将实现

安装

使用

其他

Owner

muqiu

A raytrace framework using taichi language

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

结巴中文分词

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Multilingual text (NLP) processing toolkit

Perform sentiment analysis and keyword extraction on Craigslist listings

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Topic Modelling for Humans

ASCEND Chinese-English code-switching dataset

Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

A Python/Pytorch app for easily synthesising human voices

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Nmt - TensorFlow Neural Machine Translation Tutorial

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.