超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

Overview

bert4pytorch

2021年8月27更新:

感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune案例。

背景

目前最流行的pytorch版本的bert框架,莫过于huggingface团队的Transformers项目,但是随着项目的越来越大,显得很重,对于初学者、有一定nlp基础的人来说,想看懂里面的代码逻辑,深入了解bert,有很大的难度。

另外,如果想修改Transformers的底层代码也是想当困难的,导致很难对模型进行魔改。

本项目把整个bert架构,浓缩在几个文件当中(主要修改自Transfomers开源项目),删除大量无关紧要的代码,新增了一些功能,比如:ema、warmup schedule,并且在核心部分,添加了大量中文注释,力求解答读者在使用过程中产生的一些疑惑。

此项目核心只有三个文件,modeling、tokenization、optimization。并且都在几百行内完成。结合大量的中文注释,分分钟透彻理解bert。

功能

现在已经实现

  • 加载bert、RoBERTa-wwm-ext的预训练权重进行fintune
  • 实现了带warmup的优化器
  • 实现了模型权重的指数滑动平均(ema)

未来将实现

  • albert、GPT、XLnet等网络架构
  • 实现对抗训练、conditional Layer Norm等功能(想法来自于苏神(苏剑林)的bert4keras开源项目,事实上,bert4pytorch就是受到了它的启发)
  • 添加大量的例子和中文注释,减轻学习难度

安装

pip install bert4pytorch==0.1.2

使用

  • 加载预训练模型
from bert4pytorch.modeling import BertModel, BertConfig
from bert4pytorch.tokenization import BertTokenizer
from bert4pytorch.optimization import AdamW, get_linear_schedule_with_warmup
import torch

model_path = "/model/pytorch_bert_pretrain_model"
config = BertConfig(model_path + "/config.json")

tokenizer = BertTokenizer(model_path + "/vocab.txt")
model = BertModel.from_pretrained(model_path, config)

input_ids, token_type_ids = tokenizer.encode("今天很开心")

input_ids = torch.tensor([input_ids])
token_type_ids = torch.tensor([token_type_ids])

model.eval()

outputs = model(input_ids, token_type_ids, output_all_encoded_layers=True)

## orther code
  • 带warmup的优化器实现
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer
                if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer
                if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5, correct_bias=False)

num_training_steps=train_batches * num_epoches
num_warmup_steps=num_training_steps * warmup_proportion
schedule = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps)

其他

最初整理这个项目,只是为了自己方便。这一段时间,经常逛苏剑林大佬的博客,里面的内容写得相当精辟,更加感叹的是, 苏神经常能闭门造车出一些还不错的trick,只能说,大佬牛逼。

所以本项目命名也雷同bert4keras,以感谢苏大佬无私的分享。

后来,慢慢萌生把学习中的小小成果开源出来,后期会渐渐补充例子,前期会借用苏神的bert4keras里面的例子,实现pytorch版本。如果有问题,欢迎讨论;如果本项目对您有用,请不吝star!

Owner
muqiu
muqiu
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

GPU Docker NLP Application Deployment Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU, to setup the enviroment on

Ritesh Yadav 9 Oct 14, 2022
AI-powered literature discovery and review engine for medical/scientific papers

AI-powered literature discovery and review engine for medical/scientific papers paperai is an AI-powered literature discovery and review engine for me

NeuML 819 Dec 30, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力,旨在为飞桨开发者提升文本建模效率,并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 01, 2023
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

Logadheep 1 Nov 27, 2021
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 05, 2023
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT ***************New March 28, 2020 *************** Add a colab tutorial to run fine-tuning for GLUE datasets. ***************New January 7, 2020

Google Research 3k Dec 26, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features 🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022