TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Overview

基于TaCL-BERT的中文命名实体识别及中文分词

Paper: TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Authors: Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nigel Collier

论文主Github repo: https://github.com/yxuansu/TaCL

引用:

如果我们提供的资源对你有帮助,请考虑引用我们的文章。

@misc{su2021tacl,
      title={TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning}, 
      author={Yixuan Su and Fangyu Liu and Zaiqiao Meng and Lei Shu and Ehsan Shareghi and Nigel Collier},
      year={2021},
      eprint={2111.04198},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

环境配置

python version == 3.8
pip install -r requirements.txt

模型结构

Chinese TaCL BERT + CRF

Huggingface模型:

Model Name Model Address
Chinese (cambridgeltl/tacl-bert-base-chinese) link

使用范例:

实验

一、实验数据集

(1). 命名实体识别: (1) MSRA (2) OntoNotes (3) Resume (4) Weibo

(2). 中文分词: (1) PKU (2) CityU (3) AS

二、下载数据集

chmod +x ./download_benchmark_data.sh
./download_benchmark_data.sh

三、下载训练好的模型

chmod +x ./download_checkpoints.sh
./download_checkpoints.sh

四、使用训练好的模型进行inference

cd ./sh_folder/inference/
chmod +x ./inference_{}.sh
./inference_{}.sh

对于不同的数据集{}的取值为['msra', 'ontonotes', 'weibo', 'resume', 'pku', 'cityu', 'as'],相关参数的含义为:

--saved_ckpt_path: 训练好的模型位置
--train_path: 训练集数据路径
--dev_path: 验证集数据路径
--test_path: 测试集数据路径
--label_path: 数据标签路径
--batch_size: inference时的batch size

五、测试集模型结果

使用提供的模型进行inference后,可以得到如下结果。

Dataset Precision Recall F1
MSRA 95.41 95.47 95.44
OntoNotes 81.88 82.98 82.42
Resume 96.48 96.42 96.45
Weibo 68.40 70.73 69.54
PKU 97.04 96.46 96.75
CityU 98.16 98.19 98.18
AS 96.51 96.99 96.75

六、从头训练一个模型

cd ./sh_folder/train/
chmod +x ./{}.sh
./{}.sh

对于不同的数据集{}的取值为['msra', 'ontonotes', 'weibo', 'resume', 'pku', 'cityu', 'as'],相关参数的含义为:

--model_name: 中文TaCL BERT的模型名称(cambridgeltl/tacl-bert-base-chinese)
--train_path: 训练集数据路径
--dev_path: 验证集数据路径
--test_path: 测试集数据路径
--label_path: 数据标签路径
--learning_rate: 学习率
--number_of_gpu: 可使用的GPU数量
--number_of_runs: 重复试验次数
--save_path_prefix: 模型存储路径

[Note 1] 我们没有对模型进行任何和学习率调参,2e-5只是默认值。通过调整学习率也许可以获得更好的结果。

[Note 2] 实际的batch size等于gradient_accumulation_steps x number_of_gpu x batch_size_per_gpu。我们推荐将其设置为128。

Inference: 使用在./sh_folder/inference/路径中的sh进行inference。将--saved_ckpt_path设置为自己重新训练好的模型的路径。

交互式使用训练好的模型进行inference

以下我们使用MSRA数据集作为范例。

(使用以下代码前,请先下载我们提供的训练好的模型以及数据集。具体的指导请见以上章节)

# 载入数据
from dataclass import Data
from transformers import AutoTokenizer
model_name = 'cambridgeltl/tacl-bert-base-chinese'
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_path = r'./benchmark_data/NER/MSRANER/MSRA.test.char.txt'
label_path = r'./benchmark_data/NER/MSRANER/MSRA_NER_Label.txt'
max_len = 128
data = Data(tokenizer, data_path, data_path, data_path, label_path, max_len)

# 载入模型
import torch
from model import NERModel
model = NERModel(model_name, data.num_class)
ckpt_path = r'./pretrained_ckpt/msra/msra_ckpt'
model_ckpt = torch.load(ckpt_path, map_location=torch.device('cpu'))
model_parameters = model_ckpt['model']
model.load_state_dict(model_parameters)
model.eval()

# 提供输入
text = "中 共 中 央 致 中 国 致 公 党 十 一 大 的 贺 词"
text = "[CLS] " + text + " [SEP]"
tokens = tokenizer.tokenize(text)
# process token input
input_id = tokenizer.convert_tokens_to_ids(tokens)
input_id = torch.LongTensor(input_id).view(1, -1)
attn_mask = ~input_id.eq(data.pad_idx)
tgt_mask = [1.0] * len(tokens)
tgt_mask = torch.tensor(tgt_mask, dtype=torch.uint8).contiguous().view(1,-1)

# 使用模型进行解码
x = model.decode(input_id, attn_mask, tgt_mask)[0][1:-1] # remove [CLS] and [SEP] tokens.
res = ' '.join([data.id2label_dict[tag] for tag in x])
print (res)

# 模型输出结果: 
# B-NT M-NT M-NT E-NT O B-NT M-NT M-NT M-NT M-NT M-NT M-NT E-NT O O O
# 标准预测结果: 
# B-NT M-NT M-NT E-NT O B-NT M-NT M-NT M-NT M-NT M-NT M-NT E-NT O O O

联系

如果有任何的问题,以下是我的联系方式(ys484 at outlook dot com)。

Owner
Yixuan Su
Yixuan Su
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokémon names based on its description, usin

Gustav Lang Moesmand 2 Jan 10, 2022
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Won Joon Yoo 335 Jan 04, 2023
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 07, 2023
Club chatbot

Chatbot Club chatbot Instructions to get the Chatterbot working Step 1. First make sure you are using a version of Python 3 or newer. To check your ve

5 Mar 07, 2022
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022