NLP Text Classification

Overview

多标签文本分类任务

近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以先从其中学习到一个好的表示,再将这些表示应用到其他任务中。最近的研究表明,基于大规模未标注语料库的预训练模型(Pretrained Models, PTM) 在NLP任务上取得了很好的表现。

大量的研究表明基于大型语料库的预训练模型(Pretrained Models, PTM)可以学习通用的语言表示,有利于下游NLP任务,同时能够避免从零开始训练模型。随着计算能力的发展,深度模型的出现(即 Transformer)和训练技巧的增强使得 PTM 不断发展,由浅变深。


本图片来自于:https://github.com/thunlp/PLMpapers

本示例展示了如何以BERT(Bidirectional Encoder Representations from Transformers)预训练模型Finetune完成多标签文本分类任务。

快速开始

代码结构说明

以下是本项目主要代码结构及说明:

pretrained_models/
├── deploy # 部署
│   └── python
│       └── predict.py # python预测部署示例
├── export_model.py # 动态图参数导出静态图参数脚本
├── predict.py # 预测脚本
├── README.md # 使用说明
├── data.py # 数据处理
├── metric.py # 指标计算
├── model.py # 模型网络
└── train.py # 训练评估脚本

数据准备

从Kaggle下载Toxic Comment Classification Challenge数据集并将数据集文件放在./data路径下。 以下是./data路径的文件组成:

data/
├── sample_submission.csv # 预测结果提交样例
├── train.csv # 训练集
├── test.csv # 测试集
└── test_labels.csv # 测试数据标签,数值-1代表该条数据不参与打分

模型训练

我们以Kaggle Toxic Comment Classification Challenge为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练

unset CUDA_VISIBLE_DEVICES
python -m paddle.distributed.launch --gpus "0" train.py --device gpu --save_dir ./checkpoints

可支持配置的参数:

  • save_dir:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。
  • max_seq_length:可选,BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。
  • batch_size:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
  • learning_rate:可选,Fine-tune的最大学习率;默认为5e-5。
  • weight_decay:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。
  • epochs: 训练轮次,默认为3。
  • warmup_proption:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。
  • init_from_ckpt:可选,模型参数路径,热启动模型训练;默认为None。
  • seed:可选,随机种子,默认为1000。
  • device: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
  • data_path: 可选,数据集文件路径,默认数据集存放在当前目录data文件夹下。

代码示例中使用的预训练模型是BERT,如果想要使用其他预训练模型如ERNIE等,只需要更换modeltokenizer即可。

程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的save_dir中。 如:

checkpoints/
├── model_100
│   ├── model_state.pdparams
│   ├── tokenizer_config.json
│   └── vocab.txt
└── ...

NOTE:

  • 如需恢复模型训练,则可以设置init_from_ckpt,如init_from_ckpt=checkpoints/model_100/model_state.pdparams
  • 使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在output_path指定路径中。 运行方式:
python export_model.py --params_path=./checkpoints/model_1000/model_state.pdparams --output_path=./static_graph_params

其中params_path是指动态图训练保存的参数路径,output_path是指静态图参数导出路径。

导出模型之后,可以用于部署,deploy/python/predict.py文件提供了python部署预测示例。

NOTE:

  • 可通过threshold参数调整最终预测结果,当预测概率值大于threshold时预测结果为1,否则为0;默认为0.5。 运行方式:
python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams

待预测数据如以下示例:

Your bullshit is not welcome here.
Thank you for understanding. I think very highly of you and would not revert without discussion.

预测结果示例:

Data:    Your bullshit is not welcome here.
toxic:   1
severe_toxic:    0
obscene:         0
threat:          0
insult:          0
identity_hate:   0
Data:    Thank you for understanding. I think very highly of you and would not revert without discussion.
toxic:   0
severe_toxic:    0
obscene:         0
threat:          0
insult:          0
identity_hate:   0

模型预测

启动预测:

export CUDA_VISIBLE_DEVICES=0
python predict.py --device 'gpu' --params_path checkpoints/model_1000/model_state.pdparams

预测结果会以csv文件sample_test.csv保存在当前目录下。

Owner
Jason
Jason
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

PEGASUS library Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised

Google Research 1.4k Dec 22, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

1 Nov 24, 2021
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

Wesley Hileman 47 Nov 18, 2022
🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

PokéBattle is an esoteric language designed so that the program looks like the transcript of a Pokémon battle. Original inspiration and specification

Eduardo Correia 9 Jan 11, 2022
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 09, 2022
Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

Kumar Saksham 26 Dec 25, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

Cyril 1 Jul 08, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人 此项目为基于 Nonebot2 和 go-cqhttp 开发,以 Sqlite 作为数据库的QQ群娱乐机器人 关于 纯兴趣开发,部分功能借鉴了大佬们的代码,作为Q群的娱乐+功能性Bot 声明 此项目仅用于学习交流,请勿用于非法用途 这是开发者的第一个Pytho

风屿 79 Dec 29, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022