Chinese Pre-Trained Language Models (CPM-LM) Version-I

Overview

CPM-Generate

为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]

若您想使用CPM-1进行推理,我们建议使用高效推理工具BMInf,支持1060以上显卡单卡推理。

安装

首先安装pytorch等基础依赖,再安装APEX以支持fp16:

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

考虑apex的安装容易发生问题,我们构建了对应的Docker容器,可以进行快速环境搭建。安装方式如下:

docker pull dmye/cpm:v0

参考运行指令如下:

:/CPM --name=cpm cpm:v0 ">
sudo docker run --gpus '"device=0,1"' -it -v 
   
    :/CPM  --name=cpm  cpm:v0

   

其中 为代码所在目录,-v进行文件目录挂载

注:感谢qhduan同学提供了基于TensorFlow的使用代码,用作Pytorch之外的备选。

模型

模型下载后文件夹的目录结构需设置如下:

.
├── 80000
│   ├── mp_rank_00_model_states.pt
│   └── mp_rank_01_model_states.pt
└── latest_checkpointed_iteration.txt

为保证下载文件的正确性,文件的checksum如下:

SHA1
71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/mp_rank_00_model_states.pt
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/mp_rank_01_model_states.pt
MD5
f3f6d2f7d84c6a45290a31dabf79ddac  80000/mp_rank_00_model_states.pt
b0e960be4b5226e759ae6fc5246f9160  80000/mp_rank_01_model_states.pt

使用

提供了命令行交互式生成:

bash scripts/generate_text.sh /path/to/CPM

如不使用交互式输入,可增加第二个参数,告知输入文本的位置

bash scripts/generate_text.sh /path/to/CPM example.txt

运行该脚本需要两块GPU,每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。

默认的模型并行参数为2,如果需要修改,可以使用change_mp.py,并调整generate_text.sh中的MPSIZEchange_mp.py的使用示例如下:

python change_mp.py /path/to/CPM MPSIZE

这里的/path/to/CPM为模型路径,MPSIZE为一个整数,可以为1或者2的倍数,结果会生成一个新的模型,存储路径为/path/to/CPM_MPSIZE

Tokenization

Tokenization实现主要在data_util/tokenization_gpt2.py,先对于文本进行分词,再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符,在 BPE 之前,我们将文本中的空格和换行符替换为\u2582\u2583。生成文本的时候也会对应的把生成的\u2582\u2583替换回空格和换行符。

对应问题已解决。

分类任务零次学习(Zero-shot Learning)

提供了三个任务的零次学习任务脚本以供参考,包括OCNLI、TNEWS和IFLYTEK,数据下载链接。脚本使用方法如下:

# OCNLI
bash scripts/zero-shot-ocnli.sh /path/to/CPM /path/to/dataset
# TNEWS
bash scripts/zero-shot-tnews.sh /path/to/CPM /path/to/dataset
# IFLYTEK
bash scripts/zero-shot-iflytek.sh /path/to/CPM /path/to/dataset

TODO

  • 实验环境的docker镜像
  • 提供各个任务具体的使用模板
  • 公开技术报告
  • 模型并行数可动态调整
  • Fine-tune代码
  • 开源实验中使用的小规模模型参数

引用

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}
Owner
Tsinghua AI
Tsinghua AI
The training code for the 4th place model at MDX 2021 leaderboard A.

The training code for the 4th place model at MDX 2021 leaderboard A.

Chin-Yun Yu 32 Dec 18, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

Google Research 2.1k Dec 28, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Alibaba 1.4k Jan 04, 2023
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Language Models are Few-shot Multilingual Learners Paper This is the source code of the paper [Arxiv] [ACL Anthology]: This code has been written usin

Genta Indra Winata 45 Nov 21, 2022
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022