Chinese Pre-Trained Language Models (CPM-LM) Version-I

Overview

CPM-Generate

为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]

若您想使用CPM-1进行推理,我们建议使用高效推理工具BMInf,支持1060以上显卡单卡推理。

安装

首先安装pytorch等基础依赖,再安装APEX以支持fp16:

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

考虑apex的安装容易发生问题,我们构建了对应的Docker容器,可以进行快速环境搭建。安装方式如下:

docker pull dmye/cpm:v0

参考运行指令如下:

:/CPM --name=cpm cpm:v0 ">
sudo docker run --gpus '"device=0,1"' -it -v 
   
    :/CPM  --name=cpm  cpm:v0

   

其中 为代码所在目录,-v进行文件目录挂载

注:感谢qhduan同学提供了基于TensorFlow的使用代码,用作Pytorch之外的备选。

模型

模型下载后文件夹的目录结构需设置如下:

.
├── 80000
│   ├── mp_rank_00_model_states.pt
│   └── mp_rank_01_model_states.pt
└── latest_checkpointed_iteration.txt

为保证下载文件的正确性,文件的checksum如下:

SHA1
71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/mp_rank_00_model_states.pt
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/mp_rank_01_model_states.pt
MD5
f3f6d2f7d84c6a45290a31dabf79ddac  80000/mp_rank_00_model_states.pt
b0e960be4b5226e759ae6fc5246f9160  80000/mp_rank_01_model_states.pt

使用

提供了命令行交互式生成:

bash scripts/generate_text.sh /path/to/CPM

如不使用交互式输入,可增加第二个参数,告知输入文本的位置

bash scripts/generate_text.sh /path/to/CPM example.txt

运行该脚本需要两块GPU,每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。

默认的模型并行参数为2,如果需要修改,可以使用change_mp.py,并调整generate_text.sh中的MPSIZEchange_mp.py的使用示例如下:

python change_mp.py /path/to/CPM MPSIZE

这里的/path/to/CPM为模型路径,MPSIZE为一个整数,可以为1或者2的倍数,结果会生成一个新的模型,存储路径为/path/to/CPM_MPSIZE

Tokenization

Tokenization实现主要在data_util/tokenization_gpt2.py,先对于文本进行分词,再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符,在 BPE 之前,我们将文本中的空格和换行符替换为\u2582\u2583。生成文本的时候也会对应的把生成的\u2582\u2583替换回空格和换行符。

对应问题已解决。

分类任务零次学习(Zero-shot Learning)

提供了三个任务的零次学习任务脚本以供参考,包括OCNLI、TNEWS和IFLYTEK,数据下载链接。脚本使用方法如下:

# OCNLI
bash scripts/zero-shot-ocnli.sh /path/to/CPM /path/to/dataset
# TNEWS
bash scripts/zero-shot-tnews.sh /path/to/CPM /path/to/dataset
# IFLYTEK
bash scripts/zero-shot-iflytek.sh /path/to/CPM /path/to/dataset

TODO

  • 实验环境的docker镜像
  • 提供各个任务具体的使用模板
  • 公开技术报告
  • 模型并行数可动态调整
  • Fine-tune代码
  • 开源实验中使用的小规模模型参数

引用

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}
Owner
Tsinghua AI
Tsinghua AI
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
jiant is an NLP toolkit

🚨 Update 🚨 : As of 2021/10/17, the jiant project is no longer being actively maintained. This means there will be no plans to add new models, tasks,

ML² AT CILVR 1.5k Dec 28, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Dope Wars game engine on StarkNet L2 roll-up

RYO Dope Wars game engine on StarkNet L2 roll-up. What TI-83 drug wars built as smart contract system. Background mechanism design notion here. Initia

104 Dec 04, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 07, 2023
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022