Protein Language Model

Last update: Dec 27, 2022

Related tags

Overview

ProteinLM

We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing Protein Embeddings), which contains a set of five biologically relevant semi-supervised learning tasks. And our pretrained model achieved good performance on these tasks.

Overview

The proposal of pre-training models such as Bert have greatly promoted the development of natural language processing, improving the performance of language models. Inspired by the similarity of amino acid sequence and text sequence, we consider applying the method of pre-training language model to biological data.

Guidance

We provide pretrain and finetune code in two separate folders. If you use the pretrained model we provide, you can simply download the checkpoint and follow the finetune guide. If you want to pretrain your own model yourself, you can refer to the pretrain guide.

Pretrain README
Finetune README

Download ProteinLM

ProteinLM (200M)

For the pretrained model with 200 million parameters, you can download model checkpoint via GoogleDrive, or TsinghuaCloud.

ProteinLM (3B)

For the pretrained model with 3 billion parameters, you can download model checkpoint from here.

Project Structure

.
├── pretrain                (protein language model pretrain)
│   ├── megatron            (model folder)
│   ├── pretrain_tools      (multi-node pretrain)
│   ├── protein_tools       (data preprocess shells)
└── tape
    ├── conda_env           (conda env in yaml format)
    ├── converter           (converter script and model config files)
    ├── scripts             (model generator, finetune)
    └── tape                (tape model)

Usage

As the structure above shows, there are two stages as follows.

Pretrain
- Prepare dataset (PFAM)
- Preprocess data
- Pretrain
Finetune
- Convert pretrain protein model checkpoint
- Finetune on downstream tasks

Detailed explanations are given in each folder's readme.

Downstream Tasks Performance

Task	Metric	TAPE	ProteinLM (200M)	ProteinLM (3B)
contact prediction	[email protected]/5	0.36	0.52	0.75
remote homology	Top 1 Accuracy	0.21	0.26	0.30
secondary structure	Accuracy (3-class)	0.73	0.75	0.79
fluorescence	Spearman's rho	0.68	0.68	0.68
stability	Spearman's rho	0.73	0.77	0.79

Contact

If you have any problem using ProteinLM, feel free to contact us.

Reference

Our work is based on the following papers.

Besides, part of the code is based on Megatron-LM and TAPE.

Evaluating Protein Transfer Learning with TAPE

@article{DBLP:journals/corr/abs-1909-08053,
  author    = {Mohammad Shoeybi and
               Mostofa Patwary and
               Raul Puri and
               Patrick LeGresley and
               Jared Casper and
               Bryan Catanzaro},
  title     = {Megatron-LM: Training Multi-Billion Parameter Language Models Using
               Model Parallelism},
  journal   = {CoRR},
  volume    = {abs/1909.08053},
  year      = {2019},
  url       = {http://arxiv.org/abs/1909.08053},
  archivePrefix = {arXiv},
  eprint    = {1909.08053},
  timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

@article{DBLP:journals/corr/abs-1906-08230,
  author    = {Roshan Rao and
               Nicholas Bhattacharya and
               Neil Thomas and
               Yan Duan and
               Xi Chen and
               John F. Canny and
               Pieter Abbeel and
               Yun S. Song},
  title     = {Evaluating Protein Transfer Learning with {TAPE}},
  journal   = {CoRR},
  volume    = {abs/1906.08230},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.08230},
  archivePrefix = {arXiv},
  eprint    = {1906.08230},
  timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Protein Language Model

Related tags

Overview

ProteinLM

Overview

Guidance

Download ProteinLM

ProteinLM (200M)

ProteinLM (3B)

Project Structure

Usage

Downstream Tasks Performance

Contact

Reference

Owner

THUDM

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

A PyTorch-based model pruning toolkit for pre-trained language models

Script to generate VAD dataset used in Asteroid recipe

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Sentence Embeddings with BERT & XLNet

Code associated with the Don't Stop Pretraining ACL 2020 paper

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

NSFW A chatbot based on GPT2-chitchat

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

CredData is a set of files including credentials in open source projects

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Simple program that translates the name of files into English

Protein Language Model

Related tags

Overview

ProteinLM

Overview

Guidance

Download ProteinLM

ProteinLM (200M)

ProteinLM (3B)

Project Structure

Usage

Downstream Tasks Performance

Contact

Reference

Owner

THUDM

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

A PyTorch-based model pruning toolkit for pre-trained language models

Script to generate VAD dataset used in Asteroid recipe

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Sentence Embeddings with BERT & XLNet

Code associated with the Don't Stop Pretraining ACL 2020 paper

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

**NSFW** A chatbot based on GPT2-chitchat

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

CredData is a set of files including credentials in open source projects

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Simple program that translates the name of files into English

NSFW A chatbot based on GPT2-chitchat