Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

Overview

ProGen - (wip)

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Pytorch and Jax (the weights will be made easily transferrable between the two)

Install

$ pip install progen-transformer

Usage

from jax import random
from haiku import PRNGSequence
from progen_transformer import ProGen

model = ProGen(
    num_tokens = 256,
    dim = 512,
    seq_len = 1024,
    window_size = 256,       # local attention window size
    depth = 12,              # depth
    heads = 8,               # attention heads
    dim_head = 64,           # dimension per head
    ff_glu = True,           # use GLU in feedforward, from Noam's paper
    global_mlp_depth = 2     # last N global gmlp layers
)

rng = PRNGSequence(42)
seq = random.randint(next(rng), (1024,), 0, 256)

params = model.init(next(rng), seq)
logits = model.apply(params, next(rng), seq) # (1024, 256)

Training from Uniref

Download Uniref50 from UniProt and place uniref50.fasta in the root directory

$ python gen_train_data.py

You should see a lot of green if everything succeeds. Then

$ python train.py

By default, the script will checkpoint and resume automatically, but if you wish to clear your progress and restart, just add a --new flag

$ python train.py --new

Model checkpoints will be saved periodically to ./ckpts

Todo

  • train tfrecords from google cloud storage path
  • generate validation tfrecords
  • add panda integration with GO annotations
  • resume from correct place in tfrecord even if batch size is changed inbetween runs, display number of sequences processed (aiming for 1 billion)
  • model parallelism with pjit
  • bfloat16 on xla
  • checkpoint and resume from a google cloud storage path
  • config to annotation to template string with jinja2 - use jinja2 for wandb html logging as well
  • manage experimental tracker state, and also allow ability to turn it off by piping to noop
  • add a confirmation before clearing a folder for --new run
  • engineer mask in cross entropy loss so that padding can be reused as end-of-string token
  • flip seq # annotation order with prob set in config
  • keep N last checkpoints

Citations

@misc{madani2020progen,
    title   = {ProGen: Language Modeling for Protein Generation}, 
    author  = {Ali Madani and Bryan McCann and Nikhil Naik and Nitish Shirish Keskar and Namrata Anand and Raphael R. Eguchi and Po-Ssu Huang and Richard Socher},
    year    = {2020},
    eprint  = {2004.03497},
    archivePrefix = {arXiv},
    primaryClass = {q-bio.BM}
}
@misc{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
    year    = {2021},
    eprint  = {2104.09864},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}
You might also like...
Implementation of the GVP-Transformer, which was used in the paper
Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

A pytorch-version implementation codes of paper:
A pytorch-version implementation codes of paper: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation"

BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation A pytorch-version implementation

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training Code for our paper "Predicting lncRNA–protein interactio

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.
RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

RITA: a Study on Scaling Up Generative Protein Sequence Models RITA is a family of autoregressive protein models, developed by a collaboration of Ligh

 Generative Models for Graph-Based Protein Design
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix
Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

Comments
  • protein bert uniref90 dataset

    protein bert uniref90 dataset

    (discussed in discord)

    after running the first step (create_uniref_db) of https://github.com/nadavbra/protein_bert I got a 24GB file "uniref_proteins_and_annotations.db" . It seems it could be useful for generate sequences for this project, sharing the links there

    • https://gitlab.com/rom1504/uniref data
    • colab to get the db and do a few queries https://colab.research.google.com/drive/1BGYEBDmD0yToLNou2T-t-QbJV5wCtIBz#scrollTo=21U3PpCp-pxr There are 135301051 records in the db, in a table looking like:
    CREATE TABLE "protein_annotations" (
        "index"    INTEGER,
        "tax_id"    REAL,
        "uniprot_name"    TEXT,
        "go_annotations"    TEXT,
        "flat_go_annotations"    TEXT,
        "n_go_annotations"    INTEGER,
        "complete_go_annotation_indices"    TEXT,
        "n_complete_go_annotations"    INTEGER
    );
    

    Sample look like this:

    | | index | tax_id | uniprot_name | go_annotations | flat_go_annotations | n_go_annotations | complete_go_annotation_indices | n_complete_go_annotations | |---:|--------:|-----------------:|:-----------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------|-------------------:|:---------------------------------|----------------------------:| | 0 | 0 | 1.57204e+06 | A0A5A9P0L4_9TELE | {"GO Molecular Function": ["GO:0003755", "GO:0005524", "GO:0004672", "GO:0005509"], "GO Biological Process": [], "GO Cellular Component": []} | ["GO:0003755", "GO:0004672", "GO:0005509", "GO:0005524"] | 4 | [2761, 3561, 4193, 4205] | 4 | | 1 | 1 | 648755 | UPI0016133188 | {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} | [] | 0 | [] | 0 | | 2 | 2 | 1.93059e+06 | A0A410P257_9BACT | {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} | [] | 0 | [] | 0 | | 3 | 3 | 519421 | UPI0019403D63 | {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} | [] | 0 | [] | 0 | | 4 | 4 | 72004 | A0A6B0RPA5_9CETA | {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": []} | ["GO:0004672", "GO:0005524"] | 2 | [3561, 4205] | 2 | | 5 | 5 | 375764 | A0A672ZWI7_9TELE | {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} | [] | 0 | [] | 0 | | 6 | 6 | 1.41558e+06 | A0A6P7YNV3_9AMPH | {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0005886"]} | ["GO:0004672", "GO:0005524", "GO:0005886"] | 3 | [3561, 4205, 4526] | 3 | | 7 | 7 | 240159 | A0A4U5TZD8_COLLU | {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0016021", "GO:0005886"]} | ["GO:0004672", "GO:0005524", "GO:0005886", "GO:0016021"] | 4 | [3561, 4205, 4526, 10019] | 4 | | 8 | 8 | 146911 | UPI00074FFD9C | {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} | [] | 0 | [] | 0 | | 9 | 9 | 260995 | A0A6P8RG40_GEOSA | {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0005886"]} | ["GO:0004672", "GO:0005524", "GO:0005886"] | 3 | [3561, 4205, 4526] | 3 |

    opened by rom1504 4
Releases(0.0.36)
Owner
Phil Wang
Working with Attention
Phil Wang
🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

Hugging Face 1.4k Jan 05, 2023
Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Narya The Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent. This repository

Paul Garnier 121 Dec 30, 2022
Python binding for Khiva library.

Khiva-Python Build Documentation Build Linux and Mac OS Build Windows Code Coverage README This is the Khiva Python binding, it allows the usage of Kh

Shapelets 46 Oct 16, 2022
Repository For Programmers Seeking a platform to show their skills

Programming-Nerds Repository For Programmers Seeking Pull Requests In hacktoberfest ❓ What's Hacktoberfest 2021? Hacktoberfest is the easiest way to g

42 Oct 29, 2022
(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

xxxnell 656 Dec 30, 2022
Improving Non-autoregressive Generation with Mixup Training

MIST Training MIST TRAIN_FILE=/your/path/to/train.json VALID_FILE=/your/path/to/valid.json OUTPUT_DIR=/your/path/to/save_checkpoints CACHE_DIR=/your/p

7 Nov 22, 2022
Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature fo

Google Interns 50 Dec 21, 2022
SPRING is a seq2seq model for Text-to-AMR and AMR-to-Text (AAAI2021).

SPRING This is the repo for SPRING (Symmetric ParsIng aNd Generation), a novel approach to semantic parsing and generation, presented at AAAI 2021. Wi

Sapienza NLP group 98 Dec 21, 2022
Self-driving car env with PPO algorithm from stable baseline3

Self-driving car with RL stable baseline3 Most of the project develop from https://github.com/GerardMaggiolino/Gym-Medium-Post Please check it out! Th

Sornsiri.P 7 Dec 22, 2022
Implementation of "DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration Testing".

DeepOrder Implementation of DeepOrder for the paper "DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration Testing". Project

6 Nov 07, 2022
Multi-Anchor Active Domain Adaptation for Semantic Segmentation (ICCV 2021 Oral)

Multi-Anchor Active Domain Adaptation for Semantic Segmentation Munan Ning*, Donghuan Lu*, Dong Wei†, Cheng Bian, Chenglang Yuan, Shuang Yu, Kai Ma, Y

Munan Ning 36 Dec 07, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 410 Jan 03, 2023
Covid-19 Test AI (Deep Learning - NNs) Software. Accuracy is the %96.5, loss is the 0.09 :)

Covid-19 Test AI (Deep Learning - NNs) Software I developed a segmentation algorithm to understand whether Covid-19 Test Photos are positive or negati

Emirhan BULUT 28 Dec 04, 2021
Official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021.

Introduction This repository is the official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021. Data-free Kno

NVIDIA Research Projects 50 Jan 05, 2023
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Troyanskaya Laboratory 323 Jan 01, 2023
Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Running SB3 developed agents on TFLite or Coral Introduction I've been using Stable-Baselines3 to train agents against some custom Gyms, some of which

Gary Briggs 16 Oct 11, 2022
MMdet2-based reposity about lightweight detection model: Nanodet, PicoDet.

Lightweight-Detection-and-KD MMdet2-based reposity about lightweight detection model: Nanodet, PicoDet. This repo also includes detection knowledge di

Egqawkq 12 Jan 05, 2023
Pytorch Implementation of Interaction Networks for Learning about Objects, Relations and Physics

Interaction-Network-Pytorch Pytorch Implementraion of Interaction Networks for Learning about Objects, Relations and Physics. Interaction Network is a

117 Nov 05, 2022
Code for the Image similarity challenge.

ISC 2021 This repository contains code for the Image Similarity Challenge 2021. Getting started The docs subdirectory has step-by-step instructions on

Facebook Research 173 Dec 12, 2022
Simple, efficient and flexible vision toolbox for mxnet framework.

MXbox: Simple, efficient and flexible vision toolbox for mxnet framework. MXbox is a toolbox aiming to provide a general and simple interface for visi

Ligeng Zhu 31 Oct 19, 2019