Codebase of deep learning models for inferring stability of mRNA molecules

Overview

Kaggle OpenVaccine Models

Codebase of deep learning models for inferring stability of mRNA molecules, corresponding to the Kaggle Open Vaccine Challenge and accompanying manuscript "Predictive models of RNA degradation through dual crowdsourcing", Wayment-Steele et al (2021) (full citation when available).

Models contained here are:

"Nullrecurrent": A reconstruction of winning solution by Jiayang Gao. Link to original notebooks provided below.

"DegScore-XGBoost": A model based the original DegScore model and XGBoost.

NB on other historic names for models

  • The Nullrecurrent model was called "OV" model in some instances and the .h5 model files for the Nullrecurrent model are labeled "ov".

  • The DegScore-XGBoost model was called the "BT" model in Eterna analysis.

Organization

scripts: Python scripts to perform inference.

notebooks: Python notebooks to perform inference.

model_files: Store .h5 model files used at inference time.

data: Data corresponding to Kaggle challenge and to subsequent tests on mRNAs.

data/Kaggle_RYOS_data

This directory contains training set and test sets in .csv and in .json form.

Kaggle_RYOS_trainset_prediction_output_Sep2021.txt contains predictions from the Nullrecurrent code in this repository.

Model MCRMSEs were evaluated by uploading submissions to the Kaggle competition website at https://www.kaggle.com/c/stanford-covid-vaccine.

data/mRNA_233x_data

This directory contains original data and scripts to reproduce model analysis from manuscript.

Because all the original formats are slightly different, the reformat_*.py scripts read in the original formats and reformats them in two forms for each prediction: "FULL" and "PCR" in the directory formatted_predictions.

"FULL" is per-nucleotide predictions for all the nucleotides. "PCR" has had the regions outside the RT-PCR sequencing set to NaN.

python collate_predictions.py reads in all the data and outputs all_predictions_233x.csv

RegenerateFigure5.ipynb reproduces the final scatterplot comparisons.

posthoc_code_predictions contains predictions from the Nullrecurrent code model contained in this repository. To generate these predictions use the sequence file in the mRNA_233x_data folder and run the following command(s):

python scripts/nullrecurrent_inference.py -d deg_Mg_pH10 -i 233_sequences.txt -o 233x_nullrecurrent_output_Oct2021_deg_Mg_50C.txt,

etc.

Dependencies

Install via pip install requirements.txt or conda install --file requirements.txt.

Not pip-installable: EternaFold, Vienna, and Arnie, see below.

Setup

  1. Install git-lfs (best to do before git-cloning this KaggleOpenVaccine repo).

  2. Install EternaFold (the nullrecurrent model uses this), available for free noncommercial use here.

  3. Install ViennaRNA (the DegScore-XGBoost model uses this), available here.

  4. Git clone Arnie, which wraps EternaFold in python and allows RNA thermodynamic calculations across many packages. Follow instructions here to link EternaFold to it.

  5. Add path to this repository as KOV_PATH (so that script can find path to stored model files):

export KOV_PATH='/path/to/KaggleOpenVaccine'

Usage

To run the nullrecurrent winning solution on one construct, given in example.txt:

CGC

Run

python scripts/nullrecurrent_inference.py [-d deg] -i example.txt -o predict.txt

where the deg is one of the following options

deg_Mg_pH10
deg_pH10
deg_Mg_50C
deg_50C

Similarly, for the DegScore-XGBoost model :

python scripts/degscore-xgboost_inference.py -i example.txt -o predict.txt

This write a text file of output predictions to predict.txt:

(Nullrecurrent output)

2.1289976365, 2.650808962, 2.1869660805000004

(DegScore-XGBoost output)

0.2697107, 0.37091506, 0.48528114

A note on energy model versions

The predictions in the Kaggle competition and for the manuscript were performed with EternaFold parameters and CONTRAfold-SE code. The currently available EternaFold code will result in slightly different values. For more on the difference, see the EternaFold README.

Individual Kaggle Solutions

This code is based on the winning solution for the Open Vaccine Kaggle Competition Challenge. The competition can be found here:

https://www.kaggle.com/c/stanford-covid-vaccine/overview

This code is also the supplementary material for the Kaggle Competition Solution Paper. The individual Kaggle writeups for the top solutions that have been featured in that paper can be found in the following table:

Team Name Team Members Rank Link to the solution
Nullrecurrent Jiayang Gao 1 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189620
Kazuki ** 2 Kazuki Onodera, Kazuki Fujikawa 2 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189709
Striderl Hanfei Mao 3 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189574
FromTheWheel & Dyed & StoneShop Gilles Vandewiele, Michele Tinti, Bram Steenwinckel 4 https://www.kaggle.com/group16/covid-19-mrna-4th-place-solution
tito Takuya Ito 5 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189691
nyanp Taiga Noumi 6 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189241
One architecture Shujun He 7 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189564
ishikei Keiichiro Ishi 8 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/190314
Keep going to be GM Youhan Lee 9 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189845
Social Distancing Please Fatih Öztürk,Anthony Chiu,Emin Ozturk 11 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189571
The Machine Karim Amer,Mohamed Fares 13 https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189585
You might also like...
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.
Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Decision Transformer Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas†, and Igor M

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World
Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Codebase for "ProtoAttend: Attention-Based Prototypical Learning." Authors: Sercan O. Arik and Tomas Pfister Paper: Sercan O. Arik and Tomas Pfister,

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.
Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Stock Price Prediction Using Deep Learning Univariate Time Series Predicting stock price using historical data of a company using Neural networks for

Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art methods on major benchmarks like KITTI(ViP) and nuScenes(CBGS).

Comments
  • HW edits

    HW edits

    Changes:

    Remove hardcoded paths in scripts

    Remove tmp csv output files for nullrecurrent

    Rename to reflect model naming in paper "nullrecurrent"

    Reorganize example inputs and outputs

    Update README

    Add requirements file

    opened by HWaymentSteele 0
Releases(v1.0)
  • v1.0(Sep 30, 2022)

Owner
Eternagame
Eternagame
An addon uses SMPL's poses and global translation to drive cartoon character in Blender.

Blender addon for driving character The addon drives the cartoon character by passing SMPL's poses and global translation into model's armature in Ble

犹在镜中 153 Dec 14, 2022
A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.

chitra What is chitra? chitra (चित्र) is a multi-functional library for full-stack Deep Learning. It simplifies Model Building, API development, and M

Aniket Maurya 210 Dec 21, 2022
Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology Sharon Zhou, Eric Zelikman

Stanford Machine Learning Group 34 Nov 16, 2022
Repository of continual learning papers

Continual learning paper repository This repository contains an incomplete (but dynamically updated) list of papers exploring continual learning in ma

29 Jan 05, 2023
DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates

DeepMetaHandles (CVPR2021 Oral) [paper] [animations] DeepMetaHandles is a shape deformation technique. It learns a set of meta-handles for each given

Liu Minghua 73 Dec 15, 2022
Tensorflow implementation for Self-supervised Graph Learning for Recommendation

If the compilation is successful, the evaluator of cpp implementation will be called automatically. Otherwise, the evaluator of python implementation will be called.

152 Jan 07, 2023
Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv) This is a Pytorch implementation of our te

蒋子航 383 Dec 27, 2022
Implementation of GGB color space

GGB Color Space This package is implementation of GGB color space from Development of a Robust Algorithm for Detection of Nuclei and Classification of

Resha Dwika Hefni Al-Fahsi 2 Oct 06, 2021
Patch-Based Deep Autoencoder for Point Cloud Geometry Compression

Patch-Based Deep Autoencoder for Point Cloud Geometry Compression Overview The ever-increasing 3D application makes the point cloud compression unprec

17 Dec 05, 2022
Analysis of Smiles through reservoir sampling & RDkit

Analysis of Smiles through reservoir sampling and machine learning (under development). This is a simple project that includes two Jupyter files for t

Aurimas A. Nausėdas 6 Aug 30, 2022
Uncertain natural language inference

Uncertain Natural Language Inference This repository hosts the code for the following paper: Tongfei Chen*, Zhengping Jiang*, Adam Poliak, Keisuke Sak

Tongfei Chen 14 Sep 01, 2022
PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

Involution: Inverting the Inherence of Convolution for Visual Recognition Unofficial PyTorch reimplementation of the paper Involution: Inverting the I

Christoph Reich 100 Dec 01, 2022
Equivariant layers for RC-complement symmetry in DNA sequence data

Equi-RC Equivariant layers for RC-complement symmetry in DNA sequence data This is a repository that implements the layers as described in "Reverse-Co

7 May 19, 2022
ruptures: change point detection in Python

Welcome to ruptures ruptures is a Python library for off-line change point detection. This package provides methods for the analysis and segmentation

Charles T. 1.1k Jan 03, 2023
Leveraging Two Types of Global Graph for Sequential Fashion Recommendation, ICMR 2021

This is the repo for the paper: Leveraging Two Types of Global Graph for Sequential Fashion Recommendation Requirements OS: Ubuntu 16.04 or higher ver

Yujuan Ding 10 Oct 10, 2022
Starter Code for VALUE benchmark

StarterCode for VALUE Benchmark This is the starter code for VALUE Benchmark [website], [paper]. This repository currently supports all baseline model

VALUE Benchmark 73 Dec 09, 2022
Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Packt 1.5k Jan 03, 2023
This respository includes implementations on Manifoldron: Direct Space Partition via Manifold Discovery

Manifoldron: Direct Space Partition via Manifold Discovery This respository includes implementations on Manifoldron: Direct Space Partition via Manifo

dayang_wang 4 Apr 28, 2022
A Convolutional Transformer for Keyword Spotting

☢️ Audiomer ☢️ Audiomer: A Convolutional Transformer for Keyword Spotting [ arXiv ] [ Previous SOTA ] [ Model Architecture ] Results on SpeechCommands

49 Jan 27, 2022
TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{you2019torchcv, author = {Ansheng You and Xiangtai Li and Zhen Zhu a

Donny You 2.2k Jan 06, 2023