Infrastructure template and Jupyter notebooks for running RoseTTAFold on AWS Batch.

Overview

AWS RoseTTAFold

Infrastructure template and Jupyter notebooks for running RoseTTAFold on AWS Batch.

Overview

Proteins are large biomolecules that play an important role in the body. Knowing the physical structure of proteins is key to understanding their function. However, it can be difficult and expensive to determine the structure of many proteins experimentally. One alternative is to predict these structures using machine learning algorithms. Several high-profile research teams have released such algorithms, including AlphaFold 2 (from DeepMind) and RoseTTAFold (From the Baker lab at the University of Washington). Their work was important enough for Science magazine to name it the "2021 Breakthrough of the Year".

Both AlphaFold 2 and RoseTTAFold use a multi-track transformer architecture trained on known protein templates to predict the structure of unknown peptide sequences. These predictions are heavily GPU-dependent and take anywhere from minutes to days to complete. The input features for these predictions include multiple sequence alignment (MSA) data. MSA algorithms are CPU-dependent and can themselves require several hours of processing time.

Running both the MSA and structure prediction steps in the same computing environment can be cost inefficient, because the expensive GPU resources required for the prediction sit unused while the MSA step runs. Instead, using a high performance computing (HPC) service like AWS Batch allows us to run each step as a containerized job with the best fit of CPU, memory, and GPU resources.

This project demonstrates how to provision and use AWS services for running the RoseTTAFold protein folding algorithm on AWS Batch.

Setup

  1. Log into the AWS Console.

  2. Click on Launch Stack:

    Launch Stack

  3. For Stack Name, enter a unique name.

  4. Select an availability zone from the dropdown menu.

  5. Acknowledge that AWS CloudFormation might create IAM resources and then click Create Stack.

  6. It will take 10 minutes for CloudFormation to create the stack and another 15 minutes for CodeBuild to build and publish the container (25 minutes total). Please wait for both of these tasks to finish before you submit any analysis jobs.

  7. Download and extract the RoseTTAFold network weights (under Rosetta-DL Software license), and sequence and structure databases to the newly-created FSx for Lustre file system. There are two ways to do this:

Option 1

In the AWS Console, navigate to EC2 > Launch Templates, select the template beginning with "aws-rosettafold-launch-template-", and then Actions > Launch instance from template. Select the Amazon Linux 2 AMI and launch the instance into the public subnet with a public IP. SSH into the instance and download/extract your network weights and reference data of interest to the attached volume at /fsx/aws-rosettafold-ref-data (i.e. Installation steps 3 and 5 from the RoseTTAFold public repository)

Option 2

Create a new S3 bucket in your region of interest. Spin up an EC2 instance in a public subnet in the same region and use this to download and extract the network weights and reference data. Once this is complete, copy the extracted data to S3. In the AWS Console, navigate to FSx > File Systems and select the FSx for Lustre file system created above. Link this file system to your new S3 bucket using these instructions. Specify /aws-rosettafold-ref-data as the file system path when creating the data repository association. This is a good option if you want to create multiple stacks without downloading and extracting the reference data multiple times. Note that the first job you submit using this data repository will cause the FSx file system to transfer and compress 3 TB of reference data from S3. This process may require as many as six hours to complete. Alternatively, you can preload files into the file system by following these instructions.

Once this is complete, your FSx for Lustre file system should look like this (file sizes are uncompressed):

/fsx
└── /aws-rosettafold-ref-data
    ├── /bfd
    │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata (1.4 TB)
    │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex (1.7 GB)
    │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata (15.7 GB)
    │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex (1.6 GB)
    │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata (304.4 GB)
    │   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex (123.6 MB)
    ├── /pdb100_2021Mar03
    │   ├── LICENSE (20.4 KB)
    │   ├── pdb100_2021Mar03_a3m.ffdata (633.9 GB)
    │   ├── pdb100_2021Mar03_a3m.ffindex (3.9 MB)
    │   ├── pdb100_2021Mar03_cs219.ffdata (41.8 MB)
    │   ├── pdb100_2021Mar03_cs219.ffindex (2.8 MB)
    │   ├── pdb100_2021Mar03_hhm.ffdata (6.8 GB)
    │   ├── pdb100_2021Mar03_hhm.ffindex (3.4 GB)
    │   ├── pdb100_2021Mar03_pdb.ffdata (26.2 GB)
    │   └── pdb100_2021Mar03_pdb.ffindex (3.7 MB)
    ├── /UniRef30_2020_06
    │   ├── UniRef30_2020_06_a3m.ffdata (139.6 GB)
    │   ├── UniRef30_2020_06_a3m.ffindex (671.0 MG)
    │   ├── UniRef30_2020_06_cs219.ffdata (6.0 GB)
    │   ├── UniRef30_2020_06_cs219.ffindex (605.0 MB)
    │   ├── UniRef30_2020_06_hhm.ffdata (34.1 GB)
    │   ├── UniRef30_2020_06_hhm.ffindex (19.4 MB)
    │   └── UniRef30_2020_06.md5sums (379.0 B)
    └── /weights
        ├── RF2t.pt (126 MB KB)
        ├── Rosetta-DL_LICENSE.txt (3.1 KB)
        ├── RoseTTAFold_e2e.pt (533 MB)
        └── RoseTTAFold_pyrosetta.pt (506 MB)

  1. Clone the CodeCommit repository created by CloudFormation to a Jupyter Notebook environment of your choice.
  2. Use the AWS-RoseTTAFold.ipynb and CASP14-Analysis.ipynb notebooks to submit protein sequences for analysis.

Architecture

AWS-RoseTTAFold Architecture

This project creates two computing environments in AWS Batch to run the "end-to-end" protein folding workflow in RoseTTAFold. The first of these uses the optimal mix of c4, m4, and r4 spot instance types based on the vCPU and memory requirements specified in the Batch job. The second environment uses g4dn on-demand instances to balance performance, availability, and cost.

A scientist can create structure prediction jobs using one of the two included Jupyter notebooks. AWS-RoseTTAFold.ipynb demonstrates how to submit a single analysis job and view the results. CASP14-Analysis.ipynb demonstrates how to submit multiple jobs at once using the CASP14 target list. In both of these cases, submitting a sequence for analysis creates two Batch jobs, one for data preparation (using the CPU computing environment) and a second, dependent job for structure prediction (using the GPU computing environment).

Both the data preparation and structure prediction use the same Docker image for execution. This image, based on the public Nvidia CUDA image for Ubuntu 20, includes the v1.1 release of the public RoseTTAFold repository, as well as additional scripts for integrating with AWS services. CodeBuild will automatically download this container definition and build the required image during stack creation. However, end users can make changes to this image by pushing to the CodeCommit repository included in the stack. For example, users could replace the included MSA algorithm (hhblits) with an alternative like MMseqs2 or replace the RoseTTAFold network with an alternative like AlphaFold 2 or Uni-Fold.

Costs

This workload costs approximately $217 per month to maintain, plus another $2.56 per job.

Deployment

AWS-RoseTTAFold Dewployment

Running the CloudFormation template at config/cfn.yaml creates the following resources in the specified availability zone:

  1. A new VPC with a private subnet, public subnet, NAT gateway, internet gateway, elastic IP, route tables, and S3 gateway endpoint.
  2. A FSx Lustre file system with 1.2 TiB of storage and 120 MB/s throughput capacity. This file system can be linked to an S3 bucket for loading the required reference data when the first job executes.
  3. An EC2 launch template for mounting the FSX file system to Batch compute instances.
  4. A set of AWS Batch compute environments, job queues, and job definitions for running the CPU-dependent data prep job and a second for the GPU-dependent prediction job.
  5. CodeCommit, CodeBuild, CodePipeline, and ECR resources for building and publishing the Batch container image. When CloudFormation creates the CodeCommit repository, it populates it with a zipped version of this repository stored in a public S3 bucket. CodeBuild uses this repository as its source and adds additional code from release 1.1 of the public RoseTTAFold repository. CodeBuild then publishes the resulting container image to ECR, where Batch jobs can use it as needed.

Licensing

This library is licensed under the MIT-0 License. See the LICENSE file for more information.

The University of Washington has made the code and data in the RoseTTAFold public repository available under an MIT license. However, the model weights used for prediction are only available for internal, non-profit, non-commercial research use. For information, please see the full license agreement and contact the University of Washington for details.

Security

See CONTRIBUTING for more information.

More Information

Owner
AWS Samples
AWS Samples
This repo provides the source code for "Cross-Domain Adaptive Teacher for Object Detection".

Cross-Domain Adaptive Teacher for Object Detection This is the PyTorch implementation of our paper: Cross-Domain Adaptive Teacher for Object Detection

Meta Research 91 Dec 12, 2022
Telegram bot for downloading covid-19 vaccine certificate

cowin-certificate-bot This is the source code of @cowincertbot, A telegram bot inspired by the whatsapp bot implementation of indian government for co

ArUn Pt 30 Oct 07, 2022
Prometheus exporter for CNMC API

CNMC Prometheus exporter It needs a Prometheus Pushgateway Install requirements via pip install -r requirements.txt Export the following environment v

GISCE-TI 1 Oct 20, 2021
A tool that ensures consistent string quotes in your Python code.

pyquotes Single quotes are superior. And if you disagree, there's an option for this as well. In any case, quotes should be consistent throughout the

Adrian 9 Sep 13, 2022
Python Bot that attends classes, answers polls, and then again waits for classes to start.

LPU_myclass_Bot LPU_myclass_Bot is a Python bot that waits for class to start, attends class, answers polls, and then again waits for another class to

Saurabh Kumar 6 Apr 07, 2022
Matrix trivia bot with python

Matrix-trivia-bot Getting started See SETUP.md for how to setup and run the template project. Project structure A reference of each file included in t

1 Nov 16, 2021
🔏 Discordちゃんねる ◆wGFzKUzY7E

使い方 discord.pyをインストール. python -m pip install -r requirements.txtを実行. bot.pyと同じ階層に.tokenを用意. bot.pyを実行. ※現状、使用しているライブラリの関係でWindowsOSは未対応です。 コマンド ニックネーム

Gattxxa 3 Feb 02, 2022
A code that can make your 5 accounts stay 24/7 in a discord voice channel!

Voicecord A code that can make your 5 accounts stay 24/7 in a discord voice channel! Usage ・Fork the repo ・Clone it to replit ・Install the required pa

DraKenCodeZ 3 Jan 09, 2022
This bot will send you an email or notify you via telegram & discord if dolar/lira parity breaks a record.

Dolar Rekor Kırdı Mı? This bot will send you an email or notify you via Telegram & Discord if Dolar/Lira parity breaks a record. Mailgun can be used a

Yiğit Göktuğ Budanur 2 Oct 14, 2021
🚀 A fast, flexible and lightweight Discord API wrapper for Python.

Krema A fast, flexible and lightweight Discord API wrapper for Python. Installation Unikorn unikorn add kremayard krema -no-confirmation Pip pip insta

Krema 20 Sep 04, 2022
A working bypass for discord gc spamming

IllusionGcSpammer A working bypass for discord gc spamming Installation Run pip install pip install DiscordGcSpammer then your good to go. Usage You c

6 Sep 30, 2022
discord voice bot to stream radio

Radio-Id Bot (Discord Voice Bot) Radio-id-bot (Radio Indonesia) is a simple Discord Music Bot built with discord.py to play a radio from some Indonesi

Adi Fahmi 20 Sep 20, 2022
Google scholar share - Simple python script to pull Google Scholar data from an author's profile

google_scholar_share Simple python script to pull Google Scholar data from an au

Paul Goldsmith-Pinkham 9 Sep 15, 2022
The records of 42 million users from a third-party version of the popular Telegram messaging app have just been Iranian accounts leaked

TelegramDatabase About The records of 42 million users from a third-party version of the popular Telegram messaging app have just been Iranian account

Hamed Mohammadvand 10 Jan 14, 2022
A Discord BOT that uses Google Sheets for storing the roles and permissions of a discord server.

Discord Role Manager Bot Role Manager is a discord BOT that utilizes Google Sheets for the organization of a server's hierarchy and permissions. Detai

Dion Rigatos 17 Oct 13, 2022
A simple Telegram bot which handles images in whole different way

zeroimagebot thezeroimagebot 🌟 I Can Edit Dimension Of An image which is required by @stickers 🌟 I Can Extract Text From An Image 🌟 !!! New Updates

RAVEEN KUMAR 4 Jul 01, 2021
A surviv.io bot that helps you manage you clan in surviv.io!

Scooter-Surviv.io-Clan-Bot A Surviv.io Discord Bot This is a bot that helps manage your surviv.io clan! Read below for more!!. Features Lets you creat

cosmic|duck 1 Jan 03, 2022
AWS Lambda Fast API starter application

AWS Lambda Fast API Fast API starter application compatible with API Gateway and Lambda Function. How to deploy it? Terraform AWS Lambda API is a reus

OBytes 6 Apr 20, 2022
A Telegram Userbot to play Audio and Video songs / files in Telegram Voice Chats.

VC UserBot A Telegram Userbot to play Audio and Video songs / files in Telegram Voice Chats. It's made with PyTgCalls and Pyrogram Requirements Python

조던 1 Nov 29, 2021
A discord bot that manages your server's hedge fund

Can't Hide Money Bot A discord bot that manages your server's hedge fund Installing Install wkhtmltopdf sudo apt-get install wkhtmltopdf OR brew insta

Kelvin Abrokwa-Johnson 0 Oct 16, 2021