Structural basis for solubility in protein expression systems

Overview

Structural basis for solubility in protein expression systems

Twitter Follow GitHub repo size

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com//. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star: GitHub Repo stars

Star the ProteinQure org on Github: GitHub Org's stars

Owner
ProteinQure
ProteinQure
Macros in Python: quasiquotes, case classes, LINQ and more!

MacroPy3 1.1.0b2 MacroPy is an implementation of Syntactic Macros in the Python Programming Language. MacroPy provides a mechanism for user-defined fu

Li Haoyi 3.2k Jan 06, 2023
Terrible sudoku solver with spaghetti code and performance issues

SudokuSolver Terrible sudoku solver with spaghetti code and performance issues - if it's unable to figure out next step it will stop working, it never

Kamil Bizoń 1 Dec 05, 2021
A few of my adventures with Devito.

Devito-playbox A few of my adventures with Devito. This repository contains a few notebooks and scripts that will lead me in the road of learning this

Átila Saraiva Quintela Soares 1 Feb 08, 2022
A simple but flexible plugin system for Python.

PluginBase PluginBase is a module for Python that enables the development of flexible plugin systems in Python. Step 1: from pluginbase import PluginB

Armin Ronacher 1k Dec 16, 2022
El_Binario - A converter for Binary, Decimal, Hexadecimal and Octal numbers

El_Binario El_Binario es un conversor de números Binarios, Decimales, Hexadecima

2 Jan 28, 2022
Project Guide for ASAM OpenX standards

ASAM Project Guide Important This guide is a work in progress and subject to change! Hosted version available at: ASAM Project Guide (Link) Includes:

ASAM e.V. 2 Mar 17, 2022
A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

Ubelt is a small library of robust, tested, documented, and simple functions that extend the Python standard library. It has a flat API that all behav

Jon Crall 638 Dec 13, 2022
Anki cards generator for Leetcode

Leetcode Anki card generator Summary By running this script you'll be able to generate Anki cards with all the leetcode problems. I personally use it

Pavel Safronov 150 Dec 25, 2022
A simple, light-weight and highly maintainable online judge system for secondary education

y³OJ a simple, light-weight and highly maintainable online judge system for secondary education 一个简单、轻量化、易于维护的、为中学信息技术学科课业教学设计的 Online Judge 系统。 Onlin

20 Oct 04, 2022
Alerts for Western Australian Covid-19 exposure locations via email and Slack

WA Covid Mailer Sends alerts from Healthy WA's Covid19 Exposure Locations via email and slack. Setup Edit the configuration items in wacovidmailer.py

13 Mar 29, 2022
RCCで開催する『バックエンド勉強会』の資料

RCC バックエンド勉強会 開発環境 Python 3.9 Pipenv 使い方 1. インストール pipenv install 2. アプリケーションを起動 pipenv run start 本コマンドを実行するとlocalhost:8000へアクセスできるようになります。 3. テストを実行

Averak 7 Nov 14, 2021
Repositório contendo atividades no curso de desenvolvimento de sistemas no SENAI

SENAI-DES Este é um repositório contendo as atividades relacionadas ao curso de desenvolvimento de sistemas no SENAI. Se é a primeira vez em contato c

Abe Hidek 4 Dec 06, 2022
Agora-token-helper - Some help tools for AgoraToken

Agora Token Helper Support AgoraToken version 001 - 006. But for security reason

:snake: Complete C99 parser in pure Python

pycparser v2.20 Contents 1 Introduction 1.1 What is pycparser? 1.2 What is it good for? 1.3 Which version of C does pycparser support? 1.4 What gramma

Eli Bendersky 2.8k Dec 29, 2022
A shim for the typeshed changes in mypy 0.900

types-all A shim for the typeshed changes in mypy 0.900 installation pip install types-all why --install-types is annoying, this installs all the thin

Anthony Sottile 28 Oct 20, 2022
Minimalistic Gridworld Environment (MiniGrid)

Minimalistic Gridworld Environment (MiniGrid) There are other gridworld Gym environments out there, but this one is designed to be particularly simple

Maxime Chevalier-Boisvert 1.7k Jan 03, 2023
Simple calculator with random number button and dark gray theme created with PyQt6

Calculator Application Simple calculator with random number button and dark gray theme created with : PyQt6 Python 3.9.7 you can download the dark gra

Flamingo 2 Mar 07, 2022
Package pyVHR is a comprehensive framework for studying methods of pulse rate estimation relying on remote photoplethysmography (rPPG)

Package pyVHR (short for Python framework for Virtual Heart Rate) is a comprehensive framework for studying methods of pulse rate estimation relying on remote photoplethysmography (rPPG)

PHUSE Lab 261 Jan 03, 2023
Materials for the Introduction in Python , Linux , Git and Github

This repository contains all the materials of the presentation on the introduction of python, linux, git and Github.

AMMI 3 Aug 28, 2022
Aevsploit İçin Destekde Bulun Papara: 1427113016

Aevsploit İçin Destekde Bulun Papara: 1427113016 Toolu Geliştirmek İçin Fikirlerinizi Bekliyorum Telegram

9 Jun 07, 2022