Differentiable molecular simulation of proteins with a coarse-grained potential

Overview

Differentiable molecular simulation of proteins with a coarse-grained potential

Build status

This repository contains the learned potential, simulation scripts and training code for the paper:

Greener JG and Jones DT, Differentiable molecular simulation can learn all the parameters in a coarse-grained force field for proteins, bioRxiv (2021) - link

It provides the cgdms Python package which can be used to simulate any protein and reproduce the results in the paper.

Installation

  1. Python 3.6 or later is required. The software is OS-independent.
  2. Install PyTorch 1.6 or later as appropriate for your system. A GPU is not essential but is recommended as simulations are slower on the CPU. However running on CPU is about 3x slower than GPU depending on hardware, so it is still feasible.
  3. Run pip install cgdms, which will also install NumPy, Biopython and PeptideBuilder if they are not already present. The package takes up about 75 MB of disk space.

Usage

On Unix systems the executable cgdms will be added to the path during installation. On Windows you can call the bin/cgdms script with python if you can't access the executable.

Run cgdms -h to see the help text and cgdms {mode} -h to see the help text for each mode. The modes are described below but there are other options outlined in the help text such as specifying the device to run on, running with a custom parameter set or changing the logging verbosity.

Generating protein data files

To simulate or calculate the energy of proteins you need to generate files of a certain format. If you want to use the proteins presented in the paper, the data files are here. Otherwise you will need to generate these files:

cgdms makeinput -i 1CRN.pdb -s 1CRN.ss2 > 1CRN.txt
cat 1CRN.txt
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
CCCCCCCEECCCCCEECCCCCHHHEEEECCCEEEECCCCCCCCCCC
17.047 14.099 3.625 16.967 12.784 4.338 15.685 12.755 5.133 18.551 12.359 5.368
15.115 11.555 5.265 13.856 11.469 6.066 14.164 10.785 7.379 12.841 10.531 4.694
13.488 11.241 8.417 13.66 10.707 9.787 12.269 10.431 10.323 15.126 12.087 10.354
12.019 9.272 10.928 10.646 8.991 11.408 10.654 8.793 12.919 9.947 7.885 9.793
...
  • -i is a well-behaved PDB or mmCIF file. This means a single protein chain with no missing residues or heavy atoms. Hetero atoms are ignored and all residues must be standard. The format is guessed from the file extension, default PDB.
  • -s is the PSIPRED secondary structure prediction ss2 output file. An example is given along with other example files here. If this option is omitted then fully coiled is assumed, which is not recommended, though you could replace that with a secondary structure prediction of your choosing or the known secondary structure depending on your use case.

If you are not interested in the RMSDs logged during the simulation and don't want to start simulation from the native structure, the coordinate lines (which contain coordinates for N/Cα/C/sidechain centroid) are not used. In this case you can generate your own files with random numbers in place of the coordinates. This would also apply to sequences where you don't know the native structure.

Running a simulation

Run a molecular dynamics simulation of a protein in the learned potential:

cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
    Step        1 / 12000000 - acc  0.005 - vel  0.025 - energy -44.06 ( -21.61 -15.59  -6.86 ) - Cα RMSD  32.59
    Step    10001 / 12000000 - acc  0.005 - vel  0.032 - energy -14.76 ( -11.82   0.46  -3.40 ) - Cα RMSD  32.28
    Step    20001 / 12000000 - acc  0.005 - vel  0.030 - energy  -9.15 (  -8.19   2.15  -3.10 ) - Cα RMSD  31.95
    Step    30001 / 12000000 - acc  0.005 - vel  0.028 - energy  -9.03 ( -10.20   2.22  -1.04 ) - Cα RMSD  31.79
...
  • -i is a protein data file as described above.
  • -o is the optional output PDB filepath to write the simulation to. By default snapshots are taken and the energy printed every 10,000 steps but this can be changed with the -r flag. PULCHRA can be used to generate all-atom structures from these output files if required.
  • -s is the starting conformation. This can be predss (extended with predicted secondary structure), native (the conformation in the protein data file), extended (extended with small random perturbations to the angles), random (random in ϕ -180° -> -30°, ψ -180° -> 180°) or helix (ϕ -60°, ψ -60°).
  • -n is the number of simulation steps. It takes ~36 hours on a GPU to run a simulation of this length, or ~10 ms per time step.
  • -t, -c, -st, -ts can be used to change the thermostat temperature, thermostat coupling constant, starting temperature and integrator time step respectively.

Calculating the energy

Calculate the energy of a protein structure in the learned potential:

cgdms energy -i 1CRN.txt
-136.122
  • -i is a protein data file as described above.
  • -m gives an optional number of minimisation steps before returning the energy, default 0.

Since calculating the energy without minimisation steps is mostly setup, running on the CPU using -d cpu is often faster than running on the GPU (~5 s to ~3 s).

Threading sequences onto a structure

Calculate the energy in the learned potential of a set of sequences threaded onto a structure.

cgdms thread -i 1CRN.txt -s sequences.txt
1 -145.448
2 -138.533
3 -142.473
...
  • -i is a protein data file as described above.
  • -s is a file containing protein sequences, one per line, of the same length as the sequence in the protein data file (that sequence is ignored). Since lines in the sequence file starting with > are ignored, FASTA files can be used provided each sequence is on a single line.
  • -m gives an optional number of minimisation steps before returning the energy, default 100.

Training the system

Train the system.

cgdms train
Starting training
Epoch    1 - med train/val RMSD  0.863 /  0.860 over  250 steps
Epoch    2 - med train/val RMSD  0.859 /  0.860 over  250 steps
Epoch    3 - med train/val RMSD  0.856 /  0.854 over  250 steps
...
  • -o is an optional output learned parameter filepath, default cgdms_params.pt.

Training takes about 2 months on a decent GPU and is unlikely something you want to do.

Exploring potentials

The learned potential and information on the interactions can be found in the Python package:

import torch
from cgdms import trained_model_file
params = torch.load(trained_model_file, map_location="cpu")
print(params.keys())
dict_keys(['distances', 'angles', 'dihedrals', 'optimizer'])
  • params["distances"] has shape [28961, 140] corresponding to the 28,960 distance potentials described in the paper and a flat potential used for same atom interactions. See cgdms.interactions for the interaction described by each potential, which has values corresponding to 140 distance bins.
  • params["angles"] has shape [5, 20, 140] corresponding to the 5 bond angles in cgdms.angles, the 20 amino acids in cgdms.aas, and 140 angle bins.
  • params["dihedrals"] has shape [5, 60, 142] corresponding to the 5 dihedral angles in cgdms.dihedrals, the 20 amino acids from cgdms.aas in each predicted secondary structure type (ala helix, ala sheet, ala coil, arg helix, etc.), and 140 angle bins with an extra 2 to wrap round and allow periodicity.

Notes

Running a simulation takes less than 1 GB of GPU memory for any number of steps. Training a model takes up to 32 GB of GPU memory once the number of steps is fully scaled up to 2,000. See the discussion in the paper for ways of alleviating this.

The lists of training and validation PDB chains are available here and the protein data files here.

See the autobuild script and logs for automated commands to install and run the package in Ubuntu.

The code in this package is set up to run specific coarse-grained simulations of proteins. However, the package contains code that could be useful to others wishing to carry out general differentiable simulations with PyTorch. This includes integrators not used in the paper and not thoroughly tested (velocity-free Verlet, two Langevin implementations), the Andersen thermostat, RMSD with the Kabsch algorithm, and code to apply forces to atoms from bond angle and dihedral angle potentials.

Other software related to differentiable molecular simulation includes Jax MD, TorchMD, DeePMD-kit, SchNetPack, DiffTaichi, Time Machine and Molly.

You might also like...
Implementation of Graph Transformer in Pytorch, for potential use in replicating Alphafold2
Implementation of Graph Transformer in Pytorch, for potential use in replicating Alphafold2

Graph Transformer - Pytorch Implementation of Graph Transformer in Pytorch, for potential use in replicating Alphafold2. This was recently used by bot

[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation
[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

EPCDepth EPCDepth is a self-supervised monocular depth estimation model, whose supervision is coming from the other image in a stereo pair. Details ar

Neural Scene Flow Fields using pytorch-lightning, with potential improvements
Neural Scene Flow Fields using pytorch-lightning, with potential improvements

nsff_pl Neural Scene Flow Fields using pytorch-lightning. This repo reimplements the NSFF idea, but modifies several operations based on observation o

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

Few-Shot Graph Learning for Molecular Property Prediction

Few-shot Graph Learning for Molecular Property Prediction Introduction This is the source code and dataset for the following paper: Few-shot Graph Lea

SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks (Scientific Reports)
SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks (Scientific Reports)

SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks Molecular interaction networks are powerful resources for the discovery. While dee

MolRep: A Deep Representation Learning Library for Molecular Property Prediction
MolRep: A Deep Representation Learning Library for Molecular Property Prediction

MolRep: A Deep Representation Learning Library for Molecular Property Prediction Summary MolRep is a Python package for fairly measuring algorithmic p

Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).
Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

[PDF] | [Slides] The official implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021 Long talk) Installation Inst

Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Part of the 9th place solution for the Bristol-Myers Squibb – Molecular Translation challenge translating images containing chemical structures into I

Comments
  • Output `Killed` with no other information

    Output `Killed` with no other information

    Hi, Thanks for the excellent work.

    I installed the tool and can run the file generation command: cgdms makeinput -i 1CRN.pdb -s 1CRN.ss2 > 1CRN.txt

    But I cannot run the simulate command. The output is just one word Killed.

    $ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
    Killed
    

    Could you please take a look and advise how to debug?

    Best, Roden

    My conda environment is attached:

    name: cgdms
    channels:
      - pytorch
      - salilab
      - conda-forge
      - bioconda
      - defaults
    dependencies:
      - _libgcc_mutex=0.1=conda_forge
      - _openmp_mutex=4.5=2_kmp_llvm
      - blas=1.0=mkl
      - bzip2=1.0.8=h7f98852_4
      - ca-certificates=2022.6.15=ha878542_0
      - cudatoolkit=10.2.89=h713d32c_10
      - freetype=2.10.4=h0708190_1
      - giflib=5.2.1=h36c2ea0_2
      - jpeg=9e=h166bdaf_2
      - lcms2=2.12=hddcbb42_0
      - ld_impl_linux-64=2.36.1=hea4e1c9_2
      - lerc=3.0=h9c3ff4c_0
      - libdeflate=1.12=h166bdaf_0
      - libffi=3.4.2=h7f98852_5
      - libgcc-ng=12.1.0=h8d9b700_16
      - libnsl=2.0.0=h7f98852_0
      - libpng=1.6.37=h21135ba_2
      - libstdcxx-ng=12.1.0=ha89aaad_16
      - libtiff=4.4.0=hc85c160_1
      - libuuid=2.32.1=h7f98852_1000
      - libwebp=1.2.2=h3452ae3_0
      - libwebp-base=1.2.2=h7f98852_1
      - libxcb=1.13=h7f98852_1004
      - libzlib=1.2.12=h166bdaf_1
      - llvm-openmp=14.0.4=he0ac6c6_0
      - lz4-c=1.9.3=h9c3ff4c_1
      - mkl=2021.4.0=h8d4b97c_729
      - mkl-service=2.4.0=py38h95df7f1_0
      - mkl_fft=1.3.1=py38h8666266_1
      - mkl_random=1.2.2=py38h1abd341_0
      - ncurses=6.3=h27087fc_1
      - ninja=1.11.0=h924138e_0
      - numpy=1.22.3=py38he7a7128_0
      - numpy-base=1.22.3=py38hf524024_0
      - openjpeg=2.4.0=hb52868f_1
      - openssl=3.0.4=h166bdaf_2
      - pillow=9.1.1=py38h0ee0e06_1
      - pip=22.1.2=pyhd8ed1ab_0
      - pthread-stubs=0.4=h36c2ea0_1001
      - python=3.8.13=ha86cf86_0_cpython
      - python_abi=3.8=2_cp38
      - pytorch=1.6.0=py3.8_cuda10.2.89_cudnn7.6.5_0
      - readline=8.1.2=h0f457ee_0
      - setuptools=62.6.0=py38h578d9bd_0
      - six=1.16.0=pyh6c4a22f_0
      - sqlite=3.39.0=h4ff8645_0
      - tbb=2021.5.0=h924138e_1
      - tk=8.6.12=h27826a3_0
      - torchvision=0.7.0=py38_cu102
      - wheel=0.37.1=pyhd8ed1ab_0
      - xorg-libxau=1.0.9=h7f98852_0
      - xorg-libxdmcp=1.1.3=h7f98852_0
      - xz=5.2.5=h516909a_1
      - zlib=1.2.12=h166bdaf_1
      - zstd=1.5.2=h8a70e8d_2
      - pip:
        - biopython==1.79
        - cgdms==1.0
        - colorama==0.4.5
        - peptidebuilder==1.1.0
    
    opened by RodenLuo 7
Releases(v1.0)
Owner
UCL Bioinformatics Group
UCL bioinformatics group repositories
UCL Bioinformatics Group
This repo is a C++ version of yolov5_deepsort_tensorrt. Packing all C++ programs into .so files, using Python script to call C++ programs further.

yolov5_deepsort_tensorrt_cpp Introduction This repo is a C++ version of yolov5_deepsort_tensorrt. And packing all C++ programs into .so files, using P

41 Dec 27, 2022
Compositional Sketch Search

Compositional Sketch Search Official repository for ICIP 2021 Paper: Compositional Sketch Search Requirements Install and activate conda environment c

Alexander Black 8 Sep 06, 2021
This program will stylize your photos with fast neural style transfer.

Neural Style Transfer (NST) Using TensorFlow Demo TensorFlow TensorFlow is an end-to-end open source platform for machine learning. It has a comprehen

Ismail Boularbah 1 Aug 08, 2022
[NeurIPS 2019] Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss

Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, Tengyu Ma This is the offi

Kaidi Cao 528 Jan 01, 2023
A BaSiC Tool for Background and Shading Correction of Optical Microscopy Images

BaSiC Matlab code accompanying A BaSiC Tool for Background and Shading Correction of Optical Microscopy Images by Tingying Peng, Kurt Thorn, Timm Schr

Marr Lab 34 Dec 18, 2022
Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi.

Spchcat Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi. Description spchcat is a command-line tool that read

Pete Warden 279 Jan 03, 2023
Predicts an answer in yes or no.

Oui-ou-non-prediction Predicts an answer in 'yes' or 'no'. It is based on the game 'effeuiller la marguerite' in which the person plucks flower petals

Ananya Gupta 1 Jan 15, 2022
[ICCV2021] Learning to Track Objects from Unlabeled Videos

Unsupervised Single Object Tracking (USOT) 🌿 Learning to Track Objects from Unlabeled Videos Jilai Zheng, Chao Ma, Houwen Peng and Xiaokang Yang 2021

53 Dec 28, 2022
Diverse Branch Block: Building a Convolution as an Inception-like Unit

Diverse Branch Block: Building a Convolution as an Inception-like Unit (PyTorch) (CVPR-2021) DBB is a powerful ConvNet building block to replace regul

253 Dec 24, 2022
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

Rowel Atienza 198 Dec 27, 2022
Shared Attention for Multi-label Zero-shot Learning

Shared Attention for Multi-label Zero-shot Learning Overview This repository contains the implementation of Shared Attention for Multi-label Zero-shot

dathuynh 26 Dec 14, 2022
Automatic voice-synthetised summaries of latest research papers on arXiv

PaperWhisperer PaperWhisperer is a Python application that keeps you up-to-date with research papers. How? It retrieves the latest articles from arXiv

Valerio Velardo 124 Dec 20, 2022
The official homepage of the COCO-Stuff dataset.

The COCO-Stuff dataset Holger Caesar, Jasper Uijlings, Vittorio Ferrari Welcome to official homepage of the COCO-Stuff [1] dataset. COCO-Stuff augment

Holger Caesar 715 Dec 31, 2022
Point Cloud Registration Network

PCRNet: Point Cloud Registration Network using PointNet Encoding Source Code Author: Vinit Sarode and Xueqian Li Paper | Website | Video | Pytorch Imp

ViNiT SaRoDe 59 Nov 19, 2022
This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

Emma Rocheteau 76 Dec 22, 2022
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

TUCH This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] License Software Copyright License fo

Lea Müller 45 Jan 07, 2023
StyleGAN2-ADA - Official PyTorch implementation

Need Help? If you’re new to StyleGAN2-ADA and looking to get started, please check out this video series from a course Lia Coleman and I taught in Oct

Derrick Schultz 217 Jan 04, 2023
Subpopulation detection in high-dimensional single-cell data

PhenoGraph for Python3 PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") repr

Dana Pe'er Lab 42 Sep 05, 2022
Learning-based agent for Google Research Football

TiKick 1.Introduction Learning-based agent for Google Research Football Code accompanying the paper "TiKick: Towards Playing Multi-agent Football Full

Tsinghua AI Research Team for Reinforcement Learning 90 Dec 26, 2022
Official repo for our 3DV 2021 paper "Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements".

Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements Yu Rong, Jingbo Wang, Ziwei Liu, Chen Change Loy Paper. Pr

Yu Rong 41 Dec 13, 2022