MolRep: A Deep Representation Learning Library for Molecular Property Prediction

Summary

MolRep is a Python package for fairly measuring algorithmic progress on chemical property datasets. It currently provides a complete re-evaluation of 16 state-of-the-art deep representation models over 16 benchmark property datsaets.

If you found this package useful, please cite biorxiv for now:

Install & Usage

We provide a script to install the environment. You will need the conda package manager, which can be installed from here.

To install the required packages, follow there instructions (tested on a linux terminal):

clone the repository

git clone https://github.com/Jh-SYSU/MolRep
cd into the cloned directory

cd MolRep
run the install script

source install.sh [<your_cuda_version>]

Where <your_cuda_version> is an optional argument that can be either cpu, cu92, cu100, cu101. If you do not provide a cuda version, the script will default to cpu. The script will create a virtual environment named MolRep, with all the required packages needed to run our code. Important: do NOT run this command using bash instead of source!

Data

Data could be download from Google_Driver

Current Dataset

Dataset	Task	Task type	#Molecule	Splits	Metric	Reference
QM7	1	Regression	7160	Stratified	MAE	Wu et al.
QM8	12	Regression	21786	Random	MAE	Wu et al.
QM9	12	Regression	133885	Random	MAE	Wu et al.
ESOL	1	Regression	1128	Random	RMSE	Wu et al.
FreeSolv	1	Regression	642	Random	RMSE	Wu et al.
Lipophilicity	1	Regression	4200	Random	RMSE	Wu et al.
BBBP	1	Classification	2039	Scaffold	ROC-AUC	Wu et al.
Tox21	12	Classification	7831	Random	ROC-AUC	Wu et al.
SIDER	27	Classification	1427	Random	ROC-AUC	Wu et al.
ClinTox	2	Classification	1478	Random	ROC-AUC	Wu et al.
Liver injury	1	Classification	2788	Random	ROC-AUC	Xu et al.
Mutagenesis	1	Classification	6511	Random	ROC-AUC	Hansen et al.
hERG	1	Classification	4813	Random	ROC-AUC	Li et al.
MUV	17	Classification	93087	Random	PRC-AUC	Wu et al.
HIV	1	Classification	41127	Random	ROC-AUC	Wu et al.
BACE	1	Classification	1513	Random	ROC-AUC	Wu et al.

Methods

Current Methods

Self-/unsupervised Models

Methods	Descriptions	Reference
Mol2Vec	Mol2Vec is an unsupervised approach to learns vector representations of molecular substructures that point in similar directions for chemically related substructures.	Jaeger et al.
N-Gram graph	N-gram graph is a simple unsupervised representation for molecules that first embeds the vertices in the molecule graph and then constructs a compact representation for the graph by assembling the ver-tex embeddings in short walks in the graph.	Liu et al.
FP2Vec	FP2Vec is a molecular featurizer that represents a chemical compound as a set of trainable embedding vectors and combine with CNN model.	Jeon et al.
VAE	VAE is a framework for training two neural networks (encoder and decoder) to learn a mapping from high-dimensional molecular representation into a lower-dimensional space.	Kingma et al.

Sequence Models

Methods	Descriptions	Reference
BiLSTM	BiLSTM is an artificial recurrent neural network (RNN) architecture to encoding sequences from compound SMILES strings.	Hochreiter et al.
SALSTM	SALSTM is a self-attention mechanism with improved BiLSTM for molecule representation.	Zheng et al
Transformer	Transformer is a network based solely on attention mechanisms and dispensing with recurrence and convolutions entirely to encodes compound SMILES strings.	Vaswani et al.
MAT	MAT is a molecule attention transformer utilized inter-atomic distances and the molecular graph structure to augment the attention mechanism.	Maziarka et al.

Graph Models

Methods	Descriptions	Reference
DGCNN	DGCNN is a deep graph convolutional neural network that proposes a graph convolution model with SortPooling layer which sorts graph vertices in a consistent order to learning the embedding of molec-ular graph.	Zhang et al.
GraphSAGE	GraphSAGE is a framework for inductive representation learning on molecular graphs that used to generate low-dimensional representations for atoms and performs sum, mean or max-pooling neigh-borhood aggregation to updates the atom representation and molecular representation.	Hamilton et al.
GIN	GIN is the Graph Isomorphism Network that builds upon the limitations of GraphSAGE to capture different graph structures with the Weisfeiler-Lehman graph isomorphism test.	Xu et al.
ECC	ECC is an Edge-Conditioned Convolution Network that learns a different parameter for each edge label (bond type) on the molecular graph, and neighbor aggregation is weighted according to specific edge parameters.	Simonovsky et al.
DiffPool	DiffPool combines a differentiable graph encoder with its an adaptive pooling mechanism that col-lapses nodes on the basis of a supervised criterion to learning the representation of molecular graphs.	Ying et al.
MPNN	MPNN is a message-passing graph neural network that learns the representation of compound molecular graph. It mainly focused on obtaining effective vertices (atoms) embedding	Gilmer et al.
D-MPNN	DMPNN is another message-passing graph neural network that messages associated with directed edges (bonds) rather than those with vertices. It can make use of the bond attributes.	Yang et al.
CMPNN	CMPNN is the graph neural network that improve the molecular graph embedding by strengthening the message interactions between edges (bonds) and nodes (atoms).	Song et al.

Training

To train a model by K-fold, run 5-fold-training_example.ipynb.

Testing

To test a pretrained model, run testing-example.ipynb.

Results

Results on Classification Tasks.

Datasets	BBBP	Tox21	SIDER	ClinTox	MUV	HIV	BACE
Mol2Vec	0.9213±0.0052	0.8139±0.0081	0.6043±0.0061	0.8572±0.0054	0.1178±0.0032	0.8413±0.0047	0.8284±0.0023
N-Gram graph	0.9012±0.0385	0.8371±0.0421	0.6482±0.0437	0.8753±0.0077	0.1011±0.0000	0.8378±0.0034	0.8472±0.0057
FP2Vec	0.8076±0.0032	0.8578±0.0076	0.6678±0.0068	0.8834±0.0432	0.0856±0.0031	0.7894±0.0052	0.8129±0.0492
VAE	0.8378±0.0031	0.8315±0.0382	0.6493±0.0762	0.8674±0.0124	0.0794±0.0001	0.8109±0.0381	0.8368±0.0762
BiLSTM	0.8391±0.0032	0.8279±0.0098	0.6092±0.0303	0.8319±0.0120	0.0382±0.0000	0.7962±0.0098	0.8263±0.0031
SALSTM	0.8482±0.0329	0.8253±0.0031	0.6308±0.0036	0.8317±0.0003	0.0409±0.0000	0.8034±0.0128	0.8348±0.0019
Transformer	0.9610±0.0119	0.8129±0.0013	0.6017±0.0012	0.8572±0.0032	0.0716±0.0017	0.8372±0.0314	0.8407±0.0738
MAT	0.9620±0.0392	0.8393±0.0039	0.6276±0.0029	0.8777±0.0149	0.0913±0.0001	0.8653±0.0054	0.8519±0.0504
DGCNN	0.9311±0.0434	0.7992±0.0057	0.6007±0.0053	0.8302±0.0126	0.0438±0.0000	0.8297±0.0038	0.8361±0.0034
GraphSAGE	0.9630±0.0474	0.8166±0.0041	0.6403±0.0045	0.9116±0.0146	0.1145±0.0000	0.8705±0.0724	0.9316±0.0360
GIN	0.8746±0.0359	0.8178±0.0031	0.5904±0.0000	0.8842±0.0004	0.0832±0.0000	0.8015±0.0328	0.8275±0.0034
ECC	0.9620±0.0003	0.8677±0.0090	0.6750±0.0092	0.8862±0.0831	0.1308±0.0013	0.8733±0.0025	0.8419±0.0092
DiffPool	0.8732±0.0391	0.8012±0.0130	0.6087±0.0130	0.8345±0.0233	0.0934±0.0001	0.8452±0.0042	0.8592±0.0391
MPNN	0.9321±0.0312	0.8440±0.014	0.6313±0.0121	0.8414±0.0294	0.0572±0.0001	0.8032±0.0092	0.8493±0.0013
DMPNN	0.9562±0.0070	0.8429±0.0391	0.6378±0.0329	0.8692±0.0051	0.0867±0.0032	0.8137±0.0072	0.8678±0.0372
CMPNN	0.9854±0.0215	0.8593±0.0088	0.6581±0.0020	0.9169±0.0065	0.1435±0.0002	0.8687±0.0003	0.8932±0.0019

More results will be updated soon.

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

Related tags

Overview

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

Summary

Install & Usage

Data

Current Dataset

Methods

Current Methods

Self-/unsupervised Models

Sequence Models

Graph Models

Training

Testing

Results

Results on Classification Tasks.

Owner

AI-Health @NSCC-gz

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic Scenes", ICCV 2021.

Autoregressive Models in PyTorch.

The code for two papers: Feedback Transformer and Expire-Span.

Repo for code associated with Modeling the Mitral Valve.

Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

TargetAllDomainObjects - A python wrapper to run a command on against all users/computers/DCs of a Windows Domain

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

A library for optimization on Riemannian manifolds

Discord bot-CTFD-Thread-Parser - Discord bot CTFD-Thread-Parser

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

Evaluation toolkit of the informative tracking benchmark comprising 9 scenarios, 180 diverse videos, and new challenges.

🐦 Opytimizer is a Python library consisting of meta-heuristic optimization techniques.

The MLOps platform for innovators 🚀

LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

Faune proche - Retrieval of Faune-France data near a google maps location

scikit-learn inspired API for CRFsuite

PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation

Covid19-Forecasting - An interactive website that tracks, models and predicts COVID-19 Cases

Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation