A Python package to process & model ChEMBL data.

Last update: Dec 09, 2021

Overview

insilico: A Python package to process & model ChEMBL data.

ChEMBL is a manually curated chemical database of bioactive molecules with drug-like properties. It is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL) based in Hinxton, UK.

insilico helps drug researchers find promising compounds for drug discovery. It preprocesses ChEMBL molecular data and outputs Lapinski's descriptors and chemical fingerprints using popular bioinformatic libraries. Additionally, this package can be used to make a decision tree model that predicts drug efficacy.

About the package name

The term in silico is a neologism used to mean pharmacology hypothesis development & testing performed via computer (silicon), and is related to the more commonly known biological terms in vivo ("within the living") and in vitro ("within the glass".)

Installation

Installation via pip:

$ pip install insilico

Installation via cloned repository:

$ git clone https://github.com/konstanzer/insilico
$ cd insilico
$ python setup.py install

Python dependencies

For preprocessing, rdkit-pypi, padelpy, and chembl_webresource_client and for modeling, sklearn and seaborn

Basic Usage

insilico offers two functions: one to search the ChEMBL database and a second to output preprocessed ChEMBL data based on the molecular ID. Using the chemical fingerprint from this output, the Model class creates a decision tree and outputs residual plots and metrics.

The function process_target_data saves the chemical fingerprint and, optionally, molecular descriptor plots to a data folder if plots=True.

When declaring the model class, you may specify a test set size and a variance threshold, which sets the minimum variance allowed for each column. This optional step may eliminate hundreds of features unhelpful for modeling. When calling the decision_tree function, optionally specify max tree depth and cost-complexity alpha, hyperparameters to control overfitting. If save=True, the model is saved to the data folder.

from insilico import target_search, process_target_data, Model

# return search results for 'P. falciparum D6'
result = target_search('P. falciparum')

# returns a dataframe of molecular data for CHEMBL2367107 (P. falciparum D6)
df = process_target_data('CHEMBL2367107')

model = Model(test_size=0.2, var_threshold=0.15)

# returns a decision tree and metrics (R^2 and MAE) & saves residual plot
tree, metrics = model.decision_tree(df, max_depth=50, ccp_alpha=0.)

# returns split data for use in other models
X_train, X_test, y_train, y_test = model.split_data()

Advanced option: Use optional 'fp' parameter to specify fingerprinter

Valid fingerprinters are "PubchemFingerprinter" (default), "ExtendedFingerprinter", "EStateFingerprinter", "GraphOnlyFingerprinter", "MACCSFingerprinter", "SubstructureFingerprinter", "SubstructureFingerprintCount", "KlekotaRothFingerprinter", "KlekotaRothFingerprintCount", "AtomPairs2DFingerprinter", and "AtomPairs2DFingerprintCount".

df = process_target_data('CHEMBL2367107', plots=False, fp='SubstructureFingerprinter')

Contributing, Reporting Issues & Support

Make a pull request if you'd like to contribute to insilico. Contributions should include tests for new features added and documentation. File an issue to report problems with the software or feature requests. Include information such as error messages, your OS/environment and Python version.

Questions may be sent to Steven Newton ([email protected]).

References

Bioinformatics Project from Scratch: Drug Discovery by Chanin Nantasenamat

A Python package to process & model ChEMBL data.

Related tags

Overview

insilico: A Python package to process & model ChEMBL data.

About the package name

Installation

Python dependencies

Basic Usage

Advanced option: Use optional 'fp' parameter to specify fingerprinter

Contributing, Reporting Issues & Support

References

Owner

Steven Newton

Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics.

Hso-groupie - A pwnable challenge in Real World CTF 4th

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

Original Implementation of Prompt Tuning from Lester, et al, 2021

Code for binary and multiclass model change active learning, with spectral truncation implementation.

WHENet - ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L

Survival analysis (SA) is a well-known statistical technique for the study of temporal events.

Meta graph convolutional neural network-assisted resilient swarm communications

The implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

This repo generates the training data and the model for Morpheus-Deblend

Pytorch implementation for RelTransformer

Breaking the Dilemma of Medical Image-to-image Translation

Dataset and Code for the paper "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021), and "Depth-only Object Tracking" (BMVC2021)

Using modified BiSeNet for face parsing in PyTorch

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

End-to-End Referring Video Object Segmentation with Multimodal Transformers

This MVP data web app uses the Streamlit framework and Facebook's Prophet forecasting package to generate a dynamic forecast from your own data.