Reaction SMILES-AA mapping via language modelling

Last update: Dec 13, 2022

Related tags

Overview

rxn-aa-mapper

Reactions SMILES-AA sequence mapping

setup

conda env create -f conda.yml
conda activate rxn_aa_mapper

In the following we consider on examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

Create a vocabulary compatible with the enzymatic reaction tokenizer:

create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

Using the examples vocabulary and AA tokenizer provided, we can observe the enzymatic reaction tokenizer in action:

from rxn_aa_mapper.tokenization import EnzymaticReactionBertTokenizer

tokenizer = EnzymaticReactionBertTokenizer(
    vocabulary_file="./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    aa_sequence_tokenizer_filepath="./examples/token_75K_min_600_max_750_500K.json"
)
tokenizer.tokenize("NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]")

train the model

The mlm-trainer script can be used to train a model via MTL:

mlm-trainer \
    ./examples/data-samples/biochemical ./examples/data-samples/biochemical \  # just a sample, simply split data in a train and a validation folder
    ./examples/vocabulary_token_75K_min_600_max_750_500K.txt /tmp/mlm-trainer-log \
    ./examples/sample-config.json "*.csv" 1 \  # for a more realistic config see ./examples/config.json
    ./examples/data-samples/organic ./examples/data-samples/organic \  # just a sample, simply split data in a train and a validation folder
    ./examples/token_75K_min_600_max_750_500K.json

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

Those can be turned into an HuggingFace model by simply running:

checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

The trained model can used to map reactant atoms to AA sequence locations that potentially represent the active site.

from rxn_aa_mapper.aa_mapper import RXNAAMapper

config_mapper = {
    "vocabulary_file": "./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    "aa_sequence_tokenizer_filepath": "./examples/token_75K_min_600_max_750_500K.json",
    "model_path": "/tmp/rxnaamapper-pretrained-model",
    "head": 3,
    "layers": [11],
    "top_k": 1,
}
mapper = RXNAAMapper(config=config_mapper)
mapper.get_reactant_aa_sequence_attention_guided_maps(["NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"])

citation

@article{dassi2021identification,
  title={Identification of Enzymatic Active Sites with Unsupervised Language Modeling},
  author={Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Teukam, Yves Gaetan Nana and Laino, Teodoro},
  year={2021}
  conference={AI for Science: Mind the Gaps at NeurIPS 2021, ELLIS Machine Learning for Molecule Discovery Workshop 2021}
}

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

use the tokenizer

train the model

predict active site

citation

Owner

StackNet is a computational, scalable and analytical Meta modelling framework

Conditional Generative Adversarial Networks (CGAN) for Mobility Data Fusion

Neural network chess engine trained on Gary Kasparov's games.

Tool cek opsi checkpoint facebook!

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Geometric Algebra package for JAX

Iowa Project - My second project done at General Assembly, focused on feature engineering and understanding Linear Regression as a concept

HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

Notebooks, slides and dataset of the CorrelAid Machine Learning Winter School

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

Neural Radiance Fields Using PyTorch

Bounding Wasserstein distance with couplings

To build a regression model to predict the concrete compressive strength based on the different features in the training data.

Supervised forecasting of sequential data in Python.

This folder contains the implementation of the multi-relational attribute propagation algorithm.

Implementation of Wasserstein adversarial attacks.

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

use the tokenizer

train the model

predict active site

citation

Owner

StackNet is a computational, scalable and analytical Meta modelling framework

Conditional Generative Adversarial Networks (CGAN) for Mobility Data Fusion

Neural network chess engine trained on Gary Kasparov's games.

Tool cek opsi checkpoint facebook!

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Geometric Algebra package for JAX

Iowa Project - My second project done at General Assembly, focused on feature engineering and understanding Linear Regression as a concept

HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

Notebooks, slides and dataset of the CorrelAid Machine Learning Winter School

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

Neural Radiance Fields Using PyTorch

Bounding Wasserstein distance with couplings

To build a regression model to predict the concrete compressive strength based on the different features in the training data.

Supervised forecasting of sequential data in Python.

This folder contains the implementation of the multi-relational attribute propagation algorithm.

Implementation of Wasserstein adversarial attacks.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`