Reaction SMILES-AA mapping via language modelling

Last update: Dec 13, 2022

Related tags

Overview

rxn-aa-mapper

Reactions SMILES-AA sequence mapping

setup

conda env create -f conda.yml
conda activate rxn_aa_mapper

In the following we consider on examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

Create a vocabulary compatible with the enzymatic reaction tokenizer:

create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

Using the examples vocabulary and AA tokenizer provided, we can observe the enzymatic reaction tokenizer in action:

from rxn_aa_mapper.tokenization import EnzymaticReactionBertTokenizer

tokenizer = EnzymaticReactionBertTokenizer(
    vocabulary_file="./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    aa_sequence_tokenizer_filepath="./examples/token_75K_min_600_max_750_500K.json"
)
tokenizer.tokenize("NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]")

train the model

The mlm-trainer script can be used to train a model via MTL:

mlm-trainer \
    ./examples/data-samples/biochemical ./examples/data-samples/biochemical \  # just a sample, simply split data in a train and a validation folder
    ./examples/vocabulary_token_75K_min_600_max_750_500K.txt /tmp/mlm-trainer-log \
    ./examples/sample-config.json "*.csv" 1 \  # for a more realistic config see ./examples/config.json
    ./examples/data-samples/organic ./examples/data-samples/organic \  # just a sample, simply split data in a train and a validation folder
    ./examples/token_75K_min_600_max_750_500K.json

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

Those can be turned into an HuggingFace model by simply running:

checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

The trained model can used to map reactant atoms to AA sequence locations that potentially represent the active site.

from rxn_aa_mapper.aa_mapper import RXNAAMapper

config_mapper = {
    "vocabulary_file": "./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    "aa_sequence_tokenizer_filepath": "./examples/token_75K_min_600_max_750_500K.json",
    "model_path": "/tmp/rxnaamapper-pretrained-model",
    "head": 3,
    "layers": [11],
    "top_k": 1,
}
mapper = RXNAAMapper(config=config_mapper)
mapper.get_reactant_aa_sequence_attention_guided_maps(["NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"])

citation

@article{dassi2021identification,
  title={Identification of Enzymatic Active Sites with Unsupervised Language Modeling},
  author={Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Teukam, Yves Gaetan Nana and Laino, Teodoro},
  year={2021}
  conference={AI for Science: Mind the Gaps at NeurIPS 2021, ELLIS Machine Learning for Molecule Discovery Workshop 2021}
}

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

use the tokenizer

train the model

predict active site

citation

Owner

face property detection pytorch

State-of-the-art language models can match human performance on many tasks

Understanding Convolution for Semantic Segmentation

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Iran Open Source Hackathon

Fashion Recommender System With Python

Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Generic ecosystem for feature extraction from aerial and satellite imagery

Reproduction process of AlexNet

Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

TinyML Cookbook, published by Packt

Bayesian Generative Adversarial Networks in Tensorflow

GAN Image Generator and Characterwise Image Recognizer with python

Code for the published paper : Learning to recognize rare traffic sign

Pytorch implementation of the paper "COAD: Contrastive Pre-training with Adversarial Fine-tuning for Zero-shot Expert Linking."

NudeNet: Neural Nets for Nudity Classification, Detection and selective censoring

Train a deep learning net with OpenStreetMap features and satellite imagery.

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

use the tokenizer

train the model

predict active site

citation

Owner

face property detection pytorch

State-of-the-art language models can match human performance on many tasks

Understanding Convolution for Semantic Segmentation

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Iran Open Source Hackathon

Fashion Recommender System With Python

Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Generic ecosystem for feature extraction from aerial and satellite imagery

Reproduction process of AlexNet

Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

TinyML Cookbook, published by Packt

Bayesian Generative Adversarial Networks in Tensorflow

GAN Image Generator and Characterwise Image Recognizer with python

Code for the published paper : Learning to recognize rare traffic sign

Pytorch implementation of the paper "COAD: Contrastive Pre-training with Adversarial Fine-tuning for Zero-shot Expert Linking."

NudeNet: Neural Nets for Nudity Classification, Detection and selective censoring

Train a deep learning net with OpenStreetMap features and satellite imagery.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`