Reaction SMILES-AA mapping via language modelling

Overview

rxn-aa-mapper

Reactions SMILES-AA sequence mapping

setup

conda env create -f conda.yml
conda activate rxn_aa_mapper

In the following we consider on examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

Create a vocabulary compatible with the enzymatic reaction tokenizer:

create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

Using the examples vocabulary and AA tokenizer provided, we can observe the enzymatic reaction tokenizer in action:

from rxn_aa_mapper.tokenization import EnzymaticReactionBertTokenizer

tokenizer = EnzymaticReactionBertTokenizer(
    vocabulary_file="./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    aa_sequence_tokenizer_filepath="./examples/token_75K_min_600_max_750_500K.json"
)
tokenizer.tokenize("NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]")

train the model

The mlm-trainer script can be used to train a model via MTL:

mlm-trainer \
    ./examples/data-samples/biochemical ./examples/data-samples/biochemical \  # just a sample, simply split data in a train and a validation folder
    ./examples/vocabulary_token_75K_min_600_max_750_500K.txt /tmp/mlm-trainer-log \
    ./examples/sample-config.json "*.csv" 1 \  # for a more realistic config see ./examples/config.json
    ./examples/data-samples/organic ./examples/data-samples/organic \  # just a sample, simply split data in a train and a validation folder
    ./examples/token_75K_min_600_max_750_500K.json

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

Those can be turned into an HuggingFace model by simply running:

checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

The trained model can used to map reactant atoms to AA sequence locations that potentially represent the active site.

from rxn_aa_mapper.aa_mapper import RXNAAMapper

config_mapper = {
    "vocabulary_file": "./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    "aa_sequence_tokenizer_filepath": "./examples/token_75K_min_600_max_750_500K.json",
    "model_path": "/tmp/rxnaamapper-pretrained-model",
    "head": 3,
    "layers": [11],
    "top_k": 1,
}
mapper = RXNAAMapper(config=config_mapper)
mapper.get_reactant_aa_sequence_attention_guided_maps(["NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"])

citation

@article{dassi2021identification,
  title={Identification of Enzymatic Active Sites with Unsupervised Language Modeling},
  author={Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Teukam, Yves Gaetan Nana and Laino, Teodoro},
  year={2021}
  conference={AI for Science: Mind the Gaps at NeurIPS 2021, ELLIS Machine Learning for Molecule Discovery Workshop 2021}
}
StackNet is a computational, scalable and analytical Meta modelling framework

StackNet This repository contains StackNet Meta modelling methodology (and software) which is part of my work as a PhD Student in the computer science

Marios Michailidis 1.3k Dec 15, 2022
Conditional Generative Adversarial Networks (CGAN) for Mobility Data Fusion

This code implements the paper, Kim et al. (2021). Imputing Qualitative Attributes for Trip Chains Extracted from Smart Card Data Using a Conditional Generative Adversarial Network. Transportation Re

Eui-Jin Kim 2 Feb 03, 2022
Neural network chess engine trained on Gary Kasparov's games.

Neural Chess It's not the best chess engine, but it is a chess engine. Proof of concept neural network chess engine (feed-forward multi-layer perceptr

3 Jun 22, 2022
Tool cek opsi checkpoint facebook!

tool apa ini? cek_opsi_facebook adalah sebuah tool yang mengecek opsi checkpoint akun facebook yang terkena checkpoint! tujuan dibuatnya tool ini? too

Muhammad Latif Harkat 2 Jul 17, 2022
Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021) Linsen Song, Wayne Wu, Chaoyou Fu, Chen Qian, Chen Change Loy, and Ran He [Paper], [Video

71 Dec 21, 2022
CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper) (Accepted for oral presentation at ACM

Minha Kim 1 Nov 12, 2021
Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Real-ESRGAN Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data Ported from https://github.com/xinntao/Real-ESRGAN Depend

Holy Wu 44 Dec 27, 2022
Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

71 Dec 14, 2022
TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling This is the official code release for the paper 'TiP-Adapter: Training-fre

peng gao 189 Jan 04, 2023
Geometric Algebra package for JAX

JAXGA - JAX Geometric Algebra GitHub | Docs JAXGA is a Geometric Algebra package on top of JAX. It can handle high dimensional algebras by storing onl

Robin Kahlow 36 Dec 22, 2022
Iowa Project - My second project done at General Assembly, focused on feature engineering and understanding Linear Regression as a concept

Project 2 - Ames Housing Data and Kaggle Challenge PROBLEM STATEMENT Inferring or Predicting? What's more valuable for a housing model? When creating

Adam Muhammad Klesc 1 Jan 03, 2022
HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

HeatNet HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events glob

Google Research 6 Jul 07, 2022
Notebooks, slides and dataset of the CorrelAid Machine Learning Winter School

CorrelAid Machine Learning Winter School Welcome to the CorrelAid ML Winter School! Task The problem we want to solve is to classify trees in Roosevel

CorrelAid 12 Nov 23, 2022
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

910 Dec 28, 2022
Neural Radiance Fields Using PyTorch

This project is a PyTorch implementation of Neural Radiance Fields (NeRF) for reproduction of results whilst running at a faster speed.

Vedant Ghodke 1 Feb 11, 2022
Bounding Wasserstein distance with couplings

BoundWasserstein These scripts reproduce the results of the article Bounding Wasserstein distance with couplings by Niloy Biswas and Lester Mackey. ar

Niloy Biswas 1 Jan 11, 2022
To build a regression model to predict the concrete compressive strength based on the different features in the training data.

Cement-Strength-Prediction Problem Statement To build a regression model to predict the concrete compressive strength based on the different features

Ashish Kumar 4 Jun 11, 2022
Supervised forecasting of sequential data in Python.

Supervised forecasting of sequential data in Python. Intro Supervised forecasting is the machine learning task of making predictions for sequential da

The Alan Turing Institute 54 Nov 15, 2022
This folder contains the implementation of the multi-relational attribute propagation algorithm.

MrAP This folder contains the implementation of the multi-relational attribute propagation algorithm. It requires the package pytorch-scatter. Please

6 Dec 06, 2022
Implementation of Wasserstein adversarial attacks.

Stronger and Faster Wasserstein Adversarial Attacks Code for Stronger and Faster Wasserstein Adversarial Attacks, appeared in ICML 2020. This reposito

21 Oct 06, 2022