Sapiens is a human antibody language model based on BERT.

Last update: Nov 20, 2022

Overview

Sapiens: Human antibody language model

    ____              _                
   / ___|  __ _ _ __ (_) ___ _ __  ___ 
   \___ \ / _` | '_ \| |/ _ \ '_ \/ __|
    ___| | |_| | |_| | |  __/ | | \__ \
   |____/ \__,_|  __/|_|\___|_| |_|___/
               |_|

Sapiens is a human antibody language model based on BERT.

Learn more in the Sapiens, OASis and BioPhi in our publication:

David Prihoda, Jad Maamary, Andrew Waight, Veronica Juan, Laurence Fayadat-Dilman, Daniel Svozil & Danny A. Bitton (2022) BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning, mAbs, 14:1, DOI: https://doi.org/10.1080/19420862.2021.2020203

For more information about BioPhi, see the BioPhi repository

Features

Infilling missing residues in human antibody sequences
Suggesting mutations (in frameworks as well as CDRs)
Creating vector representations (embeddings) of residues or sequences

Usage

Install Sapiens using pip:

# Recommended: Create dedicated conda environment
conda create -n sapiens python=3.8
conda activate sapiens
# Install Sapiens
pip install sapiens

❗️ Python 3.7 or 3.8 is currently required due to fairseq bug in Python 3.9 and above: pytorch/fairseq#3535

Antibody sequence infilling

Positions marked with * or X will be infilled with the most likely human residues, given the rest of the sequence

import sapiens

best = sapiens.predict_masked(
    '**QLV*SGVEVKKPGASVKVSCKASGYTFTNYYMYWVRQAPGQGLEWMGGINPSNGGTNFNEKFKNRVTLTTDSSTTTAYMELKSLQFDDTAVYYCARRDYRFDMGFDYWGQGTTVTVSS',
    'H'
)
print(best)
# QVQLVQSGVEVKKPGASVKVSCKASGYTFTNYYMYWVRQAPGQGLEWMGGINPSNGGTNFNEKFKNRVTLTTDSSTTTAYMELKSLQFDDTAVYYCARRDYRFDMGFDYWGQGTTVTVSS

Suggesting mutations

Return residue scores for a given sequence:

import sapiens

scores = sapiens.predict_scores(
    '**QLV*SGVEVKKPGASVKVSCKASGYTFTNYYMYWVRQAPGQGLEWMGGINPSNGGTNFNEKFKNRVTLTTDSSTTTAYMELKSLQFDDTAVYYCARRDYRFDMGFDYWGQGTTVTVSS',
    'H'
)
scores.head()
#           A         C         D         E  ...
# 0  0.003272  0.004147  0.004011  0.004590  ... <- based on masked input
# 1  0.012038  0.003854  0.006803  0.008174  ... <- based on masked input
# 2  0.003384  0.003895  0.003726  0.004068  ... <- based on Q input
# 3  0.004612  0.005325  0.004443  0.004641  ... <- based on L input
# 4  0.005519  0.003664  0.003555  0.005269  ... <- based on V input
#
# Scores are given both for residues that are masked and that are present. 
# When inputting a non-human antibody sequence, the output scores can be used for humanization.

Antibody sequence embedding

Get a vector representation of each position in a sequence

import sapiens

residue_embed = sapiens.predict_residue_embedding(
    'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS', 
    'H', 
    layer=None
)
residue_embed.shape
# (layer, position in sequence, features)
# (5, 119, 128)

Get a single vector for each sequence

seq_embed = sapiens.predict_sequence_embedding(
    'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS', 
    'H', 
    layer=None
)
seq_embed.shape
# (layer, features)
# (5, 128)

Notebooks

Try out Sapiens in your browser using these example notebooks:

Links	Notebook	Description
	01_sapiens_antibody_infilling	Predict missing positions in an antibody sequence
	02_sapiens_antibody_embedding	Get vector representations and visualize them using t-SNE

Acknowledgements

Sapiens is based on antibody repertoires from the Observed Antibody Space:

Kovaltsuk, A., Leem, J., Kelm, S., Snowden, J., Deane, C. M., & Krawczyk, K. (2018). Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. The Journal of Immunology, 201(8), 2502–2509. https://doi.org/10.4049/jimmunol.1800708

Sapiens is a human antibody language model based on BERT.

Related tags

Overview

Sapiens: Human antibody language model

Features

Usage

Antibody sequence infilling

Suggesting mutations

Antibody sequence embedding

Notebooks

Acknowledgements

Owner

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc.

PyTorch original implementation of Cross-lingual Language Model Pretraining.

NLP codes implemented with Pytorch (w/o library such as huggingface)

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Code for text augmentation method leveraging large-scale language models

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

The source code of HeCo

Ray-based parallel data preprocessing for NLP and ML.

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

The entmax mapping and its loss, a family of sparse softmax alternatives.

Python wrapper for Stanford CoreNLP tools v3.4.1

An open source library for deep learning end-to-end dialog systems and chatbots.

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Implementation of Multistream Transformers in Pytorch

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

DLO8012: Natural Language Processing & CSL804: Computational Lab - II

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

中文空间语义理解评测

An open-source NLP library: fast text cleaning and preprocessing.