Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Last update: Dec 14, 2022

Related tags

Deep Learning Reproducibility-Challenge

Overview

GNN_PPI

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction
Authors: Guofeng Lv, Zhiqiang Hu, Yanguang Bi, Shaoting Zhang
Arxiv extended verison (arxiv: https://arxiv.org/abs/2105.06709)

Contact: [email protected]. Any questions or discussions are welcomed!

Abstract

The study of multi-type Protein-Protein Interaction (PPI) is fundamental for understanding biological processes from a systematic perspective and revealing disease mechanisms. Existing methods suffer from significant performance degradation when tested in unseen dataset. In this paper, we investigate the problem and find that it is mainly attributed to the poor performance for inter-novel-protein interaction prediction. However, current evaluations overlook the inter-novel-protein interactions, and thus fail to give an instructive assessment. As a result, we propose to address the problem from both the evaluation and the methodology. Firstly, we design a new evaluation framework that fully respects the inter-novel-protein interactions and gives consistent assessment across datasets. Secondly, we argue that correlations between proteins must provide useful information for analysis of novel proteins, and based on this, we propose a graph neural network based method (GNN-PPI) for better inter-novel-protein interaction prediction. Experimental results on real-world datasets of different scales demonstrate that GNN-PPI significantly outperforms state-of-the-art PPI prediction methods, especially for the inter-novel-protein interaction prediction.

Contribution

We design a new evaluation framework that fully respects the inter-novel-protein interactions and give consistent assessment across datasets.

An example of the testset construction strategies under the new evaluation framework. Random is the current scheme, while Breath-First Search (BFS) and Depth-First Search (DFS) are the proposed schemes.
We propose to incorporate correlation between proteins into the PPI prediction problem. A graph neural network based method is presented to model the correlations.

Development and evaluation of the GNN-PPI framework. Pairwise interaction data are firstly assembled to build the graph, where proteins serve as the nodes and interactions as the edges. The testset is constructed by firstly selecting the root node and then performing the proposed BFS or DFS strategy. The model is developed by firstly performing embedding for each protein to obtain predefined features, then processed by Convolution, Pooling, BiGRU and FC modules to extract protein-independent encoding (PIE) features, which are finally aggregated by graph convolutions and arrive at protein-graph encoding (PGE) features. Features of the pair proteins in interaction are multiplied and classified, supervised by the trainset labels.
The proposed GNN-PPI model achieves state-of-the-art performance in real datasets of different scales, especially for the inter-novel-protein interaction prediction.

For further investigation, we divide the testset into BS, ES and NS subsets, where BS denotes Both of the pair proteins in interaction were Seen during training, ES denotes Either (but not both) of the pair proteins was Seen, and NS denotes Neither proteins were Seen during training. We regard ES and NS as inter-novel-protein interactions. Existing methods suffer from significant performance degradation when tested on unseen Protein-protein interaction, especially inter-novel-protein interactions. On the contrary, GNN-PPI can handle this situation well, whether it is BS, ES or NS, the performance will not be greatly reduced.

Experimental Results

We evaluate the multi-label PPI prediction performance using micro-F1. This is because micro-averaging will emphasize the common labels in the dataset, which gives each sample the same importance.

Benchmark

Performance of GNN-PPI against comparative methods over different datasets and data partition schemes.

In-depth Analysis

In-depth analysis between PIPR and GNN-PPI over BS, ES and NS subsets.

Model Generalization

Testing on trainset-homologous testset vs. unseen testset, under different evaluations.

PPI Network Graph Construction

The impact of the PPI network graph construction method.

Using GNN_PPI

This repository contains:

Environment Setup
Data Processing
Training
Testing
Inference

Environment Setup

base environment: python 3.7, cuda 10.2, pytorch 1.6, torchvision 0.7.0, tensorboardX 2.1
pytorch-geometric:
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-geometric

Data Processing

The data processing codes in gnn_data.py (Class GNN_DATA), including:

data reading (def __init__)
protein vectorize (def get_feature_origin)
generate pyg data (def generate_data)
Data partition (def split_dataset)

Training

Training codes in gnn_train.py, and the run script in run.py.

"python -u gnn_train.py \
    --description={} \              # Description of the current training task
    --ppi_path={} \                 # ppi dataset
    --pseq_path={} \                # protein sequence
    --vec_path={} \                 # protein pretrained-embedding
    --split_new={} \                # whether to generate a new data partition, or use the previous
    --split_mode={} \               # data split mode
    --train_valid_index_path={} \   # Data partition json file path
    --use_lr_scheduler={} \         # whether to use training learning rate scheduler
    --save_path={} \                # save model, config and results dir path
    --graph_only_train={} \         # PPI network graph construction method, True: GCT, False: GCA
    --batch_size={} \               # Batch size
    --epochs={} \                   # Train epochs
    ".format(description, ppi_path, pseq_path, vec_path, 
            split_new, split_mode, train_valid_index_path,
            use_lr_scheduler, save_path, graph_only_train, 
            batch_size, epochs)

Dataset Download:

STRING(we use Homo sapiens subset):

SHS27k and SHS148k:

http://yellowstone.cs.ucla.edu/~muhao/pipr/SHS_ppi_beta.zip

This repositorie uses the processed dataset download path:

https://pan.baidu.com/s/1FU-Ij3LxyP9dOHZxO3Aclw (Extraction code: tibn)

Testing

Testing codes in gnn_test.py and gnn_test_bigger.py, and the run script in run_test.py and run_test_bigger.py.

gnn_test.py: It can test the overall performance, and can also make in-depth analysis to test the performance of different test data separately.
gnn_test_bigger.py: It can test the performance between the trainset-homologous testset and the unseen testset.
Running script run_test_bigger.py as above.

Inference

If you have your own dataset or want to save the prediction results, you can use inference.py. After execution, a ppi csv file will be generated to record the predicted PPI type of each pair of interacting proteins.

Running script run_inference.py as above.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@misc{lv2021learning,
    title={Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction}, 
    author={Guofeng Lv and Zhiqiang Hu and Yanguang Bi and Shaoting Zhang},
    year={2021},
    eprint={2105.06709},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

You might also like...

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Related tags

Overview

GNN_PPI

Abstract

Contribution

Experimental Results

Benchmark

In-depth Analysis

Model Generalization

PPI Network Graph Construction

Using GNN_PPI

Environment Setup

Data Processing

Training

Dataset Download:

Testing

Inference

Citation

You might also like...

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer"

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'.

Releases(v1.0)

v1.0(May 2, 2022)

Owner

Ursa Zrimsek

Demo project for real time anomaly detection using kafka and python

tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

LIVECell - A large-scale dataset for label-free live cell segmentation

Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Crawl & visualize ICLR papers and reviews

PyTorch implementation of Federated Learning with Non-IID Data, and federated learning algorithms, including FedAvg, FedProx.

FSL-Mate: A collection of resources for few-shot learning (FSL).

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools

Sequential GCN for Active Learning

Reproduces the results of the paper "Finite Basis Physics-Informed Neural Networks (FBPINNs): a scalable domain decomposition approach for solving differential equations".

Pytorch code for "DPFM: Deep Partial Functional Maps" - 3DV 2021 (Oral)

Six - a Python 2 and 3 compatibility library

Semantic Edge Detection with Diverse Deep Supervision

SAT Project - The first project I had done at General Assembly, performed EDA, data cleaning and created data visualizations

GANfolk: Using AI to create portraits of fictional people to sell as NFTs

Keras code and weights files for popular deep learning models.

A set of simple scripts to process the Imagenet-1K dataset as TFRecords and make index files for NVIDIA DALI.