ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Last update: Dec 16, 2022

Overview

Description:

ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39 distinct protein feature extraction methods (protein descriptors) using 21 PSSM-based protein descriptors from POSSUM and 18 protein descriptors from iFeature.

POSSUM (Position-Specific Scoring matrix-based feature generator for machine learning), a versatile toolkit with an online web server that can generate 21 types of PSSM-based feature descriptors, thereby addressing a crucial need for bioinformaticians and computational biologists.

iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors.

Installation

ProtFeat is a python package for feature extracting from protein sequences written in Python 3.9. ProtFeat was developed and tested in Ubuntu 20.04 LTS. Please make sure that you have Anaconda installed on your computer and run the below commands to install requirements. Dependencies are available in requirements.txt file.

conda create -n protFeat_env python=3.9
conda activate protFeat_env

How to run ProtFeat to extract the protein features

Run the following commands in the given order:

To use ProtFeat as a python package:

pip install protFeat

Then, you may use protFeat as the following in python:

import protFeat
from protFeat.feature_extracter import extract_protein_feature, usage
usage()
extract_protein_feature(protein_feature, place_protein_id, input_folder, fasta_file_name)

For example,

extract_protein_feature("AAC", 1, "input_folder", "sample")

To use ProtFeat from terminal:

Clone the Git Repository.

git clone https://github.com/gozsari/ProtFeat

In terminal or command line navigate into protFeat folder.

cd ProtFeat

Install the requirements by the running the following command.

pip install -r requirements.txt

Altenatively you may run ProtFeat from the terminal as the following:

cd src
python protFeat_command_line.py --pf protein_feature --ppid place_protein_id --inpf input_folder --fname fasta_file_name

For example,

python protFeat_command_line.py --pf AAC --ppid 1 --inpf input_folder --fname sample

Explanation of Parameters

protein_feature: {string}, (default = 'aac_pssm'): one of the protein descriptors in POSSUM and iFeature.

POSSUM descriptors:

aac_pssm, d_fpssm, smoothed_pssm, ab_pssm, pssm_composition, rpm_pssm,
s_fpssm, dpc_pssm, k_separated_bigrams_pssm, eedp, tpc, edp, rpssm,
pse_pssm, dp_pssm, pssm_ac, pssm_cc, aadp_pssm, aatp, medp , or all_POSSUM

Note: all_POSSUM extracts the features of all (21) POSSUM protein descriptors.

iFeature descriptors:

AAC, PAAC, APAAC, DPC, GAAC, CKSAAP, CKSAAGP, GDPC, Moran, Geary,
NMBroto, CTDC, CTDD, CTDT, CTriad, KSCTriad, SOCNumber, QSOrder, or all_iFeature

Note: all_iFeature extracts the features of all (18) iFeature protein descriptors.

place_protein_id: {int}, (default = 1): It indicates the place of protein id in fasta header. e.g. fasta header: >sp|O27002|....|....|...., seperate the header wrt. '|' then >sp is in the zeroth position, protein id in the first(1) position.

input_folder: {string}, (default = 'input_folder'}: it is the path to the folder that contains the fasta file.

fasta_file_name: {string}, (default ='sample'): it is the name of the fasta file exclude the '.fasta' extension.

Input file

It must be in fasta format.

Output file

The extracted feature files will be located under feature_extraction_output folder with the name: fasta_file_name_protein_feature.txt (e.g. sample_AAC.txt).

The content of the output files:

The output file is tab-seperated.
Each row corresponds to the extracted features of the protein sequence.
The first column of each row is UniProtKB id of the proteins, the rest is extracted features of the protein sequence.

Tables of the available protein descriptors

Table 1: Protein descriptors obtained from the POSSUM tool.

Descriptor group	Protein descriptor	Number of dimensions
Row Transformations	AAC-PSSM D-FPSSM smoothed-PSMM AB-PSSM PSSM-composition RPM-PSSM S-FPSSM	20 20 1000 400 400 400 400
Column Transformation	DPC-PSSM k-seperated-bigrams-PSSM tri-gram-PSSM EEDP TPC	400 400 8000 4000 400
Mixture of row and column transformation	EDP RPSSM Pre-PSSM DP-PSSM PSSM-AC PSSM-CC	20 110 40 240 200 3800
Combination of above descriptors	AADP-PSSSM AATP MEDP	420 420 420

Table 2: Protein descriptors obtained from the iFeature tool.

Descriptor group	Protein descriptor	Number of dimensions
Amino acid composition	Amino acid composition (AAC) Composition of k-spaced amino acid pairs (CKSAAP) Dipeptide composition (DPC)	20 2400 400
Grouped amino acid composition	Grouped amino acid composition (GAAC) Composition of k-spaced amino acid group pairs (CKSAAGP) Grouped dipeptide composition (GDPC)	5 150 25
Autocorrelation	Moran (Moran) Geary (Geary) Normalized Moreau-Broto (NMBroto)	240 240 240
C/T/D	Composition (CTDC) Transition (CTDT) Distribution (CTDD)	39 39 195
Conjoint triad	Conjoint triad (CTriad) Conjoint k-spaced triad (KSCTriad)	343 343*(k+1)
Quasi-sequence-order	Sequence-order-coupling number (SOCNumber) Quasi-sequence-order descriptors (QSOrder)	60 100
Pseudo-amino acid composition	Pseudo-amino acid composition (PAAC) Amphiphilic PAAC (APAAC)	50 80

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

3k Feb 17, 2021

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

384 Dec 12, 2022

Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If you know more datasets, and want to contribute, please, notify me.

LIAAD - Laboratory of Artificial Intelligence and Decision Support

163 Dec 23, 2022

Releases(protein-feature-extraction)

protein-feature-extraction(Apr 12, 2022)

ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes 39 distinct protein feature extraction methods using 21 PSSM-based protein descriptors from POSSUM and 18 protein descriptors from iFeature.
Source code(tar.gz)
Source code(zip)

ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Related tags

Overview

Description:

Installation

How to run ProtFeat to extract the protein features

To use ProtFeat as a python package:

To use ProtFeat from terminal:

Explanation of Parameters

Input file

Output file

Tables of the available protein descriptors

License

You might also like...

An Open-Source Package for Neural Relation Extraction (NRE)

SpikeX - SpaCy Pipes for Knowledge Extraction

Datasets of Automatic Keyphrase Extraction

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

Utilizing RBERT model for KLUE Relation Extraction task

Spert NLP Relation Extraction API deployed with torchserve for inference

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Contact Extraction with Question Answering.

Releases(protein-feature-extraction)

protein-feature-extraction(Apr 12, 2022)

Owner

GOKHAN OZSARI

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Leon is an open-source personal assistant who can live on your server.

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Python library for processing Chinese text

Just a Basic like Language for Zeno INC

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Python SDK for working with Voicegain Speech-to-Text

Non-Autoregressive Predictive Coding

EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

This simple Python program calculates a love score based on your and your crush's full names in English

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

A list of NLP(Natural Language Processing) tutorials

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。