ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Overview

Linux version made-with-python Python GitHub license Open Source Love svg1

Description:

ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39 distinct protein feature extraction methods (protein descriptors) using 21 PSSM-based protein descriptors from POSSUM and 18 protein descriptors from iFeature.

POSSUM (Position-Specific Scoring matrix-based feature generator for machine learning), a versatile toolkit with an online web server that can generate 21 types of PSSM-based feature descriptors, thereby addressing a crucial need for bioinformaticians and computational biologists.

iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors.

Installation

ProtFeat is a python package for feature extracting from protein sequences written in Python 3.9. ProtFeat was developed and tested in Ubuntu 20.04 LTS. Please make sure that you have Anaconda installed on your computer and run the below commands to install requirements. Dependencies are available in requirements.txt file.

conda create -n protFeat_env python=3.9
conda activate protFeat_env

How to run ProtFeat to extract the protein features

Run the following commands in the given order:

To use ProtFeat as a python package:

pip install protFeat

Then, you may use protFeat as the following in python:

import protFeat
from protFeat.feature_extracter import extract_protein_feature, usage
usage()
extract_protein_feature(protein_feature, place_protein_id, input_folder, fasta_file_name)

For example,

extract_protein_feature("AAC", 1, "input_folder", "sample")

To use ProtFeat from terminal:

Clone the Git Repository.

git clone https://github.com/gozsari/ProtFeat

In terminal or command line navigate into protFeat folder.

cd ProtFeat

Install the requirements by the running the following command.

pip install -r requirements.txt

Altenatively you may run ProtFeat from the terminal as the following:

cd src
python protFeat_command_line.py --pf protein_feature --ppid place_protein_id --inpf input_folder --fname fasta_file_name

For example,

python protFeat_command_line.py --pf AAC --ppid 1 --inpf input_folder --fname sample

Explanation of Parameters

protein_feature: {string}, (default = 'aac_pssm'): one of the protein descriptors in POSSUM and iFeature.

POSSUM descriptors:

aac_pssm, d_fpssm, smoothed_pssm, ab_pssm, pssm_composition, rpm_pssm,
s_fpssm, dpc_pssm, k_separated_bigrams_pssm, eedp, tpc, edp, rpssm,
pse_pssm, dp_pssm, pssm_ac, pssm_cc, aadp_pssm, aatp, medp , or all_POSSUM

Note: all_POSSUM extracts the features of all (21) POSSUM protein descriptors.

iFeature descriptors:

AAC, PAAC, APAAC, DPC, GAAC, CKSAAP, CKSAAGP, GDPC, Moran, Geary,
NMBroto, CTDC, CTDD, CTDT, CTriad, KSCTriad, SOCNumber, QSOrder, or all_iFeature

Note: all_iFeature extracts the features of all (18) iFeature protein descriptors.

place_protein_id: {int}, (default = 1): It indicates the place of protein id in fasta header. e.g. fasta header: >sp|O27002|....|....|...., seperate the header wrt. '|' then >sp is in the zeroth position, protein id in the first(1) position.

input_folder: {string}, (default = 'input_folder'}: it is the path to the folder that contains the fasta file.

fasta_file_name: {string}, (default ='sample'): it is the name of the fasta file exclude the '.fasta' extension.

Input file

It must be in fasta format.

Output file

The extracted feature files will be located under feature_extraction_output folder with the name: fasta_file_name_protein_feature.txt (e.g. sample_AAC.txt).

The content of the output files:

  • The output file is tab-seperated.
  • Each row corresponds to the extracted features of the protein sequence.
  • The first column of each row is UniProtKB id of the proteins, the rest is extracted features of the protein sequence.

Tables of the available protein descriptors

Table 1: Protein descriptors obtained from the POSSUM tool.

Descriptor group Protein descriptor Number of dimensions
Row Transformations AAC-PSSM
D-FPSSM
smoothed-PSMM
AB-PSSM
PSSM-composition
RPM-PSSM
S-FPSSM
20
20
1000
400
400
400
400
Column Transformation DPC-PSSM
k-seperated-bigrams-PSSM                    
tri-gram-PSSM
EEDP
TPC
400
400
8000
4000
400
Mixture of row and column transformation EDP
RPSSM
Pre-PSSM
DP-PSSM
PSSM-AC
PSSM-CC
20
110
40
240
200
3800
Combination of above descriptors AADP-PSSSM
AATP
MEDP
420
420
420

Table 2: Protein descriptors obtained from the iFeature tool.
Descriptor group Protein descriptor Number of dimensions
Amino acid composition Amino acid composition (AAC)
Composition of k-spaced amino acid pairs (CKSAAP)
Dipeptide composition (DPC)
20
2400
400
Grouped amino acid composition Grouped amino acid composition (GAAC)
Composition of k-spaced amino acid group pairs (CKSAAGP)
Grouped dipeptide composition (GDPC)
5
150
25
Autocorrelation Moran (Moran)
Geary (Geary)
Normalized Moreau-Broto (NMBroto)
240
240
240
C/T/D Composition (CTDC)
Transition (CTDT)
Distribution (CTDD)
39
39
195
Conjoint triad Conjoint triad (CTriad)
Conjoint k-spaced triad (KSCTriad)
343
343*(k+1)
Quasi-sequence-order Sequence-order-coupling number (SOCNumber)
Quasi-sequence-order descriptors (QSOrder)
60
100
Pseudo-amino acid composition Pseudo-amino acid composition (PAAC)
Amphiphilic PAAC (APAAC)
50
80

License

MIT License

ProtFeat Copyright (C) 2020 CanSyL

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

You might also like...
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If you know more datasets, and want to contribute, please, notify me.

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

Spert NLP Relation Extraction API deployed with torchserve for inference

SpERT torchserve Spert_torchserve is the Relation Extraction model (SpERT)Span-based Entity and Relation Transformer API deployed with pytorch/serve.

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Releases(protein-feature-extraction)
  • protein-feature-extraction(Apr 12, 2022)

    ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes 39 distinct protein feature extraction methods using 21 PSSM-based protein descriptors from POSSUM and 18 protein descriptors from iFeature.

    Source code(tar.gz)
    Source code(zip)
Owner
GOKHAN OZSARI
Research and Teaching Assistant, at CEng, METU
GOKHAN OZSARI
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022
This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 - treatments and vaccinations.

Project: Text Analysis - This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 -

1 Mar 14, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021