Accurate identification of bacteriophages from metagenomic data using Transformer

Last update: Nov 30, 2022

Related tags

Overview

PhaMer is a python library for identifying bacteriophages from metagenomic data. PhaMer is based on a Transorfer model and rely on protein-based vocabulary to convert DNA sequences into sentences.

Overview

The main function of PhaMer is to identify phage-like contigs from metagenomic data. The input of the program should be fasta files and the output will be a csv file showing the predictions. Since it is a Deep learning model, if you have GPU units on your PC, we recommand you to use them to save your time.

If you have any trouble installing or using PhaMer, please let us know by opening an issue on GitHub or emailing us ([email protected]).

Required Dependencies

Python 3.x
Numpy
Pandas
Pytorch>1.8.0
Diamond
Prodigal
git-lfs

If you want to use the gpu to accelerate the program:

cuda
Pytorch-gpu
For cpu version pytorch: conda install pytorch torchvision torchaudio cpuonly -c pytorch
For gpu version pytorch: Search pytorch to find the correct cuda version according to your computer

An easiler way to install

Note: we suggest you to install all the package using conda (both miniconda and Anaconda are ok).

After cloning this respository, you can use anaconda to install the PhaMer.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system to use the gpu version. Othervise, it will run with cpu version). The command is: conda env create -f PhaMer.yaml -n phamer

Prepare the database and environment

Due to the limited size of the GitHub, we zip the database. Before using PhaMer, you need to unpack them using the following commands.

When you use PhaMer at the first time

cd PhaMer/
conda env create -f PhaMer.yaml -n phamer
conda activate phamer
cd database/
bzip2 -d database.fa.bz2
git lfs install
rm transformer.pth
git checkout .
cd ..

Note: Because the parameter is larger than 100M, please make sure you have installed git-lfs to downloaded it from GitHub

If the example can be run without any but bugs, you only need to activate your 'phamer' environment before using PhaMer.

conda activate phamer

Usage

python preprocessing.py [--contigs INPUT_FA] [--len MINIMUM_LEN]
python PhaMer.py [--out OUTPUT_CSV] [--reject THRESHOLD]

Options

  --contigs INPUT_FA
                        input fasta file
  --len MINIMUM_LEN
                        predict only for sequence >= len bp (default 3000)
  --out OUTPUT_CSV
                        The output csv file (prediction)
  --reject THRESHOLD
                        Threshold to reject prophage. The higher the value, the more prophage will be rejected (default 0.3)

Example

Prediction on the example file:

python preprocessing.py --contigs test_contigs.fa
python PhaMer.py --out example_prediction.csv

The prediction will be written in example_prediction.csv. The CSV file has three columns: contigs names, prediction, and prediction score.

References

The paper is submitted to the ISMB 2022.

The arXiv version can be found via: Accurate identification of bacteriophages from metagenomic data using Transformer

Contact

If you have any questions, please email us: [email protected]

Comments

issues collections from schackartk (solved)

Hi! Thank you for publishing your code publicly.

I am a researcher who works with many tools that identify phage in metagenomes. However, I like to be confident in the implementation of the concepts. I noticed that your repository does not have any formal testing. Without tests, I am always skeptical about implementing a tool in my own work because I cannot be sure it is working as described.

Would your team be interested in adding tests to the code (e.g. using pytest)? If I, or another developer, were to create a pull request that implemented testing, would your team consider accepting such a request?

Also, it is a small thing, but I noticed that your code is not formatted in any community-accepted way. Would you consider accepting a pull request that has passed the code through a linter such as yapf or black? I usually add linting as part of my test suites.

opened by schackartk 12
Rename preprocessing.py?

Hi Kenneth,

preprocessing.py is a pretty generic name; maybe rename the script to PhaMer_preprocess.py to avoid potential future conflicts with other software?

opened by sjaenick 1
bioconda recipe

Any plans on creating a bioconda recipe for PhaMer? That would greatly help users with the install & version management of PhaMer.

Also in regards to:

Because the parameter is larger than 100M, please make sure you have downloaded transformer.pth correctly.

Why not just use md5sum?

opened by nick-youngblut 1
Threading/Performance updates
Hi,

introduce ---threads to control threading behavior

allow to supply external database directory, so DIAMOND database formatting isn't needed every time

use 'pprodigal' for faster gene prediction step

removed unused imports

Please note I didn't yet add pprodigal to the conda yaml - feel free to do so if you want to include it
opened by sjaenick 1

Bug: Unable to clone repository

Hello,

It seems that this repository is exceeding its data transfer limits. I believe you are aware of this, as you instruct users to download the transformer.pth from Google Drive.

However, it seems that I cannot clone the repository in general. I just want to make sure this is not a problem on my end, so I will walk through what I am doing.

Reproducible Example

First, cloning the repository

$ git clone [email protected]:KennthShang/PhaMer.git
Cloning into 'PhaMer'...
Downloading database/transformer.pth (143 MB)
Error downloading object: database/transformer.pth (28a82c1): Smudge error: Error downloading database/transformer.pth (28a82c1ca0fb2499c0071c685dbf49f3a0d060fdc231bb04f7535e88e7fe0858): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/.git/lfs/logs/20220124T113548.16324098.log
Use `git lfs logs last` to view the log.
error: external filter git-lfs smudge -- %f failed 2
error: external filter git-lfs smudge -- %f failed
fatal: database/transformer.pth: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

Checking git status as suggested indicates that several files were not checked out

$ cd PhaMer/
$ git status
# On branch main
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	deleted:    .gitattributes
#	deleted:    LICENSE.txt
#	deleted:    PhaMer.py
#	deleted:    PhaMer.yaml
#	deleted:    README.md
#	deleted:    database/.DS_Store
#	deleted:    database/contigs.csv
#	deleted:    database/database.fa.bz2
#	deleted:    database/pc2wordsid.dict
#	deleted:    database/pcs.csv
#	deleted:    database/profiles.csv
#	deleted:    database/proteins.csv
#	deleted:    database/transformer.pth
#	deleted:    logo.jpg
#	deleted:    model.py
#	deleted:    preprocessing.py
#	deleted:    test_contigs.fa
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	.gitattributes
#	LICENSE.txt
#	PhaMer.py
#	PhaMer.yaml
#	README.md
#	database/

To confirm that several files are missing, such as preprocessing.py.

$ ls
database  LICENSE.txt  PhaMer.py  PhaMer.yaml  README.md

Continuing with installation instructions anyway in case they resolve these issues.

$ conda env create -f PhaMer.yaml -n phamer
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: \ By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

done
Installing pip dependencies: / Ran pip subprocess with arguments:
['/home/u29/schackartk/.conda/envs/phamer/bin/python', '-m', 'pip', 'install', '-U', '-r', '/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/condaenv.1nhq2jy1.requirements.txt']
Pip subprocess output:
Collecting joblib==1.1.0
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting scikit-learn==1.0.1
  Using cached scikit_learn-1.0.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (25.9 MB)
Collecting sklearn==0.0
  Using cached sklearn-0.0-py2.py3-none-any.whl
Collecting threadpoolctl==3.0.0
  Using cached threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Requirement already satisfied: scipy>=1.1.0 in /home/u29/schackartk/.conda/envs/phamer/lib/python3.8/site-packages (from scikit-learn==1.0.1->-r /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/condaenv.1nhq2jy1.requirements.txt (line 2)) (1.7.1)
Requirement already satisfied: numpy>=1.14.6 in /home/u29/schackartk/.conda/envs/phamer/lib/python3.8/site-packages (from scikit-learn==1.0.1->-r /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/condaenv.1nhq2jy1.requirements.txt (line 2)) (1.21.2)
Installing collected packages: threadpoolctl, joblib, scikit-learn, sklearn
Successfully installed joblib-1.1.0 scikit-learn-1.0.1 sklearn-0.0 threadpoolctl-3.0.0

done
#
# To activate this environment, use
#
#     $ conda activate phamer
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Activating the conda environment and attempting to get transformer.pth

$ conda activate phamer
$ cd database
$ bzip2 -d database.fa.bz2
$ git lfs install
Updated git hooks.
Git LFS initialized.
$ rm transformer.pth
rm: cannot remove ‘transformer.pth’: No such file or directory
$ git checkout .
error: pathspec './' did not match any file(s) known to git.

$ ls
contigs.csv  database.fa  pc2wordsid.dict  pcs.csv  profiles.csv  proteins.csv

Since the file doesn't seem to exist, I followed your Google Drive link and pasted into database/ manually.

$ ls
contigs.csv  database.fa  pc2wordsid.dict  pcs.csv  profiles.csv  proteins.csv  transformer.pth

I will try checking out again.

$ git checkout .
error: pathspec './' did not match any file(s) known to git.

Going back up, you can see that I am still missing scripts.

$ cd ..
$ ls
metaphinder_reprex  phage_finders  PhaMer  snakemake_tutorial

Conclusions

I am missing the scripts and cannot run the tool. I believe this all comes down to the repo exceedingits data transfer limits. This is probably due to you storing the large database files in the repository.

Possible solution?

Going forward, maybe entirely remove the large files from the repo so that you don't exceed limits. I am not sure what I can do at this moment since the limits are already exceeded.

Also, it would be helpful if I could obtain the transformer.pth from the command line (e.g. using wget) since I, and many researchers, are working on an HPC or cloud.

Thank you, -Ken

opened by schackartk 1

Releases(v1.0)

v1.0(Jul 13, 2022)

Source code
Source code(tar.gz)
Source code(zip)
PhaMer.zip(179.97 MB)

Owner

Kenneth Shang

GitHub Repository

Dialect classification

Dialect-Classification This repository presents the data that was used in a talk at ICKL-5 (5th International Conference on Kurdish Linguistics) at th

0 Nov 12, 2021

Code for ICCV2021 paper SPEC: Seeing People in the Wild with an Estimated Camera

SPEC: Seeing People in the Wild with an Estimated Camera [ICCV 2021] SPEC: Seeing People in the Wild with an Estimated Camera, Muhammed Kocabas, Chun-

187 Dec 26, 2022

toroidal - a lightweight transformer library for PyTorch

toroidal - a lightweight transformer library for PyTorch Toroidal transformers are of smaller size and lower weight than the more common E-I types. Th

64 Jan 07, 2023

TraND: Transferable Neighborhood Discovery for Unsupervised Cross-domain Gait Recognition.

TraND This is the code for the paper "Jinkai Zheng, Xinchen Liu, Chenggang Yan, Jiyong Zhang, Wu Liu, Xiaoping Zhang and Tao Mei: TraND: Transferable

32 Apr 04, 2022

Cereal box identification in store shelves using computer vision and a single train image per model.

Product Recognition on Store Shelves Description You can read the task description here. Report You can read and download our report here. Step A - Mu

1 Jan 21, 2022

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System

News! Aug 2020: v0.4.0 version of AlphaPose is released! Stronger tracking! Include whole body(face,hand,foot) keypoints! Colab now available. Dec 201

6.7k Dec 28, 2022

YOLOv5 in PyTorch > ONNX > CoreML > TFLite

This repository represents Ultralytics open-source research into future object detection methods, and incorporates lessons learned and best practices evolved over thousands of hours of training and e

34.1k Dec 31, 2022

The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

109 Dec 23, 2022

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees" Installa

0 Oct 13, 2021

Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

Code for Neural Reflectance Surfaces (NeRS) [arXiv] [Project Page] [Colab Demo] [Bibtex] This repo contains the code for NeRS: Neural Reflectance Surf

234 Dec 30, 2022

CryptoFrog - My First Strategy for freqtrade

cryptofrog-strategies CryptoFrog - My First Strategy for freqtrade NB: (2021-04-20) You'll need the latest freqtrade develop branch otherwise you migh

137 Jan 01, 2023

A rough implementation of the paper "A Steering Algorithm for Redirected Walking Using Reinforcement Learning"

2 Jun 09, 2022

Python TFLite scripts for detecting objects of any class in an image without knowing their label.

42 Oct 07, 2022

A TensorFlow implementation of the Mnemonic Descent Method.

MDM A Tensorflow implementation of the Mnemonic Descent Method. Mnemonic Descent Method: A recurrent process applied for end-to-end face alignment G.

123 Oct 07, 2022

Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

WarPI The official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels". Run python main.py --corruption_type

3 Sep 03, 2022

Python package for covariance matrices manipulation and Biosignal classification with application in Brain Computer interface

pyRiemann pyRiemann is a python package for covariance matrices manipulation and classification through Riemannian geometry. The primary target is cla

447 Jan 05, 2023

Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR

Official implementation for paper "Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR"

72 Dec 09, 2022

An open-source outlier detection package by Getcontact Data Team

pyfbad The pyfbad library supports anomaly detection projects. An end-to-end anomaly detection application can be written using the source codes of th

41 Dec 27, 2022

This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

MLOps with Vertex AI This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities. The ex

238 Dec 21, 2022

Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model

Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model Baris Gecer 1, Binod Bhattarai 1

190 Dec 29, 2022

Accurate identification of bacteriophages from metagenomic data using Transformer

Related tags

Overview

Overview

Required Dependencies

An easiler way to install

Prepare the database and environment

Usage

References

Contact

Comments

issues collections from schackartk (solved)

Rename preprocessing.py?

bioconda recipe

Threading/Performance updates

Bug: Unable to clone repository

Reproducible Example

Conclusions

Possible solution?

Releases(v1.0)

v1.0(Jul 13, 2022)

Owner

Kenneth Shang

Dialect classification

Code for ICCV2021 paper SPEC: Seeing People in the Wild with an Estimated Camera

toroidal - a lightweight transformer library for PyTorch

TraND: Transferable Neighborhood Discovery for Unsupervised Cross-domain Gait Recognition.

Cereal box identification in store shelves using computer vision and a single train image per model.

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System

YOLOv5 in PyTorch > ONNX > CoreML > TFLite

The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"

Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

CryptoFrog - My First Strategy for freqtrade

A rough implementation of the paper "A Steering Algorithm for Redirected Walking Using Reinforcement Learning"

Python TFLite scripts for detecting objects of any class in an image without knowing their label.

A TensorFlow implementation of the Mnemonic Descent Method.

Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

Python package for covariance matrices manipulation and Biosignal classification with application in Brain Computer interface

Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR

An open-source outlier detection package by Getcontact Data Team

This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model