ONT Analysis Toolkit (OAT)

Overview

Code style: black

                                       ,d
                                       88
              ,adPPYba,  ,adPPYYba, MM88MMM
              a8"     "8a ""     `Y8   88   
              8b       d8 ,adPPPPP88   88   
              "8a,   ,a8" 88,    ,88   88,  
              `"YbbdP"'  `"8bbdP"Y8   "Y888

ONT Analysis Toolkit (OAT)

A pipeline to facilitate sequencing of viral genomes (amplified with a tiled amplicon scheme) and assembly into consensus genomes. Supported viruses currently include Human betaherpesvirus 5 (CMV) and SARS-CoV-2. ONT sequencing data from a MinION can be monitored in real time with the rampart module. All analysis steps are handled by the analysis module, whereby sequencing data are analysed using a pipeline written in snakemake, with the choice of tools heavily influenced by the artic minion pipeline. Both steps can be run in order with a single command using the all module.

Author: Dr Charles Foster

Starting out

To begin with, clone this github repository:

git clone https://github.com/charlesfoster/ont-analysis-toolkit.git

cd ont-analysis-toolkit

Next, install most dependencies using conda:

conda env create -f environment.yml

Pro tip: if you install mamba, you can create the environment with that command instead of conda. A lot of conda headaches go away: it's a much faster drop-in replacement for conda.

conda install mamba
mamba env create -f environment.yml

Install the pipeline correctly by activating the conda environment and using the setup.py script with:

conda activate oat
pip install .

Other dependencies:

  • Demultiplexing is done using guppy_barcoder. The program will need to be installed and in your path.
  • If analysing SARS-CoV-2 data, lineages are typed using pangolin. Accordingly, pangolin needs to be installed according to instructions at https://github.com/cov-lineages/pangolin. The pipeline will fail if the pangolin environment can't be activated.
  • Variants are called using medaka and longshot. Ideally we could install these via mamba in the main environment.yml file, but there are sadly some necessary libraries for medaka that are incompatible with our main oat environment. Consequently, I have written the analysis pipeline so that snakemake automatically installs medaka and its dependencies into an isolated environment during execution of the oat pipeline. The environment is only created the first time you run the pipeline, but if you run the pipeline from a different working directory in the future, the environment will be created again. Solution: always run oat from the same working directory

tl;dr: you don't need to do anything for variant calling to work; just don't get confused during the initial pipeline run when the terminal indicates creation of a new conda environment. You should run oat from the same directory each time, otherwise a new conda environment will be created each time.

Usage

The environment with all necessary tools is installed as 'oat' for brevity. The environment should first be activated:

conda activate oat

Then, to run the pipeline, it's as simple as:

oat 
   

   

where should be replaced with the full path to a spreadsheet with minimal metadata for the ONT sequencing run (see example spreadsheet: run_data_example.csv). The most important thing to remember is that the 'run_name' in the spreadsheet must exactly match the name of the sequencing run, as set up in MinKNOW.

Note that there are many additional options/settings to take advantage of:

A pipeline for sequencing and analysis of viral genomes using an ONT MinION positional arguments: samples_file Path to file with sample metadata (.csv format). See example spreadsheet for minimum necessary information. optional arguments: -h, --help show this help message and exit -b , --barcode_kit Barcode kit that you used: '12' (SQK-RBK004) or '96' (SQK-RBK110-96) (default: 12) -c , --consensus_freq Variant allele frequency threshold for a variant to be incorporated into consensus genome. Variants below this frequency will be incorporated with an IUPAC ambiguity. Default: 0.8 -d, --demultiplex Demultiplex reads using guppy_barcoder. By default, assumes reads were already demultiplexed by MinKNOW. Reads are demultiplexed into the output directory. -f, --force Force overwriting of completed files in snakemake analysis (Default: files not overwritten) -n, --dry_run Dry run only -m rampart | analysis | all, --module rampart | analysis | all Pipeline module to run: 'rampart', 'analysis' or 'all' (rampart followed by analysis). Default: 'all' -o OUTDIR, --outdir OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/analysis_results + 'run_name' from samples spreadsheet --rampart_outdir RAMPART_OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/rampart_files -p, --print_dag Save directed acyclic graph (DAG) of workflow to outdir -r REFERENCE, --reference REFERENCE Reference genome to use: 'MN908947.3' (SARS-CoV-2), 'NC_006273.2' (CMV Merlin). Other references can be used, but the corresponding assembly (fasta) and annotation (gff3 from Ensembl) must be added to /home/cfos/miniconda3/envs/oat/lib/python3.9/site- packages/oat/references (Default: MN908947.3) -t , --threads Number of threads to use -v VARIANT_CALLER, --variant_caller VARIANT_CALLER Variant caller to use. Choices: 'medaka-longshot'. Default: 'medaka-longshot' --create_envs_only Create conda environments for snakemake analysis, but do no further analysis. Useful for initial pipeline setup. Default: False --snv_min SNV_MIN Minimum variant allele frequency for an SNV to be kept Default: 0.2 --delete_reads Delete demultiplexed reads after analysis --redo_analysis Delete entire analysis output directory and contents for a fresh run --version show program's version number and exit --minknow_data MINKNOW_DATA Location of MinKNOW data root. Default: /var/lib/minknow/data --max_memory Maximum memory (in MB) that you would like to provide to snakemake. Default: 53456MB --quiet Stop printing of snakemake commands to screen. --report Generate report (currently non-functional).">

                                       ,d
                                       88
              ,adPPYba,  ,adPPYYba, MM88MMM
              a8"     "8a ""     `Y8   88   
              8b       d8 ,adPPPPP88   88   
              "8a,   ,a8" 88,    ,88   88,  
              `"YbbdP"'  `"8bbdP"Y8   "Y888

        OAT: ONT Analysis Toolkit (version 0.2.0)

usage: oat [options] 
          
           

A pipeline for sequencing and analysis of viral genomes using an ONT MinION

positional arguments:
  samples_file          Path to file with sample metadata (.csv format). See
                        example spreadsheet for minimum necessary information.

optional arguments:
  -h, --help            show this help message and exit
  -b 
           
            , --barcode_kit 
            
             
                        Barcode kit that you used: '12' (SQK-RBK004) or '96'
                        (SQK-RBK110-96) (default: 12)
  -c 
             
              , --consensus_freq 
              
                Variant allele frequency threshold for a variant to be incorporated into consensus genome. Variants below this frequency will be incorporated with an IUPAC ambiguity. Default: 0.8 -d, --demultiplex Demultiplex reads using guppy_barcoder. By default, assumes reads were already demultiplexed by MinKNOW. Reads are demultiplexed into the output directory. -f, --force Force overwriting of completed files in snakemake analysis (Default: files not overwritten) -n, --dry_run Dry run only -m rampart | analysis | all, --module rampart | analysis | all Pipeline module to run: 'rampart', 'analysis' or 'all' (rampart followed by analysis). Default: 'all' -o OUTDIR, --outdir OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/analysis_results + 'run_name' from samples spreadsheet --rampart_outdir RAMPART_OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/rampart_files -p, --print_dag Save directed acyclic graph (DAG) of workflow to outdir -r REFERENCE, --reference REFERENCE Reference genome to use: 'MN908947.3' (SARS-CoV-2), 'NC_006273.2' (CMV Merlin). Other references can be used, but the corresponding assembly (fasta) and annotation (gff3 from Ensembl) must be added to /home/cfos/miniconda3/envs/oat/lib/python3.9/site- packages/oat/references (Default: MN908947.3) -t 
               
                , --threads 
                
                  Number of threads to use -v VARIANT_CALLER, --variant_caller VARIANT_CALLER Variant caller to use. Choices: 'medaka-longshot'. Default: 'medaka-longshot' --create_envs_only Create conda environments for snakemake analysis, but do no further analysis. Useful for initial pipeline setup. Default: False --snv_min SNV_MIN Minimum variant allele frequency for an SNV to be kept Default: 0.2 --delete_reads Delete demultiplexed reads after analysis --redo_analysis Delete entire analysis output directory and contents for a fresh run --version show program's version number and exit --minknow_data MINKNOW_DATA Location of MinKNOW data root. Default: /var/lib/minknow/data --max_memory 
                 
                   Maximum memory (in MB) that you would like to provide to snakemake. Default: 53456MB --quiet Stop printing of snakemake commands to screen. --report Generate report (currently non-functional). 
                 
                
               
              
             
            
           
          

What does the pipeline do?

RAMPART Module

All input files for RAMPART are generated based on your input spreadsheet, and a web browser is launched to view the sequencing in real time.

Analysis Module

Reads are mapped to the relevant reference genome with minimap2. Amplicon primers are trimmed using samtools ampliconclip. Variants are called using medaka and longshot, followed by filtering and consensus genome assembly using bcftools. The amino acid consequences of SNPs are inferred using bcftools csq. If analysing SARS-CoV-2, lineages are inferred using pangolin. Finally, a variety of sample QC metrics are combined into a final QC file.

Other Notes

Protocols

A protocol for the amplicon scheme needs (a) to be installed in the pipeline, and (b) named in the run_data.csv spreadsheet for analyses to work correctly. The pipeline comes with the Midnight protocol for SARS-CoV-2 pre-installed (https://www.protocols.io/view/sars-cov2-genome-sequencing-protocol-1200bp-amplic-bwyppfvn). Adding additional protocols is fairly easy:

  1. Make a directory called /path/to/ont-analysis-toolkit/oat/protocols/ARTICV3 (needs to be in all caps)

  2. Make a directory within ARTICV3 called 'rampart'

    (a) Put the normal rampart files within that directory (genome.json, primers.json, protocol.json, references.fasta)

  3. Make a directory within ARTICV3 called 'schemes'

    (a) Put the 'scheme.bed' file with primer coordinates in the 'schemes' directory

  4. Make sure you're in /path/to/ont-analysis-toolkit/, then activate the conda environment and use the following command: pip install .

Done!

Amino acid consequences

For the amino acid consequences step to work, a requirement is an annotation file for the chosen reference genome. The annotations must be in gff3 format, and must be in the 'Ensembl flavour' of gff3. There is a script included in the repository that can convert an NCBI gff3 file into an 'Ensembl flavour' gff3 file: /path/to/ont-analysis-toolkit/oat/scripts/gff2gff.py.

Credits

  • When this pipeline is used, citations should be found for the programs used internally.
  • The gff3 file I included for SARS-CoV-2 was originally sent to me by Torsten Seemann.
  • Being new to using snakemake + wrapper scripts, I used pangolin as a guide for directory structure and rule creation - so thanks to them.
  • The analysis module was heavily influenced by the ARTIC team, especially the artic minion pipeline.
  • gff2gff.py is based on work by Damien Farrell https://dmnfarrell.github.io/bioinformatics/bcftools-csq-gff-format
You might also like...
Red Team Toolkit is an Open-Source Django Offensive Web-App which is keeping the useful offensive tools used in the red-teaming together.
Red Team Toolkit is an Open-Source Django Offensive Web-App which is keeping the useful offensive tools used in the red-teaming together.

RedTeam Toolkit Note: Only legal activities should be conducted with this project. Red Team Toolkit is an Open-Source Django Offensive Web-App contain

A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications
A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications

This project is no longer maintained March 2020 Update: Please go see the amazing Pysa tutorial that should get you up to speed finding security vulne

SpiderFoot automates OSINT collection so that you can focus on analysis.
SpiderFoot automates OSINT collection so that you can focus on analysis.

SpiderFoot is an open source intelligence (OSINT) automation tool. It integrates with just about every data source available and utilises a range of m

Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets.
Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets.

Yuyu Scanner Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets. installation ! run as root

ThePhish: an automated phishing email analysis tool
ThePhish: an automated phishing email analysis tool

ThePhish ThePhish is an automated phishing email analysis tool based on TheHive, Cortex and MISP. It is a web application written in Python 3 and base

Lazarus analysis tools and research report
Lazarus analysis tools and research report

Lazarus Research This repository publishes analysis reports and analysis tools for Operation Dream Job and Operation JTrack for Lazarus. Tools Python

IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation
IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation

Re-Scripts IA32-VMX-Helper (IDA-Script) IA32-MSR-Decoder (IDA-Script) IA32 VMX Helper It's an IDA script (Updated IA32 MSR Decoder) which helps you to

Android Malware (Analysis | Scoring) System
Android Malware (Analysis | Scoring) System

An Obfuscation-Neglect Android Malware Scoring System Quark-Engine is also bundled with Kali Linux, BlackArch. A trust-worthy, practical tool that's r

Malware arcane - Scripts and notes on my malware analysis journey

Malware Arcane Repository of notes and scripts I use when doing malware analysis

Releases(v0.10.1)
  • v0.10.1(Apr 27, 2022)

    While the 'oat' pipeline has undergone continuous development since its inception, this is the first designated release (well, pre-release). The pipeline should be fully functional, but please let me know if you encounter any errors or bugs.

    The pipeline has the most incorporated features for SARS-CoV-2, but works well with other viruses if you install them properly as per the instructions in the README.md file.

    Source code(tar.gz)
    Source code(zip)
A small Minecraft server to help players detect vulnerability to the Log4Shell exploit 🐚

log4check A small Minecraft server to help players detect vulnerability to the Log4Shell exploit 🐚 Tested to work between Minecraft versions 1.12.2 a

Evan J. Markowitz 4 Dec 23, 2021
Big-Papa Integrates Javascript and python for remote cookie stealing which then can be used for session hijacking

Big-Papa is a remote cookie stealer which can then be used for session hijacking and Bypassing 2 Factor Authentication

77 Jan 03, 2023
A (completely native) python3 wifi brute-force attack using the 100k most common passwords (2021)

wifi-bf [LINUX ONLY] A (completely native) python3 wifi brute-force attack using the 100k most common passwords (2021) This script is purely for educa

Finn Lancaster 20 Nov 12, 2022
Exploiting CVE-2021-44228 in vCenter for remote code execution and more

Log4jCenter Exploiting CVE-2021-44228 in vCenter for remote code execution and more. Blog post detailing exploitation linked below: COMING SOON Why? P

81 Dec 20, 2022
CVE-2021-21972

CVE-2021-21972 % python3 /tmp/CVE_2021_21972.py -i /tmp/urls.txt -n 8 -e [*] Creating tmp.tar containing ../../../../../home/vsphere-ui/.ssh/authoriz

Keith Lee 30 Nov 19, 2022
Exploit grafana Pre-Auth LFI

Grafana-LFI-8.x Exploit grafana Pre-Auth LFI How to use python3

2 Jul 25, 2022
HatSploit collection of generic payloads designed to provide a wide range of attacks without having to spend time writing new ones.

HatSploit collection of generic payloads designed to provide a wide range of attacks without having to spend time writing new ones.

EntySec 5 May 10, 2022
2022-bridge - Example code belonging to the Bridge pattern video

Let's Take The Bridge Pattern To The Next Level This video covers how the bridge

11 Jun 14, 2022
Scarecrow is a tool written in Python3 allowing you to protect your Python3 scripts.

🕷️ Scarecrow 🕷️ Scarecrow is a tool written in Python3 allowing you to protect your Python3 scripts. It looks for processes with specific names to v

Billy 33 Sep 28, 2022
An forensics tool to help aid in the investigation of spoofed emails based off the email headers.

A forensic tool to make analysis of email headers easy to aid in the quick discovery of the attacker. Table of Contents About mailMeta Installation Us

Syed Modassir Ali 59 Nov 26, 2022
Detection And Breaking With Python

Detection And Breaking IIIIIIIIIIIIIIIIIIII PPPPPPPPPPPPPPPPP VVVVVVVV VVVVVVVV I::::::::II::::::::I P:::::::

Baris Dincer 1 Dec 26, 2021
A hashtag check python module

A hashtag check python module

Fayas Noushad 3 Aug 10, 2022
Simple yara rule manager

Yara Manager A simple program to manage your yara ruleset in a (sqlite) database. Todos Search rules and descriptions Cluster rules in rulesets Enforc

Nils Kuhnert 65 Nov 17, 2022
A tool combined with the advantages of masscan and nmap

A tool combined with the advantages of masscan and nmap

59 Dec 24, 2022
Auto Tor Ip Changer

AutoTor Auto Tor Ip Changer for Linux! git clone https://github.com/Arest7/AutoTor cd AutoTor pip install -r requirements.txt python3 AutoTor.py follo

Ken Ryuguji 3 Jan 23, 2022
Automatically fetch, measure, and merge subscription links on the network, use Github Action

Free Node Merge Introduction Modified from alanbobs999/TopFreeProxies It measures the speed of free nodes on the network and import the stable and hig

52 Jul 16, 2022
ShoLister - a tool that collects all available subdomains for specific hostname or organization from Shodan

ShoLister is a tool that collects all available subdomains for specific hostname or organization from Shodan. The tool is designed to be used from Penetration Tester and Bug Bounty Hunters.

Eslam Akl 45 Dec 28, 2022
Kriecher is a simple Web Scanner which will run it's own checks for the OWASP

Kriecher is a simple Web Scanner which will run it's own checks for the OWASP top 10 https://owasp.org/www-project-top-ten/# as well as run a

1 Nov 12, 2021
CVE-2021-22205 Unauthorized RCE

CVE-2021-22205 影响版本: Gitlab CE/EE 13.10.3 Gitlab CE/EE 13.9.6 Gitlab CE/EE 13.8.8 Usage python3 CVE-2021-22205.py target "curl \`whoami\`.dnslog

r0eXpeR 70 Nov 09, 2022
AMC- Automatic Media Access Control [MAC] Address Spoofing Tool

AMC (Automatic Media Access Control [MAC] Address Spoofing tool), helps you to protect your real network hardware identity. Each entered time interval your hardware address was changed automatically.

Dipen Chavan 14 Dec 23, 2022