An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

Overview

Germline Mutation Calling

This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.

The pipeline requires as inputs aligned BAM files (e.g. with BWA) where the duplicates are already marked (e.g. with Picard or sambamba). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples, which is automatically parallized over user defined intervals (for examples see intervals.txt) and chromosomes.

Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration

At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.

A high level overview of the performed steps can be seen below:

DAG

As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.

Installation

Required tools:

The majority of the listed tools can be quite easily installed with conda which is recommanded.

Usage

First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.

After setting up all the config files and installing all tools, you can simply run:

snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"

This assumes that the cluster you are using is running SLURM. If this is not the case, you have to adjust the command after --cluster. The log information of each job will be safed in the snakemake_cluster_submit_log directory. This directory will not be created automatically.

-j specifies the number of jobs/rules should be submitted in parallel.

I recommand running this command in a detached session with tmux or screen.

Output

Below is the output of the tree command, after the workflow has finished for one patient H005-00ML. Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.

.
├── cohort
│ ├── benchmark
│ │ ├── ApplyVQSR_indel.txt
│ │ ├── ApplyVQSR_snp.txt
│ │ ├── CombineGVCFs.txt
│ │ ├── GenotypeGVCFs.txt
│ │ ├── MergeCohortVCFs.txt
│ │ ├── SelectVariants.txt
│ │ ├── VEP.txt
│ │ ├── VQSR_indel.txt
│ │ └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│     ├── ApplyVQSR_indel.out
│     ├── ApplyVQSR_snp.out
│     ├── CombineGVCFs
│     ├── CombineGVCFs.1.out
│     ├── CombineGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── CombineGVCFs.Y.out
│     ├── GenotypeGVCFs.1.out
│     ├── GenotypeGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── GenotypeGVCFs.Y.out
│     ├── MakeSitesOnly.out
│     ├── MergeCohortVCFs.out
│     ├── SelectVariants.err
│     ├── VEP.out
│     ├── VQSR_indel.out
│     └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│ │ ├── ApplyBQSR.txt
│ │ ├── BaseRecalibrator.txt
│ │ ├── GatherBQSRReports.txt
│ │ ├── GatherRecalBamFiles.txt
│ │ ├── HaplotypeCaller.txt
│ │ ├── IndexBam.txt
│ │ ├── MergeHaplotypeCaller.txt
│ │ └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│     ├── ApplyBQSR
│     ├── ApplyBQSR.0000-scattered.interval_list.out
│     ├── ApplyBQSR.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── ApplyBQSR.0049-scattered.interval_list.out
│     ├── BaseRecalibrator
│     ├── BaseRecalibrator.0000-scattered.interval_list.out
│     ├── BaseRecalibrator.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── BaseRecalibrator.0049-scattered.interval_list.out
│     ├── GatherBQSRReports.out
│     ├── GatherRecalBamFiles.out
│     ├── HaplotypeCaller
│     ├── HaplotypeCaller.0000-scattered.interval_list.out
│     ├── HaplotypeCaller.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── HaplotypeCaller.0049-scattered.interval_list.out
│     ├── IndexBam.out
│     ├── MergeHaplotypeCaller.out
│     └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out

For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that were performed for that patient (log directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used (benchmark directory).

The cohort directory contains the multi-sample VCF file, which gets created after performing the joint variant calling. The cohort.recalibrated.vcf.gz is the product of GATKs Variant Quality Score Recalibration. The cohort.recalibrated.pass.vep.vcf.gz is the filtered and VEP annotated version of cohort.recalibrated.vcf.gz (only variants with PASS are kept).

For most applications, the cohort.recalibrated.pass.vep.vcf.gz file, is the file you want to continue working with.

A minimal Python package that produces slice plots through h5m DAGMC geometry files

A minimal Python package that produces slice plots through h5m DAGMC geometry files Installation pip install dagmc_geometry_slice_plotter Python API U

Fusion Energy 4 Dec 02, 2022
Voilà, install macOS on ANY Computer! This is really and magic easiest way!

OSX-PROXMOX - Run macOS on ANY Computer - AMD & Intel Install Proxmox VE v7.02 - Next, Next & Finish (NNF). Open Proxmox Web Console - Datacenter N

Gabriel Luchina 654 Jan 09, 2023
A workshop on data visualization in Python with notebooks and exercises for following along.

Beyond the Basics: Data Visualization in Python The human brain excels at finding patterns in visual representations, which is why data visualizations

Stefanie Molin 162 Dec 05, 2022
Tidy data structures, summaries, and visualisations for missing data

naniar naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot

Nicholas Tierney 611 Dec 22, 2022
Graphical display tools, to help students debug their class implementations in the Carcassonne family of projects

carcassonne_tools Graphical display tools, to help students debug their class implementations in the Carcassonne family of projects NOTE NOTE NOTE The

1 Nov 08, 2021
Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax.

PyDexter Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax. Setup $ pip install PyDexter

D3xter 31 Mar 06, 2021
Kglab - an abstraction layer in Python for building knowledge graphs

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, RAPIDS, NetworkX, iGraph, PyVis, pslpython, p

derwen.ai 466 Jan 09, 2023
Some problems of SSLC ( High School ) before outputs and after outputs

Some problems of SSLC ( High School ) before outputs and after outputs 1] A Python program and its output (output1) while running the program is given

Fayas Noushad 3 Dec 01, 2021
SummVis is an interactive visualization tool for text summarization.

SummVis is an interactive visualization tool for analyzing abstractive summarization model outputs and datasets.

Robustness Gym 246 Dec 08, 2022
Simple Inkscape Scripting

Simple Inkscape Scripting Description In the Inkscape vector-drawing program, how would you go about drawing 100 diamonds, each with a random color an

Scott Pakin 140 Dec 27, 2022
A curated list of awesome Dash (plotly) resources

Awesome Dash A curated list of awesome Dash (plotly) resources Dash is a productive Python framework for building web applications. Written on top of

Luke Singham 1.7k Jan 07, 2023
A central task in drug discovery is searching, screening, and organizing large chemical databases

A central task in drug discovery is searching, screening, and organizing large chemical databases. Here, we implement clustering on molecular similarity. We support multiple methods to provide a inte

NVIDIA Corporation 124 Jan 07, 2023
Ana's Portfolio

Ana's Portfolio ✌️ Welcome to my Portfolio! You will find here different Projects I have worked on (from scratch) 💪 Projects 💻 1️⃣ Hangman game (Mad

Ana Katherine Cortes Sobrino 9 Mar 15, 2022
Official Matplotlib cheat sheets

Official Matplotlib cheat sheets

Matplotlib Developers 6.7k Jan 09, 2023
2021 grafana arbitrary file read

2021_grafana_arbitrary_file_read base on pocsuite3 try 40 default plugins of grafana alertlist annolist barchart cloudwatch dashlist elasticsearch gra

ATpiu 5 Nov 09, 2022
Visualization of numerical optimization algorithms

Visualization of numerical optimization algorithms

Zhengxia Zou 46 Dec 01, 2022
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 694 Jan 04, 2023
3D Vision functions with end-to-end support for deep learning developers, written in Ivy.

Ivy vision focuses predominantly on 3D vision, with functions for camera geometry, image projections, co-ordinate frame transformations, forward warping, inverse warping, optical flow, depth triangul

Ivy 61 Dec 29, 2022
Create a visualization for Trump's Tweeted Words Using Python

Data Trump's Tweeted Words This plot illustrates twitter word occurences. We already did the coding I needed for this plot, so I was very inspired to

7 Mar 27, 2022
Example Code Notebooks for Data Visualization in Python

This repository contains sample code scripts for creating awesome data visualizations from scratch using different python libraries (such as matplotli

Javed Ali 27 Jan 04, 2023