An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

Overview

Germline Mutation Calling

This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.

The pipeline requires as inputs aligned BAM files (e.g. with BWA) where the duplicates are already marked (e.g. with Picard or sambamba). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples, which is automatically parallized over user defined intervals (for examples see intervals.txt) and chromosomes.

Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration

At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.

A high level overview of the performed steps can be seen below:

DAG

As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.

Installation

Required tools:

The majority of the listed tools can be quite easily installed with conda which is recommanded.

Usage

First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.

After setting up all the config files and installing all tools, you can simply run:

snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"

This assumes that the cluster you are using is running SLURM. If this is not the case, you have to adjust the command after --cluster. The log information of each job will be safed in the snakemake_cluster_submit_log directory. This directory will not be created automatically.

-j specifies the number of jobs/rules should be submitted in parallel.

I recommand running this command in a detached session with tmux or screen.

Output

Below is the output of the tree command, after the workflow has finished for one patient H005-00ML. Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.

.
├── cohort
│ ├── benchmark
│ │ ├── ApplyVQSR_indel.txt
│ │ ├── ApplyVQSR_snp.txt
│ │ ├── CombineGVCFs.txt
│ │ ├── GenotypeGVCFs.txt
│ │ ├── MergeCohortVCFs.txt
│ │ ├── SelectVariants.txt
│ │ ├── VEP.txt
│ │ ├── VQSR_indel.txt
│ │ └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│     ├── ApplyVQSR_indel.out
│     ├── ApplyVQSR_snp.out
│     ├── CombineGVCFs
│     ├── CombineGVCFs.1.out
│     ├── CombineGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── CombineGVCFs.Y.out
│     ├── GenotypeGVCFs.1.out
│     ├── GenotypeGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── GenotypeGVCFs.Y.out
│     ├── MakeSitesOnly.out
│     ├── MergeCohortVCFs.out
│     ├── SelectVariants.err
│     ├── VEP.out
│     ├── VQSR_indel.out
│     └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│ │ ├── ApplyBQSR.txt
│ │ ├── BaseRecalibrator.txt
│ │ ├── GatherBQSRReports.txt
│ │ ├── GatherRecalBamFiles.txt
│ │ ├── HaplotypeCaller.txt
│ │ ├── IndexBam.txt
│ │ ├── MergeHaplotypeCaller.txt
│ │ └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│     ├── ApplyBQSR
│     ├── ApplyBQSR.0000-scattered.interval_list.out
│     ├── ApplyBQSR.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── ApplyBQSR.0049-scattered.interval_list.out
│     ├── BaseRecalibrator
│     ├── BaseRecalibrator.0000-scattered.interval_list.out
│     ├── BaseRecalibrator.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── BaseRecalibrator.0049-scattered.interval_list.out
│     ├── GatherBQSRReports.out
│     ├── GatherRecalBamFiles.out
│     ├── HaplotypeCaller
│     ├── HaplotypeCaller.0000-scattered.interval_list.out
│     ├── HaplotypeCaller.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── HaplotypeCaller.0049-scattered.interval_list.out
│     ├── IndexBam.out
│     ├── MergeHaplotypeCaller.out
│     └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out

For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that were performed for that patient (log directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used (benchmark directory).

The cohort directory contains the multi-sample VCF file, which gets created after performing the joint variant calling. The cohort.recalibrated.vcf.gz is the product of GATKs Variant Quality Score Recalibration. The cohort.recalibrated.pass.vep.vcf.gz is the filtered and VEP annotated version of cohort.recalibrated.vcf.gz (only variants with PASS are kept).

For most applications, the cohort.recalibrated.pass.vep.vcf.gz file, is the file you want to continue working with.

Multi-class confusion matrix library in Python

Table of contents Overview Installation Usage Document Try PyCM in Your Browser Issues & Bug Reports Todo Outputs Dependencies Contribution References

Sepand Haghighi 1.3k Dec 31, 2022
Editor and Presenter for Manim Generated Content.

Editor and Presenter for Manim Generated Content. Take a look at the Working Example. More information can be found on the documentation. These Browse

Manim Community 149 Dec 29, 2022
A python wrapper for creating and viewing effects for Matt Parker's christmas tree.

Christmas Tree Visualizer A python wrapper for creating and viewing effects for Matt Parker's christmas tree. Displays py or csv effect files and allo

4 Nov 22, 2022
Certificate generating and sending system written in Python.

Certificate Generator & Sender How to use git clone https://github.com/saadhaxxan/Certificate-Generator-Sender.git cd Certificate-Generator-Sender Add

Saad Hassan 11 Dec 01, 2022
Smarthome Dashboard with Grafana & InfluxDB

Smarthome Dashboard with Grafana & InfluxDB This is a complete overhaul of my Raspberry Dashboard done with Flask. I switched from sqlite to InfluxDB

6 Oct 20, 2022
Insert SVGs into matplotlib

Insert SVGs into matplotlib

Andrew White 35 Dec 29, 2022
Visualization of hidden layer activations of small multilayer perceptrons (MLPs)

MLP Hidden Layer Activation Visualization To gain some intuition about the internal representation of simple multi-layer perceptrons (MLPs) I trained

Andreas Köpf 7 Dec 30, 2022
Bar Chart of the number of Senators from each party who are up for election in the next three General Elections

Congress-Analysis Bar Chart of the number of Senators from each party who are up for election in the next three General Elections This bar chart shows

11 Oct 26, 2021
Visualize and compare datasets, target values and associations, with one line of code.

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an open-source Python library that generat

Francois Bertrand 2.3k Jan 05, 2023
DrawBot lets you draw images taken from the internet on Skribbl.io, Gartic Phone and Paint

DrawBot You don't speak french? No worries, english translation is over here. C'est quoi ? DrawBot est un logiciel codé par V2F qui va prendre possess

V2F 205 Jan 01, 2023
Visualization ideas for data science

Nuance I use Nuance to curate varied visualization thoughts during my data scientist career. It is not yet a package but a list of small ideas. Welcom

Li Jiangchun 16 Nov 03, 2022
Sentiment Analysis application created with Python and Dash, hosted at socialsentiment.net

Social Sentiment Dash Application Live-streaming sentiment analysis application created with Python and Dash, hosted at SocialSentiment.net. Dash Tuto

Harrison 456 Dec 25, 2022
Pyan3 - Offline call graph generator for Python 3

Pyan takes one or more Python source files, performs a (rather superficial) static analysis, and constructs a directed graph of the objects in the combined source, and how they define or use each oth

Juha Jeronen 235 Jan 02, 2023
Movies-chart - A CLI app gets the top 250 movies of all time from imdb.com and the top 100 movies from rottentomatoes.com

movies-chart This CLI app gets the top 250 movies of all time from imdb.com and

3 Feb 17, 2022
paintable GitHub contribute table

githeart paintable github contribute table how to use: Functions key color select 1,2,3,4,5 clear c drawing mode mode on turn off e print paint matrix

Bahadır Araz 27 Nov 24, 2022
Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from time series data.

ts2vg: Time series to visibility graphs The Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from t

Carlos Bergillos 26 Dec 17, 2022
100 Days of Code The Complete Python Pro Bootcamp for 2022

100-Day-With-Python 100 Days of Code - The Complete Python Pro Bootcamp for 2022. In this course, I spend with python language over 100 days, and I up

Rajdip Das 8 Jun 22, 2022
A GUI for Pandas DataFrames

About Demo Installation Usage Features More Info About PandasGUI is a GUI for viewing, plotting and analyzing Pandas DataFrames. Demo Installation Ins

Adam Rose 2.8k Dec 24, 2022
Matplotlib colormaps from the yt project !

cmyt Matplotlib colormaps from the yt project ! Colormaps overview The following colormaps, as well as their respective reversed (*_r) versions are av

The yt project 5 Sep 16, 2022
A python package for animating plots build on matplotlib.

animatplot A python package for making interactive as well as animated plots with matplotlib. Requires Python = 3.5 Matplotlib = 2.2 (because slider

Tyler Makaro 394 Dec 18, 2022