HG002 QC Snakemake

To Run

Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.

Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).

# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow

# make necessary directories
mkdir cluster_logs

# create conda environment
conda env create --file workflow/environment.yaml

# activate conda environment
conda activate pb-human-wgs-workflow

# submit job
sbatch workflow/run_hg002QC.sh

Plots

A list of important stats from target files that would be good for plotting.

targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]

smrtcell_stats/all_movies.read_length_and_quality.tsv
- outputs 3 columns (read name, read length, read quality)
- boxplots of read length and quality
hifiasm/asm.p_ctg.fasta.stats.txt (primary) + hifiasm/asm.a_ctg.fasta.stats.txt (alternate)
- all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
- assembly size awk '$1=="SZ" {print $2}' <filename>
- auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
- NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename> ($2 is x-axis, $3 y-axis) like this: example plot
hifiasm/asm.p_ctg.qv.txt + hifiasm/asm.a_ctg.qv.txt
- adjusted assembly quality awk '$1=="QV" {print $3}' <filename> for primary and alternate assemblies
truvari/truvari.summary.txt
- structural variant recall jq .recall <filename>
- structural variant precision jq .precision <filename>
- structural variant f1 jq .f1 <filename>
- number of calls jq '."call cnt"' <filename>
- FP jq .FP <filename>
- TP-call jq .TP-call <filename>
- FN jq .FN <filename>
- TP-base jq .TP-base <filename>
pbsv/all_chroms.pbsv.vcf.gz
- counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
- can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
deepvariant/deepvariant.vcf.stats.txt
- several values in lines starting with 'SN' awk '$1=="SN"' <filename>
  - number of SNPS
  - number INDELs
  - number of multi-allelic sites
  - number of multi-allelic SNP sites
- ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
- can monitor substitution types awk '$1=="ST"' <filename>
- SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
- SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
- Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
- Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
- Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
whatshap/deepvariant.phased.tsv
- phase block N50 awk '$2=="ALL" {print $22}' <filename>
- bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
whatshap/deepvariant.phased.blocklist
- calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
happy/all.summary.csv + happy/cmrg.summary.csv
- stats should be collected for all variants and cmrg challenging medically relevant genes
  - SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
  - SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
  - SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
  - INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
  - INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
  - INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
happy/all.extended.csv + happy/cmrg.extended.csv
- there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
- SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
- INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
mosdepth/coverage.mosdepth.summary.txt
- mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
mosdepth/mosdepth.M2_ratio.txt
- outputs single value: ratio of chr2 coverage to chrM coverage
- bar chart of m2 ratio
mosdepth/gc_coverage.summary.txt
- outputs 5 columns: gc percentage bin, q1 , median , q3 , count
- q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
- "count" refers to # of 500 bp windows that fall in that bin
- can pick a couple of key GC coverage bins and make box plots out of them
mosdepth/coverage.thresholds.summary.txt
- outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
- maybe a line chart comparing the different coverage thresholds among conditions

Hg002-qc-snakemake - HG002 QC Snakemake

Related tags

Overview

HG002 QC Snakemake

To Run

Plots

Owner

Juniper A. Lake

Senator Trades Monitor

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Time ranges with python

Binance Kline Data With Python

Statistical Rethinking course winter 2022

Python for Data Analysis, 2nd Edition

PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Tools for working with MARC data in Catalogue Bridge.

The micro-framework to create dataframes from functions.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

Pipeline to convert a haploid assembly into diploid

Single machine, multiple cards training; mix-precision training; DALI data loader.

Streamz helps you build pipelines to manage continuous streams of data

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation