HG002 QC Snakemake

To Run

Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.

Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).

# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow

# make necessary directories
mkdir cluster_logs

# create conda environment
conda env create --file workflow/environment.yaml

# activate conda environment
conda activate pb-human-wgs-workflow

# submit job
sbatch workflow/run_hg002QC.sh

Plots

A list of important stats from target files that would be good for plotting.

targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]

smrtcell_stats/all_movies.read_length_and_quality.tsv
- outputs 3 columns (read name, read length, read quality)
- boxplots of read length and quality
hifiasm/asm.p_ctg.fasta.stats.txt (primary) + hifiasm/asm.a_ctg.fasta.stats.txt (alternate)
- all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
- assembly size awk '$1=="SZ" {print $2}' <filename>
- auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
- NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename> ($2 is x-axis, $3 y-axis) like this: example plot
hifiasm/asm.p_ctg.qv.txt + hifiasm/asm.a_ctg.qv.txt
- adjusted assembly quality awk '$1=="QV" {print $3}' <filename> for primary and alternate assemblies
truvari/truvari.summary.txt
- structural variant recall jq .recall <filename>
- structural variant precision jq .precision <filename>
- structural variant f1 jq .f1 <filename>
- number of calls jq '."call cnt"' <filename>
- FP jq .FP <filename>
- TP-call jq .TP-call <filename>
- FN jq .FN <filename>
- TP-base jq .TP-base <filename>
pbsv/all_chroms.pbsv.vcf.gz
- counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
- can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
deepvariant/deepvariant.vcf.stats.txt
- several values in lines starting with 'SN' awk '$1=="SN"' <filename>
  - number of SNPS
  - number INDELs
  - number of multi-allelic sites
  - number of multi-allelic SNP sites
- ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
- can monitor substitution types awk '$1=="ST"' <filename>
- SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
- SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
- Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
- Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
- Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
whatshap/deepvariant.phased.tsv
- phase block N50 awk '$2=="ALL" {print $22}' <filename>
- bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
whatshap/deepvariant.phased.blocklist
- calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
happy/all.summary.csv + happy/cmrg.summary.csv
- stats should be collected for all variants and cmrg challenging medically relevant genes
  - SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
  - SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
  - SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
  - INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
  - INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
  - INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
happy/all.extended.csv + happy/cmrg.extended.csv
- there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
- SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
- INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
mosdepth/coverage.mosdepth.summary.txt
- mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
mosdepth/mosdepth.M2_ratio.txt
- outputs single value: ratio of chr2 coverage to chrM coverage
- bar chart of m2 ratio
mosdepth/gc_coverage.summary.txt
- outputs 5 columns: gc percentage bin, q1 , median , q3 , count
- q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
- "count" refers to # of 500 bp windows that fall in that bin
- can pick a couple of key GC coverage bins and make box plots out of them
mosdepth/coverage.thresholds.summary.txt
- outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
- maybe a line chart comparing the different coverage thresholds among conditions

Hg002-qc-snakemake - HG002 QC Snakemake

Related tags

Overview

HG002 QC Snakemake

To Run

Plots

Owner

Juniper A. Lake

pipeline for migrating lichess data into postgresql

Implementation in Python of the reliability measures such as Omega.

A collection of robust and fast processing tools for parsing and analyzing web archive data.

Feature engineering and machine learning: together at last

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Basis Set Format Converter

pyETT: Python library for Eleven VR Table Tennis data

4CAT: Capture and Analysis Toolkit

Python data processing, analysis, visualization, and data operations

Catalogue data - A Python Scripts to prepare catalogue data

Picka: A Python module for data generation and randomization.

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

Display the behaviour of a realtime program with a scope or logic analyser.

Yet Another Workflow Parser for SecurityHub

Probabilistic reasoning and statistical analysis in TensorFlow

A data structure that extends pyspark.sql.DataFrame with metadata information.

Python for Data Analysis, 2nd Edition

a tool that compiles a csv of all h1 program stats

ped-crash-techvol: Texas Ped Crash Tech Volume Pack