SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Overview

SNV Pipeline

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38). The pipeline requires user defined datasets & annotation sources, available tools and input set of vcf files. It generates analysis scripts that can be incorporated into high performance cluster (HPC) computing to process the samples. This results in list of filtered variants per family that can be used for interpreation, reporting and further downstream analysis.

For demonstration purpose below example is presented for GRCh37. However, the same can be replicated for GRCh38.

Installation

git clone https://github.com/ajaarma/snv.git

Required Installation packages

Install anaconda v2.0
Follow this link for installation: https://docs.anaconda.com/anaconda/install/linux/
Conda environment commands
$ conda create --name snv
$ source activate snv
$ conda install python=2.7.16
$ pip install xmltodict
$ pip install dicttoxml

$ conda install -c bioconda gvcfgenotyper
$ conda install -c anaconda gawk	
$ conda install samtools=1.3
$ conda install vcftools=0.1.14
$ conda install bcftools=1.9
$ conda install gcc #(OSX)
$ conda install gcc_linux-64 #(Linux)
$ conda install parallel
$ conda install -c mvdbeek ucsc_tools
** conda-develop -n 
    
    
     /demo/softwares/vep/Plugins/

$ conda install -c r r-optparse
$ conda install -c r r-dplyr
$ conda install -c r r-plyr
$ conda install -c r r-data.table
$ conda install -c aakumar r-readbulk
$ conda install -c bioconda ensembl-vep=100.4
$ vep_install -a cf -s homo_sapiens -y GRCh37 -c 
     
      /demo/softwares/vep/grch37 --CONVERT
$ vep_install -a cf -s homo_sapiens -y GRCh38 -c 
      
       /demo/softwares/vep/grch38 --CONVERT

      
     
    
   

Data directory and datasets

Default datasets provided
1. exac_pli: demo/resources/gnomad/grch37/gnomad.v2.1.1.lof_metrics.by_transcript_forVEP.txt
2. ensembl: demo/resources/ensembl/grch37/ensBioMart_grch37_v98_ENST_lengths_191208.txt
3. region-exons: demo/resources/regions/grch37/hg19_refseq_ensembl_exons_50bp_allMT_hgmd_clinvar_20200519.txt
4. region-pseudo-autosomal: demo/resources/regions/grch37/hg19_non_pseudoautosomal_regions_X.txt
5. HPO: demo/resources/hpo/phenotype_to_genes.tar.gz
Other datasets that require no entry to user-configuration file
6. Curated: 
	6.1. Genelist: demo/resources/curated/NGC_genelist_allNamesOnly-20200519.txt
	6.2. Somatic mosaicism genes: demo/resources/curated/haem_somatic_mosaicism_genes_20191015.txt
	6.3. Imprinted gene list: demo/resources/curated/imprinted_genes_20200424.txt
	6.4. Polymorphic gene list: demo/resources/curated/polymorphic_genes_20200509.txt
7. OMIM: demo/resources/omim/omim_20200421_geneInfoBase.txt

Download link for following dataset and place them in corresponding directories as shown

' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c > hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz 8. CLINVAR: Download link: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : clinvar=clinvar/grch37/clinvar_20200506.vcf.gz ">
1. HPO: Extract HPO phenotypes mapping:
	$ cd 
   
    /demo/resources/hpo/
	$ tar -zxvf phenotypes_to_genes.tar.gz 

2. REFERENCE SEQUENCE GENOME (FASTA file alongwith Index)
	Download link: https://drive.google.com/drive/folders/1Ro3pEYhVdYkMmteSr8YRPFeTvb_K0lVf?usp=sharing
	Download file: Homo_sapiens.GRCh37.74.dna.fasta
		Get the corresponding index and dict files: *.fai and *.dict
	Put this in folder: demo/resources/genomes/grch37/Homo_sapiens.GRCh37.74.dna.fasta

3. GNOMAD
	Download link (use wget): 
	Genomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz
	Exomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Put it in this folder: 
		demo/resources/gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		demo/resources/gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Edit User config flat file CONFIG/UserConfig.txt : 
		gnomad_g=gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		gnomad_e=gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz

4. ExAC:
	Download Link: https://drive.google.com/drive/folders/11Ya8XfAxOYmlKZ9mN8A16IDTLHdHba_0?usp=sharing
	Download file: ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		also the index files (*.csi and *.tbi)
	Put it in this folder as: 
		demo/resources/exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
	Edit User config flat file CONFIG/UserConfig.txt : 
		exac=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		exac_t=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz

5. CADD:
	Download link (use wget):
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz
		(Also download the corresponding tabix index files as well)
	Put it in this directory: 
		demo/resources/cadd/grch37/whole_genome_SNVs.tsv.gz
		demo/resource/cadd/grch37/InDels.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		cadd_snv=cadd/grch37/whole_genome_SNVs.tsv.gz
		cadd_indel=cadd/grch37/InDels.tsv.gz

6. REVEL:
	Download link: https://drive.google.com/drive/folders/12Tl1YU5bI-By_VawTPVWHef7AXzn4LuP?usp=sharing
	Download file: new_tabbed_revel.tsv.gz
	         Also the index file: *.tbi
	Put it in this directory: demo/resources/revel/grch37/new_tabbed_revel.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt : 
		revel=revel/grch37/new_tabbed_revel.tsv.gz

7. HGMD:
	Download link: http://www.hgmd.cf.ac.uk/ac/index.php (Require personal access login)
	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19.vcf

	Use this command to process HGMD file inside this directory:
		$ cat hgmd_pro_2019.4_hg19.vcf | sed '/##comment=.*\"/a  ##INFO=
    
     ' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c  > hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz	

	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz

8. CLINVAR:
	Download link: 
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi
	Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		clinvar=clinvar/grch37/clinvar_20200506.vcf.gz

    
   
Customized Curated Annotation sets
Default present with this distribution. Can be found in XML file with these tags:
	(1) GeneList: 
   
    
	(2) Somatic mosaicism genes: 
    
     
	(3) Imprinted genes: 
     
      
	(4) Polymorphic genes: 
      
       
	(5) HPO terms: 
       
         (6) OMIM: 
         
        
       
      
     
    
   

Activate the conda environment

$ source activate snv

Step - 1:

1. Edit CONFIG/UserConfig.txt: 
	(a) Add the absolute path prefix for the resources directory with tag: resourceDir. 
	    An example can be seen in CONFIG/Example-UserConfig.txt file.
	(b) Manually check if datasets corresponding to other field tags are correctly downloaded and 
	    put in respective folders.
2. Create user defined XML file from input User Configuration flat file and Base-XML file
Command:
$ python createAnalysisXML.py -u 
   
     
		    	      -b 
    
      
		              -o 
     

     
    
   
Example:
$ python createAnalysisXML.py -u CONFIG/UserConfig.txt 
		       	      -b CONFIG/Analysis_base_grch37.xml 
		              -o CONFIG/Analysis_user_grch37.xml
Outputs:
CONFIG/Analysis_user_grch37.xml

Step-2:

1. Put the respective vcf files in the directory. For example: demo/example/vcf/ 
2. Create manifest file in same format as shown in demo/example/example_manifest.txt
3. Assign gender to each family members (illumina or sample id). For example: demo/example/example_genders.txt
4. List of all the family ids that needed to be analyzed.
         For e.g: demo/example/manifest/example_family_analysis.txt

Step -3:

Generate all the shell scripts that can be incorporated into user specific HPC cluster network. For e.g: Slurm/PBS/LSF network.

Command:
$ python processSNV.py 	-a 
   
    
	    		-p 
    
     
	      		-m 
     
      
	     		-e 
      
       
			-w 
       
         -g 
        
          -d 
         
           -f 
          
            -s 
           
             -r 
             
            
           
          
         
        
       
      
     
    
   
Example:
$ python processSNV.py 	-a CONFIG/Analysis_user_grch37.xml \ 
			-p 20210326 \
			-m 
   
    /demo/example/example_manifest.txt \
			-e gvcfGT \
			-w 
    
     /demo/example/ \
			-g 
     
      /demo/example/example_genders.txt \
			-d 
      
       /demo/example/exeter_samples_norm.fof \
			-f 
       
        /demo/example/manifest/example_family_analysis.txt \ -s 
        
         /demo/example/manifest/example_family.fof \ -r 
         
          /demo/example/manifest/example_family_header.txt (optional) 
         
        
       
      
     
    
   
Outputs:
Two scripts in the directory: 
   
    /demo/example/20210326/tmp_binaries/
Launch the scripts in these 2 stages sequentially after each of them gets finished.

   (1) genotypeAndAnnotate_chr%.sh where %=1..22,X,Y and MT
	scatter the annotation and frequency filtering per chromosome for all families.
   (2) mergeAndFilter.sh:
	Merge all the chromosome and apply inheritance filtering.

   

Step-4

Final output of list of filtered variant is present in:

   
    /demo/example/20210326/fam_filter/
    
     /
     
      .filt_
      
       .txt

      
     
    
   
For any questions/issues/bugs please mail us at:
Owner
East Genomics
Bringing together genomic medicine across the East Midlands and East of England
East Genomics
A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

Infrastructure Team 93 Jan 03, 2023
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
SparseLasso: Sparse Solutions for the Lasso

SparseLasso: Sparse Solutions for the Lasso Introduction SparseLasso provides a Scikit-Learn based estimation of the Lasso with cross-validation tunin

Gabriel Okasa 1 Nov 08, 2021
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
Python package for analyzing sensor-collected human motion data

Python package for analyzing sensor-collected human motion data

Simon Ho 71 Nov 05, 2022