SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Overview

SNV Pipeline

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38). The pipeline requires user defined datasets & annotation sources, available tools and input set of vcf files. It generates analysis scripts that can be incorporated into high performance cluster (HPC) computing to process the samples. This results in list of filtered variants per family that can be used for interpreation, reporting and further downstream analysis.

For demonstration purpose below example is presented for GRCh37. However, the same can be replicated for GRCh38.

Installation

git clone https://github.com/ajaarma/snv.git

Required Installation packages

Install anaconda v2.0
Follow this link for installation: https://docs.anaconda.com/anaconda/install/linux/
Conda environment commands
$ conda create --name snv
$ source activate snv
$ conda install python=2.7.16
$ pip install xmltodict
$ pip install dicttoxml

$ conda install -c bioconda gvcfgenotyper
$ conda install -c anaconda gawk	
$ conda install samtools=1.3
$ conda install vcftools=0.1.14
$ conda install bcftools=1.9
$ conda install gcc #(OSX)
$ conda install gcc_linux-64 #(Linux)
$ conda install parallel
$ conda install -c mvdbeek ucsc_tools
** conda-develop -n 
    
    
     /demo/softwares/vep/Plugins/

$ conda install -c r r-optparse
$ conda install -c r r-dplyr
$ conda install -c r r-plyr
$ conda install -c r r-data.table
$ conda install -c aakumar r-readbulk
$ conda install -c bioconda ensembl-vep=100.4
$ vep_install -a cf -s homo_sapiens -y GRCh37 -c 
     
      /demo/softwares/vep/grch37 --CONVERT
$ vep_install -a cf -s homo_sapiens -y GRCh38 -c 
      
       /demo/softwares/vep/grch38 --CONVERT

      
     
    
   

Data directory and datasets

Default datasets provided
1. exac_pli: demo/resources/gnomad/grch37/gnomad.v2.1.1.lof_metrics.by_transcript_forVEP.txt
2. ensembl: demo/resources/ensembl/grch37/ensBioMart_grch37_v98_ENST_lengths_191208.txt
3. region-exons: demo/resources/regions/grch37/hg19_refseq_ensembl_exons_50bp_allMT_hgmd_clinvar_20200519.txt
4. region-pseudo-autosomal: demo/resources/regions/grch37/hg19_non_pseudoautosomal_regions_X.txt
5. HPO: demo/resources/hpo/phenotype_to_genes.tar.gz
Other datasets that require no entry to user-configuration file
6. Curated: 
	6.1. Genelist: demo/resources/curated/NGC_genelist_allNamesOnly-20200519.txt
	6.2. Somatic mosaicism genes: demo/resources/curated/haem_somatic_mosaicism_genes_20191015.txt
	6.3. Imprinted gene list: demo/resources/curated/imprinted_genes_20200424.txt
	6.4. Polymorphic gene list: demo/resources/curated/polymorphic_genes_20200509.txt
7. OMIM: demo/resources/omim/omim_20200421_geneInfoBase.txt

Download link for following dataset and place them in corresponding directories as shown

' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c > hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz 8. CLINVAR: Download link: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : clinvar=clinvar/grch37/clinvar_20200506.vcf.gz ">
1. HPO: Extract HPO phenotypes mapping:
	$ cd 
   
    /demo/resources/hpo/
	$ tar -zxvf phenotypes_to_genes.tar.gz 

2. REFERENCE SEQUENCE GENOME (FASTA file alongwith Index)
	Download link: https://drive.google.com/drive/folders/1Ro3pEYhVdYkMmteSr8YRPFeTvb_K0lVf?usp=sharing
	Download file: Homo_sapiens.GRCh37.74.dna.fasta
		Get the corresponding index and dict files: *.fai and *.dict
	Put this in folder: demo/resources/genomes/grch37/Homo_sapiens.GRCh37.74.dna.fasta

3. GNOMAD
	Download link (use wget): 
	Genomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz
	Exomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Put it in this folder: 
		demo/resources/gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		demo/resources/gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Edit User config flat file CONFIG/UserConfig.txt : 
		gnomad_g=gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		gnomad_e=gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz

4. ExAC:
	Download Link: https://drive.google.com/drive/folders/11Ya8XfAxOYmlKZ9mN8A16IDTLHdHba_0?usp=sharing
	Download file: ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		also the index files (*.csi and *.tbi)
	Put it in this folder as: 
		demo/resources/exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
	Edit User config flat file CONFIG/UserConfig.txt : 
		exac=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		exac_t=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz

5. CADD:
	Download link (use wget):
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz
		(Also download the corresponding tabix index files as well)
	Put it in this directory: 
		demo/resources/cadd/grch37/whole_genome_SNVs.tsv.gz
		demo/resource/cadd/grch37/InDels.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		cadd_snv=cadd/grch37/whole_genome_SNVs.tsv.gz
		cadd_indel=cadd/grch37/InDels.tsv.gz

6. REVEL:
	Download link: https://drive.google.com/drive/folders/12Tl1YU5bI-By_VawTPVWHef7AXzn4LuP?usp=sharing
	Download file: new_tabbed_revel.tsv.gz
	         Also the index file: *.tbi
	Put it in this directory: demo/resources/revel/grch37/new_tabbed_revel.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt : 
		revel=revel/grch37/new_tabbed_revel.tsv.gz

7. HGMD:
	Download link: http://www.hgmd.cf.ac.uk/ac/index.php (Require personal access login)
	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19.vcf

	Use this command to process HGMD file inside this directory:
		$ cat hgmd_pro_2019.4_hg19.vcf | sed '/##comment=.*\"/a  ##INFO=
    
     ' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c  > hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz	

	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz

8. CLINVAR:
	Download link: 
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi
	Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		clinvar=clinvar/grch37/clinvar_20200506.vcf.gz

    
   
Customized Curated Annotation sets
Default present with this distribution. Can be found in XML file with these tags:
	(1) GeneList: 
   
    
	(2) Somatic mosaicism genes: 
    
     
	(3) Imprinted genes: 
     
      
	(4) Polymorphic genes: 
      
       
	(5) HPO terms: 
       
         (6) OMIM: 
         
        
       
      
     
    
   

Activate the conda environment

$ source activate snv

Step - 1:

1. Edit CONFIG/UserConfig.txt: 
	(a) Add the absolute path prefix for the resources directory with tag: resourceDir. 
	    An example can be seen in CONFIG/Example-UserConfig.txt file.
	(b) Manually check if datasets corresponding to other field tags are correctly downloaded and 
	    put in respective folders.
2. Create user defined XML file from input User Configuration flat file and Base-XML file
Command:
$ python createAnalysisXML.py -u 
   
     
		    	      -b 
    
      
		              -o 
     

     
    
   
Example:
$ python createAnalysisXML.py -u CONFIG/UserConfig.txt 
		       	      -b CONFIG/Analysis_base_grch37.xml 
		              -o CONFIG/Analysis_user_grch37.xml
Outputs:
CONFIG/Analysis_user_grch37.xml

Step-2:

1. Put the respective vcf files in the directory. For example: demo/example/vcf/ 
2. Create manifest file in same format as shown in demo/example/example_manifest.txt
3. Assign gender to each family members (illumina or sample id). For example: demo/example/example_genders.txt
4. List of all the family ids that needed to be analyzed.
         For e.g: demo/example/manifest/example_family_analysis.txt

Step -3:

Generate all the shell scripts that can be incorporated into user specific HPC cluster network. For e.g: Slurm/PBS/LSF network.

Command:
$ python processSNV.py 	-a 
   
    
	    		-p 
    
     
	      		-m 
     
      
	     		-e 
      
       
			-w 
       
         -g 
        
          -d 
         
           -f 
          
            -s 
           
             -r 
             
            
           
          
         
        
       
      
     
    
   
Example:
$ python processSNV.py 	-a CONFIG/Analysis_user_grch37.xml \ 
			-p 20210326 \
			-m 
   
    /demo/example/example_manifest.txt \
			-e gvcfGT \
			-w 
    
     /demo/example/ \
			-g 
     
      /demo/example/example_genders.txt \
			-d 
      
       /demo/example/exeter_samples_norm.fof \
			-f 
       
        /demo/example/manifest/example_family_analysis.txt \ -s 
        
         /demo/example/manifest/example_family.fof \ -r 
         
          /demo/example/manifest/example_family_header.txt (optional) 
         
        
       
      
     
    
   
Outputs:
Two scripts in the directory: 
   
    /demo/example/20210326/tmp_binaries/
Launch the scripts in these 2 stages sequentially after each of them gets finished.

   (1) genotypeAndAnnotate_chr%.sh where %=1..22,X,Y and MT
	scatter the annotation and frequency filtering per chromosome for all families.
   (2) mergeAndFilter.sh:
	Merge all the chromosome and apply inheritance filtering.

   

Step-4

Final output of list of filtered variant is present in:

   
    /demo/example/20210326/fam_filter/
    
     /
     
      .filt_
      
       .txt

      
     
    
   
For any questions/issues/bugs please mail us at:
Owner
East Genomics
Bringing together genomic medicine across the East Midlands and East of England
East Genomics
A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

Michael Milton 1 Oct 26, 2021
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Binomial Option Pricing Calculator Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required) Background A derivative is a fi

sammuhrai 1 Nov 29, 2021
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 09, 2022
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 07, 2021
NumPy aware dynamic Python compiler using LLVM

Numba A Just-In-Time Compiler for Numerical Functions in Python Numba is an open source, NumPy-aware optimizing compiler for Python sponsored by Anaco

Numba 8.2k Jan 07, 2023
Business Intelligence (BI) in Python, OLAP

Open Mining Business Intelligence (BI) Application Server written in Python Requirements Python 2.7 (Backend) Lua 5.2 or LuaJIT 5.1 (OML backend) Mong

Open Mining 1.2k Dec 27, 2022
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022