Pipeline to convert a haploid assembly into diploid

Last update: Jan 05, 2023

Related tags

Overview

HapDup

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes preserve heterozygous structural variants (in addition to small variants) and are locally phased.

Version 0.4

Input requirements

HapDup takes as input a haploid long-read assembly, such as produced with Flye or Shasta. Currenty, ONT reads (Guppy 5+ recommended) and PacBio HiFi reads are supported.

HapDup is currently designed for low-heterozygosity genomes (such as human). The expectation is that the assembly has most of the diploid genome collapsed into a single haplotype. For assemblies with partially resolved haplotypes, alternative alleles could be removed prior to running the pipeline using purge_dups. We expect to add a better support of highly heterozygous genomes in the future.

The first stage is to realign the original long reads on the assembly using minimap2. We recommend to use the latest minimap2 release.

minimap2 -ax map-ont -t 30 assembly.fasta reads.fastq | samtools sort -@ 4 -m 4G > lr_mapping.bam
samtools index -@ 4 assembly_lr_mapping.bam

Quick start using Docker

HapDup is available on the Docker Hub.

If Docker is not installed in your system, you need to set it up first following this guide.

Next steps assume that your assembly.fasta and lr_mapping.bam are in the same directory, which will also be used for HapDup output. If it is not the case, you might need to bind additional directories using the Docker's -v / --volume argument. The number of threads (-t argument) should be adjusted according to the available resources. For PacBio HiFi input, use --rtype hifi instead of --rtype ont.

cd directory_with_assembly_and_alignment
HD_DIR=`pwd`
docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.4 \
  hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont

Quick start using Singularity

Alternatively, you can use Singularity. First, you will need install the client as descibed in the manual. One way to do it is through conda:

conda install singularity

Next steps assume that your assembly.fasta and lr_mapping.bam are in the same directory, which will also be used for HapDup output. If it is not the case, you might need to bind additional directories using the --bind argument. The number of threads (-t argument) should be adjusted according to the available resources. For PacBio HiFi input, use --rtype hifi instead of --rtype ont.

singularity pull docker://mkolmogo/hapdup:0.4
HD_DIR=`pwd`
singularity exec --bind $HD_DIR hapdup_0.4.sif \
  hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont

Output files

The output directory will contain:

haplotype_{1,2}.fasta - final assembled haplotypes
phased_blocks_hp{1,2}.bed - phased blocks coordinates

Haplotypes generated by the pipeline contain homozogous and heterozygous varinats (small and structural). Becuase the pipeline is only using long-read (ONT) data, it does not achieve chromosome-level phasing. Fully-phased blocks are given in the the phased_blocks* files.

Pipeline overview

HapDup starts with filtering alignments that are likely originating from the unassembled parts of the genome. Such alignments may later create false haplotypes if not removed (e.g. if reads from a segmental duplication with two copies can create four haplotypes).
Afterwards, PEPPER is used to call SNPs from the filtered alignment file
Then we use Margin to phase SNPs and haplotype reads
We then use Flye to polish the initiall assembly with the reads from each of the two haplotypes independently
Finally, we find (heterozygous) breakpoints in long-read alignments and apply the corresponding structural changes to the corresponding polished haplotypes. Currently, it allows to recover large heterozygous inversions.

Benchmarks

We evaluated HapDup haplotypes in terms of reconstructed structural variants signatures (heterozygous & homozygous) using the HG002 for which the curated set of SVs is available. We used the recent ONT data basecalled with Guppy 5.

Given HapDup haplotypes, we called SV using dipdiff. We also compare SV set against hifiasm assemblies, even though they were produced from HiFi, rather than ONT reads. Evaluated using truvari with -r 2000 option. GT refers to genotype-considered benchmarks.

Method	Precision	Recall	F1-score	GT Precision	GT Recall	GT F1-score
Shasta+HapDup	0.9500	0.9551	0.9525	0.934	0.9543	0.9405
Sniffles	0.9294	0.9143	0.9219	0.8284	0.9051	0.8605
CuteSV	0.9324	0.9428	0.9376	0.9119	0.9416	0.9265
hifiasm	0.9512	0.9734	0.9622	0.9129	0.9723	0.9417

Yak k-mer based evaluations:

Hap	QV	Switch err	Hamming err
1	35	0.0389	0.1862
2	35	0.0385	0.1845

Given a minimap2 alignment, HapDup runs in ~400 CPUh and uses ~80 Gb of RAM.

Source installation

If you prefer, you can install from source as follows:

#create a new conda environemnt and activate it
conda create -n hapdup python=3.8
conda activate hapdup

#get HapDup source
git clone https://github.com/fenderglass/hapdup
cd hapdup
git submodule update --init --recursive

#build and install Flye
pushd submodules/Flye/ && python setup.py install && popd

#build and install Margin
pushd submodules/margin/ && mkdir build && cd build && cmake .. && make && cp ./margin $CONDA_PREFIX/bin/ && popd

#build and install PEPPER and its dependencies
pushd submodules/pepper/ && python -m pip install . && popd

To run, ensure that the conda environemnt is activated and then execute:

conda activate hapdup
./hapdup.py --assembly assembly.fasta --bam lr_mapping.bam --out-dir hapdup -t 64 --rtype ont

Acknowledgements

The major parts of the HapDup pipeline are:

Authors

The pipeline was developed at UC Santa Cruz genomics institute, Benedict Paten's lab.

Pipeline code contributors:

Mikhail Kolmogorov

PEPPER/Margin/Shasta support:

Kishwar Shafin
Trevor Pesout
Paolo Carnevali

Citation

If you use HapDup in your research, the most relevant papers to cite are:

Kishwar Shafin, Trevor Pesout, Pi-Chuan Chang, Maria Nattestad, Alexey Kolesnikov, Sidharth Goel, Gunjan Baid et al. "Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks." bioRxiv (2021). doi:10.1101/2021.03.04.433952

Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin and Pavel Pevzner, "Assembly of Long Error-Prone Reads Using Repeat Graphs", Nature Biotechnology, 2019 doi:10.1038/s41587-019-0072-8

License

HapDup is distributed under a BSD license. See the LICENSE file for details. Other software included in this discrubution is released under either MIT or BSD licenses.

How to get help

A preferred way report any problems or ask questions is the issue tracker.

In case you prefer personal communication, please contact Mikhail at [email protected].

Comments

pthread_setaffinity_np failed Error while running pepper
Hi,

I'm trying to run HapDup on my assembly from Flye. However, an error has occurred while running Pepper: RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:173 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg: there's also a warning before this runtime error: /usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_opset9.py:2095: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model. warnings.warn("Exporting a model to ONNX with a batch_size other than 1, " + Do you have any idea why this happens?

The commands that I use are like:

reads=NA24385_ONT_Promethion.fastq outdir=`pwd` assembly=${outdir}/assembly.fasta hapdup_sif=../HapDup/hapdup_0.4.sif time minimap2 -ax map-ont -t 30 ${assembly} ${reads} | samtools sort -@ 4 -m 4G > assembly_lr_mapping.bam samtools index -@ 4 assembly_lr_mapping.bam time singularity exec --bind ${outdir} ${hapdup_sif}\ hapdup --assembly ${assembly} --bam ${outdir}/assembly_lr_mapping.bam --out-dir ${outdir}/hapdup -t 64 --rtype ont

Thank you
bug
opened by LYC-vio 11

Docker fails to mount HD_DIR

Hi! The following error occurs when I try to run:

sudo docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.2   hapdup --assembly $HD_DIR/barcode11CBS1878_v2.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup

docker: Error response from daemon: error while creating mount source path '/home/user/data/volume_2/hapdup_results': mkdir /home/user/data: file exists.
ERRO[0000] error waiting for container: context canceled``

opened by alkminion1 8

Incorrect genotype for large deletion

Hi,

I have used Hapdup to make a haplotype-resolved assembly from Illumina-corrected ONT reads (haploid assembly made with Flye 2.9) and I am particularly interested in a large 32kb deletion. Here is a screenshot of IGV (from top to bottom: HAP1, HAP2 and haploid assembly):

I believe the position and size of the deletion are near correct. However, the deletion is homozygous while it should be heterozygous. I have assembled with Hifiasm this proband and its parents using 30x PacBio HiFi: the 3 assemblies support an heterozygous call in the proband. I can also see from the corrected ONT that there is support for a heterozygous call. Finally, we can see this additional contig in the haploid assembly which I guess also support a heterozygous call.

Hence, my question is: even if MARGIN manages to correctly separate reads with the deletion from reads without the deletion, can the polishing of Flye actually "fix" such a large event in one of the haplotype assembly?

Thanks, Guillaume

opened by GuillaumeHolley 7
invalid contig

Hi, I got an error when I ran the third step, and here is the error reporting information. Skipped filtering phase Skipped pepper phase Skipped margin phase Skipped Flye phase Finding breakpoints Parsed 304552 reads 14590 split reads Running: flye-minimap2 -ax asm5 -t 64 -K 5G /usr_storage/zyl/SY_haplotype/ZSP192L/ZSP192L.fasta /usr_storage/zyl/SY_haplotype/ZSP192L/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G [email protected] > /usr_storage/zyl/SY_haplotype/ZSP192L/hapdup/structural/liftover_hp1.bam [bam_sort_core] merging from 0 files and 4 in-memory blocks... Traceback (most recent call last): File "/usr/local/bin/hapdup", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/hapdup/main.py", line 173, in main bed_liftover(inversions_bed, minimap_out, open(inversions_hp, "w")) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 76, in bed_liftover proj_start_chr, proj_start_pos, proj_start_sign = project(bam_file, chr_id, chr_start) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 9, in project name, pos, sign = project_flank(bam_path, ref_seq, ref_pos, 1) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 23, in project_flank for pileup_col in samfile.pileup(ref_seq, max(0, ref_pos - flank), ref_pos + flank, truncate=True, File "pysam/libcalignmentfile.pyx", line 1335, in pysam.libcalignmentfile.AlignmentFile.pileup File "pysam/libchtslib.pyx", line 685, in pysam.libchtslib.HTSFile.parse_region ValueError: invalid contig Contig125
bug

opened by tongyin121 7
hapdup fails with multiple primary alignments
I've got one sample in a series which is failing in hapdup with the following error.

Any suggestions?

Thanks

`

Starting merge Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'

Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'

[2022-12-13 17:15:57] ERROR: Missing output: hapdup/margin/MARGIN_PHASED.haplotagged.bam Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 206, in main file_check(haplotagged_bam) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 114, in file_check raise Exception("Missing output") Exception: Missing output `
opened by mattloose 4
ZeroDivisionError

Hi,

I'm running hapdup version 0.8 on a number of human genomes.

It appears to fail fairly regulary with an error in flye:

[2022-10-20 15:33:48] INFO: Running Flye polisher [2022-10-20 15:33:48] INFO: Polishing genome (1/1) [2022-10-20 15:33:48] INFO: Polishing with provided bam [2022-10-20 15:33:48] INFO: Separating alignment into bubbles [2022-10-20 15:37:12] ERROR: Thread exception [2022-10-20 15:37:12] ERROR: Traceback (most recent call last): File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 79, in _thread_worker indels_profile = _get_indel_clusters(ctg_aln, profile, ctg_region.start) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 419, in _get_indel_clusters get_clusters(deletions, add_last=True) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 410, in get_clusters support = len(reads) / region_coverage ZeroDivisionError: division by zero

The flye version is 2.9-b1778

Has anyone else seen this and have any suggestions on how to fix?

Thanks.
bug

opened by mattloose 4
Super cool tool! Maybe in the README mention that FASTQ files are required if using the container

As far as know, inputting FASTA reads aligned to the reference as the bam file will result in Pepper failing to find any variants using the default settings of the Docker/Singularity container as the config for Pepper requires a minimum base quality setting.

opened by jelber2 3

Option to set minimap2 -I flag

Hi,

Ran into this error while trying to run hapdup:

[2022-03-08 10:23:36] INFO: Running: flye-minimap2 -ax asm5 -t 10 -K 5G <PATH>/assembly.fasta <PATH>/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G [email protected] > <PATH>/hapdup/structural/liftover_hp1.bam
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 2
samtools sort: truncated file. Aborting
Traceback (most recent call last):
  File "/usr/local/bin/hapdup", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/hapdup/main.py", line 245, in main
    subprocess.check_call(" ".join(minimap_cmd), shell=True)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'flye-minimap2 -ax asm5 -t 10 -K 5G <PATH>/assembly.fasta <PATH>/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G -@4 > <PATH>/hapdup/structural/liftover_hp1.bam' returned non-zero exit status 1.

I suspect it could be because of the default minimap2 -I flag being too small (4G)? If this is the case, maybe an option to specify this could be added, or adjust it automatically depending on genome size?

Thanks!

bug

opened by fellen31 3

Using Pepper-MARGIN r0.7?

Hi,

Would it be possible to include the latest version of Pepper-MARGIN (r0.7) in hapdup? I haven't been able to run hapdup on my data so far because of some issues in Pepper-MARGIN r0.6 (now solved in r0.7).

Thank you!

Guillaume
enhancement

opened by GuillaumeHolley 3
Please add CLI option to specify location of BAM index

Could you add an additional command line parameter, allowing the specification of the BAM index location? Eg,

hapdup --bam /some/where/abam.bam --bam-index /other/location/a_bam_index.bam.csi

And then pass that optional value to pysam.AlignmentFile() argument filepath_index?

Motivation: I'm wrapping hapdup and some other steps in a WDL script, and need to pass each file separately (ie, they are localized as individual files, and there is no guarantee they end up in the same directory when hapdup is invoked). The current hapdup assumes the index and bam are in the same directory, and fails.

Thanks!

CC: @0seastar0
enhancement

opened by bkmartinjr 3
Singularity

Hi,

Thank you for this tool, I am really excited to try it! Would it be possible to have hapdup available as a Singularity image or to have the Docker image available online in a container repo (such that it can be converted to a Singularity image with a singularity pull)?

Thanks, Guillaume

opened by GuillaumeHolley 3
--overwrite fails

[2022-11-03 20:16:22] INFO: Filtering alignments Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 153, in main filter_alignments_parallel(args.bam, filtered_bam, min(args.threads, 30), File "/data/test_data/GIT/hapdup/hapdup/filter_misplaced_alignments.py", line 188, in filter_alignments_parallel pysam.merge("-@", str(num_threads), bam_out, *bams_to_merge) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/pysam/utils.py", line 69, in call raise SamtoolsError( pysam.utils.SamtoolsError: "samtools returned with error 1: stdout=, stderr=[bam_merge] File 'hapdup/filtered.bam' exists. Please apply '-f' to overwrite. Abort.\n"

Looks as though when you run with --overwrite the command is not being correctly passed through to sub processes.
bug

opened by mattloose 2
phase block number compared to Whatshap

Hello, thank you for the great tool!

I was just testing HapDup v0.7 on our fish genome. Comparing the output with phasing done with WhatsHap (WH), I wondered why there is such a big difference in phased block size and block number between HapDup and the WH pipeline?

For the fish chromosomes, WH was generating 679 blocks using 2'689'114 phased SNPs. Margin (HapDup pipeline) was generating 5352 blocks using 3'862'108 phased SNPs.

The main difference seems to be the prior read filtering and usage of MarginPhase for the phasing in HapDup, but does this explain such a big difference?

I was wondering if phase blocks of HapDup could be concatenated using whatshap SNP and block information to increase continuity? I imagine it would be a straightforward approach overlapping SNP positions between Margin and WH with phase block ids and lift-over phase ids from WH. I will do some visual inspections and scripting to test if there is overlap of called SNPs and agreement on block boarders.

Cheers, Michel

opened by MichelMoser 3

Releases(0.10)

0.10(Oct 29, 2022)
Margin version update

Option to use unphased reads for polishing

Het insertion polishing improvement

A few small bigfixes

Source code(tar.gz)
Source code(zip)
0.6(Mar 4, 2022)
Fixed issue with PEPPER model quantization causing the pipeline to hang on some systems

Speed-up of the last breakpoint analysis part of the pipeline, causing bottlenecks on some datasets

Source code(tar.gz)
Source code(zip)
0.5(Feb 6, 2022)
Update to PEPPER 0.7

Added new option --pepper-model for custom pepper models

Added new option --bam-index to provide a non-standard path to alignment index file

Source code(tar.gz)
Source code(zip)
0.4(Nov 19, 2021)
Added HiFi reads support

Source code(tar.gz)
Source code(zip)

Owner

Mikhail Kolmogorov

Postdoc @ UCSC CGL, Paten lab. I work on building algorithms for computational genomics.

GitHub Repository

A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

50 Nov 27, 2022

Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021

Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

1 Dec 08, 2021

🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

13 Dec 15, 2022

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

A neural-based binary analysis tool

A neural-based binary analysis tool Introduction This directory contains the demo of a neural-based binary analysis tool. We test the framework using

208 Dec 22, 2022

VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

24 Dec 14, 2022

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Functional Data Analysis Python package

184 Dec 27, 2022

collect training and calibration data for gaze tracking

Collect Training and Calibration Data for Gaze Tracking This tool allows collecting gaze data necessary for personal calibration or training of eye-tr

5 Dec 17, 2022

nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

23 Dec 08, 2022

Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

98 Dec 22, 2022

Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

1 Feb 13, 2022

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Python asymptotic Partial Directed Coherence and Directed Coherence estimation package for brain connectivity analysis. Free software: MIT license Doc

3 Nov 26, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

97 Dec 08, 2022