Authors: Ada Madejska, MCDB, UCSB (contact: [email protected]) Nick Noll, UCSB This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline which includes the following steps. 1. Quality control - very short and very long reads (reads that highly deviate from the usual length of the 16S sequence) are dropped. 2. Kmer frequency matrix - make a kmer frequency matrix based on the reads from the quality control step. The value of k can be changed (k=5 or 6 is recommended) 3. UMAP projection and HDBSCAN clustering - the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP and HDBSCAN functions have been chosen based on mock dataset but can be changed. 4. Refinement - based on our tests on mock datasets, sometimes reads from different species can cluster together. To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster. The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters. The distance threshold can be changed to suit each dataset. 5. Consensus making - lastly, based on the defined clusters, the last step creates a consensus sequence based on majority calling. The direction of the reads is fixed using minimap2, the alignment is performed by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check for identity of each cluster. Software Dependencies: To successfully run the pipeline, certain software need to be installed. 1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2) 2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/) 3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html) 4. NCBIN - for identification of the consensus sequences in the database (https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required) 5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/) Specifications: This pipeline runs in python3.8.10 and julia v"1.4.1". The following Python libraries are also required: BioPython hdbscan matplotlib pandas sklearn umap Following Julia packages are required: Pkg DataFrames CSV
A pipeline that creates consensus sequences from a Nanopore reads. I
Overview
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.
PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma
signac-flow - manage workflows with signac
signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats
Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis
Senator Trades Monitor
Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito
Aggregating gridded data (xarray) to polygons
A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c
Conduits - A Declarative Pipelining Tool For Pandas
Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some
A data parser for the internal syncing data format used by Fog of World.
A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data struc
Full ELT process on GCP environment.
Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u
A utility for functional piping in Python that allows you to access any function in any scope as a partial.
WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds
Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible
Python script for transferring data between three drives in two separate stages
Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials
Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.
Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza
ASOUL直播间弹幕抓取&&数据分析
ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。
This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!
COSMETICS GENERATOR This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics! Remember to put the l
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.
Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the
Desafio 1 ~ Bantotal
Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright
Spaghetti: an open-source Python library for the analysis of network-based spatial data
pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d
Spectral Analysis in Python
SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect