A pipeline that creates consensus sequences from a Nanopore reads. I

Overview
Authors: 
Ada Madejska, MCDB, UCSB  (contact: [email protected])
Nick Noll, UCSB

This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity
of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline
which includes the following steps.
1. Quality control 
    - very short and very long reads (reads that highly deviate from the usual length of the 16S sequence)
    are dropped.
2. Kmer frequency matrix
    - make a kmer frequency matrix based on the reads from the quality control step. The value of k
    can be changed (k=5 or 6 is recommended)
3. UMAP projection and HDBSCAN clustering
    - the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP
    and HDBSCAN functions have been chosen based on mock dataset but can be changed. 
4. Refinement 
    - based on our tests on mock datasets, sometimes reads from different species can cluster together.
    To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster.
    The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters.
    The distance threshold can be changed to suit each dataset.
5. Consensus making
    - lastly, based on the defined clusters, the last step creates a consensus sequence based on 
    majority calling. The direction of the reads is fixed using minimap2, the alignment is performed 
    by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check
    for identity of each cluster. 

Software Dependencies:

To successfully run the pipeline, certain software need to be installed.
1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2)
2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/)
3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html)
4. NCBIN - for identification of the consensus sequences in the database 
    (https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required)
5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/)

Specifications:

This pipeline runs in python3.8.10 and julia v"1.4.1". 

The following Python libraries are also required:
BioPython
hdbscan
matplotlib
pandas
sklearn
umap

Following Julia packages are required:
Pkg
DataFrames
CSV
Owner
Ada Madejska
UCSB Graduate Student in Computational Biology
Ada Madejska
PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

Michael Milton 2 Aug 07, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Big Data & Cloud Computing for Oceanography

DS2 Class 2022, Big Data & Cloud Computing for Oceanography Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, I

Ocean's Big Data Mining 5 Mar 19, 2022
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science is among the first European programs in Data Science and is fully focused on data engineering and data a

Amir Ali 2 Jun 17, 2022
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
CSV database for chihuahua (HUAHUA) blockchain transactions

super-fiesta Shamelessly ripped components from https://github.com/hodgerpodger/staketaxcsv - Thanks for doing all the hard work. This code does only

Arlene Macciaveli 1 Jan 07, 2022
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

Glotzer Group 195 Dec 20, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
A program that uses an API and a AI model to get info of sotcks

Stock-Market-AI-Analysis I dont mind anyone using this code but please give me credit A program that uses an API and a AI model to get info of stocks

1 Dec 17, 2021
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

Najibulloh Asror 2 Feb 10, 2022