Authors: Ada Madejska, MCDB, UCSB (contact: [email protected]) Nick Noll, UCSB This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline which includes the following steps. 1. Quality control - very short and very long reads (reads that highly deviate from the usual length of the 16S sequence) are dropped. 2. Kmer frequency matrix - make a kmer frequency matrix based on the reads from the quality control step. The value of k can be changed (k=5 or 6 is recommended) 3. UMAP projection and HDBSCAN clustering - the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP and HDBSCAN functions have been chosen based on mock dataset but can be changed. 4. Refinement - based on our tests on mock datasets, sometimes reads from different species can cluster together. To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster. The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters. The distance threshold can be changed to suit each dataset. 5. Consensus making - lastly, based on the defined clusters, the last step creates a consensus sequence based on majority calling. The direction of the reads is fixed using minimap2, the alignment is performed by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check for identity of each cluster. Software Dependencies: To successfully run the pipeline, certain software need to be installed. 1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2) 2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/) 3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html) 4. NCBIN - for identification of the consensus sequences in the database (https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required) 5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/) Specifications: This pipeline runs in python3.8.10 and julia v"1.4.1". The following Python libraries are also required: BioPython hdbscan matplotlib pandas sklearn umap Following Julia packages are required: Pkg DataFrames CSV
A pipeline that creates consensus sequences from a Nanopore reads. I
Overview
Scraping and analysis of leetcode-compensations page.
Leetcode compensations report Scraping and analysis of leetcode-compensations page.
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab
a tool that compiles a csv of all h1 program stats
h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of D
Pandas and Spark DataFrame comparison for humans
DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.
Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i
An Integrated Experimental Platform for time series data anomaly detection.
Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer
Additional tools for particle accelerator data analysis and machine information
PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed
Maximum Covariance Analysis in Python
xMCA | Maximum Covariance Analysis in Python The aim of this package is to provide a flexible tool for the climate science community to perform Maximu
Convert monolithic Jupyter notebooks into Ploomber pipelines.
Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.
Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds
The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias Introduction | Updates | Usage | Results&Pretrained Models | Statement | Intr
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu
An extension to pandas dataframes describe function.
pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie
Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles
Correlation-Study-Climate-Change-EV-Adoption Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles I
This is a python script to navigate and extract the FSD50K dataset
FSD50K navigator This is a script I use to navigate the sound dataset from FSK50K.
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems
Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ