InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Overview

CRISPRanalysis

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

In this work, we present a workflow to analyze InDels from the multicopy α-gliadin gene family from wheat based on NGS data without the need to pre-viously establish a reference sequence for each genetic background. The pipeline was tested it in a multiple sample set, including three generations of edited wheat lines (T0, T1, and T2), from three different backgrounds and ploidy levels (hexaploid and tetraploid). Implementation of Bayesian optimization of Usearch parameters, inhouse Python, and bash scripts are reported.

Workflow:

Step1:

Bayesian optimization was implemented to optimize Usearch v9.2.64 parameters from merge to search steps for the α-gliadin amplicons on the wild type lines.

python Step1_Bayesian_usearch.py --database 
   
     --file_intervals 
    
      --trim_primers 
     
       --path_usearch_control 
      

      
     
    
   


Help:

Argument Help
--database File fasta with database sequences. Example: /path/to/database/database.fasta.
--file_intervals File with intervals for parameters. Example in /Examples/Example_intervals.txt.
--trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.
--path_usearch_control Path of usearch and control raw data separated by "," without white spaces. Example: /paht/to/usearch,/path/to/reads_control.


Outputs:

  • Bayesian_usearch.txt File with optimal values, optimal function value, samples or observations, obatained values and search space.
  • Bayesian.png Convergence plot.
  • Bayesian_data_res.txt File with the minf(x) after n calls in each iteration.

Step 2:

Usearch pipeline optimazed on wild type lines for studying results of optimization.

Step2_usearch_WT_to_DB.sh dif pct maxee amp id path_control name_dir_usearch path_database trim_primers


Help:

Arguments must be disposed in the order indicated before.

  • dif Optimal value for dif Usearch parameter.
  • pct Optimal value for pct Usearch parameter.
  • maxee Optimal value for maxee Usearch parameter.
  • amp Optimal value for amp Usearch parameter.
  • id Optimal value for id Usearch parameter.
  • path_control Path of the wild type lines fastq files.
  • name_dir_usearch Path of Usearch.
  • path_database Path of alpha-gliadin amplicon database.
  • trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.


Outputs:

Usearch merge files, filter files, unique amplicons file, unique denoised amplicon (Amp/otu) file, otu table file.

Step 3:

Usearch pipeline optimazed on all lines (wild types and CRISPR lines) for studying denoised unique amplicon relative abundances.

Step3_usearch_ALL_LINES.sh dif pct maxee amp id path_ALL name_dir_usearch trim_primers


Help:

Arguments must be disposed in the order indicated before.

  • dif Optimal value for dif Usearch parameter.
  • pct Optimal value for pct Usearch parameter.
  • maxee Optimal value for maxee Usearch parameter.
  • amp Optimal value for amp Usearch parameter.
  • id Optimal value for id Usearch parameter.
  • path_ALL Path of all lines (wild type and CRISPR lines) fastq files.
  • name_dir_usearch Path of Usearch.
  • trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.


Outputs:

Usearch merge files, filter files, unique amplicon file, unique denoised amplicon (Amp/otu) file, otu table file.

Before Step 4, otu table file must be normalized by TMM normalization method (edgeR package in R). Results of TMM normalized unique denoised amplicons table can be represented as heatmaps. Unique denoised amplicons can be compared between them to detect Insertions and Deletions (InDels) in CRISPR lines.

Step 4:

Create tables with the presence or absence of unique denoised amplicons in each CRISPR line compared to the wild type lines.

python Step4_usearch_to_table.py --file_otu 
   
     --file_group 
    
      --prefix_output 
     
       --genotype 
      

      
     
    
   


Help:

Argument Help
--file_otu File of TMM normalized otu_table from usearch. Remove "#OTU" from the first line.
--file_group Path to file of genotypes in wild type and CRISPR lines. Example in /Examples/Example_groups.txt.
--prefix_output Prefix to output name. Example: if you are working with BW208 groups: BW.
--genotype Genotype name. Example: if you are working with BW208 groups: BW208.

Default threshold 0.3 % of frequency of each unique denoised amplicon (Amp) in each line.


Outputs:

Substitute "name" in output names for the prefix_output string.

  • Amptable_frequency.txt Table of Amps (otus) transformed to frequencies for apply the threshold.
  • Amptable_brutes_name.txt Table with number of reads contained in the unique denoised amplicons (Amps) present in each line.
  • Amps_name.txt Table with number of unique denoised amplicons (Amps) in each line.

Python 3.6 or later is required.

SparseLasso: Sparse Solutions for the Lasso

SparseLasso: Sparse Solutions for the Lasso Introduction SparseLasso provides a Scikit-Learn based estimation of the Lasso with cross-validation tunin

Gabriel Okasa 1 Nov 08, 2021
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
Program that predicts the NBA mvp based on data from previous years.

NBA MVP Predictor A machine learning model using RandomForest Regression that predicts NBA MVP's using player data. Explore the docs » View Demo · Rep

Muhammad Rabee 1 Jan 21, 2022
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

Tom 3 Feb 12, 2022
Automated Exploration Data Analysis on a financial dataset

Automated EDA on financial dataset Just a simple way to get automated Exploration Data Analysis from financial dataset (OHLCV) using Streamlit and ta.

Darío López Padial 28 Nov 27, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
track your GitHub statistics

GitHub-Stalker track your github statistics 👀 features find new followers or unfollowers find who got a star on your project or remove stars find who

Bahadır Araz 34 Nov 18, 2022
A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

Antonio Catalano 50 Nov 27, 2022
Python Project on Pro Data Analysis Track

Udacity-BikeShare-Project: Python Project on Pro Data Analysis Track Basic Data Exploration with pandas on Bikeshare Data Basic Udacity project using

Belal Mohammed 0 Nov 10, 2021
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Correlation-Study-Climate-Change-EV-Adoption Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles I

Jonathan Feng 1 Jan 03, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022