InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Overview

CRISPRanalysis

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

In this work, we present a workflow to analyze InDels from the multicopy α-gliadin gene family from wheat based on NGS data without the need to pre-viously establish a reference sequence for each genetic background. The pipeline was tested it in a multiple sample set, including three generations of edited wheat lines (T0, T1, and T2), from three different backgrounds and ploidy levels (hexaploid and tetraploid). Implementation of Bayesian optimization of Usearch parameters, inhouse Python, and bash scripts are reported.

Workflow:

Step1:

Bayesian optimization was implemented to optimize Usearch v9.2.64 parameters from merge to search steps for the α-gliadin amplicons on the wild type lines.

python Step1_Bayesian_usearch.py --database 
   
     --file_intervals 
    
      --trim_primers 
     
       --path_usearch_control 
      

      
     
    
   


Help:

Argument Help
--database File fasta with database sequences. Example: /path/to/database/database.fasta.
--file_intervals File with intervals for parameters. Example in /Examples/Example_intervals.txt.
--trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.
--path_usearch_control Path of usearch and control raw data separated by "," without white spaces. Example: /paht/to/usearch,/path/to/reads_control.


Outputs:

  • Bayesian_usearch.txt File with optimal values, optimal function value, samples or observations, obatained values and search space.
  • Bayesian.png Convergence plot.
  • Bayesian_data_res.txt File with the minf(x) after n calls in each iteration.

Step 2:

Usearch pipeline optimazed on wild type lines for studying results of optimization.

Step2_usearch_WT_to_DB.sh dif pct maxee amp id path_control name_dir_usearch path_database trim_primers


Help:

Arguments must be disposed in the order indicated before.

  • dif Optimal value for dif Usearch parameter.
  • pct Optimal value for pct Usearch parameter.
  • maxee Optimal value for maxee Usearch parameter.
  • amp Optimal value for amp Usearch parameter.
  • id Optimal value for id Usearch parameter.
  • path_control Path of the wild type lines fastq files.
  • name_dir_usearch Path of Usearch.
  • path_database Path of alpha-gliadin amplicon database.
  • trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.


Outputs:

Usearch merge files, filter files, unique amplicons file, unique denoised amplicon (Amp/otu) file, otu table file.

Step 3:

Usearch pipeline optimazed on all lines (wild types and CRISPR lines) for studying denoised unique amplicon relative abundances.

Step3_usearch_ALL_LINES.sh dif pct maxee amp id path_ALL name_dir_usearch trim_primers


Help:

Arguments must be disposed in the order indicated before.

  • dif Optimal value for dif Usearch parameter.
  • pct Optimal value for pct Usearch parameter.
  • maxee Optimal value for maxee Usearch parameter.
  • amp Optimal value for amp Usearch parameter.
  • id Optimal value for id Usearch parameter.
  • path_ALL Path of all lines (wild type and CRISPR lines) fastq files.
  • name_dir_usearch Path of Usearch.
  • trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.


Outputs:

Usearch merge files, filter files, unique amplicon file, unique denoised amplicon (Amp/otu) file, otu table file.

Before Step 4, otu table file must be normalized by TMM normalization method (edgeR package in R). Results of TMM normalized unique denoised amplicons table can be represented as heatmaps. Unique denoised amplicons can be compared between them to detect Insertions and Deletions (InDels) in CRISPR lines.

Step 4:

Create tables with the presence or absence of unique denoised amplicons in each CRISPR line compared to the wild type lines.

python Step4_usearch_to_table.py --file_otu 
   
     --file_group 
    
      --prefix_output 
     
       --genotype 
      

      
     
    
   


Help:

Argument Help
--file_otu File of TMM normalized otu_table from usearch. Remove "#OTU" from the first line.
--file_group Path to file of genotypes in wild type and CRISPR lines. Example in /Examples/Example_groups.txt.
--prefix_output Prefix to output name. Example: if you are working with BW208 groups: BW.
--genotype Genotype name. Example: if you are working with BW208 groups: BW208.

Default threshold 0.3 % of frequency of each unique denoised amplicon (Amp) in each line.


Outputs:

Substitute "name" in output names for the prefix_output string.

  • Amptable_frequency.txt Table of Amps (otus) transformed to frequencies for apply the threshold.
  • Amptable_brutes_name.txt Table with number of reads contained in the unique denoised amplicons (Amps) present in each line.
  • Amps_name.txt Table with number of unique denoised amplicons (Amps) in each line.

Python 3.6 or later is required.

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Jan 02, 2023
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Overview Welcome to the Step-X repository. This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP. Be

Keanu Pang 0 Jan 20, 2022
Retentioneering 581 Jan 07, 2023
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021