Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Overview

intronIC_logo

(intron Interrogator and Classifier)

intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), using a genome and annotation or the sequences themselves. Alternatively, intronIC can be used to simply extract all intron sequences without classification (using -s).

Installation

via pip

If you have (or can get) pip, running it on this repo is the easiest way to install the most recent version of intronIC (if you have multiple versions of Python installed, be sure to use the appropriate Python 3 version e.g. python3 in the following commands):

python3 -m pip install git+https://github.com/glarue/intronIC

Alternatively, you can get the last stable version published to PyPI:

python3 -m pip install intronIC

If successful, intronIC should now be callable from the command-line.

To upgrade to the latest version from a previous one, include --upgrade in either of the previous pip commands, e.g.

python3 -m pip install git+https://github.com/glarue/intronIC --upgrade

via git clone

Otherwise, you can simply clone this repository to your local machine using git:

git clone https://github.com/glarue/intronIC.git
cd intronIC/intronIC

If you clone the repo, you may also wish to add intronIC/intronIC to your system PATH (how best to do this depends on your platform).

See the wiki for more detail information about configuration/run options.

Dependencies

To install dependencies separately using pip, do

python3 -m pip install numpy scipy matplotlib 'scikit-learn>=0.22' biogl

intronIC was built and tested on Linux, but should run on Windows or Mac OSes without too much trouble (I say that now...).

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with either of the following:

  • Genome (-g) and annotation/BED (-a, -b) files

    —OR—

  • Intron sequences file (-q) (see Training-data-and-PWMs for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:

  • -p parallel processes, which can significantly reduce runtime

  • -f cds use only CDS features to identify introns (by default, uses both CDS and exon features)

  • --no_nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries

  • -i include introns from multiple isoforms of the same gene (default: longest isoform only)

Running on test data

  • If you have installed via pip, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.

  • If you have cloned the repo, first change to the /intronIC/intronIC/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. Replace intronIC with ../intronIC.py in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-gene:[email protected]:ENST00000614285-intron_1(47);[c:-1]      10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT...AATAC|TCC       CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC       50719   transcript:ENST00000614285      gene:ENSG00000141837    1       47      3.9
     2       u12     cds

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

0)' homo_sapiens.meta.iic">
awk '($2!="." && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), add the -s flag:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s

See the rest of the wiki for more details about output files, etc.

A note on the -n (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.

For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p.

See the rest of the wiki for more detailed instructions.

Cite

If you find this tool useful, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett, Comprehensive database and evolutionary dynamics of U12-type introns, Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078, https://doi.org/10.1093/nar/gkaa464

About

intronIC was written to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from annotated intron sequences. Minor introns usually represent ~0.5% (at most) of a given genome's introns, and contain distinct splicing motifs which make them amenable to bioinformatic identification.

Earlier minor intron resources (U12DB, SpliceRack, ERISdb, etc.), while important contributions to the field, are static by design. As such, these databases fail to reflect the dramatic increase in available genome sequences and annotation quality of the last decade.

In addition, other published identification methods employ a certain amount of heuristic fuzziness in defining the classification criteria of their U12-type scoring systems (i.e how "U12-like" does an intron need to look before being called a U12-type intron). intronIC relegates this decision to the well-established support-vector machine (SVM) classification method, which produces an easy-to-interpret "probability of being U12-type" score for each intron.

Furthermore, intronIC provides researchers the opportunity to tailor the underlying training data/position-weight matrices, should they have species-specific data to take advantage of.

Finally, intronIC performs a fair amount of bookkeping during the intron collection process, resulting in (potentially) useful metadata about each intron including parent gene/transcript, ordinal index and phase, information which (as far as I'm aware) is otherwise somewhat non-trivial to acquire.

You might also like...
A library of extension and helper modules for Python's data analysis and machine learning libraries.
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Comments
  • [BUG] intronIC not working for example data and own data

    [BUG] intronIC not working for example data and own data

    I am trying to run intronIC with example/own data and is not working.

    COMMAND:

    intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens (same as wiki)

    FEEDBACK:

    [#] Starting intronIC [v1.1.1] run on [homo_sapiens (HomSap)] [#] Run command: [/home/rocesv/anaconda3/envs/Seidr/bin/intronIC -g /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens] [#] Using [cds,exon] features to define introns [#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz] [#] [38681] introns with redundant coordinates excluded [#] [8178] introns omitted from scoring based on the following criteria: [#] * short (<30 nt): 66 [#] * ambiguous nucleotides in scoring regions: 0 [#] * non-canonical boundaries: 0 [#] * overlapping coordinates: 0 [#] * not in longest isoform: 8112 [#] Most common non-canonical splice sites: [#] * AT-AG (16/328, 4.88%) [#] * GT-TG (12/328, 3.66%) [#] * GG-AG (12/328, 3.66%) [#] * GA-AG (11/328, 3.35%) [#] * AG-AG (10/328, 3.05%) [#] [24] ([15] unique, [9] redundant) putatively misannotated U12-type introns corrected in [homo_sapiens.annotation.iic] [#] [12074] introns included in scoring analysis [#] Scoring introns using the following regions: [five, bp] [#] Raw scores calculated for [20690] U2 and [387] U12 reference introns [#] Raw scores calculated for [12074] experimental introns [#] Training set score vectors constructed: [20690] U2, [387] U12 [#] Training SVM using reference data Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in <module> sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'iid'

    PROBLEM TRACEBACK:

    **Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, kwargs) TypeError: init() got an unexpected keyword argument 'iid

    In both cases i have the same problem and the log is similar. Any idea? I am very interested on using this amazing tool.

    Thank you in advance :)

    PD: Running in conda env with python 3.9 (wsl 2 Ubuntu 20.04 Windows 10Pro)

    opened by RocesV 2
Releases(v1.3.7)
  • v1.3.7(Jun 10, 2022)

  • v1.3.6(Jun 10, 2022)

    • Deal with edge-case issue where a gene feature has children exon/CDS features in a direct parent-child relationship. Previously, this would bypass the recursive search for introns used by get_introns() due to an early exit, resulting in preferential inclusion of introns whose Parent attribute was the gene itself rather than a child transcript.
    • Remove old code/fix whitespace
    • Update __version__ paradigm
    • Remove Physarum-specific branch-point PWM code
    Source code(tar.gz)
    Source code(zip)
  • v1.3.2(Oct 20, 2021)

    Misc. minor changes not affecting functionality.

    Switch to limiting master (soon to be main) to point releases, with development code contained to dev.

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Jul 23, 2021)

    • Changes default scoring behavior to include all (5', BPS and 3') regions, instead of the previous default of just 5' and BPS. The 3' region typically contains less differentiation between U2- and U12-type introns, but may help reduce FP and FN classifier calls in edge cases. Of course, it's also possible that it could also introduce FPs and/or FNs, although in my experience using all three seems to be more conservative than not.
    • Misc. minor internal changes.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 8, 2021)

    intronIC v1.2.0

    • Fix GridSearchCV regression with newer versions of scikit-learn (>v0.22) (see issue #1)
    • Due to scikit-learn's inversion of a default flag in GridSearchCV, intronIC must now require scikit-learn to be at least v0.22
    • This fix breaks compatibility with scikit-learn versions <v0.22
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Dec 5, 2020)

    intronIC v1.1.1

    • Replace parent-child hierarchical clustering of annotation features with simpler, directed graph-based approach
    • Fix occasional issues where parent genes of CDS/exon features weren't correctly identified
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Oct 31, 2020)

    A number of changes to the underlying data in this release - the default PWMs have been changed to a slightly less-stringent set, which should leave most results relatively unchanged and deals with some edge-cases where the original PWMs were overly penalizing for certain base positions due to being built from low-N samples. Other changes include:

    • Default 3'SS region shortened to [-6, 4]
    • By default, the human U2-type BPS PWM is used instead of the on-the-fly version. A per-run PWM can be generated using --generate_u2_bps_pwm
    • z-scores in the output have been adjusted to correspond to the entire dataset (previously, they were based on the training set only)
    • Non-canonical introns by default now use whatever PWM is closest to their terminal dinucleotides if one is obvious (e.g. for AT-TC introns, this would be the AT-AC PWM; for AT-AG introns, GT-AG and AT-AC are equally close in terms of edit distance). Otherwise, the terminal dinucleotides will be ignored and the best PWM will be selected based on the geometric mean of the component scores from each PWM. This can be reverted to the old behavior using --no_ignore_nc_dnts
    Source code(tar.gz)
    Source code(zip)
  • v1.0.14(Oct 30, 2020)

    intronIC v1.0.14

    • Uses human U2-type BPS PWM (data from Pineda 2018) by default. To restore the previous paradigm wherein U2-type BPS PWMs are generated on-the-fly using the best match to U12-type BPS motifs in likely U2-type introns, pass --generate_u2_bps_pwm.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.13(Sep 6, 2020)

    intronIC v1.0.13

    • Add best U2-type BPS to meta.iic output file. Previously, only the best U12-type BPS sequence was reported. In certain cases, it may be useful to know which U2-type sequence was used in determining the BPS log-ratio score.
    • Reduce formatting stringency for custom PWMs This should reduce headaches if folks are adding their own PWMs by ignoring case, etc.
    • Add clause to terminate multiprocessing pool processes on forced exit There were cases I'd noticed in my own usage when force-exiting (e.g. via ctrl-c) where zombie processes would persist. Wrapping the whole thing in a try/except/finally seems to eliminate the issue (limited testing).
    Source code(tar.gz)
    Source code(zip)
  • v1.0.12(Aug 21, 2020)

Owner
Graham Larue
PhD candidate in bioinformatics and molecular evolution at UC Merced
Graham Larue
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 663 Dec 31, 2022
Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

4 Nov 11, 2021
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

AutoML-Freiburg-Hannover 6.7k Jan 07, 2023
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Kaggle-Comp.-Data-Mining Kaggle Competition using 15 numerical predictors to predict a continuous outcome as part of a final project for a stats data

moisey alaev 1 Dec 28, 2021
healthy and lesion models for learning based on the joint estimation of stochasticity and volatility

health-lesion-stovol healthy and lesion models for learning based on the joint estimation of stochasticity and volatility Reference please cite this p

5 Nov 01, 2022
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
Machine Learning e Data Science com Python

Machine Learning e Data Science com Python Arquivos do curso de Data Science e Machine Learning com Python na Udemy, cliqe aqui para acessá-lo. O prin

Renan Barbosa 1 Jan 27, 2022
A Pythonic framework for threat modeling

pytm: A Pythonic framework for threat modeling Introduction Traditional threat modeling too often comes late to the party, or sometimes not at all. In

Izar Tarandach 644 Dec 20, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
This repository demonstrates the usage of hover to understand and supervise a machine learning task.

Hover Example Apps (works out-of-the-box on Binder) This repository demonstrates the usage of hover to understand and supervise a machine learning tas

Pavel 43 Dec 03, 2021
MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

MosaicML Composer MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training. We aim to ease th

MosaicML 2.8k Jan 06, 2023
AutoOED: Automated Optimal Experiment Design Platform

AutoOED is an optimal experiment design platform powered with automated machine learning to accelerate the discovery of optimal solutions. Our platform solves multi-objective optimization problems an

Yunsheng Tian 107 Jan 03, 2023
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 05, 2022
TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models

538 Jan 01, 2023
Neural Machine Translation (NMT) tutorial with OpenNMT-py

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

Yasmin Moslem 29 Jan 09, 2023
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

Backprop 227 Dec 10, 2022
Extreme Learning Machine implementation in Python

Python-ELM v0.3 --- ARCHIVED March 2021 --- This is an implementation of the Extreme Learning Machine [1][2] in Python, based on scikit-learn. From

David C. Lambert 511 Dec 20, 2022