Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Last update: Jul 26, 2022

Overview

(intron Interrogator and Classifier)

intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), using a genome and annotation or the sequences themselves. Alternatively, intronIC can be used to simply extract all intron sequences without classification (using -s).

Installation

via `pip`

If you have (or can get) pip, running it on this repo is the easiest way to install the most recent version of intronIC (if you have multiple versions of Python installed, be sure to use the appropriate Python 3 version e.g. python3 in the following commands):

python3 -m pip install git+https://github.com/glarue/intronIC

Alternatively, you can get the last stable version published to PyPI:

python3 -m pip install intronIC

If successful, intronIC should now be callable from the command-line.

To upgrade to the latest version from a previous one, include --upgrade in either of the previous pip commands, e.g.

python3 -m pip install git+https://github.com/glarue/intronIC --upgrade

via `git clone`

Otherwise, you can simply clone this repository to your local machine using git:

git clone https://github.com/glarue/intronIC.git
cd intronIC/intronIC

If you clone the repo, you may also wish to add intronIC/intronIC to your system PATH (how best to do this depends on your platform).

See the wiki for more detail information about configuration/run options.

Dependencies

Python >=3.3
numpy & scipy
scikit-learn >=0.22
biogl
matplotlib (optional, required for plotting)

To install dependencies separately using pip, do

python3 -m pip install numpy scipy matplotlib 'scikit-learn>=0.22' biogl

intronIC was built and tested on Linux, but should run on Windows or Mac OSes without too much trouble (I say that now...).

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with either of the following:

Genome (-g) and annotation/BED (-a, -b) files

—OR—
Intron sequences file (-q) (see Training-data-and-PWMs for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:

-p parallel processes, which can significantly reduce runtime
-f cds use only CDS features to identify introns (by default, uses both CDS and exon features)
--no_nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries
-i include introns from multiple isoforms of the same gene (default: longest isoform only)

Running on test data

If you have installed via pip, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.
If you have cloned the repo, first change to the /intronIC/intronIC/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. Replace intronIC with ../intronIC.py in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-gene:[email protected]:ENST00000614285-intron_1(47);[c:-1]      10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT...AATAC|TCC       CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC       50719   transcript:ENST00000614285      gene:ENSG00000141837    1       47      3.9
     2       u12     cds

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

0)' homo_sapiens.meta.iic">

awk '($2!="." && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), add the -s flag:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s

See the rest of the wiki for more details about output files, etc.

A note on the `-n` (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.

For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p.

See the rest of the wiki for more detailed instructions.

Cite

If you find this tool useful, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett, Comprehensive database and evolutionary dynamics of U12-type introns, Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078, https://doi.org/10.1093/nar/gkaa464

About

intronIC was written to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from annotated intron sequences. Minor introns usually represent ~0.5% (at most) of a given genome's introns, and contain distinct splicing motifs which make them amenable to bioinformatic identification.

Earlier minor intron resources (U12DB, SpliceRack, ERISdb, etc.), while important contributions to the field, are static by design. As such, these databases fail to reflect the dramatic increase in available genome sequences and annotation quality of the last decade.

In addition, other published identification methods employ a certain amount of heuristic fuzziness in defining the classification criteria of their U12-type scoring systems (i.e how "U12-like" does an intron need to look before being called a U12-type intron). intronIC relegates this decision to the well-established support-vector machine (SVM) classification method, which produces an easy-to-interpret "probability of being U12-type" score for each intron.

Furthermore, intronIC provides researchers the opportunity to tailor the underlying training data/position-weight matrices, should they have species-specific data to take advantage of.

Finally, intronIC performs a fair amount of bookkeping during the intron collection process, resulting in (potentially) useful metadata about each intron including parent gene/transcript, ordinal index and phase, information which (as far as I'm aware) is otherwise somewhat non-trivial to acquire.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

4.2k Dec 29, 2022

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

924 Jan 3, 2023

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

203 Dec 26, 2022

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

124 Dec 28, 2022

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

73 Oct 17, 2022

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

164 Jan 4, 2023

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

31 Nov 3, 2022

Comments

[BUG] intronIC not working for example data and own data

I am trying to run intronIC with example/own data and is not working.

COMMAND:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens (same as wiki)

FEEDBACK:

[#] Starting intronIC [v1.1.1] run on [homo_sapiens (HomSap)] [#] Run command: [/home/rocesv/anaconda3/envs/Seidr/bin/intronIC -g /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens] [#] Using [cds,exon] features to define introns [#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz] [#] [38681] introns with redundant coordinates excluded [#] [8178] introns omitted from scoring based on the following criteria: [#] * short (<30 nt): 66 [#] * ambiguous nucleotides in scoring regions: 0 [#] * non-canonical boundaries: 0 [#] * overlapping coordinates: 0 [#] * not in longest isoform: 8112 [#] Most common non-canonical splice sites: [#] * AT-AG (16/328, 4.88%) [#] * GT-TG (12/328, 3.66%) [#] * GG-AG (12/328, 3.66%) [#] * GA-AG (11/328, 3.35%) [#] * AG-AG (10/328, 3.05%) [#] [24] ([15] unique, [9] redundant) putatively misannotated U12-type introns corrected in [homo_sapiens.annotation.iic] [#] [12074] introns included in scoring analysis [#] Scoring introns using the following regions: [five, bp] [#] Raw scores calculated for [20690] U2 and [387] U12 reference introns [#] Raw scores calculated for [12074] experimental introns [#] Training set score vectors constructed: [20690] U2, [387] U12 [#] Training SVM using reference data Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in <module> sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'iid'

PROBLEM TRACEBACK:

**Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, kwargs) TypeError: init() got an unexpected keyword argument 'iid

In both cases i have the same problem and the log is similar. Any idea? I am very interested on using this amazing tool.

Thank you in advance :)

PD: Running in conda env with python 3.9 (wsl 2 Ubuntu 20.04 Windows 10Pro)

opened by RocesV 2

Releases(v1.3.7)

v1.3.7(Jun 10, 2022)

Fix annoying issues with generating a version number under different installation scenarios.
Source code(tar.gz)
Source code(zip)
v1.3.6(Jun 10, 2022)
Deal with edge-case issue where a gene feature has children exon/CDS features in a direct parent-child relationship. Previously, this would bypass the recursive search for introns used by get_introns() due to an early exit, resulting in preferential inclusion of introns whose Parent attribute was the gene itself rather than a child transcript.

Remove old code/fix whitespace

Update __version__ paradigm

Remove Physarum-specific branch-point PWM code

Source code(tar.gz)
Source code(zip)
v1.3.2(Oct 20, 2021)

Misc. minor changes not affecting functionality.

Switch to limiting master (soon to be main) to point releases, with development code contained to dev.
Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 23, 2021)
Changes default scoring behavior to include all (5', BPS and 3') regions, instead of the previous default of just 5' and BPS. The 3' region typically contains less differentiation between U2- and U12-type introns, but may help reduce FP and FN classifier calls in edge cases. Of course, it's also possible that it could also introduce FPs and/or FNs, although in my experience using all three seems to be more conservative than not.

Misc. minor internal changes.

Source code(tar.gz)
Source code(zip)
v1.2.0(Feb 8, 2021)
intronIC v1.2.0

Fix GridSearchCV regression with newer versions of scikit-learn (>v0.22) (see issue #1)

Due to scikit-learn's inversion of a default flag in GridSearchCV, intronIC must now require scikit-learn to be at least v0.22

This fix breaks compatibility with scikit-learn versions <v0.22

Source code(tar.gz)
Source code(zip)
v1.1.1(Dec 5, 2020)
intronIC v1.1.1

Replace parent-child hierarchical clustering of annotation features with simpler, directed graph-based approach

Fix occasional issues where parent genes of CDS/exon features weren't correctly identified

Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 31, 2020)
A number of changes to the underlying data in this release - the default PWMs have been changed to a slightly less-stringent set, which should leave most results relatively unchanged and deals with some edge-cases where the original PWMs were overly penalizing for certain base positions due to being built from low-N samples. Other changes include:

Default 3'SS region shortened to [-6, 4]

By default, the human U2-type BPS PWM is used instead of the on-the-fly version. A per-run PWM can be generated using --generate_u2_bps_pwm

z-scores in the output have been adjusted to correspond to the entire dataset (previously, they were based on the training set only)

Non-canonical introns by default now use whatever PWM is closest to their terminal dinucleotides if one is obvious (e.g. for AT-TC introns, this would be the AT-AC PWM; for AT-AG introns, GT-AG and AT-AC are equally close in terms of edit distance). Otherwise, the terminal dinucleotides will be ignored and the best PWM will be selected based on the geometric mean of the component scores from each PWM. This can be reverted to the old behavior using --no_ignore_nc_dnts

Source code(tar.gz)
Source code(zip)
v1.0.14(Oct 30, 2020)
intronIC v1.0.14

Uses human U2-type BPS PWM (data from Pineda 2018) by default. To restore the previous paradigm wherein U2-type BPS PWMs are generated on-the-fly using the best match to U12-type BPS motifs in likely U2-type introns, pass --generate_u2_bps_pwm.

Source code(tar.gz)
Source code(zip)
v1.0.13(Sep 6, 2020)
intronIC v1.0.13

Add best U2-type BPS to meta.iic output file. Previously, only the best U12-type BPS sequence was reported. In certain cases, it may be useful to know which U2-type sequence was used in determining the BPS log-ratio score.

Reduce formatting stringency for custom PWMs This should reduce headaches if folks are adding their own PWMs by ignoring case, etc.

Add clause to terminate multiprocessing pool processes on forced exit There were cases I'd noticed in my own usage when force-exiting (e.g. via ctrl-c) where zombie processes would persist. Wrapping the whole thing in a try/except/finally seems to eliminate the issue (limited testing).

Source code(tar.gz)
Source code(zip)
v1.0.12(Aug 21, 2020)

Manual release to trigger Zenodo archive. No functional difference between this and previous version (v1.0.11).
Source code(tar.gz)
Source code(zip)

Owner

Graham Larue

PhD candidate in bioinformatics and molecular evolution at UC Merced

GitHub Repository

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

648 Dec 16, 2022

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching. The solver will solve equations of the type: A can be

3 Feb 15, 2022

A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

393 Dec 27, 2022

Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

295 Jan 06, 2023

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

27 Aug 19, 2022

As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Machine Learning Loot Crate 💻 🧰 🔴 Welcome contributors! As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Cra

89 Dec 28, 2022

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

4 Sep 22, 2022

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill This is a port of the amazing openskill.js package

156 Dec 14, 2022

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

3.1k Jan 06, 2023

Lightweight Machine Learning Experiment Logging 📖

Simple logging of statistics, model checkpoints, plots and other objects for your Machine Learning Experiments (MLE). Furthermore, the MLELogger comes with smooth multi-seed result aggregation and co

65 Dec 08, 2022

Time Series Prediction with tf.contrib.timeseries

TensorFlow-Time-Series-Examples Additional examples for TensorFlow Time Series(TFTS). Read a Time Series with TFTS From a Numpy Array: See "test_input

476 Nov 17, 2022

Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

21 Jun 25, 2022

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

216 Dec 23, 2022

Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Related tags

Overview

(intron Interrogator and Classifier)

Installation

via pip

via git clone

Dependencies

Useful arguments

Running on test data

Classify annotated introns

Extract all annotated intron sequences

A note on the -n (name) argument

Resource usage

Cite

About

You might also like...

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

A library of extension and helper modules for Python's data analysis and machine learning libraries.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Comments

[BUG] intronIC not working for example data and own data

Releases(v1.3.7)

v1.3.7(Jun 10, 2022)

v1.3.6(Jun 10, 2022)

v1.3.2(Oct 20, 2021)

v1.3.0(Jul 23, 2021)

v1.2.0(Feb 8, 2021)

v1.1.1(Dec 5, 2020)

v1.1.0(Oct 31, 2020)

v1.0.14(Oct 30, 2020)

v1.0.13(Sep 6, 2020)

v1.0.12(Aug 21, 2020)