ULMFiT for Genomic Sequence Data

Overview

Genomic ULMFiT

This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the AWD-LSTM model, consisting of an embedding, three LSTM layers, and a final set of linear layers.

The ULMFiT approach uses three training phases to produce a classification model:

  1. Train a language model on a large, unlabeled corpus
  2. Fine tune the language model on the classification corpus
  3. Use the fine tuned language model to initialize a classification model

This method is particularly advantageous for genomic data, where large amounts of unlabeled data is abundant and labeled data is scarce. The ULMFiT approach allows us to train a model on a large, unlabeled genomic corpus in an unsupervised fashion. The pre-trained language model serves as a feature extractor for parsing genomic data.

Typical deep learning approaches to genomics classification are highly restricted to whatever labeled data is available. Models are usually trained from scratch on small datasets, leading to problems with overfitting. When unsupervised pre-training is used, it is typically done only on the classification dataset or on synthetically generated data. The Genomic-ULMFiT approach uses genome scale corpuses for pre-training to produce better feature extractors than we would get by training only on the classification corpus.

For a deep dive into the ULMFiT approach, model architectures, regularization and training strategies, see the Methods Long Form document in the Methods section.

Results

Performance of Genomic-ULMFiT relative to other methods

Promoter Classification

E. coli promoters

The Genomic-ULMFiT method performs well at the task of classifying promoter sequences from random sections of the genome. The process of unsupervised pre-training and fine-tuning has a clear impact on the performance of the classification model

Model Accuracy Precision Recall Correlation Coefficient
Naive 0.834 0.847 0.816 0.670
E. coli Genome Pre-Training 0.919 0.941 0.893 0.839
Genomic Ensemble Pre-Training 0.973 0.980 0.966 0.947

Data generation described in notebook

Notebook Directory

Classification performance on human promoters is competitive with published results

Human Promoters (short)

For the short promoter sequences, using data from Recognition of Prokaryotic and Eukaryotic Promoters using Convolutional Deep Learning Neural Networks:

Model DNA Size kmer/stride Accuracy Precision Recall Correlation Coefficient Specificity
Kh et al. -200/50 - - - 0.9 0.89 0.98
Naive Model -200/50 5/2 0.80 0.74 0.80 0.59 0.80
With Pre-Training -200/50 5/2 0.922 0.963 0.849 0.844 0.976
With Pre-Training and Fine Tuning -200/50 5/2 .977 .959 .989 .955 .969
With Pre-Training and Fine Tuning -200/50 5/1 .990 .983 .995 .981 .987
With Pre-Training and Fine Tuning -200/50 3/1 .995 .992 .996 .991 .994

Data Source

Notebook Directory

Human Promoters (long)

For the long promoter sequences, using data from PromID: Human Promoter Prediction by Deep Learning:

Model DNA Size Models Accuracy Precision Recall Correlation Coefficient
Umarov et al. -1000/500 2 Model Ensemble - 0.636 0.802 0.714
Umarov et al. -200/400 2 Model Ensemble - 0.769 0.755 0.762
Naive Model -500/500 Single Model 0.858 0.877 0.772 0.708
With Pre-Training -500/500 Single Model 0.888 0.90 0.824 0.770
With Pre-Training and Fine Tuning -500/500 Single Model 0.892 0.877 0.865 0.778

Data generation described in notebook

Notebook Directory

Other Bacterial Promoters

This table shows results on data from Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. These results show how CNN based methods can sometimes perform better when training on small datasets.

Method Organism Training Examples Accuracy Precision Recall Correlation Coefficient Specificity
Kh et al. E. coli 2936 - - 0.90 0.84 0.96
Genomic-ULMFiT E. coli 2936 0.956 0.917 0.880 0.871 0.977
Kh et al. B. subtilis 1050 - - 0.91 0.86 0.95
Genomic-ULMFiT B. subtilis 1050 0.905 0.857 0.789 0.759 0.95

Data Source

Notebook Directory

Metaganomics Classification

Genomic-ULMFiT shows improved performance on the metagenomics taxonomic dataset from Deep learning models for bacteria taxonomic classification of metagenomic data.

Method Data Source Accuracy Precision Recall F1
Fiannaca et al. Amplicon .9137 .9162 .9137 .9126
Genomic-ULMFiT Amplicon .9239 .9402 .9332 .9306
Fiannaca et al. Shotgun .8550 .8570 .8520 .8511
Genomic-ULMFiT Shotgun .8797 .8824 .8769 .8758

Data Source

Notebook Directory

Enhancer Classification

When trained on a dataset of mammalian enhancer sequences from Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences, Genomic_ULMFiT improves on results from Cohn et al.

Model/ROC-AUC Human Mouse Dog Opossum
Cohn et al. 0.80 0.78 0.77 0.72
Genomic-ULMFiT 5-mer Stride 2 0.812 0.871 0.773 0.787
Genomic-ULMFiT 4-mer Stride 2 0.804 0.876 0.771 0.786
Genomic-ULMFiT 3-mer Stride 1 0.819 0.875 0.788 0.798

Data Source

Notebook Directory

mRNA/lncRNA Classification

This table shows results for training a classification model on a dataset of coding mRNA sequences and long noncoding RNA (lncRNA) sequences. The dataset comes from A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential by Hill et al. The dataset contains two test sets - a standard test set and a challenge test set.

Model Test Set Accuracy Specificity Sensitivity Precision MCC
GRU Ensemble (Hill et al.)* Standard Test Set 0.96 0.97 0.95 0.97 0.92
Genomic ULMFiT (3mer stride 1) Standard Test Set 0.963 0.952 0.974 0.953 0.926
GRU Ensemble (Hill et al.)* Challenge Test Set 0.875 0.95 0.80 0.95 0.75
Genomic ULMFiT (3mer stride 1) Challenge Test Set 0.90 0.944 0.871 0.939 0.817

(*) Hill et al. presented their results as a plot rather than as a data table. Values in the above table are estimated by reading off the plot

Data Source

Notebook Directory

Interpreting Results

One way to gain insight into how the classification model makes decisions is to perturb regions of a given input sequence to see how changing different regions of the sequence impact the classification result. This allows us to create plots like the one below, highlighting important sequence regions for classification. In the plot below, the red line corresponds to a true transcription start site. The plot shows how prediction results are sensitive to changes around that location. More detail on interpretations can be found in the Model Interpretations directory.

Long Sequence Inference

Inference on long, unlabeled sequences can be done by breaking the input sequence into chunks and plotting prediction results as a function of length. The image below shows a sample prediction of promoter locations on a 40,000 bp region of the E. coli genome. True promoter locations are shown in red. More detail can be found in this notebook

Relevant Literature

For a comparison to other published methods, see Section 6 of the Methods notebook. Here are some relevant papers in the deep genomics classification space.

DeepCRISPR: optimized CRISPR guide RNA design by deep learning

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks

PromID: human promoter prediction by deep learning

Deep Learning for Genomics: A Concise Overview

Prediction of deleterious mutations in coding regions of mammals with transfer learning

Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Predicting enhancers with deep convolutional neural networks

BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone

Deep learning models for bacteria taxonomic classification of metagenomic data

Prediction of enhancer-promoter interactions via natural language processing

A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential

Recurrent Neural Network for Predicting Transcription Factor Binding Sites

Learning the Language of the Genome using RNNs

Owner
Karl
Interested in anything related to deep learning, biotech, energy, materials
Karl
Point detection through multi-instance deep heatmap regression for sutures in endoscopy

Suture detection PyTorch This repo contains the reference implementation of suture detection model in PyTorch for the paper Point detection through mu

artificial intelligence in the area of cardiovascular healthcare 3 Jul 16, 2022
Read number plates with https://platerecognizer.com/

HASS-plate-recognizer Read vehicle license plates with https://platerecognizer.com/ which offers free processing of 2500 images per month. You will ne

Robin 69 Dec 30, 2022
An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax

Simple Transformer An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax. Note: The only ex

29 Jun 16, 2022
Reimplementation of the paper `Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (ACL2020)`

Human Attention for Text Classification Re-implementation of the paper Human Attention Maps for Text Classification: Do Humans and Neural Networks Foc

Shunsuke KITADA 15 Dec 13, 2021
Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)

Junction Tree Variational Autoencoder for Molecular Graph Generation Official implementation of our Junction Tree Variational Autoencoder https://arxi

Wengong Jin 418 Jan 07, 2023
Permute Me Softly: Learning Soft Permutations for Graph Representations

Permute Me Softly: Learning Soft Permutations for Graph Representations

Giannis Nikolentzos 7 Jul 10, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation

OSCAR Project Page | Paper This repository contains the codebase used in OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Ma

NVIDIA Research Projects 74 Dec 22, 2022
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021
This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

Bin Xiao 175 Jan 08, 2023
Implementation of PersonaGPT Dialog Model

PersonaGPT An open-domain conversational agent with many personalities PersonaGPT is an open-domain conversational agent cpable of decoding personaliz

ILLIDAN Lab 42 Jan 01, 2023
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 04, 2023
DISTIL: Deep dIverSified inTeractIve Learning.

DISTIL: Deep dIverSified inTeractIve Learning. An active/inter-active learning library built on py-torch for reducing labeling costs.

decile-team 110 Dec 06, 2022
Code For TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations (EMNLP2021)

TDEER (WIP) Code For TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations (EMNLP2021) Overview TDEER is an e

Alipay 6 Dec 17, 2022
Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

HiT-GAN Official TensorFlow Implementation HiT-GAN presents a Transformer-based generator that is trained based on Generative Adversarial Networks (GA

Google Research 78 Oct 31, 2022
DeepFaceLive - Live Deep Fake in python, Real-time face swap for PC streaming or video calls

DeepFaceLive - Live Deep Fake in python, Real-time face swap for PC streaming or video calls

8.3k Dec 31, 2022
Code repository for "Free View Synthesis", ECCV 2020.

Free View Synthesis Code repository for "Free View Synthesis", ECCV 2020. Setup Install the following Python packages in your Python environment - num

Intelligent Systems Lab Org 253 Dec 07, 2022
SmartSim Infrastructure Library.

Home Install Documentation Slack Invite Cray Labs SmartSim SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and Ten

Cray Labs 139 Jan 01, 2023
Cervix ROI Segmentation Using U-NET

Cervix ROI Segmentation Using U-NET Overview This code illustrate how to segment the ROI in cervical images using U-NET. The ROI here meant to include

Scotty Kwok 35 Sep 14, 2022
Codes for NeurIPS 2021 paper "Adversarial Neuron Pruning Purifies Backdoored Deep Models"

Adversarial Neuron Pruning Purifies Backdoored Deep Models Code for NeurIPS 2021 "Adversarial Neuron Pruning Purifies Backdoored Deep Models" by Dongx

Dongxian Wu 31 Dec 11, 2022