Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

kmer-collapse

Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

Given an input set of kmers, find the smallest set of kmers that encapsulates all diversity in the input set using IUPAC degenerate bases. This aims to solve the problem described here: https://www.biostars.org/p/9498272/

Usage

Install the marisa-trie library, if necessary.

Modify the script's input variable to specify desired sequences, and then run the script:

$ python kmer-collapse.py
{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA",
        "ASAAAAAAAA"
    ]
}

Notes

This has not been tested with any kmer sets but those examples provided. However, it aims to be scalable by pruning combinations of sub-kmers along the way, which would otherwise yield incorrect encodings. This also uses a trie for faster prefix testing. If futher performance is needed, some easy wins would be to cache sub-kmer tests, since most of these test outcomes would be redundant.

Additionally, no error checking is done on the input kmer alphabet or on the consistency of kmer lengths. It may be useful to validate input before using this script.

Examples

These examples are available from the script by uncommenting the relevant input.

A

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA"
    ]
}

B

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "GCGAAAAAAA"
    ],
    "encoded_output": [
        "GCGAAAAAAA",
        "WAAAAAAAAA"
    ]
}

C

{
    "input": [
        "AAAAAAAAAA"
    ],
    "encoded_output": [
        "AAAAAAAAAA"
    ]
}

D

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA"
    ]
}

E

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "TTAAAAAAAA",
        "ATAAAAAAAA"
    ],
    "encoded_output": [
        "WWAAAAAAAA"
    ]
}

F

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ]
}

G

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "ASAAAAAAAA",
        "WAAAAAAAAA"
    ]
}
Owner
Alex Reynolds
Pug caregiver, curler, cyclist, gardener, beginning French scholar
Alex Reynolds
Python 3.9.4 Graphics and Compute Shader Framework and Primitives with no external module dependencies

pyshader Python 3.9.4 Graphics and Compute Shader Framework and Primitives with no external module dependencies Fully programmable shader model (even

Alastair Cota 1 Jan 11, 2022
Packages of Example Data for The Effect

causaldata This repository will contain R, Stata, and Python packages, all called causaldata, which contain data sets that can be used to implement th

103 Dec 24, 2022
Package pyVHR is a comprehensive framework for studying methods of pulse rate estimation relying on remote photoplethysmography (rPPG)

Package pyVHR (short for Python framework for Virtual Heart Rate) is a comprehensive framework for studying methods of pulse rate estimation relying on remote photoplethysmography (rPPG)

PHUSE Lab 261 Jan 03, 2023
Artificial intelligence based on 5-dimensional quantum selection

Deep Thought An artificial intelligence based on 5-dimensional quantum selection. Algorithm The payload Make an random bit array (e.g. 1101...) Conver

Larry Holst 3 Dec 14, 2022
Our Ping Pong Project of numerical analysis, 2nd year IC B2 INSA Toulouse

Ping Pong Project The objective of this project was to determine the moment of impact of the ball with the ground. To do this, we used different model

0 Jan 02, 2022
This Open-Source project is great for sensor capture and storage solutions.

Phase 1 This project helps developers in the creation of extended realities that communicate with Arduino and require the security of blockchain stora

Wolfberry, LLC 10 Dec 28, 2022
Unofficial Python implementation of the DNMF overlapping community detection algorithm

DNMF Unofficial Python implementation of the Discrete Non-negative Matrix Factorization (DNMF) overlapping community detection algorithm Paper Ye, Fan

Andrej Janchevski 3 Nov 30, 2021
Astroquery is an astropy affiliated package that contains a collection of tools to access online Astronomical data.

Astroquery is an astropy affiliated package that contains a collection of tools to access online Astronomical data.

The Astropy Project 631 Jan 05, 2023
Feature engineering library that helps you keep track of feature dependencies, documentation and schema

Feature engineering library that helps you keep track of feature dependencies, documentation and schema

28 May 31, 2022
this is a basic python project that I made using python

this is a basic python project that I made using python. This project is only for practice because my python skills are still newbie.

Elvira Firmansyah 2 Dec 14, 2022
A Python library that helps data scientists to infer causation rather than observing correlation.

A Python library that helps data scientists to infer causation rather than observing correlation.

QuantumBlack Labs 1.7k Jan 04, 2023
Restaurant-finder - Restaurant finder With Python

restaurant-finder APIs /restaurants query-params: a. filter: column based on whi

Kumar saurav 1 Feb 22, 2022
An Agora Python Flask token generation server

A Flask Starter Application with Login and Registration About A token generation Server using the factory pattern and Blueprints. A forked stripped do

Nii Ayi 1 Jan 21, 2022
An easy way to access to your EPITECH subjects based on the Roslyn's database.

An easy way to access to your EPITECH subjects based on the Roslyn's database.

Mathias 1 Feb 09, 2022
ELF file deserializer and serializer library

elfo ELF file deserializer and serializer library. import elfo elf = elfo.ELF.from_path('main') elf ELF( header=ELFHeader( e_ident=e

Filipe Laíns 3 Aug 23, 2021
Audio2Face - a project that transforms audio to blendshape weights,and drives the digital human,xiaomei,in UE project

Audio2Face - a project that transforms audio to blendshape weights,and drives the digital human,xiaomei,in UE project

FACEGOOD 732 Jan 08, 2023
An universal linux port of deezer, supporting both Flatpak and AppImage

Deezer for linux This repo is an UNOFFICIAL linux port of the official windows-only Deezer app. Being based on the windows app, it allows downloading

Aurélien Hamy 154 Jan 06, 2023
Project for viewing the cheapest flight deals from Netherlands to other countries.

Flight_Deals_AMS Project for viewing the cheapest flight deals from Netherlands to other countries.

2 Dec 17, 2022
Commodore 64 OS running on Atari 8-bit hardware

This is the Commodre 64 KERNAL, modified to run on the Atari 8-bit line of computers. They're practically the same machine; why didn't someone try this 30 years ago?

Nick Bensema 133 Nov 12, 2022
dynamically create __slots__ objects with less code

slots_factory Factory functions and decorators for creating slot objects Slots are a python construct that allows users to create an object that doesn

Michael Green 2 Sep 07, 2021