Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

kmer-collapse

Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

Given an input set of kmers, find the smallest set of kmers that encapsulates all diversity in the input set using IUPAC degenerate bases. This aims to solve the problem described here: https://www.biostars.org/p/9498272/

Usage

Install the marisa-trie library, if necessary.

Modify the script's input variable to specify desired sequences, and then run the script:

$ python kmer-collapse.py
{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA",
        "ASAAAAAAAA"
    ]
}

Notes

This has not been tested with any kmer sets but those examples provided. However, it aims to be scalable by pruning combinations of sub-kmers along the way, which would otherwise yield incorrect encodings. This also uses a trie for faster prefix testing. If futher performance is needed, some easy wins would be to cache sub-kmer tests, since most of these test outcomes would be redundant.

Additionally, no error checking is done on the input kmer alphabet or on the consistency of kmer lengths. It may be useful to validate input before using this script.

Examples

These examples are available from the script by uncommenting the relevant input.

A

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA"
    ]
}

B

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "GCGAAAAAAA"
    ],
    "encoded_output": [
        "GCGAAAAAAA",
        "WAAAAAAAAA"
    ]
}

C

{
    "input": [
        "AAAAAAAAAA"
    ],
    "encoded_output": [
        "AAAAAAAAAA"
    ]
}

D

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA"
    ]
}

E

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "TTAAAAAAAA",
        "ATAAAAAAAA"
    ],
    "encoded_output": [
        "WWAAAAAAAA"
    ]
}

F

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ]
}

G

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "ASAAAAAAAA",
        "WAAAAAAAAA"
    ]
}
Owner
Alex Reynolds
Pug caregiver, curler, cyclist, gardener, beginning French scholar
Alex Reynolds
XlvnsScriptTool - Tool for decompilation and compilation of scripts .SDT from the visual novel's engine xlvns

XlvnsScriptTool English Dual languaged (rus+eng) tool for decompiling and compiling (actually, this tool is more than just (dis)assenbler, but less th

Tester 3 Sep 15, 2022
mypy plugin for PynamoDB

pynamodb-mypy A plugin for mypy which gives it deeper understanding of PynamoDB (beyond what's possible through type stubs). Usage Add it to the plugi

1 Oct 21, 2022
WordlistPasswordGenerator - Shuhfab Basheer

WordlistPasswordGenerator - Shuhfab Basheer Python wordlist generator MAINTAINER

1 Dec 31, 2021
A desktop app to check the unlocked courses bases on previously done courses.

Course Picker A desktop app to check the unlocked courses bases on previously done courses. Table of contents About the Project Built with What it doe

Ahmed Symum Swapno 3 Feb 07, 2022
How to create the game Rock, Paper, Scissors in Python

Rock, Paper, Scissors! If you want to learn how to do interactive games using Python, then this is great start for you. In this code, You will learn h

SplendidSpidey 1 Dec 18, 2021
gwcheck is a tool to check .gnu.warning.* sections in ELF object files and display their content.

gwcheck Description gwcheck is a tool to check .gnu.warning.* sections in ELF object files and display their content. For an introduction to .gnu.warn

Frederic Cambus 11 Oct 28, 2022
Turn your IPad into a Screen-Slaver with 1 simple Pythonista script

ScreenSlaver Turn your IPad into a Screen-Slaver with 1 simple Pythonista script

6 Jul 09, 2022
TB Set color display - Add-on for Blender to set multiple objects and material Display Color at once.

TB_Set_color_display Add-on for Blender with operations to transfer name between object, data, materials and action names Set groups of object's or ma

1 Jun 01, 2022
Material de apoio da oficina de SAST apresentada pelo CAIS no Webinar de 28/05/21.

CAIS-CAIS Conjunto de Aplicações Intencionamente Sem-Vergonha do CAIS Material didático do Webinar "EP1. Oficina - Práticas de análise estática de cód

Fausto Filho 14 Jul 25, 2022
A webapp that timestamps key moments in a football clip

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

Pranav 1 Dec 10, 2021
Alfred 4 Workflow to search through your maintained/watched/starred GitHub repositories.

Alfred 4 Workflow to search through your maintained/watched/starred GitHub repositories. Setup This workflow requires a number of Python modules. Thes

Bᴇʀɴᴅ Sᴄʜᴏʀɢᴇʀs 1 Oct 14, 2022
🤖🧭Creates google-like navigation menu using python-telegram-bot wrapper

python telegram bot menu pagination Makes a google style pagination line for a list of items. In other words it builds a menu for navigation if you ha

Sergey Smirnov 9 Nov 27, 2022
Run Windows Applications on Linux as if they are native, Use linux applications to launch files files located in windows vm without needing to install applications on vm. With easy to use configuration GUI

Run Windows Applications on Linux as if they are native, Use linux applications to launch files files located in windows vm without needing to install applications on vm. With easy to use configurati

Casu Al Snek 2k Jan 02, 2023
A python script for compiling and executing .cc files

Debug And Run A python script for compiling and executing .cc files Example dbrun fname.cc [DEBUG MODE] Compiling fname.cc with C++17 ------------

1 May 28, 2022
MeerKAT radio telescope simulation package. Built to simulate multibeam antenna data.

MeerKATgen MeerKAT radio telescope simulation package. Designed with performance in mind and utilizes Just in time compile (JIT) and XLA backed vectro

Peter Ma 6 Jan 23, 2022
Feapder的管道扩展

FEAPDER 管道扩展 简介 此模块为feapder的pipelines扩展,感谢广大开发者对feapder的贡献 随着feapder支持的pipelines越来越多,为减少feapder的体积,特将pipelines提出,使用者可按需安装 管道 PostgreSQL 贡献者:沈瑞祥 联系方式:r

boris 9 Dec 07, 2022
CRC Reverse Engineering Tool in Python

CRC Beagle CRC Beagle is a tool for reverse engineering CRCs. It is designed for commnication protocols where you often have several messages of the s

Colin O'Flynn 51 Jan 05, 2023
Traffic flow test platform, especially for reinforcement learning

Traffic Flow Test Platform Traffic flow test platform, especially for reinforcement learning, named TFTP. A traffic signal control framework that can

4 Nov 07, 2022
A stupid obfuscation thing

StupidObfuscation A stupid obfuscation thing How it works The obfuscator takes a string, splits into pieces of one, then, using the table from letter.

Echo 2 May 03, 2022
This repository contains completed Python projects

My Python projects This repository contains completed Python projects: 1) Build projects Guide for building projects into executable files 2) Calculat

Igor Yunusov 8 Nov 04, 2021