Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

kmer-collapse

Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

Given an input set of kmers, find the smallest set of kmers that encapsulates all diversity in the input set using IUPAC degenerate bases. This aims to solve the problem described here: https://www.biostars.org/p/9498272/

Usage

Install the marisa-trie library, if necessary.

Modify the script's input variable to specify desired sequences, and then run the script:

$ python kmer-collapse.py
{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA",
        "ASAAAAAAAA"
    ]
}

Notes

This has not been tested with any kmer sets but those examples provided. However, it aims to be scalable by pruning combinations of sub-kmers along the way, which would otherwise yield incorrect encodings. This also uses a trie for faster prefix testing. If futher performance is needed, some easy wins would be to cache sub-kmer tests, since most of these test outcomes would be redundant.

Additionally, no error checking is done on the input kmer alphabet or on the consistency of kmer lengths. It may be useful to validate input before using this script.

Examples

These examples are available from the script by uncommenting the relevant input.

A

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA"
    ]
}

B

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "GCGAAAAAAA"
    ],
    "encoded_output": [
        "GCGAAAAAAA",
        "WAAAAAAAAA"
    ]
}

C

{
    "input": [
        "AAAAAAAAAA"
    ],
    "encoded_output": [
        "AAAAAAAAAA"
    ]
}

D

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA"
    ]
}

E

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "TTAAAAAAAA",
        "ATAAAAAAAA"
    ],
    "encoded_output": [
        "WWAAAAAAAA"
    ]
}

F

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ]
}

G

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "ASAAAAAAAA",
        "WAAAAAAAAA"
    ]
}
Owner
Alex Reynolds
Pug caregiver, curler, cyclist, gardener, beginning French scholar
Alex Reynolds
Runtime fault injection platform by Daniele Rizzieri (2021)

GDBitflip [v1.04] Runtime fault injection platform by Daniele Rizzieri (2021) This platform executes N times a binary and during each execution it inj

Daniele Rizzieri 1 Dec 07, 2021
A Web app to Cross-Seed torrents in Deluge/qBittorrent/Transmission

SeedCross A Web app to Cross-Seed torrents in Deluge/qBittorrent/Transmission based on CrossSeedAutoDL Require Jackett Deluge/qBittorrent/Transmission

ccf2012 76 Dec 19, 2022
A minimal configuration for a dockerized kafka project.

Docker Kafka Quickstart A minimal configuration for a dockerized kafka project. Usage: Run this command to build kafka and zookeeper containers, and c

Nouamane Tazi 5 Jan 12, 2022
Oblique Strategies for Python

Oblique Strategies for Python

Ɓukasz Langa 3 Feb 17, 2022
Collapse a set of redundant kmers to use IUPAC degenerate bases

kmer-collapse Collapse a set of redundant kmers to use IUPAC degenerate bases Overview Given an input set of kmers, find the smallest set of kmers tha

Alex Reynolds 3 Jan 06, 2022
Simple AoC helper program you can use to develop your own solutions in python.

AoC-Compabion Simple AoC helper program you can use to develop your own solutions in python. Simply install it in your python environment using pip fr

Alexander Vollmer 1 Dec 20, 2021
Run the Tianxunet software on the Xiaoyao Android simulator

Run the Tianxunet software on the Xiaoyao Android simulator, and automatically fill in the answers of English listening on the premise of having answers

1 Feb 13, 2022
JupyterLite as a Datasette plugin

datasette-jupyterlite JupyterLite as a Datasette plugin Installation Install this plugin in the same environment as Datasette. $ datasette install dat

Simon Willison 11 Sep 19, 2022
PyPIContents is an application that generates a Module Index from the Python Package Index (PyPI) and also from various versions of the Python Standard Library.

PyPIContents is an application that generates a Module Index from the Python Package Index (PyPI) and also from various versions of the Python Standar

Collage Labs 10 Nov 19, 2022
Transform your boring distro into a hacking powerhouse.

Pentizer Transform your boring distro into a hacking powerhouse. Pentizer is a personal project that imports Kali and Parrot repositories in any Debia

Michail Tsimpliarakis 2 Nov 05, 2021
This library is an abstraction for Splunk-related development, maintenance, or migration operations

This library is an abstraction for Splunk-related development, maintenance, or migration operations. It provides a single CLI or SDK to conveniently perform various operations such as managing a loca

NEXTPART 6 Dec 21, 2022
Howell County, Missouri, COVID-19 data and (unofficial) estimates

COVID-19 in Howell County, Missouri This repository contains the daily data files used to generate my COVID-19 dashboard for Howell County, Missouri,

Jonathan Thornton 0 Jun 18, 2022
Here You will Find CodeChef Challenge Solutions

Here You will Find CodeChef Challenge Solutions

kanishk kashyap 1 Sep 03, 2022
Python library for generating CycloneDX SBOMs

Python Library for generating CycloneDX This CycloneDX module for Python can generate valid CycloneDX bill-of-material document containing an aggregat

CycloneDX SBOM Standard 31 Dec 16, 2022
For radiometrically calibrating and PSF deconvolving IRIS data

irispreppy For radiometrically calibrating and PSF deconvolving IRIS data. I dislike how I need to own proprietary software (IDL) just to simply prepa

Aaron W. Peat 4 Nov 01, 2022
Prometheus exporter for Spigot accounts

SpigotExporter Prometheus exporter for Spigot accounts What it provides SpigotExporter will output metrics for each of your plugins and a cumulative d

Jacob Bashista 5 Dec 20, 2021
fast_bss_eval is a fast implementation of the bss_eval metrics for the evaluation of blind source separation.

fast_bss_eval Do you have a zillion BSS audio files to process and it is taking days ? Is your simulation never ending ? Fear no more! fast_bss_eval i

Robin Scheibler 99 Dec 13, 2022
Python program to start your zoom meetings

zoomstarter Python programm to start your zoom meetings More about Initially this was a bash script for starting zoom meetings, but as i started devel

Viktor Cvetanovic 2 Nov 24, 2021
A web-based chat application that enables multiple users to interact with one another

A web-based chat application that enables multiple users to interact with one another, in the same chat room or different ones according to their choosing.

3 Apr 22, 2022
Quantity Takeoff with Python. Collecting groups of elements by filters

The free tool QuantityTakeoff allows you to group elements from Revit and IFC models (in BIMJSON-CSV format) with just a few filters and find the required volume values for the grouped elements.

OpenDataBIM 9 Jan 06, 2023