Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Codename generator using WordNet parts of speech database

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

Huggingface Transformers + Adapters = ❤️

ETM - R package for Topic Modelling in Embedding Spaces

Finetune gpt-2 in google colab

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Text-to-Speech for Belarusian language

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Retraining OpenAI's GPT-2 on Discord Chats

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Pytorch version of BERT-whitening

Textlesslib - Library for Textless Spoken Language Processing

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

Unlimited Call - Text Bombing Tool

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.