Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

State of the art faster Natural Language Processing in Tensorflow 2.0 .

Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

Use the power of GPT3 to execute any function inside your programs just by giving some doctests

iBOT: Image BERT Pre-Training with Online Tokenizer

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Sapiens is a human antibody language model based on BERT.

A telegram bot to translate 100+ Languages

Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

A curated list of efficient attention modules

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

A Python script that compares files in directories

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Contains descriptions and code of the mini-projects developed in various programming languages

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

Machine Psychology: Python Generated Art