pyprobables is a pure-python library for probabilistic data structures

Last update: Dec 25, 2022

Overview

PyProbables

pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.

To achieve better raw performance, it is recommended supplying an alternative hashing algorithm that has been compiled in C. This could include using the md5 and sha512 algorithms provided or installing a third party package and writing your own hashing strategy. Some options include the murmur hash mmh3 or those from the pyhash library. Each data object in pyprobables makes it easy to pass in a custom hashing function.

Installation

Pip Installation:

$ pip install pyprobables

To install from source:

To install pyprobables, simply clone the repository on GitHub, then run from the folder:

$ python setup.py install

pyprobables supports python 3.5 - 3.9+

For python 2.7 support, install release 0.3.2

$ pip install pyprobables==0.3.2

API Documentation

The documentation of is hosted on readthedocs.io

You can build the documentation locally by running:

$ pip install sphinx
$ cd docs/
$ make html

Automated Tests

To run automated tests, one must simply run the following command from the downloaded folder:

$ python setup.py test

Quickstart

Import pyprobables and setup a Bloom Filter

from probables import (BloomFilter)
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05)
blm.add('google.com')
blm.check('facebook.com')  # should return False
blm.check('google.com')  # should return True

Import pyprobables and setup a Count-Min Sketch

from probables import (CountMinSketch)
cms = CountMinSketch(width=1000, depth=5)
cms.add('google.com')  # should return 1
cms.add('facebook.com', 25)  # insert 25 at once; should return 25

Import pyprobables and setup a Cuckoo Filter

from probables import (CuckooFilter)
cko = CuckooFilter(capacity=100, max_swaps=10)
cko.add('google.com')
cko.check('facebook.com')  # should return False
cko.check('google.com')  # should return True

Supplying a pre-defined, alternative hashing strategies

from probables import (BloomFilter)
from probables.hashes import (default_sha256)
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05,
                  hash_function=default_sha256)
blm.add('google.com')
blm.check('facebook.com')  # should return False
blm.check('google.com')  # should return True

Defining hashing function using the provided decorators

import mmh3  # murmur hash 3 implementation (pip install mmh3)
from pyprobables.hashes import (hash_with_depth_bytes)
from pyprobables import (BloomFilter)

@hash_with_depth_bytes
def my_hash(key):
    return mmh3.hash_bytes(key)

blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

import mmh3  # murmur hash 3 implementation (pip install mmh3)
from pyprobables.hashes import (hash_with_depth_int)
from pyprobables import (BloomFilter)

@hash_with_depth_int
def my_hash(key, encoding='utf-8'):
    max64mod = UINT64_T_MAX + 1
    val = int(hashlib.sha512(key.encode(encoding)).hexdigest(), 16)
    return val % max64mod

blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

See the API documentation for other data structures available and the quickstart page for more examples!

Changelog

Please see the changelog for a list of all changes.

Comments

Math domain error

Hello,

I'm getting the following error when using print(bloom_filter).

File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 127, in __str__
    self.estimate_elements(),
  File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 350, in estimate_elements
    log_n = math.log(1 - (float(setbits) / float(self.number_bits)))
ValueError: math domain error

I'm running the latest version, downloaded from pipit only the other day and I'm using python version 3.8.6.

opened by Glfrey 31

Wrong result with large filter

I expect that if I ask the filter to check for a membership and it tells me FALSE, then its definitely NOT a member. I did the following:

def verifyMembership(key):
    global bloom
    if key in bloom:
        print('Its possibly in')
    else:
        print('Definitly not in')

key = 'some'
filterFile = 'index.dat'
bloom = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(key)
bloom.add(key)
verifyMembership(key)
bloom.export(filterFile)

I called my script twice and the output is:

Definitly not in
Its possibly in
Definitly not in
Its possibly in

But I would expect:

Definitly not in
Its possibly in
Its possibly in
Its possibly in

If i am reducing the est_elements to lets say 10000, then its fine.

opened by mrqc 7

Use black to format code, add support for poetry and pre-commit

This is mainly a cosmetic update for the codebase. Black is now a de-facto standard code formatter.

I added minimal config for pre-commit.

I also took the liberty of adding poetry support as it already uses pyproject.toml file used also by black and isort. And it's the best venv solution on the market.

opened by dekoza 6
Hotfix/export-c-header

@dnanto this is a minor tweak to the export_c_header function. It allows for the exported bloom as chars to be compatible with the hex export which is how it can also be tested.

Thoughts?

opened by barrust 4
Several problems with the default hash
Hi, I found some problems with the default fnv hash used. Even though it is recommended to provide custom hashes, some users may expect the defaults to work properly.

First off, the results differ from standardized implementations:

$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))' 3411864951044856955 # should be 15902901984413996407 $ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))' 1047262940628067782 # should be 16101355973854746

This is caused by wrong hval value here https://github.com/barrust/pyprobables/blob/beb73f2f6c2ab9d8b8b477381e84271c88b25e8f/probables/hashes.py#L85 (should be 14695981039346656037 instead of 14695981039346656073). Changing this constant helps:

$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))' 15902901984413996407 $ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))' 16101355973854746

The second problem is in the @hash_with_depth_int wrapper once more hashes than one are computed. Because the value of the first hash is used as a seed for the subsequent hashes, once we get a collision in the first hash, all other hashes are identical:

$ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("gMPflVXtwGDXbIhP73TX", 3))' [10362567113185002004, 14351534809307984379, 3092021042139682764] $ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("LtHf1prlU1bCeYZEdqWf", 3))' [10362567113185002004, 14351534809307984379, 3092021042139682764]

This makes all Count*Sketch data structures much less accurate, since they rely on small probabilities of collision in all hash functions involved.
opened by simonmandlik 4
Missing method to aggregate count-min sketches

Count-min sketch has in theory property that 2 tables can be summed together which allows parallel count-min sketch building, but I don't see it implemented there. Should I make pull request which implements it?

opened by racinmat 4
add `frombytes` to all probabilistic data structures
added ability to load data structures using the exported bytes

added tests to verify the frombytes() functionality

minor changes to MMap class for possible use in the future (with tests)

Resolves #88 @KOLANICH
opened by barrust 3
Several fixes and performance improvement

Fixes #60 Fixes #61 Fixes part of #62 . Uses correct seed for default hash. Modified tests to reflect these changes. Changed order of arguments for assertion to follow assert(expected, actual) paradigm. When not followed, unittest yields misleading error messages. The problem with propagating collisions to larger depth still needs to be addressed though. Should I do it in this PR, or in some other?

opened by racinmat 3
bloom filter intersection failure

Tried an intersection of 2 bloom filters both with est_elements=16000000, got a list index out of range error

Works fine if both have est_elements=16000001.

If one is 160000000 and the other is 16000001, get a None return on the intersection, rather than throwing an error explaining what the problem is.

opened by sfletc 3
How to cPickle count min sketch instance

I encounter this error when using cPickle to save count min sketch instance:

Traceback (most recent call last): File "test.py", line 14, in <module> pkl.dump(cms, f) File "/usr/local/Cellar/[email protected]/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 77, in _reduce_ex raise TypeError("a class that defines __slots__ without " TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

opened by huuthonguyen76 3
Moved the metadata into PEP 621-compliant sections.

Since poetry has not yet implemented PEP 621 I have temporarily switched the build backend to flit. To use setuptools one has to comment out the lines choosing flit and uncomment the lines choosing setuptools. setup.py and setup.cfg have been removed, their content has been integrated into pyproject.toml.

PEP 621 support in poetry is tracked here: https://github.com/python-poetry/roadmap/issues/3

opened by KOLANICH 2

Releases(v0.5.6)

v0.5.6(Mar 10, 2022)
Bloom Filters:

Fix for ValueError exception when using estimate_elements() when all bits are set

Add Citation file

Source code(tar.gz)
Source code(zip)
v0.5.5(Jan 15, 2022)
Bloom Filters:

Re-implemented the entire Bloom Filter data structure to reduce complexity and code duplication

Removed unused imports

Removed unnecessary casts

Pylint Requested Style Changes:

Use python 3 super()

Use python 3 classes

Remove use of temporary variables if possible and still clear

Source code(tar.gz)
Source code(zip)
v0.5.4(Jan 8, 2022)
All Probablistic Data Structures:

Added ability to load each frombytes()

Updated underlying data structures of number based lists to be more space and time efficient; see Issue #60

Cuckoo Filters:

Added fingerprint_size_bits property

Added error_rate property

Added ability to initialize based on error rate

Simplified typing

Ensure all filepaths can be str or Path

Source code(tar.gz)
Source code(zip)
v0.5.3(Dec 29, 2021)
Additional type hinting

Improved format parsing and serialization; see PR#81. Thanks @KOLANICH

Bloom Filters

Added export_to_hex functionality for Bloom Filters on Disk

Export as C header (*.h) for Bloom Filters on Disk and Counting Bloom Filters

Added support for more input types for exporting and loading of saved files

Source code(tar.gz)
Source code(zip)
v0.5.2(Dec 13, 2021)
Add ability to hash bytes along with strings

Make all tests files individually executable from the CLI. Thanks @KOLANICH

Added type hints

Source code(tar.gz)
Source code(zip)
v0.5.1(Nov 19, 2021)
Bloom Filter:

Export as a C header (*.h)

Count-Min Sketch

Add join/merge functionality

Moved testing to use NamedTemporaryFile for file based tests

Source code(tar.gz)
Source code(zip)
v0.5.0(Oct 19, 2021)
BACKWARD INCOMPATIBLE CHANGES

NOTE: Breaks backwards compatibility with previously exported blooms, counting-blooms, cuckoo filter, or count-min-sketch files using the default hash!

Update to the FNV_1a hash function

Simplified the default hash to use a seed value

Ensure passing of depth to hashing function when using hash_with_depth_int or hash_with_depth_bytes

Source code(tar.gz)
Source code(zip)
v0.4.1(Apr 30, 2021)
Resolve issue 57 where false positive rate not stored / used the same in some instances

Source code(tar.gz)
Source code(zip)
v0.4.0(Dec 31, 2020)
Remove Python 2.7 support

Source code(tar.gz)
Source code(zip)
v0.3.2(Aug 9, 2020)
Fix RotatingBloomFilter to keep information on number of elements inserted when exported and loaded. see PR #50 Thanks @dvolker48

Source code(tar.gz)
Source code(zip)
v0.3.1(Mar 20, 2020)
Add additional slots

Minor improvement to the hashing algorithm and strategy

Source code(tar.gz)
Source code(zip)
v0.3.0(Nov 21, 2018)
Bloom Filters:

Import/Export of Expanding and Rotating Bloom Filters

Fix for importing standard Bloom Filters

Source code(tar.gz)
Source code(zip)
v0.2.6(Nov 12, 2018)
Bloom Filter:

Added Rotating Bloom Filter implementation

Source code(tar.gz)
Source code(zip)
v0.2.5(Nov 10, 2018)
Added an Expanding Bloom Filter implementation

Currently basic without import/export functionality

Automatic expansion

Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 7, 2018)
Use slots to reduce memory usage of data structures

Source code(tar.gz)
Source code(zip)
v0.1.4(May 25, 2018)
Drop support for python 3.3

Ensure passing parameters correctly to parent classes

Source code(tar.gz)
Source code(zip)
v0.1.3(Jan 2, 2018)
Cuckoo Filter

Parameterize fingerprint size

Allow passing in custom hashing functions

Better parameter checking

Update documentation

Source code(tar.gz)
Source code(zip)
v0.1.2(Oct 9, 2017)
Counting Cuckoo Filter

Fix for PyPi install

Source code(tar.gz)
Source code(zip)
v0.1.1(Oct 4, 2017)
Cuckoo Filter

Import / Export

Unique inserts

Ability to expand (automatically or by request)

Source code(tar.gz)
Source code(zip)
v0.1.0(Sep 29, 2017)
Cuckoo Filter Implementation

Source code(tar.gz)
Source code(zip)
v0.0.8(Aug 22, 2017)
Counting Bloom Filter:

Jaccard Index

Union

Intersection

Unique element estimation

Source code(tar.gz)
Source code(zip)
v0.0.7(Aug 12, 2017)
Counting Bloom Filter

Fix counting bloom hex export / import

Fix for overflow issue in counting bloom export

Added ability to remove from counting bloom

Count-Min Sketch

Fix for not recording large numbers of inserts and deletions correctly

Source code(tar.gz)
Source code(zip)
v0.0.6(Aug 5, 2017)
Counting Bloom Filter support

Refactoring of the bloom filter classes

Source code(tar.gz)
Source code(zip)
v0.0.5(Jul 21, 2017)
Documentation Update

Removed unnecessary functions from being public

Source code(tar.gz)
Source code(zip)
v0.0.4(Jul 15, 2017)
Initial probabilistic data-structures

Bloom Filter

Bloom Filter On Disk

Count-Min Sketch

Count-Mean Sketch

Count-Mean-Min Sketch

Heavy Hitters

Stream Threshold

Import / Export of each type

Source code(tar.gz)
Source code(zip)

Owner

Tyler Barrus

GitHub Repository

Python library for doing things with Grid-like structures

gridthings Python library for doing things with Grid-like structures Development This project uses poetry for dependency management, pre-commit for li

2 Dec 21, 2021

nocasedict - A case-insensitive ordered dictionary for Python

nocasedict - A case-insensitive ordered dictionary for Python Overview Class NocaseDict is a case-insensitive ordered dictionary that preserves the or

2 Dec 12, 2021

Python collections that are backended by sqlite3 DB and are compatible with the built-in collections

sqlitecollections Python collections that are backended by sqlite3 DB and are compatible with the built-in collections Installation $ pip install git+

11 Feb 03, 2022

This Repository consists of my solutions in Python 3 to various problems in Data Structures and Algorithms

Problems and it's solutions. Problem solving, a great Speed comes with a good Accuracy. The more Accurate you can write code, the more Speed you will

1.3k Jan 01, 2023

A mutable set that remembers the order of its entries. One of Python's missing data types.

An OrderedSet is a mutable data structure that is a hybrid of a list and a set. It remembers the order of its entries, and every entry has an index nu

173 Nov 28, 2022

Basic sort and search algorithms written in python.

Basic sort and search algorithms written in python. These were all developed as part of my Computer Science course to demonstrate understanding so they aren't 100% efficent

0 Dec 14, 2022

A mutable set that remembers the order of its entries. One of Python's missing data types.

An OrderedSet is a mutable data structure that is a hybrid of a list and a set. It remembers the order of its entries, and every entry has an index number that can be looked up.

173 Nov 28, 2022

A Munch is a Python dictionary that provides attribute-style access (a la JavaScript objects).

munch munch is a fork of David Schoonover's Bunch package, providing similar functionality. 99% of the work was done by him, and the fork was made mai

643 Jan 07, 2023

Data Structure With Python

Data-Structure-With-Python- Python programs also include in this repo Stack A stack is a linear data structure that stores items in a Last-In/First-Ou

2 Jan 09, 2022

Final Project for Practical Python Programming and Algorithms for Data Analysis

Final Project for Practical Python Programming and Algorithms for Data Analysis (PHW2781L, Summer 2020) Redlining, Race-Exclusive Deed Restriction Lan

1 Jan 27, 2022

A JSON-friendly data structure which allows both object attributes and dictionary keys and values to be used simultaneously and interchangeably.

93 Dec 01, 2022

Multidict is dict-like collection of key-value pairs where key might be occurred more than once in the container.

multidict Multidict is dict-like collection of key-value pairs where key might be occurred more than once in the container. Introduction HTTP Headers

325 Dec 27, 2022

A simple tutorial to use tree-sitter to parse code into ASTs

A simple tutorial to use py-tree-sitter to parse code into ASTs. To understand what is tree-sitter, see https://github.com/tree-sitter/tree-sitter. Tr

7 Sep 17, 2022

Al-Quran dengan Terjemahan Indonesia

Al-Quran Rofi Al-Quran dengan Terjemahan / Tafsir Jalalayn Instalasi Al-Quran Rofi untuk Archlinux untuk pengguna distro Archlinux dengan paket manage

4 Dec 20, 2021

dict subclass with keylist/keypath support, normalized I/O operations (base64, csv, ini, json, pickle, plist, query-string, toml, xml, yaml) and many utilities.

python-benedict python-benedict is a dict subclass with keylist/keypath support, I/O shortcuts (base64, csv, ini, json, pickle, plist, query-string, t

799 Jan 09, 2023

pyprobables is a pure-python library for probabilistic data structures

Related tags

Overview

PyProbables

Installation

API Documentation

Automated Tests

Quickstart

Import pyprobables and setup a Bloom Filter

Import pyprobables and setup a Count-Min Sketch

Import pyprobables and setup a Cuckoo Filter

Supplying a pre-defined, alternative hashing strategies

Defining hashing function using the provided decorators

Changelog

Comments

Releases(v0.5.6)

v0.5.6(Mar 10, 2022)

v0.5.5(Jan 15, 2022)

v0.5.4(Jan 8, 2022)

v0.5.3(Dec 29, 2021)

v0.5.2(Dec 13, 2021)

v0.5.1(Nov 19, 2021)

v0.5.0(Oct 19, 2021)

v0.4.1(Apr 30, 2021)

v0.4.0(Dec 31, 2020)

v0.3.2(Aug 9, 2020)

v0.3.1(Mar 20, 2020)

v0.3.0(Nov 21, 2018)

v0.2.6(Nov 12, 2018)

v0.2.5(Nov 10, 2018)

v0.2.0(Nov 7, 2018)

v0.1.4(May 25, 2018)

v0.1.3(Jan 2, 2018)

v0.1.2(Oct 9, 2017)

v0.1.1(Oct 4, 2017)

v0.1.0(Sep 29, 2017)

v0.0.8(Aug 22, 2017)

v0.0.7(Aug 12, 2017)

v0.0.6(Aug 5, 2017)

v0.0.5(Jul 21, 2017)

v0.0.4(Jul 15, 2017)

Owner

Tyler Barrus

Python library for doing things with Grid-like structures

nocasedict - A case-insensitive ordered dictionary for Python

Python collections that are backended by sqlite3 DB and are compatible with the built-in collections

This Repository consists of my solutions in Python 3 to various problems in Data Structures and Algorithms

A mutable set that remembers the order of its entries. One of Python's missing data types.

Basic sort and search algorithms written in python.

A mutable set that remembers the order of its entries. One of Python's missing data types.

A Munch is a Python dictionary that provides attribute-style access (a la JavaScript objects).

Data Structure With Python

Final Project for Practical Python Programming and Algorithms for Data Analysis

A JSON-friendly data structure which allows both object attributes and dictionary keys and values to be used simultaneously and interchangeably.

Multidict is dict-like collection of key-value pairs where key might be occurred more than once in the container.

A simple tutorial to use tree-sitter to parse code into ASTs

Al-Quran dengan Terjemahan Indonesia

dict subclass with keylist/keypath support, normalized I/O operations (base64, csv, ini, json, pickle, plist, query-string, toml, xml, yaml) and many utilities.

An command-line utility that schedules your exams preparation routines

Leetcode solutions - All algorithms implemented in Python 3 (for education)

Solutions for leetcode problems.

Array is a functional mutable sequence inheriting from Python's built-in list.

Programming of a spanning tree algorithm with Python : In depth first with a root node.