SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

Overview

hexhamming

Pip Prs Github

What does it do?

This module performs a fast bitwise hamming distance of two hexadecimal strings.

This looks like:

DEADBEEF = 11011110101011011011111011101111
00000000 = 00000000000000000000000000000000
XOR      = 11011110101011011011111011101111
Hamming  = number of ones in DEADBEEF ^ 00000000 = 24

This essentially amounts to

>>> import gmpy
>>> gmpy.popcount(0xdeadbeef ^ 0x00000000)
24

except with Python strings, so

>>> import gmpy
>>> gmpy.popcount(int("deadbeef", 16) ^ int("00000000", 16))
24

A few assumptions are made and enforced:

  • this is a valid hexadecimal string (i.e., [a-fA-F0-9]+)
  • the strings are the same length
  • the strings do not begin with "0x"

Why yet another Hamming distance library?

There are a lot of fantastic (python) libraries that offer methods to calculate various edit distances, including Hamming distances: Distance, textdistance, scipy, jellyfish, etc.

In this case, I needed a hamming distance library that worked on hexadecimal strings (i.e., a Python str) and performed blazingly fast. Furthermore, I often did not care about hex strings greater than 256 bits. That length constraint is different vs all the other libraries and enabled me to explore vectorization techniques via numba, numpy, and SSE/AVX intrinsics.

Lastly, I wanted to minimize dependencies, meaning you do not need to install numpy, gmpy, cython, pypy, pythran, etc.

Eventually, after playing around with gmpy.popcount, numba.jit, pythran.run, numpy, I decided to write what I wanted in essentially raw C. At this point, I'm using raw char* and int*, so exploring re-writing this in Fortran makes little sense.

Installation

To install, ensure you have Python 2.7 or 3.4+. Run

pip install hexhamming

or to install from source

git clone https://github.com/mrecachinas/hexhamming
cd hexhamming
python setup.py install # or pip install .

If you want to contribute to hexhamming, you should install the dev dependencies

pip install -r requirements-dev.txt

and make sure the tests pass with

python -m pytest -vls .

Example

Using hexhamming is as simple as

>>> from hexhamming import hamming_distance_string
>>> hamming_distance_string("deadbeef", "00000000")
24

New in v2.0.0 : hexhamming now supports byte`s via ``hamming_distance_bytes`. You use it in the exact same way as before, except you pass in a byte string.

>>> from hexhamming import hamming_distance_bytes
>>> hamming_distance_bytes(b"\xde\xad\xbe\xef", b"\x00\x00\x00\x00")
24

Benchmark

Below is a benchmark using pytest-benchmark with hexhamming==v1.3.2 my 2020 2.0 GHz quad-core Intel Core i5 16 GB 3733 MHz LPDDR4 macOS Catalina (10.15.5) with Python 3.7.3 and Apple clang version 11.0.3 (clang-1103.0.32.62).

Name Mean (ns) Std (ns) Median (ns) Rounds Iterations
test_hamming_distance_bench_3 93.8 10.5 94.3 53268 200
test_hamming_distance_bench_3_same 94.2 15.2 94.9 102146 100
test_check_hexstrings_within_dist_bench 231.9 104.2 216.5 195122 22
test_hamming_distance_bench_256 97.5 34.1 94.0 195122 22
test_hamming_distance_bench_1000 489.8 159.4 477.5 94411 20
test_hamming_distance_bench_1000_same 497.8 87.8 496.6 18971 20
test_hamming_distance_bench_1024 509.9 299.5 506.7 18652 10
test_hamming_distance_bench_1024_same 467.4 205.9 450.4 181819 10
Owner
Michael Recachinas
Husband to @erinrecachinas, Dad, 🐶 Dad, he/him/his
Michael Recachinas
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
Implementation of linesearch Optimization Algorithms in Python

Nonlinear Optimization Algorithms During my time as Scientific Assistant at the Karlsruhe Institute of Technology (Germany) I implemented various Opti

Paul 3 Dec 06, 2022
Generate music from midi files using BPE and markov model

Generate music from midi files using BPE and markov model

Aditya Khadilkar 37 Oct 24, 2022
Evidently helps analyze machine learning models during validation or production monitoring

Evidently helps analyze machine learning models during validation or production monitoring. The tool generates interactive visual reports and JSON profiles from pandas DataFrame or csv files. Current

Evidently AI 3.1k Jan 07, 2023
Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

Michael Floering 11.1k Jan 03, 2023
Pragmatic AI Labs 421 Dec 31, 2022
Implementation of different ML Algorithms from scratch, written in Python 3.x

Implementation of different ML Algorithms from scratch, written in Python 3.x

Gautam J 393 Nov 29, 2022
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 07, 2023
Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

CompatibL 21 Jun 25, 2022
Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

David Burns 536 Dec 29, 2022
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
LightGBM + Optuna: no brainer

AutoLGBM LightGBM + Optuna: no brainer auto train lightgbm directly from CSV files auto tune lightgbm using optuna auto serve best lightgbm model usin

Rishiraj Acharya 22 Dec 15, 2022
Land Cover Classification Random Forest

You can perform Land Cover Classification on Satellite Images using Random Forest and visualize the result using Earthpy package. Make sure to install the required packages and such as

Dr. Sander Ali Khowaja 1 Jan 21, 2022
Magenta: Music and Art Generation with Machine Intelligence

Magenta is a research project exploring the role of machine learning in the process of creating art and music. Primarily this involves developing new

Magenta 18.1k Dec 30, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
UpliftML: A Python Package for Scalable Uplift Modeling

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base l

Booking.com 254 Dec 31, 2022
A Streamlit demo to interactively visualize Uber pickups in New York City

Streamlit Demo: Uber Pickups in New York City A Streamlit demo written in pure Python to interactively visualize Uber pickups in New York City. View t

Streamlit 230 Dec 28, 2022
Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

AriesTriputranto 1 Dec 13, 2021
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023