A fast python implementation of the SimHash algorithm.

Overview

FLoC SimHash

This Python package provides hashing algorithms for computing cohort ids of users based on their browsing history. As such, it may be used to compute cohort ids of users following Google's Federated Learning of Cohorts (FLoC) proposal.

The FLoC proposal is an important part of The Privacy Sandbox, which is Google's replacement for third-party cookies. FLoC will enable interest-based advertising, thus preserving an important source of monetization for today's web.

The main idea, as outlined in the FLoC whitepaper, is to replace user cookie ids, which enable user-targeting across multiple sites, by cohort ids. A cohort would consist of a set of users sharing similar browsing behaviour. By targeting a given cohort, advertisers can ensure that relevant ads are shown while user privacy is preserved by a hiding in the pack mechanism.

The FLoC whitepaper mentions several mechanisms to map users to cohorts, with varying amounts of centralized information. The algorithms currently being implemented in Google Chrome as a POC are methods based on SimHash, which is a type of locality-sensitive hashing initially introduced for detecting near-duplicate documents.

Contents

Installation

The floc-simhash package is available at PyPI. Install using pip as follows.

pip install floc-simhash

The package requires python>=3.7 and will install scikit-learn as a dependency.

Usage

The package provides two main classes.

  • SimHash, applying the SimHash algorithm on the md5 hashes of tokens in the given document.

  • SimHashTransformer, applying the SimHash algorithm to a document vectorization as part of a scikit-learn pipeline

Finally, there is a third class available:

  • SortingSimHash, which performs the SortingLSH algorithm by first applying SimHash and then clipping the resulting hashes to a given precision.

Individual document-based SimHash

The SimHash class provides a way to calculate the SimHash of any given document, without using any information coming from other documents.

In this case, the document hash is computed by looking at md5 hashes of individual tokens. We use:

  • The implementation of the md5 hashing algorithm available in the hashlib module in the Python standard library.

  • Bitwise arithmetic for fast computations of the document hash from the individual hashed tokens.

The program below, for example, will print the following hexadecimal string: cf48b038108e698418650807001800c5.

from floc_simhash import SimHash

document = "Lorem ipsum dolor sit amet consectetur adipiscing elit"
hashed_document = SimHash(n_bits=128).hash(document)

print(hashed_document)

An example more related to computing cohort ids: the following program computes the cohort id of a user by applying SimHash to the document formed by the pipe-separated list of domains in the user browsing history.

from floc_simhash import SimHash

document = "google.com|hybridtheory.com|youtube.com|reddit.com"
hasher = SimHash(n_bits=128, tokenizer=lambda x: x.split("|"))
hashed_document = hasher.hash(document)

print(hashed_document)

The code above will print the hexadecimal string: 14dd1064800880b40025764cd0014715.

Providing your own tokenizer

The SimHash constructor will split the given document according to white space by default. However, it is possible to pass any callable that parses a string into a list of strings in the tokenizer parameter. We have provided an example above where we pass tokenizer=lambda x: x.split("|").

A good example of a more complex tokenization could be passing the word tokenizer in NLTK. This would be a nice choice if we wished to compute hashes of text documents.

Using the SimHashTransformer in scikit-learn pipelines

The approach to SimHash outlined in the FLoC Whitepaper consists of choosing random unit vectors and working on already vectorized data.

The choice of a random unit vector is equivalent to choosing a random hyperplane in feature space. Choosing p random hyperplanes partitions the feature space into 2^p regions. Then, a p-bit SimHash of a vector encodes the region to which it belongs.

It is reasonable to expect similar documents to have the same hash, provided the vectorization respects the given notion of similarity.

Two vectorizations are discussed in the aforementioned whitepaper: one-hot and tf-idf; they are available in scikit-learn.

The SimHashTransformer supplies a transformer (implementing the fit and transform methods) that can be used directly on the output of any of these two vectorizers in order to obtain hashes.

For example, given a 1d-array X containing strings, each of them corresponding to a concatenation of the domains visited by a given user and separated by "|", the following code will store in y the cohort id of each user, using one-hot encoding and a 32-bit SimHash.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

from floc_simhash import SimHashTransformer


X = [
    "google.com|hybridtheory.com|youtube.com|reddit.com",
    "google.com|youtube.com|reddit.com",
    "github.com",
    "google.com|github.com",
]

one_hot_simhash = Pipeline(
    [
        ("vect", CountVectorizer(tokenizer=lambda x: x.split("|"), binary=True)),
        ("simhash", SimHashTransformer(n_bits=32)),
    ]
)

y = one_hot_simhash.fit_transform(X)

After running this code, the value of y would look similar to the following (expect same lengths; actual hash values depend on the choice of random vectors during fit):

['0xd98c7e93' '0xd10b79b3' '0x1085154d' '0x59cd150d']

Caveats

  • The implementation works on the sparse matrices output by CountVectorizer and TfidfTransformer, in order to manage memory efficiently.

  • At the moment, the choice of precision in the numpy arrays results in overflow errors for p >= 64. While we are waiting for implementation details of the FLoC POCs, the first indications hint at choices around p = 50.

Development

This project uses poetry for managing dependencies.

In order to clone the repository and run the unit tests, execute the following steps on an environment with python>=3.7.

git clone https://github.com/hybridtheory/floc-simhash.git
cd floc-simhash
poetry install
pytest

The unit tests are property-based, using the hypothesis library. This allows for algorithm veritication against hundreds or thousands of random generated inputs.

Since running many examples may lengthen the test suite runtime, we also use pytest-xdist in order to parallelize the tests. For example, the following call will run up to 1000 examples for each test with parallelism 4.

pytest -n 4 --hypothesis-profile=ci
Owner
Hybrid Theory
(formerly Affectv)
Hybrid Theory
This is an Airport Scheduling Time table implemented using Genetic Algorithm

This is an Airport Scheduling Time table implemented using Genetic Algorithm In this The scheduling is performed on the basisi of that no two Air planes are arriving or departing at the same runway a

1 Jan 06, 2022
This is an implementation of the QuickHull algorithm in Python. I

QuickHull This is an implementation of the QuickHull algorithm in Python. It randomly generates a set of points and finds the convex hull of this set

Anant Joshi 4 Dec 04, 2022
Python Package for Reflection Ultrasound Computed Tomography (RUCT) Delay And Sum (DAS) Algorithm

pyruct Python Package for Reflection Ultrasound Computed Tomography (RUCT) Delay And Sum (DAS) Algorithm The imaging setup is explained in these paper

Berkan Lafci 21 Dec 12, 2022
causal-learn: Causal Discovery for Python

causal-learn: Causal Discovery for Python Causal-learn is a python package for causal discovery that implements both classical and state-of-the-art ca

589 Dec 29, 2022
This is a Python implementation of the HMRF algorithm on networks with categorial variables.

Salad Salad is an Open Source Python library to segment tissues into different biologically relevant regions based on Hidden Markov Random Fields. The

1 Nov 16, 2021
This repository explores an implementation of Grover's Algorithm for knights on a chessboard.

Grover Knights Welcome to my Knights project! Project Description: I explore an implementation of a quantum oracle for knights on a chessboard.

Will Sun 8 Feb 22, 2022
Implementation of an ordered dithering algorithm used in computer graphics

Ordered Dithering Project In this project, we use an ordered dithering method to turn an RGB image, first to a gray scale image and then, turn the gra

1 Oct 26, 2021
Algorithms written in different programming languages

Data Structures and Algorithms Clean example implementations of data structures and algorithms written in different languages. List of implementations

Zoran Pandovski 1.3k Jan 03, 2023
Path finding algorithm visualizer with python

path-finding-algorithm-visualizer ~ click on the grid to place the starting block and then click elsewhere to add the end block ~ click again to place

izumi 1 Oct 31, 2021
🧬 Training the car to do self-parking using a genetic algorithm

🧬 Training the car to do self-parking using a genetic algorithm

Oleksii Trekhleb 652 Jan 03, 2023
This is a demo for AAD algorithm.

Asynchronous-Anisotropic-Diffusion-Algorithm This is a demo for AAD algorithm. The subroutine of the anisotropic diffusion algorithm is modified from

3 Mar 21, 2022
Genetic Algorithm for Robby Robot based on Complexity a Guided Tour by Melanie Mitchell

Robby Robot Genetic Algorithm A Genetic Algorithm based Robby the Robot in Chapter 9 of Melanie Mitchell's book Complexity: A Guided Tour Description

Matthew 2 Dec 01, 2022
Nature-inspired algorithms are a very popular tool for solving optimization problems.

Nature-inspired algorithms are a very popular tool for solving optimization problems. Numerous variants of nature-inspired algorithms have been develo

NiaOrg 215 Dec 28, 2022
A collection of design patterns/idioms in Python

python-patterns A collection of design patterns and idioms in Python. Current Patterns Creational Patterns: Pattern Description abstract_factory use a

Sakis Kasampalis 36.2k Jan 05, 2023
HashDB is a community-sourced library of hashing algorithms used in malware.

HashDB HashDB is a community-sourced library of hashing algorithms used in malware. How To Use HashDB HashDB can be used as a stand alone hashing libr

OALabs 216 Jan 06, 2023
A Python Package for Portfolio Optimization using the Critical Line Algorithm

A Python Package for Portfolio Optimization using the Critical Line Algorithm

19 Oct 11, 2022
A litle algorithm that i made for transform a picture in a spreadsheet.

PicsToSheets How it works? It is an algorithm designed to transform an image into a spreadsheet file. this converts image pixels to color cells of she

Guilherme de Oliveira 1 Nov 12, 2021
PickMush - A mini study/project on boosting algorithm

PickMush A mini project implementing Boosting Author Shashwat Vaibhav What does it do? Classifies whether Mushroom is edible or is non-edible (binary

Shashwat Vaibahav 3 Nov 08, 2022
Pathfinding visualizer in pygame: A*

Pathfinding Visualizer A* What is this A* algorithm ? Simply put, it is an algorithm that aims to find the shortest possible path between two location

0 Feb 26, 2022
Exact algorithm for computing two-sided statistical tolerance intervals under a normal distribution assumption using Python.

norm-tol-int Exact algorithm for computing two-sided statistical tolerance intervals under a normal distribution assumption using Python. Methods The

Jed Ludlow 1 Jan 06, 2022