Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
GTK3-based panel for sway window manager

nwg-panel I have been using sway since 2019 and find it the most comfortable working environment, but... Have you ever missed all the graphical bells

Piotr Miller 290 Jan 07, 2023
D-Ticket is a discord bot for ticket system

D-Ticket Discord Bot D-Ticket is a discord bot for ticket management system. This is not final product is currently being in development stay connecte

DeViL 1 Jan 06, 2022
A liblary whre you can find helpful functions for your discord bot

DBotUtils A liblary whre you can find helpful functions for your discord bot Easy setup Setup is easily and flexible. Change anytime. After setup just

Kondek286 1 Nov 02, 2021
"zpool iostats" for humans; find the slow parts of your ZFS pool

Getting the gist of zfs statistics vpool-demo.mp4 The ZFS command "zpool iostat" provides a histogram listing of how often it takes to do things in pa

Chad 57 Oct 24, 2022
Flask extension that provides integration with Azure Storage

Flask-Azure-Storage A Flask extension that provides integration with Azure Storage Table of Contents Flask-Azure-Storage Install Usage Examples Create

Alejo Arias 17 Nov 14, 2021
An instagram bot developed in Python with Selenium that helps you get more Instagram followers.

instabot An instagram bot developed in Python with Selenium that helps you get more Instagram followers. Install You’ll need to have: Python Selenium

65 Nov 22, 2022
Social Framework

Social Int Framework Social Int Framework its a Selenium script that scrape the IG photos and do a Reverse search on google and yandex for finding ano

29 Dec 06, 2022
Telegram bot to download tiktok video/audio

TikTokDL (Bot) Telegram RoBot to Download Tiktok video/audio. Features: 👉 Download TikTok Video without Watermark 👉 Download TikTok Video with Water

X-Noid 23 Nov 21, 2022
A small and fun Discord Bot that is written in Python and discord-interactions (with discord.py)

Articuno (discord-interactions) A small and fun Discord Bot that is written in Python and discord-interactions (with discord.py) Get started If you wa

Blue 8 Dec 26, 2022
EZPZ-PGP: This is a simple and easy to use PGP tool.

EZPZ-PGP This is a simple and easy to use PGP tool. Features [X] Create new PGP Keypairs, able to choose between 4096 and 8192 bit keys.\n [X] Import

6 Dec 30, 2022
Python SDK for LUSID by FINBOURNE, a bi-temporal investment management data platform with portfolio accounting capabilities.

LUSID® Python SDK This is the Python SDK for LUSID by FINBOURNE, a bi-temporal investment management data platform with portfolio accounting capabilit

FINBOURNE 6 Dec 24, 2022
ThetaGang is an IBKR bot for collecting money

💬 Join the Matrix chat, we can get money together. Θ ThetaGang Θ Beat the capitalists at their own game with ThetaGang 📈 ThetaGang is an IBKR tradin

Brenden Matthews 1.5k Jan 08, 2023
Generates a coverage badge using coverage.py and the shields.io service.

Welcome to README Coverage Badger 👋 Generates a coverage badge using coverage.py and the shields.io service. Your README file is then updated with th

Victor Miti 10 Dec 06, 2022
Discord group chat spammer concept.

GC Spammer [Concept] GC-Spammer for https://discord.com/ Warning: This is purely a concept. In the past the script worked, however, Discord ratelimite

Roover 3 Feb 28, 2022
A Pancakeswap v2 trading client (and bot) with limit orders, stop-loss, custom gas strategies, a GUI and much more.

Pancakeswap v2 trading client A Pancakeswap trading client (and bot) with limit orders, stop-loss, custom gas strategies, a GUI and much more. If you

571 Mar 15, 2022
A Simple Telegram Inline Torrent Search Bot by @infotechIT

Torrent-Search-RoBot A Simple Telegram Inline Torrent Search Bot by @infotechIT. Torrent API Using api.infotech.wtf API Host Bot Deploy to Heroku Clic

InfoTech 0 May 05, 2022
Quadrirrotor UFABC - ROS/Gazebo

QuadROS_UFABC - Repositório utilizado durante minha dissertação de mestrado para simular sistemas de controle e estimação para navegação de um quadrirrotor utilizando visão computacional.

Mateus Ribeiro 1 Dec 13, 2022
Telegram Group Management Bot based on Pyrogram

Komi-San Telegram Group Management Bot based on Pyrogram More updates coming soon Support Group Open a Pull request if you wana contribute Example for

33 Nov 07, 2022
Ridogram is an advanced multi-featured Telegram UserBot.

Ridogram Ridogram is an advanced multi-featured Telegram UserBot. String Session Collect String Session by running python3 stringsession.py locally or

Md. Ridwanul Islam Muntakim 134 Dec 29, 2022
Скрипт, позволяющий импортировать плейлисты из Spotify, а также обычные треклисты в VK музыку.

vk-music-import Программа для переноса плейлистов из Spotify и текстовых треклистов в VK Музыку. Преимущества: Позволяет быстро импортировать плейлист

Mew Forest 32 Nov 23, 2022