🎐 a python library for doing approximate and phonetic matching of strings.

Last update: Dec 21, 2022

Overview

jellyfish

Jellyfish is a python library for doing approximate and phonetic matching of strings.

Written by James Turk <[email protected]> and Michael Stephens.

See https://github.com/jamesturk/jellyfish/graphs/contributors for contributors.

See http://jellyfish.readthedocs.io for documentation.

Source is available at http://github.com/jamesturk/jellyfish.

Jellyfish >= 0.7 only supports Python 3, if you need Python 2 please use 0.6.x.

Included Algorithms

String comparison:

Levenshtein Distance
Damerau-Levenshtein Distance
Jaro Distance
Jaro-Winkler Distance
Match Rating Approach Comparison
Hamming Distance

Phonetic encoding:

American Soundex
Metaphone
NYSIIS (New York State Identification and Intelligence System)
Match Rating Codex

Example Usage

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

>>> jellyfish.metaphone(u'Jellyfish')
'JLFX'
>>> jellyfish.soundex(u'Jellyfish')
'J412'
>>> jellyfish.nysiis(u'Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex(u'Jellyfish')
'JLLFSH'

Running Tests

If you are interested in contributing to Jellyfish, you may want to run tests locally. Jellyfish uses tox to run tests, which you can setup and run as follows:

pip install tox
# cd jellyfish/
tox

🎐 a python library for doing approximate and phonetic matching of strings.

Related tags

Overview

jellyfish

Included Algorithms

Example Usage

Running Tests

Owner

James Turk

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

KoBERT - Korean BERT pre-trained cased (KoBERT)

Big Bird: Transformers for Longer Sequences

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

FactSumm: Factual Consistency Scorer for Abstractive Summarization

Code for Emergent Translation in Multi-Agent Communication

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code examples for my Write Better Python Code series on YouTube.

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Finally, some decent sample sentences

Amazon Multilingual Counterfactual Dataset (AMCD)

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Code for text augmentation method leveraging large-scale language models

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"