Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Last update: Jan 06, 2023

Overview

TextDistance

TextDistance -- python library for comparing distance between two or more sequences by many algorithms.

Features:

30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.

Algorithms

Edit based

Algorithm	Class	Functions
Hamming	`Hamming`	`hamming`
MLIPNS	`Mlipns`	`mlipns`
Levenshtein	`Levenshtein`	`levenshtein`
Damerau-Levenshtein	`DamerauLevenshtein`	`damerau_levenshtein`
Jaro-Winkler	`JaroWinkler`	`jaro_winkler`, `jaro`
Strcmp95	`StrCmp95`	`strcmp95`
Needleman-Wunsch	`NeedlemanWunsch`	`needleman_wunsch`
Gotoh	`Gotoh`	`gotoh`
Smith-Waterman	`SmithWaterman`	`smith_waterman`

Token based

Algorithm	Class	Functions
Jaccard index	`Jaccard`	`jaccard`
Sørensen–Dice coefficient	`Sorensen`	`sorensen`, `sorensen_dice`, `dice`
Tversky index	`Tversky`	`tversky`
Overlap coefficient	`Overlap`	`overlap`
Tanimoto distance	`Tanimoto`	`tanimoto`
Cosine similarity	`Cosine`	`cosine`
Monge-Elkan	`MongeElkan`	`monge_elkan`
Bag distance	`Bag`	`bag`

Sequence based

Algorithm	Class	Functions
longest common subsequence similarity	`LCSSeq`	`lcsseq`
longest common substring similarity	`LCSStr`	`lcsstr`
Ratcliff-Obershelp similarity	`RatcliffObershelp`	`ratcliff_obershelp`

Compression based

Normalized compression distance with different compression algorithms.

Classic compression algorithms:

Algorithm	Class	Function
Arithmetic coding	`ArithNCD`	`arith_ncd`
RLE	`RLENCD`	`rle_ncd`
BWT RLE	`BWTRLENCD`	`bwtrle_ncd`

Normal compression algorithms:

Algorithm	Class	Function
Square Root	`SqrtNCD`	`sqrt_ncd`
Entropy	`EntropyNCD`	`entropy_ncd`

Work in progress algorithms that compare two strings as array of bits:

Algorithm	Class	Function
BZ2	`BZ2NCD`	`bz2_ncd`
LZMA	`LZMANCD`	`lzma_ncd`
ZLib	`ZLIBNCD`	`zlib_ncd`

See blog post for more details about NCD.

Phonetic

Algorithm	Class	Functions
MRA	`MRA`	`mra`
Editex	`Editex`	`editex`

Simple

Algorithm	Class	Functions
Prefix similarity	`Prefix`	`prefix`
Postfix similarity	`Postfix`	`postfix`
Length distance	`Length`	`length`
Identity similarity	`Identity`	`identity`
Matrix similarity	`Matrix`	`matrix`

Installation

Stable

Only pure python implementation:

pip install textdistance

With extra libraries for maximum speed:

pip install "textdistance[extras]"

With all libraries (required for benchmarking and testing):

pip install "textdistance[benchmark]"

With algorithm specific extras:

pip install "textdistance[Hamming]"

Algorithms with available extras: DamerauLevenshtein, Hamming, Jaro, JaroWinkler, Levenshtein.

Dev

Via pip:

pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance

Or clone repo and install with some extras:

git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"

Usage

All algorithms have 2 interfaces:

Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.

All algorithms have some common methods:

.distance(*sequences) -- calculate distance between sequences.
.similarity(*sequences) -- calculate similarity for sequences.
.maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
.normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

qval -- q-value for split sequences into q-grams. Possible values:
- 1 (default) -- compare sequences by chars.
- 2 or more -- transform sequences to q-grams.
- None -- split sequences by words.
as_set -- for token-based algorithms:
- True -- t and ttt is equal.
- False (default) -- t and ttt is different.

Examples

For example, Hamming distance:

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Any other algorithms have same interface.

Articles

A few articles with examples how to use textdistance in the real world:

Extra libraries

For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.

You can disable this by passing external=False argument on init:

import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3

Supported libraries:

Algorithms:

DamerauLevenshtein
Hamming
Jaro
JaroWinkler
Levenshtein

Benchmarks

Without extras installation:

algorithm	library	function	time
DamerauLevenshtein	jellyfish	damerau_levenshtein_distance	0.00965294
DamerauLevenshtein	pyxdameraulevenshtein	damerau_levenshtein_distance	0.151378
DamerauLevenshtein	pylev	damerau_levenshtein	0.766461
DamerauLevenshtein	textdistance	DamerauLevenshtein	4.13463
DamerauLevenshtein	abydos	damerau_levenshtein	4.3831
Hamming	Levenshtein	hamming	0.0014428
Hamming	jellyfish	hamming_distance	0.00240262
Hamming	distance	hamming	0.036253
Hamming	abydos	hamming	0.0383933
Hamming	textdistance	Hamming	0.176781
Jaro	Levenshtein	jaro	0.00313561
Jaro	jellyfish	jaro_distance	0.0051885
Jaro	py_stringmatching	jaro	0.180628
Jaro	textdistance	Jaro	0.278917
JaroWinkler	Levenshtein	jaro_winkler	0.00319735
JaroWinkler	jellyfish	jaro_winkler	0.00540443
JaroWinkler	textdistance	JaroWinkler	0.289626
Levenshtein	Levenshtein	distance	0.00414404
Levenshtein	jellyfish	levenshtein_distance	0.00601647
Levenshtein	py_stringmatching	levenshtein	0.252901
Levenshtein	pylev	levenshtein	0.569182
Levenshtein	distance	levenshtein	1.15726
Levenshtein	abydos	levenshtein	3.68451
Levenshtein	textdistance	Levenshtein	8.63674

Total: 24 libs.

Yeah, so slow. Use TextDistance on production only with extras.

Textdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).

You can run benchmark manually on your system:

pip install textdistance[benchmark]
python3 -m textdistance.benchmark

TextDistance show benchmarks results table for your system and save libraries priorities into libraries.json file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default libraries.json already included in package.

Running tests

You can run tests via dephell:

curl -L dephell.org/install | python3
dephell venv create --env=pytest-external
dephell deps install --env=pytest-external
dephell venv run --env=pytest-external

Contributing

PRs are welcome!

Found a bug? Fix it!
Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.
Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.
Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
Have no time to code? Tell your friends and subscribers about textdistance. More users, more contributions, more amazing features.

Thank you ❤️

Comments

add support for rapidfuzz

The implementation used by rapidfuzz has the following algorithms

Jaro/JaroWinkler (fastest by a large margin)
Hamming (slightly slower than python-Levenshtein)
Levenshtein (similar fast to python-Levenshtein for very short strings and fastest for longer strings)

Additionally it supports any sequence of hashable types (e.g. lists of strings) and not only text

Here is the benchmark result:

# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0181046   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.030925    |
| Hamming            | Levenshtein             | hamming                      | 0.000351586 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.00040442  |
| Hamming            | jellyfish               | hamming_distance             | 0.0143502   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.000749048 |
| Jaro               | jellyfish               | jaro_similarity              | 0.0152322   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.000776006 |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0157833   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.0010058   |
| Levenshtein        | Levenshtein             | distance                     | 0.00103176  |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0147382   |
| Levenshtein        | pylev                   | levenshtein                  | 0.14116     |
Total: 13 libs.

and the benchmark results when adding slightly longer strings:

STMT = """
func('text', 'test')
func('qwer', 'asdf')
func('a' * 15, 'b' * 15)
func('a' * 30, 'b' * 30)
"""

# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0323887   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.143235    |
| Hamming            | Levenshtein             | hamming                      | 0.000489837 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.000517879 |
| Hamming            | jellyfish               | hamming_distance             | 0.0182341   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.00111363  |
| Jaro               | jellyfish               | jaro_similarity              | 0.0201971   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.00105238  |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0206678   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.00138601  |
| Levenshtein        | Levenshtein             | distance                     | 0.0034889   |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0232467   |
| Levenshtein        | pylev                   | levenshtein                  | 0.599603    |
Total: 13 libs.

opened by maxbachmann 13

Add new DamerauLevenshtein... classes

There are two versions of the Damerau-Levenshtein distance, as described in this Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018933 Some of the external libraries implement one of them, others the other.

This PR splits introduces two different classes: DamerauLevenshteinRestricted and DamerauLevenshteinUnrestricted, with DamerauLevenshtein being the unrestricted version, so that it is clear what is intended.

opened by juliangilbey 7
Ignore inconsistent timings on some comparison tests

Two particular tests have timings that differ wildly between successive runs on arm64 architectures. This might be because some libraries take a long time to load or something like that - I don't know. But this patch turns off hypothesis's timing checks for these two tests. I'm going to apply it to Debian's package; you might or might not want to apply it upstream.

opened by juliangilbey 5
Modify JaroWinkler boosting to match behaviour of jellyfish algorithm

Jellyfish has recently modified its JaroWinkler algorithm to allow for boosting even when one of the strings is shorter than 4 characters: https://github.com/jamesturk/jellyfish/commit/87f9679910eba0dad6a1f6019f03cbdffba28392. It is very unclear whether this is a good idea or not. But as it is, the tests now fail, as the internal and external algorithms give different results on a pair of strings such as ":" and ":0".

This patch replicates the change that jellyfish has made, which will then allow the external tests to pass once again. It also modifies the expected value of the comparison "fog" and "frog" to match this new algorithm behaviour.

If you do not wish to apply this patch, then the external tests will need modifying to exclude the case where either of the strings has length < 4.
hacktoberfest-accepted

opened by juliangilbey 5
Possible correction to Monge-Elkan calculation
Might be wrong about this, but think the code for the Monge-Elkan algorithm needs to be corrected.

If you look at the implementation in the py_stringmatching library on line 81 of https://github.com/anhaidgroup/py_stringmatching/blob/master/py_stringmatching/similarity_measure/monge_elkan.py sim = float(sum_of_maxes) / float(len(bag1)) which is essentially the mean max.

But in the implementation for textdistance, the score is given on line 222 of https://github.com/life4/textdistance/blob/master/textdistance/algorithms/token_based.py as
sum(maxes) / len(seq) / len(maxes)

I think the further division by len(maxes) isn't needed, and the line should just be sum(maxes) / len(seq)

The change in the code could mess up tests elsewhere, so I'm not changing anything else. But thought I should bring this to your attention.

Below is some code and differing scores I got in textdistance and py_stringmatching.

# score in textdistance from textdistance import MongeElkan, levenshtein ALG = MongeElkan score = ALG(algorithm=levenshtein,qval=None,symmetric=False).similarity('Good Times!', "The Good Times and The Bad Ones") score # Got 2.25

#score in py_stringmatching from py_stringmatching import MongeElkan from py_stringmatching import Levenshtein as Levenshtein_2 ALG_2 = MongeElkan(sim_func=Levenshtein_2().get_raw_score) source = 'Good Times!' source_split = source.split() target = "The Good Times and The Bad Ones" target_split = target.split() score2 = ALG_2.get_raw_score(source_split, target_split) score2 # got 5.5
opened by shijithpk 3
Handle newer versions of abydos and jellyfish

abydos has changed its interface for distance metrics quite significantly, and jellyfish has changed the names of the functions. This patch addresses both of these issues.

opened by juliangilbey 3
Ensure that maximum normalised distance is <= 1 and ...

textdistance is currently failing its test-suite on arm64 machines with Python 3.10, which is causing me problems on Debian. I have managed to track down the first of these bugs (and there are at least two more to come): there are some algorithms that use upper() before comparing the strings. As noted in the code already, though these algorithms were designed for English (ASCII only), this can cause upper() to change the length of the string if using non-English characters. And hypothesis does this when testing. This can result in the normalised distance being greater than 1. This patch addresses this by ensuring that the distance returned from the relevant algorithms is no greater than self.maximum().

A second issue which arose when doing this was calculating the maximum distance for Editex(); the current function for calculating the maximum does not give the correct answer if match_cost > mismatch_cost, for example. But this would be a silly situation: why would we penalise matching characters more than mismatching ones? There are two ways of resolving this: the first is to calculate the maximum distance using max(match_cost, group_cost, mismatch_cost), the second is to force the inequalities match_cost <= group_cost <= mismatch_cost. I have gone for the latter option in this patch.

All being well, there will be more patches to come in the next few weeks as I get to the bottom of them!

opened by juliangilbey 2

update rapidfuzz

update rapidfuzz to the latest version which provides a damerau levenshtein implementation. It is the fastest of the supported libraries:

| algorithm          | library                               | function                     |        time |
|--------------------+---------------------------------------+------------------------------+-------------|
| DamerauLevenshtein | rapidfuzz.distance.DamerauLevenshtein | distance                     | 0.00267046  |
| DamerauLevenshtein | jellyfish                             | damerau_levenshtein_distance | 0.022479    |
| DamerauLevenshtein | pyxdameraulevenshtein                 | damerau_levenshtein_distance | 0.0393475   |
| DamerauLevenshtein | **textdistance**                      | DamerauLevenshtein           | 0.589098    |

In addition it is the only implementation which only requires linear memory.

opened by maxbachmann 1

Fix numpy types warnings

Basic types have been deprecated in numpy 1.20. Here are the full warnings:

DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I don’t know the code enough to assess if the specific numpy types are required though.

opened by ArchangeGabriel 1

Fix a setuptools warning

UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead

opened by ArchangeGabriel 1
Fix README links

Hi,

Noticed that the Travis CI link was wrong. Then found a few more links that appear to reference an old repository.

This PR tries to correct the links by replacing orsinium by life4 in some URL's.

And thanks for the great project, Bruno

opened by kinow 1

Releases(4.5.0)

4.5.0(Sep 18, 2022)
What's Changed

Run Python 3.10 tests on CI by @orsinium in https://github.com/life4/textdistance/pull/80

Type annotations by @orsinium in https://github.com/life4/textdistance/pull/82

Add new DamerauLevenshtein... classes by @juliangilbey in https://github.com/life4/textdistance/pull/84

Full Changelog: https://github.com/life4/textdistance/compare/4.4.0...4.5.0
Source code(tar.gz)
Source code(zip)
4.4.0(Aug 21, 2022)
What's Changed

update rapidfuzz by @maxbachmann in https://github.com/life4/textdistance/pull/83

Full Changelog: https://github.com/life4/textdistance/compare/4.3.0...4.4.0
Source code(tar.gz)
Source code(zip)
4.3.0(Jun 29, 2022)
What's Changed

Ensure that maximum normalised distance is <= 1 and ... by @juliangilbey in https://github.com/life4/textdistance/pull/78

Ignore inconsistent timings on some comparison tests by @juliangilbey in https://github.com/life4/textdistance/pull/79

add support for rapidfuzz by @maxbachmann in https://github.com/life4/textdistance/pull/77

New Contributors

@maxbachmann made their first contribution in https://github.com/life4/textdistance/pull/77

Full Changelog: https://github.com/life4/textdistance/compare/4.2.2...4.3.0
Source code(tar.gz)
Source code(zip)
v.4.2.1(Jan 29, 2021)

#70
Source code(tar.gz)
Source code(zip)
textdistance-4.2.1-py3-none-any.whl(28.28 KB)
textdistance-4.2.1.tar.gz(28.27 KB)
v.4.2.0(Apr 13, 2020)
Drop Python 2 support. We follow the official Python release cycle. Now CI runs for Python 3.6+. For 3.4 and 3.5 everything should still work but consider migration, it shouldn't be hard.

We've migrated tests on pytest+hypothesis. It helped us to find a lot of bugs.

Some fixes: a bug in Damerau-Levenshtein, normalization in Smith-Waterman, fix support for some unicode chars in Soundex.

All classes now accept external argument even if they have no known external libs support.

CI and releases are powered by DepHell

Source code(tar.gz)
Source code(zip)
textdistance-4.2.0-py3-none-any.whl(28.43 KB)
textdistance-4.2.0.tar.gz(33.70 KB)
v.4.1.5(Oct 3, 2019)

#38 #42 #43 #44
Source code(tar.gz)
Source code(zip)
textdistance-4.1.5-py3-none-any.whl(28.13 KB)
textdistance-4.1.5.tar.gz(29.32 KB)
v4.1.0(Mar 9, 2019)
Normalized compression distance algorithms that really works. Realization was discussed with NCD author.

MIT license

All code tested on PyPy 3.5

Strict Flake8 rules to make source code reading easier for you.

Source code(tar.gz)
Source code(zip)
textdistance-4.1.0.tar.gz(30.05 KB)
3.0.0(Mar 31, 2018)
optional dependencies support

benchmarks

faster Levenshtein and Damerau-Levenshtein

tox powered tests

Source code(tar.gz)
Source code(zip)
textdistance-3.0.0.linux-x86_64.tar.gz(45.73 KB)
textdistance-3.0.0.tar.gz(29.33 KB)
2.0.1(Feb 10, 2018)

Global project update! Old interface has been saved, but deprecated.

Now project contain 30+ algorithms separated by categories and with common intuitive interfaces and default parameters.
Source code(tar.gz)
Source code(zip)
textdistance-2.0.1.linux-x86_64.tar.gz(35.29 KB)
textdistance-2.0.1.tar.gz(21.38 KB)

Owner

Life4

Original cool Open Source projects

GitHub Repository

ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

37 Nov 06, 2022

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

1 Jan 01, 2022

Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

151 Dec 14, 2022

Train and use generative text models in a few lines of code.

blather Train and use generative text models in a few lines of code. To see blather in action check out the colab notebook! Installation Use the packa

16 Nov 07, 2022

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

39 Nov 14, 2022

Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

3 Jul 24, 2022

End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

214 Dec 07, 2022

Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

230 Nov 16, 2022

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

2 Jun 24, 2022

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

823 Dec 28, 2022

Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

1 Nov 08, 2021

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

367 Dec 31, 2022

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

8 Oct 18, 2021

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

2 May 17, 2022

An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

63 Dec 29, 2022

Graphical user interface for Argos Translate

Argos Translate GUI Website | GitHub | PyPI Graphical user interface for Argos Translate. Install pip3 install argostranslategui

16 Dec 07, 2022

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

stsb_multi_mt_en STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 an

2 Nov 05, 2021

Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023

A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

48 Oct 11, 2022

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

1 Nov 20, 2021

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Related tags

Overview

TextDistance

Algorithms

Edit based

Token based

Sequence based

Compression based

Phonetic

Simple

Installation

Stable

Dev

Usage

Examples

Articles

Extra libraries

Benchmarks

Running tests

Contributing

Comments

Releases(4.5.0)

4.5.0(Sep 18, 2022)

What's Changed

4.4.0(Aug 21, 2022)

What's Changed

4.3.0(Jun 29, 2022)

What's Changed

New Contributors

v.4.2.1(Jan 29, 2021)

v.4.2.0(Apr 13, 2020)

v.4.1.5(Oct 3, 2019)

v4.1.0(Mar 9, 2019)

3.0.0(Mar 31, 2018)

2.0.1(Feb 10, 2018)

Owner

Life4

ETM - R package for Topic Modelling in Embedding Spaces

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Wind Speed Prediction using LSTMs in PyTorch

Train and use generative text models in a few lines of code.

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

Predict the spans of toxic posts that were responsible for the toxic label of the posts

End-2-end speech synthesis with recurrent neural networks

Yet another Python binding for fastText

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Perform sentiment analysis and keyword extraction on Craigslist listings

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

An assignment on creating a minimalist neural network toolkit for CS11-747

Graphical user interface for Argos Translate

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Data manipulation and transformation for audio signal processing, powered by PyTorch

A text augmentation tool for named entity recognition.

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning