👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Last update: Dec 30, 2022

Overview

1. What does this library do?

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist?

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

Python is widely used in natural language processing, so there are a couple of comprehensive open source libraries for this task, such as Google's CLD 2 and CLD 3, langid and langdetect. Unfortunately, except for the last one they have two major drawbacks:

Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, they do not provide adequate results.
The more languages take part in the decision process, the less accurate are the detection results.

Lingua aims at eliminating these problems. She nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. She draws on both rule-based and statistical methods but does not use any dictionaries of words. She does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported?

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported:

A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
C
- Catalan
- Chinese
- Croatian
- Czech
D
- Danish
- Dutch
E
- English
- Esperanto
- Estonian
F
- Finnish
- French
G
- Ganda
- Georgian
- German
- Greek
- Gujarati
H
- Hebrew
- Hindi
- Hungarian
I
- Icelandic
- Indonesian
- Irish
- Italian
J
- Japanese
K
- Kazakh
- Korean
L
- Latin
- Latvian
- Lithuanian
M
- Macedonian
- Malay
- Maori
- Marathi
- Mongolian
N
- Norwegian Nynorsk
P
- Persian
- Polish
- Portuguese
- Punjabi
R
- Romanian
- Russian
S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
U
- Ukrainian
- Urdu
V
- Vietnamese
W
- Welsh
X
- Xhosa
Y
- Yoruba
Z
- Zulu

4. How good is it?

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

a list of single words with a minimum length of 5 characters
a list of word pairs with a minimum length of 10 characters
a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua, langdetect, langid, CLD 2 and CLD 3 running over the data of Lingua's supported 75 languages. Languages that are not supported by the other detectors are simply ignored for them during the detection process.

The box plots below illustrate the distributions of the accuracy values for each classifier. The boxes themselves represent the areas which the middle 50 % of data lie within. Within the colored boxes, the horizontal lines mark the median of the distributions. All these plots demonstrate that Lingua clearly outperforms its contenders. Bar plots for each language can be found in the file ACCURACY_PLOTS.md. Detailed statistics including mean, median and standard deviation values for each language and classifier are available in the file ACCURACY_TABLE.md.

4.1 Single word detection

4.2 Word pair detection

4.3 Sentence detection

4.4 Average detection

5. Why is it better than other libraries?

Every language detector uses a probabilistic n-gram model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language.

A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance.

In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable.

6. Test report generation

If you want to reproduce the accuracy results above, you can generate the test reports yourself for all classifiers and languages by executing:

poetry install --extras "langdetect langid gcld3 pycld2"
poetry run python3 scripts/accuracy_reporter.py

For each detector and language, a test report file is then written into /accuracy-reports. As an example, here is the current output of the Lingua German report:

##### German #####

>>> Accuracy on average: 89.27%

>> Detection of 1000 single words (average length: 9 chars)
Accuracy: 74.20%
Erroneously classified as Dutch: 2.30%, Danish: 2.20%, English: 2.20%, Latin: 1.80%, Bokmal: 1.60%, Italian: 1.30%, Basque: 1.20%, Esperanto: 1.20%, French: 1.20%, Swedish: 0.90%, Afrikaans: 0.70%, Finnish: 0.60%, Nynorsk: 0.60%, Portuguese: 0.60%, Yoruba: 0.60%, Sotho: 0.50%, Tsonga: 0.50%, Welsh: 0.50%, Estonian: 0.40%, Irish: 0.40%, Polish: 0.40%, Spanish: 0.40%, Tswana: 0.40%, Albanian: 0.30%, Icelandic: 0.30%, Tagalog: 0.30%, Bosnian: 0.20%, Catalan: 0.20%, Croatian: 0.20%, Indonesian: 0.20%, Lithuanian: 0.20%, Romanian: 0.20%, Swahili: 0.20%, Zulu: 0.20%, Latvian: 0.10%, Malay: 0.10%, Maori: 0.10%, Slovak: 0.10%, Slovene: 0.10%, Somali: 0.10%, Turkish: 0.10%, Xhosa: 0.10%

>> Detection of 1000 word pairs (average length: 18 chars)
Accuracy: 93.90%
Erroneously classified as Dutch: 0.90%, Latin: 0.90%, English: 0.70%, Swedish: 0.60%, Danish: 0.50%, French: 0.40%, Bokmal: 0.30%, Irish: 0.20%, Tagalog: 0.20%, Tsonga: 0.20%, Afrikaans: 0.10%, Esperanto: 0.10%, Estonian: 0.10%, Finnish: 0.10%, Italian: 0.10%, Maori: 0.10%, Nynorsk: 0.10%, Somali: 0.10%, Swahili: 0.10%, Turkish: 0.10%, Welsh: 0.10%, Zulu: 0.10%

>> Detection of 1000 sentences (average length: 111 chars)
Accuracy: 99.70%
Erroneously classified as Dutch: 0.20%, Latin: 0.10%

7. How to add it to your project?

Lingua is available in the Python Package Index and can be installed with:

pip install lingua-language-detector

8. How to build?

Lingua requires Python >= 3.9 and uses Poetry for packaging and dependency management. You need to install it first if you have not done so yet. Afterwards, clone the repository and install the project dependencies:

git clone https://github.com/pemistahl/lingua-py.git
cd lingua-py
poetry install

The library makes uses of type annotations which allow for static type checking with Mypy. Run the following command for checking the types:

poetry run mypy

The source code is accompanied by an extensive unit test suite. To run the tests, simply say:

poetry run pytest

9. How to use?

9.1 Basic usage

>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> detector.detect_language_of("languages are awesome")
Language.ENGLISH

9.2 Minimum relative distance

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word prologue, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated in the following way:

>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages)\
.with_minimum_relative_distance(0.25)\
.build()
>>> print(detector.detect_language_of("languages are awesome"))
None

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise, None will be returned most of the time as in the example above. This is the return value for cases where language detection is not reliably possible.

9.3 Confidence values

Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? These questions can be answered as well:

>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> confidence_values = detector.compute_language_confidence_values("languages are awesome")
>>> for language, value in confidence_values:
...     print(f"{language.name}: {value:.2f}")
ENGLISH: 1.00
FRENCH: 0.79
GERMAN: 0.75
SPANISH: 0.70

In the example above, a list of all possible languages is returned, sorted by their confidence value in descending order. The values that the detector computes are part of a relative confidence metric, not of an absolute one. Each value is a number between 0.0 and 1.0. The most likely language is always returned with value 1.0. All other languages get values assigned which are lower than 1.0, denoting how less likely those languages are in comparison to the most likely language.

The list returned by this method does not necessarily contain all languages which this LanguageDetector instance was built from. If the rule-based engine decides that a specific language is truly impossible, then it will not be part of the returned list. Likewise, if no ngram probabilities can be found within the detector's languages for the given input text, the returned list will be empty. The confidence value for each language not being part of the returned list is assumed to be 0.0.

9.4 Eager loading versus lazy loading

By default, Lingua uses lazy-loading to load only those language models on demand which are considered relevant by the rule-based filter engine. For web services, for instance, it is rather beneficial to preload all language models into memory to avoid unexpected latency while waiting for the service response. If you want to enable the eager-loading mode, you can do it like this:

LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build()

Multiple instances of LanguageDetector share the same language models in memory which are accessed asynchronously by the instances.

9.5 Methods to build the LanguageDetector

There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance. The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages:

from lingua import LanguageDetectorBuilder, Language, IsoCode639_1, IsoCode639_3

# Including all languages available in the library
# consumes approximately 3GB of memory and might
# lead to slow runtime performance.
LanguageDetectorBuilder.from_all_languages()

# Include only languages that are not yet extinct (= currently excludes Latin).
LanguageDetectorBuilder.from_all_spoken_languages()

# Include only languages written with Cyrillic script.
LanguageDetectorBuilder.from_all_languages_with_cyrillic_script()

# Exclude only the Spanish language from the decision algorithm.
LanguageDetectorBuilder.from_all_languages_without(Language.SPANISH)

# Only decide between English and German.
LanguageDetectorBuilder.from_languages(Language.ENGLISH, Language.GERMAN)

# Select languages by ISO 639-1 code.
LanguageDetectorBuilder.from_iso_codes_639_1(IsoCode639_1.EN, IsoCode639_1.DE)

# Select languages by ISO 639-3 code.
LanguageDetectorBuilder.from_iso_codes_639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)

10. What's next for version 1.1.0?

Take a look at the planned issues.

11. Contributions

Any contributions to Lingua are very much appreciated. Please read the instructions in CONTRIBUTING.md for how to add new languages to the library.

Comments

Make the library compatible with Python versions < 3.9
Hello, I try to use the module on google colab and I get this error during the installation:

ERROR: Could not find a version that satisfies the requirement lingua-language-detector (from versions: none) ERROR: No matching distribution found for lingua-language-detector

What are the requirements of this module?
opened by Jourdelune 10

Error: ZeroDivisionError: float division by zero

Hello.

When running this code with lingua_language_detector version 1.3.0.

with open('text.txt') as fh:
    text = fh.read()
    detector = LanguageDetectorBuilder.from_all_languages().build()
    print(text)
    result = detector.detect_language_of(text)
    print(result)

I get this error:

Traceback (most recent call last):
  File "/home/jordi/sc/crux-top-lists-catalan/bug.py", line 9, in <module>
    result = detector.detect_language_of(text)
  File "/home/jordi/.local/lib/python3.10/site-packages/lingua/detector.py", line 272, in detect_language_of
    confidence_values = self.compute_language_confidence_values(text)
  File "/home/jordi/.local/lib/python3.10/site-packages/lingua/detector.py", line 499, in compute_language_confidence_values
    normalized_probability = probability / denominator
ZeroDivisionError: float division by zero

I attached the text file that triggers the problem. It works fine with others texts. This happens often in a crawling application that I'm testing.

bug

opened by jordimas 3

Import of LanguageDetectorBuilder failed

When loading the LanguageDetectorBuilder as recommended in the readme, I received the following error:

from lingua import LanguageDetectorBuilder ... ImportError: cannot import name 'LanguageDetectorBuilder' from 'lingua'

The following worked for me:

from lingua.builder import LanguageDetectorBuilder

opened by geritwagner 3
Detect multiple languages in mixed-language text
Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input: He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[ {"start": 0, "end": 27, "language": ENGLISH}, {"start": 28, "end": 69, "language": GERMAN} ]
new feature
opened by pemistahl 3

ZeroDivisionError: float division by zero

On occassion on longer texts I am getting this error. Steps to reproduce:

detector.detect_language_of(text)

Where text is

Flagged as potential abuser? No Retailer | Concept-store() Brand order:  placed on  Payout scheduled date: Not Scheduled Submission type: Lead How did you initially connected?: Sales rep When did you last reach out?:  (UTC) Did you add this person through ?: I don't know Additional information: Bonjour, Je travaille avec cette boutique depuis plusieurs années. C'est moi qui lui ai conseillé de passer par pour son réassort avec le lien direct que je lui avais transmis. Pourriez vous retirer la commission de 23% ? Je vous remercie. En lien pour preuve la dernière facture que je lui ai éditée et qui date du mois dernier. De plus, j'ai redirigé vers plusieurs autres boutiques avec qui j'ai l'habitude de travailler. Elles devraient passer commande prochainement: Ça m'ennuierai de me retrouver avec le même problème pour ces clients aussi. Merci d'avance pour votre aide ! Cordialement Click here to check out customer uploaded file Click here to approve / reject / flag as potential abuser

It's not an isolated example

Any help would be massively appreciated

opened by duboff 2

Weird issues with short texts in Russian
Hi team, great library! Wanted to share an example I stumbled upon, when detecting the language of a very short basic Russian text. It comes out as Macedonian, even though as far as I can tell it's not actually correct Macedonian but is correct Russian. It is identified correctly by AWS Comprehend and other APIs:

detector = LanguageDetectorBuilder.from_all_languages().build() detector.detect_language_of("как дела") Language.MACEDONIAN
opened by duboff 2
Use softmax function instead of min-max normalization

What do you think about passing results to softmax function instead min-max normalization? I think it's more clear way. Because, for example, you can have a threshold to filter-out unidentified languages.

Is there are some pitfalls that aren't clear for me? I've implemented this by slightly changing your code. I've also rounded results.

It passed black and mypy, but not tests. It's throwing me error like: INTERNALERROR> UnicodeEncodeError: 'charmap' codec can't encode characters in position 712-720: character maps to <undefined>

opened by Alex-Kopylov 2
Failed to predict correct language for popular English single words
Hello

"ITALIAN": 0.9900000000000001,

"SPANISH": 0.8457074930316446,

"ENGLISH": 0.6405700388041755,

"FRENCH": 0.260556921899765,

"GERMAN": 0.01,

"CHINESE": 0,

"RUSSIAN": 0

Bye

"FRENCH": 0.9899999999999999,

"ENGLISH": 0.9062076381164255,

"GERMAN": 0.6259792361883574,

"SPANISH": 0.46755135335558035,

"ITALIAN": 0.01,

"CHINESE": 0,

"RUSSIAN": 0

Loss (not Löss)

"GERMAN": 0.99,

"ENGLISH": 0.9177028091362562,

"ITALIAN": 0.9082690119891484,

"FRENCH": 0.7091301303929289,

"SPANISH": 0.01,

"CHINESE": 0,

"RUSSIAN": 0
opened by Alex-Kopylov 2
Is it possible to detect only English using lingua?

Hi, I'm currently working on a project which requires me to filter all non-English text. It is comprised of mostly short texts, most of them in English. I thought of building the language detector with only Language.ENGLISH but got an error that at least two languages are required. I do not care about knowing what language each non-English text is actually in, only English / Non-English. What would be the correct way to go about it with lingua? I think it might be problematic if I set it to recognize all languages because it might just add unnecessary noise to the prediction, which should have a bias towards English in my case. Thanks!

opened by OmriPi 2

Caught an IndexError while using detect_multiple_languages_of

On the test_case:

, Ресторан «ТИНАТИН»

Code fell down with an error:

Traceback (most recent call last):
  File "/home/essential/PycharmProjects/pythonProject/test_unnest.py", line 363, in <module>
    for lang, sentence in detector.detect_multiple_languages_of(text)
  File "/home/essential/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/lingua/detector.py", line 389, in detect_multiple_languages_of
    _merge_adjacent_results(results, mergeable_result_indices)
  File "/home/essential/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/lingua/detector.py", line 114, in _merge_adjacent_results
    end_index=results[i + 1].end_index,
IndexError: list index out of range

Code example:

languages = [Language.ENGLISH, Language.RUSSIAN, Language.UKRAINIAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
text = ', Ресторан «ТИНАТИН»'
sentences = [(lang, sentence) for lang, sentence in detector.detect_multiple_languages_of(text)]

bug

opened by Saninsusanin 1

Bad detection in common word

Hello, I need to detect language in user generated content, it's for a chat. I have tested this library but the library have strange result in short text, for exemple the word hello:

from lingua import Language, LanguageDetectorBuilder

languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

text = """
Hello
"""
confidence_values = detector.compute_language_confidence_values(text.strip())
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

return spanich (but the correct language is English)

SPANISH: 1.00
ENGLISH: 0.95
FRENCH: 0.87
GERMAN: 0.82

Do you know some tips to have better result for detecting language on user generated content?

opened by Jourdelune 1

detect_multiple_languages_of predicts incorrect languages
Using version 1.3.1

Using a text that is in Catalan language only, that does not contain any fragments from other languages, and that it's very standard kind of text, detect_multiple_languages_of method detects: CATALAN, SOMALI, LATIN, FRENCH, SPANISH and PORTUGUESE. The expectation is that should report that the full text is CATALAN.

Code to reproduce the problem:

from lingua import Language, LanguageDetectorBuilder, IsoCode639_1 with open('text-catalan.txt') as fh: text = fh.read() detector = LanguageDetectorBuilder.from_all_languages().build() for result in detector.detect_multiple_languages_of(text): print(f"{result.language.name}")

Related to this problem also is that detect_language_of and detect_multiple_languages_of predict different languages over the same text. Below an example on the same input detect_language_of predicts Catalan and detect_multiple_languages_of predicts Tsonga.

My expectation is that both methods will predict the same given the same input.

Code sample:

from lingua import Language, LanguageDetectorBuilder, IsoCode639_1 with open('china.txt') as fh: text = fh.read() detector = LanguageDetectorBuilder.from_all_languages().build() result = detector.detect_language_of(text) print(f"detect_language_of prediction: {result}") for result in detector.detect_multiple_languages_of(text): print(f"detect_language_of prediction: {result.language.name}")
opened by jordimas 2

detect_multiple_languages_of is very slow

Using version 1.3.1

In a text that is 3.5K (31 lines) in my machine detect_multiple_languages_of takes 26.56 seconds while detect_language_of takes only 1.68 seconds.

26 seconds to analyse 3.5K of text (throughput of ~7 seconds per 1K) makes detect_multiple_languages_of method really not suitable for processing large corpus.

Code used for the benchmark:


from lingua import Language, LanguageDetectorBuilder, IsoCode639_1
import datetime


with open('text.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()
    
    start_time = datetime.datetime.now()
    result = detector.detect_language_of(text)
    print('Time used for detect_language_of: {0}'.format(datetime.datetime.now() - start_time))
    print(result.iso_code_639_1)

    start_time = datetime.datetime.now()    
    results = detector.detect_multiple_languages_of(text)    
    print('Time used for detect_multiple_languages_of: {0}  '.format(datetime.datetime.now() - start_time))    
    for result in results:
        print(result)
        print(f"** {result.language.name}")

opened by jordimas 1

Chars to language mapping

Hello! My understading is that this mapping:

https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L34

It's used by the rule system to identity languages based on chars. Is my assumption correct?

Looking at this: https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L191

Catalan language for example does NOT have "Áá" as valid chars (see reference https://en.wikipedia.org/wiki/Catalan_orthography#Alphabet).

Looking at the data I see other mappings that do not seem right.

May be the case that these mappings can be improved?

opened by jordimas 0
Proposition: Using prior language probability to increase likelihood
@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix.

Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.

For example: https://github.com/pemistahl/lingua-py/issues/100

Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.

There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.

So my proposition to add detector.detect_language_with_prior function and factorize it with prior: likelihood = probability X prior_probability

For example: https://github.com/pemistahl/lingua-py/issues/97

detector.detect_language_of("Hello") "ITALIAN": 0.9900000000000001, "SPANISH": 0.8457074930316446, "ENGLISH": 0.6405700388041755, "FRENCH": 0.260556921899765, "GERMAN": 0.01, "CHINESE": 0, "RUSSIAN": 0

detector.detect_language_with_prior("Hello") # Of course constants are for illustrative purposes only. # Results should be normalized afterwords "ENGLISH": 0.6405700388041755 * 0.577, "SPANISH": 0.8457074930316446 * 0.045, "ITALIAN": 0.9900000000000001 * 0.017, "FRENCH": 0.260556921899765 * 0.039,

Linked issues:

https://github.com/pemistahl/lingua-py/issues/94

https://github.com/pemistahl/lingua-py/issues/100

https://github.com/pemistahl/lingua-py/issues/97
opened by slavaGanzin 1
Increase speed by compiling to native code

It should be investigated if and how detection speed can be increased by compiling crucial parts of the library to native code, probably with the help of Cython or mypyc.
enhancement

opened by pemistahl 0

Releases(v1.3.1)

v1.3.1(Jan 4, 2023)
Bug Fixes

For long input texts, an error occurred whiled computing the confidence values due to numerical underflow when converting probabilities. This has been fixed. Thanks to @jordimas for reporting this bug. (#102)

Source code(tar.gz)
Source code(zip)
lingua_language_detector-1.3.1-py3-none-any.whl(82.66 MB)
lingua_language_detector-1.3.1.tar.gz(82.66 MB)
v1.3.0(Dec 30, 2022)
Improvements

The min-max normalization method for the confidence values has been replaced with applying the softmax function. This gives more realistic probabilities. Big thanks to @Alex-Kopylov for proposing and implementing this change. (#99)

Source code(tar.gz)
Source code(zip)
lingua_language_detector-1.3.0-py3-none-any.whl(82.66 MB)
lingua_language_detector-1.3.0.tar.gz(82.66 MB)
v1.2.1(Dec 27, 2022)
Bug Fixes

Under certain circumstances, calling the method LanguageDetector.detect_multiple_languages_of() raised an IndexError. This has been fixed. Thanks to @Saninsusanin for reporting this bug. (#98)

Source code(tar.gz)
Source code(zip)
lingua_language_detector-1.2.1-py3-none-any.whl(82.66 MB)
lingua_language_detector-1.2.1.tar.gz(82.66 MB)
v1.2.0(Dec 19, 2022)
Features

The new method LanguageDetector.detect_multiple_languages_of() has been introduced. It allows to detect multiple languages in mixed-language text. (#4)

The new method LanguageDetector.compute_language_confidence() has been introduced. It allows to retrieve the confidence value for one specific language only, given the input text. (#86)

Improvements

The computation of the confidence values has been revised and the min-max normalization algorithm is now applied to the values, making them better comparable by behaving more like real probabilities. (#78)

Miscellaneous

The library now has a fresh and colorful new logo. Why? Well, why not? (-:

Source code(tar.gz)
Source code(zip)
lingua_language_detector-1.2.0-py3-none-any.whl(82.66 MB)
lingua_language_detector-1.2.0.tar.gz(82.66 MB)
v1.1.3(Sep 29, 2022)
Improvements

An __all__ variable has been added indicating which types are exported by the library. This helps with type checking programs using Lingua. Big thanks to @bscan for the pull request. (#76)

The rule-based language filter has been improved for German texts. (#71)

A further bottleneck in the code has been removed, making language detection 30 % faster compared to version 1.1.2, approximately.

Source code(tar.gz)
Source code(zip)
lingua-language-detector-1.1.3.tar.gz(82.65 MB)
lingua_language_detector-1.1.3-py3-none-any.whl(82.66 MB)
v1.1.2(Sep 6, 2022)
Improvements

The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly.

A bottleneck in the language detection code has been removed, making language detection 40 % faster, approximately.

Bug Fixes

The py.typed file that actives static type checking was missing. Big thanks to @Vasniktel for reporting this problem. (#63)

The ISO 639-3 code for Urdu was wrong. Big thanks to @pluiez for reporting this bug. (#64)

Source code(tar.gz)
Source code(zip)
lingua-language-detector-1.1.2.tar.gz(82.65 MB)
lingua_language_detector-1.1.2-py3-none-any.whl(82.66 MB)
v1.1.1(Aug 26, 2022)
Bug Fixes

For certain ngrams, wrong probabilities were returned. This has been fixed. Big thanks to @3a77 for reporting this bug. (#62)

Source code(tar.gz)
Source code(zip)
lingua-language-detector-1.1.1.tar.gz(76.64 MB)
lingua_language_detector-1.1.1-py3-none-any.whl(76.65 MB)
v1.1.0(Aug 22, 2022)
Features

The new method LanguageDetectorBuilder.with_low_accuracy_mode() has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance.

Improvements

The memory footprint has been reduced significantly by storing the language models in structured NumPy arrays instead of dictionaries. This reduces memory consumption from 2600 MB to 800 MB, approximately.

Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint.

Compatibility

The lowest supported Python version is 3.8 now. Python 3.7 is no longer compatible with this library.

Source code(tar.gz)
Source code(zip)
lingua-language-detector-1.1.0.tar.gz(76.64 MB)
lingua_language_detector-1.1.0-py3-none-any.whl(76.65 MB)
v1.0.1(Jan 24, 2022)
Compatibility

This patch release makes the library compatible with Python >= 3.7.1. Previously, it could be installed from PyPI only with Python >= 3.9. Since updates of the Python interpreter obviously take a pretty long time in certain environments, I hope that this compatibility update will make more people use Lingua successfully. Thanks to @Jourdelune for making me aware of this issue.

Source code(tar.gz)
Source code(zip)
lingua-language-detector-1.0.1.tar.gz(82.54 MB)
lingua_language_detector-1.0.1-py3-none-any.whl(82.57 MB)
v1.0.0(Jan 10, 2022)

The very first release of Lingua. Enjoy! :)
Source code(tar.gz)
Source code(zip)
lingua-language-detector-1.0.0.tar.gz(82.54 MB)
lingua_language_detector-1.0.0-py3-none-any.whl(82.57 MB)

Owner

Peter M. Stahl

Computational linguist, Rust enthusiast, green IT advocate

GitHub Repository

ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

78 Dec 12, 2022

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

1.1k Dec 27, 2022

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

1 Jan 05, 2022

BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

401 Dec 11, 2022

Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

6 Dec 06, 2022

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

17 Dec 23, 2022

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

62 Oct 02, 2022

Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022

Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

1 Dec 23, 2021

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架，该框架将大数据预训练与多源丰富知识相结合，通过持续学习技术，不断吸收海量文本数据中词汇、结构、语义等方面的知识，实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果，并在 G

5.4k Jan 03, 2023

Minimal GUI for accessing the Watson Text to Speech service.

Description Minimal graphical application for accessing the Watson Text to Speech service. Requirements Python 3 plus all dependencies listed in requi

1 Oct 22, 2021

Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

1.6k Dec 29, 2022

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

17 Sep 29, 2021

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Related tags

Overview

1. What does this library do?

2. Why does this library exist?

3. Which languages are supported?

4. How good is it?

4.1 Single word detection

4.2 Word pair detection

4.3 Sentence detection

4.4 Average detection

5. Why is it better than other libraries?

6. Test report generation

7. How to add it to your project?

8. How to build?

9. How to use?

9.1 Basic usage

9.2 Minimum relative distance

9.3 Confidence values

9.4 Eager loading versus lazy loading

9.5 Methods to build the LanguageDetector

10. What's next for version 1.1.0?

11. Contributions

Comments

Releases(v1.3.1)

v1.3.1(Jan 4, 2023)

Bug Fixes

v1.3.0(Dec 30, 2022)

Improvements

v1.2.1(Dec 27, 2022)

Bug Fixes

v1.2.0(Dec 19, 2022)

Features

Improvements

Miscellaneous

v1.1.3(Sep 29, 2022)

Improvements

v1.1.2(Sep 6, 2022)

Improvements

Bug Fixes

v1.1.1(Aug 26, 2022)

Bug Fixes

v1.1.0(Aug 22, 2022)

Features

Improvements

Compatibility

v1.0.1(Jan 24, 2022)

Compatibility

v1.0.0(Jan 10, 2022)

Owner

Peter M. Stahl

ADCS cert template modification and ACL enumeration

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

BERT Attention Analysis

Higher quality textures for the Metal Gear Solid series.

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

Nested Named Entity Recognition for Chinese Biomedical Text

Spooky Skelly For Python

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

Minimal GUI for accessing the Watson Text to Speech service.

Longformer: The Long-Document Transformer

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Collection of scripts to pinpoint obfuscated code

Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

Language-Agnostic SEntence Representations

HF's ML for Audio study group

中文无监督SimCSE Pytorch实现

Header-only C++ HNSW implementation with python bindings

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"