Blazing fast language detection using fastText model

Last update: Dec 20, 2022

Overview

Luga

A blazing fast language detection using fastText's language models

Luga is a Swahili word for language. fastText provides a blazing fast language detection. It is though a bit funky to download and load models. fastText API is also beauty-less. This is why luga was born.

Installation

python -m pip install -U luga

Usage:

Note: First usage downloads the model for you. This is done only once.

from luga import language

print(language("the world has ended yesterday"))

Comming soon ...

TODO:

refactor artifacts.py
auto checkers with pre-commit | invoke
write more tests
write github actions
create a smart data checker (a fast List[str], what do with none strings)
make it faster with Cython

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

Comments

fix: Fix invalid pytest dependency version
poetry does not want to accept flake8 as a valid versionFixes issue #13

fix: Fix invalid pytest dependency version

fix: Use fasttext-wheel instead of fasttext
opened by saevarb 1
Installation fails with recent poetry due to `fasttext` issues

Hey!

As is explained in this issue: https://github.com/python-poetry/poetry/issues/6113 trying to install fasttext with a recent poetry version fails. This is because fasttext does some really funky things and tries to run a global pip during install. So this means that building luga or using any package that depends on it doesn't work. :/

This means that columbus doesn't build either, since it depends on luga. However, as is outlined in the issue there is a solution: using fasttext-wheel.

I pulled down luga and columbus and updated luga to use fasttext-wheel instead, and managed to get it to install, which also allowed me to build a new version of columbus using the new luga build.

opened by saevarb 1

SSL WRONG_VERSION_NUMBER

Solution from httpx

import httpx
import ssl

ssl_context = httpx.create_ssl_context()
ssl_context.options ^= ssl.OP_NO_TLSv1  # Enable TLS 1.0 back
resp = httpx.get(..., verify=ssl_context)
```

opened by Proteusiq 0

Return array for compatibility with pandas

This fails since pandas expects an array and luga returns a list

texts.loc[languages(texts["texts"].to_list(), only_language=True) == "da"]

But this works

texts.loc[np.array(languages(texts["texts"].to_list(), only_language=True) == "da")]

opened by nthomsencph 0

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.7-py3-none-any.whl(5.55 KB)
luga-0.2.7.tar.gz(5.34 KB)
v0.2.6(Sep 28, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.6-py3-none-any.whl(5.51 KB)
luga-0.2.6.tar.gz(5.32 KB)
v0.2.5(Apr 19, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.5-py3-none-any.whl(5.50 KB)
luga-0.2.5.tar.gz(5.39 KB)
v0.2.4(Dec 23, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.4-py3-none-any.whl(4.60 KB)
luga-0.2.4.tar.gz(4.52 KB)
v0.2.3(Dec 22, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.3-py3-none-any.whl(4.56 KB)
luga-0.2.3.tar.gz(4.46 KB)
v0.2.2(Dec 3, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.2-py3-none-any.whl(4.42 KB)
luga-0.2.2.tar.gz(4.28 KB)
v0.2.1(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.1-py3-none-any.whl(4.07 KB)
luga-0.2.1.tar.gz(3.95 KB)
v0.2.0(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.0-py3-none-any.whl(4.07 KB)
luga-0.2.0.tar.gz(3.95 KB)
v0.1.8(Nov 20, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.8-py3-none-any.whl(3.88 KB)
luga-0.1.8.tar.gz(3.76 KB)
v0.1.7(Nov 17, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.7-py3-none-any.whl(3.81 KB)
luga-0.1.7.tar.gz(3.66 KB)

Owner

Prayson Wilfred Daniel

🍺 Data Scientist | | 🍺 Automating Data Mining & Analysis With Python

GitHub Repository

SurvTRACE: Transformers for Survival Analysis with Competing Events

⭐ SurvTRACE: Transformers for Survival Analysis with Competing Events This repo provides the implementation of SurvTRACE for survival analysis. It is

13 Oct 06, 2022

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

2.1k Dec 28, 2022

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

73 Dec 28, 2022

Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

379 Dec 27, 2022

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

SyntaxGen Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022) In this repo, we upload all the scripts for this work. Due to siz

3 Jun 13, 2022

Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

1 May 30, 2022

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023

Mastering Transformers, published by Packt

Mastering Transformers This is the code repository for Mastering Transformers, published by Packt. Build state-of-the-art models from scratch with adv

195 Jan 01, 2023

Pangu-Alpha for Transformers

Pangu-Alpha for Transformers Usage Download MindSpore FP32 weights for GPU from here to data/Pangu-alpha_2.6B.ckpt Activate MindSpore environment and

5 Oct 01, 2022

Natural Language Processing Best Practices & Examples

NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive bus

6.1k Dec 31, 2022

leaking paid token generator that was a shit lmao for 100$ haha

Discord-Token-Generator-Leaked leaking paid token generator that was a shit lmao for 100$ he selling it for 100$ wth here the code enjoy don't forget

5 Apr 15, 2022

SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

0 Oct 07, 2021

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

1 Oct 29, 2021

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

DANeS - Open-source E-newspaper dataset Source: Technology vector created by macrovector - www.freepik.com. DANeS is an open-source E-newspaper datase

64 Aug 17, 2022

BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

41 Dec 27, 2022

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

3 Nov 27, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 09, 2023

Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

53 Dec 20, 2022

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

1.1k Dec 27, 2022

Blazing fast language detection using fastText model

Related tags

Overview

Luga

Installation

Usage:

Comming soon ...

TODO:

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Comments

fix: Fix invalid pytest dependency version

Installation fails with recent poetry due to `fasttext` issues

SSL WRONG_VERSION_NUMBER

Return array for compatibility with pandas

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

v0.2.6(Sep 28, 2022)

v0.2.5(Apr 19, 2022)

v0.2.4(Dec 23, 2021)

v0.2.3(Dec 22, 2021)

v0.2.2(Dec 3, 2021)

v0.2.1(Nov 26, 2021)

v0.2.0(Nov 26, 2021)

v0.1.8(Nov 20, 2021)

v0.1.7(Nov 17, 2021)

Owner

Prayson Wilfred Daniel

SurvTRACE: Transformers for Survival Analysis with Competing Events

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Textlesslib - Library for Textless Spoken Language Processing

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Mastering Transformers, published by Packt

Pangu-Alpha for Transformers

Natural Language Processing Best Practices & Examples

leaking paid token generator that was a shit lmao for 100$ haha

SDL: Synthetic Document Layout dataset

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

NLP - Machine learning

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

BERT, LDA, and TFIDF based keyword extraction in Python

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Natural Language Processing Tasks and Examples.

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet