Blazing fast language detection using fastText model

Overview

Luga

  • A blazing fast language detection using fastText's language models

Luga is a Swahili word for language. fastText provides a blazing fast language detection. It is though a bit funky to download and load models. fastText API is also beauty-less. This is why luga was born.

Installation

python -m pip install -U luga

Usage:

Note: First usage downloads the model for you. This is done only once.

from luga import language

print(language("the world has ended yesterday"))

Comming soon ...

TODO:

  • refactor artifacts.py
  • auto checkers with pre-commit | invoke
  • write more tests
  • write github actions
  • create a smart data checker (a fast List[str], what do with none strings)
  • make it faster with Cython
You might also like...
A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration. Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Comments
  • fix: Fix invalid pytest dependency version

    fix: Fix invalid pytest dependency version

    poetry does not want to accept flake8 as a valid versionFixes issue #13

    • fix: Fix invalid pytest dependency version
    • fix: Use fasttext-wheel instead of fasttext
    opened by saevarb 1
  • Installation fails with recent poetry due to `fasttext` issues

    Installation fails with recent poetry due to `fasttext` issues

    Hey!

    As is explained in this issue: https://github.com/python-poetry/poetry/issues/6113 trying to install fasttext with a recent poetry version fails. This is because fasttext does some really funky things and tries to run a global pip during install. So this means that building luga or using any package that depends on it doesn't work. :/

    This means that columbus doesn't build either, since it depends on luga. However, as is outlined in the issue there is a solution: using fasttext-wheel.

    I pulled down luga and columbus and updated luga to use fasttext-wheel instead, and managed to get it to install, which also allowed me to build a new version of columbus using the new luga build.

    opened by saevarb 1
  • SSL WRONG_VERSION_NUMBER

    SSL WRONG_VERSION_NUMBER

    Solution from httpx

    import httpx
    import ssl
    
    ssl_context = httpx.create_ssl_context()
    ssl_context.options ^= ssl.OP_NO_TLSv1  # Enable TLS 1.0 back
    resp = httpx.get(..., verify=ssl_context)
    ```
    opened by Proteusiq 0
  • Return array for compatibility with pandas

    Return array for compatibility with pandas

    This fails since pandas expects an array and luga returns a list

    texts.loc[languages(texts["texts"].to_list(), only_language=True) == "da"]
    

    But this works

    texts.loc[np.array(languages(texts["texts"].to_list(), only_language=True) == "da")]
    
    opened by nthomsencph 0
Owner
Prayson Wilfred Daniel
🍺 Data Scientist | | 🍺 Automating Data Mining & Analysis With Python
Prayson Wilfred Daniel
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

Chirag Daryani 0 Dec 25, 2021
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
Text classification on IMDB dataset using Keras and Bi-LSTM network

Text classification on IMDB dataset using Keras and Bi-LSTM Text classification on IMDB dataset using Keras and Bi-LSTM network. Usage python3 main.py

Hamza Rashid 2 Sep 27, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022
The swas programming language

The Swas programming language This is a language that was made for fun. Installation Step 0: Make sure you have python installed Step 1. Clone this re

Swas.py 19 Jul 18, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

LipGAN Generate realistic talking faces for any human speech and face identity. [Paper] | [Project Page] | [Demonstration Video] Important Update: A n

Rudrabha Mukhopadhyay 438 Dec 31, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022