A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Last update: Oct 07, 2022

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.

Usage

from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",
    "a", "b", "c", "s",
    "est", "sur", "tres", "donc", "sont",
    # even slang/texto stop words:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')

print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)

Output:

array([[0.14218195, 0.85781805],
       [0.11032992, 0.88967008],
       [0.16960695, 0.83039305],
       [0.88967041, 0.11032959],
       [0.8578187 , 0.1421813 ],
       [0.83039303, 0.16960697]])

['Un super-chien aboie', 'Les super-chats aiment ronronner']

[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
 [('chats',  3.4911393765493255), ('super', 2.499979634668601 )]]

[[('super chiens', 2.4921035508342464)],
 [('super chats',  2.492102155345991 )]]

How it works

See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.

Supported Languages

Those languages are supported:

Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish

You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.

Dependencies and their license

numpy==1.14.3           # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==0.19.1    # BSD-3-Clause
PyStemmer==1.3.0        # BSD-3-Clause and MIT
snowballstemmer==1.2.1  # BSD-3-Clause and BSD-2-Clause
translitcodec==0.4.0    # MIT License
scipy==1.1.0            # BSD-3-Clause and MIT-like

Unit tests

Run pytest with ./run_tests.sh. Coverage:

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
artifici_lda/__init__.py                       0      0   100%
artifici_lda/data_utils.py                    39      0   100%
artifici_lda/lda_service.py                   31      0   100%
artifici_lda/logic/__init__.py                 0      0   100%
artifici_lda/logic/count_vectorizer.py         9      0   100%
artifici_lda/logic/lda.py                     23      7    70%
artifici_lda/logic/letter_splitter.py         36      4    89%
artifici_lda/logic/stemmer.py                 60      3    95%
artifici_lda/logic/stop_words_remover.py      61      5    92%
--------------------------------------------------------------
TOTAL                                        259     19    93%

License

This project is published under the MIT License (MIT).

Coded by Guillaume Chevalier at Neuraxio Inc.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Related tags

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

Usage

How it works

Supported Languages

Dependencies and their license

Unit tests

License

Owner

Artifici Online Services inc.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

A simple Streamlit App to classify swahili news into different categories.

Sapiens is a human antibody language model based on BERT.

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

Fast, DB Backed pretrained word embeddings for natural language processing.

Code for the Python code smells video on the ArjanCodes channel.

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

A minimal code for fairseq vq-wav2vec model inference.

Share constant definitions between programming languages and make your constants constant again

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Python library for parsing resumes using natural language processing and machine learning

Modified GPT using average pooling to reduce the softmax attention memory constraints.

Code for PED: DETR For (Crowd) Pedestrian Detection

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.