Pipeline for training LSA models using Scikit-Learn.

Last update: Sep 05, 2022

Overview

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

install pipeline:

pip install latent-semantic-analysis

run pipeline:

either in terminal:

lsa-train --path_to_config config.yaml

or in python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

terminal:

lsa-train --path_to_config config.yaml

python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
config.yaml - config that was used to train the model
logging.txt - logging file
doc2topic.json - document embeddings
term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

190 Dec 21, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

57 Dec 7, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

First Release! 🥳🎉🍾
Source code(tar.gz)
Source code(zip)

Pipeline for training LSA models using Scikit-Learn.

Related tags

Overview

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Pipeline for chemical image-to-text competition

Pipeline for fast building text classification TF-IDF + LogReg baselines.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

BookNLP, a natural language processing pipeline for books

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

Owner

Dani El-Ayyass

Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

A text augmentation tool for named entity recognition.

Understanding the Difficulty of Training Transformers

Mapping a variable-length sentence to a fixed-length vector using BERT model

BERT Attention Analysis

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

a CTF web challenge about making screenshots

Protein Language Model

A Python/Pytorch app for easily synthesising human voices

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Production First and Production Ready End-to-End Keyword Spotting Toolkit

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

Various Algorithms for Short Text Mining

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"