Pipeline for training LSA models using Scikit-Learn.

Last update: Sep 05, 2022

Overview

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

install pipeline:

pip install latent-semantic-analysis

run pipeline:

either in terminal:

lsa-train --path_to_config config.yaml

or in python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

terminal:

lsa-train --path_to_config config.yaml

python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
config.yaml - config that was used to train the model
logging.txt - logging file
doc2topic.json - document embeddings
term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

190 Dec 21, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

57 Dec 7, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

First Release! 🥳🎉🍾
Source code(tar.gz)
Source code(zip)

Pipeline for training LSA models using Scikit-Learn.

Related tags

Overview

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Pipeline for chemical image-to-text competition

Pipeline for fast building text classification TF-IDF + LogReg baselines.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

BookNLP, a natural language processing pipeline for books

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

Owner

Dani El-Ayyass

BiQE: Code and dataset for the BiQE paper

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

Longformer: The Long-Document Transformer

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

SimCTG - A Contrastive Framework for Neural Text Generation

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

PIZZA - a task-oriented semantic parsing dataset

What are the best Systems? New Perspectives on NLP Benchmarking

The guide to tackle with the Text Summarization

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

Unsupervised Language Model Pre-training for French

Labelling platform for text using distant supervision

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

A NLP program: tokenize method, PoS Tagging with deep learning

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Arabic speech recognition, classification and text-to-speech.