Pipeline for fast building text classification TF-IDF + LogReg baselines.

Last update: Dec 07, 2022

Overview

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

install pipeline:

pip install text-classification-baseline

run pipeline:

either in terminal:

text-clf-train

or in python:

import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

text
target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

terminal:

text-clf-train --path_to_config config.yaml

python:

import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with TF-IDF and LogReg steps
target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
config.yaml - config that was used to train the model
logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

285 Jan 2, 2023

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

160 Feb 9, 2021

Comments

release v0.1.4
fixed load_20newsgroups.py (#65 #71)

added Makefile (#71)

added logging confusion matrix (#72)

replaced all "valid" occurrences with "test" (#74)

updated docstrings (#77)

changed python interface - train function returns model and target_names_mapping (#78)

enhancement
opened by dayyass 1
release v0.1.6

fixed token frequency support (add token frequency support #85) fixed threshold selection for binary classification (add threshold selection for binary classification #86)
bug enhancement

opened by dayyass 0
release v0.1.5
added lemmatization (#66)

added token frequency support (#84)

added threshold selection for binary classification (#79)

added arbitrary save folder name (#80)

enhancement
opened by dayyass 0
release v0.1.5
added lemmatization (#81)

added token frequency support (#85)

added threshold selection for binary classification (#86)

added arbitrary save folder name (#83)

enhancement
opened by dayyass 0

Releases(v0.1.6)

v0.1.6(Nov 6, 2021)
Release v0.1.6

fixed token frequency support (add token frequency support #85)

fixed threshold selection for binary classification (add threshold selection for binary classification #86)

Source code(tar.gz)
Source code(zip)
v0.1.5(Oct 21, 2021)
Release v0.1.5 🥳🎉🍾

added pymorphy2 lemmatization (#81)

added token frequency support (#85)

added threshold selection for binary classification (#86)

added arbitrary save folder name (#83)

pymorphy2 lemmatization (config.yaml)

# preprocessing # (included in resulting model pipeline, so preserved for inference) preprocessing: lemmatization: pymorphy2

token frequency support

text_clf.token_frequency.get_token_frequency(path_to_config) -
get token frequency of train dataset according to the config file parameters

threshold selection for binary classification

text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
get precision and recall metrics for precision-recall curve

text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve

text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
plot precision-recall curve

text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
plot roc curve

text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
plot precision, recall, f1-score curves for probability thresholds

arbitrary save folder name (config.yaml)

experiment_name: model
Source code(tar.gz)
Source code(zip)
v0.1.4(Oct 10, 2021)
fixed load_20newsgroups.py (#65 #71)

added Makefile (#71)

added logging confusion matrix (#72)

replaced all "valid" occurrences with "test" (#74)

updated docstrings (#77)

changed python interface - train function returns model and target_names_mapping (#78)

Source code(tar.gz)
Source code(zip)
v0.1.3(Sep 2, 2021)
added hyper-parameters tuning (#58)

Source code(tar.gz)
Source code(zip)
v0.1.2(Aug 19, 2021)
fixed bug with multiple logging (#55)

Source code(tar.gz)
Source code(zip)
v0.1.1(Aug 11, 2021)
added logging (#43)

added unittests (#49)

added CI with linter, tests, codecov (#46 #49)

added docker (#48)

Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 7, 2021)

First release.
Source code(tar.gz)
Source code(zip)

Owner

Dani El-Ayyass

NLP Tech Lead @ Sber AI, Master Student in Applied Mathematics and Computer Science @ CMC MSU

GitHub Repository https://pypi.org/project/text-classification-baseline/

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

60 Dec 31, 2022

Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

3 Nov 11, 2022

hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

5 Jul 17, 2022

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

11.9k Jan 08, 2023

Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

8 Oct 10, 2022

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

65 Sep 21, 2022

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

28 Dec 07, 2022

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

9 Jan 08, 2023

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

DANeS - Open-source E-newspaper dataset Source: Technology vector created by macrovector - www.freepik.com. DANeS is an open-source E-newspaper datase

64 Aug 17, 2022

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统，包含语音编码器、语音合成器、声码器和可视化模块。

6 Nov 08, 2022

Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

25.6k Jan 06, 2023

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022

Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022

A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

37 Dec 07, 2022

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

276 Dec 31, 2022

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

20 Jul 14, 2022

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Related tags

Overview

Text Classification Baseline

Usage

Config

Output

Requirements

Citation

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Pipeline for chemical image-to-text competition

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Text vectorization tool to outperform TFIDF for classification tasks

Text vectorization tool to outperform TFIDF for classification tasks

Comments

release v0.1.4

release v0.1.6

release v0.1.5

release v0.1.5

Releases(v0.1.6)

v0.1.6(Nov 6, 2021)

Release v0.1.6

v0.1.5(Oct 21, 2021)

Release v0.1.5 🥳🎉🍾

pymorphy2 lemmatization (config.yaml)

token frequency support

threshold selection for binary classification

arbitrary save folder name (config.yaml)

v0.1.4(Oct 10, 2021)

v0.1.3(Sep 2, 2021)

v0.1.2(Aug 19, 2021)

v0.1.1(Aug 11, 2021)

v0.1.0(Aug 7, 2021)

Owner

Dani El-Ayyass

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Command Line Text-To-Speech using Google TTS

hashily is a Python module that provides a variety of text decoding and encoding operations.

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Clone a voice in 5 seconds to generate arbitrary speech in real-time

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Linking data between GBIF, Biodiverse, and Open Tree of Life

A high-level yet extensible library for fast language model tuning via automatic prompt search

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

💫 Industrial-strength Natural Language Processing (NLP) in Python

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Python package for Turkish Language.