Skip to content

dayyass/text-classification-baseline

Repository files navigation

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Text Classification Baseline

Pipeline for fast building text classification baselines with TF-IDF + LogReg.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:
  • either in terminal:
text-clf-train --path_to_config config.yaml
  • or in python:
import text_clf

model, target_names_mapping = text_clf.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

The target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of two files:

  • config.yaml - general configuration with sklearn TF-IDF and LogReg parameters
  • hyperparams.py - sklearn GridSearchCV parameters

Change config.yaml and hyperparams.py to create the desired configuration and train text classification model with the following command:

  • terminal:
text-clf-train --path_to_config config.yaml
  • python:
import text_clf

model, target_names_mapping = text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models
experiment_name: model

# data
data:
  train_data_path: data/train.csv
  test_data_path: data/test.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
  lemmatization: null  # pymorphy2

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  n_jobs: -1

# grid-search
grid-search:
  do_grid_search: false
  grid_search_params_path: hyperparams.py

NOTE: grid search is disabled by default, to use it set do_grid_search: true.

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to grid-search which is sklearn GridSearchCV parametrized with hyperparams.py.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • hyperparams.py - grid-search parameters (if grid-search was used)
  • logging.txt - logging file

Additional functions

  • text_clf.token_frequency.get_token_frequency(path_to_config) -
    get token frequency of train dataset according to the config file parameters

Only for binary classifiers:

  • text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
    get precision and recall metrics for precision-recall curve
  • text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
    get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve
  • text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
    plot precision-recall curve
  • text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
    plot roc curve
  • text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
    plot precision, recall, f1-score curves for probability thresholds

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}