OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

Overview

OCTIS : Optimizing and Comparing Topic Models is Simple!

Documentation Status Contributors License

Logo

OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and comparing Topic Models, whose optimal hyper-parameters are estimated by means of a Bayesian Optimization approach.

Install

You can install OCTIS with the following command:

pip install octis

You can find the requirements in the requirements.txt file.

Features

  • Preprocess your own dataset or use one of the already-preprocessed benchmark datasets
  • Well-known topic models (both classical and neurals)
  • Evaluate your model using different state-of-the-art evaluation metrics
  • Optimize the models' hyperparameters for a given metric using Bayesian Optimization
  • Python library for advanced usage or simple web dashboard for starting and controlling the optimization experiments

Examples and Tutorials

To easily understand how to use OCTIS, we invite you to try our tutorials out :)

Name Link
How to build a topic model and evaluate the results (LDA on 20Newsgroups) Open In Colab
How to optimize the hyperparameters of a neural topic model (CTM on M10) Open In Colab

Load a preprocessed dataset

To load one of the already preprocessed datasets as follows:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

Just use one of the dataset names listed below. Note: it is case-sensitive!

Available Datasets

Name Source # Docs # Words # Labels
20NewsGroup 20Newsgroup 16309 1612 20
BBC_News BBC-News 2225 2949 5
DBLP DBLP 54595 1513 4
M10 M10 8355 1696 10

Otherwise, you can load a custom preprocessed dataset in the following way:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")
Make sure that the dataset is in the following format:
  • corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
  • vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.

Disclaimer

Similarly to TensorFlow Datasets and HuggingFace's nlp library, we just downloaded and prepared public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, please get in touch through a GitHub issue.

If you're a dataset owner and wish to include your dataset in this library, please get in touch through a GitHub issue.

Preprocess

To preprocess a dataset, import the preprocessing class and use the preprocess_dataset method.

import os
import string
from octis.preprocessing.preprocessing import Preprocessing
os.chdir(os.path.pardir)

# Initialize preprocessing
p = Preprocessing(vocabulary=None, max_features=None, remove_punctuation=True, punctuation=string.punctuation,
                  lemmatize=True, remove_stopwords=True, stopword_list=['am', 'are', 'this', 'that'],
                  min_chars=1, min_words_docs=0)
# preprocess
dataset = p.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')

# save the preprocessed dataset
dataset.save('hello_dataset')

For more details on the preprocessing see the preprocessing demo example in the examples folder.

Train a model

To build a model, load a preprocessed dataset, set the model hyperparameters and use train_model() to train the model.

from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA

# Load a dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder")

model = LDA(num_topics=25)  # Create model
model_output = model.train_model(dataset) # Train the model

If the dataset is partitioned, you can:

  • Train the model on the training set and test it on the test documents
  • Train the model with the whole dataset, regardless of any partition.

Evaluate a model

To evaluate a model, choose a metric and use the score() method of the metric class.

from octis.evaluation_metrics.diversity_metrics import TopicDiversity

metric = TopicDiversity(topk=10) # Initialize metric
topic_diversity_score = metric.score(model_output) # Compute score of the metric

Available metrics

Classification Metrics:

  • F1 measure (F1Score())
  • Precision (PrecisionScore())
  • Recall (RecallScore())
  • Accuracy (AccuracyScore())

Coherence Metrics:

  • UMass Coherence (Coherence({'measure':'c_umass'})
  • C_V Coherence (Coherence({'measure':'c_v'})
  • UCI Coherence (Coherence({'measure':'c_uci'})
  • NPMI Coherence (Coherence({'measure':'c_npmi'})
  • Word Embedding-based Coherence Pairwise (WECoherencePairwise())
  • Word Embedding-based Coherence Centroid (WECoherenceCentroid())

Diversity Metrics:

  • Topic Diversity (TopicDiversity())
  • InvertedRBO (InvertedRBO())
  • Word Embedding-based InvertedRBO (WordEmbeddingsInvertedRBO())
  • Word Embedding-based InvertedRBO centroid (WordEmbeddingsInvertedRBOCentroid())

Topic significance Metrics:

  • KL Uniform (KL_uniform())
  • KL Vacuous (KL_vacuous())
  • KL Background (KL_background())

Optimize a model

To optimize a model you need to select a dataset, a metric and the search space of the hyperparameters to optimize. For the types of the hyperparameters, we use scikit-optimize types (https://scikit-optimize.github.io/stable/modules/space.html)

from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real

# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0), "eta": Real(low=0.001, high=5.0)}

# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset, eval_metric, search_space, save_path="../results" # path to store the results
                             number_of_call=30, # number of optimization iterations
                             model_runs=5) # number of runs of the topic model
#save the results of th optimization in a csv file
optResult.save_to_csv("results.csv")

The result will provide best-seen value of the metric with the corresponding hyperparameter configuration, and the hyperparameters and metric value for each iteration of the optimization. To visualize this information, you have to set 'plot' attribute of Bayesian_optimization to True.

You can find more here: optimizer README

Available Models

Name Implementation
CTM (Bianchi et al. 2020) https://github.com/MilaNLProc/contextualized-topic-models
ETM (Dieng et al. 2020) https://github.com/adjidieng/ETM
HDP (Blei et al. 2004) https://radimrehurek.com/gensim/
LDA (Blei et al. 2003) https://radimrehurek.com/gensim/
LSI (Landauer et al. 1998) https://radimrehurek.com/gensim/
NMF (Lee and Seung 2000) https://radimrehurek.com/gensim/
NeuralLDA (Srivastava and Sutton 2017) https://github.com/estebandito22/PyTorchAVITM
ProdLda (Srivastava and Sutton 2017) https://github.com/estebandito22/PyTorchAVITM

If you use one of these implementations, make sure to cite the right paper.

If you implemented a model and wish to update any part of it, or do not want your model to be included in this library, please get in touch through a GitHub issue.

If you implemented a model and wish to include your model in this library, please get in touch through a GitHub issue. Otherwise, if you want to include the model by yourself, see the following section.

Implement your own Model

Models inherit from the class AbstractModel defined in octis/models/model.py . To build your own model your class must override the train_model(self, dataset, hyperparameters) method which always requires at least a Dataset object and a Dictionary of hyperparameters as input and should return a dictionary with the output of the model as output.

To better understand how a model work, let's have a look at the LDA implementation. The first step in developing a custom model is to define the dictionary of default hyperparameters values:

hyperparameters = {'corpus': None, 'num_topics': 100, 'id2word': None, 'alpha': 'symmetric',
    'eta': None, # ...
    'callbacks': None}

Defining the default hyperparameters values allows users to work on a subset of them without having to assign a value to each parameter.

The following step is the train_model() override:

def train_model(self, dataset, hyperparameters={}, top_words=10):

The LDA method requires a dataset, the hyperparameters dictionary and an extra (optional) argument used to select how many of the most significative words track for each topic.

With the hyperparameters defaults, the ones in input and the dataset you should be able to write your own code and return as output a dictionary with at least 3 entries:

  • topics: the list of the most significative words foreach topic (list of lists of strings).
  • topic-word-matrix: an NxV matrix of weights where N is the number of topics and V is the vocabulary length.
  • topic-document-matrix: an NxD matrix of weights where N is the number of topics and D is the number of documents in the corpus.

if your model supports the training/test partitioning it should also return:

  • test-topic-document-matrix: the document topic matrix of the test set.

Dashboard

OCTIS includes a user friendly graphical interface for creating, monitoring and viewing experiments. Following the implementation standards of datasets, models and metrics the dashboard will automatically update and allow you to use your own custom implementations.

To run rhe dashboard, while in the project directory run the following command:

python OCTIS/dashboard/server.py

The browser will open and you will be redirected to the dashboard. In the dashboard you can:

  • Create new experiments organized in batch
  • Visualize and compare all the experiments
  • Visualize a custom experiment
  • Manage the experiment queue

How to cite our work

This work has been accepted at the demo track of EACL 2021! You can find it here: https://www.aclweb.org/anthology/2021.eacl-demos.31/ If you decide to use this resource, please cite:

@inproceedings{terragni2020octis,
    title={{OCTIS}: Comparing and Optimizing Topic Models is Simple!},
    author={Terragni, Silvia and Fersini, Elisabetta and Galuzzi, Bruno Giovanni and Tropeano, Pietro and Candelieri, Antonio},
    year={2021},
    booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
    month = apr,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-demos.31",
    pages = "263--270",
}

Team

Project and Development Lead

Current Contributors

Past Contributors

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template. Thanks to all the developers that released their topic models' implementations.

Owner
MIND
MIND
Using LSTM write Tang poetry

本教程将通过一个示例对LSTM进行介绍。通过搭建训练LSTM网络,我们将训练一个模型来生成唐诗。本文将对该实现进行详尽的解释,并阐明此模型的工作方式和原因。并不需要过多专业知识,但是可能需要新手花一些时间来理解的模型训练的实际情况。为了节省时间,请尽量选择GPU进行训练。

56 Dec 15, 2022
Only works with the dashboard version / branch of jesse

Jesse optuna Only works with the dashboard version / branch of jesse. The config.yml should be self-explainatory. Installation # install from git pip

Markus K. 8 Dec 04, 2022
SegNet including indices pooling for Semantic Segmentation with tensorflow and keras

SegNet SegNet is a model of semantic segmentation based on Fully Comvolutional Network. This repository contains the implementation of learning and te

Yuta Kamikawa 172 Dec 23, 2022
Malware Env for OpenAI Gym

Malware Env for OpenAI Gym Citing If you use this code in a publication please cite the following paper: Hyrum S. Anderson, Anant Kharkar, Bobby Fila

ENDGAME 563 Dec 29, 2022
Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks"

LUNAR Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks" Adam Goodge, Bryan Hooi, Ng See Kiong and

Adam Goodge 25 Dec 28, 2022
HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

HiFi++ : a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement This is the unofficial implementation of Vocoder part of

Rishikesh (ऋषिकेश) 118 Dec 29, 2022
Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks

Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks by Ángel López García-Arias, Masanori Hashimoto, Masato Motomura, and J

Ángel López García-Arias 4 May 19, 2022
CCPD: a diverse and well-annotated dataset for license plate detection and recognition

CCPD (Chinese City Parking Dataset, ECCV) UPdate on 10/03/2019. CCPD Dataset is now updated. We are confident that images in subsets of CCPD is much m

detectRecog 1.8k Dec 30, 2022
DANet for Tabular data classification/ regression.

Deep Abstract Networks A pyTorch implementation for AAAI-2022 paper DANets: Deep Abstract Networks for Tabular Data Classification and Regression. Bri

Ronnie Rocket 55 Sep 14, 2022
Generating Anime Images by Implementing Deep Convolutional Generative Adversarial Networks paper

AnimeGAN - Deep Convolutional Generative Adverserial Network PyTorch implementation of DCGAN introduced in the paper: Unsupervised Representation Lear

Rohit Kukreja 23 Jul 21, 2022
Spatial Attentive Single-Image Deraining with a High Quality Real Rain Dataset (CVPR'19)

Spatial Attentive Single-Image Deraining with a High Quality Real Rain Dataset (CVPR'19) Tianyu Wang*, Xin Yang*, Ke Xu, Shaozhe Chen, Qiang Zhang, Ry

Steve Wong 177 Dec 01, 2022
PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

MINE: Continuous-Depth MPI with Neural Radiance Fields Project Page | Video PyTorch implementation for our ICCV 2021 paper. MINE: Towards Continuous D

Zijian Feng 325 Dec 29, 2022
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains This is an accompanying repository to the ICAIL 2021 pap

4 Dec 16, 2021
Libraries, tools and tasks created and used at DeepMind Robotics.

Libraries, tools and tasks created and used at DeepMind Robotics.

DeepMind 270 Nov 30, 2022
Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Scene Graph Generation from Natural Language Supervision This repository includes the Pytorch code for our paper "Learning to Generate Scene Graph fro

Yiwu Zhong 64 Dec 24, 2022
City-seeds - A random generator of cultural characteristics intended to spark ideas and help draw threads

City Seeds This is a random generator of cultural characteristics intended to sp

Aydin O'Leary 2 Mar 12, 2022
An example of time series augmentation methods with Keras

Time Series Augmentation This is a collection of time series data augmentation methods and an example use using Keras. News 2020/04/16: Repository Cre

九州大学 ヒューマンインタフェース研究室 229 Jan 02, 2023
[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Rethinking Minimal Sufficient Representation in Contrastive Learning PyTorch implementation of Rethinking Minimal Sufficient Representation in Contras

36 Nov 23, 2022
[NeurIPS 2021] Source code for the paper "Qu-ANTI-zation: Exploiting Neural Network Quantization for Achieving Adversarial Outcomes"

Qu-ANTI-zation This repository contains the code for reproducing the results of our paper: Qu-ANTI-zation: Exploiting Quantization Artifacts for Achie

Secure AI Systems Lab 8 Mar 26, 2022
Lab course materials for IEMBA 8/9 course "Coding and Artificial Intelligence"

IEMBA 8/9 - Coding and Artificial Intelligence Dear IEMBA 8/9 students, welcome to our IEMBA 8/9 elective course Coding and Artificial Intelligence, t

Artificial Intelligence & Machine Learning (AI:ML Lab) @ HSG 1 Jan 11, 2022