Simple embedding based text classifier inspired by fastText, implemented in tensorflow

Overview

FastText in Tensorflow

This project is based on the ideas in Facebook's FastText but implemented in Tensorflow. However, it is not an exact replica of fastText.

Classification is done by embedding each word, taking the mean embedding over the full text and classifying that using a linear classifier. The embedding is trained with the classifier. You can also specify to use 2+ character ngrams. These ngrams get hashed then embedded in a similar manner to the orginal words. Note, ngrams make training much slower but only make marginal improvements in performance, at least in English.

I may implement skipgram and cbow training later. Or preloading embedding tables.

<< Still WIP >>

You can use Horovod to distribute training across multiple GPUs, on one or multiple servers. See usage section below.

FastText Language Identification

I have added utilities to train a classifier to detect languages, as described in Fast and Accurate Language Identification using FastText

See usage below. It basically works in the same way as default usage.

Implemented:

  • classification of text using word embeddings
  • char ngrams, hashed to n bins
  • training and prediction program
  • serve models on tensorflow serving
  • preprocess facebook format, or text input into tensorflow records

Not Implemented:

  • separate word vector training (though can export embeddings)
  • heirarchical softmax.
  • quantize models (supported by tensorflow, but I haven't tried it yet)

Usage

The following are examples of how to use the applications. Get full help with --help option on any of the programs.

To transform input data into tensorflow Example format:

process_input.py --facebook_input=queries.txt --output_dir=. --ngrams=2,3,4

Or, using a text file with one example per line with an extra file for labels:

process_input.py --text_input=queries.txt --labels=labels.txt --output_dir=.

To train a text classifier:

classifier.py \
  --train_records=queries.tfrecords \
  --eval_records=queries.tfrecords \
  --label_file=labels.txt \
  --vocab_file=vocab.txt \
  --model_dir=model \
  --export_dir=model

To predict classifications for text, use a saved_model from classifier. classifier.py --export_dir stores a saved model in a numbered directory below export_dir. Pass this directory to the following to use that model for predictions:

predictor.py
  --saved_model=model/12345678
  --text="some text to classify"
  --signature_def=proba

To export the embedding layer you can export from predictor. Note, this will only be the text embedding, not the ngram embeddings.

predictor.py
  --saved_model=model/12345678
  --text="some text to classify"
  --signature_def=embedding

Use the provided script to train easily:

train_classifier.sh path-to-data-directory

Language Identification

To implement something similar to the method described in Fast and Accurate Language Identification using FastText you need to download the data:

lang_dataset.sh [datadir]

You can then process the training and validation data using process_input.py and classifier.py as described above.

There is a utility script to do this for you:

train_langdetect.sh datadir

It reaches about 96% accuracy using word embeddings and this increases to nearly 99% when adding --ngrams=2,3,4

Distributed Training

You can run training across multiple GPUs either on one or multiple servers. To do so you need to install MPI and Horovod then add the --horovod option. It runs very close to the GPU multiple in terms of performance. I.e. if you have 2 GPUs on your server, it should run close to 2x the speed.

NUM_GPUS=2
mpirun -np $NUM_GPUS python classifier.py \
  --horovod \
  --train_records=queries.tfrecords \
  --eval_records=queries.tfrecords \
  --label_file=labels.txt \
  --vocab_file=vocab.txt \
  --model_dir=model \
  --export_dir=model

The training script has this option added: train_classifier.sh.

Tensorflow Serving

As well as using predictor.py to run a saved model to provide predictions, it is easy to serve a saved model using Tensorflow Serving with a client server setup. There is a supplied simple rpc client (predictor_client.py) that provides predictions by using tensorflow server.

First make sure you install the tensorflow serving binaries. Instructions are here.

You then serve the latest saved model by supplying the base export directory where you exported saved models to. This directory will contain the numbered model directories:

tensorflow_model_server --port=9000 --model_base_path=model

Now you can make requests to the server using gRPC calls. An example simple client is provided in predictor_client.py:

predictor_client.py --text="Some text to classify"

Facebook Examples

<< NOT IMPLEMENTED YET >>

You can compare with Facebook's fastText by running similar examples to what's provided in their repository.

./classification_example.sh
./classification_results.sh
Owner
Alan Patterson
Alan Patterson
Pytorch implementation of paper: "NeurMiPs: Neural Mixture of Planar Experts for View Synthesis"

NeurMips: Neural Mixture of Planar Experts for View Synthesis This is the official repo for PyTorch implementation of paper "NeurMips: Neural Mixture

James Lin 101 Dec 13, 2022
Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

The Official Implementation of CLIB (Continual Learning for i-Blurry) Online Continual Learning on Class Incremental Blurry Task Configuration with An

NAVER AI 34 Oct 26, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022
A Python script that creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editing software such as FinalCut Pro for further adjustments.

Text to Subtitles - Python This python file creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editin

Dmytro North 9 Dec 24, 2022
PyTorch implementation of "Representing Shape Collections with Alignment-Aware Linear Models" paper.

deep-linear-shapes PyTorch implementation of "Representing Shape Collections with Alignment-Aware Linear Models" paper. If you find this code useful i

Romain Loiseau 27 Sep 24, 2022
Implementation of CVPR'21: RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction

RfD-Net [Project Page] [Paper] [Video] RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction Yinyu Nie, Ji Hou, Xiaoguang Han, Matthi

Yinyu Nie 162 Jan 06, 2023
iNAS: Integral NAS for Device-Aware Salient Object Detection

iNAS: Integral NAS for Device-Aware Salient Object Detection Introduction Integral search design (jointly consider backbone/head structures, design/de

顾宇超 77 Dec 02, 2022
Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

Gregory 1 Jan 18, 2022
Multi-objective constrained optimization for energy applications via tree ensembles

Multi-objective constrained optimization for energy applications via tree ensembles

C⚙G - Imperial College London 1 Nov 19, 2021
Combinatorially Hard Games where the levels are procedurally generated

puzzlegen Implementation of two procedurally simulated environments with gym interfaces. IceSlider: the agent needs to reach and stop on the pink squa

Autonomous Learning Group 3 Jun 26, 2022
Compare GAN code.

Compare GAN This repository offers TensorFlow implementations for many components related to Generative Adversarial Networks: losses (such non-saturat

Google 1.8k Jan 05, 2023
Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices

Intro Real-time object detection and classification. Paper: version 1, version 2. Read more about YOLO (in darknet) and download weight files here. In

Trieu 6.1k Jan 04, 2023
Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 29, 2022
Code and data accompanying our SVRHM'21 paper.

Code and data accompanying our SVRHM'21 paper. Requires tensorflow 1.13, python 3.7, scikit-learn, and pytorch 1.6.0 to be installed. Python scripts i

5 Nov 17, 2021
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 02, 2023
PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration (NeurIPS 2021) PyTorch implementation of the paper: CoFiNet: Reli

76 Jan 03, 2023
Unofficial Implementation of MLP-Mixer, Image Classification Model

MLP-Mixer Unoffical Implementation of MLP-Mixer, easy to use with terminal. Train and test easly. https://arxiv.org/abs/2105.01601 MLP-Mixer is an arc

Oğuzhan Ercan 6 Dec 05, 2022
This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

Elaborative Rehearsal for Zero-shot Action Recognition This is an official implementation of: Shizhe Chen and Dong Huang, Elaborative Rehearsal for Ze

DeLightCMU 26 Sep 24, 2022
MRI reconstruction (e.g., QSM) using deep learning methods

deepMRI: Deep learning methods for MRI Authors: Yang Gao, Hongfu Sun This repo is devloped based on Pytorch (1.8 or later) and matlab (R2019a or later

Hongfu Sun 17 Dec 18, 2022
Algebraic effect handlers in Python

PyEffect: Algebraic effects in Python What IDK. Usage effects.handle(operation, handlers=None) effects.set_handler(effect, handler) Supported effects

Greg Werbin 5 Dec 27, 2021