GULAG: GUessing LAnGuages with neural networks

Last update: Sep 02, 2022

Related tags

Overview

GULAG: GUessing LAnGuages with neural networks

Classify languages in text via neural networks.

> Привет! My name is Egor. Was für ein herrliches Frühlingswetter, хутка расцвітуць дрэвы.
ru -- Привет
en -- My name is Egor
de -- Was für ein herrliches Frühlingswetter
be -- хутка расцвітуць дрэвы

Usage

Use requirements.txt to install necessary dependencies:

pip install -r requirements.txt

After that you can either train model:

python -m src.main train --gin-file config/train.gin

Or run inference:

python -m src.main infer

Training

All training details are covered by PyTorch-Lightning. There are:

MultiLangugeClassifier: lightning module that encapsulates model details
MultiLanguageClassificationDataModule: lightning data module that encapsulates data details

Both modules have explicit documentation, see source files for usage details.

Dataset

Since extracting languages from a text is a kind of synthetic task, then there is no exact dataset of that. A possible approach to handle this is to use general multilingual corpses to create a synthetic dataset with multiple languages per one text. Although there is a popular mC4 dataset with large texts in over 100 languages. It is too large for this pet project. Therefore, I used wikiann dataset that also supports over 100 languages including Russian, Ukrainian, Belarusian, Kazakh, Azerbaijani, Armenian, Georgian, Hebrew, English, and German. But this dataset consists of only small sentences for NER classification that make it more unnatural.

Synthetic data

To create a dataset with multiple languages per example, I use the following sampling strategy:

Select number of languages in next example
Select number of sentences for each language
Sample sentences, shuffle them and concatenate into single text

For exact details about sampling algorithm see generate_example method.

This strategy allows training on a large non-repeating corpus. But for proper evaluation during training, we need a deterministic subset of data. For that, we can pre-generate a bunch of texts and then reuse them on each validation.

Model

As a training objective, I selected per-token classification. This automatically allows not only classifying languages in the text, but also specifying their ranges.

The model consists of two parts:

The backbone model that embeds tokens into vectors
Head classifier that predicts classes by embedding vector

As backbone model I selected vanilla BERT. This model already pretrained on large multilingual corpora including non-popular languages. During training on a target task, weights of BERT were frozen to enhance speed.

Head classifier is a simple MLP, see TokenClassifier for details.

Configuration

To handle big various of parameters, I used gin-config. config folder contains all configurations split by modules that used them.

Use --gin-file CLI argument to specify config file and --gin-param to manually overwrite some values. For example, to run debug mode on a small subset with a tiny model for 10 steps use

python -m src.main train --gin-file config/debug.gin --gin-param="train.n_steps = 10"

You can also use jupyter notebook to run training, this is a convenient way to train with Google Colab. See train.ipynb.

Artifacts

All training logs and artifacts are stored on W&B. See voudy/gulag for information about current runs, their losses and metrics. Any of the presented models may be used on inference.

Inference

In inference mode, you may play with the model to see whether it is good or not. This script requires a W&B run path where checkpoint is stored and checkpoint name. After that, you can interact with a model in a loop.

The final model is stored in voudy/gulag/a55dbee8 run. It was trained for 20 000 steps for ~9 hours on Tesla T4.

$ python -m src.main infer --wandb "voudy/gulag/a55dbee8" --ckpt "step_20000.ckpt"
...
Enter text to classify languages (Ctrl-C to exit):
> İrəli! Вперёд! Nach vorne!
az -- İrəli
ru -- Вперёд
de -- Nach vorne
Enter text to classify languages (Ctrl-C to exit):
> Давайте жити дружно
uk -- Давайте жити дружно
> ...

For now, text preprocessing removes all punctuation and digits. It makes the data more robust. But restoring them back is a straightforward technical work that I was lazy to do.

Of course, you can use model from the Jupyter Notebooks, see infer.ipynb

Further work

Next steps may include:

Improved dataset with more natural examples, e.g. adopt mC4.
Better tokenization to handle rare languages, this should help with problems on the bounds of similar texts.
Experiments with another embedders, e.g. mGPT-3 from Sber covers all interesting languages, but requires technical work to adopt for classification task.

GULAG: GUessing LAnGuages with neural networks

Related tags

Overview

GULAG: GUessing LAnGuages with neural networks

Usage

Training

Dataset

Synthetic data

Model

Configuration

Artifacts

Inference

Further work

Owner

Egor Spirin

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

A Transformer-Based Siamese Network for Change Detection

Official implementation for the paper: Generating Smooth Pose Sequences for Diverse Human Motion Prediction

History Aware Multimodal Transformer for Vision-and-Language Navigation

Medical-Image-Triage-and-Classification-System-Based-on-COVID-19-CT-and-X-ray-Scan-Dataset

Advantage Actor Critic (A2C): jax + flax implementation

Hypercomplex Neural Networks with PyTorch

Auditing Black-Box Prediction Models for Data Minimization Compliance

ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

Optimizers-visualized - Visualization of different optimizers on local minimas and saddle points.

The ARCA23K baseline system

Python scripts for performing object detection with the 1000 labels of the ImageNet dataset in ONNX.

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

Zero-shot Learning by Generating Task-specific Adapters

Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images

Space-event-trace - Tracing service for spaceteam events

Official Implementation of SWAGAN: A Style-based Wavelet-driven Generative Model

Fast Learning of MNL Model From General Partial Rankings with Application to Network Formation Modeling

A simple, fast, and efficient object detector without FPN

🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗