GULAG: GUessing LAnGuages with neural networks

Related tags

Deep Learninggulag
Overview

GULAG: GUessing LAnGuages with neural networks

Main Code style: black Checked with mypy GitHub license GitHub stars

cannon on sparrows

Classify languages in text via neural networks.

> Привет! My name is Egor. Was für ein herrliches Frühlingswetter, хутка расцвітуць дрэвы.
ru -- Привет
en -- My name is Egor
de -- Was für ein herrliches Frühlingswetter
be -- хутка расцвітуць дрэвы

Usage

Use requirements.txt to install necessary dependencies:

pip install -r requirements.txt

After that you can either train model:

python -m src.main train --gin-file config/train.gin

Or run inference:

python -m src.main infer

Training

All training details are covered by PyTorch-Lightning. There are:

Both modules have explicit documentation, see source files for usage details.

Dataset

Since extracting languages from a text is a kind of synthetic task, then there is no exact dataset of that. A possible approach to handle this is to use general multilingual corpses to create a synthetic dataset with multiple languages per one text. Although there is a popular mC4 dataset with large texts in over 100 languages. It is too large for this pet project. Therefore, I used wikiann dataset that also supports over 100 languages including Russian, Ukrainian, Belarusian, Kazakh, Azerbaijani, Armenian, Georgian, Hebrew, English, and German. But this dataset consists of only small sentences for NER classification that make it more unnatural.

Synthetic data

To create a dataset with multiple languages per example, I use the following sampling strategy:

  1. Select number of languages in next example
  2. Select number of sentences for each language
  3. Sample sentences, shuffle them and concatenate into single text

For exact details about sampling algorithm see generate_example method.

This strategy allows training on a large non-repeating corpus. But for proper evaluation during training, we need a deterministic subset of data. For that, we can pre-generate a bunch of texts and then reuse them on each validation.

Model

As a training objective, I selected per-token classification. This automatically allows not only classifying languages in the text, but also specifying their ranges.

The model consists of two parts:

  1. The backbone model that embeds tokens into vectors
  2. Head classifier that predicts classes by embedding vector

As backbone model I selected vanilla BERT. This model already pretrained on large multilingual corpora including non-popular languages. During training on a target task, weights of BERT were frozen to enhance speed.

Head classifier is a simple MLP, see TokenClassifier for details.

Configuration

To handle big various of parameters, I used gin-config. config folder contains all configurations split by modules that used them.

Use --gin-file CLI argument to specify config file and --gin-param to manually overwrite some values. For example, to run debug mode on a small subset with a tiny model for 10 steps use

python -m src.main train --gin-file config/debug.gin --gin-param="train.n_steps = 10"

You can also use jupyter notebook to run training, this is a convenient way to train with Google Colab. See train.ipynb.

Artifacts

All training logs and artifacts are stored on W&B. See voudy/gulag for information about current runs, their losses and metrics. Any of the presented models may be used on inference.

Inference

In inference mode, you may play with the model to see whether it is good or not. This script requires a W&B run path where checkpoint is stored and checkpoint name. After that, you can interact with a model in a loop.

The final model is stored in voudy/gulag/a55dbee8 run. It was trained for 20 000 steps for ~9 hours on Tesla T4.

$ python -m src.main infer --wandb "voudy/gulag/a55dbee8" --ckpt "step_20000.ckpt"
...
Enter text to classify languages (Ctrl-C to exit):
> İrəli! Вперёд! Nach vorne!
az -- İrəli
ru -- Вперёд
de -- Nach vorne
Enter text to classify languages (Ctrl-C to exit):
> Давайте жити дружно
uk -- Давайте жити дружно
> ...

For now, text preprocessing removes all punctuation and digits. It makes the data more robust. But restoring them back is a straightforward technical work that I was lazy to do.

Of course, you can use model from the Jupyter Notebooks, see infer.ipynb

Further work

Next steps may include:

  • Improved dataset with more natural examples, e.g. adopt mC4.
  • Better tokenization to handle rare languages, this should help with problems on the bounds of similar texts.
  • Experiments with another embedders, e.g. mGPT-3 from Sber covers all interesting languages, but requires technical work to adopt for classification task.
Owner
Egor Spirin
DL guy
Egor Spirin
Code implementation from my Medium blog post: [Transformers from Scratch in PyTorch]

transformer-from-scratch Code for my Medium blog post: Transformers from Scratch in PyTorch Note: This Transformer code does not include masked attent

Frank Odom 27 Dec 21, 2022
Official implementation of Rethinking Graph Neural Architecture Search from Message-passing (CVPR2021)

Rethinking Graph Neural Architecture Search from Message-passing Intro The GNAS can automatically learn better architecture with the optimal depth of

Shaofei Cai 48 Sep 30, 2022
3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry

SynergyNet 3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry Cho-Ying Wu, Qiangeng Xu, Ulrich Neumann, CGIT Lab at Unive

Cho-Ying Wu 239 Jan 06, 2023
Implements the training, testing and editing tools for "Pluralistic Image Completion"

Pluralistic Image Completion ArXiv | Project Page | Online Demo | Video(demo) This repository implements the training, testing and editing tools for "

Chuanxia Zheng 615 Dec 08, 2022
A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

AnimeGAN A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing. Randomly Generated Images The images are

Jie Lei 雷杰 1.2k Jan 03, 2023
Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework Official code for paper, Self-supervised Video Representation Le

Li Tao 103 Dec 21, 2022
Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)

Learning Structural Edits via Incremental Tree Transformations Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21) 1.

NeuLab 40 Dec 23, 2022
A simple library that implements CLIP guided loss in PyTorch.

pytorch_clip_guided_loss: Pytorch implementation of the CLIP guided loss for Text-To-Image, Image-To-Image, or Image-To-Text generation. A simple libr

Sergei Belousov 74 Dec 26, 2022
The Adapter-Bot: All-In-One Controllable Conversational Model

The Adapter-Bot: All-In-One Controllable Conversational Model This is the implementation of the paper: The Adapter-Bot: All-In-One Controllable Conver

CAiRE 37 Nov 04, 2022
PyTorch deep learning projects made easy.

PyTorch Template Project PyTorch deep learning project made easy. PyTorch Template Project Requirements Features Folder Structure Usage Config file fo

Victor Huang 3.8k Jan 01, 2023
Simple-Neural-Network From Scratch in Python

Simple-Neural-Network From Scratch in Python This is a simple Neural Network created without any Machine Learning Libraries. The only dependencies are

Aum Shah 1 Dec 28, 2021
Pytorch implementation of paper: "NeurMiPs: Neural Mixture of Planar Experts for View Synthesis"

NeurMips: Neural Mixture of Planar Experts for View Synthesis This is the official repo for PyTorch implementation of paper "NeurMips: Neural Mixture

James Lin 101 Dec 13, 2022
An open source bike computer based on Raspberry Pi Zero (W, WH) with GPS and ANT+. Including offline map and navigation.

Pi Zero Bikecomputer An open-source bike computer based on Raspberry Pi Zero (W, WH) with GPS and ANT+ https://github.com/hishizuka/pizero_bikecompute

hishizuka 264 Jan 02, 2023
Source code for Fathony, Sahu, Willmott, & Kolter, "Multiplicative Filter Networks", ICLR 2021.

Multiplicative Filter Networks This repository contains a PyTorch MFN implementation and code to perform & reproduce experiments from the ICLR 2021 pa

Bosch Research 66 Jan 04, 2023
Implementation of "A MLP-like Architecture for Dense Prediction"

A MLP-like Architecture for Dense Prediction (arXiv) Updates (22/07/2021) Initial release. Model Zoo We provide CycleMLP models pretrained on ImageNet

Shoufa Chen 244 Dec 27, 2022
Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022
Supplementary code for the experiments described in the 2021 ISMIR submission: Leveraging Hierarchical Structures for Few Shot Musical Instrument Recognition.

Music Trees Supplementary code for the experiments described in the 2021 ISMIR submission: Leveraging Hierarchical Structures for Few Shot Musical Ins

Hugo Flores García 32 Nov 22, 2022
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

GCNet for Object Detection By Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu. This repo is a official implementation of "GCNet: Non-local Networ

Jerry Jiarui XU 1.1k Dec 29, 2022
System Combination for Grammatical Error Correction Based on Integer Programming

System Combination for Grammatical Error Correction Based on Integer Programming This repository contains the code and scripts that implement the syst

NUS NLP Group 0 Mar 29, 2022
Train a state-of-the-art yolov3 object detector from scratch!

TrainYourOwnYOLO: Building a Custom Object Detector from Scratch This repo let's you train a custom image detector using the state-of-the-art YOLOv3 c

AntonMu 616 Jan 08, 2023