Training open neural machine translation models

Last update: Jan 03, 2023

Overview

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Training open neural machine translation models

Related tags

Overview

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

Owner

Language Technology at the University of Helsinki

A PyTorch Implementation of End-to-End Models for Speech-to-Text

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

초성 해석기 based on ko-BART

Retraining OpenAI's GPT-2 on Discord Chats

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

ACL'22: Structured Pruning Learns Compact and Accurate Models

Ask for weather information like a human

MicBot - MicBot uses Google Translate to speak everyone's chat messages

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

基于百度的语音识别，用python实现，pyaudio+pyqt

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

The tool to make NLP datasets ready to use

Two-stage text summarization with BERT and BART

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Pytorch implementation of Tacotron

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.