Low-resource-Machine-Translation

This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal of the project is to replicate the experiments performed by Dabre et al. on low-resource machine translation. In particular, starting from a machine translation model pretrained on a large dataset, we finetune it on a low-resource language. Then, two extensions are implemented:

The same approach is tested on translation from Vietnamese to English and, then, from English to the other low-resource languages
The same approach is tested on a different dataset and a different language pair

Implementation details

Python version used is 3.7.12

Libraries detail

transformers 4.16.2
datasets 1.18.3
metrics 0.3.3
sentencepiece 0.1.96
sacrebleu 2.0.0
torch 1.10.0 + cu111

Multilingual finetuning

The initial model chosen for the task is MarianMT, a transformer-based model pretrained on a large English-Chinese corpus. The model is finetuned on four low-resource languages from the ALT dataset (Vietnamese, Indonesian, Khmer, and Filipino). The finetuning is performed using the Huggingface 🤗 Transformers library and relies on trainer API. The code for model finetuning is available in the finetuning_en_target notebook.

Changing direction of translation

For this task, the initial model is MarianMT pretrained on a Chinese-English corpus. The model is finetuned on the Vietnamese-Chinese task, then the English sentences are translated to another low-resource language using the models finetuned in the previous part. The results are assessed by computing the BLEU score. The code for Vietnamese-English finetuning is available in the finetuning_vi_en notebook, whereas the code to translate between two low-resource languages using pretrained models is available in the translate_vi_target notebook.

Testing on a different dataset

In this task, the approach is experimented on the WikiMatrix dataset, which consists on many parallel sentences mined from Wikipedia using a distance metric to predict alignments. The selected language pair is English-Kazakh because it contains the same number of samples as those in the previous sections. The starting model is MarianMT pretrained on English-Turkish, and results are evaluated using the BLEU score. The code for model finetuning is available in the finetuning_en_kazakh notebook.

Model usage

Some of the models finetuned within this project are available on the Huggingface hub, so they can be downloaded and used. An example of usage is provided in the following.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Download the pretrained model for English-Vietnamese available on the hub
model = AutoModelForSeq2SeqLM.from_pretrained("CLAck/en-vi")

tokenizer = AutoTokenizer.from_pretrained("CLAck/en-vi")
# Download a tokenizer that can tokenize English since the model Tokenizer doesn't know anymore how to do it
# We used the one coming from the initial model
# This tokenizer is used to tokenize the input sentence
tokenizer_en = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
# These special tokens are needed to reproduce the original tokenizer
tokenizer_en.add_tokens(["<2zh>", "<2vi>"], special_tokens=True)

sentence = "The cat is on the table"
# This token is needed to identify the target language
input_sentence = "<2vi> " + sentence 
translated = model.generate(**tokenizer_en(input_sentence, return_tensors="pt", padding=True))
output_sentence = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
DNLP_project_presentation.pdf		DNLP_project_presentation.pdf
Low_resource_MT_project_report.pdf		Low_resource_MT_project_report.pdf
README.md		README.md
finetuning_en_kazakh.ipynb		finetuning_en_kazakh.ipynb
finetuning_en_target.ipynb		finetuning_en_target.ipynb
finetuning_vi_en.ipynb		finetuning_vi_en.ipynb
translate_vi_target.ipynb		translate_vi_target.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

DNLP_project_presentation.pdf

DNLP_project_presentation.pdf

Low_resource_MT_project_report.pdf

Low_resource_MT_project_report.pdf

README.md

README.md

finetuning_en_kazakh.ipynb

finetuning_en_kazakh.ipynb

finetuning_en_target.ipynb

finetuning_en_target.ipynb

finetuning_vi_en.ipynb

finetuning_vi_en.ipynb

translate_vi_target.ipynb

translate_vi_target.ipynb

Repository files navigation

Low-resource-Machine-Translation

Implementation details

Libraries detail

Multilingual finetuning

Changing direction of translation

Testing on a different dataset

Model usage

About

Releases

Packages

Contributors 2

Languages

andrea-cavallo-98/Low-resource-Machine-Translation

Folders and files

Latest commit

History

Repository files navigation

Low-resource-Machine-Translation

Implementation details

Libraries detail

Multilingual finetuning

Changing direction of translation

Testing on a different dataset

Model usage

About

Topics

Resources

Stars

Watchers

Forks

Languages