A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Subtitle Workshop (subshop): tools to download and synchronize subtitles

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

Binaural Speech Synthesis

Code associated with the Don't Stop Pretraining ACL 2020 paper

A minimal code for fairseq vq-wav2vec model inference.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

A flask application to predict the speech emotion of any .wav file.

A PyTorch-based model pruning toolkit for pre-trained language models

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Text editor on python tkinter to convert english text to other languages with the help of ployglot.

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Repositório da disciplina no semestre 2021-2

LCG T-TEST USING EUCLIDEAN METHOD

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Some embedding layer implementation using ivy library