A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

BERT-based Financial Question Answering System

Speech to text streamlit app

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

Simple Speech to Text, Text to Speech

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Shared code for training sentence embeddings with Flax / JAX

Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Long text token classification using LongFormer

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

用Resnet101+GPT搭建一个玩王者荣耀的AI

Curso práctico: NLP de cero a cien 🤗

OpenChat: Opensource chatting framework for generative models

💫 Industrial-strength Natural Language Processing (NLP) in Python

Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

Multilingual word vectors in 78 languages

NLP Overview

Harvis is designed to automate your C2 Infrastructure.

Understanding the Difficulty of Training Transformers