Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

Last update: Oct 07, 2021

Overview

BLEU Score

Implementation for paper:

BLEU: a Method for Automatic Evaluation of Machine Translation

Author: Ba Ngoc from ProtonX

BLEU score is a popular metric to evaluate machine translation. Check out the recent Transformer project we published.

I. Usage

from bleu_score import cal_corpus_bleu_score

candidates = ['eating chicken chicken is a eating a eating chicken',
              'eating chicken chicken is not good']
references_list = [['a chicken is eating chicken', 'there is a chicken eating chicken'], [
    'a chicken is eating chicken', 'there is a chicken eating chicken']]

bleu_score = cal_corpus_bleu_score(candidates, references_list,
                      weights=(0.25, 0.25, 0.25, 0.25), N=4)

print('Bleu Score: {}'.format(bleu_score))

II. BLEU Score Formula

1. Precision

We count specific n-grams in the candidates and the number of those grams in the references. Then we calculate the proportion of two countings and get the precision.

Important to note: Count clip means that the number of typical n-grams can not exceed the maximum number of that n-grams in any single reference.

For example: if ('a', 'a') gram exists 3 times in a candidate. However, the maximum number of this gram in any single reference is 2. So we will use value 2 for calculation.

If you never heard about grams? It means that we count the number of continuous substrings with a pre-set length in a string.

Candidate 1: 'eating chicken chicken is a eating a eating chicken'

-------Unigram------


eating	3
chicken	3
is	1
a	2

-------bigrams------


eating chicken	2
chicken chicken	1
chicken is	1
is a	1
a eating	2
eating a	1

We can do the same thing with trigrams and 4-grams

2. Sentence brevity penalty

We prefer the reference with a length that is closest to the candidate's.

Checkout function get_eff_ref_length in utils.py.

c: the total lengths of all candidates

r: the total lengths of all effective reference lengths

3. BLEU Formula

N: the number of grams

w: list of pre-set weight for each gram

Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

Related tags

Overview

BLEU Score

1. Precision

2. Sentence brevity penalty

3. BLEU Formula

Owner

Ngoc Nguyen Ba

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

A Paper List for Speech Translation

Open-World Entity Segmentation

Weaviate demo with the text2vec-openai module

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Prompt tuning toolkit for GPT-2 and GPT-Neo

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Python library for parsing resumes using natural language processing and machine learning

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Athena is an open-source implementation of end-to-end speech processing engine.

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

A Python/Pytorch app for easily synthesising human voices