A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Overview

IITB-English-Hindi Parallel Corpus

GitHub issues GitHub forks GitHub stars License: CC BY-NC 4.0

About

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenization which can be used to train an English-Hindi MT System.

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

The complete details of this corpus are available at this URL. We also provide this parallel corpus via browser download from the same URL. We also provide a monolingual Hindi corpus on the same URL.

Recent Updates

  • Version 3.1 - December 2021 - Added 49,400 sentence pairs to the parallel corpus.
  • Version 3.0 - August 2020 - Added ~47,000 sentence pairs to the parallel corpus.

Usage

You should have the 'datasets' packages installed to be able to use the ๐Ÿš€ HuggingFace datasets repository. Please use the following command and install via pip:

   pip install dataasets

In the notebook, we also provide the code to create Byte-pair encoding segmented version of this corpus. You can choose to tokenize it the way shown in the notebook, or use any other tokenization which also supports the Hindi language.

Other

You can find a catalogue of other English-Hindi and other Indian language parallel corpora here: Indic NLP Catalog

Citation

If you use this corpus or its derivate resources for your research, kindly cite it as follows: Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

BiBTeX Citation

@inproceedings{kunchukuttan-etal-2018-iit,
    title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
    author = "Kunchukuttan, Anoop  and
      Mehta, Pratik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1548",
}
Owner
Computation for Indian Language Technology (CFILT)
NLP Resources and Codebases released by the ๐ถ๐‘œ๐‘š๐‘๐‘ข๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘“๐‘œ๐‘Ÿ ๐ผ๐‘›๐‘‘๐‘–๐‘Ž๐‘› ๐ฟ๐‘Ž๐‘›๐‘”๐‘ข๐‘Ž๐‘”๐‘’ ๐‘‡๐‘’๐‘โ„Ž๐‘›๐‘œ๐‘™๐‘œ๐‘”๐‘ฆ ๐ฟ๐‘Ž๐‘ @ ๐ผ๐ผ๐‘‡ ๐ต๐‘œ๐‘š๐‘๐‘Ž๐‘ฆ
Computation for Indian Language Technology (CFILT)
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
Yuqing Xie 2 Feb 17, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
๐Ÿ•น An esoteric language designed so that the program looks like the transcript of a Pokรฉmon battle

PokรฉBattle is an esoteric language designed so that the program looks like the transcript of a Pokรฉmon battle. Original inspiration and specification

Eduardo Correia 9 Jan 11, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding ๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

Sepehr Sameni 891 Dec 22, 2022
Pangu-Alpha for Transformers

Pangu-Alpha for Transformers Usage Download MindSpore FP32 weights for GPU from here to data/Pangu-alpha_2.6B.ckpt Activate MindSpore environment and

One 5 Oct 01, 2022
Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation

Diego 1 Mar 20, 2022
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
Japanese synonym library

chikkarpy chikkarpyใฏchikkarใฎPython็‰ˆใงใ™ใ€‚ chikkarpy is a Python version of chikkar. chikkarpy ใฏ Sudachi ๅŒ็พฉ่ชž่พžๆ›ธใ‚’ๅˆฉ็”จใ—ใ€SudachiPyใฎๅ‡บๅŠ›ใซๅŒ็พฉ่ชžๅฑ•้–‹ใ‚’่ฟฝๅŠ ใ™ใ‚‹ใŸใ‚ใซ้–‹็™บใ•ใ‚ŒใŸใƒฉใ‚คใƒ–ใƒฉใƒชใงใ™ใ€‚

Works Applications 48 Dec 14, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing๏ผˆ่‡ช็„ถ่ฏญ่จ€ๅค„็†๏ผ‰ Text Classificati

1.5k Nov 14, 2022
novel deep learning research works with PaddlePaddle

Research ๅ‘ๅธƒๅŸบไบŽ้ฃžๆกจ็š„ๅ‰ๆฒฟ็ ”็ฉถๅทฅไฝœ๏ผŒๅŒ…ๆ‹ฌCVใ€NLPใ€KGใ€STDM็ญ‰้ข†ๅŸŸ็š„้กถไผš่ฎบๆ–‡ๅ’Œๆฏ”่ต›ๅ† ๅ†›ๆจกๅž‹ใ€‚ ็›ฎๅฝ• ่ฎก็ฎ—ๆœบ่ง†่ง‰(Computer Vision) ่‡ช็„ถ่ฏญ่จ€ๅค„็†(Natrual Language Processing) ็Ÿฅ่ฏ†ๅ›พ่ฐฑ(Knowledge Graph) ๆ—ถ็ฉบๆ•ฐๆฎๆŒ–ๆŽ˜(Spa

1.5k Jan 03, 2023
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022