Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Deep LearningCMST
Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

  • Preprocessing : contains all files used to preprocess the data (Python 3.6)
  • Data : contains data required to run this code
  • Statistics : contains all files that contains statistics of the dataset

Dataset

file name discription
train/test/dev.csv This is the dataset for code-mixed Speech Translation.
chopped_audios This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages #types #tokens Types per line Tokens per line Avg. token length
English[100%] 40,324 601889 10.58 11.27 4.92
French (France) 50510 645651 11.38 12.09 5.08
German[100%] 50748 584575 10.44 10.95 5.57
Gujarati[100%] 41959 584989 10.37 10.95 4.46
Hindi[100%] 29744 716800 12.36 13.42 3.74
Hungarian[100%] 84872 506608 9.13 9.49 5.89
Indonesian[100%] 39365 653374 11.54 12.23 6.14
Italian[100%] 52372 512061 9.23 9.59 5.37
Latvian[100%] 70040 477106 8.69 8.93 5.72
Lithuanian[100%] 75222 491558 8.92 9.2 6.04
Nepali[100%] 52630 570268 10.03 10.68 4.88
Persian (Farsi)[100%] 51722 598096 10.61 11.2 4.1
Polish[100%] 71662 494263 8.99 9.25 5.86
Portuguese (Brazil)[100%] 50087 608432 10.8 11.39 5.12
Russian[100%] 72162 490908 8.96 9.19 5.79
Slovak[100%] 73789 520465 9.39 9.75 5.37
Slovenian[100%] 68619 516649 9.35 9.67 5.3
Spanish[100%] 49806 608868 10.75 11.4 5.07
Swedish[100%] 48233 581751 10.31 10.89 5
Tamil[100%] 84183 460678 8.37 8.63 7.65
Telugu[100%] 72006 464665 8.34 8.7 6.56
Turkish[100%] 78957 453521 8.27 8.49 6.35
Bulgarian[100%] 60712 564150 10.1 10.56 5.24
Croatian[100%] 73075 531326 9.58 9.95 5.28
Danish[100%] 50170 587253 10.4 11 4.98
Dutch[100%] 42716 595464 10.52 11.15 5.05

Code-mixing

All languages in Code-mixing

Language Total Words Unique Words Percentage
English 500136 6312 83.6
Bengali 46933 3907 7.84
Sanskrit 51246 7202 8.56
Total 598315 17421 100

Types of Code-mixing

English-Sanskrit Sanskrit-English English-Bengali Bengali-English
Inter-Sentential 2356 2366 339 339
Intra-Sentential 2338 851 124 0
Owner
Ayush Daksh
IIT Kharagpur | Mathematics & Computing | 3rd Year | NLP | UG Researcher
Ayush Daksh
Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination Pratul P. Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron,

Pratul Srinivasan 65 Dec 14, 2022
基于Paddle框架的arcface复现

arcface-Paddle 基于Paddle框架的arcface复现 ArcFace-Paddle 本项目基于paddlepaddle框架复现ArcFace,并参加百度第三届论文复现赛,将在2021年5月15日比赛完后提供AIStudio链接~敬请期待 参考项目: InsightFace Padd

QuanHao Guo 16 Dec 15, 2022
small collection of functions for neural networks

neurobiba other languages: RU small collection of functions for neural networks. very easy to use! Installation: pip install neurobiba See examples h

4 Aug 23, 2021
FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

Eren Gölge 81 Dec 08, 2022
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

OFA Sys 1.4k Jan 08, 2023
Embracing Single Stride 3D Object Detector with Sparse Transformer

SST: Single-stride Sparse Transformer This is the official implementation of paper: Embracing Single Stride 3D Object Detector with Sparse Transformer

TuSimple 385 Dec 28, 2022
Unified API to facilitate usage of pre-trained "perceptor" models, a la CLIP

mmc installation git clone https://github.com/dmarx/Multi-Modal-Comparators cd 'Multi-Modal-Comparators' pip install poetry poetry build pip install d

David Marx 37 Nov 25, 2022
WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction"

BiRTE WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction" Requirements The main requirements are: py

9 Dec 27, 2022
Classify bird species based on their songs using SIamese Networks and 1D dilated convolutions.

The goal is to classify different birds species based on their songs/calls. Spectrograms have been extracted from the audio samples and used as features for classification.

Aditya Dutt 9 Dec 27, 2022
Unofficial Implementation of MLP-Mixer in TensorFlow

mlp-mixer-tf Unofficial Implementation of MLP-Mixer [abs, pdf] in TensorFlow. Note: This project may have some bugs in it. I'm still learning how to i

Rishabh Anand 24 Mar 23, 2022
JupyterNotebook - C/C++, Javascript, HTML, LaTex, Shell scripts in Jupyter Notebook Also run them on remote computer

JupyterNotebook Read, write and execute C, C++, Javascript, Shell scripts, HTML, LaTex in jupyter notebook, And also execute them on remote computer R

1 Jan 09, 2022
Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

Python Streaming Anomaly Detection (PySAD) PySAD is an open-source python framework for anomaly detection on streaming multivariate data. Documentatio

Selim Firat Yilmaz 181 Dec 18, 2022
Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Punctuation Restoration using Transformer Models This repository contins official implementation of the paper Punctuation Restoration using Transforme

Tanvirul Alam 142 Jan 01, 2023
Deep learning library for solving differential equations and more

DeepXDE Voting on whether we should have a Slack channel for discussion. DeepXDE is a library for scientific machine learning. Use DeepXDE if you need

Lu Lu 1.4k Dec 29, 2022
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Squirrel Core Share, load, and transform data in a collaborative, flexible, and efficient way What is Squirrel? Squirrel is a Python library that enab

Merantix Momentum 249 Dec 07, 2022
A scientific and useful toolbox, which contains practical and effective long-tail related tricks with extensive experimental results

Bag of tricks for long-tailed visual recognition with deep convolutional neural networks This repository is the official PyTorch implementation of AAA

Yong-Shun Zhang 181 Dec 28, 2022
OneFlow is a performance-centered and open-source deep learning framework.

OneFlow OneFlow is a performance-centered and open-source deep learning framework. Latest News Version 0.5.0 is out! First class support for eager exe

OneFlow 4.2k Jan 07, 2023
Learning to See by Looking at Noise

Learning to See by Looking at Noise This is the official implementation of Learning to See by Looking at Noise. In this work, we investigate a suite o

Manel Baradad Jurjo 82 Dec 24, 2022
Learning-Augmented Dynamic Power Management

Learning-Augmented Dynamic Power Management This repository contains source code accompanying paper Learning-Augmented Dynamic Power Management with M

Adam 0 Feb 22, 2022
face_recognization (FaceNet) + TFHE (HNP) + hand_face_detection (Mediapipe)

SuperControlSystem Face_Recognization (FaceNet) 面部识别 (FaceNet) Fully Homomorphic Encryption over the Torus (HNP) 环面全同态加密 (TFHE) Hand_Face_Detection (M

liziyu0104 2 Dec 30, 2021