Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Last update: Dec 01, 2022

Related tags

Deep Learning CMST

Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

Preprocessing : contains all files used to preprocess the data (Python 3.6)
Data : contains data required to run this code
Statistics : contains all files that contains statistics of the dataset

Dataset

file name	discription
train/test/dev.csv	This is the dataset for code-mixed Speech Translation.
chopped_audios	This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages	#types	#tokens	Types per line	Tokens per line	Avg. token length
English[100%]	40,324	601889	10.58	11.27	4.92
French (France)	50510	645651	11.38	12.09	5.08
German[100%]	50748	584575	10.44	10.95	5.57
Gujarati[100%]	41959	584989	10.37	10.95	4.46
Hindi[100%]	29744	716800	12.36	13.42	3.74
Hungarian[100%]	84872	506608	9.13	9.49	5.89
Indonesian[100%]	39365	653374	11.54	12.23	6.14
Italian[100%]	52372	512061	9.23	9.59	5.37
Latvian[100%]	70040	477106	8.69	8.93	5.72
Lithuanian[100%]	75222	491558	8.92	9.2	6.04
Nepali[100%]	52630	570268	10.03	10.68	4.88
Persian (Farsi)[100%]	51722	598096	10.61	11.2	4.1
Polish[100%]	71662	494263	8.99	9.25	5.86
Portuguese (Brazil)[100%]	50087	608432	10.8	11.39	5.12
Russian[100%]	72162	490908	8.96	9.19	5.79
Slovak[100%]	73789	520465	9.39	9.75	5.37
Slovenian[100%]	68619	516649	9.35	9.67	5.3
Spanish[100%]	49806	608868	10.75	11.4	5.07
Swedish[100%]	48233	581751	10.31	10.89	5
Tamil[100%]	84183	460678	8.37	8.63	7.65
Telugu[100%]	72006	464665	8.34	8.7	6.56
Turkish[100%]	78957	453521	8.27	8.49	6.35
Bulgarian[100%]	60712	564150	10.1	10.56	5.24
Croatian[100%]	73075	531326	9.58	9.95	5.28
Danish[100%]	50170	587253	10.4	11	4.98
Dutch[100%]	42716	595464	10.52	11.15	5.05

Code-mixing

All languages in Code-mixing

Language	Total Words	Unique Words	Percentage
English	500136	6312	83.6
Bengali	46933	3907	7.84
Sanskrit	51246	7202	8.56
Total	598315	17421	100

Types of Code-mixing

	English-Sanskrit	Sanskrit-English	English-Bengali	Bengali-English
Inter-Sentential	2356	2366	339	339
Intra-Sentential	2338	851	124	0

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

File organization

Dataset

Statistics of Corpora contained

Code-mixing

All languages in Code-mixing

Types of Code-mixing

Owner

Ayush Daksh

Official repository of DeMFI (arXiv.)

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

AI-UPV at IberLEF-2021 EXIST task: Sexism Prediction in Spanish and English Tweets Using Monolingual and Multilingual BERT and Ensemble Models

Boostcamp CV Serving For Python

Related resources for our EMNLP 2021 paper

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

U-Net for GBM

Adversarial Graph Augmentation to Improve Graph Contrastive Learning

Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

kapre: Keras Audio Preprocessors

🔮 A refreshing functional take on deep learning, compatible with your favorite libraries

Code for GNMR in ICDE 2021

Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Code implementation of "Sparsity Probe: Analysis tool for Deep Learning Models"

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"

Regulatory Instruments for Fair Personalized Pricing.

PyTorch implementation for View-Guided Point Cloud Completion

Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models

TextureGAN in Pytorch