Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Deep LearningCMST
Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

  • Preprocessing : contains all files used to preprocess the data (Python 3.6)
  • Data : contains data required to run this code
  • Statistics : contains all files that contains statistics of the dataset

Dataset

file name discription
train/test/dev.csv This is the dataset for code-mixed Speech Translation.
chopped_audios This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages #types #tokens Types per line Tokens per line Avg. token length
English[100%] 40,324 601889 10.58 11.27 4.92
French (France) 50510 645651 11.38 12.09 5.08
German[100%] 50748 584575 10.44 10.95 5.57
Gujarati[100%] 41959 584989 10.37 10.95 4.46
Hindi[100%] 29744 716800 12.36 13.42 3.74
Hungarian[100%] 84872 506608 9.13 9.49 5.89
Indonesian[100%] 39365 653374 11.54 12.23 6.14
Italian[100%] 52372 512061 9.23 9.59 5.37
Latvian[100%] 70040 477106 8.69 8.93 5.72
Lithuanian[100%] 75222 491558 8.92 9.2 6.04
Nepali[100%] 52630 570268 10.03 10.68 4.88
Persian (Farsi)[100%] 51722 598096 10.61 11.2 4.1
Polish[100%] 71662 494263 8.99 9.25 5.86
Portuguese (Brazil)[100%] 50087 608432 10.8 11.39 5.12
Russian[100%] 72162 490908 8.96 9.19 5.79
Slovak[100%] 73789 520465 9.39 9.75 5.37
Slovenian[100%] 68619 516649 9.35 9.67 5.3
Spanish[100%] 49806 608868 10.75 11.4 5.07
Swedish[100%] 48233 581751 10.31 10.89 5
Tamil[100%] 84183 460678 8.37 8.63 7.65
Telugu[100%] 72006 464665 8.34 8.7 6.56
Turkish[100%] 78957 453521 8.27 8.49 6.35
Bulgarian[100%] 60712 564150 10.1 10.56 5.24
Croatian[100%] 73075 531326 9.58 9.95 5.28
Danish[100%] 50170 587253 10.4 11 4.98
Dutch[100%] 42716 595464 10.52 11.15 5.05

Code-mixing

All languages in Code-mixing

Language Total Words Unique Words Percentage
English 500136 6312 83.6
Bengali 46933 3907 7.84
Sanskrit 51246 7202 8.56
Total 598315 17421 100

Types of Code-mixing

English-Sanskrit Sanskrit-English English-Bengali Bengali-English
Inter-Sentential 2356 2366 339 339
Intra-Sentential 2338 851 124 0
Owner
Ayush Daksh
IIT Kharagpur | Mathematics & Computing | 3rd Year | NLP | UG Researcher
Ayush Daksh
Library for time-series-forecasting-as-a-service.

TIMEX TIMEX (referred in code as timexseries) is a framework for time-series-forecasting-as-a-service. Its main goal is to provide a simple and generi

Alessandro Falcetta 8 Jan 06, 2023
FinRL­-Meta: A Universe for Data­-Driven Financial Reinforcement Learning. 🔥

FinRL-Meta: A Universe of Market Environments. FinRL-Meta is a universe of market environments for data-driven financial reinforcement learning. Users

AI4Finance Foundation 543 Jan 08, 2023
Convert Table data to approximate values with GUI

Table_Editor Convert Table data to approximate values with GUIs... usage - Import methods for extension Tables. Imported method supposed to have only

CLJ 1 Jan 10, 2022
[CVPRW 2021] Code for Region-Adaptive Deformable Network for Image Quality Assessment

RADN [CVPRW 2021] Code for Region-Adaptive Deformable Network for Image Quality Assessment [Paper on arXiv] Overview Update [2021/5/7] add codes for W

IIGROUP 53 Dec 28, 2022
Differentiable architecture search for convolutional and recurrent networks

Differentiable Architecture Search Code accompanying the paper DARTS: Differentiable Architecture Search Hanxiao Liu, Karen Simonyan, Yiming Yang. arX

Hanxiao Liu 3.7k Jan 09, 2023
Simple image captioning model - CLIP prefix captioning.

CLIP prefix captioning. Inference Notebook: 🥳 New: 🥳 Our technical papar is finally out! Official implementation for the paper "ClipCap: CLIP Prefix

688 Jan 04, 2023
Implementation of the SUMO (Slim U-Net trained on MODA) model

SUMO - Slim U-Net trained on MODA Implementation of the SUMO (Slim U-Net trained on MODA) model as described in: TODO: add reference to paper once ava

6 Nov 19, 2022
JupyterLite demo deployed to GitHub Pages 🚀

JupyterLite Demo JupyterLite deployed as a static site to GitHub Pages, for demo purposes. ✨ Try it in your browser ✨ ➡️ https://jupyterlite.github.io

JupyterLite 223 Jan 04, 2023
Minimalistic PyTorch training loop

Backbone for PyTorch training loop Will try to keep it minimalistic. pip install back from back import Bone Features Progress bar Checkpoints saving/l

Kashin 4 Jan 16, 2020
The Official PyTorch Implementation of "LSGM: Score-based Generative Modeling in Latent Space" (NeurIPS 2021)

The Official PyTorch Implementation of "LSGM: Score-based Generative Modeling in Latent Space" (NeurIPS 2021) Arash Vahdat*   ·   Karsten Kreis*   ·  

NVIDIA Research Projects 238 Jan 02, 2023
OrienMask: Real-time Instance Segmentation with Discriminative Orientation Maps

OrienMask This repository implements the framework OrienMask for real-time instance segmentation. It achieves 34.8 mask AP on COCO test-dev at the spe

45 Dec 13, 2022
HPRNet: Hierarchical Point Regression for Whole-Body Human Pose Estimation

HPRNet: Hierarchical Point Regression for Whole-Body Human Pose Estimation Official PyTroch implementation of HPRNet. HPRNet: Hierarchical Point Regre

Nermin Samet 53 Dec 04, 2022
Dynamic Token Normalization Improves Vision Transformers

Dynamic Token Normalization Improves Vision Transformers This is the PyTorch implementation of the paper Dynamic Token Normalization Improves Vision T

Wenqi Shao 20 Oct 09, 2022
SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

IMLHF 3 Oct 11, 2022
Source code for EquiDock: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking (ICLR 2022)

Source code for EquiDock: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking (ICLR 2022) Please cite "Independent SE(3)-Equivar

Octavian Ganea 154 Jan 02, 2023
Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Det

123 Jan 04, 2023
Dynamic wallpaper generator.

Wiki • About • Installation About This project is a dynamic wallpaper changer. It waits untill you turn on the music, downloads album cover if it's po

3 Sep 18, 2021
Large scale embeddings on a single machine.

Marius Marius is a system under active development for training embeddings for large-scale graphs on a single machine. Training on large scale graphs

Marius 107 Jan 03, 2023
🕵 Artificial Intelligence for social control of public administration

Non-tech crash course into Operação Serenata de Amor Tech crash course into Operação Serenata de Amor Contributing with code and tech skills Supportin

Open Knowledge Brasil - Rede pelo Conhecimento Livre 4.4k Dec 31, 2022
HGCN: Harmonic Gated Compensation Network For Speech Enhancement

HGCN The official repo of "HGCN: Harmonic Gated Compensation Network For Speech Enhancement", which was accepted at ICASSP2022. How to use step1: Calc

ScorpioMiku 33 Nov 14, 2022