A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Last update: Oct 23, 2022

Related tags

Overview

wav2vec-toolkit

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

This repository accompanies the 🤗 HuggingFace Community Paper on finetuning Wav2Vec2 XLSR for low-resource languages [link]

How to contribute

(Mostly identical to the huggingface/datasets contributing guide)

Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

Clone your fork to your local disk, and add the base repository as a remote:

git clone [email protected]:<your Github handle>/wav2vec-toolkit.git
cd wav2vec-toolkit
git remote add upstream https://github.com/anton-l/wav2vec-toolkit.git

Create a new branch to hold your development changes:
```
git checkout -b a-descriptive-name-for-my-changes
```
do not work on the master branch.
Set up a development environment by running the following command in a virtual environment:
```
pip install -e ".[dev]"
```
(If wav2vec-toolkit was already installed in the virtual environment, remove it with pip uninstall wav2vec_toolkit before reinstalling it in editable mode with the -e flag.)
Develop the features on your branch.
Format your code. Run black and isort so that your newly added files look nice with the following command:
```
black --line-length 119 --target-version py36 src scripts
isort src scripts
```
Once you're happy with your implementation, add your changes and make a commit to record your changes locally:
```
git add .
git commit
```
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
```
git fetch upstream
git rebase upstream/main
```
Push the changes to your account using:
```
git push -u origin a-descriptive-name-for-my-changes
```
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Related tags

Overview

wav2vec-toolkit

How to contribute

Owner

Anton Lozhkov

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Sentence Embeddings with BERT & XLNet

🏖 Easy training and deployment of seq2seq models.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Fidibo.com comments Sentiment Analyser

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Legal text retrieval for python

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

NLP-based analysis of poor Chinese movie reviews on Douban

运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

DLO8012: Natural Language Processing & CSL804: Computational Lab - II

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch