Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Overview

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation

This is a PyTorch implementation for the ACL 2022 main conference paper STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation.

Training a Model on MuST-C

Let's first take a look at training an En-De model as an example.

Enviroment Configuration

  1. Clone this repository:
git clone [email protected]:ictnlp/STEMM.git
cd STEMM/
  1. Install Montreal Forced Aligner following the official guidance. Please also download the pertained models and dictionary for MFA.

  2. Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:

pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace
pip install inflect sentencepiece soundfile textgrid pandas

Data Preparation

  1. First make a directory to store the dataset:
TGT_LANG=de
MUSTC_ROOT=data/mustc/
mkdir -p $MUSTC_ROOT
  1. Download the MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz to the $MUSTC_ROOT path, and uncompress it:
cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-de.tar.gz
  1. Return to the root directory, run the preprocess script preprocess.sh, which will perform forced alignment and organize the raw data and alignment information into .tsv format for using:
sh preprocess.sh $TGT_LANG
  1. Finally, the directory $MUSTC_ROOT should look like this:
.
├── en-de
│   ├── config_raw.yaml
│   ├── data
│   ├── dev_raw_seg_plus.tsv
│   ├── docs
│   ├── segment
│   ├── spm_unigram10000_raw.model
│   ├── spm_unigram10000_raw.txt
│   ├── spm_unigram10000_raw.vocab
│   ├── train_raw_seg_plus.tsv
│   ├── tst-COMMON_raw_seg_plus.tsv
│   ├── tst-HE_raw_seg_plus.tsv
└── MUSTC_v1.0_en-de.tar.gz

Pretrain the MT Module

[OPTIONAL] Use External MT Corpus

If you want to use external MT corpus, please first pretrain a MT model on this corpus following these steps:

  1. Perform BPE on external corpus with the sentencepiece model learned on MuST-C. As we mentioned in our paper, we use WMT for En-De, En-Fr, En-Ru, En-Es, En-Ro, and OPUS100 for En-Pt, En-It, En-Nl as external corpus. You can download them from the internet and put them in the data/ext_en${TGT_LANG}/ directory. Run the following command and replace $input_file with the path of raw text to perform BPE. You should apply BPE to texts in both source and target language of all subset (train/valid/test).
python3 data/scripts/apply_spm.py --input-file $input_file --output-file $output_file --model data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.model
  1. Use fairseq-preprocess command to convert the BPE texts into fairseq formats. Make sure to use the sentencepiece dictionary learned on MuST-C.
$spm_dict=data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.txt
fairseq-preprocess --source-lang en --target-lang $TGT_LANG --trainpref data/ext_en${TGT_LANG}/train --validpref data/ext_en${TGT_LANG}/valid --testpref data/ext_en${TGT_LANG}/test --destdir data/ext_en${TGT_LANG}/binary --joined-dictionary --srcdict $spm_dict --tgtdict $spm_dict --workers=20 --nwordssrc 10000 --nwordstgt 10000
  1. Train the model using the following command:
sh pretrain_mt_ext.sh $TGT_LANG

Pretrain the MT module on MuST-C

  1. Run the following script to pretrain the MT module. The argument --load-pretrained-mt-encoder-decoder-from indicates the path of MT model pretrained on external corpus obtained in the last step.
sh pretrain_mt.sh $TGT_LANG
  1. To ensure consistent performance, we have released our checkpoints of pretrained MT modules. You can download them and directly use them do initialize the MT module in our model for the following experiments.
Direction Link
En-De https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_mt.pt
En-Fr https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_mt.pt
En-Es https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_mt.pt
En-Ro https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_mt.pt
En-Ru https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_mt.pt
En-Nl https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_mt.pt
En-It https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_mt.pt
En-Pt https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_mt.pt

Training

  1. Download the pretrained wav2vec2.0 model from the official link, and put it in the checkpoints/ directory.
  2. Just run the training scripts:
sh train.sh $TGT_LANG

Evaluate

  1. Run the following script to average the last 10 checkpoints and evaluate on the tst-COMMON set:
sh test.sh mustc_en${TGT_LANG}_stmm_self_learning $TGT_LANG
  1. We also released our checkpoints as follows. You can download and evaluate them directly.
Direction Link
En-De https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_stmm_self_learning.pt
En-Fr https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_stmm_self_learning.pt
En-Es https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_stmm_self_learning.pt
En-Ro https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_stmm_self_learning.pt
En-Ru https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_stmm_self_learning.pt
En-Nl https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_stmm_self_learning.pt
En-It https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_stmm_self_learning.pt
En-Pt https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_stmm_self_learning.pt

Citation

In this repository is useful for you, please cite as:

@inproceedings{fang-etal-2022-STEMM,
	title = {STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation},
	author = {Fang, Qingkai and Ye, Rong and Li, Lei and Feng, Yang and Wang, Mingxuan},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
	year = {2022},
}

Contact

If you have any questions, feel free to contact me at [email protected].

Issues
  • About fine-tuning

    About fine-tuning

    In the paper, Section 2.2 , you say "We combine those pretrained modules and finetune the whole model for ST". Did you freeze the Wav2Vec2.0 Model during training ? If not , I wonder if it's because of the mix-up training strategy , so as to bridge the modality gap.

    question 
    opened by zhouyan19 2
  • Audio feature extraction during preprocessing ?

    Audio feature extraction during preprocessing ?

    I compare your preprocss procedure (preprocess.sh) to the orginial fairseq example, and find that you remove the step of extracting the audio feature ( as far as my observation is concerned ) . So when you train the fairseq ST baseline, are you using the raw audio inputs rather than the features as in the original codes ? Or are you using Wav2Vec2.0 to do an audio embedding ?

    question 
    opened by zhouyan19 2
Owner
ICTNLP
Natural Language Processing Group, Institute of Computing Technology, Chinese Academy of Sciences
ICTNLP
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 58 Apr 9, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 28 Apr 4, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

null 6 Nov 5, 2021
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

null 2 Feb 3, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 6 Mar 29, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

null 4 Mar 13, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 9 Apr 7, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 5 Apr 13, 2022
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 31 Mar 25, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 407 Apr 14, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Aug 30, 2021
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 8 Feb 21, 2022
null 143 Apr 4, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 38 Mar 29, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 38 Mar 29, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 34 Apr 10, 2022
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

HUAWEI Noah's Ark Lab 143 Apr 8, 2022
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 7 Mar 11, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ ?? Technologies Concurrent code

Bobotinho 9 Jan 18, 2022