Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Overview

Self-Training for Neural Sequence Generation

This repo includes instructions for running noisy self-training algorithms from the following paper:

Revisiting Self-Training for Neural Sequence Generation
Junxian He*, Jiatao Gu*, Jiajun Shen, Marc'Aurelio Ranzato
ICLR 2020

Requirement

  • fairseq (please see the fairseq repo for other requirements on Python and PyTorch versions)

fairseq can be installed with:

pip install fairseq

Data

Download and preprocess the WMT'14 En-De dataset:

# Download and prepare the data
wget https://raw.githubusercontent.com/pytorch/fairseq/master/examples/translation/prepare-wmt14en2de.sh
bash prepare-wmt14en2de.sh --icml17

TEXT=wmt14_en_de
fairseq-preprocess --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir wmt14_en_de_bin --thresholdtgt 0 --thresholdsrc 0 \
    --joined-dictionary --workers 16

Then we mimic a semi-supervised setting where 100K training samples are randomly selected as parallel corpus and the remaining English training samples are treated as unannotated monolingual corpus:

bash extract_wmt100k.sh

Preprocess WMT100K:

bash preprocess.sh 100ken 100kde 

Add noise to the monolingual corpus for later usage:

TEXT=wmt14_en_de
python paraphrase/paraphrase.py \
  --paraphraze-fn noise_bpe \
  --word-dropout 0.2 \
  --word-blank 0.2 \
  --word-shuffle 3 \
  --data-file ${TEXT}/train.mono_en \
  --output ${TEXT}/train.mono_en_noise \
  --bpe-type subword

Train the base supervised model

Train the translation model with 30K updates:

bash supervised_train.sh 100ken 100kde 30000

Self-training as pseudo-training + fine-tuning

Translate the monolingual data to train.[suffix] to form a pseudo parallel dataset:

bash translate.sh [model_path] [suffix]  

Suppose the pseduo target language suffix is mono_de_iter1 (by default), preprocess:

bash preprocess.sh mono_en_noise mono_de_iter1

Pseudo-training + fine-tuning:

bash self_train.sh mono_en_noise mono_de_iter1 

The above command trains the model on the pseduo parallel corpus formed by train.mono_en_noise and train.mono_de_iter1 and then fine-tune it on real parallel data.

This self-training process can be repeated for multiple iterations to improve performance.

Reference

@inproceedings{He2020Revisiting,
title={Revisiting Self-Training for Neural Sequence Generation},
author={Junxian He and Jiatao Gu and Jiajun Shen and Marc'Aurelio Ranzato},
booktitle={Proceedings of ICLR},
year={2020},
url={https://openreview.net/forum?id=SJgdnAVKDH}
}
Owner
Junxian He
NLP/ML PhD student at CMU
Junxian He
The Face Mask recognition system uses AI technology to detect the person with or without a mask.

Face Mask Detection Face Mask Detection system built with OpenCV, Keras/TensorFlow using Deep Learning and Computer Vision concepts in order to detect

Rohan Kasabe 4 Apr 05, 2022
This dlib-based facial login system

Facial-Login-System This dlib-based facial login system is a technology capable of matching a human face from a digital webcam frame capture against a

Mushahid Ali 3 Apr 23, 2022
nnFormer: Interleaved Transformer for Volumetric Segmentation

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
A Python framework for conversational search

Chatty Goose Multi-stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting Installation Ma

Castorini 36 Oct 23, 2022
EfficientNetV2-with-TPU - Cifar-10 case study

EfficientNetV2-with-TPU EfficientNet EfficientNetV2 adalah jenis jaringan saraf convolutional yang memiliki kecepatan pelatihan lebih cepat dan efisie

Sultan syach 1 Dec 28, 2021
Privacy-Preserving Portrait Matting [ACM MM-21]

Privacy-Preserving Portrait Matting [ACM MM-21] This is the official repository of the paper Privacy-Preserving Portrait Matting. Jizhizi Li∗, Sihan M

Jizhizi_Li 212 Dec 27, 2022
This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (EMNLP 2020)

Towards Persona-Based Empathetic Conversational Models (PEC) This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (E

Zhong Peixiang 35 Nov 17, 2022
Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

DIFFNet This repo is for Self-Supervised Monocular Depth Estimation with Internal Feature Fusion(arXiv), BMVC2021 A new backbone for self-supervised d

Hang 94 Dec 25, 2022
A Novel Plug-in Module for Fine-grained Visual Classification

Pytorch implementation for A Novel Plug-in Module for Fine-Grained Visual Classification. fine-grained visual classification task.

ChouPoYung 109 Dec 20, 2022
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 03, 2022
Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

The Second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 Welcome to the Second Situated Interactive Multimodal Conversation

Facebook Research 81 Nov 22, 2022
A library for differentiable nonlinear optimization.

Theseus A library for differentiable nonlinear optimization built on PyTorch to support constructing various problems in robotics and vision as end-to

Meta Research 1.1k Dec 30, 2022
Performant, differentiable reinforcement learning

deluca Performant, differentiable reinforcement learning Notes This is pre-alpha software and is undergoing a number of core changes. Updates to follo

Google 114 Dec 27, 2022
(CVPR2021) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

ClassSR (CVPR2021) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic Paper Authors: Xiangtao Kong, Hengyuan

Xiangtao Kong 308 Jan 05, 2023
Official repository for "Exploiting Session Information in BERT-based Session-aware Sequential Recommendation", SIGIR 2022 short.

Session-aware BERT4Rec Official repository for "Exploiting Session Information in BERT-based Session-aware Sequential Recommendation", SIGIR 2022 shor

Jamie J. Seol 22 Dec 13, 2022
CTRL-C: Camera calibration TRansformer with Line-Classification

CTRL-C: Camera calibration TRansformer with Line-Classification This repository contains the official code and pretrained models for CTRL-C (Camera ca

57 Nov 14, 2022
Official code of the paper "Expanding Low-Density Latent Regions for Open-Set Object Detection" (CVPR 2022)

OpenDet Expanding Low-Density Latent Regions for Open-Set Object Detection (CVPR2022) Jiaming Han, Yuqiang Ren, Jian Ding, Xingjia Pan, Ke Yan, Gui-So

csuhan 64 Jan 07, 2023
Implementation of ConvMixer in TensorFlow and Keras

ConvMixer ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on

Sayan Nath 8 Oct 03, 2022
YoloV3 Implemented in Tensorflow 2.0

YoloV3 Implemented in TensorFlow 2.0 This repo provides a clean implementation of YoloV3 in TensorFlow 2.0 using all the best practices. Key Features

Zihao Zhang 2.5k Dec 26, 2022
PyKaldi GOP-DNN on Epa-DB

PyKaldi GOP-DNN on Epa-DB This repository has the tools to run a PyKaldi GOP-DNN algorithm on Epa-DB, a database of non-native English speech by Spani

18 Dec 14, 2022