Understanding the Difficulty of Training Transformers

Overview

License PWC

Admin

Understanding the Difficulty of Training Transformers

Guided by our analyses, we propose Adaptive Model Initialization (Admin), which successfully stabilizes previously-diverged Transformer training and achieves better performance, without introducing additional hyper-parameters. Admin is adapted for better half-precision stability and can be reparameterized into the original Transformer.

We are in an early-release beta. Expect some adventures and rough edges.

Table of Contents

Introduction

What complicates Transformer training?

In our study, we go beyond gradient vanishing and identify an amplification effect that substantially influences Transformer training. Specifically, for each layer in a multi-layer Transformer, heavy dependency on its residual branch makes training unstable, yet light dependency leads to sub-optimal performance.

Dependency and Amplification Effect

Our analysis starts from the observation that Pre-LN is more robust than Post-LN, whereas Post-LN typically leads to a better performance. As shown in Figure 1, we find these two variants have different layer dependency patterns.

With further exploration, we find that for a N-layer residual network, after updating its parameters W to W*, its outputs change is proportion to the dependency on residual branches.

Intuitively, since a larger output change indicates a more unsmooth loss surface, the large dependency complicates training. Moreover, we propose Admin (adaptive model initialization), which starts the training from the area with a smoother surface. More details can be found in our paper.

Quick Start Guide

Our implementation is based on the fairseq package (python 3.6, torch 1.5/1.6 are recommended). It can be installed by:

git clone https://github.com/LiyuanLucasLiu/Transforemr-Clinic.git
cd fairseq
pip install --editable .

The guidance for reproducing our results is available at:

Specifically, our implementation requires to first set --init-type adaptive-profiling and use one GPU for this profiling stage, then set --init-type adaptive and start training.

Citation

Please cite the following papers if you found our model useful. Thanks!

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20).

@inproceedings{liu2020admin,
  title={Understanding the Difficulty of Training Transformers},
  author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},
  year={2020}
}

Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao (2020). Very Deep Transformers for Neural Machine Translation. arXiv preprint arXiv:2008.07772 (2020).

@inproceedings{liu_deep_2020,
 author = {Liu, Xiaodong and Duh, Kevin and Liu, Liyuan and Gao, Jianfeng},
 booktitle = {arXiv:2008.07772 [cs]},
 title = {Very Deep Transformers for Neural Machine Translation},
 year = {2020}
}
Owner
Liyuan Liu
Ph.D. Student @ DMG, UIUC
Liyuan Liu
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ 💻 Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

Rishikesh (ऋषिकेश) 126 Jan 02, 2023
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Python powered crossword generator with database with 20k+ polish words

crossword_generator Generate simple crossword puzzle from words and definitions fetched from krzyżowki.edu.pl endpoints -/ string:word - returns js

0 Jan 04, 2022
Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

Kumar Saksham 26 Dec 25, 2022
Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.Specifica

Meta Research 193 Dec 28, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
Lyrics generation with GPT2-based Transformer

HuggingArtists - Train a model to generate lyrics Create AI-Artist in just 5 minutes! 🚀 Run the demo notebook to train 🚀 Run the GUI demo to test Di

Aleksey Korshuk 65 Dec 19, 2022