SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Last update: Dec 25, 2022

Related tags

Overview

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer

I find that when training a transformer, the embedding matrix moves slowly, hence it's difficult for the model to jump out of the initial noisy embedding.

(initial embedding)
[[-0.0073  0.0062 -0.0261 ...  0.0086  0.0107 -0.008 ] ... ]
 (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4)
[[-0.0069  0.0066 -0.0265 ...  0.009   0.0111 -0.0084] ... ]

So I propose initializing the embedding matrix to tiny values, and put another LayerNorm after it (before all the SA & FFN layers):

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
if self.config.USE_SMALL_EMB and self.layer_id == 0:
    x = self.lnPre(x) # LN(SmallInit(Emb))
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

And then you get improved convergence (especially for BPE models) because the model can quickly jump out of the tiny initial embedding (small changes after 1 step -> significant changes of directions -> significant changes after LayerNorm).

Loss curve comparison: https://wandb.ai/blinkdl/SmallEmbTest

(the gap between LayerNorm(SmallEmb)) and baseline persists after more training)

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
x = self.ln1(x) # this plays the same role as the lnPre in the above PreLN code
x = x + self.att(x)
x = self.ln2(x)
x = x + self.ffn(x)
(note you shall have another LN after the final ffn)

SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Related tags

Overview

SmallInitEmb

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

Owner

PENG Bo

Ganilla - Official Pytorch implementation of GANILLA

Data & Code for ACCENTOR Adding Chit-Chat to Enhance Task-Oriented Dialogues

Semi-automated OpenVINO benchmark_app with variable parameters

Audio2Face - Audio To Face With Python

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

COD-Rank-Localize-and-Segment (CVPR2021)

Custom studies about block sparse attention.

Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model

Vector Quantization, in Pytorch

GANsformer: Generative Adversarial Transformers Drew A

pytorch implementation of trDesign

《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

[RSS 2021] An End-to-End Differentiable Framework for Contact-Aware Robot Design

Fast, Attemptable Route Planner for Navigation in Known and Unknown Environments

PyTorch Implementation of AnimeGANv2

Implementations for the ICLR-2021 paper: SEED: Self-supervised Distillation For Visual Representation.

The open-source and free to use Python package miseval was developed to establish a standardized medical image segmentation evaluation procedure

Using knowledge-informed machine learning on the PRONOSTIA (FEMTO) and IMS bearing data sets. Predict remaining-useful-life (RUL).

Starter kit for getting started in the Music Demixing Challenge.

Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers