Utilizing pretrained language model’s token embedding layer and position embedding layer as DALLE’s text encoder.
Background
Training DALLE model from scratch demands large size paired dataset of images and captions. For example, OpenAI DALLE is trained with more than 250 million text-image pairs for the training.
If the dataset isn’t large enough or is limited to specific domains, number of vocabularies in the trained DALLE model are insufficient. For instance, 1 million text captions of K-Fashion dataset only consists of more or less than 300 tokens.
Therefore, inferencing from such DALLE models could be problematic if the given sentence query is unconnected to the originally trained captions’ text dataset.
KoDALLE's Result on Small Size Fashion Dataset
OpenAI’s DALLE
KoDALLE of HappyFace
Train Dataset Size
250 Million Pairs
0.8 Million Pairs
#Params
12 Billion
428 Million
#Layers
64 Layers
16 Layers
Computing Resource
1024 x V100 16GB
1 x V100 32GB
Text Encoder
16384 Vocab x 512 Dim BPE
32000 Vocab x 1024 Dim klue/roberta-large
Image Encoder
VQVAE
VQGAN
Optimizer
AdamW
AdamW
Learning Rate
4.5e-5
3.0e-5
Weight Decay
4.5e-3
3.0e-3
LR Scheduler
ReduceLROnPlateau
-
The team constructed Text to Fashion Design DALLE model in Korean language with less than 100k text-image sampled pairs.
Experimentations were conducted with the following Korean Transformers Models’ embedding layers. The team selected klue/roberta-large as baseline in the repository considering the size of the model.
KoDALLE with klue/roberta-large's wpe and wte which is trainable on 16GB GPU Google Colab environment. Hyperparams related to the DALLE's model size are following.
ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS. It currently supports four examples for you to quickly experience the power of ONNX Runti