configaformers

A python library for highly configurable transformers - easing model architecture search and experimentation.

Special thanks to lucidrains (https://github.com/lucidrains) and Kharr.

Notable Features

The main purpose of this library is to allow users to quickly construct transformers by editing config files. We will also provide prebuilt configurations to common or promising model architectures.

Another feature is our model compiler. When a model is initialized it will print out (on your console) all modules, shapes, input and output names. It also performs shape checking which helps catch errors prior to running data through the model.

Setup

Requirements: PyTorch and einops

git clone https://github.com/antofuller/configaformers.git
cd configaformers

Usage

Quick demo that will configure a 768-wide, 12-layer transformer, with a language modeling head.

Import, and create token embedding block:

from model_builder import ConfigaFormer
from prebuilt_blocks import get_transformer_block

model_dim = 768
num_heads = 12
vocab_size = 50257

# Token embedding block
emb = [{'type': 'embedding',
        'output_dim': model_dim,
        'num_classes': vocab_size}]

Use our prebuilt transformer block:

t_block = transformer_block(num_heads=num_heads, dim=model_dim)

Create language modeling head:

to_logits = [{'type': 'linear',
              'output_dim': vocab_size,
              'output_name': 'logits'}]

Create blocks, initialize input shapes, and init the model:

my_blocks = [{"config": emb,
              "repeat": 1},
             {"config": t_block,
              "repeat": 12},
             {"config": to_logits,
              "repeat": 1},
             ]

input_streams = {'emb_ids': ['B', 'L_in'], 'attn_offset': ['B', num_heads, 'L_in', 'L_in'],}

model = ConfigaFormer(blocks=my_blocks, input_shapes=input_streams).cuda()

This will print out the transformer config:

Block #1, 1x
embedding -> Input(s): emb_ids (BSZ, L_in) - Output(s): x (BSZ, L_in, 768)


Block #2, 12x
make_stream -> Input(s): x (BSZ, L_in, 768) - Output(s): residual (BSZ, L_in, 768)
norm -> Input(s): x (BSZ, L_in, 768) - Output(s): x (BSZ, L_in, 768)
linear -> Input(s): x (BSZ, L_in, 768) - Output(s): queries (BSZ, L_in, 768)
linear -> Input(s): x (BSZ, L_in, 768) - Output(s): keys (BSZ, L_in, 768)
linear -> Input(s): x (BSZ, L_in, 768) - Output(s): values (BSZ, L_in, 768)
make_heads -> Input(s): queries (BSZ, L_in, 768) - Output(s): queries (BSZ, 12, L_in, 64)
make_heads -> Input(s): keys (BSZ, L_in, 768) - Output(s): keys (BSZ, 12, L_in, 64)
make_heads -> Input(s): values (BSZ, L_in, 768) - Output(s): values (BSZ, 12, L_in, 64)
mha_dots -> Input(s): queries (BSZ, 12, L_in, 64), keys (BSZ, 12, L_in, 64) - Output(s): attn_dots (BSZ, 12, L_in, L_in)
merge_streams -> Input(s): attn_dots (BSZ, 12, L_in, L_in), attn_offset (B, 12, L_in, L_in) - Output(s): attn_dots (BSZ, 12, L_in, L_in)
mha_sum -> Input(s): values (BSZ, 12, L_in, 64), attn_dots (BSZ, 12, L_in, L_in) - Output(s): x (BSZ, 12, L_in, 64)
merge_heads -> Input(s): x (BSZ, 12, L_in, 64) - Output(s): x (BSZ, L_in, 768)
linear -> Input(s): x (BSZ, L_in, 768) - Output(s): x (BSZ, L_in, 768)
merge_streams -> Input(s): x (BSZ, L_in, 768), residual (BSZ, L_in, 768) - Output(s): x (BSZ, L_in, 768)
make_stream -> Input(s): x (BSZ, L_in, 768) - Output(s): residual (BSZ, L_in, 768)
norm -> Input(s): x (BSZ, L_in, 768) - Output(s): x (BSZ, L_in, 768)
linear -> Input(s): x (BSZ, L_in, 768) - Output(s): x (BSZ, L_in, 3072)
activation -> Input(s): x (BSZ, L_in, 3072) - Output(s): x (BSZ, L_in, 3072)
linear -> Input(s): x (BSZ, L_in, 3072) - Output(s): x (BSZ, L_in, 768)
merge_streams -> Input(s): x (BSZ, L_in, 768), residual (BSZ, L_in, 768) - Output(s): x (BSZ, L_in, 768)


Block #3, 1x
linear -> Input(s): x (BSZ, L_in, 768) - Output(s): logits (BSZ, L_in, 50257)

Before running, we need to get the attention offset (in this case, AliBi with a causal mask):

from utils import get_alibi

attn_offset = get_alibi(num_heads=12, max_length=1024)

Now we can use the model:

# Prepare attention offset by repeating it over the batch dimension
attn_offset = attn_offset.repeat(bsz, 1, 1, 1)

input_data = {'emb_ids': batch_ids.view(bsz, 1024).cuda(),
              'attn_offset': attn_offset.cuda()}

logits = model(input_data)['logits'].view(bsz, 1024, 50257)

Features on the way...

Revamp rearrange module
Product-Key memories
Create more prebuilt blocks
Improve attention offsets and masking
Experiment with Triton for speed-up

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
experiments/language_modeling		experiments/language_modeling
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
activation_module.py		activation_module.py
attention_module.py		attention_module.py
block_builder.py		block_builder.py
dropout_module.py		dropout_module.py
embedding_module.py		embedding_module.py
linear_module.py		linear_module.py
model_builder.py		model_builder.py
norm_module.py		norm_module.py
prebuilt_blocks.py		prebuilt_blocks.py
rearranging_module.py		rearranging_module.py
rope_module.py		rope_module.py
stream_module.py		stream_module.py
utils.py		utils.py

License

antofuller/configaformers

Folders and files

Latest commit

History

Repository files navigation

configaformers

Notable Features

Setup

Usage

Features on the way...

About

Topics

Resources

License

Stars

Watchers

Forks

Languages