Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Overview

Attention Is All You Need Paper Implementation

This is my from-scratch implementation of the original transformer architecture from the following paper: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

Table of Contents

About

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. " - Abstract

Transformers came to be a groundbreaking advance in neural network architectures which revolutionized what we can do with NLP and beyond. To name a few applications consider the application of BERT to Google search and GPT to Github Copilot. Those architectures are upgrades on the original transformer architecture described in this seminal paper. The goal of this repository is to provide an implementation that is easy to follow and understand while reading the paper. Setup is easy and everything is runnable on CPU for learning purposes.

✔️ Highly customizable configuration and training loop
✔️ Runnable on CPU and GPU
✔️ W&B integration for detailed logging of every metric
✔️ Pretrained models and their training details
✔️ Gradient Accumulation
✔️ Label smoothing
✔️ BPE and WordLevel Tokenizers
✔️ Dynamic Batching
✔️ Batch Dataset Processing
✔️ Bleu-score calculation during training
✔️ Documented dimensions for every step of the architecture
✔️ Shown progress of translation for an example after every epoch
✔️ Tutorial notebook (Coming soon...)

Setup

Environment

Using Miniconda/Anaconda:

  1. cd path_to_repo
  2. conda env create
  3. conda activate attention-is-all-you-need-paper

Note: Depending on your GPU you might need to switch cudatoolkit to version 10.2

Pretrained Models

To download the pretrained model and tokenizer run:

python scripts/download_pretrained.py

Note: If prompted about wandb setting select option 3

Usage

Training

Before starting training you can either choose a configuration out of available ones or create your own inside a single file src/config.py. The available parameters to customize, sorted by categories, are:

  • Run 🚅 :
    • RUN_NAME - Name of a training run
    • RUN_DESCRIPTION - Description of a training run
    • RUNS_FOLDER_PTH - Saving destination of a training run
  • Data 🔡 :
    • DATASET_SIZE - Number of examples you want to include from WMT14 en-de dataset (max 4,500,000)
    • TEST_PROPORTION - Test set proportion
    • MAX_SEQ_LEN - Maximum allowed sequence length
    • VOCAB_SIZE - Size of the vocabulary (good choice is dependant on the tokenizer)
    • TOKENIZER_TYPE - 'wordlevel' or 'bpe'
  • Training 🏋️‍♂️ :
    • BATCH_SIZE - Batch size
    • GRAD_ACCUMULATION_STEPS - Over how many batches to accumulate gradients before optimizing the parameters
    • WORKER_COUNT - Number of workers used in dataloaders
    • EPOCHS - Number of epochs
  • Optimizer 📉 :
    • BETAS - Adam beta parameter
    • EPS - Adam eps parameter
  • Scheduler ⏲️ :
    • N_WARMUP_STEPS - How many warmup steps to use in the scheduler
  • Model 🤖 :
    • D_MODEL - Model dimension
    • N_BLOCKS - Number of encoder and decoder blocks
    • N_HEADS - Number of heads in the Multi-Head attention mechanism
    • D_FF - Dimension of the Position Wise Feed Forward network
    • DROPOUT_PROBA - Dropout probability
  • Other 🧰 :
    • DEVICE - 'gpu' or 'cpu'
    • MODEL_SAVE_EPOCH_CNT - After how many epochs to save a model checkpoint
    • LABEL_SMOOTHING - Whether to apply label smoothing

Once you decide on the configuration edit the config_name in train.py and do:

$ cd src
$ python train.py

Inference

For inference I created a simple app with Streamlit which runs in your browser. Make sure to train or download the pretrained models beforehand. The app looks at the model directory for model and tokenizer checkpoints.

$ streamlit run app/inference_app.py
app.mp4

Data

Same WMT 2014 data is used for the English-to-German translation task. Dataset contains about 4,500,000 sentence pairs but you can manually specify the dataset size if you want to lower it and see some results faster. When training is initiated the dataset is automatically downloaded, preprocessed, tokenized and dataloaders are created. Also, a custom batch sampler is used for dynamic batching and padding of sentences of similar lengths which speeds up training. HuggingFace 🤗 datasets and tokenizers are used to achieve this very fast.

Architecture

The original transformer architecture presented in this paper consists of an encoder and decoder part purposely included to match the seq2seq problem type of machine translation. There are also encoder-only (e.g. BERT) and decoder-only (e.g. GPT) transformer architectures, those won't be covered here. One of the main features of transformers , in general, is parallelized sequence processing which RNN's lack. Main ingredient here is the attention mechanism which enables creating modified word representations (attention representations) that take into account the word's meaning in relation to other words in a sequence (e.g. the word "bank" can represent a financial institution or land along the edge of a river as in "river bank"). Depending on how we think about a word we may choose to represent it differently. This transcends the limits of traditional word embeddings.

For a detailed walkthrough of the architecture check the notebooks/tutorial.ipynb

Weights and Biases Logs

Weights and Biases is a very powerful tool for MLOps. I integrated it with this project to automatically provide very useful logs and visualizations when training. In fact, you can take a look at how the training looked for the pretrained models at this project link. All logs and visualizations are synced real time to the cloud.

When you start training you will be asked:

wandb: (1) Create W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 

For creating and syncing the visualizations to the cloud you will need a W&B account. Creating an account and using it won't take you more than a minute and it's free. If don't want to visualize results select option 3.

Citation

Please use this bibtex if you want to cite this repository:

@misc{Koch2021attentionisallyouneed,
  author = {Koch, Brando},
  title = {attention-is-all-you-need},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bkoch4142/MISSING}},
}

License

This repository is under an MIT License

License: MIT

Owner
Brando Koch
Machine Learning Engineer with experience in ML, DL , NLP & CV specializing in ConversationalAI & NLP.
Brando Koch
SPRING is a seq2seq model for Text-to-AMR and AMR-to-Text (AAAI2021).

SPRING This is the repo for SPRING (Symmetric ParsIng aNd Generation), a novel approach to semantic parsing and generation, presented at AAAI 2021. Wi

Sapienza NLP group 98 Dec 21, 2022
SCI-AIDE : High-fidelity Few-shot Histopathology Image Synthesis for Rare Cancer Diagnosis

SCI-AIDE : High-fidelity Few-shot Histopathology Image Synthesis for Rare Cancer Diagnosis Pretrained Models In this work, we created synthetic tissue

Emirhan Kurtuluş 1 Feb 07, 2022
Second-Order Neural ODE Optimizer, NeurIPS 2021 spotlight

Second-order Neural ODE Optimizer (NeurIPS 2021 Spotlight) [arXiv] ✔️ faster convergence in wall-clock time | ✔️ O(1) memory cost | ✔️ better test-tim

Guan-Horng Liu 39 Oct 22, 2022
Meaningful titles for tabs and PDF downloads! Also supports tab search.

arxiv-utils If you are a researcher that reads a lot on ArXiv, you'll benefit a lot from this web extension. Renames the title of PDF page to the pape

Johnson 174 Dec 20, 2022
Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

Brian Alejandro 1 Feb 13, 2022
Implementation for Simple Spectral Graph Convolution in ICLR 2021

Simple Spectral Graph Convolutional Overview This repo contains an example implementation of the Simple Spectral Graph Convolutional (S^2GC) model. Th

allenhaozhu 64 Dec 31, 2022
Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Albumentations Albumentations is a Python library for image augmentation. Image augmentation is used in deep learning and computer vision tasks to inc

11.4k Jan 09, 2023
Conditional Gradients For The Approximately Vanishing Ideal

Conditional Gradients For The Approximately Vanishing Ideal Code for the paper: Wirth, E., and Pokutta, S. (2022). Conditional Gradients for the Appro

IOL Lab @ ZIB 0 May 25, 2022
Build Graph Nets in Tensorflow

Graph Nets library Graph Nets is DeepMind's library for building graph networks in Tensorflow and Sonnet. Contact DeepMind 5.2k Jan 05, 2023

Creative Applications of Deep Learning w/ Tensorflow

Creative Applications of Deep Learning w/ Tensorflow This repository contains lecture transcripts and homework assignments as Jupyter Notebooks for th

Parag K Mital 1.5k Dec 30, 2022
[CVPR'22] Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast

wseg Overview The Pytorch implementation of Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast. [arXiv] Though image-level weakly

Ye Du 96 Dec 30, 2022
Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

HEP Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior Implementation Python3 PyTorch=1.0 NVIDIA GPU+CUDA Training process The

FengZhang 34 Dec 04, 2022
Distance Encoding for GNN Design

Distance-encoding for GNN design This repository is the official PyTorch implementation of the DEGNN and DEAGNN framework reported in the paper: Dista

172 Nov 08, 2022
Efficient semidefinite bounds for multi-label discrete graphical models.

Low rank solvers #################################### benchmark/ : folder with the random instances used in the paper. ############################

1 Dec 08, 2022
Learnable Motion Coherence for Correspondence Pruning

Learnable Motion Coherence for Correspondence Pruning Yuan Liu, Lingjie Liu, Cheng Lin, Zhen Dong, Wenping Wang Project Page Any questions or discussi

liuyuan 41 Nov 30, 2022
Chess reinforcement learning by AlphaGo Zero methods.

About Chess reinforcement learning by AlphaGo Zero methods. This project is based on these main resources: DeepMind's Oct 19th publication: Mastering

Samuel 2k Dec 29, 2022
Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Period-alternatives-of-Softmax Experimental Demo for our paper 'Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechani

slwang9353 0 Sep 06, 2021
End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

PDVC Official implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021) [paper] [valse论文速递(Chinese)] This repo supports:

Teng Wang 118 Dec 16, 2022
Range Image-based LiDAR Localization for Autonomous Vehicles Using Mesh Maps

Range Image-based 3D LiDAR Localization This repo contains the code for our ICRA2021 paper: Range Image-based LiDAR Localization for Autonomous Vehicl

Photogrammetry & Robotics Bonn 208 Dec 15, 2022
Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal"

Patch-wise Adversarial Removal Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal

4 Oct 12, 2022