Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Last update: Sep 11, 2022

Related tags

Text Data & NLP gpt

Overview

Pytorch GPT-X

My Own Pytorch GPT-X

1. Abstract

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

2. Model

Transformer

Additional Module

① Rezero

Rezero Is All You Need link

② Explicit Sparse Transformer

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection link

③ Macaron Architecture

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View link

④ RealFormer, Residual Attention

RealFormer link

Train

DeepSpeed

TODO

~~ReZero~~
RealFormer, Residual Attention
~~Macaron architectures~~
~~Macaron architectures - layer Scale 0.5~~
~~Explicit Sparse Transformer~~
torch lightning
Deepspeed train on single GPU
Deepspeed parallel trainig on 2 V100 GPU with 16GB Memory

Parameter For Few-shot

The 175B parameter model is very large, but a large model is needed for Few-Shot Learning. So this repository try to use DeepSpeed for training extremely big model.

GPT-3 Config

model_name	n_params	n_layer	d_model	n_heads	d_head	batch_size	learning_rate
GPT-3 175B	175B	96	12288	96	128	3.2M	0.6 x 10^-4
GPT-3 13B	13B	40	5140	40	128	2M	1.0 x 10^-4
GPT-3 6.7B	6.7B	32	4096	32	128	2M	1.2 x 10^-4
GPT-3 2.7B	2.7B	32	25560	32	80	1M	1.6 x 10^-4

References

Transformer

lucidrains/x-transformers

DeepSpeed

ReZero

/majumderb/rezero

Explicit Sparse Transformer

x-transformer: explicit_sparse_transformer

Macaron Architecrue

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Related tags

Overview

Pytorch GPT-X

1. Abstract

2. Model

Transformer

Additional Module

① Rezero

② Explicit Sparse Transformer

③ Macaron Architecture

④ RealFormer, Residual Attention

Train

DeepSpeed

TODO

Parameter For Few-shot

GPT-3 Config

References

Owner

Seonghwan Kim

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

edge-SR: Super-Resolution For The Masses

Official PyTorch implementation of SegFormer

Speech Recognition for Uyghur using Speech transformer

A python package to fine-tune transformer-based models for named entity recognition (NER).

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

[ICLR'19] Trellis Networks for Sequence Modeling

Problem: Given a nepali news find the category of the news

A single model that parses Universal Dependencies across 75 languages.

Task-based datasets, preprocessing, and evaluation for sequence models.

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Fine-tune GPT-3 with a Google Chat conversation history

Material for GW4SHM workshop, 16/03/2022.

An assignment on creating a minimalist neural network toolkit for CS11-747

Rhythm-Finder is a unsupervised ML driven python powered web-application that can find the songs that suits you.

Snowball compiler and stemming algorithms