A fast and easy implementation of Transformer with PyTorch.

Last update: Jul 18, 2022

Overview

FasySeq

FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which can be trained efficiently and modified easily. This toolkit is based on Transformer(Vaswani et al.), and will add more seq2seq models in the future.

Dependency

PyTorch >= 1.4
NLTK

Result

...

Structure

...

To Be Updated

top-k and top-p sampling
multi-GPU inference
length penalty in beam search
...

Preprocess

Build Vocabulary

createVocab.py

NamedArguments	Description
-f/--file	The files used to build the vocabulary. `Type: List`
--vocab_num	The maximum size of vocabulary, the excess word will be discard according to the frequency. `Type: Int` `Default: -1`
--min_freq	The minimum frequency of token in vocabulary. The word with frequency less than min_freq will be discard. `Type: Int` `Default: 0`
--lower	Whether to convert all words to lowercase
--save_path	The path to save voacbulary. `Type: str`

Process Data

preprocess.py

NamedArguments	Description
--source	The path of source file. `Type: str`
[--target]	The path of target file. `Type: str`
--src_vocab	The path of source vocabulary. `Type: str`
[--tgt_vocab]	The path of target vocabulary. `Type: str`
--save_path	The path to save the processed data. `Type: str`

Train

train.py

NamedArguments	Description
Model	-
--share_embed	Source and target share the same vocabulary and word embedding. The max position of embedding is max(max_src_position, max_tgt_position) if the model employ share embedding.
--max_src_position	The maximum source position, all src-tgt pairs which source sentences' lenght are greater than max_src_position will be cut or discard. If max_src_position > max source length, it wil be set to max source length. `Type: Int` `Default: inf`
--max_tgt_position	The maximum target position, all src_tgt pairs which target sentences' length are greater than max_tgt_position will be cut or discard. If max_tgt_position > max target length, it wil be set to max target length. `Type: Int` `Default: inf`
--position_method	The method to introduce positional information. `Option: encoding/embedding`
--normalize_before	Leveraging before layer normalization. See Xiong et al.
Checkpoint	-
--checkpoint_path	The path to save checkpoint file. `Type: str` `Default: None`
--restore_file	The checkpoint file to be loaded. `Type: str` `Default: None`
--checkpoint_num	Save the nearest checkpoint_num breakpoint. `Type: Int` `Default: inf`
Data	-
--vocab	Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path. `Type: str` `Default: None`
--src_vocab	Source vocabulary path. `Type: str` `Default: None`
--tgt_vocab	Target vocabulary path. `Type: str` `Default: None`
--file	The training data file. `Type: str`
--max_tokens	The maximum tokens in each batch. `Type: Int` `Default: 1000`
--discard_invalid_data	The data which length of source or data is more than maximum position will be discard if use this option, otherwise the long sentences will be cut into max position.
Train	-
--cuda_num	The device's ID of GPU. `Type: List`
--grad_accumulate	The num of gradient accumulate. `Type: Int` `Default: 1`
--epoch	The total epoch to train. `Type: Int` `Default: inf`
--batch_print_info	The number of batch to print training information. `Type: Int` `Default: 1000`

Inference

generator.py

NamedArguments	Description
--cuda_num	The device's ID of GPU. `Type: List`
--file	The inference data file which has been processed. `Type: str`
--raw_file	The raw inference data file, and will be preprocessed before generated. `Type: str`
--ref_file	The reference file. `Type: str`
--max_length --max_alpha --max_add_token	Maximum generated length = min(max_length, max_alpha * max_src_len, max_add_token + max_src_token) `Type: Int` `Default: inf`
--max_tokens	The maximum tokens in each batch. `Type: Int` `Default: 1000`
--src_vocab	Source vocabulary path. `Type: str` `Default: None`
--tgt_vocab	Target vocabulary path. `Type: str` `Default: None`
--vocab	Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path. `Type: str` `Default: None`
--model_path	The path of pre-trained model. `Type: str`
--output_path	The path of output. the result will be saved into `output_path/result.txt`. `Type: str`
--decode_method	The decode method. `Option:greedy/beam`
--beam	Beam size. `Type: Int` `Default: 5`

Postpreposs

avg_param.py

The average parameter code we employed is the same as fairseq.

License

FasySeq(-py) is Apache-2.0 License. The license applies to the pre-trained models as well.

You might also like...

Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

1.1k Dec 16, 2022

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

27 Nov 20, 2022

Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

1.8k Dec 30, 2022

An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

858 Dec 27, 2022

Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

94 Dec 25, 2022

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

13 Jan 6, 2023

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

2.3k Jan 8, 2023

A fast and easy implementation of Transformer with PyTorch.

Related tags

Overview

FasySeq

Dependency

Result

Structure

To Be Updated

Preprocess

Build Vocabulary

Process Data

Train

Inference

Postpreposs

License

You might also like...

Fast, general, and tested differentiable structured prediction in PyTorch

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Reformer, the efficient Transformer, in Pytorch

An implementation of WaveNet with fast generation

Google's Meena transformer chatbot implementation

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Releases(checkpoint)

checkpoint(Aug 27, 2021)

Owner

宁羽

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Module for automatic summarization of text documents and HTML pages.

My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

This is a simple item2vec implementation using gensim for recbole

Linear programming solver for paper-reviewer matching and mind-matching

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

An implementation of WaveNet with fast generation

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Natural Language Processing Best Practices & Examples

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Machine Psychology: Python Generated Art