Implementation of the GBST block from the Charformer paper, in Pytorch

Last update: Dec 26, 2022

Overview

Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

Install

$ pip install charformer-pytorch

Usage

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
    dim = 512,                    # dimension of token and intra-block positional embedding
    max_block_size = 4,           # maximum block size
    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by
    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

Citations

@misc{tay2021charformer,
    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 
    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
    year    = {2021},
    eprint  = {2106.12672},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Comments

positional embedding

in section 2.1.1 in the paper, the authors claim that by adding intra-block positional embeddings https://github.com/lucidrains/charformer-pytorch/blob/main/charformer_pytorch/charformer_pytorch.py#L90-L96 the block representations will be aware of the position of each character. however, if one were to be doing mean pooling as the author propose, wouldn't this amount to just adding the mean of the positional embeddings for every block? If anyone has any insights, please leave a comment
help wanted

opened by lucidrains 3
Cannot tokenize on GPU

Hi,

I'm using Charformer to do some error corrections on Colab. But I found that after I pass tokens to CUDA and start tokenizing, this would show up:

Did I do it in a wrong way?

opened by Shamepoo 2

example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

Hello, I was attempting to adapt this guide for use with Charformer Pytorch. Colab notebook for that guide is here.

I'd like to be able to use GBST on the same data, https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt, but I'm not sure how to pass that in.

I tried looking at the source code, and the other issues here, but haven't yet found the details.

Some specific questions:

how do I "train" this tokenizer on a .txt file?
is it compatible with this section of the HF notebook, aka can it be passed into LineByLineTextDataset?

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

When I tried doing that line, I got the following error:

/usr/local/lib/python3.7/dist-packages/transformers/data/datasets/language_modeling.py:124: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
  FutureWarning,

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-38-1688c68b48be> in <module>()
      5     tokenizer=tokenizer,
      6     file_path="./oscar.eo.txt",
----> 7     block_size=128,
      8 )

1 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'add_special_tokens'

opened by cdleong 0

Sequence Length Problem in NMT

After downsampling, the length of the sequence has been shortened. But how can I return the sequence to its original length since I may need to do sentence generation in error correction?

Thank you!

opened by Shamepoo 1
Bytes vs. Characters

The authors address the difference between bytes and characters in footnote 2, it seems like the byte is just the char embedding with dimension of 256. However, in the last sentence, For other languages, each character corresponds to 2–3 bytes in general. For simplicity and to align with prior work, we will generally talk about characters unless stated otherwise. and the example 子词分词, it becomes 子子子词词词分分分词词词, with the 3 bytes in every character.

What I want to know is, 3 bytes mean we replicate three times for every single character, then feed into embedding? If so, how to decide the number of bytes.

Thank you.

opened by jamfly 0

Releases(0.0.4)

0.0.4(Jul 15, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Jul 8, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Jun 30, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Jun 30, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention

GitHub Repository

[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

92 Dec 28, 2022

Implementation of algorithms for continuous control (DDPG and NAF).

DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

288 Dec 31, 2022

A Python 3 package for state-of-the-art statistical dimension reduction methods

direpack: a Python 3 library for state-of-the-art statistical dimension reduction techniques This package delivers a scikit-learn compatible Python 3

32 Dec 14, 2022

tensorflow implementation of 'YOLO : Real-Time Object Detection'

YOLO_tensorflow (Version 0.3, Last updated :2017.02.21) 1.Introduction This is tensorflow implementation of the YOLO:Real-Time Object Detection It can

1.7k Nov 21, 2022

Disagreement-Regularized Imitation Learning

Due to a normalization bug the expert trajectories have lower performance than the rl_baseline_zoo reported experts. Please see the following link in

25 Apr 28, 2022

Reinforcement learning algorithms in RLlib

raylab Reinforcement learning algorithms in RLlib and PyTorch. Installation pip install raylab Quickstart Raylab provides agents and environments to b

50 Sep 08, 2022

Implementation of Nyström Self-attention, from the paper Nyströmformer

Nyström Attention Implementation of Nyström Self-attention, from the paper Nyströmformer. Yannic Kilcher video Install $ pip install nystrom-attention

95 Jan 02, 2023

kullanışlı ve işinizi kolaylaştıracak bir araç

Hey merhaba! işte çok sorulan sorularının cevabı ve sorunlarının çözümü; Soru= İçinde var denilen birçok şeyi göremiyorum bunun sebebi nedir? Cevap= B

16 Dec 17, 2022

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

CoTr: Efficient 3D Medical Image Segmentation by bridging CNN and Transformer This is the official pytorch implementation of the CoTr: Paper: CoTr: Ef

218 Dec 25, 2022

This repository contains code, network definitions and pre-trained models for working on remote sensing images using deep learning

Deep learning for Earth Observation This repository contains code, network definitions and pre-trained models for working on remote sensing images usi

447 Jan 05, 2023

Outlier Exposure with Confidence Control for Out-of-Distribution Detection

OOD-detection-using-OECC This repository contains the essential code for the paper Outlier Exposure with Confidence Control for Out-of-Distribution De

64 Nov 02, 2022

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion

Feature-Style Encoder for Style-Based GAN Inversion Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion. Code will

63 Jan 03, 2023

This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

PasswordGeneratorAndVault This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your us

1 Feb 26, 2022

Implementation of the GBST block from the Charformer paper, in Pytorch

Related tags

Overview

Charformer - Pytorch

Install

Usage

Citations

Comments

positional embedding

Cannot tokenize on GPU

example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

Sequence Length Problem in NMT

Bytes vs. Characters

Releases(0.0.4)

0.0.4(Jul 15, 2021)

0.0.3(Jul 8, 2021)

0.0.2(Jun 30, 2021)

0.0.1(Jun 30, 2021)

Owner

Phil Wang

[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

Implementation of algorithms for continuous control (DDPG and NAF).

A Python 3 package for state-of-the-art statistical dimension reduction methods

tensorflow implementation of 'YOLO : Real-Time Object Detection'

Disagreement-Regularized Imitation Learning

Reinforcement learning algorithms in RLlib

Implementation of Nyström Self-attention, from the paper Nyströmformer

kullanışlı ve işinizi kolaylaştıracak bir araç

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

This repository contains code, network definitions and pre-trained models for working on remote sensing images using deep learning

Outlier Exposure with Confidence Control for Out-of-Distribution Detection

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion

This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language Models"

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Spectralformer: Rethinking hyperspectral image classification with transformers

Implementation of popular bandit algorithms in batch environments.

Intrusion Detection System using ensemble learning (machine learning)

SplineConv implementation for Paddle.

Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).