Implementation of the GBST block from the Charformer paper, in Pytorch

Overview

Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

Install

$ pip install charformer-pytorch

Usage

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
    dim = 512,                    # dimension of token and intra-block positional embedding
    max_block_size = 4,           # maximum block size
    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by
    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

Citations

@misc{tay2021charformer,
    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 
    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
    year    = {2021},
    eprint  = {2106.12672},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
Comments
  • positional embedding

    positional embedding

    Screenshot from 2021-06-30 12-12-17

    in section 2.1.1 in the paper, the authors claim that by adding intra-block positional embeddings https://github.com/lucidrains/charformer-pytorch/blob/main/charformer_pytorch/charformer_pytorch.py#L90-L96 the block representations will be aware of the position of each character. however, if one were to be doing mean pooling as the author propose, wouldn't this amount to just adding the mean of the positional embeddings for every block? If anyone has any insights, please leave a comment

    help wanted 
    opened by lucidrains 3
  • Cannot tokenize on GPU

    Cannot tokenize on GPU

    Hi,

    I'm using Charformer to do some error corrections on Colab. But I found that after I pass tokens to CUDA and start tokenizing, this would show up: image

    Did I do it in a wrong way?

    opened by Shamepoo 2
  • example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

    example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

    Hello, I was attempting to adapt this guide for use with Charformer Pytorch. Colab notebook for that guide is here.

    I'd like to be able to use GBST on the same data, https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt, but I'm not sure how to pass that in.

    I tried looking at the source code, and the other issues here, but haven't yet found the details.

    Some specific questions:

    • how do I "train" this tokenizer on a .txt file?
    • is it compatible with this section of the HF notebook, aka can it be passed into LineByLineTextDataset?
    from transformers import LineByLineTextDataset
    
    dataset = LineByLineTextDataset(
        tokenizer=tokenizer,
        file_path="./oscar.eo.txt",
        block_size=128,
    )
    

    When I tried doing that line, I got the following error:

    /usr/local/lib/python3.7/dist-packages/transformers/data/datasets/language_modeling.py:124: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
      FutureWarning,
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-38-1688c68b48be> in <module>()
          5     tokenizer=tokenizer,
          6     file_path="./oscar.eo.txt",
    ----> 7     block_size=128,
          8 )
    
    1 frames
    
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    TypeError: forward() got an unexpected keyword argument 'add_special_tokens'
    
    opened by cdleong 0
  • Sequence Length Problem in NMT

    Sequence Length Problem in NMT

    After downsampling, the length of the sequence has been shortened. But how can I return the sequence to its original length since I may need to do sentence generation in error correction?

    Thank you!

    opened by Shamepoo 1
  • Bytes vs. Characters

    Bytes vs. Characters

    The authors address the difference between bytes and characters in footnote 2, it seems like the byte is just the char embedding with dimension of 256. However, in the last sentence, For other languages, each character corresponds to 2–3 bytes in general. For simplicity and to align with prior work, we will generally talk about characters unless stated otherwise. and the example 子词分词, it becomes 子子子词词词分分分词词词, with the 3 bytes in every character.

    What I want to know is, 3 bytes mean we replicate three times for every single character, then feed into embedding? If so, how to decide the number of bytes.

    Thank you.

    opened by jamfly 0
Releases(0.0.4)
Owner
Phil Wang
Working with Attention
Phil Wang
Scheme for training and applying a label propagation framework

Factorisation-based Image Labelling Overview This is a scheme for training and applying the factorisation-based image labelling (FIL) framework. Some

Wellcome Centre for Human Neuroimaging 2 Dec 17, 2021
Cross-Task Consistency Learning Framework for Multi-Task Learning

Cross-Task Consistency Learning Framework for Multi-Task Learning Tested on numpy(v1.19.1) opencv-python(v4.4.0.42) torch(v1.7.0) torchvision(v0.8.0)

Aki Nakano 2 Jan 08, 2022
Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Black-Box-Tuning Source code for paper "Black-Box Tuning for Language-Model-as-a-Service". Being busy recently, the code in this repo and this tutoria

Tianxiang Sun 149 Jan 04, 2023
TensorFlow CNN for fast style transfer

Fast Style Transfer in TensorFlow Add styles from famous paintings to any photo in a fraction of a second! It takes 100ms on a 2015 Titan X to style t

1 Dec 14, 2021
Implementation for the IJCAI2021 work "Beyond the Spectrum: Detecting Deepfakes via Re-synthesis"

Beyond the Spectrum Implementation for the IJCAI2021 work "Beyond the Spectrum: Detecting Deepfakes via Re-synthesis" by Yang He, Ning Yu, Margret Keu

Yang He 27 Jan 07, 2023
An Extendible (General) Continual Learning Framework based on Pytorch - official codebase of Dark Experience for General Continual Learning

Mammoth - An Extendible (General) Continual Learning Framework for Pytorch NEWS STAY TUNED: We are working on an update of this repository to include

AImageLab 277 Dec 28, 2022
Codes and scripts for "Explainable Semantic Space by Grounding Languageto Vision with Cross-Modal Contrastive Learning"

Visually Grounded Bert Language Model This repository is the official implementation of Explainable Semantic Space by Grounding Language to Vision wit

17 Dec 17, 2022
Semi-supervised semantic segmentation needs strong, varied perturbations

Semi-supervised semantic segmentation using CutMix and Colour Augmentation Implementations of our papers: Semi-supervised semantic segmentation needs

146 Dec 20, 2022
Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

SSAN Introduction This is the pytorch implementation of the SSAN model (see our AAAI2021 paper: Entity Structure Within and Throughout: Modeling Menti

benfeng 69 Nov 15, 2022
Face and Pose detector that emits MQTT events when a face or human body is detected and not detected.

Face Detect MQTT Face or Pose detector that emits MQTT events when a face or human body is detected and not detected. I built this as an alternative t

Jacob Morris 38 Oct 21, 2022
Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Real-ESRGAN Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data Ported from https://github.com/xinntao/Real-ESRGAN Depend

Holy Wu 44 Dec 27, 2022
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
QKeras: a quantization deep learning library for Tensorflow Keras

QKeras github.com/google/qkeras QKeras 0.8 highlights: Automatic quantization using QKeras; Stochastic behavior (including stochastic rouding) is disa

Google 437 Jan 03, 2023
Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021)

Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021) Authors: Xinshi Chen, Haoran Sun, Caleb Ellington, Eric Xing, Le Song Link to pap

Xinshi Chen 2 Dec 20, 2021
Export CenterPoint PonintPillars ONNX Model For TensorRT

CenterPoint-PonintPillars Pytroch model convert to ONNX and TensorRT Welcome to CenterPoint! This project is fork from tianweiy/CenterPoint. I impleme

CarkusL 149 Dec 13, 2022
Recognize numbers from an (28 x 28) image using neural networks

Number recognition Recognize numbers from a 28 x 28 image using neural networks Usage This is an example of a simple usage of number-recognition NOTE:

Mauro Baladés 2 Dec 29, 2021
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Semantic Segmentation.

Swin Transformer for Semantic Segmentation of satellite images This repo contains the supported code and configuration files to reproduce semantic seg

23 Oct 10, 2022
SuRE Evaluation: A Supplementary Material

SuRE Evaluation: A Supplementary Material This repository contains supplementary material regarding the evaluations presented in the paper Visual Expl

NYU Visualization Lab 0 Dec 14, 2021
Implementation of Kalman Filter in Python

Kalman Filter in Python This is a basic example of how Kalman filter works in Python. I do plan on refactoring and expanding this repo in the future.

Enoch Kan 35 Sep 11, 2022
Implementation of the master's thesis "Temporal copying and local hallucination for video inpainting".

Temporal copying and local hallucination for video inpainting This repository contains the implementation of my master's thesis "Temporal copying and

David Álvarez de la Torre 1 Dec 02, 2022