💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Overview



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the python documentation or the python quicktour to learn more!

You might also like...
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

Comments
  • Can't import any modules

    Can't import any modules

    What is says on the tin. Every module I try importing into a script is spitting out a "module not found" rror.

    Traceback (most recent call last): File "ab2.py", line 3, in from tokenizers.tools import BertWordPieceTokenizer ImportError: cannot import name 'BertWordPieceTokenizer' from 'tokenizers.tools' (/home/../anaconda3/envs/tokenizers/lib/python3.7/site-packages/tokenizers/tools/init.py)

    Traceback (most recent call last): File "ab2.py", line 3, in from transformers import BertWordPieceTokenizer ImportError: cannot import name 'BertWordPieceTokenizer' from 'transformers' (/home/../anaconda3/envs/tokenizers/lib/python3.7/site-packages/transformers/init.py)

    I've tried:

    import BertWordPieceTokenizer from tokenizers.toold import AutoTokenizer from tokenizers import BartTokenizer

    To Illustrate a few examples.

    I've installed Tokenizers in an anaconda3 venv via pip, via conda forge, and compiled from source.

    I've tried installing Transformers as well and get the same errors. I've tried installing Tokenizers and then installing Transformers and got the same errors.

    I've tried installing Transformers and then Tokenizers and gotten the same error.

    I've looked through the Tokenizers code and unless I'm missing something (entirely possible) autotokenize isn't even a part of the package? I'll admit I'm not a very experienced programmer but I'll be damned if I can find it.

    Help would be appreciated.

    System specs are:

    Linux mint 21.1 RTX2080 ti i78700k

    cudnn 8.1.1 cuda 11.2.0 Tensor rt 7.2.3 Python 3.7 (by the way, figuring out what was needed here, finding the files, and actually installing them was beyond arduous. There has to be a better way. It's the only way I could get anything at all to work though).

    opened by kronkinatorix 1
  • How to decode with the existing tokenizer

    How to decode with the existing tokenizer

    I train the tokenizer following the tutorial of the huggingface:

    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace
    
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    tokenizer.pre_tokenizer = Whitespace()
    files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
    tokenizer.train(files, trainer)
    tokenizer.save("tokenizer-wiki.json")
    

    But I don't know how to use the existing tokenizer for decoding:

    tokenizer = Tokenizer.from_file("tokenizer-wiki.json")
    o=tokenizer.encode("sd jk sds  sds")
    tokenizer.decode(o.ids)
    # s d j k s ds s ds
    

    I know we can recover the output with the o.offsets, but what if we do not know the offsets, like we are decoding from a language model or NMT.

    opened by ZhiYuanZeng 4
  • Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?

    Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?

    I'm trying to run a tokenizer in java then eventually compile it to run on android for an open domain question and answer project. I'm wondering why 'google/tapas-mini-finetuned-wtq' doesn't work with DeepJavaLibrary. For more popular models the tokenizer is working. I'm assuming there is no fast tokenizer for tapas, so i was wondering if anyone had any advice on how to go about running tapas tokenizer and model on android/java?

    opened by memetrusidovski 4
  • OpenSSL internal error when importing tokenizers module

    OpenSSL internal error when importing tokenizers module

    When importing tokenizers 0.13.2 or 0.13.1 in a Fips mode enabled environment with Red Hat Enterprise Linux 8.6 (Ootpa) we see this error:

    sh-4.4# python3 -c "import tokenizers"
    fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE
    Aborted (core dumped)
    

    Additional info:

    No errors when using tokenizers==0.13.0 or tokenizers==0.11.4
    Python 3.8.13
    OpenSSL 1.1.1g FIPS  21 Apr 2020 or OpenSSL 1.1.1k  FIPS 25 Mar 2021
    
    opened by wai25 3
Releases(v0.13.2)
  • v0.13.2(Nov 7, 2022)

  • python-v0.13.2(Nov 7, 2022)

  • node-v0.13.2(Nov 7, 2022)

  • v0.13.1(Oct 6, 2022)

  • python-v0.13.1(Oct 6, 2022)

  • node-v0.13.1(Oct 6, 2022)

  • python-v0.13.0(Sep 21, 2022)

    [0.13.0]

    • [#956] PyO3 version upgrade
    • [#1055] M1 automated builds
    • [#1008] Decoder is now a composable trait, but without being backward incompatible
    • [#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

    Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.

    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Sep 19, 2022)

    [0.13.0]

    • [#1009] unstable_wasm feature to support building on Wasm (it's unstable !)
    • [#1008] Decoder is now a composable trait, but without being backward incompatible
    • [#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

    Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.

    Source code(tar.gz)
    Source code(zip)
  • node-v0.13.0(Sep 19, 2022)

    [0.13.0]

    • [#1008] Decoder is now a composable trait, but without being backward incompatible
    • [#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible
    Source code(tar.gz)
    Source code(zip)
  • python-v0.12.1(Apr 13, 2022)

  • v0.12.0(Mar 31, 2022)

    [0.12.0]

    Bump minor version because of a breaking change.

    The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

    The decision was to rollback on that breaking change, and figure out a different way later to do this modification

    • [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

    • [#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

    • [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

    • [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

    • [#961] Added link for Ruby port of tokenizers

    • [#960] Feature gate for cli and its clap dependency

    Source code(tar.gz)
    Source code(zip)
  • python-v0.12.0(Mar 31, 2022)

    [0.12.0]

    The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

    The decision was to rollback on that breaking change, and figure out a different way later to do this modification

    Bump minor version because of a breaking change.

    • [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

    • [#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

    • [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

    • [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

    • [#962] Fix tests for python 3.10

    • [#961] Added link for Ruby port of tokenizers

    Source code(tar.gz)
    Source code(zip)
  • node-v0.12.0(Mar 31, 2022)

    [0.12.0]

    The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

    The decision was to rollback on that breaking change, and figure out a different way later to do this modification

    Bump minor version because of a breaking change. Using 0.12 to match other bindings.

    • [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

    • [#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

    • [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

    • [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

    • [#961] Added link for Ruby port of tokenizers

    Source code(tar.gz)
    Source code(zip)
  • v0.11.2(Feb 28, 2022)

  • python-v0.11.6(Feb 28, 2022)

  • node-v0.8.3(Feb 28, 2022)

  • python-v0.11.5(Feb 16, 2022)

  • v0.11.1(Jan 17, 2022)

    • [#882] Fixing Punctuation deserialize without argument.
    • [#868] Fixing missing direction in TruncationParams
    • [#860] Adding TruncationSide to TruncationParams
    Source code(tar.gz)
    Source code(zip)
  • python-v0.11.3(Jan 17, 2022)

    • [#882] Fixing Punctuation deserialize without argument.
    • [#868] Fixing missing direction in TruncationParams
    • [#860] Adding TruncationSide to TruncationParams
    Source code(tar.gz)
    Source code(zip)
  • node-v0.8.2(Jan 17, 2022)

  • node-v0.8.1(Jan 17, 2022)

  • python-v0.11.4(Jan 17, 2022)

  • python-v0.11.2(Jan 4, 2022)

  • python-v0.11.1(Dec 28, 2021)

  • python-v0.11.0(Dec 24, 2021)

    Fixed

    • [#585] Conda version should now work on old CentOS
    • [#844] Fixing interaction between is_pretokenized and trim_offsets.
    • [#851] Doc links

    Added

    • [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
    • [#845]: Documentation for Decoders.

    Changed

    • [#850]: Added a feature gate to enable disabling http features
    • [#718]: Fix WordLevel tokenizer determinism during training
    • [#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
    • [#770]: Improved documentation for UnigramTrainer
    • [#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
    • [#793]: Saving a pretty JSON file by default when saving a tokenizer
    Source code(tar.gz)
    Source code(zip)
  • node-v0.8.0(Sep 2, 2021)

    BREACKING CHANGES

    • Many improvements on the Trainer (#519). The files must now be provided first when calling tokenizer.train(files, trainer).

    Features

    • Adding the TemplateProcessing
    • Add WordLevel and Unigram models (#490)
    • Add nmtNormalizer and precompiledNormalizer normalizers (#490)
    • Add templateProcessing post-processor (#490)
    • Add digitsPreTokenizer pre-tokenizer (#490)
    • Add support for mapping to sequences (#506)
    • Add splitPreTokenizer pre-tokenizer (#542)
    • Add behavior option to the punctuationPreTokenizer (#657)
    • Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

    Fixes

    • Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
    • Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)
    Source code(tar.gz)
    Source code(zip)
  • python-v0.10.3(May 24, 2021)

    Fixed

    • [#686]: Fix SPM conversion process for whitespace deduplication
    • [#707]: Fix stripping strings containing Unicode characters

    Added

    • [#693]: Add a CTC Decoder for Wave2Vec models

    Removed

    • [#714]: Removed support for Python 3.5
    Source code(tar.gz)
    Source code(zip)
  • python-v0.10.2(Apr 5, 2021)

    Fixed

    • [#652]: Fix offsets for Precompiled corner case
    • [#656]: Fix BPE continuing_subword_prefix
    • [#674]: Fix Metaspace serialization problems
    Source code(tar.gz)
    Source code(zip)
  • python-v0.10.1(Feb 4, 2021)

    Fixed

    • [#616]: Fix SentencePiece tokenizers conversion
    • [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
    • [#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut
    • [#620]: Fix serialization/deserialization for overlapping models
    • [#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())
    Source code(tar.gz)
    Source code(zip)
  • python-v0.10.0(Jan 12, 2021)

    Added

    • [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
    • [#519]: Add a WordLevelTrainer used to train a WordLevel model
    • [#533]: Add support for conda builds
    • [#542]: Add Split pre-tokenizer to easily split using a pattern
    • [#544]: Ability to train from memory. This also improves the integration with datasets
    • [#590]: Add getters/setters for components on BaseTokenizer
    • [#574]: Add fust_unk option to SentencePieceBPETokenizer

    Changed

    • [#509]: Automatically stubbing the .pyi files
    • [#519]: Each Model can return its associated Trainer with get_trainer()
    • [#530]: The various attributes on each component can be get/set (ie. tokenizer.model.dropout = 0.1)
    • [#538]: The API Reference has been improved and is now up-to-date.

    Fixed

    • [#519]: During training, the Model is now trained in-place. This fixes several bugs that were forcing to reload the Model after a training.
    • [#539]: Fix BaseTokenizer enable_truncation docstring
    Source code(tar.gz)
    Source code(zip)
Owner
Hugging Face
Solving NLP, one commit at a time!
Hugging Face
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力,旨在为飞桨开发者提升文本建模效率,并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 01, 2023
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 01, 2022
DVC-NLP-Simple-usecase

dvc-NLP-simple-usecase DVC NLP project Reference repository: official reference repo DVC STUDIO MY View Bag of Words- Krish Naik TF-IDF- Krish Naik ST

SUNNY BHAVEEN CHANDRA 2 Oct 02, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022