BPEer

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Description

The BPETrainer of Huggingface consumes a lot of memory when I am training on a large corpus (e.g. 50000 merges on 20GB corpus). And I got a memory error.

So I use fastBPE (implemented with C) instead, which returns a list of merge operations.

However, I still want to use the huggingface Tokenizer API. So I write a simple convertor for generating the json file for Huggingface Tokenizer.

Usage

Train BPE:

cd fastBPE
./fast learnbpe [merges, e.g. 50000] [train.txt] > allvocab

Convert to json:

python convertjs.py

Warning

This tokenizer does not indicate the start of a token.

E.g. BPE result for "I am" and "Iam" may be the same. Please split the sentence by space before you use it.

    words = "I am".split()
    for word in words:
        subs = tokenizer.tokenize(word)
        subs[0] = "<begin>" + subs[0]

This results in ["<begin>I", "am"] and ["<begin>I", "<begin>am"] for "Iam" and "I am".

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
fastBPE		fastBPE
README.md		README.md
bpe.py		bpe.py
convertjs.py		convertjs.py
merges2vocab.py		merges2vocab.py
pathvocab.json		pathvocab.json
template.json		template.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastBPE

fastBPE

README.md

README.md

bpe.py

bpe.py

convertjs.py

convertjs.py

merges2vocab.py

merges2vocab.py

pathvocab.json

pathvocab.json

template.json

template.json

Repository files navigation

BPEer

Description

Usage

Warning

About

Releases

Packages

Languages

Lizhmq/BPEer

Folders and files

Latest commit

History

Repository files navigation

BPEer

Description

Usage

Warning

About

Resources

Stars

Watchers

Forks

Languages