Unsupervised text tokenizer for Neural Network-based text generation.

Overview

SentencePiece

Build Status Build status Coverage Status GitHub Issues Codacy Badge PyPI version PyPi downloads Contributions welcome License

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

This is not an official Google product.

Technical highlights

  • Purely data driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.
  • Language independent: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
  • Multiple subword algorithms: BPE [Sennrich et al.] and unigram language model [Kudo.] are supported.
  • Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.
  • Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
  • Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.
  • Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
  • NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.

Comparisons with other implementations

Feature SentencePiece subword-nmt WordPiece
Supported algorithm BPE, unigram, char, word BPE BPE*
OSS? Yes Yes Google internal
Subword regularization Yes No No
Python Library (pip) Yes No N/A
C++ Library Yes No N/A
Pre-segmentation required? No Yes Yes
Customizable normalization (e.g., NFKC) Yes No N/A
Direct id generation Yes No N/A

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

Overview

What is SentencePiece?

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.

The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For example, a standard English tokenizer would segment the text "Hello world." into the following three tokens.

[Hello] [World] [.]

One observation is that the original input and tokenized sequence are NOT reversibly convertible. For instance, the information that is no space between “World” and “.” is dropped from the tokenized sequence, since e.g., Tokenize(“World.”) == Tokenize(“World .”)

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('▁', ' ')

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the sentence with standard word segmenters, since they treat the whitespace as a special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

  • (en) Hello world. → [Hello] [World] [.] (A space between Hello and World)
  • (ja) こんにちは世界。 → [こんにちは] [世界] [。] (No space between こんにちは and 世界)

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-droptout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

Installation

Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

% pip install sentencepiece

For more detail, see Python module

Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

  • cmake
  • C++11 compiler
  • gperftools library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, you can build and install command line tools as follows.

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

On OSX/macOS, replace the last command with sudo update_dyld_shared_cache

Build and install using vcpkg

You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Usage instructions

Train SentencePiece Model

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
  • --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
  • --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.
  • --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Use --help flag to display all parameters for training, or see here for an overview.

Encode raw text into sentence pieces/ids

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

Decode sentence pieces/ids into raw text

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

Use --extra_options flag to decode the text in reverse order.

% spm_decode --extra_options=reverse < input > output

End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

Export vocabulary list

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

Redefine special meta tokens

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3.  

If you want to assign another special tokens, please see Use custom symbols.

Vocabulary restriction

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in subword-nmt page.

The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Then segment train/test corpus with --vocabulary option

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Advanced topics

Comments
  • Pip install sentencepiece failure

    Pip install sentencepiece failure

    Hi, pip install sentencepiece fails, This is the log I get:

    pip install sentencepiece 7.4.0 Collecting sentencepiece Using cached https://files.pythonhosted.org/packages/fd/45/6d0eb609d5cd81df094aab71a867b2ab6b315ffd592e78fb94a625c4d6aa/sentencepiece-0.1.3.tar.gz ERROR: Complete output from command python setup.py egg_info: ERROR: /bin/sh: 1: pkg-config: not found Failed to find sentencepiece pkgconfig ---------------------------------------- ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-463tj_x8/sentencepiece/

    opened by saareliad 31
  • Compatibility with Tensorflow Serving

    Compatibility with Tensorflow Serving

    Any idea how to best integrate the tensorflow op with tensorflow serving?

    Currently if this is used to train, when the tensorflow Graph is exported to a servable and ran with tensorflow serving a run time error will obviously occur.

    For example a model trained with this op trying to be loaded into tensorflow serving will result in:

    Loading servable: {name: xling } failed: Not Found: Op tyope not registered `SentencepieceEncodeSparse' in binary...
    
    opened by r-wheeler 31
  • pip install failed on linux cluster

    pip install failed on linux cluster

    System Info: Linux version 4.14.0-115.7.1.el7a.ppc64le ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC))

    I tried both installing from PyPI and installing from source file, but neither of them worked.

    When installing from PyPI:

    $ pip install sentencepiece
    Collecting sentencepiece
      Using cached https://files.pythonhosted.org/packages/1b/87/c3c2fa8cbec61fffe031ca9f0da512747520bec9be7f886f748457daac31/sentencepiece-0.1.83.tar.gz
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-t33o0yz4/sentencepiece/setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-t33o0yz4/sentencepiece/
    

    I then manually downloaded the tar.gz source file, uncompressed it, changed the directory to "./python", and tried to install directly from the setup.py:

    $ python setup.py install
    Package sentencepiece was not found in the pkg-config search path.
    Perhaps you should add the directory containing `sentencepiece.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'sentencepiece' found
    Failed to find sentencepiece pkgconfig
    

    However pip install . gives a different error message:

    $ pip install .
    Processing <...>/sentencepiece-0.1.83/python
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-req-build-209jgy5x/setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-209jgy5x/
    

    Does anyone know what might be wrong and how to fix it? Thank you!

    execution environment 
    opened by wendywangwwt 24
  • undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    Hi , When I am trying to import "tf_sentencepiece" . I am getting following error:

    NotFoundError Traceback (most recent call last) in import tf_sentencepiece as tfs

    ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/init.py in from future import print_function from tf_sentencepiece.sentencepiece_processor_ops import * ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/sentencepiece_processor_ops.py in _gen_sentencepiece_processor_op = tf.load_op_library(so_file) ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename) RuntimeError: when unable to load the library or get the python wrappers. """ lib_handle = py_tf.TF_LoadLibrary(library_filename) op_list_str = py_tf.TF_GetOpList(lib_handle) NotFoundError: /home/user/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/_sentencepiece_processor_ops.so.1.12.0: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    Help me out in resolving this issue. Thanks in advance.

    opened by ramreddyyasa 21
  • Add Mac M1 Compatibility

    Add Mac M1 Compatibility

    Hi,

    Like the most part of Python librairies, SentencePiece won't install on Mac M1 architecture... "A revolution in data science" they said... what a joke, every data science library is a real pain to install! Do you plan to make a compatible version of SentencePiece?

    Thank you!

    opened by pierreia 19
  • Issue in installing.

    Issue in installing.

    Python 3.7.3 OS: Redhat

    I am getting following error message while installing:

    I already tried installing wheel but getting message:

    (tanveer) [[email protected] tanveer]$ pip install sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl
    ERROR: sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl is not a supported wheel on this platform.
    
    > Using cached sentencepiece-0.1.83.tar.gz (497 kB)
    >   ERROR: Command errored out with exit status 1:
    >    command: /power8nfs/home/ai_u/.conda/envs/tanveer/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"'; __file__='"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-6kz16kgn/sentencepiece/pip-egg-info
    >        cwd: /tmp/pip-install-6kz16kgn/sentencepiece/
    >   Complete output (7 lines):
    >   Traceback (most recent call last):
    >     File "<string>", line 1, in <module>
    >     File "/tmp/pip-install-6kz16kgn/sentencepiece/setup.py", line 29, in <module>
    >       with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
    >     File "/power8nfs/home/ai_u/.conda/envs/tanveer/lib/python3.7/codecs.py", line 904, in open
    >       file = builtins.open(filename, mode, buffering)
    >   FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    >   ----------------------------------------
    > ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    > 
    
    execution environment 
    opened by tkhan3 19
  • `sentencepiece==0.1.92` seems breaking something

    `sentencepiece==0.1.92` seems breaking something

    with newly released sentencepiece==0.1.92

    Python 3.6.9 (default, Nov  7 2019, 10:44:02)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import transformers, torch
    >>> transformers.__version__
    '2.9.1'
    >>> torch.__version__
    '1.4.0'
    >>> torch.rand(3)
    Segmentation fault (core dumped)
    

    However, downgrade to sentencepiece==0.1.91 solves this issue

    opened by boy2000-007man 16
  • terminate called after throwing an instance of 'std::bad_alloc'

    terminate called after throwing an instance of 'std::bad_alloc'

    I'm running a sentencepiece model and getting an std::bad_alloc error when I increase the training size from 5M to 10M sentences. (it works fine for 5M sentences). Here's how I'm calling the function:

    spm_train --input=input.txt --vocab_size=32000 --character_coverage=1.0
        --model_type=unigram --input_sentence_size=10000000 --num_threads=32
    

    here's the specific error:

    trainer_interface.cc(317) LOG(INFO) Sampled 10000000 sentences from 283087079 sentences.
    trainer_interface.cc(321) LOG(INFO) Skipped 209436 too long sentences.
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
    trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
    trainer_interface.cc(384) LOG(INFO) all chars count=3460742236
    trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
    trainer_interface.cc(402) LOG(INFO) Alphabet size=25
    trainer_interface.cc(403) LOG(INFO) Final character coverage=1
    trainer_interface.cc(435) LOG(INFO) Done! preprocessed 10000000 sentences.
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    

    I've tried compiling SentencePiece with and without gperftools, and get the same error message. Compiled with gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), in case that matters. (Edit: also tried a more recent gcc 8.2.0 with the same results.) I doubt that it's a RAM limitation, I'm running this on a pretty beefy compute node with 768 GB of memory, and watching memory utilization as the program is running (even at 5M input sentences) I never get close to maxing out. Any thoughts why I might be getting this error message?

    opened by pstjohn 15
  • FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

    FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

    Hi,

    I opened an issue relating to the pytorch-transformers library but was redirected here. For the sake of clarity here's all the relevant info:

    OS: Windows10 Python: 3.5.2. Error when trying pip install sentencepiece:

        ERROR: Command errored out with exit status 1:
         command: 'c:\users\pawel.lonca\appdata\local\programs\python\python35\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"'; __file__='"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
             cwd: C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\
        Complete output (7 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "c:\users\pawel.lonca\appdata\local\programs\python\python35\lib\codecs.py", line 895, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    execution environment 
    opened by balkon16 14
  • Subword regularization on BPE models

    Subword regularization on BPE models

    As described by @eric-haibin-lin in https://github.com/google/sentencepiece/issues/335 it is currently not possible to use SampleEncodeAsPieces, SampleEncodeAs{Pieces,Ids} on a BPE model (displays model_interface.h(85) LOG(ERROR) Not implemented. error and returns an empty list).

    Do you plan to support it in the near futur ?

    (and thank you for this great tool BTW!)

    opened by nicolaspanel 13
  • Cannot install sentencepiece with Python 3.9 on Windows

    Cannot install sentencepiece with Python 3.9 on Windows

    Currently adding Python 3.9 support for pytorch/text and ran into an issue installing sentencepiece for Python 3.9 on windows. (CircleCI logs)

      ERROR: Failed building wheel for sentencepiece
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\circleci\project\env\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"'; __file__='"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\circleci\AppData\Local\Temp\pip-record-xi27zjv8\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\circleci\project\env\Include\sentencepiece'
             cwd: C:\Users\circleci\AppData\Local\Temp\pip-install-trvw9qva\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\
        Complete output (20 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.9
        creating build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/__init__.py -> build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
        running build_ext
        building 'sentencepiece._sentencepiece' extension
        creating build\temp.win-amd64-3.9
        creating build\temp.win-amd64-3.9\Release
        creating build\temp.win-amd64-3.9\Release\src
        creating build\temp.win-amd64-3.9\Release\src\sentencepiece
        C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\circleci\project\env\include -IC:\Users\circleci\project\env\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include -IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-3.9\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
        cl : Command line warning D9025 : overriding '/MD' with '/MT'
        sentencepiece_wrap.cxx
        src/sentencepiece/sentencepiece_wrap.cxx(2777): fatal error C1083: Cannot open include file: 'sentencepiece_processor.h': No such file or directory
        error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
    

    This is a duplicate of #452, but no real solution to building from source seems to have come from that so I have opened a new issue

    Is there a workaround for getting this dependency?

    cc @taku910

    opened by seemethere 12
  • IndexError: Out of range: piece id is out of range.

    IndexError: Out of range: piece id is out of range.

    After training an epoch and generating/scoring the predictions it throws the exception:

    Traceback (most recent call last):
      File "C:/Users/51650/PycharmProjects/speechbrain/22222.py", line 7, in <module>
        sp.decode_ids([10000])
      File "C:\Users\51650\.conda\envs\speechbrain\lib\site-packages\sentencepiece\__init__.py", line 837, in DecodeIds
        return self.Decode(input=input, out_type=out_type, **kwargs)
      File "C:\Users\51650\.conda\envs\speechbrain\lib\site-packages\sentencepiece\__init__.py", line 780, in Decode
        return self._DecodeIds(input)
      File "C:\Users\51650\.conda\envs\speechbrain\lib\site-packages\sentencepiece\__init__.py", line 337, in _DecodeIds
        return _sentencepiece.SentencePieceProcessor__DecodeIds(self, ids)
    IndexError: Out of range: piece id is out of range.
    

    The vocab.size is 500.How can I set up to solve this problem? Thanks in advance.

    opened by lytgyx 0
  • self implemente add_tokens by requesting pb, encounter

    self implemente add_tokens by requesting pb, encounter "Runtime error: X is already defined" when load sp model file

    def add_tokens(self, tokens, vocab_file, model_file):
        m = proto_model.ModelProto()
        m.ParseFromString(open(model_file, "rb").read())
        for token in tokens:
            print(self.sp.piece_to_id(token))
            if self.sp.piece_to_id(token) == 0:
                new_token = proto_model.ModelProto().SentencePiece()
                new_token.piece = token
                new_token.score = 0
                m.pieces.append(new_token)
            if token not in self.encoder:
                self.encoder[token] = max(list(self.decoder.keys())) + 1
                self.decoder[self.encoder[token]] = token
        with open(model_file, 'wb') as f:
            f.write(m.SerializeToString())
        self.sp = spm.SentencePieceProcessor(model_file=model_file)
        with open(vocab_file, "w") as f:
            json.dump(self.encoder,f) 
    

    Then I run:

    tokenizer.add_tokens(['<que>', '<desc>', '<kwd>', '<ans>'], os.path.join(args.tokenizer_path, 'vocab.json'), os.path.join(args.tokenizer_path, 'chinese_vocab.model'))
    

    So far, it successfully updates model_file to local. But when I rerun:

    self.sp = spm.SentencePieceProcessor(model_file=model_file)
    

    It reports:

    File "finetune_psyqa.py", line 279, in main File "finetune_psyqa.py", line 408, in return self.LoadFromFile(model_file) File "/scratch/mihalcea_root/mihalcea0/lsiyang/miniconda3/envs/psyqa_cpm/lib/python3.8/site-packages/sentencepiece/init.py", line 310, in LoadFromFile tokenizer = GPT2Tokenizer(os.path.join(args.tokenizer_path, 'vocab.json'), os.path.join(args.tokenizer_path, 'chinese_vocab.model')) File "/gpfs/accounts/mihalcea_root/mihalcea0/lsiyang/template-is-token/CPM-1-Finetune/data_utils/tokenization_gpt2.py", line 42, in init self.sp = spm.SentencePieceProcessor(model_file=model_file) File "/scratch/mihalcea_root/mihalcea0/lsiyang/miniconda3/envs/psyqa_cpm/lib/python3.8/site-packages/sentencepiece/init.py", line 447, in Init return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: is already defined.

    opened by lsy641 0
  • Cannot install sentencepiece with Python 3.11 on Windows

    Cannot install sentencepiece with Python 3.11 on Windows

    Error alive again, Windows 10, Python 3.10.7

     Attempting uninstall: sentencepiece
        Found existing installation: sentencepiece 0.1.97
        Uninstalling sentencepiece-0.1.97:
          Successfully uninstalled sentencepiece-0.1.97
      Running setup.py install for sentencepiece ... error
      error: subprocess-exited-with-error
    
      × Running setup.py install for sentencepiece did not run successfully.
      │ exit code: 1
      ╰─> [24 lines of output]
          C:\Python310\lib\site-packages\setuptools\dist.py:771: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
            warnings.warn(
          running install
          C:\Python310\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
            warnings.warn(
          running build
          running build_py
          creating build
          creating build\lib.win-amd64-cpython-310
          creating build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/__init__.py -> build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
          running build_ext
          building 'sentencepiece._sentencepiece' extension
          creating build\temp.win-amd64-cpython-310
          creating build\temp.win-amd64-cpython-310\Release
          creating build\temp.win-amd64-cpython-310\Release\src
          creating build\temp.win-amd64-cpython-310\Release\src\sentencepiece
          "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Python310\include -IC:\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-cpython-310\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
          cl : L¡nea de comandos warning D9025 : invalidando '/MD' con '/MT'
          sentencepiece_wrap.cxx
          src/sentencepiece/sentencepiece_wrap.cxx(2809): fatal error C1083: No se puede abrir el archivo incluir: 'sentencepiece_processor.h': No such file or directory
          error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30037\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      Rolling back uninstall of sentencepiece
      Moving to c:\python310\lib\site-packages\sentencepiece-0.1.97.dist-info\
       from C:\Python310\Lib\site-packages\~entencepiece-0.1.97.dist-info
      Moving to c:\python310\lib\site-packages\sentencepiece\
       from C:\Python310\Lib\site-packages\~entencepiece
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> sentencepiece
    
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure`
    
    Edit:
    This path: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\" exists, and cl.exe is there too.
    

    Originally posted by @cibernicola in https://github.com/google/sentencepiece/issues/591#issuecomment-1250851548

    opened by kbatsuren 1
  • Build with protobuf in system

    Build with protobuf in system

    While using protobuf library in system (i.e., SPM_USE_BUILTIN_PROTOBUF=OFF, instead of third_party/protobuf-lite), hard-coded header file inclusion causes an error.

    in init.h:21:

    #include "third_party/protobuf-lite/google/protobuf/message_lite.h"
    

    it should be

    #include "google/protobuf/message_lite.h"
    
    opened by acane77 0
  • split_by_number doesn't match documentation?

    split_by_number doesn't match documentation?

    The split_by_number flags help says "split tokens by numbers (0-9)", but the test cases have "$10" as a valid token when split_by_number is set.

    Is this intended the behavior or a bug?

    With 'split_by_number' set, "(2", "5|4", and "64*32+1!!" are all valid tokens which seems ... odd?

    opened by ywrt 0
  • bazel support for C++ API

    bazel support for C++ API

    Hello all,

    Thank you for developing sentencepiece library! I am using bazel and want to incorporate sentencepiece into my project and use the c++ API. I could not find any BUILD support for bazel. I tried to do it on my own but got stuck at some point.

    Could you provided support for bazel?

    Thank you!

    feature request 
    opened by BBerabi 1
Releases(v0.1.97)
Owner
Google
Google ❤️ Open Source
Google
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 2, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 843 Nov 30, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 718 Feb 18, 2021
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

null 321 Nov 30, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 429 Dec 7, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

NLP*CL Laboratory 2 Oct 26, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Nov 22, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Nov 29, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 332 Dec 7, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 19 Nov 14, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 226 Dec 7, 2022
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

null 19 Mar 28, 2022