Making text a first-class citizen in TensorFlow.

Last update: Dec 26, 2022

Related tags

Overview

TensorFlow Text - Text processing in Tensorflow

IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding minor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).

INDEX

Introduction
Unicode
Normalization
Tokenization
Other Text Ops
- Wordshape
- N-grams & Sliding Window
Installation
- Install using PIP
- Build from source steps:

Introduction

TensorFlow Text provides a collection of text related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

The benefit of using these ops in your text preprocessing is that they are done in the TensorFlow graph. You do not need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

Unicode

Most ops expect that the strings are in UTF-8. If you're using a different encoding, you can use the core tensorflow transcode op to transcode into UTF-8. You can also use the same op to coerce your string to structurally valid UTF-8 if your input could be invalid.

docs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),
                    u'Sad☹'.encode('UTF-16-BE')])
utf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',
                                         output_encoding='UTF-8')

Normalization

When dealing with different sources of text, it's important that the same words are recognized to be identical. A common technique for case-insensitive matching in Unicode is case folding (similar to lower-casing). (Note that case folding internally applies NFKC normalization.)

We also provide Unicode normalization ops for transforming strings into a canonical representation of characters, with Normalization Form KC being the default (NFKC).

print(text.case_fold_utf8(['Everything not saved will be lost.']))
print(text.normalize_utf8(['Äffin']))
print(text.normalize_utf8(['Äffin'], 'nfkd'))

tf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)
tf.Tensor(['\xc3\x84ffin'], shape=(1,), dtype=string)
tf.Tensor(['A\xcc\x88ffin'], shape=(1,), dtype=string)

Tokenization

Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.

The main interfaces are Tokenizer and TokenizerWithOffsets which each have a single method tokenize and tokenizeWithOffsets respectively. There are multiple implementing tokenizers available now. Each of these implement TokenizerWithOffsets (which extends Tokenizer) which includes an option for getting byte offsets into the original string. This allows the caller to know the bytes in the original string the token was created from.

All of the tokenizers return RaggedTensors with the inner-most dimension of tokens mapping to the original individual strings. As a result, the resulting shape's rank is increased by one. Please review the ragged tensor guide if you are unfamiliar with them. https://www.tensorflow.org/guide/ragged_tensors

WhitespaceTokenizer

This is a basic tokenizer that splits UTF-8 strings on ICU defined whitespace characters (eg. space, tab, new line).

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())

[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\xe2\x98\xb9']]

UnicodeScriptTokenizer

This tokenizer splits UTF-8 strings based on Unicode script boundaries. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

In practice, this is similar to the WhitespaceTokenizer with the most apparent difference being that it will split punctuation (USCRIPT_COMMON) from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language texts from each other.

tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())

[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]

Unicode split

When tokenizing languages without whitespace to segment words, it is common to just split by character, which can be accomplished using the unicode_split op found in core.

tokens = tf.strings.unicode_split([u"仅今年前".encode('UTF-8')], 'UTF-8')
print(tokens.to_list())

[['\xe4\xbb\x85', '\xe4\xbb\x8a', '\xe5\xb9\xb4', '\xe5\x89\x8d']]

Offsets

When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements TokenizerWithOffsets has a tokenize_with_offsets method that will return the byte offsets along with the tokens. The start_offsets lists the bytes in the original string each token starts at (inclusive), and the end_offsets lists the bytes where each token ends at (exclusive, i.e., first byte after the token).

tokenizer = text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(
    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
print(start_offsets.to_list())
print(end_offsets.to_list())

[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]
[[0, 11, 15, 21, 26, 29, 33], [0, 3]]
[[10, 14, 20, 25, 28, 33, 34], [3, 6]]

TF.Data Example

Tokenizers work as expected with the tf.data API. A simple example is provided below.

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = tokenized_docs.make_one_shot_iterator()
print(iterator.get_next().to_list())
print(iterator.get_next().to_list())

[['Never', 'tell', 'me', 'the', 'odds.']]
[["It's", 'a', 'trap!']]

Keras API

When you use different tokenizers and ops to preprocess your data, the resulting outputs are Ragged Tensors. The Keras API makes it easy now to train a model using Ragged Tensors without having to worry about padding or masking the data, by either using the ToDense layer which handles all of these for you or relying on Keras built-in layers support for natively working on ragged data.

model = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True)
  text.keras.layers.ToDense(pad_value=0, mask=True),
  tf.keras.layers.Embedding(100, 16),
  tf.keras.layers.LSTM(32),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

Other Text Ops

TF.Text packages other useful preprocessing ops. We will review a couple below.

Wordshape

A common feature used in some natural language understanding models is to see if the text string has a certain property. For example, a sentence breaking model might contain features which check for word capitalization or if a punctuation character is at the end of a string.

Wordshape defines a variety of useful regular expression based helper functions for matching various relevant patterns in your input text. Here are a few examples.

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])

# Is capitalized?
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
# Are all letters uppercased?
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
# Does the token contain punctuation?
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
# Is the token a number?
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)

print(f1.to_list())
print(f2.to_list())
print(f3.to_list())
print(f4.to_list())

[[True, False, False, False, False, False], [True]]
[[False, False, False, False, False, False], [False]]
[[False, False, False, False, False, True], [True]]
[[False, False, False, False, False, False], [False]]

N-grams & Sliding Window

N-grams are sequential words given a sliding window size of n. When combining the tokens, there are three reduction mechanisms supported. For text, you would want to use Reduction.STRING_JOIN which appends the strings to each other. The default separator character is a space, but this can be changed with the string_separater argument.

The other two reduction methods are most often used with numerical values, and these are Reduction.SUM and Reduction.MEAN.

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])

# Ngrams, in this case bi-gram (n = 2)
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)

print(bigrams.to_list())

[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]

Installation

Install using PIP

When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding version of TF Text. For example, if you're using TF 2.0, install the 2.0 version of TF Text, and if you're using TF 1.15, install the 1.15 version of TF Text.

pip install -U tensorflow-text==<version>

Build from source steps:

Note that TF Text needs to be built in the same environment as TensorFlow. Thus, if you manually build TF Text, it is highly recommended that you also build TensorFlow.

If building on MacOS, you must have coreutils installed. It is probably easiest to do with Homebrew.

build and install TensorFlow.
Clone the TF Text repo: git clone https://github.com/tensorflow/text.git
Run the build script to create a pip package: ./oss_scripts/run_build.sh

Comments

Add Apple Silicon support.

Description

Add Apple Silicon support. By installing tensorflow-macos 2.6.0 and tensorflow-metal 0.2.0, TF.Text can be compiled from source code(Currently Apple does not provide a preview version of tensorflow-macos and tf-nightly, so TF.Text can only build a stable version.).

Fixes #538
cla: yes

opened by sun1638650145 96
Error loading '_text_similarity_metric_ops.so' when running unit tests

running python 3.7 on mac osx 10.14.6. Not sure if there is some dependency or build step I am missing but I cannot seem to run the unit tests with out the code failing to load this file. Have tried with tensorflow 1.x and 2.x. stack trace is below. Maybe I am just missing something simple?

Traceback (most recent call last): File "/Applications/PyCharm CE.app/Contents/helpers/pycharm/_jb_unittest_runner.py", line 35, in main(argv=args, module=None, testRunner=unittestpy.TeamcityTestRunner, buffer=not JB_DISABLE_BUFFERING) File "/miniconda3/envs/tf2/lib/python3.7/unittest/main.py", line 100, in init self.parseArgs(argv) File "/miniconda3/envs/tf2/lib/python3.7/unittest/main.py", line 147, in parseArgs self.createTests() File "/miniconda3/envs/tf2/lib/python3.7/unittest/main.py", line 159, in createTests self.module) File "/miniconda3/envs/tf2/lib/python3.7/unittest/loader.py", line 220, in loadTestsFromNames suites = [self.loadTestsFromName(name, module) for name in names] File "/miniconda3/envs/tf2/lib/python3.7/unittest/loader.py", line 220, in suites = [self.loadTestsFromName(name, module) for name in names] File "/miniconda3/envs/tf2/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName module = import(module_name) File "/Users/dittmar/Development/text/tensorflow_text/python/ops/bert_tokenizer_test.py", line 32, in from tensorflow_text.python.ops import bert_tokenizer File "/Users/dittmar/Development/text/tensorflow_text/init.py", line 21, in from tensorflow_text.python import metrics File "/Users/dittmar/Development/text/tensorflow_text/python/metrics/init.py", line 20, in from tensorflow_text.python.metrics.text_similarity_metric_ops import * File "/Users/dittmar/Development/text/tensorflow_text/python/metrics/text_similarity_metric_ops.py", line 28, in gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so')) File "/miniconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/Users/dittmar/Development/text/tensorflow_text/python/metrics/_text_similarity_metric_ops.so, 6): image not found

bug

opened by GeorgeDittmar 26
Universal distribution / Windows binaries

Hello,

Is it possible to add Windows binaries / universal distribution? I couldn,'t install this library on Windows, no wonder there are no binaries for windows on pypi .

tensorflow-probability project provides universal distribution. See pypi .

I don't know if it's a problem with how bazel build is configured or something else. It would be great to have it on all platforms.

Many Thanks.
enhancement

opened by sbarman-mi9 25
No matching distribution found for tensorflow-text
I failed to install tensorflow-text.

When I enter pip install -U tensorflow-text

There was an error:

Could not find a version that satisfies the requirement tensorflow-text (from versions: ) No matching distribution found for tensorflow-text

Python 3.5.4 [MSC v.1900 64 bit (AMD64)] on win32

Tensorflow 2.0.0rc0
opened by xiaoshuwen1995 21
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.6/dist-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb

Hello!

Could you please help me with this issue? I am using tensorflow image from docker hub tensorflow==2.3.0 My Dockerfile looks like this:

Locally I installed tensorflow==2.3.0 and tensorflow-text==2.3.0 and everything works fine. But when I am using docker image + tensorflow_text I get this issue.

This is my full log: File "", line 3, in <module> import tensorflow_text File "/usr/local/lib/python3.6/dist-packages/tensorflow_text/__init__.py", line 21, in <module> from tensorflow_text.python import metrics File "/usr/local/lib/python3.6/dist-packages/tensorflow_text/python/metrics/__init__.py", line 20, in <module> from tensorflow_text.python.metrics.text_similarity_metric_ops import * File "/usr/local/lib/python3.6/dist-packages/tensorflow_text/python/metrics/text_similarity_metric_ops.py", line 28, in <module> gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so')) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.6/dist-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb

opened by Ecclesiast 19

Error on saving keras custom layer with tensorflow_text.BertTokenizer

Trying so save a keras custom layers with tokenizer in it fails versions info:

tensorflow==2.1.0 tensorflow-text==2.1.1

Code to reproduce:


import tensorflow_text
import tensorflow as tf


class TokenizationLayer(tf.keras.layers.Layer):
    def __init__(self, vocab_path, **kwargs):
        self.vocab_path =vocab_path
        self.tokenizer = tensorflow_text.BertTokenizer(vocab_path, token_out_type=tf.int64)
        super(TokenizationLayer, self).__init__(**kwargs)
        
    def get_config(self):
        config = super(TokenizationLayer, self).get_config()
        config.update({
            'vocab_path': self.vocab_path,
        })
        return config

    def call(self,inputs):
        return self.tokenizer.tokenize(inputs).to_tensor()


vocab_path = r"/home/resources/bert_en_uncased_L-12_H-768_A-12/1/assets/vocab.txt"
# tensorflow_text.BertTokenizer(vocab_lookup_table = vocab_path, token_out_type=tf.int64)
inputs = tf.keras.layers.Input(shape=(), dtype=tf.string)
tokenization_layer = TokenizationLayer(vocab_path)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

model.save("./test")

It also gives error on

 def call(self,inputs):
        return self.tokenizer.tokenize(inputs)

Error:

AssertionError                            Traceback (most recent call last)
<ipython-input-55-e49dd5ac9a41> in <module>
----> 1 model.save("./test")

~/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/network.py in save(self, filepath, overwrite, include_optimizer, save_format, signatures, options)
   1006     """
   1007     save.save_model(self, filepath, overwrite, include_optimizer, save_format,
-> 1008                     signatures, options)
   1009 
   1010   def save_weights(self, filepath, overwrite=True, save_format=None):

~/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/save.py in save_model(model, filepath, overwrite, include_optimizer, save_format, signatures, options)
    113   else:
    114     saved_model_save.save(model, filepath, overwrite, include_optimizer,
--> 115                           signatures, options)
    116 
    117 

~/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/saved_model/save.py in save(model, filepath, overwrite, include_optimizer, signatures, options)
     76     # we use the default replica context here.
     77     with distribution_strategy_context._get_default_replica_context():  # pylint: disable=protected-access
---> 78       save_lib.save(model, filepath, signatures, options)
     79 
     80   if not include_optimizer:

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in save(obj, export_dir, signatures, options)
    907   object_saver = util.TrackableSaver(checkpoint_graph_view)
    908   asset_info, exported_graph = _fill_meta_graph_def(
--> 909       meta_graph_def, saveable_view, signatures, options.namespace_whitelist)
    910   saved_model.saved_model_schema_version = (
    911       constants.SAVED_MODEL_SCHEMA_VERSION)

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _fill_meta_graph_def(meta_graph_def, saveable_view, signature_functions, namespace_whitelist)
    585 
    586   with exported_graph.as_default():
--> 587     signatures = _generate_signatures(signature_functions, resource_map)
    588     for concrete_function in saveable_view.concrete_functions:
    589       concrete_function.add_to_graph()

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _generate_signatures(signature_functions, resource_map)
    456             argument_inputs, signature_key, function.name))
    457     outputs = _call_function_with_mapped_captures(
--> 458         function, mapped_inputs, resource_map)
    459     signatures[signature_key] = signature_def_utils.build_signature_def(
    460         _tensor_dict_to_tensorinfo(exterior_argument_placeholders),

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _call_function_with_mapped_captures(function, args, resource_map)
    408   """Calls `function` in the exported graph, using mapped resource captures."""
    409   export_captures = _map_captures_to_created_tensors(
--> 410       function.graph.captures, resource_map)
    411   # Calls the function quite directly, since we have new captured resource
    412   # tensors we need to feed in which weren't part of the original function

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _map_captures_to_created_tensors(original_captures, resource_map)
    330            "be tracked by assigning them to an attribute of a tracked object "
    331            "or assigned to an attribute of the main object directly.")
--> 332           .format(interior))
    333     export_captures.append(mapped_resource)
    334   return export_captures

AssertionError: Tried to export a function which references untracked object Tensor("StatefulPartitionedCall/args_1:0", shape=(), dtype=resource).TensorFlow objects (e.g. tf.Variable) captured by functions must be tracked by assigning them to an attribute of a tracked object or assigned to an attribute of the main object directly.

opened by galfridman 19

mac os wheel broken for 2.2.0rc2

I'm getting an error running tokenize with tensorflow-text==2.2.0rc2 that I can only reproduce on macs. (same error on rc1, and possibly earlier versions)

Steps to reproduce:

Setup:

python3 -m venv .test_venv 
source .test_venv/bin/activate
pip install --upgrade pip
pip install tensorflow==2.2.0rc3
pip install tensorflow-text==2.2.0rc2

Download vocab.txt into the dir you plan to run the test: aws s3 cp s3://models.huggingface.co/bert/bert-base-uncased-vocab.txt ./vocab.txt
And then run these 5 lines in python

import tensorflow as tf
from tensorflow_text import BertTokenizer
tokenizer = BertTokenizer('./vocab.txt')
test2 = tf.convert_to_tensor(
    'Hello', dtype=tf.string
)
tokenizer.tokenize(test2)

Works on linux, (returns <tf.RaggedTensor [[[100]]]>) On Mac, it throws an error. I've run on two separate macs (one with all totally fresh installs)

2020-04-16 13:18:07.892934: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at wordpiece_kernel.cc:204 : Invalid argument: Trying to access resource using the wrong type. Expected N10tensorflow6lookup15LookupInterfaceE got N10tensorflow6lookup15LookupInterfaceE
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sylvia/Desktop/workspace/.tfvenv/lib/python3.7/site-packages/tensorflow_text/python/ops/bert_tokenizer.py", line 222, in tokenize
    return self._wordpiece_tokenizer.tokenize(tokens)
  File "/Users/sylvia/Desktop/workspace/.tfvenv/lib/python3.7/site-packages/tensorflow_text/python/ops/wordpiece_tokenizer.py", line 100, in tokenize
    subword, _, _ = self.tokenize_with_offsets(input)
  File "/Users/sylvia/Desktop/workspace/.tfvenv/lib/python3.7/site-packages/tensorflow_text/python/ops/wordpiece_tokenizer.py", line 156, in tokenize_with_offsets
    tokens.flat_values)
  File "/Users/sylvia/Desktop/workspace/.tfvenv/lib/python3.7/site-packages/tensorflow_text/python/ops/wordpiece_tokenizer.py", line 182, in tokenize_with_offsets
    **kwargs))
  File "<string>", line 141, in wordpiece_tokenize_with_offsets
  File "/Users/sylvia/Desktop/workspace/.tfvenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource using the wrong type. Expected N10tensorflow6lookup15LookupInterfaceE got N10tensorflow6lookup15LookupInterfaceE [Op:WordpieceTokenizeWithOffsets]

running on python 3.7.6

opened by sylviawhoa 18

import fails: "undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs"

I encountered this bug which is most probably a duplicate of #30 that has been closed. Is it related to https://github.com/tensorflow/text/issues/160#issuecomment-556558082 ?

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 LTS
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.0.0
Python version: Anaconda python 3.7.5
CUDA/cuDNN version: None
GPU model and memory: None

Describe the current behavior Error on importing tensorflow-text making it impossible to be imported.

Describe the expected behavior Library can be effortlessly imported and used.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

I created a new minimal environment using

conda create -n tf-test tensorflow python=3.7
conda activate tf-test
pip install tensorflow-text

then, when trying to import tensorflow_text the following error appears

$ python
Python 3.7.5 (default, Oct 25 2019, 15:51:11) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> import tensorflow_text as text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/__init__.py", line 21, in <module>
    from tensorflow_text.python import metrics
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/python/metrics/__init__.py", line 20, in <module>
    from tensorflow_text.python.metrics.text_similarity_metric_ops import *
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/python/metrics/text_similarity_metric_ops.py", line 28, in <module>
    gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so'))
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

opened by moreymat 18

Error while importing Tensorflow-text: undefined symbol

Tensorflow-text: 2.9.0 Tensorflow: 2.9.1 Python: 3.7.13

Happen only with 2.9.0 2.10.0b2 and 2.8.2rc1 seems to be working.

In Google Collab.

Similar to closed #325 two years ago.

import tensorflow_text as text

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-30-40513338c6fc>](https://localhost:8080/#) in <module>()
      1 import tensorflow_hub as hub
      2 import pandas as pd
----> 3 import tensorflow_text as text
      4 import matplotlib.pyplot as plt
      5 from sklearn.model_selection import train_test_split

[/usr/local/lib/python3.7/dist-packages/tensorflow_text/__init__.py](https://localhost:8080/#) in <module>()
     18 

     19 # pylint: disable=wildcard-import
---> 20 from tensorflow_text.core.pybinds import tflite_registrar
     21 from tensorflow_text.python import keras
     22 from tensorflow_text.python import metrics

ImportError: /usr/local/lib/python3.7/dist-packages/tensorflow_text/core/pybinds/tflite_registrar.so: undefined symbol: _ZN4absl12lts_2021110220raw_logging_internal21internal_log_functionB5cxx11E

opened by Tangogow 14

Can not build on Apple Silicon from source.

I have a Apple M1 Mac. And TensorFlow can run on it(tensorflow_macos). TF Text has no pre-compiled packages in PyPI, so I need to build it myself. I known that TF Text needs to be built in TensorFlow environment, but package name is tensorflow_macos in Apple M1 Mac. What should I do?
enhancement

opened by sun1638650145 13
2.4.0rc1 windows wheel broken due to some binary incompatibility (Op:WordpieceTokenizeWithOffsets fails on LookupInterface parameter check)

Seems very similar to https://github.com/tensorflow/text/issues/272#issue-601267559

import tensorflow as tf from tensorflow_text import BertTokenizer tokenizer = BertTokenizer('./vocab.txt') tokenizer.tokenize('Test')

Fails with InvalidArgumentError: Trying to access resource using the wrong type. Expected class tensorflow::lookup::LookupInterface got class tensorflow::lookup::LookupInterface [Op:WordpieceTokenizeWithOffsets]

Any workaround would be greatly appreciated.

`--------------------------------------------------------------------------- InvalidArgumentError Traceback (most recent call last) in 2 from tensorflow_text import BertTokenizer 3 tokenizer = BertTokenizer('./vocab.txt') ----> 4 tokenizer.tokenize('Test')

C:\Anaconda3\lib\site-packages\tensorflow_text\python\ops\bert_tokenizer.py in tokenize(self, text_input) 224 """ 225 tokens = self._basic_tokenizer.tokenize(text_input) --> 226 return self._wordpiece_tokenizer.tokenize(tokens)

C:\Anaconda3\lib\site-packages\tensorflow_text\python\ops\wordpiece_tokenizer.py in tokenize(self, input) 119 of the jth token in input[i1...iN] 120 """ --> 121 subword, _, _ = self.tokenize_with_offsets(input) 122 return subword 123

C:\Anaconda3\lib\site-packages\tensorflow_text\python\ops\wordpiece_tokenizer.py in tokenize_with_offsets(self, input) 175 tokens = ragged_tensor.RaggedTensor.from_tensor( 176 tokens, ragged_rank=rank - 1) --> 177 wordpieces, starts, ends = self.tokenize_with_offsets( 178 tokens.flat_values) 179 wordpieces = wordpieces.with_row_splits_dtype(tokens.row_splits.dtype)

C:\Anaconda3\lib\site-packages\tensorflow_text\python\ops\wordpiece_tokenizer.py in tokenize_with_offsets(self, input) 193 # Tokenize the tokens into subwords 194 values, row_splits, starts, ends = ( --> 195 gen_wordpiece_tokenizer.wordpiece_tokenize_with_offsets( 196 input_values=tokens, 197 vocab_lookup_table=self._vocab_lookup_table.resource_handle,

in wordpiece_tokenize_with_offsets(input_values, vocab_lookup_table, suffix_indicator, max_bytes_per_word, use_unknown_token, unknown_token, max_chars_per_token, split_unknown_characters, output_row_partition_type, name)

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in raise_from_not_ok_status(e, name) 6860 message = e.message + (" name: " + name if name is not None else "") 6861 # pylint: disable=protected-access -> 6862 six.raise_from(core._status_to_exception(e.code, message), None) 6863 # pylint: enable=protected-access 6864

C:\Anaconda3\lib\site-packages\six.py in raise_from(value, from_value)

InvalidArgumentError: Trying to access resource using the wrong type. Expected class tensorflow::lookup::LookupInterface got class tensorflow::lookup::LookupInterface [Op:WordpieceTokenizeWithOffsets]`
bug

opened by eugene-shnitko 13
build from source master branch fails

Compiling the master branch fails with the following errors

./oss_scripts/run_build.sh platform : Mac M1 (ARM 64) bazel : 5.3.0

Has anyone build text wheel that would work with Apple M1 max ?

-- Error ---

In file included from tensorflow_text/core/kernels/byte_splitter_kernel.cc:15: In file included from ./tensorflow_text/core/kernels/byte_splitter_kernel.h:19: ./tensorflow_text/core/kernels/byte_splitter_kernel_template.h:137:37: error: no member named 'FillOutputTensor' in 'tensorflow::text::ByteSplitterWithOffsetsOptflite::shim::Runtime::kTf' SH_RETURN_IF_ERROR(this->template FillOutputTensor<unsigned char, uint8_t>( ~~~~ ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/status_macros.h:53:31: note: expanded from macro 'SH_RETURN_IF_ERROR' ::absl::Status status = (VA_ARGS);
^~~~~~~~~~~ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/op_kernel.h:202:45: note: in instantiation of member function 'tensorflow::text::ByteSplitterWithOffsetsOptflite::shim::Runtime::kTf::Invoke' requested here return static_cast<SubType&>(*this).Invoke(ctx); ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/tf_op_shim.h:105:59: note: in instantiation of member function 'tflite::shim::OpKernelShim<tensorflow::text::ByteSplitterWithOffsetsOp, tflite::shim::Runtime::kTf>::Invoke' requested here OP_REQUIRES_OK(c, ::tensorflow::FromAbslStatus(impl->Invoke(&ctx))); ^ tensorflow_text/core/kernels/byte_splitter_kernel.cc:24:25: note: in instantiation of member function 'tflite::shim::TfOpKerneltensorflow::text::ByteSplitterWithOffsetsOp::Compute' requested here ByteSplitterWithOffsetsOpKernel); ^ In file included from tensorflow_text/core/kernels/byte_splitter_kernel.cc:15: In file included from ./tensorflow_text/core/kernels/byte_splitter_kernel.h:19: ./tensorflow_text/core/kernels/byte_splitter_kernel_template.h:139:37: error: no member named 'FillOutputTensor' in 'tensorflow::text::ByteSplitterWithOffsetsOptflite::shim::Runtime::kTf' SH_RETURN_IF_ERROR(this->template FillOutputTensor<int64_t, int64_t>( ~~~~ ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/status_macros.h:53:31: note: expanded from macro 'SH_RETURN_IF_ERROR' ::absl::Status _status = (VA_ARGS);
^~~~~~~~~~~ In file included from tensorflow_text/core/kernels/byte_splitter_kernel.cc:15: In file included from ./tensorflow_text/core/kernels/byte_splitter_kernel.h:19: ./tensorflow_text/core/kernels/byte_splitter_kernel_template.h:141:37: error: no member named 'FillOutputTensor' in 'tensorflow::text::ByteSplitterWithOffsetsOptflite::shim::Runtime::kTf' SH_RETURN_IF_ERROR(this->template FillOutputTensor<int32_t, int32_t>( ~~~~ ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/status_macros.h:53:31: note: expanded from macro 'SH_RETURN_IF_ERROR' ::absl::Status _status = (VA_ARGS);
^~~~~~~~~~~ In file included from tensorflow_text/core/kernels/byte_splitter_kernel.cc:15: In file included from ./tensorflow_text/core/kernels/byte_splitter_kernel.h:19: ./tensorflow_text/core/kernels/byte_splitter_kernel_template.h:143:37: error: no member named 'FillOutputTensor' in 'tensorflow::text::ByteSplitterWithOffsetsOptflite::shim::Runtime::kTf' SH_RETURN_IF_ERROR(this->template FillOutputTensor<int32_t, int32_t>( ~~~~ ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/status_macros.h:53:31: note: expanded from macro 'SH_RETURN_IF_ERROR' ::absl::Status _status = (VA_ARGS);
^~~~~~~~~~~ In file included from tensorflow_text/core/kernels/byte_splitter_kernel.cc:15: In file included from ./tensorflow_text/core/kernels/byte_splitter_kernel.h:19: ./tensorflow_text/core/kernels/byte_splitter_kernel_template.h:305:22: error: no member named 'FillOutputTensor' in 'tensorflow::text::ByteSplitByOffsetsOptflite::shim::Runtime::kTf' this->template FillOutputTensor<absl::string_view, tensorflow::tstring>( ~~~~ ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/status_macros.h:53:31: note: expanded from macro 'SH_RETURN_IF_ERROR' ::absl::Status status = (VA_ARGS);
^~~~~~~~~~~ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/op_kernel.h:202:45: note: in instantiation of member function 'tensorflow::text::ByteSplitByOffsetsOptflite::shim::Runtime::kTf::Invoke' requested here return static_cast<SubType&>(*this).Invoke(ctx); ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/tf_op_shim.h:105:59: note: in instantiation of member function 'tflite::shim::OpKernelShim<tensorflow::text::ByteSplitByOffsetsOp, tflite::shim::Runtime::kTf>::Invoke' requested here OP_REQUIRES_OK(c, ::tensorflow::FromAbslStatus(impl->Invoke(&ctx))); ^ tensorflow_text/core/kernels/byte_splitter_kernel.cc:28:25: note: in instantiation of member function 'tflite::shim::TfOpKerneltensorflow::text::ByteSplitByOffsetsOp::Compute' requested here ByteSplitByOffsetsOpKernel); ^ In file included from tensorflow_text/core/kernels/byte_splitter_kernel.cc:15: In file included from ./tensorflow_text/core/kernels/byte_splitter_kernel.h:19: ./tensorflow_text/core/kernels/byte_splitter_kernel_template.h:307:37: error: no member named 'FillOutputTensor' in 'tensorflow::text::ByteSplitByOffsetsOptflite::shim::Runtime::kTf' SH_RETURN_IF_ERROR(this->template FillOutputTensor<int32_t, int64_t>( ~~~~ ^ bazel-out/darwin_arm64-opt/bin/external/local_config_tf/include/tensorflow/lite/kernels/shim/status_macros.h:53:31: note: expanded from macro 'SH_RETURN_IF_ERROR' ::absl::Status _status = (VA_ARGS);
^~~~~~~~~~~ 6 errors generated.

opened by ashsha21 3

Releases(v2.11.0)

v2.11.0(Nov 21, 2022)
Release 2.11.0

Major Features and Improvements

Added op for converting to/from BOISE labels to offsets

Bug Fixes and Other Changes

tensorflow:

Moving logging.h and bitmap from tf/core to tf/tsl.

BOISE TF op:

Add main C++ functions for converting to/from BOISE labels to offsets

Add main C++ functions for converting to/from BOISE labels to offsets

Add kernel code and Python API for OffsetsToBoiseTags op

Other:

Add link to KPLs, fix typo in Neural machine translation with attention tutorial

Update README.md

Publish the tensorflow_models.nlp guide docs to tensorflow.org

Add missing dependency to constrained sequence kernel.

Add missing absl status dependency to sentence breaking utils.

Another missing absl status dependency. this time for sentence fragmenter.

Add absl status to sentence fragmenter v2.

Update pybind11 to 2.10.0 to match tensorflow.

Better error message for WordPiece when the vocabulary file has unicode issues.

Update Transformer tutorial with Keras MultiHeadAttention

transformers.ipynb: fix length filter and target slicing

transformers.ipynb: cleanup wording, create a PositionalEmbedding layer.

Replace tensorflow::Status::OK() withtensorflow::OkStatus().

Update README with note about various OS releases.

Cast the step type.

Reactivate TFLite ByteSplitter test.

Modify tokenizer to process pt_examples to tokenizers.pt

fix words alignment in documentation

Update nmt_with_attention:

transformers.ipynb: Factor out CrossAttention, GlobalSelfAttention, and CausalSelfAttention layers.

Switch the transformer to train with Model.fit.

Whitespace changes to force republish.

Fix tutorial display, again.

Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

satojkovic
Source code(tar.gz)
Source code(zip)
v2.11.0-rc0(Oct 19, 2022)
Release 2.11.0-rc0

Bug Fixes and Other Changes

tensorflow:

Moving logging.h and bitmap from tf/core to tf/tsl.

BOISE TF op:

Add main C++ functions for converting to/from BOISE labels to offsets

Add main C++ functions for converting to/from BOISE labels to offsets

Add main C++ functions for converting to/from BOISE labels to offsets

Add kernel code and Python API for OffsetsToBoiseTags op

Other:

Add link to KPLs, fix typo in Neural machine translation with attention tutorial

Update README.md

Publish the tensorflow_models.nlp guide docs to tensorflow.org

Add missing dependency to constrained sequence kernel.

Add missing absl status dependency to sentence breaking utils.

Another missing absl status dependency. this time for sentence fragmenter.

Add absl status to sentence fragmenter v2.

Update pybind11 to 2.10.0 to match tensorflow.

Better error message for WordPiece when the vocabulary file has unicode issues.

Update Transformer tutorial with Keras MultiHeadAttention

transformers.ipynb: fix length filter and target slicing

transformers.ipynb: cleanup wording, create a PositionalEmbedding layer.

Replace tensorflow::Status::OK() withtensorflow::OkStatus().

Update README with note about various OS releases.

Cast the step type.

Reactivate TFLite ByteSplitter test.

Modify tokenizer to process pt_examples to tokenizers.pt

fix words alignment in documentation

Update nmt_with_attention:

transformers.ipynb: Factor out CrossAttention, GlobalSelfAttention, and CausalSelfAttention layers.

Switch the transformer to train with Model.fit.

Whitespace changes to force republish.

Add a phrase based tokenzier

Fix tutorial display, again.

Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

satojkovic
Source code(tar.gz)
Source code(zip)
v2.10.0(Sep 8, 2022)
Release 2.10.0

Major Features and Improvements

New ByteSplitter which tokenizes strings into bytes.

New tutorial: Fine tune BERT with Orbit [will be added to tensorflow.org/text soon].

Fixed an issue where dynamic TF Lite tensors were not getting resized correctly.

Bug Fixes and Other Changes

Fix typo error in subwords_tokenizer guide with text.WordpieceTokenizer

Fixes prepare_tf_dep.sh for OSX.

Add cross-links to tensorflow_models.nlp API reference.

(Generated change) Update tf.Text versions and/or docs.

Update shape inference of kernel template for fast wordpiece and activate the op test.

Update configure.sh for Apple Silicon.

Export Trimmer ABC to be usable as tf_text.Trimmer

Fix TensorFlow checkpoint and trackable imports.

Correct tutorial explanation: meaning of attention weights

Modernize fine_tune_bert.

Lint and update the Fine-tuning a BERT model tutorial

Use pointer for pointer math instead of iterator. Fixes c++17 compilation for regex_split on windows.

Add install_bazel.sh script to make it easy to install the correctly needed version of Bazel. (#946)

Make install_bazel.sh script executable.

Prevent runtime errors from happening due to invalid regular expressions using regex_split & RegexSplitter.

Centralize tensorflow-models docs into a top-level docs/ directory.

Remove link to non-existant section on tf.org.

Move fine_tune_bert guide.

Updated the spelling mistakes in subwords_tokenizer.ipynb

Fixes a bug caused by passing an empty tensor into SentencepieceTokenizer's detokenize method.

Update build for Sentencepiece. Darts was not properly being depended on.

Improve Sentencepiece build by adding missing dependency - str_format.

Fix typos and lint Neural machine translation with attention tutorial

Fix external link formatting, lint NMT with attention tutorial

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

gadagashwini, mnahinkhan, Steve R. Sun, synandi
Source code(tar.gz)
Source code(zip)
v2.10.0-rc0(Aug 4, 2022)
Release 2.10.0-rc0

Major Features and Improvements

New ByteSplitter which tokenizes strings into bytes.

New tutorial: Fine tune BERT with Orbit [will be added to tensorflow.org/text soon].

Fixed an issue where dynamic TF Lite tensors were not getting resized correctly.

Bug Fixes and Other Changes

Fix typo error in subwords_tokenizer guide with text.WordpieceTokenizer

Fixes prepare_tf_dep.sh for OSX.

Add cross-links to tensorflow_models.nlp API reference.

(Generated change) Update tf.Text versions and/or docs.

Update shape inference of kernel template for fast wordpiece and activate the op test.

Update configure.sh for Apple Silicon.

Export Trimmer ABC to be usable as tf_text.Trimmer

Fix TensorFlow checkpoint and trackable imports.

Correct tutorial explanation: meaning of attention weights

Modernize fine_tune_bert.

Lint and update the Fine-tuning a BERT model tutorial

Use pointer for pointer math instead of iterator. Fixes c++17 compilation for regex_split on windows.

Add install_bazel.sh script to make it easy to install the correctly needed version of Bazel. (#946)

Make install_bazel.sh script executable.

Prevent runtime errors from happening due to invalid regular expressions using regex_split & RegexSplitter.

Centralize tensorflow-models docs into a top-level docs/ directory.

Remove link to non-existant section on tf.org.

Move fine_tune_bert guide.

Updated the spelling mistakes in subwords_tokenizer.ipynb

Fixes a bug caused by passing an empty tensor into SentencepieceTokenizer's detokenize method.

Update build for Sentencepiece. Darts was not properly being depended on.

Improve Sentencepiece build by adding missing dependency - str_format.

Fix typos and lint Neural machine translation with attention tutorial

Fix external link formatting, lint NMT with attention tutorial

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

gadagashwini, mnahinkhan, Steve R. Sun, synandi
Source code(tar.gz)
Source code(zip)
v2.9.0(May 18, 2022)
Release 2.9

Major Features and Improvements

New FastBertNormalizer that improves speed for BERT normalization and is convertible to TF Lite.

New FastBertTokenizer that combines FastBertNormalizer and FastWordpieceTokenizer.

New ngrams kernel for handling STRING_JOIN reductions.

Bug Fixes and Other Changes

NgramsStringJoin shape inference fixed to handle unranked tensors

Upgrade pybind11 and reenable tests that were broken.

Rename a couple files to match the naming of the other tflite kernels. Also adds some deps to tflite_ops that were missing and causing an error when testing :all.

Add to TF Lite documentation that ngrams is a convertible op.

Fix public access and missing ICU data to build_fast_bert_normalizer_model and enable the disabled tests.

Update the doc for FastWordpieceTokenizer.

Refine the doc for FastWordpieceTokenizer.

Bug fix: make BertTokenizer work for RaggedTensors with row_splits_dtype=int32

Fix typo error text.WordpieceTokenizer

Added comma at missing places in emoticons for normalizer

Refactor build and test scripts to use prepare_tf_dep.sh

Fixes prepare_tf_dep.sh for OSX.

Fixed bug in setup.py that was requiring the wrong version.

Updated package with the correct versions of Python we release on.

Update documentation on TF Lite convertible ops.

Transition to use TF's version of bazel.

Transition to use TF's bazel configuration.

Add missing symbols for tokenization layers

Fix typo in text_generation.ipynb

Fix grammar typo

Allow fast wordpiece tokenizer to take in external wordpiece model.

Internal change

Improvement to guide where mean call is redundant. See https://github.com/tensorflow/text/issues/810 for more info.

Update broken link and fix typo in BERT-SNGP demo notebook

Consolidate disparate test-related files into a single testing_infra folder.

Pin tf-text version to guides & tutorials.

Fix bug in constrained sequence op. Added a check on an edge case where num_steps = 0 should do nothing and prevent it from SIGSEV crashes.

Remove outdated Keras tests due to them no longer making the testing utilities available.

Update bert preprocessing by padding correct tensors

Update tensorflow-text notebooks from 2.7 to 2.8

Optimize FastWordPiece to only generate requested outputs.

Add a note about byte-indexing vs character indexing.

Add a MAX_TOKENS to the transformer tutorial.

Only export tensorflow symbols from shared libs.

(Generated change) Update tf.Text versions and/or docs.

Do not run the prepare_tf_dep script for Apple M1 macs.

Update text_classification_rnn.ipynb

Fix the exported symbols for the linker test. By adding it to the share objects instead of the c++ code, it allows for the code to be compiled together in one large shared lib.

Implement FastBertNormalizer based on codepoint-wise mappings.

Add pybind for fast_bert_normalizer_model_builder.

Remove unused comments related to Python 2 compatibility.

update transformer.ipynb

Update toolchain & temporarily disable tf lite tests.

Define manylinux2014 for the new toolchain target, and have presubmits use it.

Move tflite build deps to custom target.

Add FastBertTokenizer.

Update bazel version to 5.1.0

Update TF Text to use new Ngrams kernel.

Don't try to set dimension if shape is unknown for ngrams.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aflah, Connor Brinton, devnev39, Janak Ramakrishnan, Martin, Nathan Luehr, Pierre Dulac, Rabin Adhikari, gadagashwini, mohantym, rtg0795
Source code(tar.gz)
Source code(zip)
v2.10.0-b2(May 12, 2022)
Release 2.10.0-b2

Major Features and Improvements

Added FastSentencepieceTokenizer which is convertible to TF Lite. Please note the op name in the graph will change, so any models trained with this version will need to be retrained when the release candidate for 2.10 is released.

Important Notes

This beta release is outside the normal release cycle and is meant to work with TF versions 2.8.x.

Again, the op name for FSP will change in future releases.

Source code(tar.gz)
Source code(zip)
v2.8.2(Apr 21, 2022)
Release 2.8.2

Major Features and Improvements

📦️ Fix macOS packaging so it works with package managers like Poetry (#838)

Bug Fixes and Other Changes

Package metadata updated with the correct available python versions.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Connor Brinton
Source code(tar.gz)
Source code(zip)
v2.9.0-rc1(Apr 15, 2022)
Release 2.9.0-rc1

Major Features and Improvements

New FastBertNormalizer that improves speed for BERT normalization and is convertible to TF Lite.

New FastBertTokenizer that combines FastBertNormalizer and FastWordpieceTokenizer.

New ngrams kernel for handling STRING_JOIN reductions.

Bug Fixes and Other Changes

Fixed bug in setup.py that was requiring the wrong version.

Updated package with the correct versions of Python we release on.

Update documentation on TF Lite convertible ops.

Transition to use TF's version of bazel.

Transition to use TF's bazel configuration.

Add missing symbols for tokenization layers

Fix typo in text_generation.ipynb

Fix grammar typo

Allow fast wordpiece tokenizer to take in external wordpiece model.

Internal change

Improvement to guide where mean call is redundant. See https://github.com/tensorflow/text/issues/810 for more info.

Update broken link and fix typo in BERT-SNGP demo notebook

Consolidate disparate test-related files into a single testing_infra folder.

Pin tf-text version to guides & tutorials.

Fix bug in constrained sequence op. Added a check on an edge case where num_steps = 0 should do nothing and prevent it from SIGSEV crashes.

Remove outdated Keras tests due to them no longer making the testing utilities available.

Update bert preprocessing by padding correct tensors

Update tensorflow-text notebooks from 2.7 to 2.8

Optimize FastWordPiece to only generate requested outputs.

Add a note about byte-indexing vs character indexing.

Add a MAX_TOKENS to the transformer tutorial.

Only export tensorflow symbols from shared libs.

(Generated change) Update tf.Text versions and/or docs.

Do not run the prepare_tf_dep script for Apple M1 macs.

Update text_classification_rnn.ipynb

Fix the exported symbols for the linker test. By adding it to the share objects instead of the c++ code, it allows for the code to be compiled together in one large shared lib.

Implement FastBertNormalizer based on codepoint-wise mappings.

Add pybind for fast_bert_normalizer_model_builder.

Remove unused comments related to Python 2 compatibility.

update transformer.ipynb

Update toolchain & temporarily disable tf lite tests.

Define manylinux2014 for the new toolchain target, and have presubmits use it.

Move tflite build deps to custom target.

Add FastBertTokenizer.

Update bazel version to 5.1.0

Update TF Text to use new Ngrams kernel.

Don't try to set dimension if shape is unknown for ngrams.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aflah, Connor Brinton, devnev39, Janak Ramakrishnan, Martin, Nathan Luehr, Pierre Dulac, Rabin Adhikari
Source code(tar.gz)
Source code(zip)
v2.9.0-rc0(Apr 14, 2022)
Release 2.9.0-rc0

Major Features and Improvements

New FastBertNormalizer that improves speed for BERT normalization and is convertible to TF Lite.

New FastBertTokenizer that combines FastBertNormalizer and FastWordpieceTokenizer.

New ngrams kernel for handling STRING_JOIN reductions.

Bug Fixes and Other Changes

Add missing symbols for tokenization layers

Fix typo in text_generation.ipynb

Fix grammar typo

Allow fast wordpiece tokenizer to take in external wordpiece model.

Internal change

Improvement to guide where mean call is redundant. See https://github.com/tensorflow/text/issues/810 for more info.

Update broken link and fix typo in BERT-SNGP demo notebook

Consolidate disparate test-related files into a single testing_infra folder.

Pin tf-text version to guides & tutorials.

Fix bug in constrained sequence op. Added a check on an edge case where num_steps = 0 should do nothing and prevent it from SIGSEV crashes.

Remove outdated Keras tests due to them no longer making the testing utilities available.

Update bert preprocessing by padding correct tensors

Update tensorflow-text notebooks from 2.7 to 2.8

Optimize FastWordPiece to only generate requested outputs.

Add a note about byte-indexing vs character indexing.

Add a MAX_TOKENS to the transformer tutorial.

Only export tensorflow symbols from shared libs.

(Generated change) Update tf.Text versions and/or docs.

Do not run the prepare_tf_dep script for Apple M1 macs.

Update text_classification_rnn.ipynb

Fix the exported symbols for the linker test. By adding it to the share objects instead of the c++ code, it allows for the code to be compiled together in one large shared lib.

Implement FastBertNormalizer based on codepoint-wise mappings.

Add pybind for fast_bert_normalizer_model_builder.

Remove unused comments related to Python 2 compatibility.

update transformer.ipynb

Update toolchain & temporarily disable tf lite tests.

Define manylinux2014 for the new toolchain target, and have presubmits use it.

Move tflite build deps to custom target.

Add FastBertTokenizer.

Update bazel version to 5.1.0

Update TF Text to use new Ngrams kernel.

Don't try to set dimension if shape is unknown for ngrams.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aflah, Connor Brinton, devnev39, Janak Ramakrishnan, Martin, Nathan Luehr, Pierre Dulac, Rabin Adhikari
Source code(tar.gz)
Source code(zip)
v2.8.1(Feb 4, 2022)
Release 2.8.1

Major Features and Improvements

Upgrade Sentencepiece to v0.1.96

Adds new trimmer ShrinkLongestTrimmer

Bug Fixes and Other Changes

Upgrade bazel to 4.2.2

Create .bazelversion file to guarantee using correct version

Update tf.Text versions and docs.

Add Apple Silicon support for manual builds.

Update configure.sh

Only Apple Silicon will be installed with tensorflow-macos

Fix merge error & add SP patch for building on Windows

Fix inclusion of missing libraries for Mac & Windows

Update word_embeddings.ipynb

Update classify_text_with_bert.ipynb

Update tensorflow_text tutorials to new preprocessing layer symbol path

Fixes typo in guide

Update Apple Silicon's requires.

release script to use tf nighly

Fix typo in ragged tensor link.

Update requires for setup. It wasn't catching non-M1 Macs.

Add missing symbols for tokenization layers

Fix typo in text_generation.ipynb

Fix grammar typo

Allow fast word piece tokenizer to take in external word piece model.

Update guide with redundant mean call.

Update broken link and fix typo in BERT-SNGP demo notebook.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Abhijeet Manhas, chunduriv, Dean Wyatte, Feiteng, jaymessina3, Mao, Olivier Bacs, RenuPatelGoogle, Steve R. Sun, Stonepia, sun1638650145, Tharaka De Silva, thuang513, Xiaoquan Kong, devnev39, Janak Ramakrishnan, Pierre Dulac
Source code(tar.gz)
Source code(zip)
v2.8.0-rc0(Jan 31, 2022)
Release 2.8.0-rc0

Major Features and Improvements

Upgrade Sentencepiece to v0.1.96

Adds new trimmer ShrinkLongestTrimmer

Bug Fixes and Other Changes

Upgrade bazel to 4.2.2

Create .bazelversion file to guarantee using correct version

(Generated change) Update tf.Text versions and/or docs.

Add Apple Silicon support for manual builds.

Update configure.sh

Only Apple Silicon will be installed with tensorflow-macos

Fix merge error & add SP patch for building on Windows

Fix inclusion of missing libraries for Mac & Windows

Update word_embeddings.ipynb

Update classify_text_with_bert.ipynb

Update tensorflow_text tutorials to new preprocessing layer symbol path

Fixes typo in guide

Update Apple Silicon's requires.

release script to use tf nighly

Fix typo in ragged tensor link.

Update requires for setup. It wasn't catching non-M1 Macs.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Abhijeet Manhas, chunduriv, Dean Wyatte, Feiteng, jaymessina3, Mao, Olivier Bacs, RenuPatelGoogle, Steve R. Sun, Stonepia, sun1638650145, Tharaka De Silva, thuang513, Xiaoquan Kong
Source code(tar.gz)
Source code(zip)
v2.7.3(Nov 19, 2021)
Bug Fixes and Other Changes

Fixed broken packages for MacOS & Windows

Source code(tar.gz)
Source code(zip)
v2.7.0(Nov 12, 2021)
Release 2.7.0

Major Features and Improvements

Added new tokenizer: FastWordpieceTokenizer that is considerably faster than the original WordpieceTokenizer

WhitespaceTokenizer was rewritten to increase speed and smaller kernel size

Ability to convert WhitespaceTokenizer & FastWordpieceTokenizer to TF Lite

Added Keras layers for tokenizers: UnicodeScript, Whitespace, & Wordpiece

Bug Fixes and Other Changes

(Generated change) Update tf.Text versions and/or docs.

tiny change for variable name in transformer tutorial

Update nmt_with_attention.ipynb

Add vocab_size for wordpiece tokenizer to have consistency with sentence piece.

This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.

This adds the builder for the new WhitespaceTokenizer config cache. This is the first in a series of changes to update the WST for mobile.

C++ API for new WhitespaceTokenizer. The updated API is more useful (accepts strings instead of ints), faster, and smaller in size.

Adds pywrap for WhitespaceTokenizer config builder.

Simplify the configure.bzl. Since for each platform we build with C++14, let's just make it easier to default to it across the board. This should be easier to understand and maintain.

Remove most of the default oss deps for kernels as they are no longer required for building.

Updating this BERT tutorial to use model subclassing (easier for students to hack on it this way).

Adds kernels for TF & TFLite for the new WhitespaceTokenizer.

Fix a problem with the WST template that was causing members to be exported as undefined symbols. After this change they become a unique global symbol in the shared object file.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Convert the TFLite kernel for ngram with STRING_JOIN mode to use tfshim so the same code is now used for TF and TFLite kernels.

fix: masked_ids -> masked_lm_ids

Save the transformer.

Remove the sentencepiece patch in OSS

fix vocab_table arg is not used in bert_pretrain_preprocess()

Disable TSAN for one more tutorial test that may run for >900sec when TSAN is

Remove the sentencepiece patch in OSS

internal

(Generated change) Update tf.Text versions and/or docs.

Update deps to fix broken build.

Remove --gen_report flag.

Small typo fixed

Explain that all heads are handled with a single Dense layer

internal change, should be a noop in github.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Creates tf Lite registrar and adds TF Lite tests for mobile ops.

Fix nmt_with_attention start_index

Export LD_LIBRARY_PATH when configuring for build.

Update tf lite test to use the function rather than having to globally share the linked library symbols so the interpreter can find the name since this is only available on linux.

Temporarily switch to the definition of REGISTER_TF_OP_SHIM while it updates.

Update REGISTER_TF_OP_SHIM macro to remove unnecessary parameter.

Remove temporary code and set back to using the op shim macro.

Updated import statement

Internal change

pushed back forward compatibility date for tf_text.WhitespaceTokenizer.

Add .gitignore

The --keep_going flag will make bazel run all tests instead of stopping

Add missing blank line between test and doctest.

Adds a regression test for model server for the replaced WST op. This ensures that current models using the old kernel will continue to work.

Fix the build by adding a new dependency required by TF to kernel targets.

Add sentenepiece detokenize op to stateful allowlist.

Fix broken build. This occurred because of a change on TF that updated the compiler infra version (https://github.com/tensorflow/tensorflow/commit/e0940f269a10f409466b6fef4ef531aec81f9afa).

Clean up code now that the build horizon has passed.

Add pywrap dependency for tflite ops.

Update TextVectorization layer

Allows overridden get_selectable to be used.

fix: masked_input_ids is not used in bert_pretrain_preprocess()

Update word_embeddings.ipynb

Fixed a value where the training accuracy was shown instead of the validation accuracy

Mark old SP targets

Create a single SELECT_TFTEXT_OPS for registering all of the TF Text ops with TF Lite interpreter. Also adds a single target for building to them.

Add TF Lite op for RaggedTensorToTensor.

Adds a new guide for using select TF Text ops in TF Lite models for mobile.

Switch FastWordpieceTokenizer to default to running pre-tokenization, and rename the end_to_end parameter to no_pretokenization. This should be a no-op. The flatbuffer is not changed so as to not affect any models already using FWP currently. Only the python API is updated.

Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aaron Siddhartha Mondal, Abhijeet Manhas, Dominik Schlösser, jaymessina3, Mao, Xiaoquan Kong, Yasir Modak, Olivier Bacs, Tharaka De Silva
Source code(tar.gz)
Source code(zip)
v2.7.0-rc1(Nov 4, 2021)
Release 2.7.0-rc1

Major Features and Improvements

Added new tokenizer: FastWordpieceTokenizer that is considerably faster than the original WordpieceTokenizer

WhitespaceTokenizer was rewritten to increase speed and smaller kernel size

Ability to convert WhitespaceTokenizer & FastWordpieceTokenizer to TF Lite

Added Keras layers for tokenizers: UnicodeScript, Whitespace, & Wordpiece

Bug Fixes and Other Changes

(Generated change) Update tf.Text versions and/or docs.

tiny change for variable name in transformer tutorial

Update nmt_with_attention.ipynb

Add vocab_size for wordpiece tokenizer to have consistency with sentence piece.

This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.

This adds the builder for the new WhitespaceTokenizer config cache. This is the first in a series of changes to update the WST for mobile.

C++ API for new WhitespaceTokenizer. The updated API is more useful (accepts strings instead of ints), faster, and smaller in size.

Adds pywrap for WhitespaceTokenizer config builder.

Simplify the configure.bzl. Since for each platform we build with C++14, let's just make it easier to default to it across the board. This should be easier to understand and maintain.

Remove most of the default oss deps for kernels as they are no longer required for building.

Updating this BERT tutorial to use model subclassing (easier for students to hack on it this way).

Adds kernels for TF & TFLite for the new WhitespaceTokenizer.

Fix a problem with the WST template that was causing members to be exported as undefined symbols. After this change they become a unique global symbol in the shared object file.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Convert the TFLite kernel for ngram with STRING_JOIN mode to use tfshim so the same code is now used for TF and TFLite kernels.

fix: masked_ids -> masked_lm_ids

Save the transformer.

Remove the sentencepiece patch in OSS

fix vocab_table arg is not used in bert_pretrain_preprocess()

Disable TSAN for one more tutorial test that may run for >900sec when TSAN is

Remove the sentencepiece patch in OSS

internal

(Generated change) Update tf.Text versions and/or docs.

Update deps to fix broken build.

Remove --gen_report flag.

Small typo fixed

Explain that all heads are handled with a single Dense layer

internal change, should be a noop in github.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Creates tf Lite registrar and adds TF Lite tests for mobile ops.

Fix nmt_with_attention start_index

Export LD_LIBRARY_PATH when configuring for build.

Update tf lite test to use the function rather than having to globally share the linked library symbols so the interpreter can find the name since this is only available on linux.

Temporarily switch to the definition of REGISTER_TF_OP_SHIM while it updates.

Update REGISTER_TF_OP_SHIM macro to remove unnecessary parameter.

Remove temporary code and set back to using the op shim macro.

Updated import statement

Internal change

pushed back forward compatibility date for tf_text.WhitespaceTokenizer.

Add .gitignore

The --keep_going flag will make bazel run all tests instead of stopping

Add missing blank line between test and doctest.

Adds a regression test for model server for the replaced WST op. This ensures that current models using the old kernel will continue to work.

Fix the build by adding a new dependency required by TF to kernel targets.

Add sentenepiece detokenize op to stateful allowlist.

Fix broken build. This occurred because of a change on TF that updated the compiler infra version (https://github.com/tensorflow/tensorflow/commit/e0940f269a10f409466b6fef4ef531aec81f9afa).

Clean up code now that the build horizon has passed.

Add pywrap dependency for tflite ops.

Update TextVectorization layer

Allows overridden get_selectable to be used.

Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aaron Siddhartha Mondal, Abhijeet Manhas, Dominik Schlösser, jaymessina3, Mao, Xiaoquan Kong, Yasir Modak
Source code(tar.gz)
Source code(zip)
v2.7.0-rc0(Oct 15, 2021)
Release 2.7.0-rc0

Major Features and Improvements

WhitespaceTokenizer was rewritten to increase speed and smaller kernel size

Ability to convert some ops to TF Lite

Bug Fixes and Other Changes

(Generated change) Update tf.Text versions and/or docs.

tiny change for variable name in transformer tutorial

Update nmt_with_attention.ipynb

Add vocab_size for wordpiece tokenizer to have consistency with sentence piece.

This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.

This adds the builder for the new WhitespaceTokenizer config cache. This is the first in a series of changes to update the WST for mobile.

C++ API for new WhitespaceTokenizer. The updated API is more useful (accepts strings instead of ints), faster, and smaller in size.

Adds pywrap for WhitespaceTokenizer config builder.

Simplify the configure.bzl. Since for each platform we build with C++14, let's just make it easier to default to it across the board. This should be easier to understand and maintain.

Remove most of the default oss deps for kernels as they are no longer required for building.

Updating this BERT tutorial to use model subclassing (easier for students to hack on it this way).

Adds kernels for TF & TFLite for the new WhitespaceTokenizer.

Fix a problem with the WST template that was causing members to be exported as undefined symbols. After this change they become a unique global symbol in the shared object file.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Convert the TFLite kernel for ngram with STRING_JOIN mode to use tfshim so the same code is now used for TF and TFLite kernels.

fix: masked_ids -> masked_lm_ids

Save the transformer.

Remove the sentencepiece patch in OSS

fix vocab_table arg is not used in bert_pretrain_preprocess()

Disable TSAN for one more tutorial test that may run for >900sec when TSAN is

Remove the sentencepiece patch in OSS

internal

(Generated change) Update tf.Text versions and/or docs.

Update deps to fix broken build.

Remove --gen_report flag.

Small typo fixed

Explain that all heads are handled with a single Dense layer

internal change, should be a noop in github.

Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.

Creates tf Lite registrar and adds TF Lite tests for mobile ops.

Fix nmt_with_attention start_index

Export LD_LIBRARY_PATH when configuring for build.

Update tf lite test to use the function rather than having to globally share the linked library symbols so the interpreter can find the name since this is only available on linux.

Temporarily switch to the definition of REGISTER_TF_OP_SHIM while it updates.

Update REGISTER_TF_OP_SHIM macro to remove unnecessary parameter.

Remove temporary code and set back to using the op shim macro.

Updated import statement

Internal change

pushed back forward compatibility date for tf_text.WhitespaceTokenizer.

Add .gitignore

The --keep_going flag will make bazel run all tests instead of stopping

Add missing blank line between test and doctest.

Adds a regression test for model server for the replaced WST op. This ensures that current models using the old kernel will continue to work.

Fix the build by adding a new dependency required by TF to kernel targets.

Add sentenepiece detokenize op to stateful allowlist.

Fix broken build. This occurred because of a change on TF that updated the compiler infra version (https://github.com/tensorflow/tensorflow/commit/e0940f269a10f409466b6fef4ef531aec81f9afa).

Clean up code now that the build horizon has passed.

Add pywrap dependency for tflite ops.

Update TextVectorization layer

Allows overridden get_selectable to be used.

Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aaron Siddhartha Mondal, Dominik Schlösser, Xiaoquan Kong, Yasir Modak
Source code(tar.gz)
Source code(zip)
v2.6.0(Aug 18, 2021)
Release 2.6.0

Bug Fixes and Other Changes

Update __init__.py: Added a __version__ variable

Fixes the benchmark suite for graph mode. While using tf.function prevented caching, it was also causing the graph being tested to rebuild each time. Using placeholder instead fixes this.

Pin nightly version.

Remove TF patch as it is not needed anymore. The code is in core TF.

Typos

Format and lint NBs, add images

Add a couple notes to the BertTokenizer docs.

Narrative docs migration: TF Core -> TF Text

Update nmt_with_attention

Moved examples of a few API docs above the args sections to better match other formats.

Fix NBs

Update Installation from source instruction.

Add SplitterWithOffsets as an exported symbol.

Fix a note to the BertTokenizer docs.

Remove unused index.md

Convert tensorflow_text to use public TF if possible.

Fix failing notebooks.

Create user_ops BUILD file.

Remove unnecessary METADATA.

Replace tf.compat.v2.xxx with tf.xxx, since tf_text is using tf2 only.

Fix load_data function in nmt tutorial

Update tf.data.AUTOTUNE in Fine-tuning a BERT model

Switch TF to OSS keras (1/N).

added subspaces

Disable TSAN for tutorial tests that may run for >900sec when TSAN is enabled.

Adds a short description to the main landing page of our GitHub repo to point users to the tf.org subsite.

Phrasing fix to TF Transformer tutorial.

Disable RTTI when building Tf.Text kernels for mobile

Migrate the references in third_party/toolchains directory as it is going to be deleted soon.

Fix bug in RoundRobinTrimmer. Previously the stopping condition was merging and combining from across different batches. Instead now the stopping condition is first determined in each batch, then aggregated.

Set mask_token='' to make it work with TF 2.6.0

Builds TF Text with C++14 by default. This is already done by TensorFlow, and the TF Lite shim has C++14 features used within; thus, this is needed to build kernels against it.

This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.

Update the WORKSPACE to not use the same "workspace" name when initializing TensorFlow.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

8bitmp3, akiprasad, bongbonglemon, Jules Gagnon-Marchand, Stonepia
Source code(tar.gz)
Source code(zip)
v2.6.0-rc0(Jul 2, 2021)
Release 2.6.0-rc0

Bug Fixes and Other Changes

Update __init__.py: Added a __version__ variable

Fixes the benchmark suite for graph mode. While using tf.function prevented caching, it was also causing the graph being tested to rebuild each time. Using placeholder instead fixes this.

Pin nightly version.

Remove TF patch as it is not needed anymore. The code is in core TF.

Typos

Format and lint NBs, add images

Add a couple notes to the BertTokenizer docs.

Narrative docs migration: TF Core -> TF Text

Update nmt_with_attention

Moved examples of a few API docs above the args sections to better match other formats.

Fix NBs

Update Installation from source instruction.

Add SplitterWithOffsets as an exported symbol.

Fix a note to the BertTokenizer docs.

Remove unused index.md

Convert tensorflow_text to use public TF if possible.

Fix failing notebooks.

Create user_ops BUILD file.

Remove unnecessary METADATA.

Replace tf.compat.v2.xxx with tf.xxx, since tf_text is using tf2 only.

Fix load_data function in nmt tutorial

Update tf.data.AUTOTUNE in Fine-tuning a BERT model

Switch TF to OSS keras (1/N).

added subspaces

Disable TSAN for tutorial tests that may run for >900sec when TSAN is enabled.

Adds a short description to the main landing page of our GitHub repo to point users to the tf.org subsite.

Phrasing fix to TF Transformer tutorial.

Disable RTTI when building Tf.Text kernels for mobile

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

8bitmp3, akiprasad, bongbonglemon, Jules Gagnon-Marchand, Stonepia
Source code(tar.gz)
Source code(zip)
v2.5.0(May 24, 2021)
Release 2.5

We want to particularly point out that guides, tutorials, and API docs are currently being published to http://tensorflow.org/text ! This should make it easier for users to find our documentation. We worked hard on improving docs across the board, so feel free to let us know if further clarification is needed.

Major Features and Improvements

API docs, guides, & tutorial are now available on http://tensorflow.org/text

New guides & tutorials including: tokenizers, subwords tokenizer, and BERT text preprocessing guide.

Add RoundRobinTrimmer

Add a function to generate a BERT vocab from a tf.data.Dataset.

Add detokenize methods for BertTokenizer and WordpieceTokenizer.

Enable NFD and NFKD in NormalizeWithOffset op

Bug Fixes and Other Changes

Many API updates (eg. adding descriptions & examples) to various ops.

Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.

Fix a bug in split mode tokenizers that caused tests to fail on Windows.

Fix broadcasting bugs in RoundRobinTrimmer

Add WordpieceTokenizeWithOffsets with ALLOW_STATEFUL_OP_FOR_DATASET_FUNCTIONS for tf.data

Remove PersistentTensor from sentencepiece_kernels.cc

Document examples are now tested.

Fix benchmarking of graph mode ops through use of tf.function.

Set the default for mask_token for StringLookup and IntegerLookup to None

Update the sentence_breaking_ops docstring to indicate that it's deprecated.

Adding an i18n-friendly BasicTokenizer that can preserve accents

For Windows, always include ICU data files since they need to be built in statically.

Rename documentation file WordShape.md to WordShape_cls.md. Fix #361.

Convert input to tensor to allow for numpy inputs to state based sentence breaker.

Add classifiers to py packages and fix header image.

Fix for the model server test.

Update regression test for break_sentences_with_offsets.

Add a shape attribute to the ToDense Keras layer.

Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker

Fix for the model server test.

Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems.

Add regression test for Find Source Offsets

Fix unselectable_ids shape check in ItemSelector.

Switch out architecture image in tf.Text documentation.

Fix regression test for state_based_sentence_breaker_v2

Update run_build with enable_runfiles flag.

Update the version of bazel_skylib to match TF's and fix a possible visibility issue.

Simplify tf-text WORKSPACE, by relying on tf_workspace().

Update transformer.ipynb to use a saved text.BertTokenizer

Update mobile targets to use :mobile rather than separate :android & :ios targets.

Make tools part of the tensorflow_text pip package.

Import tools from the tf-text package, instead of cloning the git repo.

Minor cleanups to make some code compile on the android build system.

Fix pip install command in readme

Fix tools pip package inclusion.

A tensorfow.org compatible docs generator for tf-text.

Sample random tokens correctly during MLM.

Treat Sentencepiece ops as stateful in tf.data pipelines.

Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.

Source code(tar.gz)
Source code(zip)
v2.5.0-rc0(Apr 6, 2021)
Release 2.5.0-rc0

Major Features and Improvements

Add a subwords tokenizer tutorial to text/examples.

Add a function to generate a BERT vocab from a tf.data.Dataset.

Add detokenize methods for BertTokenizer and WordpieceTokenizer.

Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.

Enable NFD and NFKD in NormalizeWithOffset op

Adding an i18n-friendly BasicTokenizer that can preserve accents

Create guide for tokenizers.

Breaking Changes

Bug Fixes and Other Changes

Other:

For Windows, always include ICU data files since they need to be built in statically.

Patches TF to fix windows builds to not look for a python3 executable.

Rename documentation file WordShape.md to WordShape_cls.md. The problem is on MacOS (and maybe Windows) this filename collides with wordshape.md, because the filesystem does not differentiate cases for the files. This is purely a QOL change for anybody checking out the library on a non-Linux platform. Fix #361.

Convert input to tensor to allow for numpy inputs to state based sentence breaker.

Add classifiers to py packages and fix header image.

fix bad rendering for add_eos add_bos description in SentencepieceTokenizer.md

Fix for the model server test. Make sure our test tensors have the expected

Update regression test for break_sentences_with_offsets.

Add a shape attribute to the ToDense Keras layer.

Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker

Fix for the model server test. The result of the tokenize() method of

Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems. Also moved out the vocab for Wordpiece due to a tf bug.

Update documentation for SplitMergeFromLogitsTokenizer

Add regression test for Find Source Offsets

Fix unselectable_ids shape check in ItemSelector.

changing two tests, to debug failure on Kokoro Windows build.

Switch out architecture image in tf.Text documentation.

Fix regression test for state_based_sentence_breaker_v2

Update run_build with enable_runfiles flag.

Update the version of bazel_skylib to match TF's and fix a possible visibility issue.

Simplify tf-text WORKSPACE, by relying on tf_workspace().

Update transformer.ipynb to use a saved text.BertTokenizer

typos

Update mobile targets to use :mobile rather than separate :android & :ios targets.

Make tools part of the tensorflow_text pip package.

Import tools from the tf-text package, instead of cloning the git repo.

Minor cleanups to make some code compile on the android build system.

Fix pip install command in readme

Fix tools pip package inclusion.

Clear outputs

A tensorfow.org compatible docs generator for tf-text.

Formatting fixes for tensorflow.org

Sample random tokens correctly during MLM.

Internal repo change

Treat Sentencepiece ops as stateful in tf.data pipelines.

Reduce the critical section range. Because the options are

Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.

Updating guide with new template

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Rens, Samuel Marks, thuang513
Source code(tar.gz)
Source code(zip)
v2.4.3(Jan 13, 2021)
Release 2.4.3

Bug Fixes and Other Changes

Fix export as saved model of hub_module_splitter

Fix bug in regex_split_with_offsets when input.ragged_rank > 1

Convert input to tensor to allow for numpy inputs in state based sentence breaker.

Add more classifiers to py packages.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

fsx950223
Source code(tar.gz)
Source code(zip)
v2.4.2(Dec 23, 2020)
Release 2.4.2

Major Features and Improvements

We are now building a nightly package - tensorflow-text-nightly. This is available for Linux immediately, with other platforms to be added soon.

Bug Fixes and Other Changes

Fixes a bug which prevented the sentence_fragmenter from being able to process tensors with a rank > 1.

Update documentation filenames to prevent collisions when checking out the code on filesystems that do not have case sensitivity.

Source code(tar.gz)
Source code(zip)
v2.4.1(Dec 17, 2020)
Release 2.4.1

Major Features and Improvements

New APIs proposed in RFC: End-to-end text preprocessing with TF.Text #283 have been added, including:

Splitter

RegexSplitter

StateBasedSentenceBreaker

Trimmer

WaterfallTrimmer

RoundRobinTrimmer

ItemSelector

RandomItemSelector

FirstNItemSelector

MaskValuesChooser

mask_language_model()

combine_segments()

pad_model_inputs()

Windows support!

Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.

Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).

With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.

Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.

Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Added normalize_utf8_with_offsets and find_source_offsets ops.

Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.

Added string_to_id to SentencepieceTokenizer.

Support Android build.

RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

Add a minimal count_words function to wordpiece_vocabulary_learner.

Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.

Add dep on tensorflow_hub in pip_package/setup.py

Add filegroup BUILD target for test_data segmentation Hub module.

Extend documentation for class HubModuleSplitter.

Read SP model file in bytes mode in tests.

Update intro.ipynb colab.

Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.

Update StateBasedSentenceBreaker handling of text input tensors.

Reduce over-broad dependencies in regex_split library.

Fix broken builds.

Fix comparison between signed and unsigned int in FindNextFragmentBoundary.

Update README regarding versions.

Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.

Convert non-tensor inputs in pad along dimension op.

Add the necessity to install coreutils to the build instructions if building on MacOS.

Add filegroup BUILD target for test_data segmentation Hub module.

Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.

Add Spliter / SplitterWithOffsets abstract base classes.

Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.

Change variable names for token offsets: "limit" -> "end".

Fix presubmit failed for MacOS.

Allow dense tensor inputs for RegexSplit.

Fix imports in tools/.

BertTokenizer: Error out if the user passes a normalization_form that will be ignored.

Update documentation for Sentencepiece.tokenize_with_offsets.

Let WordpieceTokenizer read vocabulary files.

Numerous build improvements / adjustments (mostly to support Windows):

Patch out googletest & glog dependencies from Sentencepiece.

Switch to using Bazel's internal patching.

ICU data is built statically for Windows.

Remove reliance on tf_kernel_library.

Patch TF to fix problematic Python executable searching.

Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin
Source code(tar.gz)
Source code(zip)
v2.4.0-rc1(Dec 8, 2020)
Release 2.4.0-rc1

Major Features and Improvements

Windows support!

Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.

Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).

With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.

Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.

Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Added normalize_utf8_with_offsets and find_source_offsets ops.

Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.

Added string_to_id to SentencepieceTokenizer.

Support Android build.

RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

Add a minimal count_words function to wordpiece_vocabulary_learner.

Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.

Add dep on tensorflow_hub in pip_package/setup.py

Add filegroup BUILD target for test_data segmentation Hub module.

Extend documentation for class HubModuleSplitter.

Read SP model file in bytes mode in tests.

Update intro.ipynb colab.

Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.

Update StateBasedSentenceBreaker handling of text input tensors.

Reduce over-broad dependencies in regex_split library.

Fix broken builds.

Fix comparison between signed and unsigned int in FindNextFragmentBoundary.

Update README regarding versions.

Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.

Convert non-tensor inputs in pad along dimension op.

Add the necessity to install coreutils to the build instructions if building on MacOS.

Add filegroup BUILD target for test_data segmentation Hub module.

Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.

Add Spliter / SplitterWithOffsets abstract base classes.

Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.

Change variable names for token offsets: "limit" -> "end".

Fix presubmit failed for MacOS.

Allow dense tensor inputs for RegexSplit.

Fix imports in tools/.

BertTokenizer: Error out if the user passes a normalization_form that will be ignored.

Update documentation for Sentencepiece.tokenize_with_offsets.

Let WordpieceTokenizer read vocabulary files.

Numerous build improvements / adjustments (mostly to support Windows):

Patch out googletest & glog dependencies from Sentencepiece.

Switch to using Bazel's internal patching.

ICU data is built statically for Windows.

Remove reliance on tf_kernel_library.

Patch TF to fix problematic Python executable searching.

Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin
Source code(tar.gz)
Source code(zip)
v2.4.0-rc0(Nov 18, 2020)
Release 2.4.0-rc0

Major Features and Improvements

Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.

Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).

With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.

Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.

Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Added normalize_utf8_with_offsets and find_source_offsets ops.

Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.

Added string_to_id to SentencepieceTokenizer.

Support Android build.

Support Windows build (Py3.6 & Py3.7 this release).

RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.

Add dep on tensorflow_hub in pip_package/setup.py

Add filegroup BUILD target for test_data segmentation Hub module.

Extend documentation for class HubModuleSplitter.

Read SP model file in bytes mode in tests.

Update intro.ipynb colab.

Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.

Update StateBasedSentenceBreaker handling of text input tensors.

Reduce over-broad dependencies in regex_split library.

Fix broken builds.

Fix comparison between signed and unsigned int in FindNextFragmentBoundary.

Update README regarding versions.

Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.

Convert non-tensor inputs in pad along dimension op.

Add the necessity to install coreutils to the build instructions if building on MacOS.

Add filegroup BUILD target for test_data segmentation Hub module.

Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.

Add Spliter / SplitterWithOffsets abstract base classes.

Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.

Change variable names for token offsets: "limit" -> "end".

Fix presubmit failed for MacOS.

Allow dense tensor inputs for RegexSplit.

Fix imports in tools/.

BertTokenizer: Error out if the user passes a normalization_form that will be ignored.

Update documentation for Sentencepiece.tokenize_with_offsets.

Let WordpieceTokenizer read vocabulary files.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin
Source code(tar.gz)
Source code(zip)
v2.4.0-b0(Oct 23, 2020)
Release 2.4.0-b0

Please note that this is a pre-release and meant to run with TF v2.3.x. We wanted to give access to some of the features we were adding to 2.4.x, but did not want to wait for the TF release.

Major Features and Improvements

Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.

Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).

With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.

Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.

Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Bug Fixes and Other Changes

Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.

Add dep on tensorflow_hub in pip_package/setup.py

Add filegroup BUILD target for test_data segmentation Hub module.

Extend documentation for class HubModuleSplitter.

Read SP model file in bytes mode in tests.

Thanks to our Contributors
Source code(tar.gz)
Source code(zip)
v2.3.0(Jul 28, 2020)
Release 2.3.0

Major Features and Improvements

Added UnicodeCharacterTokenizer

Tokenizers are now tf.Modules and can be saved from within Keras layers.

Bug Fixes and Other Changes

Allow wordpiece_tokenizer to output int32 tokens natively.

Tracks the Sentencepiece model resource via a TrackableResource.

oss-segmenter:

fix end-offset error in split_merge_tokenizer_kernel.

TensorFlow text python ops wordshape:

More comprehensive emoji handling

Other:

Unref lookup_table in wordpiece_kernel fixing a possible memory leak.

Add missing LICENSE file for third_party/tensorflow_text/core/kernels.

add normalize kernals test

Fix Sentencepiece tests.

Add some metric logs to tokenizers.

Fix documentation formatting for SplitMergeTokenizer

Bug fix: make sure tokenize() method does not ignore itself.

Improve logging efficiency.

Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.

Add the ability to define a user-defined destination directory to make testing easier.

Fix typo in documentation of BertTokenizer

Clarify docstring of UnicodeScriptTokenizer about splitting on space

Add executable flag to the run_build.sh script.

Clarify docstring of WordpieceTokenizer on unknown_token:

Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors
Source code(tar.gz)
Source code(zip)
v2.3.0-rc1(Jul 15, 2020)
Release 2.3.0-rc1

Major Features and Improvements

Added UnicodeCharacterTokenizer

Bug Fixes and Other Changes

oss-segmenter:

fix end-offset error in split_merge_tokenizer_kernel.

TensorFlow text python ops wordshape:

More comprehensive emoji handling

Other:

Unref lookup_table in wordpiece_kernel fixing a possible memory leak.

Add missing LICENSE file for third_party/tensorflow_text/core/kernels.

add normalize kernals test

Add some metric logs to tokenizers.

Fix documentation formatting for SplitMergeTokenizer

Bug fix: make sure tokenize() method does not ignore itself.

Improve logging efficiency.

Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.

Add the ability to define a user-defined destination directory to make testing easier.

Fix typo in documentation of BertTokenizer

Clarify docstring of UnicodeScriptTokenizer about splitting on space

Add executable flag to the run_build.sh script.

Clarify docstring of WordpieceTokenizer on unknown_token:

Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors
Source code(tar.gz)
Source code(zip)
v2.2.1(Jun 4, 2020)
Release 2.2

Major Features and Improvements

Python 3.8 release builds added

Bug Fixes and Other Changes

Add backup storage locations for some dependencies.

Source code(tar.gz)
Source code(zip)
v2.2.0(May 11, 2020)
Release 2.2

Major Features and Improvements

Breaking Changes

Bug Fixes and Other Changes

Update version

Thanks to our Contributors
Source code(tar.gz)
Source code(zip)
v2.2.0-rc2(Apr 10, 2020)
Bug fixes

Force MacOS builds to build for OSX 10.9 so they can be installed to a wider range of MacOS versions.

Source code(tar.gz)
Source code(zip)

Making text a first-class citizen in TensorFlow.

Related tags

Overview

TensorFlow Text - Text processing in Tensorflow

INDEX

Introduction

Unicode

Normalization

Tokenization

WhitespaceTokenizer

UnicodeScriptTokenizer

Unicode split

Offsets

TF.Data Example

Keras API

Other Text Ops

Wordshape

N-grams & Sliding Window

Installation

Install using PIP

Build from source steps:

Comments

Description

Releases(v2.11.0)

v2.11.0(Nov 21, 2022)

Release 2.11.0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.11.0-rc0(Oct 19, 2022)

Release 2.11.0-rc0

Bug Fixes and Other Changes

Thanks to our Contributors

v2.10.0(Sep 8, 2022)

Release 2.10.0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.10.0-rc0(Aug 4, 2022)

Release 2.10.0-rc0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.9.0(May 18, 2022)

Release 2.9

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.10.0-b2(May 12, 2022)

Release 2.10.0-b2

Major Features and Improvements

Important Notes

v2.8.2(Apr 21, 2022)

Release 2.8.2

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.9.0-rc1(Apr 15, 2022)

Release 2.9.0-rc1

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.9.0-rc0(Apr 14, 2022)

Release 2.9.0-rc0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.8.1(Feb 4, 2022)

Release 2.8.1

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.8.0-rc0(Jan 31, 2022)

Release 2.8.0-rc0

Major Features and Improvements

Bug Fixes and Other Changes

Thanks to our Contributors

v2.7.3(Nov 19, 2021)

Bug Fixes and Other Changes

v2.7.0(Nov 12, 2021)