ICE Tokenizer

Token id [0, 20000) are image tokens.
Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == ' ', icetk[20003] == ' ', icetk[20006] == ','.
Token id [20100, 83823) are English tokens.
Token id [83823, 145653) are Chinese tokens.
Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without 
   
    )
   

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

A unified tokenization tool for Images, Chinese and English.

Related tags

Overview

ICE Tokenizer

Tokenization

Owner

THUDM

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Client library to download and publish models and other files on the huggingface.co hub

Deep learning for NLP crash course at ABBYY.

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Pipeline for fast building text classification TF-IDF + LogReg baselines.

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

基于pytorch_rnn的古诗词生成

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

Unsupervised text tokenizer for Neural Network-based text generation.

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

GPT-3 command line interaction

A unified tokenization tool for Images, Chinese and English.

Fast, DB Backed pretrained word embeddings for natural language processing.

Textpipe: clean and extract metadata from text

I can help you convert your images to pdf file.