ICE Tokenizer

Token id [0, 20000) are image tokens.
Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == ' ', icetk[20003] == ' ', icetk[20006] == ','.
Token id [20100, 83823) are English tokens.
Token id [83823, 145653) are Chinese tokens.
Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without 
   
    )
   

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

A unified tokenization tool for Images, Chinese and English.

Related tags

Overview

ICE Tokenizer

Tokenization

Owner

THUDM

Smart discord chatbot integrated with Dialogflow

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Interpretable Models for NLP using PyTorch

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Generate text line images for training deep learning OCR model (e.g. CRNN)

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

The SVO-Probes Dataset for Verb Understanding

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

The first online catalogue for Arabic NLP datasets.

DVC-NLP-Simple-usecase

SimCTG - A Contrastive Framework for Neural Text Generation

Image2pcl - Enter the metaverse with 2D image to 3D projections

code for modular summarization work published in ACL2021 by Krishna et al

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

This is an incredibly powerful calculator that is capable of many useful day-to-day functions.