Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Overview

Japanese-LUW-Tokenizer

Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫

Basic Usage

>>> from transformers import RemBertTokenizerFast
>>> tokenizer=RemBertTokenizerFast.from_pretrained("Japanese-LUW-Tokenizer")
>>> tokenizer.tokenize("全学年にわたって小学校の国語の教科書に大量の挿し絵が用いられている")
['全', '学年', 'にわたって', '小学校', 'の', '国語', 'の', '教科書', 'に', '大量', 'の', '挿し', '絵', 'が', '用い', 'られ', 'ている']

Installation

pip3 install 'transformers>=4.10.0' --user
git clone --depth=1 https://github.com/KoichiYasuoka/Japanese-LUW-Tokenizer
Owner
Koichi Yasuoka
Professor of Digital Humanities
Koichi Yasuoka
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SNCSE SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples This is the repository for SNCSE. SNCSE aims to allev

Sense-GVT 59 Jan 02, 2023
Trained T5 and T5-large model for creating keywords from text

text to keywords Trained T5-base and T5-large model for creating keywords from text. Supported languages: ru Pretraining Large version | Pretraining B

Danil 61 Nov 24, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
FastFormers - highly efficient transformer models for NLU

FastFormers FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Underst

Microsoft 678 Jan 05, 2023
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
Understanding the Difficulty of Training Transformers

Admin Understanding the Difficulty of Training Transformers Guided by our analyses, we propose Adaptive Model Initialization (Admin), which successful

Liyuan Liu 300 Dec 29, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Abhijith Neil Abraham 2 Nov 05, 2021
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022