Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Last update: Dec 22, 2021

Related tags

Overview

Japanese-LUW-Tokenizer

Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫

Basic Usage

>>> from transformers import RemBertTokenizerFast
>>> tokenizer=RemBertTokenizerFast.from_pretrained("Japanese-LUW-Tokenizer")
>>> tokenizer.tokenize("全学年にわたって小学校の国語の教科書に大量の挿し絵が用いられている")
['全', '学年', 'にわたって', '小学校', 'の', '国語', 'の', '教科書', 'に', '大量', 'の', '挿し', '絵', 'が', '用い', 'られ', 'ている']

Installation

pip3 install 'transformers>=4.10.0' --user
git clone --depth=1 https://github.com/KoichiYasuoka/Japanese-LUW-Tokenizer

Owner

Koichi Yasuoka

Professor of Digital Humanities

GitHub Repository

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

1.8k Dec 30, 2022

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

5.1k Dec 26, 2022

基于“Seq2Seq+前缀树”的知识图谱问答

KgCLUE-bert4keras 基于“Seq2Seq+前缀树”的知识图谱问答简介博客：https://kexue.fm/archives/8802 环境软件：bert4keras=0.10.8 硬件：目前的结果是用一张Titan RTX（24G）跑出来的。运行第一次运行的时候，会给知

65 Dec 12, 2022

This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

1 Dec 21, 2021

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

GPT-X using transformer pytorch

24 Sep 11, 2022

An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

33 Dec 03, 2022

🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

65 Dec 30, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

2 Mar 04, 2022

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023

Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

3 Apr 15, 2022

🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

231 Nov 18, 2022

NLP Text Classification

多标签文本分类任务近年来随着深度学习的发展，模型参数的数量飞速增长。为了训练这些参数，需要更大的数据集来避免过拟合。然而，对于大部分NLP任务来说，构建大规模的标注数据集非常困难（成本过高），特别是对于句法和语义相关的任务。相比之下，大规模的未标注语料库的构建则相对容易。为了利用这些数据，我们可以

1 Nov 11, 2021

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

1 Oct 05, 2021

Gold standard corpus annotated with verb-preverb connections for Hungarian.

Hungarian Preverb Corpus A gold standard corpus manually annotated with verb-preverb connections for Hungarian. corpus The corpus consist of the follo

3 Jan 27, 2022

Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

2 Dec 18, 2022

Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录为了促进中文自然语言处理研究的发展，本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现，可以支持数据并行、模型加速、流水并行的代码。安装 1、首先安装pytorch等基础依赖，再安装APEX以支持fp16。 p

37 Dec 06, 2022

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

567 Jan 07, 2023

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

5 Oct 24, 2022