jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Overview

jel: Japanese Entity Linker

  • jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Usage

  • Currently, link and question methods are supported.

el.link

  • This returnes named entity and its candidate ones from Wikipedia titles.
from jel import EntityLinker
el = EntityLinker()

el.link('今日は東京都のマックにアップルを買いに行き、スティーブジョブスとドナルドに会い、堀田区に引っ越した。')
>> [
    {
        "text": "東京都",
        "label": "GPE",
        "span": [
            3,
            6
        ],
        "predicted_normalized_entities": [
            [
                "東京都庁",
                0.1084
            ],
            [
                "東京",
                0.0633
            ],
            [
                "国家地方警察東京都本部",
                0.0604
            ],
            [
                "東京都",
                0.0598
            ],
            ...
        ]
    },
    {
        "text": "アップル",
        "label": "ORG",
        "span": [
            11,
            15
        ],
        "predicted_normalized_entities": [
            [
                "アップル",
                0.2986
            ],
            [
                "アップル インコーポレイテッド",
                0.1792
            ],
            …
        ]
    }

el.question

  • This returnes candidate entity for any question from Wikipedia titles.
>>> linker.question('日本の総理大臣は?')
[('菅内閣', 0.05791765857101555), ('枢密院', 0.05592481946602986), ('党', 0.05430194711042564), ('総選挙', 0.052795400668513175)]

Setup

$ pip install jel
$ python -m spacy download ja_core_news_md

Run as API

$ uvicorn jel.api.server:app --reload --port 8000 --host 0.0.0.0 --log-level trace

Example

# link
$ curl localhost:8000/link -X POST -H "Content-Type: application/json" \
    -d '{"sentence": "日本の総理は菅総理だ。"}'

# question
$ curl localhost:8000/question -X POST -H "Content-Type: application/json" \
    -d '{"sentence": "日本で有名な総理は?"}

Test

$ python pytest

Notes

  • faiss==1.5.3 from pip causes error _swigfaiss.
  • To solve this, see this issue.

LICENSE

Apache 2.0 License.

CITATION

@INPROCEEDINGS{manabe2019chive,
    author    = {真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸},
    title     = {複数粒度の分割結果に基づく日本語単語分散表現},
    booktitle = "言語処理学会第25回年次大会(NLP2019)",
    year      = "2019",
    pages     = "NLP2019-P8-5",
    publisher = "言語処理学会",
}
You might also like...
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Auto translate textbox from Japanese to English or Indonesia
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

Script to download some free japanese lessons in portuguse from NHK
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

Comments
  • ModuleNotFoundError

    ModuleNotFoundError

    Traceback (most recent call last):
      File "scripts/biencoder_training_check.py", line 1, in <module>
        from jel.biencoder.train import biencoder_training
    ModuleNotFoundError: No module named 'jel'
    
    
    opened by izuna385 1
  • Separate Estimation Model and DB

    Separate Estimation Model and DB

    Because the inference model and knowledge base are currently loaded together, it takes 30 seconds to load the model. To prevent this, we will separate the DB into a separate container.

    opened by izuna385 0
Releases(v0.1.1)
Owner
izuna385
izuna385[_@_]gmail.com
izuna385
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
Code for using and evaluating SpanBERT.

SpanBERT This repository contains code and models for the paper: SpanBERT: Improving Pre-training by Representing and Predicting Spans. If you prefer

Meta Research 798 Dec 30, 2022
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Contains links to publicly available datasets for modeling health outcomes using speech and language.

speech-nlp-datasets Contains links to publicly available datasets for modeling various health outcomes using speech and language. Speech-based Corpora

Tuka Alhanai 77 Dec 07, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
Azure Text-to-speech service for Home Assistant

Azure Text-to-speech service for Home Assistant The Azure text-to-speech platform uses online Azure Text-to-Speech cognitive service to read a text wi

Yassine Selmi 2 Aug 06, 2022
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023