Nested Named Entity Recognition for Chinese Biomedical Text

Last update: Dec 25, 2022

Related tags

Overview

CBio-NAMER

CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understanding Evaluation), a benchmark of Nested Named Entity Recognition. We got the 2nd price of the benchmark by 2021/12/07. Single model CBioNAMER also achieves top20 in CBLUE. The score of CBioNAMER has surpassed human(67.0 in F1-score).

Result

Results of our method:

Results of our single model CBioNAMER:

Approach

CBioNAMER is a sub-model in our result, which is based on GlobalPointer (a powerful open-source model, thanks for author, we rewrite it with Pytorch) and MacBert.

Usage

First, install PyTorch>=1.7.0. There's no restriction on GPU or CUDA.

Then, install this repo as a Python package:

$ pip install CBioNAMER

Python package transformers==4.6.1 would be automatically installed as well.

API

The CBioNAMER package provides the following methods:

CBioNAMER.load_NER(model_save_path='./checkpoint/macbert-large_dict.pth', maxlen=512, c_size=9, id2c=_id2c, c2c=_c2c)

Returns the pretrained model. It will download the model as necessary. The model would use the first CUDA device if there's any, otherwise using CPU instead.

The model_save_path argument specifies the path of the pretrained model weight.

The maxlen argument specifies the max length of input sentences. The sentences longer than maxlen would be cut off.

The c_size argument specifies the number of entity class. Here is 9 for CBLUE.

The id2c argument specifies the mapping between id and entity class. By default, the id2c argument for CBLUE is:

_id2c = {0: 'dis', 1: 'sym', 2: 'pro', 3: 'equ', 4: 'dru', 5: 'ite', 6: 'bod', 7: 'dep', 8: 'mic'}

The c2c argument specifies the mapping between entity class and its Chinese meaning. By default, the c2c argument for CBLUE is:

_c2c = {'dis': "疾病", 'sym': "临床表现", 'pro': "医疗程序", 'equ': "医疗设备", 'dru': "药物", 'ite': "医学检验项目", 'bod': "身体", 'dep': "科室", 'mic': "微生物类"}

The model returned by CBioNAMER.load_NER() supports the following methods:

model.recognize(text: str, threshold=0)

Given a sentence, returns a list of dictionaries with recognized entity, the format of the dictionary is {'start_idx': entity's starting index, 'end_idx': entity's ending index, 'type': entity class, 'Chinese_type': Chinese meaning of entity class, 'entity': recognized entity}. The threshold argument specifies that the returned list only contains the recognized entity with confidence score higher than threshold.

model.predict_to_file(in_file: str, out_file: str)

Given input and output .json file path, the model would do inference according in_file, and the recognized entity would be saved in out_file. The output file can be submitted to CBLUE. The format of input file is like:

[
  {
    "text": "该技术的应用使某些遗传病的诊治水平得到显著提高。"
  },
    ...
  {
    "text": "There is a sentence."
  }
]

Examples

import CBioNAMER

NER = CBioNAMER.load_NER()
in_file = './CMeEE_test.json'
out_file = './CMeEE_test_answer.json'
NER.predict_to_file(in_file, out_file)

import CBioNAMER

NER = CBioNAMER.load_NER()
text = "该技术的应用使某些遗传病的诊治水平得到显著提高。"
recognized_entity = NER.recognize(text)
print(recognized_entity)
# output:[{'start_idx': 9, 'end_idx': 11, 'type': 'dis', 'Chinese_type': '疾病', 'entity': '遗传病'}]

You might also like...

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.4k Feb 17, 2021

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 17, 2021

Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

1.1k Dec 25, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

Releases(v0.0.1)

v0.0.1(Nov 24, 2021)

Please use macbert-large_dict.pth as the pretrained model's weight and load_state_dict().

The source code of the release is out of date.
Source code(tar.gz)
Source code(zip)
macbert-large_dict.pth(1246.43 MB)

Nested Named Entity Recognition for Chinese Biomedical Text

Related tags

Overview

CBio-NAMER

Result

Approach

Usage

API

Examples

You might also like...

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Pytorch-Named-Entity-Recognition-with-BERT

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Named Entity Recognition API used by TEI Publisher

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Releases(v0.0.1)

v0.0.1(Nov 24, 2021)

Owner

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Textpipe: clean and extract metadata from text

Repositório da disciplina no semestre 2021-2

Natural language computational chemistry command line interface.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Speech Recognition for Uyghur using Speech transformer

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

BERT, LDA, and TFIDF based keyword extraction in Python

Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

Crowd sourced training data for Rasa NLU models

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Signature remover is a NLP based solution which removes email signatures from the rest of the text.

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

A fast and lightweight python-based CTC beam search decoder for speech recognition.

Sentence Embeddings with BERT & XLNet

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

HF's ML for Audio study group