Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Overview

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Project Structure

./
├── DataProcess
│   ├── __pycache__
│   ├── convert2bio.py
│   ├── convert_jsonl.py
│   ├── handle_numbers.py
│   ├── load_data.py
│   └── statistic.py
├── README.md
├── __pycache__
├── chinese_L-12_H-768_A-12                                    BERT权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── chinese_bert_wwm                                           BERT_wwm权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── chinese_macbert_base                                       macBERT权重
│   ├── chinese_macbert_base.ckpt.data-00000-of-00001
│   ├── chinese_macbert_base.ckpt.index
│   ├── chinese_macbert_base.ckpt.meta
│   ├── macbert_base_config.json
│   └── vocab.txt
├── chinese_roberta_wwm_ext_L-12_H-768_A-12                    roberta权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── config                                                     
│   ├── __pycache__
│   ├── config.py                                              配置文件
│   └── pulmonary_label2id.json                                label id
├── data                                                       数据集
│   ├── pulmonary.test
│   ├── pulmonary.train
│   └── sict_train.txt
├── environment.yaml                                           conda环境配置文件
├── evaluate.py
├── generator_train.py
├── keras_bert                                                 keras_bert(可pip下)
├── keras_contrib                                              keras_contrib(可pip下)
├── log                                                        训练nohup日志
│   ├── chinese_L-12_H-768_A-12.out
│   ├── chinese_macbert_base.out
│   ├── chinese_roberta_wwm_ext_L-12_H-768_A-12.out
│   └── electra_180g_base.out
├── model.py                                                   模型构建文件
├── models                                                     保存的模型权重
│   ├── pulmonary_chinese_L-12_H-768_A-12_ner.h5
│   ├── pulmonary_chinese_bert_wwm_ner.h5
│   ├── pulmonary_chinese_macbert_base_ner.h5
│   └── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_ner.h5
├── predict.py                                                 预测
├── report                                                     模型实体F1评估报告
│   ├── pulmonary_chinese_L-12_H-768_A-12_evaluate.txt
│   ├── pulmonary_chinese_L-12_H-768_A-12_predict.json
│   ├── pulmonary_chinese_bert_wwm_evaluate.txt
│   ├── pulmonary_chinese_bert_wwm_predict.json
│   ├── pulmonary_chinese_macbert_base_evaluate.txt
│   ├── pulmonary_chinese_macbert_base_predict.json
│   ├── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_evaluate.txt
│   └── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_predict.json
├── requirements.txt                                           pip环境
├── test.py                                                    
├── train.py                                                   训练
└── utils                                                      
    ├── FGM.py                                                 FGM对抗
    ├── __pycache__
    └── path.py                                                所有路径

56 directories, 193 files

Dataset

三甲医院肺结节数据集,20000+字,BIO格式,形如:

中	B-ORG
共	I-ORG
中	I-ORG
央	I-ORG
致	O
中	B-ORG
国	I-ORG
致	I-ORG
公	I-ORG
党	I-ORG
十	I-ORG
一	I-ORG
大	I-ORG
的	O
贺	O
词	O

ATTENTION: 在处理自己数据集的时候需要注意:

  • 字与标签之间用空格("\ ")隔开
  • 其中句子与句子之间使用空行隔开

Steps

  1. 替换数据集
  2. 使用DataProcess/load_data.py生成label2id.txt文件
  3. 修改config/config.py中的MAX_SEQ_LEN(超过截断,少于填充,最好设置训练集、测试集中最长句子作为MAX_SEQ_LEN)
  4. 下载权重,放到项目中
  5. 修改public/path.py中的地址
  6. 根据需要修改model.py模型结构
  7. 修改config/config.py的参数
  8. 训练前debug看下input_train_labels,result_train对不对,input_train_types全是0
  9. 训练

Model

BERT

roberta

macBERT

BERT_wwm

Train

运行train.py

Evaluate

运行evaluate/f1_score.py

BERT

           precision    recall  f1-score   support

     SIGN     0.6651    0.7354    0.6985       189
  ANATOMY     0.8333    0.8409    0.8371       220
 DIAMETER     1.0000    1.0000    1.0000        16
  DISEASE     0.4915    0.6744    0.5686        43
 QUANTITY     0.8837    0.9157    0.8994        83
TREATMENT     0.3571    0.5556    0.4348         9
  DENSITY     1.0000    1.0000    1.0000         8
    ORGAN     0.4500    0.6923    0.5455        13
LUNGFIELD     1.0000    0.5000    0.6667         6
    SHAPE     0.5714    0.5714    0.5714         7
   NATURE     1.0000    1.0000    1.0000         6
 BOUNDARY     1.0000    0.6250    0.7692         8
   MARGIN     0.8333    0.8333    0.8333         6
  TEXTURE     1.0000    0.8571    0.9231         7

micro avg     0.7436    0.7987    0.7702       621
macro avg     0.7610    0.7987    0.7760       621

roberta

           precision    recall  f1-score   support

  ANATOMY     0.8624    0.8545    0.8584       220
  DENSITY     0.8000    1.0000    0.8889         8
     SIGN     0.7347    0.7619    0.7481       189
 QUANTITY     0.8977    0.9518    0.9240        83
  DISEASE     0.5690    0.7674    0.6535        43
 DIAMETER     1.0000    1.0000    1.0000        16
TREATMENT     0.3333    0.5556    0.4167         9
 BOUNDARY     1.0000    0.6250    0.7692         8
LUNGFIELD     1.0000    0.6667    0.8000         6
   MARGIN     0.8333    0.8333    0.8333         6
  TEXTURE     1.0000    0.8571    0.9231         7
    SHAPE     0.5714    0.5714    0.5714         7
   NATURE     1.0000    1.0000    1.0000         6
    ORGAN     0.6250    0.7692    0.6897        13

micro avg     0.7880    0.8261    0.8066       621
macro avg     0.8005    0.8261    0.8104       621

macBERT

           precision    recall  f1-score   support

  ANATOMY     0.8773    0.8773    0.8773       220
     SIGN     0.6538    0.7196    0.6851       189
  DISEASE     0.5893    0.7674    0.6667        43
 QUANTITY     0.9070    0.9398    0.9231        83
    ORGAN     0.5882    0.7692    0.6667        13
  TEXTURE     1.0000    0.8571    0.9231         7
 DIAMETER     1.0000    1.0000    1.0000        16
TREATMENT     0.3750    0.6667    0.4800         9
LUNGFIELD     1.0000    0.5000    0.6667         6
    SHAPE     0.4286    0.4286    0.4286         7
   NATURE     1.0000    1.0000    1.0000         6
  DENSITY     1.0000    1.0000    1.0000         8
 BOUNDARY     1.0000    0.6250    0.7692         8
   MARGIN     0.8333    0.8333    0.8333         6

micro avg     0.7697    0.8180    0.7931       621
macro avg     0.7846    0.8180    0.7977       621

BERT_wwm

           precision    recall  f1-score   support

  DISEASE     0.5667    0.7907    0.6602        43
  ANATOMY     0.8676    0.8636    0.8656       220
 QUANTITY     0.8966    0.9398    0.9176        83
     SIGN     0.7358    0.7513    0.7435       189
LUNGFIELD     1.0000    0.6667    0.8000         6
TREATMENT     0.3571    0.5556    0.4348         9
 DIAMETER     0.9375    0.9375    0.9375        16
 BOUNDARY     1.0000    0.6250    0.7692         8
  TEXTURE     1.0000    0.8571    0.9231         7
   MARGIN     0.8333    0.8333    0.8333         6
    ORGAN     0.5882    0.7692    0.6667        13
  DENSITY     1.0000    1.0000    1.0000         8
   NATURE     1.0000    1.0000    1.0000         6
    SHAPE     0.5000    0.5714    0.5333         7

micro avg     0.7889    0.8245    0.8063       621
macro avg     0.8020    0.8245    0.8104       621

Predict

运行predict/predict_bio.py

Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 02, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
1 Jun 28, 2022