PyTorch impelementations of BERT-based Spelling Error Correction Models.

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型,使用PyTorch实现

数据准备

  1. http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
  2. 解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
  3. 复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
  4. 下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录
  5. 请确保以下文件在 datasets/csc 中
    train.sgml
    B1_training.sgml
    C1_training.sgml  
    SIGHAN15_CSC_A2_Training.sgml  
    SIGHAN15_CSC_B2_Training.sgml  
    SIGHAN15_CSC_TestInput.txt
    SIGHAN15_CSC_TestTruth.txt
    

环境准备

  1. 使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境(推荐)
  2. 克隆本项目并进入项目根目录
  3. 安装所需依赖 pip install -r requirements.txt
  4. 如果出现报错 GLIBC 版本过低的问题(GLIBC 的版本更迭容易出事故,不推荐更新),openCC 改为安装较低版本(例如 1.1.0)
  5. 在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型,首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型,目前支持以下配置文件:

  • train_bert4csc.yml
  • train_macbert4csc.yml
  • train_SoftMaskedBert.yml

如有其他需求,可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component sentence level acc p r f
Detection 0.5045 0.8252 0.8416 0.8333
Correction 0.8055 0.9395 0.8748 0.9060

Bert类

char level

MODEL p r f
BERT4CSC 0.9269 0.8651 0.8949
MACBERT4CSC 0.9380 0.8736 0.9047

sentence level

model acc p r f
BERT4CSC 0.7990 0.8482 0.7214 0.7797
MACBERT4CSC 0.8027 0.8525 0.7251 0.7836

推理

方法一,使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下:

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二,直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重,使用transformers或pycorrector调用

  1. 使用convert_to_pure_state_dict.py导出bert权重
  2. 后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目,请按如下格式引用:

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0,可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护,侵权必究。

更新记录

20210618

  1. 修复数据处理的编码报错问题

20210518

  1. 将BERT4CSC检错任务改为使用FocalLoss
  2. 更新修改后的模型实验结果
  3. 降低数据处理时保留原文的概率

20210517

  1. 对BERT4CSC模型新增检错任务
  2. 新增基于LineByLine文件的inference

References

  1. Spelling Error Correction with Soft-Masked BERT
  2. http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
  3. https://github.com/wdimmy/Automatic-Corpus-Generation
  4. transformers
  5. https://github.com/sunnyqiny/Confusionset-guided-Pointer-Networks-for-Chinese-Spelling-Check
  6. SoftMaskedBert-PyTorch
  7. Deep-Learning-Project-Template
  8. https://github.com/lonePatient/TorchBlocks
  9. https://github.com/shibing624/pycorrector
Owner
Heng Cai
NLPer
Heng Cai
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

Jonas Djondo 1 Nov 18, 2021
Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

chatbot Bu Chatbot, Konya Bilim Merkezi Yeni Ufuklar Sergisi için 2021 Yılında tasarlanmış olan bir projedir. Chatbot Python ortamında yazılmıştır. Sö

Emre Özkul 1 Feb 23, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Nguyễn Minh Phương 22 Dec 06, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022
Generate a cool README/About me page for your Github Profile

Github Profile README/ About Me Generator 💯 This webapp lets you build a cool README for your profile. A few inputs + ~15 mins = Your Github Profile

Rahul Banerjee 179 Jan 07, 2023