[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Overview

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

This repo provides the source code & data of our paper: LM-Critic: Language Models for Unsupervised Grammatical Error Correction (EMNLP 2021).

@InProceedings{yasunaga2021language,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LM-Critic: Language Models for Unsupervised Grammatical Error Correction},
  year =    {2021},  
  booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},  
}

Overview

We developed a new method to use a pretrained language model (e.g. GPT2) to predict if a sentence is grammatical, which we call LM-Critic. You can play with this LM-Critic as described in Section 1. below. The idea is to deem a sentence to be grammatical if the language model assigns it a higher probability than candidates in its local neighborhood.

We then use the LM-Critic to generate training data for grammatical error correction (GEC) from unlabeled raw text, using the BIFI algorithm. This allows us to train GEC models in an unsupervised way. See Section 2. below.

How LM-Critic works

LM-Critic for GEC: We use LM-Critic to learn GEC models

0. Dependencies

Run the following commands to create a conda environment (assuming CUDA10.1):

conda create -n lm-critic python=3.8
conda activate lm-critic
pip install torch==1.6.0 torchvision==0.7.0
pip install transformers==4.3.3 datasets==1.3.0 absl-py rouge-score
pip install nltk wandb editdistance spacy==3.0.5
python3 -m nltk.downloader punkt

To use the ERRANT scorer for GEC evaluation, create another conda environment separately, as follows:

conda create -n errant200 python=3.6
conda activate errant200
pip3 install errant==2.0.0
python3 -m spacy download en

1. Use LM-Critic

The LM-Critic is defined in critic/critic.py. To play with it, you can run:

CUDA_VISIBLE_DEVICES=0 python3 critic/critic.py

This will prompt you for a sentence input, and returns the judgment (Good: grammatical, Bad: ungrammatical) along with the probability score of the input sentence. For example,

Enter a sentence: I like apple.
Bad! Your sentence log(p) = -22.333
Neighbor sentence with highest log(p): I like apples. (= -19.570)

Enter a sentence: I like apples.
Good! Your sentence log(p) = -19.570

To run intrinsic evaluation of LM-Critic on a test suite, run:

CUDA_VISIBLE_DEVICES=0 python3 eval_critic/eval_critic.py

You can import the LM-Critic function (from critic.critic import gpt2_critic) for your own code as done in this script.

2. Train/run grammatical error correction models

Change the working directory to gec/. First, download all the data (GEC benchmarks and training data) by running ./download_data.sh.

Round 0

Here we train an initial fixer on synthetic GEC data. Run the commands in src/run-round0.sh.

  • This corresponds to the "Transformer" baseline in the paper Table 4.
  • The original synthetic data was dowloaded from here, and our processed data is available at data/round0__synthetic/synthetic_paired_data_9M.json

Round 1

Here we use the BIFI algorithm and unlabeled text data to train an improved fixer. Run the commands in src/run-round1.sh.

  • Specifically, we perform the following four steps: (a) apply the current fixer (from Round 0) to unlabeled sentences and keep outputs that LM-Critic judges as good; (b) train a breaker on the paired data generated in Step (a); (c) apply the trained breaker on unlabeled sentences and keep outputs that LM-Critic judges as bad; (d) train the fixer on the paired data generated so far (Step (a) + Step (c) + synthetic data from Round0).
  • This corresponds to the "+ BIFI" in the paper Table 4.
  • The original unlabeled text data was downloaded from Yahoo! Answer dataset and Wikipedia revision dataset (we take sentences pre revision). Our processed paired data used in Step (d) is available at data/round1__BIFI/BIFI_paired_data_9M.json

For evaluation, we use ERRANT and M^2Scorer. ERRANT is set up in the conda environment described above (errant200) and M^2Scorer is set up in the download script.

Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
Crowd sourced training data for Rasa NLU models

NLU Training Data Crowd-sourced training data for the development and testing of Rasa NLU models. If you're interested in grabbing some data feel free

Rasa 169 Dec 26, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SNCSE SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples This is the repository for SNCSE. SNCSE aims to allev

Sense-GVT 59 Jan 02, 2023
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 05, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

Hong Minhee (洪 民憙) 88 Dec 23, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
Trained T5 and T5-large model for creating keywords from text

text to keywords Trained T5-base and T5-large model for creating keywords from text. Supported languages: ru Pretraining Large version | Pretraining B

Danil 61 Nov 24, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 09, 2021