[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Overview

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

This repo provides the source code & data of our paper: LM-Critic: Language Models for Unsupervised Grammatical Error Correction (EMNLP 2021).

@InProceedings{yasunaga2021language,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LM-Critic: Language Models for Unsupervised Grammatical Error Correction},
  year =    {2021},  
  booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},  
}

Overview

We developed a new method to use a pretrained language model (e.g. GPT2) to predict if a sentence is grammatical, which we call LM-Critic. You can play with this LM-Critic as described in Section 1. below. The idea is to deem a sentence to be grammatical if the language model assigns it a higher probability than candidates in its local neighborhood.

We then use the LM-Critic to generate training data for grammatical error correction (GEC) from unlabeled raw text, using the BIFI algorithm. This allows us to train GEC models in an unsupervised way. See Section 2. below.

How LM-Critic works

LM-Critic for GEC: We use LM-Critic to learn GEC models

0. Dependencies

Run the following commands to create a conda environment (assuming CUDA10.1):

conda create -n lm-critic python=3.8
conda activate lm-critic
pip install torch==1.6.0 torchvision==0.7.0
pip install transformers==4.3.3 datasets==1.3.0 absl-py rouge-score
pip install nltk wandb editdistance spacy==3.0.5
python3 -m nltk.downloader punkt

To use the ERRANT scorer for GEC evaluation, create another conda environment separately, as follows:

conda create -n errant200 python=3.6
conda activate errant200
pip3 install errant==2.0.0
python3 -m spacy download en

1. Use LM-Critic

The LM-Critic is defined in critic/critic.py. To play with it, you can run:

CUDA_VISIBLE_DEVICES=0 python3 critic/critic.py

This will prompt you for a sentence input, and returns the judgment (Good: grammatical, Bad: ungrammatical) along with the probability score of the input sentence. For example,

Enter a sentence: I like apple.
Bad! Your sentence log(p) = -22.333
Neighbor sentence with highest log(p): I like apples. (= -19.570)

Enter a sentence: I like apples.
Good! Your sentence log(p) = -19.570

To run intrinsic evaluation of LM-Critic on a test suite, run:

CUDA_VISIBLE_DEVICES=0 python3 eval_critic/eval_critic.py

You can import the LM-Critic function (from critic.critic import gpt2_critic) for your own code as done in this script.

2. Train/run grammatical error correction models

Change the working directory to gec/. First, download all the data (GEC benchmarks and training data) by running ./download_data.sh.

Round 0

Here we train an initial fixer on synthetic GEC data. Run the commands in src/run-round0.sh.

  • This corresponds to the "Transformer" baseline in the paper Table 4.
  • The original synthetic data was dowloaded from here, and our processed data is available at data/round0__synthetic/synthetic_paired_data_9M.json

Round 1

Here we use the BIFI algorithm and unlabeled text data to train an improved fixer. Run the commands in src/run-round1.sh.

  • Specifically, we perform the following four steps: (a) apply the current fixer (from Round 0) to unlabeled sentences and keep outputs that LM-Critic judges as good; (b) train a breaker on the paired data generated in Step (a); (c) apply the trained breaker on unlabeled sentences and keep outputs that LM-Critic judges as bad; (d) train the fixer on the paired data generated so far (Step (a) + Step (c) + synthetic data from Round0).
  • This corresponds to the "+ BIFI" in the paper Table 4.
  • The original unlabeled text data was downloaded from Yahoo! Answer dataset and Wikipedia revision dataset (we take sentences pre revision). Our processed paired data used in Step (d) is available at data/round1__BIFI/BIFI_paired_data_9M.json

For evaluation, we use ERRANT and M^2Scorer. ERRANT is set up in the conda environment described above (errant200) and M^2Scorer is set up in the download script.

Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Coursera Natural Language Processing Specialization This repository contains material related to Coursera Natural Language Processing Specialization.

Nishant Sharma 1 Jun 05, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 03, 2022
189 Jan 02, 2023
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014). R

Yoon Kim 2k Jan 02, 2023
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
edge-SR: Super-Resolution For The Masses

edge-SR: Super Resolution For The Masses Citation Pablo Navarrete Michelini, Yunhua Lu and Xingqun Jiang. "edge-SR: Super-Resolution For The Masses",

Pablo 40 Nov 10, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022