[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Last update: Dec 12, 2022

Overview

COCO-LM

This repository contains the scripts for fine-tuning COCO-LM pretrained models on GLUE and SQuAD 2.0 benchmarks.

Paper: COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Overview

We provide the scripts in two versions, based on two widely-used open-source codebases, the Fairseq Library and the Huggingface Transformers Library. The two code versions are mostly equivalent in functionality, and you are free to use either of them. However, we note that the fairseq version is what we used in our experiments, and it will best reproduce the results in the paper; the huggingface version is implemented later to provide compatibility with the Huggingface Transformers Library, and may yield slightly different results.

Please follow the README files under the two directories for running the code.

GLUE Fine-Tuning Results

The General Language Understanding Evaluation (GLUE) benchmark is a collection of sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

GLUE dev set results of COCO-LM base++ and large++ models are as follows (median of 5 different random seeds):

Model	MNLI-m/mm	QQP	QNLI	SST-2	CoLA	RTE	MRPC	STS-B	AVG
COCO-LM base++	90.2/90.0	92.2	94.2	94.6	67.3	87.4	91.2	91.8	88.6
COCO-LM large++	91.4/91.6	92.8	95.7	96.9	73.9	91.0	92.2	92.7	90.8

GLUE test set results of COCO-LM base++ and large++ models are as follows (no ensemble, task-specific tricks, etc.):

Model	MNLI-m/mm	QQP	QNLI	SST-2	CoLA	RTE	MRPC	STS-B	AVG
COCO-LM base++	89.8/89.3	89.8	94.2	95.6	68.6	82.3	88.5	90.3	87.4
COCO-LM large++	91.6/91.1	90.5	95.8	96.7	70.5	89.2	88.4	91.8	89.3

SQuAD 2.0 Fine-Tuning Results

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD 2.0 dev set results of COCO-LM base++ and large++ models are as follows (median of 5 different random seeds):

Model	EM	F1
COCO-LM base++	85.4	88.1
COCO-LM large++	88.2	91.0

Citation

If you find the code and models useful for your research, please cite the following paper:

@inproceedings{meng2021cocolm,
  title={{COCO-LM}: Correcting and contrasting text sequences for language model pretraining},
  author={Meng, Yu and Xiong, Chenyan and Bajaj, Payal and Tiwary, Saurabh and Bennett, Paul and Han, Jiawei and Song, Xia},
  booktitle={Conference on Neural Information Processing Systems},
  year={2021}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

You might also like...

UDP++ (ECCVW 2020 Oral), (Winner of COCO 2020 Keypoint Challenge).

UDP-Pose This is the pytorch implementation for UDP++, which won the Fisrt place in COCO Keypoint Challenge at ECCV 2020 Workshop. Top-Down Results on

20 Jul 29, 2022

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

CLASP - Contrastive Language-Aminoacid Sequence Pretraining Repository for creating models pretrained on language and aminoacid sequences similar to C

133 Dec 29, 2022

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

AMOS This repository contains the scripts for fine-tuning AMOS pretrained models on GLUE and SQuAD 2.0 benchmarks. Paper: Pretraining Text Encoders wi

22 Sep 15, 2022

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera. This project prepares training and testing data for various deep learning projects such as 6D object pose estimation projects singleshotpose, as well as object detection and instance segmentation projects.

305 Dec 16, 2022

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences 1. Introduction This project is for paper Model-free Vehicle Tracking and St

92 Jan 3, 2023

Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

ImageNet-21K Pretraining for the Masses Paper | Pretrained models Official PyTorch Implementation Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelni

574 Jan 2, 2023

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

104 Dec 7, 2022

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

61 Nov 14, 2022

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings This is the repository for t

39 Jan 7, 2023

Comments

code for pre-training

Hi,

Thanks for your great work!

I want to further train my language model with COCO-LM objectives. I didn't find the code for the further pre-training. Will you provide the code?

opened by chiyuzhang94 28
Add fix for supporting offline models

A small fix so offline local models can also be used as otherwise cocolm defaults to downloading from huggingface . This is useful for kaggle competitions as an example

opened by gauravbrills 1
Bump numpy from 1.20.3 to 1.21.0 in /huggingface
Bumps numpy from 1.20.3 to 1.21.0.

Release notes

Sourced from numpy's releases.

v1.21.0

NumPy 1.21.0 Release Notes

The NumPy 1.21.0 release highlights are

continued SIMD work covering more functions and platforms,

initial work on the new dtype infrastructure and casting,

universal2 wheels for Python 3.8 and Python 3.9 on Mac,

improved documentation,

improved annotations,

new PCG64DXSM bitgenerator for random numbers.

In addition there are the usual large number of bug fixes and other improvements.

The Python versions supported for this release are 3.7-3.9. Official support for Python 3.10 will be added when it is released.

:warning: Warning: there are unresolved problems compiling NumPy 1.21.0 with gcc-11.1 .

Optimization level -O3 results in many wrong warnings when running the tests.

On some hardware NumPy will hang in an infinite loop.

New functions

Add PCG64DXSM BitGenerator

Uses of the PCG64 BitGenerator in a massively-parallel context have been shown to have statistical weaknesses that were not apparent at the first release in numpy 1.17. Most users will never observe this weakness and are safe to continue to use PCG64. We have introduced a new PCG64DXSM BitGenerator that will eventually become the new default BitGenerator implementation used by default_rng in future releases. PCG64DXSM solves the statistical weakness while preserving the performance and the features of PCG64.

See upgrading-pcg64 for more details.

(gh-18906)

Expired deprecations

The shape argument numpy.unravel_index cannot be passed as dims keyword argument anymore. (Was deprecated in NumPy 1.16.)

... (truncated)

Commits

b235f9e Merge pull request #19283 from charris/prepare-1.21.0-release

34aebc2 MAINT: Update 1.21.0-notes.rst

493b64b MAINT: Update 1.21.0-changelog.rst

07d7e72 MAINT: Remove accidentally created directory.

032fca5 Merge pull request #19280 from charris/backport-19277

7d25b81 BUG: Fix refcount leak in ResultType

fa5754e BUG: Add missing DECREF in new path

61127bb Merge pull request #19268 from charris/backport-19264

143d45f Merge pull request #19269 from charris/backport-19228

d80e473 BUG: Removed typing for == and != in dtypes

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump numpy from 1.21 to 1.22.0 in /huggingface
Bumps numpy from 1.21 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Releases(v0.1.0)

v0.1.0(Dec 2, 2021)
We release two pretrained COCO-LM model checkpoints and one dictionary file:

cocolm-base.tar.gz contains the COCO-LM base++ model; you need to extract the model from the archive.

cocolm-large.tar.gz contains the COCO-LM large++ model; you need to extract the model from the archive.

dict.tar.gz contains the sentencepiece model (sp.model) and the vocabulary file (dict.txt).

Source code(tar.gz)
Source code(zip)
cocolm-base.tar.gz(548.80 MB)
cocolm-large.tar.gz(1129.84 MB)
dict.tar.gz(935.94 KB)

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub Repository

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

AMOS This repository contains the scripts for fine-tuning AMOS pretrained models on GLUE and SQuAD 2.0 benchmarks. Paper: Pretraining Text Encoders wi

22 Sep 15, 2022

Pre-trained Deep Learning models and demos (high quality and extremely fast)

OpenVINO™ Toolkit - Open Model Zoo repository This repository includes optimized deep learning models and a set of demos to expedite development of hi

3.4k Dec 31, 2022

The Official Implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose [NIPS 2021].

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose Release Notes The offical PyTorch implementation of Neural View Sy

20 Oct 09, 2022

A tensorflow model that predicts if the image is of a cat or of a dog.

Quick intro Hello and thank you for your interest in my project! This is the backend part of a two-repo application. The other part can be found here

0 Mar 08, 2022

This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling

deSpeckNet-TF-GEE This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling publi

16 Sep 07, 2022

《Train in Germany, Test in The USA: Making 3D Object Detectors Generalize》(CVPR 2020)

Train in Germany, Test in The USA: Making 3D Object Detectors Generalize This paper has been accpeted by Conference on Computer Vision and Pattern Rec

101 Jan 02, 2023

CAST: Character labeling in Animation using Self-supervision by Tracking

CAST: Character labeling in Animation using Self-supervision by Tracking (Published as a conference paper at EuroGraphics 2022) Note: The CAST paper c

15 Nov 18, 2022

Discerning Decision-Making Process of Deep Neural Networks with Hierarchical Voting Transformation

Configurations Change HOME_PATH in CONFIG.py as the current path Data Prepare CENSINCOME Download data Put census-income.data and census-income.test i

2 Aug 14, 2022

Unadversarial Examples: Designing Objects for Robust Vision

Unadversarial Examples: Designing Objects for Robust Vision This repository contains the code necessary to replicate the major results of our paper: U

93 Nov 28, 2022

PyTorch implementation for "Sharpness-aware Quantization for Deep Neural Networks".

Sharpness-aware Quantization for Deep Neural Networks Recent Update 2021.11.23: We release the source code of SAQ. Setup the environments Clone the re

30 Dec 19, 2022

3ds-Ghidra-Scripts - Ghidra scripts to help with 3ds reverse engineering

3ds Ghidra Scripts These are ghidra scripts to help with 3ds reverse engineering

7 May 23, 2022

🥇 LG-AI-Challenge 2022 1위 솔루션 입니다.

LG-AI-Challenge-for-Plant-Classification Dacon에서 진행된 농업 환경 변화에 따른 작물 병해 진단 AI 경진대회 에 대한 코드입니다. (colab directory에 코드가 잘 정리 되어있습니다.) Requirements python

10 Jun 30, 2022

Artificial Neural network regression model to predict the energy output in a combined cycle power plant.

Energy_Output_Predictor Artificial Neural network regression model to predict the energy output in a combined cycle power plant. Abstract Energy outpu

1 Feb 11, 2022

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

187 Dec 26, 2022

Bag of Tricks for Natural Policy Gradient Reinforcement Learning

Bag of Tricks for Natural Policy Gradient Reinforcement Learning [ArXiv] Setup Python 3.8.0 pip install -r req.txt Mujoco 200 license Main Files main.

1 Oct 10, 2022

[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Grounded Situation Recognition with Transformers Paper | Model Checkpoint This is the official PyTorch implementation of Grounded Situation Recognitio

18 Jul 19, 2022

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

84 Jan 04, 2023

[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Related tags

Overview

COCO-LM

Overview

GLUE Fine-Tuning Results

SQuAD 2.0 Fine-Tuning Results

Citation

Contributing

You might also like...

UDP++ (ECCVW 2020 Oral), (Winner of COCO 2020 Keypoint Challenge).

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences

Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

Comments

code for pre-training

Add fix for supporting offline models

Bump numpy from 1.20.3 to 1.21.0 in /huggingface

v1.21.0

NumPy 1.21.0 Release Notes

New functions

Add PCG64DXSM BitGenerator

Expired deprecations

Bump numpy from 1.21 to 1.22.0 in /huggingface

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Releases(v0.1.0)

v0.1.0(Dec 2, 2021)

Owner

Microsoft

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Pre-trained Deep Learning models and demos (high quality and extremely fast)

The Official Implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose [NIPS 2021].

A tensorflow model that predicts if the image is of a cat or of a dog.

This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling

《Train in Germany, Test in The USA: Making 3D Object Detectors Generalize》(CVPR 2020)

CAST: Character labeling in Animation using Self-supervision by Tracking

Discerning Decision-Making Process of Deep Neural Networks with Hierarchical Voting Transformation

Unadversarial Examples: Designing Objects for Robust Vision

PyTorch implementation for "Sharpness-aware Quantization for Deep Neural Networks".

3ds-Ghidra-Scripts - Ghidra scripts to help with 3ds reverse engineering

🥇 LG-AI-Challenge 2022 1위 솔루션 입니다.

Artificial Neural network regression model to predict the energy output in a combined cycle power plant.

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

Bag of Tricks for Natural Policy Gradient Reinforcement Learning

[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide.

TransReID: Transformer-based Object Re-Identification

A tiny, pedagogical neural network library with a pytorch-like API.

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio