SAS: Self-Augmentation Strategy for Language Model Pre-training

This repository contains the official pytorch implementation for the paper "SAS: Self-Augmentation Strategy for Language Model Pre-training" based on Huggingface transformers version 4.3.0.

Only the SAS without the disentangled attention mechanism is released for now. To be updated.

File structure

train.py: The file for pre-training.
run_glue.py: The file for finetuning.
models
- modeling_sas.py: The main algorithm for the SAS.
- trainer_sas.py: It is inherited from Huggingface transformers. It is mainly modified for data processing.
utils: It includes all the utilities.
- data_collator_sas.py: It includes the details about self-augmentations.
The rest of codes are supportive.

How to

Download and Install

Clone this repository.
Download dataset for wiki-corpus. Store it to data folder. Currently, we only provide a trail data with 1 million sentence. Full dataset can be pre-processed according to BERT. Detail to be released.

(Optional) Create an environment through conda by the provided environment.yml
- You can also manually install the package:
  - Python==3.9, pytorch==1.10.0, transformers==4.3.0, etc.

    # Clone package
    git clone [email protected]:fei960922/SAS-Self-Augmentation-Strategy.git
    cd SAS-Self-Augmentation-Strategy

    # Establish the environment.
    conda env create -f environment.yml 
    conda activate cssl

    # Download dataset and checkpoint
    wget http://www.stat.ucla.edu/~yifeixu/sas/wiki_corpus_1M.npy

Train from stractch

    # Run default setting 
    bash script/pretrain.sh

    # Run custom setting
    python train.py

    # Starting from checkpoint 
    python train.py --start_from_checkpoint 1 --pretrain_path {PATH_TH_CHECKPOINT}

Caclulate GLUE scores

    # By running this bash, GLUE dataset will be automatically downloaded.
    bash finetune.sh MNLI 0 sas-base output_dir 5e-5 32 4 42
    bash finetune.sh MNLI 0 sas-small output_dir 1e-4 32 4 42

SAS: Self-Augmentation Strategy for Language Model Pre-training

Related tags

Overview

SAS: Self-Augmentation Strategy for Language Model Pre-training

File structure

How to

Download and Install

Train from stractch

Caclulate GLUE scores

Owner

Alibaba

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Text Extraction Formulation + Feedback Loop for state-of-the-art WSD (EMNLP 2021)

Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

Pytorch implementation for ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation".

A naive ROS interface for visualDet3D.

A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Deep-learning X-Ray Micro-CT image enhancement, pore-network modelling and continuum modelling

Dense Prediction Transformers

This repository is for our EMNLP 2021 paper "Automated Generation of Accurate & Fluent Medical X-ray Reports"

Fast, flexible and easy to use probabilistic modelling in Python.

Reusable constraint types to use with typing.Annotated

A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

An Api for Emotion recognition.

Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Code release for Universal Domain Adaptation(CVPR 2019)

[CVPR 2021] Few-shot 3D Point Cloud Semantic Segmentation

Training DiffWave using variational method from Variational Diffusion Models.