Semi-supervised Learning for Sentiment Analysis

Overview

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining

Code, models and Datasets for《Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining》.

Download Models and Dataset

Datasets and Models are found in the follwing list.

  • Download 3.4M IMDB movie reviews. Save the data at [REVIEWS_PATH]. You can download the dataset HERE.
  • Download the vanilla RoBERTa-large model released by HuggingFace. Save the model at [VANILLA_ROBERTA_LARGE_PATH]. You can download the model HERE.
  • Download in-domain pretrained models in the paper and save the model at [PRETRAIN_MODELS]. We provide three following models. You can download HERE.
    • init-roberta-base: RoBERTa-base model(U) trained over 3.4M movie reviews from scratch.
    • semi-roberta-base: RoBERTa-base model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-base model.
    • semi-roberta-large: RoBERTa-large model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-large model.
  • Download the 1M (D` + D) training dataset for the student model, save the data at [STUDENT_DATA_PATH]. You can download it HERE.
    • student_data_base: student training data generated by roberta-base teacher model
    • student_data_large: student training data generated by roberta-large teacher model
  • Download the IMDB dataset from Andrew Maas' paper. Save the data at [IMDB_DATA_PATH]. For IMDB, The training data and test data are saved in two separate files, each line in the file corresponds to one IMDB sample. You can download HERE.
  • Download shannon_preprocssor.whl to install a binarize tool. Save the .whl file at [SHANNON_PREPROCESS_WHL_PATH]. You can download HERE
  • Download the teacher model and student model that we trained. Save them at [CHECKPOINTS]. You can download HERE
    • roberta-base: teacher and student model checkpoint for roberta-base
    • roberta-large: teacher and student model checkpoint for roberta-large

Installation

pip install -r requirements.txt
pip install [SHANNON_PREPROCESS_WHL_PATH]

Quick Tour

train the roberta-large teacher model

Use the roberta model we pretrained over 3.4M reviews data to train teacher model.
Our teacher model had an accuracy rate of 96.2% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_teacher \
roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

train the roberta-large student model

Use the roberta model we pretrained over 3.4M reviews data to train student model.
Our student model had an accuracy rate of 96.8% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

evaluate the student model on the test set

Load student model checkpoint to evaluate over test set to reproduce our result.

cd sstc/tasks/semi-roberta
python evaluate.py \
--checkpoint_path [CHECKPOINTS]/roberta-large/train_student_checkpoint/***.ckpt \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--batch_size=10 \
--gpus=0,

Reproduce paper results step by step

1.Train in-domain LM based on RoBERTa

1.1 binarize 3.4M reviews data

You should modify the shell according to your paths. The result binarize data will be saved in [REVIEWS_PATH]/bin

cd sstc/tasks/roberta_lm
bash binarize.sh

1.2 train RoBERTa-large (or small, as you wish) over 3.4M reviews data

cd sstc/tasks/roberta_lm
python trainer.py \
--roberta_path [VANILLA_ROBERTA_LARGE_PATH] \
--data_dir [REVIEWS_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [PRETRAIN_ROBERTA_CK_PATH] \
--val_check_interval 0.1 \
--precision 16 \
--batch_size 10 \
--distributed_backend=ddp \
--accumulate_grad_batches=50 \
--adam_epsilon 1e-6 \
--weight_decay 0.01 \
--warmup_steps 10000 \
--workers 8 \
--lr 2e-5

Training checkpoints will be saved in [PRETRAIN_ROBERTA_CK_PATH], find the best checkpoint and convert it to HuggingFace bin format, The relevant code can be found in sstc/tasks/roberta_lm/trainer.py. Save the pretrain bin model at [PRETRAIN_MODELS]\semi-roberta-large, or you can just download the model we trained.

2.train the teacher model

2.1 binarize IMDB dataset.

cd sstc/tasks/semi_roberta/scripts
bash binarize_imdb.sh

You can run the above code to binarize IMDB data, or you can just use the file we binarized in [IMDB_DATA_PATH]\bin

2.2 train the teacher model

cd sstc/tasks/semi_roberta
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

After training, teacher model checkpoint will be save in [ROOT_SAVE_PATH]/train_teacher_checkpoint. The teacher model we trained had an accuracy rate of 96.2% on the test set. The download link of teacher model checkpoint can be found in quick tour part.

3.label the unlabeled in-domain data U

3.1 label 3.4M data

Use the teacher model that you trained in previous step to label 3.4M reviews data, notice that [ROOT_SAVE_PATH] should be the same as previous setting. The labeled data will be save in [ROOT_SAVE_PATH]\predictions.

cd sstc/tasks/roberta_lm
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_ROBERTA_PATH] \
--reviews_data_path [REVIEWS_PATH]/bin \
--best_teacher_checkpoint_path [CHECKPOINTS]/roberta-large/train_teacher_checkpoint/***.ckpt \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] 

3.2 select the top-K data points

Firstly, we random sample 3M data from 3.4M reviews data as U', then we select 1M data from U' with the highest score as D', finally, we concat the IMDB train data(D) and D' as train data for student model. The student train data will be saved in [ROOT_SAVE_PATH]\student_data\train.txt, or you can use the data we provide in [STUDENT_DATA_PATH]/student_data_large

cd sstc/tasks/roberta_lm
python data_selector.py \
--imdb_data_path [IMDB_DATA_PATH] \
--save_path [ROOT_SAVE_PATH] 

4.train the student model

4.1 binarize the dataset

You can use the same script in 3.1 to binarize student train data in [ROOT_SAVE_PATH]\student_data\train.txt

4.1 train the student model

use can use the training data we provide in [STUDENT_DATA_PATH]/student_data_large/bin or use your own training data in [ROOT_SAVE_PATH]\student_data\bin, make sure you set the right student_data_path.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

After training, student model checkpoint will be save in [ROOT_SAVE_PATH]/train_student_checkpoint. The student model we trained had an accuracy rate of 96.6% on the test set. The download link of student model checkpoint can be found in Quick tour part.

Bayesian Optimization Library for Medical Image Segmentation.

bayesmedaug: Bayesian Optimization Library for Medical Image Segmentation. bayesmedaug optimizes your data augmentation hyperparameters for medical im

Şafak Bilici 7 Feb 10, 2022
Implementation for paper: Self-Regulation for Semantic Segmentation

Self-Regulation for Semantic Segmentation This is the PyTorch implementation for paper Self-Regulation for Semantic Segmentation, ICCV 2021. Citing SR

Dong ZHANG 30 Nov 21, 2022
FFCV: Fast Forward Computer Vision (and other ML workloads!)

Fast Forward Computer Vision: train models at a fraction of the cost with accele

FFCV 2.3k Jan 03, 2023
A new codebase for Group Activity Recognition. It contains codes for ICCV 2021 paper: Spatio-Temporal Dynamic Inference Network for Group Activity Recognition and some other methods.

Spatio-Temporal Dynamic Inference Network for Group Activity Recognition The source codes for ICCV2021 Paper: Spatio-Temporal Dynamic Inference Networ

40 Dec 12, 2022
Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Meng Liu 2 Jul 19, 2022
Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5)

YOLOv5-GUI 🎉 YOLOv5算法(ver.6及ver.5)的Qt-GUI实现 🎉 Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5). 基于YOLOv5的v5版本和v6版本及Javacr大佬的UI逻辑进行编写

EricFang 12 Dec 28, 2022
An unsupervised learning framework for depth and ego-motion estimation from monocular videos

SfMLearner This codebase implements the system described in the paper: Unsupervised Learning of Depth and Ego-Motion from Video Tinghui Zhou, Matthew

Tinghui Zhou 1.8k Dec 30, 2022
Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

Yihui He 1k Jan 03, 2023
RoboDesk A Multi-Task Reinforcement Learning Benchmark

RoboDesk A Multi-Task Reinforcement Learning Benchmark If you find this open source release useful, please reference in your paper: @misc{kannan2021ro

Google Research 66 Oct 07, 2022
Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs, ICCV 2021

Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs, ICCV 2021 Global Pooling, More than Meets the Eye: Posi

Md Amirul Islam 32 Apr 24, 2022
Leveraging Two Types of Global Graph for Sequential Fashion Recommendation, ICMR 2021

This is the repo for the paper: Leveraging Two Types of Global Graph for Sequential Fashion Recommendation Requirements OS: Ubuntu 16.04 or higher ver

Yujuan Ding 10 Oct 10, 2022
Fantasy Points Prediction and Dream Team Formation

Fantasy-Points-Prediction-and-Dream-Team-Formation Collected Data from open source resources that have over 100 Parameters for predicting cricket play

Akarsh Singh 2 Sep 13, 2022
Time Series Cross-Validation -- an extension for scikit-learn

TSCV: Time Series Cross-Validation This repository is a scikit-learn extension for time series cross-validation. It introduces gaps between the traini

Wenjie Zheng 222 Jan 01, 2023
Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Transferable Semantic Augmentation for Domain Adaptation Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021) Paper

66 Dec 16, 2022
This is an example of a reproducible modelling project

An example of a reproducible modelling project What are we doing? This example was created for the 2021 fall lecture series of Stanford's Center for O

Armin Thomas 2 Oct 26, 2021
People log into different sites every day to get information and browse through these sites one by one

HyperLink People log into different sites every day to get information and browse through these sites one by one. And they are exposed to advertisemen

0 Feb 17, 2022
f-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation

f-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation [Paper] [PyTorch] [MXNet] [Video] This repository provides code for training

Visual Understanding Lab @ Samsung AI Center Moscow 516 Dec 21, 2022
Implementation of paper "DeepTag: A General Framework for Fiducial Marker Design and Detection"

Implementation of paper DeepTag: A General Framework for Fiducial Marker Design and Detection. Project page: https://herohuyongtao.github.io/research/

Yongtao Hu 46 Dec 12, 2022
Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Learning of 3D Shape Retrieval and Deformation Joint Learning of 3D Shape Retrieval and Deformation Mikaela Angelina Uy, Vladimir G. Kim, Minhyu

Mikaela Uy 38 Oct 18, 2022
Implementation of TabTransformer, attention network for tabular data, in Pytorch

Tab Transformer Implementation of Tab Transformer, attention network for tabular data, in Pytorch. This simple architecture came within a hair's bread

Phil Wang 420 Jan 05, 2023