Skip to content

VIPL-SLP/VAC_CSLR

Repository files navigation

VAC_CSLR

PWC

This repo holds the code of the paper: Visual Alignment Constraint for Continuous Sign Language Recognition.(ICCV 2021) [paper]

framework


Update (2022.05.14)

In recent experiments, we found an implementation improvement about the proposed method. In our early experiments, we adopt nn.DataParallel to parallel the visual feature extractor on multiple GPUs. However, only statistic updated on device 0 is kept during training (Dataparallel), which leads to unstable training results (results may be different when adopting different numbers of GPUs and batch sizes). Therefore, we adopt syncBN in this update, the training schedule can be shorten to 40 epochs, and the relevant results are also provided. Experimental results on other datasets will be provided in our future journal version.

from modules.sync_batchnorm import convert_model

def model_to_device(self, model):
    model = model.to(self.device.output_device)
    if len(self.device.gpu_list) > 1:
        model.conv2d = nn.DataParallel(
            model.conv2d,
            device_ids=self.device.gpu_list,
            output_device=self.device.output_device)
    model = convert_model(model)
    model.cuda()
    return model

With the provided code, the updated results are expected as:

Backbone WER on Dev WER on Test Pretrained model
ResNet18 (baseline) 23.8 25.4 [Baidu] [GoogleDrive]
ResNet18+VAC (CTC only) 21.5 22.1 [Baidu] [GoogleDrive]
ResNet18+VAC+SMKD 19.8 20.5 [Baidu] [GoogleDrive]

The VAC result is corresponding to the setting ofloss_weights: SeqCTC: 1.0, ConvCTC: 1.0. In addition to that, the VAC+SMKD adopt the setting of model_args: share_classifier: True, weight_norm: True.

If you find this repo useful in your research works, please consider cite our papers VAC and SMKD.


Prerequisites

  • This project is implemented in Pytorch (>1.8). Thus please install Pytorch first.

  • ctcdecode==0.4 [parlance/ctcdecode],for beam search decode.

  • [Optional] sclite [kaldi-asr/kaldi], install kaldi tool to get sclite for evaluation. After installation, create a soft link toward the sclite:
    ln -s PATH_TO_KALDI/tools/sctk-2.4.10/bin/sclite ./software/sclite We also provide a python version evaluation tool for convenience, but sclite can provide more detailed statistics.

  • [Optional] SeanNaren/warp-ctc At the beginning of this research, we adopt warp-ctc for supervision, and we recently find that pytorch version CTC can reach similar results.

Data Preparation

  1. Download the RWTH-PHOENIX-Weather 2014 Dataset [download link]. Our experiments based on phoenix-2014.v3.tar.gz.

  2. After finishing dataset download, extract it to ./dataset/phoenix, it is suggested to make a soft link toward downloaded dataset.
    ln -s PATH_TO_DATASET/phoenix2014-release ./dataset/phoenix2014

  3. The original image sequence is 210x260, we resize it to 256x256 for augmentation. Run the following command to generate gloss dict and resize image sequence.

    cd ./preprocess
    python data_preprocess.py --process-image --multiprocessing

Inference

​ We provide the pretrained models for inference, you can download them from:

Backbone WER on Dev WER on Test Pretrained model
ResNet18 21.2% 22.3% [Baidu] (passwd: qi83)
[Dropbox]

​ To evaluate the pretrained model, run the command below:
python main.py --load-weights resnet18_slr_pretrained.pt --phase test

​ (When evaluating the SMKD pretrained model, please modify the weight_norm and share_classifier in config files as True).

Training

The priorities of configuration files are: command line > config file > default values of argparse. To train the SLR model on phoenix14, run the command below:

python main.py --work-dir PATH_TO_SAVE_RESULTS --config PATH_TO_CONFIG_FILE --device AVAILABLE_GPUS

Feature Extraction

We also provide feature extraction function to extract frame-wise features for other research purpose, which can be achieved by:

python main.py --load-weights PATH_TO_PRETRAINED_MODEL --phase features

To Do List

  • Pure python implemented evaluation tools.
  • WAR and WER calculation scripts.

Citation

If you find this repo useful in your research works, please consider citing:

@InProceedings{Min_2021_ICCV,
    author    = {Min, Yuecong and Hao, Aiming and Chai, Xiujuan and Chen, Xilin},
    title     = {Visual Alignment Constraint for Continuous Sign Language Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {11542-11551}
}

Self-Mutual Distillation Learning for Continuous Sign Language Recognition [paper]

@InProceedings{Hao_2021_ICCV,
    author    = {Hao, Aiming and Min, Yuecong and Chen, Xilin},
    title     = {Self-Mutual Distillation Learning for Continuous Sign Language Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {11303-11312}
}

Acknowledge

We appreciate the help from Runpeng Cui, Hao Zhou@Rhythmblue and Xinzhe Han@GeraldHan :)