Goal of the project : Detecting Temporal Boundaries in Sign Language videos

Last update: Dec 21, 2022

Overview

MVA RecVis course final project :

Goal of the project : Detecting Temporal Boundaries in Sign Language videos.

Sign language automatic indexing is an important challenge to develop better communication tools for the deaf community. However, annotated datasets for sign langage are limited, and there are few people with skills to anotate such data, which makes it hard to train performant machine learning models. An important challenge is therefore to :

Increase available training datasets.
Make labeling easier for professionnals to reduce risks of bad annotations.

In this context, techniques have emerged to perform automatic sign segmentation in videos, by marking the boundaries between individual signs in sign language videos. The developpment of such tools offers the potential to alleviate the limited supply of labelled dataset currently available for sign research.

Previous work and personal contribution :

This repository provides code for the Object Recognition & Computer Vision (RecVis) course Final project. For more details please refer the the project report report.pdf. In this project, we reproduced the results obtained on the following paper (by using the code from this repository) :

Katrin Renz, Nicolaj C. Stache, Samuel Albanie and Gül Varol, Sign language segmentation with temporal convolutional networks, ICASSP 2021. [arXiv]

We used the pre-extracted frame-level features obtained by applying the I3D model on videos to retrain the MS-TCN architecture for frame-level binary classification and reproduce the papers results. The tests folder proposes a notebook for reproducing the original paper results, with a meanF1B = 68.68 on the evaluation set of the BSL Corpus.

We further implemented new models in order to improve this result. We wanted to try attention based models as they have received recently a huge gain of interest in the vision research community. We first tried to train a Vanilla Transformer Encoder from scratch, but the results were not satisfactory.

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: (2018).

We then implemented the ASFormer model (Transformer for Action Segementation), using this code : a hybrid transformer model using some interesting ideas from the MS-TCN architecture. The motivations behind the model and its architecture are detailed in the following paper :

ASFormer: Transformer for Action Segmentation, Fangqiu Yi, Hongyu Wen, Tingting Jiang (2021).

We trained this model on the I3D extracted features and obtained an improvement over the MS-TCN architecture. The results are given in the following table :

ID	Model	mF1B	mF1S
1	MS-TCN	68.68_±0.6	47.71_±0.8
2	Transformer Encoder	60.28_±0.3	42.70_±0.2
3	ASFormer	69.79_±0.2	49.23_±1.2

Setup
Data and models
Demo
Training
Citation
License
Acknowledgements

Setup

# Clone this repository
git clone https://github.com/loubnabnl/Sign-Segmentation-with-Transformers.git
cd Sign-Segmentation-with-Transformers/
# Create signseg_env environment
conda env create -f environment.yml
conda activate signseg_env

Data and models

You can download the pretrained models (I3D and MS-TCN) (models.zip [302MB]) and data (data.zip [5.5GB]) used in the experiments here or by executing download/download_*.sh. The unzipped data/ and models/ folders should be located on the root directory of the repository (for using the demo downloading the models folder is sufficient).

You can download our best pretrained ASFormer model weights here.

Data:

Please cite the original datasets when using the data: BSL Corpus The authors of github.com/RenzKa/sign-segmentation provided the pre-extracted features and metadata. See here for a detailed description of the data files.

Features: data/features/*/*/features.mat
Metadata: data/info/*/info.pkl

Models:

I3D weights, trained for sign classification: models/i3d/*.pth.tar
MS-TCN weights for the demo (see tables below for links to the other models): models/ms-tcn/*.model
As_former weights of our best model : models/asformer/*.model

The folder structure should be as below:

sign-segmentation/models/
  i3d/
    i3d_kinetics_bslcp.pth.tar
  ms-tcn/
    mstcn_bslcp_i3d_bslcp.model
  asformer/
    best_asformer_bslcp.model

Demo

The demo folder contains a sample script to estimate the segments of a given sign language video, one can run demo.pyto get a visualization on a sample video.

cd demo
python demo.py

The demo will:

use the models/i3d/i3d_kinetics_bslcp.pth.tar pretrained I3D model to extract features,
use the models/asformer/best_asformer_model.model pretrained ASFormer model to predict the segments out of the features.
save results.

Training

To train I3D please refer to github.com/RenzKa/sign-segmentation. To train ASFormer on the pre-extracted I3D features run main.py, you can change hyperparameters in the arguments inside the file. Or you can run the notebook in the folder test_asformer.

Citation

If you use this code and data, please cite the original papers following:

@inproceedings{Renz2021signsegmentation_a,
    author       = "Katrin Renz and Nicolaj C. Stache and Samuel Albanie and G{\"u}l Varol",
    title        = "Sign Language Segmentation with Temporal Convolutional Networks",
    booktitle    = "ICASSP",
    year         = "2021",
}

@article{yi2021asformer,
  title={Asformer: Transformer for action segmentation},
  author={Yi, Fangqiu and Wen, Hongyu and Jiang, Tingting},
  journal={arXiv preprint arXiv:2110.08568},
  year={2021}
}

License

The license in this repository only covers the code. For data.zip and models.zip we refer to the terms of conditions of original datasets.

Acknowledgements

The code builds on the github.com/RenzKa/sign-segmentation and github.com/ChinaYi/ASFormer repositories.

Goal of the project : Detecting Temporal Boundaries in Sign Language videos

Related tags

Overview

MVA RecVis course final project :

Goal of the project : Detecting Temporal Boundaries in Sign Language videos.

Previous work and personal contribution :

Contents

Setup

Data and models

Data:

Models:

Demo

Training

Citation

License

Acknowledgements

Owner

Loubna Ben Allal

Randomizes the warps in a stock pokeemerald repo.

Python code for loading the Aschaffenburg Pose Dataset.

Small little script to scrape, parse and check for active tor nodes. Can be used as proxies.

All-in-one Docker container that allows a user to explore Nautobot in a lab environment.

U-Net Brain Tumor Segmentation

[CVPRW 21] "BNN - BN = ? Training Binary Neural Networks without Batch Normalization", Tianlong Chen, Zhenyu Zhang, Xu Ouyang, Zechun Liu, Zhiqiang Shen, Zhangyang Wang

Codes for "Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier"

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

Official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo'

StrongSORT: Make DeepSORT Great Again

Pre-Trained Image Processing Transformer (IPT)

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

A program that uses computer vision to detect hand gestures, used for controlling movie players.

This repo. is an implementation of ACFFNet, which is accepted for in Image and Vision Computing.

Train Scene Graph Generation for Visual Genome and GQA in PyTorch >= 1.2 with improved zero and few-shot generalization.

End-to-end Temporal Action Detection with Transformer. [Under review]

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

⚓ Eurybia monitor model drift over time and securize model deployment with data validation

AI-Bot - 一个基于watermelon改造的OpenAI-GPT-2的智能机器人

The authors' implementation of Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations