Semantic Segmentation on MIT ADE20K dataset in PyTorch

This is a PyTorch implementation of semantic segmentation models on MIT ADE20K scene parsing dataset (http://sceneparsing.csail.mit.edu/).

ADE20K is the largest open source dataset for semantic segmentation and scene parsing, released by MIT Computer Vision team. Follow the link below to find the repository for our dataset and implementations on Caffe and Torch7: https://github.com/CSAILVision/sceneparsing

If you simply want to play with our demo, please try this link: http://scenesegmentation.csail.mit.edu You can upload your own photo and parse it!

You can also use this colab notebook playground here to tinker with the code for segmenting an image.

All pretrained models can be found at: http://sceneparsing.csail.mit.edu/model/pytorch

[From left to right: Test Image, Ground Truth, Predicted Result]

Color encoding of semantic categories can be found here: https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit?usp=sharing

Updates

HRNet model is now supported.
We use configuration files to store most options which were in argument parser. The definitions of options are detailed in config/defaults.py.
We conform to Pytorch practice in data preprocessing (RGB [0, 1], substract mean, divide std).

Highlights

Syncronized Batch Normalization on PyTorch

This module computes the mean and standard-deviation across all devices during training. We empirically find that a reasonable large batch size is important for segmentation. We thank Jiayuan Mao for his kind contributions, please refer to Synchronized-BatchNorm-PyTorch for details.

The implementation is easy to use as:

It is pure-python, no C++ extra extension libs.
It is completely compatible with PyTorch's implementation. Specifically, it uses unbiased variance to update the moving average, and use sqrt(max(var, eps)) instead of sqrt(var + eps).
It is efficient, only 20% to 30% slower than UnsyncBN.

Dynamic scales of input for training with multiple GPUs

For the task of semantic segmentation, it is good to keep aspect ratio of images during training. So we re-implement the DataParallel module, and make it support distributing data to multiple GPUs in python dict, so that each gpu can process images of different sizes. At the same time, the dataloader also operates differently.

^{Now the batch size of a dataloader always equals to the number of GPUs, each element will be sent to a GPU. It is also compatible with multi-processing. Note that the file index for the multi-processing dataloader is stored on the master process, which is in contradict to our goal that each worker maintains its own file list. So we use a trick that although the master process still gives dataloader an index for __getitem__ function, we just ignore such request and send a random batch dict. Also, the multiple workers forked by the dataloader all have the same seed, you will find that multiple workers will yield exactly the same data, if we use the above-mentioned trick directly. Therefore, we add one line of code which sets the defaut seed for numpy.random before activating multiple worker in dataloader.}

State-of-the-Art models

PSPNet is scene parsing network that aggregates global representation with Pyramid Pooling Module (PPM). It is the winner model of ILSVRC'16 MIT Scene Parsing Challenge. Please refer to https://arxiv.org/abs/1612.01105 for details.
UPerNet is a model based on Feature Pyramid Network (FPN) and Pyramid Pooling Module (PPM). It doesn't need dilated convolution, an operator that is time-and-memory consuming. Without bells and whistles, it is comparable or even better compared with PSPNet, while requiring much shorter training time and less GPU memory. Please refer to https://arxiv.org/abs/1807.10221 for details.
HRNet is a recently proposed model that retains high resolution representations throughout the model, without the traditional bottleneck design. It achieves the SOTA performance on a series of pixel labeling tasks. Please refer to https://arxiv.org/abs/1904.04514 for details.

Supported models

We split our models into encoder and decoder, where encoders are usually modified directly from classification networks, and decoders consist of final convolutions and upsampling. We have provided some pre-configured models in the config folder.

Encoder:

MobileNetV2dilated
ResNet18/ResNet18dilated
ResNet50/ResNet50dilated
ResNet101/ResNet101dilated
HRNetV2 (W48)

Decoder:

C1 (one convolution module)
C1_deepsup (C1 + deep supervision trick)
PPM (Pyramid Pooling Module, see PSPNet paper for details.)
PPM_deepsup (PPM + deep supervision trick)
UPerNet (Pyramid Pooling + FPN head, see UperNet for details.)

Performance:

IMPORTANT: The base ResNet in our repository is a customized (different from the one in torchvision). The base models will be automatically downloaded when needed.

Architecture	MultiScale Testing	Mean IoU	Pixel Accuracy(%)	Overall Score	Inference Speed(fps)
MobileNetV2dilated + C1_deepsup	No	34.84	75.75	54.07	17.2
MobileNetV2dilated + C1_deepsup	Yes	33.84	76.80	55.32	10.3
MobileNetV2dilated + PPM_deepsup	No	35.76	77.77	56.27	14.9
MobileNetV2dilated + PPM_deepsup	Yes	36.28	78.26	57.27	6.7
ResNet18dilated + C1_deepsup	No	33.82	76.05	54.94	13.9
ResNet18dilated + C1_deepsup	Yes	35.34	77.41	56.38	5.8
ResNet18dilated + PPM_deepsup	No	38.00	78.64	58.32	11.7
ResNet18dilated + PPM_deepsup	Yes	38.81	79.29	59.05	4.2
ResNet50dilated + PPM_deepsup	No	41.26	79.73	60.50	8.3
ResNet50dilated + PPM_deepsup	Yes	42.14	80.13	61.14	2.6
ResNet101dilated + PPM_deepsup	No	42.19	80.59	61.39	6.8
ResNet101dilated + PPM_deepsup	Yes	42.53	80.91	61.72	2.0
UperNet50	No	40.44	79.80	60.12	8.4
UperNet50	Yes	41.55	80.23	60.89	2.9
UperNet101	No	42.00	80.79	61.40	7.8
UperNet101	Yes	42.66	81.01	61.84	2.3
HRNetV2	No	42.03	80.77	61.40	5.8
HRNetV2	Yes	43.20	81.47	62.34	1.9

The training is benchmarked on a server with 8 NVIDIA Pascal Titan Xp GPUs (12GB GPU memory), the inference speed is benchmarked a single NVIDIA Pascal Titan Xp GPU, without visualization.

Environment

The code is developed under the following configurations.

Hardware: >=4 GPUs for training, >=1 GPU for testing (set [--gpus GPUS] accordingly)
Software: Ubuntu 16.04.3 LTS, CUDA>=8.0, Python>=3.5, PyTorch>=0.4.0
Dependencies: numpy, scipy, opencv, yacs, tqdm

Quick start: Test on an image using our trained model

Here is a simple demo to do inference on a single image:

chmod +x demo_test.sh
./demo_test.sh

This script downloads a trained model (ResNet50dilated + PPM_deepsup) and a test image, runs the test script, and saves predicted segmentation (.png) to the working directory.

To test on an image or a folder of images ($PATH_IMG), you can simply do the following:

python3 -u test.py --imgs $PATH_IMG --gpu $GPU --cfg $CFG

Training

Download the ADE20K scene parsing dataset:

chmod +x download_ADE20K.sh
./download_ADE20K.sh

Train a model by selecting the GPUs ($GPUS) and configuration file ($CFG) to use. During training, checkpoints by default are saved in folder ckpt.

python3 train.py --gpus $GPUS --cfg $CFG

To choose which gpus to use, you can either do --gpus 0-7, or --gpus 0,2,4,6.

For example, you can start with our provided configurations:

Train MobileNetV2dilated + C1_deepsup

python3 train.py --gpus GPUS --cfg config/ade20k-mobilenetv2dilated-c1_deepsup.yaml

Train ResNet50dilated + PPM_deepsup

python3 train.py --gpus GPUS --cfg config/ade20k-resnet50dilated-ppm_deepsup.yaml

Train UPerNet101

python3 train.py --gpus GPUS --cfg config/ade20k-resnet101-upernet.yaml

You can also override options in commandline, for example python3 train.py TRAIN.num_epoch 10 .

Evaluation

Evaluate a trained model on the validation set. Add VAL.visualize True in argument to output visualizations as shown in teaser.

For example:

Evaluate MobileNetV2dilated + C1_deepsup

python3 eval_multipro.py --gpus GPUS --cfg config/ade20k-mobilenetv2dilated-c1_deepsup.yaml

Evaluate ResNet50dilated + PPM_deepsup

python3 eval_multipro.py --gpus GPUS --cfg config/ade20k-resnet50dilated-ppm_deepsup.yaml

Evaluate UPerNet101

python3 eval_multipro.py --gpus GPUS --cfg config/ade20k-resnet101-upernet.yaml

Integration with other projects

This library can be installed via pip to easily integrate with another codebase

pip install git+https://github.com/CSAILVision/[email protected]

Now this library can easily be consumed programmatically. For example

from mit_semseg.config import cfg
from mit_semseg.dataset import TestDataset
from mit_semseg.models import ModelBuilder, SegmentationModule

Reference

If you find the code or pre-trained models useful, please cite the following papers:

Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso and A. Torralba. International Journal on Computer Vision (IJCV), 2018. (https://arxiv.org/pdf/1608.05442.pdf)

@article{zhou2018semantic,
  title={Semantic understanding of scenes through the ade20k dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Xiao, Tete and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={International Journal on Computer Vision},
  year={2018}
}

Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. (http://people.csail.mit.edu/bzhou/publication/scene-parse-camera-ready.pdf)

@inproceedings{zhou2017scene,
    title={Scene Parsing through ADE20K Dataset},
    author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    year={2017}
}

Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset

Related tags

Overview

Semantic Segmentation on MIT ADE20K dataset in PyTorch

Updates

Highlights

Syncronized Batch Normalization on PyTorch

Dynamic scales of input for training with multiple GPUs

State-of-the-Art models

Supported models

Performance:

Environment

Quick start: Test on an image using our trained model

Training

Evaluation

Integration with other projects

Reference

Owner

MIT CSAIL Computer Vision

System Design course at HSE (2021)

Supervised Classification from Text (P)

NOD: Taking a Closer Look at Detection under Extreme Low-Light Conditions with Night Object Detection Dataset

Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

Pairwise learning neural link prediction for ogb link prediction

The aim of the game, as in the original one, is to find a specific image from a group of different images of a person's face

ICLR 2021, Fair Mixup: Fairness via Interpolation

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

From this paper "SESNet: A Semantically Enhanced Siamese Network for Remote Sensing Change Detection"

Face Mask Detection System built with OpenCV, TensorFlow using Computer Vision concepts

A collection of papers about Transformer in the field of medical image analysis.

Implementation of CaiT models in TensorFlow and ImageNet-1k checkpoints. Includes code for inference and fine-tuning.

Vision transformers (ViTs) have found only limited practical use in processing images

A simple configurable bot for sending arXiv article alert by mail

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

A repository for the paper "Improved Adversarial Systems for 3D Object Generation and Reconstruction".

Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.