Scene parsing is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. MIT Scene Parsing Benchmark (SceneParse150) provides a standard training and evaluation platform for the algorithms of scene parsing. The data for this benchmark comes from ADE20K Dataset which contains more than 20K scene-centric images exhaustively annotated with objects and object parts. Specifically, the benchmark is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are totally 150 semantic categories included for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that there are non-uniform distribution of objects occuring in the images, mimicking a more natural object occurrence in daily scene.

For each image, segmentation algorithms will produce a semantic segmentation mask, predicting the semantic category for each pixel in the image. The performance of the algorithms will be evaluated on the mean of pixel-wise accuracy and the Intersection over Union (IoU) averaged over all the 150 semantic categories.

The data in the benchmark has been used in the Scene Parsing Challenge 2016 held jointly with ILSVRC'16, and Places Challenge 2017 held jointly with COCO Challenge. Demo of scene parsing is available. The pre-trained models and demo code of scene parsing are released.


Scene Parsing

Data: [train/val (922MB)] [test (203MB)]
Demo: Scene Parsing web demo.
Toolkit: Development tool for Scene Parsing.
Model Zoo: Pre-trained models.
Codebase: [Caffe/Torch7] [PyTorch]

Instance Segmentation

Data: [Images (851MB)] [Annotations (86MB)] [test (203MB)]
Toolkit: Development tool for Instance Segmentation.
Training set
20,210 images (browse)

Validation set
2,000 images (browse)

Test set


To evaluate the segmentation algorithms, we will take the mean of the pixel-wise accuracy and class-wise IoU as the final score. Pixel-wise accuracy indicates the ratio of pixels which are correctly predicted, while class-wise IoU indicates the Intersection of Union of pixels averaged over all the 150 semantic categories. Refer to the Development Kit for the detail.

You could submit the prediction results of the test set to the Evaluation Server. To prevent overfitting, submission is limited to only twice per week.



Bolei Zhou

Hang Zhao

Xavier Puig

Sanja Fidler
University of Toronto

Adela Barriuso

Antonio Torralba


If you find this scene parse challenge or the data useful, please cite the following papers:

Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. [PDF] [bib]

Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso and A. Torralba. International Journal on Computer Vision (IJCV) [PDF][bib]