The goal of this challenge is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. The data for this challenge comes from ADE20K Dataset (The full dataset will be released after the challenge) which contains more than 20K scene-centric images exhaustively annotated with objects and object parts. Specifically, the challenge data is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are totally 150 semantic categories included in the challenge for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that there are non-uniform distribution of objects occuring in the images, mimicking a more natural object occurrence in daily scene.

For each image, segmentation algorithms will produce a semantic segmentation mask, predicting the semantic category for each pixel in the image. The performance of the algorithms will be evaluated on the mean of pixel-wise accuracy and the Intersection over Union (IoU) averaged over all the 150 semantic categories.

This scene parsing challenge is held jointly with ILSVRC'16. Winners will be invited to present at ILSVRC and COCO joint workshop at ECCV 2016. Take a look at the relevant challenge Places2 Scene Recognition 2016.

Demo of scene parsing is available. The pre-trained models and demo code of scene parsing are released.


  • June 22, 2016: Development kit, data, and evaluation software made available.
  • Sep. 4, 2016: Testing set will be released.
  • Sep. 9, 2016, 5pm PDT Extended to Sep. 16, 2016, 5pm PDT: Submission deadline.
    NOITCE(Sep.15,9:00 am EST): the ILSVRC submission server is back now. The new due will be September 18, 5pm PST
  • Sep. 23, 2016: Challenge results released
  • Oct. 8, 2016: Winner(s) presents at ECCV 2016 Imagenet and COCO Joint Visual Recognition Workshop
  • Downloads


    Development toolkit

    Training set
    20,210 images (browse)

    Validation set
    2,000 images (browse)

    Test set


    To evaluate the segmentation algorithms, we will take the mean of the pixel-wise accuracy and class-wise IoU as the final score. Pixel-wise accuracy indicates the ratio of pixels which are correctly predicted, while class-wise IoU indicates the Intersection of Union of pixels averaged over all the 150 semantic categories. Refer to the development kit for the detail.

    NOTICE FOR PARTICIPANTS: In the challenge, you could use any pre-trained models as the initialization, but you need to write in the description which models have been used. There is only "provided data" track for the scene parsing challenge at ILSVRC'16, which means that you can only use the images and annotations provided and you cannot use any other images or segmentation annotations, such as Pascal or CityScapes. In the testing images, scene labels will not be provided.

    UPDATE (Augest 27, 2016): Some participants have some concerns about the definition of pre-trained models and the 'provided data track'. Here given that this is 'provided data track', the pre-trained models to be used have to be trained on the provided data in this ILSVRC'16 ONLY, which means you can use pre-trained models trained on the standard 1.2M imagenet data and the Places365 data, but not the pretrained models trained on CityScape, Pascal, COCO, or other external or internal datasets.


    Bolei Zhou

    Hang Zhao

    Xavier Puig

    Sanja Fidler
    University of Toronto

    Adela Barriuso

    Antonio Torralba


    If you find this scene parse challenge or the data useful, please cite the following paper:

    Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442