Open Vocabulary Scene Parsing

Hang Zhao¹ , Xavier Puig¹, Bolei Zhou¹, Sanja Fidler², Antonio Torralba¹,
¹Massachusetts Institute of Technology, ²University of Toronto

Abstract

Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scene with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our proposed approach to this problem is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.

Explore the data

Click the following images or enter the image name inside the dataset

Paper and Dataset

Read our ICCV paper HERE.
Download the concept graph for ADE20K dataset HERE.

Citation

          @inproceedings{openvoc2017,
            title = {Open Vocabulary Scene Parsing},
            author = {Zhao, Hang and Puig, Xavier and Zhou, Bolei and Fidler, Sanja and Torralba, Antonio,
            booktitle = {International Conference on Computer Vision (ICCV)},
            year = {2017}}

Acknowledgement: This work was supported by Samsung and NSF grant No.1524817 to AT. SF acknowledges the support from NSERC. BZ is supported by Facebook Fellowship. We thank Wei-Chiu Ma and Yusuf Aytar for insightful discussions.