Object-Oriented Representation Learning - A Primer

Table of Contents


Currently generative (or other) models are not very explainable. We cannot exactly pinpoint which part of the representation is responsible for which characteristics of the object. This is because the information of the data is encoded in the implicit form. We cannot identify exactly which component is responsible for which part/segment of the output.

In order to solve this issue, we need to learn disentangled representations of the concepts/objects present in the input (say an image). This poses two sub-problems.

  1. We need to understand the how many concepts are present and how many we care about?
  1. We need to be able to manipulate those concepts
  1. If possible, should be self-supervised i.e no external labels should be used.
  1. if possible, we should be able to model interactions also i.e a ball is placed on the bedsheet, the bedsheet should have consistent folds in it due to the ball.

The advantages of such model are

  1. Our models become more explainable
  1. We get composability, from where we can generate a lot of synthetic examples
  1. We can learn representations which are transferable

This primer will be primarily an introduction to such models and getting up to speed on the current literature on it (as of December 2020).


  1. Scene - a setting of a group of objects. So for example, your table (which has laptop, coffee mug, monitor etc) can be a scene.


The papers I will be reviewing here are

  1. SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition


This paper primarily relates to pictures (we will refer them as scenes from now on, because a scene can have many pictures taken at different angles) which have more than few objects. We would like to have object level representations of each object rather than one big representation for the entire scene.

What was wrong with earlier methods?

This method can separate foreground objects as well as background segments. Earlier methods such as spatial attention and scene mixture were good at either but not both. Earlier methods also didn't scale well.