Object-Oriented Representation Learning - A Primer

Table of Contents


Currently generative (or other) models are not very explainable. We cannot exactly pinpoint which part of the representation is responsible for which characteristics of the object. This is because the information of the data is encoded in the implicit form. We cannot identify exactly which component is responsible for which part/segment of the output.

In order to solve this issue, we need to learn disentangled representations of the concepts/objects present in the input (say an image). This poses two sub-problems.

  1. We need to understand the how many concepts are present and how many we care about?
  1. We need to be able to manipulate those concepts
  1. If possible, should be self-supervised i.e no external labels should be used.
  1. if possible, we should be able to model interactions also i.e a ball is placed on the bedsheet, the bedsheet should have consistent folds in it due to the ball.

The advantages of such model are

  1. Our models become more explainable
  1. We get composability, from where we can generate a lot of synthetic examples
  1. We can learn representations which are transferable

This primer will be primarily an introduction to such models and getting up to speed on the current literature on it (as of December 2020).

NOTE: Whenever possible, I will be using the notation followed by the paper. If you want to dive deep in this field please do read the paper. This blog post is supposed to help those who would like to get a gist of the paper without reading it in detail.


  1. Scene - a setting of a group of objects. So for example, your table (which has laptop, coffee mug, monitor etc) can be a scene.


The papers I will be reviewing here are

  1. SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition
  1. SCALOR: Generative World Models With Scalable Object Representations

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition


This paper primarily relates to pictures (we will refer them as scenes from now on, because a scene can have many pictures taken at different angles) which have more than few objects. We would like to have object level representations of each object rather than one big representation for the entire scene.

What was wrong with earlier methods?

This method can separate foreground objects as well as background segments. Earlier methods such as spatial attention and scene mixture were good at either but not both. Earlier methods also didn't scale well with many objects in the scene.

What is this method?

SPACE (Spatially Parallel Attention and Component Extraction) provides us with disentangled representations of each object in the scene. It captures the foreground objects using a bounding box and the decomposes the background into KK components and gives the representations of each of the components too. It parallelly processes the foreground objects which alleviates the scalability problem which the other models had. The representations has information captured explicitly like position, SCALE etc. It uses Probabilistic Latent Variable Modelling (I will explain this briefly) to do this.

Some assumptions that this paper makes

This foreground and background are combined together to form a distribution over the pixel, from which the image is generated, using the following equation

p(xzfg,zbg)=αp(xzfg)+(1α)Σk=1Kπkp(xzkbg)p(x|z^{fg}, z^{bg}) = \textcolor{green}{\alpha p(x|z^{fg})} + \textcolor{blue}{(1 - \alpha) \Sigma_{k=1}^{K} \pi_{k}p(x|z_{k}^{bg})}

Green denotes the effect of the foreground on the pixel distribution and blue denotes the effect of the background. πk\pi_{k} denotes the weights given to different background components and their sum i.e Σk=1Kπk=1\Sigma_{k=1}^{K}\pi_{k} = 1.

α\alpha denotes the probability (or the amount of importance) we want to give to the foreground to generate the pixel. The authors give precedence to the foreground, and then give the remaining 1α1 - \alpha to the background.

We will now look at how it models the foreground and the background portion of the scenes and also how we can train this beast.

[Click on the arrow to toggle the different subsections]

I would definitely urge you to check out the paper for the experimental results and qualitative and quantitative comparisons with other models. To know more about VAEs, I would suggest you watch this video by Peter Abeel.



This model aims to make Object Oriented representation learning efficient and scalable in a temporal (changing with time) scene.

What was wrong with the earlier methods?

Scalability is an issue with learning object-oriented representations. More the density of the objects are in the scene, the existing algorithms are slow and inefficient. Earlier methods also couldn't model complex backgrounds separately.

What is this method?

SCALOR parallelizes both propagation (information from the previous frame) and discovery (finding new objects in the current frame). It reduces the time complexity of processing each image from O(N)\mathbb O(N)  to O(1)\mathbb O(1)  where N is the number of objects.

SQAIR, a previous model used in this task, was based on an RNN. This increased the computation time by a lot. By removing the need for an RNN, this process was made much faster.

Now we will see how SCALOR works. First comes the generative process i.e. how are we generating the representations which consists of the proposal-rejection mechanism, which makes the SCALOR scale. Then we will see additional details about certain problems with this process and how we avoid/solve those problems.