# Introduction

Currently generative (or other) models are not very explainable. We cannot exactly pinpoint which part of the representation is responsible for which characteristics of the object. This is because the information of the data is encoded in the implicit form. We cannot identify exactly which component is responsible for which part/segment of the output.

In order to solve this issue, we need to learn disentangled representations of the concepts/objects present in the input (say an image). This poses two sub-problems.

1. We need to understand the how many concepts are present and how many we care about?
1. We need to be able to manipulate those concepts
1. If possible, should be self-supervised i.e no external labels should be used.
1. if possible, we should be able to model interactions also i.e a ball is placed on the bedsheet, the bedsheet should have consistent folds in it due to the ball.

The advantages of such model are

1. Our models become more explainable
1. We get composability, from where we can generate a lot of synthetic examples
1. We can learn representations which are transferable

This primer will be primarily an introduction to such models and getting up to speed on the current literature on it (as of December 2020).

NOTE: Whenever possible, I will be using the notation followed by the paper. If you want to dive deep in this field please do read the paper. This blog post is supposed to help those who would like to get a gist of the paper without reading it in detail.

# Terminology

1. Scene - a setting of a group of objects. So for example, your table (which has laptop, coffee mug, monitor etc) can be a scene.

# Papers

The papers I will be reviewing here are

1. SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition
1. SCALOR: Generative World Models With Scalable Object Representations

## SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

### Aim

This paper primarily relates to pictures (we will refer them as scenes from now on, because a scene can have many pictures taken at different angles) which have more than few objects. We would like to have object level representations of each object rather than one big representation for the entire scene.

### What was wrong with earlier methods?

This method can separate foreground objects as well as background segments. Earlier methods such as spatial attention and scene mixture were good at either but not both. Earlier methods also didn't scale well with many objects in the scene.

### What is this method?

SPACE (Spatially Parallel Attention and Component Extraction) provides us with disentangled representations of each object in the scene. It captures the foreground objects using a bounding box and the decomposes the background into $K$﻿ components and gives the representations of each of the components too. It parallelly processes the foreground objects which alleviates the scalability problem which the other models had. The representations has information captured explicitly like position, SCALE etc. It uses Probabilistic Latent Variable Modelling (I will explain this briefly) to do this.

Some assumptions that this paper makes

• Scene X is decomposed into two independent factors, foreground ($z^{fg}$﻿) and background ($z^{bg}$﻿).
• The background is decomposed further into $K$﻿ background segments $z^{bg} = z^{bg}_{1:K}$﻿ where $K$﻿ is known beforehand.

This foreground and background are combined together to form a distribution over the pixel, from which the image is generated, using the following equation

Green denotes the effect of the foreground on the pixel distribution and blue denotes the effect of the background. $\pi_{k}$﻿ denotes the weights given to different background components and their sum i.e $\Sigma_{k=1}^{K}\pi_{k} = 1$﻿.

$\alpha$﻿ denotes the probability (or the amount of importance) we want to give to the foreground to generate the pixel. The authors give precedence to the foreground, and then give the remaining $1 - \alpha$﻿ to the background.

We will now look at how it models the foreground and the background portion of the scenes and also how we can train this beast.

[Click on the arrow to toggle the different subsections]

• Foreground

Each image is divided into H x W cells.

Now, each cell is associated with four latents. Those are

1. $z^{pres}$﻿ - This denotes the presence( = 1) or absence of (= 0) of an object in the cell. This is a binary Random Variable.
1. $z^{where}$﻿ - This denotes the location (relative to the cell) and the size of the object —modelled as a Gaussian
1. $z^{depth}$﻿ - This denotes the depth of the object. So that we can model which object is covering which other objects in the scene — modelled as a Gaussian
1. $z^{what}$﻿ - This denotes which object it is and therefore the mask and appearance of the object - whether the object is a square, circle or cone etc — modelled as a Gaussian

These four latents, combined for all the cells, gives us $z^{fg}$﻿ . The probability distribution is related as follows

Now $z^{fg}$﻿ is used to compute the image i.e $p(x|z^{fg})$﻿. This is modelled as a Gaussian distribution $\mathcal{N}(\mu^{fg}, \sigma_{fg}^{2})$﻿ meaning

Here, $\sigma$﻿ is treated as a hyper-parameter and the model tries to estimate only the $\mu$﻿.

• Background

For modelling the background, it tries to model $z_{k}^{bg}$﻿ as a combination of $(\textbf{z}_{k}^{m}, \textbf{z}_{k}^{c} )$﻿ . Their meanings are as follows:

1. $z_{k}^{m}$﻿ - This models the mixing probabilities $\pi_{k}$﻿ for each components.
1. $z_{k}^{c}$﻿ - This models the RGB distribution $p(x|z_{k}^{bg})$﻿ of the $k^{th}$﻿ background component — as a Gaussian $\mathcal{N}(\mu_{i}^{bg}, \sigma_{bg}^{2})$﻿

Now, the background latent's distribution can be modelled as

• Inference and Training

Since all of these equations are intractable (i.e. cannot be evaluated with reasonable accuracy and compute) due to the latent being continuous in nature, we burrow a page from the VAE book.

This model is trained using variational approximation similar to the Variational AutoEncoder (VAE) where we use a surrogate function to infer the most useful latents from the data $X$﻿ itself. Then we use ELBO (Evidence Lower Bound Optimization) as the objective to train the model.

But where is the "parallelism" in all this? This comes from the mean-field approximation the model uses when it infers the cell latents. This allows the latents to be inferred independently from any other cell. To learn more about mean-field approximation, I would recommend this article from Brian Keng. So essentially,

".....this allows each cell to act as an independent object detector, spatially attending to its own local region in parallel."

I would definitely urge you to check out the paper for the experimental results and qualitative and quantitative comparisons with other models. To know more about VAEs, I would suggest you watch this video by Peter Abeel.

## SCALOR: GENERATIVE WORLD MODELS WITH SCALABLE OBJECT REPRESENTATIONS

### Aim

This model aims to make Object Oriented representation learning efficient and scalable in a temporal (changing with time) scene.

### What was wrong with the earlier methods?

Scalability is an issue with learning object-oriented representations. More the density of the objects are in the scene, the existing algorithms are slow and inefficient. Earlier methods also couldn't model complex backgrounds separately.

### What is this method?

SCALOR parallelizes both propagation (information from the previous frame) and discovery (finding new objects in the current frame). It reduces the time complexity of processing each image from $\mathbb O(N)$﻿ to $\mathbb O(1)$﻿ where N is the number of objects.

SQAIR, a previous model used in this task, was based on an RNN. This increased the computation time by a lot. By removing the need for an RNN, this process was made much faster.

• Predecessor SQAIR

SQAIR modeled the a sequence of images as $X = X_{1:T}$﻿ for $T$﻿ timesteps, where $X_{t}$﻿ is some intermediate step $t < T$﻿. The frames were generated from the latent variable $z_{t}^{o} = \{z_{t,n}\}_{n \in O_{t} }$﻿ where $O_{t}$﻿ are the objects present at timestep $T = t.$﻿ Now each $z_{t,n}$﻿ has three factored latents.

• $z_{t,n}^{pres}$﻿ - This indicates the presence of an object.
• $z_{t,n}^{where}$﻿ - This represents pose and location of the object.
• $z_{t,n}^{what}$﻿ - This represents the identity or the appearance of the object.

SQAIR can model propagation and discovery too. So an object can come or disappear at any time $t > 0$﻿. Combining propagation and discovery, objects present at different time steps are modelled. The overall process is

Red denotes the propagation factor coming from the foreground of the previous frame. Blue denotes given the propagation factor, what are the number of objects in the current frame and the latent factor of the new objects. Green denotes given the foreground latent of the current frame, generating the current frame.

For SQAIR, $z^{\mathbf{O}}_t = z_{t}^{fg}$﻿.

Again, like SCALOR, this was trained using the VAE -esque objective (more specifically, using the importance-weighted VAE).

Now we will see how SCALOR works. First comes the generative process i.e. how are we generating the representations which consists of the proposal-rejection mechanism, which makes the SCALOR scale. Then we will see additional details about certain problems with this process and how we avoid/solve those problems.

• Generative Process

In SCALOR, an image $\mathbf{x}_t$﻿ is generated by the background and foreground latent $z_{t}^{bg}, z_{t}^{fg}$﻿ respectively. The foreground is broken into $z_t^{fg}= \{z_{t,n}\}_{n \in O_t}$﻿ where $O_t$﻿ denotes the objects at time $t.$﻿

Now each object is again broken into 3 parts similar to SQAIR as noted above. Those are

And $z_{n,t}^{where}$﻿ is further broken into $z_{t,n}^{pos}, z_{t,n}^{scale}, z_{t,n}^{depth}$﻿ which represents centre position, scale and depth. So in a nutshell, for each frame this is how the representations are factored.

• Propagation

The propagation part is modeled as follows,

where the blue part denotes the probability of the presence of an object from the previous frame. It is modelled as a Bernoulli distribution with $\beta_{t,n}$﻿ . This is computed using an RNN, and calculated as $\beta_{t,n} = f_{mlp}(h_{t,n})$﻿ where $h$﻿ is the hidden state of the RNN. The red part is activated only when that particular object is propagated. Unlike the SQAIR, this is completely parallel.

• Proposal - Rejection

This is the main portion which makes the model scalable. First, the target is divided into $H \times W$﻿ cells similar to SPACE. For each such cell a latent variable is associated denoted by $\tilde{z}_{t,h,w}$﻿. Now the equation for the proposal phase can be written as

This equation basically looks at the latent propagation vector coming from the previous frames and decides whether a new object at that position is discovered or not. If there is a 'discovered' object which largely overlaps with an object coming from the previous frames, then that 'discovered' object is rejected. This is done by using a mask variable $\textbf{m}_{t,n}$﻿ for the $n^{th}$﻿ object at timestep $t$﻿. If the IoU (Intersection over Union) of a discovered object is greater than a certain threshold $\tau$﻿ with a propagated object then that object is ignored. A mask decoder is used to upsample from the latent variable to the image space.

• Background Transition

One of the specialities of this model over SQAIR is that it models the background too. This is done in a very straight forward fashion. The current background latent variable is conditioned on the previous background and the foreground latent variables i.e. $p(z^{bg}_t | z^{bg}_{﻿.

• Learning and Inference
• Posterior Propagation

For posterior propagation, apart from the latent variables from the previous frames, all the images from the previous timesteps are also provided but not naively but using attention over them.

• Posterior Discovery

This is done spatially - parallelly to make the model faster condition on the previous observations of the frame $\textbf{x}_{﻿, where all the $H \times W$﻿cells are conditioned just on the corresponding cells of the previous frames, i.e. for e.g. a cell in the position (2,3) will only be conditioned on the cell in the same position of the previous position. This removes the time complexity induced by auto-regressive processing.

One of the problem associated with posterior discovery is, the inherent tendency for the model to discover new objects even though the object is already propagated. This is because the model doesn't care from the object comes (either propagation or discovery). All it cares for is to reduce the cost function (or loss). To solve this 'discovery mode-collapse', the network has to be regularized. Two techniques are employed

1. The initial network parameters are biased to favour propagation as much as possible.
1. Conditioning the discovery module on the propagation module, therefore not explaining what is already explained.
• Posterior Background

The posterior background is conditioned on the input image and the currently existing objects. Foreground objects are given so the remaining part is explained by the background module.

• Training

The model is trained using the evidence lower bound (ELBO) optimization used in VAEs.

I would definitely urge you to check out the paper for the experimental results and qualitative and quantitative comparisons with other models. To know more about VAEs, I would suggest you watch this video by Peter Abeel.