Picture: A Probabilistic Programming Language for Scene Perception

by Tejas Kulkarni, Pushmeet Kohli, Joshua B Tenebnaum & Vikas Mansigha
CVPR, 2015

Short Summary

One approach for inferring a label of interest (such as pose, object category etc) is to create a generative model that can render the image(I_R) from the label (S). Now, this rendering can be compared to the original image (I_D) to estimate the likelihood P( I_D | S) and the inference thus boils down to max_S P( I_D | S).   The paper addresses the two major barriers in implementing this approach – (1) Use an approximate renderer in place of using a high fidelity generative model which is hard to construct, (2) As S can be high-dimensional performing, data-driven sampling is used to speed up inference of S.


  • Powerful and a generic method for performing inference.


  • Inference of S relies on the fact that the output of the approximate renderer I_R is close to I_D in some abstract feature space (like ConvNet features). It is unclear whether it is even possible to generate approximate renderers for complex natural scenes that will satisfy this constraint.

Detailed Review

In order to solve the sampling problem, the generative model is run without any constraints many times (typically 100K).  The visual representations (such as those extracted from a pretrained ConvNet) of the I_R and the parameters used to generate I_R (i.e. S) are stored in memory. Now, in order to perform inference – the visual features from the image of interest are computed and are used to index into the memory to find a set of likely S (inspired by the Wake-Sleep Algorithm; called as data-driven proposals in the paper). One can think of this as seed values of S for bootstrapping the inference procedure. Now, standard inference techniques like (MCMC, Blocked Sampling etc) can be used to move around in the space of S (called as proposal moves) using the seed S as the starting point. The S which produces the highest likelihood of the data would be the label prediction. The procedure of caching the representations can be seen as a classic example of trading off computation for memory.

Suggestion for Future Work

Intuitively it makes sense to bias the proposal moves based on the the difference in representations of I_R, I_D. The way it is done in the current paper is by use importance sampling to chose the next S where the importance weights are proportional to the likelihood of the data. An alternative to this sampling strategy would be to use a function approximator (such as a NNet) to predict the next move from the difference in representations of I_R, I_D and the current S. This can significantly reduce the inference time by reducing the requirement of sampling.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s