# Picture: A Probabilistic Programming Language for Scene Perception

by Tejas Kulkarni, Pushmeet Kohli, Joshua B Tenebnaum & Vikas Mansigha
CVPR, 2015

### Short Summary

One approach for inferring a label of interest (such as pose, object category etc) is to create a generative model that can render the image($I_R$) from the label ($S$). Now, this rendering can be compared to the original image ($I_D$) to estimate the likelihood $P( I_D | S)$ and the inference thus boils down to $max_S P( I_D | S)$.   The paper addresses the two major barriers in implementing this approach – (1) Use an approximate renderer in place of using a high fidelity generative model which is hard to construct, (2) As $S$ can be high-dimensional performing, data-driven sampling is used to speed up inference of $S$.

### Pros

• Powerful and a generic method for performing inference.

### Cons

• Inference of $S$ relies on the fact that the output of the approximate renderer $I_R$ is close to $I_D$ in some abstract feature space (like ConvNet features). It is unclear whether it is even possible to generate approximate renderers for complex natural scenes that will satisfy this constraint.

### Detailed Review

In order to solve the sampling problem, the generative model is run without any constraints many times (typically 100K).  The visual representations (such as those extracted from a pretrained ConvNet) of the $I_R$ and the parameters used to generate $I_R$ (i.e. $S$) are stored in memory. Now, in order to perform inference – the visual features from the image of interest are computed and are used to index into the memory to find a set of likely $S$ (inspired by the Wake-Sleep Algorithm; called as data-driven proposals in the paper). One can think of this as seed values of $S$ for bootstrapping the inference procedure. Now, standard inference techniques like (MCMC, Blocked Sampling etc) can be used to move around in the space of $S$ (called as proposal moves) using the seed $S$ as the starting point. The $S$ which produces the highest likelihood of the data would be the label prediction. The procedure of caching the representations can be seen as a classic example of trading off computation for memory.

### Suggestion for Future Work

Intuitively it makes sense to bias the proposal moves based on the the difference in representations of $I_R, I_D$. The way it is done in the current paper is by use importance sampling to chose the next $S$ where the importance weights are proportional to the likelihood of the data. An alternative to this sampling strategy would be to use a function approximator (such as a NNet) to predict the next move from the difference in representations of $I_R, I_D$ and the current $S$. This can significantly reduce the inference time by reducing the requirement of sampling.