by Tejas Kulkarni, Pushmeet Kohli, Joshua B Tenebnaum & Vikas Mansigha
One approach for inferring a label of interest (such as pose, object category etc) is to create a generative model that can render the image() from the label (). Now, this rendering can be compared to the original image () to estimate the likelihood and the inference thus boils down to . The paper addresses the two major barriers in implementing this approach – (1) Use an approximate renderer in place of using a high fidelity generative model which is hard to construct, (2) As can be high-dimensional performing, data-driven sampling is used to speed up inference of .
- Powerful and a generic method for performing inference.
- Inference of relies on the fact that the output of the approximate renderer is close to in some abstract feature space (like ConvNet features). It is unclear whether it is even possible to generate approximate renderers for complex natural scenes that will satisfy this constraint.
In order to solve the sampling problem, the generative model is run without any constraints many times (typically 100K). The visual representations (such as those extracted from a pretrained ConvNet) of the and the parameters used to generate (i.e. ) are stored in memory. Now, in order to perform inference – the visual features from the image of interest are computed and are used to index into the memory to find a set of likely (inspired by the Wake-Sleep Algorithm; called as data-driven proposals in the paper). One can think of this as seed values of for bootstrapping the inference procedure. Now, standard inference techniques like (MCMC, Blocked Sampling etc) can be used to move around in the space of (called as proposal moves) using the seed as the starting point. The which produces the highest likelihood of the data would be the label prediction. The procedure of caching the representations can be seen as a classic example of trading off computation for memory.
Suggestion for Future Work
Intuitively it makes sense to bias the proposal moves based on the the difference in representations of . The way it is done in the current paper is by use importance sampling to chose the next where the importance weights are proportional to the likelihood of the data. An alternative to this sampling strategy would be to use a function approximator (such as a NNet) to predict the next move from the difference in representations of and the current . This can significantly reduce the inference time by reducing the requirement of sampling.