by Alexey Dosovitskiy & Thomas Brox
The objective of the paper is to study image representations, e.g., features generated using a convolutional neural network, by inverting them. Re-generating the image from which the features were extracted helps to understand what information from the image gets preserved in the features. This work has a similar goal to the work of Mahendran and Vedaldi [CVPR 2015], where the authors try to optimize for an image which produces similar features to a given one. In contrast, this paper trains a fully supervised up-convolutional (or de-convolutional) neural network to generate an image given input features. A number of experiments are presented which explore the different factors of variation that are preserved in the feature domain and provide insights regarding widely used CNN features.
The approach adopted by this paper has the following advantages. One of the challenges in the task of feature inversion is that many inputs may map to the same or similar feature vector. This is so because features are typically smaller and designed to be invariant to certain variations. While Mahendra and Vedaldi [CVPR 2015] ignore this property and directly optimize the distance between original and reconstructed features, this paper addresses this problem directly by learning to general the original image. Moreover, this method requires just a forward pass at test time to invert the features.
The paper performs rigorous experiments to analyze the different properties of CNN based features. All the experiments are performed using Alexnet pre-trained on Imagenet. It shows that the colors of objects in the image are preserved even in features of higher layers like FC8, and can be recovered using the technique proposed in the paper. Similarly, the position of objects in the image is also preserved in the features of all layers. In consistency with the previous related works, it is shown that precise values of the features don’t carry much information as binarization approximately preserves the output. In another interesting experiment, images are reconstructed from interpolated and independently sampled features. These reconstructions look reasonable and follow natural image statistics. This brings out an important point about the process. The inversion of these features is possible, despite this non-bijective mapping between image and features, because the reconstruction is limited to the embedding of natural images. In this paper, network learns to enforce the natural image prior. This is unlike Mahendran and Vedaldi [CVPR 2015] where the natural image prior is explicitly added. However, all the reconstructed images are blurrier, and the paper suggests that this may be the result of L2 error loss function which usually favors over smoothed solutions.
The experiments in the paper are quite satisfactory, and lead to further insightful discussions. The results like binarization have also been validated before for Alexnet like architectures. A significant question would be to explore whether these results hold across architectures like VGG, GoogleNet etc. or whether they are Alexnet specific. This experiment would certainly add more value to the paper. Also, the paper presents a good Autoencoder baseline by allowing to finetune the imagenet pretrained encoder. It would also be good to know what happens if the encoder part of the Autoencoder is also trained from scratch. The experiments in the paper which show reconstruction from fc8 when muting the top 5 highest scoring classes are really impressive, even though network was not trained exactly for that. This is also consistent with the distillation idea proposed in earlier works . However, the complementary experiment where the paper tries to reconstruct from just top 5 activations can be further explored. The dimensionality of the embedding in the two settings being compared is much different (i.e. 5 v/s 995), and the network was never been trained or fine-tuned for such a low dimensional input. It could be that top 5 activations actually contain more information than the rest, but because the inversion network used was trained using all activations, it is incapable of doing inversion properly with just top 5 activations. Thus, it would be appealing to know what happens if the network is trained to invert from top 5 activations. Further, in the experiment where images obtained by inverting the randomly sampled features are quite blurred, it’s a bit difficult to tell whether an image would be natural or not. However, it might be interesting to see which classes these randomly sampled feature might be classified into, and see whether the reconstructed images make any sense assuming these classifications.
This is one of the first few works which show that CNNs can be trained to generate images, albeit using particular kind of vectors i.e. features. This method, owing to its simplicity and good results, certainly encourages the computer vision research related to generating natural-looking images.
 Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).