by Xiaolong Wang & Abhinav Gupta
The aim of this paper is to explore the idea of learning visual representations (a ConvNet) using unsupervised learning. Specifically, the paper compares pretraining on the task of visual tracking to supervised pre-training on ImageNet to perform detection in the PASCAL VOC 2012 dataset. This paper is attempts to answer the compelling question of whether or not you can learn effective visual representations *without* semantic supervision, using image patch tracking. Other recent work has similarly attempted to learn effective representations from context  and egomotion [2,3].
The unsupervised learning in this work consists of two main steps: (1) generating the dataset of image patch “tracks” and (2) training a ConvNet on that dataset using a “ranking” loss.
The first step of the unsupervised learning method consists of generating a dataset of image patch “tracks.” The proposed method generates tracks of image patches by first selecting video sequences to use (based on sufficient motion in the scene, but not camera motion) and using SURF key points and the KCF tracker to obtain pairs of image patches that are 30 frames apart. This automatic dataset generation process creates a supervisory signal on an auxiliary task, providing the network information on how things move; however, the videos were selected using human queries, introducing bias into the dataset. The paper also specifically selects against video segments with camera movement, though image changes due to camera movement also provide very valuable information (as demonstrated by [2,3]). It is unclear why the method couldn’t have also tracked random small patches that moved in the image due to the change in perspective from the camera. Finally, the dataset might benefit significantly from some notion of scale. The paper brings up the point that humans learn visual representations without strong supervision, though humans also have access to depth information (from stereo), providing a rough measure of the size of a tracked object. The data shown in Figure 3 shows that the scales of image patches are highly variable. We think that their approach would benefit from some sort of multiscale processing.
The second step is to train a “Siamese-triplet” network using a ranking loss. Their approach is simple, and straight-forward to implement. We believe that the method could do a more thorough job of hard negative mining by considering more than just 100 patches (which is their batch size). This hyperparameter was not discussed.
The paper presents two main experiments, the first with detection on the PASCAL VOC 2012 dataset. The goal of their experiments was to show that their learned representation from tracking patches is useful for other visual tasks. To do so, they fine-tune on the training set of VOC 2012. The results show a moderate improvement over training from scratch (44.0 mAP vs. 47.0 mAP). The results show an impressive improvement from training an ensemble of networks, though there is no comparison to training an ensemble of networks from scratch (with different random initializations). The paper also provides a comparison to initializing from an ImageNet-trained network, which performs slightly better (54.4 mAP with ImageNet pretraining vs. 52 mAP with the proposed method).
We would be very interested in seeing the performance *without* finetuning end-to-end, and simply training a linear SVM on top of the learned representation. This might be more indicative of the quality of the learned representation.
The experiments with normal estimation show similar findings as the experiments on PASCAL VOC.
Lastly, interesting future work would consider the questions: why does supervised ImageNet pretraining perform better? Is the performance gap caused by a fundamental limitation regarding supervision via tracking? Is it caused by an issue with the quantity or diversity of the collected videos? Or is it caused by issues with the proposed data-extraction approach and training method?
 Unsupervised Visual Representation Learning by Context Prediction / Carl Doersch, Abhinav Gupta, Alexei A. Efros http://arxiv.org/abs/1505.05192
 Learning to See by Moving / Pulkit Agrawal, Joao Carreira, Jitendra Malik http://arxiv.org/abs/1505.01596
 Learning image representations equivariant to ego-motion / Dinesh Jayaraman and Kristen Grauman http://arxiv.org/abs/1505.02206