by Karel Lenc & Andrea Vedaldi
This paper analyzes image feature representations by studying their mathematical properties, namely equivariance, invariance, and equivalence. The paper focuses on how simple transforms in the input space, such as vertical and horizontal flipping, rotations, and scaling, can be recovered on the output space by applying a simple affine transform in the feature space. The paper studies both “shallow” representations like HOG and intermediate representations from CNNs. The paper also analyzes the compatibility of feature representations by fusing two halves of different CNNs, “Franken-CNNs”. Overall, the paper provides a novel approach for analysis of feature representations, in particular giving an empirical evaluation of equivariance in deep nets. This complements previous analysis on image representations, such as feature visualization and inversion methods.
Equivariance to a transform is the ability to recover the corresponding transform in the output space, given the transform to the input. As this mapping is trivially exists if a feature representation is invertible, the paper is specifically interested in the existence of “simple” or affine transforms on the output space. In Section 3.4, the paper provides an example where being able to recover an affine transform on the output space can result in speedup for a pose prediction task. In Section 2.1, the paper explores several methods to uncover an affine equivariant mapping. Regularization is needed to induce sparsity in the final solution due to the high dimensionality of the problem. Two sparsity regularizers are introduced, one which encourages sparsity in the transform overall, and the other which specifically incorporates prior knowledge of the input transform. The experiments shown in Figure 3 on a “shallow” HOG representation confirm that regularization is needed, and that using prior knowledge of the transform works better. Figure 5 provides a qualitative example of the learned equivariant mapping.
In addition to studying HOG features, the paper also provides extensive experiments on deep CNN architectures. The paper also explores how well an equivariant mapping can maintain classification performance. The results in Figure 6 and Table 2 provide an evaluation on how well equivariant mappings can be found in different layers in the CNN. A few intuitions are confirmed numerically: CNN representations are invariant to horizontal flipping, but sensitive to vertical flipping and rotations. In this section, the reader might also benefit from some qualitative examples of equivariant mappings, in particular, examples where the equivariant mapping is unable to recover the feature representation.
A study on translation equivariance of CNNs is not presented in this paper, which is important for object detection and segmentation. Although translation equivalence at a coarse level (stride ~32) is inherently true by the convolution operation, it isn’t necessarily implied by CNNs at a fine level, e.g. most often one doesn’t simply get a 4 pixel shift in a Fully Convolutional Net segmentation output given an input shift by 4 pixels. It seems that in practice such high-resolution equivariance holds true (or can be implicitly learned in detection and segmentation tasks), but it will be interesting to see some quantitative analysis.
The paper also provides a quantitative measure on invariance. The paper notes that invariance is surprisingly not monotonically increasing for different transforms and argues that invariance to certain factors may be lost due to pooling. An unmentioned reason for the loss of invariance is the asymmetry of the convolutional kernels themselves. The network is shown to be invariant to horizontal flipping, which is unsurprising, as it is directly built in during training. An experiment which studies the quality of an equivariant mapping on a network without data augmentation would be interesting to see.
Finally, the paper presents 4 different CNNs with similar architectures, trained on a variety of domains, such as ILSVRC12, Places, and both datasets combined. The paper then composes a new network which stitches together the two halves from two different networks, with an “equivalence mapping” in the middle. This is an interesting application of the established framework. A minor issue is that definition of equivalence should be symmetric; otherwise, consider the case when Φ’ is the all-zero vector for all x. Then any representation Φ will be equivalent to Φ’. An additional possible experiment is to look at transformations between networks trained with different random seeds. Since training and averaging two models almost always helps, observing the learned transforms might help shed light on what one model learns that the other does not.
One natural question is how do we take the insights from this work and apply them? Currently, we augment the training sets to accommodate some transformations. As a result, the top layers of commonly used networks are invariant but mid layers are equivariant to those, as t-sne embeddings of pool5 features have shown. Overall, the method in this paper seems better than a data augmentation approach since some transformations are very unlikely to happen, for example humans are very bad at recognizing vertically inverted faces.
Since publication, the idea of modeling equivariance has been applied in “Learning image representations equivariant to ego-motion” from Jayaram and Grauman, where a feature representation is to be equivariant to input transforms caused by ego-motion in Section 5.3. In “Cross Modal Distillation for Supervision Transfer”, Gupta, et al. learn intermediate feature representations which are invariant to domain shifts, specifically in the RGB and depth domains, as a method of domain adaptation.
One concern is that given the high dimensionality of the feature space, it is always possible to find an equivariant mapping regardless of the features. The use of the strong regularizer suggests that insufficient amounts of data were used. Consequently, it would be interesting to see if this is still possible on sufficient amounts of data without using a strong regularizer.
Overall, this paper conducts a novel study on characterizing feature representations. Conducting a similar study on more interesting transforms, such as out of plane 3D rotations would be beneficial. It would also be interesting to learn feature spaces which are equivariant to desired the transforms while performing classification.