Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

by Sergey Ioffe & Christian Szegedy
Arxiv, 2015

This paper proposes a normalization scheme used in the deep model architecture to handle the internal covariate shift problem. Updating the parameters in deep neural networks changes the distribution of layer inputs, and thereafter decreases the learning rate. This point motivates the proposed “Batch Normalization (BN)” scheme which aims at reducing the internal covariate shift by fixing the mean and variance of the layer inputs. In addition, simplicity of the BN model and ease of implementation can benefit different applications of deep networks. Most (all?) deep nets learn features that are poorly scaled with respect to one another so improvements through normalization could generalize to many problems.

To address the covariate shift problem and accelerate training, layers inputs are whitened (i.e. linearly transformed to have zero means and unit variances) such that each scalar feature is normalized independently. During this normalization, the analysis of the entire training set after every parameter update is not required since the input normalization function is differentiable. Similarly, in convolutional networks, the BN transformation is applied independently on each dimension of the nonlinearity input (e.g., sigmoid or ReLU) while the information in the network and its representation ability is preserved.

Using Batch Normalization in the deep neural networks reduces the dependence of gradients flowing through the network on the scale of the hyper-parameters or their initial values, and hence, it allows a much higher learning rate without the risk of divergence. As shown in the paper, the proposed normalization scheme achieves the same accuracy with significant fewer training steps in image classification compared to the state-of-the-art models. One point ignored to be discussed in the experiments is that despite requiring fewer training steps, the extra cost added in each training step may undo the accelerating effect of BN scheme.

Batch normalization makes the model more generalizable by reducing the covariate shift and whitening the activations. This property has ignited the application of Batch Normalization on reducing overfitting in deep models. Although this paper investigates the overfitting issue from a different perspective from dropout, they are both seeking for more generalizable models and hence can be comparable to each other. As mentioned in the paper, the BN scheme regularizes the model and reduces the need for dropout. However, this benefit could be clearer in practice if more careful experiments on the overfitting aspect had been included in the paper.

The proposed method’s efficiency has been investigated through MNIST and Imagenet classification as well as an ensemble of batch normalized networks, and it is shown that this normalization scheme accelerates training and significantly stabilizes the input distributions in the network.

As an alternative to normalizing weights, one could L2 normalize the features and learn a separate scale. This is another baseline to batch normalization that does without collecting statistics and does not have a shift. This is explored to good effect by ParseNet [1] and helps with learning jet / hypercolumn nets.

Another approach to adjust the scale and bias in response to changes in the distribution of the layers can be to reparametrize the net by constraining the weights to be of norm 1 and introduce separate parameters to scale the normalized weights. This can be another potential baseline for Batch Normalization as discussed in [2]. However, it would be more desirable and practical if a method exists that handles re-parameterization while avoiding using SVD to solve the problem.

Batch normalization is already popular in the literature despite its short history, although it is still unclear whether it leads to better solutions in general. For example, [3] used batch normalization, but only because the learning problem was an extremely hard one, and the network did not train without batch normalization (though it is interesting to note that [3] does not use the scale and shifting, such that layer activations cannot go to zero). More recent empirical work [4] suggests that batch normalization provides only small boosts in performance when training on relatively small datasets like Pascal classification and detection, but similar algorithms which aim to make every layer of the network train at the same rate actually do boost performance.

[1] Liu, Rabinovich, and Berg: “ParseNet: Looking Wider to See Better” http://arxiv.org/abs/1506.04579

[2] Szegedy, et al., “Intriguing Properties of Neural Networks”
http://arxiv.org/abs/1312.6199

[3] Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”
http://arxiv.org/abs/1505.05192

[4] Krähenbühl et al. “Data-dependent Initializations of Convolutional Neural Networks”
http://arxiv.org/abs/1511.06856

Understanding Image Representations by Measuring Their Equivariance and Equivalence

by Karel Lenc & Andrea Vedaldi
CVPR 2015

This paper analyzes image feature representations by studying their mathematical properties, namely equivariance, invariance, and equivalence. The paper focuses on how simple transforms in the input space, such as vertical and horizontal flipping, rotations, and scaling, can be recovered on the output space by applying a simple affine transform in the feature space. The paper studies both “shallow” representations like HOG and intermediate representations from CNNs. The paper also analyzes the compatibility of feature representations by fusing two halves of different CNNs, “Franken-CNNs”. Overall, the paper provides a novel approach for analysis of feature representations, in particular giving an empirical evaluation of equivariance in deep nets. This complements previous analysis on image representations, such as feature visualization and inversion methods.

Equivariance to a transform is the ability to recover the corresponding transform in the output space, given the transform to the input. As this mapping is trivially exists if a feature representation is invertible, the paper is specifically interested in the existence of “simple” or affine transforms on the output space. In Section 3.4, the paper provides an example where being able to recover an affine transform on the output space can result in speedup for a pose prediction task. In Section 2.1, the paper explores several methods to uncover an affine equivariant mapping. Regularization is needed to induce sparsity in the final solution due to the high dimensionality of the problem. Two sparsity regularizers are introduced, one which encourages sparsity in the transform overall, and the other which specifically incorporates prior knowledge of the input transform. The experiments shown in Figure 3 on a “shallow” HOG representation confirm that regularization is needed, and that using prior knowledge of the transform works better. Figure 5 provides a qualitative example of the learned equivariant mapping.

In addition to studying HOG features, the paper also provides extensive experiments on deep CNN architectures. The paper also explores how well an equivariant mapping can maintain classification performance. The results in Figure 6 and Table 2 provide an evaluation on how well equivariant mappings can be found in different layers in the CNN. A few intuitions are confirmed numerically: CNN representations are invariant to horizontal flipping, but sensitive to vertical flipping and rotations. In this section, the reader might also benefit from some qualitative examples of equivariant mappings, in particular, examples where the equivariant mapping is unable to recover the feature representation.

A study on translation equivariance of CNNs is not presented in this paper, which is important for object detection and segmentation. Although translation equivalence at a coarse level (stride ~32) is inherently true by the convolution operation, it isn’t necessarily implied by CNNs at a fine level, e.g. most often one doesn’t simply get a 4 pixel shift in a Fully Convolutional Net segmentation output given an input shift by 4 pixels. It seems that in practice such high-resolution equivariance holds true (or can be implicitly learned in detection and segmentation tasks), but it will be interesting to see some quantitative analysis.

The paper also provides a quantitative measure on invariance. The paper notes that invariance is surprisingly not monotonically increasing for different transforms and argues that invariance to certain factors may be lost due to pooling. An unmentioned reason for the loss of invariance is the asymmetry of the convolutional kernels themselves. The network is shown to be invariant to horizontal flipping, which is unsurprising, as it is directly built in during training. An experiment which studies the quality of an equivariant mapping on a network without data augmentation would be interesting to see.

Finally, the paper presents 4 different CNNs with similar architectures, trained on a variety of domains, such as ILSVRC12, Places, and both datasets combined. The paper then composes a new network which stitches together the two halves from two different networks, with an “equivalence mapping” in the middle. This is an interesting application of the established framework. A minor issue is that definition of equivalence should be symmetric; otherwise, consider the case when Φ’ is the all-zero vector for all x. Then any representation Φ will be equivalent to Φ’. An additional possible experiment is to look at transformations between networks trained with different random seeds. Since training and averaging two models almost always helps, observing the learned transforms might help shed light on what one model learns that the other does not.

One natural question is how do we take the insights from this work and apply them? Currently, we augment the training sets to accommodate some transformations. As a result, the top layers of commonly used networks are invariant but mid layers are equivariant to those, as t-sne embeddings of pool5 features have shown. Overall, the method in this paper seems better than a data augmentation approach since some transformations are very unlikely to happen, for example humans are very bad at recognizing vertically inverted faces.

Since publication, the idea of modeling equivariance has been applied in “Learning image representations equivariant to ego-motion” from Jayaram and Grauman, where a feature representation is to be equivariant to input transforms caused by ego-motion in Section 5.3.  In “Cross Modal Distillation for Supervision Transfer”, Gupta, et al. learn intermediate feature representations which are invariant to domain shifts, specifically in the RGB and depth domains, as a method of domain adaptation.

One concern is that given the high dimensionality of the feature space, it is always possible to find an equivariant mapping regardless of the features. The use of the strong regularizer suggests that insufficient amounts of data were used. Consequently, it would be interesting to see if this is still possible on sufficient amounts of data without using a strong regularizer.

Overall, this paper conducts a novel study on characterizing feature representations. Conducting a similar study on more interesting transforms, such as out of plane 3D rotations would be beneficial. It would also be interesting to learn feature spaces which are equivariant to desired the transforms while performing classification.

Unsupervised Learning of Visual Representations using Videos

by Xiaolong Wang & Abhinav Gupta
ArXiv, 2015

The aim of this paper is to explore the idea of learning visual representations (a ConvNet) using unsupervised learning. Specifically, the paper compares pretraining on the task of visual tracking to supervised pre-training on ImageNet to perform detection in the PASCAL VOC 2012 dataset. This paper is attempts to answer the compelling question of whether or not you can learn effective visual representations *without* semantic supervision, using image patch tracking. Other recent work has similarly attempted to learn effective representations from context [1] and egomotion [2,3].

The unsupervised learning in this work consists of two main steps: (1) generating the dataset of image patch “tracks” and (2) training a ConvNet on that dataset using a “ranking” loss.

The first step of the unsupervised learning method consists of generating a dataset of image patch “tracks.” The proposed method generates tracks of image patches by first selecting video sequences to use (based on sufficient motion in the scene, but not camera motion) and using SURF key points and the KCF tracker to obtain pairs of image patches that are 30 frames apart. This automatic dataset generation process creates a supervisory signal on an auxiliary task, providing the network information on how things move; however, the videos were selected using human queries, introducing bias into the dataset. The paper also specifically selects against video segments with camera movement, though image changes due to camera movement also provide very valuable information (as demonstrated by [2,3]). It is unclear why the method couldn’t have also tracked random small patches that moved in the image due to the change in perspective from the camera. Finally, the dataset might benefit significantly from some notion of scale. The paper brings up the point that humans learn visual representations without strong supervision, though humans also have access to depth information (from stereo), providing a rough measure of the size of a tracked object. The data shown in Figure 3 shows that the scales of image patches are highly variable. We think that their approach would benefit from some sort of multiscale processing.

The second step is to train a “Siamese-triplet” network using a ranking loss. Their approach is simple, and straight-forward to implement. We believe that the method could do a more thorough job of hard negative mining by considering more than just 100 patches (which is their batch size). This hyperparameter was not discussed.

The paper presents two main experiments, the first with detection on the PASCAL VOC 2012 dataset. The goal of their experiments was to show that their learned representation from tracking patches is useful for other visual tasks. To do so, they fine-tune on the training set of VOC 2012. The results show a moderate improvement over training from scratch (44.0 mAP vs. 47.0 mAP). The results show an impressive improvement from training an ensemble of networks, though there is no comparison to training an ensemble of networks from scratch (with different random initializations). The paper also provides a comparison to initializing from an ImageNet-trained network, which performs slightly better (54.4 mAP with ImageNet pretraining vs. 52 mAP with the proposed method).

We would be very interested in seeing the performance *without* finetuning end-to-end, and simply training a linear SVM on top of the learned representation. This might be more indicative of the quality of the learned representation.

The experiments with normal estimation show similar findings as the experiments on PASCAL VOC.

Lastly, interesting future work would consider the questions: why does supervised ImageNet pretraining perform better? Is the performance gap caused by a fundamental limitation regarding supervision via tracking? Is it caused by an issue with the quantity or diversity of the collected videos? Or is it caused by issues with the proposed data-extraction approach and training method?

 

[1] Unsupervised Visual Representation Learning by Context Prediction / Carl Doersch, Abhinav Gupta, Alexei A. Efros http://arxiv.org/abs/1505.05192

[2] Learning to See by Moving / Pulkit Agrawal, Joao Carreira, Jitendra Malik http://arxiv.org/abs/1505.01596

[3] Learning image representations equivariant to ego-motion / Dinesh Jayaraman and Kristen Grauman http://arxiv.org/abs/1505.02206

Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia

by Randall C. O’Reilly & Michael J. Frank
Neural Computation, 2006

This paper presents a biologically plausible model of working memory. The motivation is two-fold: first to understand how working memory functions in the human (or monkey) brain, and second to create a neural net capable of solving problems that require short term storage of information. Such problems include: memorize a sequence of letters and then repeat them. Traditional recurrent neural nets do not do well on this kind of problem since they do not have a stable mechanism to store information across arbitrary time delays.

The model essentially consists of two parts: a memory bank (in the Prefrontal Cortex), and a gating mechanism that controls when memories can be written to (controlled by the Basal Ganglia, and other midbrain structures). The gating system is modeled as an actor/critic architecture. The critic evaluates which stimuli are task relevant and the actor uses this knowledge to open and close memory gates.

The paper describes a reinforcement learning algorithm that can learn to solve various working memory tasks given this architecture. The algorithm learns which stimuli are task-relevant for which memories, and how to combine the stimuli and memories to produce a desired output. To do so, the algorithm must solve both the temporal and structural credit assignment problems. The former refers to figuring out when in time the relevant information for a task was given and the latter refers to figuring out what aspect of that information is relevant.

To solve the temporal credit assignment problem, the paper introduces the Perceived Value Learned Value (PVLV) learning algorithm, which is closely related to Temporal Differences (TD) learning. The goal is to learn to associate the an event at one point in time with a reward that occurs some time later. The algorithm works by learning two associations: the Learned Value (LV) is an association between an event and a reward. The Perceived Value (PV) is an association between an earlier event and the LV signal, which serves to learn an association between the earlier event and future rewards. Unlike TD learning, there is no propagation of the reward signal through multiple time steps, and consequently the paper argues that the PVLV algorithm might work better when the intervening events are chaotic and unpredictable. The PVLV algorithm also maps well onto known biology, which may be less true of the TD algorithm.

The model is tested through several experiments, in which it learns to solve various working memory tasks. Each task requires maintaining knowledge of task state across arbitrary time delays. For example, one task, termed 1-2-AX involves taking a different action in response to presented letters depending on whether the last observed number was a “1” or a “2” in a sequence of numbers and letters. These experiments demonstrate that adaptive gating is critical to the tasks. Models that have such mechanisms, the proposed model as well as LSTM, outperformed recurrent neural net models without gating mechanisms. The experiments could be improved by adding stronger generalization tests. Two of the three experiments only report training set performance (epochs to a performance criterion on the training set). The last experiment includes generalization analysis on a test set. It would be useful to see the same kind of analysis on all the experiments.

While the experiments are convincing at showing the power of adaptive gating, it is not clear if the proposed method is advantageous over other gating methods like LSTM. Both methods perform similarly on all tasks evaluated, with a slight advantage to the proposed method. Given the current popularity of LSTM, it would be interesting to investigate in greater detail how the algorithm proposed in this paper compares. In particular, it would be great to see experiments on harder tasks or more practical tasks, like language translation. The network implemented in the paper is very small and it remains unclear if it would scale well to the size required for solving these practical tasks.

The paper argues that whether or not their method gets better results than other algorithms, it is worth studying since it is more biologically plausible than the alternatives. This point deserves further analysis: what is the advantage of biological plausibility? While this is not a question that can be answered in the scope of a single paper, it will be important for future work to a) use the model to make novel discoveries about biology and b) use the biological inspirations to achieve computational results that beat the alternative approaches.

Inverting Convolutional Networks with Convolutional Networks

by Alexey Dosovitskiy & Thomas Brox
ArXiv, 2015

The objective of the paper is to study image representations, e.g., features generated using a convolutional neural network, by inverting them. Re-generating the image from which the features were extracted helps to understand what information from the image gets preserved in the features. This work has a similar goal to the work of Mahendran and Vedaldi [CVPR 2015], where the authors try to optimize for an image which produces similar features to a given one. In contrast, this paper trains a fully supervised up-convolutional (or de-convolutional) neural network to generate an image given input features. A number of experiments are presented which explore the different factors of variation that are preserved in the feature domain and provide insights regarding widely used CNN features.

The approach adopted by this paper has the following advantages. One of the challenges in the task of feature inversion is that many inputs may map to the same or similar feature vector. This is so because features are typically smaller and designed to be invariant to certain variations. While Mahendra and Vedaldi [CVPR 2015] ignore this property and directly optimize the distance between original and reconstructed features, this paper addresses this problem directly by learning to general the original image. Moreover, this method requires just a forward pass at test time to invert the features.

The paper performs rigorous experiments to analyze the different properties of CNN based features. All the experiments are performed using Alexnet pre-trained on Imagenet. It shows that the colors of objects in the image are preserved even in features of higher layers like FC8, and can be recovered using the technique proposed in the paper. Similarly, the position of objects in the image is also preserved in the features of all layers. In consistency with the previous related works, it is shown that precise values of the features don’t carry much information as binarization approximately preserves the output. In another interesting experiment, images are reconstructed from interpolated and independently sampled features. These reconstructions look reasonable and follow natural image statistics. This brings out an important point about the process. The inversion of these features is possible, despite this non-bijective mapping between image and features, because the reconstruction is limited to the embedding of natural images. In this paper, network learns to enforce the natural image prior. This is unlike Mahendran and Vedaldi [CVPR 2015] where the natural image prior is explicitly added. However, all the reconstructed images are blurrier, and the paper suggests that this may be the result of L2 error loss function which usually favors over smoothed solutions.

The experiments in the paper are quite satisfactory, and lead to further insightful discussions. The results like binarization have also been validated before for Alexnet like architectures. A significant question would be to explore whether these results hold across architectures like VGG, GoogleNet etc. or whether they are Alexnet specific. This experiment would certainly add more value to the paper. Also, the paper presents a good Autoencoder baseline by allowing to finetune the imagenet pretrained encoder. It would also be good to know what happens if the encoder part of the Autoencoder is also trained from scratch. The experiments in the paper which show reconstruction from fc8 when muting the top 5 highest scoring classes are really impressive, even though network was not trained exactly for that. This is also consistent with the distillation idea proposed in earlier works [1]. However, the complementary experiment where the paper tries to reconstruct from just top 5 activations can be further explored. The dimensionality of the embedding in the two settings being compared is much different (i.e. 5 v/s 995), and the network was never been trained or fine-tuned for such a low dimensional input. It could be that top 5 activations actually contain more information than the rest, but because the inversion network used was trained using all activations, it is incapable of doing inversion properly with just top 5 activations. Thus, it would be appealing to know what happens if the network is trained to invert from top 5 activations. Further, in the experiment where images obtained by inverting the randomly sampled features are quite blurred, it’s a bit difficult to tell whether an image would be natural or not. However, it might be interesting to see which classes these randomly sampled feature might be classified into, and see whether the reconstructed images make any sense assuming these classifications.

This is one of the first few works which show that CNNs can be trained to generate images, albeit using particular kind of vectors i.e. features. This method, owing to its simplicity and good results, certainly encourages the computer vision research related to generating natural-looking images.

References:

[1] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

End-to-End Memory Networks

by Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston & Rob Fergus
ArXiv, 2015

There has been a recent surge of interest in searching for computational models with easily accessible and scalable memory that would allow them to perform tasks that require multi hop computation and process variable length inputs. In particular, this work adds to the efforts on allowing neural networks to capture long-term dependencies in sequential data.

The standard RNN or LSTM architectures, while performing well for various sequential tasks like captioning / translation, can only capture ‘memory’ via a latent state that is unstable over long timescales. This prevents such models from being applied in settings requiring facts being recalled after a long time period (e.g. question answering after reading a story). To overcome this limitation and have a more explicit and longer-term memory representation,  this work leverages explicit storage and attention-based ideas which have been explored previously, in particular by two related recent methods ‘Neural Turing Machines'[1] and ‘Memory Networks'[2].  While [1] demonstrates applications of memory models for tasks like sorting and exploits memory-read and memory-write operations, this work and [2] address text-based reasoning while focusing  only on memory-read operations. This paper relaxes the strong supervision requirement of [2] (where the index of memory to be looked up was required during training) and shows interesting quantitative as well as qualitative results for the tasks addressed.

Given a sequence of sentences and a question, the aim is to produce an answer from a vocabulary V. The main module of the architecture is a memory layer which takes an input u (for example an embedded question) and has access to the stored sequential input (e.g. sentences). This layer first linearly embeds all the sentences and computes a memory output via weighted averaging of the embedded sentences. The weights aim to model soft attention and are computed using a softmax over similarity of embedded sentences to the layer input.  The final output of this layer is a linear addition of its input and the memory output.

The architecture for the question answering tasks is as follows – the input sentences are stored and the question is linearly embedded and serves as input to the first of K (K <= 3) memory layers. The output of the k_th memory layer serves as input to k+1_th memory layer. Finally, the output of the last memory layer followed by a linear product and softmax computes a probability for each word in V.

There are several finer points in the implementation which are crucial for performance, including: parameter sharing across layers (Sec 2.2), representing sentences as a bag of words vs position encoded vectors (Sec 4.1) , modifying embeddings by adding sentence-index specific bias (Sec 4.1), random noise (Sec 4.1), and finally the use of a particular training schedule (Sec 4.2).

The paper first evaluates on the QA task of [2] and demonstrates improvements over an LSTM variant as well as empirically validates various training choices. The quantitative  results, while still short of the strongly supervised method of [2], highlight the importance of allowing multiple memory read operations and the qualitative visualizations of the soft attention provide additional insights. The paper further adapts the architecture for a word prediction task and demonstrates improvements over standard approaches as well as explores the effect of having a larger memory.

The paper also mentions that the model performance is high-variance and that the one with the lowest training error is chosen – it would be instructive to know if a similar strategy was required for the baseline methods.  Another claim made is that joint training across tasks helps the performance but this seems to not hold in the regime of large training data and the paper could possibly speculate on this. It would also be interesting to see some examples and more details on how the system deals with a variable length sequential output – the exposition in the paper does not make this immediately clear. This work  could also be extended to leverage the sequential nature of the task more explicitly as the current method has a fixed “Position Encoding” of sentences and a sentence-index based bias feature in the memory module – these could be replaced by learned RNNs, possibly similar to [2]. The incorporation of a simple sentence parsing and logical reasoning based baseline to answer questions would further serve to highlight the benefits of learning based methods presented.

Overall, this is an interesting direction. The soft-attention based memory lookup used relaxes strong supervision requirements and can potentially allow applications to real question answering – it would be really exciting to see further developments along these directions.

 

References:

[1] Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines.” arXiv preprint arXiv:1410.5401 (2014).

[2] Weston, Jason, Sumit Chopra, and Antoine Bordes. “Memory networks.” arXiv preprint arXiv:1410.3916 (2014).

Classifier Adaptation at Prediction Time

by Amelie Royer & Christoph H. Lampert
CVPR, 2015

This paper aims to provide more accurate image classifications by creating an adaptive classifier that can be added onto any pre-existing classifier by not assuming that the test data will be i.i.d. but instead related semantically.

The first part of the paper offers a classifier adaptation system which relies on estimating class proportions in the test dataset using a symmetric Dirichlet distribution as the prior. The “n” term of this distribution is calculated based off of the prediction scenario (online, bandit feedback, or unsupervised) and if the distribution will vary over time. For the time varying distributions, they use the same terms as in the non-time varying cases except that a sliding window is used to allow for variation.

The second half of the paper proposes a way to test adaptive systems that does not rely on i.i.d. test data. Specifically, the paper focuses on different methods to generate non-i.i.d. datasets. The proposed methods use random walks through WordNet computed by either a multidimensional scaling or kernelized sorting process and allow for random jumps between contexts. They also use noun sequences from books on Project Gutenberg to generate additional image sequences to test with. The paper doesn’t make clear the benefits of using the two word embeddings over using their calculated distances directly.  It also seems that fictional books and older texts that would be present in Project Gutenberg would create very artificial sequences that have different noun-class distributions than would be seen in the world today. It would be interesting to see the approaches in the paper applied to more realistic data sets such as first person video capturing common tasks and situations.

In the unsupervised prediction case, feeding the predicted outputs back in as a prior, as suggested in the paper, can lead the system into overconfidence for a few categories. Since the test instances are not assumed to be i.i.d., it is possible that instances from a particular class will come first in the test set. This would make the estimated class frequency high for that class, so the classifier will almost never predict other classes after adapting to these few instances, which would be problematic.

Nevertheless, the experiments conducted in the paper compare performance of these adaptive classifiers with normal, non-adaptive classifiers (trained CNN and SVM models) on the various types of image sequences generated. The results show that the adaptive classifiers worked better for all image sequences except for a sequence of random images. For that, the baseline classifiers work best.

Since the online prediction scenario has access to the labels for all previous test instances, it would be interesting to see a comparison of the methods proposed in this paper with continuing to run SGD on the test instances.

This paper shows promising results for the effects of changing the prior distributions of classes to better match a situation, however feature adaptation can also be a useful approach. It isn’t clear if people use such priors. For example, a foreign object could appear in a scene and a human would be able to identify it even if adapted to the current objects. Contextual priors, on the other hand, are used heavily, such as when disambiguating a low resolution patch based on what is around it. Using scene specific models is useful, but achieving this through modifying the class priors seems superficial.