by Sergey Ioffe & Christian Szegedy
This paper proposes a normalization scheme used in the deep model architecture to handle the internal covariate shift problem. Updating the parameters in deep neural networks changes the distribution of layer inputs, and thereafter decreases the learning rate. This point motivates the proposed “Batch Normalization (BN)” scheme which aims at reducing the internal covariate shift by fixing the mean and variance of the layer inputs. In addition, simplicity of the BN model and ease of implementation can benefit different applications of deep networks. Most (all?) deep nets learn features that are poorly scaled with respect to one another so improvements through normalization could generalize to many problems.
To address the covariate shift problem and accelerate training, layers inputs are whitened (i.e. linearly transformed to have zero means and unit variances) such that each scalar feature is normalized independently. During this normalization, the analysis of the entire training set after every parameter update is not required since the input normalization function is differentiable. Similarly, in convolutional networks, the BN transformation is applied independently on each dimension of the nonlinearity input (e.g., sigmoid or ReLU) while the information in the network and its representation ability is preserved.
Using Batch Normalization in the deep neural networks reduces the dependence of gradients flowing through the network on the scale of the hyper-parameters or their initial values, and hence, it allows a much higher learning rate without the risk of divergence. As shown in the paper, the proposed normalization scheme achieves the same accuracy with significant fewer training steps in image classification compared to the state-of-the-art models. One point ignored to be discussed in the experiments is that despite requiring fewer training steps, the extra cost added in each training step may undo the accelerating effect of BN scheme.
Batch normalization makes the model more generalizable by reducing the covariate shift and whitening the activations. This property has ignited the application of Batch Normalization on reducing overfitting in deep models. Although this paper investigates the overfitting issue from a different perspective from dropout, they are both seeking for more generalizable models and hence can be comparable to each other. As mentioned in the paper, the BN scheme regularizes the model and reduces the need for dropout. However, this benefit could be clearer in practice if more careful experiments on the overfitting aspect had been included in the paper.
The proposed method’s efficiency has been investigated through MNIST and Imagenet classification as well as an ensemble of batch normalized networks, and it is shown that this normalization scheme accelerates training and significantly stabilizes the input distributions in the network.
As an alternative to normalizing weights, one could L2 normalize the features and learn a separate scale. This is another baseline to batch normalization that does without collecting statistics and does not have a shift. This is explored to good effect by ParseNet  and helps with learning jet / hypercolumn nets.
Another approach to adjust the scale and bias in response to changes in the distribution of the layers can be to reparametrize the net by constraining the weights to be of norm 1 and introduce separate parameters to scale the normalized weights. This can be another potential baseline for Batch Normalization as discussed in . However, it would be more desirable and practical if a method exists that handles re-parameterization while avoiding using SVD to solve the problem.
Batch normalization is already popular in the literature despite its short history, although it is still unclear whether it leads to better solutions in general. For example,  used batch normalization, but only because the learning problem was an extremely hard one, and the network did not train without batch normalization (though it is interesting to note that  does not use the scale and shifting, such that layer activations cannot go to zero). More recent empirical work  suggests that batch normalization provides only small boosts in performance when training on relatively small datasets like Pascal classification and detection, but similar algorithms which aim to make every layer of the network train at the same rate actually do boost performance.
 Liu, Rabinovich, and Berg: “ParseNet: Looking Wider to See Better” http://arxiv.org/abs/1506.04579
 Szegedy, et al., “Intriguing Properties of Neural Networks”
 Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”
 Krähenbühl et al. “Data-dependent Initializations of Convolutional Neural Networks”