Medium Last updated on May 7, 2022, 1:15 a.m.
Batch Normalization is a method of normalizing features at each layer of Neural Networks. It takes care of the incorrect weight initialization and covariance shift caused by activation functions. It allows each layer of a network to learn by itself a little bit more independently of other layers.
Flexibility of using a higher learning Rate: As batch normalization ensures no layers’ outcome has gone extremely high or low. It also helps in the case of vanishing and exploding gradients.
Works as Regularization: It has been observed that batch normalization reduces overfitting as it has a slight regularization effect. Similar to drop out, it adds some noise to each hidden layer’s activations. Therefore, we can also skip dropout by using batch normalization, which might be helpful because we will not lose information. However, It is best to use both methods in practice.
Improves gradient flow through the network
Explain the intuition behind Batch Normalization?
In the above section, we saw what is Batch Normalization and why it is beneficial for Neural networks. In this section, we will see how it gets implemented mathematically.
Normalization of any data is about finding the mean and variance of the data and normalizing the data so that the data has 0 mean and unit variance. In our case, we want to normalize each hidden unit activation. For this first, we need to calculate the mean and variance of that hidden unit.
Note that simply normalizing each input of a layer may change layer representation. For example, normalizing the inputs of a sigmoid would make the output to be linear. To resolve such constraints, β and γ parameters are used and learned as part of the training process of Neural Networks. These parameters ensure that the scale and shift can bring the transformed values back to the same representation.
As we can see in the algorithm above $\mu_B$ and $σ_B$ are the mean and variance of the current batch, respectively. This step will ensure each hidden unit converts the input into zero mean and unit variance before feeding it into the activation layer. But we don’t want the values to remain in zero mean and unit variance to avoid wrong representation issues. The main objective is to make the network learn and adapt these mean and variance values for each layer.
For this, the two new variables β and γ are introduced in order to scale and shift. These parameters are also learned and updated along with weights and biases during training. The final normalized scaled and shifted version of the hidden activation for the current hidden unit will be fed into the next layer.
During testing or inference phase, we cannot apply the same batch-normalization as we did during training because we might pass only one sample at a time, so it does not make sense to find mean and variance on a single sample. For this reason, we need to compute a running average of mean and variance during training and use those mean and variance values with trained batch-norm parameters during the testing or inference phase.
The running mean and variance for the test phase can be obtained either by storing them for each training batch and randomly selecting one or averaging them out for all batches.
and again, the input gets de-normalized by using the same scale & shift formula, where β and γ are the same parameters learned during training.