Batch Normalization

Batch Normalization#

Overview#

Batch Normalization (BatchNorm) is a technique used in training deep neural networks to improve the stability and performance of the model. By normalizing the inputs to each layer, it addresses issues like internal covariate shift, leading to several benefits:

Stabilizes Learning: Reduces internal covariate shift by normalizing inputs, making the learning process more stable.
Speeds Up Training: Models with batch normalization can converge faster and are less sensitive to the initial learning rate.
Improves Generalization: Acts as a regularizer, potentially reducing the need for other regularization techniques like dropout.

Motivation#

Batch normalization works by normalizing the output of each layer:

Compute Mean and Standard Deviation: Calculate the mean \(\mu\) and standard deviation \(\sigma\) for each feature in the mini-batch.
Normalize: Subtract the mean and divide by the standard deviation:

\[\hat{x} = \frac{x - \mu}{\sigma}\]
Scale and Shift: Apply learnable parameters \(\gamma\) (scale) and \(\beta\) (shift) to the normalized output:

\[y = \gamma \hat{x} + \beta\]

Here, \(\gamma\) and \(\beta\) are parameters learned during training, allowing the network to maintain the representational power.

alt text

Train Time#

During training, the normalization process uses the statistics of the current mini-batch:

Calculate Batch Statistics: Compute the mean and variance for each feature in the mini-batch.
Normalize: Use these statistics to normalize the inputs of each layer.
Apply Learnable Parameters: Adjust the normalized inputs using the learnable parameters \(\gamma\) and \(\beta\).

Test Time#

During inference (test time), batch normalization uses running estimates of the mean and variance, accumulated during training:

Running Statistics: Use the running mean and running variance instead of batch statistics.
Normalize: Normalize inputs using these running statistics to ensure consistency and stability.

\[\hat{x} = \frac{x - \text{running mean}}{\text{running variance}}\]

This ensures that the model’s performance is consistent regardless of the mini-batch size used during training.