BATCH NORMALIZATION : IS ALL YOU NEED AT LAST LAYERS IN DEEP NEURAL NETWORKS

Sumanth
5 min readMay 31, 2021
Thanks for unsplash.com for such beautiful pics

This is my first blog on medium and I wanted to share my knowledge before it become useless. Once you read the entire blog you will be able to know the underlying usage of batch normalization. I will try to make the concepts clear and crisp.

What you will learn:

>What is Normalization and Why do we need it?

>Why do we need Batch Normalization?

>How the test data gets treated when we have Batch Normalization layer?

What is Normalization and Why do we need it?

Normalization helps to scale down the values in the feature between 0 to 1 and this can be done by using “MinMaxScaler” available inside the sklearn. Internally the “MinMaxScaler” works using the given below formulae.

The above formulae generates a normalized feature by taking each value in the feature and subtracts from the min value in the feature and divide it by difference of max and min value of the feature which makes our normalized feature ranges between 0 to 1.

All our features comes from different measurements for example in predicting the T-shirt size of a person we can have features like weight(in kg’s), height(in cm’s), age etc. Since the input features are coming from different measurements our model behaves differently and gives more importance to the feature which are having bigger values hence our model becomes bias towards that feature and even while performing SGD to get the weights using gradients comprises of loss function which is built on top of features since each feature has different measurements converging weights becomes much slower.

Summary for why we need Normalization:

* Introduces biases in the model due to the features on different scales.

*Convergence in weights becomes slower.

To compensate the above problems we should perform the Normalization which brings all the features to the same scale.

Below is the small snippet of code how the normalization works.

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]#creating 2d-array
>>> scaler = MinMaxScaler()#creating an instance
>>> print(scaler.fit(data))#fitting to the data
MinMaxScaler()
>>> print(scaler.transform(data))#transforming it to 0 to 1
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]

Why do we need Batch Normalization?

We normalize the features before we feed it to the Neural Network but why do we need to add batch norm layer again in the network?

To answer above question think of what happens when the data enters into the neural network the initial layers can see the normalized data but as the data traverse deeper it will undergo lots of mathematical operations like dot product(W.X), F(W.X) hence the the data which is seen by last layers is not in normalized form and also distribution of data varies when compared to initial layers which triggers the problem of internal covariance shift which means the variance of the data seen by the initial layers and last layers are entirely different.

Hence we should again normalize the data at last layers in the deep neural network since we send the data in the form of batches so batch normalization is done.

A Batch norm layer is added in the last layers of deep neural networks looks like below.

#example of batch normalization for an mlp
from keras.layers import Dense
from keras.layers import BatchNormalization

model.add(Dense(32, activation=’relu’))
model.add(Dense(1))

:

:

model.add(Dense(64, activation=’relu’))
model.add(Dense(1))

model.add(Dense(32, activation=’relu’))
model.add(BatchNormalization())

Diving Deep in understanding how the batch norm works.

It actually applies the below formulae on each batch before it is fed to the next layer after the batch norm layer.

xi → each point in the batch

xi → each point in the batch, uB →batch mean, sigma²B →batch standard deviation, epsilon → to avoid division by zero error if sigma²B=0. Once x^i is calculated it is scaled by gamma times and shifted by beta times to get the ouput and now this ouput is fed to next layer after the batch norm layer. gamma & beta are learnt through backpropagation.

How the test data gets treated when we have Batch Normalization layer?

Have you ever got doubt how the test data gets dealt with batch norm layer

Photo by Markus Winkler on Unsplash

The solution is moving-mean & moving-variance and these are updated for every batch until the model training gets completed.

moving-mean & moving-variance are calculated and updated at each batch norm layer using the below formulae.

moving-mean = (momentum*moving-mean)+(1-momentum)*mean

moving-variance = (momentum*moving-variance)+(1-momentum)*variance

To dive deep how the moving-mean & moving-variance works.

Initially for the first batch moving-mean is 0, momentum has to be initialized with some value and finally we get the mean for the first batch. Now, we have moving- mean and similarly moving-variance.

Now, this moving-mean & moving-variance is updated for every batch at each batch-norm layer until the model training gets completed.

Now, during the test time whenever the test data reaches the batch norm layer it applies the moving-mean, moving-variance as follows

xtest^i = (xi-test - moving-mean)/sqrt(moving-varinace+epsilon)

That’s all folks you reached end of the batch normalization story. I tried to explain as much as I can to give more insights of how the batch norm layer works. Thanks for the time and have a great day will engage with you another story.

--

--

Sumanth

Assistant Manager@ Tata Communications Ltd, Machine Learning, Deep Learning, NLP, Computer Vision, Python, Django Enthusiast.