When I first looked into batch normalization, I was turned off. It looked like a hack. I had a tough time finding exactly how it worked.

Since then I’ve come to love it. It's a wildly effective method. Here’s the tutorial I wish I had at the outset.
https://e2eml.school/batch_normalization.html
At its core, it's a rolling scale and shift that gives each element a mean of zero and a variance of one.
There is an additional scale and shift that can be learned, but this is optional. The real magic is in the normalization.
There's some confusion over exactly how batch normalization speeds up training and improves performance as much as it does, but it seems everyone agrees on the fact that is smooths the loss landscape, and gives it a similar curvature in all directions.
What really sold me on batch normalization are its signal processing properties. It gracefully handles inputs that are
multi-modal,
constant,
two-level,
rarely active, or
drifting.
Batch normalization is a fascinating case study in the interaction of our algorithms and our hardware constraints. The whole notion of batches is closely tied to GPU batching of training runs.
On non-batching hardware, batch normalization has to be adapted. My favorite example of this is Online Normalization, which does something very similar, but is well suited for single core processing.
http://papers.nips.cc/paper/9051-online-normalization-for-training-neural-networks.pdf
You can follow @_brohrer_.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: