深度学习中的Normalization方法

2018-05-01

Batch Normalization
Other Normalization

总结深度学习中常用的Normalization方法，包括BatchNormalization,LayerNormalization,GroupNormalization以及LocalResponseNorm。

Batch Normalization

Feature Scaling

是指对不同的feature做归一化，使得其有着相同的scaling。

从上图可以看出，输入的两个维度x1,x2相差比较大，如果想要得到比较好的学习效果，需要针对不同维度采用不同learning rate，但是操作起来困难。进行feature scaling可以解决这个问题。
如何做：
如何做scaling

Internal Convariate Shift

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training
减少Internal Convariate Shift有助于训练。

对于deeplearning 网络来说，可以对每个layer做feature scaling。
然而如果只是上面提到的简单的feature scaling的话会带来一个问题：
在训练过程中，中间的每一层的参数都是在变化的，那么其每层的output也会随之改变，则mean和std也在不断变化。我们将难以简单计算出mean和std来进行feature scaling。Batch Normalization则可以用来解决这个问题，因为其将normalization加入训练过程中。

Batch Normalization

为什么要加gamma，beta参数

gamma，beta参数的加入使得整个BN过程可训练，使得其与普通的预处理归一化操作区别开。??
为了保证模型的表达能力不因为规范化而下降
详解深度学习中的Normalization，不只是BN

forward

针对每一个batch做归一化操作。
training

sample_mean=np.mean(x,axis=0)
sample_var=np.var(x,axis=0)
x_hat=(x-sample_mean)/np.sqrt(sample_var+eps)
out=gamma*x_hat+beta
# move average mean
running_mean=momentum*running_mean+(1-momentum)*sample_mean
# move average var
running_var=momentum*running_var+(1-momentum)*sample_mean

test
存在一个问题：我们无法知道batch的信息，所以没法算mean和std。
解决方法：
1ideal solution: Computing mean and std using the whole training dataset
2practical solution: computing the moving average of mean and std of the batches during training

1 2	temp=(x-running_mean)/np.sqrt(running_var+eps) out=gamma*temp+beta

backward

Understanding the backward pass through Batch Normalization Layer

benefit

1、BN通过减少Internal Convariate Shift使得我们可以使用更高的learning rate来让训练更快速。
2、BN对解决梯度爆炸和梯度消失有作用，特别是对tanh、sigmoid等激活函数。
原因：因为BN确保了数据都在0附近斜率比较大的地方，不会出现梯度爆炸和消失的问题。
3、使得参数初始化对训练的影响不大

4、BN也有regularization的作用，防止overfitting

参考

1、Why-does-batch-normalization-help
2、Understanding the backward pass through Batch Normalization Layer
3、李宏毅老师讲BN(图片都来自该视频截图)
4、pytorch doc
5、BN论文
6、详解深度学习中的Normalization，不只是BN

Other Normalization

Layer Normalization

Layer Normalization和BN的区别在于：LN是对每个单独的数据做归一化，BN是对mini-batch的数据做归一化。

Group Normalization

GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes.
其可以用于在batch_size较小时对BN的替代。
GN implement

Fengyang

深度学习中的Normalization方法

Batch Normalization

Feature Scaling

Internal Convariate Shift

Batch Normalization

为什么要加gamma，beta参数

forward

backward

benefit

参考

Other Normalization

Layer Normalization

Group Normalization

Local Response Normalization