本篇文章主要是review Dumitru Erhan∗,Yoshua Bengio,Aaron Courville,Pierre-Antoine Manzagol 在2010年发表的《why does unsupervised pre-training help deep learning?》.

一 话题导入

最近深度学习框架中比如:Deep Belief Networks,Stacks of Auto-Encoder variants. 都基本基于先通过无监督的预训练,然后再通过有监督的fine tune,达到了最好的效果。

二 实验现象

1. Better Generalization

When choosing the number of units per layer, the learning rate and the number of training iterationsto optimize classification error on the validation set, unsupervised pre-training gives substantially lower test classification error than no pre-training, for the same depth or for smaller depth on variousvision data sets.

2. More Robust

These experiments show that the variance of final test error with respect to initialization randomseed is larger without pre-training, and this effect is magnified for deeper architectures. It should however be noted that there is a limit to the success of this technique: performance degrades for 5layers on this problem.

3. 总之

Better generalization that seems to be robust to random initializations is indeed achieved by pre-trained models, which indicates that unsupervised learning of P(X ) is helpful in learning P(Y |X ).

三 解释原因:

(1)是否来自于Precondition。这种无监督的训练是否是可以帮我们寻找到更好的初始化权重的分布,而不是我们经常采用的[−1/k^0.5,1/k^0.5] 的均匀分布。

答案是否

所谓的Precondition,就是作者重新挑选一个和pre training 后的参数为作为初始参数,注意这里作者是分层进行的,所谓的分层进行就是对于每一层的参数,作者根据多次的pre training结果,构造的经验的分布。然后从每层的经验分布中采样,构成参数的初始点。原文表述如下:

“By conditioning, we mean the range and marginal distribution from which we draw initial weights. In other words, could we get the same performance advantage as unsupervised pre-training if we were still drawing the initial weights independently, but from a more suitable distribution than the uniform

To verify this, we performed unsupervised pre-training, and computed marginal histograms for each layer’s pre-trained weights and biases (one histogram per each layer’s weights and biases). We then resampled new “initial” random weights and biases according to these histograms (independently for each parameter), and performed fine-tuning from there.

The resulting parameters have the same marginal statistics as those obtained after unsupervised pre-training, but not the same joint distribution. ”

结果怎么样呢?

如果是采用uniform的参数的初始化方法,测试集的误差为avg:1.77,std:0.10。如果使用上面的方法,称为Histogram,avg:169,std:0.11. 如果使用无监督的预训练,Unsup,pre,avg:1.37,std:0.09.可以看出这种方法仅仅比uiform好一点。因此precondition是无法解释。

(2)是否来自于无监督的预训练可以通过降低训练集的误差解释呢(The Effect of Pre-training on Training Error )。但是结果也不是

“The remarkable observation is rather that, at a same training cost level, the pre-trained models systematically yield a lower test cost than the randomly initialized ones. The advantage appears to be one of better generalization rather than merely a better optimization procedure. ”

(3)作者认为,无监督预训练可以为参数,提供先验(prior or regularizer),而且这种先验分布或者说是正则化,与传统的形式不同,它没有显示的正则化项,并且是依赖于数据自动发现。作者原文

unsupervised pre-training appears to have a similar effect to that of a good regularizer or a good “prior” on the parameters, even though no explicit regularization term is apparent in the cost being optimized. As we stated in the hypothesis, it might be reasoned that restricting the possible starting points in parameter space to those that minimize the unsupervised pre-training criterion (as with the SDAE), does in effect restrict the set of possible final configurations for parameter values. Like regularizers in general, unsupervised pre-training (in this case, with denoising auto-encoders) might thus be seen as decreasing the variance and introducing a bias (towards parameter configurations suitable for performing denoising). Unlike ordinary regularizers, unsupervised pre-training does so in a data-dependent manner.”

下面将继续探讨这种特殊化,无具体形式的,并且依赖数据的正则化项来源于何处?

(4)作者认为,如果假设确实来源于正则化,那么个根据正则化典型的一个性质,正则化带来的效用会随着模型的复杂性的增大而增大。作者的设想如下:

Another signature characteristic of regularization is that the effectiveness of regularization increases as capacity (e.g., the number of hidden units) increases, effectively trading off one constraint on the model complexity for another. In this experiment we explore the relationship between the number ofunits per layer and the effectiveness of unsupervised pre-training. The hypothesis that unsupervised pre-training acts as a regularizer would suggest that we should see a trend of increasing effectiveness of unsupervised pre-training as the number of units per layer are increased.

但是实验结果显示,这个效应只有对layer size 足够大(100个隐藏层),网络足够深。无监督的预训练带来的效益才才会随着模型的复杂性增加而增加。对于简单的网络,无监督的预训练反而是多余的。这是一个意外的实验发现。

“What we observe is a more systematic effect: while unsupervised pre-training helps for larger layers and deeper networks, it also appears to hurt for too small networks.”

“As the model size decreases from 800 hidden units, the generalization error increases, and it increases more with unsupervised pre-training presumably because of the extra regularization effect: small networks have a limited capacity already so further restricting it (or introducing an additional bias) can harm generalization. ”

除了上面一般性解释(简单模型,不需要正则化,因为模型已经很简单)作者给出的解释如下:

The effect can be explained in terms of the role of unsupervised pre-training as promoting input transformations (in the hidden layers) that are useful at capturing the main variations in the input distribution P(X). It may be that only a small subset of these variations are relevant for predicting the class label Y. When the hidden layers are small it is less likely that the transformations for predicting Y are included in the lot learned by unsupervised pre-training.

简单的说就是简单网络的无监督,对X进行变换的时候,可能会把对Y特别有用的特征过滤掉,因为是非监督的,并不能确定有一些X特征会对Y的预测有用。如果是复杂网络,可以保留更多的可能。好像有点道理。

(5)作者不认为 这种有效性来源于优化的结果:Challenging the Optimization Hypothesis。一般来说,由于深度网络往往很难训练,可能这种无监督的预训练可以提升一个使得训练函数更加小的局部最优点。

作者质疑Bengio et al. (2007) 的实验设计,因为在Bengio et al. (2007),中涉及到了“early stopping”,作者认为这种技巧本身就是一种正则化(regularizer)。如果不使用这种技巧,那么结论就不成立了。

“Figure 10 shows what happens without early stopping. The training error is still higher for pre-trained networks even though the generalization error is lower. This result now favors the regularization hypothesis against the optimization story. What may have happened is that early stopping prevented the networks without pre-training from moving too much towards their apparent local minimum.”

因为对于使用了无监督的预训练的网络最终泛化能力好,但是却训练误差高。因此优化假设是成立的。而因为Bengio et al. (2007) 使用了过停止的技巧,所以导致了没有使用了无监督预训练的网络,也变相使用了一种正则化的技巧,这种技巧导致了网络不会太偏向于局部最优。

下面的问题来了,既然我们已经知道了无监督预训练的魔力来自于某种先验知识(prior)或者正则化(regularizer),那么因为这种正则化并没有确定的正则化项,因此很难判担这种正则化到底是什么样的,因此作者在接下来的实验中,确定这种正则化的内容。首先我们知道,(参见,the elements of statistical learning),其实正则化是来源于贝叶斯定理,通过对参数假设一个先验分布(prior distribution),通过贝叶斯定理,可以求出参数的后验部分。因此对参数不同的先验分布往往可以推出不同的正则化项。对于我们常见的L1,L2正则化项,其实是假设参数先验分布是一个均值为0的某种概率分布(比如均值为0的正太分布),之所以是均值为0的假设是因为,我们想要得到一个尽量简单的模型。为什么一定要尽量简单的模型呢?这是因为机器学习中一个重要的理论-奥卡姆剃刀定律。好了,说了这么多,我们来看看,通过无监督的预训练的得到的隐式的正则化项是不是L1,L2. (剧透一下,肯定没有那么简单)

(6)揭开正则化的内容,是不是L1,L2.

作者于是比较了对神经网络分别加上了L1,L2正则项,并与通过无监督预训练网络之间区别。

“We found that while in the case ofMNIST a small penalty can in principle help, the gain is nowhere near as large as it is with pre-training. ForInfiniteMNIST, the optimal amount ofL1 and L2regularization is zero ”

结果发现,在MNIST这种简单的任务上,这种正则化还有一些帮助。在复杂一点的任务InfiniteMNIST,这种正则化基本上是没有价值的。

下面的作者评价说

“This is not an entirely surprising finding: not all regularizers are created equal and these results are consistent with the literature on semi-supervised training that shows that unsupervised learning can be exploited as a particularly effective form of regularization. ”

Not all regularizers are created equal。

(7)总结一下

总之,是正则化,而且还不是一般的正则化,更不是优化的假设,也不是边际分布所能解释的。是某种特殊的先验分布带来的正则化。而且这种正则化项,和early stopping以及semi-superviesed的原理比较相似。

为什么无监督的预训练可以帮助深度学习相关推荐

  1. Dense Contrastive Learning for Self-Supervised Visual Pre-Training(基于密集对比学习的自我监督视觉预训练)2021

    最前面是论文翻译,中间是背景+问题+方法步骤+实验过程,最后是文中的部分专业名词介绍(水平线分开,翻译word文件可以找我要,能力有限,部分翻译可能不太准确) 摘要: 迄今为止,大多数现有的自监督学习 ...

  2. 【CVPR2022】UniVIP:自监督视觉预训练的统一框架

    来源:专知 本文为论文,建议阅读5分钟 我们提出了统一自监督视觉预训练(UniVIP) 论文标题:UniVIP: A Unified Framework for Self-Supervised Vis ...

  3. GPT系列:生成式预训练与零样本学习

    GPT系列:生成式预训练与零样本学习 本文的主要参考是李沐老师关于 GPT 系列的解读:GPT,GPT-2,GPT-3 论文精读[论文精读]. 关于BERT和GPT Transformer/BERT/ ...

  4. 【源码】以GUI的形式实现预训练神经网络的迁移学习

    允许用户在不编码的情况下,在图形用户界面中进行预训练神经网络的迁移学习. It allows user to do transfer learning of pre-trained neural ne ...

  5. 训练好的深度学习模型是怎么部署的?

    训练好的深度学习模型是怎么部署的? 来源:https://www.zhihu.com/question/329372124 作者:田子宸 先说结论:部署的方式取决于需求 需求一:简单的demo演示,只 ...

  6. AI学习笔记(九)从零开始训练神经网络、深度学习开源框架

    AI学习笔记之从零开始训练神经网络.深度学习开源框架 从零开始训练神经网络 构建网络的基本框架 启动训练网络并测试数据 深度学习开源框架 深度学习框架 组件--张量 组件--基于张量的各种操作 组件- ...

  7. 直观的获得MATLAB训练得到的深度学习网络参数与结构

    在MATLAB当中可以通过在"命令行窗口"输出help trainNetwork获得简单的深度学习网络的搭建的代码. 为了获得经过训练得到的深度学习模型的结构与学习参数个数,而不用 ...

  8. 论文浅尝 - ICLR2020 | Pretrained Encyclopedia: 弱监督知识预训练语言模型

    论文笔记整理:陈想,浙江大学博士,研究方向为自然语言处理,知识图谱. Wenhan Xiong, Jingfei Du, William Yang Wang, Veselin Stoyanov.Pre ...

  9. PyTorch 的预训练,是时候学习一下了

    前言 最近使用 PyTorch 感觉妙不可言,有种当初使用 Keras 的快感,而且速度还不慢.各种设计直接简洁,方便研究,比 tensorflow 的臃肿好多了.今天让我们来谈谈 PyTorch 的 ...

最新文章

  1. log_bin.index not found 启动报错解决
  2. centos linux 系统上 log4j打印的时间与CST时间差8小时的解决方法
  3. Android源码学习(3) Handler之MessageQueue
  4. Java黑皮书课后题第10章:*10.1(Time类)设计一个名为Time的类。编写一个测试程序,创建两个Time对象(使用new Time()和new Time(555550000))
  5. VideoLAN,VLC和FFmpeg社区联合开发AV1解码器
  6. android热修复原理底层替换,Android 热修复 - 各框架原理学习及对比
  7. python序列是几维_从一个1维的位数组获得一个特定的2维的1序列数组[Python] - python...
  8. 【ZZ】 ACM之歌
  9. 【论文阅读】Universal Domain Adaptation
  10. 苹果macPython语言开发环境:PyCharm pro
  11. 《图解算法》第11章之 接下来如何做
  12. smartsvn 忽略文件夹_Smart SVN-使用Smart SVN 管理项目代码文件(在windows上)
  13. stack(后进先出)
  14. 2011最犀利语录大全
  15. 初学者之路—————Cycle GAN
  16. Tableau常用可视化图形介绍及其适用场景
  17. 转录组分析---Hisat2+StringTie+Ballgown使用
  18. 使用Python PIL库中的Image.thumbnail函数裁剪图片
  19. 【强化学习与机器人控制论文 2】基于强化学习的五指灵巧手操作
  20. ISE14.7 Spartan3e 呼吸灯

热门文章

  1. Python:列表推导式、生成器、迭代器
  2. openwrt ipv6 防火墙设置
  3. opencv打卡51: 形态学梯度cv2.morphologyEx(img, cv2.MORPH_GRADIENT, kernel)
  4. 扫地机器人的轮子困住_扫地机器人被“困住”的解决办法
  5. edge下载慢?教你打开多线程下载,速度直接起飞
  6. QCad源码分析 第一章
  7. 连接mysql报错 errorCode 1129, state HY000, Host ‘xxx‘ is blocked because of many connection errors
  8. 【答学员问】你们从培训机构毕业后都找到什么工作?
  9. win10系统localhost拒绝访问的解决方法
  10. python匿名函数和推导式烧脑面试题解析