[论文翻译]Reducing the Dimensionality of Data with Neural Networks

论文题目：Reducing the Dimensionality of Data with Neural Networks
论文来源：10.1126/science.1127647
翻译人：BDML@CQUT实验室

Reducing the Dimensionality of Data with Neural Networks

利用神经网络降低数据的维度

G. E. Hinton^* and R. R. Salakhutdinov

High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such ‘‘autoencoder’’ networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

通过训练一个具有小中心层的多层神经网络来重建高维输入向量，可以将高维数据转换为低维代码。梯度下降法可以用于微调这种’‘自动编码器’'网络中的权重，但只有当初始权重接近一个好的解决方案时，这种方法才会有很好的效果。我们描述了一种有效的初始化权重的方法，这种方法可以让深度自动编码器网络学习低维代码，比主成分分析作为降低数据维度的工具要好得多。

正文

Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data. A simple and widely used method is principal components analysis (PCA), which finds the directions of greatest variance in the data set and represents each data point by its coordinates along each of these directions. We describe a nonlinear generalization of PCA that uses an adaptive, multilayer “encoder” network to transform the high-dimensional data into a low-dimensional code and a similar “decoder” network to recover the data from the code.

Starting with random weights in the two networks, they can be trained together by minimizing the discrepancy between the original data and its reconstruction. The required gradients are easily obtained by using the chain rule to backpropagate error derivatives first through the decoder network and then through the encoder network (1). The whole system is called an “autoencoder” and is depicted in Fig. 1.

维度降低有利于高维数据的分类、可视化、交流和存储。一个简单而广泛使用的方法是主成分分析(PCA)，它可以找到数据集中方差最大的方向，并沿这些方向的坐标来表示每个数据点。我们描述了PCA的非线性泛化，它使用一个自适应的多层 "编码器 "网络将高维数据转化为低维代码，并使用类似的 "解码器 "网络从代码中恢复数据。

从两个网络中的随机权重开始，可以通过最小化原始数据与其重建之间的差异来一起训练它们。所需的梯度可以通过使用链式规则先通过解码器网络再通过编码器网络反向传播误差导数来轻松获得（1）。整个系统称为 “自动编码器”，如图1所示。

Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs), each having only one layer of feature detectors. The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives.

图1 预训练包括学习一叠限制性玻尔兹曼机(RBM)，每层只有一层特征检测器。学习到的一个RBM的特征激活作为’‘数据’’，用于训练堆栈中的下一个RBM。在预训练之后，RBMs被 "解卷 "以创建一个深度自动编码器，然后利用误差导数的反向传播进行微调。

It is difficult to optimize the weights in nonlinear autoencoders that have multiple hidden layers (2–4). With large initial weights, autoencoders typically find poor local minima; with small initial weights, the gradients in the early layers are tiny, making it infeasible to train autoencoders with many hidden layers. If the initial weights are close to a good solution, gradient descent works well, but finding such initial weights requires a very different type of algorithm that learns one layer of features at a time. We introduce this “pretraining” procedure for binary data, generalize it to real-valued data, and show that it works well for a variety of data sets.

An ensemble of binary vectors (e.g., images) can be modeled using a two-layer network called a “restricted Boltzmann machine” (RBM) (5, 6) in which stochastic, binary pixels are connected to stochastic, binary feature detectors using symmetrically weighted connections. The pixels correspond to “visible” units of the RBM because their states are observed; the feature detectors correspond to “hidden” units. A joint configuration (v, h) of the visible and hidden units has an energy (7) given by
$E(v,h)=−∑iεpixelsbivi−∑jεfeaturesbjhj−∑i,jvihiwijE(v,h)=-\sum_{i \varepsilon pixels} b_iv_i -\sum_{j \varepsilon features} b_jh_j -\sum_{i,j} v_ih_iw_{ij}$ (1)
where v_i and h_j are the binary states of pixel i and feature j, b_i and b_j are their biases, and w_ij is the weight between them. The network assigns a probability to every possible image via this energy function, as explained in (8). The probability of a training image can be raised by adjusting the weights and biases to lower the energy of that image and to raise the energy of similar, “confabulated” images that the network would prefer to the real data. Given a training image, the binary state h_j of each feature detector j is set to 1 with probability $σ(bj+∑iviwij)\sigma (b_j+\sum_{i}v_iw_{ij})$ , where $σ()\sigma ()$ is the logistic function 1/[1+exp(-x)], b_j is the bias of j, v_i is the state of pixel i, and w_ij is the weight between i and j. Once binary states have been chosen for the hidden units, a “confabulation” is produced by setting each v_i to 1 with probability $σ(bi+∑jhjwij)\sigma (b_i+\sum_{j}h_jw_{ij})$ , where b_i is the bias of i. The states of the hidden units are then updated once more so that they represent features of the confabulation. The change in a weight is given by
$△wij=ε(<vihj>data−<vihj>recon)\bigtriangleup w_{ij}=\varepsilon (<v_ih_j>_{data}-<v_ih_j>_{recon})$ (2)
where $ε\varepsilon$ is a learning rate, <v_ih_j>_data is the fraction of times that the pixel i and feature detector j are on together when the feature detectors are being driven by data, and <v_ih_j>_data is the corresponding fraction for confabulations. A simplified version of the same learning rule is used for the biases. The learning works well even though it is not exactly following the gradient of the log probability of the training data (6).

在具有多个隐藏层的非线性自动编码器中，优化权重是很困难的。在初始权重大的情况下，自动编码器通常会找到很差的局部最小值；在初始权重小的情况下，早期层的梯度很小，因此训练具有多个隐藏层的自动编码器是不可行的。如果初始权重接近一个好的解，梯度下降法就能很好地工作，但寻找这样的初始权重需要一种非常不同类型的算法，一次学习一层特征。我们介绍了这种针对二进制数据的 "预训练 "程序，将其推广到实值数据，并表明它对各种数据集都有良好的效果。

二进制向量(如图像)的集合可以使用称为 “限制性玻尔兹曼机”(RBM)的两层网络进行建模(5,6)，其中随机的二进制像素通过对称加权连接到随机的二进制特征检测器。像素对应于RBM的 "可见 "单元，因为它们的状态是被观察到的；特征检测器对应于 "隐藏 "单元。可见单元和隐藏单元的联合配置(v，h)的能量(7)由下列公式给出
$E(v,h)=−∑iεpixelsbivi−∑jεfeaturesbjhj−∑i,jvihiwijE(v,h)=-\sum_{i \varepsilon pixels} b_iv_i -\sum_{j \varepsilon features} b_jh_j -\sum_{i,j} v_ih_iw_{ij}$
其中v_i和h_j是像素i和特征j的二元状态，b_i和b_j是它们的偏差，w_ij是它们之间的权重。网络通过这个能量函数给每一个可能的图像分配一个概率，如(8)所解释。训练图像的概率可以通过调整权重和偏置来提高，以降低该图像的能量，并提高类似的，"混淆的 "图像的能量，而这些图像是网络更喜欢的真实数据。给定一张训练图像，每个特征检测器j的二进制状态h_j以概率$ $σ(bj+∑iviwij)\sigma (b_j+\sum_{i}v_iw_{ij})$ 设置为1，其中 $σ()\sigma ()$ 是逻辑函数1/[1+exp(-x)]，b_j是j的偏置，v_i是像素i的状态，w_ij是i和j之间的权重。一旦为隐藏单元选择了二进制状态，就以概率 $σ(bi+∑jhjwij)\sigma(b_i+\sum_{j}h_jw_{ij})$ 将每个v_i设为1，产生一个 “混淆”，其中b_i是i的偏置，然后再更新一次隐藏单元的状态，使其代表混淆的特征。权重的变化由 $△wij=ε(<vihj>data−<vihj>recon)\bigtriangleup w_{ij}=\varepsilon (<v_ih_j>_{data}-<v_ih_j>_{recon})$ 给出，其中 $ε\varepsilon$ 是学习率。<v_ih_j>_data是指当特征检测器被数据驱动时，由像素i和特征检测器j一起开启的分数，<v_ih_j>_data是混淆的相应分数。同样的学习规则的简化版本被用于偏置。尽管它并不完全遵循训练数据的对数概率的梯度，但学习效果很好（6）。

Fig. 2. (A) Top to bottom: Random samples of curves from the test data set; reconstructions produced by the six-dimensional deep autoencoder; reconstructions by ‘‘logistic PCA’’ (8) using six components; reconstructions by logistic PCA and standard PCA using 18 components. The average squared error per image for the last four rows is 1.44, 7.64, 2.45, 5.90. (B) Top to bottom: A random test image from each class; reconstructions by the 30-dimensional autoencoder; reconstructions by 30- dimensional logistic PCA and standard PCA. The average squared errors for the last three rows are 3.00, 8.01, and 13.87. © Top to bottom: Random samples from the test data set; reconstructions by the 30- dimensional autoencoder; reconstructions by 30-dimensional PCA. The average squared errors are 126 and 135.

图2 (A) 从上到下。测试数据集的随机曲线样本；六维深度自动编码器产生的重构；使用六个分量的 "逻辑PCA"重构；使用18个分量的逻辑PCA和标准PCA重构。后四行每幅图像的平均平方误差分别为1.44、7.64、2.45、5.90。(B)从上到下。每个类别的随机测试图像；30维自动编码器重建；30维逻辑PCA和标准PCA重建。后三行的平均平方误差分别为3.00、8.01、13.87。©从上到下。测试数据集的随机样本；30维自动编码器的重建；30维PCA的重建。平均平方误差为126和135。

A single layer of binary features is not the best way to model the structure in a set of images. After learning one layer of feature detectors, we can treat their activities—when they are being driven by the data—as data for learning a second layer of features. The first layer of feature detectors then become the visible units for learning the next RBM. This layer-by-layer learning can be repeated as many times as desired. It can be shown that adding an extra layer always improves a lower bound on the log probability that the model assigns to the training data, provided the number of feature detectors per layer does not decrease and their weights are initialized correctly (9). This bound does not apply when the higher layers have fewer feature detectors, but the layer-by-layer learning algorithm is nonetheless a very effective way to pretrain the weights of a deep autoencoder. Each layer of features captures strong, high-order correlations between the activities of units in the layer below. For a wide variety of data sets, this is an efficient way to progressively reveal low-dimensional, nonlinear structure.

After pretraining multiple layers of feature detectors, the model is “unfolded” (Fig. 1) to produce encoder and decoder networks that initially use the same weights. The global finetuning stage then replaces stochastic activities by deterministic, real-valued probabilities and uses backpropagation through the whole autoencoder to fine-tune the weights for optimal reconstruction.

For continuous data, the hidden units of the first-level RBM remain binary, but the visible units are replaced by linear units with Gaussian noise (10). If this noise has unit variance, the stochastic update rule for the hidden units remains the same and the update rule for visible unit i is to sample from a Gaussian with unit variance and mean $bi+∑jhjwijb_i+\sum_{j}h_jw_{ij}$ .

单层的二进制特征并不是对一组图像中的结构进行建模的最佳方式。在学习了一层特征检测器之后，当它们被数据驱动时，我们可以将它们的活动，作为学习第二层特征的数据。第一层特征检测器就成为学习下一个RBM的可见单元。这种逐层学习可以根据需要重复多次。可以看出，只要每层的特征检测器数量不减少，并且它们的权重被正确初始化，增加一个额外的层总是能提高模型分配给训练数据的对数概率的下限。当较高层的特征检测器较少时，这个约束并不适用，但逐层学习算法仍然是预训练深度自动编码器的权重的一种非常有效的方法。每一层的特征都能捕捉到下一层中单位活动之间的强烈的高阶相关性。对于各种各样的数据集，这是一种逐步揭示低维、非线性结构的有效方法。

在预训练多层特征检测器后，模型被 “展开”（如图1），以产生最初使用相同权重的编码器和解码器网络。然后，全局微调阶段将随机活动替换为确定性的实值概率，并在整个自动编码器中使用反向传播来微调权重以实现最佳重建。

对于连续数据，一级RBM的隐藏单元仍然是二进制的，但可见单元被具有高斯噪声的线性单元所取代。如果该噪声具有单位方差，则隐藏单位的随机更新规则保持不变，可见单位i的更新规则是从具有单位方差和均值 $bi+∑jhjwijb_i+\sum_{j}h_jw_{ij}$ 的高斯中采样。

In all our experiments, the visible units of every RBM had real-valued activities, which were in the range [0, 1] for logistic units. While training higher level RBMs, the visible units were set to the activation probabilities of the hidden units in the previous RBM, but the hidden units of every RBM except the top one had stochastic binary values. The hidden units of the top RBM had stochastic real-valued states drawn from a unit variance Gaussian whose mean was determined by the input from that RBM‘s logistic visible units. This allowed the low-dimensional codes to make good use of continuous variables and facilitated comparisons with PCA. Details of the pretraining and fine-tuning can be found in (8).

To demonstrate that our pretraining algorithm allows us to fine-tune deep networks efficiently, we trained a very deep autoencoder on a synthetic data set containing images of “curves” that were generated from three randomly chosen points in two dimensions (8). For this data set, the true intrinsic dimensionality is known, and the relationship between the pixel intensities and the six numbers used to generate them is highly nonlinear. The pixel intensities lie between 0 and 1 and are very non-Gaussian, so we used logistic output units in the autoencoder, and the fine-tuning stage of the learning minimized the cross-entropy error $[−∑ipilogp^i−∑i()1−pi)log(1−p^i)][-\sum_{i}p_i log\hat{p}_i-\sum_{i}()1-p_i) log(1-\hat{p}_i)]$ , where p_i is the intensity of pixel i and $\hat{p}_i $ is the intensity of its reconstruction.

The autoencoder consisted of an encoder with layers of size (28 X 28)-400-200-100- 50-25-6 and a symmetric decoder. The six units in the code layer were linear and all the other units were logistic. The network was trained on 20,000 images and tested on 10,000 new images. The autoencoder discovered how to convert each 784-pixel image into six real numbers that allow almost perfect reconstruction (Fig. 2A). PCA gave much worse reconstructions. Without pretraining, the very deep autoencoder always reconstructs the average of the training data, even after prolonged fine-tuning (8). Shallower autoencoders with a single hidden layer between the data and the code can learn without pretraining, but pretraining greatly reduces their total training time (8). When the number of parameters is the same, deep autoencoders can produce lower reconstruction errors on test data than shallow ones, but this advantage disappears as the number of parameters increases (8).

在我们所有的实验中，每个RBM的可见单元都有实值的活动，对于逻辑单元来说，其活动范围为[0，1]。在训练更高级别的RBM时，可见单元被设置为前一个RBM中隐藏单元的激活概率，但除了顶部的RBM外，每个RBM的隐藏单元都有随机的二进制值。顶部RBM的隐藏单元具有随机实值状态，从一个单位方差高斯中抽取，其平均值由该RBM 的逻辑可见单元的输入决定。这使得低维代码能够很好地利用连续变量，并有利于与PCA进行比较。预训练和微调的细节可以在（8）中找到。

为了证明我们的预训练算法可以让我们有效地对深度网络进行微调，我们在一个包含 "曲线 "图像的合成数据集上训练了一个非常深的自动编码器，这些图像是由两个维度的三个随机选择点生成的（8）。对于这个数据集，真实的内在维度是已知的，像素强度和用于生成它们的六个数字之间的关系是高度非线性的。像素强度位于0和1之间，是非高斯的，所以我们在自动编码器中使用了对数输出单元。而微调阶段的学习使交叉熵误差最小化 $[−∑ipilogp^i−∑i()1−pi)log(1−p^i)][-\sum_{i}p_i log\hat{p}_i-\sum_{i}()1-p_i) log(1-\hat{p}_i)]$ ，其中p_i为像素i的强度， $p^i\hat{p}_i$ 为其重构强度。

自动编码器由一个大小为(28 X 28)-400-200-100-50-25-6层的编码器和一个对称的解码器组成。码层中的6个单元是线性的，其他单元都是逻辑的。该网络在20000张图像上进行了训练，并在10000张新图像上进行了测试。自动编码器发现了如何将每张784像素的图像转换成六个实数，允许几乎完美的重建（如图2A）。PCA给出的重建效果要差得多。在没有预训练的情况下，很深的自动编码器总是重建训练数据的平均值，即使在长时间的微调后也是如此。较浅的自动编码器在数据和代码之间只有一个隐藏层，可以在没有预训练的情况下进行学习，但预训练大大减少了它们的总训练时间。当参数数量相同时，深层自动编码器在测试数据上产生的重构误差比浅层自动编码器低，但随着参数数量的增加，这种优势就会消失。

Fig. 3. (A) The two-dimensional codes for 500 digits of each class produced by taking the first two principal components of all 60,000 training images. (B) The two-dimensional codes found by a 784-1000-500-250-2 autoencoder. For an alternative visualization, see (8).
图3. (A)取全部60000张训练图像的前两个主成分，生成的每类500个数字的二维码；(B) 由784- 1000-500-250-2自动编码器发现的二维码。另一种可视化方式，见（8）。

Next, we used a 784-1000-500-250-30 autoencoder to extract codes for all the handwritten digits in the MNIST training set (11). The Matlab code that we used for the pretraining and fine-tuning is available in (8). Again, all units were logistic except for the 30 linear units in the code layer. After fine-tuning on all 60,000 training images, the autoencoder was tested on 10,000 new images and produced much better reconstructions than did PCA (Fig. 2B). A two-dimensional autoencoder produced a better visualization of the data than did the first two principal components (Fig. 3).

We also used a 625-2000-1000-500-30 autoencoder with linear input units to discover 30- dimensional codes for grayscale image patches that were derived from the Olivetti face data set (12). The autoencoder clearly outperformed PCA (Fig. 2C).

接下来，我们使用784-1000-500-250-30自动编码器来提取MNIST训练集中所有手写数字的代码。我们用于预训练和微调的Matlab代码见（8）。同样，除了代码层中的30个线性单位外，其他单位都是逻辑单位。在对所有60000张训练图像进行微调后，在10000张新图像上对自动编码器进行测试，产生的重建效果比PCA好得多（如图2B）。二维自动编码器产生的数据可视化效果比前两个主成分更好（如图3）。

我们还使用625-2000-1000-500-30自动编码器与线性输入单元来发现从Olivetti人脸数据集中获得的灰度图像补丁的30维代码。自动编码器的性能明显优于PCA（见图2C）。

Fig. 4. (A) The fraction of retrieved documents in the same class as the query when a query document from the test set is used to retrieve other test set documents, averaged over all 402,207 possible queries. (B) The codes produced by two-dimensional LSA. © The codes produced by a 2000- 500-250-125-2 autoencoder.
图4. (A)当测试集的一个查询文档被用来检索其他测试集文档时，检索到的文档中与查询相同类别的部分，是所有402,207个可能查询的平均值。(B)二维LSA产生的代码。©由2000-500-250-125-2自动编码器产生的代码。

When trained on documents, autoencoders produce codes that allow fast retrieval. We represented each of 804,414 newswire stories (13) as a vector of document-specific probabilities of the 2000 commonest word stems, and we trained a 2000-500-250-125-10 autoencoder on half of the stories with the use of the multiclass cross-entropy error function $[−∑ipilogp^i][-\sum_{i}p_i log\hat{p}_i]$ for the fine-tuning. The 10 code units were linear and the remaining hidden units were logistic. When the cosine of the angle between two codes was used to measure similarity, the autoencoder clearly outperformed latent semantic analysis (LSA) (14), a well-known document retrieval method based on PCA (Fig. 4). Autoencoders (8) also outperform local linear embedding, a recent nonlinear dimensionality reduction algorithm (15).

Layer-by-layer pretraining can also be used for classification and regression. On a widely used version of the MNIST handwritten digit recognition task, the best reported error rates are 1.6% for randomly initialized backpropagation and 1.4% for support vector machines. After layer-by-layer pretraining in a 784-500-500-2000-10 network, backpropagation using steepest descent and a small learning rate achieves 1.2% (8). Pretraining helps generalization because it ensures that most of the information in the weights comes from modeling the images. The very limited information in the labels is used only to slightly adjust the weights found by pretraining.

It has been obvious since the 1980s that backpropagation through deep autoencoders would be very effective for nonlinear dimensionality reduction, provided that computers were fast enough, data sets were big enough, and the initial weights were close enough to a good solution. All three conditions are now satisfied. Unlike nonparametric methods (15, 16), autoencoders give mappings in both directions between the data and code spaces, and they can be applied to very large data sets because both the pretraining and the fine-tuning scale linearly in time and space with the number of training cases.

当对文档进行训练时，自动编码器产生的代码可以进行快速检索。我们将804,414篇新闻联播报道中的每一篇都表示为2000个最常见词根的文档特定概率的向量，我们对其中一半的报道训练了2000-500-250-125-10自动编码器，并使用多类交叉熵误差函数
$[−∑ipilogp^i][-\sum_{i}p_i log\hat{p}_i]$ 进行微调。10个代码单元为线性单元，其余隐藏单元为逻辑单元。当使用两个代码之间角度的余弦来衡量相似度时，自动编码器的表现明显优于基于PCA的著名文档检索方法,潜在语义分析（LSA)(见图4）。自动编码器的表现也优于局部线性嵌入，这是一种最新的非线性维度降低算法(15)

逐层预训练也可以用于分类和回归。在一个广泛使用的MNIST手写数字识别任务的版本上，随机初始化的反向传播的最佳报告错误率为1.6%，支持向量机为1.4%。在784-500-500-2000-10网络中进行逐层预训练后，使用最陡峭的下降和小学习率的反向传播可以达到1.2%（8）。预训练有助于泛化，因为它确保了权重中的大部分信息来自于对图像的建模。标签中非常有限的信息只用来稍微调整预训练发现的权重。

从20世纪80年代开始，通过深度自动编码器进行反向传播对于非线性维度降低是非常有效的，但前提是计算机的速度要足够快，数据集要足够大，初始权重要足够接近一个好的解决方案。现在这三个条件都满足了。与非参数方法不同，自动编码器在数据和代码空间之间给出了两个方向的映射，它们可以应用于非常大的数据集，因为预训练和微调在时间和空间上都随着训练案例的数量线性扩展。

References and Notes

D. C. Plaut, G. E. Hinton, Comput. Speech Lang.
35 (1987). 2. D. DeMers, G. Cottrell, Advances in Neural Information Processing Systems 5 (Morgan Kaufmann, San Mateo, CA, 1993), pp. 580–587.
R. Hecht-Nielsen, Science 269, 1860 (1995).
N. Kambhatla, T. Leen, Neural Comput. 9, 1493 (1997).
P. Smolensky, Parallel Distributed Processing: Volume 1: Foundations, D. E. Rumelhart, J. L. McClelland, Eds. (MIT Press, Cambridge, 1986), pp. 194–281.
G. E. Hinton, Neural Comput. 14, 1711 (2002).
J. J. Hopfield, Proc. Natl. Acad. Sci. U.S.A. 79, 2554 (1982).
See supporting material on Science Online.
G. E. Hinton, S. Osindero, Y. W. Teh, Neural Comput. 18, 1527 (2006).
M. Welling, M. Rosen-Zvi, G. Hinton, Advances in Neural Information Processing Systems 17 (MIT Press, Cambridge, MA, 2005), pp. 1481–1488.
The MNIST data set is available at http://yann.lecun.com/ exdb/mnist/index.html.
The Olivetti face data set is available at www. cs.toronto.edu/ roweis/data.html.
The Reuter Corpus Volume 2 is available at http:// trec.nist.gov/data/reuters/reuters.html.
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, R. A. Harshman, J. Am. Soc. Inf. Sci. 41, 391 (1990).
S. T. Roweis, L. K. Saul, Science 290, 2323 (2000).
J. A. Tenenbaum, V. J. de Silva, J. C. Langford, Science 290, 2319 (2000).
We thank D. Rumelhart, M. Welling, S. Osindero, and S. Roweis for helpful discussions, and the Natural Sciences and Engineering Research Council of Canada for funding. G.E.H. is a fellow of the Canadian Institute for Advanced Research.