PRML翻译 Chap1 Introduction

(For recognizing hand-written numbers) Far better results can be obtained by adopting a machine learning approach in which a large set of N digits {x1x_1x1,…, xNx_NxN } called a training set is used to tune the parameters of an adaptive model. The categories of the digits in the training set are known in advance, typically by inspecting them individually and hand-labelling them. We can express the category of a digit using target vector t, which represents the identity of the corresponding digit. Suitable techniques for representing categories in terms of vectors will be discussed later. Note that there is one such target vector t for each digit image x

采用机器学习方法可以获得更好的结果。这之中，一组 N 个数 {x1x_1x1,…, xNx_NxN} 被称为训练集，用于调整自适应模型的参数。训练集中数字的类别是事先知道的，通常是单独检查并手工标记的。我们可以使用目标向量 t 来表示一个图像的类别，它表示图像对应的数字种类。稍后将讨论根据这种向量表示所用的合适技术。这里只需注意每个图像向量x都对应一种数字向量t。

The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. Once the model is trained it can then determine the identity of new digit images, which are said to comprise a test set. The ability to categorize correctly new examples that differ from those used for training is known as generalization. In practical applications, the variability of the input
vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition.

在机器学习中，输入数字图像 x ，生成输出向量 y，这俩向量编码方式相同。（x类似于题干，y类似于正确答案）y也能写成函数y(x)。函数 y(x) 的精确形式是在训练阶段（也称为学习阶段）根据训练数据确定的。一旦模型经过训练，它就可以确定新数字图像的类别，这些新图像称作一个测试集。对未用于训练的新示例进行正确分类的能力称为泛化。在实际应用中，可能的输入比训练数据多老了，所以泛化是模式识别的终极目标。

For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. For instance, in the digit recognition problem, the images of the digits are typically translated and scaled so that each digit is contained within a box of a fixed size. This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same, which makes it much easier for a subsequent pattern recognition algorithm to distinguish between the different classes. This pre-processing stage is sometimes also called feature extraction. Note that new test data must be pre-processed using the same steps as the training data.

一般我们对原始输入变量进行一波预处理，把它们转换为一些新的变量空间，希望模式识别问题更容易解决。比如，在数字识别问题中，我们经常平移、缩放数字图像，来让每个图像信号长得一边大，这样模式识别算法就更容易区分信号的类别。这个预处理阶段有时也称为特征提取。注意哦，新测试数据和训练数据的预处理必须相同。

Pre-processing might also be performed in order to speed up computation. For example, if the goal is real-time face detection in a high-resolution video stream, the computer must handle huge numbers of pixels per second, and presenting these directly to a complex pattern recognition algorithm may be computationally infeasible. Instead, the aim is to find useful features that are fast to compute, an also preserve useful discriminatory information enabling faces to be distinguished from non-faces. These features are then used as the inputs to the pattern recognition algorithm. For instance, the average value of the image intensity over a rectangular subregion can be evaluated extremely efficiently (Viola and Jones, 2004), and a set of such features can prove very effective in fast face detection. Because the number of such features is smaller than the number of pixels, this kind of pre-processing represents a form of dimensionality reduction. Care must be taken during pre-processing because often information is discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer.

预处理也能加快计算速度。例如，在高清视频流中做实时人脸检测时，计算机要快速处理大量像素，直接用复杂的模式识别算法显然不可行。但其实找有用的特征就能判断是否为人脸，把这些特征输入模式识别算法就行。比如说，来一组矩形子区域上图像强度的平均值这样的特征就能有效识别人脸。小心不要丢掉对做区分很重要的信息哦。

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems. Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which the inputs consists of the concentrations of reactants, the temperature, and the pressure.

训练数据里有“题干”有“答案”称为监督学习问题。识别数字那种把每个输入向量分给有限个离散的类的叫分类问题。如果"答案"由一个或多个连续变量组成，那就叫回归问题，比如通过反应物的浓度、温度和压力预测化学生产流程中的产量。

In other pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

在其他模式识别问题中，训练数据由一组输入向量 x 组成，但没有对应的目标值，属于是开放题，没答案，这叫无监督学习问题。这题的目标可能是把数据长得像的分一堆（称为聚类），或确定输入空间里数据的分布（称为密度估计），或者为了可视化把数据降维。

Finally, the technique of reinforcement learning (Sutton and Barto, 1998) is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward. Here the learning algorithm is not given examples of optimal outputs, in contrast to supervised learning, but must instead discover them by a process of trial and error. Typically there is a sequence of states and actions in which the learning algorithm is interacting with its environment. In many cases, the current action not only affects the immediate reward but also has an impact on the reward at all subsequent time steps. For example, by using appropriate reinforcement learning techniques a neural network can learn to play the game of backgammon to a high standard (Tesauro, 1994).

最后，强化学习（Reinforcement Learning, RL）就是要通过一波操作拿到尽量高的分。但学习算法并不直接给出最优解，而是要AI自己去试；多数情况下AI的一步操作不仅会影响即时奖励，还会影响所有后续操作得到的奖励。比如我们可以教AI玩西洋双陆琪。

Here the network must learn to take a board position as input, along with the result of a dice throw, and produce a strong move as the output. This is done by having the network play against a copy of itself for perhaps a million games. A major challenge is that a game of backgammon can involve dozens of moves, and yet it is only at the end of the game that the reward, in the form of victory, is achieved. The reward must then be attributed appropriately to all of the moves that led to it, even though some moves will have been good ones and others less so. This is an example of a credit assignment problem. A general feature of reinforcement learning is the trade-off between exploration, in which the system tries out new kinds of actions to see how effective they are, and exploitation, in which the system makes use of actions that are known to yield a high reward. Too strong a focus on either exploration or exploitation will yield poor results. Reinforcement learning continues to be an active area of machine learning research. However, a detailed treatment lies beyond the scope of this book.

学双陆棋的时候，电脑要学会将棋盘位置和掷骰子的结果作为输入，下出好棋为输出。这就得让它跟自己的分身下几百万盘棋。难就难在西洋双陆棋游戏一步可能有几十个动作，但游戏结束时才能获得奖励，还得把奖励适当地归因于这盘棋的每个动作。这其实是一个贡献度分配问题的例子。强化学习的一个一般特征是在探索（试试新操作）和利用（使用好操作）之间进行权衡，偏重哪个都会凉（小声：不能太左也不能太右）。强化学习仍然是机器学习研究的一个活跃领域。然而，详述则超出了本书的范围。

Figure 1.2 Plot of a training data set of N = 10 points, shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the data. Our goal is to predict the value of t for some new value of x, without knowledge of the green curve.

图为回归问题的例子

Although each of these tasks needs its own tools and techniques, many of the key ideas that underpin them are common to all such problems. One of the main goals of this chapter is to introduce, in a relatively informal way, several of the most important of these concepts and to illustrate them using simple examples. Later in the book we shall see these same ideas re-emerge in the context of more sophisticated models that are applicable to real-world pattern recognition applications. This chapter also provides a self-contained introduction to three important tools that will be used throughout the book, namely probability theory, decision theory, and information theory. Although these might sound like daunting topics, they are in fact straightforward, and a clear understanding of them is essential if machine learning techniques are to be used to best effect in practical applications.

尽管这些任务中的每一项都需要自己的工具和技术，但支撑它们的许多关键思想对于所有此类问题都是通用的。本章目标之一是以一种相对非正式的方式介绍其中几个最重要的概念，并用简单的例子来说明它们。在后续章节，我们将看到这些概念在实用的复杂模型中再现。本章还介绍了将要用到的三个工具：概率论、决策论和信息论。它们听起来很难，但实际上很简单，且清楚地理解它们是运用它们的前提。

1.1. Example: Polynomial Curve Fitting

多项式曲线拟合

We begin by introducing a simple regression problem, which we shall use as a running example throughout this chapter to motivate a number of key concepts. Suppose we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t. For the present purposes, it is instructive to consider an artificial example using synthetically generated data because we then know the precise process that generated the data for comparison against any learned model. The data for this example is generated from the function sin(2πx) with random noise included in the target values, as described in detail in Appendix A.

本章将以一个简单回归问题为线索引出概念。假设我们想用输入变量 x来预测目标变量 t 的值。就目前来说，考虑手造的数据很有价值，可将其与学习模型进行比较。此示例的数据是从函数 sin(2πx) 生成的，包含随机噪声，如附录 A 中详细描述的。

Now suppose that we are given a training set comprising N observations of x, written x≡(x1,...,xN)Tx ≡ (x_1,...,x_N )^Tx≡(x1,...,xN)T, together with corresponding observations of the values of t, denoted t≡(t1,...,tN)Tt ≡ (t_1,...,t_N )Tt≡(t1,...,tN)T. Figure 1.2 shows a plot of a training set comprising N = 10 data points. The input data set x in Figure 1.2 was generated by choosing values of xnx_nxn, for n=1,...,Nn = 1,...,Nn=1,...,N, spaced uniformly in range[0,1][0, 1][0,1], and the target data set t was obtained by first computing the corresponding values of the function sin(2πx) and then adding a small level of random noise having a Gaussian distribution (the Gaussian distribution is discussed in Section 1.2.4) to each such point in order to obtain the corresponding value tnt_ntn. By generating data in this way, we are capturing a property of many real data sets, namely that they possess an underlying regularity, which we wish to learn, but that individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved.

现在假设给定一个训练集，其中包含x 的 N 个值，记为 x≡(x1,...,xN)Tx ≡ (x_1,...,x_N )^Tx≡(x1,...,xN)T，以及对 t 相应值，记为 t≡(t1,...,tN)Tt ≡ (t_1,. ..,t_N )Tt≡(t1,...,tN)T。图 1.2 显示为10 个数据点的训练集的图。图 1.2 中输入数据xnx_nxn 在范围 [0,1][0, 1][0,1] 中均匀分布， n=1,...,Nn = 1,...,Nn=1,...,N。目标数据 t 是通过首先计算函数 sin(2πx) 的对应值，然后将、加入具有高斯分布（在第 1.2.4 节中讨论）的小型随机噪声以获得对应tnt_ntn。通过以这种方式生成数据，我们发现真实数据集既具有我们希望学习的潜在规律性，又存在随机噪声的干扰。这种噪音可能就是因为本来的随机性（例如放射性衰变），但更常见的情况是有被我们忽视的的干扰信号来源。

Our goal is to exploit this training set in order to make predictions of the value t^\hat{t}t^ of the target variable for some new value x^\hat{x}x^ of the input variable. As we shall see later, this involves implicitly trying to discover the underlying function sin(2πx). This is intrinsically a difficult problem as we have to generalize from a finite data set. Furthermore the observed data are corrupted with noise, and so for a given x^\hat{x}x^ there is uncertainty as to the appropriate value for t^\hat{t}t^. Probability theory, discussed
in Section 1.2, provides a framework for expressing such uncertainty in a precise and quantitative manner, and decision theory, discussed in Section 1.5, allows us to exploit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria.

我们的目标是利用这个训练集来预测新值 x^\hat{x}x^对应的t^\hat{t}t^，这就要悄悄试图发现本质上的函数 sin(2πx)。因为我们必须从有限的数据集中进行概括，所以这很难嗷。此外，由于训练集数据被噪声干扰，给定的 x^\hat{x}x^，t^\hat{t}t^ 存在不确定性。在 1.2 节中将讨论的概率论，提供了一个以精确定量的方式表达这种不确定性的框架，且在 1.5 节中将讨论的决策论允许我们利用这种概率表示，以便根据适当的标准做出最优的预测。

For the moment, however, we shall proceed rather informally and consider a simple approach based on curve fitting. In particular, we shall fit the data using a polynomial function of the form.

然而目前我们就凑活一下，考虑一种用曲线拟合的简单方法。特别是我们将使用多项式函数来拟合数据。
KaTeX parse error: Undefined control sequence: \boltw at position 7: y(x, \̲b̲o̲l̲t̲w̲) = w_0 + w_{1}…