Mind Map

CODE WORKS

Work Here!

Learning Algorithms

A machine learning algorithm is an algorithm that is able to learn from data. But what do we mean by learning? Mitchell (1997) provides the definition "A computer program is said to learn from experience EEE with respect to some class of tasks TTT and performance measure PPP, if its performance at tasks in TTT, as measured by PPP, improves with experience E."E . "E."

The Task, TTT

In this relatively formal definition of the word “task,” the process of learning itself is not the task.
Machine learning tasks are usually described in terms of how the machine learning system should process an example.
An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process. We typically represent an example as a vector x∈Rnx \in \mathbb{R}^{n}x∈Rn where each entry xix_{i}xi of the vector is another feature. For example, the features of an image are usually the values of the pixels in the image.
Many kinds of tasks can be solved with machine learning. Some of the most common machine learning tasks include the following:

Classification: In this type of task, the computer program is asked to specify which of kkk categories some input belongs to. To solve this task, the learning algorithm is usually asked to produce a function f:Rn→{1,…,k}f: \mathbb{R}^{n} \rightarrow\{1, \ldots, k\}f:Rn→{1,…,k}. When y=f(x)y=f(\boldsymbol{x})y=f(x), the model assigns an input described by vector x\boldsymbol{x}x to a category identified by numeric code yyy. There are other variants of the classification task, for example, where fff outputs a probability distribution over classes.
Classification with missing inputs: Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided. In order to solve the classification task, the learning algorithm only has to define a single function mapping from a vector input to a categorical output. When some of the inputs may be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions. Each function corresponds to classifying xxx with a different subset of its inputs missing. This kind of situation arises frequently in medical diagnosis, because many kinds of medical tests are expensive or invasive. One way to efficiently define such a large set of functions is to learn a probability distribution over all of the relevant variables, then solve the classification task by marginalizing out the missing variables.
Regression: In this type of task, the computer program is asked to predict a numerical value given some input. To solve this task, the learning algorithm is asked to output a function f:Rn→Rf: \mathbb{R}^{n} \rightarrow \mathbb{R}f:Rn→R. This type of task is similar to classification, except that the format of output is different.
Transcription(转录): In this type of task, the machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe it into discrete, textual form. For example, in optical character recognition, the computer program is shown a photograph containing an image of text and is asked to return this text in the form of a sequence of characters (e.g., in ASCII or Unicode format).
Machine translation: In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language.
Structured output: Structured output tasks involve any task where the output is a vector (or other data structure containing multiple values) with important relationships between the different elements. This is a broad category, and subsumes the transcription and translation tasks described above, but also many other tasks. One example is parsing- mapping a natural language sentence into a tree that describes its grammatical structure and tagging nodes of the trees as being verbs, nouns, or adverbs, and so on.
Anomaly detection: In this type of task, the computer program sifts through a set of events or objects, and flags some of them as being unusual or atypical. An example of an anomaly detection task is credit card fraud detection.
Synthesis and sampling: In this type of task, the machine learning algorithm is asked to generate new examples that are similar to those in the training data.
Denoising: In this type of task, the machine learning algorithm is given in input a corrupted example x~∈Rn\tilde{\boldsymbol{x}} \in \mathbb{R}^{n}x~∈Rn obtained by an unknown corruption process from a clean example x∈Rn\boldsymbol{x} \in \mathbb{R}^{n}x∈Rn. The learner must predict the clean example xxx from its corrupted version x~\tilde{x}x~, or more generally predict the conditional probability distribution p(x∣x~)p(\boldsymbol{x} \mid \tilde{\boldsymbol{x}})p(x∣x~).
Density estimation or probability mass function estimation: In the density estimation problem, the machine learning algorithm is asked to learn a function pmodel :Rn→Rp_{\text {model }}: \mathbb{R}^{n} \rightarrow \mathbb{R}pmodel :Rn→R, where pmodel (x)p_{\text {model }}(\boldsymbol{x})pmodel (x) can be interpreted as a probability density function (if x\mathrm{x}x is continuous) or a probability mass function (if x\mathrm{x}x is discrete) on the space that the examples were drawn from. To do such a task well (we will specify exactly what that means when we discuss performance measures PPP ), the algorithm needs to learn the structure of the data it has seen. It must know where examples cluster tightly and where they are unlikely to occur. Most of the tasks described above require the learning algorithm to at least implicitly capture the structure of the probability distribution. Density estimation allows us to explicitly capture that distribution. In principle, we can then perform computations on that distribution in order to solve the other tasks as well.

The Performance Measure,PPP

In order to evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure PPP is specific to the task TTT being carried out by the system.
The choice of performance measure may seem straightforward and objective, but it is often difficult to choose a performance measure that corresponds well to the desired behavior of the system.

The Experience, EEE

Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of experience they are allowed to have during the learning process.
Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. In the context of deep learning, we usually want to learn the entire probability distribution that generated a dataset, whether explicitly as in density estimation or implicitly for tasks like synthesis or denoising. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the dataset into clusters of similar examples.
Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.
Roughly speaking, unsupervised learning involves observing several examples of a random vector x\mathrm{x}x, and attempting to implicitly or explicitly learn the probability distribution p(x)p(\mathbf{x})p(x), or some interesting properties of that distribution, while supervised learning involves observing several examples of a random vector x\mathbf{x}x and an associated value or vector y\mathbf{y}y, and learning to predict y\mathbf{y}y from x\mathbf{x}x, usually by estimating p(y∣x)p(\mathbf{y} \mid \mathbf{x})p(y∣x).
Unsupervised learning and supervised learning are not formally defined terms. The lines between them are often blurred. Many machine learning technologies can be used to perform both tasks. For example, the chain rule of probability states that for a vector x∈Rn\mathbf{x} \in \mathbb{R}^{n}x∈Rn, the joint distribution can be decomposed as
p(x)=∏i=1np(xi∣x1,…,xi−1)p(\mathbf{x})=\prod_{i=1}^{n} p\left(\mathrm{x}_{i} \mid \mathrm{x}_{1}, \ldots, \mathrm{x}_{i-1}\right) p(x)=i=1∏np(xi∣x1,…,xi−1)
This decomposition means that we can solve the ostensibly(表面上的) unsupervised problem of modeling p(x)p(\mathbf{x})p(x) by splitting it into nnn supervised learning problems. Alternatively, we can solve the supervised learning problem of learning p(y∣x)p(y \mid \mathbf{x})p(y∣x) by using traditional unsupervised learning technologies to learn the joint distribution p(x,y)p(\mathbf{x}, y)p(x,y) and inferring
p(y∣x)=p(x,y)∑y′p(x,y′)p(y \mid \mathbf{x})=\frac{p(\mathbf{x}, y)}{\sum_{y^{\prime}} p\left(\mathbf{x}, y^{\prime}\right)} p(y∣x)=∑y′p(x,y′)p(x,y)
Traditionally, people refer to regression, classification and structured output problems as supervised learning. Density estimation in support of other tasks is usually considered unsupervised learning.
Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences.
One common way of describing a dataset is with a design matrix. A design matrix is a matrix containing a different example in each row. Each column of the matrix corresponds to a different feature. For instance, the Iris dataset contains 150 examples with four features for each example. This means we can represent the dataset with a design matrix X∈R150×4\boldsymbol{X} \in \mathbb{R}^{150 \times 4}X∈R150×4, where Xi,1X_{i, 1}Xi,1 is the sepal length of plant i,Xi,2i, X_{i, 2}i,Xi,2 is the sepal width of plant iii, etc. We will describe most of the learning algorithms in this book in terms of how they operate on design matrix datasets. Of course, to describe a dataset as a design matrix, it must be possible to describe each example as a vector, and each of these vectors must be the same size. This is not always possible. For example, if you have a collection of photographs with different widths and heights, then different photographs will contain different numbers of pixels, so not all of the photographs may be described with the same length of vector.
In the case of supervised learning, the example contains a label or target as well as a collection of features. For example, if we want to use a learning algorithm to perform object recognition from photographs, we need to specify which object appears in each of the photos. We might do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working with a dataset containing a design matrix of feature observations X\boldsymbol{X}X, we also provide a vector of labels y\boldsymbol{y}y, with yiy_{i}yi providing the label for example iii.

Of course, sometimes the label may be more than just a single number. For example, if we want to train a speech recognition system to transcribe entire sentences, then the label for each example sentence is a sequence of words.
Just as there is no formal definition of supervised and unsupervised learning, there is no rigid taxonomy（分类） of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.

Example: Linear Regression

As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector x∈Rn\boldsymbol{x} \in \mathbb{R}^{n}x∈Rn as input and predict the value of a scalar y∈Ry \in \mathbb{R}y∈R as its output. In the case of linear regression, the output is a linear function of the input. Let y^\hat{y}y^ be the value that our model predicts yyy should take on. We define the output to be
y^=w⊤x\hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x} y^=w⊤x
where w∈Rn\boldsymbol{w} \in \mathbb{R}^{n}w∈Rn is a vector of parameters.
Parameters are values that control the behavior of the system. In this case, wiw_{i}wi is the coefficient that we multiply by feature xix_{i}xi before summing up the contributions from all the features. We can think of w\boldsymbol{w}w as a set of weights that determine how each feature affects the prediction. If a feature xix_{i}xi receives a positive weight wiw_{i}wi, then increasing the value of that feature increases the value of our prediction y^\hat{y}y^. If a feature receives a negative weight, then increasing the value of that feature decreases the value of our prediction. If a feature’s weight is large in magnitude, then it has a large effect on the prediction. If a feature’s weight is zero, it has no effect on the prediction.
We thus have a definition of our task T:T:T: to predict yyy from xxx by outputting y^=w⊤x\hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x}y^=w⊤x. Next we need a definition of our performance measure, PPP.
Suppose that we have a design matrix of mmm example inputs that we will not use for training, only for evaluating how well the model performs. We also have a vector of regression targets providing the correct value of yyy for each of these examples. Because this dataset will only be used for evaluation, we call it the test set. We refer to the design matrix of inputs as X(test )\boldsymbol{X}^{(\text {test })}X(test ) and the vector of regression targets as y(test) \boldsymbol{y}^{\text {(test) }}y(test) .
One way of measuring the performance of the model is to compute the mean squared error of the model on the test set. If y^(test) \hat{y}^{\text {(test) }}y^(test) gives the predictions of the model on the test set, then the mean squared error is given by
MSEtest =1m∑i(y^(test )−y(test ))i2\mathrm{MSE}_{\text {test }}=\frac{1}{m} \sum_{i}\left(\hat{\boldsymbol{y}}^{(\text {test })}-\boldsymbol{y}^{(\text {test })}\right)_{i}^{2} MSEtest =m1i∑(y^(test )−y(test ))i2
Intuitively, one can see that this error measure decreases to 0 when y^(test )=y(test )\hat{\boldsymbol{y}}^{(\text {test })}=\boldsymbol{y}^{(\text {test })}y^(test )=y(test ). We can also see that
MSEtest =1m∥y^(test )−y(test )∥22\mathrm{MSE}_{\text {test }}=\frac{1}{m}\left\|\hat{\boldsymbol{y}}^{(\text {test })}-\boldsymbol{y}^{(\text {test })}\right\|_{2}^{2} MSEtest =m1∥∥∥y^(test )−y(test )∥∥∥22
so the error increases whenever the Euclidean distance between the predictions and the targets increases.
To make a machine learning algorithm, we need to design an algorithm that will improve the weights w\boldsymbol{w}w in a way that reduces MSEtest \mathrm{MSE}_{\text {test }}MSEtest when the algorithm is allowed to gain experience by observing a training set (X(train⁡),y(train )).\left(\boldsymbol{X}^{(\operatorname{train})}, \boldsymbol{y}^{(\text {train })}\right) .(X(train),y(train )). One intuitive way of doing this is just to minimize the mean squared error on the training set, MSE train _{\text {train }}train . To minimize MSE train, we can simply solve for where its gradient is 0\mathbf{0}0 :
∇wMSEtrain =0⇒∇w1m∥y^(train )−y(train )∥22=0⇒1m∇w∥X(train )w−y(train )∥22=0\begin{gathered} \nabla_{\boldsymbol{w}} \mathrm{MSE}_{\text {train }}=0 \\ \Rightarrow \nabla_{\boldsymbol{w}} \frac{1}{m}\left\|\hat{\boldsymbol{y}}^{(\text {train })}-\boldsymbol{y}^{(\text {train })}\right\|_{2}^{2}=0 \\ \Rightarrow \frac{1}{m} \nabla_{\boldsymbol{w}}\left\|\boldsymbol{X}^{(\text {train })} \boldsymbol{w}-\boldsymbol{y}^{(\text {train })}\right\|_{2}^{2}=0 \end{gathered} ∇wMSEtrain =0⇒∇wm1∥∥∥y^(train )−y(train )∥∥∥22=0⇒m1∇w∥∥∥X(train )w−y(train )∥∥∥22=0
⇒∇w(X(train⁡)w−y(train ))⊤(X(train⁡)w−y(train⁡))=0⇒∇w(w⊤X(train⁡)⊤X(train⁡)w−2w⊤X(train⁡)⊤y(train )+y(train⁡)⊤y(train⁡))=0⇒2X(train⁡)⊤X(train⁡)w−2X(train⁡)⊤y(train⁡)=0⇒w=(X(train⁡)⊤X(train⁡))−1X(train⁡)⊤y(train⁡)\begin{array}{r} \Rightarrow \nabla_{\boldsymbol{w}}\left(\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-\boldsymbol{y}^{(\text {train })}\right)^{\top}\left(\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-\boldsymbol{y}^{(\operatorname{train})}\right)=0 \\ \Rightarrow \nabla_{\boldsymbol{w}}\left(\boldsymbol{w}^{\top} \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-2 \boldsymbol{w}^{\top} \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\text {train })}+\boldsymbol{y}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\operatorname{train})}\right)=0 \\ \Rightarrow 2 \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-2 \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\operatorname{train})}=0 \\ \Rightarrow \boldsymbol{w}=\left(\boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{X}^{(\operatorname{train})}\right)^{-1} \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\operatorname{train})} \end{array} ⇒∇w(X(train)w−y(train ))⊤(X(train)w−y(train))=0⇒∇w(w⊤X(train)⊤X(train)w−2w⊤X(train)⊤y(train )+y(train)⊤y(train))=0⇒2X(train)⊤X(train)w−2X(train)⊤y(train)=0⇒w=(X(train)⊤X(train))−1X(train)⊤y(train)（the last equation is the so called normal equations）
It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated model with one additional parameter- an intercept term bbb. In this model
y^=w⊤x+b\hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x}+b y^=w⊤x+b
so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function. This extension to affine functions means that the plot of the model’s predictions still looks like a line, but it need not pass through the origin. Instead of adding the bias parameter bbb, one can continue to use the model with only weights but augment xxx with an extra entry that is always set to 1.1 .1. The weight corresponding to the extra 1 entry plays the role of the bias parameter. We will frequently use the term “linear” when referring to affine functions throughout this book.

The intercept term bbb is often called the bias parameter of the affine transformation. This terminology derives from the point of view that the output of the transformation is biased toward being bbb in the absence of any input. This term is different from the idea of a statistical bias, in which a statistical estimation algorithm’s expected estimate of a quantity is not equal to the true quantity.

Machine Learning Basics（1）相关推荐

Machine Learning Basics（2）
文章目录 CODE WORKS CONTENTS Capacity, Overfitting and Underfitting The No Free Lunch Theorem Regulariza ...
Machine Learning笔记（三）多变量线性回归
2019独角兽企业重金招聘Python工程师标准>>> Machine Learning笔记(三) 多变量线性回归注:本文内容资源来自 Andrew Ng 在 Coursera上的 ...
Machine Learning学习（一）Overview of machine learning机器学习概述
目录 Welcome to machine learning Applications of machine learning Welcome to machine learning Welcome ...
机器学习《Machine Learning》----（2）模型评估与选择
Machine Learning Basics
Deep Learning的第五章Machine Learning Basics总结.这一章主要讲机器学习的基本概念和组成. 1.学习算法机器学习算法是指能够从数据中学习的算法.Mitchell给出 ...
Java Learning Path（四）方法篇
Java Learning Path(四) 方法篇 Java作为一门编程语言,最好的学习方法就是写代码.当你学习一个类以后,你就可以自己写个简单的例子程序来运行一下,看看有什么结果,然后再多调用几个类 ...
Java Learning Path（三）过程篇
Java Learning Path(三)过程篇每个人的学习方法是不同的,一个人的方法不见得适合另一个人,我只能是谈自己的学习方法.因为我学习Java是完全自学的,从来没有问过别人,所以学习的过程基 ...
Efficiently Solving the Practical Vehicle Routing Problem: A Novel Joint Learning Approach（GCN-NPEC）
Efficiently Solving the Practical Vehicle Routing Problem: A Novel Joint Learning Approach(GCN-NPEC) ...
coursera—吴恩达Machine Learning笔记（1-3周）
Machine Learning 笔记笔记主要按照进度记录上课主要内容和部分代码实现,因为我会看一阶段再进行整理,内容会有一定交叉.关于代码部分,一开始我是只为了做作业而写代码的,现在觉得不妨仔细看 ...

Machine Learning Basics（1）

文章目录

Mind Map

CODE WORKS

CONTENTS

Learning Algorithms

The Task, TTT

The Performance Measure,PPP

The Experience, EEE

Example: Linear Regression

Machine Learning Basics（1）相关推荐

最新文章

热门文章