机器学习：CS 229 - Machine Learning - Supervised Learning cheatsheet

CS 229 - Machine Learning - Supervised Learning cheatsheet
强烈建议阅读↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑英文原版
之前参与的一个翻译项目，我负责的有监督学习部分。

Supervised Learning cheatsheet

⟶ 监督学习简明指南

Introduction to Supervised Learning

⟶ 监督学习简介

Given a set of data points {x(1),…,x(m)} associated to a set of outcomes {y(1),…,y(m)}, we want to build a classifier that learns how to predict y from x.

⟶ 给定一组数据点 {x(1),…,x(m)} 和与其对应的输出 {y(1),…,y(m)} ，我们想要建立一个分类器，学习如何从 x 预测 y。

Type of prediction ― The different types of predictive models are summed up in the table below:

⟶ 预测类型 - 不同类型的预测模型总结如下表：

[Regression, Classifier, Outcome, Examples]

⟶ [回归，分类，输出，例子]

[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]

⟶ [连续，类，线性回归，Logistic回归，SVM，朴素贝叶斯]

Type of model ― The different models are summed up in the table below:

⟶ 型号类型 - 不同型号总结如下表：

[Discriminative model, Generative model, Goal, What’s learned, Illustration, Examples]

⟶ [判别模型，生成模型，目标，所学内容，例图，示例]

[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]

⟶ [直接估计P(y|x)，估计P(x|y) 然后推导 P(y|x)，决策边界，数据的概率分布，回归，SVMs，GDA，朴素贝叶斯]

Notations and general concepts

⟶ 符号和一般概念

Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).

⟶ 假设 - 假设我们选择的模型是hθ 。对于给定的输入数据 x(i)，模型预测输出是 hθ(x(i))。

Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:

⟶ 损失函数 - 损失函数是一个 L:(z,y)∈R×Y⟼L(z,y)∈R 的函数，其将真实数据值 y 和其预测值 z 作为输入，输出它们的不同程度。常见的损失函数总结如下表：

[Least squared error, Logistic loss, Hinge loss, Cross-entropy]

⟶ [最小二乘误差，Logistic损失，铰链损失，交叉熵]

[Linear regression, Logistic regression, SVM, Neural Network]

⟶ [线性回归，Logistic回归，SVM，神经网络]

Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:

⟶ 成本函数 - 成本函数 J 通常用于评估模型的性能，使用损失函数 L 定义如下：

Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:

⟶ 梯度下降 - 记学习率为 α∈R，梯度下降的更新规则使用学习率和成本函数 J 表示如下：

Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.

⟶ 备注：随机梯度下降（SGD）是根据每个训练样本进行参数更新，而批量梯度下降是在一批训练样本上进行更新。

Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:

⟶ 似然 - 给定参数 θ 的模型 L（θ）的似然性用于通过最大化似然性来找到最佳参数θ。在实践中，我们使用更容易优化的对数似然 ℓ(θ)=log(L(θ)) 。我们有

Newton’s algorithm ― The Newton’s algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:

⟶ 牛顿算法 - 牛顿算法是一种数值方法，目的是找到一个 θ 使得 ℓ′(θ)=0. 其更新规则如下：

Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:

⟶ 备注：多维泛化，也称为 Newton-Raphson 方法，具有以下更新规则：

Linear models

⟶ 线性模型

Linear regression

⟶ 线性回归

We assume here that y|x;θ∼N(μ,σ2)

⟶ 我们假设 y|x;θ∼N(μ,σ2)

Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:

⟶ 正规方程 - 通过设计 X 矩阵，使得最小化成本函数时 θ 有闭式解：

LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:

⟶ LMS算法 - 通过 α 学习率，训练集中 m 个数据的最小均方（LMS）算法的更新规则也称为Widrow-Hoff学习规则，如下

Remark: the update rule is a particular case of the gradient ascent.

⟶ 备注：更新规则是梯度上升的特定情况。

LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:

⟶ LWR - 局部加权回归，也称为LWR，是线性回归的变体，通过 w(i)(x) 对其成本函数中的每个训练样本进行加权，其中参数 τ∈R 定义为

Classification and logistic regression

⟶ 分类和逻辑回归

Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:

⟶ Sigmoid函数 - sigmoid 函数 g，也称为逻辑函数，定义如下：

Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:

⟶ 逻辑回归 - 我们假设 y|x;θ∼Bernoulli(ϕ) 。我们有以下形式：

Remark: there is no closed form solution for the case of logistic regressions.

⟶ 备注：对于逻辑回归的情况，没有闭式解。

Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:

⟶ Softmax回归 - 当存在超过2个结果类时，使用softmax回归（也称为多类逻辑回归）来推广逻辑回归。按照惯例，我们设置 θK=0，使得每个类 i 的伯努利参数 ϕi 等于：

Generalized Linear Models

⟶ 广义线性模型

Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:

⟶ 指数分布族 - 如果可以用自然参数 η，也称为规范参数或链接函数，充分统计量 T(y) 和对数分割函数a（η）来表示，则称一类分布在指数分布族中，函数如下：

Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.

⟶ 备注：我们经常会有 T(y)=y。此外，exp(−a(η)) 可以看作是归一化参数，确保概率总和为1

Here are the most common exponential distributions summed up in the following table:

⟶ 下表中是总结的最常见的指数分布：

[Distribution, Bernoulli, Gaussian, Poisson, Geometric]

⟶ [分布，伯努利，高斯，泊松，几何]

Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:

⟶ GLM的假设 - 广义线性模型（GLM）是旨在将随机变量 y 预测为 x∈Rn+1 的函数，并依赖于以下3个假设：

Remark: ordinary least squares and logistic regression are special cases of generalized linear models.

⟶ 备注：普通最小二乘法和逻辑回归是广义线性模型的特例

Support Vector Machines

⟶ 支持向量机

The goal of support vector machines is to find the line that maximizes the minimum distance to the line.

⟶ 支持向量机的目标是找到使决策界和训练样本之间最大化最小距离的线。

Optimal margin classifier ― The optimal margin classifier h is such that:

⟶ 最优间隔分类器 - 最优间隔分类器 h 是这样的：

where (w,b)∈Rn×R is the solution of the following optimization problem:

⟶ 其中 (w,b)∈Rn×R 是以下优化问题的解决方案：

such that

⟶ 使得

support vectors

⟶ 支持向量

Remark: the line is defined as wTx−b=0.

⟶ 备注：该线定义为 wTx−b=0。

Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:

⟶ 合页损失 - 合页损失用于SVM，定义如下：

Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:

⟶ 核 - 给定特征映射 ϕ，我们定义核 K 为：

In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.

⟶ 在实践中，由 K(x,z)=exp(−||x−z||22σ2) 定义的核 K 被称为高斯核，并且经常使用这种核。

[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]

⟶ [非线性可分性，核映射的使用，原始空间中的决策边界]

Remark: we say that we use the “kernel trick” to compute the cost function using the kernel because we actually don’t need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.

⟶ 备注：我们说我们使用“核技巧”来计算使用核的成本函数，因为我们实际上不需要知道显式映射φ，通常，这非常复杂。相反，只需要 K(x,z) 的值。

Lagrangian ― We define the Lagrangian L(w,b) as follows:

⟶ 拉格朗日 - 我们将拉格朗日 L(w,b) 定义如下：

Remark: the coefficients βi are called the Lagrange multipliers.

⟶ 备注：系数 βi 称为拉格朗日乘子。

Generative Learning

⟶ 生成学习

A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes’ rule.

⟶ 生成模型首先尝试通过估计 P(x|y) 来模仿如何生成数据，然后我们可以使用贝叶斯法则来估计 P(y|x)

Gaussian Discriminant Analysis

⟶ 高斯判别分析

Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:

⟶ 设置 - 高斯判别分析假设 y 和 x|y=0 且 x|y=1 如下：

Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:

⟶ 估计 - 下表总结了我们在最大化似然时的估计值：

Naive Bayes

⟶ 朴素贝叶斯

Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:

⟶ 假设 - 朴素贝叶斯模型假设每个数据点的特征都是独立的：

Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]

⟶ 解决方案 - 最大化对数似然给出以下解，k∈{0,1}，l∈[[1,L]]

Remark: Naive Bayes is widely used for text classification and spam detection.

⟶ 备注：朴素贝叶斯广泛用于文本分类和垃圾邮件检测。

Tree-based and ensemble methods

⟶ 基于树的方法和集成方法

These methods can be used for both regression and classification problems.

⟶ 这些方法可用于回归和分类问题。

CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.

⟶ CART - 分类和回归树（CART），通常称为决策树，可以表示为二叉树。它们具有可解释性的优点。

Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.

⟶ 随机森林 - 这是一种基于树模型的技术，它使用大量的由随机选择的特征集构建的决策树。与简单的决策树相反，它是高度无法解释的，但其普遍良好的表现使其成为一种流行的算法。

Remark: random forests are a type of ensemble methods.

⟶ 备注：随机森林是一种集成方法。

Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:

⟶ 提升 - 提升方法的思想是将一些弱学习器结合起来形成一个更强大的学习器。主要内容总结在下表中：

[Adaptive boosting, Gradient boosting]

⟶ [自适应增强，梯度提升]

High weights are put on errors to improve at the next boosting step

⟶ 在下一轮提升步骤中，错误的会被置于高权重

Weak learners trained on remaining errors

⟶ 弱学习器训练剩余的错误

Other non-parametric approaches

⟶ 其他非参数方法

k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.

⟶ k-最近邻 - k-最近邻算法，通常称为k-NN，是一种非参数方法，其中数据点的判决由来自训练集中与其相邻的k个数据的性质确定。它可以用于分类和回归。

Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.

⟶ 备注：参数 k 越高，偏差越大，参数 k 越低，方差越大。

Learning Theory

⟶ 学习理论

Union bound ― Let A1,…,Ak be k events. We have:

⟶ 联盟 - 让A1，…，Ak 成为 k 个事件。我们有：

Hoeffding inequality ― Let Z1,…,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:

⟶ Hoeffding不等式 - 设Z1，…，Zm是从参数 φ 的伯努利分布中提取的 m iid 变量。设 φ 为其样本均值，固定 γ> 0。我们有：

Remark: this inequality is also known as the Chernoff bound.

⟶ 备注：这种不平等也被称为 Chernoff 界限。

Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:

⟶ 训练误差 - 对于给定的分类器 h，我们定义训练误差 ϵ(h)，也称为经验风险或经验误差，如下：

Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:

⟶ 可能近似正确 (PAC) - PAC是一个框架，在该框架下证明了许多学习理论的结果，并具有以下假设：

the training and testing sets follow the same distribution

⟶ 训练和测试集遵循相同的分布

the training examples are drawn independently

⟶ 训练样本是相互独立的

Shattering ― Given a set S={x(1),…,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),…,y(d)}, we have:

⟶ 打散 - 给定一个集合 S={x(1),…,x(d)} 和一组分类器 H，如果对于任意一组标签 {y(1),…,y(d)} 都能对分，我们称 H 打散 S ，我们有：

Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:

⟶ 上限定理 - 设 H 是有限假设类，使得 |H|=k 并且使 δ 和样本大小 m 固定。然后，在概率至少为 1-δ 的情况下，我们得到：

VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.

⟶ VC维 - 给定无限假设类 H 的 Vapnik-Chervonenkis(VC) 维，注意 VC(H) 是由 H 打散的最大集合的大小。

Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.

⟶ 备注：H = {2维线性分类器集} 的 VC 维数为3。

Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:

⟶ 定理 (Vapnik) - 设H，VC(H)=d ，m 为训练样本数。概率至少为 1-δ，我们有：

[Introduction, Type of prediction, Type of model]

⟶ [简介，预测类型，模型类型]

[Notations and general concepts, loss function, gradient descent, likelihood]

⟶ [符号和一般概念，损失函数，梯度下降，似然]

[Linear models, linear regression, logistic regression, generalized linear models]

⟶ [线性模型，线性回归，逻辑回归，广义线性模型]

[Support vector machines, Optimal margin classifier, Hinge loss, Kernel]

⟶ [支持向量机，最优间隔分类器，合页损失，核]

[Generative learning, Gaussian Discriminant Analysis, Naive Bayes]

⟶ [生成学习，高斯判别分析，朴素贝叶斯]

[Trees and ensemble methods, CART, Random forest, Boosting]

⟶ 树和集成方法，CART，随机森林，提升]

[Other methods, k-NN]

⟶ [其他方法，k-NN]

[Learning theory, Hoeffding inequality, PAC, VC dimension]

⟶ [学习理论，Hoeffding不等式，PAC，VC维]

机器学习：CS 229 - Machine Learning - Supervised Learning cheatsheet相关推荐

图机器学习（Graph Machine Learning）- 第二章图机器学习简介 Graph Machine Learning
第二章图机器学习简介 Graph Machine Learning 文章目录第二章图机器学习简介 Graph Machine Learning 前言 1. 环境要求Technical requi ...
【机器学习笔记】可解释机器学习-学习笔记 Interpretable Machine Learning (Deep Learning)
[机器学习笔记]可解释机器学习-学习笔记 Interpretable Machine Learning (Deep Learning) 目录 [机器学习笔记]可解释机器学习-学习笔记 Interpre ...
機器學習基石机器学习基石（Machine Learning Foundations）作业1 习题解答
大家好,我是Mac Jiang,今天和大家分享coursera-NTU-機器學習基石(Machine Learning Foundations)-作业1的习题解答.笔者是在学习了Ng的Machine ...
对抗机器学习（Adversarial Machine Learning）发展现状
目录 1. 了解对手 1. 1 攻击目标(Goal) 1. 2 知识储备(Knowledge) 1.3 能力限制(Capability) 1.4 攻击策略(Strategy) 2. 学会主动 2.1 ...
机器学习实战（Machine Learning in Action）学习笔记————06.k-均值聚类算法（kMeans）学习笔记...
机器学习实战(Machine Learning in Action)学习笔记----06.k-均值聚类算法(kMeans)学习笔记关键字:k-均值.kMeans.聚类.非监督学习作者:米仓山下时 ...
林轩田机器学习技法（Machine Learning Techniques）笔记（一）
终于到机器学习技法了,接下来还是尽量保持每章完结就立刻更吧..基石没有保持写完就更,现在回头不知道自己在写啥,看笔记感觉写得一塌糊涂,感觉翻车了.慢慢改进吧. 听说技法挺难的,贴一下大神博客来加持一发 ...
机器学习概要（MACHINE LEARNING SUMMARY）
机器学习概要(MACHINE LEARNING SUMMARY) 监督学习回归分析与线性回归 1.例如营业额预测,传统算法必须知道计算公式,机器学习可以帮你找到核心的函数关系式,利用它推算未来预测结 ...
Kaggle课程 — 机器学习进阶 Intermediate Machine Learning
Kaggle课程 - 机器学习进阶 Intermediate Machine Learning 1.简介 1.1 先决条件 2.缺失值 2.1 简介 2.2 三种方法 2.3 一个例子 2.3.1 定 ...
机器学习肝炎预测模型machine learning for hepatitis prediction model
作者Toby,来自机器学习肝炎预测模型肝炎是由细菌.病毒.寄生虫.酒精.药物.化学物质.自身免疫等多种致病因素引起的肝脏炎症的统称.儿童及成年人均可患病,病毒感染导致的病毒性肝炎较为常见. 由于过度 ...
林轩田机器学习技法（Machine Learning Techniques）笔记（三）
感觉边写边记还不错hhh(感觉之前就剪剪图,写在记事本里打算之后再贴上去,实在是太蠢了⁄(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄) 林轩田机器学习技法(Machine Learning Techniques)笔记 ...

机器学习：CS 229 - Machine Learning - Supervised Learning cheatsheet

机器学习：CS 229 - Machine Learning - Supervised Learning cheatsheet相关推荐

最新文章

热门文章