一、我们先谈谈线性分类器：

1.基本概念

2.有几点需要注意:

3.对线性分类器的详细解释：

二、线性可分SVM算法流程

三、softmax分类器

Softmax示意图

SVM VS Softmax

一、我们先谈谈线性分类器：

1.基本概念

The classier must remember all of the training data and store it for future comparisons with the test data. This is space inecient because datasets may easily be gigabytes in size.
Classifying a test image is expensive since it requires a comparison to all training images

分类器必须记住所有的训练数据，并将其存储起来，以便将来与测试数据进行比较。这是一个空间限制，因为数据集的大小很容易达到十亿字节。

对一个测试图像进行分类是很昂贵的，因为它需要与所有的训练图像进行比较

2.有几点需要注意:

There are a few things to note:

First, note that the single matrix multiplication Wxi is effectively evaluating 10 separate classiers in parallel (one for each class), where each classier is a row of W.
Notice also that we think of the input data (xi , yi ) as given and xed, but we have control over the setting of the parameters W,b. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes.
An advantage of this approach is that the training data is used to learn the parameters W,b, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classied based on the computed scores.
Lastly, note that to classifying the test image involves a single matrix multiplication and addition, which is signicantly faster than comparing a test image to all training images.

首先，注意单个矩阵乘法Wxi有效地并行评估10个单独的分类器(每个类一个)，其中每个分类器是一行W。

还要注意，我们认为输入数据(xi, yi)是给定的和装箱的，但是我们可以控制参数W,b的设置。我们的目标将会以这样的方式设置这些计算分数匹配地面真理标签在整个训练集。我们将进入更多细节这是如何实现的,但是直觉上我们希望正确的类有一个评分的得分高于不正确的类。

这种方法的一个优点是使用训练数据来学习参数W,b，但是一旦学习完成，我们可以丢弃整个训练集，只保留学习的参数。这是因为一个新的测试图像可以简单地通过函数进行转发，并根据计算的分数进行分类。

最后，需要注意的是，对测试图像进行分类只需要一个矩阵的乘法和加法，这比将一个测试图像与所有的训练图像进行比较要快得多。

Notice that a linear classifer computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the “ship” class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the “ship” classifer would then have a lot of positive weights across its blue channel weights (presence of
blue increases score of ship), and negative weights in the red/green channels (presence of red/green decreases the score of ship).

请注意，线性分类器计算一个类的得分时，是将其所有3个颜色通道的所有像素值的加权和。根据我们为这些权重精确地设置的值，函数有能力去决定图像中某些位置的某些颜色(取决于每个权重的符号)。例如，您可以想象，如果图像的侧面有很多蓝色(可能与水相对应)，那么“ship”类可能更有可能。您可能期望“ship”分类器在它的蓝色通道权重(存在)上有很多正的权重。蓝色增加船的分数)，红色/绿色通道的负权值(红色/绿色通道的存在降低船的分数)。

一个将图像映射到班级分数的例子。为了可视化，我们假设图像只有4个像素(4个单色像素，为了简单起见，我们在本例中不考虑颜色通道)，我们有3个类(红色(猫)，绿色(狗)，蓝色(船)类)。我们将图像像素拉伸成一列并执行矩阵乘法来获取每门课的分数。请注意，这组特定的权值W一点也不好:权值分配我们的猫的形象猫得分很低。特别是，最终权重的判断似乎让人确信它是狗。

3.对线性分类器的详细解释：

Interpretation of linear classiers as template matching. Another interpretation for the weights W is that each row of W corresponds to a template (or sometimes also called a prototype) for one of the classes.
The score of each class for an image is then obtained by comparing each template with the image using an inner product (or dot product) one by one to find the one that “fits”best.
With this terminology, the linear classfier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.

将线性分类器解释为模板匹配。对于权值W的另一种解释是，W的每一行对应一个类的模板(有时也称为雏形)。

然后，通过将每个模板与使用内积(或点积)的图像逐个进行比较，以找到最“适合”的图像，从而获得图像的每个类的分数。

有了这个术语，线性分类器就可以进行模板匹配，从而学习模板。把它的另一种方法是,我们仍然有效地做最近邻,而是拥有成千上万的训练图像我们只使用单一图像每个类(尽管我们将学习它,它并不一定需要一个图像训练集),我们使用的(负面)内积的距离而不是L1和L2距离。

Looking ahead a bit, a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left,blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

展望未来,神经网络将能够开发在其隐藏层中间神经元,可以检测特定的汽车类型(如绿色汽车面临离开,蓝色汽车面对面前,等等),和下一层神经元能把这些整合到更准确的汽车个人得分通过加权和汽车探测器。

二、线性可分SVM算法流程

The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ. Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).

设置以便支持向量机SVM损失“想”为每个图像正确的类有一个分数高于一些固定的边界Δ不正确的类。正如我们前面所做的那样，有时将损失函数拟人化是有帮助的:SVM“想要”某种结果，这种结果会产生较低的损失(这是好的)。

The Multiclass SVM loss formalized as follows:

However, this will not necessarily be the case once we start to consider more complex forms of the score function f.

The threshold at zero max(0, − ) function is often called the hinge loss. You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form max(0, − )2 that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation.

在0处的阈值max(0，−)函数通常称为铰链损耗。你有时会听说有人使用平方铰链损失支持向量机(或L2-SVM)，它使用max(0，−)^2的形式来更健壮地惩罚违反的边缘(以平方代替线性)。未平方版本更标准，但在一些数据集的平方铰链损失可以更好地工作。这可以在交叉验证期间确定。

Loss Function称为 “Hinge Loss”，对于样本 xi --> 得分 s，如果正确类别的得分 s_yi 比错误类别 s_j 的得分高出一定的安全距离（这里就是公式中的1），那么就认为在这个得分的小分量中，损失函数为0，而如果没有达到这个条件，还差多少损失就是多少。

安全距离是否一定要是1？

这个安全距离其实是无所谓的，可以定义为任意数值。因为我们事实上只关心样本有没有被正确的分类，而这些得分的绝对大小其实并没有太大意义，我们更关心的是他们之间的相对大小。也就是说，如果把权重W的数值整体放大为原来的两倍，那么得分s也会随之变为原来的两倍，但他们之间的相对大小关系是没有改变的，这个时候就相当于把安全距离从1变成了2。因此，不同的安全距离事实上只能产生一个放缩的作用，最后得出的分类器在本质上是一样的.

The loss function quantifies our unhappiness with predictions on the training set

The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta. If any class has a score inside the red region (or higher), then there will be accumulated loss.Otherwise the loss will be zero.Our objective will be to nd the weights that will simultaneously satisfy this constraint for all examples in the training data and give a total loss that is as low as possible.

多类支持向量机“希望”正确类的分数比所有其他类的分数至少高出一个边缘的差值detla。如果任意类在红色区域内有分数(或更高)，则会有累计损失。否则，损失将为零。我们的目标是为训练数据中的所有例子找到同时满足这个约束的权值，并给出尽可能低的总损失。

Regularization. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters W that correctly classify every example (i.e. all scores are so that all the margins are met, and Li = 0 for all i). The issue is that this set of W is not necessarily unique: there might be many similar W that correctly classify the examples. One easy way to see this is that if some parameters W correctly classify all examples (so loss is zero for each example), then any multiple of these parameters λW where λ > 1 will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences.

正则化。我们在上面介绍的损失函数有一个缺陷。假设我们有一个数据集和一组参数W，可以正确地对每个示例进行分类(例如，所有的分数都是为了满足所有的边界，Li = 0对所有i)。问题是正确分类的例子中,这组W不一定是独特的:可能有许多类似的W,。看到这一个简单的方法是,如果一些参数W正确的对所有的例子进行(所以损失是零为每一个例子),那么任何这些参数λWλ> 1的倍数也会得出零损失。因为这个变换均匀地拉伸了所有分数的大小，从而也拉伸了它们的绝对差异。

In other words, we wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty R(W). The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

换句话说，我们希望对某一组权重W的某些偏好进行编码，以消除这种不确定性。我们可以用正则化惩罚R(W)来扩展损失函数。最常见的正则化惩罚是L2范数，它通过对所有参数的元素二次惩罚来抑制较大的权重:

In the expression above, we are summing up all the squared elements of W. Notice that the regularization function is not a function of the data, it is only based on the weights. Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss Li over all examples) and the regularization loss. That is, the full Multiclass SVM loss becomes:

在上面的表达式中，我们将w的所有平方元素相加，注意正则化函数不是数据的函数，它只是基于权值。包含正则化惩罚完成了完整的多类支持向量机损失，它由两个部分组成:数据损失(即所有示例的平均损失Li)和正则化损失。即全多类SVM损失为:

Or expanding this out in its full form:

Where N is the number of training examples. As you can see, we append the regularization penalty to the loss objective, weighted by a hyperparameter λ. There is no simple way of setting this hyperparameter and it is usually determined by cross-validation.

其中N为训练样本个数。如你所见,我们在损失目标上附加了规则惩罚,加权λ是一个超参数。没有设置这个超参数的简单方法，通常由交叉验证来确定。

Here is the loss function (without regularization) implemented in Python, in both unvectorized and
half-vectorized form:

def L_i(x, y, W):
"""
unvectorized version. Compute the multiclass svm loss for a single example (x,y)
- x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
with an appended bias dimension in the 3073-rd position (i.e. bias trick)
- y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
- W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
"""
delta = 1.0 # see notes about delta later in this section
scores = W.dot(x) # scores becomes of size 10 x 1, the scores for each class
correct_class_score = scores[y]
D = W.shape[0] # number of classes, e.g. 10
loss_i = 0.0
for j in xrange(D): # iterate over all wrong classes
if j == y:
# skip for the true class to only loop over incorrect classes
continue
# accumulate loss for the i-th example
loss_i += max(0, scores[j] - correct_class_score + delta)
return loss_i
def L_i_vectorized(x, y, W):
"""
A faster half-vectorized implementation. half-vectorized
refers to the fact that for a single example the implementation contains
no for loops, but there is still one loop over the examples (outside this function)
"""
delta = 1.0
scores = W.dot(x)
# compute the margins for all classes in one vector operation
margins = np.maximum(0, scores - scores[y] + delta)
# on y-th position scores[y] - scores[y] canceled and gave delta. We want
# to ignore the y-th position and only consider margin on max wrong class
margins[y] = 0
loss_i = np.sum(margins)
return loss_i
def L(X, y, W):
"""
fully-vectorized implementation :
- X holds all the training examples as columns (e.g. 3073 x 50,000 in CIFAR-10)
- y is array of integers specifying correct class (e.g. 50,000-D array)
- W are weights (e.g. 10 x 3073)
"""
# evaluate loss over all examples in X without using any for loops
# left as exercise to reader in the assignment

The takeaway from this section is that the SVM loss takes one particular approach to measuring how consistent the predictions on training data are with the ground truth labels. Additionally, making good predictions on the training set is equivalent to minimizing the loss.

本节的结论是，支持向量机的损失采用一种特殊的方法来测量训练数据的预测与地面真值标签的一致性。从理论上讲，对训练集做出好的预测相当于将损失最小化。

L1和L2的比较

L1 权重矩阵 w 所有分量的绝对值之和，L2 权重矩阵 w 所有分量的平方和（似乎和曼哈顿距离和欧式距离相似）

L1正则化对于所有的损失是线性的，也就是说，10个数据点，每个点误差1个单位，和是完全等价的。

L2正则化由于平方的原因，对那些偏差特别大的点具有很大的惩罚。（1个数据点，误差10单位）的惩罚远远大于（10个数据点，每个点误差1个单位）。

正是由于这个原因，L2正则化倾向于把权重均匀分布在每一个分量上，而L1正则化则倾向于将更多的权重清零，只保留几个大权重，他们对于模型的“复杂度”的定义是不一样的。示例如下：

对于w1，如果是 L1 正则化，R（w1） = 1，如果是 L2 正则化，R（w1） = 1

对于w2, 如果是 L1 正则化 R（w1） = 1，如果是 L2 正则化，R（w2） = 0.25

为了让Loss Function最小，当 L1 正则化，w1 , w2 都可行；如果是 L2 正则化，那么 w2 更优。

三、softmax分类器

If you’ve heard of the binary Logistic Regression classier before, the Softmax classier is its generalization to multiple classes. Unlike the SVM which treats the outputs f(xi, W) as (uncalibrated and possibly dicult to interpret) scores for each class, the Softmax classier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classier,the function mapping f(xi; W) = Wxi stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

如果您以前听说过逻辑回归分类器，那么Softmax分类器就是它对多个类的泛化。与支持向量机将输出f(xi, W)作为每个类的得分(未经校准，可能需要解释)不同，Softmax分类器提供了稍微更直观的输出(归一化类概率)，并具有我们将很快描述的概率解释。在Softmax分类器中，函数映射f(xi;W) = Wxi保持不变，但我们现在将这些分数解释为每个类的未归一化日志概率，并将铰链损失替换为交叉熵损失，其形式如下:

where we are using the notation fj to mean the j-th element of the vector of class scores f. As before, the full loss for the dataset is the mean of Li over all training examples together with a regularization term R(W). The function

is called the softmax function: It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sum to one.

其中，我们使用fj表示类分数f向量的第j个元素。与之前一样，数据集的全部损失是Li在所有训练示例中的平均值，以及正则化项R(W)。该函数称为softmax函数:它获取任意实值分数(以z表示)的向量，并将其压扁为一个值介于0和1之间的向量，其和为1。

Information theory view. The cross-entropy between a “true” distribution p and an estimated distribution q is dened as:

交叉熵目标希望预测的分布的所有质量都在正确答案上。

Probabilistic interpretation. Looking at the expression, we see that

can be interpreted as the (normalized) probability assigned to the correct label yi given the image xi and parameterized by W. To see this, remember that the Softmax classi

深度不学习——————softmax分类器相关推荐

3.9 训练一个 Softmax 分类器-深度学习第二课《改善深层神经网络》-Stanford吴恩达教授
←上一篇 ↓↑ 下一篇→ 3.8 Softmax 回归回到目录 3.10 深度学习框架训练一个 Softmax 分类器 (Training a Softmax Classifier) 上一个视频中 ...
深度学习系列--1.入坑模型: 线性回归，logistic 回归，softmax分类器
惭愧啊,读研的时候学得正是模式识别:当看着书本上都是公式推导.博士师兄们也都在公式推导研究新算法的时候,排斥心理到了顶点,从此弃疗. 工作三年,重新捡起,因为更关注实际操作,所以选择了<pyth ...
训练softmax分类器实例_吴恩达深度学习笔记(56)-训练一个 Softmax 分类器
训练一个 Softmax 分类器(Training a Softmax classifier) 上一个笔记中我们学习了Softmax层和Softmax激活函数,在这个笔记中,你将更深入地了解Softm ...
二分类交叉熵损失函数python_【深度学习基础】第二课：softmax分类器和交叉熵损失函数...
[深度学习基础]系列博客为学习Coursera上吴恩达深度学习课程所做的课程笔记. 本文为原创文章,未经本人允许,禁止转载.转载请注明出处. 1.线性分类如果我们使用一个线性分类器去进行图像分类该怎 ...
深度学习与计算机视觉系列(3)_线性SVM与SoftMax分类器
作者: 寒小阳 &&龙心尘时间:2015年11月. 出处:http://blog.csdn.net/han_xiaoyang/article/details/49999299 声明: ...
深度学习与计算机视觉（二）线性SVM与Softmax分类器
2.线性SVM与Softmax分类器 2.1 得分函数(score function) 2.1.1 线性分类器 2.1.2 理解线性分类器 2.2 损失函数 2.2.1 多类别支持向量机损失(Mult ...
【深度学习笔记（二）】之Softmax分类器
本文章由公号[开发小鸽]发布!欢迎关注!!! 老规矩–妹妹镇楼: 一. Softmax分类器用SVM损失函数得出的只是一个个的分数,还要通过对比分数来分类.那么,如果直接输出结果为分类的概率,岂不是 ...
Facebook爆锤深度度量学习：该领域13年来并无进展！网友：沧海横流，方显英雄本色...
来源:AI科技评论近日,Facebook AI和Cornell Tech的研究人员近期发表研究论文预览文稿,声称近十三年深度度量学习(deep metric learning) 领域的目前研究进展和 ...

深度不学习——————softmax分类器

一、我们先谈谈线性分类器：

1.基本概念

2.有几点需要注意:

3.对线性分类器的详细解释：

二、线性可分SVM算法流程

三、softmax分类器

深度不学习——————softmax分类器相关推荐

最新文章

热门文章