







SVM VS Softmax



  • The classier must remember all of the training data and store it for future comparisons with the test data. This is space inecient because datasets may easily be gigabytes in size.
  • Classifying a test image is expensive since it requires a comparison to all training images




There are a few things to note:

  • First, note that the single matrix multiplication Wxi is effectively evaluating 10 separate classiers in parallel (one for each class), where each classier is a row of W.
  • Notice also that we think of the input data (xi , yi ) as given and xed, but we have control over the setting of the parameters W,b. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes.
  • An advantage of this approach is that the training data is used to learn the parameters W,b, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classied based on the computed scores.
  • Lastly, note that to classifying the test image involves a single matrix multiplication and addition, which is signicantly faster than comparing a test image to all training images.


还要注意,我们认为输入数据(xi, yi)是给定的和装箱的,但是我们可以控制参数W,b的设置。我们的目标将会以这样的方式设置这些计算分数匹配地面真理标签在整个训练集。我们将进入更多细节这是如何实现的,但是直觉上我们希望正确的类有一个评分的得分高于不正确的类。



Notice that a linear classifer computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the “ship” class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the “ship” classifer would then have a lot of positive weights across its blue channel weights (presence of
blue increases score of ship), and negative weights in the red/green channels (presence of red/green decreases the score of ship).




Interpretation of linear classiers as template matching. Another interpretation for the weights W is that each row of W corresponds to a template (or sometimes also called a prototype) for one of the classes.
The score of each class for an image is then obtained by comparing each template with the image using an inner product (or dot product) one by one to find the one that “fits”best. 
With this terminology, the linear classfier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.




Looking ahead a bit, a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left,blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.



The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ. Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).


The Multiclass SVM loss  formalized as follows:

​SVM loss

However, this will not necessarily be the case once we start to consider more complex forms of the score function f.

The threshold at zero max(0, − ) function is often called the hinge loss. You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form max(0, − )2 that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation.


Loss Function称为 “Hinge Loss”,对于样本 xi --> 得分 s,如果正确类别的得分 s_yi  比错误类别 s_j 的得分高出一定的安全距离(这里就是公式中的1),那么就认为在这个得分的小分量中,损失函数为0,而如果没有达到这个条件,还差多少损失就是多少。



The loss function quantifies our unhappiness with predictions on the training set

The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta. If any class has a score inside the red region (or higher), then there will be accumulated loss.Otherwise the loss will be zero.Our objective will be to nd the weights that will simultaneously satisfy this constraint for all examples in the training data and give a total loss that is as low as possible.


Regularization. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters W that correctly classify every example (i.e. all scores are so that all the margins are met, and Li = 0 for all i). The issue is that this set of W is not necessarily unique: there might be many similar W that correctly classify the examples. One easy way to see this is that if some parameters W correctly classify all examples (so loss is zero for each example), then any multiple of these parameters λW where λ > 1 will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences.

正则化。我们在上面介绍的损失函数有一个缺陷。假设我们有一个数据集和一组参数W,可以正确地对每个示例进行分类(例如,所有的分数都是为了满足所有的边界,Li = 0对所有i)。问题是正确分类的例子中,这组W不一定是独特的:可能有许多类似的W,。看到这一个简单的方法是,如果一些参数W正确的对所有的例子进行(所以损失是零为每一个例子),那么任何这些参数λWλ> 1的倍数也会得出零损失。因为这个变换均匀地拉伸了所有分数的大小,从而也拉伸了它们的绝对差异。

In other words, we wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty R(W). The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:



In the expression above, we are summing up all the squared elements of W. Notice that the regularization function is not a function of the data, it is only based on the weights. Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss Li over all examples) and the regularization loss. That is, the full Multiclass SVM loss becomes:


Or expanding this out in its full form:

Where N is the number of training examples. As you can see, we append the regularization penalty to the loss objective, weighted by a hyperparameter λ. There is no simple way of setting this hyperparameter and it is usually determined by cross-validation.


Here is the loss function (without regularization) implemented in Python, in both unvectorized and
half-vectorized form:

def L_i(x, y, W):
unvectorized version. Compute the multiclass svm loss for a single example (x,y)
- x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
with an appended bias dimension in the 3073-rd position (i.e. bias trick)
- y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
- W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
delta = 1.0 # see notes about delta later in this section
scores = W.dot(x) # scores becomes of size 10 x 1, the scores for each class
correct_class_score = scores[y]
D = W.shape[0] # number of classes, e.g. 10
loss_i = 0.0
for j in xrange(D): # iterate over all wrong classes
if j == y:
# skip for the true class to only loop over incorrect classes
# accumulate loss for the i-th example
loss_i += max(0, scores[j] - correct_class_score + delta)
return loss_i
def L_i_vectorized(x, y, W):
A faster half-vectorized implementation. half-vectorized
refers to the fact that for a single example the implementation contains
no for loops, but there is still one loop over the examples (outside this function)
delta = 1.0
scores = W.dot(x)
# compute the margins for all classes in one vector operation
margins = np.maximum(0, scores - scores[y] + delta)
# on y-th position scores[y] - scores[y] canceled and gave delta. We want
# to ignore the y-th position and only consider margin on max wrong class
margins[y] = 0
loss_i = np.sum(margins)
return loss_i
def L(X, y, W):
fully-vectorized implementation :
- X holds all the training examples as columns (e.g. 3073 x 50,000 in CIFAR-10)
- y is array of integers specifying correct class (e.g. 50,000-D array)
- W are weights (e.g. 10 x 3073)
# evaluate loss over all examples in X without using any for loops
# left as exercise to reader in the assignment

The takeaway from this section is that the SVM loss takes one particular approach to measuring how consistent the predictions on training data are with the ground truth labels. Additionally, making good predictions on the training set is equivalent to minimizing the loss.



L1 权重矩阵 w 所有分量的绝对值之和,L2 权重矩阵 w 所有分量的平方和 (似乎和曼哈顿距离和欧式距离相似)


L2正则化由于平方的原因,对那些偏差特别大的点具有很大的惩罚。(1个数据点,误差10单位)的惩罚远远大于 (10个数据点,每个点误差1个单位)。


对于w1,如果是 L1 正则化 ,R(w1) = 1,如果是 L2 正则化 ,R(w1) = 1

对于w2, 如果是 L1 正则化 R(w1) = 1,如果是 L2 正则化 ,R(w2) = 0.25

为了让Loss Function最小,当 L1 正则化,w1 , w2 都可行;如果是 L2 正则化,那么 w2 更优。


If you’ve heard of the binary Logistic Regression classier before, the Softmax classier is its generalization to multiple classes. Unlike the SVM which treats the outputs f(xi, W) as (uncalibrated and possibly dicult to interpret) scores for each class, the Softmax classier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classier,the function mapping f(xi; W) = Wxi stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

如果您以前听说过逻辑回归分类器,那么Softmax分类器就是它对多个类的泛化。与支持向量机将输出f(xi, W)作为每个类的得分(未经校准,可能需要解释)不同,Softmax分类器提供了稍微更直观的输出(归一化类概率),并具有我们将很快描述的概率解释。在Softmax分类器中,函数映射f(xi;W) = Wxi保持不变,但我们现在将这些分数解释为每个类的未归一化日志概率,并将铰链损失替换为交叉熵损失,其形式如下:

where we are using the notation fj to mean the j-th element of the vector of class scores f. As before, the full loss for the dataset is the mean of Li over all training examples together with a regularization term R(W). The function

is called the softmax function: It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sum to one.


Information theory view. The cross-entropy between a “true” distribution p and an estimated distribution q is dened as:


Probabilistic interpretation. Looking at the expression, we see that

can be interpreted as the (normalized) probability assigned to the correct label yi given the image xi and parameterized by W. To see this, remember that the Softmax classi


