吴恩达《机器学习》--- Logistic分类

Logistic分类应用于二分类问题，即给定特征XXX，y" role="presentation" style="position: relative;">yyy为0或者1.

模型假设

hθ(x)=g(θTx)hθ(x)=g(θTx)h_\theta(x) = g(\theta^Tx)
z=θTxz=θTxz = \theta^Tx
g(z)=11+e−zg(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}
g(z)g(z)g(z)函数如下：

hθ(x)hθ(x)h_\theta(x)取值范围是(0,1)(0,1)(0, 1 )，其含义为y=1y=1y = 1的概率，即hθ(x)=P(y=1|x,θ)=1−P(y=0|x,θ)hθ(x)=P(y=1|x,θ)=1−P(y=0|x,θ)h_\theta(x) = P(y = 1|x, \theta) = 1- P(y = 0|x, \theta).

决策边界

上面提到hθ(x)hθ(x)h_\theta(x)表示y = 1的概率，当其大于0.5时，我们可以预测y = 1，当其小于0.5时，我们可以预测y = 0，对于等于0.5的情况，我们可以约定 y = 1。那么由上图可知，当θTx>=0θTx>=0\theta^Tx >= 0时， y=1;y=1;y = 1;当θTx<0θTx<0\theta^Tx 时， y=0;y=0;y = 0;
对于线性的情况θTx=θ0+θ1x1+θ2x2θTx=θ0+θ1x1+θ2x2\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2可以得到如下所示的决策边界：

对于非线性情况θTx=θ0+θ1x1+θ2x2+θ3x21+θ4x22θTx=θ0+θ1x1+θ2x2+θ3x12+θ4x22\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2可以得到如下所示的决策边界：

对于更复杂的情况，可以通过θTxθTx\theta^Tx的复杂的多项式来实现。
Logistic分类的目标就是找到决策边界。

误差函数

于是有：

将两种情况结合起来：

得到误差函数为：

梯度下降法

使用梯度下降法对误差函数最小化求解

下面推导∂J(θ)∂θj∂J(θ)∂θj\frac{\partial J(\theta)}{\partial \theta_j}:
在正式推导之前先推导g(z)=11+e−zg(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}的导数,
∂g(z)∂z=−1∗1(1+e−z)2∗e−z∗−1=e−z(1+e−z)2=g(z)(1−g(z))∂g(z)∂z=−1∗1(1+e−z)2∗e−z∗−1=e−z(1+e−z)2=g(z)(1−g(z))\frac{\partial g(z)}{\partial z} = -1 * \frac{1}{(1 + e^{-z})^2}*e^{-z}*-1=\frac{e^{-z}}{(1 + e^{-z})^2} = g(z)(1-g(z))
下面正式推导

∂J(θ)∂θj====−1m∑i=1m(yi1hθ(xi)hθ(xi)(1−hθ(xi))xji+(1−yi)11−hθ(xi)(−hθ(xi)(1−hθ(xi)))xji)−1m∑i=1m(yi(1−hθ(xi))xji+(1−yi)(−hθ(xi))xji)−1m∑i=1m(yixji−yihθ(xi)xji−hθ(xi)xji+yihθ(xi)xji)1m∑i=1m(hθ(xi)−yi)xji(1)(2)(3)(4)(1)∂J(θ)∂θj=−1m∑i=1m(yi1hθ(xi)hθ(xi)(1−hθ(xi))xij+(1−yi)11−hθ(xi)(−hθ(xi)(1−hθ(xi)))xij)(2)=−1m∑i=1m(yi(1−hθ(xi))xij+(1−yi)(−hθ(xi))xij)(3)=−1m∑i=1m(yixij−yihθ(xi)xij−hθ(xi)xij+yihθ(xi)xij)(4)=1m∑i=1m(hθ(xi)−yi)xij

\begin{eqnarray} \frac{\partial J(\theta)}{\partial \theta_j} &=&-\frac{1}{m}\sum_{i=1}^m(y_i\frac{1}{h_\theta(x_i)}h_\theta(x_i)(1-h_\theta(x_i))x_i^j+(1-y_i)\frac{1}{1-h_\theta(x_i)}(-h_\theta(x_i)(1-h_\theta(x_i)))x_i^j) \\ &=&-\frac{1}{m}\sum_{i=1}^m(y_i(1-h_\theta(x_i))x_i^j+(1-y_i)(-h_\theta(x_i))x_i^j)\\ &=&-\frac{1}{m}\sum_{i=1}^m(y_ix_i^j-y_ih_\theta(x_i)x_i^j-h_\theta(x_i)x_i^j+y_ih_\theta(x_i)x_i^j)\\ &=&\frac{1}{m}\sum_{i=1}^m(h_\theta(x_i)-y_i)x_i^j \end{eqnarray}
由此梯度下降算法为：

优化算法

比如共轭梯度法什么的，吴老师都认为超纲了，还是让专业的搞数值算法的人去弄吧，我们需要的是知道怎么调用接口。

多分类

logistic分类可用于二分类，通过1-vs-all方法，即针对每一个类别，将训练集分为此类和非此类两类，进而使用logistic进行分类，共得到k个分类器（假设有k类），对于任意输入，将其输入该k个分类器，而后选择概率最高的那个。

softmax

等待填坑ing…

过拟合

使用了过多的特征，是的误差函数在训练集上很小，但是对于新的实例泛化能力较差。正则化可用于克服过拟合问题，实现方法是在误差函数的基础上加上λλ\lambda倍的θθ\theta的2范数或者1范数（不含θ0θ0\theta_0）。

sklearn logistic

sklearn中正则化实现的误差函数如下：

上述中C越大，w的值影响越小，正则化的作用越弱；反之，C越小，正则化作用越强。
LogisticRegression参数C决定了正则化的强度，penalty确定误差函数添加的是2范数还是1范数，tol是优化是收敛的阈值。
示例代码如下：

import numpy as np
import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScalerdigits  = datasets.load_digits()X, y = digits.data, digits.target
X = StandardScaler().fit_transform(X)y = (y > 4).astype(np.int)for i, C in enumerate((100, 1, 0.01)):clf_l1_LR = LogisticRegression(C = C, penalty='l1', tol=0.01)clf_l2_LR = LogisticRegression(C = C, penalty='l2', tol=0.01)clf_l1_LR.fit(X, y)clf_l2_LR.fit(X, y)coef_l1_LR = clf_l1_LR.coef_.ravel()coef_l2_LR = clf_l2_LR.coef_.ravel()sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100print("C = %.2f" % C)print("Sparsity with L1 penalty: %.2f%%" % sparsity_l1_LR)print("Score with L1 penalty: %.4f" % clf_l1_LR.score(X, y))print("Sparsity with L2 penalty: %.2f%%" % sparsity_l2_LR)print("Score with L2 penalty: %.4f" % clf_l2_LR.score(X, y))l1_plot = plt.subplot(3, 2, 2 * i + 1)l2_plot = plt.subplot(3, 2, 2 * (i + 1))if i == 0:l1_plot.set_title("L1 penalty")l2_plot.set_title("L2 penalty")l1_plot.imshow(np.abs(coef_l1_LR.reshape(8, 8)), interpolation='nearest',cmap='binary', vmax=1, vmin=0)l2_plot.imshow(np.abs(coef_l2_LR.reshape(8, 8)), interpolation='nearest',cmap='binary', vmax=1, vmin=0)plt.text(-8, 3, "C = %.2f" % C)l1_plot.set_xticks(())l1_plot.set_yticks(())l2_plot.set_xticks(())l2_plot.set_yticks(())plt.show()

效果：

多分类问题实例
multinomial logistic regression(softmax regression) && 1-vs-rest:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegressioncenters = [[-5, 0], [0, 1.5], [5, -1]]
X, y = make_blobs(n_samples = 1000, centers = centers, random_state = 40)
transformation = [[0.4, 0.2], [-0.4, 1.2]]
X = np.dot(X, transformation)for multi_class in ('multinomial', 'ovr'):clf = LogisticRegression(solver='sag', max_iter=100, random_state=42, multi_class=multi_class).fit(X, y)print("training score : %.3f(%s)" % (clf.score(X, y), multi_class))h = .02x_min, x_max = X[:, 0].min()-1, X[:, 0].max()+1y_min, y_max = X[:, 1].min()-1, X[:, 1].max()+1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])Z =Z.reshape(xx.shape)plt.figure()plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)plt.title("Decision surface of LogisticRegression (%s)" % multi_class)plt.axis('tight')colors = 'bry'for i, color in zip(clf.classes_, colors):idx = np.where(y == i)plt.scatter(X[idx,0], X[idx, 1], c=color, cmap=plt.cm.Paired, edgecolor='black', s=20)xmin, xmax = plt.xlim()ymin, ymax = plt.ylim()coef = clf.coef_intercept = clf.intercept_def plot_hyperplane(c, color):def line(x0):return (-(x0 * coef[c, 0])- intercept[c])/coef[c,1]plt.plot([xmin, xmax], [line(xmin), line(xmax)], ls="--", color=color)for i, color in zip(clf.classes_, colors):plot_hyperplane(i, color)plt.show()

最最后：欢迎关注微信公众号“翰墨知道”