参照《机器学习实战》第二版

本章探讨的大部分主题对于理解、构建和训练神经网络是至关重要的。

目的在于了解系统是如何工作的，它有助于快速定位到适合的模型、正确的训练算法，以及一套合适的参数。不仅如此，后期还能让你更高效的执行错误调试和错误分析。

我们将从最简单的模型之一 – 线性回归模型开始，介绍两种非常不同的训练模型的方法：

通过“闭式”方程，直接计算出最拟合训练集的模型参数（也就是使训练集上的成本模型最小化的模型参数）。
使用迭代优化的方法，即梯度下降（GD)，逐渐调整模型参数直至训练集上的成本函数调至最低，最终趋于第一种方法计算出来模型参数。我们还会研究几个梯度下降的变体，包括批量梯度下降、小批量梯度下降以及随机梯度下降。

接着我们进入多项式回归的讨论，这是一个更为复杂的模型，更适合非线性数据集。由于该模型的参数比线性模型更多，因此更容易对训练数据过拟合，我们将使用学习曲线来分辨这种情况是否发生。然后，再介绍几种正则化技巧，降低过拟合训练数据的风险。

最后，我们将学习两种经常用于分来任务的模型：Logistic回归和Softmax回归。

1、线性回归

1.1、公式：线性回归模型预测

y^\hat{y}y^：是预测值
nnn：是特征数i
xix_ixi：是第 i 个特征值
θj\theta_jθj：是第 j 个模型参数

1.2、公式：线性回归模型预测（向量化形式）

θ⃗\vec\thetaθ：是模型的参数向量
x⃗\vec{x}x：是实例的特征向量
θ⃗⋅x⃗\vec\theta \cdot \vec{x}θ⋅x：是两个向量的点积
hθh_\thetahθ：是假设函数，使用模型参数θ⃗\vec\thetaθ

1.3、公式：线性回归模型的 MSE 成本函数

回归模型常见的性能指标是均方根误差（RMSE）。因此，在训练线性回归模型时，你需要找到最小化 RMSE 的 θ⃗\vec\thetaθ 值。在实践中，最小化均方误差（MSE）比最小化均方根误差（RMSE）更为简单，两者效果相同（因为使函数最小的值，同样也使其平方根最小）。

在训练集 X 上，使用该公式计算训练集 X 上线性回归的 MSE，hθh_\thetahθ为假设函数：

1.4、公式：标准方程

为了得到是成本方程最小的 θ\thetaθ 值，有个闭式解方法 – 也就是直接得出结果数学方程，即标准方程：

θ⃗^\hat{\vec\theta}θ^：是使成本函数最小的值
yyy：是包含y(1)y^{(1)}y(1)到y(m)y^{(m)}y(m)的目标值向量

1.5、测试上面公式

随机生成一些线性数据来测试上面公式：

import numpy as npnp.random.seed(42)
X = np.random.rand(100, 1) * 2
y = 4 + 3 * X + np.random.randn(100, 1)

import matplotlib.pyplot as pltplt.plot(X, y, "b.")
plt.xlabel("X", fontsize=18)
plt.ylabel("y", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.show()

现在使用标准方程来计算θ⃗^\hat{\vec\theta}θ^。使用NumPy的线性代数模块np.linalg中的inv()函数来对矩阵求逆，并利用dot()函数计算矩阵的内积：

X_b = np.c_[np.ones((100, 1)), X]
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best  # MSE 成本方程最小值

array([[4.21509616],[2.77011339]])

根据我们上面y的公式，我们可以知道，我们所期望的θ0=4\theta_0 = 4θ0=4，θ1=3\theta_1 = 3θ1=3，而得到的却是θ0=4.215\theta_0 = 4.215θ0=4.215，θ1=2.770\theta_1 = 2.770θ1=2.770，这是因为有噪声的存在，导致无法完全还原原本的函数。

现在可以使用θ^\hat{\theta}θ^做出预测：

X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
y_predict  # 预测两个x值的y

array([[4.21509616],[9.75532293]])

plt.plot(X, y, "b.")
plt.plot(X_new, y_predict, "r-")
plt.axis([0, 2, 0, 15])
plt.show()

1.6、Scikit-Learn 方法

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

(array([4.21509616]), array([[2.77011339]]))

lin_reg.predict(X_new)

array([[4.21509616],[9.75532293]])

LinearRegression类基于scipy.linalg.lstsq()函数（即最小二乘法），可以直接调用：

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd

array([[4.21509616],[2.77011339]])

此处计算公式为θ^=X+y\hat{\theta} = X^{+}yθ^=X+y，其中 X+X^{+}X+ 是 XXX 的伪逆。可以直接使用np.linalg.pinv()来直接计算这个伪逆：

伪逆本身是使用被成为奇异值分解（SVD）的标准矩阵分解技术来计算的。

np.linalg.pinv(X_b).dot(y)

array([[4.21509616],[2.77011339]])

1.7、计算复杂度

标准方程计算 XTXX^TXXTX 的逆，XTXX^TXXTX 是一个(n+1)×(n+1)的矩阵（n是特征向量）。这种矩阵求逆的计算复杂度通常为 O(n2.4)O(n^{2.4})O(n2.4) 到 O(n3)O(n^3)O(n3)，取决于具体现实。换句话说，如果将特征数量翻倍，那么计算时间将乘以大约 22.4=5.32^{2.4}=5.322.4=5.3倍到 23=82^3=823=8倍。

Scikit-Learn的LinearRegression类使用的SVD方法的复杂度约为 O(n2)O(n^2)O(n2)。即特征数量翻倍，计算时间大约是原来的 4 倍。

2、梯度下降

梯度下降讲解 - 知乎

2.1、批量梯度下降

要实现梯度下降，你需要计算每个模型关于参数 θj\theta_jθj 的成本函数梯度。换言之，就是关于 θj\theta_jθj 的偏导数：

公式：
公式（向量化）：

一旦有了梯度向量，从 θ\thetaθ 中减去 ∇θMSE(θ)\nabla_{\theta} MSE(\theta)∇θMSE(θ) 。这时候学习率 η 就发挥作用了：用梯度向量乘以 η 确定下坡步长的大小：

算法实现：

eta = 0.1  # 学习率
n_iterations = 1000  # 梯度下降次数
m = len(X)  # 100个实例，X的数量
s = {}theta = np.random.rand(2, 1)
s[0] = "{} {}".format(theta[0], theta[1])
for iteration in range(n_iterations):gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)theta = theta - eta * gradientss[iteration + 1] = "({:>2}){} {}\t-> {} {}".format(iteration, gradients[0], gradients[1], theta[0], theta[1])

theta

array([[4.21509616],[2.77011339]])

X_new_b.dot(theta)

array([[4.21509616],[9.75532293]])

# 显示前几次的运算结果
for i in range(20):print(s[i])

[0.7948113] [0.50263709]
( 0)[-11.10506446] [-12.03209351]   -> [1.90531775] [1.70584644]
( 1)[-6.6211481] [-6.97223824]  -> [2.56743256] [2.40307027]
( 2)[-3.98563362] [-4.00520407] -> [2.96599592] [2.80359067]
( 3)[-2.43523896] [-2.26654005] -> [3.20951982] [3.03024468]
( 4)[-1.52191778] [-1.24882174] -> [3.3617116] [3.15512685]
( 5)[-0.98266545] [-0.65419726] -> [3.45997814] [3.22054658]
( 6)[-0.66309598] [-0.30783247] -> [3.52628774] [3.25132983]
( 7)[-0.47258202] [-0.1071039]  -> [3.57354594] [3.26204022]
( 8)[-0.35792234] [0.00822477]  -> [3.60933818] [3.26121774]
( 9)[-0.28788473] [0.07350896]  -> [3.63812665] [3.25386684]
(10)[-0.24413278] [0.10949922]  -> [3.66253993] [3.24291692]
(11)[-0.21589999] [0.12837323]  -> [3.68412993] [3.2300796]
(12)[-0.19686344] [0.13727654]  -> [3.70381627] [3.21635195]
(13)[-0.18330867] [0.14040094]  -> [3.72214714] [3.20231185]
(14)[-0.17305246] [0.14020451]  -> [3.73945238] [3.1882914]
(15)[-0.16481055] [0.13812767]  -> [3.75593344] [3.17447863]
(16)[-0.15782643] [0.13501363]  -> [3.77171608] [3.16097727]
(17)[-0.15165347] [0.13135508]  -> [3.78688143] [3.14784176]
(18)[-0.14602703] [0.12743903]  -> [3.80148413] [3.13509786]

theta_path_bgd = []def plot_gradient_descent(theta, eta, theta_path=None):m = len(X_b)plt.plot(X, y, "b.")n_iterations = 1000for iteration in range(n_iterations):if iteration < 10:  # 画出前十条线y_predict = X_new_b.dot(theta)style = "b-" if iteration > 0 else "r--"  # 第一条线是红色虚线，其余是蓝色实线plt.plot(X_new, y_predict, style)gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)theta = theta - eta * gradientsif theta_path is not None:theta_path.append(theta)plt.xlabel("$x_1$", fontsize=18)plt.axis([0, 2, 0, 15])plt.title(r"$\eta = {}$".format(eta), fontsize=16)

np.random.seed(42)
theta = np.random.randn(2,1)plt.figure(figsize=(10,4))
plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(132); plot_gradient_descent(theta, eta=0.1, theta_path=theta_path_bgd)
plt.subplot(133); plot_gradient_descent(theta, eta=0.5)plt.show()

2.2、随机梯度下降

优点在于快，每次随机挑选一个实例用于计算（而不是 全部实例计算）。

np.random.seed(42)
theta_path_sgd = []m = len(X_b)
n_epochs = 50   # 梯度下降次数
t0, t1 = 5, 50  # 学习进度超参数
s = dict()def learning_schedule(t):""" 学习计划 """return t0 / (t + t1)theta = np.random.randn(2,1)  # 随机初始化
s[0] = "{} {}".format(theta[0], theta[1])
for epoch in range(n_epochs):for i in range(m):if epoch == 0 and i < 20:  # 只画出前20条线y_predict = X_new_b.dot(theta)style = "b-" if i > 0 else "r--"plt.plot(X_new, y_predict, style)random_index = np.random.randint(m)xi = X_b[random_index:random_index+1]  # 随机取出一个实例的 xyi = y[random_index:random_index+1]    # 随机取出一个实例的 ygradients = 2 * xi.T.dot(xi.dot(theta) - yi)eta = learning_schedule(epoch * m + i) # 减小梯度theta = theta - eta * gradientstheta_path_sgd.append(theta)s[epoch * m + i + 1] = "({}, {:>2} -> {:.5f}){} {}\t-> {} {}".format(epoch, i, eta, gradients[0], gradients[1], theta[0], theta[1])plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])plt.show()

theta

array([[4.21076011],[2.74856079]])

X_new_b.dot(theta)

array([[4.21076011],[9.7078817 ]])

# 显示前20次的运算结果
for i in range(20):print(s[i])

[0.49671415] [-0.1382643]
(0,  0 -> 0.10000)[-6.86014779] [-2.72643789]    -> [1.18272893] [0.13437949]
(0,  1 -> 0.09804)[-8.52324201] [-6.62558121]    -> [2.01834089] [0.78394627]
(0,  2 -> 0.09615)[-9.97915432] [-12.21154892]   -> [2.97787496] [1.95813367]
(0,  3 -> 0.09434)[-1.04064913] [-0.68869748]    -> [3.07604941] [2.02310513]
(0,  4 -> 0.09259)[-7.01608648] [-10.23796008]   -> [3.72568705] [2.9710644]
(0,  5 -> 0.09091)[-1.13142499] [-1.59951213]    -> [3.82854386] [3.11647459]
(0,  6 -> 0.08929)[-0.5145743] [-0.72746125] -> [3.874488] [3.18142649]
(0,  7 -> 0.08772)[1.85301988] [2.36281333]  -> [3.71194239] [2.97416216]
(0,  8 -> 0.08621)[-2.24163244] [-0.48370585]    -> [3.90518657] [3.01586094]
(0,  9 -> 0.08475)[0.30279077] [0.22186197]  -> [3.87952633] [2.99705908]
(0, 10 -> 0.08333)[-0.43307984] [-0.63402364]    -> [3.91561632] [3.04989438]
(0, 11 -> 0.08197)[-0.66303807] [-0.18497948]    -> [3.9699637] [3.06505663]
(0, 12 -> 0.08065)[-0.01489278] [-0.0279835] -> [3.97116473] [3.06731337]
(0, 13 -> 0.07937)[0.79632729] [1.51415949]  -> [3.90796416] [2.94714198]
(0, 14 -> 0.07812)[1.32249284] [1.68633038]  -> [3.8046444] [2.81539742]
(0, 15 -> 0.07692)[-1.23311237] [-0.11455716]    -> [3.8994992] [2.82420951]
(0, 16 -> 0.07576)[-5.19616514] [-1.01504087]    -> [4.29314807] [2.90110654]
(0, 17 -> 0.07463)[0.80823315] [1.53679763]  -> [4.23283217] [2.78642015]
(0, 18 -> 0.07353)[2.62261168] [1.87125088]  -> [4.03999307] [2.64882817]

2.3、小批量梯度下降

小批量梯度下降优于随机梯度下降的主要优点是，你可以通过矩阵操作的硬件优化来提高性能，特别是在使用GPU时。

theta_path_mgd = []n_iterations = 50
minibatch_size = 20
s = dict()np.random.seed(42)
theta = np.random.randn(2,1)t0, t1 = 200, 1000
def learning_schedule(t):return t0 / (t + t1)t = 0
s[0] = "{} {}".format(theta[0], theta[1])
for epoch in range(n_iterations):shuffled_indices = np.random.permutation(m)X_b_shuffled = X_b[shuffled_indices]y_shuffled = y[shuffled_indices]for i in range(0, m, minibatch_size):t += 1xi = X_b_shuffled[i:i+minibatch_size]yi = y_shuffled[i:i+minibatch_size]gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)eta = learning_schedule(t)theta = theta - eta * gradientstheta_path_mgd.append(theta)s[t] = "({}, {:>2} -> {:.5f}){} {}\t-> {} {}".format(epoch, i, eta, gradients[0], gradients[1], theta[0], theta[1])

theta

array([[4.25214635],[2.7896408 ]])

# 显示前20次的运算结果
for i in range(20):print(s[i])

[0.49671415] [-0.1382643]
(0,  0 -> 0.19980)[-13.79245989] [-16.28677942]  -> [3.25245039] [3.11583748]
(0, 20 -> 0.19960)[-1.60389476] [-1.03975953]    -> [3.57258907] [3.32337431]
(0, 40 -> 0.19940)[0.69601284] [0.98286925]  -> [3.43380286] [3.12738842]
(0, 60 -> 0.19920)[-1.07936129] [-0.37991824]    -> [3.64881507] [3.20306935]
(0, 80 -> 0.19900)[-0.32903101] [-0.00213246]    -> [3.71429388] [3.20349372]
(1,  0 -> 0.19881)[-0.64125174] [-0.2601402] -> [3.84177931] [3.25521145]
(1, 20 -> 0.19861)[0.34496049] [0.39959046]  -> [3.7732668] [3.1758489]
(1, 40 -> 0.19841)[-0.41821607] [-0.17210907]    -> [3.85624618] [3.20999752]
(1, 60 -> 0.19822)[0.48598075] [1.058177]    -> [3.759917] [3.00024985]
(1, 80 -> 0.19802)[-0.29018363] [-0.16202744]    -> [3.8173791] [3.03233449]
(2,  0 -> 0.19782)[-0.56934643] [-0.1323744] -> [3.93000945] [3.05852132]
(2, 20 -> 0.19763)[0.56698072] [0.62460951]  -> [3.81795793] [2.9350807]
(2, 40 -> 0.19743)[-0.60787495] [-0.14207289]    -> [3.93797273] [2.96313063]
(2, 60 -> 0.19724)[0.06525087] [0.0170397]   -> [3.92510273] [2.95976974]
(2, 80 -> 0.19704)[-0.65616741] [-0.64608421]    -> [4.0543968] [3.08707698]
(3,  0 -> 0.19685)[0.37921206] [0.60912079]  -> [3.97974876] [2.96717131]
(3, 20 -> 0.19666)[-0.5824764] [-0.59470588] -> [4.09429672] [3.08412429]
(3, 40 -> 0.19646)[0.19575307] [0.36578361]  -> [4.05583836] [3.0122611]
(3, 60 -> 0.19627)[0.20283418] [0.37892469]  -> [4.01602792] [2.93788923]

theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

plt.figure(figsize=(14,8))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic")  # 随机批量梯度下降
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch")  # 小随机梯度下降
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch")       # 批量梯度下降
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
plt.show()

# 最后20次
plt.figure(figsize=(14,8))
plt.plot(theta_path_sgd[-20:, 0], theta_path_sgd[-20:, 1], "r-s", linewidth=1, label="Stochastic")  # 随机批量梯度下降
plt.plot(theta_path_mgd[-20:, 0], theta_path_mgd[-20:, 1], "g-+", linewidth=2, label="Mini-batch")  # 小随机梯度下降
plt.plot(theta_path_bgd[-20:, 0], theta_path_bgd[-20:, 1], "b-o", linewidth=3, label="Batch")       # 批量梯度下降
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
# plt.axis([2.5, 4.5, 2.3, 3.9])
plt.show()

上图可知，小批量梯度下降效果最好。

2.4、探索：批量梯度下降（学习率是否可变）

下面探索可以得知

当学习率在合理范围内，学习率逐步降低对批量梯度下降没有太大影响。可以忽略不计。
当学习率在大于合理范围，学习率逐步降低对批量梯度下降有影响，可以使学习率回归正常范围。

gradients_bgd_1 = []
theta_bgd_1 = []def plot_gradient_descent_1(theta, eta, gradients_bgd, theta_bgd):"""批量地梯度下降：正常"""m = len(X)n_iterations = 1000for iteration in range(n_iterations):gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)theta = theta - eta * gradientsgradients_bgd.append(gradients)theta_bgd.append(theta)

gradients_bgd_2 = []
theta_bgd_2 = []def plot_gradient_descent_2(theta, t0, t1, gradients_bgd, theta_bgd):"""批量地梯度下降：学习率逐步降低"""t = 0m = len(X)n_iterations = t1minibatch_size = t0for epoch in range(n_iterations):for i in range(0, m, minibatch_size):t += 1gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)eta = t0 / (t + t1)theta = theta - eta * gradientsgradients_bgd.append(gradients)theta_bgd.append(theta)

np.random.seed(42)
theta = np.random.randn(2,1)
plot_gradient_descent_1(theta, 0.1, gradients_bgd_1, theta_bgd_1)np.random.seed(42)
theta = np.random.randn(2,1)
plot_gradient_descent_2(theta, 100, 1000, gradients_bgd_2, theta_bgd_2)

gradients_bgd_1 = np.array(gradients_bgd_1)
theta_bgd_1 = np.array(theta_bgd_1)
gradients_bgd_2 = np.array(gradients_bgd_2)
theta_bgd_2 = np.array(theta_bgd_2)

plt.figure(figsize=(20, 16))plt.subplot(221);
plt.plot(theta_bgd_1[:, 0], theta_bgd_1[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(theta_bgd_2[:, 0], theta_bgd_2[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)plt.subplot(222);
plt.plot(theta_bgd_1[:, 0], theta_bgd_1[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(theta_bgd_2[:, 0], theta_bgd_2[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
plt.axis([3.5, 4.3, 2.7, 3.3])  # 放大plt.subplot(223);
plt.plot(gradients_bgd_1[:, 0], gradients_bgd_1[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(gradients_bgd_2[:, 0], gradients_bgd_2[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\nabla_{\theta} MSE_0$", fontsize=20)
plt.ylabel(r"$\nabla_{\theta} MSE_1$   ", fontsize=20, rotation=0)plt.subplot(224);
plt.plot(gradients_bgd_1[:, 0], gradients_bgd_1[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(gradients_bgd_2[:, 0], gradients_bgd_2[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\nabla_{\theta} MSE_0$", fontsize=20)
plt.ylabel(r"$\nabla_{\theta} MSE_1$   ", fontsize=20, rotation=0)
plt.axis([-0.2, 0.05, -0.1, 0.2])  # 放大plt.show()

gradients_bgd_3 = []
theta_bgd_3 = []
np.random.seed(42)
theta = np.random.randn(2,1)
plot_gradient_descent_1(theta, 0.5, gradients_bgd_3, theta_bgd_3)gradients_bgd_4 = []
theta_bgd_4 = []
np.random.seed(42)
theta = np.random.randn(2,1)
plot_gradient_descent_2(theta, 500, 1000, gradients_bgd_4, theta_bgd_4)

gradients_bgd_3 = np.array(gradients_bgd_3)
theta_bgd_3 = np.array(theta_bgd_3)
gradients_bgd_4 = np.array(gradients_bgd_4)
theta_bgd_4 = np.array(theta_bgd_4)

plt.figure(figsize=(20, 16))plt.subplot(221);
plt.plot(theta_bgd_3[:, 0], theta_bgd_3[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(theta_bgd_4[:, 0], theta_bgd_4[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)plt.subplot(222);
plt.plot(theta_bgd_3[:, 0], theta_bgd_3[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(theta_bgd_4[:, 0], theta_bgd_4[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
plt.axis([4.1, 4.3, 2.7, 2.9])  # 放大plt.subplot(223);
plt.plot(gradients_bgd_3[:, 0], gradients_bgd_3[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(gradients_bgd_4[:, 0], gradients_bgd_4[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\nabla_{\theta} MSE_0$", fontsize=20)
plt.ylabel(r"$\nabla_{\theta} MSE_1$   ", fontsize=20, rotation=0)plt.subplot(224);
plt.plot(gradients_bgd_3[:, 0], gradients_bgd_3[:, 1], "r-s", linewidth=1, label="Normal")            # 正常
plt.plot(gradients_bgd_4[:, 0], gradients_bgd_4[:, 1], "b-o", linewidth=1, label="Gradually reduce")  # 逐步降低学习率
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\nabla_{\theta} MSE_0$", fontsize=20)
plt.ylabel(r"$\nabla_{\theta} MSE_1$   ", fontsize=20, rotation=0)
plt.axis([-0.15, 0.05, -0.1, 0.1])  # 放大plt.show()

theta_bgd_3[-1], theta_bgd_4[-1]

(array([[-7.05138935e+27],[-7.98621001e+27]]),array([[4.21509616],[2.77011339]]))

3、多项式回归

首先，让我们基于一个简单的二次方程式（添加一些噪音）生成一些非线性数据：

np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

from sklearn.preprocessing import PolynomialFeatures
# PolynomialFeatures转换训练数据，将每个特征的平方添加为新特征
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])plt.show()

X[0], X[0] ** 2, X_poly[0]

(array([-0.75275929]), array([0.56664654]), array([-0.75275929,  0.56664654]))

lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

(array([1.78134581]), array([[0.93366893, 0.56456263]]))

X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])plt.show()

预测结果：y=0.56x2+0.93x+1.78y = 0.56x^2 + 0.93x + 1.78y=0.56x2+0.93x+1.78
实际结果：y=0.50x2+1.00x+2.00+高斯噪音y = 0.50x^2 + 1.00x + 2.00 + 高斯噪音y=0.50x2+1.00x+2.00+高斯噪音

当存在多个特征时，例如，有两个特征a和b，degree=3时，PolynomialFeatures不仅会添加特征 a2a^2a2、a3a^3a3、b2b^2b2、b3b^3b3 还会添加组合 ababab、a2ba^2ba2b、ab2ab^2ab2 。（即 (n+d)!d!n!\frac{(n+d)!}{d!n!}d!n!(n+d)! 个特征组，小心特征数量组合的数量爆炸

【推荐收藏】【机器学习实战】训练模型（挑战全网最全）相关推荐

推荐：机器学习实战项目练手的平台
推荐一个新开的数据竞赛网站,经过测试,确实是一个不错的初学者机器学习实践的平台. 一直有很多同学询问我关于数据竞赛的问题,因为不少人觉得自己很难找到入门的方式,同时一个人也很难坚持下来.因此,我联系了 ...
机器学习实战——训练模型
本章从线性回模型开始介绍两种不同的训练模型的方法: 通过"闭式"方程,直接计算出最拟合训练集的模型参数(也就是使训练集上的成本函数最小化的模型参数) 使用迭代优化的方法,即梯度下降 ...
挑战全网最全的idea快捷键与选项卡中文翻译
初用者使用idea时,可能对于快捷栏的一大堆感到懵逼,鉴于这个情况,楼主将快捷栏汉化.以供老铁们了解.希望老铁点赞收藏一下,真是要了老命了. 1.edit find word at caret ctr ...
第一篇博客，java学生管理系统（挑战全网最全）
java学生信息管理系统,(课设必备),附有源码和简版链接博主虽然技术不高,但是系统写的真的是没话说,留着开学java课设用了. 直接转载链接了,查看系统入口 https://blog.csdn.n ...
挑战全网最全之django系列
1.django介绍 Django,是用python语言写的开源web开发框架采用MVT模式: M全拼为Model,与MVC中的M功能相同,负责和数据库交互,进行数据处理. V全拼为View,与MV ...
挑战全网最全之django REST framework（DRF）教程
一.认识DRF 1.restful简介在前后端分离的应用模式中,我们通常将后端开发的每个视图都称为一个接口,或者API,前端通过访问接口来对数据进行增删改查. restful是一种后端API接口规范 ...
网络安全ctf比赛/学习资源整理，解题工具、比赛时间、解题思路、实战靶场、学习路线，推荐收藏！...
对于想学习或者参加CTF比赛的朋友来说,CTF工具.练习靶场必不可少,今天给大家分享自己收藏的CTF资源,希望能对各位有所帮助. CTF在线工具首先给大家推荐我自己常用的3个CTF在线工具网站,内容 ...
【社区图书馆】读书推荐：《PyTorch高级机器学习实战》
读书推荐:<PyTorch高级机器学习实战> 作者:i阿极作者简介:Python领域新星作者.多项比赛获奖者:博主个人首页
机器学习实战——分类及性能测量完整案例（建议收藏慢慢品）
文章目录 1. 获取数据 2. 训练二元分类器 3. 性能测量 3.1 交叉验证测量准确率 3.2 混淆矩阵 3.3 精度和召回率 3.4 F1F_1F1分数 3.5 精度/召回率权衡 3.6 RO ...

【推荐收藏】【机器学习实战】训练模型（挑战全网最全）