1 分类与回归

1.1 k近邻回归

1.2 线性回归模型

2 两个接地特征的线性回归分析

2.1 线性回归

2.2 岭回归

2.3 Lasso回归

2.4 Logistic回归

2.5 线性支持向量机

3 两个接地特征的k近邻分类与回归

3.1 两个接地特征的k近邻分类

3.2 两个接地特征的k近邻回归

3 决策树和随机森林

3.1 决策树

3.2 随机森林

4 机器学习实战：

如果所获得的数据有明确的分类信息，用有明确分类信息的数据训练数据模型，就称为有监督学习（Supervised Learing）。

1 分类与回归

机器学习的目标是对新获得的数据进行分类与回归。分类的结果是指数据不具有连续性，如将一堆水果划分为苹果、梨子、香蕉3个类别。回归是数据的一种预测算法。所预测的数据可能具有连续性。在机器学习中经常会提到过拟合、欠拟合、泛化的概念。所建立的数学模型在已有的数据上分类或拟合的结果非常理想，但在新的数据上分类或拟合的效果不佳，即泛化能力弱，这种情况一般称为过拟合。如果所建立的模型在已有数据集上训练或拟合后，分类或拟合效果不好，这种情况成为欠拟合。欠拟合的原因主要是模型建立的太简单所致，泛化能力更差。

1.1 k近邻回归

k近邻回归的本质思想就是给一个训练集，对于新的样本，去寻找训练集中k个与新样本最近的实例。这些实例会归于某类或拟合成一个值。这样，就把新样本也归为某类或拟合成一个值。用于回归的k近邻算法在sklearn.neighbors中的KNeighborsRegressor类中实现。

电力系统中经常会发生单相接地故障，现已积累到一些单相接地故障发生时的特征向量（保存在ground_feature.csv）。当新的向量产生时，用k近邻回归算法看看能否对新的向量进行回归分析。

ground_feature.csv文件获取：

链接：https://pan.baidu.com/s/1sNz3RqyiAU7V8djcXVPOtA
提取码：whj6

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
df1 = pd.read_csv('E:\PYTHON\ground_feature.csv') #读取特征向量
data = df1.values
X = data[:,0:3] #构建要分割的数据
y = data[:,3] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
#将回归模型实例化，将参数邻居个数设置为1
reg = KNeighborsRegressor(n_neighbors=1)
#利用分割的数据计算拟合模型
reg.fit(X_train,y_train)
#对测试集进行预测
print(reg.predict(X_test))

运行结果如下：

[1. 0. 1. 0. 0. 1.]

运行界面如下:

KNeighborsRegressor类回归预测

利用以下语句可以评估模型得分。看看模型好坏。回归模型好坏主要看决定系数 $R^{2}$ 的大小，其值越大，回归模型越好。

print('Test set R^2:{:.2f}'.format(reg.score(X_test,y_test)))

运行结果如下：

Test set R^2:1.00

利用以下语句查看系统对新特征向量的回归结果。

gf_new = np.array([[0.57,3.1,9.6]]) #新特征向量
gf_prediction = reg.predict(gf_new) #对新向量进行预测
print(gf_prediction)

运行结果如下：

[1.]

k近邻回归模型很简单，但分类或回归的结果好坏与分类器最重要的参数及相邻点的个数密切相关，我们可以调整n_neighbors的值，通过运行程序看看回归结果的区别。如果特征数很多，样本空间很大，k近邻方法的预测速度会比较慢。

1.2 线性回归模型

$W=\begin{bmatrix} W_{0}\\ W_{1}\\ \vdots \\ W_{P} \end{bmatrix},X=\begin{bmatrix} X_{0}\\ X_{1}\\ \vdots \\ X_{P} \end{bmatrix}$

$W^{T}$ 是W的转置，则线性回归模型可写为：

$\hat{y}=W^{T}X+b$

当只有一个特征时，线性回归模型退化为直线方程：

$\hat{y}=W_{0}X_{0}+b$

下面对单相接地线性回归模型进行分析：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df1 = pd.read_csv('E:\PYTHON\ground_feature.csv') #读取特征向量
data = df1.values
X = data[:,0:3] #构建要分割的数据
y = data[:,3] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
gf = LinearRegression().fit(X_train,y_train) #训练数据
print('gf.coef_:{}'.format(gf.coef_))
print('gf.intercept_:{}'.format(gf.intercept_))
print('Training set scores: {:.2f}'.format(gf.score(X_train,y_train)))
print('Test set scores: {:.2f}'.format(gf.score(X_test,y_test)))

运行结果如下：

gf.coef_:[ 0.61494852 -0.00442976  0.05630708]
gf.intercept_:-0.32285384955966934
Training set scores: 0.68
Test set scores: 0.84

其中，coef_和intercept_分别为W参数值（权重参数值）和偏移（截距）值。

运行界面如下:

2 两个接地特征的线性回归分析

2.1 线性回归

为了让大家更直观地从分类曲线上理解不同算法的分类效果，本节开始将单相接地故障特征数据文件中的3个特征值修改为两个特征值，以利于在二维图片中展示分类曲线。

ground_feature0.csv文件获取：

链接：https://pan.baidu.com/s/1EvmoNvcm83o3goJ8c_yavQ
提取码：whj6

ground_feature0.csv文件内容如下：

Drop percentage,"   Ascending mutation value","   target"
0.89," 11.4","   1"
0.89," 0.4","    0"
0.75," 8.6","    1"
0.23," 4.3","    0"
0.5,"  3.2","    1"
0.49," 0.7","    0"
0.29," 2.2","    0"
0.31," 3.5","    0"
0.95," 10.9","   1"
0.93," 5.6","    1"
0.82," 7.8","    1"
0.82," 0.9","    0"
0.15," 3.3","    0"
0.43," 0.5","    0"
0.55," 2.9","    1"
0.55," -0.5","   0"
0.78," 6.9","    1"
0.76," 0.8","    0"
0.88," 10.5","   1"
0.88," 0.3","    0"
0.67," 5.9","    1"
0.56,12,1

第一列是电场跌落百分比，第二列是电流突变量，第三列是训练目标值。0代表没有接地，1代表接地故障。

接下来进行两个接地特征线性回归训练：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from matplotlib.colors import ListedColormap
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
gf = LinearRegression().fit(X_train,y_train)
print('gf.coef_:{}'.format(gf.coef_))
print('gf.intercept_:{}'.format(gf.intercept_))
print('Training set scores: {:.2f}'.format(gf.score(X_train, y_train)))
print('Test set scores: {:.2f}'.format(gf.score(X_test, y_test)))
plot_decision_regions(X,y,classifier=gf,test_idx=range(16,22))
plt.xlabel('Drop percentage LinearRegressor')
plt.ylabel('Ascending mutation value')
plt.legend(loc='upper right')
plt.show()

运行结果如下：

gf.coef_:[0.57924565 0.08177374]
gf.intercept_:-0.24511049471851187
Training set scores: 0.72
Test set scores: 0.46

继续运行以下语句：

print('LinearRegression New array Prediction: {}'.format(gf.predict(gf_new)))

运行结果如下：

LinearRegression New array Prediction: [1.39684691]

这种回归结果预测值大于1的值，说明分类结果是发生单相接地故障，但测试分值比较低，说明存在欠拟合。从回归分类曲线看，多条直线并没有把样本空间通过分类曲线分开。这种回归模型不适合单相接地故障分类。

2.2 岭回归

岭回归是一种改良的最小二乘估计法，它是用于解决在线回归分析中自变量存在共线性的问题。有正则化约束的岭回归模型也可以对数据进行回归分析。

下面进行岭回归分析，程序如下：（Pycharm或Spyder中）

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.linear_model import Ridge #导入岭回归模块
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
#改变alpha参数值，可以调整回归模型回归效果
gf_ridge = Ridge(alpha=25).fit(X_train,y_train)
plot_decision_regions(X,y,classifier=gf_ridge,test_idx=range(16,22))
plt.xlabel("Drop percentage Ridge")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
gf_new = np.array([[0.590,15.9]])
print('Ridge Training set scores: {:.2f}'.format(gf_ridge.score(X_train, y_train)))
print('Ridge Test set scores: {:.2f}'.format(gf_ridge.score(X_test, y_test)))
print('Ridge New array Prediction: {}'.format(gf_ridge.predict(gf_new)))

运行结果如下：

Ridge Training set scores: 0.65
Ridge Test set scores: 0.50
Ridge New array Prediction: [1.49866647]

从结果来看，训练集和测试集得分低，效果并不好，分类曲线也没有将样本空间分开，说明这类问题并不适合岭回归模型来处理。

岭回归分类曲线（alpha=25）

调整alpha=0.25时，岭回归分类曲线如下：

运行结果如下：

Ridge Training set scores: 0.72
Ridge Test set scores: 0.49
Ridge New array Prediction: [1.44066823]

岭回归分类曲线（alpha=0.25）

alpha=0.25时的岭回归分类曲线与线性回归基本一致，性能稍微有所提升。当有几个特征比较重要时，在线性回归中，选择使用Lasso模型会好点。

2.3 Lasso回归

Lasso回归有时也称线性回归的L1正则化，和岭回归的主要区别在于正则化项。岭回归用的是L2正则化，而Lasso回归用的是L1正则化。Lasso回归使得一些系数变小，甚至使一些绝对值较小的系数直接变为0,因此特别适用于参数数目缩减与参数的选择，常用来估计稀疏参数的线性模型。下面的例子用在单相接地故障识别上，只是重点强调其使用方法。

L1正则化一般又称L1范数，一般指元素绝对值之和。L2正则化一般又称L2范数，一般指元素绝对值平方和之开平方。

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.linear_model import Lasso #导入Lasso模块
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
#改变alpha和max_iter参数值，可以调整回归模型回归效果
gf_lasso = Lasso(alpha=0.01,max_iter=100000).fit(X_train,y_train)
plot_decision_regions(X,y,classifier=gf_lasso,test_idx=range(16,22))
plt.xlabel("Drop percentage Lasso Regression")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
gf_new = np.array([[0.590,15.9]])
print('Lasso Regression Training set scores: {:.2f}'.format(gf_lasso.score(X_train, y_train)))
print('Lasso Regression Test set scores: {:.2f}'.format(gf_lasso.score(X_test, y_test)))
print('Lasso Regression New array Prediction: {}'.format(gf_lasso.predict(gf_new)))

运行结果如下：

Lasso Regression Training set scores: 0.71
Lasso Regression Test set scores: 0.49
Lasso Regression New array Prediction: [1.45308051]

Lasso回归分类曲线

为了降低欠拟合，尽管修改了 Lasso训练参数，但训练集和测试集上的效果并不显著。说明此回归模型选择在解决单相接地故障分类问题上也不适合。

2.4 Logistic回归

Logistic回归是一种广义线性回归，与多重线性回归分析有很多相同之处。他们的模型形式基本相同，都具有 ${\omega }'x+b$ 形式，其中 $\omega$ 和b是待求参数。其区别在于他们的因变量不同，多重线性回归直接将 ${\omega }'x+b$ 作为因变量，及 $y={\omega }'x+b$ 。Logistic回归则通过函数L将 ${\omega }'x+b$ 对应一个隐状态p， $p=L\left ( {\omega }' x+b\right )$ ，然后根据p与1-p的大小决定因变量的值。如果L是Logistic函数就是Logistic回归，如果L是多项式函数就是多项式回归。

下面进行Logistic回归分析，程序如下：（Pycharm或Spyder中）

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.linear_model import LogisticRegression
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
#改变alpha和max_iter参数值，可以调整回归模型回归效果
gf_logRegression = LogisticRegression(C=1000).fit(X_train,y_train)
plot_decision_regions(X,y,classifier=gf_logRegression,test_idx=range(16,22))
plt.xlabel("Drop percentage LogisticRegression")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
gf_new = np.array([[0.590,15.9]])
print('LogisticRegression Training set scores: {:.2f}'.format(gf_logRegression.score(X_train, y_train)))
print('LogisticRegression Test set scores: {:.2f}'.format(gf_logRegression.score(X_test, y_test)))
print('LogisticRegression New array Prediction: {}'.format(gf_logRegression.predict(gf_new)))

运行结果如下：

LogisticRegression Training set scores: 1.00
LogisticRegression Test set scores: 1.00
LogisticRegression New array Prediction: [1.]

Logistic回归分类曲线

该算法勉强将两类样本分开，比较理想，在数据区低端和高端都留出了识别新向量的余地。训练集和测试集上的得分也比较理想，测试向量的分类也正确，说明本分类算法比较令人满意。同时本例也说明了回归模型与分类模型之间的区别，即回归是对模拟量的一种估计，分类是对数据集划分为不同的类别。

2.5 线性支持向量机

支持向量机（Super Vector Meachine, SVM）在解决小样本、非线性及高维模式识别中表现出许多特有的优势。该算法根据有限的样本信息在模型的复杂性和学习能力之间寻求最佳折中，以期获得最佳的泛化能力。

单相接地特征向量在单相接地故障发生时与正常状态时的SVM算法如下：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.svm import LinearSVC
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
#改变alpha和max_iter参数值，可以调整回归模型回归效果
gf_LinearSVC = LinearSVC(C=10000).fit(X_train,y_train)
plot_decision_regions(X,y,classifier=gf_LinearSVC,test_idx=range(16,22))
plt.xlabel("Drop percentage LinearSVC")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
gf_new = np.array([[0.590,15.9]])
print('LinearSVC Training set scores: {:.2f}'.format(gf_LinearSVC.score(X_train, y_train)))
print('LinearSVC Test set scores: {:.2f}'.format(gf_LinearSVC.score(X_test, y_test)))
print('LinearSVC New array Prediction: {}'.format(gf_LinearSVC.predict(gf_new)))

运行结果如下：

LinearSVC Training set scores: 0.94
LinearSVC Test set scores: 0.83
LinearSVC New array Prediction: [1.]

线性SVM分类曲线

3 两个接地特征的k近邻分类与回归

3.1 两个接地特征的k近邻分类

现在用k近邻算法，对具有两个接地特征的向量空间进行分类，看看效果如何。

下面的例子将用到KNeighborsClassifier类，程序如下：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.neighbors import KNeighborsClassifier
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
#改变alpha和max_iter参数值，可以调整回归模型回归效果
gf_knn =KNeighborsClassifier(n_neighbors=1)
gf_knn.fit(X_train,y_train)
gf_new = np.array([[0.59,15.9]])
gf_prediction = gf_knn.predict(gf_new)
plot_decision_regions(X,y,classifier=gf_knn,test_idx=range(16,22))
plt.xlabel("Drop percentage KNN")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
print('KNN Training set scores: {:.2f}'.format(gf_knn.score(X_train, y_train)))
print('KNN Test set scores: {:.2f}'.format(gf_knn.score(X_test, y_test)))
print('KNN New array Prediction: {}'.format(gf_knn.predict(gf_new)))

运行结果如下：

KNN Training set scores: 1.00
KNN Test set scores: 1.00
KNN New array Prediction: [1.]

两个接地特征的k近邻分类曲线

从分类曲线来看，该算法把特征向量空间进行了正确分类，该分类曲线是比较复杂的曲线。但该算法对处于临界状态的新向量的识别结果不太理想，即泛化能力弱。对于本例中新向量，算法识别和实践分类都为[1]，这是因为k近邻模型对特征向量的计算方式比较简单所致的。

3.2 两个接地特征的k近邻回归

下面的例子将用到KNeighborsRegressor类，程序如下：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.neighbors import KNeighborsRegressor
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
#改变alpha和max_iter参数值，可以调整回归模型回归效果
reg = KNeighborsRegressor(n_neighbors=1)
reg.fit(X_train,y_train)
gf_new = np.array([[0.59,15.9]])
gf_prediction = reg.predict(gf_new)
plot_decision_regions(X,y,classifier=reg,test_idx=range(16,22))
plt.xlabel("Drop percentage KNN_Regressor")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
# print('KNN Training set scores: {:.2f}'.format(gf_knn.score(X_train, y_train)))
# print('KNN Test set scores: {:.2f}'.format(gf_knn.score(X_test, y_test)))
# print('KNN New array Prediction: {}'.format(gf_knn.predict(gf_new)))
print('X_test Prediction: {}'.format(reg.predict(X_test)))
print('Test set R^2 :{:.2f}'.format(reg.score(X_test, y_test)))
print('KNeighborsRegressor New array Prediction: {}'.format(reg.predict(gf_new)))

运行结果如下：

X_test Prediction: [1. 0. 1. 0. 0. 0.]
Test set R^2 :1.00
KNeighborsRegressor New array Prediction: [1.]

两个接地特征的k近邻回归曲线

对比k近邻分类和k近邻回归对单相接地故障特征向量空间的分类结果，两者性能基本没有改进，但非线性分类曲线把两类特征都分开了。

3 决策树和随机森林

3.1 决策树

决策树学习本质上是从训练数据集中归纳出一组分类规则。能对训练数据进行正确分类的决策树可能会有多个，也可能没有。在选择决策树时，应选择一个与训练数据比较吻合的决策树，同时具有良好的泛化能力。单相接地故障分类可以用决策树模型实现。假设所有采集上来的数据都是高精度和高可靠的。那么10kV架空线路运行状态可以用如图所示的决策树模型描述线路运行状态。

决策模型

可以从根节点开始，采用最大信息熵的特征来对数据进行划分。采用迭代算法，在每个子节点上重复信息熵大小的划分，直到子节点划分结束。假设决策树采用二叉树结构实现。

设k节点的信息熵定义为：

$L_{H}\left ( k \right )=-\sum_{i=1}^{C}p\left ( i|k \right )log_{2} p\left ( i|k \right )$

其中，p(i|k)为k节点中，属于类别C的样本占有节点k中的样本总数的比例。这样，可以定义信息熵增益目标函数为：

$IG(D_{P},F)=I(D_{P})-\sum_{j=1}^{m}\frac{N_{j}}{N_{P}}I(D_{j})$

其中，IG为信息熵增益；F为特征向量； $D_{P}$ 和 $D_{j}$ 分别为父节点和子节点； $N_{P}$ 和 $N_{j}$ 分别为父节点和子节点的样本数量。决策树模型中某些分支比较容易判断，为了防止过拟合，可以将某些分支去掉，即减掉分支。

单相接地故障特征决策树分类代码如下:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from matplotlib.colors import ListedColormap
from sklearn.tree import DecisionTreeClassifier
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
gf_tree = DecisionTreeClassifier(criterion='entropy',max_depth=3,random_state=0)
gf_tree.fit(X_train,y_train)
print('Tree Training set scores: {:.2f}'.format(gf_tree.score(X_train, y_train)))
print('Tree Test set scores: {:.2f}'.format(gf_tree.score(X_test, y_test)))
gf_new =np.array([[0.490,1.1]]) #设置新特征向量
print('Tree New array Prediction: {}'.format(gf_tree.predict(gf_new)))
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
x_combined =np.vstack((X_train,X_test))
y_combined =np.hstack((y_train,y_test))
plot_decision_regions(X,y,classifier=gf_tree,test_idx=range(16,22))
plt.xlabel('Drop percentage')
plt.ylabel('Ascending mutation value')
plt.legend(loc='upper right')
plt.show()

运行结果如下：

Tree Training set scores: 1.00
Tree Test set scores: 1.00
Tree New array Prediction: [0.]

决策树分类曲线

这是一条阶梯式分类曲线，完全将样本空间数据进行了正确分类。无论是训练集还是测试集数据，本算法的得分都是1，说明该模型比较完美。

3.2 随机森林

随机森林是多棵决策树的集成，是将弱分类器集成为鲁棒性更强的强分类器，其泛化能力更强，不易产生过拟合。随机森林中树的棵数对分类结果有很大影响，使用时要注意。随机森林算法的基本思路如下:

（1）在训练数据集中随机可重复地选择n个样本，选择方法使用bootstrap自展法抽样。

（2）用选择出的样本构建一棵决策树。决策树节点划分方法为：a 不重复随机选择k个特征；b 根据目标函数要求，使用选定的特征对节点划分。这里目标函数采用最大化信息增益实现。

（3）重复上述过程1~2000次。

（4）汇总每棵决策树的分类结果，采用多数投票分类器完成最后分类。

上面所述的bootstrap自展法是小样本估计总体值的一种非参数方法，在接地样本空间较小的时候，是非常有效的。

单相接地故障分类随机森林算法的代码如下，这里主要用到RandomForestClassfier模块：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.ensemble import RandomForestClassifier
df1 = pd.read_csv('E:\PYTHON\ground_feature0.csv') #读取特征向量
data = df1.values
X = data[:,0:2] #构建要分割的数据,前两列
y = data[:,2] #构建分类目标
#分割数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=10)
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):#两类分类曲线绘图函数定义#setup marker generator and color mapmarkers = ('s','x','o','^','v')colors = ('red','blue','lightgreen','gray','cyan')cmap = ListedColormap(colors[:len(np.unique(y))])#plot the decision surfacex1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1x2_min,x2_max = X[:,1].min()-1,X[:,1].max()+1xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)plt.xlim(xx1.min(),xx1.max())plt.ylim(xx2.min(),xx2.max())#plot all sampleX_test,y_test = X[test_idx,:],y[test_idx]for idx,cl in enumerate(np.unique(y)):plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=cmap(idx),marker=markers[idx],label=cl)
gf_forest = RandomForestClassifier(criterion='entropy',n_estimators=10,random_state=1,n_jobs=2)
gf_forest.fit(X_train,y_train)
plot_decision_regions(X,y,classifier=gf_forest,test_idx=range(16,22))
plt.xlabel("Drop percentage RandomForestClassifier")
plt.ylabel("Ascending mutation value")
plt.legend(loc='upper right')
plt.title('One-phase ground')
plt.show()
print('RandomForestClassifier Training set scores: {:.2f}'.format(gf_forest.score(X_train, y_train)))
print('RandomForestClassifier Test set scores: {:.2f}'.format(gf_forest.score(X_test, y_test)))

运行结果如下：

RandomForestClassifier Training set scores: 1.00
RandomForestClassifier Test set scores: 1.00

随机森林分类曲线

分类界面近似由两条互相垂直的直线构成。显然，在这种情况下的分类曲线，对新特征向量的分类具有更强的泛化能力。

本例是取10棵树的结果，取10棵树的分类决策边界比较理想。随机森林算法中树的棵树取多少，会直接影响决策分类曲线形状。树的棵树取多少这个问题很重要，但不是棵树越多越好。下图是1000棵树时的分类曲线。我们可以发现，取1000棵树时的分类性能与3.1中取单棵决策树分类性能几乎一样，并没有性能上的提高。

随机森林分类曲线（1000棵树）

4 机器学习实战：

上面我们所介绍的模型，使用步骤基本类似，步骤如下：

（1）数据预处理和探索：整理数据，将数据处理为适合模型使用的数据格式。

（2）数据特征工程。

（3）建立模型：利用model=LinearRegression()建立线性回归模型。

（4）训练模型：model.fit(x,y)。

（5）模型预测：model.predict([[a]])。

（6）评价模型：利用可视化的方式直观地评价模型的预测结果

假设房子的价格只跟面积有关，不考虑地理位置等其他因素，表中给出了一些房子的面积和价格之间的数据，请计算出40m^2的房屋价格。

面积与价格数据
面积（m^2）	56	32	78	160	240	89	91	69	43
价格（万元）	90	65	125	272	312	147	159	109	78

可以先将数据的分布情况利用散点图进行可视化，分析面积和价格之间的变化关系。

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
#图像显示中文设置
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#设置图像在可视化时使用的主题
import seaborn as sns
sns.set(font="Kaiti",style="ticks",font_scale=1.4)
#font="Kaiti"参数指定图中文本使用的字体，style="ticks"设置坐标系的样式，font_scale=1.4设置字体的显示比例x= np.array([56,32,78,160,240,89,91,69,43])
y= np.array([90,65,125,272,312,147,159,109,78])
#数据导入与处理，并进行数据探索
X=x.reshape(-1,1)
Y=x.reshape(-1,1)
plt.figure(figsize=(10,6)) #初始化图像窗口
plt.scatter(X,Y,s=50) #原始数据图
plt.title("原始数据图")
plt.show()#训练模型和预测
model = LinearRegression()
model.fit(X,Y)
x1=np.array([40,]).reshape(-1,1) #带预测数据
x1_pre=model.predict(np.array(x1)) #预测结果#数据可视化，将预测的结果也显示在图上
plt.figure(figsize=(10,8))
plt.scatter(X,Y)b=model.intercept_  #截距
a=model.coef_  #斜率
y=a*X+b #原始数据按照训练好的模型画出直线
plt.plot(X,y)y1=a*x1+b
plt.scatter(x1,y1,color="r")
plt.show()

运行结果如下：

笔记摘自：《Python机器学习教程》——顾涛

机器学习（3）——有监督学习相关推荐

【机器学习】无监督学习
[机器学习]无监督学习 https://mp.weixin.qq.com/s/NV84o6Jd_z8evjq05n2jzA 作者:水奈樾人工智能爱好者博客专栏:http://www.cnblog ...
机器学习算法--无监督学习--聚类
机器学习算法–无监督学习–聚类一.原型聚类特点:需要预先指定簇的个数K 1.K-Means Python实现K-Means算法: Python实现K-Means算法原理:从数据集中选择K个样本作 ...
机器学习之无监督学习——聚类
机器学习之无监督学习--聚类无监督学习一.基于划分的聚类方法 1.基于划分的方法简介 A.概念 B.分组 C.分组与样本对应关系 D.硬聚类与软聚类二.基于层次的聚类方法 1.基于层次的 ...
机器学习：非监督学习
文章目录机器学习:非监督学习聚类Clustering Kmean聚类层次聚类 (Hierarchical Clustering, HC) 单连接层次聚类(single link) 全连接层次聚类 ...
机器学习之无监督学习-K均值聚类算法
机器学习之无监督学习-K均值聚类算法对于无监督学习,有两类重要的应用,一个是聚类,一个是降维.我们今天主要学习聚类中的K均值聚类. 我们先看看下图,图a为原始的数据点,我们想要对图a的数据点进行分类 ...
机器学习-算法-半监督学习：半监督学习（Semi-supervised Learning）算法
人工智能-机器学习-算法-半监督学习:半监督学习(Semi-supervised Learning)算法一.半监督学习算法提出的背景 1.监督学习算法 2.无监督学习算法 3.监督学习的特征选择方法 ...
机器学习-算法-有监督学习：EM（最大期望值算法）＜=＞ MLE（最大似然估计法）【关系类似“梯度下降法”＜=＞“直接求导法”】【EM“梯度下降”：先初始化一个随机值，然后通过迭代不断靠近真实值】
机器学习-算法-有监督学习:EM(最大期望值算法)<=> MLE(最大似然估计法)[关系类似"梯度下降法"<=>"直接求导法"][EM& ...
Python 机器学习实战 —— 无监督学习（下）
前言在上篇< Python 机器学习实战 -- 无监督学习(上)>介绍了数据集变换中最常见的 PCA 主成分分析.NMF 非负矩阵分解等无监督模型,举例说明使用使用非监督模型对多维度特征 ...
【人工智能】机器学习入门之监督学习（一）有监督学习
机器学习入门之监督学习(一)有监督学习简介监督学习算法是常见算法之一,主要分为有监督学习和无监督学习.本文主要记录了有监督学习中的分类算法和回归算法,其中回归算法是本文最主要内容. 本笔记对应视频 ...
Python 机器学习实战 —— 无监督学习（上）
前言在上篇<Python 机器学习实战 -- 监督学习>介绍了支持向量机.k近邻.朴素贝叶斯分类 .决策树.决策树集成等多种模型,这篇文章将为大家介绍一下无监督学习的使用. 无 ...

机器学习（3）——有监督学习

1 分类与回归

1.1 k近邻回归

1.2 线性回归模型

2 两个接地特征的线性回归分析

2.1 线性回归

2.2 岭回归

2.3 Lasso回归

2.4 Logistic回归

2.5 线性支持向量机

3 两个接地特征的k近邻分类与回归

3.1 两个接地特征的k近邻分类

3.2 两个接地特征的k近邻回归

3 决策树和随机森林

3.1 决策树

3.2 随机森林

4 机器学习实战：

机器学习（3）——有监督学习相关推荐

最新文章

热门文章