1. 最近邻的概念

sklearn.neighbors 提供了基于最近邻的无监督和有监督学习方法的功能。无监督最近邻是许多其他学习方法的基础，尤其是流型学习和谱聚类。有监督的最近邻学习有两种形式：对离散类标的数据进行分类，对连续类标的数据进行回归。

最近邻方法背后的原理是找到一个预定义数量的离新的观测点距离最近的训练样本，并从这些最近点来预测新的观测点的标签。样本的数量可以是一个用户定义的常数(K近邻学习),或根据当地不同密度的点(radius-based邻居学习)。一般，距离可以是任何度量标准：标准欧几里得距离是最常见的选择。最近邻方法被称为non-generalizing机器学习方法，因为他们只是“记住”它的所有训练数据(可能转化成一个快速索引结构如球树或KD树)。

尽管它很简单,最近邻已经成功地应用于大量的分类和回归问题,包括手写的数字或卫星图像场景。作为一个非参数方法，它经常在分类的情况下取得成功，而它的决策边界是非常不规则的。

sklearn.neighbors 中的类能够处理Numpy数组和scipy矩阵作为输入。对于稠密的矩阵，由于许多可能的距离度量被支持。对于稀疏矩阵，任意的闵可夫斯基距离用做搜索被支持。

2. 无监督最近邻

NearestNeighbors 实现了无监督的最近邻学习。它以三种不同的最近邻算法的一个统一的接口工作：BallTree、KDTree、 brute-force 算法。选择哪种邻居搜索算法取决于关键字’algorithm’，它必须是以下几个关键字中的一个：[‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’]。当默认值‘auto’被传递，算法尝试从训练数据中选取最好的搜索最近邻的方法。

2.1 寻找最近邻

对于最简单的任务：在两个数据集中寻找最近邻，在 sklearn.neighbors 中定义的无监督算法可以被使用：

>>> from sklearn.neighbors import NearestNeighbors
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
>>> distances, indices = nbrs.kneighbors(X)
>>> indices
array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]]...)
>>> distances
array([[ 0.        ,  1.        ],[ 0.        ,  1.        ],[ 0.        ,  1.41421356],[ 0.        ,  1.        ],[ 0.        ,  1.        ],[ 0.        ,  1.41421356]])

由于查询集匹配训练集，每个点的最近邻是它自己，距离为零。还可以有效地产生一个稀疏图显示相邻点之间的连接:

>>> nbrs.kneighbors_graph(X).toarray()
array([[ 1.,  1.,  0.,  0.,  0.,  0.],[ 1.,  1.,  0.,  0.,  0.,  0.],[ 0.,  1.,  1.,  0.,  0.,  0.],[ 0.,  0.,  0.,  1.,  1.,  0.],[ 0.,  0.,  0.,  1.,  1.,  0.],[ 0.,  0.,  0.,  0.,  1.,  1.]])

我们的数据集具有如下的结构：索引顺序附近的点在参数空间内也是相近的，导致一个近似的分块对角矩阵的K近邻。这种稀疏图在很多种情况下是非常有用的，它可以利用点之间的稀疏关系进行无监督学习：典型的请参看：sklearn.manifold.Isomap, sklearn.manifold.LocallyLinearEmbedding, 和 sklearn.cluster.SpectralClustering。

2.2 KDTree 和 BallTree

你可以在KDTree 和 BallTree两个类中二选一来直接的找出最近邻。该功能被NearestNeighbors封装。KDTree 和 BallTree具有相同的接口；下面是一个使用KDTree的例子：

>>> from sklearn.neighbors import KDTree
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> kdt = KDTree(X, leaf_size=30, metric='euclidean')
>>> kdt.query(X, k=2, return_distance=False)
array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]]...)

参考KDTree 和 BallTree 的文档可以发现更多的最近邻搜索的选择信息，包括规范的查询策略。基于多重的距离度量。这里有一个距离度量的列表，查看DistanceMetric类的文档。

3. 最近邻分类

最近邻分类是一种基于实例的学习 或者是非正则化的学习：他不尝试去构造一个通用的内部模型，但是简单的保存训练数据的样例。分类是从简单的从每个点的最近邻多数投票计算得到的：每一个查询点被判定为最近邻中占比重最多的类。

scikit-learn实现了两种最近邻分类器：KNeighborsClassifier 实现了每个查询点的K最近邻学习，其中的k是一个由用户指定的整数。RadiusNeighborsClassifier 实现了基于每个训练点的一个指定的半径邻居的数目，r是由用户指定的浮点数值。

在这两种技术中，k近邻方法 KNeighborsClassifier更常用。最优的k值的选择是高度的数据依赖的：通常k值越大对噪声的抑制影响就越强，但是的分类的边界不明显。如果数据不是均匀采样的话，基于半径的最近邻分类器RadiusNeighborsClassifier 将会是一个更好的选择。用户指定一个半径
$r$ ，例如具有稀疏的邻居点使用更少的邻居进行分类。对于高维度的参数空间，这种方法效果就会变差，原因是由于维数诅咒。

基本的最近邻分类使用统一的权重，也就是说一个值被赋予一个查询点从一个简单的多数最近来投票。在一些情况下，对越近的点使用越高的权重是更合适的。这可以通过weights关键字完成。默认值weights=’uniform’，这种情况对每一个邻居分配统一的权重。weights= ‘diatance’，此时对每个邻居点赋予它到查询点的距离的倒数作为权重。当然，用户也可以自定义距离相关的权重公式来计算权重。

这里写图片描述

print(__doc__)import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasetsn_neighbors = 15# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features. We could# avoid this ugly slicing by using a two-dim dataset
y = iris.targeth = .02  # step size in the mesh# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])for weights in ['uniform', 'distance']:# we create an instance of Neighbours Classifier and fit the data.clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)clf.fit(X, y)# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plotZ = Z.reshape(xx.shape)plt.figure()plt.pcolormesh(xx, yy, Z, cmap=cmap_light)# Plot also the training pointsplt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.title("3-Class classification (k = %i, weights = '%s')"% (n_neighbors, weights))plt.show()

4. 最近邻回归

最近邻回归被用于数据标识是连续的数值而不是离散的的情况。查询点的标识是基于该点的最近邻的标识的均值计算得到的。scikit-learn实现了两种不同的最近邻回归算法：KNeighborsRegressor 实现基于每个查询点的k个最近邻的点的学习，k是由用户自己指定的整数。RadiusNeighborsRegressor 实现基于每个查询点的指定半径
$r$ 内的邻居点的学习。r<script type="math/tex" id="MathJax-Element-3">r</script> 是由用户指定的浮点值。

基本的最近邻回归使用统一的权重，也就是说一个值被赋予一个查询点从一个简单的多数最近来投票。在一些情况下，对越近的点使用越高的权重是更合适的。这可以通过weights关键字完成。默认值weights=’uniform’，这种情况对每一个邻居分配统一的权重。weights= ‘diatance’，此时对每个邻居点赋予它到查询点的距离的倒数作为权重。当然，用户也可以自定义距离相关的权重公式来计算权重。

这里写图片描述

import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighborsnp.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))# Fit regression model
n_neighbors = 5
for i, weights in enumerate(['uniform', 'distance']):knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)y_ = knn.fit(X, y).predict(T)plt.subplot(2, 1, i + 1)plt.scatter(X, y, c='k', label='data')plt.plot(T, y_, c='g', label='prediction')plt.axis('tight')plt.legend()plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors, weights))                         plt.show()

多输出的最近邻回归在以下的例子中被描述： Face completion with a multi-output estimators。在这个例子中，输入X是脸的上半部分，输出Y是脸的下半部分。

这里写图片描述

print(__doc__)import numpy as np
import matplotlib.pyplot as pltfrom sklearn.datasets import fetch_olivetti_faces
from sklearn.utils.validation import check_random_statefrom sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV# Load the faces datasets
data = fetch_olivetti_faces()
targets = data.targetdata = data.images.reshape((len(data.images), -1))
train = data[targets < 30]
test = data[targets >= 30]  # Test on independent people# Test on a subset of people
n_faces = 5
rng = check_random_state(4)
face_ids = rng.randint(test.shape[0], size=(n_faces, ))
test = test[face_ids, :]n_pixels = data.shape[1]
X_train = train[:, :np.ceil(0.5 * n_pixels)]  # Upper half of the faces
y_train = train[:, np.floor(0.5 * n_pixels):]  # Lower half of the faces
X_test = test[:, :np.ceil(0.5 * n_pixels)]
y_test = test[:, np.floor(0.5 * n_pixels):]# Fit estimators
ESTIMATORS = {"Extra trees": ExtraTreesRegressor(n_estimators=10, max_features=32,random_state=0),"K-nn": KNeighborsRegressor(),"Linear regression": LinearRegression(),"Ridge": RidgeCV(),
}y_test_predict = dict()
for name, estimator in ESTIMATORS.items():estimator.fit(X_train, y_train)y_test_predict[name] = estimator.predict(X_test)# Plot the completed faces
image_shape = (64, 64)n_cols = 1 + len(ESTIMATORS)
plt.figure(figsize=(2. * n_cols, 2.26 * n_faces))
plt.suptitle("Face completion with multi-output estimators", size=16)for i in range(n_faces):true_face = np.hstack((X_test[i], y_test[i]))if i:sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)else:sub = plt.subplot(n_faces, n_cols, i * n_cols + 1,title="true faces")sub.axis("off")sub.imshow(true_face.reshape(image_shape),cmap=plt.cm.gray,interpolation="nearest")for j, est in enumerate(sorted(ESTIMATORS)):completed_face = np.hstack((X_test[i], y_test_predict[est][i]))if i:sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)else:sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j,title=est)sub.axis("off")sub.imshow(completed_face.reshape(image_shape),cmap=plt.cm.gray,interpolation="nearest")plt.show()

5. 最近邻算法

5.1 Brute Force

5.2 K-D Tree

5.3 Ball Tree

6. 最近的质心分类器

7. 近似最近邻

未完待续……

Scikit-learn实战之最近邻算法相关推荐

python笔迹识别_python_基于Scikit learn库中KNN,SVM算法的笔迹识别
之前我们用自己写KNN算法[网址]识别了MNIST手写识别数据 [数据下载地址] 这里介绍,如何运用Scikit learn库中的KNN,SVM算法进行笔迹识别. 数据说明: 数据共有785列,第一列 ...
Scikit Learn: 在python中机器学习
Warning 警告:有些没能理解的句子,我以自己的理解意译. 翻译自:Scikit Learn:Machine Learning in Python 作者: Fabian Pedregosa, Ga ...
[转载]Scikit Learn: 在python中机器学习
原址:http://my.oschina.net/u/175377/blog/84420 目录[-] Scikit Learn: 在python中机器学习载入示例数据一个改变数据集大小的示例:数码 ...
scikit - learn 做文本分类
文章来源: https://my.oschina.net/u/175377/blog/84420 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的句子,我以自 ...
【机器学习】最近邻算法KNN原理、流程框图、代码实现及优缺点
通过机器学习教学视频,初识KNN算法,对原理和算法流程通过小应用进行Python实现,有了自己的一些理解.因此在此整理一下,既是对自己学习的阶段性总结,也希望能和更多的朋友们共同交流学习相关算法,如有 ...
机器学习实战——密度聚类算法
机器学习实战--密度聚类算法 1 密度聚类 2 sklearn中的实现 1 密度聚类密度聚类假设聚类结构能够通过样本分布的密集程度确定,通常情形下,密度聚类算法从样本密度的角度来考察样本之间的可连接 ...
KNN算法（K最近邻算法）详解
K 最近邻的核心数学知识是距离的计算和权重的计算.我们把需要预测的点作为中心点,然后计算其周围一定半径内的已知点距其的距离,挑选前 k 个点,进行投票,这 k 个点中,哪个类别的点多,该预测点就被判定 ...
【白话机器学习】算法理论+实战之LightGBM算法
1. 写在前面如果想从事数据挖掘或者机器学习的工作,掌握常用的机器学习算法是非常有必要的,在这简单的先捋一捋, 常见的机器学习算法: 监督学习算法:逻辑回归,线性回归,决策树,朴素贝叶斯,K近邻,支 ...
【白话机器学习】算法理论+实战之Xgboost算法
1. 写在前面如果想从事数据挖掘或者机器学习的工作,掌握常用的机器学习算法是非常有必要的,在这简单的先捋一捋, 常见的机器学习算法: 监督学习算法:逻辑回归,线性回归,决策树,朴素贝叶斯,K近邻,支 ...

Scikit-learn实战之最近邻算法

1. 最近邻的概念

2. 无监督最近邻

2.1 寻找最近邻

2.2 KDTree 和 BallTree

3. 最近邻分类

4. 最近邻回归

5. 最近邻算法

5.1 Brute Force

5.2 K-D Tree

5.3 Ball Tree

6. 最近的质心分类器

7. 近似最近邻

Scikit-learn实战之最近邻算法相关推荐

最新文章

热门文章