KNN算法简介

邻近算法，或者说K近邻(kNN，k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K近邻，就是kkk个最近的邻居的意思，说的是每个样本都可以用它最接近的kkk个邻居来代表。KNN算法本身简单有效，它是一种lazy-learning算法，分类器不需要使用训练集进行训练，训练时间复杂度为0，KNN分类的计算复杂度和训练集中的样本数目成正比。

由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。

当KNN算法的超参数k=1k=1k=1时，算法直接由与测试样本最近的训练集样本决定测试样本类别，此时算法又称为最近邻算法。kkk值的选取是影响KNN算法准确率的重要因素。

KNN算法流程

KNN算法的基本思路就是将测试样本与所有训练样本求距离，选出距离最近的kkk个样本投票表决确定测试样本类别。

KNN中的距离度量

两个样本特征向量之间的距离反映了其相似程度。常见的距离有欧氏距离、曼哈顿距离和LpL_pLp距离等。

设样本空间X⊆RnX\subseteq{R}^nX⊆Rn，nnn为数据特征向量维度。定义LpL_pLp为
Lp(xi,xj)=(∑l=1n∣xi(l)−xj(l)∣p)1/pL_p( \boldsymbol{x}_i, \boldsymbol{x}_j) = \left(\sum_{l=1}^{n}\left|\boldsymbol{x}_i^{(l)}- \boldsymbol{x}_j^{(l)}\right|^p\right)^{1/p}Lp(xi,xj)=(l=1∑n∣∣∣xi(l)−xj(l)∣∣∣p)1/p
其中p⩾1p \geqslant 1p⩾1。当p=2p=2p=2时，称为欧氏距离
L2(xi,xj)=(∑l=1n∣xi(l)−xj(l)∣2)1/2L_2( \boldsymbol{x}_i, \boldsymbol{x}_j) = \left(\sum_{l=1}^{n}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|^2\right)^{1/2}L2(xi,xj)=(l=1∑n∣∣∣xi(l)−xj(l)∣∣∣2)1/2

当p=1p=1p=1时，称为曼哈顿距离
L1(xi,xj)=∑l=1n∣xi(l)−xj(l)∣L_1( \boldsymbol{x}_i, \boldsymbol{x}_j) = \sum_{l=1}^{n}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|L1(xi,xj)=l=1∑n∣∣∣xi(l)−xj(l)∣∣∣

当p=∞p=\inftyp=∞时，它代表各个坐标距离的最大值，即
L∞(xi,xj)=max⁡l∣xi(l)−xj(l)∣L_{\infty}( \boldsymbol{x}_i, \boldsymbol{x}_j) = \max \limits_{l}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|L∞(xi,xj)=lmax∣∣∣xi(l)−xj(l)∣∣∣

其中，最常用的是欧氏距离。

KNN中的kkk值选择

KNN算法最重要的超参数就是kkk值。其选取会显著影响算法的准确性。如果kkk值较小，则表示在分类时依靠的样本数更少，因此更容易造成过拟合。如果kkk值较大，与预测样本点距离较远的样本也会参与投票，这样会导致错误预测的概率更大。

一般情况下，kkk值从较小值开始尝试，然后通过交叉验证选取最优的kkk值。另外，kkk值一般不取偶数，因为这样在投票表决过程中会产生矛盾。

KKN中的分类决策准则

KNN算法使用多数表决来决定测试样本分类：已知样本的kkk个近邻训练样本，统计其中最多的类别作为测试样本的类别，少数服从多数。

即对于一个SSS类的分类问题，如果某测试样本的kkk近邻为kkk个训练样本xi,i=1,2,…,k\boldsymbol{x}_i, i= 1,2,\ldots,kxi,i=1,2,…,k，他们的样本标签为yiy_iyi，则该测试样本的标签ypy_pyp应为
yp=arg⁡max⁡cj∑i=1kI(yi=cj)y_p = \mathop{\arg\max} \limits_{c_j} \; \sum_{i=1}^k I(y_i = c_j)yp=cjargmaxi=1∑kI(yi=cj)
其中cj∈{c1,c2,…,cs}c_j \in \left\{c_1, c_2, \ldots, c_s\right\}cj∈{c1,c2,…,cs}表示样本的类别。

数据集介绍

USPS手写体数据集

USPS手写体数据集是一个手写数字数据集，有10类表述数字0到9，每个样本是16×1616\times 1616×16的黑白图像，即样本空间为256维。数据集共有9298个样本，已经分好7291个训练样本和2007个测试样本。下图为其中一个样本图片的示例图

UCI-sonar数据集

UCI-sonar数据集是一个通过声纳数据对岩石和水雷判别的数据集。其只有两类“M”和“R”表示水雷和岩石，样本空间60维，为60个声纳点的收集数据，数据集共有207个样本，其中111个“M”类，96个“R”类。

UCI-iris数据集

UCI-iris数据集是一个分类鸢尾花的数据集，共有四个类别，样本空间为四维，表示花的四个特征，数据集共有150个样本。

实验设置

对于三个数据集，均采用交叉验证法计算分类的准确率。其中UCI的两个数据集由于数据量较小，因此采用留一法划分，USPS数据集使用5折的交叉验证法。

实验环境：Intel® Core™ i7-9750H CPU @ 2.60GHz.

Python版本：python3.6, numpy=1.19.4, sklearn=0.21.2.

实验结果及分析

对于三个数据集，均使用三种不同的距离度量方式，取k=1,3,5,…,49k=1,3,5,\ldots,49k=1,3,5,…,49，分别做实验并作出如下的图线来寻找最优的超参数。

USPS手写体数据集

对于USPS手写体数据集，k=1k=1k=1时显然已经为最优，分类准确率最高达到0.9639。随着kkk升高，准确率显著下降。在三种距离度量方式中比较，L∞L_{\infty}L∞的准确度明显低于另二者，而L1L_1L1与L2L_2L2距离中后者略优于前者。这样的结果原因可能是USPS数据集中数据特征就是大量的像素点，这样的数据特征单个特征信息量很小，特征间相关度较高，L∞L_{\infty}L∞距离会丢失很多信息。

UCI-sonar数据集

对于UCI-sonar手写体数据集，k=1k=1k=1时为最优，分类准确率最高达到0.8550，在k=5k=5k=5时准确率也有小幅度回升。继续增大kkk，准确率显著下降，并在k=15k=15k=15时稳定在0.7以下。在三种距离度量方式中比较，L∞L_{\infty}L∞的准确度仍低于另二者，而L1L_1L1与L2L_2L2距离中则变成前者略优于后者。原因仍与USPS数据集类似，sonar数据集特征也是多个传感器的数据结果，特征不明显，相关度较高。

UCI-iris数据集

对于UCI-sonar手写体数据集，最优的准确率出现在k=20k=20k=20左右，分类准确率最高达到0.98，而在kkk取其他值时，准确率变化不定，但大多稳定在0.93以上，三种距离度量方式的表现也比较类似。原因可能是iris数据集本身的数据特征是人工提取的花朵数据特征，四维特征都有其实际意义，同时四个类别在特征空间中相交不大，明显可分。

综合考虑，sonar数据集的分类效果明显劣于另两个数据集，原因可能是其数据量最小，并且两类数据混杂较多。而iris数据集准确率受超参数影响较小，原因可能仍如上文所述，是因为其数据本身很易区分。

反思与改进

kd树算法

由于需要计算测试样本与所有训练样本之间的距离，KNN算法是一种复杂度非常高的算法。kd树的提出显著提高了其效率。通过构建一棵类似与二叉查找树的树形结构，kd树能在更小的时间复杂度内完成一次最近距离的查找。

分析可得，记数据集大小为nnn，数据特征维数为ddd，当n≫2dn \gg 2^dn≫2d时kd树的计算速度明显优于KNN。在本次实验的中，USPS数据量较大，但由于数据维数也很大(d=256d=256d=256)，因此不适合kd树；另两个数据集本身较小，因此也不适合使用kd树。通过调用sklearn.neighbors.KNeighborsClassifier，可以调整实例化参数决定分类算法。依此在USPS数据集上进行实验，kd树的耗时为11.20s，KNN仅为0.86s。这也印证了上述的判断，即在USPS数据集上kd树效率很低。

KNN本身的加速

设训练样本集大小为nnn，测试集样本大小也为O(n)O(n)O(n)量级，数据特征维数为ddd，则显然计算一次训练集与测试集中样本两两之间的距离矩阵的复杂度为O(n2d)O(n^2d)O(n2d)。这一步骤是KNN算法中影响算法效率的瓶颈。

在做上述kd树的实验时，发现我遍历测试集分别预测的KNN算法在USPS数据集上对整个测试集(2007个样本)分类一次的时间，比调用库花费的时间长数十倍。通过阅读库源码，发现sklearn对计算上述距离矩阵有如下优化：

1. 类似于分块矩阵计算的思想，对训练样本和测试样本都分成多个大小相同的slice计算。

2. 在计算L2L_2L2距离时，将乘方打开，即 L2(xi,xj)=(∑l=1d∣xi(l)−xj(l)∣2)1/2=(xi⋅xi−2xi⋅xj+xj⋅xj)1/2\begin{aligned} L_2( \boldsymbol{x}_i, \boldsymbol{x}_j) &= \left(\sum_{l=1}^{d}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|^2\right)^{1/2}\\ &=\left(\boldsymbol{x}_i \cdot \boldsymbol{x}_i -2\boldsymbol{x}_i \cdot \boldsymbol{x}_j +\boldsymbol{x}_j \cdot \boldsymbol{x}_j\right)^{1/2} \end{aligned}L2(xi,xj)=(l=1∑d∣∣∣xi(l)−xj(l)∣∣∣2)1/2=(xi⋅xi−2xi⋅xj+xj⋅xj)1/2 其中(⋅)(\; \cdot \;)(⋅)表示向量点积。

关键部分库源码如下，函数包含在sklearn.metrics.pairwise下，作用即计算距离矩阵

# Pairwise distances
def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,X_norm_squared=None):"""Considering the rows of X (and Y=X) as vectors, compute thedistance matrix between each pair of vectors.For efficiency reasons, the euclidean distance between a pair of rowvector x and y is computed as::dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))This formulation has two advantages over other ways of computing distances.First, it is computationally efficient when dealing with sparse data.Second, if one argument varies but the other remains unchanged, then`dot(x, x)` and/or `dot(y, y)` can be pre-computed.However, this is not the most precise way of doing this computation, andthe distance matrix returned by this function may not be exactlysymmetric as required by, e.g., ``scipy.spatial.distance`` functions.Read more in the :ref:`User Guide <metrics>`.Parameters----------X : {array-like, sparse matrix}, shape (n_samples_1, n_features)Y : {array-like, sparse matrix}, shape (n_samples_2, n_features)Y_norm_squared : array-like, shape (n_samples_2, ), optionalPre-computed dot-products of vectors in Y (e.g.,``(Y**2).sum(axis=1)``)May be ignored in some cases, see the note below.squared : boolean, optionalReturn squared Euclidean distances.X_norm_squared : array-like, shape = [n_samples_1], optionalPre-computed dot-products of vectors in X (e.g.,``(X**2).sum(axis=1)``)May be ignored in some cases, see the note below.Notes-----To achieve better accuracy, `X_norm_squared` and `Y_norm_squared` may beunused if they are passed as ``float32``.Returns-------distances : array, shape (n_samples_1, n_samples_2)Examples-------->>> from sklearn.metrics.pairwise import euclidean_distances>>> X = [[0, 1], [1, 1]]>>> # distance between rows of X>>> euclidean_distances(X, X)array([[0., 1.],[1., 0.]])>>> # get distance to origin>>> euclidean_distances(X, [[0, 0]])array([[1.        ],[1.41421356]])See also--------paired_distances : distances betweens pairs of elements of X and Y."""X, Y = check_pairwise_arrays(X, Y)# If norms are passed as float32, they are unused. If arrays are passed as# float32, norms needs to be recomputed on upcast chunks.# TODO: use a float64 accumulator in row_norms to avoid the latter.if X_norm_squared is not None:XX = check_array(X_norm_squared)if XX.shape == (1, X.shape[0]):XX = XX.Telif XX.shape != (X.shape[0], 1):raise ValueError("Incompatible dimensions for X and X_norm_squared")if XX.dtype == np.float32:XX = Noneelif X.dtype == np.float32:XX = Noneelse:XX = row_norms(X, squared=True)[:, np.newaxis]if X is Y and XX is not None:# shortcut in the common case euclidean_distances(X, X)YY = XX.Telif Y_norm_squared is not None:YY = np.atleast_2d(Y_norm_squared)if YY.shape != (1, Y.shape[0]):raise ValueError("Incompatible dimensions for Y and Y_norm_squared")if YY.dtype == np.float32:YY = Noneelif Y.dtype == np.float32:YY = Noneelse:YY = row_norms(Y, squared=True)[np.newaxis, :]if X.dtype == np.float32:# To minimize precision issues with float32, we compute the distance# matrix on chunks of X and Y upcast to float64distances = _euclidean_distances_upcast(X, XX, Y, YY)else:# if dtype is already float64, no need to chunk and upcastdistances = - 2 * safe_sparse_dot(X, Y.T, dense_output=True)distances += XXdistances += YYnp.maximum(distances, 0, out=distances)# Ensure that distances between vectors and themselves are set to 0.0.# This may not be the case due to floating point rounding errors.if X is Y:np.fill_diagonal(distances, 0)return distances if squared else np.sqrt(distances, out=distances)

上述代码段的函数名就是计算计算pairwise的欧氏距离。在函数说明中，编写者也阐述了这一优化思路和其优点。函数在经过一系列的判断后一般会直接进入_euclidean_distances_upcast函数进行真正的距离计算

def _euclidean_distances_upcast(X, XX=None, Y=None, YY=None, batch_size=None):"""Euclidean distances between X and YAssumes X and Y have float32 dtype.Assumes XX and YY have float64 dtype or are None.X and Y are upcast to float64 by chunks, which size is chosen to limitmemory increase by approximately 10% (at least 10MiB)."""n_samples_X = X.shape[0]n_samples_Y = Y.shape[0]n_features = X.shape[1]distances = np.empty((n_samples_X, n_samples_Y), dtype=np.float32)if batch_size is None:x_density = X.nnz / np.prod(X.shape) if issparse(X) else 1y_density = Y.nnz / np.prod(Y.shape) if issparse(Y) else 1# Allow 10% more memory than X, Y and the distance matrix take (at# least 10MiB)maxmem = max(((x_density * n_samples_X + y_density * n_samples_Y) * n_features+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,10 * 2 ** 17)# The increase amount of memory in 8-byte blocks is:# - x_density * batch_size * n_features (copy of chunk of X)# - y_density * batch_size * n_features (copy of chunk of Y)# - batch_size * batch_size (chunk of distance matrix)# Hence x² + (xd+yd)kx = M, where x=batch_size, k=n_features, M=maxmem#                                 xd=x_density and yd=y_densitytmp = (x_density + y_density) * n_featuresbatch_size = (-tmp + np.sqrt(tmp ** 2 + 4 * maxmem)) / 2batch_size = max(int(batch_size), 1)x_batches = gen_batches(n_samples_X, batch_size)for i, x_slice in enumerate(x_batches):X_chunk = X[x_slice].astype(np.float64)if XX is None:XX_chunk = row_norms(X_chunk, squared=True)[:, np.newaxis]else:XX_chunk = XX[x_slice]y_batches = gen_batches(n_samples_Y, batch_size)for j, y_slice in enumerate(y_batches):if X is Y and j < i:# when X is Y the distance matrix is symmetric so we only need# to compute half of it.d = distances[y_slice, x_slice].Telse:Y_chunk = Y[y_slice].astype(np.float64)if YY is None:YY_chunk = row_norms(Y_chunk, squared=True)[np.newaxis, :]else:YY_chunk = YY[:, y_slice]d = -2 * safe_sparse_dot(X_chunk, Y_chunk.T, dense_output=True)d += XX_chunkd += YY_chunkdistances[x_slice, y_slice] = d.astype(np.float32, copy=False)return distances

其中row_norms函数用来计算xi⋅xi\boldsymbol{x}_i \cdot \boldsymbol{x}_ixi⋅xi，主要依靠np.einsum函数实现。

def row_norms(X, squared=False):"""Row-wise (squared) Euclidean norm of X.Equivalent to np.sqrt((X * X).sum(axis=1)), but also supports sparsematrices and does not create an X.shape-sized temporary.Performs no input validation.Parameters----------X : array_likeThe input arraysquared : bool, optional (default = False)If True, return squared norms.Returns-------array_likeThe row-wise (squared) Euclidean norm of X."""if sparse.issparse(X):if not isinstance(X, sparse.csr_matrix):X = sparse.csr_matrix(X)norms = csr_row_norms(X)else:norms = np.einsum('ij,ij->i', X, X)if not squared:np.sqrt(norms, norms)return norms

safe_sparse_dot函数则计算xi⋅xj\boldsymbol{x}_i \cdot \boldsymbol{x}_jxi⋅xj，本质就是np.dot计算矩阵乘法。

def safe_sparse_dot(a, b, dense_output=False):"""Dot product that handle the sparse matrix case correctlyUses BLAS GEMM as replacement for numpy.dot where possibleto avoid unnecessary copies.Parameters----------a : array or sparse matrixb : array or sparse matrixdense_output : boolean, default FalseWhen False, either ``a`` or ``b`` being sparse will yield sparseoutput. When True, output will always be an array.Returns-------dot_product : array or sparse matrixsparse if ``a`` or ``b`` is sparse and ``dense_output=False``."""if sparse.issparse(a) or sparse.issparse(b):ret = a * bif dense_output and hasattr(ret, "toarray"):ret = ret.toarray()return retelse:return np.dot(a, b)

观察易得，xi⋅xi\boldsymbol{x}_i \cdot \boldsymbol{x}_ixi⋅xi是可以在O(nd)O(nd)O(nd)时间内计算完成，而xi⋅xj\boldsymbol{x}_i \cdot \boldsymbol{x}_jxi⋅xj就是训练样本矩阵与测试样本矩阵转置的乘积。虽然这个乘积复杂度也是O(n2d)O(n^2d)O(n2d)，但是由于Numpy对矩阵乘法有加速，因此这样的算法可以比遍历测试集的写法快数十倍。

因此，我仿照sklearn库，也实现了打开L2L_2L2距离来加速kkk近邻计算的算法。同样在USPS数据集上测试，这样的写法在预测整个测试集仅用了0.34s的时间，而遍历2007个测试样本分别预测的算法用时13.05s。

总结而言，kd树等树形结构是加速寻找kkk近邻的数据结构，但是在本次实验的三个数据集上都并不适用；KNN本身的加速需要展开L2L_2L2距离，将计算引到Numpy实现的矩阵乘法上，可以大大提高运算效率。当然，这两种思路都是建立在准确求取kkk近邻的基础上进行的加速，不会改变预测结果，所以对预测准确率不会有任何改变。

总结

KNN算法是一个基于样本距离计算的有监督分类方法，其本质海肆模式匹配。在本次实验中，KNN算法在USPS和iris数据集上准确率都很高，但是在sonar数据集上表现相对较差。同时，KNN算法的计算效率很低，但是可以通过引入树形结构或改变计算方式来提高其计算效率。但是不论是想要引入树形结构，还是提高准确的，关键点都在于特征提取是否准确、易分。

本实验还尝试了SVM，随机森林等常见的分类器。由于时间有限(报告写不完了)，在此不再加以列举。

附录

附录总体为代码。首先是我编写的KNN类，然后是在三个数据集上的实验，最后是kd树与KNN的比较。

KNN类

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixinclass KNN(BaseEstimator, ClassifierMixin):def __init__(self, k, method=2, quick_L_2=True):# method=2 => use L2 distanceself.k = kself.method = methodself.quick_L_2 = quick_L_2def fit(self, x, y):self.x = xself.y = yself.labels = np.unique(y)# self.y = np.array([self.labels[self.labels == i][0] for i in self.y])def predict(self, a):y_label = []if self.method == 2 and self.quick_L_2:dis = -2 * np.dot(self.x, a.T)dis += np.einsum('ij,ij->i', self.x, self.x)[:, np.newaxis]dis += np.einsum('ij,ij->i', a, a)[np.newaxis, :]idx = np.argpartition(dis, kth=self.k, axis=0)[0:self.k, :]for i in range(a.shape[0]):vote = dict(zip(self.labels, np.zeros_like(self.labels)))for j in range(self.k):vote[self.y[idx[j, i]]] += 1y_label.append(max(vote, key=vote.get))return y_labelfor i in range(a.shape[0]):if self.method == 0:idx = np.argsort(np.max(np.abs(self.x - a[i, :]), axis=1))elif self.method == 1:idx = np.argsort(np.sum(np.abs(self.x - a[i, :]), axis=1))else:idx = np.argsort(np.sum((self.x - a[i, :]) ** 2, axis=1))vote = dict(zip(self.labels, np.zeros_like(self.labels)))for j in range(self.k):vote[self.y[idx[j]]] += 1y_label.append(max(vote, key=vote.get))return y_label

USPS实验

import h5py
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from KNN import KNN
import time# load data
# data from https://www.kaggle.com/bistaumanga/usps-dataset?select=usps.h5
# another data source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps
path = 'usps.h5'
with h5py.File(path, 'r') as hf:train = hf.get('train')x_tr = train.get('data')[:]y_tr = train.get('target')[:]test = hf.get('test')x_te = test.get('data')[:]y_te = test.get('target')[:]def check_pic(data, label, idx):pic = data[idx, :]pic = pic.reshape((16, 16))cv2.imshow(str(label[idx]), pic)cv2.waitKey(0)# check picture
# check_pic(x_tr, y_tr, 10)# RandomForest
def random_forest():randomForest = RandomForestClassifier()print(np.mean(cross_val_score(randomForest, np.concatenate((x_tr, x_te)),np.concatenate((y_tr, y_te)), cv=KFold(n_splits=5), n_jobs=8)))def svm():SVM = SVC(gamma='scale', C=1.0, decision_function_shape='ovr', kernel='rbf')print(np.mean(cross_val_score(SVM, np.concatenate((x_tr, x_te)),np.concatenate((y_tr, y_te)), cv=KFold(n_splits=5), n_jobs=8)))# KNN
def knn():ks = np.arange(1, 50, 2)acc = np.zeros((3, ks.shape[0]))for method in [2, 1, 0]:for (i, k) in enumerate(ks):acc[method, i] = (np.mean(cross_val_score(KNN(k=k, method=method), np.concatenate((x_tr, x_te)),np.concatenate((y_tr, y_te)), cv=KFold(n_splits=5), n_jobs=8)))print(k, acc[method, i])np.save('acc.npy', acc)ks = np.arange(1, 50, 2)plt.plot(ks, acc[2, :], label='$L_2$ distance')plt.plot(ks, acc[1, :], label='$L_1$ distance')plt.plot(ks, acc[0, :], label='$L_\infty$ distance')plt.legend()plt.xlabel('k')plt.ylabel('Accuracy')plt.show()print(np.max(acc))return accdef compare_knn_and_fast_knn():knn1 = KNN(k=1, quick_L_2=True)knn1.fit(x_tr, y_tr)tim = time.clock()knn1.predict(x_te)print(time.clock() - tim)knn2 = KNN(k=1, quick_L_2=False)knn2.fit(x_tr, y_tr)tim = time.clock()knn2.predict(x_te)print(time.clock() - tim)if __name__ == '__main__':random_forest()svm()knn()

sonar实验

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, LeaveOneOut
import matplotlib.pyplot as plt
from KNN import KNN# load data
# data from http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/
path = 'sonar.all-data'
data = pd.read_csv(path).values
labels = data[:, -1]
data = data[:, :-1]ks = np.arange(1, 50, 2)
acc = np.zeros((3, ks.shape[0]))
for method in range(3):for (i, k) in enumerate(ks):acc[method, i] = np.mean(cross_val_score(KNN(k=k, method=method), data, labels, cv=LeaveOneOut()))
plt.plot(ks, acc[2, :], label='$L_2$ distance')
plt.plot(ks, acc[1, :], label='$L_1$ distance')
plt.plot(ks, acc[0, :], label='$L_\infty$ distance')
plt.legend()
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.show()
print(np.max(acc))

iris实验

from sklearn import datasets
import numpy as np
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.base import BaseEstimator, ClassifierMixin
import matplotlib.pyplot as plt
from KNN import KNNiris = datasets.load_iris()
data = iris['data']
labels = iris['target']ks = np.arange(1, 50, 2)
acc = np.zeros((3, ks.shape[0]))
for method in range(3):for (i, k) in enumerate(ks):acc[method, i] = np.mean(cross_val_score(KNN(k=k, method=method), data, labels, cv=LeaveOneOut()))
plt.plot(ks, acc[2, :], label='$L_2$ distance')
plt.plot(ks, acc[1, :], label='$L_1$ distance')
plt.plot(ks, acc[0, :], label='$L_\infty$ distance')
plt.legend()
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.show()
print(np.max(acc))

KNN与kd树比较实验

import h5py
import numpy as np
import cv2
from sklearn.neighbors import KNeighborsClassifier
import time# load data
# data from https://www.kaggle.com/bistaumanga/usps-dataset?select=usps.h5
# another data source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps
path = 'usps.h5'
with h5py.File(path, 'r') as hf:train = hf.get('train')x_tr = train.get('data')[:]y_tr = train.get('target')[:]test = hf.get('test')x_te = test.get('data')[:]y_te = test.get('target')[:]# add random noise to avoid calculate in spare matrix
x_tr += np.random.rand(*x_tr.shape) * 0.001
x_te += np.random.rand(*x_te.shape) * 0.001def check_pic(data, label, idx):pic = data[idx, :]pic = pic.reshape((16, 16))cv2.imshow(str(label[idx]), pic)cv2.waitKey(0)# check picture
# check_pic(x_tr, y_tr, 10)# KNN
def knn():knn = KNeighborsClassifier(n_neighbors=5, algorithm='brute', n_jobs=1)tic = time.clock()knn.fit(x_tr, y_tr)res = knn.predict(x_te)print(time.clock() - tic)return resdef kd():kdt = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', n_jobs=1)tic = time.clock()kdt.fit(x_tr, y_tr)res = kdt.predict(x_te)print(time.clock() - tic)return resif __name__ == '__main__':assert np.all(kd() == knn())

KNN分类USPS, USI sonar及USI iris相关推荐

Python实现knn分类算法（Iris 数据集）
1.KNN分类算法 KNN分类算法(K-Nearest-Neighbors Classification),又叫K近邻算法,是一个概念极其简单,而分类效果又很优秀的分类算法. 他的核心思想就是,要确定 ...
模式识别（2）KNN分类
基于USPS和UCI数据集的近邻法分类一.问题描述使用近邻算法进行分类问题的研究,并在USPS手写体数据集和UCI数据集上的iris和sonar数据上验证算法的有效性,并分别对近邻法中k近邻算法. ...
python分类算法_用Python实现KNN分类算法
本文实例为大家分享了Python KNN分类算法的具体代码,供大家参考,具体内容如下 KNN分类算法应该算得上是机器学习中最简单的分类算法了,所谓KNN即为K-NearestNeighbor(K个最邻 ...
KNN 分类算法原理代码解析
作者 | Charmve 来源 | 迈微AI研习社 k-最近邻算法是基于实例的学习方法中最基本的,先介绍基x`于实例学习的相关概念. 基于实例的学习已知一系列的训练样例,很多学习方法为目标函数建立起 ...
KNN分类python实现
import numpy as np import pandas as pd KNN算法过程从训练集中选择离预测样本最近的K个样本. 根据这K个样本计算这个样本的值(属于哪个类别或具体数值) 对数据 ...
莺尾花数据集--kNN分类
Step1: 库函数导入 import numpy as np # 加载莺尾花数据集 from sklearn import datasets # 导入KNN分类器 from sklearn.neig ...
python KNN分类算法使用鸢尾花数据集实战
KNN分类算法,又叫K近邻算法,它概念极其简单,但效果又很优秀. 如觉得有帮助请点赞关注收藏啦~~~ KNN算法的核心是,如果一个样本在特征空间中的K个最相似,即特征空间中最邻近的样本中的大多数属于某 ...
KNN分类-python
KNN分类-python KNN(k-近邻算法) 算法原理代码介绍 1.knn_classify 2.数据集 3.结果完整代码 KNN(k-近邻算法) 算法原理 knn算法原理非常简单,这里不再赘 ...
【机器学习】机器学习算法之——K最近邻(k-Nearest Neighbor，KNN)分类算法原理讲解...
k-最近邻算法是基于实例的学习方法中最基本的,先介绍基于实例学习的相关概念. 01 基于实例的学习已知一系列的训练样例,很多学习方法为目标函数建立起明确的一般化描述:但与此不同,基于实例的学习方法只 ...
KNN分类器、最近邻分类、KD树、KNN分类的最佳K值、基于半径的最近邻分类器、KNN多分类、KNN多标签分类、KNN多输出分类、KNN分类的优缺点
KNN分类器.最近邻分类.KD树.KNN分类的最佳K值.基于半径的最近邻分类器.KNN多分类.KNN多标签分类.KNN多输出分类.KNN分类的优缺点目录

KNN分类USPS, USI sonar及USI iris

KNN算法简介

KNN算法流程

KNN中的距离度量

KNN中的kkk值选择

KKN中的分类决策准则

数据集介绍

USPS手写体数据集

UCI-sonar数据集

UCI-iris数据集

实验设置

实验结果及分析

USPS手写体数据集

UCI-sonar数据集

UCI-iris数据集

反思与改进

kd树算法

KNN本身的加速

总结

附录

KNN类

USPS实验

sonar实验

iris实验

KNN与kd树比较实验

KNN分类USPS, USI sonar及USI iris相关推荐

最新文章

热门文章