K邻近算法

存在一个样本数据集合,样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类的对应关系。输入没有标签的新数据后,将新数据的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征最相似(最邻近)的分类标签。

K邻近模型由三个基本要素--距离度量、K值选择和分类决策规则决定

1. 使用python导入数据

from numpy import *
import operator #提供排序操作需要的函数def createDataSet():group = array([[1.0, 1.1], [1.0,1.1],[0, 0], [0, 0.1]])labels = ['A', 'A', 'B', 'B']return group, labels
group, labels = createDataSet()
group
array([[1. , 1.1],[1. , 1.1],[0. , 0. ],[0. , 0.1]])
labels
['A', 'A', 'B', 'B']

2.实施KNN算法

伪代码如下:

  1. 计算已知类别数据集中的点与当前点之间的距离
  2. 按照距离递增次序排序
  3. 选取与当前点距离最小的k个点
  4. 确定前k个点所在类别的出现频率
  5. 返回前k个点出现频率最高的类别作为当前点的预测分类
def classify0(inX, dataSet, labels, k):             #inX是输入(待分类)dataSetSize = dataSet.shape[0]                  #利用属性shape[0]得到样本的个数diffMat = tile(inX, (dataSetSize, 1)) - dataSet sqDiffMat = diffMat**2sqDistances = sqDiffMat.sum(axis=1)distances = sqDistances**0.5                       #距离计算sortedDistIndicies = distances.argsort()            #对距离进行排序classCount = {}for i in range(k):                                 #选取临近的k个样本voteIlabel = labels[sortedDistIndicies[i]]     #获得样本标签classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 #样本标签统计#对样本标签进行从大到小排序,返回出现次数最多的那个标签值sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)return sortedClassCount[0][0]

注:

1.numpy.tile()函数可以将原矩阵纵向或横向地复制,复制的方向取决于第二个参数。在这里inX被纵向地复制了四次,然后减去dataset,这样一次减法就将inX和样本集中的每个样本做了减法。形如inX = [x1,x2]. tile之后变成[[x1,x2],[x1,x2],[x1,x2],[x1,x2]],原dataset的数据形式为[[a1,b1],[a2,b2],[a3,b3],[a4,b4]]。结合代码不难理解这里的距离计算了。

2.numpy.argsort()返回数组值从小到大的索引值

3.classCount.get(voteIlabel,0).这里用到python字典的方法get(). 其使用的语法为:dict.get(key, default=None)。其中key是字典中要查找的键,如果指定的键不在,那么返回default的值。

4.对标签数目排序用的是numpy.sorted,其与numpy.sort有一定区别。后者是用在列表对象上,前者则适用与所有可迭代的对象。classCount.iteritems()的作用是返回字典列表操作之后的迭代(python3x中已经废除)。key=operator.itemgetter(1)的意思定义了获取字典第一维上(也就是每个标签数目)的数据的函数,itemgetter()返回的是一个函数,因为key的参数可以是lamda表达式或者函数。

classify0([0,0], group, labels, 3)
'B'

3.实例1--利用knn算法改进约会网站的配对效果

Step1:准备数据:从文本文件中解析数据

def file2matrix(filename):fr = open(filename)arrayOLines = fr.readlines()    #arrayOLines是一个列表,包括所有的行numberOfLines = len(arrayOLines)  #获得行数returnMat = zeros((numberOfLines,3)) #创建一个返回的NumPy矩阵classLabelVector = []          #类标签index = 0for line in arrayOLines:line = line.strip()        #去掉换行符listFromLine = line.split('\t')  returnMat[index, :] = listFromLine[0:3] #将数据装填到returnMatlabels = {'didntLike':1, 'smallDoses':2, 'largeDoses':3}classLabelVector.append(labels[(listFromLine[-1])]) #获得每个样例的标签index += 1return returnMat, classLabelVector
datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
datingDataMat
array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],[1.4488000e+04, 7.1534690e+00, 1.6739040e+00],[2.6052000e+04, 1.4418710e+00, 8.0512400e-01],...,[2.6575000e+04, 1.0650102e+01, 8.6662700e-01],[4.8111000e+04, 9.1345280e+00, 7.2804500e-01],[4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
datingLabels[0:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

注:

1.readlines()是python文件的方法,它返回一个列表,包含所有的行

2.split()是字符串方法,通过指定分隔符对字符串进行切片,返回一个字符串列表。如果参数 num 有指定值,则分隔 num+1 个子字符串,语法是str.split(str="", num=string.count(str)).,str是分隔符,有'\n','\t'等,num是分隔的次数,默认-1,即分割所有。

Step2:分析数据:使用Matplotlib创建散点图

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,0], datingDataMat[:,1],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()

Step3:归一化特征值

def autoNorm(dataset):minVals = dataset.min(0)  maxVals = dataset.max(0)  #获得每一列的最大、最小值ranges = maxVals - minVals #获得范围normDataSet = zeros(shape(dataset)) m = dataset.shape[0]  #数据集的个数normDataSet = dataset - tile(minVals, (m,1)) normDataSet = normDataSet/tile(ranges, (m,1))#归一化的公式,再次用到tile()return normDataSet, ranges, minVals
normMat, ranges, minVals = autoNorm(datingDataMat)
normMat
array([[0.44832535, 0.39805139, 0.56233353],[0.15873259, 0.34195467, 0.98724416],[0.28542943, 0.06892523, 0.47449629],...,[0.29115949, 0.50910294, 0.51079493],[0.52711097, 0.43665451, 0.4290048 ],[0.47940793, 0.3768091 , 0.78571804]])
ranges
array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
minVals
array([0.      , 0.      , 0.001156])

Step4:对分类器进行测试

def datingClassTest():hoRatio = 0.10datingDataMat, datingLabels = file2matrix('datingTestSet.txt')  #读取文件normMat, ranges, minVals = autoNorm(datingDataMat)            #归一化m = normMat.shape[0]                                       #获得样本总体数目numTestVecs = int(m*hoRatio)                        #选出一部分作为测试,另一部分作为训练errorCount = 0.0for i in range(numTestVecs):classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:],\datingLabels[numTestVecs:m],3) #得到分类结果print("the classifier came back with: %d, the real answer is: %d"\%(classifierResult, datingLabels[i]))         if(classifierResult != datingLabels[i]): errorCount += 1.0 #如果错误就记录print("the total error rate is: %f" %(errorCount/float(numTestVecs))) #计算错误率
datingClassTest()
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000

转载于:https://www.cnblogs.com/patrolli/p/11319846.html

《机器学习实战》K邻近算法相关推荐

  1. 机器学习:k邻近算法(KNN)

    title: 机器学习:k邻近算法(KNN) date: 2019-11-16 20:20:41 mathjax: true categories: 机器学习 tags: 机器学习 什么是K邻近算法? ...

  2. 模式识别和机器学习实战-K近邻算法(KNN)- Python实现 - 约会网站配对效果判断和手写数字识别

    文章目录 前言 一. k-近邻算法(KNN) 1.算法介绍 2.举个例子--电影分类 3.步骤描述 4.来了--代码实现 二.实战之约会网站配对效果判断 1.导入数据 2.分析数据 3.数据归一化 4 ...

  3. 刻意练习:机器学习实战 -- Task01. K邻近算法

    背景 这是我们为拥有 Python 基础的同学推出的精进技能的"机器学习实战" 刻意练习活动,这也是我们本学期推出的第三次活动了. 我们准备利用8周时间,夯实机器学习常用算法,完成 ...

  4. 机器学习实战读书笔记--k邻近算法KNN

    k邻近算法的伪代码: 对未知类别属性的数据集中的每个点一次执行以下操作: (1)计算已知类别数据集中的点与当前点之间的距离: (2)按照距离递增次序排列 (3)选取与当前点距离最小的k个点 (4)确定 ...

  5. 机器学习3—分类算法之K邻近算法(KNN)

    K邻近算法(KNN) 一.算法思想 二.KNN类KNeighborsClassifier的使用 三.KNN分析红酒类型 3.1红酒数据集 3.2红酒数据的读取 3.3将红酒的数据集拆分为训练和测试集 ...

  6. 独家 | R语言中K邻近算法的初学者指南:从菜鸟到大神(附代码&链接)

    作者:Leihua Ye, UC Santa Barbara 翻译:陈超 校对:冯羽 本文约2300字,建议阅读10分钟 本文介绍了一种针对初学者的K临近算法在R语言中的实现方法. 本文呈现了一种在R ...

  7. k折交叉验证优缺点_R语言中K邻近算法的初学者指南:从菜鸟到大神(附代码&链接)...

    作者:Leihua Ye, UC Santa Barbara 翻译:陈超 校对:冯羽 本文约2300字,建议阅读10分钟 本文介绍了一种针对初学者的K临近算法在R语言中的实现方法. 本文呈现了一种在R ...

  8. 机器学习实战——密度聚类算法

    机器学习实战--密度聚类算法 1 密度聚类 2 sklearn中的实现 1 密度聚类 密度聚类假设聚类结构能够通过样本分布的密集程度确定,通常情形下,密度聚类算法从样本密度的角度来考察样本之间的可连接 ...

  9. k邻近算法-回归实操

    python k相邻近算法之回归实操 基本概念 先简单介绍一下机器学习里面的两个概念 1.分类与回归 分类模型和回归模型本质一样,分类模型是将回归模型的输出离散化. 一般来说,回归问题通常是用来预测一 ...

最新文章

  1. 12 个最佳的免费学习编程的游戏网站【转】
  2. windows编译生成在linux上运行,在linux上编译windows 32/64 上运行的vlc
  3. leetcode算法题--二叉树中的伪回文路径
  4. html button 隐藏_java servlet与html数据交互初体验
  5. 微软亚洲研究院的“人立方”搜索
  6. python学习-综合练习四(最大公约数、最小公倍数、生成日历、递归调用、字符串)
  7. 物联网安防技术融合在细分领域的应用分析
  8. tornado post第3方_[33]python-Web-框架-Tornado
  9. 有4件事,我很后悔~
  10. 亚马逊云科技张文翊:云让初创公司不输在起跑线上
  11. jQuery—$ is not a function
  12. 怎么把vivo强行刷入鸿蒙系统,vivo手机如何强制刷机
  13. mac系统调节鼠标、触控板灵敏度
  14. 编写一个抽象类Shape,声明计算图形面积的抽象方法。再分别定义Shape的子类Circle(圆)和Rectangle(矩形),在两个子类中按照不同图形的面积计算公式,实现Shape类中计算面积的方法
  15. ld-linux-x86-64.so.2挖矿木马,排查操作记录
  16. Angular5 + Bootstrap4使用示例
  17. matlab 读取脉冲数,已知一段波形,求脉冲个数,用代码实现
  18. 使用log4j2对日志脱敏
  19. TEC控温模块电路分析
  20. 索引(从零开始)必须大于或等于零,且小于参数列表的大小的错位问题

热门文章

  1. python基础学习-装饰器进阶
  2. C++11:POD数据类型
  3. C Primer Plus 第6章 C控制语句:循环 6.9 选择哪种循环
  4. 黄聪:说说JSON和JSONP,也许你会豁然开朗(转)
  5. CQRS及.NET中的参考资料
  6. 【转】Usage of sendBroadcast()
  7. [力扣] 501. 二叉搜索树中的众数
  8. FFT对信噪比的增益计算
  9. 飞桨模型保存_重磅发布开源框架、生物计算平台螺旋桨,百度飞桨交了年终成绩单...
  10. Camstasia studio渲染(生成)视频