【机器学习实战之一】：C++实现K-近邻算法KNN

本文不对KNN算法做过多的理论上的解释，主要是针对问题，进行算法的设计和代码的注解。

KNN算法：

优点：精度高、对异常值不敏感、无数据输入假定。

缺点：计算复杂度高、空间复杂度高。

适用数据范围：数值型和标称性。

工作原理：存在一个样本数据集合，也称作训练样本集，并且样本集中每个数据都存在标签（所属分类），即我们知道样本集中每一个数据与所属分类的对应关系。输入没有标签的新数据（testData）后，将新数据的每个特征与样本集中数据对应的特征进行比较，然后算法提取样本集中特征最相似数据（最近邻）的分类标签。一般来说，我们只选择样本数据及中前ｋ个最相似的数据，这就是k-近邻算法中k的出处，通常k选择不大于20的整数。最后，选择k个最相似数据中出现次数最多的分类，作为新数据的分类。

K-近邻算法的一般流程：

（1）收集数据：可以使用任何方法

（2）准备数据：距离计算所需要的数值，最好是结构化的数据格式

（3）分析数据：可以使用任何方法

（4）训练算法：此步骤不适用k-邻近算法

（5）测试算法：计算错误率

（6）使用算法：首先需要输入样本数据和结构化的输出结果，然后运行k-近邻算法判定输入数据分别属于哪个分类，最后应用对计算出的分类执行后续的处理。

问题一：现在我们假设一个场景，就是要为坐标上的点进行分类，如下图所示：

上图一共12个左边点，每个坐标点都有相应的坐标(x,y)以及它所属的类别A/B，那么现在需要做的就是给定一个点坐标(x1,y1)，判断它属于的类别A或者B。

所有的坐标点在data.txt文件中：

0.0 1.1 A
1.0 1.0 A
2.0 1.0 B
0.5 0.5 A
2.5 0.5 B
0.0 0.0 A
1.0 0.0 A
2.0 0.0 B
3.0 0.0 B
0.0 -1.0 A
1.0 -1.0 A
2.0 -1.0 B

step1：通过类的默认构造函数去初始化训练数据集dataSet和测试数据testData。

step2：用get_distance()来计算测试数据testData和每一个训练数据dataSet[index]的距离，用map_index_dis来保存键值对<index,distance>,其中index代表第几个训练数据，distance代表第index个训练数据和测试数据的距离。

step3：将map_index_dis按照value值（即distance值）从小到大的顺序排序，然后取前k个最小的value值，用map_label_freq来记录每一个类标签出现的频率。

step4：遍历map_label_freq中的value值，返回value最大的那个key值，就是测试数据属于的类。

看一下代码KNN_0.cc：

#include<iostream>
#include<map>
#include<vector>
#include<stdio.h>
#include<cmath>
#include<cstdlib>
#include<algorithm>
#include<fstream>using namespace std;typedef char tLabel;
typedef double tData;
typedef pair<int,double>  PAIR;
const int colLen = 2;
const int rowLen = 12;
ifstream fin;
ofstream fout;class KNN
{private:tData dataSet[rowLen][colLen];//dataSet[12][2]tLabel labels[rowLen];//labels[12]   保存样本中每个数据的分类tData testData[colLen];//testData[2]int k;//只选择样本数据及中前ｋ个最相似的数据，这就是k-近邻算法中k的出处，通常k选择不大于20的整数map<int,double> map_index_dis;//map_index_dis来保存键值对<index,distance> index代表第几个训练数据，distance代表第index个训练数据和测试数据的距离map<tLabel,int> map_label_freq;double get_distance(tData *d1,tData *d2);//get_distance()来计算测试数据 testData 和每一个训练数据dataSet[index]的距离public:KNN(int k);void get_all_distance();void get_max_freq_label();struct CmpByValue{bool operator() (const PAIR& lhs,const PAIR& rhs){return lhs.second < rhs.second;}};
};KNN::KNN(int k)
{this->k = k;fin.open("data.txt");if(!fin){cout<<"can not open the file data.txt"<<endl;exit(1);}/* input the dataSet */ for(int i=0;i<rowLen;i++)//rowLen = 12{for(int j=0;j<colLen;j++)//colLen=2{fin>>dataSet[i][j];}fin>>labels[i];}cout<<"please input the test data :"<<endl;/* inuput the test data */for(int i=0;i<colLen;i++)cin>>testData[i];}/** calculate the distance between test data and dataSet[i]*/
double KNN:: get_distance(tData *d1,tData *d2)
{double sum = 0;for(int i=0;i<colLen;i++){sum += pow( (d1[i]-d2[i]) , 2 );}//  cout<<"the sum is = "<<sum<<endl;return sqrt(sum);
}/** calculate all the distance between test data and each training data*/
void KNN:: get_all_distance()
{double distance;int i;for(i=0;i<rowLen;i++){distance = get_distance(dataSet[i],testData);//<key,value> => <i,distance>map_index_dis[i] = distance;//存放所有点dataSet 与 测试点 testData 之间的距离}//traverse the map to print the index and distancemap<int,double>::const_iterator it = map_index_dis.begin();while(it!=map_index_dis.end()){cout<<"index = "<<it->first<<" distance = "<<it->second<<endl;it++;}
}/** check which label the test data belongs to to classify the test data */
void KNN:: get_max_freq_label()
{//transform the map_index_dis to vec_index_disvector<PAIR> vec_index_dis( map_index_dis.begin(),map_index_dis.end() );//sort the vec_index_dis by distance from low to high to get the nearest data 将map_index_dis按照value值（即distance值）从小到大的顺序排序sort(vec_index_dis.begin(),vec_index_dis.end(),CmpByValue());//测试点与所有点之间的距离排序//取前k个最小的value值，用 map_label_freq 来记录每一个类标签出现的频率。for(int i=0;i<k;i++){cout<<"the index = "<<vec_index_dis[i].first<<" the distance = "<<vec_index_dis[i].second<<" the label = "<<labels[vec_index_dis[i].first]<<" the coordinate ( "<<dataSet[ vec_index_dis[i].first ][0]<<","<<dataSet[ vec_index_dis[i].first ][1]<<" )"<<endl;//calculate the count of each labelmap_label_freq[ labels[ vec_index_dis[i].first ]  ]++;}//遍历 map_label_freq 中的value值，返回value最大的那个key值，就是测试数据属于的类map<tLabel,int>::const_iterator map_it = map_label_freq.begin();tLabel label;int max_freq = 0;//find the most frequent labelwhile( map_it != map_label_freq.end() ){if( map_it->second > max_freq ){max_freq = map_it->second;label = map_it->first;}map_it++;}cout<<"The test data belongs to the "<<label<<" label"<<endl;
}int main()
{int k ;cout<<"please input the k value : "<<endl;cin>>k;KNN knn(k);knn.get_all_distance();knn.get_max_freq_label();system("pause"); return 0;
}

我们来测试一下这个分类器(k=5)：

testData(5.0,5.0):

testData(-5.0,-5.0):

testData(1.6,0.5):

分类结果的正确性可以通过坐标系来判断，可以看出结果都是正确的。

问题二：使用k-近邻算法改进约会网站的匹配效果

情景如下：我的朋友海伦一直使用在线约会网站寻找合适自己的约会对象。尽管约会网站会推荐不同的人选，但她没有从中找到喜欢的人。经过一番总结，她发现曾交往过三种类型的人：

>不喜欢的人

>魅力一般的人

>极具魅力的人

尽管发现了上述规律，但海伦依然无法将约会网站推荐的匹配对象归入恰当的分类。她觉得可以在周一到周五约会哪些魅力一般的人，而周末则更喜欢与那些极具魅力的人为伴。海伦希望我们的分类软件可以更好的帮助她将匹配对象划分到确切的分类中。此外海伦还收集了一些约会网站未曾记录的数据信息，她认为这些数据更有助于匹配对象的归类。

海伦已经收集数据一段时间。她把这些数据存放在文本文件datingTestSet.txt（文件链接：http://yunpan.cn/QUL6SxtiJFPfN，提取码：f246）中，每个样本占据一行，总共有1000行。海伦的样本主要包含3中特征：

>每年获得的飞行常客里程数

>玩视频游戏所耗时间的百分比

>每周消费的冰淇淋公升数

数据预处理：归一化数据

我们可以看到，每年获取的飞行常客里程数对于计算结果的影响将远大于其他两个特征。而产生这种现象的唯一原因，仅仅是因为飞行常客书远大于其他特征值。但是这三种特征是同等重要的，因此作为三个等权重的特征之一，飞行常客数不应该如此严重地影响到计算结果。

处理这种不同取值范围的特征值时，我们通常采用的方法是数值归一化，如将取值范围处理为0到1或者-1到1之间。

公式为：newValue = (oldValue - min) / (max - min)

其中min和max分别是数据集中的最小特征值和最大特征值。我们增加一个auto_norm_data函数来归一化数据。

同事还要设计一个get_error_rate来计算分类的错误率，选总体数据的10%作为测试数据，90%作为训练数据，当然也可以自己设定百分比。

其他的算法设计都与问题一类似。

代码如下KNN_2.cc（k=7）：

/* add the get_error_rate function */#include<iostream>
#include<map>
#include<vector>
#include<stdio.h>
#include<cmath>
#include<cstdlib>
#include<algorithm>
#include<fstream>using namespace std;typedef string tLabel;
typedef double tData;
typedef pair<int,double>  PAIR;
const int MaxColLen = 10;
const int MaxRowLen = 10000;
ifstream fin;
ofstream fout;class KNN
{
private:tData dataSet[MaxRowLen][MaxColLen];tLabel labels[MaxRowLen];tData testData[MaxColLen];int rowLen;int colLen;int k;int test_data_num;map<int,double> map_index_dis;map<tLabel,int> map_label_freq;double get_distance(tData *d1,tData *d2);
public:KNN(int k , int rowLen , int colLen , char *filename);void get_all_distance();tLabel get_max_freq_label();void auto_norm_data();void get_error_rate();struct CmpByValue{bool operator() (const PAIR& lhs,const PAIR& rhs){return lhs.second < rhs.second;}};~KNN();
};KNN::~KNN()
{fin.close();fout.close();map_index_dis.clear();map_label_freq.clear();
}KNN::KNN(int k , int row ,int col , char *filename)
{this->rowLen = row;this->colLen = col;this->k = k;test_data_num = 0;fin.open(filename);fout.open("result.txt");if( !fin || !fout ){cout<<"can not open the file"<<endl;exit(0);}for(int i=0;i<rowLen;i++){for(int j=0;j<colLen;j++){fin>>dataSet[i][j];fout<<dataSet[i][j]<<" ";}fin>>labels[i];fout<<labels[i]<<endl;}}void KNN:: get_error_rate()
{int i,j,count = 0;tLabel label;cout<<"please input the number of test data : "<<endl;cin>>test_data_num;for(i=0;i<test_data_num;i++){for(j=0;j<colLen;j++){testData[j] = dataSet[i][j];      }get_all_distance();label = get_max_freq_label();if( label!=labels[i] )count++;map_index_dis.clear();map_label_freq.clear();}cout<<"the error rate is = "<<(double)count/(double)test_data_num<<endl;
}double KNN:: get_distance(tData *d1,tData *d2)
{double sum = 0;for(int i=0;i<colLen;i++){sum += pow( (d1[i]-d2[i]) , 2 );}//cout<<"the sum is = "<<sum<<endl;return sqrt(sum);
}void KNN:: get_all_distance()
{double distance;int i;for(i=test_data_num;i<rowLen;i++){distance = get_distance(dataSet[i],testData);map_index_dis[i] = distance;}//   map<int,double>::const_iterator it = map_index_dis.begin();
//  while(it!=map_index_dis.end())
//  {
//      cout<<"index = "<<it->first<<" distance = "<<it->second<<endl;
//      it++;
//  }}tLabel KNN:: get_max_freq_label()
{vector<PAIR> vec_index_dis( map_index_dis.begin(),map_index_dis.end() );sort(vec_index_dis.begin(),vec_index_dis.end(),CmpByValue());for(int i=0;i<k;i++){cout<<"the index = "<<vec_index_dis[i].first<<" the distance = "<<vec_index_dis[i].second<<" the label = "<<labels[ vec_index_dis[i].first ]<<" the coordinate ( ";int j;for(j=0;j<colLen-1;j++){cout<<dataSet[ vec_index_dis[i].first ][j]<<",";}cout<<dataSet[ vec_index_dis[i].first ][j]<<" )"<<endl;map_label_freq[ labels[ vec_index_dis[i].first ]  ]++;}map<tLabel,int>::const_iterator map_it = map_label_freq.begin();tLabel label;int max_freq = 0;while( map_it != map_label_freq.end() ){if( map_it->second > max_freq ){max_freq = map_it->second;label = map_it->first;}map_it++;}cout<<"The test data belongs to the "<<label<<" label"<<endl;return label;
}void KNN::auto_norm_data()
{tData maxa[colLen] ;tData mina[colLen] ;tData range[colLen] ;int i,j;for(i=0;i<colLen;i++){maxa[i] = max(dataSet[0][i],dataSet[1][i]);mina[i] = min(dataSet[0][i],dataSet[1][i]);}for(i=2;i<rowLen;i++){for(j=0;j<colLen;j++){if( dataSet[i][j]>maxa[j] ){maxa[j] = dataSet[i][j];}else if( dataSet[i][j]<mina[j] ){mina[j] = dataSet[i][j];}}}for(i=0;i<colLen;i++){range[i] = maxa[i] - mina[i] ; //normalize the test data settestData[i] = ( testData[i] - mina[i] )/range[i] ;}//normalize the training data setfor(i=0;i<rowLen;i++){for(j=0;j<colLen;j++){dataSet[i][j] = ( dataSet[i][j] - mina[j] )/range[j];}}
}int main(int argc , char** argv)
{int k,row,col;char *filename;if( argc!=5 ){cout<<"The input should be like this : ./a.out k row col filename"<<endl;exit(1);}k = atoi(argv[1]);row = atoi(argv[2]);col = atoi(argv[3]);filename = argv[4];KNN knn(k,row,col,filename);knn.auto_norm_data();knn.get_error_rate();
//  knn.get_all_distance();
//  knn.get_max_freq_label();return 0;
}

makefile:

target:g++ KNN_2.cc./a.out 7 1000 3 datingTestSet.txt

结果：

可以看到：在测试数据为10%和训练数据90%的比例下，可以看到错误率为4%，相对来讲还是很准确的。

构建完整可用系统：

已经通过使用数据对分类器进行了测试，现在可以使用分类器为海伦来对人进行分类。

代码KNN_1.cc（k=7）：

/* add the auto_norm_data */#include<iostream>
#include<map>
#include<vector>
#include<stdio.h>
#include<cmath>
#include<cstdlib>
#include<algorithm>
#include<fstream>using namespace std;typedef string tLabel;
typedef double tData;
typedef pair<int,double>  PAIR;
const int MaxColLen = 10;
const int MaxRowLen = 10000;
ifstream fin;
ofstream fout;class KNN
{
private:tData dataSet[MaxRowLen][MaxColLen];tLabel labels[MaxRowLen];tData testData[MaxColLen];int rowLen;int colLen;int k;map<int,double> map_index_dis;map<tLabel,int> map_label_freq;double get_distance(tData *d1,tData *d2);
public:KNN(int k , int rowLen , int colLen , char *filename);void get_all_distance();tLabel get_max_freq_label();void auto_norm_data();struct CmpByValue{bool operator() (const PAIR& lhs,const PAIR& rhs){return lhs.second < rhs.second;}};~KNN();
};KNN::~KNN()
{fin.close();fout.close();map_index_dis.clear();map_label_freq.clear();
}KNN::KNN(int k , int row ,int col , char *filename)
{this->rowLen = row;this->colLen = col;this->k = k;fin.open(filename);fout.open("result.txt");if( !fin || !fout ){cout<<"can not open the file"<<endl;exit(0);}//input the training data setfor(int i=0;i<rowLen;i++){for(int j=0;j<colLen;j++){fin>>dataSet[i][j];fout<<dataSet[i][j]<<" ";}fin>>labels[i];fout<<labels[i]<<endl;}//input the test datacout<<"frequent flier miles earned per year?";cin>>testData[0];cout<<"percentage of time spent playing video games?";cin>>testData[1];cout<<"liters of ice cream consumed per year?";cin>>testData[2];
}double KNN:: get_distance(tData *d1,tData *d2)
{double sum = 0;for(int i=0;i<colLen;i++){sum += pow( (d1[i]-d2[i]) , 2 );}return sqrt(sum);
}void KNN:: get_all_distance()
{double distance;int i;for(i=0;i<rowLen;i++){distance = get_distance(dataSet[i],testData);map_index_dis[i] = distance;}//   map<int,double>::const_iterator it = map_index_dis.begin();
//  while(it!=map_index_dis.end())
//  {
//      cout<<"index = "<<it->first<<" distance = "<<it->second<<endl;
//      it++;
//  }}tLabel KNN:: get_max_freq_label()
{vector<PAIR> vec_index_dis( map_index_dis.begin(),map_index_dis.end() );sort(vec_index_dis.begin(),vec_index_dis.end(),CmpByValue());for(int i=0;i<k;i++){/*    cout<<"the index = "<<vec_index_dis[i].first<<" the distance = "<<vec_index_dis[i].second<<" the label = "<<labels[ vec_index_dis[i].first ]<<" the coordinate ( ";int j;for(j=0;j<colLen-1;j++){cout<<dataSet[ vec_index_dis[i].first ][j]<<",";}cout<<dataSet[ vec_index_dis[i].first ][j]<<" )"<<endl;*/map_label_freq[ labels[ vec_index_dis[i].first ]  ]++;}map<tLabel,int>::const_iterator map_it = map_label_freq.begin();tLabel label;int max_freq = 0;/*traverse the map_label_freq to get the most frequent label*/while( map_it != map_label_freq.end() ){if( map_it->second > max_freq ){max_freq = map_it->second;label = map_it->first;}map_it++;}return label;
}/** normalize the training data set*/
void KNN::auto_norm_data()
{tData maxa[colLen] ;tData mina[colLen] ;tData range[colLen] ;int i,j;for(i=0;i<colLen;i++){maxa[i] = max(dataSet[0][i],dataSet[1][i]);mina[i] = min(dataSet[0][i],dataSet[1][i]);}for(i=2;i<rowLen;i++){for(j=0;j<colLen;j++){if( dataSet[i][j]>maxa[j] ){maxa[j] = dataSet[i][j];}else if( dataSet[i][j]<mina[j] ){mina[j] = dataSet[i][j];}}}for(i=0;i<colLen;i++){range[i] = maxa[i] - mina[i] ; //normalize the test data settestData[i] = ( testData[i] - mina[i] )/range[i] ;}//normalize the training data setfor(i=0;i<rowLen;i++){for(j=0;j<colLen;j++){dataSet[i][j] = ( dataSet[i][j] - mina[j] )/range[j];}}
}int main(int argc , char** argv)
{int k,row,col;char *filename;if( argc!=5 ){cout<<"The input should be like this : ./a.out k row col filename"<<endl;exit(1);}k = atoi(argv[1]);row = atoi(argv[2]);col = atoi(argv[3]);filename = argv[4];KNN knn(k,row,col,filename);knn.auto_norm_data();knn.get_all_distance();cout<<"You will probably like this person : "<<knn.get_max_freq_label()<<endl;return 0;
}

makefile：

target:g++ KNN_1.cc./a.out 7 1000 3 datingTestSet.txt

结果：

KNN_1.cc和KNN_2.cc的差别就在于后者对分类器的性能（即分类错误率）进行分析，而前者直接对具体实际的数据进行了分类。

注明出处：http://blog.csdn.net/lavorange/article/details/16924705

【机器学习实战之一】：C++实现K-近邻算法KNN相关推荐

机器学习实战笔记(Python实现)-02-k近邻算法(kNN)
k近邻算法(kNN) 本博客来源于CSDN:http://blog.csdn.net/niuwei22007/article/details/49703719 本博客源代码下载地址:CSDN免费下载. ...
统计学方法机器学习实战（二） K近邻算法
目录一.前言: 二.理论难点: 距离度量: 欧式距离: 三.数据可视化四.数据归一化: 五.代码实践: 理论补充实验一: 海伦约会实验二使用sklearn实现knn 六.总结 1.kNN算法 ...
机器学习第七章之K近邻算法
K近邻算法(了解) 7.1 K近邻算法 7.1.1 K近邻算法的原理介绍 7.1.2 K近邻算法的计算步骤及代码实现 7.2 数据预处理之数据归一化 7.2.1 min-max标准化 7.2.2 Z- ...
01 K近邻算法 KNN
01 K近邻算法 KNN k近邻算法基础等价于 scikit-learn中的机器学习算法封装训练数据集,测试数据集分类准确度超参数考虑距离权重更多关于距离的定义搜索明可夫斯基距离相应的p ...
基于KD树的K近邻算法(KNN)算法
文章目录 KNN 简介 KNN 三要素距离度量 k值的选择分类决策规则 KNN 实现 1,构造kd树 2,搜索最近邻 3,预测用kd树完成最近邻搜索 K近邻算法(KNN)算法,是一种基本的分类与 ...
k近邻算法(KNN)-分类算法
k近邻算法(KNN)-分类算法 1 概念定义:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别. k-近邻算法采用测量不同特征值之间的 ...
k近邻算法 (KNN)
k近邻算法 k近邻算法(KNN,K-NearestNeighbor)是一种基本分类和回归方法,监督学习算法,本质上是基于一种数据统计的方法: 核心思想:给定一个训练数据集,对新的输入实例,在训练数据集 ...
机器学习-分类之K近邻算法(KNN)原理及实战
k近邻算法(KNN) 简介 KNN算法是数据挖掘分类技术中最简单的方法之一.它通过测量不同特征值之间的距离进行分类的.其基本思路为:如果一个样本在特征空间中的k个最近邻样本中的大多数属于某一个类别,则 ...
【机器学习】原理与实现k近邻算法
文章目录系列文章目录前言一.k近邻算法是什么? 二.使用步骤 1.引入库 2.读入数据总结前言随着人工智能的不断发展,机器学习这门技术也越来越重要,很多人都开启了学习机器学习,本文就介绍了 ...
机器学习入门 | 【01】K近邻算法
文章目录 1.K近邻算法[通过你的邻居来判断你的类别] 1.简介 2.电影案例分析 3.api的初步使用 3.1 一般的流程: 3.2 sklearn模块介绍 3.3 API的使用 3.4 距离度量 ...

【机器学习实战之一】：C++实现K-近邻算法KNN

【机器学习实战之一】：C++实现K-近邻算法KNN相关推荐

最新文章

热门文章