



2 电影类型分析


3 KNN算法流程总结

  1. 计算已知类别数据集中的点与当前点之间的距离
  2. 按距离递增次序排序
  3. 选取与当前点距离最小的k个点
  4. 统计前k个点所在的类别出现的频率
  5. 返回前k个点出现频率最高的类别作为当前点的预测分类



  • n_neighbors:int,可选(默认= 5),k_neighbors查询默认使用的邻居数
  • algorithm:{‘auto’,‘ball_tree’,‘kd_tree’,‘brute’},可选用于计算最近邻居的算法:‘ball_tree’将会使用 BallTree,‘kd_tree’将使用 KDTree。‘auto’将尝试根据传递给fit方法的值来决定最合适的算法。 (不同实现方式影响运算效率,但不影响结果)






import pandas as pd
from sklearn.neighbors import KNeighborsClassifierif __name__ == '__main__':#  一、读取数据movies = pd.read_excel('I:/AI_Data/movies.xlsx', sheet_name=1)print("movies = \n", movies)# 二、 特征工程:分割特征数据值、目标值x_data = movies.iloc[:, 1:3]  # 第0列的电影名称不是特征数据值y_data = movies['分类情况']print('特征数据值:x_data =\n', x_data)print('目标值:y_data =\n', y_data)# 三、算法工程# 3.1 实例化一个”k-近邻“估计器knn = KNeighborsClassifier(n_neighbors=5)# 3.2 将训练数据 x_data、y_data 喂给”k-近邻“估计器knn进行训练knn.fit(x_data, y_data)# 四、模型的使用# 4.1 构造一个测试电影特征数据集x_test = pd.DataFrame({'武打镜头': [100, 67, 1], '接吻镜头': [3, 2, 10]})print('测试电影特征数据集:x_test = \n', x_test)# 4.2 评估测试数据集的电影所属分类y_test = knn.predict(x_test)print('测试电影目标值:y_test = ', y_test)


movies = 电影名称  武打镜头  接吻镜头 分类情况
0   大话西游    36     1  动作片
1    杀破狼    43     2  动作片
2    前任3     0    10  爱情片
3    战狼2    59     1  动作片
4  泰坦尼克号     1    15  爱情片
5   星语心愿     2    19  爱情片
特征数据值:x_data =武打镜头  接吻镜头
0    36     1
1    43     2
2     0    10
3    59     1
4     1    15
5     2    19
目标值:y_data =0    动作片
1    动作片
2    爱情片
3    动作片
4    爱情片
5    爱情片
Name: 分类情况, dtype: object
测试电影特征数据集:x_test = 武打镜头  接吻镜头
0   100     3
1    67     2
2     1    10
测试电影目标值:y_test =  ['动作片' '动作片' '爱情片']


import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScalerif __name__ == '__main__':# 一、加载数据iris = datasets.load_iris()print('iris =\n', iris)X_data = iris.data  # 4列分别代表4个特征:花萼长、宽;花瓣长、宽y_data = iris.target   # 0代表setosa类花;1代表versicolor类花;2代表virginica类花;# 二、特征工程# 2.1 先统一标准化特征数据值,训练集与测试集不能分开标准化(注:此数据集标准化后的模型准确度降低)# std = StandardScaler()# X_data = std.fit_transform(X_data)# print('标准化后的特征数据值:X_data =\n', X_data)# 2.2 打乱数据集顺序index = np.arange(150)np.random.shuffle(index)print("index =\n", index)# 2.3 分割数据集为:特征数据值of训练集,特征数据值of测试集,目标值of训练集,目标值of测试集X_train = X_data[index[:120]]X_test = X_data[index[120:]]y_train = y_data[index[:120]]y_test = y_data[index[120:]]print('特征数据值of训练集:X_train =\n', X_train)print('特征数据值of训练集:X_test =\n', X_test)print('目标值of训练集:y_train =\n', y_train)print('目标值of测试集:y_test =\n', y_test)# 三、算法工程# 3.1 实例化一个”k-近邻“估计器# p = 1 距离度量采用的是:曼哈顿距离;p = 2 距离度量采用的是:欧氏距离;n_jobs:开启的进程数量# n_neighbors 一般不要超过样本数量的开平方数。knn = KNeighborsClassifier(n_neighbors=5, weights='distance', p=1, n_jobs=4)    # 3.2 将训练数据 x_data、y_data 喂给”k-近邻“估计器knn进行训练knn.fit(X_train, y_train)# 四、模型评估# 4.1 预测测试集概率及其对应的类别y_proba = knn.predict_proba(X_test)print('预测测试集中各个鸢尾花的分类概率:y_proba =\n', y_proba)y_proba_predict = y_proba.argmax(axis=1)print('根据概率计算各个鸢尾花的分类:y_proba_predict =\n', y_proba_predict)# 4.2 直接用knn.predict()方法预测测试集y_predict = knn.predict(X_test)print('直接用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =\n', y_predict)print('实际测试集中各个鸢尾花的分类:y_test =\n', y_test)# 4.3 knn模型准确度评估predict_score = knn.score(X_test, y_test)print('knn模型准确度:predict_score =\n', predict_score)


iris ={'data': array([[5.1, 3.5, 1.4, 0.2],[4.9, 3. , 1.4, 0.2],[4.7, 3.2, 1.3, 0.2],[4.6, 3.1, 1.5, 0.2],[5. , 3.6, 1.4, 0.2],[5.4, 3.9, 1.7, 0.4],[4.6, 3.4, 1.4, 0.3],[5. , 3.4, 1.5, 0.2],[4.4, 2.9, 1.4, 0.2],[4.9, 3.1, 1.5, 0.1],[5.4, 3.7, 1.5, 0.2],[4.8, 3.4, 1.6, 0.2],[4.8, 3. , 1.4, 0.1],[4.3, 3. , 1.1, 0.1],[5.8, 4. , 1.2, 0.2],[5.7, 4.4, 1.5, 0.4],[5.4, 3.9, 1.3, 0.4],[5.1, 3.5, 1.4, 0.3],[5.7, 3.8, 1.7, 0.3],[5.1, 3.8, 1.5, 0.3],[5.4, 3.4, 1.7, 0.2],[5.1, 3.7, 1.5, 0.4],[4.6, 3.6, 1. , 0.2],[5.1, 3.3, 1.7, 0.5],[4.8, 3.4, 1.9, 0.2],[5. , 3. , 1.6, 0.2],[5. , 3.4, 1.6, 0.4],[5.2, 3.5, 1.5, 0.2],[5.2, 3.4, 1.4, 0.2],[4.7, 3.2, 1.6, 0.2],[4.8, 3.1, 1.6, 0.2],[5.4, 3.4, 1.5, 0.4],[5.2, 4.1, 1.5, 0.1],[5.5, 4.2, 1.4, 0.2],[4.9, 3.1, 1.5, 0.1],[5. , 3.2, 1.2, 0.2],[5.5, 3.5, 1.3, 0.2],[4.9, 3.1, 1.5, 0.1],[4.4, 3. , 1.3, 0.2],[5.1, 3.4, 1.5, 0.2],[5. , 3.5, 1.3, 0.3],[4.5, 2.3, 1.3, 0.3],[4.4, 3.2, 1.3, 0.2],[5. , 3.5, 1.6, 0.6],[5.1, 3.8, 1.9, 0.4],[4.8, 3. , 1.4, 0.3],[5.1, 3.8, 1.6, 0.2],[4.6, 3.2, 1.4, 0.2],[5.3, 3.7, 1.5, 0.2],[5. , 3.3, 1.4, 0.2],[7. , 3.2, 4.7, 1.4],[6.4, 3.2, 4.5, 1.5],[6.9, 3.1, 4.9, 1.5],[5.5, 2.3, 4. , 1.3],[6.5, 2.8, 4.6, 1.5],[5.7, 2.8, 4.5, 1.3],[6.3, 3.3, 4.7, 1.6],[4.9, 2.4, 3.3, 1. ],[6.6, 2.9, 4.6, 1.3],[5.2, 2.7, 3.9, 1.4],[5. , 2. , 3.5, 1. ],[5.9, 3. , 4.2, 1.5],[6. , 2.2, 4. , 1. ],[6.1, 2.9, 4.7, 1.4],[5.6, 2.9, 3.6, 1.3],[6.7, 3.1, 4.4, 1.4],[5.6, 3. , 4.5, 1.5],[5.8, 2.7, 4.1, 1. ],[6.2, 2.2, 4.5, 1.5],[5.6, 2.5, 3.9, 1.1],[5.9, 3.2, 4.8, 1.8],[6.1, 2.8, 4. , 1.3],[6.3, 2.5, 4.9, 1.5],[6.1, 2.8, 4.7, 1.2],[6.4, 2.9, 4.3, 1.3],[6.6, 3. , 4.4, 1.4],[6.8, 2.8, 4.8, 1.4],[6.7, 3. , 5. , 1.7],[6. , 2.9, 4.5, 1.5],[5.7, 2.6, 3.5, 1. ],[5.5, 2.4, 3.8, 1.1],[5.5, 2.4, 3.7, 1. ],[5.8, 2.7, 3.9, 1.2],[6. , 2.7, 5.1, 1.6],[5.4, 3. , 4.5, 1.5],[6. , 3.4, 4.5, 1.6],[6.7, 3.1, 4.7, 1.5],[6.3, 2.3, 4.4, 1.3],[5.6, 3. , 4.1, 1.3],[5.5, 2.5, 4. , 1.3],[5.5, 2.6, 4.4, 1.2],[6.1, 3. , 4.6, 1.4],[5.8, 2.6, 4. , 1.2],[5. , 2.3, 3.3, 1. ],[5.6, 2.7, 4.2, 1.3],[5.7, 3. , 4.2, 1.2],[5.7, 2.9, 4.2, 1.3],[6.2, 2.9, 4.3, 1.3],[5.1, 2.5, 3. , 1.1],[5.7, 2.8, 4.1, 1.3],[6.3, 3.3, 6. , 2.5],[5.8, 2.7, 5.1, 1.9],[7.1, 3. , 5.9, 2.1],[6.3, 2.9, 5.6, 1.8],[6.5, 3. , 5.8, 2.2],[7.6, 3. , 6.6, 2.1],[4.9, 2.5, 4.5, 1.7],[7.3, 2.9, 6.3, 1.8],[6.7, 2.5, 5.8, 1.8],[7.2, 3.6, 6.1, 2.5],[6.5, 3.2, 5.1, 2. ],[6.4, 2.7, 5.3, 1.9],[6.8, 3. , 5.5, 2.1],[5.7, 2.5, 5. , 2. ],[5.8, 2.8, 5.1, 2.4],[6.4, 3.2, 5.3, 2.3],[6.5, 3. , 5.5, 1.8],[7.7, 3.8, 6.7, 2.2],[7.7, 2.6, 6.9, 2.3],[6. , 2.2, 5. , 1.5],[6.9, 3.2, 5.7, 2.3],[5.6, 2.8, 4.9, 2. ],[7.7, 2.8, 6.7, 2. ],[6.3, 2.7, 4.9, 1.8],[6.7, 3.3, 5.7, 2.1],[7.2, 3.2, 6. , 1.8],[6.2, 2.8, 4.8, 1.8],[6.1, 3. , 4.9, 1.8],[6.4, 2.8, 5.6, 2.1],[7.2, 3. , 5.8, 1.6],[7.4, 2.8, 6.1, 1.9],[7.9, 3.8, 6.4, 2. ],[6.4, 2.8, 5.6, 2.2],[6.3, 2.8, 5.1, 1.5],[6.1, 2.6, 5.6, 1.4],[7.7, 3. , 6.1, 2.3],[6.3, 3.4, 5.6, 2.4],[6.4, 3.1, 5.5, 1.8],[6. , 3. , 4.8, 1.8],[6.9, 3.1, 5.4, 2.1],[6.7, 3.1, 5.6, 2.4],[6.9, 3.1, 5.1, 2.3],[5.8, 2.7, 5.1, 1.9],[6.8, 3.2, 5.9, 2.3],[6.7, 3.3, 5.7, 2.5],[6.7, 3. , 5.2, 2.3],[6.3, 2.5, 5. , 1.9],[6.5, 3. , 5.2, 2. ],[6.2, 3.4, 5.4, 2.3],[5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']}
index =[147  89  67  57  26 130 113  87  48  38 110 118  65  33  14  94  47  70138 132  82 103  60  46 121  84 115  64  25 136  50  43  56  20  62  1213  88  39 119   4  59  31  66 100 108   5  15  10 129 114 144  45  3616  98 124 104 105 109 133  21  37   2 135  28  71  51 120 141 140 10630 146  81   8  96  90  40  24 139  92  55  83  17   0  68  61  73  6999 148  77 143  23  27  19  53  18  41  80   9 142 112 101  54  95  5211  58   1 131  79  44 127 145  29  91  86  74 111 102  76 128   6  7275 137  22  34  32 116   7  49 149 117  42  97  63  35  93   3 126 134107 125 123  78 122  85]
特征数据值of训练集:X_train =[[6.5 3.  5.2 2. ][5.5 2.5 4.  1.3][5.8 2.7 4.1 1. ][4.9 2.4 3.3 1. ][5.  3.4 1.6 0.4][7.4 2.8 6.1 1.9][5.7 2.5 5.  2. ][6.3 2.3 4.4 1.3][5.3 3.7 1.5 0.2][4.4 3.  1.3 0.2][6.5 3.2 5.1 2. ][7.7 2.6 6.9 2.3][6.7 3.1 4.4 1.4][5.5 4.2 1.4 0.2][5.8 4.  1.2 0.2][5.6 2.7 4.2 1.3][4.6 3.2 1.4 0.2][5.9 3.2 4.8 1.8][6.  3.  4.8 1.8][6.4 2.8 5.6 2.2][5.8 2.7 3.9 1.2][6.3 2.9 5.6 1.8][5.  2.  3.5 1. ][5.1 3.8 1.6 0.2][5.6 2.8 4.9 2. ][5.4 3.  4.5 1.5][6.4 3.2 5.3 2.3][5.6 2.9 3.6 1.3][5.  3.  1.6 0.2][6.3 3.4 5.6 2.4][7.  3.2 4.7 1.4][5.  3.5 1.6 0.6][6.3 3.3 4.7 1.6][5.4 3.4 1.7 0.2][6.  2.2 4.  1. ][4.8 3.  1.4 0.1][4.3 3.  1.1 0.1][5.6 3.  4.1 1.3][5.1 3.4 1.5 0.2][6.  2.2 5.  1.5][5.  3.6 1.4 0.2][5.2 2.7 3.9 1.4][5.4 3.4 1.5 0.4][5.6 3.  4.5 1.5][6.3 3.3 6.  2.5][6.7 2.5 5.8 1.8][5.4 3.9 1.7 0.4][5.7 4.4 1.5 0.4][5.4 3.7 1.5 0.2][7.2 3.  5.8 1.6][5.8 2.8 5.1 2.4][6.7 3.3 5.7 2.5][4.8 3.  1.4 0.3][5.5 3.5 1.3 0.2][5.4 3.9 1.3 0.4][5.1 2.5 3.  1.1][6.7 3.3 5.7 2.1][6.5 3.  5.8 2.2][7.6 3.  6.6 2.1][7.2 3.6 6.1 2.5][6.3 2.8 5.1 1.5][5.1 3.7 1.5 0.4][4.9 3.1 1.5 0.1][4.7 3.2 1.3 0.2][7.7 3.  6.1 2.3][5.2 3.4 1.4 0.2][6.1 2.8 4.  1.3][6.4 3.2 4.5 1.5][6.9 3.2 5.7 2.3][6.9 3.1 5.1 2.3][6.7 3.1 5.6 2.4][4.9 2.5 4.5 1.7][4.8 3.1 1.6 0.2][6.3 2.5 5.  1.9][5.5 2.4 3.7 1. ][4.4 2.9 1.4 0.2][5.7 2.9 4.2 1.3][5.5 2.6 4.4 1.2][5.  3.5 1.3 0.3][4.8 3.4 1.9 0.2][6.9 3.1 5.4 2.1][5.8 2.6 4.  1.2][5.7 2.8 4.5 1.3][6.  2.7 5.1 1.6][5.1 3.5 1.4 0.3][5.1 3.5 1.4 0.2][6.2 2.2 4.5 1.5][5.9 3.  4.2 1.5][6.1 2.8 4.7 1.2][5.6 2.5 3.9 1.1][5.7 2.8 4.1 1.3][6.2 3.4 5.4 2.3][6.7 3.  5.  1.7][6.8 3.2 5.9 2.3][5.1 3.3 1.7 0.5][5.2 3.5 1.5 0.2][5.1 3.8 1.5 0.3][5.5 2.3 4.  1.3][5.7 3.8 1.7 0.3][4.5 2.3 1.3 0.3][5.5 2.4 3.8 1.1][4.9 3.1 1.5 0.1][5.8 2.7 5.1 1.9][6.8 3.  5.5 2.1][5.8 2.7 5.1 1.9][6.5 2.8 4.6 1.5][5.7 3.  4.2 1.2][6.9 3.1 4.9 1.5][4.8 3.4 1.6 0.2][6.6 2.9 4.6 1.3][4.9 3.  1.4 0.2][7.9 3.8 6.4 2. ][5.7 2.6 3.5 1. ][5.1 3.8 1.9 0.4][6.1 3.  4.9 1.8][6.7 3.  5.2 2.3][4.7 3.2 1.6 0.2][6.1 3.  4.6 1.4][6.7 3.1 4.7 1.5][6.4 2.9 4.3 1.3]]
特征数据值of训练集:X_test =[[6.4 2.7 5.3 1.9][7.1 3.  5.9 2.1][6.8 2.8 4.8 1.4][6.4 2.8 5.6 2.1][4.6 3.4 1.4 0.3][6.3 2.5 4.9 1.5][6.6 3.  4.4 1.4][6.4 3.1 5.5 1.8][4.6 3.6 1.  0.2][4.9 3.1 1.5 0.1][5.2 4.1 1.5 0.1][6.5 3.  5.5 1.8][5.  3.4 1.5 0.2][5.  3.3 1.4 0.2][5.9 3.  5.1 1.8][7.7 3.8 6.7 2.2][4.4 3.2 1.3 0.2][6.2 2.9 4.3 1.3][6.1 2.9 4.7 1.4][5.  3.2 1.2 0.2][5.  2.3 3.3 1. ][4.6 3.1 1.5 0.2][6.2 2.8 4.8 1.8][6.1 2.6 5.6 1.4][7.3 2.9 6.3 1.8][7.2 3.2 6.  1.8][6.3 2.7 4.9 1.8][6.  2.9 4.5 1.5][7.7 2.8 6.7 2. ][6.  3.4 4.5 1.6]]
目标值of训练集:y_train =[2 1 1 1 0 2 2 1 0 0 2 2 1 0 0 1 0 1 2 2 1 2 1 0 2 1 2 1 0 2 1 0 1 0 1 0 01 0 2 0 1 0 1 2 2 0 0 0 2 2 2 0 0 0 1 2 2 2 2 2 0 0 0 2 0 1 1 2 2 2 2 0 21 0 1 1 0 0 2 1 1 1 0 0 1 1 1 1 1 2 1 2 0 0 0 1 0 0 1 0 2 2 2 1 1 1 0 1 02 1 0 2 2 0 1 1 1]
目标值of测试集:y_test =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 2 2 2 2 2 1 2 1]
预测测试集中各个鸢尾花的分类概率:y_proba =[[0.   0.1  0.9 ][0.   0.   1.  ][0.   0.75 0.25][0.   0.   1.  ][1.   0.   0.  ][0.   0.55 0.45][0.   0.85 0.15][0.   0.1  0.9 ][1.   0.   0.  ][1.   0.   0.  ][1.   0.   0.  ][0.   0.1  0.9 ][1.   0.   0.  ][1.   0.   0.  ][0.   0.3  0.7 ][0.   0.   1.  ][1.   0.   0.  ][0.   1.   0.  ][0.   0.85 0.15][1.   0.   0.  ][0.   1.   0.  ][1.   0.   0.  ][0.   0.5  0.5 ][0.   0.2  0.8 ][0.   0.   1.  ][0.   0.   1.  ][0.   0.5  0.5 ][0.   0.9  0.1 ][0.   0.   1.  ][0.   0.9  0.1 ]]
根据概率计算各个鸢尾花的分类:y_proba_predict =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 1 2 2 2 1 1 2 1]
直接用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 1 2 2 2 1 1 2 1]
实际测试集中各个鸢尾花的分类:y_test =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 2 2 2 2 2 1 2 1]
knn模型准确度:predict_score =0.9333333333333333


import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatchesif __name__ == '__main__':# 一、加载数据iris = datasets.load_iris()X_data = iris.data  # 4列分别代表4个特征:花萼长、花萼宽、花瓣长、花瓣宽y_data = iris.target  # 0代表setosa类花;1代表versicolor类花;2代表virginica类花;# 二、特征工程# 2.1 降维(ndarray数据类型降维)X_data = X_data[:, :2]  # 降维后保留前2个特征# 2.2 画散点图plt.scatter(x=X_data[:, 0], y=X_data[:, 1], c=y_data)  # 颜色c以目标值区分# 2.2 打乱数据集顺序index = np.arange(150)np.random.shuffle(index)# 2.2 分割数据集为:特征数据值of训练集,特征数据值of测试集,目标值of训练集,目标值of测试集X_train = X_data[index[:120]]X_test = X_data[index[120:]]y_train = y_data[index[:120]]y_test = y_data[index[120:]]# 三、算法工程# 3.1 实例化一个”k-近邻“估计器knn = KNeighborsClassifier(n_neighbors=5, weights='distance', p=1, n_jobs=4)  # p = 1 距离度量采用的是:曼哈顿距离;p = 2 距离度量采用的是:欧氏距离;n_jobs:开启的进程数量。n_neighbors 一般不要超过样本数量的开平方数。# 3.2 将训练数据 x_data、y_data 喂给”k-近邻“估计器knn进行训练knn.fit(X_train, y_train)# 四、模型评估# 4.1 预测测试集概率及其对应的类别y_proba = knn.predict_proba(X_test)  # 预测测试集中各个鸢尾花的分类概率y_proba_predict = y_proba.argmax(axis=1)  # 根据概率计算各个鸢尾花的分类# 4.2 直接用knn.predict()方法预测测试集y_predict = knn.predict(X_test)print('用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =\n', y_predict)print('实际测试集中各个鸢尾花的分类:y_test =\n', y_test)# 4.3 knn模型准确度评估predict_score = knn.score(X_test, y_test)print('knn模型准确度:predict_score =\n', predict_score)# 五、画图# 5.1 生成网格采样点N, M = 5, 5  # 横纵各采样多少个值用于预测后画图,此处取得值越大,画的分界面越清晰,选N, M = 500, 500 效果不错,此处选N, M = 5, 5只是为了方便分析数据间关系x1_min, x1_max = X_train[:, 0].min(), X_train[:, 0].max()  # X_train第0列的范围x2_min, x2_max = X_train[:, 1].min(), X_train[:, 1].max()  # X_train第1列的范围t1 = np.linspace(x1_min, x1_max, N)t2 = np.linspace(x2_min, x2_max, M)print('type(t1) =', type(t1), '----t1.shape =', t1.shape, "----t1 =\n", t1)print('type(t2) =', type(t2), '----t2.shape =', t2.shape, "----t2 =\n", t2)x1, x2 = np.meshgrid(t1, t2)  # 生成网格采样点;函数numpy.meshgrid():生成网格点坐标矩阵print('生成的网格采样点:x1:', '----type(x1) =', type(x1), '----x1.shape =', x1.shape, "----x1 =\n", x1)print('生成的网格采样点:x2:', '----type(x2) =', type(x2), '----x2.shape =', x2.shape, "----x2 =\n", x2)x1_ravel = x1.ravel()  # 将二维数组转为一维数组x2_ravel = x2.ravel()  # 将二维数组转为一维数组print('将生成的网格采样点x1转换为1D的迭代器----x1_ravel', '----type(x1_ravel) =', type(x1_ravel), '----x1_ravel.shape =', x1_ravel.shape, "----x1_ravel =\n", x1_ravel)print('将生成的网格采样点x2转换为1D的迭代器----x2_ravel', '----type(x2_ravel) =', type(x2_ravel), '----x2_ravel.shape =', x2_ravel.shape, "----x2_ravel =\n", x2_ravel)# 5.2 生成画图样本点x_test = np.stack((x1_ravel, x2_ravel), axis=1)  # 将 形状为(2500,1)的x1_flat与(2500,1)的x2_flat堆叠生成形状为(2500,2)的测试样本print('通过numpy.stack()堆叠生成的测试点:x_test:', '----type(x_test) =', type(x_test), '----x_test.shape =', x_test.shape, "----x_test =\n", x_test)# 5.3 使用模型预测画图样本点的分类y_hat = knn.predict(x_test)  # 预测值print('y_hat:', '----type(y_hat) =', type(y_hat), '----y_hat.shape =', y_hat.shape, "----y_hat =\n", y_hat)y_hat = y_hat.reshape(x1.shape)  # 使之与输入的形状相同print('y_hat:', '----type(y_hat) =', type(y_hat), '----y_hat.shape =', y_hat.shape, "----y_hat =\n", y_hat)# 5.4 画布预设置cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])mpl.rcParams['font.sans-serif'] = [u'simHei']mpl.rcParams['axes.unicode_minus'] = Falseplt.figure(facecolor='w')plt.xlabel(u'花萼长度', fontsize=14)plt.ylabel(u'花萼宽度', fontsize=14)plt.xlim(x1_min, x1_max)plt.ylim(x2_min, x2_max)plt.grid()patchs = [mpatches.Patch(color='#77E0A0', label='Iris-setosa'),mpatches.Patch(color='#FF8080', label='Iris-versicolor'),mpatches.Patch(color='#A0A0FF', label='Iris-virginica')]plt.legend(handles=patchs, fancybox=True, framealpha=0.8)plt.title(u'鸢尾花k-近邻三分类效果', fontsize=17)# 5.5 开始画图plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)  # 画图样本点的预测值的显示plt.scatter(X_data[:, 0], X_data[:, 1], edgecolors='k', s=30, cmap=cm_dark)  # 样本的显示plt.show()


用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =[0 0 2 2 0 2 0 0 2 2 0 2 0 2 1 0 1 1 1 0 2 0 0 2 1 1 1 1 0 2]
实际测试集中各个鸢尾花的分类:y_test =[0 0 1 1 0 1 0 0 2 1 0 2 0 2 1 0 2 2 1 0 1 0 0 2 1 1 1 1 0 2]
knn模型准确度:predict_score =0.7666666666666667
type(t1) = <class 'numpy.ndarray'> ----t1.shape = (5,) ----t1 =[4.4   5.275 6.15  7.025 7.9  ]
type(t2) = <class 'numpy.ndarray'> ----t2.shape = (5,) ----t2 =[2.  2.6 3.2 3.8 4.4]
生成的网格采样点:x1: ----type(x1) = <class 'numpy.ndarray'> ----x1.shape = (5, 5) ----x1 =[[4.4   5.275 6.15  7.025 7.9  ][4.4   5.275 6.15  7.025 7.9  ][4.4   5.275 6.15  7.025 7.9  ][4.4   5.275 6.15  7.025 7.9  ][4.4   5.275 6.15  7.025 7.9  ]]
生成的网格采样点:x2: ----type(x2) = <class 'numpy.ndarray'> ----x2.shape = (5, 5) ----x2 =[[2.  2.  2.  2.  2. ][2.6 2.6 2.6 2.6 2.6][3.2 3.2 3.2 3.2 3.2][3.8 3.8 3.8 3.8 3.8][4.4 4.4 4.4 4.4 4.4]]
将生成的网格采样点x1转换为1D的迭代器----x1_ravel ----type(x1_ravel) = <class 'numpy.ndarray'> ----x1_ravel.shape = (25,) ----x1_ravel =[4.4   5.275 6.15  7.025 7.9   4.4   5.275 6.15  7.025 7.9   4.4   5.275  6.15  7.025 7.9   4.4   5.275 6.15  7.025 7.9   4.4   5.275 6.15  7.0257.9  ]
将生成的网格采样点x2转换为1D的迭代器----x2_ravel ----type(x2_ravel) = <class 'numpy.ndarray'> ----x2_ravel.shape = (25,) ----x2_ravel =[2.  2.  2.  2.  2.  2.6 2.6 2.6 2.6 2.6 3.2 3.2 3.2 3.2 3.2 3.8 3.8 3.8  3.8 3.8 4.4 4.4 4.4 4.4 4.4]
通过numpy.stack()堆叠生成的测试点:x_test: ----type(x_test) = <class 'numpy.ndarray'> ----x_test.shape = (25, 2) ----x_test =[[4.4   2.   ][5.275 2.   ][6.15  2.   ][7.025 2.   ][7.9   2.   ][4.4   2.6  ][5.275 2.6  ][6.15  2.6  ][7.025 2.6  ][7.9   2.6  ][4.4   3.2  ][5.275 3.2  ][6.15  3.2  ][7.025 3.2  ][7.9   3.2  ][4.4   3.8  ][5.275 3.8  ][6.15  3.8  ][7.025 3.8  ][7.9   3.8  ][4.4   4.4  ][5.275 4.4  ][6.15  4.4  ][7.025 4.4  ][7.9   4.4  ]]
y_hat: ----type(y_hat) = <class 'numpy.ndarray'> ----y_hat.shape = (25,) ----y_hat =[1 1 1 1 2 0 1 2 2 2 0 0 1 1 2 0 0 0 2 2 0 0 0 2 2]
y_hat: ----type(y_hat) = <class 'numpy.ndarray'> ----y_hat.shape = (5, 5) ----y_hat =[[1 1 1 1 2][0 1 2 2 2][0 0 1 1 2][0 0 0 2 2][0 0 0 2 2]]


import numpy as np
import matplotlib.pylab as pyb
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from matplotlib.colors import ListedColormapif __name__ == '__main__':# 一、加载数据X, y = datasets.load_iris(True)# 4个属性,4维空间,4维的数据# 150代表样本的数量print('X.shape =', X.shape)# 二、特征工程# 降维,切片:简单粗暴方式(信息量变少了)X = X[:, :2]print('X.shape =', X.shape)pyb.scatter(X[:, 0], X[:, 1], c=y)# 三、算法工程knn = KNeighborsClassifier(n_neighbors=5)  # 使用KNN算法训练数据knn.fit(X, y)  # 使用150个样本点作为训练数据# 五、画图# N, M = 5, 5  # 横纵各采样多少个值用于预测后画图,此处取得值越大,画的分界面越清晰,选N, M = 500, 500 效果不错,此处选N, M = 5, 5只是为了方便分析数据间关系# meshgrid提取测试数据(500*500个测试样本)# 获取测试数据# shape (?,2)# 横坐标4 ~ 8;纵坐标 2~ 4.5# 背景点,取出来,meshgridNum = 5x1 = np.linspace(4, 8, Num)y1 = np.linspace(2, 4.5, Num)X1, Y1 = np.meshgrid(x1, y1)print('x1 =\n', x1)print('y1 =\n', y1)print('X1 =\n', X1)print('Y1 =\n', Y1)# 平铺,一维化,reshape# X1 = X1.reshape(-1,1)# Y1 = Y1.reshape(-1,1)# X_test = np.concatenate([X1,Y1],axis = 1)# print(X_test.shape)X_test = np.c_[X1.ravel(), Y1.ravel()]print('X_test.shape = ', X_test.shape)y_ = knn.predict(X_test)print('y_ =\n', y_)lc = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])lc2 = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])pyb.scatter(X_test[:, 0], X_test[:, 1], c=y_, cmap=lc)  # 画图样本点的预测值的显示pyb.scatter(X[:, 0], X[:, 1], c=y, cmap=lc2)  # 样本的显示pyb.contourf(X1, Y1, y_.reshape(Num, Num), cmap=lc)pyb.scatter(X[:, 0], X[:, 1], c=y, cmap=lc2)


X.shape = (150, 4)
X.shape = (150, 2)
x1 =[4. 5. 6. 7. 8.]
y1 =[2.    2.625 3.25  3.875 4.5  ]
X1 =[[4. 5. 6. 7. 8.][4. 5. 6. 7. 8.][4. 5. 6. 7. 8.][4. 5. 6. 7. 8.][4. 5. 6. 7. 8.]]
Y1 =[[2.    2.    2.    2.    2.   ][2.625 2.625 2.625 2.625 2.625][3.25  3.25  3.25  3.25  3.25 ][3.875 3.875 3.875 3.875 3.875][4.5   4.5   4.5   4.5   4.5  ]]
X_test.shape =  (25, 2)
y_ =[0 1 1 1 2 0 1 1 2 2 0 0 1 2 2 0 0 0 2 2 0 0 0 2 2]


import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
# 精确率、召回率、f1-score调和平均值
from sklearn.metrics import classification_reportif __name__ == '__main__':#  一、读取数据# 1.1 构造列标签名字columns = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']# 1.2 读取数据data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', names=columns)print('data.shape = ', data.shape, '----type(data) = ', type(data), '----data.head() = \n', data.head())# 1.3 将有?的数据删除data = data.replace(to_replace='?', value=np.nan)data = data.dropna()# 二、特征工程# 2.1 提取特征数据值、目标值x_data = data.iloc[:, 1:-1]y_data = data.iloc[:, -1]print('x_data.shape = ', x_data.shape, '----type(x_data) = ', type(x_data), '----x_data.head() =\n', x_data.head())print('y_data.shape = ', y_data.shape, '----type(y_data) = ', type(y_data), '----y_data.head() =\n', y_data.head())# 2.3 归一化或标准化特征数据值# # 2.3.1 人工进行归一化操作# x_data_normal01 = (x_data - x_data.min()) / (x_data.max() - x_data.min())# print('x_data_normal01.shape = ', x_data_normal01.shape, '----type(x_data_normal01) = ', type(x_data_normal01), '----x_data_normal01 =\n', x_data_normal01)# # 2.3.2 调用MinMaxScaler函数进行归一化操作# mms = MinMaxScaler()# x_data_normal02 = mms.fit_transform(x_data)# print('x_data_normal02.shape = ', x_data_normal02.shape, '----type(x_data_normal02) = ', type(x_data_normal02), '----x_data_normal02 =\n', x_data_normal02)# # 2.3.3 人工标准化操作# x_data_standard01 = (x_data - x_data.mean()) / x_data.std()# print('x_data_standard01.shape = ', x_data_standard01.shape, '----type(x_data_standard01) = ', type(x_data_standard01), '----x_data_standard01 =\n', x_data_standard01)# 2.3.4 调用StandardScaler函数进行标准化操作std = StandardScaler()x_data_standard02 = std.fit_transform(x_data)print('x_data_standard02.shape = ', x_data_standard02.shape, '----type(x_data_standard02) = ', type(x_data_standard02), '----x_data_standard02 =\n', x_data_standard02)x_data = x_data_standard02# 三、算法工程# 3.1 分割训练集与测试集x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.8)# 3.2 实例化knn估计器knn = KNeighborsClassifier()# 3.3 将测试集的特征数据值、目标值喂给knn估计器进行训练knn.fit(x_train, y_train)# 四、模型评估y_predict = knn.predict(x_test)print('y_predict =\n', y_predict)test_score = knn.score(x_test, y_test)print('test_score = ', test_score)# 五、模型优化(网格搜索)params = {'n_neighbors': [i for i in range(1, 30)], 'weights': ['uniform', 'distance'], 'p': [1, 2]}gcv = GridSearchCV(knn, params, scoring='accuracy', cv=10)gcv.fit(x_train, y_train)y_predict_gcv = gcv.predict(x_test)print('y_predict_gcv =\n', y_predict_gcv)# 最优模型的确认best_params_ = gcv.best_params_best_estimator_ = gcv.best_estimator_best_score_ = gcv.best_score_print('best_params_ = ', best_params_)print('best_estimator_ = ', best_estimator_)print('best_score_ = ', best_score_)# 评估准确度01:使用accuracy_score()评估准确度accuracy_score01 = accuracy_score(y_test, y_predict_gcv)print('accuracy_score01 = ', accuracy_score01)# 取出了最好的模型,进行预测knn_best = gcv.best_estimator_y_predict_best = knn_best.predict(x_test)accuracy_score02 = accuracy_score(y_test, y_predict_best)print('accuracy_score02 = ', accuracy_score02)# 评估准确度01:也可以直接使用gcv进行预测,结果一样的score_gcv = gcv.score(x_test, y_test)  # # 使用GridSearchCV.score()评估准确度print('score_gcv = ', score_gcv)# 六、交叉表print('目标值大小:y_test.shape =', y_test.shape)cros_tab = pd.crosstab(index=y_test, columns=y_predict_best, rownames=['True'], colnames=['Predict'], margins=True)print('cros_tab =\n', cros_tab)# 七、混淆矩阵confu_matrix = confusion_matrix(y_test, y_predict_best)print('混淆矩阵:confu_matrix =\n', confu_matrix)# 八、模型评估参数print('y_test.value_counts() =\n', y_test.value_counts())class_report = classification_report(y_test, y_predict_best, target_names=['B', 'M'])print('class_report = \n', class_report)


data.shape =  (699, 11) ----type(data) =  <class 'pandas.core.frame.DataFrame'> ----data.head() = Sample code number  Clump Thickness  ...    Mitoses  Class
0             1000025                5  ...          1      2
1             1002945                5  ...          1      2
2             1015425                3  ...          1      2
3             1016277                6  ...          1      2
4             1017023                4  ...          1      2
[5 rows x 11 columns]x_data.shape =  (683, 9) ----type(x_data) =  <class 'pandas.core.frame.DataFrame'> ----x_data.head() =Clump Thickness  Uniformity of Cell Size   ...     Normal Nucleoli  Mitoses
0                5                        1   ...                   1        1
1                5                        4   ...                   2        1
2                3                        1   ...                   1        1
3                6                        8   ...                   7        1
4                4                        1   ...                   1        1
[5 rows x 9 columns]y_data.shape =  (683,) ----type(y_data) =  <class 'pandas.core.series.Series'> ----y_data.head() =0    2
1    2
2    2
3    2
4    2
Name: Class, dtype: int64x_data_normal.shape =  (683, 9) ----type(x_data_normal) =  <class 'numpy.ndarray'> ----x_data_normal =[[0.44444444 0.         0.         ... 0.22222222 0.         0.        ][0.44444444 0.33333333 0.33333333 ... 0.22222222 0.11111111 0.        ][0.22222222 0.         0.         ... 0.22222222 0.         0.        ]...[0.44444444 1.         1.         ... 0.77777778 1.         0.11111111][0.33333333 0.77777778 0.55555556 ... 1.         0.55555556 0.        ][0.33333333 0.77777778 0.77777778 ... 1.         0.33333333 0.        ]]x_data_standard.shape =  (683, 9) ----type(x_data_standard) =  <class 'numpy.ndarray'> ----x_data_standard =[[ 0.19790469 -0.70221201 -0.74177362 ... -0.18182716 -0.61292736-0.34839971][ 0.19790469  0.27725185  0.26278299 ... -0.18182716 -0.28510482-0.34839971][-0.51164337 -0.70221201 -0.74177362 ... -0.18182716 -0.61292736-0.34839971]...[ 0.19790469  2.23617957  2.2718962  ...  1.86073779  2.337475540.22916583][-0.15686934  1.58320366  0.93248739 ...  2.67776377  1.02618536-0.34839971][-0.15686934  1.58320366  1.6021918  ...  2.67776377  0.37054027-0.34839971]]y_predict =[4 4 4 2 2 2 4 4 2 2 2 4 4 2 4 4 4 4 4 2 2 2 4 2 2 2 2 2 4 4 4 2 2 2 4 4 22 2 2 2 2 4 2 2 2 4 2 2 2 4 4 4 4 4 4 4 2 4 2 4 4 2 2 2 2 4 2 2 2 2 2 4 24 2 2 2 4 2 2 4 2 4 2 4 2 4 2 2 4 2 2 4 4 2 4 2 4 2 2 2 4 2 4 2 2 4 4 2 22 2 2 2 2 4 2 4 2 2 4 2 4 2 4 4 4 2 4 2 4 2 2 2 2 2]test_score =  0.9635036496350365y_predict_gcv =[4 4 4 2 2 2 4 4 2 2 2 4 4 2 4 4 4 4 4 2 2 2 4 2 2 2 2 2 2 4 4 2 2 2 4 4 22 2 2 2 2 4 2 2 2 4 2 2 2 4 4 4 4 4 4 4 2 4 2 4 4 2 2 2 2 4 2 2 2 2 2 4 24 2 2 2 4 2 2 4 2 4 2 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 2 4 4 2 22 2 2 2 2 4 2 4 2 2 4 2 4 2 4 4 4 2 4 2 4 2 2 2 2 2]best_params_ =  {'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}
best_estimator_ =  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=1, p=1,  weights='uniform')
best_score_ =  0.978021978021978accuracy_score01 =  0.948905109489051
accuracy_score02 =  0.948905109489051
score_gcv =  0.948905109489051目标值大小:y_test.shape = (137,)
cros_tab =Predict   2   4  All
2        79   3   82
4         4  51   55
All      83  54  137混淆矩阵:confu_matrix =[[79  3][ 4 51]]y_test.value_counts() =2    82
4    55
Name: Class, dtype: int64class_report = precision    recall  f1-score   supportB       0.95      0.96      0.96        82M       0.94      0.93      0.94        55
avg / total       0.95      0.95      0.95       137


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifierdef knncls():"""K-近邻预测用户签到位置"""# 1、读取数据(pandas)myDataFrame = pd.read_csv("I:/AI_Data/facebook-v-predicting-check-ins/train.csv")print('\nmyDataFrame.head(5) = \n', myDataFrame.head(5))# 2、处理数据(pandas)# 2.1 缩小数据,查询数据晒讯myDataFrame = myDataFrame.sort_values(by='row_id').query("x > 1.0 &  x < 1.25 & y > 2.5 & y < 2.75")print('\nmyDataFrame.count() = \n', myDataFrame.count())# 处理时间的数据:将时间戳转换为Series类型(格式为:index=时间戳, value=yyyy-mm-dd hh:mm:ss)dateTimeSeries = pd.to_datetime(myDataFrame['time'], unit='s')print('\ntype(dateTimeSeries) = ', type(dateTimeSeries))print('\ndateTimeSeries.head(5) = \n', dateTimeSeries.head(5))# 2.2 把dateTimeSeries转换成DatetimeIndex索引(字典类型)dateTimeIndexMap = pd.DatetimeIndex(dateTimeSeries)print('\ntype(dateTimeIndexMap) = ', type(dateTimeIndexMap))print('\ndateTimeIndexMap = \n', dateTimeIndexMap)# 2.3 构造一些特征(添加某些特征有可能会使预测准确度增加或减小)# myDataFrame['year'] = dateTimeIndexMap.year   # 年份没有意义,因为以后的预测不可能重复此年份myDataFrame['month'] = dateTimeIndexMap.monthmyDataFrame['day'] = dateTimeIndexMap.daymyDataFrame['hour'] = dateTimeIndexMap.hour# myDataFrame['minute'] = dateTimeIndexMap.minute# myDataFrame['weekday'] = dateTimeIndexMap.weekday# 2.4 把时间戳特征删除myDataFrame = myDataFrame.drop(['time'], axis=1)print('\nmyDataFrame.head(5) = \n', myDataFrame.head(5))# 2.5 把签到数量少于n个目标位置删除# 分组然后统计每组数量,分组后place_id变为索引,其余特征的特征值变为当前分组下的成员数量placeCountDataFrame = myDataFrame.groupby('place_id').count()print('\ntype(placeCountDataFrame) = \n', type(placeCountDataFrame))print('\nplaceCountDataFrame = \n', placeCountDataFrame.head(5))# 选取组成员数量大于3的分组,然后通过reset_index将原来的索引place_id变为一列可以被引用的特征place_id,新索引变为0,1,2,3...placeCountDataFrame = placeCountDataFrame[placeCountDataFrame.row_id > 3].reset_index()  # 选择每组数量大于3的样本print('\nplaceCountDataFrame = \n', placeCountDataFrame.head(5))# 根据placeCountDataFrame里的place_id筛选myDataFrame中符合条件的样本myDataFrame = myDataFrame[myDataFrame['place_id'].isin(placeCountDataFrame.place_id)]print('\nmyDataFrame.head(5) = \n', myDataFrame.head(5))# 2.6 将数据当中的特征值和目标值分开y_Series = myDataFrame['place_id']  # 目标值x_DataFrame = myDataFrame.drop(['place_id'], axis=1)  # 特征值# 2.7 删除特征值里没有用的特征,来提高准确率x_DataFrame = x_DataFrame.drop(['row_id'], axis=1)print('\n特征数据值:x_DataFrame = \n', type(x_DataFrame), '\n', x_DataFrame.head(5))print('\n目标值:y_Series = \n', type(y_Series), '\n', y_Series.head(5))# 3、特征工程(scikit-learn)# 3.1 特征预处理(特征数据值的标准化,目标值不需要标准化),避免某一特征对最终结果造成比其他特征更大的影响,从而提高准确率。std = StandardScaler()x_DataFrame = std.fit_transform(x_DataFrame)print('\n标准化后的x_DataFrame:\n', x_DataFrame)# 3.2 进行数据的分割:训练集、测试集x_train_DataFrame, x_test_DataFrame, y_train_Series, y_test_Series = train_test_split(x_DataFrame, y_Series, test_size=0.25)print('\n特征数据值of训练集 x_train_DataFrame:\n', x_train_DataFrame)print('\n特征数据值of测试集 x_test_DataFrame:\n', x_test_DataFrame)print('\n目标值of训练集 y_train_Series:\n', y_train_Series.head(5))print('\n目标值of测试集 y_test_Series:\n', y_test_Series.head(5))# 4 算法工程# 4.1 实例化一个k-紧邻估计器对象knn_estimator = KNeighborsClassifier(n_neighbors=5)# 4.2 调用fit方法,进行训练knn_estimator.fit(x_train_DataFrame, y_train_Series)# 5 模型评估# 5.1 数据预测,得出预测结果predictTestSeries = knn_estimator.predict(x_test_DataFrame)print('\n预测的目标值的:\n', predictTestSeries)print('\n预测值与真实值的对比情况:\n', predictTestSeries == y_test_Series)# 5.2 计算准确率predictScore = knn_estimator.score(x_test_DataFrame, y_test_Series)  # 输入”测试集“的特征数据值、目标值print('\n预测的准确率为:\n', predictScore)if __name__ == "__main__":knncls()


myDataFrame.head(5) = row_id       x       y  accuracy    time    place_id
0       0  0.7941  9.0809        54  470702  8523065625
1       1  5.9567  4.7968        13  186555  1757726713
2       2  8.3078  7.0407        74  322648  1137537235
3       3  7.3665  2.5165        65  704587  6567393236
4       4  4.0961  1.1307        31  472130  7440663949
myDataFrame.count() = row_id      17710
x           17710
y           17710
accuracy    17710
time        17710
place_id    17710
dtype: int64
type(dateTimeSeries) =  <class 'pandas.core.series.Series'>
dateTimeSeries.head(5) = 600    1970-01-01 18:09:40
957    1970-01-10 02:11:10
4345   1970-01-05 15:08:02
4735   1970-01-06 23:03:03
5580   1970-01-09 11:26:50
Name: time, dtype: datetime64[ns]
type(dateTimeIndexMap) =  <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
dateTimeIndexMap = DatetimeIndex(['1970-01-01 18:09:40', '1970-01-10 02:11:10','1970-01-05 15:08:02', '1970-01-06 23:03:03','1970-01-09 11:26:50', '1970-01-02 16:25:07','1970-01-04 15:52:57', '1970-01-01 10:13:36','1970-01-09 15:26:06', '1970-01-08 23:52:02',...'1970-01-07 10:03:36', '1970-01-09 11:44:34','1970-01-04 08:07:44', '1970-01-04 15:47:47','1970-01-08 01:24:11', '1970-01-01 10:33:56','1970-01-07 23:22:04', '1970-01-08 15:03:14','1970-01-04 00:53:41', '1970-01-08 23:01:07'],dtype='datetime64[ns]', name='time', length=17710, freq=None)
myDataFrame.head(5) = row_id       x       y  accuracy    place_id  month  day  hour
600      600  1.2214  2.7023        17  6683426742      1    1    18
957      957  1.1832  2.6891        58  6683426742      1   10     2
4345    4345  1.1935  2.6550        11  6889790653      1    5    15
4735    4735  1.1452  2.6074        49  6822359752      1    6    23
5580    5580  1.0089  2.7287        19  1527921905      1    9    11
type(placeCountDataFrame) = <class 'pandas.core.frame.DataFrame'>
placeCountDataFrame = row_id     x     y  accuracy  month   day  hour
1012023972       1     1     1         1      1     1     1
1057182134       1     1     1         1      1     1     1
1059958036       3     3     3         3      3     3     3
1085266789       1     1     1         1      1     1     1
1097200869    1044  1044  1044      1044   1044  1044  1044
placeCountDataFrame = place_id  row_id     x     y  accuracy  month   day  hour
0  1097200869    1044  1044  1044      1044   1044  1044  1044
1  1228935308     120   120   120       120    120   120   120
2  1267801529      58    58    58        58     58    58    58
3  1278040507      15    15    15        15     15    15    15
4  1285051622      21    21    21        21     21    21    21
myDataFrame.head(5) = row_id       x       y  accuracy    place_id  month  day  hour
600      600  1.2214  2.7023        17  6683426742      1    1    18
957      957  1.1832  2.6891        58  6683426742      1   10     2
4345    4345  1.1935  2.6550        11  6889790653      1    5    15
4735    4735  1.1452  2.6074        49  6822359752      1    6    23
5580    5580  1.0089  2.7287        19  1527921905      1    9    11
特征数据值:x_DataFrame = <class 'pandas.core.frame.DataFrame'> x       y  accuracy  month  day  hour
600   1.2214  2.7023        17      1    1    18
957   1.1832  2.6891        58      1   10     2
4345  1.1935  2.6550        11      1    5    15
4735  1.1452  2.6074        49      1    6    23
5580  1.0089  2.7287        19      1    9    11
目标值:y_Series = <class 'pandas.core.series.Series'> 600     6683426742
957     6683426742
4345    6889790653
4735    6822359752
5580    1527921905
Name: place_id, dtype: int64
标准化后的x_DataFrame:[[ 1.27892477  0.9941573  -0.58835492  0.         -1.50340614  0.94055369][ 0.78467442  0.80524744 -0.21403874  0.          1.80968818 -1.36413448][ 0.91794088  0.31723029 -0.6431329   0.         -0.03091978  0.50842466]...[-1.27513331  1.3018514  -0.17752009  0.          1.07344499  0.50842466][ 1.04344424  0.66928958  0.05072149  0.         -0.39904137 -1.65222051][-0.20123858 -1.30138377  0.88152082  0.          1.07344499  1.66076875]]
特征数据值of训练集 x_train_DataFrame:[[-0.35003122  0.37161343  0.70805723  0.         -0.03091978 -1.22009147][-1.17162539  0.56195443 -0.1501311   0.         -1.50340614 -1.22009147][ 1.40572198  1.21455214 -0.09535312  0.          1.07344499 -1.5081775 ]...[ 0.47932604  0.75945111  7.38184088  0.         -0.03091978 -0.35583341][ 0.69539883 -1.6548742  -0.11361244  0.          1.07344499 -0.35583341][-1.38122894  0.26141601 -0.61574391  0.         -0.76716296  0.79651068]]
特征数据值of测试集 x_test_DataFrame:[[-0.42636832 -1.18259954 -0.51531762  0.          1.44156658  0.22033864][ 1.04991348 -1.82804157  0.75370554  0.         -0.03091978  0.22033864][ 0.02647886  0.63208006 -0.66139222  0.         -1.50340614 -0.64391943]...[ 1.30221405  1.40775541 -0.19577941  0.          0.7053234   0.79651068][ 0.65011412 -1.87670017 -0.34185402  0.          1.07344499  0.50842466][-0.07314752  0.54334967 -0.12274211  0.         -0.76716296  0.36438165]]
目标值of训练集 y_train_Series:9283986     5270522918
6133610     3312463746
12660108    8048985799
7743611     9980711012
5403574     6424972551
Name: place_id, dtype: int64
目标值of测试集 y_test_Series:18871651    5035268417
2439401     3741484405
19545960    2215268322
7519621     8048985799
8706487     5270522918
Name: place_id, dtype: int64
预测的目标值的:[4932578245 1267801529 1435128522 ... 8048985799 1228935308 3312463746]
预测值与真实值的对比情况:18871651    False
2439401     False
19545960    False
7519621     False
8706487     False
19539181     True
22742928    False
1397064      True
7166613     False
25205240     True
4711428      True
1250160      True
16672285     True
20566737     True
22569239     True
21670631    False
9025634     False
21535360     True
21590563     True
22327471    False
19630021    False
21627845     True
11118565    False
17705845    False
1378529      True
21488775    False
20664329    False
16538012     True
5969839     False
1040459      True...
14696886     True
3034395      True
25438288     True
7242661     False
2836178      True
8692726     False
11822141     True
1088451     False
24377288    False
13031289    False
6547530     False
15231249     True
13378120     True
24832202    False
5806980     False
23706074    False
16672238    False
7546070      True
8210913      True
5644848      True
6458142     False
1752820      True
21721506    False
16328983    False
15980516     True
26763008     True
8184431     False
17790064     True
27489405     True
13893226     True
Name: place_id, Length: 4230, dtype: bool


  1. knn_estimator = KNeighborsClassifier(n_neighbors=5) 步骤中n_neighbors参数的取值,n_neighbors取不同的值会得到不同的结果。调节n_neighbors参数来达到更佳的效果的过程就是“调参”;
  2. “2.3 构造一些特征”步骤中的构造特征,有些特征对目标值没有意义则删除,否则影响模型准确度;



  1. 选择较小的K值,就相当于用较小的领域中的训练实例进行预测,“学习”近似误差会减小,只有与输入实例较近或相似的训练实例才会对预测结果起作用,与此同时带来的问题是“学习”的估计误差会增大,换句话说,K值的减小就意味着整体模型变得复杂,容易发生过拟合

  2. 选择较大的K值,就相当于用较大领域中的训练实例进行预测,其优点是可以减少学习的估计误差,但缺点是学习的近似误差会增大。这时候,与输入实例较远(不相似的)训练实例也会对预测器作用,使预测发生错误,且K值的增大就意味着整体的模型变得简单。容易发生欠拟合

  3. K=N(N为训练样本个数),则完全不足取,因为此时无论输入实例是什么,都只是简单的预测它属于在训练实例中最多的类,模型过于简单,忽略了训练实例中大量有用信息。



  • 对现有训练集的训练误差,关注训练集,
  • 如果近似误差过小可能会出现过拟合的现象,对现有的训练集能有很好的预测,但是对未知的测试样本将会出现较大偏差的预测。
  • 模型本身不是最接近最佳模型。


  • 可以理解为对测试集的测试误差,关注测试集,
  • 估计误差小说明对未知数据的预测能力好,
  • 模型本身最接近最佳模型。



  • 简单有效

  • 无需估计参数

  • 无需训练(无需迭代)

  • 重新训练的代价低

  • 适合类域交叉样本

  • 适合大样本自动分类


  • 惰性学习
    KNN算法是懒散学习方法(lazy learning,基本上不学习),一些积极学习的算法要快很多

  • 类别评分不是规格化

  • 输出可解释性不强

  • 对不均衡的样本不擅长

  • 计算量较大,内存开销大

  • 必须指定k值,k值选择不当则分类精度不能保证。




