更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (1)

Exercise

k-Nearest Neighbors: Fit

Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman’s voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

Instruction

  • Import KNeighborsClassifier from sklearn.neighbors.
  • Create arrays X and y for the features and the target variable. Here this has been done for you. Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .valuesattribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.
  • Instantiate a KNeighborsClassifier called knnwith 6 neighbors by specifying the n_neighborsparameter.
  • Fit the classifier to the data using the .fit() method.
import pandas as pddf = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/votes-ch1.csv')
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)# Fit the classifier to the data
knn.fit(X,y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=6, p=2,weights='uniform')

Exercise

k-Nearest Neighbors: Predict

Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model’s ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.

Instruction

  • Create arrays for the features and the target variable from df. As a reminder, the target variable is 'party'.
  • Instantiate a KNeighborsClassifier with 6neighbors.
  • Fit the classifier to the data.
  • Predict the labels of the training data, X.
  • Predict the label of the new data point X_new.
import numpy as np
X_new = pd.DataFrame((np.random.rand(1,16))) # modified by Jinny# Predict the labels for the training data X
y_pred = knn.predict(X)# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))
Prediction: ['democrat']

Exercise

The digits recognition dataset

Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you’ll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn’s included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn’s built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images'key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images'key is useful for visualizing the images, as you’ll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp’s course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these Bunchobjects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].

For more on the MNIST data, check out this exercise in Part 1 of DataCamp’s Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

Instruction

  • Import datasets from sklearn and matplotlib.pyplot as plt.
  • Load the digits dataset using the .load_digits()method on datasets.
  • Print the keys and DESCR of digits.
  • Print the shape of images and data keys using the . notation.
  • Display the 1011th image using plt.imshow(). This has been done for you, so hit ‘Submit Answer’ to see which handwritten digit this happens to be!
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt# Load the digits dataset: digits
digits = datasets.load_digits()# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits['DESCR'])# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
.. _digits_dataset:Optical recognition of handwritten digits dataset
--------------------------------------------------**Data Set Characteristics:**:Number of Instances: 5620:Number of Attributes: 64:Attribute Information: 8x8 image of integer pixels in the range 0..16.:Missing Attribute Values: None:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr):Date: July; 1998This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+DigitsThe data set contains images of hand-written digits: 10 classes where
each class refers to a digit.Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994... topic:: References- C. Kaynak (1995) Methods of Combining Multiple Classifiers and TheirApplications to Handwritten Digit Recognition, MSc Thesis, Institute ofGraduate Studies in Science and Engineering, Bogazici University.- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.Linear dimensionalityreduction using relevance weighted LDA. School ofElectrical and Electronic Engineering Nanyang Technological University.2005.- Claudio Gentile. A New Approximate Maximal Margin ClassificationAlgorithm. NIPS. 2000.
(1797, 8, 8)
(1797, 64)

Exercise

Train/Test Split + Fit/Predict/Accuracy

Now that you have learned about the importance of splitting your data into training and test sets, it’s time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score() method.

Instruction

  • Import KNeighborsClassifier from sklearn.neighbors and train_test_split from sklearn.model_selection.
  • Create an array for the features using digits.data and an array for the target using digits.target.
  • Create stratified training and test sets using 0.2 for the size of the test set. Use a random state of 42. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.
  • Create a k-NN classifier with 7 neighbors and fit it to the training data.
  • Compute and print the accuracy of the classifier’s predictions using the .score() method.
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split# Create feature and target arrays
X = digits.data
y = digits.target# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)# Fit the classifier to the training data
knn.fit(X_train, y_train)# Print the accuracy
print(knn.score(X_test, y_test))
0.9833333333333333

Exercise

Overfitting and underfitting

Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

Instruction:

Inside the for loop:

  • Setup a k-NN classifier with the number of neighbors equal to k.
  • Fit the classifier with k neighbors to the training data.
  • Compute accuracy scores the training set and test set separately using the .score() method and assign the results to the train_accuracy and test_accuracy arrays respectively.
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))# Loop over different values of k
for i, k in enumerate(neighbors):# Setup a k-NN Classifier with k neighbors: knnknn = KNeighborsClassifier(n_neighbors = k)# Fit the classifier to the training dataknn.fit(X_train, y_train)#Compute accuracy on the training settrain_accuracy[i] = knn.score(X_train, y_train)#Compute accuracy on the testing settest_accuracy[i] = knn.score(X_test, y_test)# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

Datacamp 笔记代码 Supervised Learning with scikit-learn 第一章 Classification相关推荐

  1. Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines

    更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...

  2. Datacamp 笔记代码 Unsupervised Learning in Python 第三章 Decorrelating your data and dimension reduction

    更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...

  3. Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第二章 Creating a simple first model

    更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...

  4. Datacamp 笔记代码 Unsupervised Learning in Python 第二章 Visualization with hierarchical clustering t-SNE

    更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...

  5. Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第一章 Exploring the raw data

    更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...

  6. 【笔记】Cocos2d-x高级开发教程:制作自己的捕鱼达人 笔记一:序_前言_第一章

    [笔记]Cocos2d-x高级开发教程:制作自己的<捕鱼达人> 笔记一:序_前言_第一章 转载请注明出处:http://blog.csdn.net/l_badluck/article/de ...

  7. Hands On Machine Learning with Scikit Learn and TensorFlow(第三章)

    MNIST 从sklearn自带函数中导入NMIST 第一次导入可能会出错,从这里下载https://github.com/amplab/datascience-sp14/blob/master/la ...

  8. TensorFlow官方教程《Neural Networks and Deep Learning》译(第一章)

    – 更新中 译自:Neural Networks and Deep Learning 成果预展示 如果你能坚持阅读完本章, 你可以获得如下的成果: 上图中的命令行窗口输出为: Epoch 0: 909 ...

  9. 【论文笔记+代码解读】《ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!》

    介绍 本文提出了一种注意力层+强化学习的训练模型,以解决TSP.VRP.OP.PCTSP等路径问题.文章致力于使用相同的超参数,解决多种路径问题.文中采用了贪心算法作为基线,相较于值函数效果更好. 注 ...

最新文章

  1. Can you answer these queries III (线段树维护最大子段和)
  2. TokuDB · 引擎特性 · HybridDB for MySQL高压缩引擎TokuDB 揭秘
  3. CCNP ONT LAB之PQ WFQ
  4. matlab柱状斜线_Matlab:柱状图饼状图填充不同条纹
  5. 企业应用程序部署在iOS 7.1上不起作用
  6. JavaScript字符串的单引号和双引号问题
  7. JavaFX控件ID:设置Label文本内容代码示例
  8. History of Microsoft Windows CE
  9. 农村初中学校计算机课的意义,关于农村学校计算机课件使用的反思.pdf
  10. Linux配置sudo
  11. es6 Promise,生成器函数,async
  12. mysql float 怎么设置长度_MySQL中float double decimal区别总结
  13. 数值分析共轭梯度法matlab程序,数值分析11(共轭梯度法).ppt
  14. Android8.0使用ninja模块编译Settings
  15. 拓端tecdat|把握出租车行驶的数据脉搏 :出租车轨迹数据给你答案!
  16. MATLAB GUI编程总结
  17. 实现自定义Sql 注入器
  18. Java实现地固坐标与经纬度转换
  19. 计组_指令周期/机器周期(cpu周期)/时钟周期(节拍T) 主频超频/cpu频率发热
  20. 利用python将长视频、长语音转换成文字教程 ,非常好用

热门文章

  1. Unity ugui——toggle多选小坑
  2. 英语四级作文计算机,英语四级作文关于电脑
  3. 下载百度android浏览器下载,百度浏览器下载2021安卓最新版_手机app官方版免费安装下载_豌豆荚...
  4. meego用linux软件下载,MeeGo移动应用程序开发入门
  5. 踩着别人的脚印***
  6. 一个简单的macOS剪贴板管理程序
  7. Robotstudio出现未定义错误解决办法
  8. 优秀的视频用户体验应包含_使用视频时如何提供出色的用户体验
  9. 计算机导论操作系统基础实验报告,计算机导论:实训篇
  10. C# Winform 常用控件介绍