招聘笔记：机器学习基础知识(19道题，有参考答案)

这一章介绍了大量每个数据科学家需要牢记在心的基础概念（和习语）。第一章只是概览（唯一不含有代码的一章），相当简单，但你要确保每一点都搞明白了，再继续进行学习本书其余章节。端起一杯咖啡，开始学习吧！

提示：如果你已经知道了机器学习的所有基础概念，可以直接翻到第2章。如果你不确认，可以尝试回答本章末尾列出的问题，然后再继续。

把练习和答案分开是为了督促学习和思考，提倡独立思考，主动学习，而不是背答案！文章前半段中文翻译有问题的地方欢迎评论指正，英文原文在后面。

我的博客，更多内容，持续更新！

练习

本章中，我们学习了一些机器学习中最为重要的概念。下一章，我们会更加深入，并写一些代码。开始下章之前，确保你能回答下面的问题：

如何定义机器学习？
机器学习可以解决的四类问题？
什么是带标签的训练集？
最常见的两个监督任务是什么？
指出四个常见的非监督任务？
要让一个机器人能在各种未知地形行走，你会采用什么机器学习算法？
要对你的顾客进行分组，你会采用哪类算法？
垃圾邮件检测是监督学习问题，还是非监督学习问题？
什么是在线学习系统？
什么是核外学习？
什么学习算法是用相似度做预测？
模型参数和学习算法的超参数的区别是什么？
基于模型学习的算法搜寻的是什么？最成功的策略是什么？基于模型学习如何做预测？
机器学习主要的挑战是什么？
如果模型在训练集上表现好，但推广到新实例表现差，问题是什么？给出三个可能的解决方案。
什么是测试集，为什么要使用它？
验证集的目的是什么？
如果用测试集调节超参数，会发生什么？
什么是交叉验证，为什么它比验证集好？

参考答案

机器学习是关于构建可以从数据中学习的模型。学习意味着在某些任务中，根据一些绩效衡量，模型可以更好。
机器学习非常适用于：

我们没有算法解决方案的复杂问题；
可以替换手工调整规则的长列表；
构建适应波动环境的系统；
最后帮助人类学习（例如，数据挖掘）。提示：不要把所以问题往机器学习上套，比如做网页，比如已经有高效算法的（图联通判断）。

标记的训练集是一个训练集，其中包含每个实例的所需解决方案（例如标签）。
两个最常见的监督任务是回归和分类。
常见的无监督任务包括聚类，可视化，降维和关联规则学习。
强化学习如果我们希望机器人学会在各种未知的地形中行走，那么学习可能会表现得最好，因为这通常是强化学习所解决的问题类型。有可能将问题表达为监督或半监督学习问题，但这种解决方式不太自然。
如果您不知道如何定义组，则可以使用聚类算法（无监督学习）将客户划分为类似客户的集群。但是，如果您知道您希望拥有哪些组，那么您可以将每个组的许多示例提供给分类算法（监督学习），并将所有客户分类到这些组中。
垃圾邮件检测是一种典型的监督学习问题：算法会输入许多电子邮件及其标签（垃圾邮件或非垃圾邮件）。
在线学习系统可以逐步学习，而不是批量学习系统。这使它能够快速适应不断变化的数据和自治系统，以及对大量数据的培训。
核外算法可以处理大量无法容纳在计算机主存中的数据。核心学习算法将数据分成小批量，并使用在线学习技术从这些小批量中学习。
基于实例的学习系统用心学习训练数据;然后，当给定一个新实例时，它使用相似性度量来查找最相似的学习实例并使用它们进行预测。
模型具有一个或多个模型参数，其确定在给定新实例的情况下它将预测什么（例如，线性模型的斜率）。学习算法试图找到这些参数的最佳值，以便模型很好地推广到新实例。超参数是学习算法本身的参数，而不是模型的参数（例如，要应用的正则化的量）。
基于模型的学习算法搜索模型参数的最佳值，使得模型将很好地推广到新实例。我们通常通过最小化成本函数来训练这样的系统，该成本函数测量系统在对训练数据进行预测时的糟糕程度，以及如果模型正规化则对模型复杂性的惩罚。为了进行预测，我们使用学习算法找到的参数值将新实例的特征提供给模型的预测函数。
机器学习中的一些主要挑战是缺乏数据，数据质量差，非代表性数据，无法提供信息的特征，过于简单的模型以及过度拟合训练数据的模型，以及过度复杂的模型过度拟合数据。
如果一个模型在训练数据上表现很好，但对新实例表现不佳，那么该模型可能会过度拟合训练数据（或者我们对训练数据非常幸运）。过度拟合的可能解决方案是获得更多数据，简化模型（选择更简单的算法，减少所使用的参数或特征的数量，或使模型正规化），或减少训练数据中的噪声。
测试集用于估计模型在生产中启动之前模型将对新实例进行的泛化错误。
验证集用于比较模型。它可以选择最佳模型并调整超参数。
如果使用测试集调整超参数，则存在过度拟合测试集的风险，并且您测量的泛化错误将是乐观的（您可能会启动比预期更差的模型）。
交叉验证是一种技术，可以比较模型（模型选择和超参数调整），而无需单独的验证集。这节省了宝贵的培训数据。

Exercises

In this chapter we have covered some of the most important concepts in Machine Learning. In the next chapters we will dive deeper and write more code, but before we do, make sure you know how to answer the following questions:

How would you define Machine Learning?
Can you name four types of problems where it shines?
What is a labeled training set?
What are the two most common supervised tasks?
Can you name four common unsupervised tasks?
What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?
What type of algorithm would you use to segment your customers into multiple groups?
Would you frame the problem of spam detection as a supervised learning prob‐lem or an unsupervised learning problem?
What is an online learning system?
What is out-of-core learning?
What type of learning algorithm relies on a similarity measure to make predic‐tions?
What is the difference between a model parameter and a learning algorithm’s hyperparameter?
What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
Can you name four of the main challenges in Machine Learning?
If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
What is a test set and why would you want to use it?
What is the purpose of a validation set?
What can go wrong if you tune hyperparameters using the test set?
What is cross-validation and why would you prefer it to a validation set?

Exercise Solutions

Machine Learning is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.
Machine Learning is great for complex problems for which we have no algorith‐mic solution, to replace long lists of hand-tuned rules, to build systems that adapt to fluctuating environments, and finally to help humans learn (e.g., data mining).
A labeled training set is a training set that contains the desired solution (a.k.a. a label) for each instance.
The two most common supervised tasks are regression and classification.
Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.
Reinforcement Learning is likely to perform best if we want a robot to learn to walk in various unknown terrains since this is typically the type of problem that Reinforcement Learning tackles. It might be possible to express the problem as a supervised or semisupervised learning problem, but it would be less natural.
If you don’t know how to define the groups, then you can use a clustering algo‐rithm (unsupervised learning) to segment your customers into clusters of similar customers. However, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm (supervised learning), and it will classify all your customers into these groups.
Spam detection is a typical supervised learning problem: the algorithm is fed many emails along with their label (spam or not spam).
An online learning system can learn incrementally, as opposed to a batch learn‐ing system. This makes it capable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data.
Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. An out-of-core learning algorithm chops the data into mini-batches and uses online learning techniques to learn from these mini-batches.
An instance-based learning system learns the training data by heart; then, when given a new instance, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).
Model-based learning algorithms search for an optimal value for the model parameters such that the model will generalize well to new instances. We usually train such systems by minimizing a cost function that measures how bad the sys‐tem is at making predictions on the training data, plus a penalty for model com‐plexity if the model is regularized. To make predictions, we feed the new instance’s features into the model’s prediction function, using the parameter val‐ues found by the learning algorithm.
Some of the main challenges in Machine Learning are the lack of data, poor data quality, nonrepresentative data, uninformative features, excessively simple mod‐els that underfit the training data, and excessively complex models that overfit the data.
If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data (or we got extremely lucky on the training data). Possible solutions to overfitting are getting more data, simplifying the model (selecting a simpler algorithm, reducing the number of parameters or features used, or regularizing the model), or reducing the noise in the training data.
A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).
Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate vali‐dation set. This saves precious training data.