算法题指南书

Today, we will see how popular classification algorithms work and help us, for example, to pick out and sort wonderful, juicy tomatoes.

今天，我们将了解流行的分类算法是如何工作的，例如，帮助我们挑选出美味多汁的西红柿并进行分类。

I will remind you that no algorithm is optimal over the set of all possible situations. Machine learning algorithms are delicate instruments that you tune based on the problem set, especially in supervised machine learning.

我会提醒您，在所有可能的情况下，没有一种算法是最优的。机器学习算法是您根据问题集调整的精密工具，尤其是在有监督的机器学习中。

分类工作原理 (How Classification Works)

We predict whether a thing can be referred to as a particular class every day. To give an example, classification helps us make decisions when picking tomatoes in a supermarket (“green,” “perfect,” “rotten”). In machine learning terms, we assign a label of one of the classes to every tomato we hold in our hands.

我们预测每天是否可以将某事物称为特定类别。举个例子，分类有助于我们在超市采摘西红柿时做出决定(“绿色”，“完美”，“烂”)。用机器学习的术语来说，我们为手中的每个西红柿分配一个类别的标签。

The efficiency of your tomato picking contest (some would call it a classification model) depends on the accuracy of its results. The more often you go to the supermarket yourself (instead of sending your parents or your significant other), the better you will become at picking out tomatoes that are fresh and yummy.

您的番茄采摘比赛(有些人将其称为分类模型)的效率取决于其结果的准确性。您自己去超市的次数越多(而不是送父母或重要的他人)，您越会挑选出新鲜美味的西红柿。

Computers are just the same. For a classification model to learn to predict outcomes accurately, it needs a lot of training examples.

电脑是一样的。为了使分类模型学习准确地预测结果，它需要大量的训练示例。

4种分类类型 (4 Types of Classification)

二元 (Binary)

Binary classification means there are two classes to work with that relate to one another as true and false. Imagine you have a huge lug box in front of you with yellow and red tomatoes. But your fancy Italian pasta recipe says that you only need the red ones.

二进制分类意味着有两个类别可以使用，它们相互关联，分别为真和假。想象一下，您面前有一个巨大的接线盒，上面放着黄色和红色的西红柿。但是您喜欢的意大利面食食谱说您只需要红色的意大利面食。

What do you do? Obviously, you use label-encoding and, in this case, assign 1 to “red” and 0 to “not red.” Sorting tomatoes has never been easier.

你是做什么？显然，您使用标签编码，在这种情况下，将1分配给“红色”，将0分配给“非红色”。西红柿分类从未如此简单。

多类 (Multiclass)

What do you see in this photo?

您在这张照片中看到了什么？

Red beefsteak tomatoes. Cherry tomatoes. Cocktail tomatoes. Heirloom tomatoes.

红色的牛排番茄。樱桃西红柿。鸡尾酒番茄。传家宝西红柿。

There is no black and white here, no normal and abnormal like in binary classification. We welcome all sorts of wonderful vegetables (or berries) to our table.

这里没有黑白，没有像二进制分类那样的正常和异常。我们欢迎各种精美的蔬菜(或浆果)进入我们的餐桌。

What you probably don’t know if you are not a fan of cooking with tomatoes is that not all the tomatoes are equally good for the same dish. Red beefsteak tomatoes are perfect for salsa but you do not pickle them. Cherry tomatoes work for salads but not for pasta. So it’s important to know what you are dealing with.

您可能不知道自己是否不喜欢用西红柿做饭，是因为并非所有的西红柿都同样适合同一道菜。红色牛排番茄非常适合萨尔萨酱，但是您不要腌制它们。樱桃番茄可用于沙拉，但不适用于意大利面。因此了解您要处理的内容很重要。

Multiclass classification helps us to sort all the tomatoes from the box regardless of how many classes there are.

多类分类可帮助我们从包装盒中对所有西红柿进行分类，而不管有多少类。

多标签 (Multi-label)

Multi-label classification is applied when one input can belong to more than one class, like a person who is a citizen of two countries.

当一个输入可以属于一个以上的类别时(例如，作为两个国家的公民的人)，将应用多标签分类。

To work with this type of classification, you need to build a model that can predict multiple outputs.

要使用这种类型的分类，您需要构建一个可以预测多个输出的模型。

You need a multi-label classification for object recognition in photos, for example, when you need to identify not only tomatoes but also other, different kinds of objects in the same image: apples, zucchinis, onions, etc.

您需要对照片中的对象进行识别的多标签分类，例如，当您不仅需要识别西红柿，还需要识别同一图像中其他不同类型的对象：苹果，西葫芦，洋葱等时。

Important note for all tomato lovers: You cannot just take a binary or multiclass classification algorithm and apply it directly to multi-label classification. But you can use:

对所有番茄爱好者的重要说明 ：您不能仅采用二进制或多类分类算法并将其直接应用于多标签分类。但是您可以使用：

You can also try to use a separate algorithm for each class to predict labels for each category.

您也可以尝试对每个类别使用单独的算法来预测每个类别的标签。

不平衡 (Imbalanced)

We work with imbalanced classification when examples in each class are unequally distributed.

当每个类别中的示例分布不均时，我们将使用不平衡分类。

Imbalanced classification is used for fraud detection software and medical diagnosis. Finding rare and exquisite biologically grown tomatoes that are accidentally spilled in a large pile of supermarket tomatoes is an example of imbalanced classification.

不平衡分类用于欺诈检测软件和医疗诊断。寻找偶然散落在一大堆超市番茄中的稀有和精美的生物种植番茄是分类失衡的一个例子。

I recommend you visit the fantastic blog of Machine Learning Mastery, where you can read about the different types of classification and study many more machine learning materials.

我建议您访问梦幻般的Machine Learning Mastery博客，您可以在其中阅读有关分类的不同类型并学习更多的机器学习材料。

建立分类模型的步骤 (Steps to Build a Classification Model)

Once you know what kind of classification task you are dealing with, it’s time to build a model.

一旦知道要处理的分类任务，就可以构建模型了。

Select the classifier. You need to choose one of the ML algorithms that you will apply to your data.选择分类器。您需要选择一种将应用于数据的ML算法。
Train it. You have to prepare a training data set with labeled results (the more examples, the better).训练吧。您必须准备带有标记结果的训练数据集(示例越多越好)。
Predict the output. Use the model to get some results.预测输出。使用该模型可以获得一些结果。
Evaluate the classifier model. It’s a good idea to prepare a validation set of data that you have not used in training to check the results.评估分类器模型。准备一个在训练中没有使用过的验证数据集来检查结果是一个好主意。

Let us now take a look at the most widely-used classification algorithms.

现在，让我们看一下使用最广泛的分类算法。

最受欢迎的分类算法 (The Most Popular Classification Algorithms)

Scikit-learn is one of the top ML libraries for Python. So if you want to build your model, check it out. It provides access to widely-used classifiers.

Scikit-learn是用于Python的顶级ML库之一。因此，如果要构建模型，请签出。它提供了对广泛使用的分类器的访问。

逻辑回归 (Logistic regression)

Logistic regression is used for binary classification.

Logistic回归用于二进制分类。

This algorithm employs a logistic function to model the probability of an outcome happening. It is most useful when you want to understand how several independent variables affect a single outcome variable.

该算法采用逻辑函数对结果发生的概率进行建模。当您想了解几个独立变量如何影响单个结果变量时，它非常有用。

Example question: Will the precipitation levels and the soil composition lead to a tomato’s prosperity or its untimely death?

示例问题：降水量和土壤成分会导致番茄的繁荣或过早地死亡吗？

Logistic regression has limitations; all predictors should be independent, and there should be no missing values. This algorithm will fail when there is no linear separation of values.

逻辑回归有局限性。所有预测变量都应独立，并且不应缺少任何值。没有值的线性分隔时，该算法将失败。

朴素贝叶斯 (Naive Bayes)

The Naive Bayes algorithm is based on Bayes’ theorem. You can apply this algorithm for binary and multiclass classification and classify data based on historical results.

朴素贝叶斯算法基于贝叶斯定理。您可以将此算法应用于二进制和多类分类，并根据历史结果对数据进行分类。

Example task: I need to separate rotten tomatoes from the fresh ones based on their look.

示例任务：我需要根据它们的外观将腐烂的西红柿与新鲜的西红柿分开。

The advantage of Naive Bayes is that these algorithms are fast to build: They do not require an extensive training set and are also fast compared to other methods. However, since the performance of Bayesian algorithms depends on the accuracy of its strong assumptions, the results can potentially turn out very bad.

朴素贝叶斯的优点是这些算法的构建速度很快：它们不需要大量的训练，并且与其他方法相比也很快。但是，由于贝叶斯算法的性能取决于其强大假设的准确性，因此结果可能会变得非常糟糕。

Using Bayes’ theorem, it is possible to tell how the occurrence of an event impacts the probability of another event.

使用贝叶斯定理，可以判断一个事件的发生如何影响另一个事件的概率。

k最近邻居 (k-nearest neighbors)

kNN stands for “k-nearest neighbor” and is one of the simplest classification algorithms.

kNN代表“ k最近邻居”，是最简单的分类算法之一。

The algorithm assigns objects to the class that most of its nearest neighbors in the multidimensional feature space belong to. The number k is the number of neighboring objects in the feature space that are compared with the classified object.

该算法将对象分配给其在多维要素空间中大多数最近邻居所属的类。数字k是特征空间中与分类对象比较的相邻对象的数量。

Example: I want to predict the species of the tomato from the species of tomatoes similar to it.

示例：我想根据与之相似的西红柿的种类来预测西红柿的种类。

To classify the inputs using k-nearest neighbors, you need to perform a set of actions:

要使用k近邻对输入进行分类，您需要执行一组操作：

Calculate the distance to each of the objects in the training sample.计算到训练样本中每个对象的距离。
Select k objects of the training sample, the distance to which is minimal.选择训练样本的k个对象，该对象的距离最小。
The class of the object to be classified is the class that occurs most frequently among the k-nearest neighbors.要分类的对象的类别是在k个最近邻居中最频繁出现的类别。

决策树 (Decision tree)

Decision trees are probably the most intuitive way to visualize a decision-making process. To predict a class label of input, we start from the root of the tree. You need to divide the possibility space into smaller subsets based on a decision rule that you have for each node.

决策树可能是可视化决策过程的最直观的方法。为了预测输入的类标签，我们从树的根开始。您需要根据每个节点的决策规则将可能性空间划分为较小的子集。

Here is an example:

这是一个例子：

You keep breaking up the possibility space until you reach the bottom of the tree. Every decision node has two or more branches. The leaves in the model above contain the decision about whether a person is or isn’t fit.

您一直在打破可能的空间，直到到达树的底部。每个决策节点都有两个或多个分支。上面模型中的叶子包含有关一个人是否合适的决定。

Example: You have a basket of different tomatoes and want to choose the correct one to enhance your dish.

示例：您有一篮子不同的西红柿，并且想要选择正确的西红柿来增强菜肴。

Types of decision trees

决策树的类型

There are two types of trees. They are based on the nature of the target variable:

有两种类型的树。它们基于目标变量的性质：

Categorical variable decision tree分类变量决策树
Continuous variable decision tree连续变量决策树

Therefore, decision trees work quite well with both numerical and categorical data. Another plus of using decision trees is that they require little data preparation.

因此，决策树在数值和分类数据上都可以很好地工作。使用决策树的另一个好处是它们几乎不需要数据准备。

However, decision trees can become too complicated, which leads to overfitting. A significant disadvantage of these algorithms is that small variations in training data make them unstable and lead to entirely new trees.

但是，决策树可能变得过于复杂，从而导致过度拟合。这些算法的显着缺点是训练数据的微小变化使它们不稳定，并导致了全新的树木。

随机森林 (Random forest)

Random forest classifiers use several different decision trees on various sub-samples of datasets. The average result is taken as the model’s prediction, which improves the predictive accuracy of the model in general and combats overfitting.

随机森林分类器在数据集的各个子样本上使用几种不同的决策树。将平均结果作为模型的预测，可以从总体上提高模型的预测精度并避免过拟合。

Consequently, random forests can be used to solve complex machine learning problems without compromising the accuracy of the results. Nonetheless, they demand more time to form a prediction and are more challenging to implement.

因此，随机森林可用于解决复杂的机器学习问题，而不会影响结果的准确性。尽管如此，他们需要更多的时间来进行预测，并且实施起来更具挑战性。

Read more about how random forests work in the Towards Data Science blog.

在Towards Data Science博客中了解有关随机森林如何工作的更多信息。

支持向量机 (Support vector machine)

Support vector machines use a hyperplane in an N-dimensional space to classify the data points. N here is the number of features. It can be, basically, any number, but the bigger it is, the harder it becomes to build a model.

支持向量机在N维空间中使用超平面对数据点进行分类。这里的N是要素数量。基本上可以是任何数字，但是越大，建立模型就越困难。

One can imagine the hyperplane as a line (for a two-dimensional space). Once you pass three-dimensional space, it becomes hard for us to visualize the model.

可以将超平面想象成一条线(对于二维空间)。一旦您通过了三维空间，我们就很难将模型可视化。

Data points that fall on different sides of the hyperplane are attributed to different classes.

落在超平面的不同面上的数据点被归于不同的类。

Example: An automatic system that sorts tomatoes based on their shape, weight, and color.

示例：一个自动系统，可根据西红柿的形状，重量和颜色对西红柿进行分类。

The hyperplane that we choose directly affects the accuracy of the results. So we search for the plane that has the maximum distance between data points of both classes.

我们选择的超平面直接影响结果的准确性。因此，我们搜索两个类的数据点之间具有最大距离的平面。

SVMs show accurate results with minimal computation power when you have a lot of features.

当您具有许多功能时，SVM可以以最小的计算能力显示准确的结果。

加起来 (Summing Up)

As you can see, machine learning can be as simple as picking up vegetables in the shop. But there are many details to keep in mind if you don’t want to mess it up.

如您所见，机器学习就像在商店里捡菜一样简单。但是，如果您不想弄乱它，则有许多细节需要牢记。

翻译自: https://medium.com/better-programming/a-guide-to-classification-algorithms-fdaabb538b26