深度学习算法和机器学习算法

I am a firm believer that the previous step to making your own Machine Learning Algorithms or any predictive model with code is understanding the basics and knowing how to rationally interpret the model.

我坚信，使用代码制作自己的机器学习算法或任何预测模型的上一步是了解基础知识，并且知道如何合理地解释模型。

Many times we believe that building a machine learning or data analytics model is such a difficult task when we associate it with coding. There you have another obstacle to sort! But that’s not necessarily the case if you focus on understanding the theory behind it in the first place. Here’s a little guide to help you go through that process.

很多时候，我们认为建立机器学习或数据分析模型与编码相关联是一项艰巨的任务。在那里，您还有另一个障碍要解决！ 但是，如果您一开始就专注于理解其背后的理论，那就不一定如此。这里有一些指南，可以帮助您完成该过程。

目录： (Table of Contents:)

1. Importance of Data Analytics. (1 min read)

1.数据分析的重要性。 (阅读1分钟)

2. Machine Learning Contribution. (1 min read)

2.机器学习贡献。 (阅读1分钟)

3. Understanding the Basics: Data-validation process, imbalanced datasets, supervised and unsupervised methods. (2 min read)

3.了解基础知识：数据验证过程，不平衡的数据集，有监督和无监督的方法。 (阅读2分钟)

4. Introduction to Decision Trees and Random Forest. (2 min read)

4.决策树和随机森林简介。 (阅读2分钟)

5. Application using Orange. (6 min read)

5.使用Orange进行应用。 (阅读6分钟)

1.为什么选择数据分析？ (1. Why Data Analytics?)

Thousands of data sources exist nowadays from which we can extract, transform and load data ranging stock prices, medical records, surveys, population census, and logged behaviors, among others. Also, there’s a huge variety of fields in which we can apply these techniques and a wide range of useful applications inside each field, such as fraud detection, credit scoring and asset allocation in relation to the finance domain.

如今，存在数以千计的数据源，我们可以从中提取，转换和加载数据，这些数据包括股票价格，病历，调查，人口普查和记录的行为等。此外，在众多领域中，我们都可以应用这些技术，并且每个领域内都有许多有用的应用程序，例如与金融领域相关的欺诈检测，信用评分和资产分配。

But how much can I contribute with this knowledge to a company? A LOT! Just put yourself in the situation of a credit risk analyst at a bank. “Should I lend money to this client or should I reject his application? How much information should I request him or her without risking to lose the interest rate associated with the lending? Are his periodical payslips enough? Or should I also ask him credit records from other financial institutions to guarantee the repayment?”. Data Analytics and machine learning models play a major role in automating routine tasks such as this one, handling bulks of information and optimizing metrics to enhance the business sustainability.

但是我可以用这些知识为公司贡献多少？ 很多！只是让自己陷入银行信用风险分析师的境地。 “我应该向该客户借钱还是应该拒绝他的申请？ 我应该向他或她要求多少信息，而又不会冒险失去与贷款相关的利率？ 他的定期工资单够吗？ 还是我也应该问他其他金融机构的信用记录以保证还款？”。 数据分析和机器学习模型在使例行任务自动化，处理大量信息和优化指标以增强业务可持续性方面发挥着重要作用。

The ultimate goal is to make meaningful and interpretable inferences about data, extract relationships between variables and detect patterns to forecast the outcome of a variable.

最终目标是对数据进行有意义且可解释的推断，提取变量之间的关系并检测模式以预测变量的结果。

Let’s see the case of tech companies. In order to perform and grow their business value, their focus must be set on improving business metrics and delighting users. Data analysis provides these companies with insights and metrics that are constantly changing and that allow them to build even better products. The mission: Understand users, how the offered product fits into their life, what motivates them and how was their experience in order to improve it. All this and much more can be attained with the use of data.

让我们看看科技公司的情况。为了实现并增加其业务价值，必须将重点放在改善业务指标和使用户满意方面。数据分析为这些公司提供了不断变化的见解和指标，使他们能够开发出更好的产品。使命：了解用户，所提供的产品如何适应他们的生活，激发他们的动机，以及如何改善产品体验。使用数据可以实现所有这些以及更多。

2.在所有这一切中，机器学习扮演什么角色？ (2. What role plays Machine Learning in all of this?)

Truth to be told, there’s probably not a need for Machine Learning in your company budget. Why is that? Because the majority of the companies need improvements in processes, customer experience, costs reductions and decisions making, all of which can easily be attained with the implementation of traditional data analysis models, without the need of recurring to more complex ML applications.

实话实说，您的公司预算中可能不需要机器学习。这是为什么？因为大多数公司都需要改进流程，改善客户体验，降低成本和制定决策，而通过使用传统数据分析模型可以轻松实现所有这些，而无需重复使用更复杂的ML应用程序。

Despite the above mentioned, traditional data analytics models are static and have limited use with fast-changing unstructured data inputs that are rapidly and constantly suffering changes. That’s when the need for automated processes with capacity to analyze tens of inputs and variables emerges.

尽管有上述提及，但是传统的数据分析模型是静态的，并且在快速变化且不断遭受变化的快速变化的非结构化数据输入中使用有限。那时就出现了对具有分析数十种输入和变量的能力的自动化流程的需求。

In addition, the resolution process differs greatly between both methods, as ML models focus on receiving the input of a determined goal from the user and learning from the rapid-changing data which factors are important in achieving that goal, instead of being the user who sets the factors that will determine the outcome of the target variable.

此外，两种方法之间的解决过程差异很大，因为ML模型的重点是从用户那里接收确定目标的输入，并从快速变化的数据中学习哪些因素对于实现该目标很重要，而不是成为设置将决定目标变量结果的因素。

Not only it allows the algorithm to make predictions, but also compare against its predictions and adjust the accuracy of the outcome.

它不仅允许算法进行预测，而且可以与算法的预测进行比较并调整结果的准确性。

3.了解基础知识： (3. Understanding the Basics:)

Data-validation process:In the process of performing a machine learning algorithm and selecting the best way in which we can analyze data, we split it into two subsets: training subset and testing subset, in order to fit our model on the train data and make predictions on the test data as an emulation of real-life problems.

数据验证过程： 在执行机器学习算法并选择分析数据的最佳方式的过程中，我们将其分为两个子集：训练子集和测试子集，以使我们的模型适合训练数据并在测试中做出预测数据作为现实问题的模拟。

How we perform the splitting of data is not trivial as we don’t want to bias any of the subsets. For e.g., while processing data of a sample of a company’s clients, we don’t want to split train and test subsets without including an equally-represented sample of each category in both of the subsets. As a result, we say that data splitting must be performed in a stratified way and randomly.

我们执行数据拆分的方式并非易事，因为我们不想偏向任何子集。例如，在处理公司客户样本的数据时，我们不想拆分训练和测试子集，而不必在两个子集中都包括同样代表的每个类别的样本。结果，我们说数据分割必须以分层的方式随机进行。

Imbalanced Datasets:Data is said to be imbalanced when instances of one class outnumber the other(s) by a large proportion. In the process of data classification, the model might not have enough instances of a class to learn about it, and as a result will bias the analysis.

不平衡的数据集： 当一个类的实例比另一个类的实例大很多时，数据就被认为是不平衡的。在数据分类的过程中，该模型可能没有足够的类实例来了解它，因此将使分析产生偏差。

There are several sampling methods to deal with this issue including Undersampling, Oversampling, Synthetic Data Generation and Cost Sensitive Learning. In this article, I’ll dig into Oversampling moving forward.

有几种采样方法可以解决此问题，包括欠采样，过采样，合成数据生成和成本敏感学习。 在本文中，我将深入探讨过采样。

Supervised and Unsupervised Models:

有监督和无监督模型：

Supervised Learning: Consists of manually telling the model what labels we want to predict for the training dataset.

监督学习：包括手动告诉模型我们要为训练数据集预测的标签。
Unsupervised Learning: As we don’t know the labels, we ask the model to group elements from the dataset based on the more distinct features each element has.

无监督学习：由于我们不知道标签，我们要求模型根据每个元素具有的更独特特征对数据集中的元素进行分组。

4.决策树和随机森林简介 (4. Introduction to Decision Trees and Random Forest)

Decision Trees algorithms are structured as a hierarchy of questions and answers about the observations in a dataset in order to help the model to make classifications. An example would be the following scheme, which is a simplification of a basic structure of questions to determine the salary of a Baseball player:

决策树算法被构造为关于数据集中观测值的问答结构，以帮助模型进行分类。下面的方案就是一个例子，它简化了确定棒球运动员薪水的基本问题结构：

In the graph, we see the representation of a two-level decision tree in which the first-step classification is related to the number of years as a professional player that an individual had, and conditional to the response to that question is the number of hits per season. In the example, the determination of the salary of each player in the league will be made following the guidelines of this model.

在图中，我们看到了一个两级决策树，其中第一级分类是关系到年作为一个职业球员，一个人有多少代表性，并有条件应对这个问题是多少每季点击数。在该示例中，将根据该模型的准则确定联盟中每个球员的薪水。

The most suitable model will be the one that better represents the actual relation among the variables under study (if the salary of a baseball player and years of experience were linearly correlated, probably a linear regression would be the most suitable model to use). Random Forest is one of a variety of techniques that allow us to represent more complex relations between variables that are not necessarily linear, exponential or logarithmic.

最合适的模型将是能更好地表示所研究变量之间的实际关系的模型(如果棒球运动员的薪水和工作年限成线性关系，则线性回归可能是最适合使用的模型)。随机森林是使我们能够表示不一定是线性，指数或对数变量之间更复杂关系的多种技术之一。

For e.g., in the scenario of having a player with less than 4.5 years of experience, the decision-making of his salary is uniquely dependent on his professional experience, not on the amount of hits.

例如，在拥有少于4.5年经验的球员的情况下，其薪水的决策唯一取决于他的专业经验，而不取决于命中量。

The ultimate goal of the scheme of questions is to split the observations in a way that the resulting groups are as different from each other as possible. Observations will finally be organized in sections according to which different conditions are met:

问题方案的最终目的是以这样的方式分解观察结果，使得结果组尽可能彼此不同。最终将根据满足不同条件的部分来组织观察：

Decision Tree classification example— Image by Author

随机森林 (Random Forest)

Classification mechanisms utilizing Decision Trees are based on “ensembles” of a large number of individual trees. Let’s use a simple metaphor to illustrate the concept:

利用决策树的分类机制基于大量单个树的“集合”。让我们用一个简单的比喻来说明这个概念：

Each blind man has a task: Deduce the animal based on the part of the body that he touches. For the purpose of the article, the metaphor would mean that each blind man is a model and the elephant is the value to predict. Were they to be all touching the same part, they‘d probably deduce incorrectly which is the animal. As a consequence, it would be better to have them distributed to be able “learn” from different sets of “information”. Furthermore, we’ll try to combine the models (or blind men), that independently are not effective in their predictions, in order to optimize the output of the model.

每个盲人都有一个任务：根据他接触的身体部位推断出动物。就本文的目的而言，隐喻意味着每个盲人都是模特，而大象是可以预测的价值。如果它们全部都触及同一部分，则可能会错误地推断出那是动物。结果，最好分发它们以便能够从不同的“信息”集合中“学习”。此外，我们将尝试合并模型(或盲人模型)，以优化模型的输出，这些模型在预测方面各自无效。

Instead of touching a part of an animal, the model will actually analyze a selected dataset (with reposition of samples), obtained with Bootstrapping. Unfortunately, this method doesn’t guarantee that the selected datasets are not correlated mainly because “strong” predictors generally prime over other indicators. That’s where Random Forest comes to play as its main feature is randomly selecting datasets ignoring “strong” predictors.

该模型实际上不会分析动物的一部分，而是会分析通过Bootstrapping获得的选定数据集(具有重新放置的样本)。不幸的是，这种方法不能保证所选数据集不相关，主要是因为“强”预测变量通常优先于其他指标。这就是随机森林发挥作用的地方，因为它的主要功能是随机选择忽略“强”预测变量的数据集。

5.我们将使用什么工具？ (5. What tool are we going to use?)

Orange is an open-source tool that allows us to perform a wide range of data-manipulation tasks such as data visualization, exploration, preprocessing and modeling creation without the need to use Python, R or any other piece of code. It’s ideal if you’re taking your first steps in this long learning-path.

Orange是一个开放源代码工具，它使我们能够执行各种数据处理任务，例如数据可视化，探索，预处理和建模创建，而无需使用Python，R或任何其他代码。如果您要在这个漫长的学习道路上迈出第一步，这是理想的选择。

It’s also suitable for more advanced users as it includes Python widgets to input Python scripts to complement the widgets it has to offer. Go to the following link to proceed with the installation of the program.

它也适合更高级的用户，因为它包括Python小部件，用于输入Python脚本以补充其必须提供的小部件。请转到以下链接继续安装程序。

1.打开一个新文件 (1. Open a new file)

Initial user interface — Image by Author

2.将“文件”小部件拖到画布上，并通过双击“文件”小部件来浏览本地服务器中的数据集。 (2. Drag the File widget to the canvas and browse the dataset in your local server by double clicking in the File widget.)

In this case, I’ll be utilizing a dataset containing a sample of 150.000 clients of a financial institution. The column “SeriousDlqin2yrs” will be the target variable upon preparing our model.

在这种情况下，我将利用一个包含150.000个金融机构客户样本的数据集。准备模型时，“ SeriousDlqin2yrs”列将是目标变量。

You can find the dataset in this link to my GitHub.

您可以在指向我的GitHub的链接中找到数据集。

Note: Don’t worry about the gray “Apply” button as it’s only used to confirm changes made to the values in each feature, for e.g. after modifying “Role” or “Values” tabs.

注意：不必担心灰色的“应用”按钮，因为它仅用于确认对每个功能中的值所做的更改，例如，在修改“角色”或“值”选项卡之后。

3. Visualize default features and distributions of the datasetDrag the “Feature Statistics” and “Distribution” widgets from the “Data” and “Model” and “Visualize” from the left-side panels. With these tools, you’ll get a better view of the descriptive statistics of each feature in the dataset, such as mean, dispersion, minimum, maximum and missing values.

3.可视化数据集的默认特征和分布从左侧面板的“数据”，“模型”和“可视化”中拖动“功能统计”和“分布”小部件。使用这些工具，您可以更好地了解数据集中每个特征的描述统计信息，例如均值，离散度，最小值，最大值和缺失值。

Feature Statistics Widget — Image by Author

4.选择行 (4. Select Rows)

Filter columns data to avoid incorrect values that interfere in the accuracy of the analysis. We can set conditions for the features such as values below X amount of equal to Y amount.

过滤列数据，以避免不正确的值影响分析的准确性。我们可以为特征设置条件，例如X值以下等于Y值以下的值。

5.选择列 (5. Select Columns)

Select important features for your analysis from the original dataset and create a new dataset with only those features utilizing the widget “Select Columns”. This is the widget in which you can determine selected columns and indicate target variable for further analysis.

从原始数据集中选择要分析的重要特征，然后使用“选择列”小部件创建仅包含那些特征的新数据集。在此小部件中，您可以确定选定的列并指示目标变量以进行进一步分析。

Select Columns interface— Image by Author

6.数据采样器 (6. Data Sampler)

Data Sampler widget is used to split the filtered dataset into train and test subsets. In Orange’s interface we can select a “Sampling Type” to input our desired sampling method. Particularly, I selected 70% of the entire data to be included as “Train sample” leaving the remaining 30% as “Test sample”. As mentioned earlier in the article, data-selection for the subsets is performed randomly and with stratified samples, as Orange’s interface reflects.

Data Sampler小部件用于将过滤后的数据集拆分为训练和测试子集。在Orange的界面中，我们可以选择“采样类型”以输入所需的采样方法。特别是，我选择了整个数据的70％作为“训练样本”，其余30％作为“测试样本”。如本文前面提到的，正如Orange界面所反映的那样，子集的数据选择是随机进行的，并且使用分层样本。

Data Sampler interface — Image by Author

7.不平衡的数据集分辨率 (7. Imbalanced Dataset Resolution)

In order to solve the Imbalanced Dataset problem explained above, I decided to perform the Oversampling technique instead of a SMOTE as I believe that the widget for this feature is not included in Orange.

为了解决上面解释的不平衡数据集问题，我决定执行过采样技术而不是SMOTE，因为我认为此功能的小部件未包含在Orange中。

1 . Select “Default” values, which are the minority class from the training subset, as we want to randomly replicate the observations to balance the dataset. In the visualization, you will see that links or edges between the widgets have legends, in which you must indicate what data you want to pass to the “receiving” widget. In this case, “Select Rows” widget contains a Train subset, from which “Matching data” or “Defaults” will be sent. On the other hand, we find “Unmatched data” , which are “No defaults” from the Train subset, that are directly sent to the “Concatenate ”widget.

1。选择“默认”值，它是训练子集中的少数类，因为我们要随机复制观察值以平衡数据集。在可视化中，您将看到小部件之间的链接或边都有图例，您必须在其中说明要传递给“接收”小部件的数据。在这种情况下，“选择行”窗口小部件包含一个“火车”子集，将从中发送“匹配数据”或“默认值”。另一方面，我们发现“不匹配的数据”，它们是Train子集中的“ No defaults”，直接发送到“ Concatenate ”小部件。

An advantage of Oversampling method is that it leads to no information loss in relation to Undersampling. Its disadvantage is that, since it simply adds replicated observations in the original data set, it ends up adding multiple observations of several types, thus leading to overfitting.

过采样方法的一个优点是不会导致与欠采样有关的信息丢失。其缺点是，由于仅在原始数据集中添加了重复的观测值，因此最终添加了多种类型的多个观测值，从而导致过拟合。

2. Data Sample widget randomly replicates a fixed number of observations

2. Data Sample小部件随机复制固定数量的观察值

3. Concatenate widget joins both new observations with “old” ones, in order to finally obtain a balanced dataset to submit to our model.

3.串联小部件将两个新观察值与“旧”观察值结合起来，以便最终获得平衡的数据集以提交给我们的模型。

**Orange Canvas — Image by Author** **橙色帆布—照片作者Author**

8.使用随机森林对平衡火车数据集进行预测 (8. Perform prediction on balanced train dataset with Random Forest)

Let’s move on to the fun part: Modeling Random Forest. To perform this task, select the widget “Random Forest” from the “Models” sections and link it to the balanced train dataset.

让我们继续有趣的部分：随机森林建模。要执行此任务，请从“模型”部分中选择小部件“随机森林”，并将其链接到平衡火车数据集。

We will test the “depth ” Hyperparameter in an effort to optimize the model. Hyperparameters are a sort of “setting ” of the model that can be adjusted to enhance performance. In the case of Random Forest, Hyperparameters include:

我们将测试“深度”超参数，以优化模型。超参数是模型的一种“设置”，可以对其进行调整以增强性能。对于随机森林，超参数包括：

Number of decision trees in the forest森林中决策树的数量
Number of features considered by each tree when splitting a node, also known as “depth” or “growth” of the model.分割节点时每棵树考虑的要素数量，也称为模型的“深度”或“增长”。

As it’s shown in the images below, the first model has a 3-tree depth limit and the second has no limit in how “deep” the model will grow in the optimization.

如下图所示，第一个模型具有3棵树的深度限制，第二个模型在优化中模型的“深度”增长方面没有限制。

Random Forest with 3-trees depth limit — Image by Author

Random Forest with no depth limit — Image by Author

9.测试和评分小部件(9. Test & Score widget)

This widget is used to evaluate the results of the model based on the training dataset. It will perform cross-validation based on the number of folds defined. These folds are the number of subsets created from the train sample which will run on rounds to evaluate the whole dataset.

该小部件用于根据训练数据集评估模型的结果。它将根据定义的折叠次数执行交叉验证。这些折痕是从训练样本创建的子集数量，这些子集将轮流运行以评估整个数据集。

The resulting interface is a listing of the utilized models as a comparison of performance with the metrics obtained. In the next step I’ll explain the meaning of each metric.

生成的界面是使用的模型的列表，作为性能与获得的指标的比较。在下一步中，我将解释每个指标的含义。

Test & Score interface — Image by Author

10.混淆矩阵(10. Confusion Matrix)

To simplify the exercise, I’ll explain using the most effective model from the two we ran. Confusion Matrix is a performance-measurement tool that it’s utilized to evaluate a machine learning model based on predetermined metrics. The output is a table with a combination of values as follows:

为了简化练习，我将解释使用我们运行的两个模型中最有效的模型。混淆矩阵是一种性能评估工具，可用于基于预定指标评估机器学习模型。输出是带有值组合的表，如下所示：

True Positive results (TP): The model correctly predicted the positive outcome (e.g. It predicted it would “Not Default” and ended not doing so).

真正的积极结果(TP)：模型正确预测了积极结果(例如，它预测将“不违约”，并最终不这样做)。
True Negative results (TN): The model correctly predicted the negative outcome (e.g. It predicted it would “Default” and ended doing so).

真实的负面结果(TN)：模型正确地预测了负面结果(例如，它预测将“默认”并结束了)。
False Positive results (FP): The model failed to predict the positive outcome (e.g. It predicted it would “Default” and ended not doing so).

假阳性结果(FP)：模型无法预测阳性结果(例如，它预测将“默认”，并且最终不这样做)。
False Negative results (FN): The model failed to predict the negative outcome (e.g. It predicted it would “Not Default” and ended doing so).

错误的负结果(FN)：模型无法预测负结果(例如，它预测将“不违约”并结束了)。

This Matrix is included as a widget in Orange and has the following interface:

该矩阵作为Orange中的小部件包括在内，并具有以下界面：

Confusion Matrix interface — Image by Author

It is extremely useful for measuring Recall, Precision, F1 Score, Accuracy and AUC-ROC Curve:

对于测量召回率，精度， F1得分，准确性和AUC-ROC曲线非常有用：

Accuracy: Proportion of predictions that the models successfully classified.

准确性：模型成功分类的预测比例。

Precision: Portion of correctly predicted outcomes among all positive predictions.

精度：所有积极预测中正确预测的结果的一部分。

Recall: Portion of positive outcomes that the model predicted correctly.

回想一下：模型正确预测的积极成果的一部分。

F1 Score: It’s a combination of both precision and recall, also used to measure test’s accuracy.

F1分数：它是精度和召回率的结合，也用于衡量测试的准确性。

结论 (Conclusion)

The motivation of this article was to show how to apply sophisticated Machine Learning algorithms without a single line of code, but I additionally ended up considering it as a theory facilitator that hopefully serves as a motivator for everyone that reads this post.

本文的目的是展示如何在没有一行代码的情况下应用复杂的机器学习算法，但我最终还是将其视为理论的推动者，希望可以作为阅读本文的每个人的动力。

Thanks for taking the time to read my article! Any question, suggestion or comment, feel free to contact me: herrera.ajulian@gmail.com

感谢您抽出宝贵的时间阅读我的文章！ 如有任何问题，建议或意见，请随时与我联系：herrera.ajulian@gmail.com

翻译自: https://towardsdatascience.com/is-it-possible-to-make-machine-learning-algorithms-without-coding-cb1aadf72f5a