ds数据库

数据科学 (Data Science)

These days most companies are moving towards the concept of “Citizen Data Scientists” by giving tools to subject matter experts (SME) to create their own machine learning models. The advantage of this approach is that it provides a meaningful interpretation of the results and nothing gets lost in translation between the data scientist and the SME.

如今，大多数公司通过向主题专家(SME)提供工具来创建自己的机器学习模型，正在朝着“ 市民数据科学家 ”的概念发展。这种方法的优势在于，它可以对结果进行有意义的解释，并且不会在数据科学家和SME之间的翻译中丢失任何内容。

While some citizen-data-scientists get trained in Python and R to develop models, these languages require some level of coding that some people may not be comfortable with. Data modeling platforms like JMP, Knime and Alteryx provide alternatives to programming and help you develop machine learning models simply by working with their inbuilt functionalities. In this article let’s discuss modeling with Alteryx.

虽然一些公民数据科学家接受过Python和R训练以开发模型，但这些语言需要某些人可能不满意的某种程度的编码。诸如JMP，Knime和Alteryx之类的数据建模平台提供了编程的替代方法，并通过使用其内置功能来帮助您开发机器学习模型。在本文中，我们讨论使用Alteryx进行建模。

The approach we will use here is the ‘SEMMA’ approach. ‘SEMMA’ stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential steps developed by the SAS Institute, one of the largest producers of statistics and business intelligence software. As with my previous articles, I will be using the Boston Housing Property 2019 dataset to model with Alteryx.

我们将在这里使用的方法是“ SEMMA ”方法。 “ SEMMA”代表“ 采样” ，“ 探索” ，“ 修改” ，“ 模型 ”和“ 评估” 。它是SAS Institute (最大的统计和商业智能软件生产商之一)开发的一系列步骤的列表。与之前的文章一样，我将使用Boston Housing Property 2019数据集与Alteryx进行建模。

第1步-示例： (STEP 1 — Sample:)

The process starts with data sampling, i.e., selecting the appropriate data set for modeling. The data set should be large enough to contain sufficient information to model effectively, yet small enough to be used efficiently.

该过程从数据采样开始，即选择合适的数据集进行建模。数据集应足够大以包含足以有效建模的信息，而又要足够小以有效使用。

In the ‘Sample’ stage, we will load the data into Alteryx, filter by property type = 101 (single-family homes), and drop columns that are not required for analysis(Like the name, address, and rows not associated with single-family homes).

在“ 采样 ”阶段，我们将数据加载到Alteryx中，按属性类型= 101 (单户住宅)进行过滤，并删除分析不需要的列(例如名称，地址和与单行无关的行) -家庭住宅)。

Sample data with Alteryx. Alteryx tool used- Input Data, Filter and select

Using the Input Data tool, I will import the excel dataset into Alteryx. Next using the filter tool will sort the rows containing only PTYPE = 101 as shown below.

使用输入数据工具，我将excel数据集导入Alteryx。接下来，使用过滤器工具将仅包含PTYPE = 101的行排序，如下所示。

Select Data and filter property type 101

Next, select the rows required for analysis. (Remove fields like ST_Name, Owner, etc)

接下来，选择分析所需的行。 (删除诸如ST_Name，Owner等的字段)

Select Tool: Unselect columns not required for analysis.

Alteryx allows you to create ‘containers’ where you can group all the blocks for a function together like shown above. I created a container called “Find columns with Missing values” for doing the task the name suggests. I have reused this container multiple times to check if all the missing values have been imputed

Alteryx允许您创建“容器”，您可以在其中将功能的所有块组合在一起，如上所示。我创建了一个名为“查找缺少值的列”的容器，用于执行顾名思义的任务。我已多次重复使用此容器，以检查是否所有缺失值均已估算

Columns with Missing values (Output of Container)

第2步-探索： (STEP 2 — Explore:)

This phase covers understanding the data by exploring anticipated and unanticipated relationships between the variables and discover any abnormalities with the help of data visualization.

此阶段包括通过探索变量之间的预期和意外关系来理解数据，并借助数据可视化发现任何异常。

In this phase, we will impute all the missing data to prepare it for modeling. In the last step, we identified 7 columns with some missing values. In this step, we will impute these columns.

在这一阶段，我们将估算所有缺失的数据，以准备进行建模。在最后一步中，我们确定了7列缺少值的列。在这一步中，我们将估算这些列。

Imputation Strategy:

插补策略：

a) Land_SF: This is a continuous variable and I will impute this with mean of column

a)Land_SF：这是一个连续变量，我将用列均值来估算

b) YR_REMOD: This is the year the house was remodeled, if this is null, we will replace it with year built.

b)YR_REMOD：这是房屋改建的年份，如果为空，我们将用建造年份代替。

c) STRUCTURE_CLASS: This is the categorical variable. We will replace this with the mode of the column.

c)STRUCTURE_CLASS：这是类别变量。我们将其替换为列的模式。

d) R_BTH_STYLE2 and R_BTH_STYLE3: We will calculate total_bath (total number of a full bath and half bath) parameter if the total bath is less than ‘2’, I will impute STYLE2 with ‘N/A’ similarly is total is less than ‘3’ STYLE3 will be ‘N/A’.

d)R_BTH_STYLE2和R_BTH_STYLE3：如果总浴池小于'2'，我们将计算total_bath(全浴池和半浴池的总数)参数，类似地，我将STYLE2推算为'N / A'，总和小于' 3'STYLE3将为'N / A'。

e) R_KITCH_STYLE2 and R_KITCH_STYLE3: Similar to the baths, if the number of the kitchen is less than 2 or 3, we will impute the values with,N/A’.

e)R_KITCH_STYLE2和R_KITCH_STYLE3：类似于浴室，如果厨房的数量少于2或3，我们将用N / A'来估算值。

All this can be done by using the Formula column easily.

所有这些都可以通过轻松地使用“公式”列来完成。

1) Impute LAND_SF with mean. 2) YR_REMOD = YR_BUILT, STRUCTURE_CLASS Impute with Mode. 3) Impute Bath styles and kitchen styles.

第3步-修改： (STEP 3 — Modify:)

In this step, we will

在这一步中，我们将

Transform Data/Create New Attributes: We will create a few new variables for our analysis:

转换数据/创建新属性：我们将为分析创建一些新变量：

a. Age: Calculate the age of the house(Current year- YR_BUILT)

一个。年龄：计算房屋的年龄(当前年份-YR_BUILT)

b. YR_Since_Remod: Current Year — YR_REMOD)

b。 YR_Since_Remod：当前年份-YR_REMOD)

c. LogAV_Total = Log transform target variable

C。 LogAV_Total =对数转换目标变量

Adding new variables required for analysis.

2. Attribute encoding:

2.属性编码：

We will check all the variable to make sure the data types are correct, delete original variables like YR_REMOD, YR_BUILT (using the select tool)as we have calculated new variables

我们将检查所有变量以确保数据类型正确，并在计算新变量时删除原始变量，例如YR_REMOD，YR_BUILT(使用选择工具)

3. Partition Data into training and validation set

3.将数据划分为训练和验证集

步骤4 —模型： (STEP 4 — Model:)

Model: In the Model phase the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.

模型：在模型阶段，重点是对准备好的变量应用各种建模(数据挖掘)技术，以创建可能提供所需结果的模型。

We will fit 3 different models to our data: Decision Tree, Linear Regression, and Random Forest model on the training dataset and validate the model using the validation set.

我们将3种不同的模型拟合到我们的数据中：训练数据集上的决策树，线性回归和随机森林模型，并使用验证集对模型进行验证。

The image below shows the settings for cross-validation and score blocks. In cross-validation, you have to select the Number of folds and a seed. Score block lets you select if the predictor variable has been log-transformed or not.

下图显示了交叉验证和分数块的设置。在交叉验证中，您必须选择倍数和种子。得分块使您可以选择是否将预测变量进行对数转换。

Cross-validation and Score block in Alteryx

A) Decision Tree Model

A)决策树模型

We will fit a decision tree on the training data and then use cross-validation on both training to compare ‘Mean Absolute Error’. To fit the decision tree model, we will use the decision tree block to select the target variable and predictor variables

我们将在训练数据上拟合决策树，然后在两次训练中使用交叉验证来比较“均值绝对误差”。为了适应决策树模型，我们将使用决策树块来选择目标变量和预测变量

Cross-Validation and %Error on training and validation set.

The cross-validation results show that Avg_MAE (Average Mean Absolute Error) is comparable hence this model is valid. The browse added after the decision tree model has more information and graphs.

交叉验证结果表明，Avg_MAE(平均平均绝对误差)具有可比性，因此该模型有效。在决策树模型之后添加的浏览器具有更多信息和图形。

b) Linear Regression

b)线性回归

The first step is to select the Target variable and predictor variable.

第一步是选择目标变量和预测变量。

Once you run the Linear regression model, you will get the results shown below: Using the browse tool, you get various other graphs, I have shown only one of them here.

一旦运行了线性回归模型，您将得到如下所示的结果：使用浏览器工具，您可以获得各种其他图形，这里仅显示其中一个。

Once we have the linear regression model, we will score it to calculate Training and Validation error as shown below.

有了线性回归模型后，我们将对其评分以计算训练和验证误差，如下所示。

Training and Validation dataset Score from Linear Regression Model

c) Random Forest

c)随机森林

The first step in Random Forest Modeling is to select the target variable and the predictor variable from the tool configuration. The next step is to select the number of trees. I chose to go with “100 trees”. If you are unsure of the number of trees, you can add multiple Random Forest blocks and test different tree depth and then select the tree depth which has the least Mean Absolute Error.

随机森林建模的第一步是从工具配置中选择目标变量和预测变量。下一步是选择树木的数量。我选择了“ 100棵树”。如果不确定树的数量，可以添加多个“随机森林”块并测试不同的树深度，然后选择“平均绝对误差”最小的树深度。

Once we have a final random forest model, we will cross-validate the results on the training and validation dataset and calculate the percentage error. The results show that the training and validation set have a similar Mean Absolute error. The model outputs lots of other information that can be used to see if the model is valid.

一旦有了最终的随机森林模型，我们将对训练和验证数据集上的结果进行交叉验证，并计算百分比误差。结果表明，训练和验证集具有相似的平均绝对误差。模型输出许多其他信息，可用于查看模型是否有效。

MAE and % Error on training and validation set.

第5步-评估： (STEP 5 — Assess:)

In this step, we will compare the different models using the Alteryx model comparison tool. We will compare the ‘MAE’ for the training and validation model and we will also look at the Predicted vs Actual graph.

在这一步中，我们将使用Alteryx模型比较工具比较不同的模型。我们将比较训练和验证模型的“ MAE”，还将查看“预测与实际”图。

Predicted vs Actual on Training and validation data.

From the graphs and the table, we can see that the Random forest model has the lowest Mean Absolute error, followed by the Linear regression model.

从图表和表格中，我们可以看到随机森林模型的平均绝对误差最低，其次是线性回归模型。

Also from the cross-validation results, we can see that Mean Absolute error for training and validation test is similar, Hence there is no over fit and the model can be used for prediction.

从交叉验证结果还可以看出，训练和验证测试的平均绝对误差相似，因此不会出现过度拟合，并且该模型可用于预测。

结论 (Conclusion)

Alteryx gives you a very quick and powerful way to prepare the data for analysis and test machine learning models. Depending on the data size, cross-validation could take a bit longer to execute.

Alteryx为您提供了一种非常快速而强大的方法来准备用于分析和测试机器学习模型的数据。根据数据大小，交叉验证可能需要更长的时间才能执行。

To speed up the modeling process in Alteryx, once you split and save the data into training and validation sets, run each model in a different workspace.

为了加快Alteryx中的建模过程，将数据拆分并保存到训练和验证集中后，请在不同的工作空间中运行每个模型。

You can learn more about Alteryx modeling from the Learning Guide in Alteryx official website as well as scores of tutorials and online help forums available. You can find the code for the example above on GITHUB.

您可以从Alteryx官方网站上的《 学习指南》以及许多可用的教程和在线帮助论坛中了解有关Alteryx建模的更多信息。您可以在GITHUB上找到上述示例的代码。

翻译自: https://medium.com/towards-artificial-intelligence/ds-101-alteryx-for-citizen-data-scientists-bc03134d84e2

ds数据库

查看全文

http://www.taodudu.cc/news/show-3651703.html

大数据挖掘与分析平台整体解决方案
matlab 找到数组中第一个不连续点_1.基础知识(1) Matlab基础知识
按方位角标注边长lisp_附合与闭合水准测量内业计算
《数据仓库与数据挖掘教程》ch02数据仓库原理章节整理
MADlib——基于SQL的数据挖掘解决方案（16）——回归之弹性网络回归
Vue3解析markdown解析并实现代码高亮显示
java se说明文档_Java笔记---Java官方API文档的下载及使用
ccf 201703-3 Markdown （ java）
新一代Java模板引擎Thymeleaf
CCF CSP认证题解：201703-3 Markdown（Java语言原创）
第一章java入门：Markdown语法
sql 连表修改数据
sql修改某一个字段部分数据
SQL批量修改纪要
SQL Server中怎样用命令修改数据类型
sql一次修改多条数据
SQL无法修改数据类型-解决方法
sql批量修改数据
SQL数据库——修改数据
SQL语句修改表操作
SQL数据属性的修改方法
SQL篇-修改字段、数据
sql添加、修改、删除数据，修改删除数据库表。
SQL语句对表中数据进行增删改操作[DML]
SQL新增、修改和删除数据
SQL数据库修改表结构
SQL增加、修改、删除数据
SQL server 修改表数据
SQL数据修改
SQL Server 修改数据

ds数据库_面向公民数据科学家的ds 101 Alteryx相关推荐

德国公民信用相关数据_作为公民数据科学家，没有任何事
德国公民信用相关数据数据科学,意见(Data Science, Opinion) Dear Aspiring Data Scientist, 亲爱的有抱负的数据科学家, Before you sta ...
数据结构堆栈内存堆栈_零堆栈数据科学家第二部分秋天
数据结构堆栈内存堆栈 In Hollywood, it is known that the sequels are rarely better than the original movie/par ...
分步式数据库_创建真实数据科学档案项目的分步指南
分步式数据库 As an inspiring data scientist, building interesting portfolio projects is key to showcase yo ...
算命数据_未来的数据科学家或算命精神向导
算命数据 Real Estate Sale Prices, Regression, and Classification: Data Science is the Future of Fortune ...
学习sql注入:猜测数据库_面向数据科学家SQL：学习简单方法
学习sql注入:猜测数据库 We don't pick a hammer and look for nails - that would be an unusual way of solving pr ...
分步式数据库_创建真实数据科学项目的分步指南
分步式数据库 As an inspiring data scientist, building interesting portfolio projects is key to showcase yo ...
vue取数据第一个数据_我作为数据科学家的第一个月
vue取数据第一个数据 A lot. 很多. I landed my first job as a Data Scientist at the beginning of August, and lik ...
cassandra可视化工具_一位数据科学家的私房工具清单
作为一位万人敬仰的数据科学家,不但需要培育一棵参天技能树,私人武器库里没有一票玩得转的大火力工具也是没法在江湖中呼风唤雨的. 近日北卡来罗纳大学CTO,一位数据科学家Jefferson Heard分享 ...
caj文件打不开显示内存不足_面向大数据的高效存储容量缩减技术研究
面向大数据的高效存储容量缩减技术研究 [摘要]:随着信息化的发展,全球数据量呈指数式增长,数据中心存储规模快速迈向了PB级甚至是EB级,其中包含了大量的冗余数据.这些冗余数据占用了大量的存储资源,导致 ...

ds数据库_面向公民数据科学家的ds 101 Alteryx