机器学习数据倾斜的解决方法

总览 (Overview)

I was given a large dataset of files, what some would like to call big data and told to come up with a solution to the data problem. People often associate big data with machine learning and automatically jump to a machine learning solution. However, after working with this dataset, I realised that machine learning was not the solution. The dataset was provided to me as a case study that I had to complete as part of a 4-step interview process.

我得到了一个很大的文件数据集，有人希望将其称为大数据，并告诉他们提出解决数据问题的方法。人们通常将大数据与机器学习相关联，并自动跳转到机器学习解决方案。但是，在使用该数据集之后，我意识到机器学习不是解决方案。数据集是作为案例研究提供给我的，作为4步访谈过程的一部分，我必须完成。

任务描述 (Task description)

The dataset consists of a set of files tracking email activity across multiple construction projects. All data has been anonymised. The task was to explore the dataset and report back with any insights. I was informed that clients are concerned with project duration and the number of site instructions and variations seen on a project as these typically cost money.

数据集由一组文件组成，这些文件跟踪跨多个建设项目的电子邮件活动。所有数据均已匿名。任务是探索数据集并报告任何见解。我获悉，客户担心项目的工期以及在项目中看到的现场指示和变更的数量，因为这些通常需要花钱。

数据ETL (Data ETL)

Firstly, correspondence data was read in and appended together. The data was checked for duplicates. No duplicates were found.

首先，读取对应数据并将其附加在一起。检查数据是否重复。找不到重复项。

As clients are concerned with project duration, the difference between response required by date and sent date was calculated in days. However, there were quite a few missing values for response required by date and some for sent date. These records were excluded; so the dataset was reduced from 20,006,768 records for 7 variables to 3,895,037. Then, the correspondence data was combined with mail types file to determine whether the type of correspondence has an impact on duration. Finally, a file containing the number of records for each project was combined.

由于客户担心项目的工期，因此按日期要求的响应时间与发送日期之间的差异以天为单位。但是，日期要求的响应缺少很多值，而发送日期则缺少一些值。这些记录被排除在外；因此数据集从7个变量的20,006,768条记录减少到3,895,037条。然后，将对应关系数据与邮件类型文件合并，以确定对应关系的类型是否对持续时间有影响。最后，合并了一个包含每个项目的记录数的文件。

Usually, it is not good practice to exclude data without a valid reason; however, as this dataset was assigned to me as part of a job application process I did not have the opportunity to better understand the dataset in order to impute the dates.

通常，在没有正当理由的情况下排除数据不是一种好习惯； 但是，由于此数据集是在求职过程中分配给我的，因此我没有机会更好地理解该数据集以估算日期。

As you can see from the below code to import the dataset, we do not have much information on the emails other than projectID, number of records, typeID and typeName.

从下面的代码中可以看到，导入数据集的过程中，除了projectID，记录数，typeID和typeName之外，我们在电子邮件中没有太多其他信息。

As a large number of .csv files had to be read in from a single directory, I used lapply with the fread function to read in the files and appended the list of files using rbindlist.

由于必须从单个目录中读取大量.csv文件，因此我将lapply与fread函数一起使用来读取文件，并使用rbindlist附加了文件列表。

It is good practice to save your consolidated dataset object as an R object to avoid having to rerun the import code at a later date as this process can be quite consuming depending on the number of files.

优良作法是将合并的数据集对象另存为R对象，以避免以后不得不重新运行导入代码，因为此过程可能很耗时，具体取决于文件数。

特征工程 (Feature engineering)

New variables were created to understand project duration. These are duration in days and whether the project was submitted after response required by date. If yes, then it was late, otherwise, it was early or on time.

创建了新变量以了解项目工期。这些是工期，以天为单位，是否按日期要求在响应之后提交了项目。如果是，则为时已晚，否则为时早或准时。

The unique correspondence dataset was then joined to the mail types file on correspondence type ID (primary key). This was later joined to the main file on project ID.

然后，将唯一的对应数据集添加到对应类型ID(主键)上的邮件类型文件中。后来将其加入到项目ID上的主文件中。

My initial insights after the joins are as follows:

加入后，我的初步见解如下：

Correspondence id is unique so no aggregation is required and will not assist analysis.通讯ID是唯一的，因此不需要汇总，也不会有助于分析。
There are too many organisation ids and there is no sensible way to group them so will exclude as predictor.组织ID过多，没有明智的分组方法，因此将其排除在预测范围之外。
There are too many userIDs — not a useful predictor especially as some only have a frequency count of 1. Also excluded as a predictor.用户ID太多-不是有用的预测变量，特别是因为某些用户ID的频率计数仅为1。因此也不能用作预测变量。

To create a unique row for each correspondence ID, project ID and typeName, I needed to aggregate the other features. I did this by calculating frequencies or counts for the number of late submissions, number of early submissions and by calculating summary statistics such as maximum, minimum, and mean for duration in days.

要为每个对应ID，项目ID和typeName创建唯一的行，我需要汇总其他功能。我通过计算延迟提交次数的频率或计数，早期提交次数以及计算摘要统计信息(例如最大，最小和持续时间(以天为单位))来完成此任务。

造型 (Modelling)

Grouping the data reduced the dataset size to 51,156 observations for 7 variables. The sample size at the moment is too small, however, a large majority of the data was reduced due to missing response date and including every single record per project ID would be a comparison per organisation and user ID. It appears that the client is interested in site instructions and variations which can probably be found in correspondence type ID and typeName and having the data at project ID would be too granular for the task requirements.

将数据分组将7个变量的数据集大小减少到51,156个观察值。目前的样本量太小，但是，由于缺少响应日期，因此减少了大部分数据，并且每个项目ID的每条记录都将是每个组织和用户ID的比较。看来客户对站点说明和变体很感兴趣，这可能会在对应类型ID和typeName中找到，并且项目ID处的数据对于任务要求而言过于精细。

We do have an issue where we do not know what each ID stands for and whether it is important.

我们确实有一个问题，我们不知道每个ID代表什么以及它是否重要。

Two types of models were run:

运行了两种类型的模型：

GLM (Gaussian) was carried out to determine the linear combination of best predictors that are likely to have an impact on the increase or decrease of average duration days.

进行GLM(高斯)来确定最佳预测变量的线性组合，这些线性预测可能会影响平均持续时间的增加或减少。

GBM (Gaussian) was run to again identify the top predictors and how they are related to average project duration (days)

运行GBM(高斯)来再次确定最重要的预测指标以及它们与平均项目工期(天)的关系

It is good practice to run multiple models in order to ensure that the model with greater accuracy and interpretability is selected.

优良作法是运行多个模型，以确保选择了具有更高准确性和可解释性的模型。

I first partitioned the dataset into a training (70%) and test sets (30%) using a random split. In retrospect, I could have split the model by date in order to determine how my model does at predicting average duration in the future.

我首先使用随机拆分将数据集分为训练(70％)和测试集(30％)。回想起来，我可以按日期划分模型，以便确定模型在预测未来的平均持续时间方面的表现。

I used the glm and gbm functions from the H2O package to make use of the parallel processing cloud server provided by this package as my laptop PC was quite slow. I ran 5-fold cross validation of my training set. In this method, the training set is partitioned into 5 equivalent sets. In each round, one set is used as the test set and all the other are used for training. The model is then trained on the remaining four sets. In the next round, another set is used as the test set and the remaining sets are used for training. The model then calculates the accuracy of each of the five models on the test set and provides an average model accuracy metric.

我使用了H2O软件包中的glm和gbm函数来利用此软件包提供的并行处理云服务器，因为我的笔记本电脑运行速度很慢。我对训练集进行了5次交叉验证。在这种方法中，训练集被分为5个等效集。在每个回合中，一组用作测试集，所有其他用于训练。然后在其余四组上训练模型。在下一轮中，将另一组用作测试组，其余的组用于训练。然后，该模型计算测试集上五个模型中每个模型的准确性，并提供平均模型准确性度量。

Output from the glm model is shown below. For continuous target variables, RMSE and R-squared are commonly used as accuracy metrics. We want the RMSE to be as close to 0 as possible and the R-squared value to be as close to 1. In our case, we can see that our model accuracy is awful for both metrics.

glm模型的输出如下所示。对于连续目标变量，RMSE和R平方通常用作精度指标。我们希望RMSE尽可能接近0，R平方值尽可能接近1。在我们的案例中，我们可以看到我们的模型精度对于这两个指标都非常糟糕。

Now, let’s look at the output from the gbm. The results from this model are marginally better but nothing to rave about.

现在，让我们看一下gbm的输出。该模型的结果略好一些，但没有什么值得赞扬的。

Results show that of the 1500 predictors entered in the model (1500 due to binarisation of categorical variables) there are 672 predictors that have some degree of influence. Only, 2 iterations were run despite the model having a choice of multiple runs to produce the best output. The reason for this is because the best value of lambda was reached after two iterations giving a poor model accuracy score (goodness of fit score) of 0.38% R-squared.

结果表明，在模型中输入的1500个预测变量中有1500个(由于分类变量的二值化)，有672个预测变量具有一定程度的影响。尽管模型可以选择多次运行以产生最佳输出，但仅运行了2次迭代。这样做的原因是，经过两次迭代后， λ值达到了最佳值，模型准确度得分(拟合优度得分)为0.38％R-平方，较差。

Though mean number of records did not come out as a significant predictor in the glm, it appears to be the most important from the gbm, followed by correspondence type and type Name.

尽管平均记录数并没有作为glm的重要预测指标，但从gbm来看，它似乎是最重要的，其次是信函类型和Name类型。

Due to the low accuracy for both models, I wouldn’t want to make any deductions from the output.

由于两个模型的准确性都较低，因此我不想从输出中进行任何推断。

I decided to further investigate the output by plotting the top predictors. From the box plots below, we can see that average duration is longest for a PM request for approval sample with the highest variation too compared to email and fax that typically have duration close to zero.

我决定通过绘制顶级预测变量来进一步调查输出。从下面的方框图中，我们可以看出，与通常持续时间接近于零的电子邮件和传真相比，PM批准样本的平均持续时间最长，变化也最大。

In the box plots below, we can see that the average duration is higher for payment claim than design query and non-conformance notice.

在下面的方框图中，我们可以看到，与设计查询和不合格通知相比，付款索赔的平均持续时间更长。

结论 (Conclusion)

The quality of the model depends on the data quality and its features. In this case, we started off with a very small number features and had a very poor understanding of the dataset which led to the removal of a large proportion of the dataset and poor model accuracy despite some feature engineering.

模型的质量取决于数据质量及其功能。在这种情况下，我们从很少的特征开始，对数据集的理解很差，这导致移除了大部分数据集，尽管进行了一些特征工程，但模型精度却很差。

In this example, we found that a mere exploration of the dataset would have answered the question posed by the business around what impacts project duration where would found that payment duration and PM request for approval sample can lead to an increase in duration.

在此示例中，我们发现仅对数据集进行探索就可以回答企业提出的有关对项目工期有何影响的问题，从而发现支付工期和PM批准样本会导致工期增加。

The number of site instructions and variations per project could have again be explored by calculating frequency by project ID and typename.

通过根据项目ID和类型名称计算频率，可以再次探索每个项目的现场指示和变更的数量。

Here is an example where a predictive model was not required. In order to build a model with better accuracy, additional features/datasets are required along with a better understanding of the dataset.

这是一个不需要预测模型的示例。为了以更高的精度构建模型，需要附加的功能/数据集以及对数据集的更好理解。

I would love to hear what you think and whether I could have approached this problem differently! :)

我很想听听您的想法以及我是否可以以不同的方式解决这个问题！ :)

All code can be found here: https://github.com/shedoesdatascience/email_analysis/blob/master/email_analysis_documentation.Rmd

所有代码都可以在这里找到： https : //github.com/shedoesdatascience/email_analysis/blob/master/email_analysis_documentation.Rmd

翻译自: https://towardsdatascience.com/machine-learning-is-not-always-the-solution-to-a-data-problem-7f07c000f15

机器学习数据倾斜的解决方法

查看全文

http://www.taodudu.cc/news/show-863438.html

gan简介_GAN简介
使用TensorFlow训练神经网络进行价格预测
您应该如何改变数据科学教育
r语言解释回归模型的假设_模型假设-解释
参考文献_参考文献：
深度学习用于视频检测_视频如何用于检测您的个性？
角距离恒星_恒星问卷调查的10倍机器学习生产率
apache beam_Apache Beam ML模型部署
转正老板让你谈谈你的看法_让我们谈谈逻辑回归
openai-gpt_GPT-3报告存在的问题
机器学习凝聚态物理_机器学习遇到了凝聚的问题
量子计算 qiskit_将Tensorflow和Qiskit集成到量子机器学习中
throw 烦人_烦人的简单句子聚类
使用NumPy优于Python列表的优势
迷你5和迷你4区别_可变大小的视频迷你批处理
power bi可视化表_如何使用Power BI可视化数据？
变形金刚2_变形金刚（
机器学习测试_测试优先机器学习
azure机器学习_Microsoft Azure机器学习x Udacity —第4课笔记
机器学习嵌入式实现_机器学习中的嵌入
无监督学习 k-means_无监督学习-第3部分
linkedin爬虫_机器学习的学生和从业者的常见问题在LinkedIn上提问
lime 深度学习_用LIME解释机器学习预测并建立信任
神经网络梯度下降_梯度下降优化器对神经网络训练的影响
深度学习实践:计算机视觉_深度学习与传统计算机视觉技术：您应该选择哪个？
卷积神经网络如何解释和预测图像
深度学习正则化正则化率_何时以及如何在深度学习中使用正则化
杨超越微数据_资料来源同意：数据科学技能超越数据
统计概率分布_概率统计中的重要分布
人口预测和阻尼-增长模型_使用分类模型预测利率-第1部分

机器学习数据倾斜的解决方法_机器学习并不总是解决数据问题的方法相关推荐

数据标准化处理方法_机器学习系列-数据预处理-数据标准化（归一化）-理论
在做一个具体的机器学习项目中,拿到收集到的数据后,一般都是需要做数据预处理,而标准化(暂时不考虑标准化和归一化的主要区别)是数据预处理中一个比较重要的环节,那么为什么需要对数据进行标准化处理呢? 数据 ...
3. 机器学习中为什么需要梯度下降_机器学习中一些模型为什么要对数据归一化？...
一般做机器学习应用的时候大部分时间是花费在特征处理上,其中很关键的一步就是对特征数据进行归一化,为什么要归一化呢?很多同学并未搞清楚,维基百科给出的解释: 1)归一化后加快了梯度下降求最优解的速度蓝 ...
机器学习中为什么需要梯度下降_机器学习101：一文带你读懂梯度下降
原标题 | Machine Learning 101: An Intuitive Introduction to Gradient Descent 作者 | Thalles Silva 译者 | 汪鹏 ...
channelinboundhandler中都包含了哪一类的方法_数据仓库、数据集市、数据湖、数据中台到底有什么区别？都得做吗？...
点击上方蓝字关注数据玩家经常看到有人问这个问题,数据玩家也看过很多解释,感觉都不够直观,这里,我尝试用一个大家都理解的例子来说明. 什么是数据仓库? 大家都去宜家买过东西吧,还记得一楼的大仓库不,你 ...
大数据技术学习之旅_为什么聚焦是您数据科学之旅的关键
大数据技术学习之旅 David Robinson, a data scientist, has said the following quotes: 数据科学家David Robinson曾说过以下 ...
mysql清空数据库所有表的命令_mysql清空表数据命令是什么？_数据库,mysql,清空表数据...
mysql服务无法启动怎么解决_数据库 mysql服务无法启动的解决方法是:1.配置环境变量:2.在mysql安装目录下,新建my.ini文件,设置默认字符集.端口.存储引擎等:3.执行[mysqld ...
java 提供的排序方法_请给出java几种排序方法
展开全部排序算法复习(Java实现)(一): 插入,冒泡,选择,Shell,快速排序为了便于管理,先引32313133353236313431303231363533e78988e69d83313 ...
python3中format方法_[翻译]python3中新的字符串格式化方法-----f-string
从python3.6开始,引入了新的字符串格式化方式,f-字符串. 这使得格式化字符串变得可读性更高,更简洁,更不容易出现错误而且速度也更快. 在本文后面,会详细介绍f-字符串的用法. 在此之前,让我 ...
大数据应用项目创新大赛_第二届海南大数据创新应用大赛收官
来源:新华网第二届海南大数据创新应用大赛颁奖仪式现场.新华网发 6月7日,第二届海南大数据创新应用大赛颁奖仪式举行,历经半年角逐,第二届海南大数据创新应用大赛收官.本届大赛总共吸引1664支队伍参赛 ...

机器学习数据倾斜的解决方法_机器学习并不总是解决数据问题的方法