不知道输入何时停止_知道何时停止

不知道输入何时停止

In predictive analytics, it can be a tricky thing to know when to stop.

在预测分析中，知道何时停止可能是一件棘手的事情。

Unlike many of life’s activities, there’s no definitive finishing line, after which you can say “tick, I’m done”. The possibility always remains that a little more work can yield an improvement to your model. With so many variables to tweak, it’s easy to end up obsessing over tenths of a percentage point, pouring huge amounts of effort into the details before looking up and wondering “Where did the time go?”.

与生活中的许多活动不同，没有明确的终点线，之后您可以说“打勾，我完成了”。总是有可能更多的工作可以改进您的模型。由于要调整的变量太多，因此很容易最终迷上一个十分之一的百分点，在查寻细节之前投入了大量的精力，并想知道“时间花在哪里？”。

Iterating your model, via feature engineering, model selection and hyper-parameter tuning is a key skill of any data scientist. But knowing when to stop is something that rarely gets addressed, and can vastly alter the cost of model development and the ROI of a Data Science project.

通过特征工程，模型选择和超参数调整来迭代模型是任何数据科学家的一项关键技能。但是，知道何时停止是很少解决的事情，并且可以极大地改变模型开发的成本和数据科学项目的投资回报率。

I’m not talking about over vs under fitting here. Over-fitting is where your model is too closely fit to your training data and can be detected by comparing the training set error with a validation set error. There are many great tutorials on Medium and elsewhere which explain all this in much more detail.

我在这里谈论的不是过度与不足。过度拟合是模型过于适合训练数据的地方，可以通过将训练集误差与验证集误差进行比较来检测出。在Medium和其他地方，有很多很棒的教程，它们对这一切进行了更详细的解释。

I’m referring to the time you spend working on the entire modelling pipeline, and how you quantify the rewards and justify the cost.

我指的是您花费在整个建模流程上的时间，以及如何量化收益并证明成本合理。

策略 (Strategies)

Some strategies that can help you decide when to wrap things up might be:

可以帮助您决定何时打包的一些策略可能是：

Set a deadline — Parkinson’s law states that “work expands so as to fill the time available for its completion”. Having an open ended time-frame invites you to procrastinate by spending time on things that ultimately don’t provide much value to the end result. Setting yourself a deadline is a good way of keeping costs low and predictable by forcing you to prioritise effectively. The down-side is of course that if you set your deadline too aggressively, you may deliver a model that is of poor quality.

设定最后期限-帕金森法则指出：“工作在扩大，以填补完成工作所需的时间”。有一个开放式的时间框架会邀请您拖延时间，最终花费一些时间最终无法为最终结果提供太多价值。为自己设定一个截止日期是一种有效的方法，它可以迫使您有效地确定优先级，从而将成本保持在较低水平且可预测。不利的一面当然是，如果您过分地设置截止日期，则可能会提供质量较差的模型。
Acceptable error rate — You could decide beforehand on an acceptable error rate and stop once you reach it. For example, a self-driving car might try to identify cyclists with a 99.99% level of accuracy. The difficulty of this approach is that before you start experimenting, it’s very hard to set expectations as to how accurate your model could be. Your desired accuracy rate might be impossible, given the level of irreducible error. On the other hand, you might stop prematurely whilst there is still room to easily improve your model.

可接受的错误率-您可以预先确定可接受的错误率，并在达到错误率时停止。例如，自动驾驶汽车可能会尝试以99.99％的准确度识别骑自行车的人。这种方法的困难在于，在您开始实验之前，很难就模型的精确度设定期望。鉴于无法减少的误差水平，可能无法达到理想的准确率。另一方面，您可能仍会过早停止，而仍有足够的空间轻松改善模型。
Value gradient method — By plotting the real-world cost of error in your model, vs the effort required to enhance it, you gain an understanding of what the return on investment is for each incremental improvement. This allows you to keep developing your model, only stopping when the predicted value of additional tuning fall below the value of your time.

值梯度法-通过绘制模型中的实际错误成本与增强误差所需的工作量，您可以了解每次增量改进的投资回报率。这使您可以继续开发模型，仅在其他调整的预测值低于您的时间值时才停止。

收益递减法则 (The law of diminishing returns)

As you invest time into tweaking your model, you may find that your progress is fast in the beginning, but quickly plateaus. You’ll likely perform the most obvious improvements first, but as time goes by you’ll end up working harder and harder for smaller gains. Within the data itself, the balance between reducible and irreducible error puts an upper limit on the level of accuracy that your model can achieve.

当您花时间调整模型时，您可能会发现开始时进展很快，但很快就达到了平稳状态。您可能会首先执行最明显的改进，但是随着时间的流逝，您将越来越努力地争取较小的收益。在数据本身内，可减少的误差与不可减少的误差之间的平衡为模型可达到的精度水平设置了上限。

In a learning exercise or a Kaggle competition, you can iterate to your heart’s content, chasing those incremental improvements further and further down the decimal places.

在学习练习或Kaggle比赛中，您可以迭代自己的内心内容，将这些递增的改进逐个追逐。

However, for a commercial project, the cost of tuning this model climbs linearly with respect to the amount of time you have invested. This means there comes a point where scraping out an extra 0.1% will not be worth the investment.

但是，对于商业项目，调整此模型的成本相对于您投入的时间呈线性增长。这意味着有些时候刮掉额外的0.1％将不值得投资。

This varies from project to project. If you’re working with supermarket data, given the huge number of purchases on a daily basis, an additional hundredth of a percentage point of accuracy might be worth a lot of money. This puts a strong ROI on continuing efforts to improve your model. But for projects of more modest scale, you might have to draw the line a bit sooner.

这因项目而异。如果您使用超市数据，由于每天都有大量购买，那么，精确度提高百分之一百分之百，可能是很多钱。这为持续改进模型提供了可观的投资回报率。但是，对于规模较小的项目，您可能需要尽快画线。

模型错误的实际成本 (The real-world cost of model error)

When tuning a model, the values you’re likely to be paying attention to are statistical in nature. MSE, % accuracy, R² and AIC are defined by their mathematical formulae, and are indifferent to the real-world problem you’re attempting to solve.

调整模型时，您可能需要注意的值实际上是统计值。 MSE，％精度，R²和AIC由它们的数学公式定义，并且与您要解决的实际问题无关。

Rather than solely considering statistical measures of accuracy and error, these should be converted into something that can be weighed against the time investment you’re making, i.e. money.

不应只考虑准确性和错误的统计指标，而应将这些指标转换为可以与您所花费的时间(即金钱)相权衡的指标。

Let’s say we run an ice-cream kiosk, and we’re trying to predict how many ice-creams we’ll sell on a daily basis, using variables like the weather, day of week, time of year etc.

假设我们经营一个冰淇淋亭，并尝试使用天气，星期几，一年中的时间等变量来预测每天售出多少个冰淇淋。

No model we create will be perfect, and for any given day it will usually either;

我们创建的任何模型都不完美，并且在任何给定的一天通常都不会完美。

overestimate — meaning we buy more ingredients than we need for the number of ice-creams sold.高估了-这意味着我们购买的食材比冰淇淋数量要多。
underestimate — meaning we run out of stock and lose out on potential business.低估了-这意味着我们缺货而失去了潜在业务。

Both of these types of error introduce a monetary cost to the business. If we run out of stock at midday, we’ve lost the margin on half a day’s sales. And if we overestimate, we may end up spending money on ingredients that end up being thrown away.

这两种类型的错误都会给企业带来金钱上的损失。如果我们在中午缺货，那么我们半天的销售利润就会损失。而且，如果我们高估了价值，我们最终可能会花钱购买最终被扔掉的食材。

We can introduce business rules on top of our model to help reduce some of this loss. The cost of losing 1 ice-cream’s worth of sales is likely higher than the cost of throwing away 1 ice-cream’s worth of out-of-date milk (given we’re hopefully making a profit). Therefore, we’ll want to be biased in favour of over-stocking, for example by holding 20% more ingredients than suggested by the model’s prediction. This will greatly reduce the frequency and cost of stock outages, at the expense of having to throw out a few bottles of out-of-date milk.

我们可以在模型之上引入业务规则，以帮助减少部分损失。损失1杯冰淇淋的销售成本可能会比丢掉1杯冰淇淋的过期牛奶的成本高(假设我们希望获利)。因此，我们希望偏向于库存过多，例如，持有比模型预测所建议的多20％的成分。这将大大减少断货的频率和成本，但以不得不扔掉几瓶过期牛奶为代价。

Optimising this 20% rule, falls under the umbrella of Prescriptive Analytics. Using the training set, we can tweak this rule, until the average estimated real-world cost of the error in the model is at its lowest.

优化此20％规则属于Prescriptive Analytics的保护范围。使用训练集，我们可以调整此规则，直到模型中错误的平均估计实际成本达到最低。

值梯度法 (The value gradient method)

Now that we have an estimated real-world cost for the accuracy of the model, we gain an idea of what the time we’re investing in the model is worth. With each iteration, we subtract the real-world cost from that of the previous version, to work out the value added by our extra effort. From there, we can extrapolate to a window of ROI.

现在，我们已经为模型的准确性估算了实际成本，现在我们了解了在模型上投资的时间是值得的。在每次迭代中，我们都从上一版本中减去实际成本，以计算出我们付出的额外努力所带来的价值。从那里，我们可以推断出一个投资回报率窗口。

For example, your validation set may contain 1,000 rows and your latest model saved $40 vs the previous iteration. If you are expecting to collect 100,000 data-points per year, then you can multiply the added value by 100 to get an annual rate. Therefore, the work you put in to produce the latest version of the model gives a return of $4,000 per year.

例如，您的验证集可能包含1,000行，而您的最新模型与先前的迭代相比节省了40美元。如果您希望每年收集100,000个数据点，则可以将增加值乘以100以获得年费率。因此，您投入到生产最新版本模型的工作中，每年可得到4,000美元的回报。

Comparing this to the cost of our time gives us an expected return on investment. E.g. if the above enhancement required a day’s work for someone earning $400 per day, it pays for itself very quickly.

将其与我们的时间成本进行比较，可以为我们带来预期的投资回报。例如，如果上述改进要求每天赚取$ 400的某人一天的工作，它会很快收回成本。

However, as the law of diminishing returns eats away at our rate of improvement, our margin will begin to fall. When it approaches zero, it’s time to take what we have and move on to the next stage in our project.

但是，随着收益递减法则的吞噬，我们的利润率将开始下降。当它接近零时，是时候利用我们所拥有的，并进入项目的下一个阶段。

Of course, this is an inexact science. It assumes that improvements to our model will occur in a smooth and predictable way and that future gains will smaller than previous improvements. Whenever you call it a day, there will be the possibility that a significant breakthrough lies just around the corner.

当然，这是一门不精确的科学。它假设对我们模型的改进将以一种平滑且可预测的方式进行，并且未来的收益将小于以前的改进。每当您将其命名为“一天”时，都有可能出现重大突破。

But it’s always a good idea to keep a commercial eye on the time you’re investing in a model, allowing you to do more valuable work by keeping costs down and freeing up time to spend on the most important things.

但是，始终保持商业眼光投资模型的时间始终是一个好主意，通过降低成本并腾出时间花在最重要的事情上，从而使您能够做更多有价值的工作。

Coming soon: A Python module which takes the predictions, actual values and a cost-function and outputs the expected ROI for any model — allowing you to integrate the above decision making into your model tuning process.

即将推出：一个Python模块，它将获取预测，实际值和成本函数，并输出任何模型的预期ROI-使您可以将上述决策整合到模型调整过程中。

翻译自: https://towardsdatascience.com/knowing-when-to-stop-b73ceeec7d9f

不知道输入何时停止

查看全文

http://www.taodudu.cc/news/show-997540.html

掌握大数据数据分析师吗?_要掌握您的数据吗？这就是为什么您应该关心元数据的原因...
微信支付商业版结算周期_了解商业周期
mfcc中的fft操作_简化音频数据：FFT，STFT和MFCC
r语言怎么以第二列绘制线图_用卫星图像绘制世界海岸线图-第二部分
rcp rapido_Rapido使用数据改善乘车调度
飞机上的氧气面罩有什么用_第2部分—另一个面罩检测器……（
数字经济的核心是对大数据_大数据崛起为数字世界的核心润滑剂
azure第一个月_MLOps：两个Azure管道的故事
编译原理数据流方程_数据科学中最可悲的方程式
解决朋友圈压缩_朋友中最有趣的朋友[已解决]
pymc3 贝叶斯线性回归_使用PyMC3进行贝叶斯媒体混合建模，带来乐趣和收益
ols线性回归_普通最小二乘[OLS]方法使用于机器学习的简单线性回归变得容易
Amazon Personalize：帮助释放精益数字业务的高级推荐解决方案的功能
西雅图治安_数据科学家对西雅图住宿业务的分析
创意产品分析_使用联合分析来发展创意
多层感知机深度神经网络_使用深度神经网络和合同感知损失的能源产量预测...
使用Matplotlib Numpy Pandas构想泰坦尼克号高潮
pca数学推导_PCA背后的统计和数学概念
鼠标移动到ul图片会摆动_我们可以从摆动时序分析中学到的三件事
神经网络卷积神经网络_如何愚弄神经网络？
如何在Pandas中使用Excel文件
tableau使用_使用Tableau升级Kaplan-Meier曲线
numpy 线性代数_数据科学家的线性代数—用NumPy解释
数据eda_银行数据EDA：逐步
Bigmart数据集销售预测
dt决策树_决策树：构建DT的分步方法
已知两点坐标拾取怎么操作_已知的操作员学习-第3部分
特征工程之特征选择_特征工程与特征选择
熊猫tv新功能介绍_熊猫简单介绍
matlab界area_Matlab的数据科学界