
When I was an undergraduate student studying Data Science, one of my professors always asked the same question for every data set we worked with — “What does zero mean?”


On the surface, this seems trivial. If the scenario is how many apples does each student have, zero means that a student has no apples.

从表面上看,这似乎是微不足道的。 如果假设每个学生有多少个苹果,则零表示该学生没有苹果。

Then why ask the question?


Well, zero can mean zero… but it can also mean a slew of other things, and if you’re not careful, ignoring it could come back to haunt you.


It can be easy to disregard or ignore missing data. Whether the values are 0, NULL, NA, or blank, we are often quick to ignore these records because they “lack information”. However, these data points can often be critical pieces of information for our problem and the “lack” of information actually is information.

忽略或忽略丢失的数据很容易。 无论值是0,NULL,NA还是空白,我们经常会很快忽略这些记录,因为它们“缺少信息”。 但是,这些数据点通常可能是解决我们问题的关键信息,而信息“不足”实际上就是信息。

Let’s consider the following scenario. We are consulting for a bank and they want us to determine if a customer is likely to default on their credit card payments. Below is a sample of the data we are given to evaluate this problem.

让我们考虑以下情形。 我们正在为一家银行提供咨询,他们希望我们确定客户是否有可能拖欠其信用卡付款。 以下是我们用来评估此问题的数据样本。

We see that there are five variables we could use to make predictions and of those five, four contain values of 0, blank, or in some cases both.


It might be easy to ignore these values or chalk them up as some kind of error in the bank’s system but let’s take a closer look and see if there might be more to the story.


Our first variable with missing data is Credit Score. Two of the eight customers have no credit score value. While it may seem like these values were skipped over, notice that both of these customers have ages of 23 and 20. Since they are relatively young, there is a good chance they may have only recently opened up a credit card and consequently would not have a credit score yet. It might be easy to backfill these records with a value of 0, but that would not make sense either given that we don’t know how they will actually perform. How would we handle this in a real-life scenario? One approach would be to find the average score of individuals with similar ages and use that value for our missing records.

我们缺少数据的第一个变量是Credit Score 。 八个客户中有两个没有信用评分值。 虽然看起来这些值似乎已被跳过,但请注意,这两个客户的年龄分别为23岁和20岁。由于他们相对年轻,所以很有可能他们只是最近才打开信用卡而因此不会信用评分呢。 将这些记录的值回填为0可能很容易,但是鉴于我们不知道它们的实际表现,这也没有意义。 在现实生活中,我们将如何处理? 一种方法是找到年龄相似的个人的平均分数,并将该值用于我们的缺失记录。

The next column with missing data is Missed Payments. This time we have records with both the missing data and values of 0. Intuitively, we decypher that a value of 0 indicates a customer has never made a late payment. In this case, 0 does really mean 0. What about the missing values? Well, as we discussed for Credit Score, there might be other factors impacting this variable. Notice again that our missing record is for a customer who is only 20 years old. Given that they also do not have a credit score, we might infer that they have never missed a payment because they have never had the chance to make one yet. If our customer only opened their account this month, then they would not have had to make a payment and consequently could not miss one either.

缺少数据的下一列是“ 未付款项” 。 这次,我们有同时缺失数据和值为0的记录。从直觉上讲,我们解密为0表示客户从未付款。 在这种情况下,0确实意味着0。缺失值又如何呢? 好吧,正如我们在“ 信用评分”中讨论的那样,可能还有其他因素会影响此变量。 再次注意,我们缺少的记录适用于仅20岁的客户。 考虑到他们也没有信用评分,我们可以推断他们从未错过过付款,因为他们还没有机会进行付款。 如果我们的客户仅在本月开户,那么他们就不必付款,因此也不会错过任何一个。

Moving onto our final two variables, Credit Limit and Payment Due, we see there are again both missing values and 0 values. For our values of 0, they appear to be fairly intuitive that $0 is a plausible amount in these cases. Our missing data, however, poses a bigger question. How can an individual have no value for a credit limit? Are they allowed to spend as much as they want? This same individual also has no payment due… how does that work?

进入最后两个变量, 信用额度到期付款 ,我们再次看到缺失值和0值。 对于我们的0值,他们似乎很直观,在这些情况下,$ 0是合理的金额。 但是,我们缺少的数据提出了一个更大的问题。 个人如何没有信用额度的价值? 是否允许他们花费想要多少? 该个人也没有应付款...这是如何工作的?

Let’s approach this similar to our other scenarios — by looking at the rest of the customer’s information. First, we can see this individual has a much lower score than all our other customers, including the 23-year-old (higher scores are better for credit). We also see that they have missed 8 payments, double the next highest individual. Hmmm… so why would they have no credit limit?

让我们以与其他方案类似的方式进行处理-通过查看客户的其余信息。 首先,我们可以看到此人的得分比所有其他客户(包括23岁的客户)低得多(得分越高,信用越好)。 我们还看到他们错过了8笔付款,是第二高的个人的两倍。 嗯...为什么他们没有信用额度?

One possible answer — this individual had performed so poorly that the bank decided to terminate their account. As a result, the customer is still in our database but is not active with the bank any longer. Consequently, they cannot spend any money and would not be subject to making payments either.

一个可能的答案-这个人的表现很差,银行决定终止他们的帐户。 结果,客户仍然在我们的数据库中,但是不再与银行保持联系。 因此,他们不能花任何钱,也不必付款。

All of the above are obviously hypothetical scenarios and explanations as to why data might be missing, why it might be zero, and how we might handle it. Data can be missing for a number of reasons but understanding why it is missing or zero can be critical in learning more and making better decisions. In a real-world scenario, you might be able to go back to the bank and ask clarifying questions around the data to verify if your assumptions are correct. Of course, there are plenty of times where that will not be an option either.

上面所有这些显然都是关于数据为何丢失,为何可能为零以及我们如何处理的假设性场景和解释。 数据丢失可能有多种原因,但是了解数据丢失或为零的原因对于学习更多信息和制定更好的决策至关重要。 在现实世界中,您也许可以回到银行询问有关数据的澄清问题,以验证您的假设是否正确。 当然,在很多时候,这也不是一种选择。

In the modern world of data science and machine learning, we often see models that cannot handle missing values and are forced to handle this data in another way. While it can be easy to simply drop these records or impute averages or medians, we should also take time to consider what these missing values represent. Imagine if we had simply imputed values of 0 to any blank credit score in our bank example? A potential model may have made horrible predictions because we falsely assumed that 0 and missing were the same when in this case they were not. Similarly, if we had backfilled our missing credit limit values with 0, we would have been using at least one customer who had already defaulted in a model trying to predict if this customer would default.

在当今的数据科学和机器学习世界中,我们经常看到无法处理缺失值并被迫以其他方式处理此数据的模型。 尽管简单地删除这些记录或估算平均值或中位数很容易,但我们也应该花些时间考虑这些缺失值代表什么。 想象一下,如果在我们的银行示例中,我们是否仅将0的值估算为任何空白信用评分? 一个潜在的模型可能做出了可怕的预测,因为我们错误地假定0和缺失在这种情况下不是相同的。 同样,如果我们用0回填缺少的信用额度值,那么我们将使用至少一个已经在模型中违约的客户,试图预测该客户是否会违约。

Sometimes zero really is zero. Sometimes missing is simply a human error of a failed data entry job. Sometimes, there’s a much deeper story going on. To quote my professor, “What the heck does zero mean?”

有时零真的是零。 有时丢失仅仅是由于数据输入作业失败而导致的人为错误。 有时,还有一个更深层次的故事正在发生。 用我的教授的话说:“零意味着什么?”

翻译自: https://towardsdatascience.com/what-the-heck-does-zero-mean-8c5f42266dc6




