本福德法则 2位数

Like many others, the first time I heard about Benford’s law, I thought: “What? That’s weird! What’s the trick?” And then, there is no trick. It is just there. It is a law that applies for no apparent good reason.

像许多其他人一样,我第一次听说本福德定律时,心想:“什么? 那真是怪了! 诀窍是什么?” 然后,没有把戏。 就在那里。 这是一条没有明显理由的法律。

If you have never heard of it, let’s look at what it is: Imagine a set of numbers from some real-life phenomenon. Say, for example, the populations of all inhabited places on Earth. There would be thousands of numbers, some very big, some very small. Those numbers exist not because of some systematic process but somehow emerged out of the thousands of years of the lives of billions of people. You would therefore expect them to be almost completely random. Thinking about those numbers, if I asked you: “How many of them start with the digit 1, compared to 2, 3, 4, etc.”, intuitively you would probably say: “more or less just as many”. You would be wrong. The answer is: significantly more.

如果您从未听说过它,那就看看它是什么:想象一下一些现实生活中的数字。 举例来说,地球上所有居住地区的人口。 会有成千上万个数字,有些很大,有些很小。 这些数字之所以存在,并不是因为有些系统的过程,而是因为它是从数十亿人的数千年生活中诞生的。 因此,您希望它们几乎完全是随机的。 考虑一下这些数字,如果我问你:“其中有多少以数字1开头,而数字2、3、4等。”直觉上你可能会说:“或多或少一样多”。 你会错的。 答案是:明显更多。

According to Benford’s law (which should really be called Newcomb–Benford law, see below), in large sets of naturally-occurring numbers spanning across multiple orders of magnitude, the leading digit of any number is much more likely to be small than big. How more likely? Formally, the probability P(d) that a number starts with the digit d is given by:

根据本福德定律(应真正称为纽康-本福德定律,见下文),在跨越多个数量级的大量自然发生数中,任何数字的前导数字很有可能比大数字小。 更有可能吗? 形式上,数字以数字d开头的概率P(d)由下式给出:

P(d) = log10(1+1/d)

P(d)= log10(1 + 1 / d)

That means that in those sets of naturally-occurring numbers, the probability that a number will start with 1 is just over 30%, while the probability that it will start with 9 is just under 5%. Weird right?

这意味着在那些自然出现的数字集中,数字以1开头的概率略高于30%,而数字以9开头的概率略低于5%。 对吗?

Probability of a digit starting a number according to Benford’s law.
根据本福德定律,数字以数字开头的概率。

This strange natural/mathematical phenomenon was first discovered by Simon Newcomb (hence the complete name) who noticed that the pages at the beginning of books containing logarithmic tables, which start with 1, were much more worn out than the pages at the end (starting with 9). Based on this observation, showing that people tended to need logarithmic tables more often for numbers starting with 1, he first formulated what is now known as the Newcomb-Benford law, although with a slightly different formula for the probability of the first digit. It was re-discovered by Frank Benford more than 65 years later, and tested on several different things, including populations in the US. Since then, it has been tested on and applied to many things, from financial fraud detection to code, parking spaces or even COVID. The idea is that if a set of numbers occur or emerge naturally, without being doctored or somehow artificially constrained, there is a good chance they will follow Benford’s law. If they don’t follow the law, there is something fishy.

这个奇怪的自然/数学现象最初是由西蒙·纽康( Simon Newcomb)发现的(因此全称),他注意到包含对数表(以1开头)的书开头的页面比末尾的页面(开头的页面)磨损得多。与9)。 基于此观察结果,他表明人们倾向于对以1开头的数字更需要对数表,他首先制定了现在称为纽康-本福德定律的公式,尽管对于第一位数的概率公式略有不同。 65多年后,弗兰克·本福德( Frank Benford )重新发现了它,并在包括美国人口在内的多种不同方面进行了测试。 从那时起,它就已经进行了测试并应用于许多方面,从财务欺诈检测到代码,停车位甚至COVID 。 这个想法是,如果一组数字自然出现或出现,而没有被篡改或以人为方式加以约束,则它们很有可能会遵循本福德定律。 如果他们不遵守法律,那就有些可疑了。

But is this really universally true, and to what extent? We can assume that numbers representing different phenomena follow Benford’s law to a different extent, sometimes more, sometimes less, and sometimes not. So the question is:

但这在世界范围内确实是真的吗? 我们可以假设代表不同现象的数字在不同程度上遵循本福德定律,有时更多,有时更少,有时则不同。 所以问题是:

What follows Benford’s law, and perhaps more importantly, what does not?

什么遵循本福德定律,也许更重要的是,什么不遵循?

To answer that, we need a lot of those sets of naturally-occurring numbers.

要回答这个问题,我们需要大量的自然数。

维基解密 (Wikidata to the rescue)

Newcomb and Benford were not quite as lucky as we are. To find sets of numbers on which to test their law, they had to manually collect them from whatever source was available. Nowadays, not only do we have a universally accessible encyclopedia of everything, we have a data version of it: Wikidata.

纽科姆和本福德并不像我们那样幸运。 为了找到检验其法律的数字集,他们必须从任何可用来源中手动收集它们。 如今,我们不仅拥有所有事物的通用百科全书,而且还具有其数据版本: Wikidata 。

Wikidata is the Wikipedia of data. It is a crowdsourced database of, if not everything, quite a big part of it. It is possible for example using Wikidata to quickly obtain, with a relatively simple query, the size of the populations of every US cities and many, many more things. It should therefore also be possible to obtain all the sets of numbers it contains.

Wikidata是数据的维基百科。 它是一个众包数据库,即使不是全部,它也占很大一部分。 例如,可以使用Wikidata通过一个相对简单的查询来快速获取每个美国城市的人口规模以及许多其他事物。 因此,也应该可以获得它包含的所有数字集。

To do that, we use the RDF-based representation of Wikidata. RDF (Resource Description Framework) is a graph-based data representation for the web. Basically, things in RDF are represented by URIs, and connected by labelled edges to other things or values. For example, the figure below shows a simplified extract of what the representation of the city of Galway, located in Ireland and with a population of 79,504 people looks like in Wikidata’s RDF.

为此,我们使用基于RDF的Wikidata表示形式。 RDF (资源描述框架)是Web的基于图的数据表示形式。 基本上,RDF中的事物由URI表示,并通过标记的边缘连接到其他事物或值。 例如,下图显示了Wikidata的RDF中位于爱尔兰,人口为79,504人的高威市的简化表示。

Graph representation of some information about Galway from Wikidata.
Wikidata中有关Galway的一些信息的图形表示。

The nice thing about RDF is that a very, very large graph, can be represented by a set of triples of the form <subject,predicate,object>. Each of those triples corresponds to an edge in the graph, and represents one atomic piece of information.

关于RDF的好处是,非常大的图形可以用一组<subject,predicate,object>的三元组表示。 这些三元组中的每一个对应于图中的一条边,并代表一条原子信息。

<http://www.wikidata.org/entity/Q129610> <http://www.w3.org/2000/01/rdf-schema#label> “Galway” .<http://www.wikidata.org/entity/Q129610> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/entity/Q515> .<http://www.wikidata.org/entity/Q515> <http://www.w3.org/2000/01/rdf-schema#label> “City” .<http://www.wikidata.org/entity/Q129610> <http://www.wikidata.org/entity/P17> <http://www.wikidata.org/entity/Q27> .<http://www.wikidata.org/entity/Q27> <http://www.w3.org/2000/01/rdf-schema#label> “Ireland” .<http://www.wikidata.org/entity/Q129610> <http://www.wikidata.org/entity/P1082> “75,504”^^<http://www.w3.org/2001/XMLSchema#Integer> .

So the first step in collecting sets of numbers from Wikidata is to extract all the triples for which the object part is a number.

因此,从Wikidata收集数字集的第一步是提取对象部分为数字的所有三元组。

We start by downloading the full dump of the entire Wikidata database as a compressed NTriples (NT) file. NT is a terribly inefficient representation of RDF where each triple is represented in a line. The GZipped file to download (latest-all.nt.gz) is quite large (143GB) and I would not recommend trying to uncompress it. However, because each triple is represented completely independently from the rest on one line, this format makes it very easy to filter the data with basic linux command-line tools without having to load the whole thing in memory. So, to extract triples which objects are numbers, we use zgrep (grep that works on GZipped files) to find triples that have a reference to the decimal, integer or double types in the following way:

我们首先以压缩的NTriples(NT)文件下载整个Wikidata数据库的完整转储。 NT是RDF的非常低效的表示形式,其中每个三元组都以一行表示。 要下载的GZipped文件( latest-all.nt.gz )很大(143GB),我不建议尝试对其进行解压缩。 但是,由于每个三元组都完全独立于其余三行,所以这种格式使得使用基本Linux命令行工具非常容易地过滤数据,而不必将整个内容加载到内存中。 因此,要提取对象是数字的三元组,我们使用zgrep (适用于GZipped文件的grep )以以下方式查找引用了小数整数精度类型的三元组:

zgrep “XMLSchema#decimal” latest-all.nt.gz > numbers.decimal.ntzgrep “XMLSchema#double” latest-all.nt.gz > numbers.double.ntzgrep “XMLSchema#integer” latest-all.nt.gz > numbers.integer.nt

Then we can put all of those together using the cat command:

然后我们可以使用cat命令将所有这些放在一起:

cat numbers.decimal.nt numbers.double.nt numbers.integer.nt > numbers.nt

And check how many triples this 110GB file ends-up containing by counting the number of lines in it:

并通过计算其中的行数来检查此110GB文件最终包含的三倍:

wc -l numbers.nt 730238932 numbers.nt

Each line is a triple. Each triple as a number as value (object). That’s a lot of numbers.

每行是三元组。 每个三元组作为一个数字作为值(对象)。 有很多数字。

The next step is to organise those numbers into meaningful sets and count how many in each set start with 1, with 2, with 3, etc. Here, we use the “predicate” part of the triple. There are, for example, 621,574 triples in this file that have <http://www.wikidata.org/entity/P1082> as predicate. http://www.wikidata.org/entity/P1082 is the property Wikidata uses to represent the population of inhabited places. So we can group all of those, and make it into the set of all populations known to Wikidata. That will be one of the naturally occurring sets of numbers that we will test.

下一步是将这些数字组织成有意义的集合,并计算每个集合中以1、2、3等开头的数字。这里,我们使用三元组的“谓词”部分。 例如,此文件中有621,574个三元组以< http://www.wikidata.org/entity/P1082 >作为谓词。 http://www.wikidata.org/entity/P1082是Wikidata用来表示居住人口的属性。 因此,我们可以将所有这些分组,并将其归入Wikidata已知的所有种群的集合。 那将是我们将测试的自然出现的一组数字之一。

The simple python script below creates a JSON file with the list of properties, the number of triples of which they are predicate, the minimum and maximum numbers those triples have as values, the number of orders of magnitude they cover, and the number of numbers starting with 1, 2, 3, etc.

下面的简单python脚本创建了一个JSON文件,其中包含属性列表,它们作为谓词的三元组数,这些三元组作为值的最小和最大数,它们覆盖的数量级数以及数字数以1、2、3等开头

import timeimport jsonimport redata = {}st = time.time()with open("numbers.nt") as f:    line = f.readline()    count = 0    while line:        p = line.split()[1]        val = int(re.findall("\d+", line.split()[2])[0])        val1 = str(val)[0]        if p not in data:            data[p] = {"i": val, "a": val, "c": 0, "ns": {}}        if val < data[p]["i"]: data[p]["i"] = val        if val > data[p]["a"]: data[p]["a"] = val                 if val1 not in data[p]["ns"]: data[p]["ns"][val1] = 0        data[p]["ns"][val1] += 1        data[p]["c"] += 1        count += 1        line = f.readline()        if count % 1000000 == 0:            print(str(count/1000000)+" "+str(len(data.keys()))+" "+str(time.time()-st))            st = time.time()        if count % 10000000 == 0:            with open("numbers.json", "w") as f2:                json.dump(data, f2)    with open("numbers.json", "w") as f2:        json.dump(data, f2)

We obtain 1,582 properties in total, representing as many sets of numbers to be tested against Benford’s law. We reduce this to 505 properties as there are several redundant representations of the same relation into properties in Wikidata. We also extract in another script the label (name) and description of each of the properties, so we don’t have to look them up later.

我们获得1,582 总数,代表要根据本福德定律检验的尽可能多的数字。 我们将其简化为505个属性,因为Wikidata中有多个相同关系的冗余表示形式。 我们还将在另一个脚本中提取每个属性的标签(名称)和描述,因此我们以后不必查找它们。

测试Benfordness (Testing for Benfordness)

Now that we have many sets of numbers, and their distributions according to their leading numbers, we can check how much they follow Benford’s law. Several statistical tests can be used to do this. Here, we use a relatively simple one called Chi-Squared. The value of this test for a set of numbers is given by the formula

现在我们有许多数字集,并且根据它们的前导数字进行分布,我们可以检查它们遵循本福德定律的程度。 可以使用几种统计检验来做到这一点。 在这里,我们使用一种相对简单的称为Chi-Squared的方法。 该测试对于一组数字的值由公式给出

χ² = Σᵢ(oᵢ-eᵢ)² / eᵢ

χ²=Σᵢ(oᵢ-eᵢ)²/eᵢ

Where i is the leading number under consideration (1 to 9), oᵢ is the observed value for i (the percentage of numbers in the set that start with i) and eᵢ is the expected value (the percentage of numbers that should start with i according to Benford’s law). The smaller the result is, the more Benford the set of numbers is. The script below calculates the Chi-Squared test on each set of numbers created with the previous script, to check how they fit Benford’s law.

其中, i是所考虑的前导数字(1到9), oi的观测值(以i开头的集合中数字的百分比), eᵢ是期望值(以i开头的数字的百分比)根据本福德定律)。 结果越小,越本福德的组数字。 下面的脚本对使用前一个脚本创建的每组数字计算卡方检验,以检查它们是否符合本福德定律。

import mathimport sysimport jsonif len(sys.argv) !=2:    print("provide filename")    sys.exit(-1)es = {    "1": math.log10(1.+1.),    "2": math.log10(1.+(1./2.)),    "3": math.log10(1.+(1./3.)),    "4": math.log10(1.+(1./4.)),    "5": math.log10(1.+(1./5.)),    "6": math.log10(1.+(1./6.)),    "7": math.log10(1.+(1./7.)),    "8": math.log10(1.+(1./8.)),    "9": math.log10(1.+(1./9.))        }print("expected values: "+str(es))data = {}with open(sys.argv[1]) as f:    data=json.load(f)for p in data:    sum = 0    for n in es:        if n in data[p]["ns"]:            sum += data[p]["ns"][n]    cs = 0.        for n in es:        e = es[n]        a = 0.        if n in data[p]["ns"]:            a = float(data[p]["ns"][n])/float(sum)        cs += (((a-e)**2)/e) # chi-square test    data[p]["f"] = cswith open(sys.argv[1]+".fit.json", "w") as f:    json.dump(data, f)

那么,这是真的吗? (So, is it real?)

The results obtained are available in a Google Spreadsheet for convenience. The scripts and results are also available on Github.

为方便起见,可以在Google Spreadsheet中获得获得的结果。 脚本和结果也可以在Github上获得。

The first thing to notice when looking at the results is that our favourite example, population, does very very well. It is in fact the second best fit for Benford’s law with a Chi-Squared value of 0.000445. There are over 600K numbers in this, which just happen to exist, and they follow almost exactly what Benford’s law predicted. With such a low value for the Chi-Squared test and such a large sample, the chances that this could be a coincidence are so small, they are truly impossible to contemplate. It is real.

查看结果时要注意的第一件事是,我们最喜欢的示例population非常好。 实际上,它是本福德定律的第二最佳拟合,卡方值0.000445。 其中有60万个数字正好存在,并且它们几乎完全遵循本福德定律的预测。 由于Chi-Squared检验的值如此之低,而样本量如此之大,这可能是巧合的机会是如此之小,以至于它们实际上是无法考虑的。 真的

Unsurprisingly, several other properties very related to population also all end up in the top-10 most fitting Benford’s law, including literate/illiterate populations, male/female populations or number of households.

毫不奇怪,与人口非常相关的其他几个属性也都排在最适合本福德法则的前十名中,包括文盲/文盲人口男性/女性人口家庭数量

The question I’m sure everybody is dying to see answered is “which property is first then?”, since population is only second. With a Chi-Squared of 0.000344, the first place actually goes to a property called number of visitors which is described as the “number of people visiting a location or an event each year” (48,908 numbers in total).

我敢肯定,每个人都渴望看到这个问题的答案是“那是哪个财产先?”,因为人口仅次于人口。 “ Chi-Squared”为0.000344,实际上排在第一位的是“访客人数” ,即“每年访问某个地点或某个事件的人数”(总共48,908名)。

Amongst the very highly Benford, we also find area (the area occupied by an object), or total valid votes (for elections. The number of blank votes is also doing well on Benfordness).

在非常高的本福德中,我们还可以找到区域(物体所占的面积)或有效选票总数(用于选举。空白票的数量在本福德尼斯上也很不错)。

There seem also to be quite a few properties related to diseases in the highly Benford properties in Wikidata, including number of cases, number of recoveries, number of clinical tests, and number of deaths.

在Wikidata中,本福德具有很高的属性,似乎也有许多与疾病相关的属性,包括病例数,恢复数,临床检查数和死亡数

Numbers related to companies also appear very strongly amongst the top Benford properties. The property employees (the number of employees of a company) is the strongest among those, but we also see patronage, net income, operating income, and total revenue.

在本福德(Benford)的顶级物业中,与公司相关的数字也非常明显。 财产雇员(公司雇员人数)是其中最强的,但我们还会看到光顾净收入营业收入总收入

Sports statistics make a good appearance, with total shots in career, career plus-minus rating and total points in career, together with several biology- and other nature-related topics, such as wingspan (of aeroplanes or animals), proper motion (of stars), topographic prominence (i.e. the height of a mountain or hill) or distance from Earth (of astronomical objects).

体育统计一个良好的外观,并在职业生涯职业生涯正负评价总出手数和总分在职业生涯中,有几个biology-和其他性质相关的主题,如翼展(飞机或动物),适当的运动联系在一起(的星星),地形凸显度(即山或山的高度)或与地球的距离(天文物体的距离)。

There are of course many more properties that fit Benford’s law very well: The ones above only cover the very top most fitting sets of numbers (with a Chi-Squared below 0.01). They are also not particularly surprising as they match very well the characteristics of sets of numbers that should normally follow Benford’s law: They are large (the smallest, number of blank votes, still contains 886 numbers), cover several orders of magnitudes (from 3 to 80) and, more importantly, are naturally occurring: They are not generated through any systematic process. They just emerged.

当然,还有更多其他属性非常符合本福德定律:上面的那些仅覆盖最适合的一组数字(Chi-Squared小于0.01)。 它们也并不特别令人惊讶,因为它们与通常应遵循本福德定律的一组数字的特征非常吻合:它们很大(最小的空白票数仍包含886个数字),涵盖几个数量级(从3开始)至80),而且更重要的是自然发生的:它们不是通过任何系统的过程生成的。 他们刚刚出现。

There is one very significant exception to this however. With a Chi-Squared value of 0.00407, and 3,259 numbers covering 7 orders of magnitude, we find the property Alexa rank. This corresponds to the ranking of websites from the Alexa internet service, which provides information about websites based on traffic and audiences. It is very hard to explain how it could fit so well, since, being a ranking, it should normally be linearly distributed from 1 to its maximum value. There are two possible explanations however as to why this happened: 1- For a given website, several rankings might be available for several years, and 2- not all websites ranked by Alexa are in Wikidata. In other words, it is not the ranking itself that follows Benford’s law, but the naturally occurring selection of rankings in Wikidata. The same kind of things, worryingly, might affect other results too and is a good example demonstrating how any dataset might be biased in a way that seriously affects the results of statistical analyses.

但是,有一个非常重要的例外。 卡方值为0.00407,具有3,259个覆盖7个数量级的数字,我们找到了Alexa等级属性 这与Alexa互联网服务对网站的排名相对应,该服务根据流量和访问者提供有关网站的信息。 很难解释它如何如此适合,因为作为排名,它通常应从1线性分布到其最大值。 但是,对于发生这种情况的原因有两种可能的解释:1-对于给定的网站,可能会在几年内提供多个排名,并且2-并非所有由Alexa排名的网站都在Wikidata中。 换句话说,不是排名本身遵循本福德定律,而是Wikidata中自然发生的排名选择。 令人担忧的是,同样的事情也可能会影响其他结果,这是一个很好的例子,展示了如何以严重影响统计分析结果的方式对任何数据集进行偏倚。

不好的呢? (How about the bad ones?)

So, we have verified that Benford’s law indeed applies to many naturally occurring sets of numbers, and even sometimes to naturally occurring selections of non-naturally occurring numbers. What about the cases when it does not work?

因此,我们已经验证了本福德定律确实适用于许多自然发生的数字集,甚至有时适用于非自然发生的数字的自然发生的选择。 如果它不起作用怎么办?

First, we can eliminate all the sets that are too small (less than 100 numbers) and that cover too few orders of magnitude (less than 3). Those, in most cases, would not fit at all, and if they do, we cannot rule out the possibility that it is just a coincidence.

首先,我们可以消除所有太小(少于100个数字)且涵盖太少数量级(少于3个)的集合。 在大多数情况下,这些根本不适合,如果可以,我们不能排除这只是巧合的可能性。

The worst-fitting property of the whole set we looked at is lowest atmospheric pressure, described as the “minimum pressure measured or estimated for a storm (a measure of strength for tropical cyclones)”, with a Chi-Squared of 12.6. The set contains 1,783 numbers varying from 4 to 1,016. While this is a naturally occurring set of numbers that match the needed characteristics, it is easy to see why it does not fit. Atmospheric pressure does not usually vary that much, and it can be expected that most of the values are actually close to the average sea-level pressure (1013 mbar). It is even possible that the value of 4 is simply an error in the data.

我们研究的整套装置的最差拟合性能是最低气压,被描述为“为风暴测得或估算的最低压力(热带气旋的强度度量)”,Chi-Squared为12.6。 该集合包含1,783个数字,范围从4到1,016。 尽管这是一组自然出现的数字,与所需的特征相匹配,但很容易看出为什么它不合适。 大气压力通常变化不大,可以预期大多数值实际上都接近平均海平面压力(1013 mbar)。 甚至4的值可能只是数据中的错误。

Many other of the non-fitting properties can be explained similarly: Their values usually don’t vary enough for them to follow Benford’s law, but exceptions and outliers make that they still span several orders of magnitude. Those include wheelbase (distance between front wheels and rear wheels on a vehicle), life expectancy (of species), field of view (of a device for example), or mains voltage (in a country or region).

许多其他不适合的性质也可以用类似的方式进行解释:它们的值通常变化不大,无法遵循本福德定律,但由于例外情况和异常值使得它们仍然跨越几个数量级。 那些包括轴距寿命(物质)的视场(例如,设备)的电源电压(在一个国家或地区)(前轮和后轮之间在车辆上的距离),,,或。

Interestingly (maybe because I know nothing about astronomy), while related to natural phenomena, many of the badly fitting properties correspond to measures related to planets or space: semi-major axis of an orbit, effective temperature (of star or planet), apoapsis (distance at which a celestial body is the farthest to the object it orbits), periapsis (distance at which a celestial body is the closest to the object it orbits), metallicity (abundance of elements that are heavier than hydrogen or helium in an astronomical object) and orbital period (the time taken for a given astronomic object to make one complete orbit about another object). Maybe Benford’s law is a law of Earth’s nature, or even of human nature, rather than a law of the universe.

有趣的是(也许因为我对天文学一无所知),尽管与自然现象有关,但许多不合适的性质却与与行星或太空有关的量度相对应:轨道的半长轴, (恒星或行星的)有效温度阿朴s(天体离它所绕行的物体最远的距离), perapsis (天体离它所绕行的物体最近的距离),金属性(金属中比氢或氦重的元素的丰度)天文物体)和轨道周期(给定的天文物体绕另一个物体完成一个完整轨道所花费的时间)。 也许本福德定律是地球自然乃至人性的定律,而不是宇宙的定律。

If that’s the case however, there is an interesting exception here: Within the really badly fitting properties, we find number of viewers/listeners (number of viewers of a television or broadcasting program; web traffic on websites). This set includes 248 numbers, ranging from 5,318 to 6 billion. This feels very paradoxical considering that the Alexa rank (which is related to the size of the audience for websites) was also our exception for well-fitting properties. Maybe the same explanation applies however: If we had a complete set of numbers of subscribers/viewers for everything, it might follow Benford’s law very well, but we don’t and the bias introduced by the selection of those 248 ones, which might focus on the most noticeable websites and programs, is a sufficiently unnatural constraint that it makes it lose its Benfordness.

但是,如果是这种情况,这里会有一个有趣的例外:在真正不合适的属性中,我们发现了观众/听众的数量(电视或广播节目的观众数量;网站上的网络访问量)。 该集合包括248个数字,范围从5,318到60亿。 考虑到Alexa排名(与网站的受众群体数量有关)也是我们对合适房产的例外,这感觉很矛盾。 但是,也许可以使用相同的解释:如果我们拥有所有内容的完整订阅者/观看者人数,它可能会很好地遵循本福德定律,但我们并没有选择那248个关注者而引起的偏见。在最引人注目的网站和程序上,是一个非常不自然的约束,使其失去了Benfordness。

所以? (So?)

Benford’s law is weird. It is unexpected and not well explained, but somehow actually works. There is clearly a category of sets of numbers that are supposed to follow Benford’s law, and in effect very much do so. There are also others that apparently don’t and in most cases, it is relatively easy to know why. Interestingly however, there are a few cases where numbers don’t behave the way they should with respect to Benford’s law. As mentioned at the beginning of this article, this phenomenon as been extensively used to detect when numbers have been doctored, by checking for numbers that should follow Benford’s law and in effect do not. It seems that there is another category of applications looking at selections of numbers that should not follow Benford’s law. If they do, it might be an indication that what may very well look like random sampling somehow emerges to be biased by the natural tendencies that uncontrolled human and natural processes have to produce highly Benford numbers. Knowing this, and in which category a dataset is supposed to fall could be very useful to test data against such biases and the representativeness of data samples.

本福德定律很奇怪。 这是出乎意料的,没有得到很好的解释,但是实际上可以工作。 显然有一组数字应该遵循本福德定律,实际上实际上是这样做的。 还有其他一些显然没有的原因,在大多数情况下,相对容易知道原因。 然而有趣的是,在某些情况下,数字在本福德定律方面的表现不尽如人意。 如本文开头所述,这种现象已被广泛用于通过检查应遵循本福德定律的数字来检测何时篡改了数字,而实际上却不这样做。 似乎还有另一类应用程序正在研究不符合本福德定律的数字选择。 如果这样做的话,这可能表明,很可能看起来像随机抽样的某种方式由于不受控制的人为和自然过程必须生成高本福德数的自然趋势而出现偏差。 知道这一点以及将数据集归为哪一类,对于针对此类偏差和数据样本的代表性测试数据非常有用。

And of course, there is the case of astronomy. I suspect the answer would simply emphasize my own ignorance in the matter, but I would really like to know why all those measures related to astronomical bodies so stubbornly refuse to obey Benford’s law.

当然,还有天文学的情况。 我怀疑答案只会简单地说明我对此事的无知,但我真的很想知道为什么所有与天体有关的措施都如此顽固地拒绝遵守本福德定律。

翻译自: https://towardsdatascience.com/what-does-doesnt-follow-benford-s-law-7d0b3c14afa5

本福德法则 2位数


http://www.taodudu.cc/news/show-2552722.html

相关文章:

  • data单复数一样吗_Data和media的复数是什么?——别以为你很懂复数!
  • 2019数学建模F:数字货币存在是否合理?提供一些思路供思考
  • 8. 无线体内纳米网:基于蓝牙LE接口的数字ID系统
  • 考研作文--过去数年见证了一个社会现象
  • 可扩展机器学习——分类——点击率预测(Click-through Rate Prediction)
  • Curse of dimensionality - 维数灾难
  • runge phenomenon(龙格现象)和过拟合
  • 英语打卡3:可数名词
  • 两个故事理解时间管理的必要
  • 管理小故事100例3
  • 统筹高效利用时间——《小强升职记(升级版):时间管理故事书》读后感
  • 项目管理案例故事
  • 侦探推理小故事
  • 小团队管理核心(二)
  • 雨后小故事动态邪恶_当您遇到“邪恶”的问题时,使故事变小
  • 敏捷管理中的史诗与故事
  • 管理故事集锦
  • 一则质量小故事
  • 企业预算管理实践:小故事大道理
  • 企业管理小故事【有问题4个管理小法则】
  • 项目管理的10个经典故事
  • 几个项目管理的小故事
  • 管理小故事100例1
  • 结构体习题:有5个职工,每个职工的数据包括:职工号、姓名、工资,编写程序要求从键盘上输入职工们的数据,输出高于平均工资的职工信息及高于平均工资的职工人数。——[C语言]入门基础编程 1092
  • 杨元庆:税收影响联想电脑国内售价
  • 70个城市房价上涨,令人忐忑
  • 工资和社会保险计算
  • 五险一金 ,即将毕业的每个人都看看吧
  • 中国成品油价为啥会比美国贵30%?
  • 社会保障相关知识

本福德法则 2位数_什么不遵循本福德定律相关推荐

  1. 加减法叫做什么运算_期中备考:数学运算定律、法则与顺序

    很多孩子的数学不好,尤其是女孩子.家长往往认定为数学不好就是孩子不擅长,能力差.其实未必,有的孩子数学不好的原因并不在于智商,而是没有理解到数学的方法与逻辑,比如小学的运算中,很多孩子并没有了解到运算 ...

  2. 修改shape数据 小数位数_【数据管理】Excel实用精华

    点击上方蓝字关注星标★不迷路来源:从Excel小白到数据分析师 这是一篇关于Excel中的小技巧但是都是精华,文中附有操作视屏简单易学上手快噢! 01 添加数据有效性_名称管理器 数据有效性是对单元格 ...

  3. append 降低数组位数_腿粗有理!研究发现腿部脂肪多,能大幅降低患高血压的风险!...

    拥有一双笔直.纤细的"筷子腿"是每位小仙女的梦想,而"大象腿"则往往让人避之不及.很多研究也表明,各种各样的健康问题都与脂肪过多有关. 然而,最近研究证实,脂肪 ...

  4. 获取整数的位数_从NMEA0183到GNSS定位数据获取(二)软件篇

    点击"蓝字"关注我吧 作者:良知犹存 转载授权以及围观:欢迎添加微信公众号:Conscience_Remains 总述 GPS我们都知道,一种用来全球定位的系统,后来俄罗斯推出了格 ...

  5. 小数位数_圆周率的小数位是否包含了所有的数字组合?

    人们很早就认识到,无论多大的圆,其周长除以直径是一个恒定的常数,该常数被称为圆周率.一直以来,数学家知道圆周率是一个小数,但并不清楚这个小数是否是循环的.为此,数学家不断想办法计算出更多小数位的圆周率 ...

  6. 编程 小数位数_使用动态编程的n位数的非递减总数

    编程 小数位数 Problem statement: 问题陈述: Given the number of digits n, find the count of total non-decreasin ...

  7. C++中如何读取一个数的位数_求1000以内的水仙花数

    点击上方 蓝字关注我们 大家好,我是阿汤哥. 看知乎上有朋友说还不明白怎么判断水仙花数,今天我们就来看看这个问题.(PS:"求1000以内的水仙花数"这道题阿汤哥记忆犹新.到现在还 ...

  8. 中缀表达式转后缀表达式两位数_再见,正则表达式!

    从一段指定的字符串中,取得期望的数据,正常人都会想到正则表达式吧? 写过正则表达式的人都知道,正则表达式入门不难,写起来也容易.但是正则表达式几乎没有可读性可言,维护起来,真的会让人抓狂,别以为这段正 ...

  9. java生成随机十位数_随机10位字符串生成

    小于11位的小写字母以及数字随机码生成方法如下 方法一 奇妙的写法 Math.random().toString(36) //0.apnsudpxq0 //10位数字符串 Math.random(). ...

  10. python怎么控制小数点位数_谈谈关于Python里面小数点精度控制的问题

    基础 浮点数是用机器上浮点数的本机双精度(64 bit)表示的.提供大约17位的精度和范围从-308到308的指数.和C语言里面的double类型相同.Python不支持32bit的单精度浮点数.如果 ...

最新文章

  1. HTML: 字體設置
  2. PostgreSQL 10.1 手册_部分 II. SQL 语言_第 11 章 索引_11.5. 组合多个索引
  3. mysql drivermanager_MYSQL 之 JDBC(二): 数据库连接(二)通过DriverManager获取数据库连接...
  4. BeetleX之vue-autoui自匹配UI插件
  5. 算法题目——二次函数三分求极值(HDU-3714)
  6. leetcode 1423. 可获得的最大点数(滑动窗口)
  7. 【分享】Android JNI实例​
  8. 更新SQL Server实例所有数据库表统计信息
  9. mybatis循环map的一些技巧
  10. ruby 嵌套函数_Ruby嵌套有示例的循环
  11. 实验报告-python文库_Python实验报告
  12. 怎样调整input框背景颜色_还在用百度搜索PPT背景图?7个高大上的图片网站,个个都是高清免费无版权!...
  13. python3的fft_FFT乘法Python 3.4.3
  14. 乱下东西导致挖矿病毒Trojan,CoinMiner的解决记录
  15. 【ATSC】ATSC数字测试专用ATSC Frequency
  16. 跨专业北邮计算机考研,北京邮电大学跨专业考研心得
  17. css text-transform实现英文字母或拼音大小写转换
  18. MixFormer: End-to-End Tracking with Iterative Mixed Attention解读
  19. PMP备考经典题库-敏捷专项练习题一(30道)
  20. 蓝桥杯-算法训练-跳马

热门文章

  1. linux公社 资料 和QT学习资料
  2. Squid代理服务器基础_wuli大世界_新浪博客
  3. openwrt设置成无线ap模式
  4. 初创公司几个投资人,各占多少股份合适
  5. 程序员的梗_你知道程序员是什么人?1024程序员节是什么梗吗?
  6. 头条小程序服务器设置,今日头条小程序如何注册申请
  7. 动手学深度学习:6.3 语言模型数据集(周杰伦专辑歌词)
  8. OSChina 周六乱弹 —— 快上车,司机调休了
  9. 解决Ubuntu与Windows不能复制粘贴问题
  10. 02_性能_内存调整_个人学习小结