我好吗太阳照常升起梁静茹_明天太阳会升起吗？

我好吗太阳照常升起梁静茹

拉普拉斯，贝叶斯和当今的机器学习 (Laplace, Bayes, and machine learning today)

It may not be a question that you were worrying much about. After all, it appears to happen every day without fail.

您担心的不是一个问题。毕竟，它似乎每天都在发生。

But what is the probability the sun will rise tomorrow?

但是，什么是太阳会升起，明天的概率是多少？

Believe it or not, this question was given consideration by one of mathematics’ all-time greats Pierre-Simon Laplace in his pioneering work of 1814, “Essai philosophique sur les probabilités”.

信不信由你，这个数学史上的伟大人物皮埃尔·西蒙·拉普拉斯 ( Pierre-Simon Laplace)在他1814年的开创性著作《 Essai philosophique sur lesprobabilités 》中曾考虑过这个问题。

Fundamentally, Laplace’s treatment of the question was intended to illustrate a more general concept. It was not a serious attempt to estimate whether the sun will, in fact, rise.

从根本上讲，拉普拉斯对该问题的处理旨在说明一个更笼统的概念。估计太阳是否会升起并不是认真的尝试。

In his essay, Laplace describes a framework for probabilistic reasoning that today we recognise as Bayesian.

拉普拉斯(Laplace)在他的文章中描述了一个概率推理的框架，今天我们认为它是贝叶斯的。

The Bayesian approach forms a keystone in many modern machine learning algorithms. But the computational power required to make use of these methods has only been available since the latter half of the 20th Century.

贝叶斯方法是许多现代机器学习算法的基石。但是使用这些方法所需的计算能力仅在20世纪下半叶才可用。

(So far, it appears current state-of-the-art AI is keeping quiet on the issue of tomorrow’s sunrise.)

(到目前为止，似乎最新的AI在明天的日出问题上保持沉默。)

Laplace’s ideas are still relevant today, despite being developed more than two centuries ago. This article will review some of these ideas, and show how they are used in modern applications, perhaps envisaged by Laplace’s contemporaries.

尽管已经有两个多世纪的历史了，但拉普拉斯的思想在今天仍然有用。本文将回顾其中的一些想法，并展示它们如何在现代应用程序中使用，也许是拉普拉斯的同时代人所设想的。

皮埃尔·西蒙·拉普拉斯 (Pierre-Simon Laplace)

Born in the small Normandy commune of Beaumont-en-Auge in 1749, Pierre-Simon Laplace was initially marked out to become a theologian.

皮埃尔·西蒙·拉普拉斯(Pierre-Simon Laplace)于1749年出生于博蒙昂日(Beaumont-en-Auge)的诺曼底小公社，最初被任命为神学家。

However, while studying at the University of Caen, he discovered a brilliant aptitude for mathematics. He transferred to Paris, where he impressed the great mathematician and physicist Jean le Rond d’Alembert.

然而，在卡昂大学学习期间，他发现了数学的杰出才能。他移居巴黎，给伟大的数学家和物理学家让·勒·隆德·阿朗伯特留下了深刻的印象。

At the age of 24, Laplace was elected to the prestigious Académie des Sciences.

拉普拉斯(Laplace)在24岁时当选为著名的科学院(Académiedes Sciences)。

Laplace was an astonishingly prolific scientist and mathematician. Amongst his many contributions, his work on probability, planetary motion, and mathematical physics stand out. He counted figures such as Antoine Lavoisier, Jean d’Alembert, Siméon Poisson, and even Napoleon Bonaparte, as his collaborators, advisers, and students.

拉普拉斯是一位惊人的多产的科学家和数学家。在他的许多贡献中，他在概率论，行星运动和数学物理学方面的研究脱颖而出。他将诸如安托万·拉瓦锡(Antoine Lavoisier)，让·达伦伯特(Jean d'Alembert)，西蒙·泊松(SiméonPoisson)甚至拿破仑·波拿巴(Napoleon Bonaparte)等人作为他的合作者，顾问和学生。

Laplace’s “Essai philosophique sur les probabilités” was based upon a lecture he delivered in 1795. It provided a general overview of ideas contained within his work “Théorie analytique des probabilités”, published two years earlier in 1812.

拉普拉斯(Laplace)的“概率论哲学”(Essai philosophique sur lesprobabilités) 它基于他在1795年发表的一次演讲。它对他的著作“Théorieanalytique desprobabilités”(两年前于1812年出版)中所包含的思想进行了总体概述。

In “Essai philosophique”, Laplace provides ten principles of probability. The first few cover basic definitions, and how to calculate probabilities relating to independent and dependent events.

在《 Essai哲学》中，拉普拉斯提供了十项概率原则。前几篇介绍了基本定义，以及如何计算与独立事件和从属事件有关的概率。

Principles eight, nine, and ten concern the application of probability to what we might describe today as cost-benefit analysis.

原则8，原则9和原则10涉及将概率应用于我们今天所说的成本效益分析中。

The sixth is an important generalization of Thomas Bayes’ eponymous theorem of 1763.

第六是托马斯·贝叶斯同名定理1763的重要推广。

It states that, for a given event, the likelihood of each possible cause is found by multiplying the prior probability of that cause by a fraction.

它指出，对于给定事件，每个可能原因的可能性是通过将该原因的先验概率乘以分数来找到的。

This fraction is the probability of the event arising from that particular cause, divided by the probability of the event occurring by any cause.

该分数是由该特定原因引起的事件的概率除以任何原因引起的事件的概率。

This theorem’s influence within machine learning cannot be overstated.

该定理在机器学习中的影响不可夸大。

The seventh principle is the one that has caused the most controversy since its publication. However, the actual wording is innocuous enough.

第七项原则是自出版以来引起争议最大的原则。但是，实际措词是无害的。

Rather, it is Laplace’s choice of discussing the probability of the sun rising the next day by way of illustrative example that has in turn drawn derision and objection over the following two centuries.

相反，正是拉普拉斯选择通过举例说明来讨论第二天太阳升起的可能性，这反过来又在接下来的两个世纪引起了嘲笑和反对。

The rule of succession is still used today under various guises, and sometimes in the form Laplace originally described.

如今，继承规则仍以各种形式使用，有时以最初描述的拉普拉斯的形式使用。

In fact, the rule of succession represents an important early step in applying Bayesian thinking to systems for which we have very limited data and little or no prior knowledge. This is a starting point often faced in modern machine learning problems.

实际上，继承规则代表了将贝叶斯思想应用于我们拥有非常有限的数据且很少或没有先验知识的系统的重要的早期步骤。这是现代机器学习问题中经常面临的起点。

拉普拉斯的继承规则 (Laplace’s rule of succession)

The seventh principle of probability given in Laplace’s “Essai philosophique” is, in essence, straightforward.

拉普拉斯《埃塞哲学》中给出的第七条概率原则从本质上讲，它很简单。

It states that the probability of a given event occurring is found by summing the probability of each of its potential causes multiplied by the probability of that cause giving rise to the event in question.

它指出，给定事件发生的概率是通过将每个潜在原因的可能性乘以该事件引起所讨论事件的可能性而得出的。

Laplace then proceeds to outline an example based upon drawing balls from urns. So far, so good. Nothing contentious yet.

拉普拉斯然后根据从中抽出球来概述一个示例。到目前为止，一切都很好。尚无争议。

However, he then describes how to proceed with estimating the probability of an event occurring in situations where we have limited (or indeed no) prior knowledge about what that probability might be.

但是，他接着描述了如何在我们对事件概率可能是有限的(或什至没有)先验知识的情况下，估计事件发生的概率。

“On trouve ainsi qu’un événement étant arrivé de suite un nombre quelconque de fois , la probabilité qu’il arrivera encore la fois suivante est égale à ce nombre augmenté de l’unité, divisé par le même nombre augmenté de deux unités.”

“在因弗朗西斯·福尔瓦河畔因斯·库恩环境友好的套房抵达后，在法国安哥拉·科威特·埃维埃·埃加勒·科维埃·科涅夫·阿尔贝·德·埃·梅涅·德·梅·德·梅因·德·梅因·德·梅布尔·德·梅因·德·布尔·梅尼·德·梅因·德·梅布尔·德·梅尼

Which translates in English: “So, one finds for a event which has occurred any number of times until now, the probability it will occur again the next time is equal to this number increased by one, divided by the same number increased by two”.

英文翻译为：“因此，人们发现一个事件直到现在已经发生了多次，等于下次发生该事件的可能性等于该数字加1，再除以相同的数字加2” 。

Or, in math notation:

或者，以数学符号表示：

That is, given s successes out of n trials, the probability of success on the next trial is approximately (s+1)/(n+2).

也就是说，给定n个试验中有s 个成功，则下一个试验中成功的概率约为(s + 1)/(n + 2)。

To make his point, Laplace doesn’t hold back:

为了表达他的观点，拉普拉斯没有退缩：

“… par exemple, remonter la plus ancienne époque de l’histoire à cinq mille ans, ou à 1,826,213 jours, et le soleil s’étant levé constamment dans cet intervalle, à chaque révolution de vingtquatre heures, il y a 1,826,214 à parier contre un qu’il se lèvera encore demain”

“……例如，remonter la plus ancienneépoquede l'histoireàcinq mills ans，ouà1,826,213 jours，et le soleils'étantlevéconstamment dans cet intervalle，àchaquerévolutionde vetquatre heures，conày yà1,826， Qu'il selèveraencore demain”

Which translates as: “…for example, given the sun has risen every day for the last 5000 years — or 1,826,213 days — the probability it will rise tomorrow is 1,826,214 / 1,826,215”.

意思是：“……例如，在过去的5000年中，每天都有太阳升起，即1,826,213天，因此明天升起的概率为1,826,214 / 1,826,215”。

At 99.9%, that’s a pretty certain bet. And it only becomes more certain each day the sun continues to rise.

99.9％的赌注是可以肯定的。而且，每天不断升起的太阳才变得更加确定。

Yet Laplace acknowledges that, for someone who understands the mechanism by which the sun rises and sees no reason why it should cease to function, even this probability is unreasonably low.

然而，拉普拉斯承认，对于一个了解太阳升起的机理，却没有理由不让太阳停止运转的人来说，即使这种可能性也不合理。

And it turns out that this qualification is perhaps just as important as the actual rule itself. After all, it hints at the fact that our prior knowledge of a system is encoded in the assumptions we make when assigning probabilities to each of its potential outcomes.

事实证明，这种资格可能与实际规则本身同样重要。毕竟，它暗示了这样一个事实，即我们在为每个潜在结果分配概率时所做的假设中都包含了我们对系统的先验知识。

This is true in machine learning today, especially when we try learning from limited or incomplete training data.

在当今的机器学习中，尤其是当我们尝试从有限或不完整的训练数据中学习时，这是正确的。

But what is the rationale behind Laplace’s rule of succession, and how does it live on in some of today’s most popular machine learning algorithms?

但是，拉普拉斯继承规则背后的原理是什么，它如何在当今一些最流行的机器学习算法中继续存在？

没有什么是不可能的？ (Nothing is impossible?)

To better understand the significance of Laplace’s rule, we need to consider what it means to have very little prior knowledge about a system.

为了更好地理解拉普拉斯规则的重要性，我们需要考虑对系统了解很少的先验知识意味着什么。

Say you have one of Laplace’s urns, which you know to contain at least one red ball. You know nothing else about the contents of the urn “system”. Perhaps it contains many different colors, perhaps it only contains that one red ball.

假设您有一只拉普拉斯骨灰盒，您知道其中至少包含一个红球。您对the“系统”的内容一无所知。也许它包含许多不同的颜色，也许它仅包含一个红色的球。

Draw one ball from the urn. You know the probability that it will be red is greater than zero, and either less than or equal to one.

从the中抽出一个球。您知道它将变为红色的可能性大于零，并且小于或等于一。

But, as you don’t know whether the urn contains other colors, you cannot say the probability of drawing red certainly equals one. You simply cannot rule out any other possibility.

但是，由于您不知道the是否还包含其他颜色，因此不能肯定地说出红色的可能性等于一。您根本无法排除任何其他可能性。

So, how do you estimate the probability of drawing a red ball from the urn?

那么，您如何估算从the中抽出红色球的可能性？

Well, according to Laplace’s rule of succession, you can model drawing a ball from the urn as a Bernoulli trial with two possible outcomes: “red” and “not-red”.

好吧，根据拉普拉斯的继承规则，您可以将伯尔尼试验中的从中抽出球作为模型，进行两种可能的结果：“红色”和“非红色”。

Before we’ve drawn anything from the urn, we’ve already allowed for two potential outcomes to exist. In so doing, we’ve effectively “pseudo-counted” two imaginary draws from the urn, observing each outcome once.

在从the中取出任何东西之前，我们已经允许存在两个潜在的结果。通过这样做，我们有效地从ur中“伪计数”了两个假想的抽奖，并观察了每个结果一次。

This gives each outcome (“red” and “not-red”) a probability of 1/2.

这使每个结果(“红色”和“非红色”)的概率为1/2。

As the number of draws from the urn increases, the effect of these pseudo-counts becomes less and less important. If the first ball drawn is red, you update the probability of the next one being red to (1+1)/(1+2) = 2/3.

随着从the中抽取的次数增加，这些伪计数的影响变得越来越不重要。如果绘制的第一个球是红色，则将下一个球变成红色的概率更新为(1 + 1)/(1 + 2)= 2/3。

If the next ball is red, the probability updates to 3/4. If you keep drawing red, the probability reaches ever closer to 1.

如果下一个球为红色，则概率更新为3/4。如果继续绘制红色，则概率将越来越接近1。

In today’s language, probability concerns a sample space. This is a mathematical set of all possible outcomes for a given “experiment” (a process that selects one of the outcomes).

用当今的语言，概率涉及一个样本空间。这是给定“实验”(选择结果之一的过程)的所有可能结果的数学集合。

Probability was put on a formal axiomatic basis by Andrey Kolmogorov in the 1930s. Kolmogorov’s axioms make it easy to prove that a sample space must contain at least one element.

1930年代，安德烈·科莫戈洛夫(Andrey Kolmogorov)在正式的公理基础上提出了概率。柯尔莫哥洛夫的公理可以轻松证明一个样本空间必须包含至少一个元素。

Kolmogorov also defines probability as a measure that returns a real valued number between zero and one for all elements of the sample space.

Kolmogorov还将概率定义为一种度量，该度量针对样本空间的所有元素返回介于零和一之间的实数值。

Naturally, probability makes a useful way to model real world systems, especially when you assume complete knowledge about the contents of the sample space.

自然，概率是建模现实世界系统的有用方法，尤其是当您假设对样本空间的内容有完整的了解时。

But when we don’t understand the system at hand, we don’t know the sample space — apart from that it must contain at least one element. This is a common starting point in many machine learning contexts. We have to learn the contents of the sample space as we go.

但是，当我们不了解手头的系统时，我们就不知道样本空间-样本空间必须至少包含一个元素。这是许多机器学习环境中的常见起点。我们必须随时了解样本空间的内容。

Therefore, we ought to allow the sample space to contain at least one extra, catch-all element — or, if you like, the “unknown unknown”. Laplace’s rule of succession tells us to assign the “unknown unknown” a probability of 1/n+2, after n repeated observations of known events.

因此，我们应该允许样本空间包含至少一个额外的，包罗万象的元素，或者，如果您愿意，还可以包含“未知未知数”。拉普拉斯的继承规则告诉我们，在对已知事件进行n次重复观察之后，将“未知未知数”的概率指定为1 / n + 2。

Although in many cases it is convenient to ignore the possibility of unknown unknowns, there are epistemological grounds for always allowing such eventualities to exist.

尽管在许多情况下可以方便地忽略未知未知数的可能性，但存在认识论基础，始终允许存在这种可能性。

One such argument is known as Cromwell’s Rule, coined by the late Dennis Lindley. Quoting the 17th century’s Oliver Cromwell:

一种这样的论点被称为克伦威尔定律，由已故的丹尼斯·林德利(Dennis Lindley)提出。引用17世纪的奥利弗·克伦威尔 ( Oliver Cromwell) ：

“I beseech you, in the bowels of Christ, think it possible that you may be mistaken”

“我恳求你，在基督的肠子里，以为你有可能被误解了”

This rather dramatic statement asks us to allow a remote possibility for the unexpected to occur. In the language of Bayesian probability, this amounts to requiring us to always consider a non-zero prior.

这一颇具戏剧性的声明要求我们允许发生意外情况的可能性很小。用贝叶斯概率的语言来说，这等于要求我们始终考虑非零先验。

Because if your prior probability is set to zero, no amount of evidence will ever convince you otherwise. After all, even the strongest evidence to the contrary will still yield a posterior probability of zero, when multiplied by zero.

因为如果将您的先验概率设置为零，那么没有其他证据可以说服您。毕竟，即使相反的最有力的证据，在乘以零时，其后验概率仍然为零。

异议和对拉普拉斯的辩护 (Objections, and a defence of Laplace)

It may come as little surprise to learn that Laplace’s sunrise example attracted much criticism from his contemporaries.

得知拉普拉斯的日出之举引起了他同时代人的批评，也许就不足为奇了。

People objected to the perceived simplicity — naivety, even — of Laplace’s assumptions. The idea that there was a 1/1,826,215 probability the sun would not rise the following day seemed absurd.

人们反对拉普拉斯假设的简单性，甚至天真。认为第二天太阳不会升起的概率为1 / 1,826,215，这似乎很荒谬。

It is tempting to believe that, given a large number of trials, a non-zero probability event must happen. And therefore, observing so many consecutive sunrises without a single failure surely implies Laplace’s estimate is an overestimate?

令人信服的是，考虑到大量的试验，一定会发生非零概率的事件。因此，观察如此多的连续日出而没有一次失败肯定意味着Laplace的估计是高估了吗？

For example, you might expect that after a million trials, you’d have observed a one-in-a-million event — almost guaranteed by definition! What’s the probability of doing otherwise?

例如，您可能希望经过一百万次试用后，您已经观察到百万分之一的事件-几乎可以肯定地保证！否则的可能性是多少？

Well, you wouldn’t be astonished if you tossed a fair coin twice without landing heads. Nor would it be cause for concern if you rolled a die six times, and never saw the number six. These are events with probability 1/2 and 1/6 respectively, but that absolutely does not guarantee their occurrence in the first two and six trials.

好吧，如果您没有落下头就扔了两枚公平的硬币，您将不会感到惊讶。如果您掷骰子六次却从未见过六号，也不会令人担心。这些事件的概率分别为1/2和1/6，但这绝对不能保证它们在前两个和六个试验中均会发生。

A result attributed to Bernoulli back in the 17th Century finds the limit as the probability 1/n and number of trials n grow very large:

归因于17世纪的伯努利(Bernoulli)的结果发现极限为概率1 / n和试验次数n 变得非常大：

Although on average you will have observed at least one occurrence of an event with probability 1/n after n trials, there is still a greater than 1/3 chance you will not.

尽管平均而言，在进行n次试验后，您至少观察到一个事件的发生概率为1 / n，但您仍然没有发生的可能性大于1/3。

Likewise, if the true probability of the sun failing to rise were indeed 1/1,826,215, then we perhaps shouldn’t be so surprised such an occurrence has never been recorded in history.

同样，如果太阳升起的真实概率确实为1 / 1,826,215，那么我们也许不应该感到惊讶，这种情况从未在历史上记录过。

And, arguably, Laplace’s qualification is too generous.

而且，可以说拉普拉斯的资格太宽泛了。

It is true that, for a person who claims to understand the mechanism by which the sun rises every day, the probability of it failing to do so must be much closer to zero.

的确，对于一个声称了解太阳每天升起的机制的人来说，它不这样做的可能性必须非常接近于零。

Yet to assume an understanding of such a mechanism requires us to possess prior knowledge of the system, beyond that which we have observed. This is because such a mechanism is implicitly assumed constant — in other words, true for all time.

然而，要想理解这种机制，就需要我们拥有系统的先验知识，而不是我们所观察到的知识。这是因为这种机制被隐式假定为常数，换句话说，一直以来都是如此。

This assumption lets us, in a sense, “conjure up” an unlimited number of observations — on top of those we have actually observed. It’s an assumption called for by none other than Isaac Newton, at the beginning of the third book in his famous “Philosophiae Naturalis Principia Mathematica”.

从某种意义上说，这个假设使我们“构想”了无限数量的观察结果-在我们实际观察到的结果之上。在艾萨克·牛顿(Isaac Newton)着名的《自然哲学的数学原理》( Philcipia Mathematica)第三本书的开头，就提出了这样的假设。

Newton outlines four “Rules of Reasoning in Philosophy”. The fourth rule claims we can regard propositions derived from previous observations as “very nearly true”, until contradicted by future observations.

牛顿概述了四个“哲学推理规则”。第四条规则要求我们可以认为，从先前的观察中得出的命题是“非常接近真实的”，直到与将来的观察相矛盾。

Such an assumption was crucial for the scientific revolution, despite being a kick in the teeth for philosophers such as David Hume, who famously argued for the problem of induction.

尽管对于像戴维·休姆这样的哲学家来说是一箭之遥，但他的这一假设对于科学革命至关重要。

It is this epistemological compromise that lets us do useful science and, in turn, invent technology. Somewhere along the line, as we see the estimated probability of the sun failing to rise diminish ever closer to zero, we allow ourselves to “round down” and claim a fully fledged scientific truth.

正是这种认识论上的妥协使我们能够进行有用的科学研究，进而发明技术。沿着这条线的某个地方，正如我们看到的估计的那样，太阳未能升起的可能性逐渐减小，接近于零，因此我们允许自己“向下舍入”并宣称一个完全成熟的科学真理。

But all of this presumably lies beyond the scope of the point Laplace originally sought to make.

但是所有这些大概超出了拉普拉斯最初试图提出的观点的范围。

Indeed, his choice of a sunrise example is unfortunate. The rule of succession really comes into its own when applied to completely unknown “black-box” systems for which we have zero (or very few) observations.

的确，他选择日出的例子是不幸的。当将继承规则应用于完全未知的“黑匣子”系统时，该规则的观察数为零(或很少)。

This is because the rule of succession offers an early example of a non-informative prior.

这是因为继承规则提供了非信息先验的早期示例。

如何假设尽可能少 (How to assume as little as possible)

Bayesian probability is a keystone concept in modern machine learning. Algorithms such as Naive Bayes classification, Expectation Maximisation, Variational Inference and Markov Chain Monte Carlo are amongst the most popular in use today.

贝叶斯概率是现代机器学习中的关键概念。诸如朴素贝叶斯分类，期望最大化，变异推理和马尔可夫链蒙特卡洛等算法是当今最受欢迎的算法。

Bayesian probability generally refers to an interpretation of probability where you update your (often subjective) belief in the light of new evidence.

贝叶斯概率通常是指对概率的解释，您可以根据新证据更新您的(通常是主观的)信念。

Two key concepts are prior and posterior probabilities.

两个关键概念是先验概率和后验概率。

Posterior probabilities are those we ascribe to after updating our beliefs in the face of new evidence.

后验概率是我们在面对新证据后更新我们的信念后所赋予的概率。

Prior probabilities (or ‘priors’) are those we hold to be true before seeing new evidence.

先验概率(或“先验”)是我们在看到新证据之前认为是真实的概率。

Data scientists are interested in how we assign prior probabilities to events in the absence of any previous knowledge at all. This is a typical starting point for many problems in machine learning and predictive analytics.

数据科学家对在根本没有任何先验知识的情况下如何将先验概率分配给事件感兴趣。这是机器学习和预测分析中许多问题的典型起点。

Priors can be informative, in the sense they come with “opinions” about the probability of different events. These “opinions” can be strong or weak, and are usually based on past observations or otherwise reasonable assumptions. These are invaluable in situations where we want to train our machine learning model quickly.

从某种意义上说，先验可能会带来信息，因为它们带有关于不同事件发生概率的“观点”。这些“观点”可以是强也可以是弱，通常是基于过去的观察或其他合理的假设。在我们希望快速训练机器学习模型的情况下，这些功能非常宝贵。

However, priors can also be non-informative. This means they assume as little as possible about the respective probabilities of an event. These are useful in situations where we want our machine learning model to learn from a blank state.

但是，先验也可以是非信息性的。这意味着他们尽可能少地假设事件的各个概率。这些在我们希望机器学习模型从空白状态学习的情况下很有用。

So we must ask: how do you measure how “informative” a prior probability distribution is?

因此，我们必须问：您如何衡量先验概率分布的“信息量”？

Information theory provides an answer. This is a branch of mathematics that concerns how information is measured and communicated.

信息论提供了答案。这是数学的一个分支，涉及如何测量和传达信息。

Information can be thought of in terms of certainty, or a lack thereof.

可以根据确定性或缺乏确定性来考虑信息。

After all, in an everyday sense, the more information you have about some event, the more certain you are about its outcome. Less information equates to less certainty. This means that information theory and probability theory are inextricably linked.

毕竟，从日常的意义上讲，您对某事件的了解越多，您对事件结果的把握就越确定。信息越少，确定性就越低。这意味着信息论和概率论有着千丝万缕的联系。

Information entropy is a fundamental concept in information theory. It serves as a measure of the uncertainty inherent to a given probability distribution. A probability distribution with high entropy is one for which the outcome is more uncertain.

信息熵是信息论中的一个基本概念。它可以用来衡量给定概率分布所固有的不确定性。具有高熵的概率分布是其结果更加不确定的一种。

Perhaps intuitively, you can reason that a uniform probability distribution — a distribution for which each event is equally likely — has the highest possible entropy. For example, if you flipped a fair coin and a biased coin, which outcome would you be least certain about?

也许从直觉上讲，您可以推断出统一的概率分布(每个事件均等可能的分布)具有最高的熵。例如，如果您掷出一枚公平硬币和一个有偏见的硬币，您将最不确定哪个结果？

Information entropy provides a formal means of quantifying this, and if you know some calculus, you can check out the proof here.

信息熵提供了一种量化此信息的正式方法，如果您知道某些演算，则可以在此处查看证明。

So the uniform distribution is, in a very real sense, the least informative distribution possible. And for that reason, it makes an obvious choice for an uninformative prior.

因此，从实际意义上讲，均匀分布是信息最少的分布。出于这个原因，对于毫无先例的先验，它无疑是一个选择。

Perhaps you’ve spotted how Laplace’s rule of succession effectively amounts to using a uniform prior? By adding one success and one failure before we’ve even observed any outcomes, we’re using a uniform probability distribution to represent our “prior” belief about the system.

也许您已经发现拉普拉斯的继承规则实际上等同于使用统一先验？通过在观察到任何结果之前添加一个成功和一个失败，我们正在使用统一的概率分布来表示我们对系统的“先前”信念。

Then, as we observe more and more outcomes, the weight of the evidence increasingly overpowers the prior.

然后，随着我们观察到越来越多的结果，证据的重要性越来越强于先验。

案例研究：朴素贝叶斯分类 (Case study: Naive Bayes classification)

Today, Laplace’s rule of succession is generalised to additive smoothing and pseudo-counting.

如今，拉普拉斯的继承规则已推广到加法平滑和伪计数。

These are techniques which allow us to use non-zero probabilities for events not observed in training data. This is an essential part of how machine learning algorithms are able to generalize when faced with inputs not seen previously.

这些技术使我们可以对训练数据中未观察到的事件使用非零概率。这是机器学习算法在面对以前未见过的输入时能够概括的重要部分。

For instance, take Naive Bayes classification.

例如，采用朴素贝叶斯分类。

This is a simple yet effective algorithm that can classify textual and other suitably tokenized data, using Bayes’ theorem.

这是一种简单有效的算法，可以使用贝叶斯定理对文本数据和其他适当标记化的数据进行分类。

The algorithm is trained on a corpus of pre-classified data, in which each document consists of a set of words or “features”. The algorithm begins by estimating the probability of each feature, given a certain class.

该算法在预分类的数据集上进行训练，其中每个文档都由一组单词或“特征”组成。该算法从给定特定类别的情况下估算每个特征的概率开始。

Using Bayes’ theorem (and some very naive assumptions about feature independence), the algorithm can then approximate the relative probabilities of each class, given the features observed in a previously unseen document.

使用贝叶斯定理(以及关于特征独立性的一些非常幼稚的假设)，给定在先前未见过的文档中观察到的特征，该算法然后可以近似每个类的相对概率。

An important step in Naive Bayes classification is estimating the probability of a feature being observed within a given class. This can be done by calculating the frequency at which the feature is observed in each of that class’s records in the training data.

朴素贝叶斯分类的重要步骤是估计在给定类别中观察到特征的可能性。这可以通过计算在训练数据中该类的每个记录中观察到特征的频率来完成。

For instance, the word “Python” might appear in 12% of all documents classed as “programming”, compared to 1% of all documents classed as “start-up”. The word “learn” might appear in 10% of programming documents and 20% of all start-up documents.

例如，单词“ Python”可能出现在所有归类为“编程”的文档中的12％，而所有归类为“启动”的文档中只有1％。 “学习”一词可能出现在编程文档的10％和所有启动文档的20％中。

Take the sentence “learn Python”.

采取句子“ learn Python”。

Using these frequencies, we find the probability of the sentence being classed as “programming” equals 0.12 ×0.10 = 0.012, and the probability of it being classed as “start-up” is 0.01×0.20 = 0.002.

使用这些频率，我们发现将该句子归类为“编程”的概率等于0.12×0.10 = 0.012，而将该句子归类为“启动”的概率为0.01×0.20 = 0.002。

Therefore, “programming” is the more likely of these two classes.

因此，“编程”是这两类中更可能的。

But this frequency-based approach runs into trouble whenever we consider a feature which never occurs in a given class. This would mean it has a frequency of zero.

但是，每当我们考虑某个给定类中从未出现过的功能时，这种基于频率的方法就会遇到麻烦。这意味着它的频率为零。

Naive Bayes classification requires us to multiply probabilities, but multiplying anything by zero will, of course, always yield zero.

朴素贝叶斯分类法要求我们乘以概率，但是任何乘以零的过程当然总会产生零。

So, what happens if a previously unseen document does contain a word never observed in a given class in the training data? That class will be deemed impossible — no matter how frequently every other word in the document occurs in that class.

那么，如果以前看不见的文档中确实包含训练数据中给定班级从未观察到的单词，该怎么办？该类别将被视为是不可能的-不管文档中的每个其他单词在该类别中出现的频率如何。

加法平滑 (Additive smoothing)

An approach called additive smoothing offers a solution. Instead of allowing for zero frequencies, we add a small constant to the numerator. This prevents unseen class/feature combinations from derailing the classifier.

一种称为加法平滑的方法提供了一种解决方案。代替允许零频率，我们在分子上添加一个小常数。这样可以防止看不见的类别/功能组合使分类器出轨。

When this constant equals one, additive smoothing is the same as applying Laplace’s rule of succession.

当此常数等于1时，加法平滑与应用拉普拉斯的继承规则相同。

As well as Naive Bayes classification, additive smoothing is used in other probabilistic machine learning contexts. Examples include problems in language modelling, neural networks, and hidden Markov models.

除了朴素贝叶斯分类之外，在其他概率机器学习环境中也使用加法平滑。示例包括语言建模，神经网络和隐马尔可夫模型中的问题。

In mathematical terms, additive smoothing amounts to using a beta distribution as a conjugate prior for carrying out Bayesian inference with binomial and geometric distributions.

用数学术语来说，加法平滑等于在执行具有二项式和几何分布的贝叶斯推断之前，先将β分布用作共轭。

The beta distribution is a family of probability distributions defined over the interval [0,1]. It takes two shape parameters, α and β. Laplace’s rule of succession corresponds to setting α = 1 and β = 1.

Beta分布是在区间[0,1]上定义的概率分布族。它采用两个形状参数α 和β。拉普拉斯的继承规则对应于设置α = 1且β = 1。

As discussed above, the beta(1,1) distribution is the one for which information entropy is maximised. However, there are alternative priors for cases in which the assumption of one success and one failure are not valid.

如上所述，beta(1,1)分布是信息熵最大的分布。但是，对于一种成功和一种失败的假设无效的情况，存在替代的先验条件。

For instance, Haldane’s prior is defined as a beta(0,0) distribution. It applies in cases when we are not even sure if we can allow for a binary outcome. Haldane’s prior places an infinite amount of “weight” on zero and one.

例如，Haldane的先验定义为beta(0,0)分布。当我们甚至不确定是否可以允许二进制结果时，它就适用。霍尔丹的先验将无限量的“权重”置于零和一。

Jeffrey’s prior, the beta(0.5, 0.5) distribution, is another non-informative prior. It has the helpful property that it remains invariant under reparameterization. Its derivation is beyond the scope of this article, but if you are interested, check out this thread.

Jeffrey的先验值beta(0.5，0.5)分布是另一个非信息性先验值。它具有有用的特性，即在重新参数化的情况下保持不变。它的派生超出了本文的范围，但是如果您有兴趣，请查看此线程。

思想的遗产 (The legacy of ideas)

Personally, I find it fascinating how some of the earliest ideas in probability and statistics have survived years of contention, and still find widespread use in modern machine learning.

就个人而言，我发现它着迷于概率论和统计学中的一些最早的想法如何在多年的争论中幸存下来，并且仍然在现代机器学习中得到广泛使用。

It is extraordinary to realise that the influence of ideas developed more than two centuries ago is still being felt today. Machine learning and data science have gained real mainstream momentum in the last decade or so. But the foundations upon which they are built were laid long before the first computers were even close to realization.

非常不寻常地意识到，两个多世纪以前发展起来的思想的影响至今仍在被感受到。在过去的十年左右的时间里，机器学习和数据科学已经获得了真正的主流动力。但是建立它们的基础早在第一台计算机甚至接近实现之前就已经奠定了。

It’s no coincidence that such ideas border on the philosophy of knowledge. This becomes especially relevant as machines become more and more intelligent. At what point might the focus shift onto our philosophy of consciousness?

这样的想法与知识哲学接壤并非偶然。随着机器变得越来越智能，这变得尤为重要。在什么时候焦点可以转移到我们的意识哲学上？

Finally, what would Laplace and his contemporaries make of machine learning today? It’s tempting to suggest they’d be astounded by the progress that’s been made.

最后，拉普拉斯和他的同时代人如何看待今天的机器学习？诱人的建议是他们会对所取得的进步感到惊讶。

But that would probably be a disservice to their foresight. After all, the French philosopher René Descartes had written of a mechanistic philosophy back in the 17th Century. Describing a hypothetical machine:

但这可能会损害他们的远见。毕竟，法国哲学家笛卡尔(RenéDescartes)早在17世纪就曾撰写过机械哲学。描述一个假设的机器：

“Je désire que vous considériez … toutes les fonctions que j’ai attribuées à cette machine, comme … la réception de la lumière, des sons, des odeurs, des goûts … l’empreinte de ces idées dans la mémoire … et enfin les mouvements extérieurs … qu’ils imitent le plus parfaitement possible ceux d’un vrai homme … considériez que ces fonctions … de la seule disposition de ses organes, ni plus ni moins que font les mouvements d’une horloge … de celle de ses contrepoids et de ses roues”

“ Jedésireque vousconsidériez…吹捧了cette机上的les函功能，这是com…laréceptionde lalumière，儿子，des odeurs，desgoûts…l empreinte cesidéesesésdéla laémémoire外部...可能具有同质性，可能会在行政法庭上发生……consééériezques tions ... de se se disposition des organs，ni plus ni moins que font les mov of d'une horloge ... de celle de ses cont ses roues”

Which translates as: “I desire that you consider that all the functions I’ve attributed to this machine such as… the reception of light, sound, smell and taste… the imprint of these ideas in the memory… and finally the external movements which imitate as perfectly as possible those of a true human…Consider that these functions are only under the control of the organs, no more or less than the movements of a clock are to its counterweights and wheels”

意思是：“我希望您考虑到我赋予机器的所有功能，例如……光，声音，气味和味道的接收……这些想法在记忆中的印记……以及最终的外部运动。尽可能地模仿一个真正的人类…考虑这些功能仅在器官的控制下，不超过或不超过时钟对其配重和轮子的运动。”

The passage above describes a hypothetical machine capable of responding to stimuli and behaving like a “true human”. It was published in Descartes’ 1664 work “Traité de l’homme” — a full 150 years before Laplace’s “Essai philosophique sur les probabilités”.

上面的段落描述了一种假设的机器，该机器能够对刺激做出React，并且表现得像“真正的人类”。它发表在笛卡尔1664年的著作《Traitéde l'homme》中 -比拉普拉斯(Laplace)的“无产阶级哲学”(Essai philosophique sur lesprobabilités)整整150年。

Indeed, the 18th and early 19th Centuries saw the construction of incredibly sophisticated automata by inventors such as Pierre Jaquet-Droz and Henri Maillardet. These clockwork androids could be “programmed” to write, draw, and play music.

的确，在18世纪和19世纪初，皮埃尔·雅克·德罗兹 ( Pierre Jaquet-Droz)和亨利· 梅拉尔德 ( Henri Maillardet)等发明家建造了令人难以置信的复杂自动机。可以对这些发条机器人进行“编程”以编写，绘画和播放音乐。

So there is no doubting that Laplace and his contemporaries could conceive of the notion of an intelligent machine. And it surely would not have escaped their notice how progress made in the field of probability might be applied to machine intelligence.

因此，毫无疑问，拉普拉斯和他的同时代人可以构思出智能机器的概念。而且它肯定不会逃脱他们的注意，如何将概率领域中取得的进展应用于机器智能。

Right at the beginning of “Essai philosophique”, Laplace writes of a hypothetical super-intelligence, retrospectively named “Laplace’s Demon”:

拉普拉斯(Laplace)就在“埃塞哲学(Essai philosophique)”的开头，写下了一种假设的超级智能，追溯地命名为“拉普拉斯的恶魔”(Laplace's Demon)：

“Une intelligence qui, pour un instant donné, connaîtrait toutes les forces dont la nature est animée, et la situation respective des êtres qui la composent, si d’ailleurs elle était assez vaste pour sou- mettre ces données à l’analyse … rien ne serait incertain pour elle, et l’avenir comme le passé, serait présent à ses yeux”

“情报情报，即时情报，警卫队吹捧大自然不动产，以及各种情况，包括情报，情报，情报和情报…………………………………………………………” ne serait poursàses yeux的确定性；

Which translates as: “An intelligence, which in a given moment, knows all the forces by which nature is animated, and the respective situation of the beings which compose it, and if it were large enough to submit these data to analysis … nothing would be uncertain to it, and the future as the past, would be present in its eyes”.

译为：“一种智能，在给定的时刻，它知道大自然所产生的所有力量，以及组成大自然的生物的各自情况，如果它足够大，可以将这些数据提交给分析……什么也不会……对此不确定，未来将成为过去。”

Could Laplace’s Demon be realized as one of Descartes’ intelligent machines? Modern sensibilities overwhelmingly suggest no.

拉普拉斯的恶魔能否实现为笛卡尔的智能机器之一？现代的情感绝大多数都表示不。

Yet Laplace’s premise envisaged on a smaller scale may soon become a reality, thanks in no small part to his own pioneering work in the field of probability.

然而，拉普拉斯在较小规模上设想的前提可能很快成为现实，这在很大程度上要归功于他在概率领域的开拓性工作。

Meanwhile, the sun will (probably) continue to rise.

同时，太阳将(可能)继续上升。

翻译自: https://www.freecodecamp.org/news/will-the-sun-rise-tomorrow-255afc810682/

我好吗太阳照常升起梁静茹