统计概率分布

Random Variables follow different types of distribution in probability space which decides their behaviour and helps in predictions.

随机变量在概率空间中遵循不同的分布类型，这决定了它们的行为并有助于预测。

Table of contents:

目录：

Introduction介绍
Gaussian/Normal Distribution高斯/正态分布
Binomial Distribution二项分布
Bernoulli Distribution伯努利分布
Log Normal Distribution对数正态分布
Power Law Distribution幂律分布
Uses of Distributions发行用途

介绍 (Introduction)

Whenever we come across any experiment in probability, we talk about random variable which is nothing but the variable which takes the expected outcomes of that experiment. For example, when we roll a dice, we expect a value from the set {1,2,3,4,5,6}. So we define a random variable X which takes these values every time we roll.

每当我们有机会遇到任何实验时，我们都在谈论随机变量，它不过是采用该实验的预期结果的变量。例如，当我们掷骰子时，我们期望值{1,2,3,4,5,6}中的值。因此，我们定义了一个随机变量X，每次滚动时它都会使用这些值。

Depending upon the experiment, the random variable can take either discrete values or continuous values. So this dice example is of discrete random variable as it takes a discrete value. But suppose we are talking about the price of houses of a particular town then the associated random variable can take continuous values (e.g. $550,000, $1,200,523.54, etc).

根据实验，随机变量可以采用离散值或连续值 。因此，此骰子示例具有离散随机变量，因为它具有离散值。但是，假设我们正在谈论特定城镇的房屋价格，那么相关的随机变量可以采用连续值(例如$ 550,000，$ 1,200,523.54等)。

When we plot these expected values of random variable vs. the frequency of there appearance in an experiment, we get a frequency distribution plot in form of histograms. After using kernel Density Estimation for smoothing these histograms, we get a fine curve. This curve is referred as “Distribution”.

当我们在实验中绘制这些随机变量的期望值与出现频率的关系时，我们会得到直方图形式的频率分布图 。使用核密度估计对这些直方图进行平滑处理后，我们得到了一条细曲线。该曲线称为“ 分布 ”。

高斯/正态分布 (Gaussian/Normal Distribution)

Gaussian/Normal distribution is a continuous probability distribution function where random variable lies symmetrically around a mean (μ) and Variance (σ²).

高斯/正态分布是一个连续概率分布函数，其中随机变量对称地位于均值(μ)和方差(σ²)周围。

general expression for Gaussian distribution curve

Mean (μ): It decides the position of the peak on X-axis. Also, all the data are symmetrically located on either side of the the line X = μ. As you can observe in the image shown, the Blue, Red and Yellow curves are spread either side of X=0 but Green curve is having its center at X= -2. So by looking these curves, we can easily say that mean of Blue, Red and Yellow is 0 whereas that of Green is -2.

均值(μ)：确定峰在X轴上的位置。而且，所有数据对称地位于线X =μ的两侧。正如您在所示图像中观察到的那样，蓝色，红色和黄色曲线分布在X = 0的两侧，而绿色曲线的中心位于X = -2。因此，通过查看这些曲线，我们可以轻松地说出蓝色，红色和黄色的均值为0，而绿色的均值为-2。

Variance (σ²): It decides the spread and height of the curve. Variance is nothing but the square of the standard deviation. Notice here in the image, σ² values for all the four curves are given. Now without looking at the values, we can easily say that the yellow curve has the lowest height and maximum spread and spread can be intuitively understood as standard deviation. So we can say that Yellow curve has maximum variance out of the four. Similarly Blue curve has minimum.

方差(σ²)：它决定曲线的展宽和高度。方差不过是标准偏差的平方。注意，在此图像中，给出了所有四个曲线的σ²值。现在，不用看这些值，我们可以轻松地说出黄色曲线的高度最低，并且最大展宽，并且可以直观地理解为标准差。因此，我们可以说，黄色曲线的四个方差最大。同样，蓝色曲线具有最小值。

If we put μ = 0 and σ = 1, the Normal distribution is then called Standard Normal Distribution or Standard Normal Variate and the general expression changes to:

如果我们将μ= 0且σ= 1，则正态分布称为标准正态分布或标准正态变量 ，并且一般表达式变为：

Now one can imagine, what does the denominator signify? Its’s there to ensure that the area under curve for Normal distribution is always equal to 1.

现在可以想象，分母表示什么？可以确保正态分布的曲线下面积始终等于1。

We get a lot of useful information about segmentation of data from Normal Distribution. Look at the image:

我们从正态分布中获得了很多有关数据分割的有用信息。看图片：

Values segmentation diagram for Normal Distribution

As you can see, this distribution stores 34.1% of total mass if we move one standard deviation right from mean, (34.1 + 13.6) = 47.7% of mass if we move 2 standard deviations right from mean and 49.8% when 3 standard deviation right. Since this curve is symmetrical, it holds for either sides.

如您所见，如果我们从均值向右移动一个标准偏差，则此分布将存储总质量的34.1％，如果我们从均值向右移动2个标准偏差，则该分布将存储质量的(34.1 + 13.6)=质量的47.7％，而当我们向右移动3个标准偏差时，则存储49.8％。由于该曲线是对称的，因此对于任一侧都适用。

So, now we know if any property follows a Normal distribution, e.g. weights of population in a town, we can easily estimate a lot of values without actually performing extensive analysis. This is the power of Normal Distribution.

因此，现在我们知道是否有任何属性遵循正态分布，例如，城镇人口的权重，我们可以轻松估算很多值，而无需实际进行大量分析。这就是正态分布的力量。

二项分布 (Binomial Distribution)

As we can see in the name, there is a “Bi”. So, this ‘Bi’ stands for 2 outcomes of an experiment, either Yes or No, either Pass or Fail, either 1 or 0 etc. In most simple terms this distribution is the distribution of multiple repeated experiments and their probabilities where the expected outcome is either “Success” or “Failure”.

正如我们在名称中看到的那样，有一个“ Bi”。因此，“ Bi”代表实验的2个结果，是或否，通过或失败，1或0等。最简单的说，此分布是多个重复实验及其概率的分布，其中预期结果是“成功”或“失败”。

As you can observe from image, it is a discrete probability distribution function. Main parameters are n (number of trials) and p (probability of success).

从图像中可以看到，它是离散的概率分布函数。主要参数是n(试验次数)和p(成功概率)。

Now suppose we have a probability p of SUCCESS of an event, then the probability of FAILURE is (1-p) and let us say you repeat the experiment n times (number of trials = n). Then probability of getting k successes in n independent Bernoulli trials is:

现在假设我们有一个事件成功的概率为p，那么失败的概率为(1-p)，可以说您重复了该实验n次(试验次数= n)。那么，在n个独立的伯努利试验中获得k次成功的概率为：

Probability Mass Function of Binomial Distribution

where k belongs in range [0,n] and:

其中k属于[0，n]范围，并且：

Note: We will see what is Bernoulli trial in next section.

注意：我们将在下一部分中看到什么是伯努利试验。

Let me ask a simple question. Suppose there is cricket match going on between India and Australia. Rohit Sharma has already scored 151* and by your experience you know that after 150 Rohit has a probability 0.3 of hitting a six. It’s the last over and your father asks you what are the chances that Rohit will hit 4 sixes. Then how would you find out?

让我问一个简单的问题。假设印度和澳大利亚之间正在进行板球比赛。罗希特·夏尔马(Rohit Sharma)已经获得151 *的分数，根据您的经验，您知道罗希特(Rohit)在150以后有0.3的概率达到6。这是最后一次结束，您的父亲问您，罗希特(Rohit)击中4个6的机会是多少？那你怎么知道

This is a typical example of Binomial trials. So, the solution is:

这是二项式试验的典型示例。因此，解决方案是：

Note: The 6 and 4 in big bracket is nothing but 6C4 which is combinations of 4 sixes in 6 balls.

注意：大括号中的6和4只是6C4，它是6个球中4个6的组合。

伯努利分布： (Bernoulli Distribution:)

In Binomial Distribution, we have a special case knows as Bernoulli Distribution where n=1 which means just a single trial is conducted in that binomial experiment. When we put n=1 in PMF (Probability Mass Function) of Binomial, the nCk will be equal to 1 and function becomes:

在二项分布中，我们有一个特殊情况称为伯努利分布 ，其中n = 1 ，这意味着在该二项式实验中仅进行了一次试验。当我们在二项式的PMF(概率质量函数)中放入n = 1时，nCk等于1，函数变为：

where k = {0,1}.

其中k = {0,1}。

Now let’s take the India vs Australia match. Let’s say when Rohit hits a ton then chances of India winning is 0.7. So you can simply tell your father that there is a 70% chance that India will win.It was nothing but a very basic Bernoulli trial.

现在让我们来看看印度对澳大利亚的比赛。假设Rohit达到1吨，那么印度获胜的机率是0.7。因此，您可以简单地告诉您的父亲，印度获胜的机率有70％，这只是一个非常基本的伯努利审判。

对数正态分布 (Log Normal Distribution)

We have seen the nature of Normal distribution and in first glance many would say that Log normal curve also somewhat gives a glimpse of Normal distribution which is right skewed.

我们已经看到了正态分布的性质，乍一看，很多人会说对数正态曲线在某种程度上也使人对正态分布有所偏斜。

Suppose there is a random variable X which follows Log Normal distribution with mean = μ and Variance = σ². X has a total n possible values (x1,x2,x3…..xn). Now take natural Log over all X values and create a new random variable Y = [log(x1),log(x2),log(x3)……log(xn)]. This random variable Y will be Normally distributed.

假设存在一个随机变量X，它遵循对数正态分布，均值=μ，方差=σ²。 X总共有n个可能的值(x1，x2，x3 ..... xn)。现在对所有X值取自然对数并创建一个新的随机变量Y = [log(x1)，log(x2)，log(x3)……log(xn)] 。该随机变量Y将呈正态分布。

In other words if there is a Normal Distribution Y, and we take it’s exponential function X = exp(Y) then X will follow Log Normal distribution. In simple language as name suggests Log Normal distribution is the distribution of a random variable whose natural log is Normally distributed.

换句话说，如果存在正态分布Y，并且我们采用它的指数函数X = exp(Y)，则X将遵循对数正态分布。用简单的语言顾名思义，对数正态分布是自然变量为正态分布的随机变量的分布。

It has also the same parameters as Gaussian: mean (μ) and Variance (σ²).

它还具有与高斯相同的参数： 均值(μ)和方差(σ²) 。

幂律/帕累托分布 (Power Law/Pareto Distribution)

Power Law is a relationship between two quantities in which changes in one quantity will proportionally change the other quantity. It follows a 80–20 rule which says: in top 20% of values, we will find roughly 80% of mass density. As you can see in the image, the slightly darker left portion is 80% of mass and the right bright yellow is 20%.

幂律是两个量之间的关系，其中一个量的变化将成比例地改变另一个量。它遵循80–20规则，即：在值的前20％中，我们将发现质量密度大约为80％。如您在图像中看到的，左侧稍暗的部分占质量的80％，右侧亮黄色的部分占20％。

When a probability distribution follows a power law we say it is a Pareto Distribution.

当概率分布遵循幂定律时，我们说它是帕累托分布。

Pareto distribution is controlled by two parameters: x_m and α.

帕累托分布受两个参数控制： x_m和α。

x_m can be thought of as mean which controls scale of curve and α can be thought of as σ which controls the shape of curve. (Note: x_m is not mean and α is not σ. I am speaking intuitively for understanding.)

可以将x_m视为控制曲线比例的均值，将α_视为控制曲线形状的σ。 (注意：x_m不是均值，α不是σ。我直觉地说是为了理解。)

Now as we can see in the image, all four curves have their peak located at x=1. So, we can say that x_m = 1 for all the curves.

现在，如我们在图像中看到的，所有四个曲线的峰值都位于x = 1。因此，我们可以说所有曲线的x_m = 1。

As we can observe from the image, as α increases the peak also goes up and and in extreme case of α tending to infinity, the curve transforms into merely a vertical line. This is called a Dirac Delta Function.

从图像中我们可以看到，随着α的增加，峰值也会上升，在极端情况下，α趋于无穷大，曲线仅变成一条垂直线。这称为Dirac Delta函数 。

As α reduces, the flatness of curve increases.

随着α的减小，曲线的平坦度增加。

发行用途 (Uses of Distributions)

If we know a particular property follows a certain dist then we can take a sample and find the parameters involved and then can plot the Probability Distribution function to answer lot of question.

如果我们知道某个特定属性遵循一定距离，那么我们可以取样并找到涉及的参数，然后可以绘制概率分布函数来回答很多问题。

For ex: In a town of 100,000 people, we have to do height analysis, but we cannot do a survey for such a large population. So, we select a random sample and find it sample mean and sample standard deviation.

例如：在一个有10万人的小镇上，我们必须进行身高分析，但是我们无法对如此庞大的人口进行调查。因此，我们选择一个随机样本，并找到样本均值和样本标准差。

Now suppose a doctor or expert tells us height follows a Normal distribution. Then we can easily answer many questions.

现在，假设医生或专家告诉我们身高遵循正态分布。然后，我们可以轻松回答许多问题。

翻译自: https://medium.com/analytics-vidhya/important-distributions-in-probability-statistics-a868283fa127

统计概率分布

查看全文

http://www.taodudu.cc/news/show-863409.html

人口预测和阻尼-增长模型_使用分类模型预测利率-第1部分
基于kb的问答系统_1KB以下基于表的Q学习
图论为什么这么难_图论是什么，为什么要关心？
使用RNN和TensorFlow创建自己的Harry Potter短故事
bitnami如何使用_使用Bitnami获取完全配置的Apache Airflow Docker开发堆栈
cox风险回归模型参数估计_信用风险管理：分类模型和超参数调整
支持向量机回归分析_支持向量机和回归分析
ai/ml_您本周应阅读的有趣的AI / ML文章（8月15日）
chime-4 lstm_CHIME-6挑战赛回顾
文本文件加密和解密_解密文本见解和相关业务用例
有关糖尿病模型建立的论文_预测糖尿病结果的模型比较
chi-squared检验_每位数据科学家都必须具备Chi-S方检验统计量：客户流失中的案例研究
深度学习：在图像上找到手势_使用深度学习的人类情绪和手势检测器：第2部分
爆破登录测试网页_预测危险的地震爆破第一部分：EDA，特征工程和针对不平衡数据集的列车测试拆分
概率论在数据挖掘_为什么概率论在数据科学中很重要
集合计数二项式反演_对计数数据使用负二项式
使用TorchElastic训练DeepSpeech
神经网络架构搜索_神经网络架构
raspberry pi_通过串行蓝牙从Raspberry Pi传感器单元发送数据
问答机器人接口python_设计用于机器学习工程的Python接口
k均值算法二分k均值算法_如何获得K均值算法面试问题
支持向量机概念图解_支持向量机：基本概念
如何设置Jupiter Notebook服务器并从任何地方访问它（Windows 10）
无监督学习 k-means_监督学习-它意味着什么？
logistic 回归_具有Logistic回归的优秀初学者项目
脉冲多普勒雷达_是人类还是动物？多普勒脉冲雷达和神经网络的目标分类
pandas内置绘图_使用Pandas内置功能探索数据集
sim卡rfm_信用卡客户的RFM集群
需求分析与建模最佳实践_社交媒体和主题建模：如何在实践中分析帖子
机器学习数据模型_使用PyCaret将机器学习模型运送到数据—第二部分

统计概率分布_概率统计中的重要分布相关推荐

概率分布分位点_概率统计计量经济学_假设检验中的重要概念_分位点/p值
在学完了几个重要分布之后,紧接着的内容就是这几个分布的使用,实际上这就是假设检验的过程其中有一些概念: 分位点和分位数,p值,分布表,置信区间因为是新概念, 我这种蒻蒻就是看得很不清楚,理解起来总 ...
概率论在实际生活的例子_概率统计在实际生活中的应用
阿尔法趣味数学网小编来今天给同学们带来的趣味数学故事是:概率统计在实际生活中的应用. 每天10分钟头脑大风暴,开发智力,培养探索能力,让你成为学习小天才. 故事适合年级:小学 [概率统计在实际生活中的 ...
python统计库存_通过Python中的pandas将每日库存数据转换为每周库存数据
我有一个存储每日数据的DataFrame,如下所示:Date Open High Low Close Volume 2010-01-04 38.660000 39.299999 38.509998 3 ...
python描述性统计命令_描述性统计_Python数据分析实战应用_数据挖掘与分析视频-51CTO学院...
为什么学Python: 重要:数据分析是职业技能必备,Python是大数据分析** 趋势:Python是目前非常火的编程语言,使用人多好学:学习简单,容易上手,使用灵活,可扩展强 **:会Pytho ...
java实现网站统计功能_网站统计功能的设计与实现
本文分为以下五个部分: 埋点设计与实现页面引入数据接收数据入库统计分析一.埋点设计与实现在JavaScript中,包含了很多对象,可以用于获取用户的数据.比如Document对象用于分析每 ...
mysql统计期初库存_商品库存统计 - 学习进步 - OSCHINA - 中文开源技术交流社区...
创建表 CREATE TABLE "public"."test" ( "product" varchar COLLATE "def ...
机器学习和概率统计的关系
机器学习和概率统计的关系机器学习是一个比较宽泛的概念,主要包括有监督学习,无监督学习,强化学习等,每个分类又有很多不同的算法,在使用时需要根据不同的场景进行选择,这个将会在后续的博客中涉及,这里 ...
关于概率分布理论的原理分析的一些讨论，以及经典概率分布的应用场景，以及概率统计其在工程实践中的应用...
1. 随机变量定义 0x1:为什么要引入随机变量这个数学概念在早期的古典概率理论研究中,人们基于随机试验的样本空间去研究随机事件,也发展出了非常多辉煌的理论,包括著名的贝叶斯估计在内. 但是随着研究 ...
深度学习中需要掌握的数学1之概率统计
深度学习中需要掌握的概率统计 1.常见的概率分布 1.1伯努利分布(二值分布,0-1分布) 1.2二项分布(离散的) 1.3均匀分布 1.4`高斯分布`(连续) 2.独立事件的解释 3.多变量概率分布 ...

统计概率分布_概率统计中的重要分布