腾讯哈勃

Simple OLS Regression, Pairs Bootstrap Resampling, and Hypothesis Testing to observe the effect of Hubble’s Law in Python.

通过简单的OLS回归，配对Bootstrap重采样和假设检验来观察哈勃定律在Python中的效果。

In this post, we will revisit Hubble’s Law and examine the original dataset he used by running an Ordinary Least Squares Linear Regression on the 24 measurements of distances and recessional velocities of extra-galactic nebulae. Then, we will use a pairs bootstrap resampling to calculate the RSS Minima and perform a hypothesis test on the measured effect of galactic distance on recessional velocities.

在本文中，我们将回顾哈勃定律，并通过对银河外星云的距离和后退速度的24个测量值进行普通最小二乘线性回归来检验他使用的原始数据集。然后，我们将使用成对的自举重采样来计算RSS最小值，并对银河距离对后退速度的测量影响进行假设检验。

Based on the results of the hypothesis test we can conclude with a high degree of statistical signficance that distance has an observed effect on the recessional velocity of galaxies. This is concrete evidence of Hubble’s Law that the universe is constantly expanding.

根据假设检验的结果，我们可以得出高度的统计意义，即距离对星系的后退速度有明显影响。这是哈勃定律证明宇宙不断膨胀的具体证据。

Before we get into that let’s familiarize ourselves with Hubble’s Law.

在开始讨论之前，让我们熟悉哈勃定律。

哈勃定律 (Hubble’s Law)

In Edwin Hubble’s famous PNAS article “A relation between distance and radial velocity among extra-galactic nebulae” (1), Hubble provided evidence for one of science’s greatest discoveries: the expanding universe. Hubble demonstrated that galaxies are moving away from Earth with a recession velocity that is correlated to their distance from Earth. In other words, galaxies that are further away from Earth move away faster than nearby galaxies. This is commonly referred to as Hubble’s Law. Hubble’s classic graph of observed velocity vs. distance for nearby galaxies (presented above) visualizes this phenomenon. This graph has become a milestone in the scientific community, as it displays the linear relationship between galactic recessional velocity (v) and distance from Earth (d):

在埃德温·哈勃(Edwin Hubble)着名的PNAS文章“银河外星云之间的距离与径向速度之间的关系”(1)中，哈勃为科学上最伟大的发现之一：膨胀中的宇宙提供了证据。哈勃证明，星系正在以与地球到地球的距离相关的衰退速度离开地球。换句话说，距离地球较远的星系比附近的星系移动得更快。这通常称为哈勃定律。哈勃(Hubble)关于附近星系观测到的速度与距离的经典关系图(如上所示)将这一现象形象化。这张图已成为科学界的里程碑，因为它显示了银河退缩速度(v)与距地球的距离(d)之间的线性关系：

v = Ho x d

v = Ho xd

Here v is the galaxy’s recessional velocity and d is the galaxy’s distance from Earth. Ho is an empirically determined constant called Hubble’s constant. Even though the expansion rate is persistent in all directions at any given time, it changes throughout the lifetime of the universe. The well-calibrated expansion rate at the present time, Ho, is about 70 kilometers per second per megaparsec (note on units used here: recession velocity is in kilometers per second and distance is in megaparsec, 1 megaparsec = 1M parsecs, 1 parsec = 3.26 light-years). (2)

这里v是星系的后退速度， d是星系到地球的距离。 Ho是根据经验确定的常数，称为哈勃常数。即使在任何给定时间在所有方向上都具有持久的膨胀率，它在整个宇宙的生命周期中都会发生变化。目前，经过良好校准的扩展速度Ho约为每秒每兆帕秒70公里(此处使用的单位请注意：后退速度以千米每秒为单位，距离以兆帕秒为单位，1兆帕秒= 1M帕秒，1帕秒= 3.26光年。 (2)

Hubble used the Hooker Telescope at Mount Wilson Observatory for some of his most important discoveries. © Emilio Segrè Visual Archives / American Institute of Physics / Photo Researchers, Inc.

Hubble’s remarkable feat was obtained using a very small sample of measurements of velocities and distances for 24 nearby galaxies. The distances to these galaxies were inaccurately measured from the visible brightness of their stars. In addition to plotting all of the individual 24 galaxies in the diagram, Hubble also grouped them into 9 clusters (open circles on Hubble’s diagram) based on their closeness in direction and distance, as a means of minimizing the scatter. Hubble’s experiment was conclusive in convincing the scientific community of the existence of the expanding universe. (2)

哈勃的非凡成就是使用非常小的24个附近星系的速度和距离测量样本获得的。距这些星系的距离是根据其恒星的可见亮度进行的不准确测量。除了在图中绘制所有24个星系之外，哈勃还根据方向和距离的紧密程度将它们分为9个簇(哈勃图上的空心圆)，以最大程度地减少散射。哈勃的实验在说服科学界相信不断膨胀的宇宙的存在方面是结论性的。 (2)

Hubble’s diagram of galactic recessional velocity versus distance. (Hubble, Proceedings of the National Academy of Sciences, 1929, 15, 168)

Hubble’s diagram shows a strong linear relationship between velocity and distance. What makes this graph profound is the extensive implications of the observed trend: we live in a large, dynamically evolving universe that is expanding all directions. It is not the type of universe that Albert Einstein assumed in 1917. In fact, Einstein factored in a cosmological constant into his equations to keep the universe static, as it was believed to be at the time. Contrary to Einstein’s beliefs, Hubble’s results suggested that the universe has been expanding for billions of years, from an early beginning of the “Big Bang” up until the present. (2)

哈勃图显示了速度和距离之间的强线性关系。使该图更深刻的是观察到的趋势的广泛含义：我们生活在一个巨大的，动态演化的宇宙中，宇宙在向各个方向扩展。这不是阿尔伯特·爱因斯坦(Albert Einstein)在1917年所假定的那种宇宙。实际上，爱因斯坦将宇宙学常数纳入其方程式中，以保持宇宙的静态性，这在当时被认为是这样。与爱因斯坦的看法相反，哈勃的结果表明，从“大爆炸”的早期开始到现在，宇宙已经膨胀了数十亿年。 (2)

Although Hubble successfully displayed the beautiful linear relationship in his diagram, Hubble’s values for his distances in 1929 were too small by a factor of ~7. The expansion rate Ho was also too large by the same factor. However, despite this large imprecision and its great ramifications for the expansion rate and age of the universe, Hubble’s discovery of the expanding universe is not affected. The underlying linear equation of v ∼ d still holds true! (2)

尽管哈勃在图表中成功显示出漂亮的线性关系，但哈勃在1929年的距离值太小了约7倍。出于相同的原因，膨胀率Ho也太大。但是，尽管存在很大的不精确性，并且对宇宙的膨胀率和年龄有很大的影响，但哈勃关于宇宙膨胀的发现并没有受到影响。 v的基本线性方程〜d仍然适用！ (2)

Note that Einstein’s theory of relativity forecasts deviations from a strictly linear interpretation of Hubble’s law. The amount of deviation depends on the total mass of the universe. A greater understanding of Hubble’s law can inform us about the amount of total matter in the universe. It might also provide more information about dark matter… (3)

请注意，爱因斯坦的相对论预测偏离哈勃定律的严格线性解释。偏差量取决于宇宙的总质量。对哈勃定律有更深入的了解可以使我们了解宇宙中总物质的数量。它还可能提供有关暗物质的更多信息……(3)

Hubble’s Law was the primary observational evidence in support of the Big Bang theory. Hubble was well renown for his discoveries and in 1990 NASA named the Hubble space telescope after him. (4)

哈勃定律是支持大爆炸理论的主要观察证据。哈勃因其发现而享誉世界。1990年，美国国家航空航天局(NASA)以他的名字命名了哈勃太空望远镜。 (4)

Excellent! With that out of the way, now we can start diving into all the fun we’re going to have with hacker stats and Ordinary Least Squares (OLS) Regression. Let’s get started.

优秀的！有了这种方式，现在我们就可以开始研究黑客统计数据和普通最小二乘(OLS)回归带来的所有乐趣。让我们开始吧。

实验设计(方法论) (Experimental Design (Methodology))

Exploratory Data Analysis (EDA).探索性数据分析(EDA)。
Adjust galactic distances by a factor of 7.将银河距离调整7倍。
OLS using the original Hubble dataset of 24 measurements of galactic distances and recession velocities.OLS使用最初的哈勃数据集，其中包含24个银河距离和后退速度测量值。
Pairs Bootstrap Resampling of 24 measurements.配对Bootstrap重采样24个测量值。
Hypothesis Test → measure the effect of distance on recession velocities.假设检验→测量距离对衰退速度的影响。

关于数据 (About the Data)

Source: “A relation between distance and radial velocity among extra-galactic nebulae” by Edwin Hubble. (1)

资料来源：埃德温·哈勃(Edwin Hubble)的“银河外星云之间的距离与径向速度之间的关系”。 (1)

Object Name: Name of the galaxy.

对象名称：星系的名称。
Distance [Mpc] (r): Distance from Earth in megaparsecs. 1 megaparsec = 1M parsecs, 1 parsec = 3.26 light-years.

距离[Mpc](r)：距地球的距离，单位为兆帕。 1兆帕秒= 1M帕秒，1帕秒= 3.26光年。
Velocity [Km/second] (v): Recessional velocity, how fast a galaxy is moving away from Earth. Recessional velocity was recorded in kilometers per second.

速度[Km / second](v)：衰退速度，银河系离开地球移动的速度。衰退速度以公里每秒记录。

Hubble’s 24 measurements of galactic distances and recession velocities. (Hubble, Proceedings of the National Academy of Sciences, 1929, 15, 168)

**Note to my technical readers: If you are interested in the Python code that I used to generate the plots, calculations, etc. feel free to check out my GitHub repo.**

**我的技术读者注意：如果您对我用来生成绘图，计算等的Python代码感兴趣，请随时查看我的 GitHub存储库 。**

EDA (EDA)

The first thing we will do is look at the normalized deviations of distances and recessional velocities to examine their relationship. The mean describes the center of the data. The standard deviation describes the spread of the data. It is convenient to normalize two variables in order to perform a fair comparison.

我们要做的第一件事是查看距离和后退速度的归一化偏差，以检查它们之间的关系。平均值描述数据的中心。标准差描述数据的传播。标准化两个变量以便进行公平比较很方便。

With the exception of the first couple of measurements, upon visual inspection of the two normalized arrays of the deviations the galactic distances and their recession velocities seem to be highly correlated. Let’s adjust the distance values and generate summary statistics of our data.

除了前几对测量值以外，在目视检查两个标准化的偏差阵列后，银河距离及其后退速度似乎高度相关。让我们调整距离值并生成数据的摘要统计信息。

Adjust Distances by Factor of 7

以7的系数调整距离

Now we are going to adjust the distance values by multiplying them by a factor of 7. We can look at the result of our adjustment through descriptive statistics.

现在，我们将距离值乘以7来调整距离值。我们可以通过描述性统计数据查看调整结果。

Descriptive Statistics of Hubble’s 24 observations of galactic distances and recession velocities, including adjusted distances ie. distances7

Despite the increase in distance by a large factor of 7, recessional velocity and galactic distance are still highly correlated. Let’s calculate the Pearson correlation coefficient and visualize the correlation with the adjusted distance variable via a scatter plot.

尽管距离增加了7倍，但后退速度和银河距离仍然高度相关。让我们计算皮尔逊相关系数，并通过散点图可视化与调整后的距离变量的相关性。

Correlation

相关性

Aside from the adjustment of the x-axis, this graph doesn’t look much different than the one that Hubble created. The data exhibits a strong linear relationship with a Pearson correlation coefficient of ~0.8. Next we will perform an Ordinary Least Squares Regression to further understand this relationship as a result of a linear function.

除了调整x轴外，此图看起来与哈勃创建的图没有太大不同。数据表现出很强的线性关系，皮尔逊相关系数约为0.8。接下来，我们将执行普通最小二乘回归，以进一步了解线性函数的关系。

最小二乘 (OLS)

The regression results below were generated via the statsmodels ols() API in Python.

下面的回归结果是通过Python中的statsmodels ols()API生成的。

statsmodels ols() results: velocities ~ distances7

For every unit increase in distance, recessional velocity increases by 64.88 km per second.

每增加单位距离，后退速度将增加64.88 km / s。

According to the R-squared value, 62% of the variance of recession velocities are explained by distances.

根据R平方值，用距离解释了衰退速度变化的62％。

We’ve already observed Hubble’s Law with a couple of lines of Python code. We examined correlation and concluded with enough confidence that the majority of the variance can be explained by the model. Technically, we could stop at this point and call it day. But let’s take this a step further and understand the residuals like any good scientist would.

我们已经用几行Python代码观察了哈勃定律。我们检查了相关性，并以足够的信心得出结论，该模型可以解释大部分方差。从技术上讲，我们可以在这一点上停下来并将其命名为“ day”。但是，让我们更进一步，像任何优秀科学家一样理解残差。

Residuals, RSS and RMSE

残差，RSS和RMSE

If we interpret R-squared as the variances that can be explained by our OLS model, the residual sum of squares (RSS) represents the amount of errors that are not explained by the model.

如果我们将R平方解释为可以由我们的OLS模型解释的方差，则残差平方和(RSS)表示该模型无法解释的误差量。

The solution of OLS regression is the set of coefficient values for which the RSS is minimal. We’ll revisit this topic when we look at bootstrap resampling in the next section.

OLS回归的解决方案是RSS最小的一组系数值。在下一节中，我们将在介绍引导程序重采样时重新讨论该主题。

Here we have Root Mean Square Error (RMSE) of ~223, which can be interpreted as the spread of prediction errors, or how concentrated the data is around the line of best fit. Let’s look at a probability plot to visualize the spread of residuals.

在这里，我们的均方根误差(RMSE)为〜223，可以解释为预测误差的散布，或者数据在最佳拟合线附近的集中程度。让我们看一下概率图，以可视化残差的分布。

The probability plot of the residuals of our OLS model is approximately linear, supporting the assumption that the error terms are normally distributed.

我们的OLS模型残差的概率图近似线性，支持误差项呈正态分布的假设。

Again we could also stop right here, but we’re going to keep moving and generate some bootstrap replicates to validate some of the conclusions we’ve witnessed from OLS Regression and uncover a couple of new ones of our own.

同样，我们也可以在这里停止，但是我们将继续前进并生成一些引导程序副本，以验证从OLS Regression见证的一些结论，并发现我们自己的一些新结论。

使用双自举重采样 (Resampling with Pairs Bootstraps)

Pairs bootstrap involves resampling pairs of data with replacement. Each collection of pairs fit with a regression model. We will do this again, and again, and again n number of times generating bootstrap n sample statistics from the explanatory and dependent variables, in addition to model parameter estimates after running the OLS model n number of times. We will also calculate the RSS Minima using Bootstrap Resampling to identify the linear equation that best minimizes the errors.

Pairs bootstrap涉及重新采样数据对并进行替换。对的每个集合均符合回归模型。我们会再次做到这一点，又一次，又一次次从生成的解释变量和因变量的自举n个采样统计n个，除了模型参数估计运行时间的OLS模型n个后。我们还将使用Bootstrap重采样来计算RSS最小值，以识别最能使误差最小的线性方程。

The goal is to use bootstrap resampling to compute one mean for each sample and create a distribution of sample means and then compute the standard error to quantify the uncertainty in the sample statistic as an estimator for the population average and standard deviation. This comes in very handy since we don’t know the true values for the population average or standard deviation. Instead, we will infer it using bootstrap resampling.

目标是使用自举重采样为每个样本计算一个均值，并创建样本均值的分布，然后计算标准误差以量化样本统计数据中的不确定性，作为总体平均值和标准偏差的估计量。这非常方便，因为我们不知道总体平均值或标准差的真实值。相反，我们将使用引导重采样来推断它。

According to the central limit theorem, if we generate enough replicates the resampled distributions will follow a normal distribution, which is one of the assumptions for a hypothesis test. More on that in the next section. For now, let’s generate 1,000 paired replicates for each variable.

根据中心极限定理，如果我们生成足够多的重复项，则重新采样的分布将遵循正态分布，这是假设检验的假设之一。下一节将对此进行更多介绍。现在，让我们为每个变量生成1,000个成对的重复。

Through way of bootstrap, we inferred that the expected average value of galactic distances is 6.47 Mpc with an uncertainty of about 1 Mpc. This is really close to the sample mean and standard deviation we generated early on. In addition, we can infer with 95% confidence that the true population average lies somewhere between 4.78 and 8.16 Mpc, based on the data provided.

通过自举，我们推断银河距离的预期平均值为6.47 Mpc，不确定性约为1 Mpc。这确实接近我们早期生成的样本均值和标准差。此外，根据提供的数据，我们可以以95％的置信度推断出真正的人口平均数介于4.78和8.16 Mpc之间。

Notice we have a black line in the middle to mark the expected value. Uncertainty here is just one measure of the spread of the distribution of sample means. Moreover, notice the uncertainty we computed also fits inside the confidence interval. You can think of the uncertainty as the one-sigma confidence interval.

注意，中间有一条黑线标记期望值。这里的不确定度只是衡量样本均值分布范围的一种方法。此外，请注意，我们计算出的不确定性也适合置信区间内。您可以将不确定性视为一个1西格玛的置信区间。

In addition, the vertical red lines mark the 5th (left) and 95th (right) percentiles, which denote the extent of the confidence interval or the range of values containing the inner 95% of sample means.

此外，垂直红线标记第5个(左)和第95个(右)百分位数，表示置信区间的范围或包含内部95％样本均值的值的范围。

Similarly for velocities, we inferred that the expected average value of velocities is about 378 km per second with an uncertainty of about 74 km per second. In addition, we can infer with 95% confidence that the true population average lies somewhere between 238 and 526 km per second, based on the data provided.

同样，对于速度，我们推断速度的期望平均值约为每秒378公里，不确定度约为每秒74公里。此外，根据提供的数据，我们可以以95％的置信度推断出真实的平均人口数量在每秒238至526 km之间。

Now we’re going to conduct a similar exercise, this time with the model slope and intercept parameters. That’s right! You can also use bootstrap resampling to compute the estimate, standard error, and confidence interval for OLS model parameters, all thanks to the central limit theorem. We’re basically going to use each pairs bootstrap replicate as an input into an OLS model to generate bootstrap slope and intercept estimates. Let’s give it a try.

现在，我们将使用模型斜率和截距参数进行类似的练习。那就对了！您还可以使用引导重采样来计算OLS模型参数的估计值，标准误差和置信区间，这全都归功于中心极限定理。基本上，我们将使用每对引导复制作为OLS模型的输入，以生成引导斜率和截距估计。试一试吧。

We inferred that the estimate of the slope is 65.17 km per second/Mpc with a standard error of 10.33 km per second/Mpc. We are 95% confident that the true slope lies somewhere between 46.33 and 87.34 km per second/Mpc, based on the data provided.

我们推断斜率的估计值为65.17 km / s / Mpc，标准误差为10.33 km / s / Mpc。根据提供的数据，我们有95％的把握是真实的斜率在46.33和87.34 km /秒/ Mpc之间。

Note that this is very close to the summary output of statsmodels ols().

请注意，这与statsmodels ols()的摘要输出非常接近。

We inferred that the estimate of the intercept is -43.23 km per second with a standard error of 78.44 km per second. We are 95% confident that the true intercept lies somewhere between -200.99 and 104.23 km per second, based on the data provided.

我们推断，截距的估计值为每秒-43.23 km，标准误为每秒78.44 km。根据提供的数据，我们有95％的信心确定真正的截距在每秒-200.99至104.23 km之间。

Now we’re going to generate the RSS Minima via Pairs Bootstrap Resampling.

现在，我们将通过Pairs Bootstrap重采样来生成RSS最小值。

Visualizing the RSS Minima

可视化RSS最小值

Recall when we looked at RSS before, the solution of OLS is the set of coefficient values for which the RSS is minimal. Now we’re going to use the same replicates we generated to visualize the RSS Minima. Then we’re going to retrieve the model parameters (slope and intercept) that generated the RSS Minima.

回想一下我们以前看过RSS时，OLS的解是RSS最小的一组系数值。现在，我们将使用生成的相同副本来可视化RSS Minima。然后，我们将检索生成RSS最小值的模型参数(坡度和截距)。

Amazing! The best slope and intercept are the ones out of arrays of slopes and intercepts that yielded the minimum RSS value. Notice that our slope value is almost equivalent to the well-calibrated expansion rate (Ho) at the present time.

惊人！最佳斜率和截距是那些产生最小RSS值的斜率和截距数组中的斜率和截距。请注意，目前我们的斜率值几乎等于经过良好校准的膨胀率( Ho )。

Behind the scenes, we used the 95% confidence intervals that we generated for the slope and intercept estimates to filter out model parameter values that weren’t within range.

在幕后，我们使用为斜率生成的95％置信区间并截取估计值，以过滤掉不在范围内的模型参数值。

Now that we have the RSS Minima and the model parameters that yielded it, we can visualize the new model with a scatter plot.

现在我们有了RSS Minima和产生它的模型参数，我们可以用散点图可视化新模型了。

If we compare this scatter plot to the one we generated earlier during EDA, there’s a slight difference as the red line is a bit steeper. It doesn’t pass through the second galaxy from the top at 14 Mpc but rather is slightly above it. We can consider this an improvement in the overall fit of the model!

如果将散布图与我们在EDA期间生成的散布图进行比较，则会有一点差异，因为红线更陡一些。它没有以14 Mpc的速度从顶部穿过第二个星系，而是略高于它。我们可以认为这是模型整体拟合的改进！

In the final section, we will conduct a hypothesis test to examine the theory that the length of galactic distance from Earth has an effect on the galaxy’s recessional velocity. We’ve already used a number of tools in our hacker stats toolbox to examine Hubble’s Law. Let’s put the finishing touches on the icing of the cake!

在最后一节中，我们将进行假设检验，以检验银河系与地球之间的距离长度对银河系的后退速度有影响的理论。我们已经在黑客统计信息工具箱中使用了许多工具来检查哈勃定律。让我们为蛋糕锦上添花！

假设检验→银河星云的距离是否对其后退速度有观察到的影响？ (Hypothesis Test → Do the distances of galactic nebulae have an observed effect on their recession velocities?)

Recall that we used the assumption of the central limit theorem to generate enough replicates to obtain paired resampled normal distributions of galactic distances and recessional velocities. Data that is normally distributed is one of the assumptions required for a hypothesis test.

回想一下，我们使用中心极限定理的假设来生成足够的重复项，以获得成对的重新采样的银河距离和后退速度正态分布。正态分布的数据是假设检验所需的假设之一。

Now we will test whether the length of galactic distance has an observed effect on recessional velocity. We will define short and long distances of galaxies from planet Earth. Then we will resample and shuffle the velocities and take the difference in resampled means as a test statistic. In other words, if the test statistic distribution truly exhibits a difference in effect (ie. mean difference of velocities > 0 ) then we can reject the null hypothesis, and conclude with enough power that the results are statistically significant.

现在，我们将测试银河距离的长度是否对后退速度有观察到的影响。我们将定义星系与地球之间的短距离和长距离。然后，我们将对速度进行重新采样和改组，并将重新采样的均值之差作为检验统计量。换句话说，如果检验统计量分布确实显示出效果上的差异(即，速度的平均差异> 0)，那么我们可以拒绝原假设，并以足够的能力得出结论，该结果在统计上是有意义的。

See the null and alternative hypotheses below:

请参见下面的原假设和替代假设：

Null Hypothesis

零假设

The length of distance has no effect on the recession velocity of Extra-Galactic Nebulae.

距离的长度对银河外星云的后退速度没有影响。

Alternative Hypothesis

替代假设

The length of distance has an observed effect on the recession velocity velocities of Extra-Galactic Nebulae.

距离的长度对银河外星云的后退速度有影响。

Assumptions

假设条件

For our experiment, we will use a 95% significance level, which will make our alpha value 0.05. We define short distances as distances less than 7 Mpc; Conversely, we define long distances as distances that are greater or equal to 7 Mpc → Note that this will be done with our adjusted values for distances.

对于我们的实验，我们将使用95％的显着性水平，这将使我们的alpha值为0.05。我们将短距离定义为小于7 Mpc的距离；相反，我们将长距离定义为大于或等于7 Mpc的距离→请注意，这将通过我们对距离的调整值来完成。

We’re going to use a T-test since we do not know the true standard deviation of the population; We will use 1,000 bootstrap replicates of galactic recessional velocities.

因为我们不知道总体的真实标准偏差，所以我们将使用T检验。我们将使用1,000个银河衰退速度的自举复制。

The test statistic is the difference between a recession velocity drawn from shorter distances and one drawn from longer distances. The distribution of difference values is built up by subtracting each point in the shorter range with one from the longer range, to see if the mean difference is greater than zero, also known as the effect size.

检验统计量是从较短距离得出的衰退速度与从较长距离得出的衰退速度之差。差值的分布是通过将较短范围内的每个点减去较长范围中的一个点而建立的，以查看平均差是否大于零(也称为效果大小)。

And there we have it! The mean of the test statistic is not zero (denoted by the shaded region in gray), which tells us that there is on average an 83.76 km per second difference in velocities when comparing short and long galactic distances. Again, we refer to this as our effect size. In other words, galaxies that are closer to Earth are moving away at a much slower rate than galaxies that are a lot further away from Earth. The increase in the galactic distance from Earth had an observed effect on recessional velocity. The standard error of the test statistic distribution is also not zero, so there is uncertainty in the size of the effect.

我们终于得到它了！测试统计的平均值不为零(由灰色阴影区域表示)，这告诉我们在比较短银河和长银河距离时平均每秒速度相差83.76 km。同样，我们将此称为效应大小。换句话说，离地球更近的星系的移动速度比离地球更远的星系移动的速度要慢得多。到地球的银河距离的增加对后退速度有观察到的影响。测试统计量分布的标准误差也不是零，因此影响的大小存在不确定性。

It’s also worthwhile to mention that shuffling the resampled data points had an effect on the randomness of our experiment. We shuffled the data in order to make sure that each sample is composed of random and independent data points, which are other assumptions required for a hypothesis test. If we didn’t shuffle the data then the effect size would be much greater due to the time-ordered effect on the mean.

还值得一提的是，对重新采样的数据点进行混洗会影响我们实验的随机性。我们对数据进行了混洗，以确保每个样本均由随机和独立的数据点组成，这是假设检验所需的其他假设。如果我们不对数据进行混洗，那么由于均值的时间顺序影响，影响大小会更大。

Finally, our P-value is extremely small → 0.0000001

最后，我们的P值非常小→0.0000001

Thus, we can conclude with a high degree of statistical significance that the distances of galaxies from Earth have an effect on their recessional velocities, which is observational evidence of Hubble’s Law. The universe is constantly expanding all around us!

因此，我们可以得出具有高度统计意义的结论，即星系与地球的距离会影响它们的后退速度，这是哈勃定律的观测证据。宇宙不断在我们周围扩展！

P.S. ~ Note that we didn’t use power analysis to determine the sample size for the hypothesis test upfront, since we weren’t privy to the standard effect size at the beginning of the experiment. According to traditional stats textbooks, one should determine the needed sample size prior to performing a hypothesis test. Instead, we opted for the hacker stats approach: we ran a hypothesis test with 1,000 samples, retrieved the effect size and standard error of the effect, and hacked the needed sample size. The result was roughly 910 observations, which worked well in our favor. Got to love hacker stats!

PS〜请注意，由于我们在实验开始时并不熟悉标准效应量，因此我们并未使用功效分析来预先确定假设检验的样本量。根据传统的统计教科书，应该在执行假设检验之前确定所需的样本量。相反，我们选择了黑客统计方法：我们对1,000个样本进行了假设检验，检索了效应大小和效应的标准误，然后黑客入侵了所需的样本量。结果大约是910个观测值，对我们有利。爱上了黑客统计信息！

下一步 (Next Steps)

Add 22 estimated distances for the T-test.为T检验加上22个估计的距离。
Identify Nebulae Clusters with KMeans.用KMeans识别星云团。
Use other data sources of galactic distances & recession velocities…使用银河距离和后退速度的其他数据源…

We could have used the 22 estimated distances for our T-test as well, bringing the total number of observations up to 46.

我们也可以将22个估计的距离用于T检验，从而使观测的总数达到46个。

Hubble grouped the 24 galaxies into 9 clusters. This would be an interesting exercise to see if we get the same cluster centroids with KMeans or Agglomerative clustering.

哈勃将24个星系分为9个星团。看看我们是否获得具有KMeans或聚集聚类的相同聚类质心，这将是一个有趣的练习。

Finally, we can also use data from the Hubble telescope to examine Hubble’s law.

最后，我们还可以使用哈勃望远镜的数据检查哈勃定律。

As Hubble concludes in his PNAS paper, “The results establish a roughly linear relation between velocities and distances among nebulae for which velocities have been previously published, and the [relationship] appears to dominate the distribution of velocities…..New data to be expected in the near future may modify the significance of the present investigation or, if confirmatory, will lead to a solution many times the weight.” (1)

正如哈勃在其PNAS论文中总结的那样：“结果建立了速度与星云之间的距离之间的大致线性关系，先前已经针对该距离发布了速度，[关系]似乎主导了速度的分布……..新数据有望得到预料在不久的将来可能会改变当前调查的意义，或者，如果证实这一点，将导致解决方案的重量增加很多倍。” (1)

翻译自: https://medium.com/datadriveninvestor/revisiting-hubbles-law-with-hacker-stats-in-python-9b56604916c1

腾讯哈勃

查看全文

http://www.taodudu.cc/news/show-997353.html

如何使用Picterra的地理空间平台分析卫星图像
hopper_如何利用卫星收集的遥感数据轻松对蚱hopper中的站点进行建模
华为开源构建工具_为什么我构建了用于大数据测试和质量控制的开源工具
数据科学项目_完整的数据科学组合项目
uni-app清理缓存数据_数据清理-从哪里开始？
bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类
vlookup match_INDEX-MATCH — VLOOKUP功能的升级
flask redis_在Flask应用程序中将Redis队列用于异步任务
前馈神经网络中的前馈_前馈神经网络在基于趋势的交易中的有效性（1）
hadoop将消亡_数据科学家：适应还是消亡！
数据科学领域有哪些技术_领域知识在数据科学中到底有多重要？
初创公司怎么做销售数据分析_为什么您的初创企业需要数据科学来解决这一危机...
r软件时间序列分析论文_高度比较的时间序列分析-一篇论文评论
selenium抓取_使用Selenium的网络抓取电子商务网站
裁判打分_内在的裁判偏见
从Jupyter Notebook切换到脚本的5个理由
ip登录打印机怎么打印_不要打印，登录。
机器学习模型非线性模型_调试机器学习模型的终极指南
您的第一个简单的机器学习项目
鸽子为什么喜欢盘旋_如何为鸽子回避系统设置数据收集
追求卓越追求完美规范学习_追求新的黄金比例
周末想找个地方敲代码_观看我们的代码游戏，全周末直播
javascript 开发_25个新JavaScript开发人员的免费资源
感谢您的提问_感谢您的反馈，我们正在改进的5种方法
堆叠自编码器中的微调解释_25种深刻漫画中的编码解释
Free Code Camp现在有本地组
递归javascript_JavaScript中的递归
判断一个指针有没有free_Free Code Camp的每个人现在都有一个档案袋
使您的Java代码闻起来很新鲜
Stack Overflow 2016年对50,000名开发人员进行的调查得出的见解

腾讯哈勃_用Python的黑客统计资料重新审视哈勃定律相关推荐

python爬取腾讯视频弹幕_用Python爬取腾讯视频弹幕
原标题:用Python爬取腾讯视频弹幕 via:菜J学Python 1.网页分析本文以爬取<脱口秀大会第3季>最后一期视频弹幕为例,首先通过以下步骤找到存放弹幕的真实url. 通过删减 ...
python实现批量下载视频_利用Python实现批量下载腾讯视频！
原标题:利用Python实现批量下载腾讯视频! 导语利用Python下载腾讯非VIP视频,也就是可以免费观看的视频.做这个的起因是最近在看一个叫"请吃红小豆吧"的动漫,一共三分钟 ...
python集群_使用Python集群文档
python集群 Natural Language Processing has made huge advancements in the last years. Currently, variou ...
spotify音乐下载_使用Python和R对音乐进行聚类以在Spotify上创建播放列表。
spotify音乐下载 Spotify is one of the most famous Music Platforms to discover new music. The company use ...
网络安全用python吗_使用Python进行网络安全渗透——密码攻击测试器
相关文章: 本篇将会涉及: HTTP 基本认证对HTTP Basic认证进行密码暴力攻击测试什么是HTTP 基本认证 HTTP基本认证(HTTP Basic Authentication)是HTT ...
python交互式和文件式_使用Python创建和自动化交互式仪表盘
python交互式和文件式 In this tutorial, I will be creating an automated, interactive dashboard of Texas COVI ...
python 网页编程_通过Python编程检索网页
python 网页编程 The internet and the World Wide Web (WWW), is probably the most prominent source of info ...
python 时间序列预测_使用Python进行动手时间序列预测
python 时间序列预测 Time series analysis is the endeavor of extracting meaningful summary and statistical ...
python做审计底稿视频_最新Python教学视频，每天自学俩小时，让你offer拿到手软...
2020最新Python零基础到精通资料教材,干货分享,新基础Python教材,看这里,这里有你想要的所有资源哦,最强笔记,教你怎么入门提升!让你对自己更加有信心,重点是资料都是免费的,免费!!! 如 ...

腾讯哈勃_用Python的黑客统计资料重新审视哈勃定律