
1. 一些很简单的统计数据:Mean, Median, Mode, Percentiles, Variance, Standard Deviation

2. 用数据描述图形的形状:z-score, skewness

3. 最后的两个变量之间的 Covariance 和 Correlation Coefficient。


If the measures are computed for data from a sample, they are called sample statistics.

If the measures are computed for data from a population, they are called population parameters.

3.1 Measure of Location

Sample Mean :

Population Mean:

Median:Arrange the data in ascending order (smallest value to largest value).

(a) For an odd number of observations, the median is the middle value.

(b) For an even number of observations, the median is the average of the two middle values.

Mode:The mode is the value that occurs with greatest frequency.

Percentiles: The pth percentile is a value such that at least p percent of the observations are less than or equal to this value and at least (100-p) percent of the observations are greater than or equal to this value. 第 p 的 percentiles 是至少 p%的数据是小于等于这个值的,而且 至少100-p 的值是大于或者等于这个值。

怎么计算 Percentile ?

Step 1. Arrange the data in ascending order (smallest value to largest value).

Step 2. Compute an index i= p/100 *n  where p is the percentile of interest and n is the number of observations.

Step 3. (a) If i is not an integer, round up. The next integer greater than i denotes the position of the pth percentile.

(b) If i is an integer, the pth percentile is the average of the values in po- sitions i and i+1.

如果计算出 i 不是一个整数,那么就 round up,下一个数字就是第 p 的 pencentile。

如果计算出 i 是一个整数,那么第 p 的 pencentile 就是 第 i个数据和第 i+1个数据的平均数

Quartiles:  Q1 = first quartile, or 25th percentile   Q2 = second quartile, or 50th percentile (also the median)  Q3 = third quartile, or 75th percentile

3.2 Measures of Variability

Range = Largest Value - Smallest Value

Interquartile Range = IQR = Q3 - Q1

Population Variance: σ2 = Σ ( Xi - X )2 / N

Sample Variance: s2 = Σ ( xi - x )2 / ( n - 1 )

Why Sample Variance is n-1?

因为用 n 的时候,计算出来的方差会偏小,所以用 n-1,具体过程还可以用数学证明,具体见:https://www.zhihu.com/question/20099757

Sample standard deviation: s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]

Population standard deviation: σ = sqrt [ Σ ( Xi - X )2 / N ]

Coefficient of Variation = (standard deviation/ Mean) * 100% 这个数字可以用来比较不同数据集中数据的变化程度 (Variability)

3.3 Measures of Distribution Shape, Relative Location, and Detective Outliers

1. Distribution Shape

Skewed Left (Skewness <0 ), median > mean

Skewed Right (Skewness >0), mean > median

Symmetric  (Skewness = 0), mean = median

2. Z-score: 

determine how far a particular value is from the mean  Zi = (Xi -X)/s

The z-score, zi , can be interpreted as the number of standard deviations xi is from the meanX


At least (1-1/z2) of the data values must be within z standard deviations of the mean, where z is any value greater than 1.

4. Empirical Rule

For data having a bell-shaped distribution:

• Approximately 68% of the data values will be within one standard deviation of the mean.

• Approximately 95% of the data values will be within two standard deviations of the mean.

• Almost all of the data values will be within three standard deviations of the mean.

5. Detecting Outliers

we recommend treating any data value with a z-score less than 3 or greater than 3 as an outlier.

3.4 Exploratory Data Analysis

1. Five Number Summary

  1. Smallest value

  2. First quartile (Q1)

  3. Median (Q2)

  4. Third quartile (Q3)

  5. Largest value

2. Box Plot

3.5 Measures of Association Between two variables

Sample Covariance

Population Covariance: 


Covariance 也有局限性,第一,数值结果会变得很大,最好能变成[-1,1]的区间,第二,当把 x 由米转成厘米的时候,结果应该是一样的。

所以要用到:Correlation Coefficient

Correlation Coefficient: 这个的范围是从 -1到1,The correlation coefficient ranges from 1 to 1. Values close to 1 or 1 indicate a strong linear relationship. The closer the correlation is to zero, the weaker the relationship.

3.6 The Weighted Mean and Working with Grouped Data

1. Weighted Mean

2. Grouped Data

