Pandas的Series统计函数（7）

pandas是python下常用来进行大数据处理与分析，本质是数理统计，所以本章简单了解一下pandas的一些统计函数，这里以series为例。

sum函数

sum函数可以统计series数值之和。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, None, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.sum()

mean函数

mean函数可以得到均值μμ，这时需要注意的是如果values里含有NaN，可以使用mean函数的参数避开NaN，默认情况下启用了skipna=True避开NaN值，如果需要考虑NaN可以使skipna=False，那么均值里是考虑了NaN项的，实际工作中是忽略掉的。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, None, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.mean()
print t.mean(skipna=False)

quantile分位数函数

分位数是统计学里的概念，可自行查找学习。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.quantile()
print t.quantile(0.5)
print t.quantile(0.25)
print t.quantile(0.75)

describe函数

describe可以给出一系列的和统计相关的数据信息。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.describe()

程序执行结果

count       4.000000
mean      409.500000
std       404.699477
min       104.000000
25%       176.750000
50%       267.000000
75%       499.750000
max      1000.000000
dtype: float64

max和idxmax函数

max函数可以返回series里最大值，而idxmax返回的是其index或者label。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.max()
print t.idxmax()

程序执行结果：

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
1000
hello

同样的还有min和idxmin两个函数。

统计学里的方差相关的函数

var函数计算方差，方差Variance反映的是模型每一次输出结果与模型输出期望（平均值）之间的误差，即模型的稳定性,在pandas的series里可以用var函数计算。

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.var(), "\t<- var"

程序执行结果：

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
163781.66666666666  <- var

方差的计算公式如下：

这里的μμ是均值可以通过mean函数得到。所以可以通过python来验证一下var函数是否满足上边的公式的计算，

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val =  [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.var(), "\t<- var"
x =  val
mu = t.mean()
y = [np.square(v - mu) for v in x]
print np.sum(y) / 3

程序的执行结果：

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
163781.66666666666  <- var
163781.66666666666

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val =  [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"
print t.var(), "\t<- var"
x =  val
mu = t.mean()
y = [np.square(v - mu) for v in x]
delta2 = np.sum(y) / 3
print delta2
print np.sqrt(delta2)
print t.std(), "\t<- std"

程序的执行结果：

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
163781.66666666666  <- var
163781.66666666666
404.6994769784941
404.6994769784941   <- std

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val =  [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
x =  val
mu = t.mean()
y = [np.abs(v - mu) for v in x]
md = np.sum(y) / 4
print md
print t.mad(), "\t<- mad"

程序执行结果：

295.25
295.25  <- mad

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
van = [1100, 221, 303, 84]
y = pd.Series(van, index = idx)
xt =  val
mux = x.mean()
yt = van
muy = y.mean()
xx = [v - mux for v in xt]
yy = [v - muy for v in yt]
print xx
print yy
print np.sum(np.array(xx).dot(np.array(yy))) / 3
print x.cov(y), "\t<- cov"

程序执行结果：

[590.5, -208.5, -76.5, -305.5]
[673.0, -206.0, -124.0, -343.0]
184876.66666666666
184876.66666666666  <- cov

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
van = [1100, 221, 303, 84]
y = pd.Series(van, index = idx)
xt =  val
mux = x.mean()
yt = van
muy = y.mean()
xx = [v - mux for v in xt]
yy = [v - muy for v in yt]
xx2 = [np.square(v - mux) for v in xt]
yy2 = [np.square(v - muy) for v in yt]
cov = np.sum(np.array(xx).dot(np.array(yy)))
muxy = np.sqrt(np.sum(xx2)) * np.sqrt(np.sum(yy2))
print cov / muxy
print x.corr(y), "\t<- corr"

程序执行结果：

0.998149178876946
0.9981491788769461  <- corr

1). 利用kurt计算峰度值。

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
n = 4
mu = x.mean()
delta = x.std()
xu = [np.power((v - mu), 4) for v in val]
print (1.0 * n *(n + 1))/ ((n-1)*(n-2)*(n-3)) * np.sum(xu) / delta ** 4 - 3.0 * (n - 1) ** 2 / (n-2)*(n-3) , "<-python"
print x.kurt(), "<- kurt"

程序执行结果：

2.93023293658 <-python
2.93023293658 <- kurt

2). skew函数可以偏态值。

import pandas as pd
import numpy as np
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
x = pd.Series(val, index = idx)
n = 4
mu = x.mean()
delta = x.std()
xu = [np.power((v - mu), 3) for v in val]
print (1.0 * n) / ((n - 1)*(n - 2))*np.sum(xu) / delta ** 3, "<- python"
print x.skew(), "<- skew"

程序执行结果：

1.68850911034 <- python
1.68850911034 <- skew

统计学里的累计函数

import pandas as pd
idx =  "hello the cruel world".split()
val = [1000, 201, 333, 104]
t = pd.Series(val, index = idx)
print t, "<- t"print t.cumsum(), "\t<- cumsum"
print t.cumprod(), "\t<- cumprod"
print t.cummin(), "\t<- cummin"

程序执行结果：

hello    1000
the       201
cruel     333
world     104
dtype: int64 <- t
hello    1000
the      1201
cruel    1534
world    1638
dtype: int64    <- cumsum
hello          1000
the          201000
cruel      66933000
world    6961032000
dtype: int64    <- cumprod
hello    1000
the       201
cruel     201
world     104
dtype: int64    <- cummin