第十章数据聚合与分组操作
- 使用一个或多个键(以函数、数组或DataFrame列名的形式)将pandas对象拆分为多块
- 计算组汇总统计信息,如计数、平均值或标准偏差或用户定义的函数
- 应用组内变换或其他操作,如标准化、线性回归、排位或子集选择
- 计算数据透视表和交叉表
- 执行分位数分析和其他统计组分析
10.1 GroupBy机制
import numpy as np
import pandas as pd
df = pd.DataFrame({'data1':np.random.randn(5),'data2':np.random.randn(5),'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one']})
df
|
data1
|
data2
|
key1
|
key2
|
0
|
0.405121
|
0.233504
|
a
|
one
|
1
|
-0.419340
|
0.207110
|
a
|
two
|
2
|
0.871916
|
-0.476933
|
b
|
one
|
3
|
-0.311263
|
0.614344
|
b
|
two
|
4
|
0.498925
|
0.502884
|
a
|
one
|
grouped = df['data1'].groupby(df['key1'])
grouped.mean()
key1
a 0.161569
b 0.280326
Name: data1, dtype: float64
means = df['data1'].groupby([df['key1'],df['key2']]).mean()
means
key1 key2
a one 0.452023two -0.419340
b one 0.871916two -0.311263
Name: data1, dtype: float64
means.unstack()
key2
|
one
|
two
|
key1
|
|
|
a
|
0.452023
|
-0.419340
|
b
|
0.871916
|
-0.311263
|
states = np.array(['ohio','california','california','ohio','ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
california 2005 -0.4193402006 0.871916
ohio 2005 0.0469292006 0.498925
Name: data1, dtype: float64
#可以传递列名(无论那些列名是字符串、数字或其他Python对象)作为分组键
df.groupby('key1').mean()
|
data1
|
data2
|
key1
|
|
|
a
|
0.161569
|
0.314499
|
b
|
0.280326
|
0.068706
|
df.groupby(['key1','key2']).mean()
|
|
data1
|
data2
|
key1
|
key2
|
|
|
a
|
one
|
0.452023
|
0.368194
|
two
|
-0.419340
|
0.207110
|
b
|
one
|
0.871916
|
-0.476933
|
two
|
-0.311263
|
0.614344
|
#通用的GroupBy方法是size,size方法返回一个包含组大小信息的Series
#请注意,分组键中的任何缺失值将被排除在结果之外。
df.groupby(['key1','key2']).size()
key1 key2
a one 2two 1
b one 1two 1
dtype: int64
10.1.1 遍历各分组
- GroupBy对象支持迭代,会生成一个包含组名和数据块的2维元组序列。
for name,group in df.groupby('key1'):print(name)print(group)
adata1 data2 key1 key2
0 0.405121 0.233504 a one
1 -0.419340 0.207110 a two
4 0.498925 0.502884 a one
bdata1 data2 key1 key2
2 0.871916 -0.476933 b one
3 -0.311263 0.614344 b two
df
|
data1
|
data2
|
key1
|
key2
|
0
|
0.405121
|
0.233504
|
a
|
one
|
1
|
-0.419340
|
0.207110
|
a
|
two
|
2
|
0.871916
|
-0.476933
|
b
|
one
|
3
|
-0.311263
|
0.614344
|
b
|
two
|
4
|
0.498925
|
0.502884
|
a
|
one
|
for (k1,k2),group in df.groupby(['key1','key2']):print((k1,k2))print(group)
('a', 'one')data1 data2 key1 key2
0 0.405121 0.233504 a one
4 0.498925 0.502884 a one
('a', 'two')data1 data2 key1 key2
1 -0.41934 0.20711 a two
('b', 'one')data1 data2 key1 key2
2 0.871916 -0.476933 b one
('b', 'two')data1 data2 key1 key2
3 -0.311263 0.614344 b two
pieces = dict(list(df.groupby('key1')))
pieces['b']
|
data1
|
data2
|
key1
|
key2
|
2
|
0.871916
|
-0.476933
|
b
|
one
|
3
|
-0.311263
|
0.614344
|
b
|
two
|
#groupby在axis=0的轴向上分组,但你也可以在其他任意轴向上进行分组。
df.dtypes
data1 float64
data2 float64
key1 object
key2 object
dtype: object
for dtype,group in grouped:print(dtype)print(group)
a
0 0.405121
1 -0.419340
4 0.498925
Name: data1, dtype: float64
b
2 0.871916
3 -0.311263
Name: data1, dtype: float64
10.1.2 选择一列或所有列的子集
- 将从DataFrame创建的GroupBy对象用列名称或列名称数组进行索引时,会产生用于聚合的列子集的效果。
- “语法糖(Syntactic sugar),也译为糖衣语法,是由英国计算机科学家彼得·约翰·兰达(Peter J. Landin)发明的一个术语,指计算机语言中添加的某种语法,这种语法对语言的功能并没有影响,但是更方便程序员使用。通常来说使用语法糖能够增加程序的可读性,从而减少程序代码出错的机会。”
df.groupby('key1')['data1']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B6D760>
df.groupby('key1')[['data2']]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70370>
df['data1'].groupby(df['key1'])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B70610>
df[['data2']].groupby(df['key1'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70FA0>
df.groupby(['key1','key2'])[['data2']].mean()
|
|
data2
|
key1
|
key2
|
|
a
|
one
|
0.368194
|
two
|
0.207110
|
b
|
one
|
-0.476933
|
two
|
0.614344
|
s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped.mean()
key1 key2
a one 0.368194two 0.207110
b one -0.476933two 0.614344
Name: data2, dtype: float64
10.1.3 使用字典和Series分组
people = pd.DataFrame(np.random.randn(5,5),columns = ['a','b','c','d','e'],index = ['joe','steve','wes','jim','travis'])
people
|
a
|
b
|
c
|
d
|
e
|
joe
|
1.176984
|
-0.779043
|
1.232603
|
0.811508
|
-0.114978
|
steve
|
-0.162964
|
0.936905
|
-0.671911
|
-0.660697
|
0.297713
|
wes
|
-1.154320
|
-0.351915
|
-2.286200
|
-0.200603
|
0.990375
|
jim
|
-0.496740
|
0.474151
|
0.058887
|
2.775079
|
-0.615440
|
travis
|
-0.737061
|
1.470497
|
-0.173087
|
0.766692
|
-0.234068
|
people.iloc[2:3,[1,4]] = np.nan
people
|
a
|
b
|
c
|
d
|
e
|
joe
|
1.176984
|
-0.779043
|
1.232603
|
0.811508
|
-0.114978
|
steve
|
-0.162964
|
0.936905
|
-0.671911
|
-0.660697
|
0.297713
|
wes
|
-1.154320
|
NaN
|
-2.286200
|
-0.200603
|
NaN
|
jim
|
-0.496740
|
0.474151
|
0.058887
|
2.775079
|
-0.615440
|
travis
|
-0.737061
|
1.470497
|
-0.173087
|
0.766692
|
-0.234068
|
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column = people.groupby(mapping,axis=1)
by_column.sum()
|
blue
|
red
|
joe
|
2.044111
|
0.282963
|
steve
|
-1.332608
|
1.071654
|
wes
|
-2.486803
|
-1.154320
|
jim
|
2.833967
|
-0.638030
|
travis
|
0.593605
|
0.499368
|
map_series = pd.Series(mapping)
map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
people.groupby(map_series,axis=1).count()
|
blue
|
red
|
joe
|
2
|
3
|
steve
|
2
|
3
|
wes
|
2
|
1
|
jim
|
2
|
3
|
travis
|
2
|
3
|
10.1.4 使用函数分组
people
|
a
|
b
|
c
|
d
|
e
|
joe
|
1.176984
|
-0.779043
|
1.232603
|
0.811508
|
-0.114978
|
steve
|
-0.162964
|
0.936905
|
-0.671911
|
-0.660697
|
0.297713
|
wes
|
-1.154320
|
NaN
|
-2.286200
|
-0.200603
|
NaN
|
jim
|
-0.496740
|
0.474151
|
0.058887
|
2.775079
|
-0.615440
|
travis
|
-0.737061
|
1.470497
|
-0.173087
|
0.766692
|
-0.234068
|
people.groupby(len).sum()
|
a
|
b
|
c
|
d
|
e
|
3
|
-0.474076
|
-0.304892
|
-0.994710
|
3.385984
|
-0.730418
|
5
|
-0.162964
|
0.936905
|
-0.671911
|
-0.660697
|
0.297713
|
6
|
-0.737061
|
1.470497
|
-0.173087
|
0.766692
|
-0.234068
|
key_list = ['one','one','one','two','two']
people.groupby([len,key_list]).min()
|
|
a
|
b
|
c
|
d
|
e
|
3
|
one
|
-1.154320
|
-0.779043
|
-2.286200
|
-0.200603
|
-0.114978
|
two
|
-0.496740
|
0.474151
|
0.058887
|
2.775079
|
-0.615440
|
5
|
one
|
-0.162964
|
0.936905
|
-0.671911
|
-0.660697
|
0.297713
|
6
|
two
|
-0.737061
|
1.470497
|
-0.173087
|
0.766692
|
-0.234068
|
10.1.5 根据索引层级分组
hire_df = pd.DataFrame(np.random.randn(4,5),columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names = ['cty','tenor']))
hire_df
cty
|
US
|
JP
|
tenor
|
1
|
3
|
5
|
1
|
3
|
0
|
0.840245
|
2.187988
|
-0.393045
|
-0.150897
|
1.573875
|
1
|
1.329124
|
0.070597
|
-0.246468
|
-0.147710
|
-0.397490
|
2
|
0.362198
|
0.404290
|
-0.233373
|
-0.406261
|
0.812492
|
3
|
1.104010
|
-0.485481
|
0.439878
|
-1.948048
|
0.784266
|
hire_df.groupby(level='cty',axis=1).count()
cty
|
JP
|
US
|
0
|
2
|
3
|
1
|
2
|
3
|
2
|
2
|
3
|
3
|
2
|
3
|
10.2 数据聚合
函数名
|
描述
|
count
|
分组中的非NA值数量
|
sum
|
非NA值的累和
|
mean
|
非NA值的均值
|
median
|
非NA值的算术中位数
|
std.var
|
无偏的(n-1 分母)标准差和方差
|
min,max
|
非NA值的最小值.最大值
|
prod
|
非NA值的乘积
|
first、last
|
非NA值的第-个和最后-个值
|
df
|
data1
|
data2
|
key1
|
key2
|
0
|
0.405121
|
0.233504
|
a
|
one
|
1
|
-0.419340
|
0.207110
|
a
|
two
|
2
|
0.871916
|
-0.476933
|
b
|
one
|
3
|
-0.311263
|
0.614344
|
b
|
two
|
4
|
0.498925
|
0.502884
|
a
|
one
|
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
key1
a 0.480164
b 0.753598
Name: data1, dtype: float64
def peak_to_peak(arr):return arr.max() - arr.min()grouped.agg(peak_to_peak)
|
data1
|
data2
|
key1
|
|
|
a
|
0.918265
|
0.295774
|
b
|
1.183179
|
1.091277
|
grouped.describe()
|
data1
|
data2
|
|
count
|
mean
|
std
|
min
|
25%
|
50%
|
75%
|
max
|
count
|
mean
|
std
|
min
|
25%
|
50%
|
75%
|
max
|
key1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a
|
3.0
|
0.161569
|
0.505263
|
-0.419340
|
-0.007109
|
0.405121
|
0.452023
|
0.498925
|
3.0
|
0.314499
|
0.163679
|
0.207110
|
0.220307
|
0.233504
|
0.368194
|
0.502884
|
b
|
2.0
|
0.280326
|
0.836634
|
-0.311263
|
-0.015468
|
0.280326
|
0.576121
|
0.871916
|
2.0
|
0.068706
|
0.771649
|
-0.476933
|
-0.204113
|
0.068706
|
0.341525
|
0.614344
|
10.2.1 逐列及多函数应用
tips = pd.read_csv('examples/tips.csv')
tips.head()
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
0
|
16.99
|
1.01
|
No
|
Sun
|
Dinner
|
2
|
1
|
10.34
|
1.66
|
No
|
Sun
|
Dinner
|
3
|
2
|
21.01
|
3.50
|
No
|
Sun
|
Dinner
|
3
|
3
|
23.68
|
3.31
|
No
|
Sun
|
Dinner
|
2
|
4
|
24.59
|
3.61
|
No
|
Sun
|
Dinner
|
4
|
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips[:8]
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
tip_pct
|
0
|
16.99
|
1.01
|
No
|
Sun
|
Dinner
|
2
|
0.059447
|
1
|
10.34
|
1.66
|
No
|
Sun
|
Dinner
|
3
|
0.160542
|
2
|
21.01
|
3.50
|
No
|
Sun
|
Dinner
|
3
|
0.166587
|
3
|
23.68
|
3.31
|
No
|
Sun
|
Dinner
|
2
|
0.139780
|
4
|
24.59
|
3.61
|
No
|
Sun
|
Dinner
|
4
|
0.146808
|
5
|
25.29
|
4.71
|
No
|
Sun
|
Dinner
|
4
|
0.186240
|
6
|
8.77
|
2.00
|
No
|
Sun
|
Dinner
|
2
|
0.228050
|
7
|
26.88
|
3.12
|
No
|
Sun
|
Dinner
|
4
|
0.116071
|
grouped = tips.groupby(['day','smoker'])
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B96733A0>
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')
day smoker
Fri No 0.151650Yes 0.174783
Sat No 0.158048Yes 0.147906
Sun No 0.160113Yes 0.187250
Thur No 0.160298Yes 0.163863
Name: tip_pct, dtype: float64
#如果你传递的是函数或者函数名的列表,你会获得一个列名是这些函数名的DataFrame
grouped_pct.agg(['mean','std',peak_to_peak])
|
|
mean
|
std
|
peak_to_peak
|
day
|
smoker
|
|
|
|
Fri
|
No
|
0.151650
|
0.028123
|
0.067349
|
Yes
|
0.174783
|
0.051293
|
0.159925
|
Sat
|
No
|
0.158048
|
0.039767
|
0.235193
|
Yes
|
0.147906
|
0.061375
|
0.290095
|
Sun
|
No
|
0.160113
|
0.042347
|
0.193226
|
Yes
|
0.187250
|
0.154134
|
0.644685
|
Thur
|
No
|
0.160298
|
0.038774
|
0.193350
|
Yes
|
0.163863
|
0.039389
|
0.151240
|
grouped_pct.agg([('foo','mean'),('bar',np.std)])
|
|
foo
|
bar
|
day
|
smoker
|
|
|
Fri
|
No
|
0.151650
|
0.028123
|
Yes
|
0.174783
|
0.051293
|
Sat
|
No
|
0.158048
|
0.039767
|
Yes
|
0.147906
|
0.061375
|
Sun
|
No
|
0.160113
|
0.042347
|
Yes
|
0.187250
|
0.154134
|
Thur
|
No
|
0.160298
|
0.038774
|
Yes
|
0.163863
|
0.039389
|
functions = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(functions)
result
<ipython-input-74-05231bb63e58>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.result = grouped['tip_pct','total_bill'].agg(functions)
|
|
tip_pct
|
total_bill
|
|
|
count
|
mean
|
max
|
count
|
mean
|
max
|
day
|
smoker
|
|
|
|
|
|
|
Fri
|
No
|
4
|
0.151650
|
0.187735
|
4
|
18.420000
|
22.75
|
Yes
|
15
|
0.174783
|
0.263480
|
15
|
16.813333
|
40.17
|
Sat
|
No
|
45
|
0.158048
|
0.291990
|
45
|
19.661778
|
48.33
|
Yes
|
42
|
0.147906
|
0.325733
|
42
|
21.276667
|
50.81
|
Sun
|
No
|
57
|
0.160113
|
0.252672
|
57
|
20.506667
|
48.17
|
Yes
|
19
|
0.187250
|
0.710345
|
19
|
24.120000
|
45.35
|
Thur
|
No
|
45
|
0.160298
|
0.266312
|
45
|
17.113111
|
41.19
|
Yes
|
17
|
0.163863
|
0.241255
|
17
|
19.190588
|
43.11
|
result['tip_pct']
|
|
count
|
mean
|
max
|
day
|
smoker
|
|
|
|
Fri
|
No
|
4
|
0.151650
|
0.187735
|
Yes
|
15
|
0.174783
|
0.263480
|
Sat
|
No
|
45
|
0.158048
|
0.291990
|
Yes
|
42
|
0.147906
|
0.325733
|
Sun
|
No
|
57
|
0.160113
|
0.252672
|
Yes
|
19
|
0.187250
|
0.710345
|
Thur
|
No
|
45
|
0.160298
|
0.266312
|
Yes
|
17
|
0.163863
|
0.241255
|
ftuples = [('Durch','mean'),('abwe',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
<ipython-input-76-ce86ef210451>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.grouped['tip_pct','total_bill'].agg(ftuples)
|
|
tip_pct
|
total_bill
|
|
|
Durch
|
abwe
|
Durch
|
abwe
|
day
|
smoker
|
|
|
|
|
Fri
|
No
|
0.151650
|
0.000791
|
18.420000
|
25.596333
|
Yes
|
0.174783
|
0.002631
|
16.813333
|
82.562438
|
Sat
|
No
|
0.158048
|
0.001581
|
19.661778
|
79.908965
|
Yes
|
0.147906
|
0.003767
|
21.276667
|
101.387535
|
Sun
|
No
|
0.160113
|
0.001793
|
20.506667
|
66.099980
|
Yes
|
0.187250
|
0.023757
|
24.120000
|
109.046044
|
Thur
|
No
|
0.160298
|
0.001503
|
17.113111
|
59.625081
|
Yes
|
0.163863
|
0.001551
|
19.190588
|
69.808518
|
#只有多个函数应用于至少一个列时,DataFrame才具有分层列。
grouped.agg({'tip':np.max,'size':'sum'})
|
|
tip
|
size
|
day
|
smoker
|
|
|
Fri
|
No
|
3.50
|
9
|
Yes
|
4.73
|
31
|
Sat
|
No
|
9.00
|
115
|
Yes
|
10.00
|
104
|
Sun
|
No
|
6.00
|
167
|
Yes
|
6.50
|
49
|
Thur
|
No
|
6.70
|
112
|
Yes
|
5.00
|
40
|
10.2.2 返回不含行索引的聚合数据
- 通过在结果上调用reset_index也可以获得同样的结果。使用as_index=False可以避免一些不必要的计算。
tips.groupby(['day','smoker'],as_index=False).mean()
|
day
|
smoker
|
total_bill
|
tip
|
size
|
tip_pct
|
0
|
Fri
|
No
|
18.420000
|
2.812500
|
2.250000
|
0.151650
|
1
|
Fri
|
Yes
|
16.813333
|
2.714000
|
2.066667
|
0.174783
|
2
|
Sat
|
No
|
19.661778
|
3.102889
|
2.555556
|
0.158048
|
3
|
Sat
|
Yes
|
21.276667
|
2.875476
|
2.476190
|
0.147906
|
4
|
Sun
|
No
|
20.506667
|
3.167895
|
2.929825
|
0.160113
|
5
|
Sun
|
Yes
|
24.120000
|
3.516842
|
2.578947
|
0.187250
|
6
|
Thur
|
No
|
17.113111
|
2.673778
|
2.488889
|
0.160298
|
7
|
Thur
|
Yes
|
19.190588
|
3.030000
|
2.352941
|
0.163863
|
10.3 应用:通用拆分-应用-联合
import pandas as pd
tips = pd.read_csv('examples/tips.csv')
tips.head()
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
0
|
16.99
|
1.01
|
No
|
Sun
|
Dinner
|
2
|
1
|
10.34
|
1.66
|
No
|
Sun
|
Dinner
|
3
|
2
|
21.01
|
3.50
|
No
|
Sun
|
Dinner
|
3
|
3
|
23.68
|
3.31
|
No
|
Sun
|
Dinner
|
2
|
4
|
24.59
|
3.61
|
No
|
Sun
|
Dinner
|
4
|
tips['tip_pct'] = tips['tip']/(tips['total_bill']-tips['tip'])
#假设你想要按组选出小费百分比(tip-pct)最高的五组。
#首先,写一个可以在特定列中选出最大值所在行的函数
def top(df,n=5,column='tip_pct'):return df.sort_values(by=column)[-n:]
top(tips,n=6)
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
tip_pct
|
109
|
14.31
|
4.00
|
Yes
|
Sat
|
Dinner
|
2
|
0.387973
|
183
|
23.17
|
6.50
|
Yes
|
Sun
|
Dinner
|
4
|
0.389922
|
232
|
11.61
|
3.39
|
No
|
Sat
|
Dinner
|
2
|
0.412409
|
67
|
3.07
|
1.00
|
Yes
|
Sat
|
Dinner
|
1
|
0.483092
|
178
|
9.60
|
4.00
|
Yes
|
Sun
|
Dinner
|
2
|
0.714286
|
172
|
7.25
|
5.15
|
Yes
|
Sun
|
Dinner
|
2
|
2.452381
|
tips.groupby('smoker').apply(top)
#top函数在DataFrame的每一行分组上被调用,之后使用pandas.
#concat将函数结果粘贴在一起,并使用分组名作为各组的标签。
#因此结果包含一个分层索引,该分层索引的内部层级包含原DataFrame的索引值
|
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
tip_pct
|
smoker
|
|
|
|
|
|
|
|
|
No
|
88
|
24.71
|
5.85
|
No
|
Thur
|
Lunch
|
2
|
0.310180
|
185
|
20.69
|
5.00
|
No
|
Sun
|
Dinner
|
5
|
0.318674
|
51
|
10.29
|
2.60
|
No
|
Sun
|
Dinner
|
2
|
0.338101
|
149
|
7.51
|
2.00
|
No
|
Thur
|
Lunch
|
2
|
0.362976
|
232
|
11.61
|
3.39
|
No
|
Sat
|
Dinner
|
2
|
0.412409
|
Yes
|
109
|
14.31
|
4.00
|
Yes
|
Sat
|
Dinner
|
2
|
0.387973
|
183
|
23.17
|
6.50
|
Yes
|
Sun
|
Dinner
|
4
|
0.389922
|
67
|
3.07
|
1.00
|
Yes
|
Sat
|
Dinner
|
1
|
0.483092
|
178
|
9.60
|
4.00
|
Yes
|
Sun
|
Dinner
|
2
|
0.714286
|
172
|
7.25
|
5.15
|
Yes
|
Sun
|
Dinner
|
2
|
2.452381
|
#如果你除了向apply传递函数,还传递其他参数或关键字的话,你可以把这些放在函数后进行传递
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')
|
|
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
tip_pct
|
smoker
|
day
|
|
|
|
|
|
|
|
|
No
|
Fri
|
94
|
22.75
|
3.25
|
No
|
Fri
|
Dinner
|
2
|
0.166667
|
Sat
|
212
|
48.33
|
9.00
|
No
|
Sat
|
Dinner
|
4
|
0.228833
|
Sun
|
156
|
48.17
|
5.00
|
No
|
Sun
|
Dinner
|
6
|
0.115821
|
Thur
|
142
|
41.19
|
5.00
|
No
|
Thur
|
Lunch
|
5
|
0.138160
|
Yes
|
Fri
|
95
|
40.17
|
4.73
|
Yes
|
Fri
|
Dinner
|
4
|
0.133465
|
Sat
|
170
|
50.81
|
10.00
|
Yes
|
Sat
|
Dinner
|
3
|
0.245038
|
Sun
|
182
|
45.35
|
3.50
|
Yes
|
Sun
|
Dinner
|
3
|
0.083632
|
Thur
|
197
|
43.11
|
5.00
|
Yes
|
Thur
|
Lunch
|
4
|
0.131199
|
result = tips.groupby('smoker')['tip_pct'].describe()
result
|
count
|
mean
|
std
|
min
|
25%
|
50%
|
75%
|
max
|
smoker
|
|
|
|
|
|
|
|
|
No
|
151.0
|
0.192237
|
0.057665
|
0.060217
|
0.158622
|
0.184308
|
0.227015
|
0.412409
|
Yes
|
93.0
|
0.218176
|
0.254295
|
0.036955
|
0.119534
|
0.181818
|
0.242326
|
2.452381
|
result.unstack('smoker')
smoker
count No 151.000000Yes 93.000000
mean No 0.192237Yes 0.218176
std No 0.057665Yes 0.254295
min No 0.060217Yes 0.036955
25% No 0.158622Yes 0.119534
50% No 0.184308Yes 0.181818
75% No 0.227015Yes 0.242326
max No 0.412409Yes 2.452381
dtype: float64
#在GroupBy对象的内部,当你调用像describe这样的方法时,实际上是以下代码的简写
f = lambda x :x.describe()
grouped.apply(f)
|
|
|
total_bill
|
tip
|
size
|
tip_pct
|
day
|
smoker
|
|
|
|
|
|
Fri
|
No
|
count
|
4.000000
|
4.000000
|
4.00
|
4.000000
|
mean
|
18.420000
|
2.812500
|
2.25
|
0.151650
|
std
|
5.059282
|
0.898494
|
0.50
|
0.028123
|
min
|
12.460000
|
1.500000
|
2.00
|
0.120385
|
25%
|
15.100000
|
2.625000
|
2.00
|
0.137239
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
Thur
|
Yes
|
min
|
10.340000
|
2.000000
|
2.00
|
0.090014
|
25%
|
13.510000
|
2.000000
|
2.00
|
0.148038
|
50%
|
16.470000
|
2.560000
|
2.00
|
0.153846
|
75%
|
19.810000
|
4.000000
|
2.00
|
0.194837
|
max
|
43.110000
|
5.000000
|
4.00
|
0.241255
|
64 rows × 4 columns
10.3.1 压缩分组键
- 你可以看到所得到的对象具有分组键所形成的分层索引以及每个原始对象的索引。
- 你可以通过向groupby传递group_keys=False来禁用这个功能
tips.groupby('smoker').apply(top)
|
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
tip_pct
|
smoker
|
|
|
|
|
|
|
|
|
No
|
88
|
24.71
|
5.85
|
No
|
Thur
|
Lunch
|
2
|
0.310180
|
185
|
20.69
|
5.00
|
No
|
Sun
|
Dinner
|
5
|
0.318674
|
51
|
10.29
|
2.60
|
No
|
Sun
|
Dinner
|
2
|
0.338101
|
149
|
7.51
|
2.00
|
No
|
Thur
|
Lunch
|
2
|
0.362976
|
232
|
11.61
|
3.39
|
No
|
Sat
|
Dinner
|
2
|
0.412409
|
Yes
|
109
|
14.31
|
4.00
|
Yes
|
Sat
|
Dinner
|
2
|
0.387973
|
183
|
23.17
|
6.50
|
Yes
|
Sun
|
Dinner
|
4
|
0.389922
|
67
|
3.07
|
1.00
|
Yes
|
Sat
|
Dinner
|
1
|
0.483092
|
178
|
9.60
|
4.00
|
Yes
|
Sun
|
Dinner
|
2
|
0.714286
|
172
|
7.25
|
5.15
|
Yes
|
Sun
|
Dinner
|
2
|
2.452381
|
tips.groupby('smoker',group_keys=False).apply(top)
|
total_bill
|
tip
|
smoker
|
day
|
time
|
size
|
tip_pct
|
88
|
24.71
|
5.85
|
No
|
Thur
|
Lunch
|
2
|
0.310180
|
185
|
20.69
|
5.00
|
No
|
Sun
|
Dinner
|
5
|
0.318674
|
51
|
10.29
|
2.60
|
No
|
Sun
|
Dinner
|
2
|
0.338101
|
149
|
7.51
|
2.00
|
No
|
Thur
|
Lunch
|
2
|
0.362976
|
232
|
11.61
|
3.39
|
No
|
Sat
|
Dinner
|
2
|
0.412409
|
109
|
14.31
|
4.00
|
Yes
|
Sat
|
Dinner
|
2
|
0.387973
|
183
|
23.17
|
6.50
|
Yes
|
Sun
|
Dinner
|
4
|
0.389922
|
67
|
3.07
|
1.00
|
Yes
|
Sat
|
Dinner
|
1
|
0.483092
|
178
|
9.60
|
4.00
|
Yes
|
Sun
|
Dinner
|
2
|
0.714286
|
172
|
7.25
|
5.15
|
Yes
|
Sun
|
Dinner
|
2
|
2.452381
|
10.3.2 分位数与桶分析
- pandas有一些工具,尤其是cut和qcut,用于将数据按照你选择的箱位或样本分位数进行分桶。
- 与groupby方法一起使用这些函数可以对数据集更方便地进行分桶或分位分析
frame = pd.DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})
frame.head()
|
data1
|
data2
|
0
|
1.038225
|
-0.225594
|
1
|
-1.142048
|
0.762311
|
2
|
-0.383751
|
0.319326
|
3
|
0.122939
|
-1.205274
|
4
|
-1.557777
|
-0.495632
|
quertitles = pd.cut(frame.data1,4)
quertitles[:10]
0 (-0.175, 1.544]
1 (-1.894, -0.175]
2 (-1.894, -0.175]
3 (-0.175, 1.544]
4 (-1.894, -0.175]
5 (-1.894, -0.175]
6 (-1.894, -0.175]
7 (-3.62, -1.894]
8 (-1.894, -0.175]
9 (-0.175, 1.544]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.62, -1.894] < (-1.894, -0.175] < (-0.175, 1.544] < (1.544, 3.263]]
def get_stats(group):return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
grouped = frame.data2.groupby(quertitles)
grouped.apply(get_stats).unstack()
|
min
|
max
|
count
|
mean
|
data1
|
|
|
|
|
(-3.62, -1.894]
|
-1.687871
|
1.887558
|
30.0
|
-0.210904
|
(-1.894, -0.175]
|
-2.898182
|
3.015219
|
412.0
|
0.043604
|
(-0.175, 1.544]
|
-2.829533
|
2.779859
|
495.0
|
-0.040127
|
(1.544, 3.263]
|
-1.966623
|
2.059859
|
63.0
|
0.051666
|
#返回分位数数值
grouping = pd.cut(frame.data1,10,labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
|
min
|
max
|
count
|
mean
|
data1
|
|
|
|
|
0
|
-1.351679
|
0.479551
|
3.0
|
-0.162551
|
1
|
-1.396584
|
1.887558
|
11.0
|
0.134897
|
2
|
-2.122874
|
1.734941
|
47.0
|
-0.178326
|
3
|
-2.020175
|
3.015219
|
136.0
|
-0.004133
|
4
|
-2.898182
|
2.468129
|
245.0
|
0.079938
|
5
|
-2.758906
|
2.779859
|
253.0
|
-0.035156
|
6
|
-2.829533
|
2.439201
|
187.0
|
-0.045802
|
7
|
-2.250796
|
2.128096
|
86.0
|
-0.033331
|
8
|
-1.605125
|
2.059859
|
29.0
|
0.083334
|
9
|
-0.029929
|
0.692138
|
3.0
|
0.433789
|
10.3.3 示例:使用指定分组值填充缺失值
s = pd.Series(np.random.randn(6))
s
0 -1.825708
1 -0.406503
2 0.319659
3 0.077313
4 -0.304391
5 0.369055
dtype: float64
s[::2] = np.nan
s
0 NaN
1 -0.406503
2 NaN
3 0.077313
4 NaN
5 0.369055
dtype: float64
s.fillna(s.mean())
0 0.013288
1 -0.406503
2 0.013288
3 0.077313
4 0.013288
5 0.369055
dtype: float64
#假设你需要填充值按组来变化。
#一个方法是对数据分组后使用apply和一个在每个数据块上都调用fillna的函数。
states = ['ohio','new york','vermont','florida','oregon','nevada','california','idaho']
states
['ohio','new york','vermont','florida','oregon','nevada','california','idaho']
group_key = ['east']*4+['west']*4
group_key
['east', 'east', 'east', 'east', 'west', 'west', 'west', 'west']
data = pd.Series(np.random.randn(8),index=states)
data
ohio -1.196387
new york 1.116637
vermont -2.096959
florida -0.482581
oregon -0.426209
nevada 0.494309
california -1.444239
idaho 2.388043
dtype: float64
#将数据中的一些值设置为缺失值
data[['vermont','nevada','idaho']] = np.nan
data
ohio -1.196387
new york 1.116637
vermont NaN
florida -0.482581
oregon -0.426209
nevada NaN
california -1.444239
idaho NaN
dtype: float64
data.groupby(group_key).mean()
east -0.187444
west -0.935224
dtype: float64
#使用平均值来填充NA值
fill_mean = lambda g :g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
ohio -1.196387
new york 1.116637
vermont -0.187444
florida -0.482581
oregon -0.426209
nevada -0.935224
california -1.444239
idaho -0.935224
dtype: float64
fill_values = {'east':0.5,'west':-1}
fill_func = lambda g :g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
ohio -1.196387
new york 1.116637
vermont 0.500000
florida -0.482581
oregon -0.426209
nevada -1.000000
california -1.444239
idaho -1.000000
dtype: float64
10.3.4 示例:随机采样与排列
#红桃、黑桃、梅花、方块
suits = ['H','S','C','D']
card_val = (list(range(1,11))+[10]*3)*4
card_val[-5:]
[9, 10, 10, 10, 10]
base_names = ['A']+list(range(2,11))+['J','K','Q']
base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
cards = []
for suit in suits:cards.extend(str(num)+suit for num in base_names)deck = pd.Series(card_val,index=cards)
deck.head()
AH 1
2H 2
3H 3
4H 4
5H 5
dtype: int64
deck[:13]
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
def draw(deck,n=5):return deck.sample(5)
draw(deck)
9S 9
8S 8
3S 3
2D 2
9H 9
dtype: int64
#假设你想要从每个花色中随机抽取两张牌
get_suit = lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)
C JC 103C 39C 9QC 108C 8
D 6D 6QD 107D 79D 94D 4
H 4H 46H 62H 210H 109H 9
S 10S 10QS 108S 8KS 105S 5
dtype: int64
deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
QC 10
AC 1
JC 10
KC 10
4C 4
JD 10
AD 1
6D 6
5D 5
7D 7
KH 10
2H 2
6H 6
5H 5
8H 8
2S 2
5S 5
QS 10
4S 4
JS 10
dtype: int64
10.3.5 示例:分组加权平均和相关性
df = pd.DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)})
df
|
category
|
data
|
weights
|
0
|
a
|
-0.538213
|
0.890005
|
1
|
a
|
1.036831
|
0.119786
|
2
|
a
|
-1.706140
|
0.086685
|
3
|
a
|
-1.687288
|
0.418367
|
4
|
b
|
0.574443
|
0.372295
|
5
|
b
|
-1.791063
|
0.994798
|
6
|
b
|
1.419661
|
0.715527
|
7
|
b
|
0.674622
|
0.338912
|
#通过category进行分组加权平均如下
grouped = df.groupby('category')
get_wavg = lambda g :np.average(g['data'],weights=g['weights'])
grouped.apply(get_wavg)
category
a -0.797849
b -0.133568
dtype: float64
close_px = pd.read_csv(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\examples\stock_px_2.csv',parse_dates=True,index_col=0)
close_px.head()
|
AAPL
|
MSFT
|
XOM
|
SPX
|
2003-01-02
|
7.40
|
21.11
|
29.22
|
909.03
|
2003-01-03
|
7.45
|
21.14
|
29.24
|
908.59
|
2003-01-06
|
7.45
|
21.52
|
29.96
|
929.01
|
2003-01-07
|
7.43
|
21.93
|
28.95
|
922.93
|
2003-01-08
|
7.28
|
21.31
|
28.83
|
909.93
|
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 AAPL 2214 non-null float641 MSFT 2214 non-null float642 XOM 2214 non-null float643 SPX 2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
|
AAPL
|
MSFT
|
XOM
|
SPX
|
2011-10-11
|
400.29
|
27.00
|
76.27
|
1195.54
|
2011-10-12
|
402.19
|
26.96
|
77.16
|
1207.25
|
2011-10-13
|
408.43
|
27.18
|
76.37
|
1203.66
|
2011-10-14
|
422.00
|
27.27
|
78.11
|
1224.58
|
spx_corr = lambda x :x.corrwith(x['SPX'])
rets = close_px.pct_change().dropna()
get_year = lambda x :x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)
|
AAPL
|
MSFT
|
XOM
|
SPX
|
2003
|
0.541124
|
0.745174
|
0.661265
|
1.0
|
2004
|
0.374283
|
0.588531
|
0.557742
|
1.0
|
2005
|
0.467540
|
0.562374
|
0.631010
|
1.0
|
2006
|
0.428267
|
0.406126
|
0.518514
|
1.0
|
2007
|
0.508118
|
0.658770
|
0.786264
|
1.0
|
2008
|
0.681434
|
0.804626
|
0.828303
|
1.0
|
2009
|
0.707103
|
0.654902
|
0.797921
|
1.0
|
2010
|
0.710105
|
0.730118
|
0.839057
|
1.0
|
2011
|
0.691931
|
0.800996
|
0.859975
|
1.0
|
by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
10.3.6 示例:逐组线性回归
import statsmodels.api as sm
def regress(data,yvar,xvars):Y = data[yvar]X = data[xvars]X['intercept'] = 1.result = sm.OLS(Y,X).fit()return result.params
by_year.apply(regress,'AAPL',['SPX'])
|
SPX
|
intercept
|
2003
|
1.195406
|
0.000710
|
2004
|
1.363463
|
0.004201
|
2005
|
1.766415
|
0.003246
|
2006
|
1.645496
|
0.000080
|
2007
|
1.198761
|
0.003438
|
2008
|
0.968016
|
-0.001110
|
2009
|
0.879103
|
0.002954
|
2010
|
1.052608
|
0.001261
|
2011
|
0.806605
|
0.001514
|
10.4 数据透视表与交叉表
- Python中的pandas透视表是通过本章所介绍的groupby工具以及使用分层索引的重塑操作实现的。
- DataFrame拥有一个pivot_table方法,并且还有一个顶层的pandas.pivot_table函数。
- 除了为groupby提供一个方便接口,pivot_table还可以添加部分总计,也称作边距。
- pivot_table选项
选项名
|
描述
|
values
|
需要聚合的列名:默认情况下聚合所有数值型的列
|
index
|
在结果透视表的行.上进行分组的列名或其他分组键
|
#假设你想要计算一张在行方向上按day和smoker排列的分组平均值(默认的pivot_table聚合类型)的表
tips.pivot_table(index=['day','smoker'])
|
|
size
|
tip
|
tip_pct
|
total_bill
|
day
|
smoker
|
|
|
|
|
Fri
|
No
|
2.250000
|
2.812500
|
0.179740
|
18.420000
|
Yes
|
2.066667
|
2.714000
|
0.216293
|
16.813333
|
Sat
|
No
|
2.555556
|
3.102889
|
0.190412
|
19.661778
|
Yes
|
2.476190
|
2.875476
|
0.179833
|
21.276667
|
Sun
|
No
|
2.929825
|
3.167895
|
0.193617
|
20.506667
|
Yes
|
2.578947
|
3.516842
|
0.322021
|
24.120000
|
Thur
|
No
|
2.488889
|
2.673778
|
0.193424
|
17.113111
|
Yes
|
2.352941
|
3.030000
|
0.198508
|
19.190588
|
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker')
|
|
size
|
tip_pct
|
|
smoker
|
No
|
Yes
|
No
|
Yes
|
time
|
day
|
|
|
|
|
Dinner
|
Fri
|
2.000000
|
2.222222
|
0.162612
|
0.202545
|
Sat
|
2.555556
|
2.476190
|
0.190412
|
0.179833
|
Sun
|
2.929825
|
2.578947
|
0.193617
|
0.322021
|
Thur
|
2.000000
|
NaN
|
0.190114
|
NaN
|
Lunch
|
Fri
|
3.000000
|
1.833333
|
0.231125
|
0.236915
|
Thur
|
2.500000
|
2.352941
|
0.193499
|
0.198508
|
#通过传递margins=True来扩充这个表来包含部分总计。
#All的值是均值,且该均值是不考虑吸烟者与非吸烟者(All列)或行分组中任何两级的(All行)。
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
|
|
size
|
tip_pct
|
|
smoker
|
No
|
Yes
|
All
|
No
|
Yes
|
All
|
time
|
day
|
|
|
|
|
|
|
Dinner
|
Fri
|
2.000000
|
2.222222
|
2.166667
|
0.162612
|
0.202545
|
0.192562
|
Sat
|
2.555556
|
2.476190
|
2.517241
|
0.190412
|
0.179833
|
0.185305
|
Sun
|
2.929825
|
2.578947
|
2.842105
|
0.193617
|
0.322021
|
0.225718
|
Thur
|
2.000000
|
NaN
|
2.000000
|
0.190114
|
NaN
|
0.190114
|
Lunch
|
Fri
|
3.000000
|
1.833333
|
2.000000
|
0.231125
|
0.236915
|
0.236088
|
Thur
|
2.500000
|
2.352941
|
2.459016
|
0.193499
|
0.198508
|
0.194895
|
All
|
|
2.668874
|
2.408602
|
2.569672
|
0.192237
|
0.218176
|
0.202123
|
tips.pivot_table('tip_pct',index=['time','smoker'],aggfunc=len,margins=True)
|
|
tip_pct
|
time
|
smoker
|
|
Dinner
|
No
|
106.0
|
Yes
|
70.0
|
Lunch
|
No
|
45.0
|
Yes
|
23.0
|
All
|
|
244.0
|
#如果某些情况下产生了空值(或者NA),你可能想要传递一个fill_value
tips.pivot_table('tip_pct',index=['time','size','smoker'],aggfunc='mean',fill_value=0)
|
|
|
tip_pct
|
time
|
size
|
smoker
|
|
Dinner
|
1
|
No
|
0.160000
|
Yes
|
0.483092
|
2
|
No
|
0.200221
|
Yes
|
0.239672
|
3
|
No
|
0.183436
|
Yes
|
0.179271
|
4
|
No
|
0.176033
|
Yes
|
0.169406
|
5
|
No
|
0.263344
|
Yes
|
0.094779
|
6
|
No
|
0.115821
|
Lunch
|
1
|
No
|
0.222087
|
Yes
|
0.288288
|
2
|
No
|
0.201503
|
Yes
|
0.201043
|
3
|
No
|
0.138483
|
Yes
|
0.257941
|
4
|
No
|
0.161573
|
Yes
|
0.186592
|
5
|
No
|
0.138160
|
6
|
No
|
0.211191
|
10.4.1 交叉表:crosstab
- 交叉表(简写为crosstab)是数据透视表的一个特殊情况,计算的是分组中的频率
10.5 本章小结
data = pd.DataFrame({'Sample':range(1,11),'nationality':['USA','Japan','USA','Japan','Japan','Japan','USA','USA','Japan','USA'],'handedness':['right','left','right','right','left','right','right','left','right','right']})
data
|
Sample
|
nationality
|
handedness
|
0
|
1
|
USA
|
right
|
1
|
2
|
Japan
|
left
|
2
|
3
|
USA
|
right
|
3
|
4
|
Japan
|
right
|
4
|
5
|
Japan
|
left
|
5
|
6
|
Japan
|
right
|
6
|
7
|
USA
|
right
|
7
|
8
|
USA
|
left
|
8
|
9
|
Japan
|
right
|
9
|
10
|
USA
|
right
|
data.pivot_table(index = 'nationality',columns='handedness',margins=True,aggfunc='count')
|
Sample
|
handedness
|
left
|
right
|
All
|
nationality
|
|
|
|
Japan
|
2
|
3
|
5
|
USA
|
1
|
4
|
5
|
All
|
3
|
7
|
10
|
pd.crosstab(data.nationality,data.handedness,margins=True)
handedness
|
left
|
right
|
All
|
nationality
|
|
|
|
Japan
|
2
|
3
|
5
|
USA
|
1
|
4
|
5
|
All
|
3
|
7
|
10
|
#crosstab的前两个参数可是数组、Series或数组的列表
pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
|
smoker
|
No
|
Yes
|
All
|
time
|
day
|
|
|
|
Dinner
|
Fri
|
3
|
9
|
12
|
Sat
|
45
|
42
|
87
|
Sun
|
57
|
19
|
76
|
Thur
|
1
|
0
|
1
|
Lunch
|
Fri
|
1
|
6
|
7
|
Thur
|
44
|
17
|
61
|
All
|
|
151
|
93
|
244
|
第十一章时间序列
11.1 日期和时间数据的类型及工具
- Python标准库包含了日期和时间数据的类型,也包括日历相关的功能。
- datetime、time和calendar模块是开始处理时间数据的主要内容。
- datetime.datetime类型,或简写为datetime,是广泛使用的
- datetime模块中的类型
类型
|
描述
|
date
|
使用公历日历存储日历日期(年, 月,日)
|
time
|
将时间存储为小时,分钟,秒和微秒
|
datetime
|
存储日期和时间
|
timedelta
|
表示两个datetime值之间的差(如日,秒和微秒)
|
tzinfo
|
用于存储时区信息的基本类型
|
from datetime import datetime
#datetime既存储了日期,也存储了细化到微秒的时间。
now = datetime.now()
now
datetime.datetime(2021, 12, 28, 14, 25, 24, 8294)
now.year,now.month,now.day
(2021, 12, 28)
#timedelta表示两个datetime对象的时间差
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta
datetime.timedelta(days=926, seconds=56700)
delta.days,delta.seconds
(926, 56700)
#你可以为一个datetime对象加上(或减去)一个timedelta或其整数倍来产生一个新的datetime对象
from datetime import timedelta
start = datetime(2011,1,7)
start+timedelta(12)
datetime.datetime(2011, 1, 19, 0, 0)
start-2*timedelta(12)
datetime.datetime(2010, 12, 14, 0, 0)
11.1.1 字符串与datetime互相转换
- 可以使用str方法或传递一个指定的格式给strftime方法来对datetime对象和pandas的Timestamp对象进行格式化
- datetime格式说明(兼容ISO C89)
类型
|
描述
|
%Y
|
四位的年份
|
%y
|
两位的年份
|
%m
|
两位的月份[01, 12]
|
%d
|
两位的日期号[01,31]
|
%H
|
小时,24小时制[00, 23]
|
%I
|
小时,12小时制[01, 12]
|
%M
|
两位的分钟[00, 59]
|
%S
|
秒[00,61] (60、61 是闰秒)
|
%w
|
星期日期[0 (星期天),6]
|
%U
|
一年中的星期数[00, 53]。以星期天为每周的第一天,-年中第一个星期天前的日期作为"第0周"
|
%W
|
一年中的星期數[00, 53]。以星期一为每周的第一-天,-年中第一个星期一前的日期作为"第0周"
|
%z
|
格式为+HHMM或-HHMM的UTC时区偏移;如果没有时区则为空
|
%F
|
%Y-%m-%d的简写(例如,2012-4-18)
|
%D
|
%m/%d/%y的简写(例如,04/18/12)
|
stamp = datetime(2011,1,3)
stamp
datetime.datetime(2011, 1, 3, 0, 0)
str(stamp)
'2011-01-03 00:00:00'
stamp.strftime('%Y-%m-%d')
'2011-01-03'
#可以使用datetime.srtptime和这些格式代码,将字符串转换日期
value = '2011-01-03'
datetime.strptime(value,'%Y-%m-%d')
datetime.datetime(2011, 1, 3, 0, 0)
datestrs = ['7/6/2011','8/6/2011']
[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
from dateutil.parser import parse
parse('2011-01-03')
datetime.datetime(2011, 1, 3, 0, 0)
#dateutil能够解析大部分人类可理解的日期表示
parse('jan 31,1997 10:45 am')
datetime.datetime(2021, 1, 31, 10, 45)
#在国际场合下,日期出现在月份之前很常见,因此你可以传递dayfirst=True来表明这种情况
parse('6/12/2011',dayfirst=True)
datetime.datetime(2011, 12, 6, 0, 0)
#to_datetime方法可以转换很多不同的日期表示格式
datestrs = ['2011-07-06 12:00:00','2011-08-06 00:00:00']
pd.to_datetime(datestrs)
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
#to_datetime方法还可以处理那些被认为是缺失值的值(None、空字符串等)
idx = pd.to_datetime(datestrs+[None])
idx
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
#NaT(Not a time)是pandas中时间戳数据的是null值。
idx[2]
NaT
pd.isnull(idx)
array([False, False, True])
类型
|
描述
|
%a
|
缩写的工作日名称
|
%A
|
全写的工作日名称
|
%b
|
简写的月份名称
|
%B
|
全写的月份名称
|
%c
|
完整的日期和时间(例如,‘ Tue 01 May 2012 04:20:57 PM’)
|
%p
|
AM或PM的地区等效
|
%x
|
适合地区的格式化日期(例如,在美国,May 1, 2012会生成’05 101/2012)
|
%X
|
适合地区的时间(例如,’ 04:24:12 PM’ )
|
11.2 时间序列基础
- pandas中的基础时间序列种类是由时间戳索引的Series,在pandas外部则通常表示为Python字符串或datetime对象
from datetime import datetime
dates = [datetime(2011,1,2),datetime(2011,1,5),datetime(2011,1,7),datetime(2011,1,8),datetime(2011,1,10),datetime(2011,1,12)]
dates
[datetime.datetime(2011, 1, 2, 0, 0),datetime.datetime(2011, 1, 5, 0, 0),datetime.datetime(2011, 1, 7, 0, 0),datetime.datetime(2011, 1, 8, 0, 0),datetime.datetime(2011, 1, 10, 0, 0),datetime.datetime(2011, 1, 12, 0, 0)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
#这些datetime对象可以被放入DatetimeIndex中
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08','2011-01-10', '2011-01-12'],dtype='datetime64[ns]', freq=None)
ts[::2]+ts
2011-01-02 -0.777807
2011-01-05 NaN
2011-01-07 -0.373041
2011-01-08 NaN
2011-01-10 1.420241
2011-01-12 NaN
dtype: float64
#pandas使用NumPy的datetime64数据类型在纳秒级的分辨率下存储时间戳
ts.index.dtype
dtype('<M8[ns]')
#DatetimeIndex中的标量值是pandas的Timestamp对象
stamp = ts.index[0]
stamp
#所有使用datetime对象的地方都可以用Timestamp。
#此外,Timestamp还可以存储频率信息(如果有的话)并了解如何进行时区转换和其他类型操作
Timestamp('2011-01-02 00:00:00')
11.2.1 索引、选择、子集
- 当你基于标签进行索引和选择时,时间序列的行为和其他的pandas.Series类似
stamp = ts.index[2]
ts[stamp]
-0.18652074734702304
#为了方便,你还可以传递一个能解释为日期的字符串
ts['1/10/2011']
0.7101205694131453
ts['20110110']
0.7101205694131453
#对一个长的时间序列,可以传递一个年份或一个年份和月份来轻松地选择数据的切片
longer_ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
longer_ts
2000-01-01 1.399510
2000-01-02 1.167288
2000-01-03 -1.557553
2000-01-04 0.905812
2000-01-05 -1.204531...
2002-09-22 -0.055722
2002-09-23 0.913953
2002-09-24 3.299072
2002-09-25 -1.384872
2002-09-26 0.034663
Freq: D, Length: 1000, dtype: float64
longer_ts['2001']
2001-01-01 -0.098773
2001-01-02 0.299508
2001-01-03 -0.111169
2001-01-04 0.545366
2001-01-05 0.559263...
2001-12-27 0.383230
2001-12-28 0.274157
2001-12-29 -1.571527
2001-12-30 -0.066029
2001-12-31 0.860201
Freq: D, Length: 365, dtype: float64
longer_ts['2001-05']
2001-05-01 0.952182
2001-05-02 -1.637248
2001-05-03 -1.673597
2001-05-04 -1.257016
2001-05-05 -0.992064
2001-05-06 0.874060
2001-05-07 -0.491860
2001-05-08 0.479921
2001-05-09 0.286766
2001-05-10 2.026310
2001-05-11 -0.547724
2001-05-12 1.313638
2001-05-13 -0.624659
2001-05-14 0.003374
2001-05-15 -0.097401
2001-05-16 -0.733808
2001-05-17 -0.557165
2001-05-18 1.592470
2001-05-19 -1.030133
2001-05-20 1.789641
2001-05-21 0.821066
2001-05-22 0.954955
2001-05-23 0.537921
2001-05-24 -0.016609
2001-05-25 1.806582
2001-05-26 0.323395
2001-05-27 -0.144737
2001-05-28 0.783111
2001-05-29 0.673603
2001-05-30 -1.775568
2001-05-31 -0.553773
Freq: D, dtype: float64
ts[datetime(2011,1,7):]
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
ts
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
ts['1/6/2011':'1/11/2011']
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
dtype: float64
#请记住通过这种方式的切片产生了原时间序列的视图,类似于NumPy的数组。
#这意味着没有数据被复制,并且在切片上的修改会反映在原始数据上。
#有一个等价实例方法,truncate,它可以在两个日期间对Series进行切片
ts.truncate(after='1/9/2011')
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
dtype: float64
dates = pd.date_range('1/1/2000',periods=100,freq='W-WED')
dates[:5]
DatetimeIndex(['2000-01-05', '2000-01-12', '2000-01-19', '2000-01-26','2000-02-02'],dtype='datetime64[ns]', freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100,4),index=dates,columns = ['colorado','texas','new york','ohio'])
long_df
|
colorado
|
texas
|
new york
|
ohio
|
2000-01-05
|
-0.401749
|
-0.717368
|
-1.281874
|
1.274557
|
2000-01-12
|
1.038078
|
0.915171
|
-0.080030
|
-0.280146
|
2000-01-19
|
-1.080790
|
0.255669
|
-0.987855
|
-0.662729
|
2000-01-26
|
-0.360513
|
1.133701
|
0.678780
|
0.421880
|
2000-02-02
|
-0.636087
|
1.338383
|
-0.247395
|
1.089950
|
...
|
...
|
...
|
...
|
...
|
2001-10-31
|
1.464741
|
-0.390381
|
2.867647
|
-0.299401
|
2001-11-07
|
-0.341461
|
-0.358244
|
-0.766587
|
0.728394
|
2001-11-14
|
-1.010205
|
-0.055786
|
-0.574341
|
-0.652109
|
2001-11-21
|
-0.130517
|
0.929190
|
0.102579
|
-0.677333
|
2001-11-28
|
-0.349123
|
1.238194
|
-0.058024
|
-3.170347
|
100 rows × 4 columns
long_df.loc['5-2001']
|
colorado
|
texas
|
new york
|
ohio
|
2001-05-02
|
0.054055
|
-0.491146
|
0.241249
|
0.310428
|
2001-05-09
|
-1.556219
|
-0.716196
|
0.514238
|
1.694993
|
2001-05-16
|
-1.382658
|
0.459362
|
0.732855
|
0.644502
|
2001-05-23
|
-0.342885
|
0.544961
|
-1.133561
|
1.351170
|
2001-05-30
|
-2.051397
|
0.595807
|
0.536894
|
0.584369
|
11.2.2 含有重复索引的时间序列
dates = pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])
dates
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02','2000-01-03'],dtype='datetime64[ns]', freq=None)
dup_ts = pd.Series(np.arange(5),index=dates)
dup_ts
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int32
#通过检查索引的is_unique属性,我们可以看出索引并不是唯一的
dup_ts.index.is_unique
False
#对上面的Series进行索引,结果是标量值还是Series切片取决于是否有时间戳是重复的
dup_ts['1/3/2000']
4
dup_ts['1/2/2000']
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int32
#假设你想要聚合含有非唯一时间戳的数据。一种方式就是使用groupby并传递level=0
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
dup_ts.groupby(level=-1).mean()
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
grouped.count()
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
11.3 日期范围、频率和移位
- pandas拥有一整套标准的时间序列频率和工具用于重新采样、推断频率以及生成固定频率的数据范围。
ts
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
#字符串’D’被解释为每日频率。
resampler = ts.resample('D')
11.3.1 生成日期范围
- pandas.date_range是用于根据特定频率生成指定长度的DatetimeIndex
#默认情况下,date_range生成的是每日的时间戳。
index = pd.date_range('2012-04-01','2012-06-01')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04','2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08','2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12','2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16','2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20','2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24','2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28','2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02','2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06','2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10','2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14','2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18','2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22','2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26','2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30','2012-05-31', '2012-06-01'],dtype='datetime64[ns]', freq='D')
#如果你只传递一个起始或结尾日期,你必须传递一个用于生成范围的数字
pd.date_range(start='2012-04-01',periods=200)
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04','2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08','2012-04-09', '2012-04-10',...'2012-10-08', '2012-10-09', '2012-10-10', '2012-10-11','2012-10-12', '2012-10-13', '2012-10-14', '2012-10-15','2012-10-16', '2012-10-17'],dtype='datetime64[ns]', length=200, freq='D')
#可以传递’BM’频率(business end of month ,月度业务结尾;参考表11-4的频率列表)
pd.date_range('2000-01-01','2000-12-01',freq='BM')
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28','2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31','2000-09-29', '2000-10-31', '2000-11-30'],dtype='datetime64[ns]', freq='BM')
别名
|
偏置类型
|
描述
|
D
|
Day
|
日历日的每天
|
B
|
BusinessDay
|
工作日的每天
|
H
|
Hour
|
每小时
|
T或min
|
Minute
|
每分钟
|
S
|
Second
|
每秒
|
L或ms
|
Milli
|
每毫秒(1/1,000 秒)
|
U
|
Micro
|
每微秒(1/1,00,000秒)
|
M
|
MonthEnd
|
日历日的月底日期
|
BM
|
Busines sMonthEnd
|
工作日的月底日期
|
MS
|
MonthBegin
|
日历日的月初日期
|
BMS
|
BusinessMonthBegin
|
工作日的月初日期
|
W-MON, W-TUE, …
|
Week
|
按照给定的星期日期按每周取日期(MON, TUE, WED, THU,FRI, SAT或SUN)
|
WOM- 1MON , WOM- 2MON,…
|
Wee kofMonth
|
在本月的第一,二,三或四周创建按周分隔的日期(例如WOM-3FRI代表每月的第三个星期五)
|
Q-JAN,Q-FEB, …
|
QuarterEnd
|
每个月最后一个日历日的季度日期,以表示月份结束的年份(JAN, FEB,MAR, APR, MAY,JUN, JUL, AUG, SEP, OCT, NOV或DEC)
|
BQ-JAN,BQ-FEB,…
|
BusinessQuarterEnd
|
每个月最后一个工作日对应的季度日期,以表示月份结束的年份
|
QS-JAN, QS-FEB, …
|
QuarterBegin
|
每个月第一个日历日对应的季度日期,以表示月份结束的年份
|
BQS-JAN,BQS-FEB, …
|
YearEnd
|
给定月份所在的月的最后一个日历日所对应的年度日期
|
A-JAN,A-FEB, …
|
BusinessYearEnd
|
给定月份所在的月的最后一个.工作日所对应的年度日期(JAN,FEB, MAR, APR, MAY, JUN,JUL,AUG, SEP, OCT,NOV或DEC)
|
BA-JAN,BA-FEB, …
|
YearBegin
|
给定月份所在的月的第一个日历日所对应的年度日期
|
AS-JAN,AS-FEB, …
|
BusinessYearBegin
|
给定月份所在的月的第一个工作日所对应的年度日期
|
BAS-JAN, BAS-FEB, …
|
usinessYearBegin
|
给定月份所在月的第一个工作日所对应的年度日期
|
#默认情况下,date_range保留开始或结束时间戳的时间(如果有的话)
pd.date_range('2012-05-02 12:56:31',periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31','2012-05-04 12:56:31', '2012-05-05 12:56:31','2012-05-06 12:56:31'],dtype='datetime64[ns]', freq='D')
pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05','2012-05-06'],dtype='datetime64[ns]', freq='D')
11.3.2 频率和日期偏置
- pandas中的频率是由基础频率和倍数组成的。
- 基础频率通常会有字符串别名,例如’M’代表每月,'H’代表每小时。对于每个基础频率,都有一个对象可以被用于定义日期偏置。
from pandas.tseries.offsets import Hour,Minute
hour = Hour()
hour
<Hour>
#你可以传递一个整数来定义偏置量的倍数
four_hours = Hour(4)
four_hours
<4 * Hours>
pd.date_range('2000-01-01','2000-01-03 23:59',freq='4h')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00','2000-01-01 08:00:00', '2000-01-01 12:00:00','2000-01-01 16:00:00', '2000-01-01 20:00:00','2000-01-02 00:00:00', '2000-01-02 04:00:00','2000-01-02 08:00:00', '2000-01-02 12:00:00','2000-01-02 16:00:00', '2000-01-02 20:00:00','2000-01-03 00:00:00', '2000-01-03 04:00:00','2000-01-03 08:00:00', '2000-01-03 12:00:00','2000-01-03 16:00:00', '2000-01-03 20:00:00'],dtype='datetime64[ns]', freq='4H')
#多个偏置可以通过加法进行联合
Hour(2)+Minute(30)
<150 * Minutes>
pd.date_range('2000-01-01','2000-01-03 23:59',freq='1h30min')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00','2000-01-01 03:00:00', '2000-01-01 04:30:00','2000-01-01 06:00:00', '2000-01-01 07:30:00','2000-01-01 09:00:00', '2000-01-01 10:30:00','2000-01-01 12:00:00', '2000-01-01 13:30:00','2000-01-01 15:00:00', '2000-01-01 16:30:00','2000-01-01 18:00:00', '2000-01-01 19:30:00','2000-01-01 21:00:00', '2000-01-01 22:30:00','2000-01-02 00:00:00', '2000-01-02 01:30:00','2000-01-02 03:00:00', '2000-01-02 04:30:00','2000-01-02 06:00:00', '2000-01-02 07:30:00','2000-01-02 09:00:00', '2000-01-02 10:30:00','2000-01-02 12:00:00', '2000-01-02 13:30:00','2000-01-02 15:00:00', '2000-01-02 16:30:00','2000-01-02 18:00:00', '2000-01-02 19:30:00','2000-01-02 21:00:00', '2000-01-02 22:30:00','2000-01-03 00:00:00', '2000-01-03 01:30:00','2000-01-03 03:00:00', '2000-01-03 04:30:00','2000-01-03 06:00:00', '2000-01-03 07:30:00','2000-01-03 09:00:00', '2000-01-03 10:30:00','2000-01-03 12:00:00', '2000-01-03 13:30:00','2000-01-03 15:00:00', '2000-01-03 16:30:00','2000-01-03 18:00:00', '2000-01-03 19:30:00','2000-01-03 21:00:00', '2000-01-03 22:30:00'],dtype='datetime64[ns]', freq='90T')
#有些频率描述点的时间并不是均匀分隔的。
#例如,'M'(日历月末)和’BM'(月内最后工作日)取决于当月天数,以及像之后的例子中,取决于月末是否是周末。
#我们将这些日期称为锚定偏置量。
11.3.2.1 月中某星期的日期
#允许你可以获取每月第三个星期五这样的日期
rng = pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')
rng
DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20','2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],dtype='datetime64[ns]', freq='WOM-3FRI')
11.3.3 移位(前向和后向)日期
- "移位"是指将日期按时间向前移动或向后移动。
- Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位,而不改变索引
ts = pd.Series(np.random.randn(4),index=pd.date_range('1/1/2000',freq='M',periods=4))
ts
2000-01-31 -0.209127
2000-02-29 0.998652
2000-03-31 0.006782
2000-04-30 0.558995
Freq: M, dtype: float64
ts.shift(2)
2000-01-31 NaN
2000-02-29 NaN
2000-03-31 -0.209127
2000-04-30 0.998652
Freq: M, dtype: float64
ts.shift(-2)
2000-01-31 0.006782
2000-02-29 0.558995
2000-03-31 NaN
2000-04-30 NaN
Freq: M, dtype: float64
#shift常用于计算时间序列或DataFrame多列时间序列的百分比变化
ts/ts.shift(1)-1
2000-01-31 NaN
2000-02-29 -5.775325
2000-03-31 -0.993209
2000-04-30 81.428620
Freq: M, dtype: float64
#如果频率是已知的,则可以将频率传递给shift来推移时间戳而不是简单的数据
ts.shift(2,freq='M')
2000-03-31 -0.209127
2000-04-30 0.998652
2000-05-31 0.006782
2000-06-30 0.558995
Freq: M, dtype: float64
#其他的频率也可以传递,为你前移和后移数据提供灵活性
ts.shift(3,freq='D')
2000-02-03 -0.209127
2000-03-03 0.998652
2000-04-03 0.006782
2000-05-03 0.558995
dtype: float64
#这里的T代表分钟
ts.shift(1,freq='90T')
2000-01-31 01:30:00 -0.209127
2000-02-29 01:30:00 0.998652
2000-03-31 01:30:00 0.006782
2000-04-30 01:30:00 0.558995
dtype: float64
11.3.3.1 使用偏置进行移位日期
- pandas日期偏置也可以使用datetime或Timestamp对象完成
from pandas.tseries.offsets import Day,MonthEnd
now = datetime(2011,11,17)
now+3*Day()
Timestamp('2011-11-20 00:00:00')
now+MonthEnd()
Timestamp('2011-11-30 00:00:00')
now+MonthEnd(2)
Timestamp('2011-12-31 00:00:00')
#锚定偏置可以使用rollforward和rollback分别显式地将日期向前或向后"滚动"
offset = MonthEnd()
offset.rollforward(now)
Timestamp('2011-11-30 00:00:00')
offset.rollback(now)
Timestamp('2011-10-31 00:00:00')
ts = pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))
ts
2000-01-15 0.165827
2000-01-19 -0.564001
2000-01-23 -1.297098
2000-01-27 0.461758
2000-01-31 0.104934
2000-02-04 -0.197272
2000-02-08 -0.238585
2000-02-12 1.822792
2000-02-16 1.053136
2000-02-20 0.114542
2000-02-24 -0.719873
2000-02-28 0.938727
2000-03-03 0.250604
2000-03-07 1.692985
2000-03-11 0.249253
2000-03-15 -2.015027
2000-03-19 1.274159
2000-03-23 0.593201
2000-03-27 -0.340764
2000-03-31 -0.806581
Freq: 4D, dtype: float64
ts.resample('M').mean()
2000-01-31 -0.225716
2000-02-29 0.396209
2000-03-31 0.112229
Freq: M, dtype: float64
11.4 时区处理
- 在Python语言中,时区信息来源于第三方库pytz(可以使用pip或conda安装),其中公开了Olson数据库,这是世界时区信息的汇编。
import pytz
pytz.common_timezones[-5:]
['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
#要获得pytz的时区对象,可使用pytz.timezone
tz = pytz.timezone('America/New_York')
tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>
11.4.1 时区的本地化和转换
rng = pd.date_range('3/9/2012 9:30',periods=6,freq='D')
rng
DatetimeIndex(['2012-03-09 09:30:00', '2012-03-10 09:30:00','2012-03-11 09:30:00', '2012-03-12 09:30:00','2012-03-13 09:30:00', '2012-03-14 09:30:00'],dtype='datetime64[ns]', freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2012-03-09 09:30:00 -0.804325
2012-03-10 09:30:00 0.060626
2012-03-11 09:30:00 -1.531345
2012-03-12 09:30:00 0.133690
2012-03-13 09:30:00 0.173941
2012-03-14 09:30:00 -1.416870
Freq: D, dtype: float64
#索引的tz属性是None
print(ts.index.tz)
None
pd.date_range('3/9/2012 9:30',periods=10,freq='D',tz='UTC')
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00','2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00','2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00','2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00','2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],dtype='datetime64[ns, UTC]', freq='D')
#使用tz_localize方法可以从简单时区转换到本地化时区
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-09 09:30:00+00:00 -0.804325
2012-03-10 09:30:00+00:00 0.060626
2012-03-11 09:30:00+00:00 -1.531345
2012-03-12 09:30:00+00:00 0.133690
2012-03-13 09:30:00+00:00 0.173941
2012-03-14 09:30:00+00:00 -1.416870
Freq: D, dtype: float64
ts_utc.index
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00','2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00','2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],dtype='datetime64[ns, UTC]', freq='D')
#一旦时间序列被本地化为某个特定的时区,则可以通过tz_convert将其转换为另一个时区
ts_utc.tz_convert('America/New_York')
2012-03-09 04:30:00-05:00 -0.804325
2012-03-10 04:30:00-05:00 0.060626
2012-03-11 05:30:00-04:00 -1.531345
2012-03-12 05:30:00-04:00 0.133690
2012-03-13 05:30:00-04:00 0.173941
2012-03-14 05:30:00-04:00 -1.416870
Freq: D, dtype: float64
ts_eastern = ts.tz_localize('America/New_York')
ts_eastern.tz_convert('UTC')
2012-03-09 14:30:00+00:00 -0.804325
2012-03-10 14:30:00+00:00 0.060626
2012-03-11 13:30:00+00:00 -1.531345
2012-03-12 13:30:00+00:00 0.133690
2012-03-13 13:30:00+00:00 0.173941
2012-03-14 13:30:00+00:00 -1.416870
dtype: float64
ts_eastern.tz_convert('Europe/Berlin')
2012-03-09 15:30:00+01:00 -0.804325
2012-03-10 15:30:00+01:00 0.060626
2012-03-11 14:30:00+01:00 -1.531345
2012-03-12 14:30:00+01:00 0.133690
2012-03-13 14:30:00+01:00 0.173941
2012-03-14 14:30:00+01:00 -1.416870
dtype: float64
#tz_localize和tz_convert也是DatetimeIndex的实例方法
ts.index.tz_localize('Asia/Shanghai')
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00','2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00','2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],dtype='datetime64[ns, Asia/Shanghai]', freq=None)
11.4.2 时区感知时间戳对象的操作
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')
Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
stamp_moscow = pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')
stamp_moscow
Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')
stamp_utc.value
1299902400000000000
stamp_utc.tz_convert('America/New_York').value
1299902400000000000
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')
stamp
Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
stamp+Hour()
Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
stamp = pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')
stamp
Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
stamp+2*Hour()
Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')
11.4.3 不同时区间的操作
rng = pd.date_range('3/7/2012 9:30',periods=10,freq='B')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2012-03-07 09:30:00 0.386445
2012-03-08 09:30:00 -0.300690
2012-03-09 09:30:00 0.759709
2012-03-12 09:30:00 1.257558
2012-03-13 09:30:00 -0.505203
2012-03-14 09:30:00 -0.002910
2012-03-15 09:30:00 -0.196304
2012-03-16 09:30:00 0.440398
2012-03-19 09:30:00 0.218168
2012-03-20 09:30:00 1.265953
Freq: B, dtype: float64
ts1 = ts[:7].tz_localize('Europe/London')
ts1
2012-03-07 09:30:00+00:00 0.386445
2012-03-08 09:30:00+00:00 -0.300690
2012-03-09 09:30:00+00:00 0.759709
2012-03-12 09:30:00+00:00 1.257558
2012-03-13 09:30:00+00:00 -0.505203
2012-03-14 09:30:00+00:00 -0.002910
2012-03-15 09:30:00+00:00 -0.196304
dtype: float64
ts2 = ts[:7].tz_localize('Europe/Moscow')
ts2
2012-03-07 09:30:00+04:00 0.386445
2012-03-08 09:30:00+04:00 -0.300690
2012-03-09 09:30:00+04:00 0.759709
2012-03-12 09:30:00+04:00 1.257558
2012-03-13 09:30:00+04:00 -0.505203
2012-03-14 09:30:00+04:00 -0.002910
2012-03-15 09:30:00+04:00 -0.196304
dtype: float64
result = ts1 + ts2
result.index
DatetimeIndex(['2012-03-07 05:30:00+00:00', '2012-03-07 09:30:00+00:00','2012-03-08 05:30:00+00:00', '2012-03-08 09:30:00+00:00','2012-03-09 05:30:00+00:00', '2012-03-09 09:30:00+00:00','2012-03-12 05:30:00+00:00', '2012-03-12 09:30:00+00:00','2012-03-13 05:30:00+00:00', '2012-03-13 09:30:00+00:00','2012-03-14 05:30:00+00:00', '2012-03-14 09:30:00+00:00','2012-03-15 05:30:00+00:00', '2012-03-15 09:30:00+00:00'],dtype='datetime64[ns, UTC]', freq=None)
11.5 时间区间和区间算术
p = pd.Period(2007,freq='A-DEC')
p
Period('2007', 'A-DEC')
p+5
Period('2012', 'A-DEC')
p-2
Period('2005', 'A-DEC')
#如果两个区间拥有相同的频率,则它们的差是它们之间的单位数
pd.Period(2014,freq='A-DEC')-p
<7 * YearEnds: month=12>
#使用period_range函数可以构造规则区间序列
rng = pd.period_range('2000-01-01','2000-06-30',freq='M')
rng
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')
#PeriodIndex类存储的是区间的序列,可以作为任意pandas数据结构的轴索引
pd.Series(np.random.randn(6),index=rng)
2000-01 -1.474403
2000-02 0.146328
2000-03 1.247877
2000-04 0.797235
2000-05 -1.583312
2000-06 1.525540
Freq: M, dtype: float64
#如果你有一个字符串数组,你也可以使用PeriodIndex类
values = ['2001Q3','2002Q2','2003Q1']
index = pd.PeriodIndex(values,freq='Q-DEC')
index
PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
11.5.1 区间频率转换
- 使用asfreq可以将区间和PeriodIndex对象转换为其他的频率。
p = pd.Period(2007,freq='A-DEC')
p
Period('2007', 'A-DEC')
p.asfreq('M',how='start')
Period('2007-01', 'M')
p.asfreq('M',how='end')
Period('2007-12', 'M')
p = pd.Period(2007,freq='A-JUN')
p
Period('2007', 'A-JUN')
p.asfreq('M',how='start')
Period('2006-07', 'M')
p.asfreq('M',how='end')
Period('2007-06', 'M')
p = pd.Period('Aug-2007','M')
p
Period('2007-08', 'M')
p.asfreq('A-JUN')
Period('2008', 'A-JUN')
rng = pd.period_range('2006','2009',freq='A-DEC')
rng
PeriodIndex(['2006', '2007', '2008', '2009'], dtype='period[A-DEC]', freq='A-DEC')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2006 -0.784284
2007 -1.058489
2008 -0.151931
2009 0.443975
Freq: A-DEC, dtype: float64
ts.asfreq('M',how='start')
2006-01 -0.784284
2007-01 -1.058489
2008-01 -0.151931
2009-01 0.443975
Freq: M, dtype: float64
ts.asfreq('B',how='end')
2006-12-29 -0.784284
2007-12-31 -1.058489
2008-12-31 -0.151931
2009-12-31 0.443975
Freq: B, dtype: float64
11.5.2 季度区间频率
p = pd.Period(2007,freq='Q-JAN')
p
Period('2007Q4', 'Q-JAN')
p.asfreq('D','start')
Period('2006-11-01', 'D')
p.asfreq('D','end')
Period('2007-01-31', 'D')
#获取在季度倒数第二个工作日下午4点的时间戳
p4pm = (p.asfreq('B','e')-1).asfreq('T','S')+16*60
p4pm
Period('2007-01-30 16:00', 'T')
p4pm.to_timestamp()
Timestamp('2007-01-30 16:00:00')
rng = pd.period_range('2011Q3','2012Q4',freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)),index=rng)
ts
2011Q3 0
2011Q4 1
2012Q1 2
2012Q2 3
2012Q3 4
2012Q4 5
Freq: Q-JAN, dtype: int32
new_rng = (rng.asfreq('B','e')-1).asfreq('T','S')+16*60
ts.index = new_rng.to_timestamp()
ts
2010-10-28 16:00:00 0
2011-01-28 16:00:00 1
2011-04-28 16:00:00 2
2011-07-28 16:00:00 3
2011-10-28 16:00:00 4
2012-01-30 16:00:00 5
dtype: int32
11.5.3 将时间戳转换为区间(以及逆转换)
rng = pd.date_range('2000-01-01',periods=3,freq='M')
ts = pd.Series(np.random.randn(3),index=rng)
ts
2000-01-31 -0.622352
2000-02-29 -1.830166
2000-03-31 -0.601708
Freq: M, dtype: float64
#通过时间戳索引的Series和DataFrame可以被to_period方法转换为区间
pts = ts.to_period()
pts
2000-01 -0.622352
2000-02 -1.830166
2000-03 -0.601708
Freq: M, dtype: float64
rng = pd.date_range('1/29/2000',periods=6,freq='D')
rng
DatetimeIndex(['2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01','2000-02-02', '2000-02-03'],dtype='datetime64[ns]', freq='D')
ts2 = pd.Series(np.random.randn(6),index=rng)
ts2
2000-01-29 0.056749
2000-01-30 -0.310291
2000-01-31 1.369577
2000-02-01 -0.361215
2000-02-02 0.426119
2000-02-03 -0.010565
Freq: D, dtype: float64
ts2.to_period('M')
2000-01 0.056749
2000-01 -0.310291
2000-01 1.369577
2000-02 -0.361215
2000-02 0.426119
2000-02 -0.010565
Freq: M, dtype: float64
#使用to_timestamp可以将区间再转换为时间戳
pts = ts2.to_period()
pts
2000-01-29 0.056749
2000-01-30 -0.310291
2000-01-31 1.369577
2000-02-01 -0.361215
2000-02-02 0.426119
2000-02-03 -0.010565
Freq: D, dtype: float64
pts.to_timestamp(how='end')
2000-01-29 23:59:59.999999999 0.056749
2000-01-30 23:59:59.999999999 -0.310291
2000-01-31 23:59:59.999999999 1.369577
2000-02-01 23:59:59.999999999 -0.361215
2000-02-02 23:59:59.999999999 0.426119
2000-02-03 23:59:59.999999999 -0.010565
Freq: D, dtype: float64
11.5.4 从数组生成PeriodIndex
data = pd.read_csv('examples/macrodata.csv')
data.head(5)
|
year
|
quarter
|
realgdp
|
realcons
|
realinv
|
realgovt
|
realdpi
|
cpi
|
m1
|
tbilrate
|
unemp
|
pop
|
infl
|
realint
|
0
|
1959.0
|
1.0
|
2710.349
|
1707.4
|
286.898
|
470.045
|
1886.9
|
28.98
|
139.7
|
2.82
|
5.8
|
177.146
|
0.00
|
0.00
|
1
|
1959.0
|
2.0
|
2778.801
|
1733.7
|
310.859
|
481.301
|
1919.7
|
29.15
|
141.7
|
3.08
|
5.1
|
177.830
|
2.34
|
0.74
|
2
|
1959.0
|
3.0
|
2775.488
|
1751.8
|
289.226
|
491.260
|
1916.4
|
29.35
|
140.5
|
3.82
|
5.3
|
178.657
|
2.74
|
1.09
|
3
|
1959.0
|
4.0
|
2785.204
|
1753.7
|
299.356
|
484.052
|
1931.3
|
29.37
|
140.0
|
4.33
|
5.6
|
179.386
|
0.27
|
4.06
|
4
|
1960.0
|
1.0
|
2847.699
|
1770.5
|
331.722
|
462.199
|
1955.5
|
29.54
|
139.6
|
3.50
|
5.2
|
180.007
|
2.31
|
1.19
|
data.year
0 1959.0
1 1959.0
2 1959.0
3 1959.0
4 1960.0...
198 2008.0
199 2008.0
200 2009.0
201 2009.0
202 2009.0
Name: year, Length: 203, dtype: float64
data.quarter
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0...
198 3.0
199 4.0
200 1.0
201 2.0
202 3.0
Name: quarter, Length: 203, dtype: float64
#通过将这些数组和频率传递给PeriodIndex,你可以联合这些数组形成DataFrame的索引
index = pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')
index
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2','1960Q3', '1960Q4', '1961Q1', '1961Q2',...'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3','2008Q4', '2009Q1', '2009Q2', '2009Q3'],dtype='period[Q-DEC]', length=203, freq='Q-DEC')
data.index = index
data.head()
|
year
|
quarter
|
realgdp
|
realcons
|
realinv
|
realgovt
|
realdpi
|
cpi
|
m1
|
tbilrate
|
unemp
|
pop
|
infl
|
realint
|
1959Q1
|
1959.0
|
1.0
|
2710.349
|
1707.4
|
286.898
|
470.045
|
1886.9
|
28.98
|
139.7
|
2.82
|
5.8
|
177.146
|
0.00
|
0.00
|
1959Q2
|
1959.0
|
2.0
|
2778.801
|
1733.7
|
310.859
|
481.301
|
1919.7
|
29.15
|
141.7
|
3.08
|
5.1
|
177.830
|
2.34
|
0.74
|
1959Q3
|
1959.0
|
3.0
|
2775.488
|
1751.8
|
289.226
|
491.260
|
1916.4
|
29.35
|
140.5
|
3.82
|
5.3
|
178.657
|
2.74
|
1.09
|
1959Q4
|
1959.0
|
4.0
|
2785.204
|
1753.7
|
299.356
|
484.052
|
1931.3
|
29.37
|
140.0
|
4.33
|
5.6
|
179.386
|
0.27
|
4.06
|
1960Q1
|
1960.0
|
1.0
|
2847.699
|
1770.5
|
331.722
|
462.199
|
1955.5
|
29.54
|
139.6
|
3.50
|
5.2
|
180.007
|
2.31
|
1.19
|
data.infl
1959Q1 0.00
1959Q2 2.34
1959Q3 2.74
1959Q4 0.27
1960Q1 2.31...
2008Q3 -3.16
2008Q4 -8.79
2009Q1 0.94
2009Q2 3.37
2009Q3 3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64
11.6 重新采样与频率转换
- 重新采样是指将时间序列从一个频率转换为另一个频率的过程。
- 将更高频率的数据聚合到低频率被称为向下采样,而从低频率转换到高频率称为向上采样。
- 并不是所有的重新采样都属于上面说的两类
- pandas对象都配有resample方法,该方法是所有频率转换的工具函数。
- resample拥有类似于groupby的API;你调用resample对数据分组,之后再调用聚合函数
rng = pd.date_range('2000-01-01',periods=100,freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2000-01-01 0.037518
2000-01-02 -0.067833
2000-01-03 -0.704365
2000-01-04 -0.368819
2000-01-05 -1.208229...
2000-04-05 1.085903
2000-04-06 0.746400
2000-04-07 0.517810
2000-04-08 -0.363853
2000-04-09 0.057838
Freq: D, Length: 100, dtype: float64
ts.resample('M').mean()
2000-01-31 -0.108144
2000-02-29 -0.055158
2000-03-31 0.358646
2000-04-30 0.773431
Freq: M, dtype: float64
ts.resample('M',kind='period').mean()
2000-01 -0.108144
2000-02 -0.055158
2000-03 0.358646
2000-04 0.773431
Freq: M, dtype: float64
参数
|
描述
|
freq
|
表明所需采样频率的字符串或DateOffset对象(例如,‘M’. ‘5min’ 或Second(1))
|
axis
|
需要采样的轴向;默认是axis=0
|
fill_method
|
向上采样时的插值方式,‘ffill’ 或’bf11’;默认是不插值的
|
closed
|
向下采样中,每段间隔的哪-段是封闭的(包含的),’ right’或’1eft’
|
label
|
向下采样中,如何用’right’或’left’的箱标签标记聚合结果(例如,9:30到9:35的五分钟间隔可以被标记为9:30或9:35)
|
loffset
|
对箱标签进行时间调校,例如’-1s’ /Second (-1) 可以将聚合标签向前移动一秒
|
limit
|
在前向或后向填充时,填充区间的最大值
|
kind
|
对区间(’ period’)或时间戳(‘timestamp’) 的聚合:默认为时间序列索引的类型
|
convention
|
在对区间重新采样时,用于将低频周期转换为高频的约定(‘start’ 或’end’);默认为’end’
|
11.6.1 向下采样
- 在使用resample进行向下采样数据时有些事情需要考虑:
- 每段间隔的哪一边是闭合的
- 如何在间隔的起始或结束位置标记每个已聚合的箱体
rng = pd.date_range('2000-01-01',periods=12,freq='T')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2000-01-01 00:00:00 0.579555
2000-01-01 00:01:00 0.210504
2000-01-01 00:02:00 0.135227
2000-01-01 00:03:00 -0.125649
2000-01-01 00:04:00 1.511186
2000-01-01 00:05:00 -1.384490
2000-01-01 00:06:00 -0.938106
2000-01-01 00:07:00 2.057013
2000-01-01 00:08:00 -0.737751
2000-01-01 00:09:00 1.469319
2000-01-01 00:10:00 -0.257609
2000-01-01 00:11:00 -0.975761
Freq: T, dtype: float64
ts.resample('5min',closed='right',label='right').sum()
2000-01-01 00:00:00 0.579555
2000-01-01 00:05:00 0.346778
2000-01-01 00:10:00 1.592866
2000-01-01 00:15:00 -0.975761
Freq: 5T, dtype: float64
ts.resample('5min',closed='left').sum()
2000-01-01 00:00:00 2.310823
2000-01-01 00:05:00 0.465985
2000-01-01 00:10:00 -1.233370
Freq: 5T, dtype: float64
ts.resample('5min').sum()
2000-01-01 00:00:00 2.310823
2000-01-01 00:05:00 0.465985
2000-01-01 00:10:00 -1.233370
Freq: 5T, dtype: float64
#你可能需要将结果索引移动一定的数量,例如从右边缘减去一秒,以使其更清楚地表明时间戳所指的间隔。
#要实现这个功能,向loffset传递字符串或日期偏置
ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
#你也可以通过在结果上调用shift方法来完成loffset的效果。
<ipython-input-407-0b956efa5584>:3: FutureWarning: 'loffset' in .resample() and in Grouper() is deprecated.>>> df.resample(freq="3s", loffset="8H")becomes:>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")ts.resample('5min',closed='right',label='right',loffset='-1s').sum()1999-12-31 23:59:59 0.579555
2000-01-01 00:04:59 0.346778
2000-01-01 00:09:59 1.592866
2000-01-01 00:14:59 -0.975761
Freq: 5T, dtype: float64
11.6.1.1 开端-峰值-谷值-结束(OHLC)重新采样
- 在金融中,为每个数据桶计算四个值是一种流行的时间序列聚合方法:第一个值(开端)、最后一个值(结束)、最大值(峰值)和最小值(谷值)。通过使用ohlc聚合函数你将会获得包含四种聚合值列的DataFrame,这些值在数据的单次扫描中被高效计算
ts.resample('5min').ohlc()
|
open
|
high
|
low
|
close
|
2000-01-01 00:00:00
|
0.579555
|
1.511186
|
-0.125649
|
1.511186
|
2000-01-01 00:05:00
|
-1.384490
|
2.057013
|
-1.384490
|
1.469319
|
2000-01-01 00:10:00
|
-0.257609
|
-0.257609
|
-0.975761
|
-0.975761
|
11.6.2 向上采样与插值
- 当从低频率转换为高频率时,并不需要任何聚合
- 由于区间涉及时间范围,向上采样和向下采样就更为严格:
- 在向下采样中,目标频率必须是原频率的子区间。
- 在向上采样中,目标频率必须是原频率的父区间。
frame = pd.DataFrame(np.random.randn(2,4),index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),columns=['colorado','texas','new york','ohio'])
frame
|
colorado
|
texas
|
new york
|
ohio
|
2000-01-05
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-12
|
0.160683
|
1.629834
|
-0.340546
|
-0.767973
|
#使用asfreq方法在不聚合的情况下转换到高频率
df_daily = frame.resample('D').asfreq()
df_daily
|
colorado
|
texas
|
new york
|
ohio
|
2000-01-05
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-06
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-07
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-08
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-09
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-10
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-11
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-12
|
0.160683
|
1.629834
|
-0.340546
|
-0.767973
|
#fillna和reindex方法中可用的填充或插值方法可用于重采样
frame.resample('D').ffill()
|
colorado
|
texas
|
new york
|
ohio
|
2000-01-05
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-06
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-07
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-08
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-09
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-10
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-11
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-12
|
0.160683
|
1.629834
|
-0.340546
|
-0.767973
|
#你可以同样选择仅向前填充一定数量的区间,以限制继续使用观测值的时距
frame.resample('D').ffill(limit=2)
|
colorado
|
texas
|
new york
|
ohio
|
2000-01-05
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-06
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-07
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-08
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-09
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-10
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-11
|
NaN
|
NaN
|
NaN
|
NaN
|
2000-01-12
|
0.160683
|
1.629834
|
-0.340546
|
-0.767973
|
#请注意,新的日期索引根本不需要与旧的索引重叠
frame.resample('W-THU').ffill()
|
colorado
|
texas
|
new york
|
ohio
|
2000-01-06
|
-1.050656
|
0.063228
|
-1.179070
|
1.070880
|
2000-01-13
|
0.160683
|
1.629834
|
-0.340546
|
-0.767973
|
11.6.3 使用区间进行重新采样
frame = pd.DataFrame(np.random.randn(24,4),index=pd.period_range('1-2000','12-2001',freq='M'),columns=['colorado','texas','new york','ohio'])
frame.head()
|
colorado
|
texas
|
new york
|
ohio
|
2000-01
|
0.827955
|
0.212008
|
0.136026
|
-1.278525
|
2000-02
|
0.745499
|
1.909823
|
0.819960
|
-0.569550
|
2000-03
|
0.893031
|
0.512417
|
1.035980
|
-0.923079
|
2000-04
|
-0.223343
|
0.536235
|
-1.872214
|
-0.210844
|
2000-05
|
-2.059027
|
-2.296580
|
-1.082294
|
-0.352297
|
annual_frame = frame.resample('A-DEC').mean()
annual_frame
|
colorado
|
texas
|
new york
|
ohio
|
2000
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2001
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
annual_frame.resample('Q-DEC').ffill()
|
colorado
|
texas
|
new york
|
ohio
|
2000Q1
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2000Q2
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2000Q3
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2000Q4
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2001Q1
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
2001Q2
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
2001Q3
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
2001Q4
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
annual_frame.resample('Q-MAR').ffill()
|
colorado
|
texas
|
new york
|
ohio
|
2000Q4
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2001Q1
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2001Q2
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2001Q3
|
0.062947
|
0.204868
|
-0.149046
|
0.072700
|
2001Q4
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
2002Q1
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
2002Q2
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
2002Q3
|
0.433417
|
0.014037
|
-0.483594
|
0.242176
|
11.7 移动窗口函数
close_px_all = pd.read_csv('examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px_all
|
AAPL
|
MSFT
|
XOM
|
SPX
|
2003-01-02
|
7.40
|
21.11
|
29.22
|
909.03
|
2003-01-03
|
7.45
|
21.14
|
29.24
|
908.59
|
2003-01-06
|
7.45
|
21.52
|
29.96
|
929.01
|
2003-01-07
|
7.43
|
21.93
|
28.95
|
922.93
|
2003-01-08
|
7.28
|
21.31
|
28.83
|
909.93
|
...
|
...
|
...
|
...
|
...
|
2011-10-10
|
388.81
|
26.94
|
76.28
|
1194.89
|
2011-10-11
|
400.29
|
27.00
|
76.27
|
1195.54
|
2011-10-12
|
402.19
|
26.96
|
77.16
|
1207.25
|
2011-10-13
|
408.43
|
27.18
|
76.37
|
1203.66
|
2011-10-14
|
422.00
|
27.27
|
78.11
|
1224.58
|
2214 rows × 4 columns
close_px = close_px_all[['AAPL','MSFT','XOM']]
close_px
|
AAPL
|
MSFT
|
XOM
|
2003-01-02
|
7.40
|
21.11
|
29.22
|
2003-01-03
|
7.45
|
21.14
|
29.24
|
2003-01-06
|
7.45
|
21.52
|
29.96
|
2003-01-07
|
7.43
|
21.93
|
28.95
|
2003-01-08
|
7.28
|
21.31
|
28.83
|
...
|
...
|
...
|
...
|
2011-10-10
|
388.81
|
26.94
|
76.28
|
2011-10-11
|
400.29
|
27.00
|
76.27
|
2011-10-12
|
402.19
|
26.96
|
77.16
|
2011-10-13
|
408.43
|
27.18
|
76.37
|
2011-10-14
|
422.00
|
27.27
|
78.11
|
2214 rows × 3 columns
close_px = close_px.resample('B').ffill()
close_px
|
AAPL
|
MSFT
|
XOM
|
2003-01-02
|
7.40
|
21.11
|
29.22
|
2003-01-03
|
7.45
|
21.14
|
29.24
|
2003-01-06
|
7.45
|
21.52
|
29.96
|
2003-01-07
|
7.43
|
21.93
|
28.95
|
2003-01-08
|
7.28
|
21.31
|
28.83
|
...
|
...
|
...
|
...
|
2011-10-10
|
388.81
|
26.94
|
76.28
|
2011-10-11
|
400.29
|
27.00
|
76.27
|
2011-10-12
|
402.19
|
26.96
|
77.16
|
2011-10-13
|
408.43
|
27.18
|
76.37
|
2011-10-14
|
422.00
|
27.27
|
78.11
|
2292 rows × 3 columns
close_px.AAPL.plot()
close_px.AAPL.rolling(250).mean().plot()
app1_std250 = close_px.AAPL.rolling(250,min_periods=10).std()
app1_std250[5:12]
2003-01-09 NaN
2003-01-10 NaN
2003-01-13 NaN
2003-01-14 NaN
2003-01-15 0.077496
2003-01-16 0.074760
2003-01-17 0.112368
Freq: B, Name: AAPL, dtype: float64
app1_std250.plot()
expanding_mean = app1_std250.expanding().mean()
close_px.rolling(60).mean().plot(logy=True)
close_px.rolling('20D').mean()
|
AAPL
|
MSFT
|
XOM
|
2003-01-02
|
7.400000
|
21.110000
|
29.220000
|
2003-01-03
|
7.425000
|
21.125000
|
29.230000
|
2003-01-06
|
7.433333
|
21.256667
|
29.473333
|
2003-01-07
|
7.432500
|
21.425000
|
29.342500
|
2003-01-08
|
7.402000
|
21.402000
|
29.240000
|
...
|
...
|
...
|
...
|
2011-10-10
|
389.351429
|
25.602143
|
72.527857
|
2011-10-11
|
388.505000
|
25.674286
|
72.835000
|
2011-10-12
|
388.531429
|
25.810000
|
73.400714
|
2011-10-13
|
388.826429
|
25.961429
|
73.905000
|
2011-10-14
|
391.038000
|
26.048667
|
74.185333
|
2292 rows × 3 columns
11.7.1 指数加权函数
- 指定一个常数衰减因子以向更多近期观测值提供更多权重,可以替代使用具有相等加权观察值的静态窗口尺寸的方法
appl_px = close_px.AAPL['2006':'2007']
ma60 = appl_px.rolling(30,min_periods=20).mean()
ewma60 = appl_px.ewm(span=30).mean()
import matplotlib.pyplot as plt
ma60.plot(style='k--',label='simple ma')
ewma60.plot(style='k-',label='ew ma')
plt.legend()
<matplotlib.legend.Legend at 0x259c45d36d0>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N4fO6y4C-1640758272713)(output_335_1.png)]
11.7.2 二元移动窗口函数
spx_px = close_px_all['SPX']
spx_rets = close_px.pct_change()
returns = close_px.pct_change()
corr = returns.AAPL.rolling(125,min_periods=100).corr(spx_rets)
corr.plot()
11.7.3 用户自定义的移动窗口函数
from scipy.stats import percentileofscore
score_at_2percent = lambda x:percentileofscore(x,0.2)
result = returns.AAPL.rolling(250).apply(score_at_2percent)
result.plot()
第十二章 高阶pandas
12.1 分类数据
12.1.1 背景和目标
import numpy as np
import pandas as pd
values = pd.Series(['apple','orange','apple','apple']*2)
values
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
dtype: object
pd.unique(values)
array(['apple', 'orange'], dtype=object)
pd.value_counts(values)
apple 6
orange 2
dtype: int64
values = pd.Series([0,1,0,0]*2)
values
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
dtype: int64
dim = pd.Series(['apple', 'orange'])
dim
0 apple
1 orange
dtype: object
#可以使用take方法来恢复原来的字符串Series
dim.take(values)
0 apple
1 orange
0 apple
0 apple
0 apple
1 orange
0 apple
0 apple
dtype: object
#这种按照整数展现的方式被称为分类或字典编码展现
12.1.2 pandas中的Categorical类型
- pandas拥有特殊的Categorical类型,用于承载基于整数的类别展示或编码的数据。
fruits = ['apple','orange','apple','apple']*2
fruits
['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
N = len(fruits)
N
8
df = pd.DataFrame({'fruit':fruits,'basker_id':np.arange(N),'count':np.random.randint(3,15,size=N),'weight':np.random.uniform(0,4,size=N)})
df
|
fruit
|
basker_id
|
count
|
weight
|
0
|
apple
|
0
|
14
|
3.004288
|
1
|
orange
|
1
|
9
|
1.432892
|
2
|
apple
|
2
|
10
|
2.859906
|
3
|
apple
|
3
|
6
|
2.328174
|
4
|
apple
|
4
|
12
|
3.558704
|
5
|
orange
|
5
|
13
|
1.199922
|
6
|
apple
|
6
|
5
|
0.090920
|
7
|
apple
|
7
|
8
|
2.336066
|
#df['fruit']是一个Python字符串对象组成的数组。
#我们可以通过调用函数将它转换为Categorical对象
fruit_cat = df['fruit'].astype('category')
fruit_cat
#fruit_cat的值并不是NumPy数组,而是pandas.Categorical的实例
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
c = fruit_cat.values
c
['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']
type(c)
pandas.core.arrays.categorical.Categorical
#Categorical对象拥有categories和codes属性
c.categories
Index(['apple', 'orange'], dtype='object')
c.codes
array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)
#你可以通过分配已转换的结果将DataFrame的一列转换为Categorical对象
df['fruit'] = df['fruit'].astype('category')
df.fruit
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
#可以从其他Python序列类型直接生成pandas.Categorical
my_categories = pd.Categorical(['foo','bar','baz','foo','bar'])
my_categories
['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']
#如果你已经从另一个数据源获得了分类编码数据,你可以使用from_codes构造函数
categories = ['foo','bar','baz']
codes = [0,1,2,0,0,1]
my_cats_2 = pd.Categorical.from_codes(codes,categories)
my_cats_2
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']
#当使用from_codes或其他任意构造函数时,你可以为类别指定一个有意义的顺序
ordered_cat = pd.Categorical.from_codes(codes,categories,ordered=True)
ordered_cat
#输出的[foo<bar<baz]表明’foo’的顺序在’bar’之前,以此类推
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
#一个未排序的分类实例可以使用as_ordered进行排序
my_cats_2.as_ordered()
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
12.1.3 使用Categorical对象进行计算
np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5]
array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057])
#让我们计算上面数据的四分位分箱,并提取一些统计值
bins = pd.cut(draws,4)
bins
[(-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], ..., (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], (0.489, 2.208]]
Length: 1000
Categories (4, interval[float64]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]
bins = pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])
bins
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
bins = pd.Series(bins,name='quartile')
results = (pd.Series(draws).groupby(bins).agg(['count','min','max']).reset_index())
results
|
quartile
|
count
|
min
|
max
|
0
|
Q1
|
250
|
-2.949343
|
-0.685484
|
1
|
Q2
|
250
|
-0.683066
|
-0.010115
|
2
|
Q3
|
250
|
-0.010032
|
0.628894
|
3
|
Q4
|
250
|
0.634238
|
3.927528
|
results['quartile']
0 Q1
1 Q2
2 Q3
3 Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
12.1.3.1 使用分类获得更高性能
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo','bar','baz','qux']*(N//4))
categories = labels.astype('category')
#labels比categories使用了明显更多的内存
labels.memory_usage()
80000128
categories.memory_usage()
10000332
#分类转换当然不是免费的,但是它是一次性开销
%timeit labels.astype('category')
299 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.1.4 分类方法
方法
|
描述
|
add_ categories
|
将新的类别(未使用过的)添加到已有类别的尾部
|
as_ordered
|
对类别排序
|
as_unordered
|
使类别无序
|
remove_categories
|
去除类别,将被移除的值置为null
|
remove_unused_categories
|
去除所有没有出现在数据中的类别
|
rename_categories
|
使用新的类别名称替代现有的类别,不会改变类别的数量
|
reorder_categories
|
与rename_categories 类似,但结果是经过排序的类别
|
set_categories
|
用指定的一组新类别替换现有类别,可以添加或删除类别
|
s = pd.Series(['a','b','c','d']*2)
cat_s = s.astype('category')
cat_s
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
#特殊属性cat提供了对分类方法的访问
cat_s.cat.codes
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
dtype: int8
cat_s.cat.categories
Index(['a', 'b', 'c', 'd'], dtype='object')
#可以使用set_categories方法来改变类别
actual_categories = ['a','b','c','d','e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
cat_s.value_counts()
a 2
b 2
c 2
d 2
dtype: int64
cat_s2.value_counts()
a 2
b 2
c 2
d 2
e 0
dtype: int64
#可以使用remove_unused_categories方法来去除未观察到的类别
cat_s3 = cat_s[cat_s2.isin(['a','b'])]
cat_s3
0 a
1 b
4 a
5 b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
cat_s3.cat.remove_unused_categories()
0 a
1 b
4 a
5 b
dtype: category
Categories (2, object): ['a', 'b']
12.1.4.1 创建用于建模的虚拟变量
- 当你使用统计数据或机器学习工具时,通常会将分类数据转换为虚拟变量,也称为one-hot编码。
- 这会产生一个DataFrame,每个不同的类别都是它的一列。这些列包含一个特定类别的出现次数,否则为0。
cat_s = pd.Series(['a','b','c','d']*2,dtype='category')
cat_s
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
#pandas.get_dummies函数将一维的分类数据转换为一个包含虚拟变量的DataFrame
pd.get_dummies(cat_s)
|
a
|
b
|
c
|
d
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
2
|
0
|
0
|
1
|
0
|
3
|
0
|
0
|
0
|
1
|
4
|
1
|
0
|
0
|
0
|
5
|
0
|
1
|
0
|
0
|
6
|
0
|
0
|
1
|
0
|
7
|
0
|
0
|
0
|
1
|
12.2 高阶GroupBy应用
12.2.1 分组转换和“展开”GroupBy
- 内建方法transform
- transform可以产生一个标量值,并广播到各分组的尺寸数据中
- transform可以产生一个与输入分组尺寸相同的对象
- transform不可改变它的输入
df = pd.DataFrame({'key':['a','b','c']*4,'value':np.arange(12.)})
df
|
key
|
value
|
0
|
a
|
0.0
|
1
|
b
|
1.0
|
2
|
c
|
2.0
|
3
|
a
|
3.0
|
4
|
b
|
4.0
|
5
|
c
|
5.0
|
6
|
a
|
6.0
|
7
|
b
|
7.0
|
8
|
c
|
8.0
|
9
|
a
|
9.0
|
10
|
b
|
10.0
|
11
|
c
|
11.0
|
g = df.groupby('key').value
g.mean()
key
a 4.5
b 5.5
c 6.5
Name: value, dtype: float64
g.transform(lambda x:x.mean())
0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
#对于内建的聚合函数,我们可以像GroupBy的agg方法一样传递一个字符串别名
g.transform('mean')
0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
g.transform(lambda x:x*2)
0 0.0
1 2.0
2 4.0
3 6.0
4 8.0
5 10.0
6 12.0
7 14.0
8 16.0
9 18.0
10 20.0
11 22.0
Name: value, dtype: float64
#可以按照每个组的降序计算排名
g.transform(lambda x:x.rank(ascending=False))
0 4.0
1 4.0
2 4.0
3 3.0
4 3.0
5 3.0
6 2.0
7 2.0
8 2.0
9 1.0
10 1.0
11 1.0
Name: value, dtype: float64
#考虑一个由简单聚合构成的分组转换函数
def normalize(x):return (x-x.mean())/x.std()
g.transform(normalize)
0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
g.apply(normalize)
0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
g.transform('mean')
0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
normalized = (df['value']-g.transform('mean'))/g.transform('std')
normalized
0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
12.2.2 分组的时间重新采样
- 对于时间序列数据,resample方法在语义上是一种基于时间分段的分组操作
N = 15
times = pd.date_range('2017-05-20 00:00',freq='1min',periods=N)
times
DatetimeIndex(['2017-05-20 00:00:00', '2017-05-20 00:01:00','2017-05-20 00:02:00', '2017-05-20 00:03:00','2017-05-20 00:04:00', '2017-05-20 00:05:00','2017-05-20 00:06:00', '2017-05-20 00:07:00','2017-05-20 00:08:00', '2017-05-20 00:09:00','2017-05-20 00:10:00', '2017-05-20 00:11:00','2017-05-20 00:12:00', '2017-05-20 00:13:00','2017-05-20 00:14:00'],dtype='datetime64[ns]', freq='T')
df = pd.DataFrame({'time':times,'value':np.arange(N)})
df
|
time
|
value
|
0
|
2017-05-20 00:00:00
|
0
|
1
|
2017-05-20 00:01:00
|
1
|
2
|
2017-05-20 00:02:00
|
2
|
3
|
2017-05-20 00:03:00
|
3
|
4
|
2017-05-20 00:04:00
|
4
|
5
|
2017-05-20 00:05:00
|
5
|
6
|
2017-05-20 00:06:00
|
6
|
7
|
2017-05-20 00:07:00
|
7
|
8
|
2017-05-20 00:08:00
|
8
|
9
|
2017-05-20 00:09:00
|
9
|
10
|
2017-05-20 00:10:00
|
10
|
11
|
2017-05-20 00:11:00
|
11
|
12
|
2017-05-20 00:12:00
|
12
|
13
|
2017-05-20 00:13:00
|
13
|
14
|
2017-05-20 00:14:00
|
14
|
df.set_index('time').resample('5min').count()
|
value
|
time
|
|
2017-05-20 00:00:00
|
5
|
2017-05-20 00:05:00
|
5
|
2017-05-20 00:10:00
|
5
|
df2 = pd.DataFrame({'time':times.repeat(3),'key':np.tile(['a','b','c'],N),'value':np.arange(N*3.0)})
df2[:7]
|
time
|
key
|
value
|
0
|
2017-05-20 00:00:00
|
a
|
0.0
|
1
|
2017-05-20 00:00:00
|
b
|
1.0
|
2
|
2017-05-20 00:00:00
|
c
|
2.0
|
3
|
2017-05-20 00:01:00
|
a
|
3.0
|
4
|
2017-05-20 00:01:00
|
b
|
4.0
|
5
|
2017-05-20 00:01:00
|
c
|
5.0
|
6
|
2017-05-20 00:02:00
|
a
|
6.0
|
- TimeGrouper调用不了
- https://blog.csdn.net/weixin_51701683/article/details/110710765
- from pandas.core import resample
from pandas.core import resample
#要为每个’key’的值进行相同的重新采样,我们可以使用pandas.TimeGrouper对象
#使用TimeGrouper的一个限制是时间必须是Series或DataFrame的索引。
time_key = resample.TimeGrouper('5min')
time_key
TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', origin='start_day')
resampled = (df2.set_index('time').groupby(['key',time_key]).sum())
resampled
|
|
value
|
key
|
time
|
|
a
|
2017-05-20 00:00:00
|
30.0
|
2017-05-20 00:05:00
|
105.0
|
2017-05-20 00:10:00
|
180.0
|
b
|
2017-05-20 00:00:00
|
35.0
|
2017-05-20 00:05:00
|
110.0
|
2017-05-20 00:10:00
|
185.0
|
c
|
2017-05-20 00:00:00
|
40.0
|
2017-05-20 00:05:00
|
115.0
|
2017-05-20 00:10:00
|
190.0
|
resampled.reset_index()
|
key
|
time
|
value
|
0
|
a
|
2017-05-20 00:00:00
|
30.0
|
1
|
a
|
2017-05-20 00:05:00
|
105.0
|
2
|
a
|
2017-05-20 00:10:00
|
180.0
|
3
|
b
|
2017-05-20 00:00:00
|
35.0
|
4
|
b
|
2017-05-20 00:05:00
|
110.0
|
5
|
b
|
2017-05-20 00:10:00
|
185.0
|
6
|
c
|
2017-05-20 00:00:00
|
40.0
|
7
|
c
|
2017-05-20 00:05:00
|
115.0
|
8
|
c
|
2017-05-20 00:10:00
|
190.0
|
12.3 方法链技术
- 原位赋值可能比使用assign更为快速,但assign可以实现更方便的方法链
12.3.1 pipe方法
- 表达式f(df)和df.pipe(f)是等价的,但是pipe使得链式调用更为方便
12.4 本章小结
利用python进行数据分析(4)相关推荐
- 利用Python进行数据分析(第2版)
Wes McKinney 是流行的Python开源数据分析库pandas的创始人.他是一名活跃的演讲者,也是Python数据社区和Apache软件基金会的Python/C++开源开发者.目前他在纽约从 ...
- 干货 | 《利用Python进行数据分析》资料开源下载
今天要跟大家分享的是数据分析领域的必备书籍之一的<利用Python进行数据分析>第二版.英文名为 Python for Data Analysis. 本书作者Wes McKinney 资深 ...
- 利用python进行数据分析 英文-如何学习和评价《利用python进行数据分析》这本书?...
作为用Python做数据分析的必读书籍之一,这本书的开篇有向读者说明,这本书关注的是利用Python操作.处理.清洗和操作数据时的基本要点.目标是提供一份Python编程语言以及Python面向数据的 ...
- 利用python进行数据分析 百度云-利用Python进行数据分析 原书第2版.pdf
作 者 :(美)韦斯·麦金尼(Wes McKinney) 出版发行 : 北京:机械工业出版社 , 2018.07 ISBN号 :978-7-111-60370-2 页 数 : 480 原书定价 : 1 ...
- python 数据分析学什么-利用Python做数据分析 需要学习哪些知识
根据调查结果,十大最常用的数据工具中有八个来自或利用Python.Python广泛应用于所有数据科学领域,包括数据分析.机器学习.深度学习和数据可视化.不过你知道如何利用Python做数据分析吗?需要 ...
- python数据分析视频网盘-利用Python进行数据分析视频教程云盘下载
利用Python进行数据分析视频教程 内容简介 本视频讲的是利用Python进行数据控制.处理.整理.分析等方面的具体细节和基本要点.同时,它也是利用Python进行科学计算的实用指南(专门针对数据密 ...
- 利用Python进行数据分析(7) pandas基础: Series和DataFrame的简单介绍 一、pandas 是什么 pandas 是基于 NumPy 的一个 Python 数据分析包,主
利用Python进行数据分析(7) pandas基础: Series和DataFrame的简单介绍 一.pandas 是什么 pandas 是基于 NumPy 的一个 Python 数据分析包,主要目 ...
- 【赠书】pandas创始人手把手教你利用Python进行数据分析
周末就要到了,本次给大家赠送5本Python技术书籍,这次赠送的书籍是<利用Python进行数据分析>. 这是一本什么样的书 Python是目前数据科学领域的王者语言,众多科学家.工程师. ...
- 学完可以解决90%以上的数据分析问题-利用python进行数据分析第二版(代码和中文笔记)...
<利用python进行数据分析>是数据分析的基础教程,绝大部分数据分析师的入门教材,目前已经升级到第二版.本站搜集了教材的第二版原版代码进行中文翻译和注释,并做了一定的笔记.基本上只需要看 ...
- python输入数组并计算_利用Python进行数据分析——Numpy基础:数组和矢量计算
利用Python进行数据分析--Numpy基础:数组和矢量计算 ndarry,一个具有矢量运算和复杂广播能力快速节省空间的多维数组 对整组数据进行快速运算的标准数学函数,无需for-loop 用于读写 ...
最新文章
- 程序员毕业两年,三年工作经验是怎么来的? | 每日趣闻
- Juniper SRX 常用命令
- php88,php88微博
- Gazebo添加模型并控制模型运动作为动态障碍物(Ubuntu16.04, Gazebo7.16),附录动态链接库和静态链接库区别
- Linux下和Windows下创建Oracle数据库,表空间,Oracle用户
- django-后台传图0912
- cf244D. Match amp; Catch 字符串hash (模板)或 后缀数组。。。
- python棋盘覆盖问题_棋盘覆盖问题可视化动图——python
- js去除png格式的图片阴影
- innodb逻辑存储结构
- 【侯捷】C++内存管理机制
- 两个平面的位置关系和判定方程组解_精品获奖教案 1.2.4平面与平面的位置关系(2)教案 苏教版必修2...
- 高温差热分析仪(高温热重分析仪)
- 非法本法考备考经验总结
- JavaScript 的 switch 有四样写法,你知道么?
- 招募 | 香港理工大学Georg Kranz 博士诚招博士
- cad在线转换_手绘稿画好后我熬了一个通宵描图,结果组长给我说可以一键生成CAD!...
- 商场三十六计——第32计 “空城计”
- GDOI2017旅游记
- Android瀑布流,类似于蘑菇街和迷尚 应用里的排列
热门文章
- Android 仿QQ好友列表功能实现
- 惠普计算机进入安全模式,Windows10系统惠普电脑快速进入安全模式的方法
- 北京最好的商标律师解析商标侵权的处罚规定
- Android Behavior之ViewPager+Fragment+RecyclerView实现吸顶效果
- centos6 yum 阿里源 配置
- 微信公众号文章阅读数和点赞数采集接口
- 康耐视智能相机更新固件版本方式
- Missing Parts——Alpha 第 3 季NFT作品集来啦!
- 标准盒模型与怪异盒模型的区别
- OceanBase数据库大赛