pandas–groupby相关操作

pandas——groupby操作

实验目的
熟练掌握pandas中的groupby操作

实验原理
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False）
参数说明：
by：是指分组依据（列表、字典、函数，元组，Series）
axis：是作用维度（0为行，1为列）
level：根据索引级别分组
sort：对groupby分组后新的dataframe中索引进行排序，sort=True为升序，
as_index：在groupby中使用的键是否成为新的dataframe中的索引，默认as_index=True
group_keys：在调用apply时，将group键添加到索引中以识别片段
squeeze ：如果可能的话，减少返回类型的维数，否则返回一个一致的类型

grouping操作（split-apply-combine）
数据的分组&聚合 – 什么是groupby 技术?
在数据分析中，我们往往需要在将数据拆分，在每一个特定的组里进行运算。比如根据教育水平和年龄段计算某个城市的工作人口的平均收入。
pandas中的groupby提供了一个高效的数据的分组运算。
我们通过一个或者多个分类变量将数据拆分，然后分别在拆分以后的数据上进行需要的计算
我们可以把上述过程理解为三部：

1.拆分数据（split）
2.应用某个函数（apply）
3.汇总计算结果（aggregate）

实验环境
Python 3.6.1
Jupyter
实验内容
练习pandas中的groupby的操作案例

import pandas as pd
import numpy as np

1.创建一个数据帧df

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	1.210314	0.704632
1	bar	one	-0.062250	1.446661
2	foo	two	-1.389148	0.741158
3	bar	three	-1.095487	1.759002
4	foo	two	-0.964502	-0.564613
5	bar	two	0.829750	-2.951202
6	foo	one	-0.516992	-0.058681
7	foo	three	-2.728634	0.250330

2.通过A列对df进行分布操作

df.groupby('A')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000004E64130>

3.通过A、B列对df进行分组操作

df.groupby(['A','B'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000004E68F10>

4.使用自定义函数进行分组操作，自定义一个函数，使用groupby方法并使用自定义函数给定的条件，按列对df进行分组。

def get_letter_type(letter):  if letter.lower() in 'aeiou':  return 'vowel'  else:  return 'consonant'  grouped = df.groupby(get_letter_type, axis=1)
for group in grouped:  print(group)

('consonant',        B         C         D
0    one  1.210314  0.704632
1    one -0.062250  1.446661
2    two -1.389148  0.741158
3  three -1.095487  1.759002
4    two -0.964502 -0.564613
5    two  0.829750 -2.951202
6    one -0.516992 -0.058681
7  three -2.728634  0.250330)
('vowel',      A
0  foo
1  bar
2  foo
3  bar
4  foo
5  bar
6  foo
7  foo)

5.创建一个Series名为s，使用groupby根据s的索引对s进行分组，返回分组后的新Series，对新Series进行first、last、sum操作。

lst = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], lst)
s

1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

grouped = s.groupby(level=0)
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000000004E714C0>

#查看分组后的第一行数据
grouped.first()

1    1
2    2
3    3
dtype: int64

#查看分组后的最后一行数据
grouped.last()

1    10
2    20
3    30
dtype: int64

#对分组的各组进行求和
grouped.sum()

1    11
2    22
3    33
dtype: int64

6.分组排序，使用groupby进行分组时，默认是按分组后索引进行升序排列，在groupby方法中加入sort=False参数，可以进行降序排列。

df2=pd.DataFrame({'X':['B','B','A','A'],'Y':[1,2,3,4]})
df2

	X	Y
0	B	1
1	B	2
2	A	3
3	A	4

#按X列对df2进行分组，并求每组的和
df2.groupby(['X']).sum()

	Y
X
A	7
B	3

#按X列对df2进行分组，分组时不对键进行排序，并求每组的和
df2.groupby(['X'],sort=False).sum()

	Y
X
B	3
A	7

7.使用get_group方法得到分组后某组的值。

df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3

	X	Y
0	A	1
1	B	4
2	A	3
3	B	2

#按X列df3进行分组，并得到A组的df3值
df3.groupby(['X']).get_group('A')

	X	Y
0	A	1
2	A	3

#按X列df3进行分组，并得到B组的df3值
df3.groupby(['X']).get_group('B')

	X	Y
1	B	4
3	B	2

8.使用groups方法得到分组后所有组的值。

df.groupby('A').groups

{'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

df.groupby(['A','B']).groups

{('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

9.查看分组对象的所有内置函数

1）使用groupby方法按A列对df进行分组操作，将结果赋值给grouped

grouped=df.groupby(['A'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000008070B80>

2) 然后在命令行中输入grouped.(注意grouped后面有“.”)，然后按Tab键，可以查看到所有的内置函数

help(grouped)

10.多级索引分组，创建一个有两级索引的Series，并使用两个方法对Series进行分组并求和。

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
arrays

[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index=pd.MultiIndex.from_arrays(arrays,names=['first','second'])
index

MultiIndex([('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')],names=['first', 'second'])

s=pd.Series(np.random.randn(8),index=index)
s

first  second
bar    one       0.528530two       0.083659
baz    one      -1.561120two      -1.276969
foo    one      -0.487720two       0.339357
qux    one       0.198976two      -0.379343
dtype: float64

s.groupby(level=0).sum()

first
bar    0.612189
baz   -2.838089
foo   -0.148363
qux   -0.180368
dtype: float64

s.groupby(level='second').sum()

second
one   -1.321335
two   -1.233296
dtype: float64

11.复合分组，对s按first、second进行分组并求和。

s.groupby(level=['first', 'second']).sum()

first  second
bar    one       0.528530two       0.083659
baz    one      -1.561120two      -1.276969
foo    one      -0.487720two       0.339357
qux    one       0.198976two      -0.379343
dtype: float64

12.复合分组（按索引和列），创建数据帧df，使用索引级别和列对df进行分组。

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3], 'B': np.arange(8)},index=index)
df

		A	B
first	second
bar	one	1	0
bar	two	1	1
baz	one	1	2
baz	two	1	3
foo	one	2	4
foo	two	2	5
qux	one	3	6
qux	two	3	7

df.groupby([pd.Grouper(level=1),'A']).sum()

		B
second	A
one	1	2
	2	4
	3	6
two	1	4
	2	5
	3	7

13.对df进行分组，将分组后C列的值赋值给grouped，统计grouped中每类的个数

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	0.240436	0.178188
1	bar	one	0.078877	-0.667510
2	foo	two	0.287559	-1.029024
3	bar	three	0.275751	0.685817
4	foo	two	-0.469280	-1.583382
5	bar	two	0.182907	-0.306387
6	foo	one	-0.930772	0.231160
7	foo	three	-0.826608	1.170842

grouped=df.groupby(['A'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000009540760>

grouped_C=grouped['C']
grouped_C

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000000095350D0>

grouped_C.count()

A
bar    3
foo    5
Name: C, dtype: int64

14.对上面创建的df的C列，按A列值进行分组并求和。

df['C'].groupby(df['A']).sum()

A
bar    0.537535
foo   -1.698664
Name: C, dtype: float64

15.遍历分组结果，通过A，B两列对df进行分组，分组结果的组名为元组。

for name, group in df.groupby(['A', 'B']):  print(name)  print(group)

('bar', 'one')A    B         C        D
1  bar  one  0.078877 -0.66751
('bar', 'three')A      B         C         D
3  bar  three  0.275751  0.685817
('bar', 'two')A    B         C         D
5  bar  two  0.182907 -0.306387
('foo', 'one')A    B         C         D
0  foo  one  0.240436  0.178188
6  foo  one -0.930772  0.231160
('foo', 'three')A      B         C         D
7  foo  three -0.826608  1.170842
('foo', 'two')A    B         C         D
2  foo  two  0.287559 -1.029024
4  foo  two -0.469280 -1.583382

16.通过A列对df进行分组，并查看分组对象的bar列

df.groupby(['A']).get_group(('bar'))

	A	B	C	D
1	bar	one	0.078877	-0.667510
3	bar	three	0.275751	0.685817
5	bar	two	0.182907	-0.306387

17.按A,B两列对df进行分组，并查看分组对象中bar、one都存在的部分

df.groupby(['A','B']).get_group(('bar','one'))

	A	B	C	D
1	bar	one	0.078877	-0.66751

注意:当分组按两列来分时，查看分组对象也应该包含每列的一部分。

2.聚集

1.聚合操作，按A列对df进行分组，使用聚合函数aggregate求每组的和。

grouped=df.groupby(['A'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000008F95A00>

grouped.aggregate(np.sum)

	C	D
A
bar	0.537535	-0.288080
foo	-1.698664	-1.032218

2.按A、B两列对df进行分组，并使用聚合函数aggregate对每组求和

grouped=df.groupby(['A','B'])
grouped.aggregate(np.sum)

		C	D
A	B
bar	one	0.078877	-0.667510
	three	0.275751	0.685817
	two	0.182907	-0.306387
foo	one	-0.690335	0.409347
	three	-0.826608	1.170842
	two	-0.181721	-2.612407

注意：通过上面的结果可以看到。聚合完成后每组都有一个组名作为新的索引，使用as_index=False可以忽略组名。

3.当as_index=True时，在groupby中使用的键将成为新的dataframe中的索引。按A、B两列对df进行分组，这是使参数as_index=False，再使用聚合函数aggregate求每组的和。

grouped=df.groupby(['A','B'],as_index=False)
grouped.aggregate(np.sum)

	A	B	C	D
0	bar	one	0.078877	-0.667510
1	bar	three	0.275751	0.685817
2	bar	two	0.182907	-0.306387
3	foo	one	-0.690335	0.409347
4	foo	three	-0.826608	1.170842
5	foo	two	-0.181721	-2.612407

4.使用reset_index函数可以得到与参数as_index=False相同的结果

df.groupby(['A','B']).sum().reset_index()

	A	B	C	D
0	bar	one	0.078877	-0.667510
1	bar	three	0.275751	0.685817
2	bar	two	0.182907	-0.306387
3	foo	one	-0.690335	0.409347
4	foo	three	-0.826608	1.170842
5	foo	two	-0.181721	-2.612407

5.聚合操作，按A、B列对df进行分组，使用size方法，求每组的大小。返回一个Series，索引是组名，值是每组的大小。

grouped=df.groupby(['A','B'])
grouped.size()

A    B
bar  one      1three    1two      1
foo  one      2three    1two      2
dtype: int64

6.聚合操作，对分组grouped进行统计描述。

grouped.describe()

		C								D
		count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
A	B
bar	one	1.0	0.078877	NaN	0.078877	0.078877	0.078877	0.078877	0.078877	1.0	-0.667510	NaN	-0.667510	-0.667510	-0.667510	-0.667510	-0.667510
	three	1.0	0.275751	NaN	0.275751	0.275751	0.275751	0.275751	0.275751	1.0	0.685817	NaN	0.685817	0.685817	0.685817	0.685817	0.685817
	two	1.0	0.182907	NaN	0.182907	0.182907	0.182907	0.182907	0.182907	1.0	-0.306387	NaN	-0.306387	-0.306387	-0.306387	-0.306387	-0.306387
foo	one	2.0	-0.345168	0.828169	-0.930772	-0.637970	-0.345168	-0.052366	0.240436	2.0	0.204674	0.037457	0.178188	0.191431	0.204674	0.217917	0.231160
	three	1.0	-0.826608	NaN	-0.826608	-0.826608	-0.826608	-0.826608	-0.826608	1.0	1.170842	NaN	1.170842	1.170842	1.170842	1.170842	1.170842
	two	2.0	-0.090861	0.535166	-0.469280	-0.280070	-0.090861	0.098349	0.287559	2.0	-1.306203	0.391990	-1.583382	-1.444793	-1.306203	-1.167614	-1.029024

注意：聚合函数可以减少数据帧的维度，常用的聚合函数有：mean、sum、size、count、std、var、sem 、describe、first、last、nth、min、max。

7.执行多个函数在一个分组结果上：在分组返回的Series中我们可以通过一个聚合函数的列表或一个字典去操作series，返回一个DataFrame。

1) 按A列对df进行分组，再对分组结果grouped的C列进行sum、mean、std三个聚合操作。

grouped=df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])

	sum	mean	std
A
bar	0.537535	0.179178	0.098490
foo	-1.698664	-0.339733	0.577332

2) 聚合操作，使用agg方法对分组结果grouped进行sum、mean、std操作

grouped.agg([np.sum,np.mean,np.std])

	C			D
	sum	mean	std	sum	mean	std
A
bar	0.537535	0.179178	0.098490	-0.288080	-0.096027	0.700758
foo	-1.698664	-0.339733	0.577332	-1.032218	-0.206444	1.096466

3) 对分组结果grouped的C列进行sum、mean、std三个聚合操作，然后使用rename方法修改返回结果中的列名，将sum.mean,std分别替换为：foo,bar，baz。

grouped['C'].agg([np.sum,np.mean,np.std]).rename(columns={'sum':'foo','mean':'bar','std':'baz'})

	foo	bar	baz
A
bar	0.537535	0.179178	0.098490
foo	-1.698664	-0.339733	0.577332

8.作用不同的聚合函数到DataFrame的不同列上，通过聚合函数的一个字典作用不同的聚合函数到一个DataFrame的列上。

1) 对分组结果grouped进行聚合操作，C列上作用sum函数，D列上作用std函数并使参数ddof=1。

grouped.agg({'C':np.sum,'D':lambda x:np.std(x,ddof=1)})

	C	D
A
bar	0.537535	0.700758
foo	-1.698664	1.096466

grouped.agg({'C':np.sum,'D':np.std})

	C	D
A
bar	0.537535	0.700758
foo	-1.698664	1.096466

9.transform操作:对数组在分组内进行转换，而不是全局转换。

index = pd.date_range('10/1/1999', periods=1100)
index

DatetimeIndex(['1999-10-01', '1999-10-02', '1999-10-03', '1999-10-04','1999-10-05', '1999-10-06', '1999-10-07', '1999-10-08','1999-10-09', '1999-10-10',...'2002-09-25', '2002-09-26', '2002-09-27', '2002-09-28','2002-09-29', '2002-09-30', '2002-10-01', '2002-10-02','2002-10-03', '2002-10-04'],dtype='datetime64[ns]', length=1100, freq='D')

ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
ts

1999-10-01   -0.308860
1999-10-02    0.378851
1999-10-03   -0.420424
1999-10-04    1.147322
1999-10-05   -2.050742...
2002-09-30   -4.233787
2002-10-01    0.569052
2002-10-02   -2.565333
2002-10-03   -2.945548
2002-10-04   -2.746983
Freq: D, Length: 1100, dtype: float64

ts = ts.rolling(window=100,min_periods=100).mean().dropna()
ts.head()

2000-01-08    0.615245
2000-01-09    0.610235
2000-01-10    0.605533
2000-01-11    0.590685
2000-01-12    0.603397
Freq: D, dtype: float64

ts.tail()

2002-09-30    0.538173
2002-10-01    0.529140
2002-10-02    0.508942
2002-10-03    0.481543
2002-10-04    0.474645
Freq: D, dtype: float64

key=lambda x:x.year
zscore=lambda x:(x-x.mean())/x.std()
ts.groupby(key).transform(zscore)

2000-01-08    0.450007
2000-01-09    0.416414
2000-01-10    0.384888
2000-01-11    0.285345
2000-01-12    0.370571...
2002-09-30   -0.281554
2002-10-01   -0.332813
2002-10-02   -0.447437
2002-10-03   -0.602917
2002-10-04   -0.642061
Freq: D, Length: 1001, dtype: float64

10.filter操作：filter方法返回的是原对象的一个子集。

1)创建一个Series值为[1,1,2,3,3,3]并命名为sf，对sf按sf进行分组，然后过滤出每组的和大于2的值。

sf=pd.Series([1,1,2,2,3,3,3])
sf

0    1
1    1
2    2
3    2
4    3
5    3
6    3
dtype: int64

sf.groupby(sf).filter(lambda x:x.sum()>2)

2    2
3    2
4    3
5    3
6    3
dtype: int64