数据聚合技术Aggregation

数据聚合技术Aggregation
- 引入相关库
- 数据获取
- 数据聚合

数据聚合技术Aggregation

引入相关库

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

数据获取

df=pd.read_csv('../homework/city_weather.csv')

g=df.groupby('city')

数据聚合

通过agg函数求最大值

g.agg('max')

	date	temperature	wind
city
BJ	31/01/2016	19	5
GZ	31/07/2016	25	5
SH	27/03/2016	20	5
SZ	25/09/2016	20	4

通过agg函数求最小值

g.agg('min')

	date	temperature	wind
city
BJ	03/01/2016	-3	2
GZ	14/08/2016	-1	2
SH	03/07/2016	-10	2
SZ	11/09/2016	-10	1

自定义一个聚合函数

def foo(attr):print(type(attr))return np.nan

向agg方法传入该函数

g.agg(foo)

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

	date	temperature	wind
city
BJ	NaN	NaN	NaN
GZ	NaN	NaN	NaN
SH	NaN	NaN	NaN
SZ	NaN	NaN	NaN

查看每一行结果Series里面存放的什么

def foo(attr):print(type(attr)),print(attr)return np.nan
g.agg(foo)

<class 'pandas.core.series.Series'>
Series([], Name: date, dtype: object)
<class 'pandas.core.series.Series'>
0    03/01/2016
1    17/01/2016
2    31/01/2016
3    14/02/2016
4    28/02/2016
5    13/03/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
14    17/07/2016
15    31/07/2016
16    14/08/2016
17    28/08/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
6     27/03/2016
7     10/04/2016
8     24/04/2016
9     08/05/2016
10    22/05/2016
11    05/06/2016
12    19/06/2016
13    03/07/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
18    11/09/2016
19    25/09/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
Series([], Name: temperature, dtype: int64)
<class 'pandas.core.series.Series'>
0     8
1    12
2    19
3    -3
4    19
5     5
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
14    10
15    -1
16     1
17    25
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
6     -4
7     19
8     20
9     17
10     4
11   -10
12     0
13    -9
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
18    20
19   -10
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
Series([], Name: wind, dtype: int64)
<class 'pandas.core.series.Series'>
0    5
1    2
2    2
3    3
4    2
5    3
Name: wind, dtype: int64
<class 'pandas.core.series.Series'>
14    2
15    5
16    5
17    4
Name: wind, dtype: int64
<class 'pandas.core.series.Series'>
6     4
7     3
8     3
9     3
10    2
11    4
12    5
13    5
Name: wind, dtype: int64
<class 'pandas.core.series.Series'>
18    1
19    4
Name: wind, dtype: int64

	date	temperature	wind
city
BJ	NaN	NaN	NaN
GZ	NaN	NaN	NaN
SH	NaN	NaN	NaN
SZ	NaN	NaN	NaN

把函数改成attr的最大值减去最小值，例如北京的wind为max-min=5-2=3

def foo(attr):return attr.max()-attr.min()
g.agg(foo)

	temperature	wind
city
BJ	22	3
GZ	26	3
SH	30	3
SZ	30	3

查看原始的DataFrame

df

	date	city	temperature	wind
0	03/01/2016	BJ	8	5
1	17/01/2016	BJ	12	2
2	31/01/2016	BJ	19	2
3	14/02/2016	BJ	-3	3
4	28/02/2016	BJ	19	2
5	13/03/2016	BJ	5	3
6	27/03/2016	SH	-4	4
7	10/04/2016	SH	19	3
8	24/04/2016	SH	20	3
9	08/05/2016	SH	17	3
10	22/05/2016	SH	4	2
11	05/06/2016	SH	-10	4
12	19/06/2016	SH	0	5
13	03/07/2016	SH	-9	5
14	17/07/2016	GZ	10	2
15	31/07/2016	GZ	-1	5
16	14/08/2016	GZ	1	5
17	28/08/2016	GZ	25	4
18	11/09/2016	SZ	20	1
19	25/09/2016	SZ	-10	4

对df做groupby，传入两个columns，‘city’和‘wind’，产生一个groupby对象

g_new=df.groupby(['city','wind'])
g_new

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017B3ADF1E08>

对新产生的groupby对象做group，比原始的g的groups要多，因为加上了wind,对于每个城市不同的wind又会分group

g_new.groups

{('BJ', 2): Int64Index([1, 2, 4], dtype='int64'),('BJ', 3): Int64Index([3, 5], dtype='int64'),('BJ', 5): Int64Index([0], dtype='int64'),('GZ', 2): Int64Index([14], dtype='int64'),('GZ', 4): Int64Index([17], dtype='int64'),('GZ', 5): Int64Index([15, 16], dtype='int64'),('SH', 2): Int64Index([10], dtype='int64'),('SH', 3): Int64Index([7, 8, 9], dtype='int64'),('SH', 4): Int64Index([6, 11], dtype='int64'),('SH', 5): Int64Index([12, 13], dtype='int64'),('SZ', 1): Int64Index([18], dtype='int64'),('SZ', 4): Int64Index([19], dtype='int64')}

g.groups

{('BJ': Int64Index([0,1,2,3,4,5], dtype='int64'),'GZ': Int64Index([14,15,16,17], dtype='int64'),'SH': Int64Index([6, 7, 8, 9, 10, 11, 12, 13], dtype='int64'),'SZ': Int64Index([18, 19], dtype='int64'),

在g获取BJ这一group

g.get_group('BJ')

	date	temperature	wind
0	03/01/2016	8	5
1	17/01/2016	12	2
2	31/01/2016	19	2
3	14/02/2016	-3	3
4	28/02/2016	19	2
5	13/03/2016	5	3

在g_new获取BJ，wind为2这一group

g_new.get_group(('BJ',2))

	date	city	temperature	wind
1	17/01/2016	BJ	12	2
2	31/01/2016	BJ	19	2
4	28/02/2016	BJ	19	2

在g_new获取BJ，wind为3这一group

g_new.get_group(('BJ',3))

	date	city	temperature	wind
3	14/02/2016	BJ	-3	3
5	13/03/2016	BJ	5	3

用for循环访问的按多个columns做group的所有group

for (name_1,name_2),group in g_new:print(name_1,name_2)print(group)
BJ 2date city  temperature  wind
1  17/01/2016   BJ           12     2
2  31/01/2016   BJ           19     2
4  28/02/2016   BJ           19     2
BJ 3date city  temperature  wind
3  14/02/2016   BJ           -3     3
5  13/03/2016   BJ            5     3
BJ 5date city  temperature  wind
0  03/01/2016   BJ            8     5
GZ 2date city  temperature  wind
14  17/07/2016   GZ           10     2
GZ 4date city  temperature  wind
17  28/08/2016   GZ           25     4
GZ 5date city  temperature  wind
15  31/07/2016   GZ           -1     5
16  14/08/2016   GZ            1     5
SH 2date city  temperature  wind
10  22/05/2016   SH            4     2
SH 3date city  temperature  wind
7  10/04/2016   SH           19     3
8  24/04/2016   SH           20     3
9  08/05/2016   SH           17     3
SH 4date city  temperature  wind
6   27/03/2016   SH           -4     4
11  05/06/2016   SH          -10     4
SH 5date city  temperature  wind
12  19/06/2016   SH            0     5
13  03/07/2016   SH           -9     5
SZ 1date city  temperature  wind
18  11/09/2016   SZ           20     1
SZ 4date city  temperature  wind
19  25/09/2016   SZ          -10     4

数据聚合技术Aggregation相关推荐

mysql数据聚合技术_Mysql 去重聚合
示例数据表中的数据: mysql> select * from talk_test; +----+-------+--------+ | id | name | mobile | +----+ ...
数据科学入门与实战：玩转pandas之七数据分箱技术，分组技术，聚合技术
首先导入相关包 import pandas as pd import numpy as np from pandas import Series,DataFrame #数据分箱技术Binning 数据 ...
数据可视化工具_卓越中心和数据可视化工具2020中的组织中的数据聚合
数据可视化工具人工智能 (ARTIFICIAL INTELLIGENCE) Data consumption is rising steadily in 2020 with estimates s ...
ELK技术栈(四) elasticsearch 数据聚合数据同步
目录一.数据聚合 1.Bucket聚合 2.Metries聚合 3.自动补全二.数据同步 1.同步通知 2.异步通知 3.binlog监听 4.小结 5.案例:基于MQ实现数据同步一.数据聚合 ...
链路聚合(Link Aggregation)与权重
链路聚合(Link Aggregation)与权重介绍链路聚合通过聚合多条并行的物理链路,对上层协议表现为一条逻辑链路,来提高吞吐量和冗余性.常见的链路聚合技术有Cisco的Etherchanne ...
熊猫数据集_用熊猫掌握数据聚合
熊猫数据集 Data aggregation is the process of gathering data and expressing it in a summary form. This ty ...
路由器链路聚合技术（Eth-Trunk、Ip-Trunk）
随着网络规模不断扩大,运营商对骨干链路的带宽和可靠性提出越来越高的要求.在传统技术中,常用更换高速率的接口板或更换支持高速率接口板的设备的方式来增加带宽,但这种方案需要付出高额的费用,而且不够灵活.采 ...
Python数据聚合和分组运算(1)-GroupBy Mechanics
前言 Python的pandas包提供的数据聚合与分组运算功能很强大,也很灵活.<Python for Data Analysis>这本书第9章详细的介绍了这方面的用法,但是有些细节不常用 ...
华为交换机：链路聚合技术
一.链路聚合简介链路聚合(Link Aggregation),是指将多个物理端口捆绑在一起,成为一个逻辑端口,以实现出入流量在各成员端口中的负荷分担,交换机根据用户配置的端口负荷分担策略决定报文从哪 ...
云化数据中心发展历程回顾及金融与运营商对云化数据中心技术的要求
本文先带领大家详细回顾了云化数据中心架构发展的演变历史,描述了不同发展阶段云化数据中心使用的网络技术.再举例说明了2个特殊行业(运营商.金融行业)对云化数据中心网络的设计要求.结合不同行业的业务要求, ...

数据聚合技术Aggregation

数据聚合技术Aggregation

数据聚合技术Aggregation

引入相关库

数据获取

数据聚合

数据聚合技术Aggregation相关推荐

最新文章

热门文章