2021-03-05 pandas（合并_分组聚合

数据合并

join

默认情况下他是把行索引相同的数据合并到一起，以调用对象的行为准
调用对象不存在的行数不合并，调用对象存在但合并对象无数据的索引位置出现NaN

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: d1 = pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd'))
In [4]: d1
Out[4]: a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0In [6]: d2 = pd.DataFrame(np.zeros((3,3)),index=['A','B','C'],columns=list('xyz'))                          In [7]: d2
Out[7]: x    y    z
A  0.0  0.0  0.0
B  0.0  0.0  0.0
C  0.0  0.0  0.0In [8]: d1.join(d2)
Out[8]: a    b    c    d    x    y    z
A  1.0  1.0  1.0  1.0  0.0  0.0  0.0
B  1.0  1.0  1.0  1.0  0.0  0.0  0.0In [9]: d2.join(d1)
Out[9]: x    y    z    a    b    c    d
A  0.0  0.0  0.0  1.0  1.0  1.0  1.0
B  0.0  0.0  0.0  1.0  1.0  1.0  1.0
C  0.0  0.0  0.0  NaN  NaN  NaN  NaN

Merge

本节内容引用自CSDN博主「Catherine_In_Data」
————————————————
版权声明：本文为CSDN博主「Catherine_In_Data」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/zhouwenyuan1015/article/details/77334889

合并数据集，默认是两个数据集相同的字段（交集）

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: a1 = pd.DataFrame({'name':['max','kate','candy','sour'],'age':[20,21,22,23]})               In [4]: a2 = pd.DataFrame({'name':['max','candy','tom'],'score':[100,99,98]}) In [17]: a3 = pd.DataFrame({'call_name':['max','kate','tom'],'age':[20,22,24]})                      In [5]: a1
Out[5]: name  age
0    max   20
1   kate   21
2  candy   22
3   sour   23In [6]: a2
Out[6]: name  score
0    max    100
1  candy     99
2    tom     98In [7]: pd.merge(a1,a2)
Out[7]: name  age  score
0    max   20    100
1  candy   22     99

’on=’用于连接的列名，必须同时存在于2个DataFrame对象中，如果未指定，则以left和right列名的交集作为连接键
- 当左右侧字段不相同时，使用左右侧连接符，合并后再删除重复字段
- left_on 左侧DataFrame 列连接键
- right_on 右侧DataFrame 列连接键
- Drop(‘删除字段’,axis=选择行列0/1)
- left_index 左侧行索引连接键
- right_index 右侧行索引连接键

In [8]: pd.merge(a1,a2,on='name')
Out[8]: name  age  score
0    max   20    100
1  candy   22     99

pd.merge(a2,a3,left_on='name',right_on='call_name')
Out[17]: name  score call_name  age
0  max    100       max   20
1  tom     98       tom   24In [18]: pd.merge(a2,a3,left_on='name',right_on='call_name').drop('call_name',axis=1)
Out[18]: name  score  age
0  max    100   20
1  tom     98   24

’How’表示连接方式，默认’inner’，其他’outer’,’left’,’right’
- how=‘inner’ 内连接，取交集
- how=‘outer’ 外连接，取并集，并用NaN填充
- how=‘left’ 左连接，左侧取全部，右侧取部分（与左侧相交部分）
- how=‘right’ 右连接，左侧取部分，右侧取全部
- sort 根据连接键对合并后的数据进行排序
- Suffixes字符串值元组，用于追加到重叠列名的末尾，默认为(‘_x’,’_y’)，例如重名的data，结果会出现’data_x’,’data_y’
- Copy设置为False,可以在某些特殊情况下避免将数据复制到结果数据结构中，默认总是赋值

how=‘inner’ 内连接，取交集

In [23]: pd.merge(a1,a2,on='name',how='inner')
Out[23]: name  age  score
0    max   20    100
1  candy   22     99

how=‘outer’ 外连接，取并集，并用NaN填充

In [25]: pd.merge(a1,a2,on='name',how='outer')
Out[25]: name   age  score
0    max  20.0  100.0
1   kate  21.0    NaN
2  candy  22.0   99.0
3   sour  23.0    NaN
4    tom   NaN   98.0

how=‘left’ 左连接，左侧取全部，右侧取部分（与左侧相交部分）

In [26]: pd.merge(a1,a2,on='name',how='left')
Out[26]: name  age  score
0    max   20  100.0
1   kate   21    NaN
2  candy   22   99.0
3   sour   23    NaN

how=‘right’ 右连接，左侧取部分，右侧取全部

In [27]: pd.merge(a1,a2,on='name',how='right')
Out[27]: name   age  score
0    max  20.0    100
1  candy  22.0     99
2    tom   NaN     98

sort 根据连接键对合并后的数据进行排序

In [28]: pd.merge(a1,a2,on='name',how='right',sort=False)
Out[28]: name   age  score
0    max  20.0    100
1  candy  22.0     99
2    tom   NaN     98In [29]: pd.merge(a1,a2,on='name',how='right',sort=True)
Out[29]: name   age  score
0  candy  22.0     99
1    max  20.0    100
2    tom   NaN     98

concat

本节内容引用自CSDN博主「Catherine_In_Data」
————————————————
版权声明：本文为CSDN博主「Catherine_In_Data」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/zhouwenyuan1015/article/details/77334889

Concat axis=0 上下合并，要求合并的DataFrame对象列名相同

In [38]: a4 = pd.DataFrame({‘name':['henry','suly','sherry','cherry'],'age':[12,13,14,15]})
In [41]: pd.concat([a1,a4],axis=0)
Out[41]: name  age
0     max   20
1    kate   21
2   candy   22
3    sour   23
0   henry   12
1    suly   13
2  sherry   14
3  cherry   15

Concat axis=1 左右合并，默认按索引合并

In [45]: a5 = pd.DataFrame({'name':['henry','sherry'],'age':[20,25]})                               In [46]: pd.concat([a1,a5],axis=1)
Out[46]: name  age    name   age
0    max   20   henry  20.0
1   kate   21  sherry  25.0
2  candy   22     NaN   NaN
3   sour   23     NaN   NaNIn [47]: pd.concat([a1,a5],axis=1).reset_index(drop=True)
Out[47]: name  age    name   age
0    max   20   henry  20.0
1   kate   21  sherry  25.0
2  candy   22     NaN   NaN
3   sour   23     NaN   NaN

分组聚合

DataFrameGroupBy
- 可以进行遍历
- 调用聚合方法
- 数据按照多个条件进行分组返回Series
  - 提供2个索引条件，但它只有1列数据（使用单个方括号[]），仍是Series类型
- 数据按照多个条件进行分组返回DataFrame
  - 使用2个中括号（[[]]），[]嵌套[]
  - 获得DataFrame类型，比较明显的是它有列索引
  - 前两列是提供的2个条件分类，1个条件有1个index

import pandas as pd
import numpy as npfile_path = 'starbucks_store_worldwide.csv'
df = pd.read_csv(file_path)#数据按照多个条件进行分组   返回Series
grouped = df['Brand'].groupby(by=[df['Country'], df['State/Province']]).count()
#print(grouped)
#print(type(grouped))
'''
提供2个索引条件，但它只有1列数据（使用单个方括号[]），仍是Series类型
Country  State/Province
AD       7                    1
AE       AJ                   2AZ                  48DU                  82'''#数据按照多个条件进行分组   返回DataFrame
grouped_1 = df[['Brand']].groupby(by=[df['Country'], df['State/Province']]).count()
grouped_2 = df.groupby(by=[df['Country'], df['State/Province']])[['Brand']].count()
grouped_3 = df.groupby(by=[df['Country'], df['State/Province']]).count()[['Brand']]
print(grouped_1)
print(type(grouped_1))
print('*'*100)
print(grouped_2)
print(type(grouped_2))
print('*'*100)
print(grouped_3)
print(type(grouped_3))'''
使用2个中括号（[[]]），[]嵌套[]
获得DataFrame类型，比较明显的是它有列索引
前两列是提供的2个条件分类，1个条件有1个indexBrand
Country State/Province
AD      7                   1
AE      AJ                  2AZ                 48DU                 82FU                  2'''

索引和复合索引

简单的索引操作

#获取index
In [4]: df1 = pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd')) In [5]: df1
Out[5]: a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0In [6]: df1.index
Out[6]: Index(['A', 'B'], dtype=‘object')#指定index
In [8]: df1.index = ['a','b']                                                   In [9]: df1.index
Out[9]: Index(['a', 'b'], dtype=‘object')#重新设置index
#实际是对DataFrame重新取行，对原DataFrame没有改动
In [10]: df1.reindex(list('ad'))
Out[10]: a    b    c    d
a  1.0  1.0  1.0  1.0
d  NaN  NaN  NaN  NaNIn [11]: df1
Out[11]: a    b    c    d
a  1.0  1.0  1.0  1.0
b  1.0  1.0  1.0  1.0#指定某一列作为index,drop=False 表示充当索引的列依然保存
In [15]: dfc = pd.concat([df1,df2],axis=0)                                      In [16]: dfc
Out[16]: a    b    c    d
a  1.0  1.0  1.0  1.0
b  1.0  1.0  1.0  1.0
a  0.0  0.0  0.0  0.0
b  0.0  0.0  0.0  0.0
In [18]: dfc.set_index('d',drop=False)
Out[18]: a    b    c    d
d
1.0  1.0  1.0  1.0  1.0
1.0  1.0  1.0  1.0  1.0
0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0dfc.set_index('d',drop=False).index
Out[21]: Float64Index([1.0, 1.0, 0.0, 0.0], dtype='float64', name=‘d')#指定2列或以上作为index，则变为复合索引
In [23]: dfc.set_index(['c','d'])
Out[23]: a    b
c   d
1.0 1.0  1.0  1.01.0  1.0  1.0
0.0 0.0  0.0  0.00.0  0.0  0.0In [24]: dfc.set_index(['c','d']).index
Out[24]:
MultiIndex(levels=[[0.0, 1.0], [0.0, 1.0]],codes=[[1, 1, 0, 0], [1, 1, 0, 0]],names=['c', ‘d'])#返回index的唯一值（去重）
In [25]: dfc.set_index('c').index.unique()
Out[25]: Float64Index([1.0, 0.0], dtype='float64', name=‘c')

Series 复合索引

使用方法：

set_index([]) 取多个索引值需用中括号括起
直接在方括号中写索引
swaplevel() 索引位置调换

DataFrame 复合索引

使用方法：

loc[]
swaplevel().loc[]

实操

统计电影类型数据

import pandas as pd
import numpy as np
from matplotlib import pyplot as pltfile_path = 'IMDB-Movie-Data.csv'
df = pd.read_csv(file_path)
#print(df['Genre'])#统计分类的列表
temp_list = df['Genre'].str.split(',').tolist() #[[],[],[]] 列表嵌套#分类数据去重（通过set()转为集合去重数据）,使用列表接收
genre_list = list(set([i for j in temp_list for i in j]))#创建全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(genre_list))),columns=genre_list)
#print(zeros_df)#给每个电影出现分类的位置赋值1
#遍历df每一行数据 shape[0]=行数据类型
for i in range(df.shape[0]):#loc[]通过标签位置获取数据/赋值#zeros_df.loc[i(遍历出来的行数),temp_list[i](Genre 未去重数据对应行的列数)]zeros_df.loc[i,temp_list[i]] = 1print(zeros_df.head(3))#统计每个分类的数量和
genre_sum = zeros_df.sum(axis=0)
#排序
genre_sum = genre_sum.sort_values()
print(genre_sum)#画图
plt.figure(figsize=(20,8),dpi=80)
_x = genre_sum.index
_y = genre_sum.values#plt.bar(_x,_y)
plt.barh(_x,_y)
plt.show()

统计美国和中国的星巴克数量

import pandas as pd
import numpy as npfile_path = 'starbucks_store_worldwide.csv'
df = pd.read_csv(file_path)
#print(df.head(1))
#print(df.info())groupdf = df.groupby(by='Country')
#print(groupdf)
#<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8aeb687128>'''
DataFrameGroupBy可以进行遍历调用聚合方法遍历：
出来的结果是提取了国家码 以及对应的其他信息（DataFrame）组成的groupby对象
groupby是可迭代的，里面的每一个对象都是元组
第一个值是自定的索引值（国家码），第二个值是对应其他信息（DataFrame）
for i,j in groupdf:print(i)print('-'*100)print(j)print('*'*100)'''
#第二种更简便的方法，布尔索引，注意==才是取值，=报错
country_us_ = df[df['Country'] == 'US']
print(country_us_.head(2))############################################################
#调用聚合 统计个数
gro_count = groupdf['Brand'].count()
print(gro_count['US'])  #13608
print(gro_count['CN'])  #2734
print('*'*100)#只统计星巴克
brand_starbucks = df[df['Brand'] == 'Starbucks']
starbucks_groupby = brand_starbucks.groupby(by='Country')
brand__count = starbucks_groupby['Brand'].count()
print(brand__count['US'])   #13311
print(brand__count['CN'])   #2734

使用matplotlib呈现店铺总数排名前10的国家

import pandas as pd
import numpy as np
from matplotlib import pyplot as pltfile_path = 'starbucks_store_worldwide.csv'
df = pd.read_csv(file_path)data_ten = df.groupby(by='Country').count()['Brand'].sort_values(ascending=False)[:10]_x = data_ten.index
_y = data_ten.values#plt.bar(_x,_y) 可以直接传值
# 比较严谨的写法
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()

使用matplotlib呈现出中国每个城市的店铺数量

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import font_managermy_font =font_manager.FontProperties(fname='/System/Library/Fonts/PingFang.ttc')file_path = 'starbucks_store_worldwide.csv'
df = pd.read_csv(file_path)df = df[df['Country'] == 'CN']data_city = df.groupby(by='City').count()['Brand'].sort_values(ascending=False)[:25]_x = data_city.index
_y = data_city.valuesplt.figure(figsize=(20,8),dpi=80)
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x,fontproperties=my_font)
plt.show()

2021-03-05 pandas（合并_分组聚合_复合索引）相关推荐

Pandas数据处理_分组聚合_透视表交叉表
1.分组聚合 1.1拆分数据 groupby方法的参数及其说明: #该方法提供的是分组聚合步骤中的拆分功能, #能根据索引或字段对数据进行分组.其常用参数与使用格式如下:DataFrame.group ...
数据分析第七讲 pandas练习数据的合并、分组聚合、时间序列、pandas绘图
文章目录数据分析第七讲 pandas练习数据的合并和分组聚合一.pandas-DataFrame 练习1 对于这一组电影数据,如果我们想runtime(电影时长)的分布情况,应该如何呈现数据? ...
33【数据的合并和分组聚合】03数据分组聚合
例题现在我们有一组关于全球星巴克店铺的统计数据,如果我想知道美国的星巴克数量和中国的哪个多,或者我想知道中国每个省份星巴克的数量的情况,那么应该怎么办? 数据来源: https://www.kagg ...
【DS with Python】DataFrame的合并、分组聚合与数据透视表
文章目录前言一.DataFrame的合并 1.1 按列名合并 (pd.merge()) 1.2 相同列添加行数 (pd.concat()功能) 二.应用 (.apply()功能) 三.分组 (.g ...
Chapter5 数据的合并和分组聚合
字符串离散化的案例 #coding=utf-8 import matplotlib.pyplot as plt import pandas as pd import numpy as np file_ ...
pandas合并groupby_pandas数据聚合与分组运算——groupby方法
简介 pandas中一类非常重要的操作是数据聚合与分组运算.通过groupby方法能够实现对数据集的拆分.统计.转换等操作,这个过程一气呵成. 在本文中,你将学到: 选取特定列分组: 对分组进行迭代: ...
python利用pandas合并excel表格代码_利用Python pandas对Excel进行合并的方法示例
前言在网上找了很多Python处理Excel的方法和代码,都不是很尽人意,所以自己综合网上各位大佬的方法,自己进行了优化,具体的代码如下. 博主也是新手一枚,代码肯定有很多需要优化的地方,欢迎各位大 ...
第2课春晓-2021.03.05 《小学生C++趣味编程》--C++、Scratch
/* 试编一程序,输出此首诗中的一句, 如"春眠不觉晓,处处蚊子咬." */ #include<iostream> using namespace std; int m ...
索引和未索引执行计划的比较_详解Oracle复合索引+实例说明
复合索引复合索引顾名思义,区别于单列索引,是由两个或多个列一起构成的索引.其在B树上的数据结构是什么样?如下图,是一个包含两列的复合索引. 如果你观察仔细,还会发现它的叶子节点是ASC递增排序的.现 ...

2021-03-05 pandas（合并_分组聚合_复合索引）

数据合并

join

Merge

concat

分组聚合

索引和复合索引

简单的索引操作

Series 复合索引

DataFrame 复合索引

实操

2021-03-05 pandas（合并_分组聚合_复合索引）相关推荐

最新文章

热门文章