总览

1. groupby

2. 聚合，过滤与变换

3. apply

细节

1. groupby

# groupby
# 按着某一列进行分组
grouped_single = df.groupby('School')
# 分组之后选取某一个特定的组
grouped_single.get_group('S_1').head()
# 按照多列进行分组
grouped_mul = df.groupby(['School','Class'])
# 多列分组之后选取某一个特定的组
grouped_mul.get_group(('S_2','C_4'))
# 分组之后看每一个组里面有多少个体
grouped_single.size()
grouped_mul.size()
# 分组之后看下有多少组
grouped_single.ngroups
grouped_mul.ngroups
# 遍历分组中的所有组
for name,group in grouped_single:print(name)display(group.head())
# 指定index里面的分组，这里的gender就是level0 而school就是level1
df.set_index(['Gender','School']).groupby(level=0,axis=0).get_group('M').head(40)# 分组对象的方法
# 查看分组对象的方法
print([attr for attr in dir(grouped_single) if not attr.startswith('_')])
# 分组.head()返回的是每组的head
grouped_single.head(2)
# 分组.first()返回的是每组的first
grouped_single.first()# 分组依据
# 除了按照某一列进行分组
# 还可以按照自己生成的一列进行分组，也就是说可以按着任意的一个array进行分组
df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('a').head(40)
# 这一个还没搞懂是什么意思
df[:5].groupby(lambda x:print(x)).head(10) # 为什么传入的对象就是索引
# 按着奇偶行进行分组
df.groupby(lambda x:'奇数行' if not df.index.get_loc(x)%2==1 else '偶数行').groups
# 这里还是没看懂是什么意思
math_score = df.set_index(['Gender','School'])['Math'].sort_index()
grouped_score = df.set_index(['Gender','School']).sort_index().groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))
for name,_ in grouped_score:print(name)# []操作选出某一个，或者某几个列
df.groupby(['Gender','School'])['Math'].mean()
df.groupby(['Gender','School'])['Math'].mean()>=60
df.groupby(['Gender','School'])[['Math','Height']].mean()# 连续型变量的分组
# cuts是一个Series,是不是验证了实际上，groupby接受的是一个Series的观点
bins = [0,40,60,80,90,100]
cuts = pd.cut(df['Math'],bins=bins) #可选label添加自定义标签
df.groupby(cuts)['Math'].count()
cuts = pd.cut(df['Math'],bins=bins)
cuts?
# 实际上这个cuts就是一个Series, groupby的接受对象要求是Series
# 要注意？这种查询方式后面不可接注释，因为？貌似不是python的内置函数，所以不认可# 注释这种方式

2. 聚合，过滤和变换

# mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数
# 一个聚合函数使用的例子# 同时使用多个聚合函数
group_m.agg(['sum','mean','std'])
# 使用agg进行列的重新命名，很不理解为什么可以这么做
group_m.agg([('rename_sum','sum'),('rename_mean','mean')])
# 还可以指定哪些列进行什么样子的聚合
grouped_mul.agg({'Math':['mean','max'],'Height':'var'})
# 自定义函数，暂时还没看懂，但是有助于理解这个agg的使用方式
grouped_single['Math'].agg(lambda x:print(x.head(),'间隔'))
grouped_single['Math'].agg(lambda x:x.max()-x.min())
def R1(x):return x.max()-x.min()
def R2(x):return x.max()-x.median()
grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),max_score1=pd.NamedAgg(column='col2', aggfunc='max'),range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()
# 在agg中传入参数
def f(s,low,high):return s.between(low,high).max()
grouped_single['Math'].agg(f,50,52)
# 传入多个函数与多个参数
def f_test(s,low,high):return s.between(low,high).max()
def agg_f(f_mul,name,*args,**kwargs):def wrapper(x):return f_mul(x,*args,**kwargs)wrapper.__name__ = namereturn wrapper
new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
grouped_single['Math'].agg([new_f,'mean']).head()

# 过滤, 选择出来的组是一个整的组
grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()

# 使用transform进行列的计算
grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()
# 算出来的是一个标量值的话，会填充到每个位置
grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()
# 组内标准化
grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()
# 缺失值的填充
df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()

3. apply函数

# apply是按着分组返回值的
df.groupby('School').apply(lambda x:print(x.head(1)))# 返回值十分多样
# 这个返回了这个组里面每一列的最大值
df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())
# 这个相当于对每一列进行计算，并返回值
df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head()
# 这个直接定义一个新的df存放返回值
df[['School','Math','Height']].groupby('School').apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),'col2':x['Math']-x['Math'].min(),'col3':x['Height']-x['Height'].max(),'col4':x['Height']-x['Height'].min()})).head()
# 进行多指标的统计
from collections import OrderedDict
def f(df):data = OrderedDict()data['M_sum'] = df['Math'].sum()data['W_var'] = df['Weight'].var()data['H_mean'] = df['Height'].mean()return pd.Series(data)
grouped_single.apply(f)

Ref: https://github.com/yeayee/joyful-pandas 很棒的pandas教程，强烈推荐大家star这个项目，作者读过很多Pandas的英文原著。这一系列的文章前期主要是依据这个教程做一个cheat sheet, 后续会在此基础上添加更多其他书籍以及教程里面的常用函数以及代码技巧。

此外，目前打算后期对这个cheat_sheet进行重新编排，不再按照原来的顺序进行分类排版，而是按照一个完整的数据分析，或者数据挖掘的流程进行归类，还需要进一步理清思路，尚在思考...

Pandas_C3_分组cheat sheet相关推荐

Emmet Cheat Sheet（Sublime编辑）
快捷创建html标签官网的Emmet Cheat Sheet :http://docs.emmet.io/cheat-sheet/ https://files.cnblogs.com/files/t ...
139.00.007 Git学习-Cheat Sheet
@(139 - Environment Settings | 环境配置) Git虽然极其强大,命令繁多,但常用的就那么十来个,掌握好这十几个常用命令,你已经可以得心应手地使用Git了. 友情附赠国外网 ...
mysql 递归_「MySQL」 - SQL Cheat Sheet - 未完成
近几个月的心情真是安排的妥妥的,呈现W状.多的不说了,这里对SQL的测试进行简单梳理,制作一份SQL Cheat Sheet. 0x01.数据库基本架构 Clinet层 Server层连接器网络连 ...
ubuntu cheat sheet 目录结构
Ubuntu Cheat Sheet Ubuntu系统目录结构以下为Ubuntu目录的主要目录结构,您稍微了解它们都包含了哪些文件就可以了,不需要记忆. / 根目录 │ ├boot/ 启动文件.所 ...
容器编排技术 -- kubectl Cheat Sheet
容器编排技术 -- kubectl Cheat Sheet 1 Kubectl 自动补全 2 Kubectl 上下文和配置 3 创建对象 4 显示和查找资源 5 更新资源 6 修补资源 7 编辑资源 ...
Nmap Cheat Sheet Part 1
译者:未知原文:Nmap Cheat Sheet: From Discovery to Exploits – Part 1: Introduction to Nmap 在侦查期间,扫描一直是信息收集 ...
XSS Cheat Sheet
XSS Cheat Sheet XSS 101 <h1>Hello,<script>alert(1)</script>!</h1> 1. With &l ...
Tmux Cheat Sheet
Tmux Cheat Sheet 文章目录 Tmux Cheat Sheet 1. Sessions 2. Windows(个人不常用) 3. Panes 4. Tips 5. 配置文件:`.tmux ...
Cheat sheet FOR Python Packages
Cheat sheet FOR Python Packages Pyspark Pandas And SO ON

Pandas_C3_分组cheat sheet

总览

细节

Pandas_C3_分组cheat sheet相关推荐

最新文章

热门文章