Pandas学习(三)---数值运算
Pandas学习--数值运算
- 数值计算和统计基础
- 常用数学、统计方法
- 基本参数:axis、skipna
- 主要数学计算方法,可用于Series和DataFrame(1)
- 主要数学计算方法,可用于Series和DataFrame(2)累和、累积
- 主要数学计算方法,唯一值:.unique()
- 主要数学计算方法,值计数(计算频率):.value_counts()
- 主要数学计算方法,成员资格(是否包含):.isin()
- 课堂作业
- 处理文本数据
- 通过str访问,且自动排除丢失/ NA值
- 字符串常用方法(1) - lower,upper,len,startswith,endswith
- 字符串常用方法(2) - strip
- 字符串常用方法(3) - replace
- 字符串常用方法(4) - split、rsplit
- 字符串索引
- 课堂作业
- 合并 merge、join
- merge合并 → 类似excel的vlookup
- merge合并 → 参数how → 合并方式
- merge合并 → 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
- merge合并 →参数 sort
- pd.join() → 直接通过索引链接
- 课堂作业
- 连接与修补 concat、combine_first
- 连接:concat
- 连接方式:join,join_axes
- 覆盖列名(!!!)
- 修补 pd.combine_first()
- 课堂作业
- 去重及替换 .duplicated / .replace
- 去重 .duplicated
- 替换 .replace
- 数据分组!!!(重要)
- groupby分组
- 分组 - 可迭代对象
- 其他轴上的分组
- 通过字典或者Series分组
- 通过函数分组
- 分组计算函数方法
- 分组多函数计算:agg()
- 课堂作业
- 分组转换及一般性“拆分-应用-合并”
- 数据分组转换,transform
- 一般化Groupby方法:apply
- 课堂作业
- 透视表及交叉表
- 透视表:pivot_table
- 交叉表:crosstab
- 课堂作业
- 数据读取
- 读取普通分隔数据:read_table
- 读取csv数据:read_csv
- 读取excel数据:read_excel
数值计算和统计基础
'''
【课程2.14】 数值计算和统计基础常用数学、统计方法'''
常用数学、统计方法
基本参数:axis、skipna
# 基本参数:axis、skipnaimport numpy as np
import pandas as pddf = pd.DataFrame({'key1':[4,5,3,np.nan,2],'key2':[1,2,np.nan,4,5],'key3':[1,2,3,'j','k']},index = ['a','b','c','d','e'])
print(df)
print(df['key1'].dtype,df['key2'].dtype,df['key3'].dtype)
print('-----')m1 = df.mean()
print(m1,type(m1))
print('单独统计一列:',df['key2'].mean())
print('-----')
# np.nan :空值
# .mean()计算均值
# 只统计数字列
# 可以通过索引单独统计一列m2 = df.mean(axis=1)
print(m2)
print('-----')
# axis参数:默认为0,以列来计算,axis=1,以行来计算,这里就按照行来汇总了m3 = df.mean(skipna=False)
print(m3)
print('-----')
# skipna参数:是否忽略NaN,默认True,如False,有NaN的列统计结果仍未NaN
key1 key2 key3
a 4.0 1.0 1
b 5.0 2.0 2
c 3.0 NaN 3
d NaN 4.0 j
e 2.0 5.0 k
float64 float64 object
-----
key1 3.5
key2 3.0
dtype: float64 <class 'pandas.core.series.Series'>
单独统计一列: 3.0
-----
a 2.5
b 3.5
c 3.0
d 4.0
e 3.5
dtype: float64
-----
key1 NaN
key2 NaN
dtype: float64
-----
主要数学计算方法,可用于Series和DataFrame(1)
# 主要数学计算方法,可用于Series和DataFrame(1)df = pd.DataFrame({'key1':np.arange(10),'key2':np.random.rand(10)*10})
print(df)
print('-----')print(df.count(),'→ count统计非Na值的数量\n')
print(df.min(),'→ min统计最小值\n',df['key2'].max(),'→ max统计最大值\n')
print(df.quantile(q=0.75),'→ quantile统计分位数,参数q确定位置\n')
print(df.sum(),'→ sum求和\n')
print(df.mean(),'→ mean求平均值\n')
print(df.median(),'→ median求算数中位数,50%分位数\n')
print(df.std(),'\n',df.var(),'→ std,var分别求标准差,方差\n')
print(df.skew(),'→ skew样本的偏度\n')
print(df.kurt(),'→ kurt样本的峰度\n')
key1 key2
0 0 0.327398
1 1 0.959262
2 2 6.455080
3 3 6.275359
4 4 6.138641
5 5 8.853716
6 6 4.525300
7 7 9.740657
8 8 9.229833
9 9 0.949789
-----
key1 10
key2 10
dtype: int64 → count统计非Na值的数量key1 0.000000
key2 0.327398
dtype: float64 → min统计最小值9.740656570973671 → max统计最大值key1 6.750000
key2 8.254057
Name: 0.75, dtype: float64 → quantile统计分位数,参数q确定位置key1 45.000000
key2 53.455034
dtype: float64 → sum求和key1 4.500000
key2 5.345503
dtype: float64 → mean求平均值key1 4.500
key2 6.207
dtype: float64 → median求算数中位数,50%分位数key1 3.027650
key2 3.556736
dtype: float64 key1 9.166667
key2 12.650371
dtype: float64 → std,var分别求标准差,方差key1 0.000000
key2 -0.329924
dtype: float64 → skew样本的偏度key1 -1.200000
key2 -1.430276
dtype: float64 → kurt样本的峰度
主要数学计算方法,可用于Series和DataFrame(2)累和、累积
# 主要数学计算方法,可用于Series和DataFrame(2)df['key1_s'] = df['key1'].cumsum()
df['key2_s'] = df['key2'].cumsum()
print(df,'→ cumsum样本的累计和\n')df['key1_p'] = df['key1'].cumprod()
df['key2_p'] = df['key2'].cumprod()
print(df,'→ cumprod样本的累计积\n')print(df.cummax(),'\n',df.cummin(),'→ cummax,cummin分别求累计最大值,累计最小值\n')
# 会填充key1,和key2的值
key1 key2 key1_s key2_s
0 0 0.327398 0 0.327398
1 1 0.959262 1 1.286660
2 2 6.455080 3 7.741740
3 3 6.275359 6 14.017099
4 4 6.138641 10 20.155740
5 5 8.853716 15 29.009456
6 6 4.525300 21 33.534756
7 7 9.740657 28 43.275412
8 8 9.229833 36 52.505245
9 9 0.949789 45 53.455034 → cumsum样本的累计和key1 key2 key1_s key2_s key1_p key2_p
0 0 0.327398 0 0.327398 0 0.327398
1 1 0.959262 1 1.286660 0 0.314061
2 2 6.455080 3 7.741740 0 2.027286
3 3 6.275359 6 14.017099 0 12.721946
4 4 6.138641 10 20.155740 0 78.095454
5 5 8.853716 15 29.009456 0 691.434982
6 6 4.525300 21 33.534756 0 3128.950808
7 7 9.740657 28 43.275412 0 30478.035251
8 8 9.229833 36 52.505245 0 281307.179260
9 9 0.949789 45 53.455034 0 267182.375541 → cumprod样本的累计积key1 key2 key1_s key2_s key1_p key2_p
0 0.0 0.327398 0.0 0.327398 0.0 0.327398
1 1.0 0.959262 1.0 1.286660 0.0 0.327398
2 2.0 6.455080 3.0 7.741740 0.0 2.027286
3 3.0 6.455080 6.0 14.017099 0.0 12.721946
4 4.0 6.455080 10.0 20.155740 0.0 78.095454
5 5.0 8.853716 15.0 29.009456 0.0 691.434982
6 6.0 8.853716 21.0 33.534756 0.0 3128.950808
7 7.0 9.740657 28.0 43.275412 0.0 30478.035251
8 8.0 9.740657 36.0 52.505245 0.0 281307.179260
9 9.0 9.740657 45.0 53.455034 0.0 281307.179260 key1 key2 key1_s key2_s key1_p key2_p
0 0.0 0.327398 0.0 0.327398 0.0 0.327398
1 0.0 0.327398 0.0 0.327398 0.0 0.314061
2 0.0 0.327398 0.0 0.327398 0.0 0.314061
3 0.0 0.327398 0.0 0.327398 0.0 0.314061
4 0.0 0.327398 0.0 0.327398 0.0 0.314061
5 0.0 0.327398 0.0 0.327398 0.0 0.314061
6 0.0 0.327398 0.0 0.327398 0.0 0.314061
7 0.0 0.327398 0.0 0.327398 0.0 0.314061
8 0.0 0.327398 0.0 0.327398 0.0 0.314061
9 0.0 0.327398 0.0 0.327398 0.0 0.314061 → cummax,cummin分别求累计最大值,累计最小值
主要数学计算方法,唯一值:.unique()
# 唯一值:.unique()s = pd.Series(list('asdvasdcfgg'))
sq = s.unique()
print(s)
print(sq,type(sq))
print(pd.Series(sq))
# 得到一个唯一值数组
# 通过pd.Series重新变成新的Seriessq.sort()
print(sq)
# 重新排序
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
['a' 's' 'd' 'v' 'c' 'f' 'g'] <class 'numpy.ndarray'>
0 a
1 s
2 d
3 v
4 c
5 f
6 g
dtype: object
['a' 'c' 'd' 'f' 'g' 's' 'v']
主要数学计算方法,值计数(计算频率):.value_counts()
# 值计数:.value_counts()sc = s.value_counts(sort = False) # 也可以这样写:pd.value_counts(sc, sort = False)
print(sc)
# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True
a 2
d 2
v 1
g 2
s 2
f 1
c 1
dtype: int64
主要数学计算方法,成员资格(是否包含):.isin()
# 成员资格:.isin()s = pd.Series(np.arange(10,15))
df = pd.DataFrame({'key1':list('asdcbvasd'),'key2':np.arange(4,13)})
print(s)
print(df)
print('-----')print(s.isin([5,14]))
print(df.isin(['a','bc','10',8]))
# 用[]表示
# 得到一个布尔值的Series或者Dataframe
0 10
1 11
2 12
3 13
4 14
dtype: int32key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
6 a 10
7 s 11
8 d 12
-----
0 False
1 False
2 False
3 False
4 True
dtype: boolkey1 key2
0 True False
1 False False
2 False False
3 False False
4 False True
5 False False
6 True False
7 False False
8 False False
课堂作业
ts1 = pd.DataFrame(np.random.rand(5,2)*100,columns=['key1','key2'])
print("创建的Dateframe为:")
print(ts1)
print('------')
print("df['key1']的均值为:")
print(ts1['key1'].mean())
print('------')
print("df['key1']的中位数为:")
print(ts1['key1'].median())
print('------')
print("df['key2']的均值为:")
print(ts1['key2'].mean())
print('------')
print("df['key2']的中位数为:")
print(ts1['key2'].median())
print('------')
print("df['key2']的累计和为:")
ts1['key1_cumsum'] = ts1['key1'].cumsum()
ts1['key2_cumsum'] = ts1['key2'].cumsum()
print(ts1)
创建的Dateframe为:key1 key2
0 0.445031 70.879116
1 40.164080 8.052621
2 4.118756 72.932482
3 46.818794 12.744497
4 37.192819 18.393109
------
df['key1']的均值为:
25.747896160805663
------
df['key1']的中位数为:
37.192819239210486
------
df['key2']的均值为:
36.60036488397306
------
df['key2']的中位数为:
18.39310866824474
------
df['key2']的累计和为:key1 key2 key1_cumsum key2_cumsum
0 0.445031 70.879116 0.445031 70.879116
1 40.164080 8.052621 40.609112 78.931737
2 4.118756 72.932482 44.727868 151.864219
3 46.818794 12.744497 91.546662 164.608716
4 37.192819 18.393109 128.739481 183.001824
# 作业2:写出一个输入元素直接生成数组的代码块,然后创建一个函数,该函数功能用于判断一个Series是否是唯一值数组,返回“是”和“不是”def f(s):s2 = s.unique()if len(s) == len(s2):print('------\n该数组是唯一值数组')else:print('------\n该数组不是唯一值数组')d = input('请随机输入一组元素,用逗号(英文符号)隔开:\n')
lst = d.split(',')
ds = pd.Series(lst)
f(ds)
请随机输入一组元素,用逗号(英文符号)隔开:
a,sc,2,2,2,d,s,s,a
------
该数组不是唯一值数组
处理文本数据
'''
【课程2.15】 文本数据Pandas针对字符串配备的一套方法,使其易于对数组的每个元素进行操作'''
通过str访问,且自动排除丢失/ NA值
# 通过str访问,且自动排除丢失/ NA值s = pd.Series(['A','bB','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print(df)
print('-----')print(s.str.count('b')) # 大小写敏感
print(df['key2'].str.upper()) # 不会改变原有值
print(df['key2'])
print('-----')
# 直接通过.str调用字符串方法
# 可以对Series、Dataframe使用
# 自动过滤NaN值df.columns = df.columns.str.upper()
print(df)
# df.columns是一个Index对象,也可使用.str
0 A
1 bB
2 C
3 bbhello
4 123
5 NaN
6 hj
dtype: objectkey1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
-----
0 0.0
1 1.0
2 0.0
3 2.0
4 0.0
5 NaN
6 0.0
dtype: float64
0 HEE
1 FV
2 W
3 HIJA
4 123
5 NaN
Name: key2, dtype: objectkey1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
-----KEY1 KEY2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
字符串常用方法(1) - lower,upper,len,startswith,endswith
# 字符串常用方法(1) - lower,upper,len,startswith,endswiths = pd.Series(['A','b','bbhello','123',np.nan])print(s.str.lower(),'→ lower小写\n')
print(s.str.upper(),'→ upper大写\n')
print(s.str.len(),'→ len字符长度\n')
print(s.str.startswith('b'),'→ 判断起始是否为a\n')
print(s.str.endswith('3'),'→ 判断结束是否为3\n')
0 a
1 b
2 bbhello
3 123
4 NaN
dtype: object → lower小写0 A
1 B
2 BBHELLO
3 123
4 NaN
dtype: object → upper大写0 1.0
1 1.0
2 7.0
3 3.0
4 NaN
dtype: float64 → len字符长度0 False
1 True
2 True
3 False
4 NaN
dtype: object → 判断起始是否为a0 False
1 False
2 False
3 True
4 NaN
dtype: object → 判断结束是否为3
字符串常用方法(2) - strip
# 字符串常用方法(2) - strips = pd.Series([' jack', 'ji ll ', ' jesse ', 'frank'])
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],index=range(3))
print(s)
print(df)
print('-----')print(s.str.strip()) # 去除字符串前后的空格
print(s.str.lstrip()) # 去除字符串中的左空格
print(s.str.rstrip()) # 去除字符串中的右空格df.columns = df.columns.str.strip()
print(df)
# 这里去掉了columns的前后空格,但没有去掉中间空格
0 jack
1 ji ll
2 jesse
3 frank
dtype: objectColumn A Column B
0 1.178373 -0.770705
1 0.611277 0.705297
2 -1.106696 1.455232
-----
0 jack
1 ji ll
2 jesse
3 frank
dtype: object
0 jack
1 ji ll
2 jesse
3 frank
dtype: object
0 jack
1 ji ll
2 jesse
3 frank
dtype: objectColumn A Column B
0 1.178373 -0.770705
1 0.611277 0.705297
2 -1.106696 1.455232
字符串常用方法(3) - replace
# 字符串常用方法(3) - replacedf = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],index=range(3))
df.columns = df.columns.str.replace(' ','-')
print(df)
# 替换df.columns = df.columns.str.replace('-','hehe',n=1)
print(df)
# n:替换个数
-Column-A- -Column-B-
0 -1.140552 -2.215192
1 -0.386697 1.323757
2 -0.288860 1.405160heheColumn-A- heheColumn-B-
0 -1.140552 -2.215192
1 -0.386697 1.323757
2 -0.288860 1.405160
字符串常用方法(4) - split、rsplit
# 字符串常用方法(4) - split、rsplits = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan])
print(s)
print(s.str.split(','),type(s.str.split(','))) # 因为['a,,,c']是列表,所有split是NAN 这是一个Series对象
print('-----')
# 类似字符串的splitprint(s.str.split(',')[0],type(s.str.split(',')[0])) # 对于Series来说这是取第一行
print('-----')
# 直接索引得到一个listprint(s.str.split(',').str) # 这是对一个Series对象取str
print(s.str.split(',').str[1],type(s.str.split(',').str[1]),'split.str..') # 对于子元素已经是列表形式的Series来说, .str[0] 相当于取第一列
print(s.str.split(',').str.get(1)) # 而且返回的对象还是一个series
print('-----')
# 可以使用get或[]符号访问拆分列表中的元素print(s.str.split(',', expand=True))
print(s.str.split(',', expand=True, n = 1))
print(s.str.rsplit(',', expand=True, n = 1))
print('-----')
# 可以使用expand可以轻松扩展此操作以返回DataFrame
# n参数限制分割数
# rsplit类似于split,反向工作,即从字符串的末尾到字符串的开头df = pd.DataFrame({'key1':['a,b,c','1,2,3',[':,., ']],'key2':['a-b-c','1-2-3',[':-.- ']]})
print(df['key2'].str.split('-'))
# Dataframe使用split
0 a,b,c
1 1,2,3
2 [a,,,c]
3 NaN
dtype: object
0 [a, b, c]
1 [1, 2, 3]
2 NaN
3 NaN
dtype: object <class 'pandas.core.frame.DataFrame'>
-----
['a', 'b', 'c'] <class 'list'>
-----
<pandas.core.strings.StringMethods object at 0x000001A8D1163828>
0 b
1 2
2 NaN
3 NaN
dtype: object <class 'pandas.core.series.Series'> split.str..
0 b
1 2
2 NaN
3 NaN
dtype: object
-----0 1 2
0 a b c
1 1 2 3
2 NaN NaN NaN
3 NaN NaN NaN0 1
0 a b,c
1 1 2,3
2 NaN NaN
3 NaN NaN0 1
0 a,b c
1 1,2 3
2 NaN NaN
3 NaN NaN
-----
0 [a, b, c]
1 [1, 2, 3]
2 NaN
Name: key2, dtype: object
字符串索引
# 字符串索引s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),'key2':['hee','fv','w','hija','123',np.nan]})print(s.str[0]) # 取第一个字符串
print(s.str[:2]) # 取前两个字符串
print(df['key2'].str[0])
# str之后和字符串本身索引方式相同
0 A
1 b
2 C
3 b
4 1
5 NaN
6 h
dtype: object
0 A
1 b
2 C
3 bb
4 12
5 NaN
6 hj
dtype: object
0 h
1 f
2 w
3 h
4 1
5 NaN
Name: key2, dtype: object
课堂作业
df = pd.DataFrame({'name':['jack','tom','Marry','zack','heheda'],'gender':['M ','M',' F',' M ',' F'],'score':['90-92-89','89-78-88','90-92-95','78-88-76','60-60-67']})
print(df)
df['gender'] = df['gender'].str.strip()
df['name'] = df['name'].str.capitalize() # 首字母大写
df = df.reindex(['gender','name','score'],axis=1)sf = df['score'].str.split('-', expand=True) # 对单个字段的expand=True返回的是一个dataframe
print(sf,type(sf))
print(sf[0],type(sf[0]))df['math'] = sf[0]
df['english'] = sf[1]
df['art'] = sf[2]del df['score']
print(df)# print(sf) # 这是一个DateFrame# 重要结论
# Series split expand=False 返回Series,se[0] 第一行数据 se.str[0] 返回Series中列表第一列的数据,还是个Series对象
# Series split expand=True 返回dataframe
name gender score
0 jack M 90-92-89
1 tom M 89-78-88
2 Marry F 90-92-95
3 zack M 78-88-76
4 heheda F 60-60-670 1 2
0 90 92 89
1 89 78 88
2 90 92 95
3 78 88 76
4 60 60 67 <class 'pandas.core.frame.DataFrame'>
0 90
1 89
2 90
3 78
4 60
Name: 0, dtype: object <class 'pandas.core.series.Series'>gender name math english art
0 M Jack 90 92 89
1 M Tom 89 78 88
2 F Marry 90 92 95
3 M Zack 78 88 76
4 F Heheda 60 60 67
合并 merge、join
'''
【课程2.16】 合并 merge、joinPandas具有全功能的,高性能内存中连接操作,与SQL等关系数据库非常相似pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True, indicator=False)'''
merge合并 → 类似excel的vlookup
# merge合并 → 类似excel的vlookupdf1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],'C': ['C0', 'C1', 'C2', 'C3'],'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],'key2': ['K0', 'K1', 'K0', 'K1'],'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],'key2': ['K0', 'K0', 'K0', 'K0'],'C': ['C0', 'C1', 'C2', 'C3'],'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(df1, df2, on='key'))
print('------')
# left:第一个df
# right:第二个df
# on:参考键print(pd.merge(df3, df4, on=['key1','key2']))
# 多个链接键
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
------key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
merge合并 → 参数how → 合并方式
# 参数how → 合并方式print(pd.merge(df3, df4,on=['key1','key2'], how = 'inner'))
print('------')
# inner:默认,取交集print(pd.merge(df3, df4, on=['key1','key2'], how = 'outer'))
print('------')
# outer:取并集,数据缺失范围NaNprint(df3)
print(df4)
print(pd.merge(df3, df4, on=['key1','key2'], how = 'left'))
print('------')
# left:按照df3为参考合并,数据缺失范围NaNprint(pd.merge(df3, df4, on=['key1','key2'], how = 'right'))
# right:按照df4为参考合并,数据缺失范围NaN
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
------key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
------key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
------key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
merge合并 → 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
# 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键df1 = pd.DataFrame({'lkey':list('bbacaab'),'data1':range(7)})
df2 = pd.DataFrame({'rkey':list('abd'),'date2':range(3)})
print(pd.merge(df1, df2, left_on='lkey', right_on='rkey'))
print('------')
# df1以‘lkey’为键,df2以‘rkey’为键df1 = pd.DataFrame({'key':list('abcdfeg'),'data1':range(7)})
df2 = pd.DataFrame({'date2':range(100,105)},index = list('abcde'))
print(pd.merge(df1, df2, left_on='key', right_index=True))
# df1以‘key’为键,df2以index为键
# left_index:为True时,第一个df以index为键,默认False
# right_index:为True时,第二个df以index为键,默认False# 所以left_on, right_on, left_index, right_index可以相互组合:
# left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index
lkey data1 rkey date2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
------key data1 date2
0 a 0 100
1 b 1 101
2 c 2 102
3 d 3 103
5 e 5 104
merge合并 →参数 sort
# 参数 sortdf1 = pd.DataFrame({'key':list('bbacaab'),'data1':[1,3,2,4,5,9,7]})
df2 = pd.DataFrame({'key':list('abd'),'date2':[11,2,33]})
x1 = pd.merge(df1,df2, on = 'key', how = 'outer')
x2 = pd.merge(df1,df2, on = 'key', sort=True, how = 'outer')
print(x1)
print(x2)
print('------')
# sort:按照字典顺序通过 连接键 对结果DataFrame进行排序。默认为False,设置为False会大幅提高性能print(x2.sort_values('data1'))
# 也可直接用Dataframe的排序方法:sort_values,sort_index
key data1 date2
0 b 1.0 2.0
1 b 3.0 2.0
2 b 7.0 2.0
3 a 2.0 11.0
4 a 5.0 11.0
5 a 9.0 11.0
6 c 4.0 NaN
7 d NaN 33.0key data1 date2
0 a 2.0 11.0
1 a 5.0 11.0
2 a 9.0 11.0
3 b 1.0 2.0
4 b 3.0 2.0
5 b 7.0 2.0
6 c 4.0 NaN
7 d NaN 33.0
------key data1 date2
3 b 1.0 2.0
0 a 2.0 11.0
4 b 3.0 2.0
6 c 4.0 NaN
1 a 5.0 11.0
5 b 7.0 2.0
2 a 9.0 11.0
7 d NaN 33.0
pd.join() → 直接通过索引链接
# pd.join() → 直接通过索引链接left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],'B': ['B0', 'B1', 'B2']},index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],'D': ['D0', 'D2', 'D3']},index=['K0', 'K2', 'K3'])
print(left)
print(right)
print(left.join(right))
print(left.join(right, how='outer'))
print('-----')
# 等价于:pd.merge(left, right, left_index=True, right_index=True, how='outer')df1 = pd.DataFrame({'key':list('bbacaab'),'data1':[1,3,2,4,5,9,7]})
df2 = pd.DataFrame({'key':list('abd'),'date2':[11,2,33]})
print(df1)
print(df2)
print(pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2')))
print(df1.join(df2['date2']))
print('-----')
# suffixes=('_x', '_y')默认left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3'],'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'C': ['C0', 'C1'],'D': ['D0', 'D1']},index=['K0', 'K1'])
print(left)
print(right)
print(left.join(right, on = 'key'))
# 等价于pd.merge(left, right, left_on='key', right_index=True, how='left', sort=False);
# left的‘key’和right的index
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2C D
K0 C0 D0
K2 C2 D2
K3 C3 D3A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
-----key data1
0 b 1
1 b 3
2 a 2
3 c 4
4 a 5
5 a 9
6 b 7key date2
0 a 11
1 b 2
2 d 33key_1 data1 key_2 date2
0 b 1 a 11
1 b 3 b 2
2 a 2 d 33key data1 date2
0 b 1 11.0
1 b 3 2.0
2 a 2 33.0
3 c 4 NaN
4 a 5 NaN
5 a 9 NaN
6 b 7 NaN
-----A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K0
3 A3 B3 K1C D
K0 C0 D0
K1 C1 D1A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
课堂作业
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['key'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'])
print('创建df2为:')
df2['key'] = list('bcd')
print(df2)df3 = pd.merge(df1,df2,on='key',how='outer')
print('合并df3(取并集)为:')
print(df3)
创建df1为:values1 key
0 0.363363 a
1 0.705128 b
2 0.514941 c
创建df2为:values2 key
0 0.305494 b
1 0.243707 c
2 0.816473 d
合并df3(取并集)为:values1 key values2
0 0.363363 a NaN
1 0.705128 b 0.305494
2 0.514941 c 0.243707
3 NaN d 0.816473
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['lkey'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'])
print('创建df2为:')
df2['rkey'] = list('bcd')
print(df2)df3 = pd.merge(df1,df2,left_on='lkey',right_on='rkey',how='left')
print('合并df3(左连接,保留left所有)为:')
print(df3)
创建df1为:values1 lkey
0 0.625525 a
1 0.121965 b
2 0.114507 c
创建df2为:values2 rkey
0 0.406097 b
1 0.922127 c
2 0.326960 d
合并df3(左连接,保留left所有)为:values1 lkey values2 rkey
0 0.625525 a NaN NaN
1 0.121965 b 0.406097 b
2 0.114507 c 0.922127 c
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['lkey'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'],index=list('bcd'))
print('创建df2为:')
df2['value3'] = [5,6,7]
print(df2)df3 = pd.merge(df1,df2,left_on='lkey',right_index=True,how='inner')
print('合并df3(内连接,取并集)为:')
print(df3)
创建df1为:values1 lkey
0 0.509719 a
1 0.157929 b
2 0.392352 c
创建df2为:values2 value3
b 0.805541 5
c 0.897287 6
d 0.093350 7
合并df3(内连接,取并集)为:values1 lkey values2 value3
1 0.157929 b 0.805541 5
2 0.392352 c 0.897287 6
连接与修补 concat、combine_first
'''
【课程2.17】 连接与修补 concat、combine_first连接 - 沿轴执行连接操作pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None, verify_integrity=False,copy=True)'''
连接:concat
# 连接:concats1 = pd.Series([1,2,3])
s2 = pd.Series([2,3,4])
s3 = pd.Series([1,2,3],index = ['a','c','h'])
s4 = pd.Series([2,3,4],index = ['b','e','d'])
print(pd.concat([s1,s2]))
print(pd.concat([s3,s4]).sort_index())
print('-----')
# 默认axis=0,行+行print(pd.concat([s3,s4], axis=1))
print('-----')
# axis=1,列+列,成为一个Dataframe,相当于一个笛卡尔积
0 1
1 2
2 3
0 2
1 3
2 4
dtype: int64
a 1
b 2
c 2
d 4
e 3
h 3
dtype: int64
-----0 1
a 1.0 NaN
b NaN 2.0
c 2.0 NaN
d NaN 4.0
e NaN 3.0
h 3.0 NaN
-----D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.To accept the future behavior, pass 'sort=False'.To retain the current behavior and silence the warning, pass 'sort=True'.if sys.path[0] == '':
连接方式:join,join_axes
# 连接方式:join,join_axess5 = pd.Series([1,2,3],index = ['a','b','c'])
s6 = pd.Series([2,3,4],index = ['b','c','d'])
print(pd.concat([s5,s6], axis= 1))
print(pd.concat([s5,s6], axis= 1, join='inner'))
print(pd.concat([s5,s6], axis= 1, join_axes=[['a','b','d']]))
# join:{'inner','outer'},默认为“outer”。如何处理其他轴上的索引。outer为联合和inner为交集。
# join_axes:指定联合的index
0 1
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.00 1
b 2 2
c 3 30 1
a 1.0 NaN
b 2.0 2.0
d NaN 4.0D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.To accept the future behavior, pass 'sort=False'.To retain the current behavior and silence the warning, pass 'sort=True'."""
覆盖列名(!!!)
# 覆盖列名sre = pd.concat([s5,s6], keys = ['one','two'])
print(sre,type(sre))
print(sre.index)
print('-----')
# keys:序列,默认值无。使用传递的键作为最外层构建层次索引sre = pd.concat([s5,s6], axis=1, keys = ['one','two'])
print(sre,type(sre))
# axis = 1, 覆盖列名
one a 1b 2c 3
two b 2c 3d 4
dtype: int64 <class 'pandas.core.series.Series'>
MultiIndex(levels=[['one', 'two'], ['a', 'b', 'c', 'd']],codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
-----one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0 <class 'pandas.core.frame.DataFrame'>D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.To accept the future behavior, pass 'sort=False'.To retain the current behavior and silence the warning, pass 'sort=True'.if __name__ == '__main__':
修补 pd.combine_first()
# 修补 pd.combine_first()df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan],[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],index=[1, 2])
print(df1)
print(df2)
print(df1.combine_first(df2))
print('-----')
# 根据index,df1的空值被df2替代
# 如果df2的index多于df1,则更新到df1上,比如index=['a',1]df1.update(df2)
print(df1)
# update,直接df2覆盖df1,相同index位置
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN NaN
2 NaN 7.0 NaN0 1 2
1 -42.6 NaN -8.2
2 -5.0 1.6 4.00 1 2
0 NaN 3.0 5.0
1 -4.6 NaN -8.2
2 -5.0 7.0 4.0
-----0 1 2
0 NaN 3.0 5.0
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
课堂作业
df1 = pd.DataFrame(np.random.rand(4,2),index=list('abcd'),columns=['value1','value2'])
print("创建df1为:")
print(df1)
print('----------')df2 = pd.DataFrame(np.random.rand(4,2),index=list('efgh'),columns=['value1','value2'])
print("创建df2为:")
print(df2)
print('----------')df3 = pd.concat([df1,df2])
print('堆叠为df3:')
print(df3)
创建df1为:value1 value2
a 0.261681 0.109421
b 0.782509 0.374875
c 0.447257 0.056709
d 0.349732 0.669266
----------
创建df2为:value1 value2
e 0.902231 0.531241
f 0.818947 0.537972
g 0.052821 0.696736
h 0.098303 0.911916
----------
堆叠为df3:value1 value2
a 0.261681 0.109421
b 0.782509 0.374875
c 0.447257 0.056709
d 0.349732 0.669266
e 0.902231 0.531241
f 0.818947 0.537972
g 0.052821 0.696736
h 0.098303 0.911916
data = np.random.rand(4,2)
data[1:3,0] = np.NAN
df1 = pd.DataFrame(data,index=list('abcd'),columns=['value1','value2'])
print("创建df1为:")print(df1)
print('----------')df2 = pd.DataFrame(np.arange(8).reshape(4,2),index=list('abcd'),columns=['value1','value2'])
print("创建df2为:")
print(df2)
print('----------')df3 = df1.combine_first(df2)
print('df1修补后为:')
print(df3)
创建df1为:value1 value2
a 0.451591 0.556266
b NaN 0.943348
c NaN 0.944175
d 0.273202 0.594670
----------
创建df2为:value1 value2
a 0 1
b 2 3
c 4 5
d 6 7
----------
df1修补后为:value1 value2
a 0.451591 0.556266
b 2.000000 0.943348
c 4.000000 0.944175
d 0.273202 0.594670
去重及替换 .duplicated / .replace
'''
【课程2.18】 去重及替换.duplicated / .replace'''
去重 .duplicated
# 去重 .duplicateds = pd.Series([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s.duplicated())
print(s[s.duplicated() == False])
print('-----')
# 判断是否重复
# 通过布尔判断,得到不重复的值s_re = s.drop_duplicates()
print(s_re)
print('-----')
# drop.duplicates移除重复
# inplace参数:是否替换原值,默认Falsedf = pd.DataFrame({'key1':['a','a',3,4,5],'key2':['a','a','b','b','c']})
print(df.duplicated())
print(df['key2'].duplicated())
# Dataframe中使用duplicated
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
dtype: bool
0 1
4 2
7 3
8 4
9 5
dtype: int64
-----
0 1
4 2
7 3
8 4
9 5
dtype: int64
-----
0 False
1 True
2 False
3 False
4 False
dtype: bool
0 False
1 True
2 False
3 True
4 False
Name: key2, dtype: bool
替换 .replace
# 替换 .replaces = pd.Series(list('ascaazsd'))
print(s.replace('a', np.nan))
print(s.replace(['a','s'] ,np.nan))
print(s.replace({'a':'hello world!','s':123}))
# 可一次性替换一个值或多个值
# 可传入列表或字典
0 NaN
1 s
2 c
3 NaN
4 NaN
5 z
6 s
7 d
dtype: object
0 NaN
1 NaN
2 c
3 NaN
4 NaN
5 z
6 NaN
7 d
dtype: object
0 hello world!
1 123
2 c
3 hello world!
4 hello world!
5 z
6 123
7 d
dtype: object
数据分组!!!(重要)
'''
【课程2.19】 数据分组分组统计 - groupby功能① 根据某些条件将数据拆分成组
② 对每个组独立应用函数
③ 将结果合并到一个数据结构中Dataframe在行(axis=0)或列(axis=1)上进行分组,将一个函数应用到各个分组并产生一个新值,然后函数执行结果被合并到最终的结果对象中。df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)'''
groupby分组
# 分组df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
print(df)
print('------')print(df.groupby('A'), type(df.groupby('A')))
print('------')
# 直接分组得到一个groupby对象,是一个中间数据,没有进行计算a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby(['A'])['D'].mean() # 以A分组,算D的平均值
print(a,type(a),'\n',a.columns)
print(b,type(b),'\n',b.columns)
print(c,type(c))
# 通过分组后的计算,得到一个新的dataframe
# 默认axis = 0,以行来分组
# 可单个或多个([])列分组
A B C D
0 foo one 0.172157 1.118132
1 bar one 0.323895 1.188046
2 foo two -1.048614 -0.747383
3 bar three 0.338934 1.587185
4 foo two 0.423342 -1.542578
5 bar two 0.255962 1.337651
6 foo one 0.225461 0.557273
7 foo three -0.748118 0.418550
------
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002362C23ADD8> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
------C D
A
bar 0.306263 1.370960
foo -0.195154 -0.039201 <class 'pandas.core.frame.DataFrame'> Index(['C', 'D'], dtype='object')C D
A B
bar one 0.323895 1.188046three 0.338934 1.587185two 0.255962 1.337651
foo one 0.198809 0.837702three -0.748118 0.418550two -0.312636 -1.144981 <class 'pandas.core.frame.DataFrame'> Index(['C', 'D'], dtype='object')
A
bar 1.370960
foo -0.039201
Name: D, dtype: float64 <class 'pandas.core.series.Series'>
分组 - 可迭代对象
# 分组 - 可迭代对象df = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'), type(df.groupby('X')))
print('-----')print(list(df.groupby('X')), '→ 可迭代对象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式显示\n')
for n,g in df.groupby('X'):print(n)print(g)print('###')
print('-----')
# n是组名,g是分组后的Dataframeprint(df.groupby(['X']).get_group('A'),'\n')
print(df.groupby(['X']).get_group('B'),'\n')
print('-----')
# .get_group()提取分组后的组grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A']) # 也可写:df.groupby('X').groups['A']
print('-----')
# .groups:将分组后的groups转为dict
# 可以字典索引方法来查看groups里的元素sz = grouped.size()
print(sz,type(sz))
print('-----')
# .size():查看分组后的长度df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
grouped = df.groupby(['A','B']).groups
print(df)
print(grouped)
print(grouped[('foo', 'three')])
# 按照两个列进行分组
X Y
0 A 1
1 B 4
2 A 3
3 B 2
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002362C6C09B0> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
-----
[('A', X Y
0 A 1
2 A 3), ('B', X Y
1 B 4
3 B 2)] → 可迭代对象,直接生成list('A', X Y
0 A 1
2 A 3) → 以元祖形式显示AX Y
0 A 1
2 A 3
###
BX Y
1 B 4
3 B 2
###
-----X Y
0 A 1
2 A 3 X Y
1 B 4
3 B 2 -----
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')
-----
X
A 2
B 2
dtype: int64 <class 'pandas.core.series.Series'>
-----A B C D
0 foo one 0.981468 0.473817
1 bar one -1.236826 0.028449
2 foo two -1.611723 1.444489
3 bar three 1.136316 0.881776
4 foo two 0.523383 0.707726
5 bar two -2.196340 -0.201260
6 foo one 1.014091 0.256455
7 foo three -1.700698 1.217236
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
Int64Index([7], dtype='int64')
其他轴上的分组
# 其他轴上的分组df = pd.DataFrame({'data1':np.random.rand(2),'data2':np.random.rand(2),'key1':[1,'b'],'key2':['one','two']})
print(df)
print(df.dtypes,type(df.dtypes)) #返回的是一个Series
print('-----')
for n,p in df.groupby(df.dtypes, axis=1):print(n)print(p)print('##')
# 按照值类型分列
data1 data2 key1 key2
0 0.572579 0.924789 1 one
1 0.575395 0.814979 b two
data1 float64
data2 float64
key1 object
key2 object
dtype: object <class 'pandas.core.series.Series'>
-----
float64data1 data2
0 0.572579 0.924789
1 0.575395 0.814979
##
objectkey1 key2
0 1 one
1 b two
##
通过字典或者Series分组
# 通过字典或者Series分组df = pd.DataFrame(np.arange(16).reshape(4,4),columns = ['a','b','c','d'])
print(df)
print('-----')
# 通过字典可以将多列变成一个自定义分组
mapping = {'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())
print('-----')
# mapping中,a、b列对应的为one,c、d列对应的为two,以字典来分组s = pd.Series(mapping)
print(s,'\n')
print(s.groupby(s).count())
# s中,index中a、b对应的为one,c、d对应的为two,以Series来分组
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
-----one two
0 1 5
1 9 13
2 17 21
3 25 29
-----
a one
b one
c two
d two
e three
dtype: object one 2
three 1
two 2
dtype: int64
通过函数分组
# 通过函数分组df = pd.DataFrame(np.arange(16).reshape(4,4),columns = ['a','b','c','d'],index = ['abc','bcd','aa','b'])
print(df,'\n')
print(df.groupby(len).sum())# 默认传递的参数是索引
# 按照字母长度分组
a b c d
abc 0 1 2 3
bcd 4 5 6 7
aa 8 9 10 11
b 12 13 14 15 a b c d
1 12 13 14 15
2 8 9 10 11
3 4 6 8 10
分组计算函数方法
# 分组计算函数方法s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0) # 唯一索引用.groupby(level=0),将同一个index的分为一组
print(grouped)
print(grouped.first(),'→ first:非NaN的第一个值\n')
print(grouped.last(),'→ last:非NaN的最后一个值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算术中位数\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的标准差和方差\n')
print(grouped.prod(),'→ prod:非NaN的积\n')
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002362C6D8F60>
1 1
2 2
3 3
dtype: int64 → first:非NaN的第一个值1 10
2 20
3 30
dtype: int64 → last:非NaN的最后一个值1 11
2 22
3 33
dtype: int64 → sum:非NaN的和1 5.5
2 11.0
3 16.5
dtype: float64 → mean:非NaN的平均值1 5.5
2 11.0
3 16.5
dtype: float64 → median:非NaN的算术中位数1 2
2 2
3 2
dtype: int64 → count:非NaN的值1 1
2 2
3 3
dtype: int64 → min、max:非NaN的最小值、最大值1 6.363961
2 12.727922
3 19.091883
dtype: float64 → std,var:非NaN的标准差和方差1 10
2 40
3 90
dtype: int64 → prod:非NaN的积
分组多函数计算:agg()
# 多函数计算:agg()df = pd.DataFrame({'a':[1,1,2,2],'b':np.random.rand(4),'c':np.random.rand(4),'d':np.random.rand(4),})
print(df)
print(df.groupby('a').agg(['mean',np.sum]))
print(df.groupby('a')['b'].agg({'result1':np.mean,'result2':np.sum})) # 快过时了
# 函数写法可以用str,或者np.方法
# 可以通过list,dict传入,当用dict时,key名为columns
a b c d
0 1 0.456934 0.286735 0.889033
1 1 0.354812 0.117281 0.476132
2 2 0.958267 0.239303 0.276428
3 2 0.840423 0.544267 0.514867b c d mean sum mean sum mean sum
a
1 0.405873 0.811746 0.202008 0.404016 0.682582 1.365165
2 0.899345 1.798690 0.391785 0.783570 0.395648 0.791296result1 result2
a
1 0.405873 0.811746
2 0.899345 1.798690D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version# Remove the CWD from sys.path while we load stuff.
课堂作业
df = pd.DataFrame({'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],'C' : np.arange(10,26,2),'D' : np.random.randn(8),'E':np.random.rand(8)})
print(df)
df2 = df.groupby(['A'])[['C','D']].mean()
print(df2)
df2 = df.groupby(['A','B'])[['D','E']].sum()
print(df2)dica = df.groupby(['A']).groups
print(dica)# dtypes = df.dtypes
# print(dtypes)
dt = df.groupby(df.dtypes, axis=1).sum() # axis=1就是按照列进行分组
print(dt)print('---------')
mapping = {'C':'one','D':'one'}
dm = df.groupby(mapping,axis=1).groups
print(dm) # 这个分组结构只会包含one
dmap = df.groupby(mapping,axis=1).sum() # 凡是按照列分组,加上axis=1就对了 ; 这是将C列和D列相加
print(dmap) # 凡是按照列分组,加上axis=1就对了print('-------')
print(df.groupby(mapping,axis=1).get_group('one'))print('-------')
dcd = df.groupby(mapping,axis=1).get_group('one').sum() # 这是将 C列和D列从one中提前出来,分别对他们自己求和,注意,返回的是一个Series
print(dcd) # 其中index为 CDprint('-------')
db = df.groupby(['B']).agg(['mean',np.sum,'max','min'])
print(db)
A B C D E
0 one h 10 1.006026 0.133697
1 two h 12 -0.359184 0.976752
2 three h 14 0.066493 0.933959
3 one h 16 -1.462475 0.614514
4 two f 18 2.007785 0.458461
5 three f 20 -1.650301 0.805937
6 one f 22 -0.197564 0.760070
7 two f 24 -1.654774 0.633005C D
A
one 16 -0.218005
three 17 -0.791904
two 18 -0.002057D E
A B
one f -0.197564 0.760070h -0.456449 0.748211
three f -1.650301 0.805937h 0.066493 0.933959
two f 0.353012 1.091466h -0.359184 0.976752
{'one': Int64Index([0, 3, 6], dtype='int64'), 'three': Int64Index([2, 5], dtype='int64'), 'two': Int64Index([1, 4, 7], dtype='int64')}int32 float64 object
0 10 1.139723 oneh
1 12 0.617569 twoh
2 14 1.000452 threeh
3 16 -0.847962 oneh
4 18 2.466246 twof
5 20 -0.844364 threef
6 22 0.562506 onef
7 24 -1.021768 twof
---------
{'one': Index(['C', 'D'], dtype='object')}one
0 11.006026
1 11.640816
2 14.066493
3 14.537525
4 20.007785
5 18.349699
6 21.802436
7 22.345226
-------C D
0 10 1.006026
1 12 -0.359184
2 14 0.066493
3 16 -1.462475
4 18 2.007785
5 20 -1.650301
6 22 -0.197564
7 24 -1.654774
-------
C 136.000000
D -2.243994
dtype: float64
-------C D E \mean sum max min mean sum max min mean
B
f 21 84 24 18 -0.373713 -1.494854 2.007785 -1.654774 0.664369
h 13 52 16 10 -0.187285 -0.749141 1.006026 -1.462475 0.664731 sum max min
B
f 2.657474 0.805937 0.458461
h 2.658922 0.976752 0.133697
分组转换及一般性“拆分-应用-合并”
'''
【课程2.20】 分组转换及一般性“拆分-应用-合并”transform / apply'''
数据分组转换,transform
# 数据分组转换,transformdf = pd.DataFrame({'data1':np.random.rand(5),'data2':np.random.rand(5),'key1':list('aabba'),'key2':['one','two','one','two','one']})
k_mean = df.groupby('key1').mean()
print(df)
print(k_mean)
print(pd.merge(df,k_mean,left_on='key1',right_index=True).add_prefix('mean_')) # .add_prefix('mean_'):添加前缀
print('-----')
# 通过分组、合并,得到一个包含均值的Dataframeprint(df.groupby('key2').mean()) # 按照key2分组求均值
print(df.groupby('key2').transform(np.mean))
# data1、data2每个位置元素取对应分组列的均值
# 字符串不能进行计算
data1 data2 key1 key2
0 0.234441 0.600356 a one
1 0.773225 0.730067 a two
2 0.483987 0.637845 b one
3 0.243679 0.997665 b two
4 0.882532 0.617680 a onedata1 data2
key1
a 0.630066 0.649368
b 0.363833 0.817755mean_data1_x mean_data2_x mean_key1 mean_key2 mean_data1_y mean_data2_y
0 0.234441 0.600356 a one 0.630066 0.649368
1 0.773225 0.730067 a two 0.630066 0.649368
4 0.882532 0.617680 a one 0.630066 0.649368
2 0.483987 0.637845 b one 0.363833 0.817755
3 0.243679 0.997665 b two 0.363833 0.817755
-----data1 data2
key2
one 0.533653 0.618627
two 0.508452 0.863866data1 data2
0 0.533653 0.618627
1 0.508452 0.863866
2 0.533653 0.618627
3 0.508452 0.863866
4 0.533653 0.618627
一般化Groupby方法:apply
# 一般化Groupby方法:applydf = pd.DataFrame({'data1':np.random.rand(5),'data2':np.random.rand(5),'key1':list('aabba'),'key2':['one','two','one','two','one']})print(df.groupby('key1').apply(lambda x: x.describe()))
# apply直接运行其中的函数
# 这里为匿名函数,直接描述分组后的统计量def f_df1(d,n):return(d.sort_index()[:n])
def f_df2(d,k1):return(d[k1])
print(df.groupby('key1').apply(f_df1,2),'\n')
print(df.groupby('key1').apply(f_df2,'data2'))
print(type(df.groupby('key1').apply(f_df2,'data2')))
# f_df1函数:返回排序后的前n行数据
# f_df2函数:返回分组后表的k1列,结果为Series,层次化索引
# 直接运行f_df函数
# 参数直接写在后面,也可以为.apply(f_df,n = 2))
data1 data2
key1
a count 3.000000 3.000000mean 0.545712 0.522070std 0.315040 0.463898min 0.202184 0.02686125% 0.408011 0.30984150% 0.613838 0.59282175% 0.717477 0.769675max 0.821116 0.946529
b count 2.000000 2.000000mean 0.446845 0.399589std 0.311004 0.160466min 0.226932 0.28612325% 0.336888 0.34285650% 0.446845 0.39958975% 0.556801 0.456322max 0.666758 0.513056data1 data2 key1 key2
key1
a 0 0.202184 0.946529 a one1 0.821116 0.592821 a two
b 2 0.226932 0.286123 b one3 0.666758 0.513056 b two key1
a 0 0.9465291 0.5928214 0.026861
b 2 0.2861233 0.513056
Name: data2, dtype: float64
<class 'pandas.core.series.Series'>
课堂作业
df = pd.DataFrame({'data1':np.random.rand(8),'data2':np.random.rand(8),'key':list('aabbabab')})print('创建df为:\n',df,'\n------')
df2 = df.groupby(['key']).mean()
print(df2)
df3 = pd.merge(df,df2,left_on='key',right_index=True).add_prefix('mean_')
print('求和且合并后的结果为:')
print(df3)# df_ = df.groupby('key').transform(np.mean)
# print('求和且合并之后结果为:\n',df.join(df_,rsuffix='_mean'),'\n------')
创建df为:data1 data2 key
0 0.841120 0.987305 a
1 0.965404 0.734070 a
2 0.511385 0.044053 b
3 0.912349 0.828049 b
4 0.819506 0.131610 a
5 0.723875 0.642737 b
6 0.822328 0.457494 a
7 0.107970 0.936853 b
------data1 data2
key
a 0.862090 0.577619
b 0.563895 0.612923
求和且合并后的结果为:mean_data1_x mean_data2_x mean_key mean_data1_y mean_data2_y
0 0.841120 0.987305 a 0.862090 0.577619
1 0.965404 0.734070 a 0.862090 0.577619
4 0.819506 0.131610 a 0.862090 0.577619
6 0.822328 0.457494 a 0.862090 0.577619
2 0.511385 0.044053 b 0.563895 0.612923
3 0.912349 0.828049 b 0.563895 0.612923
5 0.723875 0.642737 b 0.563895 0.612923
7 0.107970 0.936853 b 0.563895 0.612923
透视表及交叉表
'''
【课程2.21】 透视表及交叉表类似excel数据透视 - pivot table / crosstab'''
透视表:pivot_table
# 透视表:pivot_table
# pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')date = ['2017-5-1','2017-5-2','2017-5-3']*3
rng = pd.to_datetime(date)
df = pd.DataFrame({'date':rng,'key':list('abcdabcda'),'values':np.random.rand(9)*10})
print(df)
print('-----')# 相当于 groupby[index][values].aggfunc columns见下
print(pd.pivot_table(df, values = 'values', index = 'date', aggfunc=np.sum)) # 也可以写 aggfunc='sum'
print(pd.pivot_table(df, values = 'values', index = 'date', columns = 'key', aggfunc=np.sum)) # 也可以写 aggfunc='sum'# 如果加上了colums那么他就会以这一列的行分组作为生成的DF的columns,# 而其中colunms下对应的值就是原来表中通过index和colunms(联合)确定到的values值
print('-----')
# data:DataFrame对象
# values:要聚合的列或列的列表
# index:数据透视表的index,从原数据的列中筛选
# columns:数据透视表的columns,从原数据的列中筛选
# aggfunc:用于聚合的函数,默认为numpy.mean,支持numpy计算方法print(pd.pivot_table(df, values = 'values', index = ['date','key'], aggfunc=len))
print('-----')
# 这里就分别以date、key共同做数据透视,值为values:统计不同(date,key)情况下values的平均值
# aggfunc=len(或者count):计数
date key values
0 2017-05-01 a 1.573759
1 2017-05-02 b 3.750596
2 2017-05-03 c 4.958902
3 2017-05-01 d 0.797226
4 2017-05-02 a 5.757876
5 2017-05-03 b 0.082909
6 2017-05-01 c 3.799717
7 2017-05-02 d 0.754402
8 2017-05-03 a 3.117813
-----values
date
2017-05-01 6.170701
2017-05-02 10.262874
2017-05-03 8.159623
key a b c d
date
2017-05-01 1.573759 NaN 3.799717 0.797226
2017-05-02 5.757876 3.750596 NaN 0.754402
2017-05-03 3.117813 0.082909 4.958902 NaN
-----values
date key
2017-05-01 a 1.0c 1.0d 1.0
2017-05-02 a 1.0b 1.0d 1.0
2017-05-03 a 1.0b 1.0c 1.0
-----
交叉表:crosstab
# 交叉表:crosstab
# 默认情况下,crosstab计算因子的频率表,比如用于str的数据透视分析
# pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)df = pd.DataFrame({'A': [1, 2, 2, 2, 2],'B': [3, 3, 4, 4, 4],'C': [1, 1, np.nan, 1, 1]})
print(df)
print('-----')print(pd.crosstab(df['A'],df['B']))
print('-----')
# 如果crosstab只接收两个Series,它将提供一个频率表。
# 用A的唯一值,统计B唯一值的出现次数print(pd.crosstab(df['A'],df['B'],normalize=True))
print('-----')
# normalize:默认False,将所有值除以值的总和进行归一化 → 为True时候显示百分比print(pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum))
print('-----')
# values:可选,根据因子聚合的值数组
# aggfunc:可选,如果未传递values数组,则计算频率表,如果传递数组,则按照指定计算
# 这里相当于以A和B界定分组,计算出每组中第三个系列C的值print(pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum, margins=True))
print('-----')
# margins:布尔值,默认值False,添加行/列边距(小计)
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
-----
B 3 4
A
1 1 0
2 1 3
-----
B 3 4
A
1 0.2 0.0
2 0.2 0.6
-----
B 3 4
A
1 1.0 NaN
2 1.0 2.0
-----
B 3 4 All
A
1 1.0 NaN 1.0
2 1.0 2.0 3.0
All 2.0 2.0 4.0
-----
课堂作业
df = pd.DataFrame({'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],'C' : np.arange(10,26,2),'D' : np.random.randn(8),'E':np.random.rand(8)})print(df)
print('---------')
print(pd.pivot_table(df,index=['A'],values=['C','D'],aggfunc='mean'))
print('---------')
print(pd.pivot_table(df,index=['A','B'],values=['D','E'],aggfunc=['mean','sum']))
print('---------')
print(pd.pivot_table(df,index=['B'],values=['C'],columns=['A'],aggfunc='count')) # 或者用下面这种
print('---------')
print(pd.crosstab(df['B'],df['A'])) # 一般使用交叉表计算频率
A B C D E
0 one h 10 1.801648 0.234444
1 two h 12 1.015224 0.473324
2 three h 14 1.145384 0.423148
3 one h 16 0.782241 0.053959
4 two f 18 -0.015952 0.669829
5 three f 20 -0.356324 0.455806
6 one f 22 -1.555999 0.136985
7 two f 24 1.791435 0.448069
---------C D
A
one 16 0.342630
three 17 0.394530
two 18 0.930235
---------mean sum D E D E
A B
one f -1.555999 0.136985 -1.555999 0.136985h 1.291944 0.144202 2.583889 0.288403
three f -0.356324 0.455806 -0.356324 0.455806h 1.145384 0.423148 1.145384 0.423148
two f 0.887741 0.558949 1.775482 1.117898h 1.015224 0.473324 1.015224 0.473324
---------C
A one three two
B
f 1 1 2
h 2 1 1
---------
A one three two
B
f 1 1 2
h 2 1 1
数据读取
'''
【课程2.22】 数据读取核心:read_table, read_csv, read_excel'''
读取普通分隔数据:read_table
# 读取普通分隔数据:read_table
# 可以读取txt,csvimport os
os.chdir('D:/data/pandasData/')data1 = pd.read_table('data1.txt', delimiter=',',header = 0, index_col=1)
print(data1)
# delimiter:用于拆分的字符,也可以用sep:sep = ','
# header:用做列名的序号,默认为0(第一行)
# index_col:指定某列为行索引,否则自动索引0, 1, .....# read_table主要用于读取简单的数据,txt/csv
va1 va3 va4
va2
2 1 3 4
3 2 4 5
4 3 5 6
5 4 6 7D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: read_table is deprecated, use read_csv instead.import sys
读取csv数据:read_csv
# 读取csv数据:read_csv
# 先熟悉一下excel怎么导出csvdata2 = pd.read_csv('data2.csv',engine = 'python')
print(data2.head())
# engine:使用的分析引擎。可以选择C或者是python。C引擎快但是Python引擎功能更加完备。
# encoding:指定字符集类型,即编码,通常指定为'utf-8'# 大多数情况先将excel导出csv,再读取
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \
0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN NaN
1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN NaN
2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN NaN
3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN NaN
4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN NaN ... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 工作年份
0 ... NaN 硕士 1.0 NaN NaN NaN NaN NaN NaN NaN
1 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
2 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
3 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
4 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN [5 rows x 23 columns]
读取excel数据:read_excel
# 读取excel数据:read_exceldata3 = pd.read_excel('地市级党委书记数据库(2000-10).xlsx',sheet_name='中国人民共和国地市级党委书记数据库(2000-10)',header=0)
print(data3)
# io :文件路径。
# sheetname:返回多表使用sheetname=[0,1],若sheetname=None是返回全表 → ① int/string 返回的是dataframe ②而none和list返回的是dict
# header:指定列名行,默认0,即取第一行
# index_col:指定列为索引列,也可以使用u”strings”
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 \
0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN
1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN
2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN
3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN
4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN
5 130000 河北省 130100 石家庄市 2005 吴振华 NaN NaN NaN
6 130000 河北省 130100 石家庄市 2006 吴振华 NaN NaN NaN
7 130000 河北省 130100 石家庄市 2007 吴显国 NaN NaN NaN
8 130000 河北省 130100 石家庄市 2008 吴显国 NaN NaN NaN
9 130000 河北省 130100 石家庄市 2009 车俊 NaN NaN NaN
10 130000 河北省 130100 石家庄市 2010 孙瑞彬 NaN NaN NaN
11 130000 河北省 130200 唐山市 2000 白润璋 NaN NaN NaN
12 130000 河北省 130200 唐山市 2001 白润璋 NaN NaN NaN
13 130000 河北省 130200 唐山市 2002 白润璋 NaN NaN NaN
14 130000 河北省 130200 唐山市 2003 张和 NaN NaN NaN
15 130000 河北省 130200 唐山市 2004 张和 NaN NaN NaN
16 130000 河北省 130200 唐山市 2005 张和 NaN NaN NaN
17 130000 河北省 130200 唐山市 2006 张和 NaN NaN NaN
18 130000 河北省 130200 唐山市 2007 赵勇 NaN NaN NaN
19 130000 河北省 130200 唐山市 2008 赵勇 NaN NaN NaN
20 130000 河北省 130200 唐山市 2009 赵勇 NaN NaN NaN
21 130000 河北省 130200 唐山市 2010 赵勇 NaN NaN NaN
22 130000 河北省 130300 秦皇岛市 2000 王建忠 NaN NaN NaN
23 130000 河北省 130300 秦皇岛市 2001 王建忠 NaN NaN NaN
24 130000 河北省 130300 秦皇岛市 2002 王建忠 NaN NaN NaN
25 130000 河北省 130300 秦皇岛市 2003 宋长瑞 NaN NaN NaN
26 130000 河北省 130300 秦皇岛市 2004 宋长瑞 NaN NaN NaN
27 130000 河北省 130300 秦皇岛市 2005 宋长瑞 NaN NaN NaN
28 130000 河北省 130300 秦皇岛市 2006 宋长瑞 NaN NaN NaN
29 130000 河北省 130300 秦皇岛市 2007 王三堂 NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
3633 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2003 NaN NaN NaN NaN
3634 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2004 NaN NaN NaN NaN
3635 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2005 NaN NaN NaN NaN
3636 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2006 NaN NaN NaN NaN
3637 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2007 NaN NaN NaN NaN
3638 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2008 NaN NaN NaN NaN
3639 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2009 NaN NaN NaN NaN
3640 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2010 NaN NaN NaN NaN
3641 650000 新疆维吾尔自治区 654200 塔城地区 2000 NaN NaN NaN NaN
3642 650000 新疆维吾尔自治区 654200 塔城地区 2001 NaN NaN NaN NaN
3643 650000 新疆维吾尔自治区 654200 塔城地区 2002 NaN NaN NaN NaN
3644 650000 新疆维吾尔自治区 654200 塔城地区 2003 NaN NaN NaN NaN
3645 650000 新疆维吾尔自治区 654200 塔城地区 2004 NaN NaN NaN NaN
3646 650000 新疆维吾尔自治区 654200 塔城地区 2005 NaN NaN NaN NaN
3647 650000 新疆维吾尔自治区 654200 塔城地区 2006 NaN NaN NaN NaN
3648 650000 新疆维吾尔自治区 654200 塔城地区 2007 NaN NaN NaN NaN
3649 650000 新疆维吾尔自治区 654200 塔城地区 2008 NaN NaN NaN NaN
3650 650000 新疆维吾尔自治区 654200 塔城地区 2009 NaN NaN NaN NaN
3651 650000 新疆维吾尔自治区 654200 塔城地区 2010 NaN NaN NaN NaN
3652 650000 新疆维吾尔自治区 654300 阿勒泰地区 2000 NaN NaN NaN NaN
3653 650000 新疆维吾尔自治区 654300 阿勒泰地区 2001 NaN NaN NaN NaN
3654 650000 新疆维吾尔自治区 654300 阿勒泰地区 2002 NaN NaN NaN NaN
3655 650000 新疆维吾尔自治区 654300 阿勒泰地区 2003 NaN NaN NaN NaN
3656 650000 新疆维吾尔自治区 654300 阿勒泰地区 2004 NaN NaN NaN NaN
3657 650000 新疆维吾尔自治区 654300 阿勒泰地区 2005 NaN NaN NaN NaN
3658 650000 新疆维吾尔自治区 654300 阿勒泰地区 2006 NaN NaN NaN NaN
3659 650000 新疆维吾尔自治区 654300 阿勒泰地区 2007 NaN NaN NaN NaN
3660 650000 新疆维吾尔自治区 654300 阿勒泰地区 2008 NaN NaN NaN NaN
3661 650000 新疆维吾尔自治区 654300 阿勒泰地区 2009 NaN NaN NaN NaN
3662 650000 新疆维吾尔自治区 654300 阿勒泰地区 2010 NaN NaN NaN NaN 籍贯省份名称 ... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 \
0 NaN ... NaN 硕士 1.0 NaN NaN NaN NaN NaN
1 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
2 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
3 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
4 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
5 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
6 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
7 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
8 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
9 NaN ... NaN 本科 1.0 0.0 1.0 0.0 0.0 0.0
10 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
11 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
12 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
13 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
14 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
15 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
16 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
17 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
18 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
19 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
20 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
21 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
22 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
23 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
24 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
25 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
26 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
27 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
28 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
29 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
3633 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3634 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3635 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3636 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3637 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3638 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3639 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3640 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3641 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3642 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3643 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3644 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3645 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3646 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3647 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3648 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3649 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3650 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3651 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3652 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3653 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3654 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3655 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3656 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3657 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3658 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3659 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3660 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3661 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3662 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 入党年份 工作年份
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 NaN NaN
20 NaN NaN
21 NaN NaN
22 NaN NaN
23 NaN NaN
24 NaN NaN
25 NaN NaN
26 NaN NaN
27 NaN NaN
28 NaN NaN
29 NaN NaN
... ... ...
3633 NaN NaN
3634 NaN NaN
3635 NaN NaN
3636 NaN NaN
3637 NaN NaN
3638 NaN NaN
3639 NaN NaN
3640 NaN NaN
3641 NaN NaN
3642 NaN NaN
3643 NaN NaN
3644 NaN NaN
3645 NaN NaN
3646 NaN NaN
3647 NaN NaN
3648 NaN NaN
3649 NaN NaN
3650 NaN NaN
3651 NaN NaN
3652 NaN NaN
3653 NaN NaN
3654 NaN NaN
3655 NaN NaN
3656 NaN NaN
3657 NaN NaN
3658 NaN NaN
3659 NaN NaN
3660 NaN NaN
3661 NaN NaN
3662 NaN NaN [3663 rows x 23 columns]
Pandas学习(三)---数值运算相关推荐
- pandas学习(数据分组与分组运算、离散化处理、数据合并)
pandas学习(数据分组与分组运算.离散化处理.数据合并) 目录 数据分组与分组运算 离散化处理 数据合并 数据分组与分组运算 GroupBy技术:实现数据的分组,和分组运算,作用类似于数据透视表数 ...
- python数值运算实例_“每天进步一点点”案例学习python数值操作
这是树哥讲python系列的第四篇文章. 本质上计算机熟悉的是二进制,也就是我们常说的"0,1"代码,所以无论是执行的命令还是数据本身,都必须转化为0和1他们才会认知.而我们熟悉的 ...
- python提供了几个基本的数值运算操作符_慢步学习,二级python,数字类型及其运算...
#我要学Python# 记得曾经的老师常说,要带着问题学习. 咱继续带着二级Python编程语言的考试大纲学习. 考试大纲中考试内容分七部分,其中第一部分Python 语言基本语法元素已经学完.笔者认 ...
- Pandas学习(三)——NBA球员薪资分析
欢迎加入python学习交流群 667279387 学习笔记汇总 Pandas学习(一)–数据的导入 pandas学习(二)–双色球数据分析 pandas学习(三)–NAB球员薪资分析 pandas学 ...
- python基础学习06_if条件判断(多重判断、嵌套、三目运算)
一.条件语句: 条件成立执行某些代码,条件不成立执行哪些代码. 二.IF 条件判断 IF简单条件判断 多重判断 IF嵌套 三目运算符 1.if 简单条件判断 if """ ...
- Python 数据分析三剑客之 Pandas(三):算术运算与缺失值的处理
CSDN 课程推荐:<迈向数据科学家:带你玩转Python数据分析>,讲师齐伟,苏州研途教育科技有限公司CTO,苏州大学应用统计专业硕士生指导委员会委员:已出版<跟老齐学Python ...
- Pandas 学习手册中文第二版:1~5
原文:Learning pandas 协议:CC BY-NC-SA 4.0 译者:飞龙 一.Pandas 与数据分析 欢迎来到<Pandas 学习手册>! 在本书中,我们将进行一次探索我们 ...
- Pandas学习(3)——Pandas基础
本文基于Datawhale提供的Pandas学习资料. 均为本人理解,如有不足或错误地方欢迎补充批评指正,如有侵权,联系速删. 开始学习前,请确认已经安装了 xlrd, xlwt, openpyxl ...
- pandas学习笔记之DateFrame
pandas学习笔记之DateFrame 文章目录 pandas学习笔记之DateFrame 1.DateFrame的创建 1)认识DataFrame对象 2)由二维列表创建(默认index和colu ...
最新文章
- 2022-2028年中国密胺塑料制品行业市场研究及前瞻分析报告
- 学历案与深度学习电子书
- 补丁更新选项的禁用与恢复
- HDU 3081 Marriage Match II【并查集+二分图最大匹配】
- IP应用加速 – DCDN迈入全栈新篇章
- ecs 云服务器 管理控制台_阿里云ECS服务器监控资源使用情况
- 【华为云技术分享】【极客思考】设计模式:你确定你真的理解了单例模式吗?
- 第10课 外边距和内边距
- RAID6结构原理详解
- obs之libfaac编码
- 【提问】iOS UIAtumator 是怎么判断元素isVisible的?
- Perl语言如何学习总结
- tomcat日志配置调整
- MYSQL查看进程和kill进程
- 数据仓库建设---数据建模
- 技术面常见问题(持续更新)
- 前端开发的流程与规范
- 图像处理之仿油画效果
- 长期存储在计算机内的有组织 可共享,()是长期存储在计算机内有序的、可共享的数据集合...
- 怎么搭建个人私有网盘(高速企业网盘源码平台搭建教程)
热门文章
- [译]C# 7系列,Part 10: Spanlt;Tgt; and universal memory management Spanlt;Tgt;和统一内存管理
- 使用PowerShell 批量修改O365 用户UPN
- vue使用font-icon
- 工业机器人入门z50的含义_工业机器人基础教程——快速入门学习
- Linux下列举局域网内的IP地址
- HotSpot C2编译器
- 任何一台计算机都可以安装win 7系统,任何电脑都能用U盘装64位win7么?
- 黑苹果 常用必装kext 补丁 驱动 文件说明 用途 笔记 (备用)
- 26位前谷歌AI专家出走创业
- 如何有效建设企业薪酬管理体系?