Python数据分析第三期–简述Pandas
目录
文章目录
- Python数据分析第三期--简述Pandas
- 目录
- 1. Pandas 的数据结构
- 1.1 Series
- 1.2 DataFrame
- 1.3 Index
- 2. Pandas 的数据操作
- 2.1 索引操作
- 2.1.1 Series 索引
- 2.1.2 DataFrame 索引
- 2.1.3 三种索引方式(标签索引,整型位置索引,混合索引)
- 2.2 运算与对齐
- 2.3 函数应用
- 排序
- 2.4 处理缺失数据
- 3. Pandas 统计计算和描述
- 4. 数据的分组与聚合
- 4.1 分组(groupby)
- 4.1.1 groupby对象
- 4.1.2 GroupBy 对象分组迭代
- 4.1.2 其他分组方法
- 4.2 聚合
- 4.3 数据分组运算
- 4.3.1 transform 方法
- 4.3.2 apply
- 4.3.3 禁止分组 group_keys = False
- 5. 数据清洗、合并、转化和重构
- 5.1 数据清洗
- 5.2 数据连接
- 5.3 数据合并
- 5.3.1 Numpy 的 concat
- 5.3.2 Series 上的concat
- 5.3.3 DataFrame上的concat
- 5.4 数据重构
- 5.5 数据转换
- 5.5.1 重复数据
- 5.5.2 map函数
- 5.5.4 数据替换repalce
- 5.5 数据转换
- 5.5.1 重复数据
- 5.5.2 map函数
- 5.5.4 数据替换repalce
1. Pandas 的数据结构
Pandas ,
一个强大的分析结构化数据的工具集 。
基础式NumPy提供了高性能矩阵的运算。
提供数据清洗功能。
1.1 Series
类似于一维数组的对象
import pandas as pd# 通过list构建Series
ser_obj = pd.Series(range(10, 20))
print(type(ser_obj))
# 获取数据
print(ser_obj.values)# 获取索引
print(ser_obj.index)
[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)
# 预览数据
print(ser_obj.head(3))
0 10
1 11
2 12
dtype: int64
print(ser_obj)
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
#通过索引获取数据
print(ser_obj[0])
print(ser_obj[8])
10
18
# 索引与数据的对应关系仍保持在数组运算的结果中
print(ser_obj * 2)
print(ser_obj > 15)
0 20
1 22
2 24
3 26
4 28
5 30
6 32
7 34
8 36
9 38
dtype: int64
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 True
dtype: bool
# 通过dict构建Series
year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print(ser_obj2.head())
print(ser_obj2.index)
2001 17.8
2002 20.1
2003 16.5
dtype: float64
Int64Index([2001, 2002, 2003], dtype='int64')
# name属性
ser_obj2.name = 'temp'
ser_obj2.index.name = 'year'
print(ser_obj2.head())
year
2001 17.8
2002 20.1
2003 16.5
Name: temp, dtype: float64
1.2 DataFrame
类似于多维数组/表格数组,索引包括列索引和行索引。
import numpy as np# 通过ndarray构建DataFrame
array = np.random.randn(5,4)
print(array)df_obj = pd.DataFrame(array)
print(df_obj.head())
[[-0.81322788 -1.51137049 -1.06359499 1.28648638][ 1.31333252 0.25461725 -0.7915332 0.75370275][-0.09241866 -1.29103143 -0.32951959 -0.76066101][-0.40326652 1.04621326 0.99044616 -0.35491204][ 2.8170386 -0.58422902 2.84249502 -0.62538581]]0 1 2 3
0 -0.813228 -1.511370 -1.063595 1.286486
1 1.313333 0.254617 -0.791533 0.753703
2 -0.092419 -1.291031 -0.329520 -0.760661
3 -0.403267 1.046213 0.990446 -0.354912
4 2.817039 -0.584229 2.842495 -0.625386
# 通过dict构建DataFrame
dict_data = {'A': 1., 'B': pd.Timestamp('20161217'),'C': pd.Series(1, index=list(range(4)),dtype='float32'),'D': np.array([3] * 4,dtype='int32'),'E' : pd.Categorical(["Python","Java","C++","C#"]),'F' : 'ChinaHadoop' }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print(df_obj2.head())
A B C D E F
0 1.0 2016-12-17 1.0 3 Python ChinaHadoop
1 1.0 2016-12-17 1.0 3 Java ChinaHadoop
2 1.0 2016-12-17 1.0 3 C++ ChinaHadoop
3 1.0 2016-12-17 1.0 3 C# ChinaHadoop
# 通过列索引获取列数据
print(df_obj2['A'])
print(type(df_obj2['A']))print(df_obj2.A)
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
# 增加列
df_obj2['G'] = df_obj2['D'] + 4
print(df_obj2.head())
A B C D E F G
0 1.0 2016-12-17 1.0 3 Python ChinaHadoop 7
1 1.0 2016-12-17 1.0 3 Java ChinaHadoop 7
2 1.0 2016-12-17 1.0 3 C++ ChinaHadoop 7
3 1.0 2016-12-17 1.0 3 C# ChinaHadoop 7
# 删除列
del(df_obj2['G'] )
print(df_obj2.head())
A B C D E F
0 1.0 2016-12-17 1.0 3 Python ChinaHadoop
1 1.0 2016-12-17 1.0 3 Java ChinaHadoop
2 1.0 2016-12-17 1.0 3 C++ ChinaHadoop
3 1.0 2016-12-17 1.0 3 C# ChinaHadoop
1.3 Index
Series和DataFrame中的索引都是Index对象,不可变,保证了数据的安全。
常见的 Index 种类:
index
Int64Index
MultiIndex , “层级”索引
DataTimeIndex , 时间戳类型
print(type(ser_obj.index))
print(type(df_obj2.index))print(df_obj2.index)
<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.numeric.Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')
2. Pandas 的数据操作
2.1 索引操作
切片索引的注意,位置访问时,末尾边界不包含;名称访问时,末尾边界包含。
2.1.1 Series 索引
import pandas as pdser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e'])
print(ser_obj.head())
a 0
b 1
c 2
d 3
e 4
dtype: int32
# 行索引
print(ser_obj['a'])
print(ser_obj[0])
0
0
# 切片索引
print(ser_obj[1:3])
print(ser_obj['b':'d'])
b 1
c 2
dtype: int32
b 1
c 2
d 3
dtype: int32
# 不连续索引
print(ser_obj[[0, 2, 4]])
print(ser_obj[['a', 'e']])
a 0
c 2
e 4
dtype: int32
a 0
e 4
dtype: int32
# 布尔索引
ser_bool = ser_obj > 2
print(ser_bool)
print(ser_obj[ser_bool])print(ser_obj[ser_obj > 2])
a False
b False
c False
d True
e True
dtype: bool
d 3
e 4
dtype: int32
d 3
e 4
dtype: int32
2.1.2 DataFrame 索引
import numpy as npdf_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj.head())
a b c d
0 -0.595692 0.813699 -0.551327 -0.059703
1 0.339194 -2.335579 0.230472 -0.680213
2 -0.252306 0.212406 -0.979523 0.408522
3 0.216677 0.574524 -0.819607 2.170009
4 -1.099175 -0.665488 0.391421 -0.400642
# 列索引
print('列索引')
print(df_obj['a']) # 返回Series类型
print(type(df_obj[[0]])) # 返回DataFrame类型# 不连续索引
print('不连续索引')
print(df_obj[['a','c']])
print(df_obj[[1, 3]])
列索引
0 -0.595692
1 0.339194
2 -0.252306
3 0.216677
4 -1.099175
Name: a, dtype: float64
<class 'pandas.core.frame.DataFrame'>
不连续索引a c
0 -0.595692 -0.551327
1 0.339194 0.230472
2 -0.252306 -0.979523
3 0.216677 -0.819607
4 -1.099175 0.391421b d
0 0.813699 -0.059703
1 -2.335579 -0.680213
2 0.212406 0.408522
3 0.574524 2.170009
4 -0.665488 -0.400642
2.1.3 三种索引方式(标签索引,整型位置索引,混合索引)
# 标签索引 loc
# Series
print(ser_obj['b':'d'])
print(ser_obj.loc['b':'d'])# DataFrame
print(df_obj['a'])
print(df_obj.loc[0:2, 'a'])
b 1
c 2
d 3
dtype: int32
b 1
c 2
d 3
dtype: int32
0 -0.595692
1 0.339194
2 -0.252306
3 0.216677
4 -1.099175
Name: a, dtype: float64
0 -0.595692
1 0.339194
2 -0.252306
Name: a, dtype: float64
# 整型位置索引 iloc
print(ser_obj[1:3])
print(ser_obj.iloc[1:3])# DataFrame
print(df_obj.iloc[0:2, 0]) # 注意和df_obj.loc[0:2, 'a']的区别
b 1
c 2
dtype: int32
b 1
c 2
dtype: int32
0 -0.595692
1 0.339194
Name: a, dtype: float64
# 混合索引 ix
print(ser_obj.ix[1:3])
print(ser_obj.ix['b':'c'])# DataFrame
print(df_obj.ix[0:2, 0]) # 先按标签索引尝试操作,然后再按位置索引尝试操作
b 1
c 2
dtype: int32
b 1
c 2
dtype: int32
0 -0.595692
1 0.339194
2 -0.252306
Name: a, dtype: float64
2.2 运算与对齐
s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))print('s1: ' )
print(s1)print('') print('s2: ')
print(s2)
s1:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int32s2:
0 20
1 21
2 22
3 23
4 24
dtype: int32
# Series 对齐运算
s1 + s2
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
import numpy as npdf1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])print('df1: ')
print(df1)print('')
print('df2: ')
print(df2)
df1: a b
0 1.0 1.0
1 1.0 1.0df2: a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
# DataFrame对齐操作
df1 + df2
a
|
b
|
c
|
|
0
|
2.0
|
2.0
|
NaN
|
1
|
2.0
|
2.0
|
NaN
|
2
|
NaN
|
NaN
|
NaN
|
# 填充未对齐的数据进行运算
print(s1)
print(s2)s1.add(s2, fill_value = -1)
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
0 20
1 21
2 22
3 23
4 24
dtype: int64
Out[9]:
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 14.0
6 15.0
7 16.0
8 17.0
9 18.0
dtype: float64
df1.sub(df2, fill_value = 2.)
|
a
|
b
|
c
|
0
|
0.0
|
0.0
|
1.0
|
1
|
0.0
|
0.0
|
1.0
|
2
|
1.0
|
1.0
|
1.0
|
# 填充NaN
s3 = s1 + s2
print(s3)
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 -1.0
6 -1.0
7 -1.0
8 -1.0
9 -1.0
dtype: float64
df3 = df1 + df2
print(df3)
a b c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN
df3.fillna(100, inplace = True)
print(df3)
a b c
0 2.0 2.0 100.0
1 2.0 2.0 100.0
2 100.0 100.0 100.0
2.3 函数应用
# Numpy ufunc 函数
df = pd.DataFrame(np.random.randn(5,4) - 1)
print(df)print(np.abs(df))
0 1 2 3
0 -2.193022 -2.090432 -2.288651 -0.026022
1 -0.720957 -1.501025 -1.734828 -1.858286
2 0.300216 -3.391127 -0.872570 -0.686669
3 -2.552131 -1.452268 -1.188845 -0.597845
4 2.111044 -1.203676 -1.143487 -0.5427550 1 2 3
0 2.193022 2.090432 2.288651 0.026022
1 0.720957 1.501025 1.734828 1.858286
2 0.300216 3.391127 0.872570 0.686669
3 2.552131 1.452268 1.188845 0.597845
4 2.111044 1.203676 1.143487 0.542755
# 使用apply应用行或列数据 , 不指定轴方向时默认为列
#f = lambda x : x.max()
print(df.apply(lambda x : x.max()))
0 2.111044
1 -1.203676
2 -0.872570
3 -0.026022
dtype: float64
# 指定轴方向 , axis=1时,为行运算
print(df.apply(lambda x : x.max(), axis=1))
# 使用applymap应用到每个数据
f2 = lambda x : '%.2f' % x
print(df.applymap(f2))
0 1 2 3
0 -2.19 -2.09 -2.29 -0.03
1 -0.72 -1.50 -1.73 -1.86
2 0.30 -3.39 -0.87 -0.69
3 -2.55 -1.45 -1.19 -0.60
4 2.11 -1.20 -1.14 -0.54
排序
s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(s4)
4 10
3 11
1 12
4 13
4 14
dtype: int32
# 索引排序
s4.sort_index()
1 12
3 11
4 10
4 13
4 14
dtype: int32
df4 = pd.DataFrame(np.random.randn(3, 4), index=np.random.randint(3, size=3),columns=np.random.randint(4, size=4))
print(df4)
3 2 2 1
2 0.244068 -1.977220 0.045238 -2.064546
2 0.218196 -0.419284 -0.698839 0.241649
2 0.296747 -0.021311 0.225724 -0.325439
#df4.sort_index(ascending=False)
df4.sort_index(axis=1)
|
1
|
2
|
2
|
3
|
2
|
-2.064546
|
-1.977220
|
0.045238
|
0.244068
|
2
|
0.241649
|
-0.419284
|
-0.698839
|
0.218196
|
2
|
-0.325439
|
-0.021311
|
0.225724
|
0.296747
|
# 按值排序
df4.sort_values(by=1)
3
|
2
|
2
|
1
|
|
2
|
0.244068
|
-1.977220
|
0.045238
|
-2.064546
|
2
|
0.296747
|
-0.021311
|
0.225724
|
-0.325439
|
2
|
0.218196
|
-0.419284
|
-0.698839
|
0.241649
|
2.4 处理缺失数据
df_data = pd.DataFrame([np.random.randn(3), [1., np.nan, np.nan],[4., np.nan, np.nan], [1., np.nan, 2.]])
df_data.head()
0
|
1
|
2
|
|
0
|
0.106763
|
1.521225
|
-0.654447
|
1
|
1.000000
|
NaN
|
NaN
|
2
|
4.000000
|
NaN
|
NaN
|
3
|
1.000000
|
NaN
|
2.000000
|
# isnull
df_data.isnull()
0
|
1
|
2
|
|
0
|
False
|
False
|
False
|
1
|
False
|
True
|
True
|
2
|
False
|
True
|
True
|
3
|
False
|
True
|
False
|
# dropna
df_data.dropna()
#df_data.dropna(axis=1)
0
|
1
|
2
|
|
0
|
0.106763
|
1.521225
|
-0.654447
|
# dropna
#df_data.dropna()
df_data.dropna(axis=1)
0
|
|
0
|
0.106763
|
1
|
1.000000
|
2
|
4.000000
|
3
|
1.000000
|
# fillna
df_data.fillna(-100.)
0
|
1
|
2
|
|
0
|
1.619463
|
0.548047
|
-1.027003
|
1
|
1.000000
|
-100.000000
|
-100.000000
|
2
|
4.000000
|
-100.000000
|
-100.000000
|
3
|
1.000000
|
-100.000000
|
2.000000
|
3. Pandas 统计计算和描述
import numpy as np
import pandas as pd
3.1 常用的统计计算
df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
df_obj
|
a
|
b
|
c
|
d
|
0
|
-0.767557
|
0.637179
|
1.650174
|
1.046862
|
1
|
1.675605
|
0.438124
|
-0.119837
|
-0.637159
|
2
|
-0.367849
|
1.044564
|
0.977799
|
-0.185916
|
3
|
-0.453807
|
-0.786276
|
-1.917200
|
0.502135
|
4
|
0.230846
|
-0.056462
|
1.612659
|
0.426120
|
df_obj.sum() #默认按列计算
a 0.317239
b 1.277129
c 2.203596
d 1.152041
dtype: float64
df_obj.max()
a 1.675605
b 1.044564
c 1.650174
d 1.046862
dtype: float64
df_obj.min(axis=1) #定义轴方向,为行
0 -0.767557
1 -0.637159
2 -0.367849
3 -1.917200
4 -0.056462
dtype: float64
3.2 统计描述
df_obj.describe()
a
|
b
|
c
|
d
|
|
count
|
5.000000
|
5.000000
|
5.000000
|
5.000000
|
mean
|
0.882532
|
-0.574613
|
0.004122
|
-0.455327
|
std
|
1.052045
|
0.887115
|
0.456436
|
0.646042
|
min
|
-0.713423
|
-1.600219
|
-0.628493
|
-1.403887
|
25%
|
0.715594
|
-1.483364
|
-0.211933
|
-0.664303
|
50%
|
0.783228
|
-0.053138
|
0.021936
|
-0.424985
|
75%
|
1.580704
|
0.123322
|
0.276417
|
-0.103682
|
max
|
2.046556
|
0.140333
|
0.562683
|
0.320219
|
4. 数据的分组与聚合
4.1 分组(groupby)
对数据集进行分组,然后对每组进行统计分析,pandas能利用 groupby 进行更加复杂的分组运算。
分组运算过程:
split -> apply -> combine
拆分:进行分组的依据
应用:每个分组运行的计算规则
合并:把每个分组的的计算结果合并起来
4.1.1 groupby对象
import pandas as pd
import numpy as npdict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
data1 data2 key1 key2
0 -0.943078 0.820645 a one
1 -1.429043 0.142617 b one
2 0.832261 0.843898 a two
3 0.906262 0.688165 b three
4 0.541173 0.117232 a two
5 -0.213385 -0.098734 b two
6 -1.291468 -1.186638 a one
7 1.186941 0.809122 a three
# dataframe根据key1进行分组
print(type(df_obj.groupby('key1')))
<class 'pandas.core.groupby.DataFrameGroupBy'>
# data1列根据key1进行分组
print(type(df_obj['data1'].groupby(df_obj['key1'])))
<class 'pandas.core.groupby.SeriesGroupBy'>
# 分组运算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
data1 data2
key1
a 0.065166 0.280852
b -0.245389 0.244016
key1
a 0.065166
b -0.245389
Name: data1, dtype: float64
# size
print(grouped1.size())
print(grouped2.size())
key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64
# 按列名分组
df_obj.groupby('key1')Out[10]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002091C820550>
# 按自定义key分组,列表
self_def_key = [1, 1, 2, 2, 2, 1, 1, 1]
df_obj.groupby(self_def_key).size()结果:
1 5
2 3
dtype: int64
# 按自定义key分组,多层列表
df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()
key1 key2
a one 2three 1two 2
b one 1three 1two 1
dtype: int64
# 按多个列多层分组
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())
key1 key2
a one 2three 1two 2
b one 1three 1two 1
dtype: int64
# 多层分组按key的顺序进行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
print()
print(grouped3.mean().unstack())
data1 data2
key2 key1
one a -1.117273 -0.182997b -1.429043 0.142617
three a 1.186941 0.809122b 0.906262 0.688165
two a 0.686717 0.480565b -0.213385 -0.098734data1 data2
key1 a b a b
key2
one -1.117273 -1.429043 -0.182997 0.142617
three 1.186941 0.906262 0.809122 0.688165
two 0.686717 -0.213385 0.480565 -0.098734
4.1.2 GroupBy 对象分组迭代
# 单层分组
for group_name, group_data in grouped1:print(group_name)print(group_data)
adata1 data2 key1 key2
0 -0.943078 0.820645 a one
2 0.832261 0.843898 a two
4 0.541173 0.117232 a two
6 -1.291468 -1.186638 a one
7 1.186941 0.809122 a three
bdata1 data2 key1 key2
1 -1.429043 0.142617 b one
3 0.906262 0.688165 b three
5 -0.213385 -0.098734 b two
# 多层分组
for group_name, group_data in grouped2:print(group_name)print(group_data)
('a', 'one')data1 data2 key1 key2
0 -0.943078 0.820645 a one
6 -1.291468 -1.186638 a one
('a', 'three')data1 data2 key1 key2
7 1.186941 0.809122 a three
('a', 'two')data1 data2 key1 key2
2 0.832261 0.843898 a two
4 0.541173 0.117232 a two
('b', 'one')data1 data2 key1 key2
1 -1.429043 0.142617 b one
('b', 'three')data1 data2 key1 key2
3 0.906262 0.688165 b three
('b', 'two')data1 data2 key1 key2
5 -0.213385 -0.098734 b two
# GroupBy对象转换list
list(grouped1)
[('a', data1 data2 key1 key20 -0.943078 0.820645 a one2 0.832261 0.843898 a two4 0.541173 0.117232 a two6 -1.291468 -1.186638 a one7 1.186941 0.809122 a three), ('b', data1 data2 key1 key21 -1.429043 0.142617 b one3 0.906262 0.688165 b three5 -0.213385 -0.098734 b two)]
# GroupBy对象转换dict
dict(list(grouped1))
{'a': data1 data2 key1 key20 -0.943078 0.820645 a one2 0.832261 0.843898 a two4 0.541173 0.117232 a two6 -1.291468 -1.186638 a one7 1.186941 0.809122 a three, 'b': data1 data2 key1 key21 -1.429043 0.142617 b one3 0.906262 0.688165 b three5 -0.213385 -0.098734 b two}
# 按列分组
print(df_obj.dtypes)# 按数据类型分组
df_obj.groupby(df_obj.dtypes, axis=1).size()
df_obj.groupby(df_obj.dtypes, axis=1).sum()
data1 float64
data2 float64
key1 object
key2 object
dtype: object
float64
|
object
|
|
0
|
-0.122433
|
aone
|
1
|
-1.286426
|
bone
|
2
|
1.676158
|
atwo
|
3
|
1.594427
|
bthree
|
4
|
0.658404
|
atwo
|
5
|
-0.312119
|
btwo
|
6
|
-2.478106
|
aone
|
7
|
1.996064
|
athree
|
4.1.2 其他分组方法
df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),columns=['a', 'b', 'c', 'd', 'e'],index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
df_obj2
a
|
b
|
c
|
d
|
e
|
|
A
|
1
|
1.0
|
1.0
|
6.0
|
5
|
B
|
2
|
NaN
|
NaN
|
NaN
|
6
|
C
|
5
|
5.0
|
7.0
|
5.0
|
7
|
D
|
2
|
8.0
|
5.0
|
6.0
|
2
|
E
|
5
|
1.0
|
4.0
|
4.0
|
4
|
# 通过字典分组
mapping_dict = {'a':'python', 'b':'python', 'c':'java', 'd':'C', 'e':'java'}
df_obj2.groupby(mapping_dict, axis=1).size()
df_obj2.groupby(mapping_dict, axis=1).count() # 非NaN的个数
df_obj2.groupby(mapping_dict, axis=1).sum()
C
|
java
|
python
|
|
A
|
6.0
|
6.0
|
2.0
|
B
|
NaN
|
6.0
|
2.0
|
C
|
5.0
|
14.0
|
10.0
|
D
|
6.0
|
7.0
|
10.0
|
E
|
4.0
|
8.0
|
6.0
|
# 通过函数分组
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),columns=['a', 'b', 'c', 'd', 'e'],index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3def group_key(idx):"""idx 为列索引或行索引"""#return idxreturn len(idx)df_obj3.groupby(group_key).size()# 以上自定义函数等价于
#df_obj3.groupby(len).size()
1 1
2 3
3 1
dtype: int64
# 通过索引级别分组
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
df_obj4
language
|
Python
|
Java
|
Python
|
Java
|
Python
|
index
|
A
|
A
|
B
|
C
|
B
|
0
|
1
|
6
|
4
|
7
|
2
|
1
|
9
|
7
|
2
|
2
|
4
|
2
|
3
|
9
|
9
|
7
|
5
|
3
|
1
|
6
|
1
|
6
|
6
|
4
|
5
|
1
|
7
|
3
|
6
|
# 根据language进行分组
df_obj4.groupby(level='language', axis=1).sum()
df_obj4.groupby(level='index', axis=1).sum()
index
|
A
|
B
|
C
|
0
|
7
|
6
|
7
|
1
|
16
|
6
|
2
|
2
|
12
|
14
|
7
|
3
|
7
|
7
|
6
|
4
|
6
|
13
|
3
|
4.2 聚合
对分组后的数据进行运算 , 数组产生标量的过程,如mean()、count()等。
dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'data1': np.random.randint(1,10, 8),'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)out:data1 data2 key1 key2
0 4 2 a one
1 7 1 b one
2 2 8 a two
3 9 4 b three
4 3 2 a two
5 8 5 b two
6 6 8 a one
7 9 3 a three
# 内置的聚合函数
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
data1 data2
key1
a 24 23
b 24 10data1 data2 key2
key1
a 9 8 two
b 9 5 twodata1 data2 key2
key1
a 2 2 one
b 7 1 onedata1 data2
key1
a 4.8 4.600000
b 8.0 3.333333
key1
a 5
b 3
dtype: int64data1 data2 key2
key1
a 5 5 5
b 3 3 3data1 data2
key1
a count 5.000000 5.000000mean 4.800000 4.600000std 2.774887 3.130495min 2.000000 2.00000025% 3.000000 2.00000050% 4.000000 3.00000075% 6.000000 8.000000max 9.000000 8.000000
b count 3.000000 3.000000mean 8.000000 3.333333std 1.000000 2.081666min 7.000000 1.00000025% 7.500000 2.50000050% 8.000000 4.00000075% 8.500000 4.500000max 9.000000 5.000000
# 自定义聚合函数
def peak_range(df):"""返回数值范围"""#print type(df) #参数为索引所对应的记录return df.max() - df.min()print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
data1 data2
key1
a 7 6
b 2 4data1 data2
key1
a 2.478410 2.030536
b 2.335305 0.786899
# 应用多个聚合函数# 同时应用多个聚合函数
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默认列名为函数名
data1 data2 mean std count peak_range mean std count peak_range
key1
a 0.065166 1.110226 5 2.478410 0.280852 0.875752 5 2.030536
b -0.245389 1.167982 3 2.335305 0.244016 0.403130 3 0.786899
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通过元组提供新的列名
data1 data2 mean std count range mean std count range
key1
a 0.065166 1.110226 5 2.478410 0.280852 0.875752 5 2.030536
b -0.245389 1.167982 3 2.335305 0.244016 0.403130 3 0.786899
# 每列作用不同的聚合函数
dict_mapping = {'data1':'mean','data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
data2 data1
key1
a 1.404259 0.065166
b 0.732047 -0.245389
dict_mapping = {'data1':['mean','max'],'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
data2 data1 sum mean max
key1
a 1.404259 0.065166 1.186941
b 0.732047 -0.245389 0.906262
4.3 数据分组运算
import pandas as pd
import numpy as np# 分组运算后保持shape
dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'data1': np.random.randint(1, 10, 8),'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
df_obj
|
key1
|
key2
|
data1
|
data2
|
0
|
a
|
one
|
3
|
9
|
1
|
b
|
one
|
4
|
5
|
2
|
a
|
two
|
8
|
9
|
3
|
b
|
three
|
4
|
8
|
4
|
a
|
two
|
5
|
2
|
5
|
b
|
two
|
3
|
2
|
6
|
a
|
one
|
4
|
9
|
7
|
a
|
three
|
9
|
4
|
# 按key1分组后,计算data1,data2的统计信息并附加到原始表格中
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
k1_sum
|
sum_data1
|
sum_data2
|
key1
|
|
|
a
|
29
|
33
|
b
|
11
|
15
|
# 方法1,使用merge
pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
key1
|
key2
|
data1
|
data2
|
sum_data1
|
sum_data2
|
|
0
|
a
|
one
|
3
|
9
|
29
|
33
|
2
|
a
|
two
|
8
|
9
|
29
|
33
|
4
|
a
|
two
|
5
|
2
|
29
|
33
|
6
|
a
|
one
|
4
|
9
|
29
|
33
|
7
|
a
|
three
|
9
|
4
|
29
|
33
|
1
|
b
|
one
|
4
|
5
|
11
|
15
|
3
|
b
|
three
|
4
|
8
|
11
|
15
|
5
|
b
|
two
|
3
|
2
|
11
|
15
|
4.3.1 transform 方法
# 方法2,使用transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
df_obj
|
key1
|
key2
|
data1
|
data2
|
sum_key2
|
sum_data1
|
sum_data2
|
0
|
a
|
one
|
3
|
9
|
onetwotwoonethree
|
29
|
33
|
1
|
b
|
one
|
4
|
5
|
onethreetwo
|
11
|
15
|
2
|
a
|
two
|
8
|
9
|
onetwotwoonethree
|
29
|
33
|
3
|
b
|
three
|
4
|
8
|
onethreetwo
|
11
|
15
|
4
|
a
|
two
|
5
|
2
|
onetwotwoonethree
|
29
|
33
|
5
|
b
|
two
|
3
|
2
|
onethreetwo
|
11
|
15
|
6
|
a
|
one
|
4
|
9
|
onetwotwoonethree
|
29
|
33
|
7
|
a
|
three
|
9
|
4
|
onetwotwoonethree
|
29
|
33
|
# 自定义函数传入transform
def diff_mean(s):"""返回数据与均值的差值"""return s - s.mean()df_obj.groupby('key1').transform(diff_mean)
data1
|
data2
|
sum_data1
|
sum_data2
|
|
0
|
-2.800000
|
2.4
|
0
|
0
|
1
|
0.333333
|
0.0
|
0
|
0
|
2
|
2.200000
|
2.4
|
0
|
0
|
3
|
0.333333
|
3.0
|
0
|
0
|
4
|
-0.800000
|
-4.6
|
0
|
0
|
5
|
-0.666667
|
-3.0
|
0
|
0
|
6
|
-1.800000
|
2.4
|
0
|
0
|
7
|
3.200000
|
-2.6
|
0
|
0
|
dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek', 'TotalHours', 'APM'])
4.3.2 apply
def top_n(df, n=3, column='APM'):"""返回每个分组按 column 的 top n 数据"""return df.sort_values(by=column, ascending=False)[:n]df_data.groupby('LeagueIndex').apply(top_n)
|
LeagueIndex
|
Age
|
HoursPerWeek
|
TotalHours
|
APM
|
|
LeagueIndex
|
|
|
|
|
|
|
1
|
2214
|
1
|
20.0
|
12.0
|
730.0
|
172.9530
|
2246
|
1
|
27.0
|
8.0
|
250.0
|
141.6282
|
|
1753
|
1
|
20.0
|
28.0
|
100.0
|
139.6362
|
|
2
|
3062
|
2
|
20.0
|
6.0
|
100.0
|
179.6250
|
3229
|
2
|
16.0
|
24.0
|
110.0
|
156.7380
|
|
1520
|
2
|
29.0
|
6.0
|
250.0
|
151.6470
|
|
3
|
1557
|
3
|
22.0
|
6.0
|
200.0
|
226.6554
|
484
|
3
|
19.0
|
42.0
|
450.0
|
220.0692
|
|
2883
|
3
|
16.0
|
8.0
|
800.0
|
208.9500
|
|
4
|
2688
|
4
|
26.0
|
24.0
|
990.0
|
249.0210
|
1759
|
4
|
16.0
|
6.0
|
75.0
|
229.9122
|
|
2637
|
4
|
23.0
|
24.0
|
650.0
|
227.2272
|
|
5
|
3277
|
5
|
18.0
|
16.0
|
950.0
|
372.6426
|
93
|
5
|
17.0
|
36.0
|
720.0
|
335.4990
|
|
202
|
5
|
37.0
|
14.0
|
800.0
|
327.7218
|
|
6
|
734
|
6
|
16.0
|
28.0
|
730.0
|
389.8314
|
2746
|
6
|
16.0
|
28.0
|
4000.0
|
350.4114
|
|
1810
|
6
|
21.0
|
14.0
|
730.0
|
323.2506
|
|
7
|
3127
|
7
|
23.0
|
42.0
|
2000.0
|
298.7952
|
104
|
7
|
21.0
|
24.0
|
1000.0
|
286.4538
|
|
1654
|
7
|
18.0
|
98.0
|
700.0
|
236.0316
|
|
8
|
3393
|
8
|
NaN
|
NaN
|
NaN
|
375.8664
|
3373
|
8
|
NaN
|
NaN
|
NaN
|
364.8504
|
|
3372
|
8
|
NaN
|
NaN
|
NaN
|
355.3518
|
|
# apply函数接收的参数会传入自定义的函数中
df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age')
|
LeagueIndex
|
Age
|
HoursPerWeek
|
TotalHours
|
APM
|
|
LeagueIndex
|
|
|
|
|
|
|
1
|
3146
|
1
|
40.0
|
12.0
|
150.0
|
38.5590
|
3040
|
1
|
39.0
|
10.0
|
500.0
|
29.8764
|
|
2
|
920
|
2
|
43.0
|
10.0
|
730.0
|
86.0586
|
2437
|
2
|
41.0
|
4.0
|
200.0
|
54.2166
|
|
3
|
1258
|
3
|
41.0
|
14.0
|
800.0
|
77.6472
|
2972
|
3
|
40.0
|
10.0
|
500.0
|
60.5970
|
|
4
|
1696
|
4
|
44.0
|
6.0
|
500.0
|
89.5266
|
1729
|
4
|
39.0
|
8.0
|
500.0
|
86.7246
|
|
5
|
202
|
5
|
37.0
|
14.0
|
800.0
|
327.7218
|
2745
|
5
|
37.0
|
18.0
|
1000.0
|
123.4098
|
|
6
|
3069
|
6
|
31.0
|
8.0
|
800.0
|
133.1790
|
2706
|
6
|
31.0
|
8.0
|
700.0
|
66.9918
|
|
7
|
2813
|
7
|
26.0
|
36.0
|
1300.0
|
188.5512
|
1992
|
7
|
26.0
|
24.0
|
1000.0
|
219.6690
|
|
8
|
3340
|
8
|
NaN
|
NaN
|
NaN
|
189.7404
|
4.3.3 禁止分组 group_keys = False
df_data.groupby('LeagueIndex', group_keys=False).apply(top_n)
|
LeagueIndex
|
Age
|
HoursPerWeek
|
TotalHours
|
APM
|
2214
|
1
|
20.0
|
12.0
|
730.0
|
172.9530
|
2246
|
1
|
27.0
|
8.0
|
250.0
|
141.6282
|
1753
|
1
|
20.0
|
28.0
|
100.0
|
139.6362
|
3062
|
2
|
20.0
|
6.0
|
100.0
|
179.6250
|
3229
|
2
|
16.0
|
24.0
|
110.0
|
156.7380
|
1520
|
2
|
29.0
|
6.0
|
250.0
|
151.6470
|
1557
|
3
|
22.0
|
6.0
|
200.0
|
226.6554
|
484
|
3
|
19.0
|
42.0
|
450.0
|
220.0692
|
2883
|
3
|
16.0
|
8.0
|
800.0
|
208.9500
|
2688
|
4
|
26.0
|
24.0
|
990.0
|
249.0210
|
1759
|
4
|
16.0
|
6.0
|
75.0
|
229.9122
|
2637
|
4
|
23.0
|
24.0
|
650.0
|
227.2272
|
3277
|
5
|
18.0
|
16.0
|
950.0
|
372.6426
|
93
|
5
|
17.0
|
36.0
|
720.0
|
335.4990
|
202
|
5
|
37.0
|
14.0
|
800.0
|
327.7218
|
734
|
6
|
16.0
|
28.0
|
730.0
|
389.8314
|
2746
|
6
|
16.0
|
28.0
|
4000.0
|
350.4114
|
1810
|
6
|
21.0
|
14.0
|
730.0
|
323.2506
|
3127
|
7
|
23.0
|
42.0
|
2000.0
|
298.7952
|
104
|
7
|
21.0
|
24.0
|
1000.0
|
286.4538
|
1654
|
7
|
18.0
|
98.0
|
700.0
|
236.0316
|
3393
|
8
|
NaN
|
NaN
|
NaN
|
375.8664
|
3373
|
8
|
NaN
|
NaN
|
NaN
|
364.8504
|
3372
|
8
|
NaN
|
NaN
|
NaN
|
355.3518
|
5. 数据清洗、合并、转化和重构
5.1 数据清洗
5.2 数据连接
import pandas as pd
import numpy as npdf_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2' : np.random.randint(0,10,3)})print(df_obj1)
print(df_obj2)
data1 key
0 5 b
1 9 b
2 1 a
3 0 c
4 3 a
5 9 a
6 0 bdata2 key
0 9 a
1 3 b
2 8 d
# 默认将重叠列的列名作为“外键”进行连接
pd.merge(df_obj1, df_obj2)
|
data1
|
key
|
data2
|
0
|
5
|
b
|
3
|
1
|
9
|
b
|
3
|
2
|
0
|
b
|
3
|
3
|
1
|
a
|
9
|
4
|
3
|
a
|
9
|
5
|
9
|
a
|
9
|
# on显示指定“外键”
pd.merge(df_obj1, df_obj2, on='key')
|
data1
|
key
|
data2
|
0
|
5
|
b
|
3
|
1
|
9
|
b
|
3
|
2
|
0
|
b
|
3
|
3
|
1
|
a
|
9
|
4
|
3
|
a
|
9
|
5
|
9
|
a
|
9
|
# left_on,right_on分别指定左侧数据和右侧数据的“外键”# 更改列名
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2')
data1
|
key1
|
data2
|
key2
|
|
0
|
5
|
b
|
3
|
b
|
1
|
9
|
b
|
3
|
b
|
2
|
0
|
b
|
3
|
b
|
3
|
1
|
a
|
9
|
a
|
4
|
3
|
a
|
9
|
a
|
5
|
9
|
a
|
9
|
a
|
# “外连接”
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer')
|
data1
|
key1
|
data2
|
key2
|
0
|
5.0
|
b
|
3.0
|
b
|
1
|
9.0
|
b
|
3.0
|
b
|
2
|
0.0
|
b
|
3.0
|
b
|
3
|
1.0
|
a
|
9.0
|
a
|
4
|
3.0
|
a
|
9.0
|
a
|
5
|
9.0
|
a
|
9.0
|
a
|
6
|
0.0
|
c
|
NaN
|
NaN
|
7
|
NaN
|
NaN
|
8.0
|
d
|
# 左连接
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left')
data1
|
key1
|
data2
|
key2
|
|
0
|
5
|
b
|
3.0
|
b
|
1
|
9
|
b
|
3.0
|
b
|
2
|
1
|
a
|
9.0
|
a
|
3
|
0
|
c
|
NaN
|
NaN
|
4
|
3
|
a
|
9.0
|
a
|
5
|
9
|
a
|
9.0
|
a
|
6
|
0
|
b
|
3.0
|
b
|
# 右连接
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right')
data1
|
key1
|
data2
|
key2
|
|
0
|
5.0
|
b
|
3
|
b
|
1
|
9.0
|
b
|
3
|
b
|
2
|
0.0
|
b
|
3
|
b
|
3
|
1.0
|
a
|
9
|
a
|
4
|
3.0
|
a
|
9
|
a
|
5
|
9.0
|
a
|
9
|
a
|
6
|
NaN
|
NaN
|
8
|
d
|
# 处理重复列名
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data' : np.random.randint(0,10,3)})pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right'))
|
data_left
|
key
|
data_right
|
0
|
4
|
b
|
5
|
1
|
2
|
b
|
5
|
2
|
5
|
b
|
5
|
3
|
9
|
a
|
7
|
4
|
6
|
a
|
7
|
5
|
6
|
a
|
7
|
# 按索引连接
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
pd.merge(df_obj1, df_obj2, left_on='key', right_index=True)
|
data1
|
key
|
data2
|
0
|
6
|
b
|
5
|
1
|
7
|
b
|
5
|
6
|
9
|
b
|
5
|
2
|
2
|
a
|
3
|
4
|
4
|
a
|
3
|
5
|
4
|
a
|
3
|
5.3 数据合并
import numpy as np
import pandas as pd
5.3.1 Numpy 的 concat
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))print(arr1)
print(arr2)
[[8 5 2 4][2 9 9 0][6 2 1 3]]
[[0 4 5 3][6 4 1 3][1 2 9 9]]
np.concatenate([arr1, arr2])
array([[8, 5, 2, 4],[2, 9, 9, 0],[6, 2, 1, 3],[0, 4, 5, 3],[6, 4, 1, 3],[1, 2, 9, 9]])
np.concatenate([arr1, arr2], axis=1)
array([[8, 5, 2, 4, 0, 4, 5, 3],[2, 9, 9, 0, 6, 4, 1, 3],[6, 2, 1, 3, 1, 2, 9, 9]])
5.3.2 Series 上的concat
# index 没有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
0 9
1 9
2 5
3 6
4 4
dtype: int32
5 1
6 9
7 2
8 5
dtype: int32
9 4
10 7
11 4
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3])
Out[6]:
0 9
1 9
2 5
3 6
4 4
5 1
6 9
7 2
8 5
9 4
10 7
11 4
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1)
0
|
1
|
2
|
|
0
|
9.0
|
NaN
|
NaN
|
1
|
9.0
|
NaN
|
NaN
|
2
|
5.0
|
NaN
|
NaN
|
3
|
6.0
|
NaN
|
NaN
|
4
|
4.0
|
NaN
|
NaN
|
5
|
NaN
|
1.0
|
NaN
|
6
|
NaN
|
9.0
|
NaN
|
7
|
NaN
|
2.0
|
NaN
|
8
|
NaN
|
5.0
|
NaN
|
9
|
NaN
|
NaN
|
4.0
|
10
|
NaN
|
NaN
|
7.0
|
11
|
NaN
|
NaN
|
4.0
|
# index 有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
0 7
1 3
2 9
3 1
4 7
dtype: int32
0 6
1 1
2 4
3 7
dtype: int32
0 7
1 2
2 3
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3])
0 7
1 3
2 9
3 1
4 7
0 6
1 1
2 4
3 7
0 7
1 2
2 3
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')
|
0
|
1
|
2
|
0
|
7
|
6
|
7
|
1
|
3
|
1
|
2
|
2
|
9
|
4
|
3
|
5.3.3 DataFrame上的concat
df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],columns=['A', 'B'])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],columns=['C', 'D'])
print(df_obj1)
print(df_obj2)
A B
a 1 6
b 1 0
c 1 6C D
a 2 1
b 7 4
pd.concat([df_obj1, df_obj2])
A
|
B
|
C
|
D
|
|
a
|
1.0
|
6.0
|
NaN
|
NaN
|
b
|
1.0
|
0.0
|
NaN
|
NaN
|
c
|
1.0
|
6.0
|
NaN
|
NaN
|
a
|
NaN
|
NaN
|
2.0
|
1.0
|
b
|
NaN
|
NaN
|
7.0
|
4.0
|
pd.concat([df_obj1, df_obj2], axis=1)
A
|
B
|
C
|
D
|
|
a
|
1
|
6
|
2.0
|
1.0
|
b
|
1
|
0
|
7.0
|
4.0
|
c
|
1
|
6
|
NaN
|
NaN
|
5.4 数据重构
5.4.1 stack
import numpy as np
import pandas as pddf_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
df_obj
data1
|
data2
|
|
0
|
0
|
8
|
1
|
9
|
5
|
2
|
1
|
1
|
3
|
8
|
5
|
4
|
9
|
4
|
stacked = df_obj.stack()
print(stacked)
0 data1 0data2 8
1 data1 9data2 5
2 data1 1data2 1
3 data1 8data2 5
4 data1 9data2 4
dtype: int32
# 默认操作内层索引
stacked.unstack()
|
data1
|
data2
|
0
|
0
|
8
|
1
|
9
|
5
|
2
|
1
|
1
|
3
|
8
|
5
|
4
|
9
|
4
|
# 通过level指定操作索引的级别
stacked.unstack(level=0)
|
0
|
1
|
2
|
3
|
4
|
data1
|
0
|
9
|
1
|
8
|
9
|
data2
|
8
|
5
|
1
|
5
|
4
|
5.5 数据转换
import numpy as np
import pandas as pd
5.5.1 重复数据
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)})
df_obj
|
data1
|
data2
|
0
|
a
|
1
|
1
|
a
|
3
|
2
|
a
|
3
|
3
|
a
|
3
|
4
|
b
|
3
|
5
|
b
|
3
|
6
|
b
|
1
|
7
|
b
|
1
|
df_obj.duplicated()
0 False
1 False
2 True
3 True
4 False
5 True
6 False
7 True
dtype: bool
df_obj.drop_duplicates()
data1
|
data2
|
|
0
|
a
|
1
|
1
|
a
|
3
|
4
|
b
|
3
|
6
|
b
|
1
|
df_obj.drop_duplicates('data2')
5.5.2 map函数
ser_obj = pd.Series(np.random.randint(0,10,10))
ser_obj
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj.map(lambda x : x ** 2)
0 36
1 25
2 16
3 16
4 64
5 1
6 36
7 25
8 16
9 1
dtype: int64
5.5.4 数据替换repalce
# 替换单个值
ser_obj.replace(0, -100)
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], -100)
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], [-100, -200])
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
| —: | ----: | ----: |
| 0 | 0 | 8 |
| 1 | 9 | 5 |
| 2 | 1 | 1 |
| 3 | 8 | 5 |
| 4 | 9 | 4 |
# 通过level指定操作索引的级别
stacked.unstack(level=0)
|
0
|
1
|
2
|
3
|
4
|
data1
|
0
|
9
|
1
|
8
|
9
|
data2
|
8
|
5
|
1
|
5
|
4
|
5.5 数据转换
import numpy as np
import pandas as pd
5.5.1 重复数据
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)})
df_obj
|
data1
|
data2
|
0
|
a
|
1
|
1
|
a
|
3
|
2
|
a
|
3
|
3
|
a
|
3
|
4
|
b
|
3
|
5
|
b
|
3
|
6
|
b
|
1
|
7
|
b
|
1
|
df_obj.duplicated()
0 False
1 False
2 True
3 True
4 False
5 True
6 False
7 True
dtype: bool
df_obj.drop_duplicates()
data1
|
data2
|
|
0
|
a
|
1
|
1
|
a
|
3
|
4
|
b
|
3
|
6
|
b
|
1
|
df_obj.drop_duplicates('data2')
5.5.2 map函数
ser_obj = pd.Series(np.random.randint(0,10,10))
ser_obj
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj.map(lambda x : x ** 2)
0 36
1 25
2 16
3 16
4 64
5 1
6 36
7 25
8 16
9 1
dtype: int64
5.5.4 数据替换repalce
# 替换单个值
ser_obj.replace(0, -100)
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], -100)
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], [-100, -200])
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
Python数据分析第三期--简述Pandas相关推荐
- Python 数据分析第二期--简述数据集常用格式及操作
Python 数据分析第二期–简述数据常用格式及操作 1.常用格式的本地数据读取 常用的数据分析文件格式有: csv,json,xls,xlsx,txt,xml,等 TXT txt文件由字 ...
- 【Python数据分析与可视化】Pandas统计分析-实训
[Python数据分析与可视化]Pandas统计分析-实训 文章目录 [Python数据分析与可视化]Pandas统计分析-实训 导包 读取数据 分析数据 1.查看数据的描述和统计信息: 2.修改列名 ...
- Python数据分析实战基础 | 初识Pandas
这是Python数据分析实战基础的第一篇内容,主要是和Pandas来个简单的邂逅.已经熟练掌握Pandas的同学,可以加快手速滑动浏览或者直接略过本文. 01 重要的前言 这段时间和一些做数据分析的 ...
- pandas 把某一列中字符串变数值_Python学习教程:Python数据分析实战基础 | 初识Pandas...
这段时间和一些做数据分析的同学闲聊,我发现数据分析技能入门阶段存在一个普遍性的问题,很多凭着兴趣入坑的同学,都能够很快熟悉Python基础语法,好像自己什么都会了一点,然而实际操作起来既不知从何操起, ...
- Python数据分析笔记——Numpy、Pandas库
Python数据分析--Numpy.Pandas库 总第48篇 ▼ 利用Python进行数据分析中有两个重要的库是Numpy和Pandas,本章将围绕这两个库进行展开介绍. Numpy库 Numpy最 ...
- 《Python数据分析实战》day2: Pandas中取得某行或者是某列的一点思考
今天看了<Python数据分析实战>这本书的第四章:Pandas的简介的部分,大概看了一半,在实践代码的时候发现了jupyter notebook提示warning,研究了一下有了一些自己 ...
- Python数据分析实例,利用Pandas建立流域三层蒸发和蓄满产流模型
流域三层蒸发和蓄满产流模型 根据水量平衡方程推求土壤含水量 土壤三层蒸发模型,EU,EL,ED的计算 产流R计算 分水源Rg,Rd计算 完整代码 根据水量平衡方程推求土壤含水量 if i ==1:df ...
- Python数据分析第一课:Pandas初入门
数据分析如下所示共6步,而使用Python可以帮助我们完成前5步. 一. Pandas简介 Pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库 ...
- Python数据分析基础技术之pandas(史上最全面)
pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具.pandas提供了大量能使我们快 ...
最新文章
- java对托盘加监听右击报错_java实现系统托盘示例
- phoenix+hbase+Spark整合,Spark处理数据操作phoenix入hbase,Spring Cloud整合phoenix
- ActiveReports 9 新功能:借助目录(TOC)控件为报表添加目录功能
- ajax获取对象获取不了属性,Ajaxing JavaScript变量到Django视图获取:AttributeError:“WSGIRequest”对象没有属性“data”...
- 这是我见过的最全的训练数据集,没有之一!
- UVa 10635 (LIS+二分) Prince and Princess
- 使用opensocial接入social game
- android 灰色向白色渐变,iPhone-iOS的白色到透明渐变层为灰色
- 真的,没人能逃开“真香定律”
- 开发管理 -启动项目(转)
- 计算机保研夏令营英语面试,保研经验 | 夏令营面试那些事儿(内含视频)
- GUI界面设计学渣笔记———使用PyQt做深度学习的可视化界面
- excel冻结行和列_Spire.Cloud.Excel 冻结或解除冻结Excel中的行和列
- git 分支关系图谱讲解
- 网上疯传河南高考零分作文:兔子 你傻啊?!
- SQL Server安装中错误该性能计数器注册表配置单元已损坏。若要继续,必须修复该性能计数器注册表配置单元的解决
- docker(12):Dockerfile
- opencv学习笔记—— opencv3 源码文件夹内容
- Android -结束当前activity并返回上一个activity
- 2018新版个税计算器
热门文章
- java根节点到叶子节点_二叉树根节点到叶子节点的所有路径和
- 移动物联网应用典型案例福建8项目入选,四信携手产业共赢
- 洛谷 P5108 仰望半月的夜空 解题报告
- 富士康贫血100天:40万员工人均到手工资1932元
- 一枝看上去很美的花——书评《Java与模式》
- Redis实现点赞与关注
- R语言时间序列基础库zoo
- “百度+创维”强力破局家电产业:中国AI大屏奇点临近
- c++连通区域处理 种子生成法
- 【蓝桥杯练习系统】【入门训练】【BEGIN-1】A+B问题 (Java)