Python数据分析第三期–简述Pandas

目录

文章目录

  • Python数据分析第三期--简述Pandas
    • 目录
    • 1. Pandas 的数据结构
      • 1.1 Series
      • 1.2 DataFrame
      • 1.3 Index
    • 2. Pandas 的数据操作
      • 2.1 索引操作
        • 2.1.1 Series 索引
        • 2.1.2 DataFrame 索引
        • 2.1.3 三种索引方式(标签索引,整型位置索引,混合索引)
      • 2.2 运算与对齐
      • 2.3 函数应用
      • 排序
      • 2.4 处理缺失数据
    • 3. Pandas 统计计算和描述
      • 3.1 常用的统计计算
      • 3.2 统计描述
    • 4. 数据的分组与聚合
      • 4.1 分组(groupby)
        • 4.1.1 groupby对象
        • 4.1.2 GroupBy 对象分组迭代
        • 4.1.2 其他分组方法
      • 4.2 聚合
      • 4.3 数据分组运算
        • 4.3.1 transform 方法
        • 4.3.2 apply
        • 4.3.3 禁止分组 group_keys = False
    • 5. 数据清洗、合并、转化和重构
      • 5.1 数据清洗
      • 5.2 数据连接
      • 5.3 数据合并
        • 5.3.1 Numpy 的 concat
        • 5.3.2 Series 上的concat
        • 5.3.3 DataFrame上的concat
      • 5.4 数据重构
        • 5.4.1 stack
      • 5.5 数据转换
        • 5.5.1 重复数据
        • 5.5.2 map函数
        • 5.5.4 数据替换repalce
      • 5.5 数据转换
        • 5.5.1 重复数据
        • 5.5.2 map函数
        • 5.5.4 数据替换repalce

1. Pandas 的数据结构

Pandas
一个强大的分析结构化数据的工具集 。

​ 基础式NumPy提供了高性能矩阵的运算。

​ 提供数据清洗功能。

1.1 Series

​ 类似于一维数组的对象

import pandas as pd# 通过list构建Series
ser_obj = pd.Series(range(10, 20))
print(type(ser_obj))
# 获取数据
print(ser_obj.values)# 获取索引
print(ser_obj.index)
[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)
# 预览数据
print(ser_obj.head(3))
0    10
1    11
2    12
dtype: int64
print(ser_obj)
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64
#通过索引获取数据
print(ser_obj[0])
print(ser_obj[8])
10
18
# 索引与数据的对应关系仍保持在数组运算的结果中
print(ser_obj * 2)
print(ser_obj > 15)
0    20
1    22
2    24
3    26
4    28
5    30
6    32
7    34
8    36
9    38
dtype: int64
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool
# 通过dict构建Series
year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print(ser_obj2.head())
print(ser_obj2.index)
2001    17.8
2002    20.1
2003    16.5
dtype: float64
Int64Index([2001, 2002, 2003], dtype='int64')
# name属性
ser_obj2.name = 'temp'
ser_obj2.index.name = 'year'
print(ser_obj2.head())
year
2001    17.8
2002    20.1
2003    16.5
Name: temp, dtype: float64

1.2 DataFrame

​ 类似于多维数组/表格数组,索引包括列索引和行索引。

import numpy as np# 通过ndarray构建DataFrame
array = np.random.randn(5,4)
print(array)df_obj = pd.DataFrame(array)
print(df_obj.head())
[[-0.81322788 -1.51137049 -1.06359499  1.28648638][ 1.31333252  0.25461725 -0.7915332   0.75370275][-0.09241866 -1.29103143 -0.32951959 -0.76066101][-0.40326652  1.04621326  0.99044616 -0.35491204][ 2.8170386  -0.58422902  2.84249502 -0.62538581]]0         1         2         3
0 -0.813228 -1.511370 -1.063595  1.286486
1  1.313333  0.254617 -0.791533  0.753703
2 -0.092419 -1.291031 -0.329520 -0.760661
3 -0.403267  1.046213  0.990446 -0.354912
4  2.817039 -0.584229  2.842495 -0.625386
# 通过dict构建DataFrame
dict_data = {'A': 1., 'B': pd.Timestamp('20161217'),'C': pd.Series(1, index=list(range(4)),dtype='float32'),'D': np.array([3] * 4,dtype='int32'),'E' : pd.Categorical(["Python","Java","C++","C#"]),'F' : 'ChinaHadoop' }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print(df_obj2.head())
   A          B    C  D       E            F
0  1.0 2016-12-17  1.0  3  Python  ChinaHadoop
1  1.0 2016-12-17  1.0  3    Java  ChinaHadoop
2  1.0 2016-12-17  1.0  3     C++  ChinaHadoop
3  1.0 2016-12-17  1.0  3      C#  ChinaHadoop
# 通过列索引获取列数据
print(df_obj2['A'])
print(type(df_obj2['A']))print(df_obj2.A)
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
# 增加列
df_obj2['G'] = df_obj2['D'] + 4
print(df_obj2.head())
    A          B    C  D       E            F  G
0  1.0 2016-12-17  1.0  3  Python  ChinaHadoop  7
1  1.0 2016-12-17  1.0  3    Java  ChinaHadoop  7
2  1.0 2016-12-17  1.0  3     C++  ChinaHadoop  7
3  1.0 2016-12-17  1.0  3      C#  ChinaHadoop  7
# 删除列
del(df_obj2['G'] )
print(df_obj2.head())
     A          B    C  D       E            F
0  1.0 2016-12-17  1.0  3  Python  ChinaHadoop
1  1.0 2016-12-17  1.0  3    Java  ChinaHadoop
2  1.0 2016-12-17  1.0  3     C++  ChinaHadoop
3  1.0 2016-12-17  1.0  3      C#  ChinaHadoop

1.3 Index

​ Series和DataFrame中的索引都是Index对象,不可变,保证了数据的安全。

​ 常见的 Index 种类:

index

Int64Index

MultiIndex , “层级”索引

DataTimeIndex , 时间戳类型

print(type(ser_obj.index))
print(type(df_obj2.index))print(df_obj2.index)
<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.numeric.Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')

2. Pandas 的数据操作

2.1 索引操作

​ 切片索引的注意,位置访问时,末尾边界不包含;名称访问时,末尾边界包含。

2.1.1 Series 索引

import pandas as pdser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e'])
print(ser_obj.head())
a    0
b    1
c    2
d    3
e    4
dtype: int32
# 行索引
print(ser_obj['a'])
print(ser_obj[0])
0
0
# 切片索引
print(ser_obj[1:3])
print(ser_obj['b':'d'])
b    1
c    2
dtype: int32
b    1
c    2
d    3
dtype: int32
# 不连续索引
print(ser_obj[[0, 2, 4]])
print(ser_obj[['a', 'e']])
a    0
c    2
e    4
dtype: int32
a    0
e    4
dtype: int32
# 布尔索引
ser_bool = ser_obj > 2
print(ser_bool)
print(ser_obj[ser_bool])print(ser_obj[ser_obj > 2])
a    False
b    False
c    False
d     True
e     True
dtype: bool
d    3
e    4
dtype: int32
d    3
e    4
dtype: int32

2.1.2 DataFrame 索引

import numpy as npdf_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj.head())
         a         b         c         d
0 -0.595692  0.813699 -0.551327 -0.059703
1  0.339194 -2.335579  0.230472 -0.680213
2 -0.252306  0.212406 -0.979523  0.408522
3  0.216677  0.574524 -0.819607  2.170009
4 -1.099175 -0.665488  0.391421 -0.400642
# 列索引
print('列索引')
print(df_obj['a']) # 返回Series类型
print(type(df_obj[[0]])) # 返回DataFrame类型# 不连续索引
print('不连续索引')
print(df_obj[['a','c']])
print(df_obj[[1, 3]])
列索引
0   -0.595692
1    0.339194
2   -0.252306
3    0.216677
4   -1.099175
Name: a, dtype: float64
<class 'pandas.core.frame.DataFrame'>
不连续索引a         c
0 -0.595692 -0.551327
1  0.339194  0.230472
2 -0.252306 -0.979523
3  0.216677 -0.819607
4 -1.099175  0.391421b         d
0  0.813699 -0.059703
1 -2.335579 -0.680213
2  0.212406  0.408522
3  0.574524  2.170009
4 -0.665488 -0.400642

2.1.3 三种索引方式(标签索引,整型位置索引,混合索引)

# 标签索引 loc
# Series
print(ser_obj['b':'d'])
print(ser_obj.loc['b':'d'])# DataFrame
print(df_obj['a'])
print(df_obj.loc[0:2, 'a'])
b    1
c    2
d    3
dtype: int32
b    1
c    2
d    3
dtype: int32
0   -0.595692
1    0.339194
2   -0.252306
3    0.216677
4   -1.099175
Name: a, dtype: float64
0   -0.595692
1    0.339194
2   -0.252306
Name: a, dtype: float64
# 整型位置索引 iloc
print(ser_obj[1:3])
print(ser_obj.iloc[1:3])# DataFrame
print(df_obj.iloc[0:2, 0]) # 注意和df_obj.loc[0:2, 'a']的区别
b    1
c    2
dtype: int32
b    1
c    2
dtype: int32
0   -0.595692
1    0.339194
Name: a, dtype: float64
# 混合索引 ix
print(ser_obj.ix[1:3])
print(ser_obj.ix['b':'c'])# DataFrame
print(df_obj.ix[0:2, 0]) # 先按标签索引尝试操作,然后再按位置索引尝试操作
b    1
c    2
dtype: int32
b    1
c    2
dtype: int32
0   -0.595692
1    0.339194
2   -0.252306
Name: a, dtype: float64

2.2 运算与对齐

s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))print('s1: ' )
print(s1)print('') print('s2: ')
print(s2)
s1:
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32s2:
0    20
1    21
2    22
3    23
4    24
dtype: int32
# Series 对齐运算
s1 + s2
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
dtype: float64
import numpy as npdf1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])print('df1: ')
print(df1)print('')
print('df2: ')
print(df2)
df1: a    b
0  1.0  1.0
1  1.0  1.0df2: a    b    c
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0
# DataFrame对齐操作
df1 + df2
a b c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN
# 填充未对齐的数据进行运算
print(s1)
print(s2)s1.add(s2, fill_value = -1)
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64
0    20
1    21
2    22
3    23
4    24
dtype: int64
Out[9]:
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5    14.0
6    15.0
7    16.0
8    17.0
9    18.0
dtype: float64
df1.sub(df2, fill_value = 2.)
a b c
0 0.0 0.0 1.0
1 0.0 0.0 1.0
2 1.0 1.0 1.0
# 填充NaN
s3 = s1 + s2
print(s3)
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5    -1.0
6    -1.0
7    -1.0
8    -1.0
9    -1.0
dtype: float64
df3 = df1 + df2
print(df3)
     a    b   c
0  2.0  2.0 NaN
1  2.0  2.0 NaN
2  NaN  NaN NaN
df3.fillna(100, inplace = True)
print(df3)
       a      b      c
0    2.0    2.0  100.0
1    2.0    2.0  100.0
2  100.0  100.0  100.0

2.3 函数应用

# Numpy ufunc 函数
df = pd.DataFrame(np.random.randn(5,4) - 1)
print(df)print(np.abs(df))
          0         1         2         3
0 -2.193022 -2.090432 -2.288651 -0.026022
1 -0.720957 -1.501025 -1.734828 -1.858286
2  0.300216 -3.391127 -0.872570 -0.686669
3 -2.552131 -1.452268 -1.188845 -0.597845
4  2.111044 -1.203676 -1.143487 -0.5427550         1         2         3
0  2.193022  2.090432  2.288651  0.026022
1  0.720957  1.501025  1.734828  1.858286
2  0.300216  3.391127  0.872570  0.686669
3  2.552131  1.452268  1.188845  0.597845
4  2.111044  1.203676  1.143487  0.542755
# 使用apply应用行或列数据 , 不指定轴方向时默认为列
#f = lambda x : x.max()
print(df.apply(lambda x : x.max()))
0    2.111044
1   -1.203676
2   -0.872570
3   -0.026022
dtype: float64
# 指定轴方向 , axis=1时,为行运算
print(df.apply(lambda x : x.max(), axis=1))
# 使用applymap应用到每个数据
f2 = lambda x : '%.2f' % x
print(df.applymap(f2))
       0      1      2      3
0  -2.19  -2.09  -2.29  -0.03
1  -0.72  -1.50  -1.73  -1.86
2   0.30  -3.39  -0.87  -0.69
3  -2.55  -1.45  -1.19  -0.60
4   2.11  -1.20  -1.14  -0.54

排序

s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(s4)
4    10
3    11
1    12
4    13
4    14
dtype: int32
# 索引排序
s4.sort_index()
1    12
3    11
4    10
4    13
4    14
dtype: int32
df4 = pd.DataFrame(np.random.randn(3, 4), index=np.random.randint(3, size=3),columns=np.random.randint(4, size=4))
print(df4)
          3         2         2         1
2  0.244068 -1.977220  0.045238 -2.064546
2  0.218196 -0.419284 -0.698839  0.241649
2  0.296747 -0.021311  0.225724 -0.325439
#df4.sort_index(ascending=False)
df4.sort_index(axis=1)
1 2 2 3
2 -2.064546 -1.977220 0.045238 0.244068
2 0.241649 -0.419284 -0.698839 0.218196
2 -0.325439 -0.021311 0.225724 0.296747
# 按值排序
df4.sort_values(by=1)
3 2 2 1
2 0.244068 -1.977220 0.045238 -2.064546
2 0.296747 -0.021311 0.225724 -0.325439
2 0.218196 -0.419284 -0.698839 0.241649

2.4 处理缺失数据

df_data = pd.DataFrame([np.random.randn(3), [1., np.nan, np.nan],[4., np.nan, np.nan], [1., np.nan, 2.]])
df_data.head()
0 1 2
0 0.106763 1.521225 -0.654447
1 1.000000 NaN NaN
2 4.000000 NaN NaN
3 1.000000 NaN 2.000000
# isnull
df_data.isnull()
0 1 2
0 False False False
1 False True True
2 False True True
3 False True False
# dropna
df_data.dropna()
#df_data.dropna(axis=1)
0 1 2
0 0.106763 1.521225 -0.654447
# dropna
#df_data.dropna()
df_data.dropna(axis=1)
0
0 0.106763
1 1.000000
2 4.000000
3 1.000000
# fillna
df_data.fillna(-100.)
0 1 2
0 1.619463 0.548047 -1.027003
1 1.000000 -100.000000 -100.000000
2 4.000000 -100.000000 -100.000000
3 1.000000 -100.000000 2.000000

3. Pandas 统计计算和描述

import numpy as np
import pandas as pd

3.1 常用的统计计算

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
df_obj
a b c d
0 -0.767557 0.637179 1.650174 1.046862
1 1.675605 0.438124 -0.119837 -0.637159
2 -0.367849 1.044564 0.977799 -0.185916
3 -0.453807 -0.786276 -1.917200 0.502135
4 0.230846 -0.056462 1.612659 0.426120
df_obj.sum()   #默认按列计算
a    0.317239
b    1.277129
c    2.203596
d    1.152041
dtype: float64
df_obj.max()
a    1.675605
b    1.044564
c    1.650174
d    1.046862
dtype: float64
df_obj.min(axis=1)   #定义轴方向,为行
0   -0.767557
1   -0.637159
2   -0.367849
3   -1.917200
4   -0.056462
dtype: float64

3.2 统计描述

df_obj.describe()
a b c d
count 5.000000 5.000000 5.000000 5.000000
mean 0.882532 -0.574613 0.004122 -0.455327
std 1.052045 0.887115 0.456436 0.646042
min -0.713423 -1.600219 -0.628493 -1.403887
25% 0.715594 -1.483364 -0.211933 -0.664303
50% 0.783228 -0.053138 0.021936 -0.424985
75% 1.580704 0.123322 0.276417 -0.103682
max 2.046556 0.140333 0.562683 0.320219

4. 数据的分组与聚合

4.1 分组(groupby)

​ 对数据集进行分组,然后对每组进行统计分析,pandas能利用 groupby 进行更加复杂的分组运算。

​ 分组运算过程:

​ split -> apply -> combine

​ 拆分:进行分组的依据

​ 应用:每个分组运行的计算规则

​ 合并:把每个分组的的计算结果合并起来

4.1.1 groupby对象

import pandas as pd
import numpy as npdict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
    data1     data2 key1   key2
0 -0.943078  0.820645    a    one
1 -1.429043  0.142617    b    one
2  0.832261  0.843898    a    two
3  0.906262  0.688165    b  three
4  0.541173  0.117232    a    two
5 -0.213385 -0.098734    b    two
6 -1.291468 -1.186638    a    one
7  1.186941  0.809122    a  three
# dataframe根据key1进行分组
print(type(df_obj.groupby('key1')))
<class 'pandas.core.groupby.DataFrameGroupBy'>
# data1列根据key1进行分组
print(type(df_obj['data1'].groupby(df_obj['key1'])))
<class 'pandas.core.groupby.SeriesGroupBy'>
# 分组运算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
         data1     data2
key1
a     0.065166  0.280852
b    -0.245389  0.244016
key1
a    0.065166
b   -0.245389
Name: data1, dtype: float64
# size
print(grouped1.size())
print(grouped2.size())
key1
a    5
b    3
dtype: int64
key1
a    5
b    3
dtype: int64
# 按列名分组
df_obj.groupby('key1')Out[10]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002091C820550>
# 按自定义key分组,列表
self_def_key = [1, 1, 2, 2, 2, 1, 1, 1]
df_obj.groupby(self_def_key).size()结果:
1    5
2    3
dtype: int64
# 按自定义key分组,多层列表
df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()
key1  key2
a     one      2three    1two      2
b     one      1three    1two      1
dtype: int64
# 按多个列多层分组
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())
key1  key2
a     one      2three    1two      2
b     one      1three    1two      1
dtype: int64
# 多层分组按key的顺序进行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
print()
print(grouped3.mean().unstack())
              data1     data2
key2  key1
one   a    -1.117273 -0.182997b    -1.429043  0.142617
three a     1.186941  0.809122b     0.906262  0.688165
two   a     0.686717  0.480565b    -0.213385 -0.098734data1               data2
key1          a         b         a         b
key2
one   -1.117273 -1.429043 -0.182997  0.142617
three  1.186941  0.906262  0.809122  0.688165
two    0.686717 -0.213385  0.480565 -0.098734

4.1.2 GroupBy 对象分组迭代

# 单层分组
for group_name, group_data in grouped1:print(group_name)print(group_data)
adata1     data2 key1   key2
0 -0.943078  0.820645    a    one
2  0.832261  0.843898    a    two
4  0.541173  0.117232    a    two
6 -1.291468 -1.186638    a    one
7  1.186941  0.809122    a  three
bdata1     data2 key1   key2
1 -1.429043  0.142617    b    one
3  0.906262  0.688165    b  three
5 -0.213385 -0.098734    b    two
# 多层分组
for group_name, group_data in grouped2:print(group_name)print(group_data)
('a', 'one')data1     data2 key1 key2
0 -0.943078  0.820645    a  one
6 -1.291468 -1.186638    a  one
('a', 'three')data1     data2 key1   key2
7  1.186941  0.809122    a  three
('a', 'two')data1     data2 key1 key2
2  0.832261  0.843898    a  two
4  0.541173  0.117232    a  two
('b', 'one')data1     data2 key1 key2
1 -1.429043  0.142617    b  one
('b', 'three')data1     data2 key1   key2
3  0.906262  0.688165    b  three
('b', 'two')data1     data2 key1 key2
5 -0.213385 -0.098734    b  two
# GroupBy对象转换list
list(grouped1)
[('a',       data1     data2 key1   key20 -0.943078  0.820645    a    one2  0.832261  0.843898    a    two4  0.541173  0.117232    a    two6 -1.291468 -1.186638    a    one7  1.186941  0.809122    a  three), ('b',       data1     data2 key1   key21 -1.429043  0.142617    b    one3  0.906262  0.688165    b  three5 -0.213385 -0.098734    b    two)]
# GroupBy对象转换dict
dict(list(grouped1))
{'a':       data1     data2 key1   key20 -0.943078  0.820645    a    one2  0.832261  0.843898    a    two4  0.541173  0.117232    a    two6 -1.291468 -1.186638    a    one7  1.186941  0.809122    a  three, 'b':       data1     data2 key1   key21 -1.429043  0.142617    b    one3  0.906262  0.688165    b  three5 -0.213385 -0.098734    b    two}
# 按列分组
print(df_obj.dtypes)# 按数据类型分组
df_obj.groupby(df_obj.dtypes, axis=1).size()
df_obj.groupby(df_obj.dtypes, axis=1).sum()
data1    float64
data2    float64
key1      object
key2      object
dtype: object
float64 object
0 -0.122433 aone
1 -1.286426 bone
2 1.676158 atwo
3 1.594427 bthree
4 0.658404 atwo
5 -0.312119 btwo
6 -2.478106 aone
7 1.996064 athree

4.1.2 其他分组方法

df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),columns=['a', 'b', 'c', 'd', 'e'],index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
df_obj2
a b c d e
A 1 1.0 1.0 6.0 5
B 2 NaN NaN NaN 6
C 5 5.0 7.0 5.0 7
D 2 8.0 5.0 6.0 2
E 5 1.0 4.0 4.0 4
# 通过字典分组
mapping_dict = {'a':'python', 'b':'python', 'c':'java', 'd':'C', 'e':'java'}
df_obj2.groupby(mapping_dict, axis=1).size()
df_obj2.groupby(mapping_dict, axis=1).count() # 非NaN的个数
df_obj2.groupby(mapping_dict, axis=1).sum()
C java python
A 6.0 6.0 2.0
B NaN 6.0 2.0
C 5.0 14.0 10.0
D 6.0 7.0 10.0
E 4.0 8.0 6.0
# 通过函数分组
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),columns=['a', 'b', 'c', 'd', 'e'],index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3def group_key(idx):"""idx 为列索引或行索引"""#return idxreturn len(idx)df_obj3.groupby(group_key).size()# 以上自定义函数等价于
#df_obj3.groupby(len).size()
1    1
2    3
3    1
dtype: int64
# 通过索引级别分组
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
df_obj4
language Python Java Python Java Python
index A A B C B
0 1 6 4 7 2
1 9 7 2 2 4
2 3 9 9 7 5
3 1 6 1 6 6
4 5 1 7 3 6
# 根据language进行分组
df_obj4.groupby(level='language', axis=1).sum()
df_obj4.groupby(level='index', axis=1).sum()
index A B C
0 7 6 7
1 16 6 2
2 12 14 7
3 7 7 6
4 6 13 3

4.2 聚合

​ 对分组后的数据进行运算 , 数组产生标量的过程,如mean()、count()等。

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'data1': np.random.randint(1,10, 8),'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)out:data1  data2 key1   key2
0      4      2    a    one
1      7      1    b    one
2      2      8    a    two
3      9      4    b  three
4      3      2    a    two
5      8      5    b    two
6      6      8    a    one
7      9      3    a  three
# 内置的聚合函数
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
 data1  data2
key1
a        24     23
b        24     10data1  data2 key2
key1
a         9      8  two
b         9      5  twodata1  data2 key2
key1
a         2      2  one
b         7      1  onedata1     data2
key1
a       4.8  4.600000
b       8.0  3.333333
key1
a    5
b    3
dtype: int64data1  data2  key2
key1
a         5      5     5
b         3      3     3data1     data2
key1
a    count  5.000000  5.000000mean   4.800000  4.600000std    2.774887  3.130495min    2.000000  2.00000025%    3.000000  2.00000050%    4.000000  3.00000075%    6.000000  8.000000max    9.000000  8.000000
b    count  3.000000  3.000000mean   8.000000  3.333333std    1.000000  2.081666min    7.000000  1.00000025%    7.500000  2.50000050%    8.000000  4.00000075%    8.500000  4.500000max    9.000000  5.000000
# 自定义聚合函数
def peak_range(df):"""返回数值范围"""#print type(df) #参数为索引所对应的记录return df.max() - df.min()print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
   data1  data2
key1
a         7      6
b         2      4data1     data2
key1
a     2.478410  2.030536
b     2.335305  0.786899
# 应用多个聚合函数# 同时应用多个聚合函数
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默认列名为函数名
         data1                                data2                           mean       std count peak_range      mean       std count peak_range
key1
a     0.065166  1.110226     5   2.478410  0.280852  0.875752     5   2.030536
b    -0.245389  1.167982     3   2.335305  0.244016  0.403130     3   0.786899
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通过元组提供新的列名
         data1                               data2                          mean       std count     range      mean       std count     range
key1
a     0.065166  1.110226     5  2.478410  0.280852  0.875752     5  2.030536
b    -0.245389  1.167982     3  2.335305  0.244016  0.403130     3  0.786899
# 每列作用不同的聚合函数
dict_mapping = {'data1':'mean','data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
         data2     data1
key1
a     1.404259  0.065166
b     0.732047 -0.245389
dict_mapping = {'data1':['mean','max'],'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
         data2     data1          sum      mean       max
key1
a     1.404259  0.065166  1.186941
b     0.732047 -0.245389  0.906262

4.3 数据分组运算

import pandas as pd
import numpy as np# 分组运算后保持shape
dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'data1': np.random.randint(1, 10, 8),'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
df_obj
key1 key2 data1 data2
0 a one 3 9
1 b one 4 5
2 a two 8 9
3 b three 4 8
4 a two 5 2
5 b two 3 2
6 a one 4 9
7 a three 9 4
# 按key1分组后,计算data1,data2的统计信息并附加到原始表格中
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
k1_sum
sum_data1 sum_data2
key1
a 29 33
b 11 15
# 方法1,使用merge
pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
key1 key2 data1 data2 sum_data1 sum_data2
0 a one 3 9 29 33
2 a two 8 9 29 33
4 a two 5 2 29 33
6 a one 4 9 29 33
7 a three 9 4 29 33
1 b one 4 5 11 15
3 b three 4 8 11 15
5 b two 3 2 11 15

4.3.1 transform 方法

# 方法2,使用transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
df_obj
key1 key2 data1 data2 sum_key2 sum_data1 sum_data2
0 a one 3 9 onetwotwoonethree 29 33
1 b one 4 5 onethreetwo 11 15
2 a two 8 9 onetwotwoonethree 29 33
3 b three 4 8 onethreetwo 11 15
4 a two 5 2 onetwotwoonethree 29 33
5 b two 3 2 onethreetwo 11 15
6 a one 4 9 onetwotwoonethree 29 33
7 a three 9 4 onetwotwoonethree 29 33
# 自定义函数传入transform
def diff_mean(s):"""返回数据与均值的差值"""return s - s.mean()df_obj.groupby('key1').transform(diff_mean)
data1 data2 sum_data1 sum_data2
0 -2.800000 2.4 0 0
1 0.333333 0.0 0 0
2 2.200000 2.4 0 0
3 0.333333 3.0 0 0
4 -0.800000 -4.6 0 0
5 -0.666667 -3.0 0 0
6 -1.800000 2.4 0 0
7 3.200000 -2.6 0 0
dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek', 'TotalHours', 'APM'])

4.3.2 apply

def top_n(df, n=3, column='APM'):"""返回每个分组按 column 的 top n 数据"""return df.sort_values(by=column, ascending=False)[:n]df_data.groupby('LeagueIndex').apply(top_n)
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
# apply函数接收的参数会传入自定义的函数中
df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age')
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300.0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404

4.3.3 禁止分组 group_keys = False

df_data.groupby('LeagueIndex', group_keys=False).apply(top_n)
LeagueIndex Age HoursPerWeek TotalHours APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518

5. 数据清洗、合并、转化和重构

5.1 数据清洗

5.2 数据连接

import pandas as pd
import numpy as npdf_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2' : np.random.randint(0,10,3)})print(df_obj1)
print(df_obj2)
  data1 key
0      5   b
1      9   b
2      1   a
3      0   c
4      3   a
5      9   a
6      0   bdata2 key
0      9   a
1      3   b
2      8   d
# 默认将重叠列的列名作为“外键”进行连接
pd.merge(df_obj1, df_obj2)
data1 key data2
0 5 b 3
1 9 b 3
2 0 b 3
3 1 a 9
4 3 a 9
5 9 a 9
# on显示指定“外键”
pd.merge(df_obj1, df_obj2, on='key')
data1 key data2
0 5 b 3
1 9 b 3
2 0 b 3
3 1 a 9
4 3 a 9
5 9 a 9
# left_on,right_on分别指定左侧数据和右侧数据的“外键”# 更改列名
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2')
data1 key1 data2 key2
0 5 b 3 b
1 9 b 3 b
2 0 b 3 b
3 1 a 9 a
4 3 a 9 a
5 9 a 9 a
# “外连接”
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer')
data1 key1 data2 key2
0 5.0 b 3.0 b
1 9.0 b 3.0 b
2 0.0 b 3.0 b
3 1.0 a 9.0 a
4 3.0 a 9.0 a
5 9.0 a 9.0 a
6 0.0 c NaN NaN
7 NaN NaN 8.0 d
# 左连接
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left')
data1 key1 data2 key2
0 5 b 3.0 b
1 9 b 3.0 b
2 1 a 9.0 a
3 0 c NaN NaN
4 3 a 9.0 a
5 9 a 9.0 a
6 0 b 3.0 b
# 右连接
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right')
data1 key1 data2 key2
0 5.0 b 3 b
1 9.0 b 3 b
2 0.0 b 3 b
3 1.0 a 9 a
4 3.0 a 9 a
5 9.0 a 9 a
6 NaN NaN 8 d
# 处理重复列名
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data' : np.random.randint(0,10,3)})pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right'))
data_left key data_right
0 4 b 5
1 2 b 5
2 5 b 5
3 9 a 7
4 6 a 7
5 6 a 7
# 按索引连接
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
pd.merge(df_obj1, df_obj2, left_on='key', right_index=True)
data1 key data2
0 6 b 5
1 7 b 5
6 9 b 5
2 2 a 3
4 4 a 3
5 4 a 3

5.3 数据合并

import numpy as np
import pandas as pd

5.3.1 Numpy 的 concat

arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))print(arr1)
print(arr2)
[[8 5 2 4][2 9 9 0][6 2 1 3]]
[[0 4 5 3][6 4 1 3][1 2 9 9]]
np.concatenate([arr1, arr2])
array([[8, 5, 2, 4],[2, 9, 9, 0],[6, 2, 1, 3],[0, 4, 5, 3],[6, 4, 1, 3],[1, 2, 9, 9]])
np.concatenate([arr1, arr2], axis=1)
array([[8, 5, 2, 4, 0, 4, 5, 3],[2, 9, 9, 0, 6, 4, 1, 3],[6, 2, 1, 3, 1, 2, 9, 9]])

5.3.2 Series 上的concat

# index 没有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
0    9
1    9
2    5
3    6
4    4
dtype: int32
5    1
6    9
7    2
8    5
dtype: int32
9     4
10    7
11    4
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3])

Out[6]:
0     9
1     9
2     5
3     6
4     4
5     1
6     9
7     2
8     5
9     4
10    7
11    4
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1)
0 1 2
0 9.0 NaN NaN
1 9.0 NaN NaN
2 5.0 NaN NaN
3 6.0 NaN NaN
4 4.0 NaN NaN
5 NaN 1.0 NaN
6 NaN 9.0 NaN
7 NaN 2.0 NaN
8 NaN 5.0 NaN
9 NaN NaN 4.0
10 NaN NaN 7.0
11 NaN NaN 4.0
# index 有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
0    7
1    3
2    9
3    1
4    7
dtype: int32
0    6
1    1
2    4
3    7
dtype: int32
0    7
1    2
2    3
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3])
0    7
1    3
2    9
3    1
4    7
0    6
1    1
2    4
3    7
0    7
1    2
2    3
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')
0 1 2
0 7 6 7
1 3 1 2
2 9 4 3

5.3.3 DataFrame上的concat

df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],columns=['A', 'B'])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],columns=['C', 'D'])
print(df_obj1)
print(df_obj2)
   A  B
a  1  6
b  1  0
c  1  6C  D
a  2  1
b  7  4
pd.concat([df_obj1, df_obj2])
A B C D
a 1.0 6.0 NaN NaN
b 1.0 0.0 NaN NaN
c 1.0 6.0 NaN NaN
a NaN NaN 2.0 1.0
b NaN NaN 7.0 4.0
pd.concat([df_obj1, df_obj2], axis=1)
A B C D
a 1 6 2.0 1.0
b 1 0 7.0 4.0
c 1 6 NaN NaN

5.4 数据重构

5.4.1 stack

import numpy as np
import pandas as pddf_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
df_obj
data1 data2
0 0 8
1 9 5
2 1 1
3 8 5
4 9 4
stacked = df_obj.stack()
print(stacked)
0  data1    0data2    8
1  data1    9data2    5
2  data1    1data2    1
3  data1    8data2    5
4  data1    9data2    4
dtype: int32
# 默认操作内层索引
stacked.unstack()
data1 data2
0 0 8
1 9 5
2 1 1
3 8 5
4 9 4
# 通过level指定操作索引的级别
stacked.unstack(level=0)
0 1 2 3 4
data1 0 9 1 8 9
data2 8 5 1 5 4

5.5 数据转换

import numpy as np
import pandas as pd

5.5.1 重复数据

df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)})
df_obj
data1 data2
0 a 1
1 a 3
2 a 3
3 a 3
4 b 3
5 b 3
6 b 1
7 b 1
df_obj.duplicated()
0    False
1    False
2     True
3     True
4    False
5     True
6    False
7     True
dtype: bool
df_obj.drop_duplicates()
data1 data2
0 a 1
1 a 3
4 b 3
6 b 1
df_obj.drop_duplicates('data2')
data1 data2
0 a 1
1 a 3

5.5.2 map函数

ser_obj = pd.Series(np.random.randint(0,10,10))
ser_obj
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
ser_obj.map(lambda x : x ** 2)
0    36
1    25
2    16
3    16
4    64
5     1
6    36
7    25
8    16
9     1
dtype: int64

5.5.4 数据替换repalce

# 替换单个值
ser_obj.replace(0, -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], [-100, -200])
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32

| —: | ----: | ----: |
| 0 | 0 | 8 |
| 1 | 9 | 5 |
| 2 | 1 | 1 |
| 3 | 8 | 5 |
| 4 | 9 | 4 |

# 通过level指定操作索引的级别
stacked.unstack(level=0)
0 1 2 3 4
data1 0 9 1 8 9
data2 8 5 1 5 4

5.5 数据转换

import numpy as np
import pandas as pd

5.5.1 重复数据

df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)})
df_obj
data1 data2
0 a 1
1 a 3
2 a 3
3 a 3
4 b 3
5 b 3
6 b 1
7 b 1
df_obj.duplicated()
0    False
1    False
2     True
3     True
4    False
5     True
6    False
7     True
dtype: bool
df_obj.drop_duplicates()
data1 data2
0 a 1
1 a 3
4 b 3
6 b 1
df_obj.drop_duplicates('data2')
data1 data2
0 a 1
1 a 3

5.5.2 map函数

ser_obj = pd.Series(np.random.randint(0,10,10))
ser_obj
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
ser_obj.map(lambda x : x ** 2)
0    36
1    25
2    16
3    16
4    64
5     1
6    36
7    25
8    16
9     1
dtype: int64

5.5.4 数据替换repalce

# 替换单个值
ser_obj.replace(0, -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], [-100, -200])
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32

Python数据分析第三期--简述Pandas相关推荐

  1. Python 数据分析第二期--简述数据集常用格式及操作

    Python 数据分析第二期–简述数据常用格式及操作 1.常用格式的本地数据读取 ​ 常用的数据分析文件格式有: ​ csv,json,xls,xlsx,txt,xml,等 TXT ​ txt文件由字 ...

  2. 【Python数据分析与可视化】Pandas统计分析-实训

    [Python数据分析与可视化]Pandas统计分析-实训 文章目录 [Python数据分析与可视化]Pandas统计分析-实训 导包 读取数据 分析数据 1.查看数据的描述和统计信息: 2.修改列名 ...

  3. Python数据分析实战基础 | 初识Pandas

    这是Python数据分析实战基础的第一篇内容,主要是和Pandas来个简单的邂逅.已经熟练掌握Pandas的同学,可以加快手速滑动浏览或者直接略过本文. 01  重要的前言 这段时间和一些做数据分析的 ...

  4. pandas 把某一列中字符串变数值_Python学习教程:Python数据分析实战基础 | 初识Pandas...

    这段时间和一些做数据分析的同学闲聊,我发现数据分析技能入门阶段存在一个普遍性的问题,很多凭着兴趣入坑的同学,都能够很快熟悉Python基础语法,好像自己什么都会了一点,然而实际操作起来既不知从何操起, ...

  5. Python数据分析笔记——Numpy、Pandas库

    Python数据分析--Numpy.Pandas库 总第48篇 ▼ 利用Python进行数据分析中有两个重要的库是Numpy和Pandas,本章将围绕这两个库进行展开介绍. Numpy库 Numpy最 ...

  6. 《Python数据分析实战》day2: Pandas中取得某行或者是某列的一点思考

    今天看了<Python数据分析实战>这本书的第四章:Pandas的简介的部分,大概看了一半,在实践代码的时候发现了jupyter notebook提示warning,研究了一下有了一些自己 ...

  7. Python数据分析实例,利用Pandas建立流域三层蒸发和蓄满产流模型

    流域三层蒸发和蓄满产流模型 根据水量平衡方程推求土壤含水量 土壤三层蒸发模型,EU,EL,ED的计算 产流R计算 分水源Rg,Rd计算 完整代码 根据水量平衡方程推求土壤含水量 if i ==1:df ...

  8. Python数据分析第一课:Pandas初入门

    数据分析如下所示共6步,而使用Python可以帮助我们完成前5步. 一. Pandas简介 Pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库 ...

  9. Python数据分析基础技术之pandas(史上最全面)

    pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具.pandas提供了大量能使我们快 ...

最新文章

  1. java对托盘加监听右击报错_java实现系统托盘示例
  2. phoenix+hbase+Spark整合,Spark处理数据操作phoenix入hbase,Spring Cloud整合phoenix
  3. ActiveReports 9 新功能:借助目录(TOC)控件为报表添加目录功能
  4. ajax获取对象获取不了属性,Ajaxing JavaScript变量到Django视图获取:AttributeError:“WSGIRequest”对象没有属性“data”...
  5. 这是我见过的最全的训练数据集,没有之一!
  6. UVa 10635 (LIS+二分) Prince and Princess
  7. 使用opensocial接入social game
  8. android 灰色向白色渐变,iPhone-iOS的白色到透明渐变层为灰色
  9. 真的,没人能逃开“真香定律”
  10. 开发管理 -启动项目(转)
  11. 计算机保研夏令营英语面试,保研经验 | 夏令营面试那些事儿(内含视频)
  12. GUI界面设计学渣笔记———使用PyQt做深度学习的可视化界面
  13. excel冻结行和列_Spire.Cloud.Excel 冻结或解除冻结Excel中的行和列
  14. git 分支关系图谱讲解
  15. 网上疯传河南高考零分作文:兔子 你傻啊?!
  16. SQL Server安装中错误该性能计数器注册表配置单元已损坏。若要继续,必须修复该性能计数器注册表配置单元的解决
  17. docker(12):Dockerfile
  18. opencv学习笔记—— opencv3 源码文件夹内容
  19. Android -结束当前activity并返回上一个activity
  20. 2018新版个税计算器

热门文章

  1. java根节点到叶子节点_二叉树根节点到叶子节点的所有路径和
  2. 移动物联网应用典型案例福建8项目入选,四信携手产业共赢
  3. 洛谷 P5108 仰望半月的夜空 解题报告
  4. 富士康贫血100天:40万员工人均到手工资1932元
  5. 一枝看上去很美的花——书评《Java与模式》
  6. Redis实现点赞与关注
  7. R语言时间序列基础库zoo
  8. “百度+创维”强力破局家电产业:中国AI大屏奇点临近
  9. c++连通区域处理 种子生成法
  10. 【蓝桥杯练习系统】【入门训练】【BEGIN-1】A+B问题 (Java)