Python数据分析基础

  • 一、导入常用库
  • 二、创建对象
  • 三、查看数据
  • 四、选取
  • 五、通过标签选取
  • 六、通过位置选取
  • 七、布尔索引
  • 八、赋值
  • 九、缺失值处理
  • 十、运算与统计
  • 十一、Apply函数的作用
  • 十二、频数统计
  • 十三、字符串方法
  • 十四、合并
  • 十五、分组
  • 十六、变形之堆叠
  • 十七、透视数据表
  • 十八、时间序列
  • 十九、分类
  • 二十、绘图
  • 二十一、获取数据的I/O
  • 后记

一、导入常用库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

二、创建对象

(1)通过传递一个list来创建Series,pandas会默认创建整型索引

代码如下:

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

运行结果如下:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

说明:Series:一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。注意:Series中的索引值是可以重复的。

(2)通过传递一个numpy array,日期索引以及列标签来创建一个DataFrame

代码如下:

dates = pd.date_range('20210101', periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

运行结果如下:

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06'],dtype='datetime64[ns]', freq='D')A         B         C         D
2021-01-01  0.332594  1.151379  0.155028  1.914757
2021-01-02  1.681969 -1.157997  0.340231  2.054462
2021-01-03 -1.351239 -0.636244 -1.670505 -0.577896
2021-01-04  1.091745 -0.265513 -0.585511  0.462430
2021-01-05 -0.945369 -0.416703 -0.833535  0.446753
2021-01-06  0.712920  0.223502 -1.158448  0.046709

(3)通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame

代码如下:

df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20200629'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3]*4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})
print(df2)
print(df2.dtypes)

运行结果如下:

     A          B    C  D      E    F
0  1.0 2020-06-29  1.0  3   test  foo
1  1.0 2020-06-29  1.0  3  train  foo
2  1.0 2020-06-29  1.0  3   test  foo
3  1.0 2020-06-29  1.0  3  train  foo
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

三、查看数据

(1)查看frame中头部和尾部的几行

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.head())    # 查看头部几行
print(df.tail(3))   # 查看尾部几行

运行结果如下:

                   A         B         C         D
2021-01-01 -0.854651 -0.610823 -0.167534  0.792160
2021-01-02  0.493142  0.580007  0.204097 -0.461438
2021-01-03  0.281382 -1.412539  3.594873 -0.130037
2021-01-04 -0.020957  0.013987 -2.404149 -0.277812
2021-01-05 -1.464734  0.144639 -0.667339  0.917941A         B         C         D
2021-01-04 -0.020957  0.013987 -2.404149 -0.277812
2021-01-05 -1.464734  0.144639 -0.667339  0.917941
2021-01-06  1.813891 -1.379392 -1.490363 -0.954958

(2)显示索引、列名以及底层的numpy数据

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.index)    # 显示索引
print(df.columns)   # 显示列名
print(df.values)   # 底层numpy数据

运行结果如下:

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06'],dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.54760061 -1.52821104  0.98156267 -1.3086871 ][ 1.08947302 -0.27957116 -0.99159702  0.56656625][-0.56661193  0.73175369  0.84474106 -0.01924194][ 0.40981963 -1.23025219  1.31332923  1.16469658][-0.0996665   0.42960539  0.15250292  0.5774405 ][-0.05484658  0.70352716  0.88048923  1.0268004 ]]

(3)describe()能对数据做一个快速统计汇总

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.describe())    # 快速统计汇总

运行结果如下:

              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.037825 -0.684544  0.074748 -0.393371
std    1.221381  0.651434  0.445844  1.016243
min   -1.638185 -1.327833 -0.581986 -1.639139
25%   -0.866597 -1.061551 -0.204719 -1.129647
50%   -0.116033 -0.838348  0.128008 -0.347528
75%    0.977540 -0.567642  0.455081  0.123801
max    1.418020  0.510622  0.525979  1.083412

(4)对数据做转置

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.T)    # 对数据做转置

运行结果如下:

   2021-01-01  2021-01-02  2021-01-03  2021-01-04  2021-01-05  2021-01-06
A   -0.924087    1.036039    2.380802    0.396621   -1.344227   -0.060524
B   -1.223318    1.818642   -1.659037    0.440670   -2.068355    0.660393
C    0.028661   -0.739622   -0.702494    0.767255   -0.027886    0.712692
D   -0.352344    1.421342   -0.915466    0.192375    1.665294   -0.865071

(5)按轴进行排序

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.sort_index(axis=1, ascending=False))    # 按轴进行排序

运行结果如下:

                   D         C         B         A
2021-01-01  0.147621  0.372596  0.403472 -1.462936
2021-01-02 -0.433823 -0.649770 -1.840609 -0.191425
2021-01-03  0.279578  1.917370  0.931369 -0.226179
2021-01-04  1.420825  1.596096 -1.250926  0.597007
2021-01-05  0.928866  0.465932 -1.089402 -0.060359
2021-01-06  0.175272 -0.152957  0.535680  1.290633

(5)按值进行排序

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.sort_values(by='B'))    # 按值进行排序

运行结果如下:

                   A         B         C         D
2021-01-02  1.103506 -0.595645  0.666151  1.309689
2021-01-03  0.021516 -0.091451  0.024281 -0.598654
2021-01-01 -1.565367  0.163802  0.425172 -2.247528
2021-01-05  3.003356  0.336145 -1.738533 -0.084639
2021-01-04  0.699287  0.706519  0.891762 -1.278873
2021-01-06  0.987927  1.177693 -0.741832  1.223762

注意:虽然标准的Python/Numpy的表达式能完成选择与赋值等功能,但我们仍推荐使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix

四、选取

(1)选择某一列数据,它会返回一个Series,等同于df.A

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.A)

运行结果如下:

2021-01-01   -0.111543
2021-01-02   -0.656968
2021-01-03   -0.688010
2021-01-04   -1.589676
2021-01-05   -0.678847
2021-01-06    2.115350
Freq: D, Name: A, dtype: float64

(2)通过使用[]进行切片选取

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df[0:3])
print(df['20210102':'20210104'])

运行结果如下:

                   A         B         C         D
2021-01-01 -0.047732  0.552092 -0.729498 -0.714394
2021-01-02  0.591364 -1.105802 -0.762140 -0.612312
2021-01-03 -0.065074 -0.839530 -1.497781  0.126298A         B         C         D
2021-01-02  0.738849 -1.043999 -0.521313 -0.224035
2021-01-03  0.111772 -1.778993  2.102982  0.245293
2021-01-04  0.715842  0.664216  0.229961 -1.134740

五、通过标签选取

(1)通过标签进行交叉选取

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 通过标签进行交叉选取
print(df.loc[dates[0]])

运行结果如下:

A    0.419647
B   -0.213496
C   -0.247529
D   -1.832256
Name: 2021-01-01 00:00:00, dtype: float64

(2)使用标签对多个轴进行选取

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
#  使用标签对多个轴进行选取
print(df.loc[:, ['A', 'B']])
print(df.loc[:, ['A', 'B']][:3])

运行结果如下:

                   A         B
2021-01-01  2.682915 -0.914341
2021-01-02  0.583982  0.282933
2021-01-03 -0.191259  0.195227
2021-01-04 -1.560690  0.035329
2021-01-05 -1.130526  2.553366
2021-01-06 -0.021148  0.385572A         B
2021-01-01  2.682915 -0.914341
2021-01-02  0.583982  0.282933
2021-01-03 -0.191259  0.195227

(3)进行标签切片,包含两个端点

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 进行标签切片,包含两个端点
print(df.loc['20210102':'20210104', ['A', 'B']])

运行结果如下:

                   A         B
2021-01-02  0.583693 -1.117799
2021-01-03  1.105072 -1.793949
2021-01-04 -1.167001 -0.817904

(4)对返回的对象进行降维处理

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 对返回的对象进行降维处理
print(df.loc['20210102', ['A', 'B']])

运行结果如下:

A   -0.778794
B   -0.015910
Name: 2021-01-02 00:00:00, dtype: float64

(5)获取一个标量

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 获取一个标量
print(df.loc[dates[0], 'A'])

运行结果如下:

1.428208723016515

(6)快速获取标量(与上面方法等价)

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 快速获取标量(与上面方法等价)
print(df.at[dates[0], 'A'])

运行结果如下:

1.428208723016515

六、通过位置选取

(1)通过传递整型的位置进行选取

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[3])

运行结果如下:

A    0.935744
B    0.460515
C   -0.636717
D    0.918826
Name: 2021-01-04 00:00:00, dtype: float64

(2)通过整型的位置切片进行选取(与python/numpy形式相同)

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[3:5, 0:2])

运行结果如下:

                   A         B
2021-01-04 -0.731813 -0.007271
2021-01-05 -0.098682 -1.033287

(3)只对行进行切片

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[1:3, :])

运行结果如下:

                   A         B         C         D
2021-01-02  1.725870  0.316616 -0.226371  2.271909
2021-01-03 -0.701184 -0.101915  1.670719 -1.069785

(4)只对列进行切片

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[:, 1:3])

运行结果如下:

                   B         C
2021-01-01 -0.344991  0.714762
2021-01-02  0.756099 -0.716836
2021-01-03 -0.253883  1.408437
2021-01-04  0.617495 -0.370847
2021-01-05  0.361932 -0.149773
2021-01-06 -0.203682 -1.166916

(5)只获取某个值

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[1, 1])

运行结果如下:

0.5140577371616526

(6)快速获取某个值(与上面方法等价)

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iat[1, 1])

运行结果如下:

-0.8025050449303489

七、布尔索引

(1)用某列的值来选取数据

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df[df.A > 0])

运行结果如下:

                   A         B         C         D
2021-01-01  1.246744 -0.178323  1.584207  0.451347
2021-01-02  1.119580 -0.278993 -0.975688  0.857890
2021-01-04  0.128081  0.126333 -0.413096  1.912839
2021-01-05  0.315206 -0.997872 -0.315139 -1.187635

(2)where操作来选取数据

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df[df > 0])

运行结果如下:

                   A         B         C         D
2021-01-01  1.050739       NaN       NaN  0.527370
2021-01-02       NaN  0.946090  0.267921  1.673618
2021-01-03       NaN       NaN       NaN       NaN
2021-01-04       NaN  0.643557  0.372513       NaN
2021-01-05       NaN       NaN       NaN  0.808884
2021-01-06       NaN  0.721053       NaN  0.522357

(3)用isin()方法来过滤数据

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print(df2)
print(df2[df2['E'].isin(['two', 'four'])])

运行结果如下:

                   A         B         C         D      E
2021-01-01 -0.574905 -0.003417 -0.344360 -0.119831    one
2021-01-02 -0.324021  0.954608 -0.596665  0.827242    one
2021-01-03  1.216822 -0.479907  0.721729 -1.394054    two
2021-01-04  1.337284 -0.526787 -0.346786  2.736462  three
2021-01-05 -0.292888 -0.177181  0.113743 -0.606479   four
2021-01-06 -0.117398  0.664194  0.301029  1.171757  threeA         B         C         D     E
2021-01-03  1.216822 -0.479907  0.721729 -1.394054   two
2021-01-05 -0.292888 -0.177181  0.113743 -0.606479  four

八、赋值

(1)赋值一个新的列,通过索引来自动对齐数据

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
print(s1)
df['F'] = s1
print(df)

运行结果如下:

2021-01-02    1
2021-01-03    2
2021-01-04    3
2021-01-05    4
2021-01-06    5
2021-01-07    6
Freq: D, dtype: int64A         B         C         D    F
2021-01-01 -1.298911 -0.133224 -0.557085 -0.142379  NaN
2021-01-02  0.238531  0.433289  1.494014 -0.588631  1.0
2021-01-03  2.522138  0.688398  0.767005  0.123376  2.0
2021-01-04  0.678053  0.140091  1.117512 -0.555320  3.0
2021-01-05 -0.447904  0.353646 -1.198465 -1.003590  4.0
2021-01-06 -0.861330 -0.812971  1.317353 -0.978052  5.0Process finished with exit code 0

(2)通过标签赋值

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.at[dates[0], 'A'] = 0
print(df)

运行结果如下:

                   A         B         C         D
2021-01-01  0.000000  1.702900 -0.020916 -0.617243
2021-01-02 -0.544483  1.276033 -1.070828  0.416703
2021-01-03  1.214969  1.411715 -0.200606  2.133288
2021-01-04  0.710090 -0.290432  0.243515  0.356134
2021-01-05 -0.868281 -0.043208  0.436506  1.252045
2021-01-06 -0.040620 -0.559917  0.083952  1.074859

(3)通过位置赋值

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.iat[0, 1] = 0
print(df)

运行结果如下:

                   A         B         C         D
2021-01-01 -0.995703  0.000000 -0.497130  0.292860
2021-01-02 -0.370458 -0.450762  1.235836 -1.117611
2021-01-03  0.563000 -0.529552  1.012462  0.351527
2021-01-04  0.002556  2.456097 -1.275803  0.243018
2021-01-05  0.958823 -1.869412  0.638924 -0.468291
2021-01-06 -0.975528 -0.271083  0.245019  0.922966

(4)通过传递numpy array赋值

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.loc[:, 'D'] = np.array([5] * len(df))
print(df)

运行结果如下:

                   A         B         C  D
2021-01-01 -1.304089 -0.376122 -0.108818  5
2021-01-02  0.557907  0.666416 -1.335505  5
2021-01-03  1.312906 -0.920788  0.217328  5
2021-01-04 -1.191590  0.643327 -0.572647  5
2021-01-05 -1.114065 -1.957133 -0.254868  5
2021-01-06  1.881592  2.020586 -0.368924  5

(5)通过where操作来赋值

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df2 = df.copy()
df2[df2 > 0] = -df2
print(df2)

运行结果如下:

                   A         B         C         D
2021-01-01 -0.489553 -1.543897 -0.678892 -0.487907
2021-01-02 -0.273650 -1.775763 -0.627094 -0.361745
2021-01-03 -0.131556 -1.010881 -0.653446 -0.996312
2021-01-04 -1.030592 -0.961013 -2.088085 -0.275543
2021-01-05 -0.219124 -1.220296 -1.156944 -0.015766
2021-01-06 -0.134167 -1.470275 -0.892858 -0.575739

九、缺失值处理

在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。

(1)reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
print(df1)

运行结果如下:

                   A         B         C         D    F    E
2021-01-01  0.544711 -0.053116 -0.226346 -0.763461  NaN  1.0
2021-01-02  1.717452  0.819771 -0.601411  0.108737  1.0  1.0
2021-01-03 -1.342919  0.032636  1.850492  1.482909  2.0  NaN
2021-01-04  0.613216 -0.637186 -0.888018 -0.387602  3.0  NaN

(2)剔除所有包含缺失值的行数据

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
# 剔除含有缺失值数据
print(df1.dropna(how='any'))

运行结果如下:

                   A        B         C         D    F    E
2021-01-02 -0.125127 -0.14816 -0.491284 -0.777581  1.0  1.0

(3)填充缺失值

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
# 填充缺失值
print(df1.fillna(value=5))

运行结果如下:

                   A         B         C         D    F    E
2021-01-01 -2.382767  0.500487 -0.217522 -1.805155  5.0  1.0
2021-01-02 -1.183837 -0.391934  1.215758 -1.513532  1.0  1.0
2021-01-03  1.588544  0.169653 -0.299395 -0.884112  2.0  5.0
2021-01-04  2.493201 -0.059129 -1.738894 -1.887012  3.0  5.0

(4)获取值为nan的布尔标记

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1print(pd.isnull(df1))

运行结果如下:

                A      B      C      D      F      E
2021-01-01  False  False  False  False   True  False
2021-01-02  False  False  False  False  False  False
2021-01-03  False  False  False  False  False   True
2021-01-04  False  False  False  False  False   True

十、运算与统计

运算过程中,通常不包含缺失值。

(1)进行描述性统计

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1print(df.mean())

运行结果如下:

A   -0.406563
B   -0.464531
C   -0.649678
D    0.081891
F    3.000000
dtype: float64

(2)对其他轴进行同样的运算

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1print(df.mean(1))

运行结果如下:

2021-01-01   -0.000218
2021-01-02   -0.192326
2021-01-03    1.105851
2021-01-04    0.646850
2021-01-05    0.424485
2021-01-06    0.947733
Freq: D, dtype: float64

(3)对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算

代码如下:

dates = pd.date_range('20210101', periods=6)
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print(s)
print(df.sub(s, axis='index'))
print(df.mean())

运行结果如下:

2021-01-01    NaN
2021-01-02    NaN
2021-01-03    1.0
2021-01-04    3.0
2021-01-05    5.0
2021-01-06    NaN
Freq: D, dtype: float64A         B         C         D    F
2021-01-01       NaN       NaN       NaN       NaN  NaN
2021-01-02       NaN       NaN       NaN       NaN  NaN
2021-01-03 -1.256636 -3.067793 -1.168880 -1.753398  1.0
2021-01-04 -2.012296 -2.662295 -3.183522 -3.437285  0.0
2021-01-05 -8.293072 -4.674037 -4.800625 -4.896442 -1.0
2021-01-06       NaN       NaN       NaN       NaN  NaNProcess finished with exit code 0

十一、Apply函数的作用

(1)通过apply()对函数作用

代码如下:

dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1
print(df.apply(np.cumsum))
print("---------------------")
print(df.apply(lambda x:x.max()-x.min()))

运行结果如下:

                   A         B         C         D     F
2021-01-01 -2.311017  1.587081  1.588087 -1.099827   NaN
2021-01-02 -1.833219  1.234021  3.044834 -0.172339   1.0
2021-01-03 -1.721813  0.638080  3.887059  0.134763   3.0
2021-01-04 -1.685430  0.846965  5.226440  0.207427   6.0
2021-01-05 -0.877456  0.032083  5.021183  0.264057  10.0
2021-01-06 -2.354379 -0.559002  3.641077  0.236156  15.0
---------------------
A    3.118991
B    2.401963
C    2.968192
D    2.027316
F    4.000000
dtype: float64Process finished with exit code 0

十二、频数统计

代码如下:

s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
print(s.value_counts())

运行结果如下:

0    0
1    4
2    5
3    2
4    3
5    0
6    2
7    5
8    2
9    2
dtype: int32
2    4
5    2
0    2
4    1
3    1
dtype: int64Process finished with exit code 0

十三、字符串方法

对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。

代码如下:

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CACA', 'dog', 'cat'])
s.str.lower
print(s)

运行结果如下:

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CACA
7     dog
8     cat
dtype: object

十四、合并

(1)Concat连接
pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作。
通过concat()来连接pandas对象:

代码如下:

df = pd.DataFrame(np.random.randn(10, 4))
print(df)
print("---------------")
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
print(pieces)
print("---------------")
print(pd.concat(pieces))

运行结果如下:

          0         1         2         3
0 -0.434515 -0.573066  0.862445  1.036136
1  0.667840 -0.097729 -1.652916 -0.722389
2 -0.220746 -1.049863 -0.114852 -2.581321
3  0.384615  0.639548 -0.739006  0.481056
4 -1.523937  0.690972  1.291165  0.290739
5  0.231251  0.803626 -1.693163  0.256039
6 -0.114371  0.519657  0.231674  0.456883
7 -1.121592 -0.430156 -0.563986  0.168413
8 -0.465606  0.476165 -1.314072  0.196124
9  0.243630  0.865871 -0.645785  0.753181
---------------
[          0         1         2         3
0 -0.434515 -0.573066  0.862445  1.036136
1  0.667840 -0.097729 -1.652916 -0.722389
2 -0.220746 -1.049863 -0.114852 -2.581321,           0         1         2         3
3  0.384615  0.639548 -0.739006  0.481056
4 -1.523937  0.690972  1.291165  0.290739
5  0.231251  0.803626 -1.693163  0.256039
6 -0.114371  0.519657  0.231674  0.456883,           0         1         2         3
7 -1.121592 -0.430156 -0.563986  0.168413
8 -0.465606  0.476165 -1.314072  0.196124
9  0.243630  0.865871 -0.645785  0.753181]
---------------0         1         2         3
0 -0.434515 -0.573066  0.862445  1.036136
1  0.667840 -0.097729 -1.652916 -0.722389
2 -0.220746 -1.049863 -0.114852 -2.581321
3  0.384615  0.639548 -0.739006  0.481056
4 -1.523937  0.690972  1.291165  0.290739
5  0.231251  0.803626 -1.693163  0.256039
6 -0.114371  0.519657  0.231674  0.456883
7 -1.121592 -0.430156 -0.563986  0.168413
8 -0.465606  0.476165 -1.314072  0.196124
9  0.243630  0.865871 -0.645785  0.753181Process finished with exit code 0

(2)Join合并
类似于SQL中的合并(merge)

代码如下:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
print(left)
print("---------------------")
right = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [4, 5]})
print(right)
print("---------------------")
print(pd.merge(left, right, on='key'))

运行结果如下:

   key  lval
0  foo     1
1  foo     2
---------------------key  lval
0  foo     4
1  foo     5
---------------------key  lval_x  lval_y
0  foo       1       4
1  foo       1       5
2  foo       2       4
3  foo       2       5Process finished with exit code 0

(3)Append添加
将若干行添加到dataFrame后面。

代码如下:

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
print(df)
print("-----------------")
s = df.iloc[3]
print(s)
print("-----------------")
print(df.append(s, ignore_index=True))

运行结果如下:

          A         B         C         D
0  0.774301 -0.146980 -0.867190  0.804019
1  0.504305  1.186497 -0.281873  1.243404
2  1.369683  0.805037  0.231694  0.392675
3 -0.200875  0.330411 -0.478353 -0.740152
4  1.042464 -0.138162 -1.513976 -0.666396
5  0.132588 -0.187199 -1.451298  0.983176
6  1.677020 -1.505520  0.314352  0.467116
7  0.926760 -2.036741 -0.182761 -0.167417
-----------------
A   -0.200875
B    0.330411
C   -0.478353
D   -0.740152
Name: 3, dtype: float64
-----------------A         B         C         D
0  0.774301 -0.146980 -0.867190  0.804019
1  0.504305  1.186497 -0.281873  1.243404
2  1.369683  0.805037  0.231694  0.392675
3 -0.200875  0.330411 -0.478353 -0.740152
4  1.042464 -0.138162 -1.513976 -0.666396
5  0.132588 -0.187199 -1.451298  0.983176
6  1.677020 -1.505520  0.314352  0.467116
7  0.926760 -2.036741 -0.182761 -0.167417
8 -0.200875  0.330411 -0.478353 -0.740152Process finished with exit code 0

十五、分组

对于“group by”操作,我们通常是指以下一个或几个步骤:

  • 划分 按照某些标准将数据分为不同的组
  • 应用 对每组数据分别执行一个函数
  • 组合 将结果组合到一个数据结构

代码如下:

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'bar'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})
print(df)
print("------------------")
print(df.groupby('A').sum())    # 分组并对每个分组应用sum函数
print("------------------")
print(df.groupby(['A', 'B']).sum())    # 按多个列分组形成层级索引,然后应用函数

运行结果如下:

     A      B         C         D
0  foo    one -0.521933  0.030400
1  bar    one  2.046228  0.611504
2  foo    two -0.650801  1.682347
3  bar  three -0.121637 -1.130325
4  foo    two  0.040135 -0.495454
5  bar    two  1.736218 -0.774311
6  foo    one  0.081882  0.691103
7  bar  three -0.612624 -1.388700
------------------C         D
A
bar  3.048185 -2.681832
foo -1.050718  1.908396
------------------C         D
A   B
bar one    2.046228  0.611504three -0.734261 -2.519025two    1.736218 -0.774311
foo one   -0.440052  0.721503two   -0.610666  1.186893Process finished with exit code 0

十六、变形之堆叠

(1)堆叠

代码如下:

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
print(df2)

运行结果如下:

                     A         B
first second
bar   one    -0.127506  0.941748two    -0.565635  0.251350
baz   one     0.077156 -1.003484two    -0.412658 -0.557502Process finished with exit code 0

(2)stack()方法对DataFrame的列“压缩”一个层级

代码如下:

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
stacked = df2.stack()
print(stacked)

运行结果如下:

                     A         B
first  second
bar    one     A    0.640794B    1.611218two     A   -1.731782B    0.997328
baz    one     A   -1.639688B    0.942692two     A    0.094491B   -0.364335
dtype: float64

(3)对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级

代码如下:

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
stacked = df2.stack()
print(stacked.unstack())
print("-----------------------------")
print(stacked.unstack(1))    # 反堆叠第1列
print("-----------------------------")
print(stacked.unstack(0))    # 反堆叠第0列

运行结果如下:

                     A         B
first second
bar   one    -1.220559  1.272748two     2.373165  1.359084
baz   one     0.594712  0.567112two    -0.870067  1.412194
-----------------------------
second        one       two
first
bar   A -1.220559  2.373165B  1.272748  1.359084
baz   A  0.594712 -0.870067B  0.567112  1.412194
-----------------------------
first          bar       baz
second
one    A -1.220559  0.594712B  1.272748  0.567112
two    A  2.373165 -0.870067B  1.359084  1.412194Process finished with exit code 0

十七、透视数据表

代码如下:

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,'B': ['A', 'B', 'C'] * 4,'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D': np.random.randn(12),'E': np.random.randn(12)})
print(df)
df1 = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
print("--------------------------------")
print(df1)

运行结果如下:

        A  B    C         D         E
0     one  A  foo  0.489705 -0.724392
1     one  B  foo  0.077034  0.942014
2     two  C  foo  1.915312 -0.641194
3   three  A  bar  0.275519 -0.014924
4     one  B  bar  1.239626 -1.418770
5     one  C  bar  0.468554  0.778672
6     two  A  foo -0.911088  0.411054
7   three  B  foo  0.728673 -0.941020
8     one  C  foo -0.090592 -1.599612
9     one  A  bar  0.279766  1.578581
10    two  B  bar  1.452452  0.117850
11  three  C  bar -0.839334  0.679560
--------------------------------
C             bar       foo
A     B
one   A  0.279766  0.489705B  1.239626  0.077034C  0.468554 -0.090592
three A  0.275519       NaNB       NaN  0.728673C -0.839334       NaN
two   A       NaN -0.911088B  1.452452       NaNC       NaN  1.915312Process finished with exit code 0

十八、时间序列

pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。这在金融领域很常见,但又不限于此。

(1)初识时间序列

代码如下:

rng = pd.date_range('1/1/2012', periods=100, freq='S')
print(rng)
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print(ts)
print(ts.resample('5min').sum())

运行结果如下:

DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01','2012-01-01 00:00:02', '2012-01-01 00:00:03','2012-01-01 00:00:04', '2012-01-01 00:00:05','2012-01-01 00:00:06', '2012-01-01 00:00:07','2012-01-01 00:00:08', '2012-01-01 00:00:09','2012-01-01 00:00:10', '2012-01-01 00:00:11','2012-01-01 00:00:12', '2012-01-01 00:00:13','2012-01-01 00:00:14', '2012-01-01 00:00:15','2012-01-01 00:00:16', '2012-01-01 00:00:17','2012-01-01 00:00:18', '2012-01-01 00:00:19','2012-01-01 00:00:20', '2012-01-01 00:00:21','2012-01-01 00:00:22', '2012-01-01 00:00:23','2012-01-01 00:00:24', '2012-01-01 00:00:25','2012-01-01 00:00:26', '2012-01-01 00:00:27','2012-01-01 00:00:28', '2012-01-01 00:00:29','2012-01-01 00:00:30', '2012-01-01 00:00:31','2012-01-01 00:00:32', '2012-01-01 00:00:33','2012-01-01 00:00:34', '2012-01-01 00:00:35','2012-01-01 00:00:36', '2012-01-01 00:00:37','2012-01-01 00:00:38', '2012-01-01 00:00:39','2012-01-01 00:00:40', '2012-01-01 00:00:41','2012-01-01 00:00:42', '2012-01-01 00:00:43','2012-01-01 00:00:44', '2012-01-01 00:00:45','2012-01-01 00:00:46', '2012-01-01 00:00:47','2012-01-01 00:00:48', '2012-01-01 00:00:49','2012-01-01 00:00:50', '2012-01-01 00:00:51','2012-01-01 00:00:52', '2012-01-01 00:00:53','2012-01-01 00:00:54', '2012-01-01 00:00:55','2012-01-01 00:00:56', '2012-01-01 00:00:57','2012-01-01 00:00:58', '2012-01-01 00:00:59','2012-01-01 00:01:00', '2012-01-01 00:01:01','2012-01-01 00:01:02', '2012-01-01 00:01:03','2012-01-01 00:01:04', '2012-01-01 00:01:05','2012-01-01 00:01:06', '2012-01-01 00:01:07','2012-01-01 00:01:08', '2012-01-01 00:01:09','2012-01-01 00:01:10', '2012-01-01 00:01:11','2012-01-01 00:01:12', '2012-01-01 00:01:13','2012-01-01 00:01:14', '2012-01-01 00:01:15','2012-01-01 00:01:16', '2012-01-01 00:01:17','2012-01-01 00:01:18', '2012-01-01 00:01:19','2012-01-01 00:01:20', '2012-01-01 00:01:21','2012-01-01 00:01:22', '2012-01-01 00:01:23','2012-01-01 00:01:24', '2012-01-01 00:01:25','2012-01-01 00:01:26', '2012-01-01 00:01:27','2012-01-01 00:01:28', '2012-01-01 00:01:29','2012-01-01 00:01:30', '2012-01-01 00:01:31','2012-01-01 00:01:32', '2012-01-01 00:01:33','2012-01-01 00:01:34', '2012-01-01 00:01:35','2012-01-01 00:01:36', '2012-01-01 00:01:37','2012-01-01 00:01:38', '2012-01-01 00:01:39'],dtype='datetime64[ns]', freq='S')
2012-01-01 00:00:00    124
2012-01-01 00:00:01    231
2012-01-01 00:00:02    298
2012-01-01 00:00:03    398
2012-01-01 00:00:04    418...
2012-01-01 00:01:35    313
2012-01-01 00:01:36    157
2012-01-01 00:01:37    424
2012-01-01 00:01:38    105
2012-01-01 00:01:39     72
Freq: S, Length: 100, dtype: int32
2012-01-01    24430
Freq: 5T, dtype: int32Process finished with exit code 0

(2)时区表示

代码如下:

rng = pd.date_range('3/6/2012', periods=5, freq='D')
print(rng)
print("----------------------------")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
print("----------------------------")
ts_utc = ts.tz_localize('UTC')
print(ts_utc)

运行结果如下:

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09','2012-03-10'],dtype='datetime64[ns]', freq='D')
----------------------------
2012-03-06    0.083046
2012-03-07    1.300931
2012-03-08   -0.172009
2012-03-09   -0.500776
2012-03-10   -0.561864
Freq: D, dtype: float64
----------------------------
2012-03-06 00:00:00+00:00    0.083046
2012-03-07 00:00:00+00:00    1.300931
2012-03-08 00:00:00+00:00   -0.172009
2012-03-09 00:00:00+00:00   -0.500776
2012-03-10 00:00:00+00:00   -0.561864
Freq: D, dtype: float64Process finished with exit code 0

(3)时区转换

代码如下:

rng = pd.date_range('3/6/2012', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts_utc = ts.tz_localize('UTC')
print(ts_utc.tz_convert('US/Eastern'))

运行结果如下:

2012-03-05 19:00:00-05:00    0.375370
2012-03-06 19:00:00-05:00   -0.341717
2012-03-07 19:00:00-05:00    0.152345
2012-03-08 19:00:00-05:00    1.537487
2012-03-09 19:00:00-05:00   -1.145042
Freq: D, dtype: float64

(4)时间跨度转换

代码如下:

rng = pd.date_range('1/1/2012', periods=5, freq='M')
print(rng)
print("----------------------------")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
print("----------------------------")
ps = ts.to_period()
print(ps)
print("----------------------------")
print(ps.to_timestamp())

运行结果如下:

DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30','2012-05-31'],dtype='datetime64[ns]', freq='M')
----------------------------
2012-01-31    1.161573
2012-02-29    1.481427
2012-03-31    1.681822
2012-04-30   -0.796045
2012-05-31   -0.214463
Freq: M, dtype: float64
----------------------------
2012-01    1.161573
2012-02    1.481427
2012-03    1.681822
2012-04   -0.796045
2012-05   -0.214463
Freq: M, dtype: float64
----------------------------
2012-01-01    1.161573
2012-02-01    1.481427
2012-03-01    1.681822
2012-04-01   -0.796045
2012-05-01   -0.214463
Freq: MS, dtype: float64Process finished with exit code 0

(5)日期与时间戳之间的转换使得可以使用一些方便的算术函数。
例如,我们把以11月为年底的季度数据转换为当前季度末月底为始的数据

代码如下:

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
print(prng)
print("-----------------------------")
ts = pd.Series(np.random.randn(len(prng)), index=prng)
print(ts)
print("-----------------------------")
ts.index = (prng.asfreq('M', 'end')).asfreq('H', 'start') + 9
print(ts)

运行结果如下:

PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2','1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4','1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2','1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4','1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2','1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4','1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2','2000Q3', '2000Q4'],dtype='period[Q-NOV]', freq='Q-NOV')
-----------------------------
1990Q1   -1.137006
1990Q2   -0.623052
1990Q3   -0.034055
1990Q4    0.045069
1991Q1    0.394846
1991Q2    1.194376
1991Q3   -0.808152
1991Q4    2.840198
1992Q1   -0.389149
1992Q2   -1.915210
1992Q3   -2.631764
1992Q4    0.297902
1993Q1   -0.819829
1993Q2   -0.065494
1993Q3   -1.171204
1993Q4    1.774212
1994Q1    1.735605
1994Q2    1.101451
1994Q3   -2.097832
1994Q4   -1.519787
1995Q1   -0.592369
1995Q2   -0.069788
1995Q3   -0.111981
1995Q4   -0.725699
1996Q1   -0.258395
1996Q2   -0.144076
1996Q3   -0.312234
1996Q4   -0.205665
1997Q1   -1.190604
1997Q2    0.849108
1997Q3    0.666772
1997Q4    0.507039
1998Q1    0.603365
1998Q2    0.954041
1998Q3   -0.856542
1998Q4   -0.353008
1999Q1   -0.215423
1999Q2    0.127024
1999Q3    1.137850
1999Q4    0.879086
2000Q1   -0.241292
2000Q2    1.918176
2000Q3    0.900579
2000Q4    1.366803
Freq: Q-NOV, dtype: float64
-----------------------------
1990-02-01 09:00   -1.137006
1990-05-01 09:00   -0.623052
1990-08-01 09:00   -0.034055
1990-11-01 09:00    0.045069
1991-02-01 09:00    0.394846
1991-05-01 09:00    1.194376
1991-08-01 09:00   -0.808152
1991-11-01 09:00    2.840198
1992-02-01 09:00   -0.389149
1992-05-01 09:00   -1.915210
1992-08-01 09:00   -2.631764
1992-11-01 09:00    0.297902
1993-02-01 09:00   -0.819829
1993-05-01 09:00   -0.065494
1993-08-01 09:00   -1.171204
1993-11-01 09:00    1.774212
1994-02-01 09:00    1.735605
1994-05-01 09:00    1.101451
1994-08-01 09:00   -2.097832
1994-11-01 09:00   -1.519787
1995-02-01 09:00   -0.592369
1995-05-01 09:00   -0.069788
1995-08-01 09:00   -0.111981
1995-11-01 09:00   -0.725699
1996-02-01 09:00   -0.258395
1996-05-01 09:00   -0.144076
1996-08-01 09:00   -0.312234
1996-11-01 09:00   -0.205665
1997-02-01 09:00   -1.190604
1997-05-01 09:00    0.849108
1997-08-01 09:00    0.666772
1997-11-01 09:00    0.507039
1998-02-01 09:00    0.603365
1998-05-01 09:00    0.954041
1998-08-01 09:00   -0.856542
1998-11-01 09:00   -0.353008
1999-02-01 09:00   -0.215423
1999-05-01 09:00    0.127024
1999-08-01 09:00    1.137850
1999-11-01 09:00    0.879086
2000-02-01 09:00   -0.241292
2000-05-01 09:00    1.918176
2000-08-01 09:00    0.900579
2000-11-01 09:00    1.366803
Freq: H, dtype: float64Process finished with exit code 0

十九、分类

从版本0.15开始,pandas在DataFrame中开始包括分类数据

(1)把raw_grade转换为分类类型

代码如下:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
print(df["grade"])

输出结果如下:

   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         e
5   6         e
---------------------------
0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']Process finished with exit code 0

(2)重命名类别名为更有意义的名称并对分类重新排序,并添加缺失的分类

代码如下:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
# 重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print(df["grade"])

输出结果如下:

   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         e
5   6         e
---------------------------
0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']Process finished with exit code 0

(3)排序是按照分类的顺序进行的,而不是字典序

代码如下:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
# 重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print(df.sort_values(by="grade"))   # 按照分类排序

输出结果如下:

   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         e
5   6         e
---------------------------id raw_grade      grade
4   5         e   very bad
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very goodProcess finished with exit code 0

(4)按分类分组时,也会显示空的分类

代码如下:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
# 重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print(df.groupby("grade").size())   # 按照分类分组显示空的类

输出结果如下:

   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         e
5   6         e
---------------------------
grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64Process finished with exit code 0

二十、绘图

(1)初识matplotlib

代码如下:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()

输出结果如下:

(2)对于DataFrame类型,plot()能很方便地画出所有列及其标签

代码如下:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()
plt.legend(loc='best')
plt.show()

运行结果如下:

二十一、获取数据的I/O

(1)CSV数据

写入一个csv文件:

代码如下:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.to_csv('data/foo.csv')   # 写入一个csv文件

生成的csv文件如下:

从一个csv文件读入:

代码如下:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
print(pd.read_csv('foo.csv'))    # 读入一个csv文件

输出结果如下:

     Unnamed: 0         A          B          C          D
0    2000-01-01 -1.050730  -0.093286   0.845858  -0.613594
1    2000-01-02 -3.349290   0.123289  -0.665181  -0.066789
2    2000-01-03 -5.499037   1.363600  -2.832010   1.519423
3    2000-01-04 -4.594498   1.883442  -2.438111   0.488036
4    2000-01-05 -5.885108   2.386728  -1.897067  -0.489968
..          ...       ...        ...        ...        ...
995  2002-09-22 -0.637796  12.922019 -24.813859  29.094194
996  2002-09-23 -0.268030  12.872831 -24.495352  28.371192
997  2002-09-24 -0.429714  14.442154 -24.049543  28.734404
998  2002-09-25 -1.868194  12.465456 -22.799273  29.116129
999  2002-09-26 -2.172118  12.427707 -24.062128  28.882752[1000 rows x 5 columns]Process finished with exit code 0

(2)HDF5数据

写入一个HDF5 Store:

代码如下:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.to_hdf('foo.h5', 'df')     # 写入

生成文件如下:

从一个HDF5 Store读入:

代码如下:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.to_hdf('foo.h5', 'df')     # 写入
pd.read_hdf('foo.h5', 'df')    # 读入

输出结果如下:

                A        B          C            D
2000-01-01  0.544649    -0.048527   -0.069001   0.615770
2000-01-02  -1.375622   -1.001399   -0.461100   -0.314656
2000-01-03  -1.080360   -0.491494   0.058478    -0.263584
2000-01-04  -0.781192   -1.366070   0.713983    0.276506
2000-01-05  -1.726031   -2.226089   1.372384    -1.010998
2000-01-06  -1.029440   -1.116124   3.213516    -1.070904
2000-01-07  -1.132326   0.549860    4.047112    -2.089782
2000-01-08  -1.918885   -0.496898   5.011556    -0.398244
2000-01-09  -2.689141   -0.353073   5.508442    -1.777378
2000-01-10  -0.574410   0.424270    6.674230    -0.836982
2000-01-11  0.968438    0.130614    7.331227    -1.299968
2000-01-12  1.187288    0.447871    8.748361    -2.459602
2000-01-13  1.889565    0.109931    10.110426   -2.873714
2000-01-14  1.549227    0.892440    11.540149   -2.708039
2000-01-15  1.230412    3.760428    11.407844   -3.150406
2000-01-16  0.057691    4.539744    9.338813    -3.744172
2000-01-17  0.004047    4.752471    10.309765   -4.507723
2000-01-18  0.092892    4.859409    9.760544    -3.024875
2000-01-19  1.756680    5.184557    8.991681    -4.547709
2000-01-20  3.133134    5.517767    8.498777    -5.635272
2000-01-21  2.889005    6.491254    8.638766    -6.878007
2000-01-22  2.614277    4.514062    9.359361    -6.409828
2000-01-23  2.004339    5.711879    9.398218    -5.936106
2000-01-24  2.186803    4.852240    8.156034    -6.388658
2000-01-25  3.085273    4.543388    6.914151    -5.814296
2000-01-26  5.203147    4.600532    6.475757    -5.283672
2000-01-27  5.751381    4.626839    8.047942    -3.977368
2000-01-28  5.675581    3.608191    6.809387    -2.812447
2000-01-29  5.401486    2.936898    7.269270    -2.104369
2000-01-30  5.553712    5.005159    8.387554    -2.762008
... ... ... ... ...
2002-08-28  -47.209369  7.254643    8.048536    25.650071
2002-08-29  -46.758124  6.307750    8.852335    25.292648
2002-08-30  -45.177406  5.847630    8.134441    25.595963
2002-08-31  -44.555625  4.738374    9.103920    25.938850
2002-09-01  -44.594843  4.847349    7.607951    26.767106
2002-09-02  -45.468864  3.460726    7.441725    27.277645
2002-09-03  -48.126574  4.654244    5.223401    27.618957
2002-09-04  -47.503283  4.500056    6.162534    28.210921
2002-09-05  -47.770849  3.965948    6.850322    28.129603
2002-09-06  -47.103058  3.908913    7.081636    29.309787
2002-09-07  -48.252013  4.328563    8.561459    29.842983
2002-09-08  -48.335899  2.360573    8.865642    30.591404
2002-09-09  -46.875850  2.844337    7.152740    31.220225
2002-09-10  -47.242826  2.538062    6.462508    30.843580
2002-09-11  -47.881749  3.812996    6.520225    32.369875
2002-09-12  -46.864357  4.713924    6.569562    32.144355
2002-09-13  -45.546403  2.981736    8.046595    33.097245
2002-09-14  -44.470824  4.739932    7.934668    32.488292
2002-09-15  -44.206498  3.851915    6.901387    31.004478
2002-09-16  -45.192152  2.235635    7.017709    30.362812
2002-09-17  -45.775304  3.109701    5.925081    30.872055
2002-09-18  -45.652522  4.743547    4.843658    29.608422
2002-09-19  -47.494023  4.842967    3.590295    29.586813
2002-09-20  -47.098042  5.926378    4.235130    29.989704
2002-09-21  -48.317910  4.805615    5.094592    30.270280
2002-09-22  -48.432906  5.759228    4.651891    30.817247
2002-09-23  -47.112905  6.014631    4.202600    29.703626
2002-09-24  -46.085425  6.468472    3.649689    29.517390
2002-09-25  -44.384026  5.569878    3.598782    29.982023
2002-09-26  -42.653887  5.947389    4.319416    31.901769
1000 rows × 4 columns

(3)Excel数据

写入一个Excel文件:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()df.to_excel('foo.xlsx', sheet_name='Sheet1')    # 写入excel

注意:要导入相关依赖库和module。

生成excel文件如下:

从一个Excel文件读入:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()print(pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']))

运行结果如下:

    Unnamed: 0          A          B          C          D
0   2000-01-01   1.161650   0.711842   0.355790  -0.465520
1   2000-01-02   0.303399   0.266687   0.029220  -0.304709
2   2000-01-03   0.525579   1.834383  -1.003671   0.202279
3   2000-01-04   0.733337   3.045663  -2.245458   2.095680
4   2000-01-05   0.107569   3.380991  -4.067166   0.423934
..         ...        ...        ...        ...        ...
995 2002-09-22 -16.517889 -14.676465  18.354474  72.024858
996 2002-09-23 -16.714840 -13.742384  17.133618  72.441209
997 2002-09-24 -15.751055 -15.707861  17.496880  73.232321
998 2002-09-25 -14.977084 -16.827697  17.091075  72.890022
999 2002-09-26 -15.173852 -17.319843  16.150347  71.989576[1000 rows x 5 columns]Process finished with exit code 0

注意:
我写代码时报错:pandas无法打开.xlsx文件,xlrd.biffh.XLRDError: Excel xlsx file; not supported
原因:原因是最近xlrd更新到了2.0.1版本,只支持.xls文件。所以pandas.read_excel(‘xxx.xlsx’)会报错。可以安装旧版xlrd。
解决办法:在window10下,进入项目工程文件夹D:\dream\venv\Scripts(我的路径是这个)使用cmd进入该目录下安装。
pip install xlrd==1.2.0
或者也可以用openpyxl代替xlrd打开.xlsx文件:
df=pandas.read_excel(‘data.xlsx’,engine=‘openpyxl’)

后记

最后学习python推荐大家使用anaconda,它集成了ipython,jupyter notebook等,编写代码方便,我在写这题的时候用来pycharm,有些库用的发生了问题,而使用jupyter notebook却运行正常。好了,今天就到这里,明天继续学习。

Python数据分析pandas入门(一)------十分钟入门pandas相关推荐

  1. linux pandas教程_十分钟入门 Pandas

    # 十分钟入门 Pandas 本节是帮助 Pandas 新手快速上手的简介.烹饪指南里介绍了更多实用案例. 本节以下列方式导入 Pandas 与 NumPy: In [1]: import numpy ...

  2. 自学python编程免费教程-Python十分钟入门 自学python基础教程送你参考

    python十分钟入门.简介Python是一种动态解释型的编程语言.Python可以在Windows.UNIX.MAC等多种操作系统上使用,也可以在Java..NET开发平台上使用. 特点 1 Pyt ...

  3. 【Python】【进阶篇】十二、Python爬虫的Xpath简明教程(十分钟入门)

    目录 十二.Python爬虫的Xpath简明教程(十分钟入门) 12.1 Xpath表达式 12.2 Xpath节点 12.3 节点关系 12.4 Xpath基本语法 12.4.1 基本语法使用 12 ...

  4. Azure IoT Hub 十分钟入门系列 (2)- 使用模拟设备发送设备到云(d2c)的消息

    本文主要分享一个案例: 10分钟- 使用Python 示例代码和SDK向IoT Hub 发送遥测消息 本文主要有如下内容: 了解C2D/D2C消息: 了解IoT Hub中Device的概念 了解并下载 ...

  5. Python3快速入门(十四)——Pandas数据读取

    Python3快速入门(十四)--Pandas数据读取 一.DataFrame IO 1.CSV文件 pandas.read_csv(filepath_or_buffer, na_values='NA ...

  6. “易语言.飞扬”十分钟入门教程(修订版1,update for EF1.1.0)

    "易语言.飞扬"十分钟入门教程 (修订版1,update for EF1.1.0) 作者:liigo,2007.8.12 本文地址:http://blog.csdn.net/lii ...

  7. “易语言.飞扬”十分钟入门教程

    "易语言.飞扬"十分钟入门教程 作者:liigo 2007.1.1 原文链接:http://blog.csdn.net/liigo/archive/2007/01/01/14720 ...

  8. Azure IoT Hub 十分钟入门系列 (1)- 10分钟带你了解Azure IoT Hub 并创建IoT Hub

    建议您先对<Azure 上 IoT 整体解决方案概览 >进行了解. 本文主要分享一个案例: 10分钟-了解Azure IoT Hub并创建Azure IoT Hub 本文主要有如下内容: ...

  9. 快速入门:十分钟学会Python

    初试牛刀 假设你希望学习Python这门语言,却苦于找不到一个简短而全面的入门教程.那么本教程将花费十分钟的时间带你走入Python的大门.本文的内容介于教程(Toturial)和速查手册(Cheat ...

最新文章

  1. Python高级编程(二)
  2. 粒子群算法(PSO)Matlab实现(两种解法)
  3. 电商泛滥的时代,我们的出路在哪里?
  4. sublime代码片段
  5. JAVA学习日志(7-1-继承)
  6. Essential MSBuild: .NET 工具生成引擎概述
  7. Python——三级菜单
  8. docker run 与docker start的区别
  9. FreeMarker语言概述(1)
  10. LQR轨迹跟踪——基于ROS系统和全向车实验平台
  11. 华硕主板如何用u盘启动计算机,华硕主板怎么设置u盘启动
  12. java编译的类包含美元符号 $
  13. python中格式化输出是什么意思_Python中 {:.0f} 格式化输出,{0:^30}什么意思 . format(name))...
  14. 五种提前还款方式那种更划算
  15. oracle使用(五)表空间创建、删除以及删除后数据文件还存在的问题
  16. 王亮 中国科学院自动化研究所
  17. Python全栈工程师(30:html)
  18. 嵌入式学习⑩——STM的PWM和DAC
  19. 什么是白帽/黑帽SEO?一次性看懂
  20. 【Waves12】waves安装教程

热门文章

  1. 耿建超英语语法---定语从句
  2. 全文检索系统技术架构及流程说明
  3. 根据视频地址获取视频的第一帧画面做为封面 IllegalArgumentException
  4. 儿童使用显微镜有好处吗?
  5. 如何用 IT 业者能听懂的话介绍量子计算的原理?
  6. Spring Boot + Spring Security + JWT + 微信小程序登录
  7. [从头读历史] 第289节 神之物语 忒修斯的故事
  8. SPSS多元线性回归结果分析
  9. 多元线性回归分析spss结果解读_SPSS--回归-多元线性回归模型案例解析
  10. 【定量分析、量化金融与统计学】多元回归模型与回归推理