Python数据分析pandas入门(一)------十分钟入门pandas
Python数据分析基础
- 一、导入常用库
- 二、创建对象
- 三、查看数据
- 四、选取
- 五、通过标签选取
- 六、通过位置选取
- 七、布尔索引
- 八、赋值
- 九、缺失值处理
- 十、运算与统计
- 十一、Apply函数的作用
- 十二、频数统计
- 十三、字符串方法
- 十四、合并
- 十五、分组
- 十六、变形之堆叠
- 十七、透视数据表
- 十八、时间序列
- 十九、分类
- 二十、绘图
- 二十一、获取数据的I/O
- 后记
一、导入常用库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
二、创建对象
(1)通过传递一个list来创建Series,pandas会默认创建整型索引
代码如下:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
运行结果如下:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
说明:Series:一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。注意:Series中的索引值是可以重复的。
(2)通过传递一个numpy array,日期索引以及列标签来创建一个DataFrame
代码如下:
dates = pd.date_range('20210101', periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
运行结果如下:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06'],dtype='datetime64[ns]', freq='D')A B C D
2021-01-01 0.332594 1.151379 0.155028 1.914757
2021-01-02 1.681969 -1.157997 0.340231 2.054462
2021-01-03 -1.351239 -0.636244 -1.670505 -0.577896
2021-01-04 1.091745 -0.265513 -0.585511 0.462430
2021-01-05 -0.945369 -0.416703 -0.833535 0.446753
2021-01-06 0.712920 0.223502 -1.158448 0.046709
(3)通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame
代码如下:
df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20200629'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3]*4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})
print(df2)
print(df2.dtypes)
运行结果如下:
A B C D E F
0 1.0 2020-06-29 1.0 3 test foo
1 1.0 2020-06-29 1.0 3 train foo
2 1.0 2020-06-29 1.0 3 test foo
3 1.0 2020-06-29 1.0 3 train foo
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
三、查看数据
(1)查看frame中头部和尾部的几行
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.head()) # 查看头部几行
print(df.tail(3)) # 查看尾部几行
运行结果如下:
A B C D
2021-01-01 -0.854651 -0.610823 -0.167534 0.792160
2021-01-02 0.493142 0.580007 0.204097 -0.461438
2021-01-03 0.281382 -1.412539 3.594873 -0.130037
2021-01-04 -0.020957 0.013987 -2.404149 -0.277812
2021-01-05 -1.464734 0.144639 -0.667339 0.917941A B C D
2021-01-04 -0.020957 0.013987 -2.404149 -0.277812
2021-01-05 -1.464734 0.144639 -0.667339 0.917941
2021-01-06 1.813891 -1.379392 -1.490363 -0.954958
(2)显示索引、列名以及底层的numpy数据
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.index) # 显示索引
print(df.columns) # 显示列名
print(df.values) # 底层numpy数据
运行结果如下:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06'],dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.54760061 -1.52821104 0.98156267 -1.3086871 ][ 1.08947302 -0.27957116 -0.99159702 0.56656625][-0.56661193 0.73175369 0.84474106 -0.01924194][ 0.40981963 -1.23025219 1.31332923 1.16469658][-0.0996665 0.42960539 0.15250292 0.5774405 ][-0.05484658 0.70352716 0.88048923 1.0268004 ]]
(3)describe()能对数据做一个快速统计汇总
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.describe()) # 快速统计汇总
运行结果如下:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.037825 -0.684544 0.074748 -0.393371
std 1.221381 0.651434 0.445844 1.016243
min -1.638185 -1.327833 -0.581986 -1.639139
25% -0.866597 -1.061551 -0.204719 -1.129647
50% -0.116033 -0.838348 0.128008 -0.347528
75% 0.977540 -0.567642 0.455081 0.123801
max 1.418020 0.510622 0.525979 1.083412
(4)对数据做转置
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.T) # 对数据做转置
运行结果如下:
2021-01-01 2021-01-02 2021-01-03 2021-01-04 2021-01-05 2021-01-06
A -0.924087 1.036039 2.380802 0.396621 -1.344227 -0.060524
B -1.223318 1.818642 -1.659037 0.440670 -2.068355 0.660393
C 0.028661 -0.739622 -0.702494 0.767255 -0.027886 0.712692
D -0.352344 1.421342 -0.915466 0.192375 1.665294 -0.865071
(5)按轴进行排序
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.sort_index(axis=1, ascending=False)) # 按轴进行排序
运行结果如下:
D C B A
2021-01-01 0.147621 0.372596 0.403472 -1.462936
2021-01-02 -0.433823 -0.649770 -1.840609 -0.191425
2021-01-03 0.279578 1.917370 0.931369 -0.226179
2021-01-04 1.420825 1.596096 -1.250926 0.597007
2021-01-05 0.928866 0.465932 -1.089402 -0.060359
2021-01-06 0.175272 -0.152957 0.535680 1.290633
(5)按值进行排序
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.sort_values(by='B')) # 按值进行排序
运行结果如下:
A B C D
2021-01-02 1.103506 -0.595645 0.666151 1.309689
2021-01-03 0.021516 -0.091451 0.024281 -0.598654
2021-01-01 -1.565367 0.163802 0.425172 -2.247528
2021-01-05 3.003356 0.336145 -1.738533 -0.084639
2021-01-04 0.699287 0.706519 0.891762 -1.278873
2021-01-06 0.987927 1.177693 -0.741832 1.223762
注意:虽然标准的Python/Numpy的表达式能完成选择与赋值等功能,但我们仍推荐使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix
四、选取
(1)选择某一列数据,它会返回一个Series,等同于df.A
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.A)
运行结果如下:
2021-01-01 -0.111543
2021-01-02 -0.656968
2021-01-03 -0.688010
2021-01-04 -1.589676
2021-01-05 -0.678847
2021-01-06 2.115350
Freq: D, Name: A, dtype: float64
(2)通过使用[]进行切片选取
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df[0:3])
print(df['20210102':'20210104'])
运行结果如下:
A B C D
2021-01-01 -0.047732 0.552092 -0.729498 -0.714394
2021-01-02 0.591364 -1.105802 -0.762140 -0.612312
2021-01-03 -0.065074 -0.839530 -1.497781 0.126298A B C D
2021-01-02 0.738849 -1.043999 -0.521313 -0.224035
2021-01-03 0.111772 -1.778993 2.102982 0.245293
2021-01-04 0.715842 0.664216 0.229961 -1.134740
五、通过标签选取
(1)通过标签进行交叉选取
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 通过标签进行交叉选取
print(df.loc[dates[0]])
运行结果如下:
A 0.419647
B -0.213496
C -0.247529
D -1.832256
Name: 2021-01-01 00:00:00, dtype: float64
(2)使用标签对多个轴进行选取
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 使用标签对多个轴进行选取
print(df.loc[:, ['A', 'B']])
print(df.loc[:, ['A', 'B']][:3])
运行结果如下:
A B
2021-01-01 2.682915 -0.914341
2021-01-02 0.583982 0.282933
2021-01-03 -0.191259 0.195227
2021-01-04 -1.560690 0.035329
2021-01-05 -1.130526 2.553366
2021-01-06 -0.021148 0.385572A B
2021-01-01 2.682915 -0.914341
2021-01-02 0.583982 0.282933
2021-01-03 -0.191259 0.195227
(3)进行标签切片,包含两个端点
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 进行标签切片,包含两个端点
print(df.loc['20210102':'20210104', ['A', 'B']])
运行结果如下:
A B
2021-01-02 0.583693 -1.117799
2021-01-03 1.105072 -1.793949
2021-01-04 -1.167001 -0.817904
(4)对返回的对象进行降维处理
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 对返回的对象进行降维处理
print(df.loc['20210102', ['A', 'B']])
运行结果如下:
A -0.778794
B -0.015910
Name: 2021-01-02 00:00:00, dtype: float64
(5)获取一个标量
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 获取一个标量
print(df.loc[dates[0], 'A'])
运行结果如下:
1.428208723016515
(6)快速获取标量(与上面方法等价)
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 快速获取标量(与上面方法等价)
print(df.at[dates[0], 'A'])
运行结果如下:
1.428208723016515
六、通过位置选取
(1)通过传递整型的位置进行选取
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[3])
运行结果如下:
A 0.935744
B 0.460515
C -0.636717
D 0.918826
Name: 2021-01-04 00:00:00, dtype: float64
(2)通过整型的位置切片进行选取(与python/numpy形式相同)
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[3:5, 0:2])
运行结果如下:
A B
2021-01-04 -0.731813 -0.007271
2021-01-05 -0.098682 -1.033287
(3)只对行进行切片
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[1:3, :])
运行结果如下:
A B C D
2021-01-02 1.725870 0.316616 -0.226371 2.271909
2021-01-03 -0.701184 -0.101915 1.670719 -1.069785
(4)只对列进行切片
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[:, 1:3])
运行结果如下:
B C
2021-01-01 -0.344991 0.714762
2021-01-02 0.756099 -0.716836
2021-01-03 -0.253883 1.408437
2021-01-04 0.617495 -0.370847
2021-01-05 0.361932 -0.149773
2021-01-06 -0.203682 -1.166916
(5)只获取某个值
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iloc[1, 1])
运行结果如下:
0.5140577371616526
(6)快速获取某个值(与上面方法等价)
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df.iat[1, 1])
运行结果如下:
-0.8025050449303489
七、布尔索引
(1)用某列的值来选取数据
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df[df.A > 0])
运行结果如下:
A B C D
2021-01-01 1.246744 -0.178323 1.584207 0.451347
2021-01-02 1.119580 -0.278993 -0.975688 0.857890
2021-01-04 0.128081 0.126333 -0.413096 1.912839
2021-01-05 0.315206 -0.997872 -0.315139 -1.187635
(2)where操作来选取数据
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df[df > 0])
运行结果如下:
A B C D
2021-01-01 1.050739 NaN NaN 0.527370
2021-01-02 NaN 0.946090 0.267921 1.673618
2021-01-03 NaN NaN NaN NaN
2021-01-04 NaN 0.643557 0.372513 NaN
2021-01-05 NaN NaN NaN 0.808884
2021-01-06 NaN 0.721053 NaN 0.522357
(3)用isin()方法来过滤数据
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print(df2)
print(df2[df2['E'].isin(['two', 'four'])])
运行结果如下:
A B C D E
2021-01-01 -0.574905 -0.003417 -0.344360 -0.119831 one
2021-01-02 -0.324021 0.954608 -0.596665 0.827242 one
2021-01-03 1.216822 -0.479907 0.721729 -1.394054 two
2021-01-04 1.337284 -0.526787 -0.346786 2.736462 three
2021-01-05 -0.292888 -0.177181 0.113743 -0.606479 four
2021-01-06 -0.117398 0.664194 0.301029 1.171757 threeA B C D E
2021-01-03 1.216822 -0.479907 0.721729 -1.394054 two
2021-01-05 -0.292888 -0.177181 0.113743 -0.606479 four
八、赋值
(1)赋值一个新的列,通过索引来自动对齐数据
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
print(s1)
df['F'] = s1
print(df)
运行结果如下:
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
Freq: D, dtype: int64A B C D F
2021-01-01 -1.298911 -0.133224 -0.557085 -0.142379 NaN
2021-01-02 0.238531 0.433289 1.494014 -0.588631 1.0
2021-01-03 2.522138 0.688398 0.767005 0.123376 2.0
2021-01-04 0.678053 0.140091 1.117512 -0.555320 3.0
2021-01-05 -0.447904 0.353646 -1.198465 -1.003590 4.0
2021-01-06 -0.861330 -0.812971 1.317353 -0.978052 5.0Process finished with exit code 0
(2)通过标签赋值
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.at[dates[0], 'A'] = 0
print(df)
运行结果如下:
A B C D
2021-01-01 0.000000 1.702900 -0.020916 -0.617243
2021-01-02 -0.544483 1.276033 -1.070828 0.416703
2021-01-03 1.214969 1.411715 -0.200606 2.133288
2021-01-04 0.710090 -0.290432 0.243515 0.356134
2021-01-05 -0.868281 -0.043208 0.436506 1.252045
2021-01-06 -0.040620 -0.559917 0.083952 1.074859
(3)通过位置赋值
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.iat[0, 1] = 0
print(df)
运行结果如下:
A B C D
2021-01-01 -0.995703 0.000000 -0.497130 0.292860
2021-01-02 -0.370458 -0.450762 1.235836 -1.117611
2021-01-03 0.563000 -0.529552 1.012462 0.351527
2021-01-04 0.002556 2.456097 -1.275803 0.243018
2021-01-05 0.958823 -1.869412 0.638924 -0.468291
2021-01-06 -0.975528 -0.271083 0.245019 0.922966
(4)通过传递numpy array赋值
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.loc[:, 'D'] = np.array([5] * len(df))
print(df)
运行结果如下:
A B C D
2021-01-01 -1.304089 -0.376122 -0.108818 5
2021-01-02 0.557907 0.666416 -1.335505 5
2021-01-03 1.312906 -0.920788 0.217328 5
2021-01-04 -1.191590 0.643327 -0.572647 5
2021-01-05 -1.114065 -1.957133 -0.254868 5
2021-01-06 1.881592 2.020586 -0.368924 5
(5)通过where操作来赋值
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df2 = df.copy()
df2[df2 > 0] = -df2
print(df2)
运行结果如下:
A B C D
2021-01-01 -0.489553 -1.543897 -0.678892 -0.487907
2021-01-02 -0.273650 -1.775763 -0.627094 -0.361745
2021-01-03 -0.131556 -1.010881 -0.653446 -0.996312
2021-01-04 -1.030592 -0.961013 -2.088085 -0.275543
2021-01-05 -0.219124 -1.220296 -1.156944 -0.015766
2021-01-06 -0.134167 -1.470275 -0.892858 -0.575739
九、缺失值处理
在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。
(1)reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
print(df1)
运行结果如下:
A B C D F E
2021-01-01 0.544711 -0.053116 -0.226346 -0.763461 NaN 1.0
2021-01-02 1.717452 0.819771 -0.601411 0.108737 1.0 1.0
2021-01-03 -1.342919 0.032636 1.850492 1.482909 2.0 NaN
2021-01-04 0.613216 -0.637186 -0.888018 -0.387602 3.0 NaN
(2)剔除所有包含缺失值的行数据
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
# 剔除含有缺失值数据
print(df1.dropna(how='any'))
运行结果如下:
A B C D F E
2021-01-02 -0.125127 -0.14816 -0.491284 -0.777581 1.0 1.0
(3)填充缺失值
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
# 填充缺失值
print(df1.fillna(value=5))
运行结果如下:
A B C D F E
2021-01-01 -2.382767 0.500487 -0.217522 -1.805155 5.0 1.0
2021-01-02 -1.183837 -0.391934 1.215758 -1.513532 1.0 1.0
2021-01-03 1.588544 0.169653 -0.299395 -0.884112 2.0 5.0
2021-01-04 2.493201 -0.059129 -1.738894 -1.887012 3.0 5.0
(4)获取值为nan的布尔标记
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1df1 = df.reindex(index=dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1print(pd.isnull(df1))
运行结果如下:
A B C D F E
2021-01-01 False False False False True False
2021-01-02 False False False False False False
2021-01-03 False False False False False True
2021-01-04 False False False False False True
十、运算与统计
运算过程中,通常不包含缺失值。
(1)进行描述性统计
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1print(df.mean())
运行结果如下:
A -0.406563
B -0.464531
C -0.649678
D 0.081891
F 3.000000
dtype: float64
(2)对其他轴进行同样的运算
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1print(df.mean(1))
运行结果如下:
2021-01-01 -0.000218
2021-01-02 -0.192326
2021-01-03 1.105851
2021-01-04 0.646850
2021-01-05 0.424485
2021-01-06 0.947733
Freq: D, dtype: float64
(3)对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算
代码如下:
dates = pd.date_range('20210101', periods=6)
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print(s)
print(df.sub(s, axis='index'))
print(df.mean())
运行结果如下:
2021-01-01 NaN
2021-01-02 NaN
2021-01-03 1.0
2021-01-04 3.0
2021-01-05 5.0
2021-01-06 NaN
Freq: D, dtype: float64A B C D F
2021-01-01 NaN NaN NaN NaN NaN
2021-01-02 NaN NaN NaN NaN NaN
2021-01-03 -1.256636 -3.067793 -1.168880 -1.753398 1.0
2021-01-04 -2.012296 -2.662295 -3.183522 -3.437285 0.0
2021-01-05 -8.293072 -4.674037 -4.800625 -4.896442 -1.0
2021-01-06 NaN NaN NaN NaN NaNProcess finished with exit code 0
十一、Apply函数的作用
(1)通过apply()对函数作用
代码如下:
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210102', periods=6))
df['F'] = s1
print(df.apply(np.cumsum))
print("---------------------")
print(df.apply(lambda x:x.max()-x.min()))
运行结果如下:
A B C D F
2021-01-01 -2.311017 1.587081 1.588087 -1.099827 NaN
2021-01-02 -1.833219 1.234021 3.044834 -0.172339 1.0
2021-01-03 -1.721813 0.638080 3.887059 0.134763 3.0
2021-01-04 -1.685430 0.846965 5.226440 0.207427 6.0
2021-01-05 -0.877456 0.032083 5.021183 0.264057 10.0
2021-01-06 -2.354379 -0.559002 3.641077 0.236156 15.0
---------------------
A 3.118991
B 2.401963
C 2.968192
D 2.027316
F 4.000000
dtype: float64Process finished with exit code 0
十二、频数统计
代码如下:
s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
print(s.value_counts())
运行结果如下:
0 0
1 4
2 5
3 2
4 3
5 0
6 2
7 5
8 2
9 2
dtype: int32
2 4
5 2
0 2
4 1
3 1
dtype: int64Process finished with exit code 0
十三、字符串方法
对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。
代码如下:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CACA', 'dog', 'cat'])
s.str.lower
print(s)
运行结果如下:
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CACA
7 dog
8 cat
dtype: object
十四、合并
(1)Concat连接
pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作。
通过concat()来连接pandas对象:
代码如下:
df = pd.DataFrame(np.random.randn(10, 4))
print(df)
print("---------------")
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
print(pieces)
print("---------------")
print(pd.concat(pieces))
运行结果如下:
0 1 2 3
0 -0.434515 -0.573066 0.862445 1.036136
1 0.667840 -0.097729 -1.652916 -0.722389
2 -0.220746 -1.049863 -0.114852 -2.581321
3 0.384615 0.639548 -0.739006 0.481056
4 -1.523937 0.690972 1.291165 0.290739
5 0.231251 0.803626 -1.693163 0.256039
6 -0.114371 0.519657 0.231674 0.456883
7 -1.121592 -0.430156 -0.563986 0.168413
8 -0.465606 0.476165 -1.314072 0.196124
9 0.243630 0.865871 -0.645785 0.753181
---------------
[ 0 1 2 3
0 -0.434515 -0.573066 0.862445 1.036136
1 0.667840 -0.097729 -1.652916 -0.722389
2 -0.220746 -1.049863 -0.114852 -2.581321, 0 1 2 3
3 0.384615 0.639548 -0.739006 0.481056
4 -1.523937 0.690972 1.291165 0.290739
5 0.231251 0.803626 -1.693163 0.256039
6 -0.114371 0.519657 0.231674 0.456883, 0 1 2 3
7 -1.121592 -0.430156 -0.563986 0.168413
8 -0.465606 0.476165 -1.314072 0.196124
9 0.243630 0.865871 -0.645785 0.753181]
---------------0 1 2 3
0 -0.434515 -0.573066 0.862445 1.036136
1 0.667840 -0.097729 -1.652916 -0.722389
2 -0.220746 -1.049863 -0.114852 -2.581321
3 0.384615 0.639548 -0.739006 0.481056
4 -1.523937 0.690972 1.291165 0.290739
5 0.231251 0.803626 -1.693163 0.256039
6 -0.114371 0.519657 0.231674 0.456883
7 -1.121592 -0.430156 -0.563986 0.168413
8 -0.465606 0.476165 -1.314072 0.196124
9 0.243630 0.865871 -0.645785 0.753181Process finished with exit code 0
(2)Join合并
类似于SQL中的合并(merge)
代码如下:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
print(left)
print("---------------------")
right = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [4, 5]})
print(right)
print("---------------------")
print(pd.merge(left, right, on='key'))
运行结果如下:
key lval
0 foo 1
1 foo 2
---------------------key lval
0 foo 4
1 foo 5
---------------------key lval_x lval_y
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5Process finished with exit code 0
(3)Append添加
将若干行添加到dataFrame后面。
代码如下:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
print(df)
print("-----------------")
s = df.iloc[3]
print(s)
print("-----------------")
print(df.append(s, ignore_index=True))
运行结果如下:
A B C D
0 0.774301 -0.146980 -0.867190 0.804019
1 0.504305 1.186497 -0.281873 1.243404
2 1.369683 0.805037 0.231694 0.392675
3 -0.200875 0.330411 -0.478353 -0.740152
4 1.042464 -0.138162 -1.513976 -0.666396
5 0.132588 -0.187199 -1.451298 0.983176
6 1.677020 -1.505520 0.314352 0.467116
7 0.926760 -2.036741 -0.182761 -0.167417
-----------------
A -0.200875
B 0.330411
C -0.478353
D -0.740152
Name: 3, dtype: float64
-----------------A B C D
0 0.774301 -0.146980 -0.867190 0.804019
1 0.504305 1.186497 -0.281873 1.243404
2 1.369683 0.805037 0.231694 0.392675
3 -0.200875 0.330411 -0.478353 -0.740152
4 1.042464 -0.138162 -1.513976 -0.666396
5 0.132588 -0.187199 -1.451298 0.983176
6 1.677020 -1.505520 0.314352 0.467116
7 0.926760 -2.036741 -0.182761 -0.167417
8 -0.200875 0.330411 -0.478353 -0.740152Process finished with exit code 0
十五、分组
对于“group by”操作,我们通常是指以下一个或几个步骤:
- 划分 按照某些标准将数据分为不同的组
- 应用 对每组数据分别执行一个函数
- 组合 将结果组合到一个数据结构
代码如下:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'bar'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})
print(df)
print("------------------")
print(df.groupby('A').sum()) # 分组并对每个分组应用sum函数
print("------------------")
print(df.groupby(['A', 'B']).sum()) # 按多个列分组形成层级索引,然后应用函数
运行结果如下:
A B C D
0 foo one -0.521933 0.030400
1 bar one 2.046228 0.611504
2 foo two -0.650801 1.682347
3 bar three -0.121637 -1.130325
4 foo two 0.040135 -0.495454
5 bar two 1.736218 -0.774311
6 foo one 0.081882 0.691103
7 bar three -0.612624 -1.388700
------------------C D
A
bar 3.048185 -2.681832
foo -1.050718 1.908396
------------------C D
A B
bar one 2.046228 0.611504three -0.734261 -2.519025two 1.736218 -0.774311
foo one -0.440052 0.721503two -0.610666 1.186893Process finished with exit code 0
十六、变形之堆叠
(1)堆叠
代码如下:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
print(df2)
运行结果如下:
A B
first second
bar one -0.127506 0.941748two -0.565635 0.251350
baz one 0.077156 -1.003484two -0.412658 -0.557502Process finished with exit code 0
(2)stack()方法对DataFrame的列“压缩”一个层级
代码如下:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
stacked = df2.stack()
print(stacked)
运行结果如下:
A B
first second
bar one A 0.640794B 1.611218two A -1.731782B 0.997328
baz one A -1.639688B 0.942692two A 0.094491B -0.364335
dtype: float64
(3)对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级
代码如下:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
stacked = df2.stack()
print(stacked.unstack())
print("-----------------------------")
print(stacked.unstack(1)) # 反堆叠第1列
print("-----------------------------")
print(stacked.unstack(0)) # 反堆叠第0列
运行结果如下:
A B
first second
bar one -1.220559 1.272748two 2.373165 1.359084
baz one 0.594712 0.567112two -0.870067 1.412194
-----------------------------
second one two
first
bar A -1.220559 2.373165B 1.272748 1.359084
baz A 0.594712 -0.870067B 0.567112 1.412194
-----------------------------
first bar baz
second
one A -1.220559 0.594712B 1.272748 0.567112
two A 2.373165 -0.870067B 1.359084 1.412194Process finished with exit code 0
十七、透视数据表
代码如下:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,'B': ['A', 'B', 'C'] * 4,'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D': np.random.randn(12),'E': np.random.randn(12)})
print(df)
df1 = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
print("--------------------------------")
print(df1)
运行结果如下:
A B C D E
0 one A foo 0.489705 -0.724392
1 one B foo 0.077034 0.942014
2 two C foo 1.915312 -0.641194
3 three A bar 0.275519 -0.014924
4 one B bar 1.239626 -1.418770
5 one C bar 0.468554 0.778672
6 two A foo -0.911088 0.411054
7 three B foo 0.728673 -0.941020
8 one C foo -0.090592 -1.599612
9 one A bar 0.279766 1.578581
10 two B bar 1.452452 0.117850
11 three C bar -0.839334 0.679560
--------------------------------
C bar foo
A B
one A 0.279766 0.489705B 1.239626 0.077034C 0.468554 -0.090592
three A 0.275519 NaNB NaN 0.728673C -0.839334 NaN
two A NaN -0.911088B 1.452452 NaNC NaN 1.915312Process finished with exit code 0
十八、时间序列
pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。这在金融领域很常见,但又不限于此。
(1)初识时间序列
代码如下:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
print(rng)
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print(ts)
print(ts.resample('5min').sum())
运行结果如下:
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01','2012-01-01 00:00:02', '2012-01-01 00:00:03','2012-01-01 00:00:04', '2012-01-01 00:00:05','2012-01-01 00:00:06', '2012-01-01 00:00:07','2012-01-01 00:00:08', '2012-01-01 00:00:09','2012-01-01 00:00:10', '2012-01-01 00:00:11','2012-01-01 00:00:12', '2012-01-01 00:00:13','2012-01-01 00:00:14', '2012-01-01 00:00:15','2012-01-01 00:00:16', '2012-01-01 00:00:17','2012-01-01 00:00:18', '2012-01-01 00:00:19','2012-01-01 00:00:20', '2012-01-01 00:00:21','2012-01-01 00:00:22', '2012-01-01 00:00:23','2012-01-01 00:00:24', '2012-01-01 00:00:25','2012-01-01 00:00:26', '2012-01-01 00:00:27','2012-01-01 00:00:28', '2012-01-01 00:00:29','2012-01-01 00:00:30', '2012-01-01 00:00:31','2012-01-01 00:00:32', '2012-01-01 00:00:33','2012-01-01 00:00:34', '2012-01-01 00:00:35','2012-01-01 00:00:36', '2012-01-01 00:00:37','2012-01-01 00:00:38', '2012-01-01 00:00:39','2012-01-01 00:00:40', '2012-01-01 00:00:41','2012-01-01 00:00:42', '2012-01-01 00:00:43','2012-01-01 00:00:44', '2012-01-01 00:00:45','2012-01-01 00:00:46', '2012-01-01 00:00:47','2012-01-01 00:00:48', '2012-01-01 00:00:49','2012-01-01 00:00:50', '2012-01-01 00:00:51','2012-01-01 00:00:52', '2012-01-01 00:00:53','2012-01-01 00:00:54', '2012-01-01 00:00:55','2012-01-01 00:00:56', '2012-01-01 00:00:57','2012-01-01 00:00:58', '2012-01-01 00:00:59','2012-01-01 00:01:00', '2012-01-01 00:01:01','2012-01-01 00:01:02', '2012-01-01 00:01:03','2012-01-01 00:01:04', '2012-01-01 00:01:05','2012-01-01 00:01:06', '2012-01-01 00:01:07','2012-01-01 00:01:08', '2012-01-01 00:01:09','2012-01-01 00:01:10', '2012-01-01 00:01:11','2012-01-01 00:01:12', '2012-01-01 00:01:13','2012-01-01 00:01:14', '2012-01-01 00:01:15','2012-01-01 00:01:16', '2012-01-01 00:01:17','2012-01-01 00:01:18', '2012-01-01 00:01:19','2012-01-01 00:01:20', '2012-01-01 00:01:21','2012-01-01 00:01:22', '2012-01-01 00:01:23','2012-01-01 00:01:24', '2012-01-01 00:01:25','2012-01-01 00:01:26', '2012-01-01 00:01:27','2012-01-01 00:01:28', '2012-01-01 00:01:29','2012-01-01 00:01:30', '2012-01-01 00:01:31','2012-01-01 00:01:32', '2012-01-01 00:01:33','2012-01-01 00:01:34', '2012-01-01 00:01:35','2012-01-01 00:01:36', '2012-01-01 00:01:37','2012-01-01 00:01:38', '2012-01-01 00:01:39'],dtype='datetime64[ns]', freq='S')
2012-01-01 00:00:00 124
2012-01-01 00:00:01 231
2012-01-01 00:00:02 298
2012-01-01 00:00:03 398
2012-01-01 00:00:04 418...
2012-01-01 00:01:35 313
2012-01-01 00:01:36 157
2012-01-01 00:01:37 424
2012-01-01 00:01:38 105
2012-01-01 00:01:39 72
Freq: S, Length: 100, dtype: int32
2012-01-01 24430
Freq: 5T, dtype: int32Process finished with exit code 0
(2)时区表示
代码如下:
rng = pd.date_range('3/6/2012', periods=5, freq='D')
print(rng)
print("----------------------------")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
print("----------------------------")
ts_utc = ts.tz_localize('UTC')
print(ts_utc)
运行结果如下:
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09','2012-03-10'],dtype='datetime64[ns]', freq='D')
----------------------------
2012-03-06 0.083046
2012-03-07 1.300931
2012-03-08 -0.172009
2012-03-09 -0.500776
2012-03-10 -0.561864
Freq: D, dtype: float64
----------------------------
2012-03-06 00:00:00+00:00 0.083046
2012-03-07 00:00:00+00:00 1.300931
2012-03-08 00:00:00+00:00 -0.172009
2012-03-09 00:00:00+00:00 -0.500776
2012-03-10 00:00:00+00:00 -0.561864
Freq: D, dtype: float64Process finished with exit code 0
(3)时区转换
代码如下:
rng = pd.date_range('3/6/2012', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts_utc = ts.tz_localize('UTC')
print(ts_utc.tz_convert('US/Eastern'))
运行结果如下:
2012-03-05 19:00:00-05:00 0.375370
2012-03-06 19:00:00-05:00 -0.341717
2012-03-07 19:00:00-05:00 0.152345
2012-03-08 19:00:00-05:00 1.537487
2012-03-09 19:00:00-05:00 -1.145042
Freq: D, dtype: float64
(4)时间跨度转换
代码如下:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
print(rng)
print("----------------------------")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
print("----------------------------")
ps = ts.to_period()
print(ps)
print("----------------------------")
print(ps.to_timestamp())
运行结果如下:
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30','2012-05-31'],dtype='datetime64[ns]', freq='M')
----------------------------
2012-01-31 1.161573
2012-02-29 1.481427
2012-03-31 1.681822
2012-04-30 -0.796045
2012-05-31 -0.214463
Freq: M, dtype: float64
----------------------------
2012-01 1.161573
2012-02 1.481427
2012-03 1.681822
2012-04 -0.796045
2012-05 -0.214463
Freq: M, dtype: float64
----------------------------
2012-01-01 1.161573
2012-02-01 1.481427
2012-03-01 1.681822
2012-04-01 -0.796045
2012-05-01 -0.214463
Freq: MS, dtype: float64Process finished with exit code 0
(5)日期与时间戳之间的转换使得可以使用一些方便的算术函数。
例如,我们把以11月为年底的季度数据转换为当前季度末月底为始的数据
代码如下:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
print(prng)
print("-----------------------------")
ts = pd.Series(np.random.randn(len(prng)), index=prng)
print(ts)
print("-----------------------------")
ts.index = (prng.asfreq('M', 'end')).asfreq('H', 'start') + 9
print(ts)
运行结果如下:
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2','1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4','1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2','1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4','1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2','1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4','1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2','2000Q3', '2000Q4'],dtype='period[Q-NOV]', freq='Q-NOV')
-----------------------------
1990Q1 -1.137006
1990Q2 -0.623052
1990Q3 -0.034055
1990Q4 0.045069
1991Q1 0.394846
1991Q2 1.194376
1991Q3 -0.808152
1991Q4 2.840198
1992Q1 -0.389149
1992Q2 -1.915210
1992Q3 -2.631764
1992Q4 0.297902
1993Q1 -0.819829
1993Q2 -0.065494
1993Q3 -1.171204
1993Q4 1.774212
1994Q1 1.735605
1994Q2 1.101451
1994Q3 -2.097832
1994Q4 -1.519787
1995Q1 -0.592369
1995Q2 -0.069788
1995Q3 -0.111981
1995Q4 -0.725699
1996Q1 -0.258395
1996Q2 -0.144076
1996Q3 -0.312234
1996Q4 -0.205665
1997Q1 -1.190604
1997Q2 0.849108
1997Q3 0.666772
1997Q4 0.507039
1998Q1 0.603365
1998Q2 0.954041
1998Q3 -0.856542
1998Q4 -0.353008
1999Q1 -0.215423
1999Q2 0.127024
1999Q3 1.137850
1999Q4 0.879086
2000Q1 -0.241292
2000Q2 1.918176
2000Q3 0.900579
2000Q4 1.366803
Freq: Q-NOV, dtype: float64
-----------------------------
1990-02-01 09:00 -1.137006
1990-05-01 09:00 -0.623052
1990-08-01 09:00 -0.034055
1990-11-01 09:00 0.045069
1991-02-01 09:00 0.394846
1991-05-01 09:00 1.194376
1991-08-01 09:00 -0.808152
1991-11-01 09:00 2.840198
1992-02-01 09:00 -0.389149
1992-05-01 09:00 -1.915210
1992-08-01 09:00 -2.631764
1992-11-01 09:00 0.297902
1993-02-01 09:00 -0.819829
1993-05-01 09:00 -0.065494
1993-08-01 09:00 -1.171204
1993-11-01 09:00 1.774212
1994-02-01 09:00 1.735605
1994-05-01 09:00 1.101451
1994-08-01 09:00 -2.097832
1994-11-01 09:00 -1.519787
1995-02-01 09:00 -0.592369
1995-05-01 09:00 -0.069788
1995-08-01 09:00 -0.111981
1995-11-01 09:00 -0.725699
1996-02-01 09:00 -0.258395
1996-05-01 09:00 -0.144076
1996-08-01 09:00 -0.312234
1996-11-01 09:00 -0.205665
1997-02-01 09:00 -1.190604
1997-05-01 09:00 0.849108
1997-08-01 09:00 0.666772
1997-11-01 09:00 0.507039
1998-02-01 09:00 0.603365
1998-05-01 09:00 0.954041
1998-08-01 09:00 -0.856542
1998-11-01 09:00 -0.353008
1999-02-01 09:00 -0.215423
1999-05-01 09:00 0.127024
1999-08-01 09:00 1.137850
1999-11-01 09:00 0.879086
2000-02-01 09:00 -0.241292
2000-05-01 09:00 1.918176
2000-08-01 09:00 0.900579
2000-11-01 09:00 1.366803
Freq: H, dtype: float64Process finished with exit code 0
十九、分类
从版本0.15开始,pandas在DataFrame中开始包括分类数据
(1)把raw_grade转换为分类类型
代码如下:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
print(df["grade"])
输出结果如下:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e
---------------------------
0 a
1 b
2 b
3 a
4 e
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']Process finished with exit code 0
(2)重命名类别名为更有意义的名称并对分类重新排序,并添加缺失的分类
代码如下:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
# 重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print(df["grade"])
输出结果如下:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e
---------------------------
0 very good
1 good
2 good
3 very good
4 very bad
5 very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']Process finished with exit code 0
(3)排序是按照分类的顺序进行的,而不是字典序
代码如下:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
# 重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print(df.sort_values(by="grade")) # 按照分类排序
输出结果如下:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e
---------------------------id raw_grade grade
4 5 e very bad
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very goodProcess finished with exit code 0
(4)按分类分组时,也会显示空的分类
代码如下:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'e', 'e']})
print(df)
df["grade"] = df["raw_grade"].astype("category")
print("---------------------------")
# 重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print(df.groupby("grade").size()) # 按照分类分组显示空的类
输出结果如下:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e
---------------------------
grade
very bad 2
bad 0
medium 0
good 2
very good 2
dtype: int64Process finished with exit code 0
二十、绘图
(1)初识matplotlib
代码如下:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()
输出结果如下:
(2)对于DataFrame类型,plot()能很方便地画出所有列及其标签
代码如下:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()
plt.legend(loc='best')
plt.show()
运行结果如下:
二十一、获取数据的I/O
(1)CSV数据
写入一个csv文件:
代码如下:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.to_csv('data/foo.csv') # 写入一个csv文件
生成的csv文件如下:
从一个csv文件读入:
代码如下:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
print(pd.read_csv('foo.csv')) # 读入一个csv文件
输出结果如下:
Unnamed: 0 A B C D
0 2000-01-01 -1.050730 -0.093286 0.845858 -0.613594
1 2000-01-02 -3.349290 0.123289 -0.665181 -0.066789
2 2000-01-03 -5.499037 1.363600 -2.832010 1.519423
3 2000-01-04 -4.594498 1.883442 -2.438111 0.488036
4 2000-01-05 -5.885108 2.386728 -1.897067 -0.489968
.. ... ... ... ... ...
995 2002-09-22 -0.637796 12.922019 -24.813859 29.094194
996 2002-09-23 -0.268030 12.872831 -24.495352 28.371192
997 2002-09-24 -0.429714 14.442154 -24.049543 28.734404
998 2002-09-25 -1.868194 12.465456 -22.799273 29.116129
999 2002-09-26 -2.172118 12.427707 -24.062128 28.882752[1000 rows x 5 columns]Process finished with exit code 0
(2)HDF5数据
写入一个HDF5 Store:
代码如下:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.to_hdf('foo.h5', 'df') # 写入
生成文件如下:
从一个HDF5 Store读入:
代码如下:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.to_hdf('foo.h5', 'df') # 写入
pd.read_hdf('foo.h5', 'df') # 读入
输出结果如下:
A B C D
2000-01-01 0.544649 -0.048527 -0.069001 0.615770
2000-01-02 -1.375622 -1.001399 -0.461100 -0.314656
2000-01-03 -1.080360 -0.491494 0.058478 -0.263584
2000-01-04 -0.781192 -1.366070 0.713983 0.276506
2000-01-05 -1.726031 -2.226089 1.372384 -1.010998
2000-01-06 -1.029440 -1.116124 3.213516 -1.070904
2000-01-07 -1.132326 0.549860 4.047112 -2.089782
2000-01-08 -1.918885 -0.496898 5.011556 -0.398244
2000-01-09 -2.689141 -0.353073 5.508442 -1.777378
2000-01-10 -0.574410 0.424270 6.674230 -0.836982
2000-01-11 0.968438 0.130614 7.331227 -1.299968
2000-01-12 1.187288 0.447871 8.748361 -2.459602
2000-01-13 1.889565 0.109931 10.110426 -2.873714
2000-01-14 1.549227 0.892440 11.540149 -2.708039
2000-01-15 1.230412 3.760428 11.407844 -3.150406
2000-01-16 0.057691 4.539744 9.338813 -3.744172
2000-01-17 0.004047 4.752471 10.309765 -4.507723
2000-01-18 0.092892 4.859409 9.760544 -3.024875
2000-01-19 1.756680 5.184557 8.991681 -4.547709
2000-01-20 3.133134 5.517767 8.498777 -5.635272
2000-01-21 2.889005 6.491254 8.638766 -6.878007
2000-01-22 2.614277 4.514062 9.359361 -6.409828
2000-01-23 2.004339 5.711879 9.398218 -5.936106
2000-01-24 2.186803 4.852240 8.156034 -6.388658
2000-01-25 3.085273 4.543388 6.914151 -5.814296
2000-01-26 5.203147 4.600532 6.475757 -5.283672
2000-01-27 5.751381 4.626839 8.047942 -3.977368
2000-01-28 5.675581 3.608191 6.809387 -2.812447
2000-01-29 5.401486 2.936898 7.269270 -2.104369
2000-01-30 5.553712 5.005159 8.387554 -2.762008
... ... ... ... ...
2002-08-28 -47.209369 7.254643 8.048536 25.650071
2002-08-29 -46.758124 6.307750 8.852335 25.292648
2002-08-30 -45.177406 5.847630 8.134441 25.595963
2002-08-31 -44.555625 4.738374 9.103920 25.938850
2002-09-01 -44.594843 4.847349 7.607951 26.767106
2002-09-02 -45.468864 3.460726 7.441725 27.277645
2002-09-03 -48.126574 4.654244 5.223401 27.618957
2002-09-04 -47.503283 4.500056 6.162534 28.210921
2002-09-05 -47.770849 3.965948 6.850322 28.129603
2002-09-06 -47.103058 3.908913 7.081636 29.309787
2002-09-07 -48.252013 4.328563 8.561459 29.842983
2002-09-08 -48.335899 2.360573 8.865642 30.591404
2002-09-09 -46.875850 2.844337 7.152740 31.220225
2002-09-10 -47.242826 2.538062 6.462508 30.843580
2002-09-11 -47.881749 3.812996 6.520225 32.369875
2002-09-12 -46.864357 4.713924 6.569562 32.144355
2002-09-13 -45.546403 2.981736 8.046595 33.097245
2002-09-14 -44.470824 4.739932 7.934668 32.488292
2002-09-15 -44.206498 3.851915 6.901387 31.004478
2002-09-16 -45.192152 2.235635 7.017709 30.362812
2002-09-17 -45.775304 3.109701 5.925081 30.872055
2002-09-18 -45.652522 4.743547 4.843658 29.608422
2002-09-19 -47.494023 4.842967 3.590295 29.586813
2002-09-20 -47.098042 5.926378 4.235130 29.989704
2002-09-21 -48.317910 4.805615 5.094592 30.270280
2002-09-22 -48.432906 5.759228 4.651891 30.817247
2002-09-23 -47.112905 6.014631 4.202600 29.703626
2002-09-24 -46.085425 6.468472 3.649689 29.517390
2002-09-25 -44.384026 5.569878 3.598782 29.982023
2002-09-26 -42.653887 5.947389 4.319416 31.901769
1000 rows × 4 columns
(3)Excel数据
写入一个Excel文件:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()df.to_excel('foo.xlsx', sheet_name='Sheet1') # 写入excel
注意:要导入相关依赖库和module。
生成excel文件如下:
从一个Excel文件读入:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()print(pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']))
运行结果如下:
Unnamed: 0 A B C D
0 2000-01-01 1.161650 0.711842 0.355790 -0.465520
1 2000-01-02 0.303399 0.266687 0.029220 -0.304709
2 2000-01-03 0.525579 1.834383 -1.003671 0.202279
3 2000-01-04 0.733337 3.045663 -2.245458 2.095680
4 2000-01-05 0.107569 3.380991 -4.067166 0.423934
.. ... ... ... ... ...
995 2002-09-22 -16.517889 -14.676465 18.354474 72.024858
996 2002-09-23 -16.714840 -13.742384 17.133618 72.441209
997 2002-09-24 -15.751055 -15.707861 17.496880 73.232321
998 2002-09-25 -14.977084 -16.827697 17.091075 72.890022
999 2002-09-26 -15.173852 -17.319843 16.150347 71.989576[1000 rows x 5 columns]Process finished with exit code 0
注意:
我写代码时报错:pandas无法打开.xlsx文件,xlrd.biffh.XLRDError: Excel xlsx file; not supported
原因:原因是最近xlrd更新到了2.0.1版本,只支持.xls文件。所以pandas.read_excel(‘xxx.xlsx’)会报错。可以安装旧版xlrd。
解决办法:在window10下,进入项目工程文件夹D:\dream\venv\Scripts(我的路径是这个)使用cmd进入该目录下安装。
pip install xlrd==1.2.0
或者也可以用openpyxl代替xlrd打开.xlsx文件:
df=pandas.read_excel(‘data.xlsx’,engine=‘openpyxl’)
后记
最后学习python推荐大家使用anaconda,它集成了ipython,jupyter notebook等,编写代码方便,我在写这题的时候用来pycharm,有些库用的发生了问题,而使用jupyter notebook却运行正常。好了,今天就到这里,明天继续学习。
Python数据分析pandas入门(一)------十分钟入门pandas相关推荐
- linux pandas教程_十分钟入门 Pandas
# 十分钟入门 Pandas 本节是帮助 Pandas 新手快速上手的简介.烹饪指南里介绍了更多实用案例. 本节以下列方式导入 Pandas 与 NumPy: In [1]: import numpy ...
- 自学python编程免费教程-Python十分钟入门 自学python基础教程送你参考
python十分钟入门.简介Python是一种动态解释型的编程语言.Python可以在Windows.UNIX.MAC等多种操作系统上使用,也可以在Java..NET开发平台上使用. 特点 1 Pyt ...
- 【Python】【进阶篇】十二、Python爬虫的Xpath简明教程(十分钟入门)
目录 十二.Python爬虫的Xpath简明教程(十分钟入门) 12.1 Xpath表达式 12.2 Xpath节点 12.3 节点关系 12.4 Xpath基本语法 12.4.1 基本语法使用 12 ...
- Azure IoT Hub 十分钟入门系列 (2)- 使用模拟设备发送设备到云(d2c)的消息
本文主要分享一个案例: 10分钟- 使用Python 示例代码和SDK向IoT Hub 发送遥测消息 本文主要有如下内容: 了解C2D/D2C消息: 了解IoT Hub中Device的概念 了解并下载 ...
- Python3快速入门(十四)——Pandas数据读取
Python3快速入门(十四)--Pandas数据读取 一.DataFrame IO 1.CSV文件 pandas.read_csv(filepath_or_buffer, na_values='NA ...
- “易语言.飞扬”十分钟入门教程(修订版1,update for EF1.1.0)
"易语言.飞扬"十分钟入门教程 (修订版1,update for EF1.1.0) 作者:liigo,2007.8.12 本文地址:http://blog.csdn.net/lii ...
- “易语言.飞扬”十分钟入门教程
"易语言.飞扬"十分钟入门教程 作者:liigo 2007.1.1 原文链接:http://blog.csdn.net/liigo/archive/2007/01/01/14720 ...
- Azure IoT Hub 十分钟入门系列 (1)- 10分钟带你了解Azure IoT Hub 并创建IoT Hub
建议您先对<Azure 上 IoT 整体解决方案概览 >进行了解. 本文主要分享一个案例: 10分钟-了解Azure IoT Hub并创建Azure IoT Hub 本文主要有如下内容: ...
- 快速入门:十分钟学会Python
初试牛刀 假设你希望学习Python这门语言,却苦于找不到一个简短而全面的入门教程.那么本教程将花费十分钟的时间带你走入Python的大门.本文的内容介于教程(Toturial)和速查手册(Cheat ...
最新文章
- Python高级编程(二)
- 粒子群算法(PSO)Matlab实现(两种解法)
- 电商泛滥的时代,我们的出路在哪里?
- sublime代码片段
- JAVA学习日志(7-1-继承)
- Essential MSBuild: .NET 工具生成引擎概述
- Python——三级菜单
- docker run 与docker start的区别
- FreeMarker语言概述(1)
- LQR轨迹跟踪——基于ROS系统和全向车实验平台
- 华硕主板如何用u盘启动计算机,华硕主板怎么设置u盘启动
- java编译的类包含美元符号 $
- python中格式化输出是什么意思_Python中 {:.0f} 格式化输出,{0:^30}什么意思 . format(name))...
- 五种提前还款方式那种更划算
- oracle使用(五)表空间创建、删除以及删除后数据文件还存在的问题
- 王亮 中国科学院自动化研究所
- Python全栈工程师(30:html)
- 嵌入式学习⑩——STM的PWM和DAC
- 什么是白帽/黑帽SEO?一次性看懂
- 【Waves12】waves安装教程
热门文章
- 耿建超英语语法---定语从句
- 全文检索系统技术架构及流程说明
- 根据视频地址获取视频的第一帧画面做为封面 IllegalArgumentException
- 儿童使用显微镜有好处吗?
- 如何用 IT 业者能听懂的话介绍量子计算的原理?
- Spring Boot + Spring Security + JWT + 微信小程序登录
- [从头读历史] 第289节 神之物语 忒修斯的故事
- SPSS多元线性回归结果分析
- 多元线性回归分析spss结果解读_SPSS--回归-多元线性回归模型案例解析
- 【定量分析、量化金融与统计学】多元回归模型与回归推理