numpy、pandas操作
numpy、pandas操作
- pandas介绍
- numpy介绍及使用
- numpy简介
- numpy使用
- Pandas操作
- Pandas数据结构
- Series结构:列表,会自动创建索引
- Series的创建
- Series特性验证
- DataFrame:二维数组
- DataFrame的创建
- DataFrame的插入操作
- DataFrame的删除操作
- DataFrame的数据查找
- DataFrame的其他属性
- DataFrame的数据选择
- 修改表格数据
- Panel:三维数组
- Panel的创建
- Pandas在notebook中的操作
- pandas基础运算
- 重新索引
- 重新索引设默认值
pandas介绍
pandas是python中分析结构化数据的工具集
基础是numpy:高性能矩阵运算
图形库matplotlib:提供数据可视化
numpy介绍及使用
numpy简介
(1)高性能:科学计算和数据分析的基础包,是所有高级数据分析工具构建的基础
(2)面向数组:numpy的思维模式
numpy使用
(1)定义一维数组并显示数组shape和数组的数据类型
In [1]: import numpy as npIn [2]: data = np.array([1,3,4,8])In [3]: data
Out[3]: array([1, 3, 4, 8])In [4]: data.shape
Out[4]: (4,)In [5]: data.dtype
Out[5]: dtype('int32')
(2)数组的索引和重新赋值
In [7]: data[1]
Out[7]: 3In [8]: data[1]=9In [9]: data
Out[9]: array([1, 9, 4, 8])
(3)定义一维数组并显示数组shape和数组的数据类型
In [11]: data = np.array([[1,2,3],[4,5,6]])In [12]: data
Out[12]:
array([[1, 2, 3],[4, 5, 6]])In [13]: data.shape
Out[13]: (2, 3)In [14]: data[0,0]
Out[14]: 1
(4)arange函数和range函数
In [16]: np.arange(10)
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])//返回array数组In [17]: range(10)
Out[17]: range(0, 10)//返回列表In [18]: np.arange(5,15)
Out[18]: array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])In [19]: data = np.arange(100, step=10)In [20]: data
Out[20]: array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])In [21]: data[2]
Out[21]: 20In [22]: data[2:5]
Out[22]: array([20, 30, 40])In [23]: data[:3]
Out[23]: array([ 0, 10, 20])In [24]: data[5:]=-1In [25]: data
Out[25]: array([ 0, 10, 20, 30, 40, -1, -1, -1, -1, -1])
(5)array数组的reshape
- 没有copy数组,只是返回了数组的试图,即当原数组中的数据值改变时,reshape后的数组值同样发生改变
In [21]: data = np.arange(10)In [22]: data
Out[22]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])In [23]: data.reshape(2,5)
Out[23]:
array([[0, 1, 2, 3, 4],[5, 6, 7, 8, 9]])
(6)创建三维数组
In [24]: np.ones((2,3,3))
Out[24]:
array([[[1., 1., 1.],[1., 1., 1.],[1., 1., 1.]],[[1., 1., 1.],[1., 1., 1.],[1., 1., 1.]]])
(7)创建对角数组
In [26]: np.eye(4)
Out[26]:
array([[1., 0., 0., 0.],[0., 1., 0., 0.],[0., 0., 1., 0.],[0., 0., 0., 1.]])
(8)对二维数组进行取值赋值等操作
In [35]: data = np.arange(16).reshape(4,4)In [36]: data
Out[36]:
array([[ 0, 1, 2, 3],[ 4, 5, 6, 7],[ 8, 9, 10, 11],[12, 13, 14, 15]])In [37]: data[1]
Out[37]: array([4, 5, 6, 7])In [38]: data[1:3]
Out[38]:
array([[ 4, 5, 6, 7],[ 8, 9, 10, 11]])In [39]: data[:,2:4]
Out[39]:
array([[ 2, 3],[ 6, 7],[10, 11],[14, 15]])In [40]: data[1:3,2:4]
Out[40]:
array([[ 6, 7],[10, 11]])In [41]: data[[1,3],[2,3]]
Out[41]: array([ 6, 15])In [42]: data>10
Out[42]:
array([[False, False, False, False],[False, False, False, False],[False, False, False, True],[ True, True, True, True]])In [43]: idx = data>10In [44]: idx
Out[44]:
array([[False, False, False, False],[False, False, False, False],[False, False, False, True],[ True, True, True, True]])In [45]: data[idx]
Out[45]: array([11, 12, 13, 14, 15])
(9)数组的相加、相乘、相除
In [46]: x = np.arange(1,5).reshape(2,2)In [47]: y = np.arange(5,9).reshape(2,2)In [48]: x+y
Out[48]:
array([[ 6, 8],[10, 12]])In [49]: np.add(x,y)
Out[49]:
array([[ 6, 8],[10, 12]])In [50]: x*y//逐个元素间的相乘
Out[50]:
array([[ 5, 12],[21, 32]])In [51]: x.dot(y)//向量相乘
Out[51]:
array([[19, 22],[43, 50]])In [53]: x = np.array(x,dtype=float)In [54]: y = np.array(y,dtype=float)In [55]: x/y //逐元素相除
Out[55]:
array([[0.2 , 0.33333333],[0.42857143, 0.5 ]])
(10)其他运算(平方根计算、转置计算、linspace)
In [56]: np.sqrt(x)
Out[56]:
array([[1. , 1.41421356],[1.73205081, 2. ]])In [57]: x.T
Out[57]:
array([[1., 3.],[2., 4.]])In [58]: np.linspace(1,10) //可以用于计算sin等情况
Out[58]:
array([ 1. , 1.18367347, 1.36734694, 1.55102041, 1.73469388,1.91836735, 2.10204082, 2.28571429, 2.46938776, 2.65306122,2.83673469, 3.02040816, 3.20408163, 3.3877551 , 3.57142857,3.75510204, 3.93877551, 4.12244898, 4.30612245, 4.48979592,4.67346939, 4.85714286, 5.04081633, 5.2244898 , 5.40816327,5.59183673, 5.7755102 , 5.95918367, 6.14285714, 6.32653061,6.51020408, 6.69387755, 6.87755102, 7.06122449, 7.24489796,7.42857143, 7.6122449 , 7.79591837, 7.97959184, 8.16326531,8.34693878, 8.53061224, 8.71428571, 8.89795918, 9.08163265,9.26530612, 9.44897959, 9.63265306, 9.81632653, 10.
Pandas操作
Pandas数据结构
Series结构:列表,会自动创建索引
Series是一维带标签的数组,数组中可以放任意数据,包括整数、浮点数、PythonObject等
基本形式:s=pd.Series(data,index=index)
其中data可以是python字典,numpy对象,标量
Series对象的特点:
(1)类NdArray
(2)类dict对象
(3)支持标签对其操作
Series的创建
(1)从Ndarray对象创建Series
In [64]: s = pd.Series([1,3,5,np.NaN,8,4])In [65]: s
Out[65]:
0 1.0
1 3.0
2 5.0
3 NaN
4 8.0
5 4.0
dtype: float64
(2)通过字典创建Series,字典中的key会被转换为索引值
In [258]: d = {'a':0,'b':1,'d':3}In [259]: s = pd.Series(d,index=list('abcd'))In [260]: s
Out[260]:
a 0.0
b 1.0
c NaN
d 3.0
dtype: float64
(3)用标量来创建Series
In [261]: s = pd.Series(5, index=list('abcd'))In [262]: s
Out[262]:
a 5
b 5
c 5
d 5
dtype: int64
Series特性验证
(1)NdArray特性
【1】支持NdArray中的索引方式
In [263]: s[0]
Out[263]: 5In [264]: s[:3]
Out[264]:
a 5
b 5
c 5
dtype: int64In [265]: s[2:5]
Out[265]:
c 5
d 5
dtype: int64
【2】可直接使用ndarray的函数
In [267]: np.sin(s)
Out[267]:
a -0.958924
b -0.958924
c -0.958924
d -0.958924
dtype: float64In [268]: np.exp(s)
Out[268]:
a 148.413159
b 148.413159
c 148.413159
d 148.413159
dtype: float64
(2)字典特性
【1】用字典的方法进行访问
In [269]: s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'])In [270]: s
Out[270]:
a -0.029259
b 0.162623
c -1.261891
d -0.849811
e 0.756510
dtype: float64In [271]: s['a']
Out[271]: -0.02925920853872483In [281]: print(s.get('f'))
NoneIn [282]: print(s.get('f',0))
0
【2】用字典的方法赋值
In [272]: s['b']=3In [273]: s
Out[273]:
a -0.029259
b 3.000000
c -1.261891
d -0.849811
e 0.756510
【3】用字典的方法增加一个元素
In [274]: s['g']=100In [275]: s
Out[275]:
a -0.029259
b 3.000000
c -1.261891
d -0.849811
e 0.756510
g 100.000000
dtype: float64
(3)标签对齐特性,运算时会自动实现标签对其
In [283]: s1 = pd.Series(np.random.randn(3),index=['a','c','e'])In [284]: s2 = pd.Series(np.random.randn(3),index=['a','d','e'])In [285]: print('{0}\n\n{1}'.format(s1,s2))
a -0.833574
c -1.132164
e 0.582486
dtype: float64a -1.710644
d -0.118072
e -0.034564
dtype: float64In [286]: s1+s2
Out[286]:
a -2.544218
c NaN
d NaN
e 0.547923
dtype: float64
DataFrame:二维数组
二维的带标签的数组,包括行标签和列标签,可以将DataFrame想象成Excel或SQL数据库的表格,或是一个Series字典,即每行/列都为一个Series对象。它是Pandas中最常用的数据结构
基本形式:df = pd.DataFrame(data, index=index, columns=columns)
其中data可以是一维的numpy数组,list,Series构成的字典,二维的numpy数组,一个Series或从另外的DataFrame对象中复制
DataFrame的创建
- 创建二维数组方法一:DataFrame
In [66]: dates = pd.date_range('20200417',periods=6)In [67]: dates
Out[67]:
DatetimeIndex(['2020-04-17', '2020-04-18', '2020-04-19', '2020-04-20','2020-04-21', '2020-04-22'],dtype='datetime64[ns]', freq='D')In [68]: data = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))In [69]: data
Out[69]:A B C D
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-22 -2.138328 0.752788 0.102256 0.742620
- 创建二维数组方法二:字典表
#带时间常数的字典表
In [70]: d = {'A':1,'B':pd.Timestamp('20200417'),'C':range(4),'D':np.arange(4)}In [71]: d
Out[71]:
{'A': 1,'B': Timestamp('2020-04-17 00:00:00'),'C': range(0, 4),'D': array([0, 1, 2, 3])}In [72]: df = pd.DataFrame(d)In [73]: df
Out[73]:A B C D
0 1 2020-04-17 0 0
1 1 2020-04-17 1 1
2 1 2020-04-17 2 2
3 1 2020-04-17 3 3
#Series字典表
In [292]: d = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['a','b','c','d'])}In [293]: df=pd.DataFrame(d)In [294]: df
Out[294]:one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
#用列表表示的字典表
#用字典表创建时Series中的元素可以不相同,但当用列表创建时,元素个数一定要相同
In [289]: d={'one':[1,2,3,4],'two':[21,22,23,24]}In [290]: df = pd.DataFrame(d)In [291]: df
Out[291]:one two
0 1 21
1 2 22
2 3 23
3 4 24
- 创建二维数组方法三:列表
#列表中的元素是元组
In [295]: data = [(1,2.2,'Hello'),(2,3,'World')]In [296]: df = pd.DataFrame(data,index=['one','two'],columns=list('ABC'))In [297]: df
Out[297]:A B C
one 1 2.2 Hello
two 2 3.0 World
#列表中的元素是字典,当列标签不全时,会自动实现索引对齐,若出现不存在的列标签,则用NaN表示```python
In [299]: data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]In [305]: df = pd.DataFrame(data,index=['A','B'],columns=['a','b'])In [306]: df
Out[306]:a b
A 1 2
B 5 10In [307]: df = pd.DataFrame(data,index=['A','B'],columns=['a','b','e'])In [308]: df
Out[308]:a b e
A 1 2 NaN
B 5 10 NaN
- 创建二维数组方法四:Series
In [312]: s = pd.Series(np.random.rand(5),index=['a','b','c','d','e'])In [313]: pd.DataFrame(s, columns=['A'])
Out[313]:A
a 0.084951
b 0.010217
c 0.409527
d 0.775261
e 0.451806In [314]: pd.DataFrame(s, columns=['A'],index=list('acd'))
Out[314]:A
a 0.084951
c 0.409527
d 0.775261
- 复杂数据结构
In [309]: d = {('a','b'):{('A','B'):1,('A','C'):2},('a','a'):{('A','C'):3,('A','B'):4},('a','c'):{('A','C'):5,('A','C'):6}}In [310]: df = pd.DataFrame(d)In [311]: df
Out[311]:ab a c
A B 1 4 NaNC 2 3 6.0
DataFrame的插入操作
(1)直接添加在最后面
In [316]: df = pd.DataFrame(np.random.randn(6,4),columns=['one','two','three','four'])In [317]: df
Out[317]:one two three four
0 -0.752271 -0.282806 0.528453 -0.966714
1 -1.545213 -0.758960 2.155316 -0.447701
2 1.067247 0.845392 1.795486 0.501453
3 -0.319423 -0.691444 1.026918 -1.324076
4 -0.373745 0.155677 0.932794 -1.704437
5 0.251379 -1.479839 1.466653 0.234889In [318]: df['five']=5In [319]: df
Out[319]:one two three four five
0 -0.752271 -0.282806 0.528453 -0.966714 5
1 -1.545213 -0.758960 2.155316 -0.447701 5
2 1.067247 0.845392 1.795486 0.501453 5
3 -0.319423 -0.691444 1.026918 -1.324076 5
4 -0.373745 0.155677 0.932794 -1.704437 5
5 0.251379 -1.479839 1.466653 0.234889 5
(2)添加在指定位置
- insert函数,直接作用在df里
In [322]: df.insert(1,'bar',df['one']+df['two'])In [323]: df
Out[323]:one bar two three five
0 -0.752271 -1.035077 -0.282806 0.528453 5
1 -1.545213 -2.304173 -0.758960 2.155316 5
2 1.067247 1.912640 0.845392 1.795486 5
3 -0.319423 -1.010867 -0.691444 1.026918 5
4 -0.373745 -0.218068 0.155677 0.932794 5
5 0.251379 -1.228459 -1.479839 1.466653 5
- assign函数,df没有变,只做复制的操作
In [324]: df.assign(Ratio = df['one']/df['two'])
Out[324]:one bar two three five Ratio
0 -0.752271 -1.035077 -0.282806 0.528453 5 2.660020
1 -1.545213 -2.304173 -0.758960 2.155316 5 2.035962
2 1.067247 1.912640 0.845392 1.795486 5 1.262428
3 -0.319423 -1.010867 -0.691444 1.026918 5 0.461965
4 -0.373745 -0.218068 0.155677 0.932794 5 -2.400776
5 0.251379 -1.228459 -1.479839 1.466653 5 -0.169869
- assign可以直接传入函数作为参数进行计算
In [329]: df.assign(Ratio = lambda x: x.one-x.two)
Out[329]:one bar two five Ratio
0 -0.752271 -1.035077 -0.282806 5 -0.469464
1 -1.545213 -2.304173 -0.758960 5 -0.786254
2 1.067247 1.912640 0.845392 5 0.221855
3 -0.319423 -1.010867 -0.691444 5 0.372021
4 -0.373745 -0.218068 0.155677 5 -0.529421
5 0.251379 -1.228459 -1.479839 5 1.731218
- assign的链式方法
In [330]: df.assign(ABRatio = df.one/df.two).assign(BarValue = lambda x: x.ABRatio * x.bar)
Out[330]:one bar two five ABRatio BarValue
0 -0.752271 -1.035077 -0.282806 5 2.660020 -2.753326
1 -1.545213 -2.304173 -0.758960 5 2.035962 -4.691209
2 1.067247 1.912640 0.845392 5 1.262428 2.414569
3 -0.319423 -1.010867 -0.691444 5 0.461965 -0.466985
4 -0.373745 -0.218068 0.155677 5 -2.400776 0.523533
5 0.251379 -1.228459 -1.479839 5 -0.169869 0.208678
DataFrame的删除操作
In [320]: s = df.pop('four')In [321]: df
Out[321]:one two three five
0 -0.752271 -0.282806 0.528453 5
1 -1.545213 -0.758960 2.155316 5
2 1.067247 0.845392 1.795486 5
3 -0.319423 -0.691444 1.026918 5
4 -0.373745 0.155677 0.932794 5
5 0.251379 -1.479839 1.466653 5# 或
In [327]: del df['three']In [328]: df
Out[328]:one bar two five
0 -0.752271 -1.035077 -0.282806 5
1 -1.545213 -2.304173 -0.758960 5
2 1.067247 1.912640 0.845392 5
3 -0.319423 -1.010867 -0.691444 5
4 -0.373745 -0.218068 0.155677 5
5 0.251379 -1.228459 -1.479839 5
DataFrame的数据查找
(1)data.head:返回头部数据,默认返回前五行
In [74]: data.head()
Out[74]:A B C D
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387In [75]: data.head(2)
Out[75]:A B C D
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
(2)data.tail:返回尾部数据,默认返回后五行
In [76]: data.tail()
Out[76]:A B C D
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-22 -2.138328 0.752788 0.102256 0.742620In [77]: data.tail(2)
Out[77]:A B C D
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-22 -2.138328 0.752788 0.102256 0.742620
DataFrame的其他属性
(1)data.index:行标签
In [78]: data.index
Out[78]:
DatetimeIndex(['2020-04-17', '2020-04-18', '2020-04-19', '2020-04-20','2020-04-21', '2020-04-22'],dtype='datetime64[ns]', freq='D')
(2)data.columns:列标签
In [79]: data.columns
Out[79]: Index(['A', 'B', 'C', 'D'], dtype='object')
(3)data.values:值
data.values
Out[80]:
array([[-0.3013205 , 0.80265403, -0.16755782, 0.1265712 ],[-0.45784158, -1.11922575, 1.01342568, -0.74293211],[ 1.769593 , 0.6675759 , -0.3715449 , 1.12298311],[ 0.7787021 , 0.24610972, 0.27924835, -0.83650411],[-0.00412084, -0.78379106, -0.4523986 , 0.05438711],[-2.13832768, 0.75278802, 0.10225592, 0.7426
(4)data.describe:观察数据的整体统计情况
In [81]: data.describe()
Out[81]:A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.058886 0.094352 0.067238 0.077854
std 1.310718 0.840328 0.540209 0.780625
min -2.138328 -1.119226 -0.452399 -0.836504
25% -0.418711 -0.526316 -0.320548 -0.543602
50% -0.152721 0.456843 -0.032651 0.090479
75% 0.582996 0.731485 0.235000 0.588608
max 1.769593 0.802654 1.013426 1.122983
(5)data.T:转置操作
data.T
Out[82]:2020-04-17 2020-04-18 2020-04-19 2020-04-20 2020-04-21 2020-04-22
A -0.301320 -0.457842 1.769593 0.778702 -0.004121 -2.138328
B 0.802654 -1.119226 0.667576 0.246110 -0.783791 0.752788
C -0.167558 1.013426 -0.371545 0.279248 -0.452399 0.102256
D 0.126571 -0.742932 1.122983 -0.836504 0
(6)data.sort_index:对数据进行排序
- axis=1:对列标签进行排序
In [83]: data.sort_index(axis=1)
Out[83]:A B C D
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-22 -2.138328 0.752788 0.102256 0.742620
- ascending=False:降序排列
In [86]: data.sort_index(axis=1,ascending=False)
Out[86]:D C B A
2020-04-17 0.126571 -0.167558 0.802654 -0.301320
2020-04-18 -0.742932 1.013426 -1.119226 -0.457842
2020-04-19 1.122983 -0.371545 0.667576 1.769593
2020-04-20 -0.836504 0.279248 0.246110 0.778702
2020-04-21 0.054387 -0.452399 -0.783791 -0.004121
2020-04-22 0.742620 0.102256 0.752788 -2.138328
- axis=0:对行标签进行排序
In [87]: data.sort_index(axis=0,ascending=False)
Out[87]:A B C D
2020-04-22 -2.138328 0.752788 0.102256 0.742620
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
- data.sort_values(by=‘A’):根据某一列的值进行排序
In [88]: data.sort_values(by='A')
Out[88]:A B C D
2020-04-22 -2.138328 0.752788 0.102256 0.742620
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-19 1.769593 0.667576 -0.371545 1.122983
DataFrame的数据选择
(1)根据列进行选择
In [89]: data['A']
Out[89]:
2020-04-17 -0.301320
2020-04-18 -0.457842
2020-04-19 1.769593
2020-04-20 0.778702
2020-04-21 -0.004121
2020-04-22 -2.138328
Freq: D, Name: A, dtype: float64In [90]: data.A
Out[90]:
2020-04-17 -0.301320
2020-04-18 -0.457842
2020-04-19 1.769593
2020-04-20 0.778702
2020-04-21 -0.004121
2020-04-22 -2.138328
Freq: D, Name: A, dtype: float64
(2)根据行进行选择
//通过行索引
In [91]: data[2:4]
Out[91]:A B C D
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
//通过行标签
In [93]: data['20200417':'20200418']
Out[93]:A B C D
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
//通过标签函数进行选择--效率高,因为不需要判断是位置参数还是索引标签参数
In [94]: data.loc['20200417':'20200418']
Out[94]:A B C D
2020-04-17 -0.301320 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
//位置标签
In [95]: data.iloc[2:4]
Out[95]:A B C D
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
(3)根据列进行选择
In [96]: data.loc[:,['B','C']]
Out[96]:B C
2020-04-17 0.802654 -0.167558
2020-04-18 -1.119226 1.013426
2020-04-19 0.667576 -0.371545
2020-04-20 0.246110 0.279248
2020-04-21 -0.783791 -0.452399
2020-04-22 0.752788 0.102256
(4)根据行和列进行选择
In [101]: data.loc['20200417':'20200418',['B','C']]
Out[101]:B C
2020-04-17 0.802654 -0.167558
2020-04-18 -1.119226 1.013426
(5)访问特定值
In [102]: data.loc['20200417','B']
Out[102]: 0.8026540334880221//效率更高,但需要传原生数据结构
In [103]: data.at[pd.Timestamp('20200417'),'B']
Out[103]: 0.8026540334880221
(6)通过位置索引访问数值
In [104]: data.iloc[1]
Out[104]:
A -0.457842
B -1.119226
C 1.013426
D -0.742932
Name: 2020-04-18 00:00:00, dtype: float64In [105]: data.iloc[1:3]
Out[105]:A B C D
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-19 1.769593 0.667576 -0.371545 1.122983In [106]: data.iloc[1:3,2:4]
Out[106]:C D
2020-04-18 1.013426 -0.742932
2020-04-19 -0.371545 1.122983In [107]: data.iloc[1,1]
Out[107]: -1.1192257494013678In [108]: data.iat[1,1]
Out[108]: -1.1192257494013678//iat方法更加高效
In [111]: %timeit df.iloc[1,1]
9.13 µs ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)In [112]: %timeit df.iat[1,1]
6.41 µs ± 108 ns per loop (mean ± std. dev. of 7 run
(7)根据特殊要求选取数据
data[data.A>0]
Out[113]:A B C D
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504In [114]: data[data>0]
Out[114]:A B C D
2020-04-17 NaN 0.802654 NaN 0.126571
2020-04-18 NaN NaN 1.013426 NaN
2020-04-19 1.769593 0.667576 NaN 1.122983
2020-04-20 0.778702 0.246110 0.279248 NaN
2020-04-21 NaN NaN NaN 0.054387
2020-04-22 NaN 0.752788 0.102256 0.742620
(8)过滤数据
tag = ['a']*2 + ['b']*2 + ['c']*2In [120]: data2['TAG']=tagIn [121]: data2[data2.TAG.isin(['a','c'])]
Out[121]:A B C D TAG
2020-04-17 -0.301320 0.802654 -0.167558 0.126571 a
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932 a
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387 c
2020-04-22 -2.138328 0.752788 0.102256 0.742620 c
修改表格数据
(1)修改某个元素
In [122]: data.iat[0,0]=100In [123]: data
Out[123]:A B C D
2020-04-17 100.000000 0.802654 -0.167558 0.126571
2020-04-18 -0.457842 -1.119226 1.013426 -0.742932
2020-04-19 1.769593 0.667576 -0.371545 1.122983
2020-04-20 0.778702 0.246110 0.279248 -0.836504
2020-04-21 -0.004121 -0.783791 -0.452399 0.054387
2020-04-22 -2.138328 0.752788 0.102256 0.742620
(2)修改一列元素,注意元素需要匹配
In [124]: data.A=range(6)In [125]: data
Out[125]:A B C D
2020-04-17 0 0.802654 -0.167558 0.126571
2020-04-18 1 -1.119226 1.013426 -0.742932
2020-04-19 2 0.667576 -0.371545 1.122983
2020-04-20 3 0.246110 0.279248 -0.836504
2020-04-21 4 -0.783791 -0.452399 0.054387
2020-04-22 5 0.752788 0.102256 0.742620In [126]: data.B=200In [127]: data
Out[127]:A B C D
2020-04-17 0 200 -0.167558 0.126571
2020-04-18 1 200 1.013426 -0.742932
2020-04-19 2 200 -0.371545 1.122983
2020-04-20 3 200 0.279248 -0.836504
2020-04-21 4 200 -0.452399 0.054387
2020-04-22 5 200 0.102256 0.742620
(3)对多行多列的修改
In [128]: data.iloc[:,2:5]=1000In [129]: data
Out[129]:A B C D
2020-04-17 0 200 1000 1000
2020-04-18 1 200 1000 1000
2020-04-19 2 200 1000 1000
2020-04-20 3 200 1000 1000
2020-04-21 4 200 1000 1000
2020-04-22 5 200 1000 1000
Panel:三维数组
items:坐标轴0,索引对应的元素是一个DataFrame
major_axis:坐标轴1,DataFrame里的行标签
major_axis:坐标轴2,DataFrame里的列标签
Panel的创建
data = {'Item1':pd.DataFrame(np.random.randn(4,3)),'Item2':pd.DataFrame(np.random.randn(4,2))}In [335]: pn = pd.Panel(data)pn['Item1']
Out[336]:0 1 2
0 -1.327305 1.073906 -0.873040
1 0.424617 -0.048509 1.354535
2 -0.752743 0.728434 0.143341
3 1.599821 -0.269235 1.571243In [337]: pn.to_frame()
Out[337]:Item1 Item2
major minor
0 0 -1.327305 -0.5096331 1.073906 -0.471738
1 0 0.424617 -0.2360801 -0.048509 -0.317929
2 0 -0.752743 -0.9278551 0.728434 0.417202
3 0 1.599821 -0.2737141 -0.269235 1.021836
Pandas在notebook中的操作
(1)missingdata的处理
- dropna():丢掉空数据,返回的值为复制的值
In [140]: df1
Out[140]:A B C D E
2020-04-17 1.741634 0.080174 -1.484283 -0.779988 NaN
2020-04-18 1.102289 -1.411300 0.448187 0.823690 2.0
2020-04-19 0.787129 0.693279 -0.981830 0.102260 2.0
2020-04-20 1.925935 1.299940 0.816230 1.060300 NaNIn [141]: df1.dropna()
Out[141]:A B C D E
2020-04-18 1.102289 -1.411300 0.448187 0.82369 2.0
2020-04-19 0.787129 0.693279 -0.981830 0.10226 2.0
- fillna():把空的值用默认值替换,返回的值为复制的值
In [142]: df1.fillna(value=5)
Out[142]:A B C D E
2020-04-17 1.741634 0.080174 -1.484283 -0.779988 5.0
2020-04-18 1.102289 -1.411300 0.448187 0.823690 2.0
2020-04-19 0.787129 0.693279 -0.981830 0.102260 2.0
2020-04-20 1.925935 1.299940 0.816230 1.060300 5.0
- 判断数据集中是否有空数据
In [143]: pd.isnull(df1)
Out[143]:A B C D E
2020-04-17 False False False False True
2020-04-18 False False False False False
2020-04-19 False False False False False
2020-04-20 False False False False TrueIn [144]: pd.isnull(df1).any()
Out[144]:
A False
B False
C False
D False
E True
dtype: boolIn [145]: pd.isnull(df1).any().any()
Out[145]: True
(2)统计计算,空数据不参与计算
- 求平均值
In [146]: df1.mean()
Out[146]:
A 1.389247
B 0.165523
C -0.300424
D 0.301566
E 2.000000
dtype: float64In [147]: df1.mean(axis=1)
Out[147]:
2020-04-17 -0.110616
2020-04-18 0.592573
2020-04-19 0.520168
2020-04-20 1.275601
Freq: D, dtype: float64
- 求累加值:方法一:直接用cumsum函数
In [148]: df1.cumsum()
Out[148]:A B C D E
2020-04-17 1.741634 0.080174 -1.484283 -0.779988 NaN
2020-04-18 2.843923 -1.331127 -1.036097 0.043702 2.0
2020-04-19 3.631052 -0.637848 -2.017926 0.145962 4.0
2020-04-20 5.556987 0.662092 -1.201696 1.206262 NaN
ps:广播的概念,二维数据对一维数据进行加减操作时,会对一维数据进行扩展
In [149]: a = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)In [150]: a
Out[150]:
2020-04-17 NaN
2020-04-18 NaN
2020-04-19 1.0
2020-04-20 3.0
2020-04-21 5.0
2020-04-22 NaN
Freq: D, dtype: float64In [152]: df.sub(a,axis='index')
Out[152]:A B C D
2020-04-17 NaN NaN NaN NaN
2020-04-18 NaN NaN NaN NaN
2020-04-19 -0.212871 -0.306721 -1.981830 -0.89774
2020-04-20 -1.074065 -1.700060 -2.183770 -1.93970
2020-04-21 -5.375537 -4.485144 -5.373087 -4.83622
2020-04-22 NaN NaN NaN NaN
- 求累加值:方法二:用apply函数,把累加后的参数传到apply函数里作为参数
In [154]: df.apply(np.cumsum)
Out[154]:A B C D
2020-04-17 1.741634 0.080174 -1.484283 -0.779988
2020-04-18 2.843923 -1.331127 -1.036097 0.043702
2020-04-19 3.631052 -0.637848 -2.017926 0.145962
2020-04-20 5.556987 0.662092 -1.201696 1.206262
2020-04-21 5.181451 1.176948 -1.574783 1.370042
2020-04-22 2.290288 1.436656 -0.816895 0.859868
- count():统计每个数字产生了多少次
In [157]: s = pd.Series(np.random.randint(10,20,size=20))In [158]: s
Out[158]:
0 14
1 10
2 17
3 15
4 18
5 14
6 18
7 15
8 14
9 19
10 15
11 18
12 16
13 11
14 13
15 19
16 11
17 15
18 19
19 18
dtype: int32In [159]: s.value_counts()
Out[159]:
18 4
15 4
19 3
14 3
11 2
17 1
16 1
13 1
10 1
dtype: int64
- mode():统计产生最多的数字
In [160]: s.mode()
Out[160]:
0 15
1 18
dtype: int32
(3)apply函数
In [155]: df.apply(lambda x : x.max() - x.min())
Out[155]:
A 4.817097
B 2.711241
C 2.300513
D 1.840288
dtype: float64
(4)数据合并
- 方法1:用concat函数
In [161]: df = pd.DataFrame(np.random.randn(10,4),columns=list('ABCD'))In [162]: df
Out[162]:A B C D
0 -0.836643 1.389491 -0.346328 -0.579350
1 -0.033253 -0.100592 -0.187342 0.077368
2 -0.492430 0.187553 0.282775 0.486783
3 -2.189806 0.527236 1.742809 0.341727
4 -0.610598 -1.127726 0.321672 1.154310
5 0.050277 0.000600 0.580841 2.216663
6 0.216428 -0.369792 -0.527231 0.252527
7 0.331808 -0.136121 0.056004 -1.125896
8 0.177764 -1.976727 -0.137631 -0.188156
9 -0.117127 -0.591025 -1.415468 0.451023In [165]: df.iloc[:3]
Out[165]:A B C D
0 -0.836643 1.389491 -0.346328 -0.579350
1 -0.033253 -0.100592 -0.187342 0.077368
2 -0.492430 0.187553 0.282775 0.486783In [166]: df.iloc[3:7]
Out[166]:A B C D
3 -2.189806 0.527236 1.742809 0.341727
4 -0.610598 -1.127726 0.321672 1.154310
5 0.050277 0.000600 0.580841 2.216663
6 0.216428 -0.369792 -0.527231 0.252527In [167]: df.iloc[7:]
Out[167]:A B C D
7 0.331808 -0.136121 0.056004 -1.125896
8 0.177764 -1.976727 -0.137631 -0.188156
9 -0.117127 -0.591025 -1.415468 0.451023In [170]: df1 = pd.concat([df.iloc[:3],df.iloc[3:7],df.iloc[7:]])In [171]: df1
Out[171]:A B C D
0 -0.836643 1.389491 -0.346328 -0.579350
1 -0.033253 -0.100592 -0.187342 0.077368
2 -0.492430 0.187553 0.282775 0.486783
3 -2.189806 0.527236 1.742809 0.341727
4 -0.610598 -1.127726 0.321672 1.154310
5 0.050277 0.000600 0.580841 2.216663
6 0.216428 -0.369792 -0.527231 0.252527
7 0.331808 -0.136121 0.056004 -1.125896
8 0.177764 -1.976727 -0.137631 -0.188156
9 -0.117127 -0.591025 -1.415468 0.451023In [172]: df==df1
Out[172]:A B C D
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True TrueIn [173]: (df==df1).all()
Out[173]:
A True
B True
C True
D True
dtype: boolIn [174]: (df==df1).all().all()
Out[174]: True
- 方法2:用merge函数
In [175]: left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})In [176]: right = pd.DataFrame({'key':['foo','foo'],'lval':[4,5]})In [177]: left
Out[177]:key lval
0 foo 1
1 foo 2In [178]: right
Out[178]:key lval
0 foo 4
1 foo 5//等价于sql语句中的:
# SELECT * FROM left INNER JOIN right ON left.key = right.key;
In [180]: pd.merge(left,right,on='key')
Out[180]:key lval_x lval_y
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
- 方法三:直接插入一行,只返回一个copy
In [181]: s = pd.Series(np.random.randint(1,5,size=4),index=list('ABCD'))In [182]: s
Out[182]:
A 3
B 2
C 1
D 3
dtype: int32In [183]: df.append(s,ignore_index=True)
Out[183]:A B C D
0 -0.836643 1.389491 -0.346328 -0.579350
1 -0.033253 -0.100592 -0.187342 0.077368
2 -0.492430 0.187553 0.282775 0.486783
3 -2.189806 0.527236 1.742809 0.341727
4 -0.610598 -1.127726 0.321672 1.154310
5 0.050277 0.000600 0.580841 2.216663
6 0.216428 -0.369792 -0.527231 0.252527
7 0.331808 -0.136121 0.056004 -1.125896
8 0.177764 -1.976727 -0.137631 -0.188156
9 -0.117127 -0.591025 -1.415468 0.451023
10 3.000000 2.000000 1.000000 3.000000In [184]: s = pd.Series(np.random.randint(1,5,size=5),index=list('ABCDE'))In [185]: df.append(s,ignore_index=True)
Out[185]:A B C D E
0 -0.836643 1.389491 -0.346328 -0.579350 NaN
1 -0.033253 -0.100592 -0.187342 0.077368 NaN
2 -0.492430 0.187553 0.282775 0.486783 NaN
3 -2.189806 0.527236 1.742809 0.341727 NaN
4 -0.610598 -1.127726 0.321672 1.154310 NaN
5 0.050277 0.000600 0.580841 2.216663 NaN
6 0.216428 -0.369792 -0.527231 0.252527 NaN
7 0.331808 -0.136121 0.056004 -1.125896 NaN
8 0.177764 -1.976727 -0.137631 -0.188156 NaN
9 -0.117127 -0.591025 -1.415468 0.451023 NaN
10 4.000000 2.000000 3.000000 2.000000 4.0
(5)分类统计
单索引
In [187]: df = pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],'B':['one','two','three','two','two','thre...: e','one','one'],'C':np.random.randn(8),'D':np.random.randn(8)})In [188]: df
Out[188]:A B C D
0 foo one 0.601375 0.862727
1 bar two -1.414134 0.664557
2 foo three -0.106683 -0.926406
3 bar two 1.085907 -0.827664
4 foo two -0.697518 -0.669789
5 bar three -0.137290 0.042817
6 foo one 0.084015 0.767032
7 foo one 1.428494 -2.223576In [189]: df.groupby('A').sum()
Out[189]:C D
A
bar -0.465517 -0.120290
foo 1.309682 -2.190011
双索引
In [191]: df.groupby(['A','B']).sum()
Out[191]:C D
A B
bar three -0.137290 0.042817two -0.328227 -0.163107
foo one 2.113884 -0.593817three -0.106683 -0.926406two -0.697518 -0.669789
In [192]: df.groupby(['B','A']).sum()
Out[192]:C D
B A
one foo 2.113884 -0.593817
three bar -0.137290 0.042817foo -0.106683 -0.926406
two bar -0.328227 -0.163107foo -0.697518 -0.669789
(5)数据整形
In [194]: tuples = list(zip(*[['bar','bar','baz','baz','foo','foo','qux','qux'],['one','two','one','two','one','two','one','t...: wo']]))In [195]: tuples
Out[195]:
[('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')]
In [198]: index = pd.MultiIndex.from_tuples(tuples,names=['first','second'])In [199]: index
Out[199]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],names=['first', 'second'])In [200]: df = pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])In [201]: df
Out[201]:A B
first second
bar one -0.040812 -0.470402two -1.077897 0.426708
baz one -0.672487 0.022344two -0.515349 -0.701331
foo one -0.331600 1.568290two -0.673093 0.853233
qux one -1.193980 0.658491two 0.685413 -1.692874In [202]: stacked = df.stack()In [203]: stacked
Out[203]:
first second
bar one A -0.040812B -0.470402two A -1.077897B 0.426708
baz one A -0.672487B 0.022344two A -0.515349B -0.701331
foo one A -0.331600B 1.568290two A -0.673093B 0.853233
qux one A -1.193980B 0.658491two A 0.685413B -1.692874
dtype: float64In [204]: stacked.index
Out[204]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two'], ['A', 'B']],codes=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],names=['first', 'second', None])In [205]: stacked.unstack()
Out[205]:A B
first second
bar one -0.040812 -0.470402two -1.077897 0.426708
baz one -0.672487 0.022344two -0.515349 -0.701331
foo one -0.331600 1.568290two -0.673093 0.853233
qux one -1.193980 0.658491two 0.685413 -1.692874In [206]: stacked.unstack().unstack()
Out[206]:A B
second one two one two
first
bar -0.040812 -1.077897 -0.470402 0.426708
baz -0.672487 -0.515349 0.022344 -0.701331
foo -0.331600 -0.673093 1.568290 0.853233
qux -1.193980 0.685413 0.658491 -1.692874
(6)数据透视
- 当数据透视表中对应的值有多个时求平均值,当没有值时用NaN表示
【1】当只对应一个值,其中有一些未对应时
In [207]: df = pd.DataFrame({'A':['one','one','two','three']*3,'B':['A','B','C']*4,'C':['foo','foo','foo','bar','bar','bar']*...: 2,'D':np.random.randn(12),'E':np.random.randn(12)})In [208]: df
Out[208]:A B C D E
0 one A foo -1.658244 -1.000468
1 one B foo -1.268684 0.866844
2 two C foo 0.362601 0.881982
3 three A bar 0.530524 -0.856335
4 one B bar -0.711668 -1.042455
5 one C bar -0.424758 -1.493948
6 two A foo 1.105030 0.292543
7 three B foo 0.627286 0.601456
8 one C foo -1.357240 0.816328
9 one A bar 0.261102 0.173785
10 two B bar -0.357801 1.930272
11 three C bar -1.417299 -1.118276In [211]: df.pivot_table(values=['D'],index=['A','B'],columns=['C'])
Out[211]:D
C bar foo
A B
one A 0.261102 -1.658244B -0.711668 -1.268684C -0.424758 -1.357240
three A 0.530524 NaNB NaN 0.627286C -1.417299 NaN
two A NaN 1.105030B -0.357801 NaNC NaN 0.362601
【2】当对应多个值时,取平均
In [214]: df.pivot_table(values=['E'],index=['A'],columns=['C'])
Out[214]:E
C bar foo
A
one -0.787540 0.227568
three -0.987305 0.601456
two 1.930272 0.587263
(7)时间处理函数
【1】对时间重新采样
In [216]: rng = pd.date_range('20200417',periods=600,freq='s')In [217]: rng
Out[217]:
DatetimeIndex(['2020-04-17 00:00:00', '2020-04-17 00:00:01','2020-04-17 00:00:02', '2020-04-17 00:00:03','2020-04-17 00:00:04', '2020-04-17 00:00:05','2020-04-17 00:00:06', '2020-04-17 00:00:07','2020-04-17 00:00:08', '2020-04-17 00:00:09',...'2020-04-17 00:09:50', '2020-04-17 00:09:51','2020-04-17 00:09:52', '2020-04-17 00:09:53','2020-04-17 00:09:54', '2020-04-17 00:09:55','2020-04-17 00:09:56', '2020-04-17 00:09:57','2020-04-17 00:09:58', '2020-04-17 00:09:59'],dtype='datetime64[ns]', length=600, freq='S')In [218]: s = pd.Series(np.random.randint(0,500,len(rng)),index=rng)In [219]: s
Out[219]:
2020-04-17 00:00:00 301
2020-04-17 00:00:01 173
2020-04-17 00:00:02 126
2020-04-17 00:00:03 490
2020-04-17 00:00:04 182
2020-04-17 00:00:05 260
2020-04-17 00:00:06 224
2020-04-17 00:00:07 127
2020-04-17 00:00:08 33
2020-04-17 00:00:09 154
2020-04-17 00:00:10 145
2020-04-17 00:00:11 379
2020-04-17 00:00:12 66
2020-04-17 00:00:13 116
2020-04-17 00:00:14 119
2020-04-17 00:00:15 491
2020-04-17 00:00:16 84
2020-04-17 00:00:17 239
2020-04-17 00:00:18 171
2020-04-17 00:00:19 327
2020-04-17 00:00:20 165
2020-04-17 00:00:21 448
2020-04-17 00:00:22 205
2020-04-17 00:00:23 179
2020-04-17 00:00:24 158
2020-04-17 00:00:25 383
2020-04-17 00:00:26 139
2020-04-17 00:00:27 161
2020-04-17 00:00:28 141
2020-04-17 00:00:29 156...
2020-04-17 00:09:30 76
2020-04-17 00:09:31 397
2020-04-17 00:09:32 437
2020-04-17 00:09:33 470
2020-04-17 00:09:34 433
2020-04-17 00:09:35 288
2020-04-17 00:09:36 367
2020-04-17 00:09:37 351
2020-04-17 00:09:38 407
2020-04-17 00:09:39 28
2020-04-17 00:09:40 259
2020-04-17 00:09:41 291
2020-04-17 00:09:42 42
2020-04-17 00:09:43 250
2020-04-17 00:09:44 284
2020-04-17 00:09:45 93
2020-04-17 00:09:46 356
2020-04-17 00:09:47 154
2020-04-17 00:09:48 275
2020-04-17 00:09:49 75
2020-04-17 00:09:50 369
2020-04-17 00:09:51 409
2020-04-17 00:09:52 330
2020-04-17 00:09:53 200
2020-04-17 00:09:54 158
2020-04-17 00:09:55 335
2020-04-17 00:09:56 296
2020-04-17 00:09:57 197
2020-04-17 00:09:58 399
2020-04-17 00:09:59 22
Freq: S, Length: 600, dtype: int32#重新采样
In [220]: s.resample('2Min',how='mean')
d:\Anaconda3\Scripts\ipython:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
Out[220]:
2020-04-17 00:00:00 234.408333
2020-04-17 00:02:00 239.350000
2020-04-17 00:04:00 246.108333
2020-04-17 00:06:00 267.491667
2020-04-17 00:08:00 256.266667
Freq: 2T, dtype: float64
【2】period_range
In [221]: rng=pd.period_range('2000Q1','2016Q1',freq='Q')In [222]: rng
Out[222]:
PeriodIndex(['2000Q1', '2000Q2', '2000Q3', '2000Q4', '2001Q1', '2001Q2','2001Q3', '2001Q4', '2002Q1', '2002Q2', '2002Q3', '2002Q4','2003Q1', '2003Q2', '2003Q3', '2003Q4', '2004Q1', '2004Q2','2004Q3', '2004Q4', '2005Q1', '2005Q2', '2005Q3', '2005Q4','2006Q1', '2006Q2', '2006Q3', '2006Q4', '2007Q1', '2007Q2','2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3', '2008Q4','2009Q1', '2009Q2', '2009Q3', '2009Q4', '2010Q1', '2010Q2','2010Q3', '2010Q4', '2011Q1', '2011Q2', '2011Q3', '2011Q4','2012Q1', '2012Q2', '2012Q3', '2012Q4', '2013Q1', '2013Q2','2013Q3', '2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4','2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1'],dtype='period[Q-DEC]', freq='Q-DEC')
In [223]: rng.to_timestamp()
Out[223]:
DatetimeIndex(['2000-01-01', '2000-04-01', '2000-07-01', '2000-10-01','2001-01-01', '2001-04-01', '2001-07-01', '2001-10-01','2002-01-01', '2002-04-01', '2002-07-01', '2002-10-01','2003-01-01', '2003-04-01', '2003-07-01', '2003-10-01','2004-01-01', '2004-04-01', '2004-07-01', '2004-10-01','2005-01-01', '2005-04-01', '2005-07-01', '2005-10-01','2006-01-01', '2006-04-01', '2006-07-01', '2006-10-01','2007-01-01', '2007-04-01', '2007-07-01', '2007-10-01','2008-01-01', '2008-04-01', '2008-07-01', '2008-10-01','2009-01-01', '2009-04-01', '2009-07-01', '2009-10-01','2010-01-01', '2010-04-01', '2010-07-01', '2010-10-01','2011-01-01', '2011-04-01', '2011-07-01', '2011-10-01','2012-01-01', '2012-04-01', '2012-07-01', '2012-10-01','2013-01-01', '2013-04-01', '2013-07-01', '2013-10-01','2014-01-01', '2014-04-01', '2014-07-01', '2014-10-01','2015-01-01', '2015-04-01', '2015-07-01', '2015-10-01','2016-01-01'],dtype='datetime64[ns]', freq='QS-OCT')
【3】时间运算
In [224]: pd.Timestamp('20200417')-pd.Timestamp('20200317')
Out[224]: Timedelta('31 days 00:00:00')In [226]: pd.Timestamp('20200417')+pd.Timedelta(days=5)
Out[226]: Timestamp('2020-04-22 00:00:00')
(8)类别数据,排序时是根据类别对应的值排序,而不是根据类别排序
df = pd.DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','d']})In [228]: df
Out[228]:id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 dIn [229]: df['grade']=df.raw_grade.astype('category')In [230]: df
Out[230]:id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 d dIn [231]: df.grade
Out[231]:
0 a
1 b
2 b
3 a
4 a
5 d
Name: grade, dtype: category
Categories (3, object): [a, b, d]In [232]: df.grade.cat.categories
Out[232]: Index(['a', 'b', 'd'], dtype='object')#重命名
In [234]: df.grade.cat.categories=['very good','good','bad']In [235]: df
Out[235]:id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 d badIn [236]: df.sort_values(by='grade',ascending=False)
Out[236]:id raw_grade grade
5 6 d bad
2 3 b good
1 2 b good
4 5 a very good
3 4 a very good
0 1 a very good
(8)数据的读写
In [249]: df = pd.DataFrame(np.random.randn(100,4),columns=list('ABCD'))In [250]: df
Out[250]:A B C D
0 -0.083321 -0.658209 -0.651313 0.709735
1 -0.477651 -1.144574 -0.424812 -2.597468
2 0.287646 0.465932 0.945663 -1.010607
3 0.638164 0.413887 0.143015 0.919356
4 -0.173809 2.310479 0.453230 0.400560
5 0.946080 -0.616891 -0.240875 -0.010665
6 0.026942 1.909838 0.993354 0.545339
7 1.585801 -1.026446 -1.568373 0.625591
8 -0.163700 0.270450 -0.131551 0.060142
9 0.529707 -0.423947 -0.416857 0.756358
10 0.068693 -0.142339 0.681443 0.245802
11 -1.070199 0.100707 -1.710141 -0.194633
12 -0.245831 0.049669 -1.648418 -1.496346
13 -0.232378 -1.347531 -1.612275 -0.429955
14 -0.728380 0.937256 -1.413875 0.806011
15 0.605852 -0.378269 -0.750708 0.075920
16 -1.610980 0.870669 0.278972 -0.948627
17 2.130313 -1.653204 1.402163 -0.170074
18 -0.655884 -0.585036 0.278200 -0.888715
19 -0.549018 0.195922 0.598360 -2.693013
20 -2.155961 0.847062 0.027725 1.469359
21 0.220326 0.557393 -1.017209 -0.907464
22 -0.756740 -0.997176 1.160416 -1.400701
23 -0.140183 -0.765795 0.333937 0.412158
24 -0.219093 0.026067 0.489387 -1.576976
25 -1.786427 0.242143 -2.148370 0.394657
26 0.005408 0.634260 1.099559 0.794257
27 0.171895 -0.149267 0.839116 -0.497465
28 0.804562 0.007311 0.088189 -1.777547
29 1.370591 0.234319 -2.708779 0.942077
.. ... ... ... ...
70 -0.583113 0.510314 0.777001 2.011854
71 0.077530 0.376037 0.312966 0.903666
72 1.948758 -0.757751 0.142735 -1.013880
73 -0.166173 -0.675107 -0.291241 0.634107
74 0.135204 -0.745272 0.985285 -2.196835
75 -0.215731 -0.481880 0.674186 0.056815
76 0.215328 1.222422 -1.631149 -0.999990
77 0.675607 0.666754 0.125106 0.310992
78 1.712664 1.568540 0.980551 1.275940
79 0.852778 0.947336 -0.677572 1.360434
80 -1.489507 0.110790 1.868481 0.339136
81 0.293409 1.250446 1.749058 -1.033468
82 1.082102 0.666194 1.247482 -0.644338
83 -2.916044 0.230605 0.750991 0.802369
84 1.989383 0.031088 0.390258 -0.003017
85 1.153904 -0.808503 -0.226332 -0.145706
86 -0.225708 -0.961442 -0.534315 -0.178530
87 -0.379955 1.432803 0.120019 0.422698
88 1.173942 -0.017247 0.509582 -0.063431
89 -1.491573 -0.089146 0.745232 0.674076
90 0.208917 -1.160621 1.063769 -0.082351
91 0.395819 1.232898 -0.068886 0.295321
92 -1.162220 -2.487554 3.304217 -0.017615
93 -0.513441 1.200510 -0.818692 1.625370
94 1.697458 0.278362 0.508730 -0.384557
95 -0.209887 -0.620793 0.011167 1.256941
96 -0.474955 0.223494 1.607891 1.392664
97 1.083230 -0.498918 -0.153474 -1.029604
98 -1.514452 0.741072 -0.924284 0.522594
99 0.159945 -0.077083 -0.655568 1.216421[100 rows x 4 columns]In [253]: df.to_csv('data.csv')In [257]: pd.read_csv('data.csv',index_col=0)
Out[257]:A B C D
0 -0.083321 -0.658209 -0.651313 0.709735
1 -0.477651 -1.144574 -0.424812 -2.597468
2 0.287646 0.465932 0.945663 -1.010607
3 0.638164 0.413887 0.143015 0.919356
4 -0.173809 2.310479 0.453230 0.400560
5 0.946080 -0.616891 -0.240875 -0.010665
6 0.026942 1.909838 0.993354 0.545339
7 1.585801 -1.026446 -1.568373 0.625591
8 -0.163700 0.270450 -0.131551 0.060142
9 0.529707 -0.423947 -0.416857 0.756358
10 0.068693 -0.142339 0.681443 0.245802
11 -1.070199 0.100707 -1.710141 -0.194633
12 -0.245831 0.049669 -1.648418 -1.496346
13 -0.232378 -1.347531 -1.612275 -0.429955
14 -0.728380 0.937256 -1.413875 0.806011
15 0.605852 -0.378269 -0.750708 0.075920
16 -1.610980 0.870669 0.278972 -0.948627
17 2.130313 -1.653204 1.402163 -0.170074
18 -0.655884 -0.585036 0.278200 -0.888715
19 -0.549018 0.195922 0.598360 -2.693013
20 -2.155961 0.847062 0.027725 1.469359
21 0.220326 0.557393 -1.017209 -0.907464
22 -0.756740 -0.997176 1.160416 -1.400701
23 -0.140183 -0.765795 0.333937 0.412158
24 -0.219093 0.026067 0.489387 -1.576976
25 -1.786427 0.242143 -2.148370 0.394657
26 0.005408 0.634260 1.099559 0.794257
27 0.171895 -0.149267 0.839116 -0.497465
28 0.804562 0.007311 0.088189 -1.777547
29 1.370591 0.234319 -2.708779 0.942077
.. ... ... ... ...
70 -0.583113 0.510314 0.777001 2.011854
71 0.077530 0.376037 0.312966 0.903666
72 1.948758 -0.757751 0.142735 -1.013880
73 -0.166173 -0.675107 -0.291241 0.634107
74 0.135204 -0.745272 0.985285 -2.196835
75 -0.215731 -0.481880 0.674186 0.056815
76 0.215328 1.222422 -1.631149 -0.999990
77 0.675607 0.666754 0.125106 0.310992
78 1.712664 1.568540 0.980551 1.275940
79 0.852778 0.947336 -0.677572 1.360434
80 -1.489507 0.110790 1.868481 0.339136
81 0.293409 1.250446 1.749058 -1.033468
82 1.082102 0.666194 1.247482 -0.644338
83 -2.916044 0.230605 0.750991 0.802369
84 1.989383 0.031088 0.390258 -0.003017
85 1.153904 -0.808503 -0.226332 -0.145706
86 -0.225708 -0.961442 -0.534315 -0.178530
87 -0.379955 1.432803 0.120019 0.422698
88 1.173942 -0.017247 0.509582 -0.063431
89 -1.491573 -0.089146 0.745232 0.674076
90 0.208917 -1.160621 1.063769 -0.082351
91 0.395819 1.232898 -0.068886 0.295321
92 -1.162220 -2.487554 3.304217 -0.017615
93 -0.513441 1.200510 -0.818692 1.625370
94 1.697458 0.278362 0.508730 -0.384557
95 -0.209887 -0.620793 0.011167 1.256941
96 -0.474955 0.223494 1.607891 1.392664
97 1.083230 -0.498918 -0.153474 -1.029604
98 -1.514452 0.741072 -0.924284 0.522594
99 0.159945 -0.077083 -0.655568 1.216421[100 rows x 4 columns]
pandas基础运算
重新索引
In [338]: s = pd.Series([1,3,5,6,8],index=list('acefh'))In [339]: s
Out[339]:
a 1
c 3
e 5
f 6
h 8
dtype: int64In [340]: s.index
Out[340]: Index(['a', 'c', 'e', 'f', 'h'], dtype='object')In [341]: s.reindex(list('abcdefgh'))
Out[341]:
a 1.0
b NaN
c 3.0
d NaN
e 5.0
f 6.0
g NaN
h 8.0
dtype: float64
重新索引设默认值
(1)Series的重新索引
In [342]: s.reindex(list('abcdefgh'),fill_value=0)
Out[342]:
a 1
b 0
c 3
d 0
e 5
f 6
g 0
h 8
dtype: int64
或以取前一值进行填充
In [343]: s.reindex(list('abcdefgh'),method='ffill')
Out[343]:
a 1
b 1
c 3
d 3
e 5
f 6
g 6
h 8
dtype: int64
(2)DataFrame的重新索引
ps:是将值拷贝出来进行索引
# 对行进行索引
In [347]: df.reindex(index=list('ABCDEFGH'),fill_value=0)
Out[347]:one two three four five six
A -0.031925 -0.547648 -1.445043 -1.674583 0.897179 -0.091507
B 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
D 2.360810 1.156675 0.379871 -0.686860 1.723596 0.614264
E 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
F -1.982705 0.738971 -0.769946 0.994929 -0.741597 0.149975
G 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
H -1.868423 -0.684187 -0.479999 -0.768320 0.530670 -2.508482
# 对列进行索引(method设置默认值只对行有效,对列无效)
In [349]: df.reindex(columns=['one','two','three','seven'],fill_value=0)
Out[349]:one two three seven
A -0.031925 -0.547648 -1.445043 0
D 2.360810 1.156675 0.379871 0
F -1.982705 0.738971 -0.769946 0
H -1.868423 -0.684187 -0.479999 0
(3)apply(默认按列进行运算)
In [351]: df = pd.DataFrame(np.arange(12).reshape(4,3),index=['one','two','three','four'],columns=list('ABC'))In [352]: df
Out[352]:A B C
one 0 1 2
two 3 4 5
three 6 7 8
four 9 10 11In [353]: df.apply(lambda x:x.max()-x.min())
Out[353]:
A 9
B 9
C 9
dtype: int64In [354]: df.apply(lambda x:x.max()-x.min(),axis=1)
Out[354]:
one 2
two 2
three 2
four 2
dtype: int64In [355]: def min_max(x):...: return pd.Series([x.min(),x.max()],index=['min','max'])In [357]: df.apply(min_max,axis=1)
Out[357]:min max
one 0 2
two 3 5
three 6 8
four 9 11
(4)applymap(对DataFrame的每一个元素都进行参数传递)
In [362]: df = pd.DataFrame(np.random.randn(4,3),index=['one','two','three','four'],columns=list('ABC'))In [363]: df
Out[363]:A B C
one -1.165380 0.992127 -0.577970
two -0.250636 0.337495 1.385477
three -1.471206 -0.631274 -1.700984
four -1.570891 0.623608 2.603508In [364]: formater = lambda x : '%.03f' %xIn [365]: df.applymap(formater)
Out[365]:A B C
one -1.165 0.992 -0.578
two -0.251 0.337 1.385
three -1.471 -0.631 -1.701
four -1.571 0.624 2.604#或
In [366]: formater = '{0:.03f}'.formatIn [367]: df.applymap(formater)
Out[367]:A B C
one -1.165 0.992 -0.578
two -0.251 0.337 1.385
three -1.471 -0.631 -1.701
four -1.571 0.624 2.604
(5)排名
In [370]: s.rank()
Out[370]:
0 2.0
1 4.5
2 1.0
3 4.5
4 3.0
dtype: float64In [371]: s.rank(method='first')
Out[371]:
0 2.0
1 4.0
2 1.0
3 5.0
4 3.0
dtype: float64
(6)数据唯一性
In [372]: s = pd.Series(list('abcdabcddbc'))In [373]: s
Out[373]:
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
8 d
9 b
10 c
dtype: objectIn [374]: s.value_counts()
Out[374]:
d 3
c 3
b 3
a 2
dtype: int64In [375]: s.unique()
Out[375]: array(['a', 'b', 'c', 'd'], dtype=object)In [377]: s.isin(['a','b','d'])
Out[377]:
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 True
9 True
10 False
dtype: boolIn [378]: s.isin(s.unique())
Out[378]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
dtype: bool
numpy、pandas操作相关推荐
- Pandas 操作 csv 文件
Pandas 操作 csv 文件 官方英文文档 官方文档PDF下载 中文文档 一.安装 Pandas 安装 Pandas: pip3 install pandas 导入 Pandas: import ...
- python 数据分析工具之 numpy pandas matplotlib
作为一个网络技术人员,机器学习是一种很有必要学习的技术,在这个数据爆炸的时代更是如此. python做数据分析,最常用以下几个库 numpy pandas matplotlib 一.Numpy库 为了 ...
- 使用Matplotlib Numpy Pandas构想泰坦尼克号高潮
Did you know, a novel predicted the Titanic sinking 14 years previously to the actual disaster??? 您知 ...
- pandas filter_数据分析之Pandas操作(2)
接着数据分析之Pandas操作(1)的介绍,本次介绍在实际应用场景中几个常用的函数.还是以titanic生存数据为例,本次需要导入pandas .numpy .scipy三个工具包. import p ...
- Numpy,Pandas,Matplotlib
一 . numpy -- 数据分析:就是把一些看似杂乱无章的数据信息提炼出来,总结出所研究的内在规律 -- 数据分析三剑客:Numpy,Pandas,Matplotlib -- Numpy(Numer ...
- python文件和数据格式化思维导图,思维导图:Numpy+Pandas
思维导图:Numpy+Pandas 附:文本结构 Numpy+Pandas Numpy 基于矩阵运算的模块 数组转矩阵 A = np.array([[1,2,3],[2,3,4]]) 矩阵属性 ndi ...
- Python数据分析(全) #超长预警 #思维导图 #matplotlib #numpy #pandas
数据分析 一.基础概念及环境 1. 数据分析概念 2. anaconda 2.3 安装 2.2 基本操作 二.matplotlib 1. 简介 2. 基本要点 3. 使用方法 3.1 最简单形式 3. ...
- 【详解】机器学习库-Matplotlib+Numpy+Pandas
目录 机器学习库-Matplotlib+Numpy+Pandas 1 Matplotlib基本使用 1.2 用途 1.3 操作指南 1.4 常见图形绘制 1.5 代码实现 2 Numpy基本使用 2. ...
- 安装命令:pip install xlrd ,pandas操作Excel学习笔记__7000
pandas操作Excel学习笔记_loc和iloc_7000 pandas操作Excel学习笔记__7000 1.安装环境:pandas需要处理Excel的模块xlrd,所以需要提前安装xlrd.不 ...
最新文章
- linux 挂载硬盘_Linux系列教程(十八)——Linux文件系统管理之文件系统常用命令...
- 还不知道事务消息吗?这篇文章带你全面扫盲
- 【机器学习】在机器学习中处理大量数据!
- 合工大python期末复习知识点汇总
- iOS之深入解析预乘透明度Premultiplied Alpha
- 使用Fiori elements技术开发的ui5应用,方便大家参考
- one order event handling - event filtering
- mysql 强制读主库_laravel(lumen)配置读写分离后,强制读主(写)库数据库,解决主从延迟问题...
- linux arch 包管理,Archlinux使用包管理方式安装MyEclipse
- python常用函数import_python 常用函数集合
- 第97课 寻找亲密数对_例97.1 《小学生C++编程入门》
- Android1 按钮
- POJ 1635 树的最小表示法
- 第7章 XSL高级应用
- ZYNQ研究----(2)基于开发板制作串口测试程序
- ov5640摄像头使用心得
- Android.Oldboot.1,腾讯手机管家发布全球首款可根除Oldboot病毒专杀
- hihoCoder 1498 Diligent Robots
- Matlab Classification Learner
- 职业规划-自动化测试