原文：

https://github.com/catalystfrank/Python4DataScience.CH

和大熊猫们(Pandas)一起游戏吧！

Pandas是Python的一个用于数据分析的库： http://pandas.pydata.org

API速查：http://pandas.pydata.org/pandas-docs/stable/api.html

基于NumPy,SciPy的功能，在其上补充了大量的数据操作（Data Manipulation）功能。

统计、分组、排序、透视表自由转换，如果你已经很熟悉结构化数据库（RDBMS）与Excel的功能，就会知道Pandas有过之而无不及！

0. 上手玩：Why Pandas?

普通的程序员看到一份数据会怎么做？

In [1]:

import codecs

import requests

import numpy as np

import scipy as sp

import scipy.stats as spstat

import pandas as pd

import datetime

import json

In [2]:

r = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

with codecs.open('S1EP3_Iris.txt','w',encoding='utf-8') as f:

    f.write(r.text)

In [3]:

with codecs.open('S1EP3_Iris.txt','r',encoding='utf-8') as f:

    lines = f.readlines()

for idx,line in enumerate(lines):

    print line,

    if idx==10:

        break

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa

Pandas的意义就在于

快速的识别结构化数据

In [19]:

import pandas as pd

irisdata = pd.read_csv('S1EP3_Iris.txt',header = None, encoding='utf-8')

#irisdata

快速的操作元数据

In [22]:

cnames = ['sepal_length','sepal_width','petal_length','petal_width','class']

irisdata.columns = cnames

#irisdata

快速过滤

In [11]:

#irisdata[irisdata['petal_width']==irisdata.petal_width.max()]

irisdata[irisdata['sepal_length']==irisdata.sepal_length.max()]

Out[11]:

	sepal_length	sepal_width	petal_length	petal_width	class
131	7.9	3.8	6.4	2	Iris-virginica

快速切片

In [23]:

#irisdata.iloc[::30,:2]

irisdata.iloc[::40,0:3]

Out[23]:

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
40	5.0	3.5	1.3
80	5.5	2.4	3.8
120	6.9	3.2	5.7

快速统计

In [31]:

#print irisdata['class'].value_counts()

for x in xrange(4):

    s = irisdata.iloc[:,x]

    print '{0:<12}'.format(s.name.upper()), " Statistics: ", \

    '{0:>5}  {1:>5}  {2:>5}  {3:>5}'.format(s.max(), s.min(), round(s.mean(),2),round(s.std(),2))

SEPAL_LENGTH  Statistics:    7.9    4.3   5.84   0.83
SEPAL_WIDTH   Statistics:    4.4    2.0   3.05   0.43
PETAL_LENGTH  Statistics:    6.9    1.0   3.76   1.76
PETAL_WIDTH   Statistics:    2.5    0.1    1.2   0.76

快速“MapReduce”

In [9]:

slogs = lambda x:sp.log(x)*x

entpy = lambda x:sp.exp((slogs(x.sum())-x.map(slogs).sum())/x.sum())

irisdata.groupby('class').agg(entpy)

Out[9]:

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	49.878745	49.695242	49.654909	45.810069
Iris-versicolor	49.815081	49.680665	49.694505	49.452305
Iris-virginica	49.772059	49.714500	49.761700	49.545918

1. 欢迎来到大熊猫世界

Pandas的重要数据类型

DataFrame(二维表)
Series(一维序列)
Index(行索引，行级元数据)

1.1 Series：pandas的长枪(数据表中的一列或一行,观测向量,一维数组...)

数据世界中对于任意一个个体的全面观测，或者对于任意一组个体某一属性的观测，全部可以抽象为Series的概念。

用值构建一个Series：

由默认index和values组成。

In [10]:

Series1 = pd.Series(np.random.randn(4))

print Series1,type(Series1)

print Series1.index

print Series1.values

0   -0.909672
1    1.739425
2   -1.163028
3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Int64Index([0, 1, 2, 3], dtype='int64')
[-0.90967166  1.73942495 -1.1630284   0.40869312]

Series支持过滤的原理就如同NumPy：

In [11]:

print Series1>0

print Series1[Series1>0]

0    False
1     True
2    False
3     True
dtype: bool
1    1.739425
3    0.408693
dtype: float64

当然也支持Broadcasting：

In [12]:

print Series1*2

print Series1+5

0   -1.819343
1    3.478850
2   -2.326057
3    0.817386
dtype: float64
0    4.090328
1    6.739425
2    3.836972
3    5.408693
dtype: float64

以及Universal Function：

In [13]:

print np.exp(Series1)

#NumPy Universal Function

f_np = np.frompyfunc(lambda x:np.exp(x*2+5),1,1)

print f_np(Series1)

0    0.402656
1    5.694068
2    0.312538
3    1.504850
dtype: float64
0    24.06255
1    4811.913
2    14.49702
3    336.0924
dtype: object

在序列上就使用行标，而不是创建一个2列的数据表，能够轻松辨别哪里是数据，哪里是元数据：

In [14]:

Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

print Series2,type(Series2)

print Series2.index

print type(Series2.index)

print Series2.values

norm_0   -0.909672
norm_1    1.739425
norm_2   -1.163028
norm_3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')
<class 'pandas.core.index.Index'>
[-0.90967166  1.73942495 -1.1630284   0.40869312]

虽然行是有顺序的，但是仍然能够通过行级的index来访问到数据：

（当然也不尽然像Ordered Dict，因为行索引甚至可以重复，不推荐重复的行索引不代表不能用）

In [15]:

print Series2[['norm_0','norm_3']]

norm_0   -0.909672
norm_3    0.408693
dtype: float64

In [16]:

print 'norm_0' in Series2

print 'norm_6' in Series2

True
False

默认行索引就像行号一样：

In [17]:

print Series1.index

Int64Index([0, 1, 2, 3], dtype='int64')

从Key不重复的Ordered Dict或者从Dict来定义Series就不需要担心行索引重复：

In [18]:

Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

Series3_pdSeries = pd.Series(Series3_Dict)

print Series3_pdSeries

print Series3_pdSeries.values

print Series3_pdSeries.index

China      Beijing
Japan        Tokyo
S.Korea      Seoul
dtype: object
['Beijing' 'Tokyo' 'Seoul']
Index([u'China', u'Japan', u'S.Korea'], dtype='object')

与Dict区别一：有序

In [19]:

Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)

print Series4_pdSeries

print Series4_pdSeries.values

print Series4_pdSeries.index

print Series4_pdSeries.isnull()

print Series4_pdSeries.notnull()

Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
dtype: object
['Tokyo' 'Beijing' nan 'Seoul']
Index([u'Japan', u'China', u'Singapore', u'S.Korea'], dtype='object')
Japan        False
China        False
Singapore     True
S.Korea      False
dtype: bool
Japan         True
China         True
Singapore    False
S.Korea       True
dtype: bool

与Dict区别二： index内值可以重复，尽管不推荐。

In [20]:

Series5_IndexList = ['A','B','B','C']

Series5 = pd.Series(Series1.values,index = Series5_IndexList)

print Series5

print Series5[['B','A']]

A   -0.909672
B    1.739425
B   -1.163028
C    0.408693
dtype: float64
B    1.739425
B   -1.163028
A   -0.909672
dtype: float64

整个序列级别的元数据信息：name

当数据序列以及index本身有了名字，就可以更方便的进行后续的数据关联啦！

In [21]:

print Series4_pdSeries.name

print Series4_pdSeries.index.name

None
None

In [22]:

Series4_pdSeries.name = "Capital Series"

Series4_pdSeries.index.name = "Nation"

print Series4_pdSeries

pd.DataFrame(Series4_pdSeries)

Nation
Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
Name: Capital Series, dtype: object

Out[22]:

	Capital Series
Nation
Japan	Tokyo
China	Beijing
Singapore	NaN
S.Korea	Seoul

1.2 DataFrame：pandas的战锤(数据表，二维数组)

Series的有序集合，就像R的DataFrame一样方便。

仔细想想，绝大部分的数据形式都可以表现为DataFrame。

从NumPy二维数组、从文件或者从数据库定义：数据虽好，勿忘列名

In [23]:

dataNumPy = np.asarray([('Japan','Tokyo',4000),\

                ('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])

DF1

Out[23]:

	nation	capital	GDP
0	Japan	Tokyo	4000
1	S.Korea	Seoul	1300
2	China	Beijing	9100

等长的列数据保存在一个字典里（JSON）：很不幸，字典key是无序的

In [24]:

dataDict = {'nation':['Japan','S.Korea','China'],\

        'capital':['Tokyo','Seoul','Beijing'],'GDP':[4900,1300,9100]}

DF2 = pd.DataFrame(dataDict)

DF2

Out[24]:

	GDP	capital	nation
0	4900	Tokyo	Japan
1	1300	Seoul	S.Korea
2	9100	Beijing	China

从另一个DataFrame定义DataFrame：啊，强迫症犯了！

In [25]:

DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])

DF21

Out[25]:

	nation	capital	GDP
0	Japan	Tokyo	4900
1	S.Korea	Seoul	1300
2	China	Beijing	9100

In [26]:

DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])

DF22

Out[26]:

	nation	capital	GDP
2	China	Beijing	9100
0	Japan	Tokyo	4900
1	S.Korea	Seoul	1300

从DataFrame中取出列？两种方法（与JavaScript完全一致！）

'.'的写法容易与其他预留关键字产生冲突
'[ ]'的写法最安全。

In [27]:

print DF22.nation

print DF22.capital

print DF22['GDP']

2      China
0      Japan
1    S.Korea
Name: nation, dtype: object
2    Beijing
0      Tokyo
1      Seoul
Name: capital, dtype: object
2    9100
0    4900
1    1300
Name: GDP, dtype: int64

从DataFrame中取出行？（至少）两种方法：

In [28]:

print DF22[0:1] #给出的实际是DataFrame

print DF22.ix[0] #通过对应Index给出行

  nation  capital   GDP
2  China  Beijing  9100
nation     Japan
capital    Tokyo
GDP         4900
Name: 0, dtype: object

像NumPy切片一样的终极招式：iloc

In [29]:

print DF22.iloc[0,:]

print DF22.iloc[:,0]

nation       China
capital    Beijing
GDP           9100
Name: 2, dtype: object
2      China
0      Japan
1    S.Korea
Name: nation, dtype: object

听说你从Alter Table地狱来，大熊猫笑了

然而动态增加列无法用"."的方式完成，只能用"[ ]"

In [30]:

DF22['population'] = [1600,130,55]

DF22['region'] = 'East_Asian'

DF22

Out[30]:

	nation	capital	GDP	population	region
2	China	Beijing	9100	1600	East_Asian
0	Japan	Tokyo	4900	130	East_Asian
1	S.Korea	Seoul	1300	55	East_Asian

In [ ]:

1.3 Index：pandas进行数据操纵的鬼牌（行级索引）

行级索引是

元数据
可能由真实数据产生，因此可以视作数据
可以由多重索引也就是多个列组合而成
可以和列名进行交换，也可以进行堆叠和展开，达到Excel透视表效果

Index有四种...哦不，很多种写法，一些重要的索引类型包括

pd.Index（普通）
Int64Index（数值型索引）
MultiIndex（多重索引，在数据操纵中更详细描述）
DatetimeIndex（以时间格式作为索引）
PeriodIndex （含周期的时间格式作为索引）

直接定义普通索引，长得就和普通的Series一样

In [31]:

index_names = ['a','b','c']

Series_for_Index = pd.Series(index_names)

print pd.Index(index_names)

print pd.Index(Series_for_Index)

Index([u'a', u'b', u'c'], dtype='object')
Index([u'a', u'b', u'c'], dtype='object')

可惜Immutable，牢记！

In [32]:

index_names = ['a','b','c']

index0 = pd.Index(index_names)

print index0.get_values()

index0[2] = 'd'

['a' 'b' 'c']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-f34da0a8623c> in <module>()
 2 index0 = pd.Index(index_names)  3 print index0.get_values() ----> 4 index0[2] = 'd' /Users/wangweiyang/anaconda/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in __setitem__(self, key, value)  1055   1056 def __setitem__(self, key, value): -> 1057 raise TypeError("Indexes does not support mutable operations")  1058   1059 def __getitem__(self, key): TypeError: Indexes does not support mutable operations

扔进去一个含有多元组的List，就有了MultiIndex

可惜，如果这个List Comprehension改成小括号，就不对了。

In [33]:

#print [('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)]

multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)])

multi1.name = ['index1','index2']

print multi1

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

对于Series来说，如果拥有了多重Index，数据，变形！

下列代码说明：

二重MultiIndex的Series可以unstack()成DataFrame
DataFrame可以stack成拥有二重MultiIndex的Series

In [34]:

data_for_multi1 = pd.Series(xrange(0,16),index=multi1)

data_for_multi1

Out[34]:

Row_1  Col_1     0Col_2     1Col_3     2Col_4     3
Row_2  Col_1     4Col_2     5Col_3     6Col_4     7
Row_3  Col_1     8Col_2     9Col_3    10Col_4    11
Row_4  Col_1    12Col_2    13Col_3    14Col_4    15
dtype: int64

In [35]:

data_for_multi1.unstack()

Out[35]:

	Col_1	Col_2	Col_3	Col_4
Row_1	0	1	2	3
Row_2	4	5	6	7
Row_3	8	9	10	11
Row_4	12	13	14	15

In [36]:

data_for_multi1.unstack().stack()

Out[36]:

Row_1  Col_1     0Col_2     1Col_3     2Col_4     3
Row_2  Col_1     4Col_2     5Col_3     6Col_4     7
Row_3  Col_1     8Col_2     9Col_3    10Col_4    11
Row_4  Col_1    12Col_2    13Col_3    14Col_4    15
dtype: int64

我们来看一下非平衡数据的例子：

Row_1,2,3,4和Col_1,2,3,4并不是全组合的。

In [37]:

multi2 = pd.Index([('Row_'+str(x),'Col_'+str(y+1)) \

                   for x in xrange(5) for y in xrange(x)])

multi2

Out[37]:

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])

In [38]:

data_for_multi2 = pd.Series(np.arange(10),index = multi2)

data_for_multi2

Out[38]:

Row_1  Col_1    0
Row_2  Col_1    1Col_2    2
Row_3  Col_1    3Col_2    4Col_3    5
Row_4  Col_1    6Col_2    7Col_3    8Col_4    9
dtype: int64

In [39]:

data_for_multi2.unstack()

Out[39]:

	Col_1	Col_2	Col_3	Col_4
Row_1	0	NaN	NaN	NaN
Row_2	1	2	NaN	NaN
Row_3	3	4	5	NaN
Row_4	6	7	8	9

In [40]:

data_for_multi2.unstack().stack()

Out[40]:

Row_1  Col_1    0
Row_2  Col_1    1Col_2    2
Row_3  Col_1    3Col_2    4Col_3    5
Row_4  Col_1    6Col_2    7Col_3    8Col_4    9
dtype: float64

DateTime标准库如此好用，你值得拥有

In [41]:

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]

pd.DatetimeIndex(dates)

Out[41]:

DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)

如果你不仅需要时间格式统一，时间频率也要统一的话

In [42]:

periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

print periodindex1

PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')

月级精度和日级精度如何转换？

有的公司统一以1号代表当月，有的公司统一以最后一天代表当月，转化起来很麻烦，可以asfreq

In [43]:

print periodindex1.asfreq('D',how='start')

print periodindex1.asfreq('D',how='end')

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')

最后的最后，我要真正把两种频率的时间精度匹配上？

In [44]:

periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')

print periodindex_mon

print periodindex_day

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04','2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08','2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12','2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16','2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20','2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24','2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28','2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01','2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05','2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09','2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13','2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17','2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21','2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25','2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01','2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05','2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09','2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13','2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17','2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21','2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25','2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29','2015-03-30', '2015-03-31'],dtype='int64', freq='D')

粗粒度数据＋reindex＋ffill/bfill

In [45]:

#print pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)

full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)

full_ts

Out[45]:

2015-01-01    2015-01-01
2015-01-02           NaN
2015-01-03           NaN
2015-01-04           NaN
2015-01-05           NaN
2015-01-06           NaN
2015-01-07           NaN
2015-01-08           NaN
2015-01-09           NaN
2015-01-10           NaN
2015-01-11           NaN
2015-01-12           NaN
2015-01-13           NaN
2015-01-14           NaN
2015-01-15           NaN
2015-01-16           NaN
2015-01-17           NaN
2015-01-18           NaN
2015-01-19           NaN
2015-01-20           NaN
2015-01-21           NaN
2015-01-22           NaN
2015-01-23           NaN
2015-01-24           NaN
2015-01-25           NaN
2015-01-26           NaN
2015-01-27           NaN
2015-01-28           NaN
2015-01-29           NaN
2015-01-30           NaN...
2015-03-02           NaN
2015-03-03           NaN
2015-03-04           NaN
2015-03-05           NaN
2015-03-06           NaN
2015-03-07           NaN
2015-03-08           NaN
2015-03-09           NaN
2015-03-10           NaN
2015-03-11           NaN
2015-03-12           NaN
2015-03-13           NaN
2015-03-14           NaN
2015-03-15           NaN
2015-03-16           NaN
2015-03-17           NaN
2015-03-18           NaN
2015-03-19           NaN
2015-03-20           NaN
2015-03-21           NaN
2015-03-22           NaN
2015-03-23           NaN
2015-03-24           NaN
2015-03-25           NaN
2015-03-26           NaN
2015-03-27           NaN
2015-03-28           NaN
2015-03-29           NaN
2015-03-30           NaN
2015-03-31           NaN
Freq: D, dtype: object

In [46]:

full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')

full_ts

Out[46]:

2015-01-01    2015-01-01
2015-01-02    2015-01-01
2015-01-03    2015-01-01
2015-01-04    2015-01-01
2015-01-05    2015-01-01
2015-01-06    2015-01-01
2015-01-07    2015-01-01
2015-01-08    2015-01-01
2015-01-09    2015-01-01
2015-01-10    2015-01-01
2015-01-11    2015-01-01
2015-01-12    2015-01-01
2015-01-13    2015-01-01
2015-01-14    2015-01-01
2015-01-15    2015-01-01
2015-01-16    2015-01-01
2015-01-17    2015-01-01
2015-01-18    2015-01-01
2015-01-19    2015-01-01
2015-01-20    2015-01-01
2015-01-21    2015-01-01
2015-01-22    2015-01-01
2015-01-23    2015-01-01
2015-01-24    2015-01-01
2015-01-25    2015-01-01
2015-01-26    2015-01-01
2015-01-27    2015-01-01
2015-01-28    2015-01-01
2015-01-29    2015-01-01
2015-01-30    2015-01-01...
2015-03-02    2015-03-01
2015-03-03    2015-03-01
2015-03-04    2015-03-01
2015-03-05    2015-03-01
2015-03-06    2015-03-01
2015-03-07    2015-03-01
2015-03-08    2015-03-01
2015-03-09    2015-03-01
2015-03-10    2015-03-01
2015-03-11    2015-03-01
2015-03-12    2015-03-01
2015-03-13    2015-03-01
2015-03-14    2015-03-01
2015-03-15    2015-03-01
2015-03-16    2015-03-01
2015-03-17    2015-03-01
2015-03-18    2015-03-01
2015-03-19    2015-03-01
2015-03-20    2015-03-01
2015-03-21    2015-03-01
2015-03-22    2015-03-01
2015-03-23    2015-03-01
2015-03-24    2015-03-01
2015-03-25    2015-03-01
2015-03-26    2015-03-01
2015-03-27    2015-03-01
2015-03-28    2015-03-01
2015-03-29    2015-03-01
2015-03-30    2015-03-01
2015-03-31    2015-03-01
Freq: D, dtype: object

关于索引，方便的操作有？

前面描述过了，索引有序，重复，但一定程度上又能通过key来访问，也就是说，某些集合操作都是可以支持的。

In [47]:

index1 = pd.Index(['A','B','B','C','C'])

index2 = pd.Index(['C','D','E','E','F'])

index3 = pd.Index(['B','C','A'])

print index1.append(index2)

print index1.difference(index2)

print index1.intersection(index2)

print index1.union(index2) # Support unique-value Index well

print index1.isin(index2)

print index1.delete(2)

print index1.insert(0,'K') # Not suggested

print index3.drop('A') # Support unique-value Index well

print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic

print index1.is_unique,index2.is_unique,index3.is_unique

Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
Index([u'A', u'B'], dtype='object')
Index([u'C', u'C'], dtype='object')
Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
[False False False  True  True]
Index([u'A', u'B', u'C', u'C'], dtype='object')
Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')
Index([u'B', u'C'], dtype='object')
True True False
False False True

2. 大熊猫世界来去自如：Pandas的I/O

老生常谈，从基础来看，我们仍然关心pandas对于与外部数据是如何交互的。

2.1 结构化数据输入输出

read_csv与to_csv 是一对输入输出的工具，read_csv直接返回pandas.DataFrame，而to_csv只要执行命令即可写文件
- read_table：功能类似
- read_fwf：操作fixed width file
read_excel与to_excel方便的与excel交互

还记得刚开始的例子吗？

header 表示数据中是否存在列名，如果在第0行就写就写0，并且开始读数据时跳过相应的行数，不存在可以写none
names 表示要用给定的列名来作为最终的列名
encoding 表示数据集的字符编码，通常而言一份数据为了方便的进行文件传输都以utf-8作为标准

提问：下列例子中，header=4，names=cnames时，究竟会读到怎样的数据？

In [48]:

print cnames

irisdata = pd.read_csv('S1EP3_Iris.txt',header = None, names = cnames,\

                       encoding='utf-8')

irisdata[::30]

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

Out[48]:

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
30	4.8	3.1	1.6	0.2	Iris-setosa
60	5.0	2.0	3.5	1.0	Iris-versicolor
90	5.5	2.6	4.4	1.2	Iris-versicolor
120	6.9	3.2	5.7	2.3	Iris-virginica

希望了解全部参数的请移步API：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

这里介绍一些常用的参数：

读取处理：

skiprows：跳过一定的行数
nrows：仅读取一定的行数
skipfooter：尾部有固定的行数永不读取
skip_blank_lines：空行跳过

内容处理：

sep/delimiter：分隔符很重要，常见的有逗号，空格和Tab('\t')
na_values：指定应该被当作na_values的数值
thousands：处理数值类型时，每千位分隔符并不统一 (1.234.567,89或者1,234,567.89都可能)，此时要把字符串转化为数字需要指明千位分隔符

收尾处理：

index_col：将真实的某列（列的数目，甚至列名）当作index
squeeze：仅读到一列时，不再保存为pandas.DataFrame而是pandas.Series

2.1.x Excel ... ?

对于存储着极为规整数据的Excel而言，其实是没必要一定用Excel来存，尽管Pandas也十分友好的提供了I/O接口。

In [49]:

irisdata.to_excel('S1EP3_irisdata.xls',index = None,encoding='utf-8')

irisdata_from_excel = pd.read_excel('S1EP3_irisdata.xls',header=0, encoding='utf-8')

irisdata_from_excel[::30]

Out[49]:

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
30	4.8	3.1	1.6	0.2	Iris-setosa
60	5.0	2.0	3.5	1.0	Iris-versicolor
90	5.5	2.6	4.4	1.2	Iris-versicolor
120	6.9	3.2	5.7	2.3	Iris-virginica

唯一重要的参数：sheetname=k，标志着一个excel的第k个sheet页将会被取出。（从0开始）

2.2 半结构化数据

JSON：网络传输中常用的一种数据格式。

仔细看一下，实际上这就是我们平时收集到异源数据的风格是一致的：

列名不能完全匹配
关联键可能并不唯一
元数据被保存在数据里

In [50]:

json_data = [{'name':'Wang','sal':50000,'job':'VP'},\

             {'name':'Zhang','job':'Manager','report':'VP'},\

             {'name':'Li','sal':5000,'report':'Manager'}]

data_employee = pd.read_json(json.dumps(json_data))

data_employee_ri = data_employee.reindex(columns=['name','job','sal','report'])

data_employee_ri

Out[50]:

	name	job	sal	report
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager

2.3 数据库连接流程（Optional）

使用下列包，通过数据库配置建立Connection

pymysql
pyODBC
cx_Oracle

通过pandas.read_sql_query,read_sql_table,to_sql进行数据库操作。

Python与数据库的交互方案有很多种，从数据分析师角度看pandas方案比较适合，之后的讲义中会结合SQL语法进行讲解。

进行数据库连接首先你需要类似的这样一组信息：

In [ ]:

IP = '127.0.0.1'

us = 'root'

pw = '123456'

举例说明如果是MySQL：

In [ ]:

import pymysql

import pymysql.cursors

connection = pymysql.connect(host=IP,\

                             user=us,\

                             password=pw,\

                             charset='utf8mb4',\

                             cursorclass=pymysql.cursors.DictCursor)

#pd.read_sql_query("sql",connection)

#df.to_sql('tablename',connection,flavor='mysql')

3. 深入Pandas数据操纵

在第一部分的基础上，数据会有更多种操纵方式：

通过列名、行index来取数据，结合ix、iloc灵活的获取数据的一个子集（第一部分已经介绍）
按记录拼接（就像Union All）或者关联（join）
方便的自定义函数映射
排序
缺失值处理
与Excel一样灵活的数据透视表（在第四部分更详细介绍）

3.1 数据整合：方便灵活

3.1.1 横向拼接：直接DataFrame

In [51]:

pd.DataFrame([np.random.rand(2),np.random.rand(2),np.random.rand(2)],columns=['C1','C2'])

Out[51]:

	C1	C2
0	0.958384	0.826066
1	0.607771	0.687302
2	0.943502	0.647464

3.1.2 横向拼接：Concatenate

In [52]:

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri])

Out[52]:

	name	job	sal	report
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager

In [53]:

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri],ignore_index=True)

Out[53]:

	name	job	sal	report
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
3	Wang	VP	50000	NaN
4	Zhang	Manager	NaN	VP
5	Li	NaN	5000	Manager
6	Wang	VP	50000	NaN
7	Zhang	Manager	NaN	VP
8	Li	NaN	5000	Manager

3.1.3 纵向拼接：Merge

根据数据列关联，使用on关键字

可以指定一列或多列
可以使用left_on和right_on

In [54]:

pd.merge(data_employee_ri,data_employee_ri,on='name')

Out[54]:

	name	job_x	sal_x	report_x	job_y	sal_y	report_y
0	Wang	VP	50000	NaN	VP	50000	NaN
1	Zhang	Manager	NaN	VP	Manager	NaN	VP
2	Li	NaN	5000	Manager	NaN	5000	Manager

In [55]:

pd.merge(data_employee_ri,data_employee_ri,on=['name','job'])

Out[55]:

	name	job	sal_x	report_x	sal_y	report_y
0	Wang	VP	50000	NaN	50000	NaN
1	Zhang	Manager	NaN	VP	NaN	VP
2	Li	NaN	5000	Manager	5000	Manager

根据index关联，可以直接使用left_index和right_index

In [56]:

data_employee_ri.index.name = 'index1'

pd.merge(data_employee_ri,data_employee_ri,\

         left_index='index1',right_index='index1')

Out[56]:

	name_x	job_x	sal_x	report_x	name_y	job_y	sal_y	report_y
index1
0	Wang	VP	50000	NaN	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager	Li	NaN	5000	Manager

TIPS: 增加how关键字，并指定

how = 'inner'
how = 'left'
how = 'right'
how = 'outer'

结合how，可以看到merge基本再现了SQL应有的功能，并保持代码整洁。

In [57]:

DF31xA = pd.DataFrame({'name':[u'老王',u'老张',u'老李'],'sal':[5000,3000,1000]})

DF31xA

Out[57]:

	name	sal
0	老王	5000
1	老张	3000
2	老李	1000

In [58]:

DF31xB = pd.DataFrame({'name':[u'老王',u'老刘'],'job':['VP','Manager']})

DF31xB

Out[58]:

	job	name
0	VP	老王
1	Manager	老刘

how='left': 保留左表信息

In [59]:

pd.merge(DF31xA,DF31xB,on='name',how='left')

Out[59]:

	name	sal	job
0	老王	5000	VP
1	老张	3000	NaN
2	老李	1000	NaN

how='right': 保留右表信息

In [60]:

pd.merge(DF31xA,DF31xB,on='name',how='right')

Out[60]:

	name	sal	job
0	老王	5000	VP
1	老刘	NaN	Manager

how='inner': 保留两表交集信息，这样尽量避免出现缺失值

In [61]:

pd.merge(DF31xA,DF31xB,on='name',how='inner')

Out[61]:

	name	sal	job
0	老王	5000	VP

how='outer': 保留两表并集信息，这样会导致缺失值，但最大程度的整合了已有信息

In [62]:

pd.merge(DF31xA,DF31xB,on='name',how='outer')

Out[62]:

	name	sal	job
0	老王	5000	VP
1	老张	3000	NaN
2	老李	1000	NaN
3	老刘	NaN	Manager

3.2 数据清洗三剑客

接下来的三个功能，map,applymap,apply,功能，是绝大多数数据分析师在数据清洗这一步骤中的必经之路。

他们分别回答了以下问题：

我想根据一列数据新做一列数据，怎么办？（Series->Series）
我想根据整张表的数据新做整张表，怎么办？（DataFrame->DataFrame）
我想根据很多列的数据新做一列数据，怎么办？（DataFrame->Series）

不要再写什么for循环了！改变思维，提高编码和执行效率

In [63]:

dataNumPy32 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF32 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])

DF32

Out[63]:

	nation	capital	GDP
0	Japan	Tokyo	4000
1	S.Korea	Seoul	1300
2	China	Beijing	9100

map: 以相同规则将一列数据作一个映射，也就是进行相同函数的处理

In [64]:

def GDP_Factorize(v):

    fv = np.float64(v)

    if fv > 6000.0:

        return 'High'

    elif fv < 2000.0:

        return 'Low'

    else:

        return 'Medium'

DF32['GDP_Level'] = DF32['GDP'].map(GDP_Factorize)

DF32['NATION'] = DF32.nation.map(str.upper)

DF32

Out[64]:

	nation	capital	GDP	GDP_Level	NATION
0	Japan	Tokyo	4000	Medium	JAPAN
1	S.Korea	Seoul	1300	Low	S.KOREA
2	China	Beijing	9100	High	CHINA

类似的功能还有applymap，可以对一个dataframe里面每一个元素像map那样全局操作

In [65]:

DF32.applymap(lambda x: float(x)*2 if x.isdigit() else x.upper())

Out[65]:

	nation	capital	GDP	GDP_Level	NATION
0	JAPAN	TOKYO	8000	MEDIUM	JAPAN
1	S.KOREA	SEOUL	2600	LOW	S.KOREA
2	CHINA	BEIJING	18200	HIGH	CHINA

apply则可以对一个DataFrame操作得到一个Series

他会有点像我们后面介绍的agg,但是apply可以按行操作和按列操作，用axis控制即可。

In [66]:

DF32.apply(lambda x:x['nation']+x['capital']+'_'+x['GDP'],axis=1)

Out[66]:

0      JapanTokyo_4000
1    S.KoreaSeoul_1300
2    ChinaBeijing_9100
dtype: object

3.3 数据排序

sort: 按一列或者多列的值进行行级排序
sort_index: 根据index里的取值进行排序，而且可以根据axis决定是重排行还是列

In [67]:

dataNumPy33 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF33 = pd.DataFrame(dataNumPy33,columns=['nation','capital','GDP'])

DF33

Out[67]:

	nation	capital	GDP
0	Japan	Tokyo	4000
1	S.Korea	Seoul	1300
2	China	Beijing	9100

In [68]:

DF33.sort(['capital','nation'])

Out[68]:

	nation	capital	GDP
2	China	Beijing	9100
1	S.Korea	Seoul	1300
0	Japan	Tokyo	4000

In [69]:

DF33.sort('GDP',ascending=False)

Out[69]:

	nation	capital	GDP
2	China	Beijing	9100
0	Japan	Tokyo	4000
1	S.Korea	Seoul	1300

In [70]:

DF33.sort('GDP').sort(ascending=False)

Out[70]:

	nation	capital	GDP
2	China	Beijing	9100
1	S.Korea	Seoul	1300
0	Japan	Tokyo	4000

In [71]:

DF33.sort_index(axis=1,ascending=True)

Out[71]:

	GDP	capital	nation
0	4000	Tokyo	Japan
1	1300	Seoul	S.Korea
2	9100	Beijing	China

一个好用的功能：Rank

In [72]:

DF33

Out[72]:

	nation	capital	GDP
0	Japan	Tokyo	4000
1	S.Korea	Seoul	1300
2	China	Beijing	9100

In [73]:

DF33.rank()

Out[73]:

	nation	capital	GDP
0	2	3	2
1	3	2	1
2	1	1	3

In [74]:

DF33.rank(ascending=False)

Out[74]:

	nation	capital	GDP
0	2	1	2
1	1	2	3
2	3	3	1

注意tied data（相同值）的处理：

method = 'average'
method = 'min'
method = 'max'
method = 'first'

In [75]:

DF33x = pd.DataFrame({'name':[u'老王',u'老张',u'老李',u'老刘'],'sal':np.array([5000,3000,5000,9000])})

DF33x

Out[75]:

	name	sal
0	老王	5000
1	老张	3000
2	老李	5000
3	老刘	9000

DF33x.rank()默认使用method='average'，两条数据相等时，处理排名时大家都用平均值

In [76]:

DF33x.sal.rank()

Out[76]:

0    2.5
1    1.0
2    2.5
3    4.0
Name: sal, dtype: float64

method='min'，处理排名时大家都用最小值

In [77]:

DF33x.sal.rank(method='min')

Out[77]:

0    2
1    1
2    2
3    4
Name: sal, dtype: float64

method='max'，处理排名时大家都用最大值

In [78]:

DF33x.sal.rank(method='max')

Out[78]:

0    3
1    1
2    3
3    4
Name: sal, dtype: float64

method='first'，处理排名时谁先出现就先给谁较小的数值。

In [79]:

DF33x.sal.rank(method='first')

Out[79]:

0    2
1    1
2    3
3    4
Name: sal, dtype: float64

3.4 缺失数据处理

In [80]:

DF34 = data_for_multi2.unstack()

DF34

Out[80]:

	Col_1	Col_2	Col_3	Col_4
Row_1	0	NaN	NaN	NaN
Row_2	1	2	NaN	NaN
Row_3	3	4	5	NaN
Row_4	6	7	8	9

忽略缺失值：

In [81]:

DF34.mean(skipna=True)

Out[81]:

Col_1    2.500000
Col_2    4.333333
Col_3    6.500000
Col_4    9.000000
dtype: float64

In [82]:

DF34.mean(skipna=False)

Out[82]:

Col_1    2.5
Col_2    NaN
Col_3    NaN
Col_4    NaN
dtype: float64

如果不想忽略缺失值的话，就需要祭出fillna了：

In [83]:

DF34

Out[83]:

	Col_1	Col_2	Col_3	Col_4
Row_1	0	NaN	NaN	NaN
Row_2	1	2	NaN	NaN
Row_3	3	4	5	NaN
Row_4	6	7	8	9

In [84]:

DF34.fillna(0).mean(axis=1,skipna=False)

Out[84]:

Row_1    0.00
Row_2    0.75
Row_3    3.00
Row_4    7.50
dtype: float64

4. “一组”大熊猫：Pandas的groupby

groupby的功能类似SQL的group by关键字：

Split-Apply-Combine

Split，就是按照规则分组
Apply，通过一定的agg函数来获得输入pd.Series返回一个值的效果
Combine，把结果收集起来

Pandas的groupby的灵活性：

分组的关键字可以来自于index，也可以来自于真实的列数据
分组规则可以通过一列或者多列

In [85]:

from IPython.display import Image

Image(filename="S1EP3_group.png")

Out[85]:

分组的具体逻辑

In [86]:

irisdata_group = irisdata.groupby('class')

irisdata_group

Out[86]:

<pandas.core.groupby.DataFrameGroupBy object at 0x10a543b10>

In [87]:

for level,subsetDF in irisdata_group:

    print level

    print subsetDF[::20]

Iris-setosasepal_length  sepal_width  petal_length  petal_width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
20           5.4          3.4           1.7          0.2  Iris-setosa
40           5.0          3.5           1.3          0.3  Iris-setosa
Iris-versicolorsepal_length  sepal_width  petal_length  petal_width            class
50           7.0          3.2           4.7          1.4  Iris-versicolor
70           5.9          3.2           4.8          1.8  Iris-versicolor
90           5.5          2.6           4.4          1.2  Iris-versicolor
Iris-virginicasepal_length  sepal_width  petal_length  petal_width           class
100           6.3          3.3           6.0          2.5  Iris-virginica
120           6.9          3.2           5.7          2.3  Iris-virginica
140           6.7          3.1           5.6          2.4  Iris-virginica

分组可以快速实现MapReduce的逻辑

Map: 指定分组的列标签，不同的值就会被扔到不同的分组处理
Reduce: 输入多个值，返回一个值，一般可以通过agg实现，agg能接受一个函数

In [88]:

irisdata.groupby('class').agg(\

    lambda x:((x-x.mean())**3).sum()*(len(x)-0.0)/\

                (len(x)-1.0)/(len(x)-2.0)/(x.std()*np.sqrt((len(x)-0.0)/(len(x)-1.0)))**3 if len(x)>2 else None)

Out[88]:

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	0.116502	0.103857	0.069702	1.161506
Iris-versicolor	0.102232	-0.352014	-0.588404	-0.030249
Iris-virginica	0.114492	0.355026	0.533044	-0.125612

In [89]:

irisdata.groupby('class').agg(spstat.skew)

Out[89]:

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	0.116454	0.103814	0.069673	1.161022
Iris-versicolor	0.102190	-0.351867	-0.588159	-0.030236
Iris-virginica	0.114445	0.354878	0.532822	-0.125560

汇总之后的广播操作

在OLAP数据库上，为了避免groupby+join的二次操作，提出了sum()over(partition by)的开窗操作。

在Pandas中，这种操作能够进一步被transform所取代。

In [90]:

pd.concat([irisdata,irisdata.groupby('class').transform('mean')],axis=1)[::20]

Out[90]:

	sepal_length	sepal_width	petal_length	petal_width	class	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2	Iris-setosa	5.006	3.418	1.464	0.244
20	5.4	3.4	1.7	0.2	Iris-setosa	5.006	3.418	1.464	0.244
40	5.0	3.5	1.3	0.3	Iris-setosa	5.006	3.418	1.464	0.244
60	5.0	2.0	3.5	1.0	Iris-versicolor	5.936	2.770	4.260	1.326
80	5.5	2.4	3.8	1.1	Iris-versicolor	5.936	2.770	4.260	1.326
100	6.3	3.3	6.0	2.5	Iris-virginica	6.588	2.974	5.552	2.026
120	6.9	3.2	5.7	2.3	Iris-virginica	6.588	2.974	5.552	2.026
140	6.7	3.1	5.6	2.4	Iris-virginica	6.588	2.974	5.552	2.026

产生 MultiIndex（多列分组）后的数据透视表操作

一般来说，多列groupby的一个副作用就是.groupby().agg()之后你的行index已经变成了一个多列分组的分级索引。

如果我们希望达到Excel的数据透视表的效果，行和列的索引自由交换，达到统计目的，究竟应该怎么办呢？

In [91]:

factor1 = np.random.randint(0,3,50)

factor2 = np.random.randint(0,2,50)

factor3 = np.random.randint(0,3,50)

values = np.random.randn(50)

In [92]:

hierindexDF = pd.DataFrame({'F1':factor1,'F2':factor2,'F3':factor3,'F4':values})

hierindexDF

Out[92]:

	F1	F2	F3	F4
0	1	0	1	-0.083928
1	0	0	0	0.949984
2	2	1	0	-0.037692
3	0	1	1	-0.518305
4	1	0	0	0.963678
5	1	1	1	-0.284463
6	0	0	1	0.412449
7	2	0	0	-0.277126
8	0	0	2	-2.488946
9	0	1	0	0.167413
10	0	1	1	1.520003
11	0	1	2	-0.665450
12	1	1	2	0.783906
13	1	1	1	-0.077717
14	2	0	2	-2.018863
15	0	1	1	-0.993884
16	2	0	2	-1.051174
17	0	0	2	-0.723457
18	1	1	0	-0.783521
19	1	0	2	0.355093
20	1	0	0	-0.252415
21	0	0	2	0.600679
22	2	1	2	-0.660498
23	0	0	2	-0.301613
24	0	1	2	-1.270339
25	0	0	0	0.074086
26	1	0	0	-0.043586
27	0	1	2	-2.000161
28	2	1	1	-1.041330
29	2	1	1	0.984101
30	2	0	0	0.160715
31	1	1	2	0.903035
32	1	1	1	-1.105763
33	1	0	0	0.850310
34	1	1	0	-0.062946
35	1	1	0	-0.507763
36	1	0	1	0.112303
37	2	1	2	-0.663782
38	2	1	1	1.147671
39	0	0	2	-1.552327
40	1	1	1	-1.166517
41	0	1	0	-0.622502
42	0	1	2	1.620961
43	1	1	0	-0.354238
44	1	0	2	-0.233783
45	0	0	0	0.131051
46	2	1	1	-1.301164
47	1	0	2	-1.013341
48	2	0	1	-0.508761
49	0	0	1	-1.104457

In [93]:

hierindexDF_gbsum = hierindexDF.groupby(['F1','F2','F3']).sum()

hierindexDF_gbsum

Out[93]:

			F4
F1	F2	F3
0	0	0	1.155121
		1	-0.692009
		2	-4.465664
	1	0	-0.455089
		1	0.007813
		2	-2.314989
1	0	0	1.517987
		1	0.028374
		2	-0.892030
	1	0	-1.708468
		1	-2.634460
		2	1.686941
2	0	0	-0.116412
		1	-0.508761
		2	-3.070037
	1	0	-0.037692
		1	-0.210722
		2	-1.324281

观察Index：

In [94]:

hierindexDF_gbsum.index

Out[94]:

MultiIndex(levels=[[0, 1, 2], [0, 1], [0, 1, 2]],labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],names=[u'F1', u'F2', u'F3'])

unstack：

无参数时，把最末index置换到column上
有数字参数时，把指定位置的index置换到column上
有列表参数时，依次把特定位置的index置换到column上

In [95]:

hierindexDF_gbsum.unstack()

Out[95]:

		F4
	F3	0	1	2
F1	F2
0	0	1.155121	-0.692009	-4.465664
0	1	-0.455089	0.007813	-2.314989
1	0	1.517987	0.028374	-0.892030
1	1	-1.708468	-2.634460	1.686941
2	0	-0.116412	-0.508761	-3.070037
2	1	-0.037692	-0.210722	-1.324281

In [96]:

hierindexDF_gbsum.unstack(0)

Out[96]:

		F4
	F1	0	1	2
F2	F3
0	0	1.155121	1.517987	-0.116412
	1	-0.692009	0.028374	-0.508761
	2	-4.465664	-0.892030	-3.070037
1	0	-0.455089	-1.708468	-0.037692
	1	0.007813	-2.634460	-0.210722
	2	-2.314989	1.686941	-1.324281

In [97]:

hierindexDF_gbsum.unstack(1)

Out[97]:

		F4
	F2	0	1
F1	F3
0	0	1.155121	-0.455089
	1	-0.692009	0.007813
	2	-4.465664	-2.314989
1	0	1.517987	-1.708468
	1	0.028374	-2.634460
	2	-0.892030	1.686941
2	0	-0.116412	-0.037692
	1	-0.508761	-0.210722
	2	-3.070037	-1.324281

In [98]:

hierindexDF_gbsum.unstack([2,0])

Out[98]:

	F4
F3	0	1	2	0	1	2	0	1	2
F1	0	0	0	1	1	1	2	2	2
F2
0	1.155121	-0.692009	-4.465664	1.517987	0.028374	-0.892030	-0.116412	-0.508761	-3.070037
1	-0.455089	0.007813	-2.314989	-1.708468	-2.634460	1.686941	-0.037692	-0.210722	-1.324281

更进一步的，stack的功能是和unstack对应，把column上的多级索引换到index上去

In [99]:

hierindexDF_gbsum.unstack([2,0]).stack([1,2])

Out[99]:

			F4
F2	F3	F1
0	0	0	1.155121
		1	1.517987
		2	-0.116412
	1	0	-0.692009
		1	0.028374
		2	-0.508761
	2	0	-4.465664
		1	-0.892030
		2	-3.070037
1	0	0	-0.455089
		1	-1.708468
		2	-0.037692
	1	0	0.007813
		1	-2.634460
		2	-0.210722
	2	0	-2.314989
		1	1.686941
		2	-1.324281

Pandas python相关推荐

python根据频率画出词云_利用pandas+python制作100G亚马逊用户评论数据词云
原标题:利用pandas+python制作100G亚马逊用户评论数据词云数据挖掘入门与实战公众号: datadw 我们手里面有一个差不多100G的亚马逊用户在购买商品后留下的评论数据(数据格式为j ...
基于pandas python sklearn 的美团某商家的评论分类(文本分类）
基于pandas python sklearn 的美团某商家的评论分类(文本分类) 美团店铺评价语言处理以及分类(NLP) 第一篇数据分析部分第二篇可视化部分, 本文是该系列第三篇,文本分类主 ...
pandas python csv_python:pandas合并csv文件的方法(图书数据集成)
数据集成:将不同表的数据通过主键进行连接起来,方便对数据进行整体的分析. 两张表:ReaderInformation.csv,ReaderRentRecode.csv ReaderInformatio ...
python菜鸟教程官网pandas,python菜鸟教程官网numpy
求比较好的Python练习网站适合 Python 入门学习的5个网站1. Python官方网站最权威的 Python 教程.2. 菜鸟教程 Python 教程基本入门级教程,还有一些简单的进阶教程. ...
python可视化模块pandas,python数据可视化软件
Python中数据可视化经典库有哪些? Python有很多经典的数据可视化库,比较经典的数据可视化库有下面几个.matplotlib是Python编程语言及其数值数学扩展包 NumPy 的可视化操作界 ...
python数据分页pandas,Python数据分析[3] - Pandas包
Series Series是一个一维的向量,每个值都会有对应标签,该标签我们称之为Index Obj = Series([4, 5, -7, 8]) Obj Out[5]: 0 4 1 5 2 -7 ...
pandas——python
目录 0.导引 1.Series 1.1 Series的创建 1.2 Series的索引 1.3 Series的切片 1.4 Series常用属性和方法 1.5 Series的运算 1.6 Serie ...
pandas / python使用技巧-统计频次，超级实用
pandas使用技巧-统计元素频数之所以会写这篇文章是因为经常会做词云图,而做词云图一个很重要的因素就是要事先知道每个元素的出现的频数:也就是说通过它们出现次数的多少来决定它们的重要性,所以需要事先 ...
pandas python groupby_pandas之分组groupby()的使用整理与总结
前言在使用pandas的时候,有些场景需要对数据内部进行分组处理,如一组全校学生成绩的数据,我们想通过班级进行分组,或者再对班级分组后的性别进行分组来进行分析,这时通过pandas下的groupby ...
python保存为xlsb_Read XLSB File in Pandas Python
问题 There are many questions on this, but there has been no simple answer on how to read an xlsb file ...

Pandas python

和大熊猫们(Pandas)一起游戏吧！

0. 上手玩：Why Pandas?

快速的识别结构化数据

快速的操作元数据

快速过滤

快速切片

快速统计

快速“MapReduce”

1. 欢迎来到大熊猫世界

1.1 Series：pandas的长枪(数据表中的一列或一行,观测向量,一维数组...)

Series支持过滤的原理就如同NumPy：

当然也支持Broadcasting：

以及Universal Function：

1.2 DataFrame：pandas的战锤(数据表，二维数组)

从NumPy二维数组、从文件或者从数据库定义：数据虽好，勿忘列名

等长的列数据保存在一个字典里（JSON）：很不幸，字典key是无序的

从另一个DataFrame定义DataFrame：啊，强迫症犯了！

从DataFrame中取出列？两种方法（与JavaScript完全一致！）

从DataFrame中取出行？（至少）两种方法：

像NumPy切片一样的终极招式：iloc

听说你从Alter Table地狱来，大熊猫笑了

1.3 Index：pandas进行数据操纵的鬼牌（行级索引）

直接定义普通索引，长得就和普通的Series一样

可惜Immutable，牢记！

扔进去一个含有多元组的List，就有了MultiIndex

对于Series来说，如果拥有了多重Index，数据，变形！

DateTime标准库如此好用，你值得拥有

如果你不仅需要时间格式统一，时间频率也要统一的话

月级精度和日级精度如何转换？

最后的最后，我要真正把两种频率的时间精度匹配上？

粗粒度数据＋reindex＋ffill/bfill

关于索引，方便的操作有？

2. 大熊猫世界来去自如：Pandas的I/O

2.1 结构化数据输入输出

2.1.x Excel ... ?

2.2 半结构化数据

2.3 数据库连接流程（Optional）

3. 深入Pandas数据操纵

3.1 数据整合：方便灵活

3.1.1 横向拼接：直接DataFrame

3.1.2 横向拼接：Concatenate

3.1.3 纵向拼接：Merge

3.2 数据清洗三剑客

map: 以相同规则将一列数据作一个映射，也就是进行相同函数的处理

类似的功能还有applymap，可以对一个dataframe里面每一个元素像map那样全局操作

apply则可以对一个DataFrame操作得到一个Series

3.3 数据排序

3.4 缺失数据处理

4. “一组”大熊猫：Pandas的groupby

汇总之后的广播操作

产生 MultiIndex（多列分组）后的数据透视表操作

Pandas python相关推荐

最新文章

热门文章