1 pandas快速入门（一）

.Series()方法。Series类型由一组数据及与之相关的数据索引组成。

import pandas as pd
import numpy as np
s = pd.Series([1, 3, 5, np.NaN, 8, 4])
print(s)

输出：

0    1.0
1    3.0
2    5.0
3    NaN
4    8.0
5    4.0
dtype: float64

在pandas中有一个非常常用的函数date_range，尤其是在处理时间序列数据时，这个函数的作用就是产生一个DatetimeIndex，就是时间序列数据的索引。

dates = pd.date_range('20200301', periods = 6) # 第一个参数是起始时间，第二个参数是产生的日期个数
print(dates)

输出：

DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04','2020-03-05', '2020-03-06'],dtype='datetime64[ns]', freq='D')

.DataFrame()。创建一个表格型的数据结构。它提供有序的列和不同类型的列值。

# 第一个参数是表格内容，第二个是行标签，第三个是列标签
data = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
print(data)

输出：

                   A         B         C         D
2020-03-01  0.167420  0.008361  0.706377  0.752018
2020-03-02 -0.942199  0.064732 -0.193355 -0.096331
2020-03-03  0.450057  0.692476 -0.015185  0.093720
2020-03-04 -0.040322 -0.108812  0.138713  1.137867
2020-03-05 -0.492778 -1.677214 -0.039888  0.336530
2020-03-06  0.098786 -0.904987 -1.102924 -0.415507

查看data的形状：

data.shape

输出：

(6, 4)

查看data的值：

data.values

输出：

array([[ 0.16741957,  0.00836142,  0.70637657,  0.7520177 ],[-0.94219939,  0.064732  , -0.19335507, -0.09633117],[ 0.45005695,  0.69247647, -0.01518453,  0.09371978],[-0.04032154, -0.10881161,  0.13871262,  1.13786689],[-0.49277839, -1.67721437, -0.03988837,  0.33653038],[ 0.0987858 , -0.90498673, -1.10292439, -0.41550716]])

data.head() 默认返回前5行

data.head() # 默认返回前5行

输出：

                A            B           C           D
2020-03-01  0.167420    0.008361    0.706377    0.752018
2020-03-02  -0.942199   0.064732    -0.193355   -0.096331
2020-03-03  0.450057    0.692476    -0.015185   0.093720
2020-03-04  -0.040322   -0.108812   0.138713    1.137867
2020-03-05  -0.492778   -1.677214   -0.039888   0.336530

返回指定的行数：

data.head(2) # 返回前两行

输出：

 A   B   C   D
2020-03-01  0.167420    0.008361    0.706377    0.752018
2020-03-02  -0.942199   0.064732    -0.193355   -0.096331

data.tail() 默认返回后5行

data.tail() # 默认返回后5行

输出：


A   B   C   D
2020-03-02  -0.942199   0.064732    -0.193355   -0.096331
2020-03-03  0.450057    0.692476    -0.015185   0.093720
2020-03-04  -0.040322   -0.108812   0.138713    1.137867
2020-03-05  -0.492778   -1.677214   -0.039888   0.336530
2020-03-06  0.098786    -0.904987   -1.102924   -0.415507

查看行标签

data.index # 返回行标签

输出：

DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04','2020-03-05', '2020-03-06'],dtype='datetime64[ns]', freq='D')

查看列标签

data.columns # 返回列标签

输出：

Index(['A', 'B', 'C', 'D'], dtype='object')

查看数据的整体情况：

data.describe() # 数据的整体情况

输出：


A   B   C   D
count   6.000000    6.000000    6.000000    6.000000
mean    -0.126506   -0.320907   -0.084377   0.301383
std 0.505275    0.837829    0.588412    0.569077
min -0.942199   -1.677214   -1.102924   -0.415507
25% -0.379664   -0.705943   -0.154988   -0.048818
50% 0.029232    -0.050225   -0.027536   0.215125
75% 0.150261    0.050639    0.100238    0.648146
max 0.450057    0.692476    0.706377    1.137867

count有效数据个数；mean 平均值；std 方差；min 最小值；25% 四分之一位；max 最大值
对数据转置：

data.T

输出：

2020-03-01   2020-03-02  2020-03-03  2020-03-04  2020-03-05  2020-03-06
A   0.167420    -0.942199   0.450057    -0.040322   -0.492778   0.098786
B   0.008361    0.064732    0.692476    -0.108812   -1.677214   -0.904987
C   0.706377    -0.193355   -0.015185   0.138713    -0.039888   -1.102924
D   0.752018    -0.096331   0.093720    1.137867    0.336530    -0.415507

排序：

data.sort_index(axis = 1) # 按列标签排序

输出：


A   B   C   D
2020-03-01  0.147174    -0.605480   1.043737    0.005772
2020-03-02  0.074604    -1.100579   0.450711    0.857264
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131
2020-03-05  -1.298202   0.449078    1.079647    -0.647769
2020-03-06  -1.112085   2.234422    -0.257315   -0.015560

降序排序：

data.sort_index(axis = 1, ascending = False) # 按列标签降序排序

输出：


D   C   B   A
2020-03-01  0.005772    1.043737    -0.605480   0.147174
2020-03-02  0.857264    0.450711    -1.100579   0.074604
2020-03-03  0.709246    -0.109472   -0.369136   -0.246770
2020-03-04  0.263131    -1.007450   -0.398656   -0.061607
2020-03-05  -0.647769   1.079647    0.449078    -1.298202
2020-03-06  -0.015560   -0.257315   2.234422    -1.112085

按照值的内容排序：

data.sort_values(by = 'A') # 对A这一列排序

输出：


A   B   C   D
2020-03-05  -1.298202   0.449078    1.079647    -0.647769
2020-03-06  -1.112085   2.234422    -0.257315   -0.015560
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131
2020-03-02  0.074604    -1.100579   0.450711    0.857264
2020-03-01  0.147174    -0.605480   1.043737    0.005772

数据的选择：

data['A'] # 选择A这一列
# 也可以date.A

输出：

2020-03-01    0.147174
2020-03-02    0.074604
2020-03-03   -0.246770
2020-03-04   -0.061607
2020-03-05   -1.298202
2020-03-06   -1.112085
Freq: D, Name: A, dtype: float64

选择行：

data[2:4]

输出：


A   B   C   D
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131

也可以按照行标签选择：

data['20200301':'20200304']

输出：


A   B   C   D
2020-03-01  0.147174    -0.605480   1.043737    0.005772
2020-03-02  0.074604    -1.100579   0.450711    0.857264
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131

但是上面的方法效率较低，推荐使用.loc方法。因为.loc方法只认行列标签，不认索引。如写data.loc[2:4]会报错。

data.loc['20200301':'20200304']

输出：

 A   B   C   D
2020-03-01  0.147174    -0.605480   1.043737    0.005772
2020-03-02  0.074604    -1.100579   0.450711    0.857264
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131

.iloc方法只接受位置索引，不接受行列标签：

data.iloc[2:5]

输出：

A    B   C   D
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131
2020-03-05  -1.298202   0.449078    1.079647    -0.647769

.loc方法的其他应用：

data.loc[:, ['B', 'C']] # 只选择BC两列的数据

输出：


B   C
2020-03-01  -0.605480   1.043737
2020-03-02  -1.100579   0.450711
2020-03-03  -0.369136   -0.109472
2020-03-04  -0.398656   -1.007450
2020-03-05  0.449078    1.079647
2020-03-06  2.234422    -0.257315

data.loc['20200301':'20200304', 'A':'D']

输出：

 A   B   C   D
2020-03-01  0.147174    -0.605480   1.043737    0.005772
2020-03-02  0.074604    -1.100579   0.450711    0.857264
2020-03-03  -0.246770   -0.369136   -0.109472   0.709246
2020-03-04  -0.061607   -0.398656   -1.007450   0.263131

访问某个具体数据：

data.loc['20200304', 'B'] #  访问某个具体数值

输出：

-0.39865622221605224

.at方法效率更高

data.at[pd.Timestamp('20200304'), 'B'] # 访问具体数据效率更高

输出：

-0.39865622221605224

访问某个具体数据，推荐使用.iat方法：

data.iat[1, 1] # 访问某个具体数据，推荐使用.iat方法

输出：

-1.1005786367873263

布尔索引

data[data.A > 0] # 选择A这一列大于0的数据

输出：

 A   B   C   D
2020-03-01  0.147174    -0.605480   1.043737    0.005772
2020-03-02  0.074604    -1.100579   0.450711    0.857264

data[data > 0] # 选择大于0的数据

输出：


A   B   C   D
2020-03-01  0.147174    NaN 1.043737    0.005772
2020-03-02  0.074604    NaN 0.450711    0.857264
2020-03-03  NaN NaN NaN 0.709246
2020-03-04  NaN NaN NaN 0.263131
2020-03-05  NaN 0.449078    1.079647    NaN
2020-03-06  NaN 2.234422    NaN NaN

2 pandas快速入门（二）

%matplotlib inline # 将matplotlib的图表直接嵌入到Notebook之中
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltdates = pd.date_range('20200301', periods = 6)
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
print(df)A         B         C         D
2020-03-01 -0.339025  1.321283  1.563992 -1.757175
2020-03-02 -0.808769  1.927426 -1.080492  0.403419
2020-03-03 -2.812467  0.227888 -0.487071 -0.413189
2020-03-04 -1.856601  1.019911  1.791216  2.702585
2020-03-05 -0.581288  1.300641  0.129494 -0.897040
2020-03-06 -0.340724  0.086645 -0.380084  0.960427# 制造数据缺失
df1 = df.reindex(index = dates[0:4], columns = list(df.columns) + ['E']) # 取前4行，并加一列
df1A    B   C   D   E
2020-03-01  -0.339025   1.321283    1.563992    -1.757175   NaN
2020-03-02  -0.808769   1.927426    -1.080492   0.403419    NaN
2020-03-03  -2.812467   0.227888    -0.487071   -0.413189   NaN
2020-03-04  -1.856601   1.019911    1.791216    2.702585    NaNdf1.loc[dates[1:3], 'E'] = 2 # 给部分缺失值赋值
df1
# 这样就构造了二维的dataframe，其中部分数据缺失
Out[7]:
A   B   C   D   E
2020-03-01  -0.339025   1.321283    1.563992    -1.757175   NaN
2020-03-02  -0.808769   1.927426    -1.080492   0.403419    2.0
2020-03-03  -2.812467   0.227888    -0.487071   -0.413189   2.0
2020-03-04  -1.856601   1.019911    1.791216    2.702585    NaN
两种处理空数据的方式：
dropna()
fillna(value = num)df1.dropna() # 把空数据丢掉
Out[12]:
A   B   C   D   E
2020-03-02  -0.808769   1.927426    -1.080492   0.403419    2.0
2020-03-03  -2.812467   0.227888    -0.487071   -0.413189   2.0df1.fillna(value = 5) # 把空数据替换成给定值
Out[13]:
A   B   C   D   E
2020-03-01  -0.339025   1.321283    1.563992    -1.757175   5.0
2020-03-02  -0.808769   1.927426    -1.080492   0.403419    2.0
2020-03-03  -2.812467   0.227888    -0.487071   -0.413189   2.0
2020-03-04  -1.856601   1.019911    1.791216    2.702585    5.0pd.isnull(df1) # 判断是否含有空数据
Out[14]:
A   B   C   D   E
2020-03-01  False   False   False   False   True
2020-03-02  False   False   False   False   False
2020-03-03  False   False   False   False   False
2020-03-04  False   False   False   False   True# 如果表格很大，很难看出是否含有空数据：
pd.isnull(df1).any()
Out[17]:
A    False
B    False
C    False
D    False
E     True
dtype: bool# 如果有很多列，同样很难看出
pd.isnull(df1).any().any()
Out[18]:
Truedf1
Out[19]:
A   B   C   D   E
2020-03-01  -0.339025   1.321283    1.563992    -1.757175   NaN
2020-03-02  -0.808769   1.927426    -1.080492   0.403419    2.0
2020-03-03  -2.812467   0.227888    -0.487071   -0.413189   2.0
2020-03-04  -1.856601   1.019911    1.791216    2.702585    NaNdf1.mean() # 空数据不参与计算
Out[21]:
A   -1.454216
B    1.124127
C    0.446911
D    0.233910
E    2.000000
dtype: float64
apply()函数df.apply(np.cumsum) # 累加
Out[22]:
A   B   C   D
2020-03-01  -0.339025   1.321283    1.563992    -1.757175
2020-03-02  -1.147794   3.248709    0.483500    -1.353756
2020-03-03  -3.960262   3.476597    -0.003572   -1.766945
2020-03-04  -5.816863   4.496508    1.787645    0.935640
2020-03-05  -6.398151   5.797148    1.917139    0.038600
2020-03-06  -6.738874   5.883793    1.537055    0.999027df
Out[23]:
A   B   C   D
2020-03-01  -0.339025   1.321283    1.563992    -1.757175
2020-03-02  -0.808769   1.927426    -1.080492   0.403419
2020-03-03  -2.812467   0.227888    -0.487071   -0.413189
2020-03-04  -1.856601   1.019911    1.791216    2.702585
2020-03-05  -0.581288   1.300641    0.129494    -0.897040
2020-03-06  -0.340724   0.086645    -0.380084   0.960427df.apply(lambda x : x.max() - x.min())
Out[24]:
A    2.473442
B    1.840781
C    2.871708
D    4.459761
dtype: float64s = pd.Series(np.random.randint(10, 20, size = 20))
s
Out[25]:
0     19
1     13
2     19
3     19
4     12
5     19
6     18
7     15
8     14
9     18
10    16
11    14
12    18
13    10
14    17
15    18
16    10
17    16
18    10
19    12
dtype: int32s.value_counts() # 统计各个数字出现了多少次
Out[27]:
19    4
18    4
10    3
16    2
14    2
12    2
17    1
15    1
13    1
dtype: int64s.mode() # 产生最多的数
Out[28]:
0    18
1    19
dtype: int32
数据的合并df = pd.DataFrame(np.random.randn(10, 4), columns = list('ABCD'))
df
Out[29]:
A   B   C   D
0   -1.289507   -1.002505   1.792938    -1.885870
1   1.024196    -0.978207   0.990827    0.467831
2   0.287490    1.029234    -0.788564   0.508841
3   -1.971881   1.151978    -1.276380   3.042233
4   -0.706756   2.127796    0.255050    -0.649438
5   0.961700    0.329416    0.003750    0.516274
6   0.105380    0.399627    -1.472621   -0.605783
7   0.791631    0.707824    1.587626    -0.033991
8   -0.336135   -0.483174   0.100718    0.243218
9   -0.272511   -1.086092   0.650176    -0.106609df.iloc[:3]
Out[30]:
A   B   C   D
0   -1.289507   -1.002505   1.792938    -1.885870
1   1.024196    -0.978207   0.990827    0.467831
2   0.287490    1.029234    -0.788564   0.508841df.iloc[3:7]
Out[31]:
A   B   C   D
3   -1.971881   1.151978    -1.276380   3.042233
4   -0.706756   2.127796    0.255050    -0.649438
5   0.961700    0.329416    0.003750    0.516274
6   0.105380    0.399627    -1.472621   -0.605783df.iloc[7:]
Out[32]:
A   B   C   D
7   0.791631    0.707824    1.587626    -0.033991
8   -0.336135   -0.483174   0.100718    0.243218
9   -0.272511   -1.086092   0.650176    -0.106609df1 = pd.concat([df.iloc[:3], df.iloc[3:7], df.iloc[7:]]) # 将以上3部分合并
df1
Out[34]:
A   B   C   D
0   -1.289507   -1.002505   1.792938    -1.885870
1   1.024196    -0.978207   0.990827    0.467831
2   0.287490    1.029234    -0.788564   0.508841
3   -1.971881   1.151978    -1.276380   3.042233
4   -0.706756   2.127796    0.255050    -0.649438
5   0.961700    0.329416    0.003750    0.516274
6   0.105380    0.399627    -1.472621   -0.605783
7   0.791631    0.707824    1.587626    -0.033991
8   -0.336135   -0.483174   0.100718    0.243218
9   -0.272511   -1.086092   0.650176    -0.106609df == df1 # 判断两个表格是否相等
Out[35]:
A   B   C   D
0   True    True    True    True
1   True    True    True    True
2   True    True    True    True
3   True    True    True    True
4   True    True    True    True
5   True    True    True    True
6   True    True    True    True
7   True    True    True    True
8   True    True    True    True
9   True    True    True    True(df == df1).all()
Out[37]:
A    True
B    True
C    True
D    True
dtype: bool(df == df1).all().all()
Out[38]:
Trueleft = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1, 2]})
right = pd.DataFrame({'key':['foo', 'foo'], 'rval':[4, 5]})
left
Out[39]:
key lval
0   foo 1
1   foo 2right
Out[40]:
key rval
0   foo 4
1   foo 5pd.merge(left, right, on = 'key') # 将两个数据合并
Out[41]:
key lval    rval
0   foo 1   4
1   foo 1   5
2   foo 2   4
3   foo 2   5pd.merge(right, left, on = 'key')
Out[42]:
key rval    lval
0   foo 4   1
1   foo 4   2
2   foo 5   1
3   foo 5   2s = pd.Series(np.random.randint(1, 5, size = 4), index = list('ABCD'))
s
Out[43]:
A    1
B    2
C    4
D    2
dtype: int32df
Out[44]:
A   B   C   D
0   -1.289507   -1.002505   1.792938    -1.885870
1   1.024196    -0.978207   0.990827    0.467831
2   0.287490    1.029234    -0.788564   0.508841
3   -1.971881   1.151978    -1.276380   3.042233
4   -0.706756   2.127796    0.255050    -0.649438
5   0.961700    0.329416    0.003750    0.516274
6   0.105380    0.399627    -1.472621   -0.605783
7   0.791631    0.707824    1.587626    -0.033991
8   -0.336135   -0.483174   0.100718    0.243218
9   -0.272511   -1.086092   0.650176    -0.106609df.append(s, ignore_index = True) # 将s插入到df，并忽略索引
Out[45]:
A   B   C   D
0   -1.289507   -1.002505   1.792938    -1.885870
1   1.024196    -0.978207   0.990827    0.467831
2   0.287490    1.029234    -0.788564   0.508841
3   -1.971881   1.151978    -1.276380   3.042233
4   -0.706756   2.127796    0.255050    -0.649438
5   0.961700    0.329416    0.003750    0.516274
6   0.105380    0.399627    -1.472621   -0.605783
7   0.791631    0.707824    1.587626    -0.033991
8   -0.336135   -0.483174   0.100718    0.243218
9   -0.272511   -1.086092   0.650176    -0.106609
10  1.000000    2.000000    4.000000    2.000000df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
2'B':['one', 'one', 'two', 'three', 'two', 'one', 'three', 'one'],
3'C':np.random.randn(8),
4'D':np.random.randn(8)})
5
df
Out[47]:
A   B   C   D
0   foo one 0.769949    0.918423
1   bar one 0.187173    -0.555428
2   foo two -0.272259   -0.781077
3   bar three   0.613351    -0.300836
4   foo two 1.581734    -0.884281
5   bar one -2.433477   -0.077995
6   foo three   -0.809238   -1.526005
7   foo one -1.327003   -0.801657df.groupby('A').sum()
Out[50]:
C   D
A
bar -1.632954   -0.934259
foo -0.056816   -3.074597df.groupby(['A', 'B']).sum()
Out[51]:
C   D
A   B
bar one -2.246305   -0.633423
three   0.613351    -0.300836
foo one -0.557053   0.116766
three   -0.809238   -1.526005
two 1.309475    -1.665359df.groupby(['B', 'A']).sum()
Out[52]:
C   D
B   A
one bar -2.246305   -0.633423
foo -0.557053   0.116766
three   bar 0.613351    -0.300836
foo -0.809238   -1.526005
two foo 1.309475    -1.665359

3 pandas快速入门（三）

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
数据整形tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
tuples
Out[2]:
[('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')]index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
index
Out[3]:
MultiIndex([('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')],names=['first', 'second'])df = pd.DataFrame(np.random.randn(8, 2), index = index, columns = ['A', 'B'])
df
Out[4]:
A   B
first   second
bar one -1.527931   -1.753414
two 0.403523    -0.946154
baz one 1.580554    -0.452768
two 0.157092    0.118553
foo one 0.205942    0.420926
two 0.311821    2.747632
qux one 0.225572    -1.252437
two -0.680653   -1.427652stacked = df.stack() # 把行索引AB变成列索引
stacked
Out[5]:
first  second
bar    one     A   -1.527931B   -1.753414two     A    0.403523B   -0.946154
baz    one     A    1.580554B   -0.452768two     A    0.157092B    0.118553
foo    one     A    0.205942B    0.420926two     A    0.311821B    2.747632
qux    one     A    0.225572B   -1.252437two     A   -0.680653B   -1.427652
dtype: float64# 查看所有索引
stacked.index # 查看所有索引
Out[6]:
MultiIndex([('bar', 'one', 'A'),('bar', 'one', 'B'),('bar', 'two', 'A'),('bar', 'two', 'B'),('baz', 'one', 'A'),('baz', 'one', 'B'),('baz', 'two', 'A'),('baz', 'two', 'B'),('foo', 'one', 'A'),('foo', 'one', 'B'),('foo', 'two', 'A'),('foo', 'two', 'B'),('qux', 'one', 'A'),('qux', 'one', 'B'),('qux', 'two', 'A'),('qux', 'two', 'B')],names=['first', 'second', None])stacked.unstack()
stacked.unstack() # 转换回去
Out[7]:
A   B
first   second
bar one -1.527931   -1.753414
two 0.403523    -0.946154
baz one 1.580554    -0.452768
two 0.157092    0.118553
foo one 0.205942    0.420926
two 0.311821    2.747632
qux one 0.225572    -1.252437
two -0.680653   -1.427652stacked.unstack().unstack()
Out[8]:
A   B
second  one two one two
first
bar -1.527931   0.403523    -1.753414   -0.946154
baz 1.580554    0.157092    -0.452768   0.118553
foo 0.205942    0.311821    0.420926    2.747632
qux 0.225572    -0.680653   -1.252437   -1.427652
数据透视* 2
1
df = pd.DataFrame({'A':['one', 'two', 'three'] * 4,
2'B':['A', 'B', 'C'] * 4,
3'C':['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
4'D':np.random.randn(12),
5'E':np.random.randn(12)})
6
df
Out[10]:
A   B   C   D   E
0   one A   foo -0.860000   -1.395613
1   two B   foo 2.222074    0.878310
2   three   C   foo 0.487662    -0.996239
3   one A   bar 0.816647    0.072039
4   two B   bar 0.413811    0.256600
5   three   C   bar 0.246407    2.223985
6   one A   foo 1.615020    1.034235
7   two B   foo 2.035441    0.658418
8   three   C   foo 1.325854    -0.363353
9   one A   bar 0.451436    0.300830
10  two B   bar 1.344891    0.890003
11  three   C   bar -0.944687   1.161672df.pivot_table(values = 'D', index = ['A', 'B'], columns = 'C') # 以ab为行索引，c为列索引，针对d的数据
Out[11]:
C   bar foo
A   B
one A   0.634041    0.377510
three   C   -0.349140   0.906758
two B   0.879351    2.128758
时间序列rng
rng = pd.date_range('20200301', periods = 600, freq = 's')
rng
Out[12]:
DatetimeIndex(['2020-03-01 00:00:00', '2020-03-01 00:00:01','2020-03-01 00:00:02', '2020-03-01 00:00:03','2020-03-01 00:00:04', '2020-03-01 00:00:05','2020-03-01 00:00:06', '2020-03-01 00:00:07','2020-03-01 00:00:08', '2020-03-01 00:00:09',...'2020-03-01 00:09:50', '2020-03-01 00:09:51','2020-03-01 00:09:52', '2020-03-01 00:09:53','2020-03-01 00:09:54', '2020-03-01 00:09:55','2020-03-01 00:09:56', '2020-03-01 00:09:57','2020-03-01 00:09:58', '2020-03-01 00:09:59'],dtype='datetime64[ns]', length=600, freq='S')
s
1
s = pd.Series(np.random.randint(0, 500, size = len(rng)), index = rng)
2
s
Out[13]:
2020-03-01 00:00:00    261
2020-03-01 00:00:01    215
2020-03-01 00:00:02    108
2020-03-01 00:00:03    348
2020-03-01 00:00:04    365...
2020-03-01 00:09:55    231
2020-03-01 00:09:56    385
2020-03-01 00:09:57     82
2020-03-01 00:09:58    235
2020-03-01 00:09:59    475
Freq: S, Length: 600, dtype: int32# s.resample('2Min', how='sum')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-30cfb34abd66> in <module>
----> 1 s.resample('2Min', how='sum')TypeError: resample() got an unexpected keyword argument 'how'rng = pd.period_range('2000Q1', '2016Q1', freq = 'Q') # 从2000年第一季度到2016年第一季度
2
rng
Out[22]:
PeriodIndex(['2000Q1', '2000Q2', '2000Q3', '2000Q4', '2001Q1', '2001Q2','2001Q3', '2001Q4', '2002Q1', '2002Q2', '2002Q3', '2002Q4','2003Q1', '2003Q2', '2003Q3', '2003Q4', '2004Q1', '2004Q2','2004Q3', '2004Q4', '2005Q1', '2005Q2', '2005Q3', '2005Q4','2006Q1', '2006Q2', '2006Q3', '2006Q4', '2007Q1', '2007Q2','2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3', '2008Q4','2009Q1', '2009Q2', '2009Q3', '2009Q4', '2010Q1', '2010Q2','2010Q3', '2010Q4', '2011Q1', '2011Q2', '2011Q3', '2011Q4','2012Q1', '2012Q2', '2012Q3', '2012Q4', '2013Q1', '2013Q2','2013Q3', '2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4','2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1'],dtype='period[Q-DEC]', freq='Q-DEC')# 将季度转换成日期
rng.to_timestamp() # 将季度转换成日期
Out[24]:
DatetimeIndex(['2000-01-01', '2000-04-01', '2000-07-01', '2000-10-01','2001-01-01', '2001-04-01', '2001-07-01', '2001-10-01','2002-01-01', '2002-04-01', '2002-07-01', '2002-10-01','2003-01-01', '2003-04-01', '2003-07-01', '2003-10-01','2004-01-01', '2004-04-01', '2004-07-01', '2004-10-01','2005-01-01', '2005-04-01', '2005-07-01', '2005-10-01','2006-01-01', '2006-04-01', '2006-07-01', '2006-10-01','2007-01-01', '2007-04-01', '2007-07-01', '2007-10-01','2008-01-01', '2008-04-01', '2008-07-01', '2008-10-01','2009-01-01', '2009-04-01', '2009-07-01', '2009-10-01','2010-01-01', '2010-04-01', '2010-07-01', '2010-10-01','2011-01-01', '2011-04-01', '2011-07-01', '2011-10-01','2012-01-01', '2012-04-01', '2012-07-01', '2012-10-01','2013-01-01', '2013-04-01', '2013-07-01', '2013-10-01','2014-01-01', '2014-04-01', '2014-07-01', '2014-10-01','2015-01-01', '2015-04-01', '2015-07-01', '2015-10-01','2016-01-01'],dtype='datetime64[ns]', freq='QS-OCT')# 计算时间差
pd.Timestamp('20200301') - pd.Timestamp('20200201') # 计算时间差
Out[25]:
Timedelta('29 days 00:00:00')pd.Timestamp('20200301') + pd.Timedelta(days = 6)
Out[26]:
Timestamp('2020-03-07 00:00:00')df
df = pd.DataFrame({'id':[1, 2, 3, 4, 5, 6], 'row_grade':['a', 'b', 'b', 'a', 'a', 'd']})
2
df
Out[27]:
id  row_grade
0   1   a
1   2   b
2   3   b
3   4   a
4   5   a
5   6   d
df
1
df['grade'] = df.row_grade.astype('category')
2
df
Out[29]:
id  row_grade   grade
0   1   a   a
1   2   b   b
2   3   b   b
3   4   a   a
4   5   a   a
5   6   d   d
In [30]:df.grade
1
df.grade
Out[30]:
0    a
1    b
2    b
3    a
4    a
5    d
Name: grade, dtype: category
Categories (3, object): [a, b, d]
In [32]:df.grade.cat.categories
1
df.grade.cat.categories
Out[32]:
Index(['a', 'b', 'd'], dtype='object')
In [33]:1
df.grade.cat.categories = ['very good', 'good', 'bad'] # 对grade的内容重新赋值
2
df
Out[33]:
id  row_grade   grade
0   1   a   very good
1   2   b   good
2   3   b   good
3   4   a   very good
4   5   a   very good
5   6   d   bad
In [34]:# 对grade的内容降序排序
1
df.sort_values(by = 'grade', ascending = True) # 对grade的内容降序排序
Out[34]:
id  row_grade   grade
0   1   a   very good
3   4   a   very good
4   5   a   very good
1   2   b   good
2   3   b   good
5   6   d   bad
数据可视化
In [36]:date_range
1
s = pd.Series(np.random.randn(1000), index = pd.date_range('20000101', periods = 1000))
2
s
Out[36]:
2000-01-01    0.181931
2000-01-02    0.133446
2000-01-03   -0.014128
2000-01-04   -0.755955
2000-01-05    0.847678...
2002-09-22   -0.153774
2002-09-23    1.409455
2002-09-24    1.177651
2002-09-25   -0.449985
2002-09-26    0.700871
Freq: D, Length: 1000, dtype: float64
In [37]:1
s = s.cumsum() # 累加
2
s
Out[37]:
2000-01-01     0.181931
2000-01-02     0.315377
2000-01-03     0.301249
2000-01-04    -0.454706
2000-01-05     0.392972...
2002-09-22   -17.139739
2002-09-23   -15.730284
2002-09-24   -14.552633
2002-09-25   -15.002618
2002-09-26   -14.301747
Freq: D, Length: 1000, dtype: float64
In [38]:1
s.plot()
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x139a78cba8>数据读写
In [40]:)
1
df = pd.DataFrame(np.random.randn(100, 4), columns = list('ABCD'))
2
df
Out[40]:
A   B   C   D
0   -1.253046   -1.345258   -0.040495   0.390861
1   0.274316    1.073607    -0.110999   0.962318
2   1.328825    -1.236292   0.959564    -0.368799
3   0.451488    -0.646667   -1.248890   0.711037
4   0.299365    -0.485150   -0.855304   -0.355098
... ... ... ... ...
95  -0.623444   -0.490164   -0.804014   -0.827152
96  -0.271459   -0.217562   1.569923    -0.851998
97  -0.532696   1.299053    -0.858330   1.225276
98  -0.227407   -2.551994   0.291486    0.879787
99  -0.981516   0.343269    0.644074    -0.188729
100 rows × 4 columns
In [41]:保存数据
1
df.to_csv('data.csv') # 保存数据
In [42]:%ls
1
%ls驱动器 C 中的卷是 Windows8_OS卷的序列号是 2CD3-41A8C:\Users\wangzhaohui\Desktop\机器学习---数据科学包\第2天 的目录2020/04/01  10:55    <DIR>          .
2020/04/01  10:55    <DIR>          ..
2020/04/01  08:57    <DIR>          .ipynb_checkpoints
2020/03/31  10:37            51,435 01.ipynb
2020/03/31  19:33            69,769 02.ipynb
2020/04/01  10:55             8,245 data.csv
2020/04/01  10:55            68,999 Untitled1.ipynb4 个文件        198,448 字节3 个目录 13,441,826,816 可用字节
In [43]:%more data.csv
1
%more data.csv
In [44]:pd.read_csv('data.csv')
1
pd.read_csv('data.csv') # 读取数据
Out[44]:
Unnamed: 0  A   B   C   D
0   0   -1.253046   -1.345258   -0.040495   0.390861
1   1   0.274316    1.073607    -0.110999   0.962318
2   2   1.328825    -1.236292   0.959564    -0.368799
3   3   0.451488    -0.646667   -1.248890   0.711037
4   4   0.299365    -0.485150   -0.855304   -0.355098
... ... ... ... ... ...
95  95  -0.623444   -0.490164   -0.804014   -0.827152
96  96  -0.271459   -0.217562   1.569923    -0.851998
97  97  -0.532696   1.299053    -0.858330   1.225276
98  98  -0.227407   -2.551994   0.291486    0.879787
99  99  -0.981516   0.343269    0.644074    -0.188729
100 rows × 5 columns
In [46]:# 指定第0列作为索引
1
pd.read_csv('data.csv', index_col = 0) # 指定第0列作为索引
Out[46]:
A   B   C   D
0   -1.253046   -1.345258   -0.040495   0.390861
1   0.274316    1.073607    -0.110999   0.962318
2   1.328825    -1.236292   0.959564    -0.368799
3   0.451488    -0.646667   -1.248890   0.711037
4   0.299365    -0.485150   -0.855304   -0.355098
... ... ... ... ...
95  -0.623444   -0.490164   -0.804014   -0.827152
96  -0.271459   -0.217562   1.569923    -0.851998
97  -0.532696   1.299053    -0.858330   1.225276
98  -0.227407   -2.551994   0.291486    0.879787
99  -0.981516   0.343269    0.644074    -0.188729
100 rows × 4 columns
In [ ]:1

4 MovieLens 电影数据分析

import pandas as pd
In [2]:1
# 读取数据保存为user，并给每一列命名
2
unames = ['user_id', 'gender', 'age', 'occupatation', 'zip']
3
users = pd.read_table('ml-1m/users.dat', sep = '::', header = None, names = unames) # 没有表头，所以header为空
A:\anaconda\anaconda\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.This is separate from the ipykernel package so we can avoid doing imports until
In [3]:1
print(len(users))
2
users.head(10)
6040
Out[3]:
user_id gender  age occupatation    zip
0   1   F   1   10  48067
1   2   M   56  16  70072
2   3   M   25  15  55117
3   4   M   45  7   02460
4   5   M   25  20  55455
5   6   F   50  9   55117
6   7   M   35  1   06810
7   8   M   25  12  11413
8   9   M   25  17  61614
9   10  F   35  1   95370
In [4]:1
rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
2
ratings = pd.read_table('ml-1m/ratings.dat', sep = '::', header = None, names = rating_names)
A:\anaconda\anaconda\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.In [5]:1
print(len(ratings))
2
ratings.head(10)
1000209
Out[5]:
user_id movie_id    rating  timestamp
0   1   1193    5   978300760
1   1   661 3   978302109
2   1   914 3   978301968
3   1   3408    4   978300275
4   1   2355    5   978824291
5   1   1197    3   978302268
6   1   1287    5   978302039
7   1   2804    5   978300719
8   1   594 4   978302268
9   1   919 4   978301368
In [6]:1
movie_names = ['movie_id', 'title', 'genres']
2
movies = pd.read_table('ml-1m/movies.dat', sep = '::', header = None, names = movie_names)
A:\anaconda\anaconda\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.In [7]:1
print(len(movies))
2
movies.head(10)
3883
Out[7]:
movie_id    title   genres
0   1   Toy Story (1995)    Animation|Children's|Comedy
1   2   Jumanji (1995)  Adventure|Children's|Fantasy
2   3   Grumpier Old Men (1995) Comedy|Romance
3   4   Waiting to Exhale (1995)    Comedy|Drama
4   5   Father of the Bride Part II (1995)  Comedy
5   6   Heat (1995) Action|Crime|Thriller
6   7   Sabrina (1995)  Comedy|Romance
7   8   Tom and Huck (1995) Adventure|Children's
8   9   Sudden Death (1995) Action
9   10  GoldenEye (1995)    Action|Adventure|Thriller
将3张表合并起来
In [8]:1
data = pd.merge(pd.merge(users, ratings), movies)
2
print(len(data))
3
data.head(10)
1000209
Out[8]:
user_id gender  age occupatation    zip movie_id    rating  timestamp   title   genres
0   1   F   1   10  48067   1193    5   978300760   One Flew Over the Cuckoo's Nest (1975) Drama
1   2   M   56  16  70072   1193    5   978298413   One Flew Over the Cuckoo's Nest (1975) Drama
2   12  M   25  12  32793   1193    4   978220179   One Flew Over the Cuckoo's Nest (1975) Drama
3   15  M   25  7   22903   1193    4   978199279   One Flew Over the Cuckoo's Nest (1975) Drama
4   17  M   50  1   95350   1193    5   978158471   One Flew Over the Cuckoo's Nest (1975) Drama
5   18  F   18  3   95825   1193    4   978156168   One Flew Over the Cuckoo's Nest (1975) Drama
6   19  M   1   10  48073   1193    5   982730936   One Flew Over the Cuckoo's Nest (1975) Drama
7   24  F   25  7   10023   1193    5   978136709   One Flew Over the Cuckoo's Nest (1975) Drama
8   28  F   25  1   14607   1193    3   978125194   One Flew Over the Cuckoo's Nest (1975) Drama
9   33  M   45  3   55421   1193    5   978557765   One Flew Over the Cuckoo's Nest (1975) Drama
In [9]:1
data[data.user_id == 1] # 查看id为1的用户的数据
Out[9]:
user_id gender  age occupatation    zip movie_id    rating  timestamp   title   genres
0   1   F   1   10  48067   1193    5   978300760   One Flew Over the Cuckoo's Nest (1975) Drama
1725    1   F   1   10  48067   661 3   978302109   James and the Giant Peach (1996)    Animation|Children's|Musical
2250    1   F   1   10  48067   914 3   978301968   My Fair Lady (1964) Musical|Romance
2886    1   F   1   10  48067   3408    4   978300275   Erin Brockovich (2000)  Drama
4201    1   F   1   10  48067   2355    5   978824291   Bug's Life, A (1998)   Animation|Children's|Comedy
5904    1   F   1   10  48067   1197    3   978302268   Princess Bride, The (1987)  Action|Adventure|Comedy|Romance
8222    1   F   1   10  48067   1287    5   978302039   Ben-Hur (1959)  Action|Adventure|Drama
8926    1   F   1   10  48067   2804    5   978300719   Christmas Story, A (1983)   Comedy|Drama
10278   1   F   1   10  48067   594 4   978302268   Snow White and the Seven Dwarfs (1937)  Animation|Children's|Musical
11041   1   F   1   10  48067   919 4   978301368   Wizard of Oz, The (1939)    Adventure|Children's|Drama|Musical
12759   1   F   1   10  48067   595 5   978824268   Beauty and the Beast (1991) Animation|Children's|Musical
13819   1   F   1   10  48067   938 4   978301752   Gigi (1958) Musical
14006   1   F   1   10  48067   2398    4   978302281   Miracle on 34th Street (1947)   Drama
14386   1   F   1   10  48067   2918    4   978302124   Ferris Bueller's Day Off (1986)    Comedy
15859   1   F   1   10  48067   1035    5   978301753   Sound of Music, The (1965)  Musical
16741   1   F   1   10  48067   2791    4   978302188   Airplane! (1980)    Comedy
18472   1   F   1   10  48067   2687    3   978824268   Tarzan (1999)   Animation|Children's
18914   1   F   1   10  48067   2018    4   978301777   Bambi (1942)    Animation|Children's
19503   1   F   1   10  48067   3105    5   978301713   Awakenings (1990)   Drama
20183   1   F   1   10  48067   2797    4   978302039   Big (1988)  Comedy|Fantasy
21674   1   F   1   10  48067   2321    3   978302205   Pleasantville (1998)    Comedy
22832   1   F   1   10  48067   720 3   978300760   Wallace & Gromit: The Best of Aardman Animatio...   Animation
23270   1   F   1   10  48067   1270    5   978300055   Back to the Future (1985)   Comedy|Sci-Fi
25853   1   F   1   10  48067   527 5   978824195   Schindler's List (1993)    Drama|War
28157   1   F   1   10  48067   2340    3   978300103   Meet Joe Black (1998)   Romance
28501   1   F   1   10  48067   48  5   978824351   Pocahontas (1995)   Animation|Children's|Musical|Romance
28883   1   F   1   10  48067   1097    4   978301953   E.T. the Extra-Terrestrial (1982)   Children's|Drama|Fantasy|Sci-Fi
31152   1   F   1   10  48067   1721    4   978300055   Titanic (1997)  Drama|Romance
32698   1   F   1   10  48067   1545    4   978824139   Ponette (1996)  Drama
32771   1   F   1   10  48067   745 3   978824268   Close Shave, A (1995)   Animation|Comedy|Thriller
33428   1   F   1   10  48067   2294    4   978824291   Antz (1998) Animation|Children's
34073   1   F   1   10  48067   3186    4   978300019   Girl, Interrupted (1999)    Drama
34504   1   F   1   10  48067   1566    4   978824330   Hercules (1997) Adventure|Animation|Children's|Comedy|Musical
34973   1   F   1   10  48067   588 4   978824268   Aladdin (1992)  Animation|Children's|Comedy|Musical
36324   1   F   1   10  48067   1907    4   978824330   Mulan (1998)    Animation|Children's
36814   1   F   1   10  48067   783 4   978824291   Hunchback of Notre Dame, The (1996) Animation|Children's|Musical
37204   1   F   1   10  48067   1836    5   978300172   Last Days of Disco, The (1998)  Drama
37339   1   F   1   10  48067   1022    5   978300055   Cinderella (1950)   Animation|Children's|Musical
37916   1   F   1   10  48067   2762    4   978302091   Sixth Sense, The (1999) Thriller
40375   1   F   1   10  48067   150 5   978301777   Apollo 13 (1995)    Drama
41626   1   F   1   10  48067   1   5   978824268   Toy Story (1995)    Animation|Children's|Comedy
43703   1   F   1   10  48067   1961    5   978301590   Rain Man (1988) Drama
45033   1   F   1   10  48067   1962    4   978301753   Driving Miss Daisy (1989)   Drama
45685   1   F   1   10  48067   2692    4   978301570   Run Lola Run (Lola rennt) (1998)    Action|Crime|Romance
46757   1   F   1   10  48067   260 4   978300760   Star Wars: Episode IV - A New Hope (1977)   Action|Adventure|Fantasy|Sci-Fi
49748   1   F   1   10  48067   1028    5   978301777   Mary Poppins (1964) Children's|Comedy|Musical
50759   1   F   1   10  48067   1029    5   978302205   Dumbo (1941)    Animation|Children's|Musical
51327   1   F   1   10  48067   1207    4   978300719   To Kill a Mockingbird (1962)    Drama
52255   1   F   1   10  48067   2028    5   978301619   Saving Private Ryan (1998)  Action|Drama|War
54908   1   F   1   10  48067   531 4   978302149   Secret Garden, The (1993)   Children's|Drama
55246   1   F   1   10  48067   3114    4   978302174   Toy Story 2 (1999)  Animation|Children's|Comedy
56831   1   F   1   10  48067   608 4   978301398   Fargo (1996)    Crime|Drama|Thriller
59344   1   F   1   10  48067   1246    4   978302091   Dead Poets Society (1989)   Drama
In [10]:1
# 求出每一部电影女性观众的平均评分与男性观众的平均评分
2
ratings_by_gender = data.pivot_table(values = 'rating', index = 'title', columns = 'gender', aggfunc = 'mean')
3
ratings_by_gender.head(10)
Out[10]:
gender  F   M
title
$1,000,000 Duck (1971)  3.375000    2.761905
'Night Mother (1986)   3.388889    3.352941
'Til There Was You (1997)  2.675676    2.733333
'burbs, The (1989) 2.793478    2.962085
...And Justice for All (1979)   3.828571    3.689024
1-900 (1994)    2.000000    3.000000
10 Things I Hate About You (1999)   3.646552    3.311966
101 Dalmatians (1961)   3.791444    3.500000
101 Dalmatians (1996)   3.240000    2.911215
12 Angry Men (1957) 4.184397    4.328421
In [11]:1
ratings_by_gender['diff'] = ratings_by_gender.F - ratings_by_gender.M # 给表格加一列，显示男女评分的差
2
ratings_by_gender.head(10)
Out[11]:
gender  F   M   diff
title
$1,000,000 Duck (1971)  3.375000    2.761905    0.613095
'Night Mother (1986)   3.388889    3.352941    0.035948
'Til There Was You (1997)  2.675676    2.733333    -0.057658
'burbs, The (1989) 2.793478    2.962085    -0.168607
...And Justice for All (1979)   3.828571    3.689024    0.139547
1-900 (1994)    2.000000    3.000000    -1.000000
10 Things I Hate About You (1999)   3.646552    3.311966    0.334586
101 Dalmatians (1961)   3.791444    3.500000    0.291444
101 Dalmatians (1996)   3.240000    2.911215    0.328785
12 Angry Men (1957) 4.184397    4.328421    -0.144024
In [12]:1
ratings_by_gender.sort_values(by = 'diff', ascending = True).head(10)
2
# ratings_by_gender.head(10)
Out[12]:
gender  F   M   diff
title
Tigrero: A Film That Was Never Made (1994)  1.0 4.333333    -3.333333
Neon Bible, The (1995)  1.0 4.000000    -3.000000
Enfer, L' (1994)   1.0 3.750000    -2.750000
Stalingrad (1993)   1.0 3.593750    -2.593750
Killer: A Journal of Murder (1995)  1.0 3.428571    -2.428571
Dangerous Ground (1997) 1.0 3.333333    -2.333333
In God's Hands (1998)  1.0 3.333333    -2.333333
Rosie (1998)    1.0 3.333333    -2.333333
Flying Saucer, The (1950)   1.0 3.300000    -2.300000
Jamaica Inn (1939)  1.0 3.142857    -2.142857
In [13]:1
data.head(10)
Out[13]:
user_id gender  age occupatation    zip movie_id    rating  timestamp   title   genres
0   1   F   1   10  48067   1193    5   978300760   One Flew Over the Cuckoo's Nest (1975) Drama
1   2   M   56  16  70072   1193    5   978298413   One Flew Over the Cuckoo's Nest (1975) Drama
2   12  M   25  12  32793   1193    4   978220179   One Flew Over the Cuckoo's Nest (1975) Drama
3   15  M   25  7   22903   1193    4   978199279   One Flew Over the Cuckoo's Nest (1975) Drama
4   17  M   50  1   95350   1193    5   978158471   One Flew Over the Cuckoo's Nest (1975) Drama
5   18  F   18  3   95825   1193    4   978156168   One Flew Over the Cuckoo's Nest (1975) Drama
6   19  M   1   10  48073   1193    5   982730936   One Flew Over the Cuckoo's Nest (1975) Drama
7   24  F   25  7   10023   1193    5   978136709   One Flew Over the Cuckoo's Nest (1975) Drama
8   28  F   25  1   14607   1193    3   978125194   One Flew Over the Cuckoo's Nest (1975) Drama
9   33  M   45  3   55421   1193    5   978557765   One Flew Over the Cuckoo's Nest (1975) Drama
In [14]:1
rating_by_title = data.groupby('title').size() # 获取每一部电影评分的人数
2
rating_by_title.head(10)
Out[14]:
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64
In [15]:1
rating_by_title.sort_values(ascending = False).head(10) # 按评分人数排序
Out[15]:
title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64
In [16]:1
mean_ratings = data.pivot_table(values = 'rating', index = 'title', aggfunc = 'mean') # 求每一部电影的平均评分
2
mean_ratings.head(10)
Out[16]:
rating
title
$1,000,000 Duck (1971)  3.027027
'Night Mother (1986)   3.371429
'Til There Was You (1997)  2.692308
'burbs, The (1989) 2.910891
...And Justice for All (1979)   3.713568
1-900 (1994)    2.500000
10 Things I Hate About You (1999)   3.422857
101 Dalmatians (1961)   3.596460
101 Dalmatians (1996)   3.046703
12 Angry Men (1957) 4.295455
In [17]:1
mean_ratings.sort_values(by = 'rating', ascending = False).head(10) # 对电影的平均评分排序
Out[17]:
rating
title
Ulysses (Ulisse) (1954) 5.0
Lured (1947)    5.0
Follow the Bitch (1998) 5.0
Bittersweet Motel (2000)    5.0
Song of Freedom (1936)  5.0
One Little Indian (1973)    5.0
Smashing Time (1967)    5.0
Schlafes Bruder (Brother of Sleep) (1995)   5.0
Gate of Heavenly Peace, The (1995)  5.0
Baby, The (1973)    5.0
In [18]:1
top_10_hot = rating_by_title.sort_values(ascending = False).head(10)
2
top_10_hot
Out[18]:
title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64
In [32]:1mean_ratings.loc[top_10_hot.index, :]
Out[32]:
rating
title
American Beauty (1999)  4.317386
Star Wars: Episode IV - A New Hope (1977)   4.453694
Star Wars: Episode V - The Empire Strikes Back (1980)   4.292977
Star Wars: Episode VI - Return of the Jedi (1983)   4.022893
Jurassic Park (1993)    3.763847
Saving Private Ryan (1998)  4.337354
Terminator 2: Judgment Day (1991)   4.058513
Matrix, The (1999)  4.315830
Back to the Future (1985)   3.990321
Silence of the Lambs, The (1991)    4.351823
In [31]:1
print(type(rating_by_title))
<class 'pandas.core.series.Series'>
In [24]:1
top_20_score = mean_ratings.sort_values(by = 'rating', ascending = False).head(20) # 获取20高分电影
2
top_20_score
Out[24]:
rating
title
Ulysses (Ulisse) (1954) 5.000000
Lured (1947)    5.000000
Follow the Bitch (1998) 5.000000
Bittersweet Motel (2000)    5.000000
Song of Freedom (1936)  5.000000
One Little Indian (1973)    5.000000
Smashing Time (1967)    5.000000
Schlafes Bruder (Brother of Sleep) (1995)   5.000000
Gate of Heavenly Peace, The (1995)  5.000000
Baby, The (1973)    5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000
Lamerica (1994) 4.750000
Apple, The (Sib) (1998) 4.666667
Sanjuro (1962)  4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560510
Shawshank Redemption, The (1994)    4.554558
Godfather, The (1972)   4.524966
Close Shave, A (1995)   4.520548
Usual Suspects, The (1995)  4.517106
Schindler's List (1993)    4.510417
In [25]:1
rating_by_title[top_20_score.index] # 获取20高分电影评分的人数
Out[25]:
title
Ulysses (Ulisse) (1954)                                                   1
Lured (1947)                                                              1
Follow the Bitch (1998)                                                   1
Bittersweet Motel (2000)                                                  1
Song of Freedom (1936)                                                    1
One Little Indian (1973)                                                  1
Smashing Time (1967)                                                      2
Schlafes Bruder (Brother of Sleep) (1995)                                 1
Gate of Heavenly Peace, The (1995)                                        3
Baby, The (1973)                                                          1
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                       5
Lamerica (1994)                                                           8
Apple, The (Sib) (1998)                                                   9
Sanjuro (1962)                                                           69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)     628
Shawshank Redemption, The (1994)                                       2227
Godfather, The (1972)                                                  2223
Close Shave, A (1995)                                                   657
Usual Suspects, The (1995)                                             1783
Schindler's List (1993)                                                2304
dtype: int64
In [26]:1
hot_movies = rating_by_title[rating_by_title > 1000] # 获取评分人数超过1000的电影
2
print(len(hot_movies))
3
hot_movies.head(10)
207
Out[26]:
title
2001: A Space Odyssey (1968)    1716
Abyss, The (1989)               1715
African Queen, The (1951)       1057
Air Force One (1997)            1076
Airplane! (1980)                1731
Aladdin (1992)                  1351
Alien (1979)                    2024
Aliens (1986)                   1820
Amadeus (1984)                  1382
American Beauty (1999)          3428
dtype: int64
In [28]:1
mean_ratings.head(10)
Out[28]:
rating
title
$1,000,000 Duck (1971)  3.027027
'Night Mother (1986)   3.371429
'Til There Was You (1997)  2.692308
'burbs, The (1989) 2.910891
...And Justice for All (1979)   3.713568
1-900 (1994)    2.500000
10 Things I Hate About You (1999)   3.422857
101 Dalmatians (1961)   3.596460
101 Dalmatians (1996)   3.046703
12 Angry Men (1957) 4.295455

5 pandas 核心数据结构

import pandas as pd
2
import numpy as np
创建Series对象
In [2]:1
s = pd.Series(np.random.randn(5), index = ['a', 'b', 'c', 'd', 'e'])
2
s
Out[2]:
a   -2.208328
b    0.687184
c   -1.615629
d    0.049625
e    0.073550
dtype: float64
In [3]:1
s.index # 查看s的索引
Out[3]:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [4]:1
s = pd.Series(np.random.randn(5)) # 不指定索引，会自动指定整型索引值
2
s
Out[4]:
0   -1.068177
1    1.408667
2   -0.003749
3   -0.752304
4    0.379617
dtype: float64
用字典创建Series对象
In [5]:1
d = {'a':1, 'b':2, 'd':4}
2
s = pd.Series(d, index = list('abcd'))
3
s
Out[5]:
a    1.0
b    2.0
c    NaN
d    4.0
dtype: float64
用标量创建Series
In [6]:1
s = pd.Series(5, index = list('abcd'))
2
s
Out[6]:
a    5
b    5
c    5
d    5
dtype: int64
Series对象的特性
1.类ndarray对象特性
In [7]:1
s = pd.Series(np.random.randn(5))
2
s
Out[7]:
0    1.508598
1    1.647663
2    0.820469
3    0.502848
4    1.358940
dtype: float64
In [8]:1
s[0]
Out[8]:
1.5085975977102866
In [9]:1
s[:3]
Out[9]:
0    1.508598
1    1.647663
2    0.820469
dtype: float64
In [10]:1
s[2:5]
Out[10]:
2    0.820469
3    0.502848
4    1.358940
dtype: float64
In [11]:1
s[[1, 3, 4]] # 支持整型索引
Out[11]:
1    1.647663
3    0.502848
4    1.358940
dtype: float64
In [12]:1
np.sin(s)
Out[12]:
0    0.998066
1    0.997047
2    0.731466
3    0.481923
4    0.977642
dtype: float64
类字典对象特性
In [13]:1
s = pd.Series(np.random.randn(5), index = list('abcde'))
2
s
Out[13]:
a   -1.215274
b    2.374261
c    1.050139
d    1.091067
e    0.982012
dtype: float64
In [14]:1
s['a']
Out[14]:
-1.21527427068595
In [15]:1
s['b'] = 3
In [16]:1
s
Out[16]:
a   -1.215274
b    3.000000
c    1.050139
d    1.091067
e    0.982012
dtype: float64
In [17]:1
s['f'] = 100 # 增加一个数据
In [18]:1
s
Out[18]:
a     -1.215274
b      3.000000
c      1.050139
d      1.091067
e      0.982012
f    100.000000
dtype: float64
标签对齐属性
In [19]:1
s1 = pd.Series(np.random.randn(3), index = list('abe'))
2
s2 = pd.Series(np.random.randn(3), index = list('ace'))
3
s1
Out[19]:
a   -0.654712
b    1.445863
e   -0.929808
dtype: float64
In [20]:1
s2
Out[20]:
a   -1.239739
c    0.024641
e    0.748008
dtype: float64
In [21]:1
s1 + s2
Out[21]:
a   -1.894451
b         NaN
c         NaN
e   -0.181800
dtype: float64
DataFrame
1.字典创建DataFrame
In [22]:1
d = {'one':pd.Series([1, 2, 3], index = list('abc')), 'two':pd.Series([1, 2, 3, 4], index = list('abcd'))}
2
d
Out[22]:
{'one': a    1b    2c    3dtype: int64, 'two': a    1b    2c    3d    4dtype: int64}
In [23]:1
df = pd.DataFrame(d)
2
df
Out[23]:
one two
a   1.0 1
b   2.0 2
c   3.0 3
d   NaN 4
In [24]:1
df = pd.DataFrame(d, index = ['a', 'b', 'd'])
2
df
Out[24]:
one two
a   1.0 1
b   2.0 2
d   NaN 4
In [25]:1
# 由列表构成的字典创建DataFrame， 列表长度必须一致，否则报错。这一点与由Series创建不同。
2
d = {'one':[1, 2, 3, 4], 'two':[11, 22, 33, 44]}
3
df = pd.DataFrame(d)
4
df
Out[25]:
one two
0   1   11
1   2   22
2   3   33
3   4   44
由列表创建DataFrame
In [26]:1
data = [(0, 1, 2), ('one', 3, 4)]
2
df = pd.DataFrame(data)
3
df
Out[26]:
0   1   2
0   0   1   2
1   one 3   4
由字典创建DataFrame
In [27]:1
data = [{'a':1, 'b':2}, {'a':3, 'b':4, 'c':5}] # 字典的键是列标签
2
df = pd.DataFrame(data)
3
df
Out[27]:
a   b   c
0   1   2   NaN
1   3   4   5.0
由Series创建DataFrame
In [28]:1
data = pd.Series(np.random.randn(5), index = list('abcde'))
2
df = pd.DataFrame(data)
3
df
Out[28]:
0
a   -0.236843
b   -0.108453
c   -0.434020
d   1.244950
e   -0.453697
DataFrame的特性
列的选择增加和删除
In [29]:1
df = pd.DataFrame(np.random.randn(6, 4), columns = ['one', 'two', 'three', 'four'])
2
df
Out[29]:
one two three   four
0   -0.297524   0.297013    -1.079953   1.046785
1   0.160032    0.444978    0.301297    0.636168
2   0.187298    -1.486748   0.228950    1.823566
3   0.528394    1.282184    1.157520    -0.848585
4   -0.010840   -0.347936   -0.501200   -1.252119
5   0.946331    -0.978641   0.442339    1.023809
In [30]:1
# 选择列
2
df['one']
Out[30]:
0   -0.297524
1    0.160032
2    0.187298
3    0.528394
4   -0.010840
5    0.946331
Name: one, dtype: float64
In [31]:1
# 选择行
2
df.loc[1]
Out[31]:
one      0.160032
two      0.444978
three    0.301297
four     0.636168
Name: 1, dtype: float64
In [32]:1
# 赋值
2
df['three'] = df['one'] + df['two']
3
df
Out[32]:
one two three   four
0   -0.297524   0.297013    -0.000510   1.046785
1   0.160032    0.444978    0.605009    0.636168
2   0.187298    -1.486748   -1.299450   1.823566
3   0.528394    1.282184    1.810577    -0.848585
4   -0.010840   -0.347936   -0.358776   -1.252119
5   0.946331    -0.978641   -0.032310   1.023809
In [33]:1
df
Out[33]:
one two three   four
0   -0.297524   0.297013    -0.000510   1.046785
1   0.160032    0.444978    0.605009    0.636168
2   0.187298    -1.486748   -1.299450   1.823566
3   0.528394    1.282184    1.810577    -0.848585
4   -0.010840   -0.347936   -0.358776   -1.252119
5   0.946331    -0.978641   -0.032310   1.023809
In [34]:1
del df['three']
2
df
Out[34]:
one two four
0   -0.297524   0.297013    1.046785
1   0.160032    0.444978    0.636168
2   0.187298    -1.486748   1.823566
3   0.528394    1.282184    -0.848585
4   -0.010840   -0.347936   -1.252119
5   0.946331    -0.978641   1.023809
In [35]:1
# 增加一列
2
df['flag'] = df['one'] > 0.2
3
df
Out[35]:
one two four    flag
0   -0.297524   0.297013    1.046785    False
1   0.160032    0.444978    0.636168    False
2   0.187298    -1.486748   1.823566    False
3   0.528394    1.282184    -0.848585   True
4   -0.010840   -0.347936   -1.252119   False
5   0.946331    -0.978641   1.023809    True
In [36]:1
df.pop('four')
Out[36]:
0    1.046785
1    0.636168
2    1.823566
3   -0.848585
4   -1.252119
5    1.023809
Name: four, dtype: float64
In [37]:1
df
Out[37]:
one two flag
0   -0.297524   0.297013    False
1   0.160032    0.444978    False
2   0.187298    -1.486748   False
3   0.528394    1.282184    True
4   -0.010840   -0.347936   False
5   0.946331    -0.978641   True
In [38]:1
df.insert(1, 'bar', df['one'] + df['two']) # 直接添加列只会添加在最后，insert可以指定添加位置
2
df
Out[38]:
one bar two flag
0   -0.297524   -0.000510   0.297013    False
1   0.160032    0.605009    0.444978    False
2   0.187298    -1.299450   -1.486748   False
3   0.528394    1.810577    1.282184    True
4   -0.010840   -0.358776   -0.347936   False
5   0.946331    -0.032310   -0.978641   True
In [39]:1
df.assign(Ratio = df['one'] / df['two']) # assign会重新copy一份dataframe，此时df数据没变。insert会直接作用在df上
Out[39]:
one bar two flag    Ratio
0   -0.297524   -0.000510   0.297013    False   -1.001718
1   0.160032    0.605009    0.444978    False   0.359640
2   0.187298    -1.299450   -1.486748   False   -0.125978
3   0.528394    1.810577    1.282184    True    0.412105
4   -0.010840   -0.358776   -0.347936   False   0.031156
5   0.946331    -0.032310   -0.978641   True    -0.966985
In [40]:1
# assign可接受函数
2
df.assign(ratio = lambda x : x.one - x.two)
Out[40]:
one bar two flag    ratio
0   -0.297524   -0.000510   0.297013    False   -0.594537
1   0.160032    0.605009    0.444978    False   -0.284946
2   0.187298    -1.299450   -1.486748   False   1.674046
3   0.528394    1.810577    1.282184    True    -0.753790
4   -0.010840   -0.358776   -0.347936   False   0.337095
5   0.946331    -0.032310   -0.978641   True    1.924972
In [41]:1
df.one
Out[41]:
0   -0.297524
1    0.160032
2    0.187298
3    0.528394
4   -0.010840
5    0.946331
Name: one, dtype: float64
In [42]:1
# 后一个assign是作用在前一个assign创建出的副本上，因为是匿名变量，所以只能使用lambda函数
2
df.assign(abratio = lambda x : x.one - x.two).assign(abvalue = lambda x : x.abratio * 10)
Out[42]:
one bar two flag    abratio abvalue
0   -0.297524   -0.000510   0.297013    False   -0.594537   -5.945371
1   0.160032    0.605009    0.444978    False   -0.284946   -2.849461
2   0.187298    -1.299450   -1.486748   False   1.674046    16.740456
3   0.528394    1.810577    1.282184    True    -0.753790   -7.537898
4   -0.010840   -0.358776   -0.347936   False   0.337095    3.370953
5   0.946331    -0.032310   -0.978641   True    1.924972    19.249725
dataframe的索引
In [43]:1
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index = list('abcdef'), columns = list('ABCD'))
2
df
Out[43]:
A   B   C   D
a   3   3   2   4
b   2   8   4   3
c   2   5   2   1
d   6   2   2   9
e   1   3   6   1
f   3   4   2   4

6 Pandas基础运算

import numpy as np
1
import pandas as pd
2
import numpy as np
重新索引
In [2]:s
1
s = pd.Series([1, 3, 5, 7, 9], index = list('acefh'))
2
s
Out[2]:
a    1
c    3
e    5
f    7
h    9
dtype: int64
In [3]:s.index
1
s.index
Out[3]:
Index(['a', 'c', 'e', 'f', 'h'], dtype='object')
In [4]:1
s.reindex(list('abcdefgh')) # 增加一些行
Out[4]:
a    1.0
b    NaN
c    3.0
d    NaN
e    5.0
f    7.0
g    NaN
h    9.0
dtype: float64
In [5]:1
s.reindex(list('abcdefgh'), fill_value = 0) # 给增加的行赋予默认值
Out[5]:
a    1
b    0
c    3
d    0
e    5
f    7
g    0
h    9
dtype: int64
In [7]:1
s.reindex(list('abcdefgh'), method = 'ffill') # 用前面的数据填充新增加的
2
# ffill对新增的列无效，只对新增的行有效
Out[7]:
a    1
b    1
c    3
d    3
e    5
f    7
g    7
h    9
dtype: int64
In [8]:df
1
df = pd.DataFrame(np.random.randn(4, 6), index = list('ACEG'), columns = ['one', 'two', 'three', 'four', 'five', 'sex'])
2
df
Out[8]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
In [9]:df2 = df.reindex(index = list('ABCDEFG'))
1
df2 = df.reindex(index = list('ABCDEFG')) # 对原来的df复制一份，不影响df的值
2
df2
Out[9]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
B   NaN NaN NaN NaN NaN NaN
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
D   NaN NaN NaN NaN NaN NaN
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
F   NaN NaN NaN NaN NaN NaN
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
In [10]:df
1
df
Out[10]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
In [12]:df2
1
df2 = df.reindex(index = list('ABCDEFG'), fill_value = 0)
2
df2
Out[12]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
B   0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
D   0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
F   0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
In [13]:1
# 对列重新索引
2
df.reindex(columns = ['one', 'three', 'five', 'seven'])
Out[13]:
one three   five    seven
A   0.459135    0.579744    -2.049527   NaN
C   -0.350446   0.574660    0.647261    NaN
E   0.321467    -0.433891   0.043650    NaN
G   0.585555    -3.290051   -0.661426   NaN
丢弃一行
In [14]:df
1
df
Out[14]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
In [15]:A
1
df.drop('A')
Out[15]:
one two three   four    five    sex
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
In [16]:df
1
df
Out[16]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
丢弃一列
In [17]:, axis = 1
1
df.drop(['two', 'four'], axis = 1)
Out[17]:
one three   five    sex
A   0.459135    0.579744    -2.049527   -1.365882
C   -0.350446   0.574660    0.647261    0.155820
E   0.321467    -0.433891   0.043650    0.936114
G   0.585555    -3.290051   -0.661426   0.312379
In [18]:df
1
df
Out[18]:
one two three   four    five    sex
A   0.459135    -1.267627   0.579744    0.238482    -2.049527   -1.365882
C   -0.350446   -1.427801   0.574660    0.374694    0.647261    0.155820
E   0.321467    -0.066514   -0.433891   -0.130726   0.043650    0.936114
G   0.585555    0.997346    -3.290051   1.298093    -0.661426   0.312379
apply与applymap
apply是将一行或一列作为参数传递给函数
In [19]:df
1
df = pd.DataFrame(np.arange(12).reshape(4, 3), index = ['one', 'two', 'three', 'four'], columns = list('ABC'))
2
df
Out[19]:
A   B   C
one 0   1   2
two 3   4   5
three   6   7   8
four    9   10  11
In [20]:1
df.apply(lambda x : x.max() - x.min()) # 默认按照列运算
Out[20]:
A    9
B    9
C    9
dtype: int64
In [21]:1
df.apply(lambda x : x.max() - x.min(), axis = 1) # 按行运算
Out[21]:
one      2
two      2
three    2
four     2
dtype: int64
In [22]:def min_max(x):return pd.Series([x.min(), x.max()], index = ['min', 'max'])
df.apply(min_max)
1
def min_max(x):
2return pd.Series([x.min(), x.max()], index = ['min', 'max'])
3
df.apply(min_max)
Out[22]:
A   B   C
min 0   1   2
max 9   10  11
In [23]:, axis = 1
1
def min_max(x):
2return pd.Series([x.min(), x.max()], index = ['min', 'max'])
3
df.apply(min_max, axis = 1)
Out[23]:
min max
one 0   2
two 3   5
three   6   8
four    9   11
applymap是将每一个数据作为参数传递
In [24]:df
1
df = pd.DataFrame(np.random.randn(4, 3), index = ['one', 'two', 'three', 'four'], columns = list('ABC'))
2
df
Out[24]:
A   B   C
one -0.697469   -1.224409   -0.887984
two 1.268382    0.999928    1.028207
three   -1.159685   -1.631597   -0.778589
four    0.158716    0.617732    -0.742091
In [27]:1
# df的每个数据只显示小数点后3位
2
# formater = lambda x : '%.03f' % x # x是某行某列具体的数据
3
formater = '{0:.03f}'.format
4
df.applymap(formater)
Out[27]:
A   B   C
one -0.697  -1.224  -0.888
two 1.268   1.000   1.028
three   -1.160  -1.632  -0.779
four    0.159   0.618   -0.742
排序和排名
In [30]:6
1
s = pd.Series([2, 6, 3, 6, 1, 0])
2
s
Out[30]:
0    2
1    6
2    3
3    6
4    1
5    0
dtype: int64
In [31]:1
s.rank()
Out[31]:
0    3.0
1    5.5
2    4.0
3    5.5
4    2.0
5    1.0
dtype: float64
In [32]:first
1
s.rank(method = 'first')
Out[32]:
0    3.0
1    5.0
2    4.0
3    6.0
4    2.0
5    1.0
dtype: float64
In [33]:average
1
s.rank(method = 'average')
Out[33]:
0    3.0
1    5.5
2    4.0
3    5.5
4    2.0
5    1.0
dtype: float64
In [34]:df
1
df
Out[34]:
A   B   C
one -0.697469   -1.224409   -0.887984
two 1.268382    0.999928    1.028207
three   -1.159685   -1.631597   -0.778589
four    0.158716    0.617732    -0.742091
In [35]:first
1
df.rank(method = 'first')
Out[35]:
A   B   C
one 2.0 2.0 1.0
two 4.0 4.0 4.0
three   1.0 1.0 2.0
four    3.0 3.0 3.0
数据的唯一性
In [36]:s
1
s = pd.Series(list('abbcdeefggghhi'))
2
s
Out[36]:
0     a
1     b
2     b
3     c
4     d
5     e
6     e
7     f
8     g
9     g
10    g
11    h
12    h
13    i
dtype: object
In [37]:1
s.value_counts() # 返回各个元素各出现了多少次
Out[37]:
g    3
e    2
h    2
b    2
d    1
a    1
c    1
f    1
i    1
dtype: int64
In [38]:1
s.unique() # 返回各个元素，不重复
Out[38]:
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype=object)
In [39]:1
s.isin(['a', 'j', 'd']) # 判断s列表的各个元素是否在给定列表中
Out[39]:
0      True
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool
In [40]:1
s.isin(s.unique())
Out[40]:
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
dtype: bool
In [ ]:1

机器学习---数据科学包-第2天相关推荐

3.机器学习—数据科学包3.2pandas基础
pandas基础一.pandas介绍 1.什么是pandas 2.pandas用途 3.课程内容二.Ipython开发环境搭建 1.安装 2.新建运行环境 3.Ipython技巧 4.Ipytho ...
python中画出距平垂线_3.机器学习—数据科学包3.3pandas操作
pandas操作一.pandas索引 1.Series索引index 2.DateFrame行索引index和列索引columns 3.pandas预置索引的类 4.重复索引 4.1重复索引定义 4 ...
机器学习数据科学包（二）——Pandas入门
目录二.查看数据三.选择四.缺失值处理五.相关操作六.合并七.分组八.重塑(Reshaping) 九.时间序列十.Categorical 十一.画图十二.导入和保存数据本文对十分钟 ...
机器学习---数据科学包-第4天
1 Numpy简介 1 # 1 通过python的基础数据对象转化 2 import numpy as np 3 x = [1, 2, 3, 4] 4 x = np.array(x) 5 x Out[ ...
机器学习数据科学包（三）——Pandas实例：MovieLens电影数据分析
电影数据分析准备工作从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据. 数据说明参阅数据介绍文件 README.tx ...
python 数据科学包_什么时候应该使用哪个Python数据科学软件包？
python 数据科学包 Python is the most popular language for data science. Unfortunately, it can be tricky ...
机器学习-数据科学库-day6
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档机器学习-数据科学库-day6 pandas学习动手练习 pandas中的时间序列生成一段时间范围关于频率的更多缩写在Data ...
机器学习-数据科学库-day5
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档机器学习-数据科学库-day5 pandas学习 pandas之DataFrame pandas常用统计方法将字符串离散化数据合并 ...
机器学习-数据科学库-day1
机器学习-数据科学库-day1 机器学习-数据科学库-day1 matplotlib 机器学习-数据科学库-day1 数据分析课程包括: 基础概念与环境 matplotlib numpy pandas ...

机器学习---数据科学包-第2天