机器学习---数据科学包-第2天
1 pandas快速入门(一)
.Series()方法。Series类型由一组数据及与之相关的数据索引组成。
import pandas as pd
import numpy as np
s = pd.Series([1, 3, 5, np.NaN, 8, 4])
print(s)
输出:
0 1.0
1 3.0
2 5.0
3 NaN
4 8.0
5 4.0
dtype: float64
在pandas中有一个非常常用的函数date_range,尤其是在处理时间序列数据时,这个函数的作用就是产生一个DatetimeIndex,就是时间序列数据的索引。
dates = pd.date_range('20200301', periods = 6) # 第一个参数是起始时间,第二个参数是产生的日期个数
print(dates)
输出:
DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04','2020-03-05', '2020-03-06'],dtype='datetime64[ns]', freq='D')
.DataFrame()。创建一个表格型的数据结构。它提供有序的列和不同类型的列值。
# 第一个参数是表格内容,第二个是行标签,第三个是列标签
data = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
print(data)
输出:
A B C D
2020-03-01 0.167420 0.008361 0.706377 0.752018
2020-03-02 -0.942199 0.064732 -0.193355 -0.096331
2020-03-03 0.450057 0.692476 -0.015185 0.093720
2020-03-04 -0.040322 -0.108812 0.138713 1.137867
2020-03-05 -0.492778 -1.677214 -0.039888 0.336530
2020-03-06 0.098786 -0.904987 -1.102924 -0.415507
查看data的形状:
data.shape
输出:
(6, 4)
查看data的值:
data.values
输出:
array([[ 0.16741957, 0.00836142, 0.70637657, 0.7520177 ],[-0.94219939, 0.064732 , -0.19335507, -0.09633117],[ 0.45005695, 0.69247647, -0.01518453, 0.09371978],[-0.04032154, -0.10881161, 0.13871262, 1.13786689],[-0.49277839, -1.67721437, -0.03988837, 0.33653038],[ 0.0987858 , -0.90498673, -1.10292439, -0.41550716]])
data.head() 默认返回前5行
data.head() # 默认返回前5行
输出:
A B C D
2020-03-01 0.167420 0.008361 0.706377 0.752018
2020-03-02 -0.942199 0.064732 -0.193355 -0.096331
2020-03-03 0.450057 0.692476 -0.015185 0.093720
2020-03-04 -0.040322 -0.108812 0.138713 1.137867
2020-03-05 -0.492778 -1.677214 -0.039888 0.336530
返回指定的行数:
data.head(2) # 返回前两行
输出:
A B C D
2020-03-01 0.167420 0.008361 0.706377 0.752018
2020-03-02 -0.942199 0.064732 -0.193355 -0.096331
data.tail() 默认返回后5行
data.tail() # 默认返回后5行
输出:
A B C D
2020-03-02 -0.942199 0.064732 -0.193355 -0.096331
2020-03-03 0.450057 0.692476 -0.015185 0.093720
2020-03-04 -0.040322 -0.108812 0.138713 1.137867
2020-03-05 -0.492778 -1.677214 -0.039888 0.336530
2020-03-06 0.098786 -0.904987 -1.102924 -0.415507
查看行标签
data.index # 返回行标签
输出:
DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04','2020-03-05', '2020-03-06'],dtype='datetime64[ns]', freq='D')
查看列标签
data.columns # 返回列标签
输出:
Index(['A', 'B', 'C', 'D'], dtype='object')
查看数据的整体情况:
data.describe() # 数据的整体情况
输出:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.126506 -0.320907 -0.084377 0.301383
std 0.505275 0.837829 0.588412 0.569077
min -0.942199 -1.677214 -1.102924 -0.415507
25% -0.379664 -0.705943 -0.154988 -0.048818
50% 0.029232 -0.050225 -0.027536 0.215125
75% 0.150261 0.050639 0.100238 0.648146
max 0.450057 0.692476 0.706377 1.137867
count有效数据个数 ;mean 平均值;std 方差;min 最小值;25% 四分之一位;max 最大值
对数据转置:
data.T
输出:
2020-03-01 2020-03-02 2020-03-03 2020-03-04 2020-03-05 2020-03-06
A 0.167420 -0.942199 0.450057 -0.040322 -0.492778 0.098786
B 0.008361 0.064732 0.692476 -0.108812 -1.677214 -0.904987
C 0.706377 -0.193355 -0.015185 0.138713 -0.039888 -1.102924
D 0.752018 -0.096331 0.093720 1.137867 0.336530 -0.415507
排序:
data.sort_index(axis = 1) # 按列标签排序
输出:
A B C D
2020-03-01 0.147174 -0.605480 1.043737 0.005772
2020-03-02 0.074604 -1.100579 0.450711 0.857264
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
2020-03-05 -1.298202 0.449078 1.079647 -0.647769
2020-03-06 -1.112085 2.234422 -0.257315 -0.015560
降序排序:
data.sort_index(axis = 1, ascending = False) # 按列标签降序排序
输出:
D C B A
2020-03-01 0.005772 1.043737 -0.605480 0.147174
2020-03-02 0.857264 0.450711 -1.100579 0.074604
2020-03-03 0.709246 -0.109472 -0.369136 -0.246770
2020-03-04 0.263131 -1.007450 -0.398656 -0.061607
2020-03-05 -0.647769 1.079647 0.449078 -1.298202
2020-03-06 -0.015560 -0.257315 2.234422 -1.112085
按照值的内容排序:
data.sort_values(by = 'A') # 对A这一列排序
输出:
A B C D
2020-03-05 -1.298202 0.449078 1.079647 -0.647769
2020-03-06 -1.112085 2.234422 -0.257315 -0.015560
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
2020-03-02 0.074604 -1.100579 0.450711 0.857264
2020-03-01 0.147174 -0.605480 1.043737 0.005772
数据的选择:
data['A'] # 选择A这一列
# 也可以date.A
输出:
2020-03-01 0.147174
2020-03-02 0.074604
2020-03-03 -0.246770
2020-03-04 -0.061607
2020-03-05 -1.298202
2020-03-06 -1.112085
Freq: D, Name: A, dtype: float64
选择行:
data[2:4]
输出:
A B C D
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
也可以按照行标签选择:
data['20200301':'20200304']
输出:
A B C D
2020-03-01 0.147174 -0.605480 1.043737 0.005772
2020-03-02 0.074604 -1.100579 0.450711 0.857264
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
但是上面的方法效率较低,推荐使用.loc方法。因为.loc方法只认行列标签,不认索引。如写data.loc[2:4]会报错。
data.loc['20200301':'20200304']
输出:
A B C D
2020-03-01 0.147174 -0.605480 1.043737 0.005772
2020-03-02 0.074604 -1.100579 0.450711 0.857264
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
.iloc方法只接受位置索引,不接受行列标签:
data.iloc[2:5]
输出:
A B C D
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
2020-03-05 -1.298202 0.449078 1.079647 -0.647769
.loc方法的其他应用:
data.loc[:, ['B', 'C']] # 只选择BC两列的数据
输出:
B C
2020-03-01 -0.605480 1.043737
2020-03-02 -1.100579 0.450711
2020-03-03 -0.369136 -0.109472
2020-03-04 -0.398656 -1.007450
2020-03-05 0.449078 1.079647
2020-03-06 2.234422 -0.257315
data.loc['20200301':'20200304', 'A':'D']
输出:
A B C D
2020-03-01 0.147174 -0.605480 1.043737 0.005772
2020-03-02 0.074604 -1.100579 0.450711 0.857264
2020-03-03 -0.246770 -0.369136 -0.109472 0.709246
2020-03-04 -0.061607 -0.398656 -1.007450 0.263131
访问某个具体数据:
data.loc['20200304', 'B'] # 访问某个具体数值
输出:
-0.39865622221605224
.at方法效率更高
data.at[pd.Timestamp('20200304'), 'B'] # 访问具体数据效率更高
输出:
-0.39865622221605224
访问某个具体数据,推荐使用.iat方法:
data.iat[1, 1] # 访问某个具体数据,推荐使用.iat方法
输出:
-1.1005786367873263
布尔索引
data[data.A > 0] # 选择A这一列大于0的数据
输出:
A B C D
2020-03-01 0.147174 -0.605480 1.043737 0.005772
2020-03-02 0.074604 -1.100579 0.450711 0.857264
data[data > 0] # 选择大于0的数据
输出:
A B C D
2020-03-01 0.147174 NaN 1.043737 0.005772
2020-03-02 0.074604 NaN 0.450711 0.857264
2020-03-03 NaN NaN NaN 0.709246
2020-03-04 NaN NaN NaN 0.263131
2020-03-05 NaN 0.449078 1.079647 NaN
2020-03-06 NaN 2.234422 NaN NaN
2 pandas快速入门(二)
%matplotlib inline # 将matplotlib的图表直接嵌入到Notebook之中
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltdates = pd.date_range('20200301', periods = 6)
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
print(df)A B C D
2020-03-01 -0.339025 1.321283 1.563992 -1.757175
2020-03-02 -0.808769 1.927426 -1.080492 0.403419
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189
2020-03-04 -1.856601 1.019911 1.791216 2.702585
2020-03-05 -0.581288 1.300641 0.129494 -0.897040
2020-03-06 -0.340724 0.086645 -0.380084 0.960427# 制造数据缺失
df1 = df.reindex(index = dates[0:4], columns = list(df.columns) + ['E']) # 取前4行,并加一列
df1A B C D E
2020-03-01 -0.339025 1.321283 1.563992 -1.757175 NaN
2020-03-02 -0.808769 1.927426 -1.080492 0.403419 NaN
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189 NaN
2020-03-04 -1.856601 1.019911 1.791216 2.702585 NaNdf1.loc[dates[1:3], 'E'] = 2 # 给部分缺失值赋值
df1
# 这样就构造了二维的dataframe,其中部分数据缺失
Out[7]:
A B C D E
2020-03-01 -0.339025 1.321283 1.563992 -1.757175 NaN
2020-03-02 -0.808769 1.927426 -1.080492 0.403419 2.0
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189 2.0
2020-03-04 -1.856601 1.019911 1.791216 2.702585 NaN
两种处理空数据的方式:
dropna()
fillna(value = num)df1.dropna() # 把空数据丢掉
Out[12]:
A B C D E
2020-03-02 -0.808769 1.927426 -1.080492 0.403419 2.0
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189 2.0df1.fillna(value = 5) # 把空数据替换成给定值
Out[13]:
A B C D E
2020-03-01 -0.339025 1.321283 1.563992 -1.757175 5.0
2020-03-02 -0.808769 1.927426 -1.080492 0.403419 2.0
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189 2.0
2020-03-04 -1.856601 1.019911 1.791216 2.702585 5.0pd.isnull(df1) # 判断是否含有空数据
Out[14]:
A B C D E
2020-03-01 False False False False True
2020-03-02 False False False False False
2020-03-03 False False False False False
2020-03-04 False False False False True# 如果表格很大,很难看出是否含有空数据:
pd.isnull(df1).any()
Out[17]:
A False
B False
C False
D False
E True
dtype: bool# 如果有很多列,同样很难看出
pd.isnull(df1).any().any()
Out[18]:
Truedf1
Out[19]:
A B C D E
2020-03-01 -0.339025 1.321283 1.563992 -1.757175 NaN
2020-03-02 -0.808769 1.927426 -1.080492 0.403419 2.0
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189 2.0
2020-03-04 -1.856601 1.019911 1.791216 2.702585 NaNdf1.mean() # 空数据不参与计算
Out[21]:
A -1.454216
B 1.124127
C 0.446911
D 0.233910
E 2.000000
dtype: float64
apply()函数df.apply(np.cumsum) # 累加
Out[22]:
A B C D
2020-03-01 -0.339025 1.321283 1.563992 -1.757175
2020-03-02 -1.147794 3.248709 0.483500 -1.353756
2020-03-03 -3.960262 3.476597 -0.003572 -1.766945
2020-03-04 -5.816863 4.496508 1.787645 0.935640
2020-03-05 -6.398151 5.797148 1.917139 0.038600
2020-03-06 -6.738874 5.883793 1.537055 0.999027df
Out[23]:
A B C D
2020-03-01 -0.339025 1.321283 1.563992 -1.757175
2020-03-02 -0.808769 1.927426 -1.080492 0.403419
2020-03-03 -2.812467 0.227888 -0.487071 -0.413189
2020-03-04 -1.856601 1.019911 1.791216 2.702585
2020-03-05 -0.581288 1.300641 0.129494 -0.897040
2020-03-06 -0.340724 0.086645 -0.380084 0.960427df.apply(lambda x : x.max() - x.min())
Out[24]:
A 2.473442
B 1.840781
C 2.871708
D 4.459761
dtype: float64s = pd.Series(np.random.randint(10, 20, size = 20))
s
Out[25]:
0 19
1 13
2 19
3 19
4 12
5 19
6 18
7 15
8 14
9 18
10 16
11 14
12 18
13 10
14 17
15 18
16 10
17 16
18 10
19 12
dtype: int32s.value_counts() # 统计各个数字出现了多少次
Out[27]:
19 4
18 4
10 3
16 2
14 2
12 2
17 1
15 1
13 1
dtype: int64s.mode() # 产生最多的数
Out[28]:
0 18
1 19
dtype: int32
数据的合并df = pd.DataFrame(np.random.randn(10, 4), columns = list('ABCD'))
df
Out[29]:
A B C D
0 -1.289507 -1.002505 1.792938 -1.885870
1 1.024196 -0.978207 0.990827 0.467831
2 0.287490 1.029234 -0.788564 0.508841
3 -1.971881 1.151978 -1.276380 3.042233
4 -0.706756 2.127796 0.255050 -0.649438
5 0.961700 0.329416 0.003750 0.516274
6 0.105380 0.399627 -1.472621 -0.605783
7 0.791631 0.707824 1.587626 -0.033991
8 -0.336135 -0.483174 0.100718 0.243218
9 -0.272511 -1.086092 0.650176 -0.106609df.iloc[:3]
Out[30]:
A B C D
0 -1.289507 -1.002505 1.792938 -1.885870
1 1.024196 -0.978207 0.990827 0.467831
2 0.287490 1.029234 -0.788564 0.508841df.iloc[3:7]
Out[31]:
A B C D
3 -1.971881 1.151978 -1.276380 3.042233
4 -0.706756 2.127796 0.255050 -0.649438
5 0.961700 0.329416 0.003750 0.516274
6 0.105380 0.399627 -1.472621 -0.605783df.iloc[7:]
Out[32]:
A B C D
7 0.791631 0.707824 1.587626 -0.033991
8 -0.336135 -0.483174 0.100718 0.243218
9 -0.272511 -1.086092 0.650176 -0.106609df1 = pd.concat([df.iloc[:3], df.iloc[3:7], df.iloc[7:]]) # 将以上3部分合并
df1
Out[34]:
A B C D
0 -1.289507 -1.002505 1.792938 -1.885870
1 1.024196 -0.978207 0.990827 0.467831
2 0.287490 1.029234 -0.788564 0.508841
3 -1.971881 1.151978 -1.276380 3.042233
4 -0.706756 2.127796 0.255050 -0.649438
5 0.961700 0.329416 0.003750 0.516274
6 0.105380 0.399627 -1.472621 -0.605783
7 0.791631 0.707824 1.587626 -0.033991
8 -0.336135 -0.483174 0.100718 0.243218
9 -0.272511 -1.086092 0.650176 -0.106609df == df1 # 判断两个表格是否相等
Out[35]:
A B C D
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True(df == df1).all()
Out[37]:
A True
B True
C True
D True
dtype: bool(df == df1).all().all()
Out[38]:
Trueleft = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1, 2]})
right = pd.DataFrame({'key':['foo', 'foo'], 'rval':[4, 5]})
left
Out[39]:
key lval
0 foo 1
1 foo 2right
Out[40]:
key rval
0 foo 4
1 foo 5pd.merge(left, right, on = 'key') # 将两个数据合并
Out[41]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5pd.merge(right, left, on = 'key')
Out[42]:
key rval lval
0 foo 4 1
1 foo 4 2
2 foo 5 1
3 foo 5 2s = pd.Series(np.random.randint(1, 5, size = 4), index = list('ABCD'))
s
Out[43]:
A 1
B 2
C 4
D 2
dtype: int32df
Out[44]:
A B C D
0 -1.289507 -1.002505 1.792938 -1.885870
1 1.024196 -0.978207 0.990827 0.467831
2 0.287490 1.029234 -0.788564 0.508841
3 -1.971881 1.151978 -1.276380 3.042233
4 -0.706756 2.127796 0.255050 -0.649438
5 0.961700 0.329416 0.003750 0.516274
6 0.105380 0.399627 -1.472621 -0.605783
7 0.791631 0.707824 1.587626 -0.033991
8 -0.336135 -0.483174 0.100718 0.243218
9 -0.272511 -1.086092 0.650176 -0.106609df.append(s, ignore_index = True) # 将s插入到df,并忽略索引
Out[45]:
A B C D
0 -1.289507 -1.002505 1.792938 -1.885870
1 1.024196 -0.978207 0.990827 0.467831
2 0.287490 1.029234 -0.788564 0.508841
3 -1.971881 1.151978 -1.276380 3.042233
4 -0.706756 2.127796 0.255050 -0.649438
5 0.961700 0.329416 0.003750 0.516274
6 0.105380 0.399627 -1.472621 -0.605783
7 0.791631 0.707824 1.587626 -0.033991
8 -0.336135 -0.483174 0.100718 0.243218
9 -0.272511 -1.086092 0.650176 -0.106609
10 1.000000 2.000000 4.000000 2.000000df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
2'B':['one', 'one', 'two', 'three', 'two', 'one', 'three', 'one'],
3'C':np.random.randn(8),
4'D':np.random.randn(8)})
5
df
Out[47]:
A B C D
0 foo one 0.769949 0.918423
1 bar one 0.187173 -0.555428
2 foo two -0.272259 -0.781077
3 bar three 0.613351 -0.300836
4 foo two 1.581734 -0.884281
5 bar one -2.433477 -0.077995
6 foo three -0.809238 -1.526005
7 foo one -1.327003 -0.801657df.groupby('A').sum()
Out[50]:
C D
A
bar -1.632954 -0.934259
foo -0.056816 -3.074597df.groupby(['A', 'B']).sum()
Out[51]:
C D
A B
bar one -2.246305 -0.633423
three 0.613351 -0.300836
foo one -0.557053 0.116766
three -0.809238 -1.526005
two 1.309475 -1.665359df.groupby(['B', 'A']).sum()
Out[52]:
C D
B A
one bar -2.246305 -0.633423
foo -0.557053 0.116766
three bar 0.613351 -0.300836
foo -0.809238 -1.526005
two foo 1.309475 -1.665359
3 pandas快速入门(三)
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
数据整形tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
tuples
Out[2]:
[('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')]index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
index
Out[3]:
MultiIndex([('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')],names=['first', 'second'])df = pd.DataFrame(np.random.randn(8, 2), index = index, columns = ['A', 'B'])
df
Out[4]:
A B
first second
bar one -1.527931 -1.753414
two 0.403523 -0.946154
baz one 1.580554 -0.452768
two 0.157092 0.118553
foo one 0.205942 0.420926
two 0.311821 2.747632
qux one 0.225572 -1.252437
two -0.680653 -1.427652stacked = df.stack() # 把行索引AB变成列索引
stacked
Out[5]:
first second
bar one A -1.527931B -1.753414two A 0.403523B -0.946154
baz one A 1.580554B -0.452768two A 0.157092B 0.118553
foo one A 0.205942B 0.420926two A 0.311821B 2.747632
qux one A 0.225572B -1.252437two A -0.680653B -1.427652
dtype: float64# 查看所有索引
stacked.index # 查看所有索引
Out[6]:
MultiIndex([('bar', 'one', 'A'),('bar', 'one', 'B'),('bar', 'two', 'A'),('bar', 'two', 'B'),('baz', 'one', 'A'),('baz', 'one', 'B'),('baz', 'two', 'A'),('baz', 'two', 'B'),('foo', 'one', 'A'),('foo', 'one', 'B'),('foo', 'two', 'A'),('foo', 'two', 'B'),('qux', 'one', 'A'),('qux', 'one', 'B'),('qux', 'two', 'A'),('qux', 'two', 'B')],names=['first', 'second', None])stacked.unstack()
stacked.unstack() # 转换回去
Out[7]:
A B
first second
bar one -1.527931 -1.753414
two 0.403523 -0.946154
baz one 1.580554 -0.452768
two 0.157092 0.118553
foo one 0.205942 0.420926
two 0.311821 2.747632
qux one 0.225572 -1.252437
two -0.680653 -1.427652stacked.unstack().unstack()
Out[8]:
A B
second one two one two
first
bar -1.527931 0.403523 -1.753414 -0.946154
baz 1.580554 0.157092 -0.452768 0.118553
foo 0.205942 0.311821 0.420926 2.747632
qux 0.225572 -0.680653 -1.252437 -1.427652
数据透视* 2
1
df = pd.DataFrame({'A':['one', 'two', 'three'] * 4,
2'B':['A', 'B', 'C'] * 4,
3'C':['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
4'D':np.random.randn(12),
5'E':np.random.randn(12)})
6
df
Out[10]:
A B C D E
0 one A foo -0.860000 -1.395613
1 two B foo 2.222074 0.878310
2 three C foo 0.487662 -0.996239
3 one A bar 0.816647 0.072039
4 two B bar 0.413811 0.256600
5 three C bar 0.246407 2.223985
6 one A foo 1.615020 1.034235
7 two B foo 2.035441 0.658418
8 three C foo 1.325854 -0.363353
9 one A bar 0.451436 0.300830
10 two B bar 1.344891 0.890003
11 three C bar -0.944687 1.161672df.pivot_table(values = 'D', index = ['A', 'B'], columns = 'C') # 以ab为行索引,c为列索引,针对d的数据
Out[11]:
C bar foo
A B
one A 0.634041 0.377510
three C -0.349140 0.906758
two B 0.879351 2.128758
时间序列rng
rng = pd.date_range('20200301', periods = 600, freq = 's')
rng
Out[12]:
DatetimeIndex(['2020-03-01 00:00:00', '2020-03-01 00:00:01','2020-03-01 00:00:02', '2020-03-01 00:00:03','2020-03-01 00:00:04', '2020-03-01 00:00:05','2020-03-01 00:00:06', '2020-03-01 00:00:07','2020-03-01 00:00:08', '2020-03-01 00:00:09',...'2020-03-01 00:09:50', '2020-03-01 00:09:51','2020-03-01 00:09:52', '2020-03-01 00:09:53','2020-03-01 00:09:54', '2020-03-01 00:09:55','2020-03-01 00:09:56', '2020-03-01 00:09:57','2020-03-01 00:09:58', '2020-03-01 00:09:59'],dtype='datetime64[ns]', length=600, freq='S')
s
1
s = pd.Series(np.random.randint(0, 500, size = len(rng)), index = rng)
2
s
Out[13]:
2020-03-01 00:00:00 261
2020-03-01 00:00:01 215
2020-03-01 00:00:02 108
2020-03-01 00:00:03 348
2020-03-01 00:00:04 365...
2020-03-01 00:09:55 231
2020-03-01 00:09:56 385
2020-03-01 00:09:57 82
2020-03-01 00:09:58 235
2020-03-01 00:09:59 475
Freq: S, Length: 600, dtype: int32# s.resample('2Min', how='sum')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-30cfb34abd66> in <module>
----> 1 s.resample('2Min', how='sum')TypeError: resample() got an unexpected keyword argument 'how'rng = pd.period_range('2000Q1', '2016Q1', freq = 'Q') # 从2000年第一季度到2016年第一季度
2
rng
Out[22]:
PeriodIndex(['2000Q1', '2000Q2', '2000Q3', '2000Q4', '2001Q1', '2001Q2','2001Q3', '2001Q4', '2002Q1', '2002Q2', '2002Q3', '2002Q4','2003Q1', '2003Q2', '2003Q3', '2003Q4', '2004Q1', '2004Q2','2004Q3', '2004Q4', '2005Q1', '2005Q2', '2005Q3', '2005Q4','2006Q1', '2006Q2', '2006Q3', '2006Q4', '2007Q1', '2007Q2','2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3', '2008Q4','2009Q1', '2009Q2', '2009Q3', '2009Q4', '2010Q1', '2010Q2','2010Q3', '2010Q4', '2011Q1', '2011Q2', '2011Q3', '2011Q4','2012Q1', '2012Q2', '2012Q3', '2012Q4', '2013Q1', '2013Q2','2013Q3', '2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4','2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1'],dtype='period[Q-DEC]', freq='Q-DEC')# 将季度转换成日期
rng.to_timestamp() # 将季度转换成日期
Out[24]:
DatetimeIndex(['2000-01-01', '2000-04-01', '2000-07-01', '2000-10-01','2001-01-01', '2001-04-01', '2001-07-01', '2001-10-01','2002-01-01', '2002-04-01', '2002-07-01', '2002-10-01','2003-01-01', '2003-04-01', '2003-07-01', '2003-10-01','2004-01-01', '2004-04-01', '2004-07-01', '2004-10-01','2005-01-01', '2005-04-01', '2005-07-01', '2005-10-01','2006-01-01', '2006-04-01', '2006-07-01', '2006-10-01','2007-01-01', '2007-04-01', '2007-07-01', '2007-10-01','2008-01-01', '2008-04-01', '2008-07-01', '2008-10-01','2009-01-01', '2009-04-01', '2009-07-01', '2009-10-01','2010-01-01', '2010-04-01', '2010-07-01', '2010-10-01','2011-01-01', '2011-04-01', '2011-07-01', '2011-10-01','2012-01-01', '2012-04-01', '2012-07-01', '2012-10-01','2013-01-01', '2013-04-01', '2013-07-01', '2013-10-01','2014-01-01', '2014-04-01', '2014-07-01', '2014-10-01','2015-01-01', '2015-04-01', '2015-07-01', '2015-10-01','2016-01-01'],dtype='datetime64[ns]', freq='QS-OCT')# 计算时间差
pd.Timestamp('20200301') - pd.Timestamp('20200201') # 计算时间差
Out[25]:
Timedelta('29 days 00:00:00')pd.Timestamp('20200301') + pd.Timedelta(days = 6)
Out[26]:
Timestamp('2020-03-07 00:00:00')df
df = pd.DataFrame({'id':[1, 2, 3, 4, 5, 6], 'row_grade':['a', 'b', 'b', 'a', 'a', 'd']})
2
df
Out[27]:
id row_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 d
df
1
df['grade'] = df.row_grade.astype('category')
2
df
Out[29]:
id row_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 d d
In [30]:df.grade
1
df.grade
Out[30]:
0 a
1 b
2 b
3 a
4 a
5 d
Name: grade, dtype: category
Categories (3, object): [a, b, d]
In [32]:df.grade.cat.categories
1
df.grade.cat.categories
Out[32]:
Index(['a', 'b', 'd'], dtype='object')
In [33]:1
df.grade.cat.categories = ['very good', 'good', 'bad'] # 对grade的内容重新赋值
2
df
Out[33]:
id row_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 d bad
In [34]:# 对grade的内容降序排序
1
df.sort_values(by = 'grade', ascending = True) # 对grade的内容降序排序
Out[34]:
id row_grade grade
0 1 a very good
3 4 a very good
4 5 a very good
1 2 b good
2 3 b good
5 6 d bad
数据可视化
In [36]:date_range
1
s = pd.Series(np.random.randn(1000), index = pd.date_range('20000101', periods = 1000))
2
s
Out[36]:
2000-01-01 0.181931
2000-01-02 0.133446
2000-01-03 -0.014128
2000-01-04 -0.755955
2000-01-05 0.847678...
2002-09-22 -0.153774
2002-09-23 1.409455
2002-09-24 1.177651
2002-09-25 -0.449985
2002-09-26 0.700871
Freq: D, Length: 1000, dtype: float64
In [37]:1
s = s.cumsum() # 累加
2
s
Out[37]:
2000-01-01 0.181931
2000-01-02 0.315377
2000-01-03 0.301249
2000-01-04 -0.454706
2000-01-05 0.392972...
2002-09-22 -17.139739
2002-09-23 -15.730284
2002-09-24 -14.552633
2002-09-25 -15.002618
2002-09-26 -14.301747
Freq: D, Length: 1000, dtype: float64
In [38]:1
s.plot()
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x139a78cba8>数据读写
In [40]:)
1
df = pd.DataFrame(np.random.randn(100, 4), columns = list('ABCD'))
2
df
Out[40]:
A B C D
0 -1.253046 -1.345258 -0.040495 0.390861
1 0.274316 1.073607 -0.110999 0.962318
2 1.328825 -1.236292 0.959564 -0.368799
3 0.451488 -0.646667 -1.248890 0.711037
4 0.299365 -0.485150 -0.855304 -0.355098
... ... ... ... ...
95 -0.623444 -0.490164 -0.804014 -0.827152
96 -0.271459 -0.217562 1.569923 -0.851998
97 -0.532696 1.299053 -0.858330 1.225276
98 -0.227407 -2.551994 0.291486 0.879787
99 -0.981516 0.343269 0.644074 -0.188729
100 rows × 4 columns
In [41]:保存数据
1
df.to_csv('data.csv') # 保存数据
In [42]:%ls
1
%ls驱动器 C 中的卷是 Windows8_OS卷的序列号是 2CD3-41A8C:\Users\wangzhaohui\Desktop\机器学习---数据科学包\第2天 的目录2020/04/01 10:55 <DIR> .
2020/04/01 10:55 <DIR> ..
2020/04/01 08:57 <DIR> .ipynb_checkpoints
2020/03/31 10:37 51,435 01.ipynb
2020/03/31 19:33 69,769 02.ipynb
2020/04/01 10:55 8,245 data.csv
2020/04/01 10:55 68,999 Untitled1.ipynb4 个文件 198,448 字节3 个目录 13,441,826,816 可用字节
In [43]:%more data.csv
1
%more data.csv
In [44]:pd.read_csv('data.csv')
1
pd.read_csv('data.csv') # 读取数据
Out[44]:
Unnamed: 0 A B C D
0 0 -1.253046 -1.345258 -0.040495 0.390861
1 1 0.274316 1.073607 -0.110999 0.962318
2 2 1.328825 -1.236292 0.959564 -0.368799
3 3 0.451488 -0.646667 -1.248890 0.711037
4 4 0.299365 -0.485150 -0.855304 -0.355098
... ... ... ... ... ...
95 95 -0.623444 -0.490164 -0.804014 -0.827152
96 96 -0.271459 -0.217562 1.569923 -0.851998
97 97 -0.532696 1.299053 -0.858330 1.225276
98 98 -0.227407 -2.551994 0.291486 0.879787
99 99 -0.981516 0.343269 0.644074 -0.188729
100 rows × 5 columns
In [46]:# 指定第0列作为索引
1
pd.read_csv('data.csv', index_col = 0) # 指定第0列作为索引
Out[46]:
A B C D
0 -1.253046 -1.345258 -0.040495 0.390861
1 0.274316 1.073607 -0.110999 0.962318
2 1.328825 -1.236292 0.959564 -0.368799
3 0.451488 -0.646667 -1.248890 0.711037
4 0.299365 -0.485150 -0.855304 -0.355098
... ... ... ... ...
95 -0.623444 -0.490164 -0.804014 -0.827152
96 -0.271459 -0.217562 1.569923 -0.851998
97 -0.532696 1.299053 -0.858330 1.225276
98 -0.227407 -2.551994 0.291486 0.879787
99 -0.981516 0.343269 0.644074 -0.188729
100 rows × 4 columns
In [ ]:1
4 MovieLens 电影数据分析
import pandas as pd
In [2]:1
# 读取数据保存为user,并给每一列命名
2
unames = ['user_id', 'gender', 'age', 'occupatation', 'zip']
3
users = pd.read_table('ml-1m/users.dat', sep = '::', header = None, names = unames) # 没有表头,所以header为空
A:\anaconda\anaconda\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.This is separate from the ipykernel package so we can avoid doing imports until
In [3]:1
print(len(users))
2
users.head(10)
6040
Out[3]:
user_id gender age occupatation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
5 6 F 50 9 55117
6 7 M 35 1 06810
7 8 M 25 12 11413
8 9 M 25 17 61614
9 10 F 35 1 95370
In [4]:1
rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
2
ratings = pd.read_table('ml-1m/ratings.dat', sep = '::', header = None, names = rating_names)
A:\anaconda\anaconda\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.In [5]:1
print(len(ratings))
2
ratings.head(10)
1000209
Out[5]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
5 1 1197 3 978302268
6 1 1287 5 978302039
7 1 2804 5 978300719
8 1 594 4 978302268
9 1 919 4 978301368
In [6]:1
movie_names = ['movie_id', 'title', 'genres']
2
movies = pd.read_table('ml-1m/movies.dat', sep = '::', header = None, names = movie_names)
A:\anaconda\anaconda\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.In [7]:1
print(len(movies))
2
movies.head(10)
3883
Out[7]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
将3张表合并起来
In [8]:1
data = pd.merge(pd.merge(users, ratings), movies)
2
print(len(data))
3
data.head(10)
1000209
Out[8]:
user_id gender age occupatation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama
5 18 F 18 3 95825 1193 4 978156168 One Flew Over the Cuckoo's Nest (1975) Drama
6 19 M 1 10 48073 1193 5 982730936 One Flew Over the Cuckoo's Nest (1975) Drama
7 24 F 25 7 10023 1193 5 978136709 One Flew Over the Cuckoo's Nest (1975) Drama
8 28 F 25 1 14607 1193 3 978125194 One Flew Over the Cuckoo's Nest (1975) Drama
9 33 M 45 3 55421 1193 5 978557765 One Flew Over the Cuckoo's Nest (1975) Drama
In [9]:1
data[data.user_id == 1] # 查看id为1的用户的数据
Out[9]:
user_id gender age occupatation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1725 1 F 1 10 48067 661 3 978302109 James and the Giant Peach (1996) Animation|Children's|Musical
2250 1 F 1 10 48067 914 3 978301968 My Fair Lady (1964) Musical|Romance
2886 1 F 1 10 48067 3408 4 978300275 Erin Brockovich (2000) Drama
4201 1 F 1 10 48067 2355 5 978824291 Bug's Life, A (1998) Animation|Children's|Comedy
5904 1 F 1 10 48067 1197 3 978302268 Princess Bride, The (1987) Action|Adventure|Comedy|Romance
8222 1 F 1 10 48067 1287 5 978302039 Ben-Hur (1959) Action|Adventure|Drama
8926 1 F 1 10 48067 2804 5 978300719 Christmas Story, A (1983) Comedy|Drama
10278 1 F 1 10 48067 594 4 978302268 Snow White and the Seven Dwarfs (1937) Animation|Children's|Musical
11041 1 F 1 10 48067 919 4 978301368 Wizard of Oz, The (1939) Adventure|Children's|Drama|Musical
12759 1 F 1 10 48067 595 5 978824268 Beauty and the Beast (1991) Animation|Children's|Musical
13819 1 F 1 10 48067 938 4 978301752 Gigi (1958) Musical
14006 1 F 1 10 48067 2398 4 978302281 Miracle on 34th Street (1947) Drama
14386 1 F 1 10 48067 2918 4 978302124 Ferris Bueller's Day Off (1986) Comedy
15859 1 F 1 10 48067 1035 5 978301753 Sound of Music, The (1965) Musical
16741 1 F 1 10 48067 2791 4 978302188 Airplane! (1980) Comedy
18472 1 F 1 10 48067 2687 3 978824268 Tarzan (1999) Animation|Children's
18914 1 F 1 10 48067 2018 4 978301777 Bambi (1942) Animation|Children's
19503 1 F 1 10 48067 3105 5 978301713 Awakenings (1990) Drama
20183 1 F 1 10 48067 2797 4 978302039 Big (1988) Comedy|Fantasy
21674 1 F 1 10 48067 2321 3 978302205 Pleasantville (1998) Comedy
22832 1 F 1 10 48067 720 3 978300760 Wallace & Gromit: The Best of Aardman Animatio... Animation
23270 1 F 1 10 48067 1270 5 978300055 Back to the Future (1985) Comedy|Sci-Fi
25853 1 F 1 10 48067 527 5 978824195 Schindler's List (1993) Drama|War
28157 1 F 1 10 48067 2340 3 978300103 Meet Joe Black (1998) Romance
28501 1 F 1 10 48067 48 5 978824351 Pocahontas (1995) Animation|Children's|Musical|Romance
28883 1 F 1 10 48067 1097 4 978301953 E.T. the Extra-Terrestrial (1982) Children's|Drama|Fantasy|Sci-Fi
31152 1 F 1 10 48067 1721 4 978300055 Titanic (1997) Drama|Romance
32698 1 F 1 10 48067 1545 4 978824139 Ponette (1996) Drama
32771 1 F 1 10 48067 745 3 978824268 Close Shave, A (1995) Animation|Comedy|Thriller
33428 1 F 1 10 48067 2294 4 978824291 Antz (1998) Animation|Children's
34073 1 F 1 10 48067 3186 4 978300019 Girl, Interrupted (1999) Drama
34504 1 F 1 10 48067 1566 4 978824330 Hercules (1997) Adventure|Animation|Children's|Comedy|Musical
34973 1 F 1 10 48067 588 4 978824268 Aladdin (1992) Animation|Children's|Comedy|Musical
36324 1 F 1 10 48067 1907 4 978824330 Mulan (1998) Animation|Children's
36814 1 F 1 10 48067 783 4 978824291 Hunchback of Notre Dame, The (1996) Animation|Children's|Musical
37204 1 F 1 10 48067 1836 5 978300172 Last Days of Disco, The (1998) Drama
37339 1 F 1 10 48067 1022 5 978300055 Cinderella (1950) Animation|Children's|Musical
37916 1 F 1 10 48067 2762 4 978302091 Sixth Sense, The (1999) Thriller
40375 1 F 1 10 48067 150 5 978301777 Apollo 13 (1995) Drama
41626 1 F 1 10 48067 1 5 978824268 Toy Story (1995) Animation|Children's|Comedy
43703 1 F 1 10 48067 1961 5 978301590 Rain Man (1988) Drama
45033 1 F 1 10 48067 1962 4 978301753 Driving Miss Daisy (1989) Drama
45685 1 F 1 10 48067 2692 4 978301570 Run Lola Run (Lola rennt) (1998) Action|Crime|Romance
46757 1 F 1 10 48067 260 4 978300760 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi
49748 1 F 1 10 48067 1028 5 978301777 Mary Poppins (1964) Children's|Comedy|Musical
50759 1 F 1 10 48067 1029 5 978302205 Dumbo (1941) Animation|Children's|Musical
51327 1 F 1 10 48067 1207 4 978300719 To Kill a Mockingbird (1962) Drama
52255 1 F 1 10 48067 2028 5 978301619 Saving Private Ryan (1998) Action|Drama|War
54908 1 F 1 10 48067 531 4 978302149 Secret Garden, The (1993) Children's|Drama
55246 1 F 1 10 48067 3114 4 978302174 Toy Story 2 (1999) Animation|Children's|Comedy
56831 1 F 1 10 48067 608 4 978301398 Fargo (1996) Crime|Drama|Thriller
59344 1 F 1 10 48067 1246 4 978302091 Dead Poets Society (1989) Drama
In [10]:1
# 求出每一部电影女性观众的平均评分与男性观众的平均评分
2
ratings_by_gender = data.pivot_table(values = 'rating', index = 'title', columns = 'gender', aggfunc = 'mean')
3
ratings_by_gender.head(10)
Out[10]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
1-900 (1994) 2.000000 3.000000
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
In [11]:1
ratings_by_gender['diff'] = ratings_by_gender.F - ratings_by_gender.M # 给表格加一列,显示男女评分的差
2
ratings_by_gender.head(10)
Out[11]:
gender F M diff
title
$1,000,000 Duck (1971) 3.375000 2.761905 0.613095
'Night Mother (1986) 3.388889 3.352941 0.035948
'Til There Was You (1997) 2.675676 2.733333 -0.057658
'burbs, The (1989) 2.793478 2.962085 -0.168607
...And Justice for All (1979) 3.828571 3.689024 0.139547
1-900 (1994) 2.000000 3.000000 -1.000000
10 Things I Hate About You (1999) 3.646552 3.311966 0.334586
101 Dalmatians (1961) 3.791444 3.500000 0.291444
101 Dalmatians (1996) 3.240000 2.911215 0.328785
12 Angry Men (1957) 4.184397 4.328421 -0.144024
In [12]:1
ratings_by_gender.sort_values(by = 'diff', ascending = True).head(10)
2
# ratings_by_gender.head(10)
Out[12]:
gender F M diff
title
Tigrero: A Film That Was Never Made (1994) 1.0 4.333333 -3.333333
Neon Bible, The (1995) 1.0 4.000000 -3.000000
Enfer, L' (1994) 1.0 3.750000 -2.750000
Stalingrad (1993) 1.0 3.593750 -2.593750
Killer: A Journal of Murder (1995) 1.0 3.428571 -2.428571
Dangerous Ground (1997) 1.0 3.333333 -2.333333
In God's Hands (1998) 1.0 3.333333 -2.333333
Rosie (1998) 1.0 3.333333 -2.333333
Flying Saucer, The (1950) 1.0 3.300000 -2.300000
Jamaica Inn (1939) 1.0 3.142857 -2.142857
In [13]:1
data.head(10)
Out[13]:
user_id gender age occupatation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama
5 18 F 18 3 95825 1193 4 978156168 One Flew Over the Cuckoo's Nest (1975) Drama
6 19 M 1 10 48073 1193 5 982730936 One Flew Over the Cuckoo's Nest (1975) Drama
7 24 F 25 7 10023 1193 5 978136709 One Flew Over the Cuckoo's Nest (1975) Drama
8 28 F 25 1 14607 1193 3 978125194 One Flew Over the Cuckoo's Nest (1975) Drama
9 33 M 45 3 55421 1193 5 978557765 One Flew Over the Cuckoo's Nest (1975) Drama
In [14]:1
rating_by_title = data.groupby('title').size() # 获取每一部电影评分的人数
2
rating_by_title.head(10)
Out[14]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
dtype: int64
In [15]:1
rating_by_title.sort_values(ascending = False).head(10) # 按评分人数排序
Out[15]:
title
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
dtype: int64
In [16]:1
mean_ratings = data.pivot_table(values = 'rating', index = 'title', aggfunc = 'mean') # 求每一部电影的平均评分
2
mean_ratings.head(10)
Out[16]:
rating
title
$1,000,000 Duck (1971) 3.027027
'Night Mother (1986) 3.371429
'Til There Was You (1997) 2.692308
'burbs, The (1989) 2.910891
...And Justice for All (1979) 3.713568
1-900 (1994) 2.500000
10 Things I Hate About You (1999) 3.422857
101 Dalmatians (1961) 3.596460
101 Dalmatians (1996) 3.046703
12 Angry Men (1957) 4.295455
In [17]:1
mean_ratings.sort_values(by = 'rating', ascending = False).head(10) # 对电影的平均评分排序
Out[17]:
rating
title
Ulysses (Ulisse) (1954) 5.0
Lured (1947) 5.0
Follow the Bitch (1998) 5.0
Bittersweet Motel (2000) 5.0
Song of Freedom (1936) 5.0
One Little Indian (1973) 5.0
Smashing Time (1967) 5.0
Schlafes Bruder (Brother of Sleep) (1995) 5.0
Gate of Heavenly Peace, The (1995) 5.0
Baby, The (1973) 5.0
In [18]:1
top_10_hot = rating_by_title.sort_values(ascending = False).head(10)
2
top_10_hot
Out[18]:
title
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
dtype: int64
In [32]:1mean_ratings.loc[top_10_hot.index, :]
Out[32]:
rating
title
American Beauty (1999) 4.317386
Star Wars: Episode IV - A New Hope (1977) 4.453694
Star Wars: Episode V - The Empire Strikes Back (1980) 4.292977
Star Wars: Episode VI - Return of the Jedi (1983) 4.022893
Jurassic Park (1993) 3.763847
Saving Private Ryan (1998) 4.337354
Terminator 2: Judgment Day (1991) 4.058513
Matrix, The (1999) 4.315830
Back to the Future (1985) 3.990321
Silence of the Lambs, The (1991) 4.351823
In [31]:1
print(type(rating_by_title))
<class 'pandas.core.series.Series'>
In [24]:1
top_20_score = mean_ratings.sort_values(by = 'rating', ascending = False).head(20) # 获取20高分电影
2
top_20_score
Out[24]:
rating
title
Ulysses (Ulisse) (1954) 5.000000
Lured (1947) 5.000000
Follow the Bitch (1998) 5.000000
Bittersweet Motel (2000) 5.000000
Song of Freedom (1936) 5.000000
One Little Indian (1973) 5.000000
Smashing Time (1967) 5.000000
Schlafes Bruder (Brother of Sleep) (1995) 5.000000
Gate of Heavenly Peace, The (1995) 5.000000
Baby, The (1973) 5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000
Lamerica (1994) 4.750000
Apple, The (Sib) (1998) 4.666667
Sanjuro (1962) 4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560510
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Close Shave, A (1995) 4.520548
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
In [25]:1
rating_by_title[top_20_score.index] # 获取20高分电影评分的人数
Out[25]:
title
Ulysses (Ulisse) (1954) 1
Lured (1947) 1
Follow the Bitch (1998) 1
Bittersweet Motel (2000) 1
Song of Freedom (1936) 1
One Little Indian (1973) 1
Smashing Time (1967) 2
Schlafes Bruder (Brother of Sleep) (1995) 1
Gate of Heavenly Peace, The (1995) 3
Baby, The (1973) 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5
Lamerica (1994) 8
Apple, The (Sib) (1998) 9
Sanjuro (1962) 69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628
Shawshank Redemption, The (1994) 2227
Godfather, The (1972) 2223
Close Shave, A (1995) 657
Usual Suspects, The (1995) 1783
Schindler's List (1993) 2304
dtype: int64
In [26]:1
hot_movies = rating_by_title[rating_by_title > 1000] # 获取评分人数超过1000的电影
2
print(len(hot_movies))
3
hot_movies.head(10)
207
Out[26]:
title
2001: A Space Odyssey (1968) 1716
Abyss, The (1989) 1715
African Queen, The (1951) 1057
Air Force One (1997) 1076
Airplane! (1980) 1731
Aladdin (1992) 1351
Alien (1979) 2024
Aliens (1986) 1820
Amadeus (1984) 1382
American Beauty (1999) 3428
dtype: int64
In [28]:1
mean_ratings.head(10)
Out[28]:
rating
title
$1,000,000 Duck (1971) 3.027027
'Night Mother (1986) 3.371429
'Til There Was You (1997) 2.692308
'burbs, The (1989) 2.910891
...And Justice for All (1979) 3.713568
1-900 (1994) 2.500000
10 Things I Hate About You (1999) 3.422857
101 Dalmatians (1961) 3.596460
101 Dalmatians (1996) 3.046703
12 Angry Men (1957) 4.295455
5 pandas 核心数据结构
import pandas as pd
2
import numpy as np
创建Series对象
In [2]:1
s = pd.Series(np.random.randn(5), index = ['a', 'b', 'c', 'd', 'e'])
2
s
Out[2]:
a -2.208328
b 0.687184
c -1.615629
d 0.049625
e 0.073550
dtype: float64
In [3]:1
s.index # 查看s的索引
Out[3]:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [4]:1
s = pd.Series(np.random.randn(5)) # 不指定索引,会自动指定整型索引值
2
s
Out[4]:
0 -1.068177
1 1.408667
2 -0.003749
3 -0.752304
4 0.379617
dtype: float64
用字典创建Series对象
In [5]:1
d = {'a':1, 'b':2, 'd':4}
2
s = pd.Series(d, index = list('abcd'))
3
s
Out[5]:
a 1.0
b 2.0
c NaN
d 4.0
dtype: float64
用标量创建Series
In [6]:1
s = pd.Series(5, index = list('abcd'))
2
s
Out[6]:
a 5
b 5
c 5
d 5
dtype: int64
Series对象的特性
1.类ndarray对象特性
In [7]:1
s = pd.Series(np.random.randn(5))
2
s
Out[7]:
0 1.508598
1 1.647663
2 0.820469
3 0.502848
4 1.358940
dtype: float64
In [8]:1
s[0]
Out[8]:
1.5085975977102866
In [9]:1
s[:3]
Out[9]:
0 1.508598
1 1.647663
2 0.820469
dtype: float64
In [10]:1
s[2:5]
Out[10]:
2 0.820469
3 0.502848
4 1.358940
dtype: float64
In [11]:1
s[[1, 3, 4]] # 支持整型索引
Out[11]:
1 1.647663
3 0.502848
4 1.358940
dtype: float64
In [12]:1
np.sin(s)
Out[12]:
0 0.998066
1 0.997047
2 0.731466
3 0.481923
4 0.977642
dtype: float64
类字典对象特性
In [13]:1
s = pd.Series(np.random.randn(5), index = list('abcde'))
2
s
Out[13]:
a -1.215274
b 2.374261
c 1.050139
d 1.091067
e 0.982012
dtype: float64
In [14]:1
s['a']
Out[14]:
-1.21527427068595
In [15]:1
s['b'] = 3
In [16]:1
s
Out[16]:
a -1.215274
b 3.000000
c 1.050139
d 1.091067
e 0.982012
dtype: float64
In [17]:1
s['f'] = 100 # 增加一个数据
In [18]:1
s
Out[18]:
a -1.215274
b 3.000000
c 1.050139
d 1.091067
e 0.982012
f 100.000000
dtype: float64
标签对齐属性
In [19]:1
s1 = pd.Series(np.random.randn(3), index = list('abe'))
2
s2 = pd.Series(np.random.randn(3), index = list('ace'))
3
s1
Out[19]:
a -0.654712
b 1.445863
e -0.929808
dtype: float64
In [20]:1
s2
Out[20]:
a -1.239739
c 0.024641
e 0.748008
dtype: float64
In [21]:1
s1 + s2
Out[21]:
a -1.894451
b NaN
c NaN
e -0.181800
dtype: float64
DataFrame
1.字典创建DataFrame
In [22]:1
d = {'one':pd.Series([1, 2, 3], index = list('abc')), 'two':pd.Series([1, 2, 3, 4], index = list('abcd'))}
2
d
Out[22]:
{'one': a 1b 2c 3dtype: int64, 'two': a 1b 2c 3d 4dtype: int64}
In [23]:1
df = pd.DataFrame(d)
2
df
Out[23]:
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
In [24]:1
df = pd.DataFrame(d, index = ['a', 'b', 'd'])
2
df
Out[24]:
one two
a 1.0 1
b 2.0 2
d NaN 4
In [25]:1
# 由列表构成的字典创建DataFrame, 列表长度必须一致,否则报错。这一点与由Series创建不同。
2
d = {'one':[1, 2, 3, 4], 'two':[11, 22, 33, 44]}
3
df = pd.DataFrame(d)
4
df
Out[25]:
one two
0 1 11
1 2 22
2 3 33
3 4 44
由列表创建DataFrame
In [26]:1
data = [(0, 1, 2), ('one', 3, 4)]
2
df = pd.DataFrame(data)
3
df
Out[26]:
0 1 2
0 0 1 2
1 one 3 4
由字典创建DataFrame
In [27]:1
data = [{'a':1, 'b':2}, {'a':3, 'b':4, 'c':5}] # 字典的键是列标签
2
df = pd.DataFrame(data)
3
df
Out[27]:
a b c
0 1 2 NaN
1 3 4 5.0
由Series创建DataFrame
In [28]:1
data = pd.Series(np.random.randn(5), index = list('abcde'))
2
df = pd.DataFrame(data)
3
df
Out[28]:
0
a -0.236843
b -0.108453
c -0.434020
d 1.244950
e -0.453697
DataFrame的特性
列的选择增加和删除
In [29]:1
df = pd.DataFrame(np.random.randn(6, 4), columns = ['one', 'two', 'three', 'four'])
2
df
Out[29]:
one two three four
0 -0.297524 0.297013 -1.079953 1.046785
1 0.160032 0.444978 0.301297 0.636168
2 0.187298 -1.486748 0.228950 1.823566
3 0.528394 1.282184 1.157520 -0.848585
4 -0.010840 -0.347936 -0.501200 -1.252119
5 0.946331 -0.978641 0.442339 1.023809
In [30]:1
# 选择列
2
df['one']
Out[30]:
0 -0.297524
1 0.160032
2 0.187298
3 0.528394
4 -0.010840
5 0.946331
Name: one, dtype: float64
In [31]:1
# 选择行
2
df.loc[1]
Out[31]:
one 0.160032
two 0.444978
three 0.301297
four 0.636168
Name: 1, dtype: float64
In [32]:1
# 赋值
2
df['three'] = df['one'] + df['two']
3
df
Out[32]:
one two three four
0 -0.297524 0.297013 -0.000510 1.046785
1 0.160032 0.444978 0.605009 0.636168
2 0.187298 -1.486748 -1.299450 1.823566
3 0.528394 1.282184 1.810577 -0.848585
4 -0.010840 -0.347936 -0.358776 -1.252119
5 0.946331 -0.978641 -0.032310 1.023809
In [33]:1
df
Out[33]:
one two three four
0 -0.297524 0.297013 -0.000510 1.046785
1 0.160032 0.444978 0.605009 0.636168
2 0.187298 -1.486748 -1.299450 1.823566
3 0.528394 1.282184 1.810577 -0.848585
4 -0.010840 -0.347936 -0.358776 -1.252119
5 0.946331 -0.978641 -0.032310 1.023809
In [34]:1
del df['three']
2
df
Out[34]:
one two four
0 -0.297524 0.297013 1.046785
1 0.160032 0.444978 0.636168
2 0.187298 -1.486748 1.823566
3 0.528394 1.282184 -0.848585
4 -0.010840 -0.347936 -1.252119
5 0.946331 -0.978641 1.023809
In [35]:1
# 增加一列
2
df['flag'] = df['one'] > 0.2
3
df
Out[35]:
one two four flag
0 -0.297524 0.297013 1.046785 False
1 0.160032 0.444978 0.636168 False
2 0.187298 -1.486748 1.823566 False
3 0.528394 1.282184 -0.848585 True
4 -0.010840 -0.347936 -1.252119 False
5 0.946331 -0.978641 1.023809 True
In [36]:1
df.pop('four')
Out[36]:
0 1.046785
1 0.636168
2 1.823566
3 -0.848585
4 -1.252119
5 1.023809
Name: four, dtype: float64
In [37]:1
df
Out[37]:
one two flag
0 -0.297524 0.297013 False
1 0.160032 0.444978 False
2 0.187298 -1.486748 False
3 0.528394 1.282184 True
4 -0.010840 -0.347936 False
5 0.946331 -0.978641 True
In [38]:1
df.insert(1, 'bar', df['one'] + df['two']) # 直接添加列只会添加在最后,insert可以指定添加位置
2
df
Out[38]:
one bar two flag
0 -0.297524 -0.000510 0.297013 False
1 0.160032 0.605009 0.444978 False
2 0.187298 -1.299450 -1.486748 False
3 0.528394 1.810577 1.282184 True
4 -0.010840 -0.358776 -0.347936 False
5 0.946331 -0.032310 -0.978641 True
In [39]:1
df.assign(Ratio = df['one'] / df['two']) # assign会重新copy一份dataframe,此时df数据没变。insert会直接作用在df上
Out[39]:
one bar two flag Ratio
0 -0.297524 -0.000510 0.297013 False -1.001718
1 0.160032 0.605009 0.444978 False 0.359640
2 0.187298 -1.299450 -1.486748 False -0.125978
3 0.528394 1.810577 1.282184 True 0.412105
4 -0.010840 -0.358776 -0.347936 False 0.031156
5 0.946331 -0.032310 -0.978641 True -0.966985
In [40]:1
# assign可接受函数
2
df.assign(ratio = lambda x : x.one - x.two)
Out[40]:
one bar two flag ratio
0 -0.297524 -0.000510 0.297013 False -0.594537
1 0.160032 0.605009 0.444978 False -0.284946
2 0.187298 -1.299450 -1.486748 False 1.674046
3 0.528394 1.810577 1.282184 True -0.753790
4 -0.010840 -0.358776 -0.347936 False 0.337095
5 0.946331 -0.032310 -0.978641 True 1.924972
In [41]:1
df.one
Out[41]:
0 -0.297524
1 0.160032
2 0.187298
3 0.528394
4 -0.010840
5 0.946331
Name: one, dtype: float64
In [42]:1
# 后一个assign是作用在前一个assign创建出的副本上,因为是匿名变量,所以只能使用lambda函数
2
df.assign(abratio = lambda x : x.one - x.two).assign(abvalue = lambda x : x.abratio * 10)
Out[42]:
one bar two flag abratio abvalue
0 -0.297524 -0.000510 0.297013 False -0.594537 -5.945371
1 0.160032 0.605009 0.444978 False -0.284946 -2.849461
2 0.187298 -1.299450 -1.486748 False 1.674046 16.740456
3 0.528394 1.810577 1.282184 True -0.753790 -7.537898
4 -0.010840 -0.358776 -0.347936 False 0.337095 3.370953
5 0.946331 -0.032310 -0.978641 True 1.924972 19.249725
dataframe的索引
In [43]:1
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index = list('abcdef'), columns = list('ABCD'))
2
df
Out[43]:
A B C D
a 3 3 2 4
b 2 8 4 3
c 2 5 2 1
d 6 2 2 9
e 1 3 6 1
f 3 4 2 4
6 Pandas基础运算
import numpy as np
1
import pandas as pd
2
import numpy as np
重新索引
In [2]:s
1
s = pd.Series([1, 3, 5, 7, 9], index = list('acefh'))
2
s
Out[2]:
a 1
c 3
e 5
f 7
h 9
dtype: int64
In [3]:s.index
1
s.index
Out[3]:
Index(['a', 'c', 'e', 'f', 'h'], dtype='object')
In [4]:1
s.reindex(list('abcdefgh')) # 增加一些行
Out[4]:
a 1.0
b NaN
c 3.0
d NaN
e 5.0
f 7.0
g NaN
h 9.0
dtype: float64
In [5]:1
s.reindex(list('abcdefgh'), fill_value = 0) # 给增加的行赋予默认值
Out[5]:
a 1
b 0
c 3
d 0
e 5
f 7
g 0
h 9
dtype: int64
In [7]:1
s.reindex(list('abcdefgh'), method = 'ffill') # 用前面的数据填充新增加的
2
# ffill对新增的列无效,只对新增的行有效
Out[7]:
a 1
b 1
c 3
d 3
e 5
f 7
g 7
h 9
dtype: int64
In [8]:df
1
df = pd.DataFrame(np.random.randn(4, 6), index = list('ACEG'), columns = ['one', 'two', 'three', 'four', 'five', 'sex'])
2
df
Out[8]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
In [9]:df2 = df.reindex(index = list('ABCDEFG'))
1
df2 = df.reindex(index = list('ABCDEFG')) # 对原来的df复制一份,不影响df的值
2
df2
Out[9]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
B NaN NaN NaN NaN NaN NaN
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
D NaN NaN NaN NaN NaN NaN
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
F NaN NaN NaN NaN NaN NaN
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
In [10]:df
1
df
Out[10]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
In [12]:df2
1
df2 = df.reindex(index = list('ABCDEFG'), fill_value = 0)
2
df2
Out[12]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
B 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
D 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
F 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
In [13]:1
# 对列重新索引
2
df.reindex(columns = ['one', 'three', 'five', 'seven'])
Out[13]:
one three five seven
A 0.459135 0.579744 -2.049527 NaN
C -0.350446 0.574660 0.647261 NaN
E 0.321467 -0.433891 0.043650 NaN
G 0.585555 -3.290051 -0.661426 NaN
丢弃一行
In [14]:df
1
df
Out[14]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
In [15]:A
1
df.drop('A')
Out[15]:
one two three four five sex
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
In [16]:df
1
df
Out[16]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
丢弃一列
In [17]:, axis = 1
1
df.drop(['two', 'four'], axis = 1)
Out[17]:
one three five sex
A 0.459135 0.579744 -2.049527 -1.365882
C -0.350446 0.574660 0.647261 0.155820
E 0.321467 -0.433891 0.043650 0.936114
G 0.585555 -3.290051 -0.661426 0.312379
In [18]:df
1
df
Out[18]:
one two three four five sex
A 0.459135 -1.267627 0.579744 0.238482 -2.049527 -1.365882
C -0.350446 -1.427801 0.574660 0.374694 0.647261 0.155820
E 0.321467 -0.066514 -0.433891 -0.130726 0.043650 0.936114
G 0.585555 0.997346 -3.290051 1.298093 -0.661426 0.312379
apply与applymap
apply是将一行或一列作为参数传递给函数
In [19]:df
1
df = pd.DataFrame(np.arange(12).reshape(4, 3), index = ['one', 'two', 'three', 'four'], columns = list('ABC'))
2
df
Out[19]:
A B C
one 0 1 2
two 3 4 5
three 6 7 8
four 9 10 11
In [20]:1
df.apply(lambda x : x.max() - x.min()) # 默认按照列运算
Out[20]:
A 9
B 9
C 9
dtype: int64
In [21]:1
df.apply(lambda x : x.max() - x.min(), axis = 1) # 按行运算
Out[21]:
one 2
two 2
three 2
four 2
dtype: int64
In [22]:def min_max(x):return pd.Series([x.min(), x.max()], index = ['min', 'max'])
df.apply(min_max)
1
def min_max(x):
2return pd.Series([x.min(), x.max()], index = ['min', 'max'])
3
df.apply(min_max)
Out[22]:
A B C
min 0 1 2
max 9 10 11
In [23]:, axis = 1
1
def min_max(x):
2return pd.Series([x.min(), x.max()], index = ['min', 'max'])
3
df.apply(min_max, axis = 1)
Out[23]:
min max
one 0 2
two 3 5
three 6 8
four 9 11
applymap是将每一个数据作为参数传递
In [24]:df
1
df = pd.DataFrame(np.random.randn(4, 3), index = ['one', 'two', 'three', 'four'], columns = list('ABC'))
2
df
Out[24]:
A B C
one -0.697469 -1.224409 -0.887984
two 1.268382 0.999928 1.028207
three -1.159685 -1.631597 -0.778589
four 0.158716 0.617732 -0.742091
In [27]:1
# df的每个数据只显示小数点后3位
2
# formater = lambda x : '%.03f' % x # x是某行某列具体的数据
3
formater = '{0:.03f}'.format
4
df.applymap(formater)
Out[27]:
A B C
one -0.697 -1.224 -0.888
two 1.268 1.000 1.028
three -1.160 -1.632 -0.779
four 0.159 0.618 -0.742
排序和排名
In [30]:6
1
s = pd.Series([2, 6, 3, 6, 1, 0])
2
s
Out[30]:
0 2
1 6
2 3
3 6
4 1
5 0
dtype: int64
In [31]:1
s.rank()
Out[31]:
0 3.0
1 5.5
2 4.0
3 5.5
4 2.0
5 1.0
dtype: float64
In [32]:first
1
s.rank(method = 'first')
Out[32]:
0 3.0
1 5.0
2 4.0
3 6.0
4 2.0
5 1.0
dtype: float64
In [33]:average
1
s.rank(method = 'average')
Out[33]:
0 3.0
1 5.5
2 4.0
3 5.5
4 2.0
5 1.0
dtype: float64
In [34]:df
1
df
Out[34]:
A B C
one -0.697469 -1.224409 -0.887984
two 1.268382 0.999928 1.028207
three -1.159685 -1.631597 -0.778589
four 0.158716 0.617732 -0.742091
In [35]:first
1
df.rank(method = 'first')
Out[35]:
A B C
one 2.0 2.0 1.0
two 4.0 4.0 4.0
three 1.0 1.0 2.0
four 3.0 3.0 3.0
数据的唯一性
In [36]:s
1
s = pd.Series(list('abbcdeefggghhi'))
2
s
Out[36]:
0 a
1 b
2 b
3 c
4 d
5 e
6 e
7 f
8 g
9 g
10 g
11 h
12 h
13 i
dtype: object
In [37]:1
s.value_counts() # 返回各个元素各出现了多少次
Out[37]:
g 3
e 2
h 2
b 2
d 1
a 1
c 1
f 1
i 1
dtype: int64
In [38]:1
s.unique() # 返回各个元素,不重复
Out[38]:
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype=object)
In [39]:1
s.isin(['a', 'j', 'd']) # 判断s列表的各个元素是否在给定列表中
Out[39]:
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool
In [40]:1
s.isin(s.unique())
Out[40]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
dtype: bool
In [ ]:1
机器学习---数据科学包-第2天相关推荐
- 3.机器学习—数据科学包3.2pandas基础
pandas基础 一.pandas介绍 1.什么是pandas 2.pandas用途 3.课程内容 二.Ipython开发环境搭建 1.安装 2.新建运行环境 3.Ipython技巧 4.Ipytho ...
- python中画出距平垂线_3.机器学习—数据科学包3.3pandas操作
pandas操作 一.pandas索引 1.Series索引index 2.DateFrame行索引index和列索引columns 3.pandas预置索引的类 4.重复索引 4.1重复索引定义 4 ...
- 机器学习数据科学包(二)——Pandas入门
目录 二.查看数据 三.选择 四.缺失值处理 五.相关操作 六.合并 七.分组 八.重塑(Reshaping) 九.时间序列 十.Categorical 十一.画图 十二.导入和保存数据 本文对十分钟 ...
- 机器学习---数据科学包-第4天
1 Numpy简介 1 # 1 通过python的基础数据对象转化 2 import numpy as np 3 x = [1, 2, 3, 4] 4 x = np.array(x) 5 x Out[ ...
- 机器学习数据科学包(三)——Pandas实例:MovieLens电影数据分析
电影数据分析 准备工作 从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据. 数据说明 参阅数据介绍文件 README.tx ...
- python 数据科学 包_什么时候应该使用哪个Python数据科学软件包?
python 数据科学 包 Python is the most popular language for data science. Unfortunately, it can be tricky ...
- 机器学习-数据科学库-day6
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 机器学习-数据科学库-day6 pandas学习 动手练习 pandas中的时间序列 生成一段时间范围 关于频率的更多缩写 在Data ...
- 机器学习-数据科学库-day5
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 机器学习-数据科学库-day5 pandas学习 pandas之DataFrame pandas常用统计方法 将字符串离散化 数据合并 ...
- 机器学习-数据科学库-day1
机器学习-数据科学库-day1 机器学习-数据科学库-day1 matplotlib 机器学习-数据科学库-day1 数据分析课程包括: 基础概念与环境 matplotlib numpy pandas ...
最新文章
- VSCode中.py文件找不到路径的解决办法
- SD-WAN平台的要素
- jQuery获取或设置元素的宽度和高度
- php使用redis持久化,Redis持久化完整版本
- 黑马程序员——java基础---多线程(二)
- Java Enum 枚举
- python中什么是实例-在Python中使用’__rsub__’方法的典型实例是什么?
- 手机屏幕宽高像素计算_2020年的智能手机拍照新设计,就全看下半年了
- Python 进阶 —— 重访 tuple
- PVID、Access、Trunk、Hybrid三种不同端口收发规则、Vlan中tagged端口和untagged端口的区别
- 软件工程之图书管理系统总体设计
- 示例代码-协方差,黎曼协方差计算.
- 前端HTML页面实现批量下载
- 关于前几天的招聘,我说几点
- 安装opensips时创建MySQL表_Centos7.6安装opensips并实现通话成功
- GridView常见问题
- 反三角函数在matlab中怎样定义
- 纪念中国人工智能学会成立40周年
- java大数据最全课程学习笔记(6)--MapReduce精通(二)--MapReduce框架原理
- shell jq 解析json包含点的key
热门文章
- ESP8266-NodeMCU项目(二):ESP8266-NodeMCU+Blinker+DHT11+小爱同学
- OpenCV的Mat类型以及基本函数使用
- 使用TeXpad iOS实现移动办公(一)
- [小说连载]张小庆,在路上(2)- 第一天上班
- Kaggle Faster Data Science Education coursera
- Word临时文件怎么恢复?可持续的文件恢复方法
- 【黑马头条训练营】day02-黑马头条-App端文章展示
- 非递归!APIO2009atm[抢掠计划]题解
- 如何解决 img 图片变形
- linux的create命令,createuser命令