大家早上好,本人姓吴,如果觉得文章写得还行的话也可以叫我吴老师。欢迎大家跟我一起走进数据分析的世界,一起学习!
感兴趣的朋友可以关注我的数据分析专栏,里面有许多优质的文章跟大家分享哦。
本篇博客将会给出大家平时使用pandas的时候经常需要用到的功能代码,同时也会给出运行结果,以帮助大家更进一步的理解。
另外,我也以注释的形式更进一步的补充说明代码的功能及其作用,需要本篇博文中用到的文档文件以及代码的朋友,也可以三连支持一下,并评论留下你的邮箱,我会在看到后的第一时间发送给你。
当然啦,你也可以把本篇博文当作一本小小的pandas书籍,当需要用到pandas哪些知识的时候,Ctrl+F就可以搜索到啦,现在不看的话就先收藏着。
更新的另外一篇,欢迎先来点击收藏:学习pandas全套代码【超详细】分箱操作、分组聚合、时间序列、数据可视化
目录
- 第一部分:pandas数据结构
- 第二部分:数据查看
- 第三部分:数据输入与输出
- 3.1 csv
- 3.2 Excel
- 3.3 HDF5
- 3.4 SQL
- 第四部分:数据选取
- 4.1 获取数据
- 4.2 标签选择
- 4.3 位置选择
- 4.4 boolean索引
- 4.5 赋值操作
- 第五部分:数据集成
- 5.1 concat数据串联
- 5.2 数据插入
- 5.3 Join SQL风格合并
- 第六部分:数据清洗
- 第七部分:数据转换
- 7.1 轴和元素转换
- 7.2 map映射元素转变
- 7.3 apply映射元素转变
- 7.4 transform元素转变
- 7.5 重排随机抽样哑变量
- 第八部分:数据重塑
- 第九部分:数学和统计方法
- 9.1 简单统计指标
- 9.2 索引标签、位置获取
- 9.3 更多统计指标
- 9.4 高级统计指标
- 第十部分:排序
- 结束语
第一部分:pandas数据结构
import numpy as np
import pandas as pd # pandas基于NumPy,升级
pandas的主要数据结构是 Series(⼀维数据)与 DataFrame(二维数据)。
1.1 Series
# Series
l = np.array([1,2,3,6,9]) # NumPy数组s1 = pd.Series(data = l)
display(l,s1) # Series是一维的数组,和NumPy数组不一样:Series多了索引
array([1, 2, 3, 6, 9])0 1
1 2
2 3
3 6
4 9
dtype: int64
s2 = pd.Series(data = l,index = list('ABCDE'))
s2
A 1
B 2
C 3
D 6
E 9
dtype: int64
s3 = pd.Series(data = {'A':149,'B':130,'C':118,'D':99,'E':66})
s3
A 149
B 130
C 118
D 99
E 66
dtype: int64
1.2 DataFrame
# Series是一维的,功能比较少
# DataFrame是二维的,多个Series公用索引,组成了DataFrame
# 像不像 Excel,所有数据,结构化
df1 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),index = list('ABCDEFHIJK'), # 行索引columns=['Python','Math','En'],dtype=np.float16) # 列索引
df1
|
Python
|
Math
|
En
|
A
|
113.0
|
37.0
|
70.0
|
B
|
92.0
|
22.0
|
11.0
|
C
|
0.0
|
9.0
|
66.0
|
D
|
40.0
|
145.0
|
23.0
|
E
|
25.0
|
133.0
|
108.0
|
F
|
124.0
|
16.0
|
130.0
|
H
|
121.0
|
85.0
|
133.0
|
I
|
84.0
|
125.0
|
39.0
|
J
|
111.0
|
36.0
|
137.0
|
K
|
55.0
|
26.0
|
85.0
|
df2 = pd.DataFrame(data = {'Python':[66,99,128],'Math':[88,65,137],'En':[100,121,45]})
df2 # 字典,key作为列索引,不指定index默认从0开始索引,自动索引一样
|
Python
|
Math
|
En
|
0
|
66
|
88
|
100
|
1
|
99
|
65
|
121
|
2
|
128
|
137
|
45
|
第二部分:数据查看
df = pd.DataFrame(data = np.random.randint(0,151,size = (100,3)),columns=['Python','Math','En'])
df
|
Python
|
Math
|
En
|
0
|
133
|
139
|
141
|
1
|
82
|
17
|
130
|
2
|
51
|
51
|
145
|
3
|
127
|
70
|
11
|
4
|
93
|
60
|
91
|
...
|
...
|
...
|
...
|
95
|
57
|
133
|
96
|
96
|
91
|
21
|
134
|
97
|
76
|
109
|
113
|
98
|
99
|
82
|
29
|
99
|
28
|
54
|
88
|
100 rows × 3 columns
df.shape # 查看DataFrame形状
(100, 3)
df.head(n = 3) # 显示前N个,默认N = 5
|
Python
|
Math
|
En
|
0
|
133
|
139
|
141
|
1
|
82
|
17
|
130
|
2
|
51
|
51
|
145
|
df.tail() # 显示后n个
|
Python
|
Math
|
En
|
95
|
57
|
133
|
96
|
96
|
91
|
21
|
134
|
97
|
76
|
109
|
113
|
98
|
99
|
82
|
29
|
99
|
28
|
54
|
88
|
df.dtypes # 数据类型
Python int64
Math int64
En int64
dtype: object
df.info() # 比较详细信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):# Column Non-Null Count Dtype
--- ------ -------------- -----0 Python 100 non-null int641 Math 100 non-null int642 En 100 non-null int64
dtypes: int64(3)
memory usage: 2.5 KB
df.describe() # 描述:平均值、标准差、中位数、四等分、最大值,最小值
|
Python
|
Math
|
En
|
count
|
100.000000
|
100.000000
|
100.000000
|
mean
|
85.790000
|
77.410000
|
67.630000
|
std
|
41.375173
|
44.905309
|
43.883835
|
min
|
3.000000
|
0.000000
|
3.000000
|
25%
|
54.500000
|
40.250000
|
31.250000
|
50%
|
84.500000
|
81.000000
|
58.500000
|
75%
|
123.000000
|
113.250000
|
103.000000
|
max
|
149.000000
|
149.000000
|
147.000000
|
df.values # 值,返回的是NumPy数组
array([[133, 139, 141],[ 82, 17, 130],[ 51, 51, 145],[127, 70, 11],[ 93, 60, 91],[103, 110, 103],[ 27, 133, 32],[148, 99, 128],[139, 97, 44],[ 64, 85, 71],[147, 94, 37],[114, 12, 16],[ 16, 54, 44],[123, 3, 76],[137, 97, 123],[149, 113, 74],[ 69, 38, 7],[ 68, 122, 4],[ 53, 13, 47],[113, 127, 124],[ 55, 139, 47],[140, 114, 14],[ 84, 111, 115],[ 65, 5, 136],[ 96, 50, 89],[145, 130, 15],[111, 30, 66],[132, 122, 144],[ 79, 5, 45],[115, 29, 49],[ 27, 55, 83],[ 29, 74, 38],[ 87, 100, 45],[132, 147, 119],[ 66, 90, 40],[ 67, 108, 48],[ 78, 28, 46],[105, 137, 110],[132, 119, 55],[117, 23, 79],[ 12, 29, 12],[114, 58, 119],[139, 0, 42],[ 61, 69, 142],[141, 73, 107],[ 49, 12, 19],[ 8, 1, 75],[134, 60, 25],[138, 80, 79],[112, 115, 26],[ 77, 4, 120],[140, 100, 35],[ 82, 129, 4],[100, 8, 25],[ 77, 97, 78],[ 55, 113, 53],[ 45, 73, 37],[ 44, 0, 80],[ 26, 74, 52],[ 99, 75, 147],[111, 8, 144],[ 55, 146, 15],[140, 106, 74],[ 91, 78, 92],[130, 108, 41],[ 34, 41, 136],[ 3, 139, 4],[123, 93, 4],[ 24, 103, 3],[ 44, 122, 92],[ 83, 45, 50],[ 46, 149, 103],[ 48, 127, 92],[ 3, 51, 57],[136, 136, 82],[ 65, 102, 16],[ 23, 61, 118],[138, 15, 6],[ 83, 91, 4],[109, 24, 54],[ 40, 43, 125],[103, 123, 141],[116, 113, 38],[137, 71, 126],[ 69, 143, 83],[ 8, 60, 60],[ 40, 22, 95],[ 73, 19, 17],[137, 129, 103],[109, 142, 94],[ 85, 105, 10],[ 97, 107, 19],[ 79, 12, 27],[143, 74, 18],[ 32, 114, 52],[ 57, 133, 96],[ 91, 21, 134],[ 76, 109, 113],[ 99, 82, 29],[ 28, 54, 88]])
df.columns # 列索引
Index(['Python', 'Math', 'En'], dtype='object')
df.index # 行索引 0 ~ 99
RangeIndex(start=0, stop=100, step=1)
第三部分:数据输入与输出
3.1 csv
df = pd.DataFrame(data = np.random.randint(0,151,size = (100,3)),columns=['Python','Math','En'])
df # 行索引,列索引
|
Python
|
Math
|
En
|
0
|
10
|
128
|
54
|
1
|
45
|
47
|
74
|
2
|
103
|
33
|
133
|
3
|
70
|
24
|
81
|
4
|
90
|
143
|
121
|
...
|
...
|
...
|
...
|
95
|
145
|
25
|
139
|
96
|
53
|
51
|
109
|
97
|
35
|
7
|
130
|
98
|
86
|
51
|
20
|
99
|
149
|
66
|
75
|
100 rows × 3 columns
df.to_csv('./data.csv',sep = ',',index = True, # 保存行索引header=True) # 保存列索引
df.to_csv('./data2.csv',sep = ',',index = False, # 不保存行索引header=False) # 不保存列索引
pd.read_csv('./data.csv',index_col=0) # 第一列作为行索引
|
Python
|
Math
|
En
|
0
|
10
|
128
|
54
|
1
|
45
|
47
|
74
|
2
|
103
|
33
|
133
|
3
|
70
|
24
|
81
|
4
|
90
|
143
|
121
|
...
|
...
|
...
|
...
|
95
|
145
|
25
|
139
|
96
|
53
|
51
|
109
|
97
|
35
|
7
|
130
|
98
|
86
|
51
|
20
|
99
|
149
|
66
|
75
|
100 rows × 3 columns
pd.read_csv('./data2.csv',header =None)
|
0
|
1
|
2
|
0
|
10
|
128
|
54
|
1
|
45
|
47
|
74
|
2
|
103
|
33
|
133
|
3
|
70
|
24
|
81
|
4
|
90
|
143
|
121
|
...
|
...
|
...
|
...
|
95
|
145
|
25
|
139
|
96
|
53
|
51
|
109
|
97
|
35
|
7
|
130
|
98
|
86
|
51
|
20
|
99
|
149
|
66
|
75
|
100 rows × 3 columns
3.2 Excel
df
|
Python
|
Math
|
En
|
0
|
10
|
128
|
54
|
1
|
45
|
47
|
74
|
2
|
103
|
33
|
133
|
3
|
70
|
24
|
81
|
4
|
90
|
143
|
121
|
...
|
...
|
...
|
...
|
95
|
145
|
25
|
139
|
96
|
53
|
51
|
109
|
97
|
35
|
7
|
130
|
98
|
86
|
51
|
20
|
99
|
149
|
66
|
75
|
100 rows × 3 columns
df.to_excel('./data.xls')
pd.read_excel('./data.xls',index_col=0) # 第一列作为行索引
|
Python
|
Math
|
En
|
0
|
10
|
128
|
54
|
1
|
45
|
47
|
74
|
2
|
103
|
33
|
133
|
3
|
70
|
24
|
81
|
4
|
90
|
143
|
121
|
...
|
...
|
...
|
...
|
95
|
145
|
25
|
139
|
96
|
53
|
51
|
109
|
97
|
35
|
7
|
130
|
98
|
86
|
51
|
20
|
99
|
149
|
66
|
75
|
100 rows × 3 columns
3.3 HDF5
df.to_hdf('./data.h5',key = 'score')
df2 = pd.DataFrame(data = np.random.randint(6,100,size = (1000,5)),columns=['计算机','化工','生物','工程','教师'])
df2
|
计算机
|
化工
|
生物
|
工程
|
教师
|
0
|
64
|
22
|
16
|
68
|
60
|
1
|
95
|
47
|
72
|
76
|
37
|
2
|
88
|
48
|
92
|
50
|
37
|
3
|
75
|
38
|
8
|
63
|
83
|
4
|
62
|
14
|
20
|
21
|
45
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
89
|
94
|
88
|
97
|
27
|
996
|
52
|
68
|
21
|
8
|
50
|
997
|
76
|
99
|
10
|
92
|
56
|
998
|
66
|
31
|
55
|
65
|
94
|
999
|
8
|
21
|
38
|
89
|
14
|
1000 rows × 5 columns
df2.to_hdf('./data.h5',key = 'salary')
pd.read_hdf('./data.h5',key = 'salary')
|
计算机
|
化工
|
生物
|
工程
|
教师
|
0
|
64
|
22
|
16
|
68
|
60
|
1
|
95
|
47
|
72
|
76
|
37
|
2
|
88
|
48
|
92
|
50
|
37
|
3
|
75
|
38
|
8
|
63
|
83
|
4
|
62
|
14
|
20
|
21
|
45
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
89
|
94
|
88
|
97
|
27
|
996
|
52
|
68
|
21
|
8
|
50
|
997
|
76
|
99
|
10
|
92
|
56
|
998
|
66
|
31
|
55
|
65
|
94
|
999
|
8
|
21
|
38
|
89
|
14
|
1000 rows × 5 columns
3.4 SQL
from sqlalchemy import create_engine # 数据库引擎,构建和数据库的连接
# PyMySQL
# 类似网页地址
engine = create_engine('mysql+pymysql://root:12345678@localhost/pandas?charset=utf8')
df2.to_sql('salary',engine,index=False) # 将Python中数据DataFrame保存到Mysql
df3 = pd.read_sql('select * from salary limit 50',con = engine)
df3
|
计算机
|
化工
|
生物
|
工程
|
教师
|
0
|
64
|
22
|
16
|
68
|
60
|
1
|
95
|
47
|
72
|
76
|
37
|
2
|
88
|
48
|
92
|
50
|
37
|
3
|
75
|
38
|
8
|
63
|
83
|
4
|
62
|
14
|
20
|
21
|
45
|
5
|
95
|
41
|
84
|
37
|
16
|
6
|
34
|
45
|
11
|
93
|
94
|
7
|
99
|
10
|
57
|
30
|
63
|
8
|
59
|
60
|
12
|
37
|
93
|
9
|
41
|
58
|
15
|
67
|
70
|
10
|
19
|
8
|
63
|
96
|
64
|
11
|
75
|
61
|
78
|
49
|
89
|
12
|
86
|
84
|
31
|
68
|
27
|
13
|
42
|
98
|
24
|
20
|
85
|
14
|
95
|
15
|
97
|
80
|
87
|
15
|
52
|
22
|
44
|
35
|
74
|
16
|
67
|
20
|
65
|
24
|
10
|
17
|
9
|
46
|
41
|
62
|
66
|
18
|
86
|
76
|
80
|
72
|
19
|
19
|
61
|
81
|
64
|
26
|
6
|
20
|
77
|
92
|
84
|
18
|
7
|
21
|
87
|
16
|
75
|
14
|
34
|
22
|
23
|
82
|
92
|
42
|
32
|
23
|
61
|
89
|
28
|
21
|
40
|
24
|
22
|
12
|
38
|
89
|
14
|
25
|
77
|
12
|
46
|
89
|
12
|
26
|
60
|
45
|
52
|
71
|
67
|
27
|
29
|
76
|
94
|
26
|
91
|
28
|
14
|
60
|
82
|
88
|
60
|
29
|
56
|
36
|
44
|
60
|
37
|
30
|
63
|
77
|
43
|
42
|
82
|
31
|
25
|
71
|
36
|
51
|
21
|
32
|
76
|
86
|
87
|
83
|
8
|
33
|
93
|
59
|
86
|
25
|
78
|
34
|
73
|
40
|
12
|
86
|
66
|
35
|
10
|
30
|
54
|
13
|
71
|
36
|
9
|
48
|
58
|
75
|
85
|
37
|
81
|
41
|
61
|
12
|
55
|
38
|
80
|
68
|
66
|
92
|
84
|
39
|
53
|
36
|
84
|
26
|
66
|
40
|
19
|
62
|
63
|
47
|
45
|
41
|
89
|
39
|
91
|
31
|
86
|
42
|
57
|
43
|
53
|
48
|
19
|
43
|
66
|
16
|
23
|
19
|
10
|
44
|
46
|
28
|
78
|
81
|
21
|
45
|
38
|
53
|
76
|
49
|
8
|
46
|
55
|
94
|
70
|
6
|
44
|
47
|
56
|
33
|
92
|
17
|
84
|
48
|
69
|
68
|
23
|
87
|
90
|
49
|
12
|
47
|
32
|
80
|
15
|
第四部分:数据选取
4.1 获取数据
df = pd.DataFrame(np.random.randint(0,151,size = (10,3)),index=list('ABCDEFHIJK'),columns=['Python','Math','En'])
df
|
Python
|
Math
|
En
|
A
|
88
|
52
|
48
|
B
|
78
|
62
|
94
|
C
|
9
|
14
|
71
|
D
|
86
|
15
|
21
|
E
|
1
|
71
|
71
|
F
|
123
|
138
|
55
|
H
|
59
|
17
|
140
|
I
|
68
|
8
|
58
|
J
|
100
|
70
|
63
|
K
|
79
|
37
|
72
|
df['Python'] # 获取数据Series
A 88
B 78
C 9
D 86
E 1
F 123
H 59
I 68
J 100
K 79
Name: Python, dtype: int64
df.Python # 属性,DataFrame中列索引,表示属性
A 88
B 78
C 9
D 86
E 1
F 123
H 59
I 68
J 100
K 79
Name: Python, dtype: int64
df[['Python','En']] # 获取多列数据
|
Python
|
En
|
A
|
88
|
48
|
B
|
78
|
94
|
C
|
9
|
71
|
D
|
86
|
21
|
E
|
1
|
71
|
F
|
123
|
55
|
H
|
59
|
140
|
I
|
68
|
58
|
J
|
100
|
63
|
K
|
79
|
72
|
4.2 标签选择
# 标签,就是行索引 location = loc 位置
df.loc['A']
Python 88
Math 52
En 48
Name: A, dtype: int64
df.loc[['A','F','K']]
|
Python
|
Math
|
En
|
A
|
88
|
52
|
48
|
F
|
123
|
138
|
55
|
K
|
79
|
37
|
72
|
df.loc['A','Python']
88
df.loc[['A','C','F'],'Python']
A 88
C 9
F 123
Name: Python, dtype: int64
df.loc['A'::2,['Math','En']]
|
Math
|
En
|
A
|
52
|
48
|
C
|
14
|
71
|
E
|
71
|
71
|
H
|
17
|
140
|
J
|
70
|
63
|
df.loc['A':'D',:]
|
Python
|
Math
|
En
|
A
|
88
|
52
|
48
|
B
|
78
|
62
|
94
|
C
|
9
|
14
|
71
|
D
|
86
|
15
|
21
|
4.3 位置选择
df.iloc[0]
Python 88
Math 52
En 48
Name: A, dtype: int64
df.iloc[[0,2,4]]
|
Python
|
Math
|
En
|
A
|
88
|
52
|
48
|
C
|
9
|
14
|
71
|
E
|
1
|
71
|
71
|
df.iloc[0:4,[0,2]]
|
Python
|
En
|
A
|
88
|
48
|
B
|
78
|
94
|
C
|
9
|
71
|
D
|
86
|
21
|
df.iloc[3:8:2]
|
Python
|
Math
|
En
|
D
|
86
|
15
|
21
|
F
|
123
|
138
|
55
|
I
|
68
|
8
|
58
|
4.4 boolean索引
cond = df.Python > 80 # 将Python大于80分的成绩获取
df[cond]
|
Python
|
Math
|
En
|
A
|
88
|
52
|
48
|
D
|
86
|
15
|
21
|
F
|
123
|
138
|
55
|
J
|
100
|
70
|
63
|
cond = df.mean(axis = 1) > 75 # 平均分大于75,优秀,筛选出来
df[cond]
|
Python
|
Math
|
En
|
B
|
78
|
62
|
94
|
F
|
123
|
138
|
55
|
J
|
100
|
70
|
63
|
cond = (df.Python > 70) & (df.Math > 70)
df[cond]
|
Python
|
Math
|
En
|
F
|
123
|
138
|
55
|
cond = df.index.isin(['C','E','H','K']) # 判断数据是否在数组中
df[cond] # 删选出来了符合条件的数据
|
Python
|
Math
|
En
|
C
|
9
|
14
|
71
|
E
|
1
|
71
|
71
|
H
|
59
|
17
|
140
|
K
|
79
|
37
|
72
|
4.5 赋值操作
df['Python']['A'] = 150 # 修改某个位置的值
df
|
Python
|
Math
|
En
|
A
|
150
|
52
|
48
|
B
|
78
|
62
|
94
|
C
|
9
|
14
|
71
|
D
|
86
|
15
|
21
|
E
|
1
|
71
|
71
|
F
|
123
|
138
|
55
|
H
|
59
|
17
|
140
|
I
|
68
|
8
|
58
|
J
|
100
|
70
|
63
|
K
|
79
|
37
|
72
|
df['Java'] = np.random.randint(0,151,size = 10) # 新增加一列
df
|
Python
|
Math
|
En
|
Java
|
A
|
150
|
52
|
48
|
65
|
B
|
78
|
62
|
94
|
25
|
C
|
9
|
14
|
71
|
82
|
D
|
86
|
15
|
21
|
139
|
E
|
1
|
71
|
71
|
67
|
F
|
123
|
138
|
55
|
145
|
H
|
59
|
17
|
140
|
53
|
I
|
68
|
8
|
58
|
141
|
J
|
100
|
70
|
63
|
11
|
K
|
79
|
37
|
72
|
127
|
df.loc[['C','D','E'],'Math'] = 147 # 修改多个人的成绩
df
|
Python
|
Math
|
En
|
Java
|
A
|
150
|
52
|
48
|
65
|
B
|
78
|
62
|
94
|
25
|
C
|
9
|
147
|
71
|
82
|
D
|
86
|
147
|
21
|
139
|
E
|
1
|
147
|
71
|
67
|
F
|
123
|
138
|
55
|
145
|
H
|
59
|
17
|
140
|
53
|
I
|
68
|
8
|
58
|
141
|
J
|
100
|
70
|
63
|
11
|
K
|
79
|
37
|
72
|
127
|
cond = df < 60
df[cond] = 60 # where 条件操作,符合这条件值,修改,不符合,不改变
df
|
Python
|
Math
|
En
|
Java
|
A
|
150
|
60
|
60
|
65
|
B
|
78
|
62
|
94
|
60
|
C
|
60
|
147
|
71
|
82
|
D
|
86
|
147
|
60
|
139
|
E
|
60
|
147
|
71
|
67
|
F
|
123
|
138
|
60
|
145
|
H
|
60
|
60
|
140
|
60
|
I
|
68
|
60
|
60
|
141
|
J
|
100
|
70
|
63
|
60
|
K
|
79
|
60
|
72
|
127
|
df.iloc[3::3,[0,2]] += 100
df
|
Python
|
Math
|
En
|
Java
|
A
|
150
|
60
|
60
|
65
|
B
|
78
|
62
|
94
|
60
|
C
|
60
|
147
|
71
|
82
|
D
|
186
|
147
|
160
|
139
|
E
|
60
|
147
|
71
|
67
|
F
|
123
|
138
|
60
|
145
|
H
|
160
|
60
|
240
|
60
|
I
|
68
|
60
|
60
|
141
|
J
|
100
|
70
|
63
|
60
|
K
|
179
|
60
|
172
|
127
|
第五部分:数据集成
5.1 concat数据串联
# np.concatenate NumPy数据集成
df1 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),columns=['Python','Math','En'],index = list('ABCDEFHIJK'))
df2 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),columns = ['Python','Math','En'],index = list('QWRTUYOPLM'))
df3 = pd.DataFrame(np.random.randint(0,151,size = (10,2)),columns=['Java','Chinese'],index = list('ABCDEFHIJK'))
pd.concat([df1,df2],axis = 0) # axis = 0变是行合并,行增加
|
Python
|
Math
|
En
|
A
|
108
|
74
|
53
|
B
|
98
|
16
|
47
|
C
|
71
|
77
|
128
|
D
|
9
|
123
|
131
|
E
|
25
|
90
|
132
|
F
|
105
|
106
|
86
|
H
|
146
|
42
|
81
|
I
|
83
|
4
|
36
|
J
|
102
|
79
|
8
|
K
|
92
|
11
|
47
|
Q
|
119
|
59
|
43
|
W
|
20
|
62
|
106
|
R
|
77
|
82
|
128
|
T
|
44
|
119
|
15
|
U
|
49
|
149
|
62
|
Y
|
94
|
90
|
88
|
O
|
105
|
72
|
133
|
P
|
87
|
109
|
123
|
L
|
125
|
140
|
149
|
M
|
148
|
22
|
102
|
pd.concat([df1,df3],axis = 1) # axis = 1表示列增加
|
Python
|
Math
|
En
|
Java
|
Chinese
|
A
|
108
|
74
|
53
|
61
|
81
|
B
|
98
|
16
|
47
|
117
|
117
|
C
|
71
|
77
|
128
|
48
|
4
|
D
|
9
|
123
|
131
|
149
|
115
|
E
|
25
|
90
|
132
|
113
|
73
|
F
|
105
|
106
|
86
|
140
|
26
|
H
|
146
|
42
|
81
|
117
|
118
|
I
|
83
|
4
|
36
|
103
|
91
|
J
|
102
|
79
|
8
|
43
|
20
|
K
|
92
|
11
|
47
|
93
|
72
|
df1.append(df2) # append追加,在行后面直接进行追加
|
Python
|
C++
|
Math
|
En
|
A
|
108
|
59.0
|
74
|
53
|
B
|
98
|
4.0
|
16
|
47
|
C
|
71
|
27.0
|
77
|
128
|
D
|
9
|
17.0
|
123
|
131
|
E
|
25
|
60.0
|
90
|
132
|
F
|
105
|
136.0
|
106
|
86
|
H
|
146
|
112.0
|
42
|
81
|
I
|
83
|
120.0
|
4
|
36
|
J
|
102
|
28.0
|
79
|
8
|
K
|
92
|
53.0
|
11
|
47
|
Q
|
119
|
NaN
|
59
|
43
|
W
|
20
|
NaN
|
62
|
106
|
R
|
77
|
NaN
|
82
|
128
|
T
|
44
|
NaN
|
119
|
15
|
U
|
49
|
NaN
|
149
|
62
|
Y
|
94
|
NaN
|
90
|
88
|
O
|
105
|
NaN
|
72
|
133
|
P
|
87
|
NaN
|
109
|
123
|
L
|
125
|
NaN
|
140
|
149
|
M
|
148
|
NaN
|
22
|
102
|
df1.append(df3) # 出现空数据,原因在于:df1的列索引和df3列索引不一致
|
Python
|
C++
|
Math
|
En
|
Java
|
Chinese
|
A
|
108.0
|
59.0
|
74.0
|
53.0
|
NaN
|
NaN
|
B
|
98.0
|
4.0
|
16.0
|
47.0
|
NaN
|
NaN
|
C
|
71.0
|
27.0
|
77.0
|
128.0
|
NaN
|
NaN
|
D
|
9.0
|
17.0
|
123.0
|
131.0
|
NaN
|
NaN
|
E
|
25.0
|
60.0
|
90.0
|
132.0
|
NaN
|
NaN
|
F
|
105.0
|
136.0
|
106.0
|
86.0
|
NaN
|
NaN
|
H
|
146.0
|
112.0
|
42.0
|
81.0
|
NaN
|
NaN
|
I
|
83.0
|
120.0
|
4.0
|
36.0
|
NaN
|
NaN
|
J
|
102.0
|
28.0
|
79.0
|
8.0
|
NaN
|
NaN
|
K
|
92.0
|
53.0
|
11.0
|
47.0
|
NaN
|
NaN
|
A
|
NaN
|
NaN
|
NaN
|
NaN
|
61.0
|
81.0
|
B
|
NaN
|
NaN
|
NaN
|
NaN
|
117.0
|
117.0
|
C
|
NaN
|
NaN
|
NaN
|
NaN
|
48.0
|
4.0
|
D
|
NaN
|
NaN
|
NaN
|
NaN
|
149.0
|
115.0
|
E
|
NaN
|
NaN
|
NaN
|
NaN
|
113.0
|
73.0
|
F
|
NaN
|
NaN
|
NaN
|
NaN
|
140.0
|
26.0
|
H
|
NaN
|
NaN
|
NaN
|
NaN
|
117.0
|
118.0
|
I
|
NaN
|
NaN
|
NaN
|
NaN
|
103.0
|
91.0
|
J
|
NaN
|
NaN
|
NaN
|
NaN
|
43.0
|
20.0
|
K
|
NaN
|
NaN
|
NaN
|
NaN
|
93.0
|
72.0
|
pd.concat([df1,df3],axis = 0)
|
Python
|
C++
|
Math
|
En
|
Java
|
Chinese
|
A
|
108.0
|
59.0
|
74.0
|
53.0
|
NaN
|
NaN
|
B
|
98.0
|
4.0
|
16.0
|
47.0
|
NaN
|
NaN
|
C
|
71.0
|
27.0
|
77.0
|
128.0
|
NaN
|
NaN
|
D
|
9.0
|
17.0
|
123.0
|
131.0
|
NaN
|
NaN
|
E
|
25.0
|
60.0
|
90.0
|
132.0
|
NaN
|
NaN
|
F
|
105.0
|
136.0
|
106.0
|
86.0
|
NaN
|
NaN
|
H
|
146.0
|
112.0
|
42.0
|
81.0
|
NaN
|
NaN
|
I
|
83.0
|
120.0
|
4.0
|
36.0
|
NaN
|
NaN
|
J
|
102.0
|
28.0
|
79.0
|
8.0
|
NaN
|
NaN
|
K
|
92.0
|
53.0
|
11.0
|
47.0
|
NaN
|
NaN
|
A
|
NaN
|
NaN
|
NaN
|
NaN
|
61.0
|
81.0
|
B
|
NaN
|
NaN
|
NaN
|
NaN
|
117.0
|
117.0
|
C
|
NaN
|
NaN
|
NaN
|
NaN
|
48.0
|
4.0
|
D
|
NaN
|
NaN
|
NaN
|
NaN
|
149.0
|
115.0
|
E
|
NaN
|
NaN
|
NaN
|
NaN
|
113.0
|
73.0
|
F
|
NaN
|
NaN
|
NaN
|
NaN
|
140.0
|
26.0
|
H
|
NaN
|
NaN
|
NaN
|
NaN
|
117.0
|
118.0
|
I
|
NaN
|
NaN
|
NaN
|
NaN
|
103.0
|
91.0
|
J
|
NaN
|
NaN
|
NaN
|
NaN
|
43.0
|
20.0
|
K
|
NaN
|
NaN
|
NaN
|
NaN
|
93.0
|
72.0
|
5.2 数据插入
df1
|
Python
|
Math
|
En
|
A
|
108
|
74
|
53
|
B
|
98
|
16
|
47
|
C
|
71
|
77
|
128
|
D
|
9
|
123
|
131
|
E
|
25
|
90
|
132
|
F
|
105
|
106
|
86
|
H
|
146
|
42
|
81
|
I
|
83
|
4
|
36
|
J
|
102
|
79
|
8
|
K
|
92
|
11
|
47
|
df1.insert(loc = 1, # 插入位置column='C++', # 插入一列,这一列名字value = np.random.randint(0,151,size = 10)) # 插入的值
df1
|
Python
|
C++
|
Math
|
En
|
A
|
108
|
59
|
74
|
53
|
B
|
98
|
4
|
16
|
47
|
C
|
71
|
27
|
77
|
128
|
D
|
9
|
17
|
123
|
131
|
E
|
25
|
60
|
90
|
132
|
F
|
105
|
136
|
106
|
86
|
H
|
146
|
112
|
42
|
81
|
I
|
83
|
120
|
4
|
36
|
J
|
102
|
28
|
79
|
8
|
K
|
92
|
53
|
11
|
47
|
5.3 Join SQL风格合并
df1 = pd.DataFrame(data = {'name':['softpo','Brandon','Ella','Daniel','张三'],'height':[175,180,169,177,168]}) # 身高
df2 = pd.DataFrame(data = {'name':['softpo','Brandon','Ella','Daniel','李四'],'weight':[70,65,74,63,88]}) # 体重
df3 = pd.DataFrame(data = {'名字':['softpo','Brandon','Ella','Daniel','张三'],'salary':np.random.randint(20,100,size = 5)}) # 薪水
display(df1,df2,df3)
|
name
|
height
|
0
|
softpo
|
175
|
1
|
Brandon
|
180
|
2
|
Ella
|
169
|
3
|
Daniel
|
177
|
4
|
张三
|
168
|
|
name
|
weight
|
0
|
softpo
|
70
|
1
|
Brandon
|
65
|
2
|
Ella
|
74
|
3
|
Daniel
|
63
|
4
|
李四
|
88
|
|
名字
|
salary
|
0
|
softpo
|
64
|
1
|
Brandon
|
48
|
2
|
Ella
|
25
|
3
|
Daniel
|
26
|
4
|
张三
|
96
|
pd.concat([df1,df2],axis = 1)
|
name
|
height
|
name
|
weight
|
0
|
softpo
|
175
|
softpo
|
70
|
1
|
Brandon
|
180
|
Brandon
|
65
|
2
|
Ella
|
169
|
Ella
|
74
|
3
|
Daniel
|
177
|
Daniel
|
63
|
4
|
张三
|
168
|
李四
|
88
|
# 根据共同的属性,合并数据
# df1 和 df2 共同属性:name
# 数据库,合并join 共同key
# inner内合并
pd.merge(df1,df2,how = 'inner') # 根据共同name进行合并,两表合并,外键
|
name
|
height
|
weight
|
0
|
softpo
|
175
|
70
|
1
|
Brandon
|
180
|
65
|
2
|
Ella
|
169
|
74
|
3
|
Daniel
|
177
|
63
|
pd.merge(df1,df2,how = 'outer') # 外合并,所有数据保留,不对应位置,填充了空数据
|
name
|
height
|
weight
|
0
|
softpo
|
175.0
|
70.0
|
1
|
Brandon
|
180.0
|
65.0
|
2
|
Ella
|
169.0
|
74.0
|
3
|
Daniel
|
177.0
|
63.0
|
4
|
张三
|
168.0
|
NaN
|
5
|
李四
|
NaN
|
88.0
|
pd.merge(df1,df2,how = 'left')
|
name
|
height
|
weight
|
0
|
softpo
|
175
|
70.0
|
1
|
Brandon
|
180
|
65.0
|
2
|
Ella
|
169
|
74.0
|
3
|
Daniel
|
177
|
63.0
|
4
|
张三
|
168
|
NaN
|
pd.merge(df1,df3,left_on='name',right_on='名字')
|
name
|
height
|
名字
|
salary
|
0
|
softpo
|
175
|
softpo
|
64
|
1
|
Brandon
|
180
|
Brandon
|
48
|
2
|
Ella
|
169
|
Ella
|
25
|
3
|
Daniel
|
177
|
Daniel
|
26
|
4
|
张三
|
168
|
张三
|
96
|
df4 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),columns=['Python','Math','En'],index = list('ABCDEFHIJK'))
df4
|
Python
|
Math
|
En
|
A
|
71
|
7
|
89
|
B
|
145
|
116
|
40
|
C
|
56
|
150
|
139
|
D
|
88
|
66
|
41
|
E
|
87
|
139
|
117
|
F
|
141
|
45
|
18
|
H
|
93
|
119
|
114
|
I
|
110
|
89
|
2
|
J
|
2
|
35
|
96
|
K
|
125
|
59
|
9
|
score_mean = df4.mean(axis = 1).round(1)
score_mean
A 55.7
B 100.3
C 115.0
D 65.0
E 114.3
F 68.0
H 108.7
I 67.0
J 44.3
K 64.3
dtype: float64
df4.insert(loc = 2,column='平均分',value=score_mean)
df4
|
Python
|
Math
|
平均分
|
En
|
A
|
71
|
7
|
55.7
|
89
|
B
|
145
|
116
|
100.3
|
40
|
C
|
56
|
150
|
115.0
|
139
|
D
|
88
|
66
|
65.0
|
41
|
E
|
87
|
139
|
114.3
|
117
|
F
|
141
|
45
|
68.0
|
18
|
H
|
93
|
119
|
108.7
|
114
|
I
|
110
|
89
|
67.0
|
2
|
J
|
2
|
35
|
44.3
|
96
|
K
|
125
|
59
|
64.3
|
9
|
df5 = df4.iloc[:,[0,1,3]]
df5
|
Python
|
Math
|
En
|
A
|
71
|
7
|
89
|
B
|
145
|
116
|
40
|
C
|
56
|
150
|
139
|
D
|
88
|
66
|
41
|
E
|
87
|
139
|
117
|
F
|
141
|
45
|
18
|
H
|
93
|
119
|
114
|
I
|
110
|
89
|
2
|
J
|
2
|
35
|
96
|
K
|
125
|
59
|
9
|
score_mean.name = '平均分'
score_mean
A 55.7
B 100.3
C 115.0
D 65.0
E 114.3
F 68.0
H 108.7
I 67.0
J 44.3
K 64.3
Name: 平均分, dtype: float64
df5
|
Python
|
Math
|
En
|
A
|
71
|
7
|
89
|
B
|
145
|
116
|
40
|
C
|
56
|
150
|
139
|
D
|
88
|
66
|
41
|
E
|
87
|
139
|
117
|
F
|
141
|
45
|
18
|
H
|
93
|
119
|
114
|
I
|
110
|
89
|
2
|
J
|
2
|
35
|
96
|
K
|
125
|
59
|
9
|
pd.merge(df5,score_mean,left_index=True, # 数据合并根据行索引,对应right_index=True) # 右边数据根据行索引,对应
|
Python
|
Math
|
En
|
平均分
|
A
|
71
|
7
|
89
|
55.7
|
B
|
145
|
116
|
40
|
100.3
|
C
|
56
|
150
|
139
|
115.0
|
D
|
88
|
66
|
41
|
65.0
|
E
|
87
|
139
|
117
|
114.3
|
F
|
141
|
45
|
18
|
68.0
|
H
|
93
|
119
|
114
|
108.7
|
I
|
110
|
89
|
2
|
67.0
|
J
|
2
|
35
|
96
|
44.3
|
K
|
125
|
59
|
9
|
64.3
|
第六部分:数据清洗
df = pd.DataFrame(data = {'color':['red','blue','red','green','green','blue',None,np.NaN,'green'],'price':[20,15,20,18,18,22,30,30,22]})
df
|
color
|
price
|
0
|
red
|
20
|
1
|
blue
|
15
|
2
|
red
|
20
|
3
|
green
|
18
|
4
|
green
|
18
|
5
|
blue
|
22
|
6
|
None
|
30
|
7
|
NaN
|
30
|
8
|
green
|
22
|
# 重复数据删除
df.drop_duplicates() # 非重复数据,索引7和索引6重复数据,None和NaN一回事
|
color
|
price
|
0
|
red
|
20
|
1
|
blue
|
15
|
3
|
green
|
18
|
5
|
blue
|
22
|
6
|
None
|
30
|
8
|
green
|
22
|
df
|
color
|
price
|
0
|
red
|
20
|
1
|
blue
|
15
|
2
|
red
|
20
|
3
|
green
|
18
|
4
|
green
|
18
|
5
|
blue
|
22
|
6
|
None
|
30
|
7
|
NaN
|
30
|
8
|
green
|
22
|
df.dropna() # 空数据过滤
|
color
|
price
|
0
|
red
|
20
|
1
|
blue
|
15
|
2
|
red
|
20
|
3
|
green
|
18
|
4
|
green
|
18
|
5
|
blue
|
22
|
8
|
green
|
22
|
# 删除行,或者列
df.drop(labels=[2,4,6,8]) # 默认情况下删除行
|
color
|
price
|
0
|
red
|
20
|
1
|
blue
|
15
|
3
|
green
|
18
|
5
|
blue
|
22
|
7
|
NaN
|
30
|
# 删除指定的列
df.drop(labels='color',axis = 1) # 删除列,axis = 1
|
price
|
0
|
20
|
1
|
15
|
2
|
20
|
3
|
18
|
4
|
18
|
5
|
22
|
6
|
30
|
7
|
30
|
8
|
22
|
df.filter(items=['price']) # 参数意思,保留数据price
|
price
|
0
|
20
|
1
|
15
|
2
|
20
|
3
|
18
|
4
|
18
|
5
|
22
|
6
|
30
|
7
|
30
|
8
|
22
|
df['size'] = 1024 # 广播
df
|
color
|
price
|
size
|
0
|
red
|
20
|
1024
|
1
|
blue
|
15
|
1024
|
2
|
red
|
20
|
1024
|
3
|
green
|
18
|
1024
|
4
|
green
|
18
|
1024
|
5
|
blue
|
22
|
1024
|
6
|
None
|
30
|
1024
|
7
|
NaN
|
30
|
1024
|
8
|
green
|
22
|
1024
|
df.filter(like = 'i') # 模糊匹配,保留了带有i这个字母的索引
|
price
|
size
|
0
|
20
|
1024
|
1
|
15
|
1024
|
2
|
20
|
1024
|
3
|
18
|
1024
|
4
|
18
|
1024
|
5
|
22
|
1024
|
6
|
30
|
1024
|
7
|
30
|
1024
|
8
|
22
|
1024
|
df['hello'] = 512
df
|
color
|
price
|
size
|
hello
|
0
|
red
|
20
|
1024
|
512
|
1
|
blue
|
15
|
1024
|
512
|
2
|
red
|
20
|
1024
|
512
|
3
|
green
|
18
|
1024
|
512
|
4
|
green
|
18
|
1024
|
512
|
5
|
blue
|
22
|
1024
|
512
|
6
|
None
|
30
|
1024
|
512
|
7
|
NaN
|
30
|
1024
|
512
|
8
|
green
|
22
|
1024
|
512
|
# 正则表达式,方式很多
df.filter(regex = 'e$') # 正则表达式,正则表达式,限制e必须在最后
|
price
|
size
|
0
|
20
|
1024
|
1
|
15
|
1024
|
2
|
20
|
1024
|
3
|
18
|
1024
|
4
|
18
|
1024
|
5
|
22
|
1024
|
6
|
30
|
1024
|
7
|
30
|
1024
|
8
|
22
|
1024
|
df.filter(regex='e') # 只要带有e全部选出来
|
price
|
size
|
hello
|
0
|
20
|
1024
|
512
|
1
|
15
|
1024
|
512
|
2
|
20
|
1024
|
512
|
3
|
18
|
1024
|
512
|
4
|
18
|
1024
|
512
|
5
|
22
|
1024
|
512
|
6
|
30
|
1024
|
512
|
7
|
30
|
1024
|
512
|
8
|
22
|
1024
|
512
|
# 异常值过滤
a = np.random.randint(0,1000,size = 200)
a
array([647, 871, 35, 738, 789, 587, 413, 559, 648, 993, 579, 129, 825,904, 356, 316, 997, 800, 35, 601, 1, 208, 465, 614, 680, 619,922, 346, 994, 135, 5, 650, 165, 475, 95, 194, 225, 455, 634,717, 836, 678, 156, 203, 263, 180, 143, 248, 407, 56, 202, 947,46, 408, 686, 530, 545, 273, 125, 964, 323, 775, 313, 238, 242,804, 228, 322, 322, 768, 556, 9, 629, 938, 932, 859, 955, 707,729, 541, 280, 493, 255, 681, 428, 992, 420, 650, 267, 32, 662,185, 756, 319, 313, 271, 229, 711, 803, 85, 527, 853, 670, 685,423, 458, 628, 701, 253, 495, 548, 879, 503, 115, 90, 978, 665,532, 198, 482, 412, 850, 879, 913, 96, 177, 778, 337, 502, 128,49, 747, 591, 22, 557, 105, 136, 775, 626, 515, 959, 869, 245,437, 51, 236, 438, 489, 854, 49, 163, 687, 488, 175, 428, 517,493, 377, 100, 728, 717, 926, 689, 186, 777, 639, 79, 83, 620,623, 931, 918, 721, 315, 133, 423, 161, 999, 341, 55, 837, 582,530, 805, 22, 301, 177, 322, 708, 14, 50, 864, 889, 929, 967,497, 624, 127, 539, 14])
# 异常值,大于800,小于 100算作异常,认为定义的。根据实际情况。
cond = (a <=800) & (a >=100)
a[cond]
array([647, 738, 789, 587, 413, 559, 648, 579, 129, 356, 316, 800, 601,208, 465, 614, 680, 619, 346, 135, 650, 165, 475, 194, 225, 455,634, 717, 678, 156, 203, 263, 180, 143, 248, 407, 202, 408, 686,530, 545, 273, 125, 323, 775, 313, 238, 242, 228, 322, 322, 768,556, 629, 707, 729, 541, 280, 493, 255, 681, 428, 420, 650, 267,662, 185, 756, 319, 313, 271, 229, 711, 527, 670, 685, 423, 458,628, 701, 253, 495, 548, 503, 115, 665, 532, 198, 482, 412, 177,778, 337, 502, 128, 747, 591, 557, 105, 136, 775, 626, 515, 245,437, 236, 438, 489, 163, 687, 488, 175, 428, 517, 493, 377, 100,728, 717, 689, 186, 777, 639, 620, 623, 721, 315, 133, 423, 161,341, 582, 530, 301, 177, 322, 708, 497, 624, 127, 539])
# 正态分布,平均值是0,标准差是1
b = np.random.randn(100000)
b
array([-1.17335196, 2.02215212, -0.29891071, ..., -1.6762474 ,-1.27071523, -1.15187761])
# 过滤异常值
cond = np.abs(b) > 3*1 # 这些异常值,找到了
b[cond]
array([ 3.46554243, 3.08127362, 3.55119821, 3.62774922, 3.11823028,3.22620922, -3.10381164, -3.20067563, -3.04607325, -3.04427703,3.09111414, -3.28220862, 3.00499105, -3.06179762, -3.17331972,-3.37172359, 3.93766782, -3.22895232, -3.13737479, 3.07612751,-3.43215209, -3.27660651, -3.35116041, 4.74328695, 3.25586636,-3.54090785, 3.08881127, 3.00635551, 3.5018534 , -3.14463788,-3.0182886 , -3.12145648, -3.24276219, 3.08087834, 3.04820238,-3.24173442, -3.14648209, 3.87748281, -3.07660111, -3.16083928,3.32641202, -3.05228179, 3.04924043, 3.02825131, -3.08360056,-3.04890894, -3.27258041, -3.07339115, -3.38375287, -3.14267022,-3.7207377 , 3.4813841 , -3.12866105, -3.17122631, 3.0599701 ,3.12393087, 3.20253178, -3.05221958, -3.35532417, 3.02450167,-3.28385568, 3.3422833 , -3.11052755, -3.09647003, 3.32353664,-3.70215812, -3.07916575, -3.13546874, 3.20575826, -3.67982084,-3.17055893, 3.4836615 , -3.30039879, -3.27774497, 3.02125912,3.12332885, 3.01456477, 3.15958151, -3.34101369, 3.32444673,3.06479889, 3.14506863, 3.15670827, 3.15066995, 3.14705869,-3.20526898, -3.0761338 , 3.20716127, -3.20941307, -3.7212859 ,-3.51785834, -3.06096986, -3.05425748, -3.47049261, 3.22285172,-3.32233224, -3.04630606, 3.41215312, -3.16482337, -3.01813609,-3.05441573, -3.10394416, 3.03469642, 3.01493847, -3.11901071,3.5996865 , 3.48194227, -3.77734847, 3.04588004, 3.10611158,-3.20473003, -3.4377999 , 3.22680244, -3.1536921 , -3.22798726,3.34569796, 3.06046948, -3.16955677, 3.12613756, 3.04286964,3.01148054, 3.18525226, -4.08971624, -3.55427596, -5.39879049,3.05203254, 3.08944491, -3.02258209, 3.17316913, -3.1615401 ,3.17205118, -3.24221772, -3.14421237, -3.74675036, 3.61678522,3.59097443, -3.0302881 , 3.23236707, -3.00850012, 3.33608986,-3.02859152, -3.7000766 , -3.10992575, -3.00412636, -3.05657102,-3.05208781, 3.14017797, 3.46457731, 3.15619413, -3.43236114,3.08259529, -3.84578168, 3.04203424, -3.29444028, -3.01764756,3.11300256, 3.23071233, 3.20785451, -3.15668756, 3.44176099,-3.19985577, -3.14126853, -3.26482841, -3.62208271, -3.55305069,3.09639491, -3.18178713, -3.03662021, 3.17247227, 3.3908074 ,-3.63563705, -3.56417097, 3.02823554, -3.06955375, 3.74305364,3.63993306, -3.14193492, -3.04032527, -3.28310908, -3.37949723,-3.25915912, -3.01206123, -3.10871377, -3.22982732, 3.8136103 ,3.48893313, 3.9918267 , 3.4526763 , -3.46595488, -3.29996013,-3.42965097, 3.151502 , 3.10548689, -3.44707735, 3.21881565,3.50932999, -3.12410382, 3.30296386, 3.02454576, -3.20072608,3.54339754, -3.17847739, -3.21475045, 3.03546088, -3.06225619,3.48158164, 3.15243123, -3.06358376, 3.27300242, 3.32577453,3.23535167, -3.04681725, 3.33439387, 3.10620079, 3.52883469,-3.1790272 , 3.02641222, -3.45636819, 3.21009424, 3.08045954,-3.59721754, 3.24693695, 3.05920919, -3.43674159, -3.00370946,-3.48031594, -3.28748467, 3.42581649, 3.46912521, -3.28384157,3.76358974, -3.34035865, 3.12978233, 3.44856854, -3.04074246,3.50018071, 3.33188267, -3.09775514, -3.49356906, -3.09902374,3.12068562, -3.1776565 , -3.44282129, 3.19286374, -3.28304596,-3.10080963, -3.37189709, 3.77743156, 3.03547536, 3.22045459,-3.44007263, 3.01331408, 3.49733677, 3.28831922, 3.62147013,3.03458981, 3.15447237, -3.33931478, 3.09858431, -3.23592306,3.3144797 , 3.37067342, -3.18749118, 3.09319307, -3.34390567,3.29819563, 3.3120354 , 3.04166958, -3.00975323, 3.0347423 ,-3.82502331, -3.13125028, -3.0876424 , 3.13929221, 3.570775 ,-3.37420738, 3.17527797, 3.13396148, -3.70088631, -3.04054948,3.05399103, 3.24908851, 3.19666266, -3.64071456, -3.85271081,3.06864652, 3.53367592, 3.54650649, 3.6355438 , 3.657715 ,4.03831601, 3.61651925])
第七部分:数据转换
7.1 轴和元素转换
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),columns=['Python','Tensorflow','Keras'],index = list('ABCDEFHIJK'))
df
|
Python
|
Tensorflow
|
Keras
|
A
|
2
|
5
|
3
|
B
|
5
|
0
|
0
|
C
|
7
|
0
|
4
|
D
|
0
|
4
|
7
|
E
|
8
|
6
|
9
|
F
|
8
|
2
|
6
|
H
|
6
|
7
|
8
|
I
|
7
|
6
|
9
|
J
|
4
|
7
|
9
|
K
|
6
|
7
|
1
|
df.rename(index = {'A':'X','K':'Y'}, # 行索引columns={'Python':'人工智能'}, # 列索引修改inplace=True) # 替换原数据
df.replace(5,50,inplace=True)
df
|
人工智能
|
Tensorflow
|
Keras
|
X
|
2
|
50
|
3
|
B
|
50
|
0
|
0
|
C
|
7
|
0
|
4
|
D
|
0
|
4
|
7
|
E
|
8
|
6
|
9
|
F
|
8
|
2
|
6
|
H
|
6
|
7
|
8
|
I
|
7
|
6
|
9
|
J
|
4
|
7
|
9
|
Y
|
6
|
7
|
1
|
df.replace([2,7],1024,inplace=True)
df
|
人工智能
|
Tensorflow
|
Keras
|
X
|
1024
|
50
|
3
|
B
|
50
|
0
|
0
|
C
|
1024
|
0
|
4
|
D
|
0
|
4
|
1024
|
E
|
8
|
6
|
9
|
F
|
8
|
1024
|
6
|
H
|
6
|
1024
|
8
|
I
|
1024
|
6
|
9
|
J
|
4
|
1024
|
9
|
Y
|
6
|
1024
|
1
|
df.iloc[4,2] = np.NaN # 空数据
df.replace({0:2048,np.nan:-100},inplace=True)
df
|
人工智能
|
Tensorflow
|
Keras
|
X
|
1024
|
50
|
3.0
|
B
|
50
|
2048
|
2048.0
|
C
|
1024
|
2048
|
4.0
|
D
|
2048
|
4
|
1024.0
|
E
|
8
|
6
|
-100.0
|
F
|
8
|
1024
|
6.0
|
H
|
6
|
1024
|
8.0
|
I
|
1024
|
6
|
9.0
|
J
|
4
|
1024
|
9.0
|
Y
|
6
|
1024
|
1.0
|
df.replace({'Tensorflow':1024},-1024) # 指定某一列,进行数据替换
|
人工智能
|
Tensorflow
|
Keras
|
X
|
1024
|
50
|
3.0
|
B
|
50
|
2048
|
2048.0
|
C
|
1024
|
2048
|
4.0
|
D
|
2048
|
4
|
1024.0
|
E
|
8
|
6
|
-100.0
|
F
|
8
|
-1024
|
6.0
|
H
|
6
|
-1024
|
8.0
|
I
|
1024
|
6
|
9.0
|
J
|
4
|
-1024
|
9.0
|
Y
|
6
|
-1024
|
1.0
|
7.2 map映射元素转变
# map 只能针对一列,就是Series
# 有一些没有对应,那么返回就是空数据
df['人工智能'].map({1024:3.14,2048:2.718,6:1108}) # 跟据字典对数据进行改变
X 3.140
B NaN
C 3.140
D 2.718
E NaN
F NaN
H 1108.000
I 3.140
J NaN
Y 1108.000
Name: 人工智能, dtype: float64
df['Keras'].map(lambda x :True if x > 0 else False) # 如果大于 0 返回True,不然返回False
X True
B True
C True
D True
E False
F True
H True
I True
J True
Y True
Name: Keras, dtype: bool
def convert(x):if x >= 1024:return Trueelse:return False
df['level'] = df['Tensorflow'].map(convert) # map映射,映射是Tensorflow中这一列中每一个数据,传递到方法中
df
|
人工智能
|
Tensorflow
|
Keras
|
level
|
X
|
1024
|
50
|
3.0
|
False
|
B
|
50
|
2048
|
2048.0
|
True
|
C
|
1024
|
2048
|
4.0
|
True
|
D
|
2048
|
4
|
1024.0
|
False
|
E
|
8
|
6
|
-100.0
|
False
|
F
|
8
|
1024
|
6.0
|
True
|
H
|
6
|
1024
|
8.0
|
True
|
I
|
1024
|
6
|
9.0
|
False
|
J
|
4
|
1024
|
9.0
|
True
|
Y
|
6
|
1024
|
1.0
|
True
|
7.3 apply映射元素转变
# 既可以操作Series又可以操作DataFrame
df['人工智能'].apply(lambda x : x + 100)
X 1124
B 150
C 1124
D 2148
E 108
F 108
H 106
I 1124
J 104
Y 106
Name: 人工智能, dtype: int64
df['level'].apply(lambda x:1 if x else 0)
X 0
B 1
C 1
D 0
E 0
F 1
H 1
I 0
J 1
Y 1
Name: level, dtype: int64
df.apply(lambda x : x + 1000) # apply对 所有的数据进行映射
|
人工智能
|
Tensorflow
|
Keras
|
level
|
X
|
2024
|
1050
|
1003.0
|
1000
|
B
|
1050
|
3048
|
3048.0
|
1001
|
C
|
2024
|
3048
|
1004.0
|
1001
|
D
|
3048
|
1004
|
2024.0
|
1000
|
E
|
1008
|
1006
|
900.0
|
1000
|
F
|
1008
|
2024
|
1006.0
|
1001
|
H
|
1006
|
2024
|
1008.0
|
1001
|
I
|
2024
|
1006
|
1009.0
|
1000
|
J
|
1004
|
2024
|
1009.0
|
1001
|
Y
|
1006
|
2024
|
1001.0
|
1001
|
def convert(x):return (x.median(),x.count(),x.min(),x.max(),x.std()) # 返回中位数,返回的是计数
df.apply(convert).round(1) # 默认操作列数据
|
人工智能
|
Tensorflow
|
Keras
|
level
|
0
|
29.0
|
1024.0
|
7.0
|
1
|
1
|
10.0
|
10.0
|
10.0
|
10
|
2
|
4.0
|
4.0
|
-100.0
|
False
|
3
|
2048.0
|
2048.0
|
2048.0
|
True
|
4
|
717.8
|
800.4
|
694.9
|
0.516398
|
df
|
人工智能
|
Tensorflow
|
Keras
|
level
|
X
|
1024
|
50
|
3.0
|
False
|
B
|
50
|
2048
|
2048.0
|
True
|
C
|
1024
|
2048
|
4.0
|
True
|
D
|
2048
|
4
|
1024.0
|
False
|
E
|
8
|
6
|
-100.0
|
False
|
F
|
8
|
1024
|
6.0
|
True
|
H
|
6
|
1024
|
8.0
|
True
|
I
|
1024
|
6
|
9.0
|
False
|
J
|
4
|
1024
|
9.0
|
True
|
Y
|
6
|
1024
|
1.0
|
True
|
df.apply(convert,axis = 1) # axis = 1,操作数据就是行数据
X (26.5, 4, False, 1024, 503.68732033541073)
B (1049.0, 4, True, 2048, 1167.8622564326668)
C (514.0, 4, True, 2048, 979.1007353689405)
D (514.0, 4, False, 2048, 979.3623776042588)
E (3.0, 4, -100.0, 8, 52.443620520834884)
F (7.0, 4, True, 1024, 509.5085049993441)
H (7.0, 4, True, 1024, 509.5085049993441)
I (7.5, 4, False, 1024, 509.51373877453)
J (6.5, 4, True, 1024, 509.6773489179208)
Y (3.5, 4, 1.0, 1024, 510.6721061503164)
dtype: object
7.4 transform元素转变
df = pd.DataFrame(np.random.randint(0,10,size = (10,3)),columns=['Python','Tensorflow','Keras'],index = list('ABCDEFHIJK'))
display(df)
# 可以针对一列数据,Series进行运算
df['Python'].transform(lambda x : 1024 if x > 5 else -1024) # 这个功能和map,apply类似的
|
Python
|
Tensorflow
|
Keras
|
A
|
1
|
1
|
9
|
B
|
6
|
9
|
1
|
C
|
1
|
4
|
6
|
D
|
9
|
5
|
1
|
E
|
4
|
1
|
8
|
F
|
2
|
5
|
7
|
H
|
7
|
2
|
3
|
I
|
7
|
8
|
9
|
J
|
5
|
4
|
2
|
K
|
7
|
0
|
7
|
A -1024
B 1024
C -1024
D 1024
E -1024
F -1024
H 1024
I 1024
J -1024
K 1024
Name: Python, dtype: int64
df['Tensorflow'].apply([np.sqrt,np.square,np.cumsum]) # 针对一列,进行不同的操作
|
sqrt
|
square
|
cumsum
|
A
|
1.000000
|
1
|
1
|
B
|
3.000000
|
81
|
10
|
C
|
2.000000
|
16
|
14
|
D
|
2.236068
|
25
|
19
|
E
|
1.000000
|
1
|
20
|
F
|
2.236068
|
25
|
25
|
H
|
1.414214
|
4
|
27
|
I
|
2.828427
|
64
|
35
|
J
|
2.000000
|
16
|
39
|
K
|
0.000000
|
0
|
39
|
df['Tensorflow'].transform([np.sqrt,np.square,np.cumsum]) # 针对一列,进行不同的操作
|
sqrt
|
square
|
cumsum
|
A
|
1.000000
|
1
|
1
|
B
|
3.000000
|
81
|
10
|
C
|
2.000000
|
16
|
14
|
D
|
2.236068
|
25
|
19
|
E
|
1.000000
|
1
|
20
|
F
|
2.236068
|
25
|
25
|
H
|
1.414214
|
4
|
27
|
I
|
2.828427
|
64
|
35
|
J
|
2.000000
|
16
|
39
|
K
|
0.000000
|
0
|
39
|
def convert(x):if x > 5:return Trueelse:return False
# 可以针对DataFrame进行运算
df.transform({'Python':np.cumsum,'Tensorflow':np.square,'Keras':convert}) # 对不同的列,执行不同的操作
|
Python
|
Tensorflow
|
Keras
|
A
|
1
|
1
|
True
|
B
|
7
|
81
|
False
|
C
|
8
|
16
|
True
|
D
|
17
|
25
|
False
|
E
|
21
|
1
|
True
|
F
|
23
|
25
|
True
|
H
|
30
|
4
|
False
|
I
|
37
|
64
|
True
|
J
|
42
|
16
|
False
|
K
|
49
|
0
|
True
|
df.apply({'Python':np.cumsum,'Tensorflow':np.square,'Keras':convert}) # 对不同的列,执行不同的操作
|
Python
|
Tensorflow
|
Keras
|
A
|
1
|
1
|
True
|
B
|
7
|
81
|
False
|
C
|
8
|
16
|
True
|
D
|
17
|
25
|
False
|
E
|
21
|
1
|
True
|
F
|
23
|
25
|
True
|
H
|
30
|
4
|
False
|
I
|
37
|
64
|
True
|
J
|
42
|
16
|
False
|
K
|
49
|
0
|
True
|
7.5 重排随机抽样哑变量
df
|
Python
|
Tensorflow
|
Keras
|
A
|
1
|
1
|
9
|
B
|
6
|
9
|
1
|
C
|
1
|
4
|
6
|
D
|
9
|
5
|
1
|
E
|
4
|
1
|
8
|
F
|
2
|
5
|
7
|
H
|
7
|
2
|
3
|
I
|
7
|
8
|
9
|
J
|
5
|
4
|
2
|
K
|
7
|
0
|
7
|
index = np.random.permutation(10) # 返回打乱顺讯的索引
index
array([3, 4, 1, 2, 7, 9, 0, 8, 5, 6])
# 重排,索引打乱
df.take(index)
|
Python
|
Tensorflow
|
Keras
|
D
|
9
|
5
|
1
|
E
|
4
|
1
|
8
|
B
|
6
|
9
|
1
|
C
|
1
|
4
|
6
|
I
|
7
|
8
|
9
|
K
|
7
|
0
|
7
|
A
|
1
|
1
|
9
|
J
|
5
|
4
|
2
|
F
|
2
|
5
|
7
|
H
|
7
|
2
|
3
|
# 从大量数据中随机抽取数据
df.take(np.random.randint(0,10,size = 20)) # 随机抽样20个数据
|
Python
|
Tensorflow
|
Keras
|
J
|
5
|
4
|
2
|
J
|
5
|
4
|
2
|
D
|
9
|
5
|
1
|
K
|
7
|
0
|
7
|
H
|
7
|
2
|
3
|
I
|
7
|
8
|
9
|
J
|
5
|
4
|
2
|
A
|
1
|
1
|
9
|
C
|
1
|
4
|
6
|
J
|
5
|
4
|
2
|
I
|
7
|
8
|
9
|
D
|
9
|
5
|
1
|
I
|
7
|
8
|
9
|
K
|
7
|
0
|
7
|
A
|
1
|
1
|
9
|
B
|
6
|
9
|
1
|
H
|
7
|
2
|
3
|
D
|
9
|
5
|
1
|
B
|
6
|
9
|
1
|
H
|
7
|
2
|
3
|
df2 = pd.DataFrame(data = {'key':['a','b','a','b','c','b','c']})
df2
|
key
|
0
|
a
|
1
|
b
|
2
|
a
|
3
|
b
|
4
|
c
|
5
|
b
|
6
|
c
|
# one-hot,哑变量
# str类型数据,经过哑变量变换可以使用数字表示
pd.get_dummies(df2,prefix='',prefix_sep='') # 1表示,有;0表示,没有
|
a
|
b
|
c
|
0
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
2
|
1
|
0
|
0
|
3
|
0
|
1
|
0
|
4
|
0
|
0
|
1
|
5
|
0
|
1
|
0
|
6
|
0
|
0
|
1
|
第八部分:数据重塑
df
|
Python
|
Tensorflow
|
Keras
|
A
|
1
|
1
|
9
|
B
|
6
|
9
|
1
|
C
|
1
|
4
|
6
|
D
|
9
|
5
|
1
|
E
|
4
|
1
|
8
|
F
|
2
|
5
|
7
|
H
|
7
|
2
|
3
|
I
|
7
|
8
|
9
|
J
|
5
|
4
|
2
|
K
|
7
|
0
|
7
|
df.T # 转置,行变列,列变行
|
A
|
B
|
C
|
D
|
E
|
F
|
H
|
I
|
J
|
K
|
Python
|
1
|
6
|
1
|
9
|
4
|
2
|
7
|
7
|
5
|
7
|
Tensorflow
|
1
|
9
|
4
|
5
|
1
|
5
|
2
|
8
|
4
|
0
|
Keras
|
9
|
1
|
6
|
1
|
8
|
7
|
3
|
9
|
2
|
7
|
df2 = pd.DataFrame(np.random.randint(0,10,size = (20,3)),columns=['Python','Math','En'],index = pd.MultiIndex.from_product([list('ABCDEFHIJK'),['期中','期末']])) # 多层索引df2
|
|
Python
|
Math
|
En
|
A
|
期中
|
3
|
3
|
0
|
期末
|
6
|
5
|
8
|
B
|
期中
|
5
|
5
|
9
|
期末
|
7
|
5
|
2
|
C
|
期中
|
0
|
7
|
9
|
期末
|
9
|
7
|
5
|
D
|
期中
|
5
|
6
|
5
|
期末
|
7
|
9
|
6
|
E
|
期中
|
7
|
3
|
9
|
期末
|
9
|
1
|
4
|
F
|
期中
|
9
|
9
|
5
|
期末
|
0
|
8
|
9
|
H
|
期中
|
7
|
0
|
0
|
期末
|
1
|
6
|
6
|
I
|
期中
|
8
|
1
|
8
|
期末
|
7
|
9
|
9
|
J
|
期中
|
5
|
0
|
8
|
期末
|
3
|
6
|
6
|
K
|
期中
|
8
|
2
|
2
|
期末
|
3
|
5
|
2
|
df2.unstack(level = 1) # 将行索引变成列索引,-1表示最后一层
|
Python
|
Math
|
En
|
|
期中
|
期末
|
期中
|
期末
|
期中
|
期末
|
A
|
3
|
6
|
3
|
5
|
0
|
8
|
B
|
5
|
7
|
5
|
5
|
9
|
2
|
C
|
0
|
9
|
7
|
7
|
9
|
5
|
D
|
5
|
7
|
6
|
9
|
5
|
6
|
E
|
7
|
9
|
3
|
1
|
9
|
4
|
F
|
9
|
0
|
9
|
8
|
5
|
9
|
H
|
7
|
1
|
0
|
6
|
0
|
6
|
I
|
8
|
7
|
1
|
9
|
8
|
9
|
J
|
5
|
3
|
0
|
6
|
8
|
6
|
K
|
8
|
3
|
2
|
5
|
2
|
2
|
df2.unstack(level = -1) # 将行索引变成列索引,-1表示最后一层
|
Python
|
Math
|
En
|
|
期中
|
期末
|
期中
|
期末
|
期中
|
期末
|
A
|
3
|
6
|
3
|
5
|
0
|
8
|
B
|
5
|
7
|
5
|
5
|
9
|
2
|
C
|
0
|
9
|
7
|
7
|
9
|
5
|
D
|
5
|
7
|
6
|
9
|
5
|
6
|
E
|
7
|
9
|
3
|
1
|
9
|
4
|
F
|
9
|
0
|
9
|
8
|
5
|
9
|
H
|
7
|
1
|
0
|
6
|
0
|
6
|
I
|
8
|
7
|
1
|
9
|
8
|
9
|
J
|
5
|
3
|
0
|
6
|
8
|
6
|
K
|
8
|
3
|
2
|
5
|
2
|
2
|
df2.stack() # 列变成行了
A 期中 Python 3Math 3En 0期末 Python 6Math 5En 8
B 期中 Python 5Math 5En 9期末 Python 7Math 5En 2
C 期中 Python 0Math 7En 9期末 Python 9Math 7En 5
D 期中 Python 5Math 6En 5期末 Python 7Math 9En 6
E 期中 Python 7Math 3En 9期末 Python 9Math 1En 4
F 期中 Python 9Math 9En 5期末 Python 0Math 8En 9
H 期中 Python 7Math 0En 0期末 Python 1Math 6En 6
I 期中 Python 8Math 1En 8期末 Python 7Math 9En 9
J 期中 Python 5Math 0En 8期末 Python 3Math 6En 6
K 期中 Python 8Math 2En 2期末 Python 3Math 5En 2
dtype: int64
df2.unstack().stack(level = 0)
|
|
期中
|
期末
|
A
|
En
|
0
|
8
|
Math
|
3
|
5
|
Python
|
3
|
6
|
B
|
En
|
9
|
2
|
Math
|
5
|
5
|
Python
|
5
|
7
|
C
|
En
|
9
|
5
|
Math
|
7
|
7
|
Python
|
0
|
9
|
D
|
En
|
5
|
6
|
Math
|
6
|
9
|
Python
|
5
|
7
|
E
|
En
|
9
|
4
|
Math
|
3
|
1
|
Python
|
7
|
9
|
F
|
En
|
5
|
9
|
Math
|
9
|
8
|
Python
|
9
|
0
|
H
|
En
|
0
|
6
|
Math
|
0
|
6
|
Python
|
7
|
1
|
I
|
En
|
8
|
9
|
Math
|
1
|
9
|
Python
|
8
|
7
|
J
|
En
|
8
|
6
|
Math
|
0
|
6
|
Python
|
5
|
3
|
K
|
En
|
2
|
2
|
Math
|
2
|
5
|
Python
|
8
|
3
|
df2.mean() # 计算的是 列
Python 5.45
Math 4.85
En 5.60
dtype: float64
df2.mean(axis = 1)
A 期中 2.000000期末 6.333333
B 期中 6.333333期末 4.666667
C 期中 5.333333期末 7.000000
D 期中 5.333333期末 7.333333
E 期中 6.333333期末 4.666667
F 期中 7.666667期末 5.666667
H 期中 2.333333期末 4.333333
I 期中 5.666667期末 8.333333
J 期中 4.333333期末 5.000000
K 期中 4.000000期末 3.333333
dtype: float64
df2.mean(level=1) # 计算期中期末所有学生的平均分
|
Python
|
Math
|
En
|
期中
|
5.7
|
3.6
|
5.5
|
期末
|
5.2
|
6.1
|
5.7
|
df2.mean(level = 0) # 计算每位学生期中和期末平均分
|
Python
|
Math
|
En
|
A
|
4.5
|
4.0
|
4.0
|
B
|
6.0
|
5.0
|
5.5
|
C
|
4.5
|
7.0
|
7.0
|
D
|
6.0
|
7.5
|
5.5
|
E
|
8.0
|
2.0
|
6.5
|
F
|
4.5
|
8.5
|
7.0
|
H
|
4.0
|
3.0
|
3.0
|
I
|
7.5
|
5.0
|
8.5
|
J
|
4.0
|
3.0
|
7.0
|
K
|
5.5
|
3.5
|
2.0
|
第九部分:数学和统计方法
9.1 简单统计指标
df = pd.DataFrame(np.random.randint(0,10,size = (20,3)),columns=['Python','Math','En'],index = list('QWERTYUIOPASDFGHJKLZ'))
df
|
Python
|
Math
|
En
|
Q
|
1
|
4
|
3
|
W
|
8
|
2
|
0
|
E
|
3
|
0
|
9
|
R
|
9
|
8
|
9
|
T
|
9
|
1
|
3
|
Y
|
5
|
3
|
1
|
U
|
5
|
8
|
0
|
I
|
1
|
8
|
3
|
O
|
0
|
3
|
5
|
P
|
6
|
6
|
1
|
A
|
0
|
4
|
0
|
S
|
3
|
9
|
4
|
D
|
5
|
2
|
8
|
F
|
2
|
9
|
2
|
G
|
1
|
9
|
8
|
H
|
9
|
5
|
2
|
J
|
5
|
7
|
5
|
K
|
2
|
6
|
5
|
L
|
2
|
7
|
3
|
Z
|
2
|
9
|
8
|
df.iloc[6,2] = np.NAN
display(df)
|
Python
|
Math
|
En
|
Q
|
1
|
4
|
3.0
|
W
|
8
|
2
|
0.0
|
E
|
3
|
0
|
9.0
|
R
|
9
|
8
|
9.0
|
T
|
9
|
1
|
3.0
|
Y
|
5
|
3
|
1.0
|
U
|
5
|
8
|
NaN
|
I
|
1
|
8
|
3.0
|
O
|
0
|
3
|
5.0
|
P
|
6
|
6
|
1.0
|
A
|
0
|
4
|
0.0
|
S
|
3
|
9
|
4.0
|
D
|
5
|
2
|
8.0
|
F
|
2
|
9
|
2.0
|
G
|
1
|
9
|
8.0
|
H
|
9
|
5
|
2.0
|
J
|
5
|
7
|
5.0
|
K
|
2
|
6
|
5.0
|
L
|
2
|
7
|
3.0
|
Z
|
2
|
9
|
8.0
|
df.count() # 统计非空数据数量
Python 20
Math 20
En 19
dtype: int64
display(df.mean(),df.median()) # 平均值,中位数
Python 3.900000
Math 5.500000
En 4.157895
dtype: float64Python 3.0
Math 6.0
En 3.0
dtype: float64
display(df.min(),df.max()) # 最小值,最大值
Python 0.0
Math 0.0
En 0.0
dtype: float64Python 9.0
Math 9.0
En 9.0
dtype: float64
df['Python'].unique() # 去除重复数据
array([1, 8, 3, 9, 5, 0, 6, 2])
df['Math'].value_counts() # 统计出现的频次
9 4
8 3
7 2
6 2
4 2
3 2
2 2
5 1
1 1
0 1
Name: Math, dtype: int64
df.quantile(q = [0,0.25,0.5,0.75,1]) # 百分位数
|
Python
|
Math
|
En
|
0.00
|
0.00
|
0.0
|
0.0
|
0.25
|
1.75
|
3.0
|
2.0
|
0.50
|
3.00
|
6.0
|
3.0
|
0.75
|
5.25
|
8.0
|
6.5
|
1.00
|
9.00
|
9.0
|
9.0
|
df.describe().round(1)
|
Python
|
Math
|
En
|
count
|
20.0
|
20.0
|
19.0
|
mean
|
3.9
|
5.5
|
4.2
|
std
|
3.0
|
2.9
|
3.0
|
min
|
0.0
|
0.0
|
0.0
|
25%
|
1.8
|
3.0
|
2.0
|
50%
|
3.0
|
6.0
|
3.0
|
75%
|
5.2
|
8.0
|
6.5
|
max
|
9.0
|
9.0
|
9.0
|
9.2 索引标签、位置获取
df['Python'].argmax() # 返回最大值索引
3
df['En'].argmin() # 最小值索引
1
df.idxmax() # 返回最大值的标签
Python R
Math S
En E
dtype: object
df.idxmin() # 返回最小值标签
Python O
Math E
En W
dtype: object
9.3 更多统计指标
df.cumsum() # 累加和
|
Python
|
Math
|
En
|
Q
|
1
|
4
|
3.0
|
W
|
9
|
6
|
3.0
|
E
|
12
|
6
|
12.0
|
R
|
21
|
14
|
21.0
|
T
|
30
|
15
|
24.0
|
Y
|
35
|
18
|
25.0
|
U
|
40
|
26
|
NaN
|
I
|
41
|
34
|
28.0
|
O
|
41
|
37
|
33.0
|
P
|
47
|
43
|
34.0
|
A
|
47
|
47
|
34.0
|
S
|
50
|
56
|
38.0
|
D
|
55
|
58
|
46.0
|
F
|
57
|
67
|
48.0
|
G
|
58
|
76
|
56.0
|
H
|
67
|
81
|
58.0
|
J
|
72
|
88
|
63.0
|
K
|
74
|
94
|
68.0
|
L
|
76
|
101
|
71.0
|
Z
|
78
|
110
|
79.0
|
df.cumprod() # 累乘和
|
Python
|
Math
|
En
|
Q
|
1
|
4
|
3.0
|
W
|
8
|
8
|
0.0
|
E
|
24
|
0
|
0.0
|
R
|
216
|
0
|
0.0
|
T
|
1944
|
0
|
0.0
|
Y
|
9720
|
0
|
0.0
|
U
|
48600
|
0
|
NaN
|
I
|
48600
|
0
|
0.0
|
O
|
0
|
0
|
0.0
|
P
|
0
|
0
|
0.0
|
A
|
0
|
0
|
0.0
|
S
|
0
|
0
|
0.0
|
D
|
0
|
0
|
0.0
|
F
|
0
|
0
|
0.0
|
G
|
0
|
0
|
0.0
|
H
|
0
|
0
|
0.0
|
J
|
0
|
0
|
0.0
|
K
|
0
|
0
|
0.0
|
L
|
0
|
0
|
0.0
|
Z
|
0
|
0
|
0.0
|
df.cummin() # 累计最小值
|
Python
|
Math
|
En
|
Q
|
1
|
4
|
3.0
|
W
|
1
|
2
|
0.0
|
E
|
1
|
0
|
0.0
|
R
|
1
|
0
|
0.0
|
T
|
1
|
0
|
0.0
|
Y
|
1
|
0
|
0.0
|
U
|
1
|
0
|
NaN
|
I
|
1
|
0
|
0.0
|
O
|
0
|
0
|
0.0
|
P
|
0
|
0
|
0.0
|
A
|
0
|
0
|
0.0
|
S
|
0
|
0
|
0.0
|
D
|
0
|
0
|
0.0
|
F
|
0
|
0
|
0.0
|
G
|
0
|
0
|
0.0
|
H
|
0
|
0
|
0.0
|
J
|
0
|
0
|
0.0
|
K
|
0
|
0
|
0.0
|
L
|
0
|
0
|
0.0
|
Z
|
0
|
0
|
0.0
|
df.cummax() # 累计最大值
|
Python
|
Math
|
En
|
Q
|
1
|
4
|
3.0
|
W
|
8
|
4
|
3.0
|
E
|
8
|
4
|
9.0
|
R
|
9
|
8
|
9.0
|
T
|
9
|
8
|
9.0
|
Y
|
9
|
8
|
9.0
|
U
|
9
|
8
|
NaN
|
I
|
9
|
8
|
9.0
|
O
|
9
|
8
|
9.0
|
P
|
9
|
8
|
9.0
|
A
|
9
|
8
|
9.0
|
S
|
9
|
9
|
9.0
|
D
|
9
|
9
|
9.0
|
F
|
9
|
9
|
9.0
|
G
|
9
|
9
|
9.0
|
H
|
9
|
9
|
9.0
|
J
|
9
|
9
|
9.0
|
K
|
9
|
9
|
9.0
|
L
|
9
|
9
|
9.0
|
Z
|
9
|
9
|
9.0
|
df.std() # 标准差
Python 3.041814
Math 2.946898
En 3.004869
dtype: float64
df.var()
Python 9.252632
Math 8.684211
En 9.029240
dtype: float64
df.diff() # 差分,当前数据减去上一个的差值
|
Python
|
Math
|
En
|
Q
|
NaN
|
NaN
|
NaN
|
W
|
7.0
|
-2.0
|
-3.0
|
E
|
-5.0
|
-2.0
|
9.0
|
R
|
6.0
|
8.0
|
0.0
|
T
|
0.0
|
-7.0
|
-6.0
|
Y
|
-4.0
|
2.0
|
-2.0
|
U
|
0.0
|
5.0
|
NaN
|
I
|
-4.0
|
0.0
|
NaN
|
O
|
-1.0
|
-5.0
|
2.0
|
P
|
6.0
|
3.0
|
-4.0
|
A
|
-6.0
|
-2.0
|
-1.0
|
S
|
3.0
|
5.0
|
4.0
|
D
|
2.0
|
-7.0
|
4.0
|
F
|
-3.0
|
7.0
|
-6.0
|
G
|
-1.0
|
0.0
|
6.0
|
H
|
8.0
|
-4.0
|
-6.0
|
J
|
-4.0
|
2.0
|
3.0
|
K
|
-3.0
|
-1.0
|
0.0
|
L
|
0.0
|
1.0
|
-2.0
|
Z
|
0.0
|
2.0
|
5.0
|
df.pct_change().round(3) # 计算百分比变化
|
Python
|
Math
|
En
|
Q
|
NaN
|
NaN
|
NaN
|
W
|
7.000
|
-0.500
|
-1.000
|
E
|
-0.625
|
-1.000
|
inf
|
R
|
2.000
|
inf
|
0.000
|
T
|
0.000
|
-0.875
|
-0.667
|
Y
|
-0.444
|
2.000
|
-0.667
|
U
|
0.000
|
1.667
|
0.000
|
I
|
-0.800
|
0.000
|
2.000
|
O
|
-1.000
|
-0.625
|
0.667
|
P
|
inf
|
1.000
|
-0.800
|
A
|
-1.000
|
-0.333
|
-1.000
|
S
|
inf
|
1.250
|
inf
|
D
|
0.667
|
-0.778
|
1.000
|
F
|
-0.600
|
3.500
|
-0.750
|
G
|
-0.500
|
0.000
|
3.000
|
H
|
8.000
|
-0.444
|
-0.750
|
J
|
-0.444
|
0.400
|
1.500
|
K
|
-0.600
|
-0.143
|
0.000
|
L
|
0.000
|
0.167
|
-0.400
|
Z
|
0.000
|
0.286
|
1.667
|
9.4 高级统计指标
df.cov() # 协方差:自己和别人计算
|
Python
|
Math
|
En
|
Python
|
9.252632
|
-2.157895
|
-0.695906
|
Math
|
-2.157895
|
8.684211
|
1.160819
|
En
|
-0.695906
|
1.160819
|
9.029240
|
df.var() # 方差: 自己和自己计算
Python 9.252632
Math 8.684211
En 9.029240
dtype: float64
df['Python'].cov(df['Math'])
-2.157894736842105
df.corr() # 相关性系数 -1 ~ 1
|
Python
|
Math
|
En
|
Python
|
1.000000
|
-0.240731
|
-0.074376
|
Math
|
-0.240731
|
1.000000
|
0.130217
|
En
|
-0.074376
|
0.130217
|
1.000000
|
df.corrwith(df['En']) # 一列的相关性系数
Python -0.074376
Math 0.130217
En 1.000000
dtype: float64
第十部分:排序
df = pd.DataFrame(np.random.randint(0,20,size = (20,3)),columns=['Python','Tensorflow','Keras'],index = list('QWERTYUIOPASDFGHJKLZ'))
df
|
Python
|
Tensorflow
|
Keras
|
Q
|
17
|
3
|
4
|
W
|
13
|
18
|
7
|
E
|
12
|
11
|
0
|
R
|
3
|
5
|
14
|
T
|
11
|
15
|
7
|
Y
|
5
|
15
|
4
|
U
|
18
|
2
|
7
|
I
|
7
|
3
|
6
|
O
|
1
|
18
|
5
|
P
|
12
|
6
|
0
|
A
|
4
|
18
|
4
|
S
|
15
|
5
|
8
|
D
|
8
|
11
|
14
|
F
|
3
|
2
|
17
|
G
|
4
|
17
|
8
|
H
|
12
|
1
|
4
|
J
|
1
|
2
|
6
|
K
|
17
|
9
|
16
|
L
|
11
|
14
|
4
|
Z
|
16
|
13
|
4
|
df.sort_index(axis = 0,ascending=False) # 降序
|
Python
|
Tensorflow
|
Keras
|
Z
|
16
|
13
|
4
|
Y
|
5
|
15
|
4
|
W
|
13
|
18
|
7
|
U
|
18
|
2
|
7
|
T
|
11
|
15
|
7
|
S
|
15
|
5
|
8
|
R
|
3
|
5
|
14
|
Q
|
17
|
3
|
4
|
P
|
12
|
6
|
0
|
O
|
1
|
18
|
5
|
L
|
11
|
14
|
4
|
K
|
17
|
9
|
16
|
J
|
1
|
2
|
6
|
I
|
7
|
3
|
6
|
H
|
12
|
1
|
4
|
G
|
4
|
17
|
8
|
F
|
3
|
2
|
17
|
E
|
12
|
11
|
0
|
D
|
8
|
11
|
14
|
A
|
4
|
18
|
4
|
df.sort_index(ascending=True) # 升序
|
Python
|
Tensorflow
|
Keras
|
A
|
4
|
18
|
4
|
D
|
8
|
11
|
14
|
E
|
12
|
11
|
0
|
F
|
3
|
2
|
17
|
G
|
4
|
17
|
8
|
H
|
12
|
1
|
4
|
I
|
7
|
3
|
6
|
J
|
1
|
2
|
6
|
K
|
17
|
9
|
16
|
L
|
11
|
14
|
4
|
O
|
1
|
18
|
5
|
P
|
12
|
6
|
0
|
Q
|
17
|
3
|
4
|
R
|
3
|
5
|
14
|
S
|
15
|
5
|
8
|
T
|
11
|
15
|
7
|
U
|
18
|
2
|
7
|
W
|
13
|
18
|
7
|
Y
|
5
|
15
|
4
|
Z
|
16
|
13
|
4
|
df.sort_values(by = 'Python',ascending=True) # 根据Python属性进行升序排列
|
Python
|
Tensorflow
|
Keras
|
J
|
1
|
2
|
6
|
O
|
1
|
18
|
5
|
R
|
3
|
5
|
14
|
F
|
3
|
2
|
17
|
A
|
4
|
18
|
4
|
G
|
4
|
17
|
8
|
Y
|
5
|
15
|
4
|
I
|
7
|
3
|
6
|
D
|
8
|
11
|
14
|
T
|
11
|
15
|
7
|
L
|
11
|
14
|
4
|
P
|
12
|
6
|
0
|
E
|
12
|
11
|
0
|
H
|
12
|
1
|
4
|
W
|
13
|
18
|
7
|
S
|
15
|
5
|
8
|
Z
|
16
|
13
|
4
|
K
|
17
|
9
|
16
|
Q
|
17
|
3
|
4
|
U
|
18
|
2
|
7
|
df.sort_values(by = ['Python','Tensorflow'],ascending=True)
# 先根据Python进行排序,如果相等在根据Tensorflow排序
|
Python
|
Tensorflow
|
Keras
|
J
|
1
|
2
|
6
|
O
|
1
|
18
|
5
|
F
|
3
|
2
|
17
|
R
|
3
|
5
|
14
|
G
|
4
|
17
|
8
|
A
|
4
|
18
|
4
|
Y
|
5
|
15
|
4
|
I
|
7
|
3
|
6
|
D
|
8
|
11
|
14
|
L
|
11
|
14
|
4
|
T
|
11
|
15
|
7
|
H
|
12
|
1
|
4
|
P
|
12
|
6
|
0
|
E
|
12
|
11
|
0
|
W
|
13
|
18
|
7
|
S
|
15
|
5
|
8
|
Z
|
16
|
13
|
4
|
Q
|
17
|
3
|
4
|
K
|
17
|
9
|
16
|
U
|
18
|
2
|
7
|
df.nlargest(n = 5,columns='Python') # 根据Python进行排序,获取最大的5个数值
|
Python
|
Tensorflow
|
Keras
|
U
|
18
|
2
|
7
|
Q
|
17
|
3
|
4
|
K
|
17
|
9
|
16
|
Z
|
16
|
13
|
4
|
S
|
15
|
5
|
8
|
df.nsmallest(5,columns='Keras') # 根据Keras进行排序,获取最小的5个
|
Python
|
Tensorflow
|
Keras
|
E
|
12
|
11
|
0
|
P
|
12
|
6
|
0
|
Q
|
17
|
3
|
4
|
Y
|
5
|
15
|
4
|
A
|
4
|
18
|
4
|
结束语
本篇博文的代码是在jupyter上运行的,不过具体在哪运行都没什么大的区别。
感谢收看,祝学业和工作进步! 需要本文资料的话,欢迎关注评论留下你的邮箱。
推荐关注的专栏
学习pandas全套代码【超详细】数据查看、输入输出、选取、集成、清洗、转换、重塑、数学和统计方法、排序相关推荐
- UNIX网络编程学习笔记(代码超详细解析)(持续更新)
1. 其他函数准备 1. TCP 回射服务器程序: str_echo 函数 #include "unp.h"void str_echo(int sockfd) {ssize_t n ...
- Hbase学习文档(超详细单机安装)
Hbase学习文档(超详细单机安装) 一.前言 1.1简述 本文分为五个部分:linux主机名的设置.jdk的安装.hadoop的安装.单机模式下hbase的安装.hbase的shell常用命令及ja ...
- [扩展阅读] EasyGUI 学习文档【超详细中文版】
[扩展阅读] EasyGUI 学习文档[超详细中文版] 0. 安装 EasyGUI 官网:https://github.com/robertlugg/easygui python查看内置的所有模块 h ...
- bat脚本常用命令及亲测示例代码超详细讲解
这篇文章主要介绍了bat脚本常用命令及亲测示例代码超详细讲解,在这里需要注意编辑bat文件请使用ANSI编码,不然容易出现中文乱码,需要的朋友可以参考下 目录一 1.语句注释 2.暂停 3.输出和换行 ...
- 一、css清除浮动方法学习笔记总结(超详细,简单易懂)
** css清除浮动方法学习笔记总结(超详细,简单易懂) ** 问题: 上图中,由于container(父级元素)未设置高度,其内部子元素设置了float浮动,导致与container同级(也就是co ...
- WebRTC VideoEngine超详细教程(三)——集成X264编码和ffmpeg解码
转自:http://blog.csdn.net/nonmarking/article/details/47958395 本系列目前共三篇文章,后续还会更新 WebRTC VideoEngine超详细教 ...
- openCV4.0 C++ 快速入门30讲学习笔记(自用 代码+注释)详细版
课程来源:哔哩哔哩 环境:OpenCV4.5.1 + VS2019 目录 002.图像色彩空间转换 003.图像对象的创建与赋值 004.图像像素的读写操作 005.图像像素的算术操作(加减乘除4种不 ...
- python的 numpy库学习总结和介绍(超详细)模块
目录 前言 numpy是什么? 一.创建数据容器 1.np.array(),通过传递一个列表,元祖来创建 2.np.zeros(),np.ones(),np.empty(),np.full(),传递对 ...
- BF算法代码(超详细)
在此之前需要用到的知识: 头文件<string.h>中包含strlen()函数计算字符串的长度,遇到结束性标 志'\0'时停止: 例如,字符串为"lx adore zlr&quo ...
最新文章
- c# Bitmap byte[] Stream 文件相互转换
- 早停 tf.keras.callbacks.EarlyStopping() 详解【TensorFlow2入门手册】
- fuse的API修改
- 从函数中返回多个值的方法
- pytorch 训练过程acc_Pytorch之Softmax多分类任务
- app.vue里使用data_在电脑使用讯飞有声,通过python自动化朗读
- cursor -- 定义鼠标样式
- 闭包 python_根据面试题谈谈 Python 闭包
- HDU 4054 Number String
- ibm服务器做系统视频,智慧运算 IBM x3650 M4服务器拆机(视频)
- php 数组中连续的数字,php数组中最近的次要数字
- android实现翻书效果(文本和图片的翻书)
- 令人厌恶的错误MSB3721,以及win10,VS2019,YOLO V4 环境搭建
- Android 时间格式转换
- PDF加页码怎么设置?这里有你想知道的答案
- Tomcat的中文乱码设置
- O2O优惠券使用预测
- @OneToMany---ManyToOne
- sv中program和module区别
- is are am 三者的区别
热门文章
- python正则表达式查找(findall)
- Numpy和spicy函数复习1
- UNIZEN每周洞察第四期:Unizen生态迎来丰富进展
- 尽己力,听天命。无愧于心,不惑于情
- 多自由度有阻尼matlab,多自由度阻尼系统固有振型的MATLAB求解程序
- 20170923 HHC
- 百度联盟广告代码php,Javascript 实现广告后加载 可加载百度谷歌联盟广告[原创]_javascript技巧...
- 2022最新微信表白墙小程序源码+美观好看
- 如何取消系统“嘟”的蜂鸣声(报警声)
- 一站式、整套智能家居解决方案——HomeKit?绿米?华为还是智汀?