十九、数据整理(上)
作者:Chris Albon
译者:飞龙
协议:CC BY-NC-SA 4.0
在 Pandas 中通过分组应用函数
import pandas as pd# 创建示例数据帧
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
df
|
Casualties
|
Platoon
|
0
|
1
|
A
|
1
|
4
|
A
|
2
|
5
|
A
|
3
|
7
|
A
|
4
|
5
|
A
|
5
|
5
|
A
|
6
|
6
|
B
|
7
|
1
|
B
|
8
|
4
|
B
|
9
|
5
|
B
|
10
|
6
|
B
|
11
|
7
|
C
|
12
|
4
|
C
|
13
|
6
|
C
|
14
|
4
|
C
|
15
|
6
|
C
|
# 按照 df.platoon 对 df 分组
# 然后将滚动平均 lambda 函数应用于 df.casualties
df.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())'''
0 NaN
1 2.5
2 4.5
3 6.0
4 6.0
5 5.0
6 NaN
7 3.5
8 2.5
9 4.5
10 5.5
11 NaN
12 5.5
13 5.0
14 5.0
15 5.0
dtype: float64
'''
在 Pandas 中向分组应用操作
# 导入模块
import pandas as pd# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
|
regiment
|
company
|
name
|
preTestScore
|
postTestScore
|
0
|
Nighthawks
|
1st
|
Miller
|
4
|
25
|
1
|
Nighthawks
|
1st
|
Jacobson
|
24
|
94
|
2
|
Nighthawks
|
2nd
|
Ali
|
31
|
57
|
3
|
Nighthawks
|
2nd
|
Milner
|
2
|
62
|
4
|
Dragoons
|
1st
|
Cooze
|
3
|
70
|
5
|
Dragoons
|
1st
|
Jacon
|
4
|
25
|
6
|
Dragoons
|
2nd
|
Ryaner
|
24
|
94
|
7
|
Dragoons
|
2nd
|
Sone
|
31
|
57
|
8
|
Scouts
|
1st
|
Sloan
|
2
|
62
|
9
|
Scouts
|
1st
|
Piger
|
3
|
70
|
10
|
Scouts
|
2nd
|
Riani
|
2
|
62
|
11
|
Scouts
|
2nd
|
Ali
|
3
|
70
|
# 创建一个 groupby 变量,按团队(regiment)对 preTestScores 分组
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment# <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550>
“这个分组变量现在是GroupBy
对象。 除了分组的键df ['key1']
的一些中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具有将所有操作应用于每个分组所需的所有信息。” – PyDA
使用list()
显示分组的样子。
list(df['preTestScore'].groupby(df['regiment']))'''
[('Dragoons', 4 35 46 247 31Name: preTestScore, dtype: int64), ('Nighthawks', 0 41 242 313 2Name: preTestScore, dtype: int64), ('Scouts', 8 29 310 211 3Name: preTestScore, dtype: int64)]
'''df['preTestScore'].groupby(df['regiment']).describe()
|
count
|
mean
|
std
|
min
|
25%
|
50%
|
75%
|
max
|
regiment
|
|
|
|
|
|
|
|
|
Dragoons
|
4.0
|
15.50
|
14.153916
|
3.0
|
3.75
|
14.0
|
25.75
|
31.0
|
Nighthawks
|
4.0
|
15.25
|
14.453950
|
2.0
|
3.50
|
14.0
|
25.75
|
31.0
|
Scouts
|
4.0
|
2.50
|
0.577350
|
2.0
|
2.00
|
2.5
|
3.00
|
3.0
|
# 每个团队的 preTestScore 均值
groupby_regiment.mean()'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''df['preTestScore'].groupby([df['regiment'], df['company']]).mean()'''
regiment company
Dragoons 1st 3.52nd 27.5
Nighthawks 1st 14.02nd 16.5
Scouts 1st 2.52nd 2.5
Name: preTestScore, dtype: float64
'''df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
company
|
1st
|
2nd
|
regiment
|
|
|
Dragoons
|
3.5
|
27.5
|
Nighthawks
|
14.0
|
16.5
|
Scouts
|
2.5
|
2.5
|
# 按团队和公司(company)对整个数据帧分组
df.groupby(['regiment', 'company']).mean()
|
|
preTestScore
|
postTestScore
|
regiment
|
company
|
|
|
Dragoons
|
1st
|
3.5
|
47.5
|
2nd
|
27.5
|
75.5
|
|
Nighthawks
|
1st
|
14.0
|
59.5
|
2nd
|
16.5
|
59.5
|
|
Scouts
|
1st
|
2.5
|
66.0
|
2nd
|
2.5
|
66.0
|
|
# 每个团队和公司的观测数量
df.groupby(['regiment', 'company']).size()'''
regiment company
Dragoons 1st 22nd 2
Nighthawks 1st 22nd 2
Scouts 1st 22nd 2
dtype: int64
'''# 按团队对数据帧分组,对于每个团队,
for name, group in df.groupby('regiment'): # 打印团队名称print(name)# 打印它的数据print(group)'''
Dragoonsregiment company name preTestScore postTestScore
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
Nighthawksregiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
Scoutsregiment company name preTestScore postTestScore
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
'''
按列分组:
特别是在这种情况下:按列对数据类型(即axis = 1
)分组,然后使用list()
查看该分组的外观。
list(df.groupby(df.dtypes, axis=1))'''
[(dtype('int64'), preTestScore postTestScore0 4 251 24 942 31 573 2 624 3 705 4 256 24 947 31 578 2 629 3 7010 2 6211 3 70),(dtype('O'), regiment company name0 Nighthawks 1st Miller1 Nighthawks 1st Jacobson2 Nighthawks 2nd Ali3 Nighthawks 2nd Milner4 Dragoons 1st Cooze5 Dragoons 1st Jacon6 Dragoons 2nd Ryaner7 Dragoons 2nd Sone8 Scouts 1st Sloan9 Scouts 1st Piger10 Scouts 2nd Riani11 Scouts 2nd Ali)] df.groupby('regiment').mean().add_prefix('mean_')
|
mean_preTestScore
|
mean_postTestScore
|
regiment
|
|
|
Dragoons
|
15.50
|
61.5
|
Nighthawks
|
15.25
|
59.5
|
Scouts
|
2.50
|
66.0
|
# 创建获取分组状态的函数
def get_stats(group):return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
|
count
|
max
|
mean
|
min
|
categories
|
|
|
|
|
Good
|
8.0
|
70.0
|
63.75
|
57.0
|
Great
|
2.0
|
94.0
|
94.00
|
94.0
|
Low
|
2.0
|
25.0
|
25.00
|
25.0
|
Okay
|
0.0
|
NaN
|
NaN
|
NaN
|
在 Pandas 数据帧上应用操作
# 导入模型
import pandas as pd
import numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
|
coverage
|
name
|
reports
|
year
|
Cochice
|
25
|
Jason
|
4
|
2012
|
Pima
|
94
|
Molly
|
24
|
2012
|
Santa Cruz
|
57
|
Tina
|
31
|
2013
|
Maricopa
|
62
|
Jake
|
2
|
2014
|
Yuma
|
70
|
Amy
|
3
|
2014
|
# 创建大写转换的 lambda 函数
capitalizer = lambda x: x.upper()
将capitalizer
函数应用于name
列。
apply()
可以沿数据帧的任意轴应用函数。
df['name'].apply(capitalizer)'''
Cochice JASON
Pima MOLLY
Santa Cruz TINA
Maricopa JAKE
Yuma AMY
Name: name, dtype: object
'''
将capitalizer
lambda 函数映射到序列name
中的每个元素。
map()
对序列的每个元素应用操作。
df['name'].map(capitalizer)'''
Cochice JASON
Pima MOLLY
Santa Cruz TINA
Maricopa JAKE
Yuma AMY
Name: name, dtype: object
'''
将平方根函数应用于整个数据帧中的每个单元格。
applymap()
将函数应用于整个数据帧中的每个元素。
# 删除字符串变量,以便 applymap() 可以运行
df = df.drop('name', axis=1)# 返回数据帧每个单元格的平方根
df.applymap(np.sqrt)
|
coverage
|
reports
|
year
|
Cochice
|
5.000000
|
2.000000
|
44.855323
|
Pima
|
9.695360
|
4.898979
|
44.855323
|
Santa Cruz
|
7.549834
|
5.567764
|
44.866469
|
Maricopa
|
7.874008
|
1.414214
|
44.877611
|
Yuma
|
8.366600
|
1.732051
|
44.877611
|
在数据帧上应用函数。
# 创建叫做 times100 的函数
def times100(x):# 如果 x 是字符串,if type(x) is str:# 原样返回它return x# 如果不是,返回它乘上 100elif x:return 100 * x# 并留下其它东西else:returndf.applymap(times100)
|
coverage
|
reports
|
year
|
Cochice
|
2500
|
400
|
201200
|
Pima
|
9400
|
2400
|
201200
|
Santa Cruz
|
5700
|
3100
|
201300
|
Maricopa
|
6200
|
200
|
201400
|
Yuma
|
7000
|
300
|
201400
|
向 Pandas 数据帧赋予新列
import pandas as pd# 创建空数据帧
df = pd.DataFrame()# 创建一列
df['name'] = ['John', 'Steve', 'Sarah']# 查看数据帧
df
|
name
|
0
|
John
|
1
|
Steve
|
2
|
Sarah
|
# 将一个新列赋予名为 age 的 df,它包含年龄列表
df.assign(age = [31, 32, 19])
|
name
|
age
|
0
|
John
|
31
|
1
|
Steve
|
32
|
2
|
Sarah
|
19
|
将列表拆分为大小为 N 的分块
在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在处理具有最大请求大小的 API 时,这是一种非常常见的做法。
这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow。
# 创建名称列表
first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']# 创建叫做 chunks 的函数,有两个参数 l 和 n
def chunks(l, n):# 对于长度为 l 的范围中的项目 ifor i in range(0, len(l), n):# 创建索引范围yield l[i:i+n]# 从函数 chunks 的结果创建一个列表
list(chunks(first_names, 5))'''
[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],['Will', 'Won', 'Li']]
'''
在 Pandas 中使用正则表达式将字符串分解为列
# 导入模块
import re
import pandas as pd# 创建带有一列字符串的数据帧
data = {'raw': ['Arizona 1 2014-12-23 3242.0','Iowa 1 2010-02-23 3453.7','Oregon 0 2014-06-20 2123.0','Maryland 0 2014-03-14 1123.6','Florida 1 2013-01-15 2134.0','Georgia 0 2012-07-14 2345.6']}
df = pd.DataFrame(data, columns = ['raw'])
df
|
raw
|
0
|
Arizona 1 2014-12-23 3242.0
|
1
|
Iowa 1 2010-02-23 3453.7
|
2
|
Oregon 0 2014-06-20 2123.0
|
3
|
Maryland 0 2014-03-14 1123.6
|
4
|
Florida 1 2013-01-15 2134.0
|
5
|
Georgia 0 2012-07-14 2345.6
|
# df['raw'] 的哪些行包含 'xxxx-xx-xx'?
df['raw'].str.contains('....-..-..', regex=True)'''
0 True
1 True
2 True
3 True
4 True
5 True
Name: raw, dtype: bool
'''# 在 raw 列中,提取字符串中的单个数字
df['female'] = df['raw'].str.extract('(\d)', expand=True)
df['female']'''
0 1
1 1
2 0
3 0
4 1
5 0
Name: female, dtype: object
'''# 在 raw 列中,提取字符串中的 xxxx-xx-xx
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
df['date']'''
0 2014-12-23
1 2010-02-23
2 2014-06-20
3 2014-03-14
4 2013-01-15
5 2012-07-14
Name: date, dtype: object
'''# 在 raw 列中,提取字符串中的 ####.##
df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)
df['score']'''
0 3242.0
1 3453.7
2 2123.0
3 1123.6
4 2134.0
5 2345.6
Name: score, dtype: object
'''# 在 raw 列中,提取字符串中的单词
df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)
df['state']'''
0 Arizona
1 Iowa
2 Oregon
3 Maryland
4 Florida
5 Georgia
Name: state, dtype: object
'''df
|
raw
|
female
|
date
|
score
|
state
|
0
|
Arizona 1 2014-12-23 3242.0
|
1
|
2014-12-23
|
3242.0
|
Arizona
|
1
|
Iowa 1 2010-02-23 3453.7
|
1
|
2010-02-23
|
3453.7
|
Iowa
|
2
|
Oregon 0 2014-06-20 2123.0
|
0
|
2014-06-20
|
2123.0
|
Oregon
|
3
|
Maryland 0 2014-03-14 1123.6
|
0
|
2014-03-14
|
1123.6
|
Maryland
|
4
|
Florida 1 2013-01-15 2134.0
|
1
|
2013-01-15
|
2134.0
|
Florida
|
5
|
Georgia 0 2012-07-14 2345.6
|
0
|
2012-07-14
|
2345.6
|
Georgia
|
由两个数据帧贡献列
# 导入库
import pandas as pd# 创建数据帧
dataframe_one = pd.DataFrame()
dataframe_one['1'] = ['1', '1', '1']
dataframe_one['B'] = ['b', 'b', 'b']# 创建第二个数据帧
dataframe_two = pd.DataFrame()
dataframe_two['2'] = ['2', '2', '2']
dataframe_two['B'] = ['b', 'b', 'b']# 将每个数据帧的列转换为集合,
# 然后找到这两个集合的交集。
# 这将是两个数据帧共享的列的集合。
set.intersection(set(dataframe_one), set(dataframe_two))# {'B'}
从多个列表构建字典
# 创建官员名称的列表
officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']# 创建官员军队的列表
officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']# 创建字典,它是两个列表的 zip
dict(zip(officer_names, officer_armies))'''
{'Chris Jefferson': 'Orange Army','Jessica Billars': 'Green Army','Michael Mulligan': 'Red Army','Sodoni Dogla': 'Purple Army','Steven Johnson': 'Blue Army'}
'''
将 CSV 转换为 Python 代码来重建它
# 导入 pandas 包
import pandas as pd# 将 csv 文件加载为数据帧
df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')# 打印创建数据帧的代码
print('==============================')
print('RUN THE CODE BELOW THIS LINE')
print('==============================')
print('raw_data =', df.to_dict(orient='list'))
print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')'''
==============================
RUN THE CODE BELOW THIS LINE
==============================
raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}
'''df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']) # 如果你打算检查结果
# 1\. 输入此单元格中上面单元格生成的代码
raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}
df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])# 查看原始数据帧的前几行
df.head()
|
Unnamed: 0
|
Sepal.Length
|
Sepal.Width
|
Petal.Length
|
Petal.Width
|
Species
|
0
|
1
|
5.1
|
3.5
|
1.4
|
0.2
|
setosa
|
1
|
2
|
4.9
|
3.0
|
1.4
|
0.2
|
setosa
|
2
|
3
|
4.7
|
3.2
|
1.3
|
0.2
|
setosa
|
3
|
4
|
4.6
|
3.1
|
1.5
|
0.2
|
setosa
|
4
|
5
|
5.0
|
3.6
|
1.4
|
0.2
|
setosa
|
# 查看使用我们的代码创建的,数据帧的前几行
df_original.head()
|
Unnamed: 0
|
Sepal.Length
|
Sepal.Width
|
Petal.Length
|
Petal.Width
|
Species
|
0
|
1
|
5.1
|
3.5
|
1.4
|
0.2
|
setosa
|
1
|
2
|
4.9
|
3.0
|
1.4
|
0.2
|
setosa
|
2
|
3
|
4.7
|
3.2
|
1.3
|
0.2
|
setosa
|
3
|
4
|
4.6
|
3.1
|
1.5
|
0.2
|
setosa
|
4
|
5
|
5.0
|
3.6
|
1.4
|
0.2
|
setosa
|
将分类变量转换为虚拟变量
# 导入模块
import pandas as pd# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'sex': ['male', 'female', 'male', 'female', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])
df
|
first_name
|
last_name
|
sex
|
0
|
Jason
|
Miller
|
male
|
1
|
Molly
|
Jacobson
|
female
|
2
|
Tina
|
Ali
|
male
|
3
|
Jake
|
Milner
|
female
|
4
|
Amy
|
Cooze
|
female
|
# 从 sex 变量创建一组虚拟变量
df_sex = pd.get_dummies(df['sex'])# 将虚拟变量连接到主数据帧
df_new = pd.concat([df, df_sex], axis=1)
df_new
|
first_name
|
last_name
|
sex
|
female
|
male
|
0
|
Jason
|
Miller
|
male
|
0.0
|
1.0
|
1
|
Molly
|
Jacobson
|
female
|
1.0
|
0.0
|
2
|
Tina
|
Ali
|
male
|
0.0
|
1.0
|
3
|
Jake
|
Milner
|
female
|
1.0
|
0.0
|
4
|
Amy
|
Cooze
|
female
|
1.0
|
0.0
|
# 连接新列的替代方案
df_new = df.join(df_sex)
df_new
|
first_name
|
last_name
|
sex
|
female
|
male
|
0
|
Jason
|
Miller
|
male
|
0.0
|
1.0
|
1
|
Molly
|
Jacobson
|
female
|
1.0
|
0.0
|
2
|
Tina
|
Ali
|
male
|
0.0
|
1.0
|
3
|
Jake
|
Milner
|
female
|
1.0
|
0.0
|
4
|
Amy
|
Cooze
|
female
|
1.0
|
0.0
|
将分类变量转换为虚拟变量
# 导入模块
import pandas as pd
import patsy# 创建数据帧
raw_data = {'countrycode': [1, 2, 3, 2, 1]}
df = pd.DataFrame(raw_data, columns = ['countrycode'])
df
|
countrycode
|
0
|
1
|
1
|
2
|
2
|
3
|
3
|
2
|
4
|
1
|
# 将 countrycode 变量转换为三个二元变量
patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')
|
C(countrycode)[1]
|
C(countrycode)[2]
|
C(countrycode)[3]
|
0
|
1.0
|
0.0
|
0.0
|
1
|
0.0
|
1.0
|
0.0
|
2
|
0.0
|
0.0
|
1.0
|
3
|
0.0
|
1.0
|
0.0
|
4
|
1.0
|
0.0
|
0.0
|
将字符串分类变量转换为数字变量
# 导入模块
import pandas as pdraw_data = {'patient': [1, 1, 1, 2, 2], 'obs': [1, 2, 3, 1, 2], 'treatment': [0, 1, 0, 1, 0],'score': ['strong', 'weak', 'normal', 'weak', 'strong']}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
|
patient
|
obs
|
treatment
|
score
|
0
|
1
|
1
|
0
|
strong
|
1
|
1
|
2
|
1
|
weak
|
2
|
1
|
3
|
0
|
normal
|
3
|
2
|
1
|
1
|
weak
|
4
|
2
|
2
|
0
|
strong
|
# 创建一个函数,将 df['score'] 的所有值转换为数字
def score_to_numeric(x):if x=='strong':return 3if x=='normal':return 2if x=='weak':return 1df['score_num'] = df['score'].apply(score_to_numeric)
df
|
patient
|
obs
|
treatment
|
score
|
score_num
|
0
|
1
|
1
|
0
|
strong
|
3
|
1
|
1
|
2
|
1
|
weak
|
1
|
2
|
1
|
3
|
0
|
normal
|
2
|
3
|
2
|
1
|
1
|
weak
|
1
|
4
|
2
|
2
|
0
|
strong
|
3
|
将变量转换为时间序列
# 导入库
import pandas as pd# 创建索引为一组名称的数据集
raw_data = {'date': ['2014-06-01T01:21:38.004053', '2014-06-02T01:21:38.004053', '2014-06-03T01:21:38.004053'],'score': [25, 94, 57]}
df = pd.DataFrame(raw_data, columns = ['date', 'score'])
df
|
date
|
score
|
0
|
2014-06-01T01:21:38.004053
|
25
|
1
|
2014-06-02T01:21:38.004053
|
94
|
2
|
2014-06-03T01:21:38.004053
|
57
|
# 转置数据集,使索引(在本例中为名称)为列
df["date"] = pd.to_datetime(df["date"])df = df.set_index(df["date"])df
|
date
|
score
|
date
|
|
|
—
|
—
|
—
|
2014-06-01 01:21:38.004053
|
2014-06-01 01:21:38.004053
|
25
|
2014-06-02 01:21:38.004053
|
2014-06-02 01:21:38.004053
|
94
|
2014-06-03 01:21:38.004053
|
2014-06-03 01:21:38.004053
|
57
|
在 Pandas 数据帧中计数
# 导入库
import pandas as pdyear = pd.Series([1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894])
guardCorps = pd.Series([0,2,2,1,0,0,1,1,0,3,0,2,1,0,0,1,0,1,0,1])
corps1 = pd.Series([0,0,0,2,0,3,0,2,0,0,0,1,1,1,0,2,0,3,1,0])
corps2 = pd.Series([0,0,0,2,0,2,0,0,1,1,0,0,2,1,1,0,0,2,0,0])
corps3 = pd.Series([0,0,0,1,1,1,2,0,2,0,0,0,1,0,1,2,1,0,0,0])
corps4 = pd.Series([0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0])
corps5 = pd.Series([0,0,0,0,2,1,0,0,1,0,0,1,0,1,1,1,1,1,1,0])
corps6 = pd.Series([0,0,1,0,2,0,0,1,2,0,1,1,3,1,1,1,0,3,0,0])
corps7 = pd.Series([1,0,1,0,0,0,1,0,1,1,0,0,2,0,0,2,1,0,2,0])
corps8 = pd.Series([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1])
corps9 = pd.Series([0,0,0,0,0,2,1,1,1,0,2,1,1,0,1,2,0,1,0,0])
corps10 = pd.Series([0,0,1,1,0,1,0,2,0,2,0,0,0,0,2,1,3,0,1,1])
corps11 = pd.Series([0,0,0,0,2,4,0,1,3,0,1,1,1,1,2,1,3,1,3,1])
corps14 = pd.Series([ 1,1,2,1,1,3,0,4,0,1,0,3,2,1,0,2,1,1,0,0])
corps15 = pd.Series([0,1,0,0,0,0,0,1,0,1,1,0,0,0,2,2,0,0,0,0])variables = dict(guardCorps = guardCorps, corps1 = corps1, corps2 = corps2, corps3 = corps3, corps4 = corps4, corps5 = corps5, corps6 = corps6, corps7 = corps7, corps8 = corps8, corps9 = corps9, corps10 = corps10, corps11 = corps11 , corps14 = corps14, corps15 = corps15)horsekick = pd.DataFrame(variables, columns = ['guardCorps', 'corps1', 'corps2', 'corps3', 'corps4', 'corps5', 'corps6', 'corps7', 'corps8', 'corps9', 'corps10', 'corps11', 'corps14', 'corps15'])horsekick.index = [1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894]horsekick
|
guardCorps
|
corps1
|
corps2
|
corps3
|
corps4
|
corps5
|
corps6
|
corps7
|
corps8
|
corps9
|
corps10
|
corps11
|
corps14
|
corps15
|
1875
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
0
|
0
|
1
|
0
|
1876
|
2
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
1
|
1877
|
2
|
0
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
0
|
1
|
0
|
2
|
0
|
1878
|
1
|
2
|
2
|
1
|
1
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
1879
|
0
|
0
|
0
|
1
|
1
|
2
|
2
|
0
|
1
|
0
|
0
|
2
|
1
|
0
|
1880
|
0
|
3
|
2
|
1
|
1
|
1
|
0
|
0
|
0
|
2
|
1
|
4
|
3
|
0
|
1881
|
1
|
0
|
0
|
2
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
1882
|
1
|
2
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
1
|
2
|
1
|
4
|
1
|
1883
|
0
|
0
|
1
|
2
|
0
|
1
|
2
|
1
|
0
|
1
|
0
|
3
|
0
|
0
|
1884
|
3
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
2
|
0
|
1
|
1
|
1885
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
2
|
0
|
1
|
0
|
1
|
1886
|
2
|
1
|
0
|
0
|
1
|
1
|
1
|
0
|
0
|
1
|
0
|
1
|
3
|
0
|
1887
|
1
|
1
|
2
|
1
|
0
|
0
|
3
|
2
|
1
|
1
|
0
|
1
|
2
|
0
|
1888
|
0
|
1
|
1
|
0
|
0
|
1
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1889
|
0
|
0
|
1
|
1
|
0
|
1
|
1
|
0
|
0
|
1
|
2
|
2
|
0
|
2
|
1890
|
1
|
2
|
0
|
2
|
0
|
1
|
1
|
2
|
0
|
2
|
1
|
1
|
2
|
2
|
1891
|
0
|
0
|
0
|
1
|
1
|
1
|
0
|
1
|
1
|
0
|
3
|
3
|
1
|
0
|
1892
|
1
|
3
|
2
|
0
|
1
|
1
|
3
|
0
|
1
|
1
|
0
|
1
|
1
|
0
|
1893
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
2
|
0
|
0
|
1
|
3
|
0
|
0
|
1894
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
# 计算每个团队中每个死亡人数的次数
result = horsekick.apply(pd.value_counts).fillna(0); result
| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 |
| 0 | 9.0 | 11.0 | 12.0 | 11.0 | 12.0 | 10.0 | 9.0 | 11.0 | 13.0 | 10.0 | 10.0 | 6 | 6 | 14.0 |
| 1 | 7.0 | 4.0 | 4.0 | 6.0 | 8.0 | 9.0 | 7.0 | 6.0 | 7.0 | 7.0 | 6.0 | 8 | 8 | 4.0 |
| 2 | 3.0 | 3.0 | 4.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2 | 3 | 2.0 |
| 3 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 2 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 |
# 计算每个月死亡总数出现在 guardCorps 的次数
pd.value_counts(horsekick['guardCorps'].values, sort=False)'''
0 9
1 7
2 3
3 1
dtype: int64
'''horsekick['guardCorps'].unique()# array([0, 2, 1, 3])
在 Pandas 中创建流水线
Pandas 的流水线功能允许你将 Python 函数串联在一起,来构建数据处理流水线。
import pandas as pd# 创建空数据帧
df = pd.DataFrame()# Create a column
df['name'] = ['John', 'Steve', 'Sarah']
df['gender'] = ['Male', 'Male', 'Female']
df['age'] = [31, 32, 19]# 查看数据帧
df
|
name
|
gender
|
age
|
0
|
John
|
Male
|
31
|
1
|
Steve
|
Male
|
32
|
2
|
Sarah
|
Female
|
19
|
# 创建函数,
def mean_age_by_group(dataframe, col):# 它按列分组数据,并返回每组的均值return dataframe.groupby(col).mean()# 创建函数,
def uppercase_column_name(dataframe):# 它大写所有列标题dataframe.columns = dataframe.columns.str.upper()# 并返回它return dataframe# 创建流水线,它应用 mean_age_by_group 函数
(df.pipe(mean_age_by_group, col='gender')# 之后应用 uppercase_column_name 函数.pipe(uppercase_column_name)
)
|
AGE
|
gender
|
|
Female
|
19.0
|
Male
|
31.5
|
使用for
循环创建 Pandas 列
import pandas as pd
import numpy as npraw_data = {'student_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'test_score': [76, 88, 84, 67, 53, 96, 64, 91, 77, 73, 52, np.NaN]}
df = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])# 创建列表来储存数据
grades = []# 对于列中的每一行
for row in df['test_score']:# 如果大于某个值if row > 95:# 添加字母分数grades.append('A')# 或者,如果大于某个值elif row > 90:# 添加字母分数grades.append('A-')# 或者,如果大于某个值elif row > 85:# 添加字母分数grades.append('B')# 或者,如果大于某个值elif row > 80:# 添加字母分数grades.append('B-')# 或者,如果大于某个值elif row > 75:# 添加字母分数grades.append('C')# 或者,如果大于某个值elif row > 70:# 添加字母分数grades.append('C-')# 或者,如果大于某个值elif row > 65:# 添加字母分数grades.append('D')# 或者,如果大于某个值elif row > 60:# 添加字母分数grades.append('D-')# 否则else:# 添加不及格分数grades.append('Failed')# 从列表创建一列
df['grades'] = grades# 查看新数据帧
df
|
student_name
|
test_score
|
grades
|
0
|
Miller
|
76.0
|
C
|
1
|
Jacobson
|
88.0
|
B
|
2
|
Ali
|
84.0
|
B-
|
3
|
Milner
|
67.0
|
D
|
4
|
Cooze
|
53.0
|
Failed
|
5
|
Jacon
|
96.0
|
A
|
6
|
Ryaner
|
64.0
|
D-
|
7
|
Sone
|
91.0
|
A-
|
8
|
Sloan
|
77.0
|
C
|
9
|
Piger
|
73.0
|
C-
|
10
|
Riani
|
52.0
|
Failed
|
11
|
Ali
|
NaN
|
Failed
|
创建项目计数
from collections import Counter# 创建一个今天吃的水果的计数器
fruit_eaten = Counter(['Apple', 'Apple', 'Apple', 'Banana', 'Pear', 'Pineapple'])# 查看计数器
fruit_eaten# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 1}) # 更新菠萝的计数(因为你只吃菠萝)
fruit_eaten.update(['Pineapple'])# 查看计数器
fruit_eaten# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 2}) # 查看计数最大的三个项目
fruit_eaten.most_common(3)# [('Apple', 3), ('Pineapple', 2), ('Banana', 1)]
基于条件创建一列
# 导入所需模块
import pandas as pd
import numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
|
name
|
age
|
preTestScore
|
postTestScore
|
0
|
Jason
|
42
|
4
|
25
|
1
|
Molly
|
52
|
24
|
94
|
2
|
Tina
|
36
|
31
|
57
|
3
|
Jake
|
24
|
2
|
62
|
4
|
Amy
|
73
|
3
|
70
|
# 创建一个名为 df.elderly 的新列
# 如果 df.age 大于 50 则值为 yes,否则为 no
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')# 查看数据帧
df
|
name
|
age
|
preTestScore
|
postTestScore
|
elderly
|
0
|
Jason
|
42
|
4
|
25
|
no
|
1
|
Molly
|
52
|
24
|
94
|
yes
|
2
|
Tina
|
36
|
31
|
57
|
no
|
3
|
Jake
|
24
|
2
|
62
|
no
|
4
|
Amy
|
73
|
3
|
70
|
yes
|
从词典键和值创建列表
# 创建字典
dict = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [2012, 2012, 2013, 2014, 2014], 'fireReports': [4, 24, 31, 2, 3]}# 创建键的列表
list(dict.keys())# ['fireReports', 'year', 'county'] # 创建值的列表
list(dict.values())'''
[[4, 24, 31, 2, 3],[2012, 2012, 2013, 2014, 2014],['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']]
'''
Pandas 中的交叉表
# 导入库
import pandas as pdraw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['infantry', 'infantry', 'cavalry', 'cavalry', 'infantry', 'infantry', 'cavalry', 'cavalry','infantry', 'infantry', 'cavalry', 'cavalry'], 'experience': ['veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie','veteran', 'rookie', 'veteran', 'rookie'],'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'experience', 'name', 'preTestScore', 'postTestScore'])
df
|
regiment
|
company
|
experience
|
name
|
preTestScore
|
postTestScore
|
0
|
Nighthawks
|
infantry
|
veteran
|
Miller
|
4
|
25
|
1
|
Nighthawks
|
infantry
|
rookie
|
Jacobson
|
24
|
94
|
2
|
Nighthawks
|
cavalry
|
veteran
|
Ali
|
31
|
57
|
3
|
Nighthawks
|
cavalry
|
rookie
|
Milner
|
2
|
62
|
4
|
Dragoons
|
infantry
|
veteran
|
Cooze
|
3
|
70
|
5
|
Dragoons
|
infantry
|
rookie
|
Jacon
|
4
|
25
|
6
|
Dragoons
|
cavalry
|
veteran
|
Ryaner
|
24
|
94
|
7
|
Dragoons
|
cavalry
|
rookie
|
Sone
|
31
|
57
|
8
|
Scouts
|
infantry
|
veteran
|
Sloan
|
2
|
62
|
9
|
Scouts
|
infantry
|
rookie
|
Piger
|
3
|
70
|
10
|
Scouts
|
cavalry
|
veteran
|
Riani
|
2
|
62
|
11
|
Scouts
|
cavalry
|
rookie
|
Ali
|
3
|
70
|
按公司和团队创建交叉表。按公司和团队计算观测数量。
pd.crosstab(df.regiment, df.company, margins=True)
company
|
cavalry
|
infantry
|
All
|
regiment
|
|
|
|
Dragoons
|
2
|
2
|
4
|
Nighthawks
|
2
|
2
|
4
|
Scouts
|
2
|
2
|
4
|
All
|
6
|
6
|
12
|
# 为每个团队创建公司和经验的交叉表
pd.crosstab([df.company, df.experience], df.regiment, margins=True)
|
regiment
|
Dragoons
|
Nighthawks
|
Scouts
|
All
|
company
|
experience
|
|
|
|
|
cavalry
|
rookie
|
1
|
1
|
1
|
3
|
|
veteran
|
1
|
1
|
1
|
3
|
infantry
|
rookie
|
1
|
1
|
1
|
3
|
|
veteran
|
1
|
1
|
1
|
3
|
All
|
|
4
|
4
|
4
|
12
|
删除重复
# 导入模块
import pandas as pdraw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 'age': [42, 42, 1111111, 36, 24, 73], 'preTestScore': [4, 4, 4, 31, 2, 3],'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
|
first_name
|
last_name
|
age
|
preTestScore
|
postTestScore
|
0
|
Jason
|
Miller
|
42
|
4
|
25
|
1
|
Jason
|
Miller
|
42
|
4
|
25
|
2
|
Jason
|
Miller
|
1111111
|
4
|
25
|
3
|
Tina
|
Ali
|
36
|
31
|
57
|
4
|
Jake
|
Milner
|
24
|
2
|
62
|
5
|
Amy
|
Cooze
|
73
|
3
|
70
|
# 确定哪些观测是重复的
df.duplicated()'''
0 False
1 True
2 False
3 False
4 False
5 False
dtype: bool
'''df.drop_duplicates()
|
first_name
|
last_name
|
age
|
preTestScore
|
postTestScore
|
0
|
Jason
|
Miller
|
42
|
4
|
25
|
2
|
Jason
|
Miller
|
1111111
|
4
|
25
|
3
|
Tina
|
Ali
|
36
|
31
|
57
|
4
|
Jake
|
Milner
|
24
|
2
|
62
|
5
|
Amy
|
Cooze
|
73
|
3
|
70
|
# 删除 first_name 列中的重复项
# 但保留重复集中的最后一个观测
df.drop_duplicates(['first_name'], keep='last')
|
first_name
|
last_name
|
age
|
preTestScore
|
postTestScore
|
2
|
Jason
|
Miller
|
1111111
|
4
|
25
|
3
|
Tina
|
Ali
|
36
|
31
|
57
|
4
|
Jake
|
Milner
|
24
|
2
|
62
|
5
|
Amy
|
Cooze
|
73
|
3
|
70
|
Pandas 数据帧的描述性统计
# 导入模块
import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
|
name
|
age
|
preTestScore
|
postTestScore
|
0
|
Jason
|
42
|
4
|
25
|
1
|
Molly
|
52
|
24
|
94
|
2
|
Tina
|
36
|
31
|
57
|
3
|
Jake
|
24
|
2
|
62
|
4
|
Amy
|
73
|
3
|
70
|
5 rows × 4 columns
# 所有年龄之和
df['age'].sum()# 227 df['preTestScore'].mean()# 12.800000000000001 df['preTestScore'].cumsum()'''
0 4
1 28
2 59
3 61
4 64
Name: preTestScore, dtype: int64
'''df['preTestScore'].describe()'''
count 5.000000
mean 12.800000
std 13.663821
min 2.000000
25% 3.000000
50% 4.000000
75% 24.000000
max 31.000000
Name: preTestScore, dtype: float64
'''df['preTestScore'].count()# 5 df['preTestScore'].min()# 2 df['preTestScore'].max()# 31 df['preTestScore'].median()# 4.0 df['preTestScore'].var()# 186.69999999999999 df['preTestScore'].std()# 13.663820841916802 df['preTestScore'].skew()# 0.74334524573267591 df['preTestScore'].kurt()# -2.4673543738411525 df.corr()
|
age
|
preTestScore
|
postTestScore
|
age
|
1.000000
|
-0.105651
|
0.328852
|
preTestScore
|
-0.105651
|
1.000000
|
0.378039
|
postTestScore
|
0.328852
|
0.378039
|
1.000000
|
3 rows × 3 columns
# 协方差矩阵
df.cov()
|
age
|
preTestScore
|
postTestScore
|
age
|
340.80
|
-26.65
|
151.20
|
preTestScore
|
-26.65
|
186.70
|
128.65
|
postTestScore
|
151.20
|
128.65
|
620.30
|
3 rows × 3 columns
丢弃行或者列
# 导入模块
import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
Santa Cruz
|
Tina
|
31
|
2013
|
Maricopa
|
Jake
|
2
|
2014
|
Yuma
|
Amy
|
3
|
2014
|
# 丢弃观测(行)
df.drop(['Cochice', 'Pima'])
|
name
|
reports
|
year
|
Santa Cruz
|
Tina
|
31
|
2013
|
Maricopa
|
Jake
|
2
|
2014
|
Yuma
|
Amy
|
3
|
2014
|
# 丢弃变量(列)
# 注意:`axis = 1`表示我们指的是列,而不是行
df.drop('reports', axis=1)
|
name
|
year
|
Cochice
|
Jason
|
2012
|
Pima
|
Molly
|
2012
|
Santa Cruz
|
Tina
|
2013
|
Maricopa
|
Jake
|
2014
|
Yuma
|
Amy
|
2014
|
如果它包含某个值(这里是Tina
),丢弃一行。
具体来说:创建一个名为df
的新数据框,名称列中的单元格的值不等于Tina
。
df[df.name != 'Tina']
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
Maricopa
|
Jake
|
2
|
2014
|
Yuma
|
Amy
|
3
|
2014
|
按照行号丢弃一行(在本例中为第 3 行)。
请注意,Pandas使用从零开始的编号,因此 0 是第一行,1 是第二行,等等。
df.drop(df.index[2])
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
Maricopa
|
Jake
|
2
|
2014
|
Yuma
|
Amy
|
3
|
2014
|
可以扩展到范围。
df.drop(df.index[[2,3]])
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
Yuma
|
Amy
|
3
|
2014
|
或相对于 DF 的末尾来丢弃。
df.drop(df.index[-2])
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
Santa Cruz
|
Tina
|
31
|
2013
|
Yuma
|
Amy
|
3
|
2014
|
你也可以选择相对于起始或末尾的范围。
df[:3] # 保留前三个
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
Santa Cruz
|
Tina
|
31
|
2013
|
df[:-3] # 丢掉后三个
|
name
|
reports
|
year
|
Cochice
|
Jason
|
4
|
2012
|
Pima
|
Molly
|
24
|
2012
|
枚举列表
# 创建字符串列表
data = ['One','Two','Three','Four','Five']# 对于 enumerate(data) 中的每个项目
for item in enumerate(data):# 打印整个枚举的元素print(item)# 只打印值(没有索引)print(item[1])'''
(0, 'One')
One
(1, 'Two')
Two
(2, 'Three')
Three
(3, 'Four')
Four
(4, 'Five')
Five
'''
在 Pandas 中将包含列表的单元扩展为自己的变量
# 导入 pandas
import pandas as pd# 创建数据集
raw_data = {'score': [1,2,3], 'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}
df = pd.DataFrame(raw_data, columns = ['score', 'tags'])# 查看数据集
df
|
score
|
tags
|
0
|
1
|
[apple, pear, guava]
|
1
|
2
|
[truck, car, plane]
|
2
|
3
|
[cat, dog, mouse]
|
# 将 df.tags 扩展为自己的数据帧
tags = df['tags'].apply(pd.Series)# 将每个变量重命名为标签
tags = tags.rename(columns = lambda x : 'tag_' + str(x))# 查看 tags 数据帧
tags
|
tag_0
|
tag_1
|
tag_2
|
0
|
apple
|
pear
|
guava
|
1
|
truck
|
car
|
plane
|
2
|
cat
|
dog
|
mouse
|
# 将 tags 数据帧添加回原始数据帧
pd.concat([df[:], tags[:]], axis=1)
|
score
|
tags
|
tag_0
|
tag_1
|
tag_2
|
0
|
1
|
[apple, pear, guava]
|
apple
|
pear
|
guava
|
1
|
2
|
[truck, car, plane]
|
truck
|
car
|
plane
|
2
|
3
|
[cat, dog, mouse]
|
cat
|
dog
|
mouse
|
过滤 pandas 数据帧
# 导入模块
import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
|
coverage
|
name
|
reports
|
year
|
Cochice
|
25
|
Jason
|
4
|
2012
|
Pima
|
94
|
Molly
|
24
|
2012
|
Santa Cruz
|
57
|
Tina
|
31
|
2013
|
Maricopa
|
62
|
Jake
|
2
|
2014
|
Yuma
|
70
|
Amy
|
3
|
2014
|
# 查看列
df['name']'''
Cochice Jason
Pima Molly
Santa Cruz Tina
Maricopa Jake
Yuma Amy
Name: name, dtype: object
'''df[['name', 'reports']]
|
name
|
reports
|
Cochice
|
Jason
|
4
|
Pima
|
Molly
|
24
|
Santa Cruz
|
Tina
|
31
|
Maricopa
|
Jake
|
2
|
Yuma
|
Amy
|
3
|
# 查看前两行
df[:2]
|
coverage
|
name
|
reports
|
year
|
Cochice
|
25
|
Jason
|
4
|
2012
|
Pima
|
94
|
Molly
|
24
|
2012
|
# 查看 Coverage 大于 50 的行
df[df['coverage'] > 50]
|
coverage
|
name
|
reports
|
year
|
Pima
|
94
|
Molly
|
24
|
2012
|
Santa Cruz
|
57
|
Tina
|
31
|
2013
|
Maricopa
|
62
|
Jake
|
2
|
2014
|
Yuma
|
70
|
Amy
|
3
|
2014
|
# 查看 Coverage 大于 50 并且 Reports 小于 4 的行
df[(df['coverage'] > 50) & (df['reports'] < 4)]
|
coverage
|
name
|
reports
|
year
|
Maricopa
|
62
|
Jake
|
2
|
2014
|
Yuma
|
70
|
Amy
|
3
|
2014
|
寻找数据帧的列中的最大值
# 导入模块
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
|
first_name
|
last_name
|
age
|
preTestScore
|
postTestScore
|
0
|
Jason
|
Miller
|
42
|
4
|
25
|
1
|
Molly
|
Jacobson
|
52
|
24
|
94
|
2
|
Tina
|
Ali
|
36
|
31
|
57
|
3
|
Jake
|
Milner
|
24
|
2
|
62
|
4
|
Amy
|
Cooze
|
73
|
3
|
70
|
# 获取 preTestScore 列中的最大值的索引
df['preTestScore'].idxmax()# 2
寻找数据帧中的唯一值
import pandas as pd
import numpy as npraw_data = {'regiment': ['51st', '29th', '2nd', '19th', '12th', '101st', '90th', '30th', '193th', '1st', '94th', '91th'], 'trucks': ['MAZ-7310', np.nan, 'MAZ-7310', 'MAZ-7310', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'ZIS-150', 'Tatra 810', 'ZIS-150', 'ZIS-150'],'tanks': ['Merkava Mark 4', 'Merkava Mark 4', 'Merkava Mark 4', 'Leopard 2A6M', 'Leopard 2A6M', 'Leopard 2A6M', 'Arjun MBT', 'Leopard 2A6M', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT'],'aircraft': ['none', 'none', 'none', 'Harbin Z-9', 'Harbin Z-9', 'none', 'Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk']}df = pd.DataFrame(raw_data, columns = ['regiment', 'trucks', 'tanks', 'aircraft'])# 查看前几行
df.head()
|
regiment
|
trucks
|
tanks
|
aircraft
|
0
|
51st
|
MAZ-7310
|
Merkava Mark 4
|
none
|
1
|
29th
|
NaN
|
Merkava Mark 4
|
none
|
2
|
2nd
|
MAZ-7310
|
Merkava Mark 4
|
none
|
3
|
19th
|
MAZ-7310
|
Leopard 2A6M
|
Harbin Z-9
|
4
|
12th
|
Tatra 810
|
Leopard 2A6M
|
Harbin Z-9
|
# 通过将 pandas 列转换为集合
# 创建唯一值的列表
list(set(df.trucks))# [nan, 'Tatra 810', 'MAZ-7310', 'ZIS-150'] # 创建 df.trucks 中的唯一值的列表
list(df['trucks'].unique())# ['MAZ-7310', nan, 'Tatra 810', 'ZIS-150']
地理编码和反向地理编码
在使用地理数据时,地理编码(将物理地址或位置转换为经纬度)和反向地理编码(将经纬度转换为物理地址或位置)是常见任务。
Python 提供了许多软件包,使任务变得异常简单。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包装器)来进行地理编码和反向地理编码。
首先,我们要加载我们想要在脚本中使用的包。 具体来说,我正在为地理函数加载 pygeocoder,为数据帧结构加载 pandas,为缺失值(np.nan
)函数加载 numpy。
# 加载包
from pygeocoder import Geocoder
import pandas as pd
import numpy as np
地理数据有多种形式,在这种情况下,我们有一个 Python 字典,包含五个经纬度的字符串,每个坐标在逗号分隔的坐标对中。
# 创建原始数据的字典
data = {'Site 1': '31.336968, -109.560959','Site 2': '31.347745, -108.229963','Site 3': '32.277621, -107.734724','Site 4': '31.655494, -106.420484','Site 5': '30.295053, -104.014528'}
虽然技术上没必要,因为我最初使用 R,我是数据帧的忠实粉丝,所以让我们把模拟的数据字典变成数据帧。
# 将字典转换为 pandas 数据帧
df = pd.DataFrame.from_dict(data, orient='index')# 查看数据帧
df
|
0
|
Site 1
|
31.336968, -109.560959
|
Site 2
|
31.347745, -108.229963
|
Site 3
|
32.277621, -107.734724
|
Site 4
|
31.655494, -106.420484
|
Site 5
|
30.295053, -104.014528
|
你现在可以看到,我们有了包含五行的数据帧,每行包含一个经纬度字符串。 在我们处理数据之前,我们需要1)将字符串分成纬度和经度,然后将它们转换为浮点数。以下代码就是这样。
# 为循环创建两个列表
lat = []
lon = []# 对于变量中的每一行
for row in df[0]:# 尝试try:# 用逗号分隔行,转换为浮点# 并将逗号前的所有内容追加到 latlat.append(float(row.split(',')[0]))# 用逗号分隔行,转换为浮点# 并将逗号后的所有内容追加到 lonlon.append(float(row.split(',')[1]))# 但是如果你得到了错误except:# 向 lat 添加缺失值lat.append(np.NaN)# 向 lon 添加缺失值lon.append(np.NaN)# 从 lat 和 lon 创建新的两列
df['latitude'] = lat
df['longitude'] = lon
让我们看看现在有了什么。
# 查看数据帧
df
|
0
|
latitude
|
longitude
|
Site 1
|
31.336968, -109.560959
|
31.336968
|
-109.560959
|
Site 2
|
31.347745, -108.229963
|
31.347745
|
-108.229963
|
Site 3
|
32.277621, -107.734724
|
32.277621
|
-107.734724
|
Site 4
|
31.655494, -106.420484
|
31.655494
|
-106.420484
|
Site 5
|
30.295053, -104.014528
|
30.295053
|
-104.014528
|
真棒。这正是我们想要看到的,一列用于纬度的浮点和一列用于经度的浮点。
为了反转地理编码,我们将特定的经纬度对(这里为第一行,索引为0
)提供给 pygeocoder 的reverse_geocoder
函数。
# 将经度和纬度转换为某个位置
results = Geocoder.reverse_geocode(df['latitude'][0], df['longitude'][0])
现在我们可以开始提取我们想要的数据了。
# 打印经纬度
results.coordinates# (31.3372728, -109.5609559) # 打印城市
results.city# 'Douglas' # 打印国家/地区
results.country# 'United States' # 打印街道地址(如果可用)
results.street_address# 打印行政区
results.administrative_area_level_1# 'Arizona'
对于地理编码,我们需要将包含地址或位置(例如城市)的字符串,传入地理编码函数中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由.geocode().valid_address
函数验证有效,我们可以转换。
# 验证地址是否有效(即在 Google 的系统中)
Geocoder.geocode("4207 N Washington Ave, Douglas, AZ 85607").valid_address# True
因为输出是True,我们现在知道这是一个有效的地址,因此可以打印纬度和经度坐标。
# 打印经纬度
results.coordinates# (31.3372728, -109.5609559)
但更有趣的是,一旦地址由 Google 地理 API 处理,我们就可以解析它并轻松地分隔街道号码,街道名称等。
# 寻找特定地址中的经纬度
result = Geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")# 打印街道号码
result.street_number# '7250' # 打印街道名
result.route# 'South Tucson Boulevard'
你就实现了它。Python 使整个过程变得简单,只需几分钟即可完成分析。祝好运!
地理定位城市和国家
本教程创建一个函数,尝试获取城市和国家并返回其经纬度。 但是当城市不可用时(通常是这种情况),则返回该国中心的经纬度。
from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as npdef geolocate(city=None, country=None):'''输入城市和国家,或仅输入国家。 如果可以的话,返回城市的经纬度坐标,否则返回该国家中心的经纬度。'''# 如果城市存在if city != None:# 尝试try:# 地理定位城市和国家loc = geolocator.geocode(str(city + ',' + country))# 并返回经纬度return (loc.latitude, loc.longitude)# 否则except:# 返回缺失值return np.nan# 如果城市不存在else:# 尝试try:# 地理定位国家中心loc = geolocator.geocode(country)# 返回经纬度return (loc.latitude, loc.longitude)# 否则except:# 返回缺失值return np.nan# 地理定位城市和国家
geolocate(city='Austin', country='USA')# (30.2711286, -97.7436995) # 仅仅地理定位国家
geolocate(country='USA')# (39.7837304, -100.4458824)
使用 pandas 分组时间序列
# 导入所需模块
import pandas as pd
import numpy as npdf = pd.DataFrame()df['german_army'] = np.random.randint(low=20000, high=30000, size=100)
df['allied_army'] = np.random.randint(low=20000, high=40000, size=100)
df.index = pd.date_range('1/1/2014', periods=100, freq='H')df.head()
|
german_army
|
allied_army
|
2014-01-01 00:00:00
|
28755
|
33938
|
2014-01-01 01:00:00
|
25176
|
28631
|
—
|
—
|
—
|
2014-01-01 02:00:00
|
23261
|
39685
|
—
|
—
|
—
|
2014-01-01 03:00:00
|
28686
|
27756
|
—
|
—
|
—
|
2014-01-01 04:00:00
|
24588
|
25681
|
—
|
—
|
—
|
Truncate the dataframe
df.truncate(before='1/2/2014', after='1/3/2014')
|
german_army
|
allied_army
|
2014-01-02 00:00:00
|
26401
|
20189
|
2014-01-02 01:00:00
|
29958
|
23934
|
2014-01-02 02:00:00
|
24492
|
39075
|
2014-01-02 03:00:00
|
25707
|
39262
|
2014-01-02 04:00:00
|
27129
|
35961
|
2014-01-02 05:00:00
|
27903
|
25418
|
2014-01-02 06:00:00
|
20409
|
25163
|
2014-01-02 07:00:00
|
25736
|
34794
|
2014-01-02 08:00:00
|
24057
|
27209
|
2014-01-02 09:00:00
|
26875
|
33402
|
2014-01-02 10:00:00
|
23963
|
38575
|
2014-01-02 11:00:00
|
27506
|
31859
|
2014-01-02 12:00:00
|
23564
|
25750
|
2014-01-02 13:00:00
|
27958
|
24365
|
2014-01-02 14:00:00
|
24915
|
38866
|
2014-01-02 15:00:00
|
23538
|
33820
|
2014-01-02 16:00:00
|
23361
|
30080
|
2014-01-02 17:00:00
|
27284
|
22922
|
2014-01-02 18:00:00
|
24176
|
32155
|
2014-01-02 19:00:00
|
23924
|
27763
|
2014-01-02 20:00:00
|
23111
|
32343
|
2014-01-02 21:00:00
|
20348
|
28907
|
2014-01-02 22:00:00
|
27136
|
38634
|
2014-01-02 23:00:00
|
28649
|
29950
|
2014-01-03 00:00:00
|
21292
|
26395
|
# 设置数据帧的索引
df.index = df.index + pd.DateOffset(months=4, days=5)df.head()
|
german_army
|
allied_army
|
2014-05-06 00:00:00
|
28755
|
33938
|
2014-05-06 01:00:00
|
25176
|
28631
|
2014-05-06 02:00:00
|
23261
|
39685
|
2014-05-06 03:00:00
|
28686
|
27756
|
2014-05-06 04:00:00
|
24588
|
25681
|
# 将变量提前一小时
df.shift(1).head()
|
german_army
|
allied_army
|
2014-05-06 00:00:00
|
NaN
|
NaN
|
2014-05-06 01:00:00
|
28755.0
|
33938.0
|
2014-05-06 02:00:00
|
25176.0
|
28631.0
|
2014-05-06 03:00:00
|
23261.0
|
39685.0
|
2014-05-06 04:00:00
|
28686.0
|
27756.0
|
# 将变量延后一小时
df.shift(-1).tail()
|
german_army
|
allied_army
|
2014-05-09 23:00:00
|
26903.0
|
39144.0
|
2014-05-10 00:00:00
|
27576.0
|
39759.0
|
2014-05-10 01:00:00
|
25232.0
|
35246.0
|
2014-05-10 02:00:00
|
23391.0
|
21044.0
|
2014-05-10 03:00:00
|
NaN
|
NaN
|
# 对每小时观测值求和来按天汇总
df.resample('D').sum()
|
german_army
|
allied_army
|
2014-05-06
|
605161
|
755962
|
2014-05-07
|
608100
|
740396
|
2014-05-08
|
589744
|
700297
|
2014-05-09
|
607092
|
719283
|
2014-05-10
|
103102
|
135193
|
# 对每小时观测值求平均来按天汇总
df.resample('D').mean()
|
german_army
|
allied_army
|
2014-05-06
|
25215.041667
|
31498.416667
|
2014-05-07
|
25337.500000
|
30849.833333
|
2014-05-08
|
24572.666667
|
29179.041667
|
2014-05-09
|
25295.500000
|
29970.125000
|
2014-05-10
|
25775.500000
|
33798.250000
|
# 对每小时观测值求最小值来按天汇总
df.resample('D').min()
|
german_army
|
allied_army
|
2014-05-06
|
24882.0
|
31310.0
|
2014-05-07
|
25311.0
|
30969.5
|
2014-05-08
|
24422.5
|
28318.0
|
2014-05-09
|
24941.5
|
32082.5
|
2014-05-10
|
26067.5
|
37195.0
|
# 对每小时观测值求中值来按天汇总
df.resample('D').median()
|
german_army
|
allied_army
|
2014-05-06
|
24882.0
|
31310.0
|
2014-05-07
|
25311.0
|
30969.5
|
2014-05-08
|
24422.5
|
28318.0
|
2014-05-09
|
24941.5
|
32082.5
|
2014-05-10
|
26067.5
|
37195.0
|
# 对每小时观测值取第一个值来按天汇总
df.resample('D').first()
|
german_army
|
allied_army
|
2014-05-06
|
28755
|
33938
|
2014-05-07
|
26401
|
20189
|
2014-05-08
|
21292
|
26395
|
2014-05-09
|
25764
|
22613
|
2014-05-10
|
26903
|
39144
|
# 对每小时观测值取最后一个值来按天汇总
df.resample('D').last()
|
german_army
|
allied_army
|
2014-05-06
|
28214
|
32110
|
2014-05-07
|
28649
|
29950
|
2014-05-08
|
28379
|
32600
|
2014-05-09
|
26752
|
22379
|
2014-05-10
|
23391
|
21044
|
# 对每小时观测值取第一个值,最后一个值,最高值,最低值来按天汇总
df.resample('D').ohlc()
|
german_army
|
allied_army
|
|
open
|
high
|
2014-05-06
|
28755
|
29206
|
2014-05-07
|
26401
|
29958
|
2014-05-08
|
21292
|
29786
|
2014-05-09
|
25764
|
29952
|
2014-05-10
|
26903
|
27576
|
按时间分组数据
2016 年 3 月 13 日,Pandas 版本 0.18.0 发布,重采样功能的运行方式发生了重大变化。 本教程遵循 v0.18.0,不适用于以前版本的 pandas。
首先让我们加载我们关心的模块。
# 导入所需模块
import pandas as pd
import datetime
import numpy as np
接下来,让我们创建一些样例数据,我们可以将它们按时间分组作为样本。 在这个例子中,我创建了一个包含两列 365 行的数据帧。一列是日期,第二列是数值。
# 为今天创建 datetime 变量
base = datetime.datetime.today()
# 创建一列变量
# 包含 365 天的 datetime 值
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]# 创建 365 个数值的列表
score_list = list(np.random.randint(low=1, high=1000, size=365))# 创建空数据帧
df = pd.DataFrame()# 从 datetime 变量创建一列
df['datetime'] = date_list
# 将列转换为 datetime 类型
df['datetime'] = pd.to_datetime(df['datetime'])
# 将 datetime 列设为索引
df.index = df['datetime']
# 为数值得分变量创建一列
df['score'] = score_list# 让我们看看数据
df.head()
|
datetime
|
score
|
datetime
|
|
|
2016-06-02 09:57:54.793972
|
2016-06-02 09:57:54.793972
|
900
|
2016-06-01 09:57:54.793972
|
2016-06-01 09:57:54.793972
|
121
|
2016-05-31 09:57:54.793972
|
2016-05-31 09:57:54.793972
|
547
|
2016-05-30 09:57:54.793972
|
2016-05-30 09:57:54.793972
|
504
|
2016-05-29 09:57:54.793972
|
2016-05-29 09:57:54.793972
|
304
|
在 pandas 中,按时间分组的最常用方法是使用.resample()
函数。 在 v0.18.0 中,此函数是两阶段的。 这意味着df.resample('M')
创建了一个对象,我们可以对其应用其他函数(mean
,count
,sum
等)
# 按月对数据分组,并取每组(即每个月)的平均值
df.resample('M').mean()
|
score
|
datetime
|
|
2015-06-30
|
513.629630
|
2015-07-31
|
561.516129
|
2015-08-31
|
448.032258
|
2015-09-30
|
548.000000
|
2015-10-31
|
480.419355
|
2015-11-30
|
487.033333
|
2015-12-31
|
499.935484
|
2016-01-31
|
429.193548
|
2016-02-29
|
520.413793
|
2016-03-31
|
349.806452
|
2016-04-30
|
395.500000
|
2016-05-31
|
503.451613
|
2016-06-30
|
510.500000
|
# 按月对数据分组,并获取每组(即每个月)的总和
df.resample('M').sum()
|
score
|
datetime
|
|
2015-06-30
|
13868
|
2015-07-31
|
17407
|
2015-08-31
|
13889
|
2015-09-30
|
16440
|
2015-10-31
|
14893
|
2015-11-30
|
14611
|
2015-12-31
|
15498
|
2016-01-31
|
13305
|
2016-02-29
|
15092
|
2016-03-31
|
10844
|
2016-04-30
|
11865
|
2016-05-31
|
15607
|
2016-06-30
|
1021
|
分组有很多选项。 你可以在 Pandas 的时间序列文档中了解它们的更多信息,但是,为了你的方便,我也在下面列出了它们。
值
|
描述
|
B
|
business day frequency
|
C
|
custom business day frequency (experimental)
|
D
|
calendar day frequency
|
W
|
weekly frequency
|
M
|
month end frequency
|
BM
|
business month end frequency
|
CBM
|
custom business month end frequency
|
MS
|
month start frequency
|
BMS
|
business month start frequency
|
Q
|
quarter end frequency
|
BQ
|
business quarter endfrequency
|
QS
|
quarter start frequency
|
BQS
|
business quarter start frequency
|
A
|
year end frequency
|
BA
|
business year end frequency
|
AS
|
year start frequency
|
BAS
|
business year start frequency
|
BH
|
business hour frequency
|
H
|
hourly frequency
|
T
|
minutely frequency
|
S
|
secondly frequency
|
L
|
milliseonds
|
U
|
microseconds
|
N
|
nanoseconds
|
按小时分组数据
# 导入库
import pandas as pd
import numpy as np# 创建 2000 个元素的时间序列
# 每五分钟一个元素,起始于 2000.1.1
time = pd.date_range('1/1/2000', periods=2000, freq='5min')# 创建 pandas 序列,带有 0 到 100 的随机值
# 将 time 用于索引
series = pd.Series(np.random.randint(100, size=2000), index=time)# 查看前几行
series[0:10]'''
2000-01-01 00:00:00 40
2000-01-01 00:05:00 13
2000-01-01 00:10:00 99
2000-01-01 00:15:00 72
2000-01-01 00:20:00 4
2000-01-01 00:25:00 36
2000-01-01 00:30:00 24
2000-01-01 00:35:00 20
2000-01-01 00:40:00 83
2000-01-01 00:45:00 44
Freq: 5T, dtype: int64
'''# 按索引的小时值对数据分组,然后按平均值进行汇总
series.groupby(series.index.hour).mean()'''
0 50.380952
1 49.380952
2 49.904762
3 53.273810
4 47.178571
5 46.095238
6 49.047619
7 44.297619
8 53.119048
9 48.261905
10 45.166667
11 54.214286
12 50.714286
13 56.130952
14 50.916667
15 42.428571
16 46.880952
17 56.892857
18 54.071429
19 47.607143
20 50.940476
21 50.511905
22 44.550000
23 50.250000
dtype: float64
'''
对行分组
# 导入模块
import pandas as pd# 示例数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
|
regiment
|
company
|
name
|
preTestScore
|
postTestScore
|
0
|
Nighthawks
|
1st
|
Miller
|
4
|
25
|
1
|
Nighthawks
|
1st
|
Jacobson
|
24
|
94
|
2
|
Nighthawks
|
2nd
|
Ali
|
31
|
57
|
3
|
Nighthawks
|
2nd
|
Milner
|
2
|
62
|
4
|
Dragoons
|
1st
|
Cooze
|
3
|
70
|
5
|
Dragoons
|
1st
|
Jacon
|
4
|
25
|
6
|
Dragoons
|
2nd
|
Ryaner
|
24
|
94
|
7
|
Dragoons
|
2nd
|
Sone
|
31
|
57
|
8
|
Scouts
|
1st
|
Sloan
|
2
|
62
|
9
|
Scouts
|
1st
|
Piger
|
3
|
70
|
10
|
Scouts
|
2nd
|
Riani
|
2
|
62
|
11
|
Scouts
|
2nd
|
Ali
|
3
|
70
|
# 创建分组对象。 换句话说,
# 创建一个表示该特定分组的对象。
# 这里,我们按照团队来分组 pre-test 得分。
regiment_preScore = df['preTestScore'].groupby(df['regiment'])# 展示每个团队的 pre-test 得分的均值
regiment_preScore.mean()'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''
Pandas 中的分层数据
# 导入模块
import pandas as pd# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
|
regiment
|
company
|
name
|
preTestScore
|
postTestScore
|
0
|
Nighthawks
|
1st
|
Miller
|
4
|
25
|
1
|
Nighthawks
|
1st
|
Jacobson
|
24
|
94
|
2
|
Nighthawks
|
2nd
|
Ali
|
31
|
57
|
3
|
Nighthawks
|
2nd
|
Milner
|
2
|
62
|
4
|
Dragoons
|
1st
|
Cooze
|
3
|
70
|
5
|
Dragoons
|
1st
|
Jacon
|
4
|
25
|
6
|
Dragoons
|
2nd
|
Ryaner
|
24
|
94
|
7
|
Dragoons
|
2nd
|
Sone
|
31
|
57
|
8
|
Scouts
|
1st
|
Sloan
|
2
|
62
|
9
|
Scouts
|
1st
|
Piger
|
3
|
70
|
10
|
Scouts
|
2nd
|
Riani
|
2
|
62
|
11
|
Scouts
|
2nd
|
Ali
|
3
|
70
|
# 设置分层索引但将列保留在原位
df = df.set_index(['regiment', 'company'], drop=False)
df
|
|
regiment
|
company
|
name
|
preTestScore
|
postTestScore
|
regiment
|
company
|
|
|
|
|
|
|
Nighthawks
|
1st
|
Nighthawks
|
1st
|
Miller
|
4
|
1st
|
Nighthawks
|
1st
|
Jacobson
|
24
|
94
|
|
2nd
|
Nighthawks
|
2nd
|
Ali
|
31
|
57
|
|
2nd
|
Nighthawks
|
2nd
|
Milner
|
2
|
62
|
|
|
Dragoons
|
1st
|
Dragoons
|
1st
|
Cooze
|
3
|
1st
|
Dragoons
|
1st
|
Jacon
|
4
|
25
|
|
2nd
|
Dragoons
|
2nd
|
Ryaner
|
24
|
94
|
|
2nd
|
Dragoons
|
2nd
|
Sone
|
31
|
57
|
|
|
Scouts
|
1st
|
Scouts
|
1st
|
Sloan
|
2
|
1st
|
Scouts
|
1st
|
Piger
|
3
|
70
|
|
2nd
|
Scouts
|
2nd
|
Riani
|
2
|
62
|
|
2nd
|
Scouts
|
2nd
|
Ali
|
3
|
70
|
|
# 将分层索引设置为团队然后公司
df = df.set_index(['regiment', 'company'])
df
|
|
name
|
preTestScore
|
postTestScore
|
regiment
|
company
|
|
|
|
|
Nighthawks
|
1st
|
Miller
|
4
|
1st
|
Jacobson
|
24
|
94
|
|
2nd
|
Ali
|
31
|
57
|
|
2nd
|
Milner
|
2
|
62
|
|
|
Dragoons
|
1st
|
Cooze
|
3
|
1st
|
Jacon
|
4
|
25
|
|
2nd
|
Ryaner
|
24
|
94
|
|
2nd
|
Sone
|
31
|
57
|
|
|
Scouts
|
1st
|
Sloan
|
2
|
1st
|
Piger
|
3
|
70
|
|
2nd
|
Riani
|
2
|
62
|
|
2nd
|
Ali
|
3
|
70
|
|
# 查看索引
df.indexMultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],names=['regiment', 'company']) # 交换索引中的级别
df.swaplevel('regiment', 'company')
|
|
name
|
preTestScore
|
postTestScore
|
company
|
regiment
|
|
|
|
1st
|
Nighthawks
|
Miller
|
4
|
25
|
|
Nighthawks
|
Jacobson
|
24
|
94
|
2nd
|
Nighthawks
|
Ali
|
31
|
57
|
|
Nighthawks
|
Milner
|
2
|
62
|
1st
|
Dragoons
|
Cooze
|
3
|
70
|
|
Dragoons
|
Jacon
|
4
|
25
|
2nd
|
Dragoons
|
Ryaner
|
24
|
94
|
|
Dragoons
|
Sone
|
31
|
57
|
1st
|
Scouts
|
Sloan
|
2
|
62
|
|
Scouts
|
Piger
|
3
|
70
|
2nd
|
Scouts
|
Riani
|
2
|
62
|
|
Scouts
|
Ali
|
3
|
70
|
# 按需求和数据
df.sum(level='regiment')
|
preTestScore
|
postTestScore
|
regiment
|
|
|
Nighthawks
|
61
|
238
|
Dragoons
|
62
|
246
|
Scouts
|
10
|
264
|
数据科学和人工智能技术笔记 十九、数据整理(上)相关推荐
- 数据科学和人工智能技术笔记 十二、逻辑回归
十二.逻辑回归 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 C 超参数快速调优 有时,学习算法的特征使我们能够比蛮力或随机模型搜索方法更快地搜索最佳超参数. sci ...
- 数据科学和人工智能技术笔记 三、数据预处理
三.数据预处理 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 为 Scikit-Learn 转换 Pandas 类别数据 # 导入所需的库 from sklearn ...
- 数据科学和人工智能技术笔记 十六、朴素贝叶斯
十六.朴素贝叶斯 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 伯努利朴素贝叶斯 伯努利朴素贝叶斯分类器假设我们的所有特征都是二元的,它们仅有两个值(例如,已经是独热 ...
- 数据科学和人工智能技术笔记 十五、支持向量机
十五.支持向量机 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 校准 SVC 中的预测概率 SVC 使用超平面来创建决策区域,不会自然输出观察是某一类成员的概率估计. ...
- 数据科学和人工智能技术笔记 十、模型选择
十.模型选择 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 在模型选择期间寻找最佳预处理步骤 在进行模型选择时,我们必须小心正确处理预处理. 首先,GridSearc ...
- 数据科学和人工智能技术笔记 十四、K 最近邻
十四.K 最近邻 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 确定 K 的最佳值 # 加载库 from sklearn.neighbors import KNeig ...
- 数据科学和人工智能技术笔记 二、数据准备
二.数据准备 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 从字典加载特征 from sklearn.feature_extraction import DictVe ...
- 数据科学和人工智能技术笔记 七、特征工程
七.特征工程 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 稀疏特征矩阵上的降维 # 加载库 from sklearn.preprocessing import St ...
- 数据科学和人工智能技术笔记 十三、树和森林
十三.树和森林 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 Adaboost 分类器 # 加载库 from sklearn.ensemble import AdaB ...
最新文章
- C#合并两张表结构相同(列数和列类型都相同)的表
- 超硬核全套Java视频教程(学习路线+免费视频+配套资料)
- html限制最多字符串,css – 设置字符串换行中允许的最大换行量
- OpenDataSource和OPENROWSET
- 信息系统项目管理师优秀论文:项目整体管理
- VMware10.0中安装CentOS8时提示客户机操作系统已禁用CPU,请关闭或重置虚拟机
- 某省高职比赛试题(园区网互联)
- leetcode 778. 水位上升的泳池中游泳(并查集)
- oracle 常用调优方法
- centos 使用rz sz指令
- 高通联机修改IMEI等参数的相关解析
- 360驱动大师要怎么操作安装打印机驱动
- 麦咖啡设置指南------详细介绍访问保护的设置方法抵御未知病毒
- Nifi Api访问
- 飞机飞行轨迹可视化Tacview
- JS 基础知识点与高频考题解析
- 模型误差、观测误差、截断误差(或称方法误差)、舍入误差
- gtj2018如何生成工程量报表_问答系列之广联达GTJ2018常见问题汇总
- centos 安装flash插件
- 2008春晚,赵本山之《火炬手》(现场版最新完整台词)
热门文章
- 科学计数法在计算机上怎么表示,科学计数法怎么表示
- kinect_试衣间(1)
- 钱多多第二阶段冲刺02
- 专升本管理学知识点总结——决策理论
- Springboot接入华为云短信平台
- 最差的算法工程师能差到什么程度呢?
- 程序员DIY HIFI功放(前后级)的艰难过程
- 有趣的路灯问题——按规律打印图形
- 全国邮编区号大全和从word中读取内容保存到msql中的源程序
- Json代码实战演练