十九、数据整理(上)

作者:Chris Albon

译者:飞龙

协议:CC BY-NC-SA 4.0

在 Pandas 中通过分组应用函数

import pandas as pd# 创建示例数据帧
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
df
Casualties Platoon
0 1 A
1 4 A
2 5 A
3 7 A
4 5 A
5 5 A
6 6 B
7 1 B
8 4 B
9 5 B
10 6 B
11 7 C
12 4 C
13 6 C
14 4 C
15 6 C
# 按照 df.platoon 对 df 分组
# 然后将滚动平均 lambda 函数应用于 df.casualties
df.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())'''
0     NaN
1     2.5
2     4.5
3     6.0
4     6.0
5     5.0
6     NaN
7     3.5
8     2.5
9     4.5
10    5.5
11    NaN
12    5.5
13    5.0
14    5.0
15    5.0
dtype: float64
'''

在 Pandas 中向分组应用操作

# 导入模块
import pandas as pd# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
# 创建一个 groupby 变量,按团队(regiment)对 preTestScores 分组
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment# <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550>

“这个分组变量现在是GroupBy对象。 除了分组的键df ['key1']的一些中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具有将所有操作应用于每个分组所需的所有信息。” – PyDA

使用list()显示分组的样子。

list(df['preTestScore'].groupby(df['regiment']))'''
[('Dragoons', 4     35     46    247    31Name: preTestScore, dtype: int64), ('Nighthawks', 0     41    242    313     2Name: preTestScore, dtype: int64), ('Scouts', 8     29     310    211    3Name: preTestScore, dtype: int64)]
'''df['preTestScore'].groupby(df['regiment']).describe()
count mean std min 25% 50% 75% max
regiment
Dragoons 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0
Nighthawks 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0
Scouts 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0
# 每个团队的 preTestScore 均值
groupby_regiment.mean()'''
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64
'''df['preTestScore'].groupby([df['regiment'], df['company']]).mean()'''
regiment    company
Dragoons    1st         3.52nd        27.5
Nighthawks  1st        14.02nd        16.5
Scouts      1st         2.52nd         2.5
Name: preTestScore, dtype: float64
'''df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
company 1st 2nd
regiment
Dragoons 3.5 27.5
Nighthawks 14.0 16.5
Scouts 2.5 2.5
# 按团队和公司(company)对整个数据帧分组
df.groupby(['regiment', 'company']).mean()
preTestScore postTestScore
regiment company
Dragoons 1st 3.5 47.5
2nd 27.5 75.5
Nighthawks 1st 14.0 59.5
2nd 16.5 59.5
Scouts 1st 2.5 66.0
2nd 2.5 66.0
# 每个团队和公司的观测数量
df.groupby(['regiment', 'company']).size()'''
regiment    company
Dragoons    1st        22nd        2
Nighthawks  1st        22nd        2
Scouts      1st        22nd        2
dtype: int64
'''# 按团队对数据帧分组,对于每个团队,
for name, group in df.groupby('regiment'): # 打印团队名称print(name)# 打印它的数据print(group)'''
Dragoonsregiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawksregiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scoutsregiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70
'''

按列分组:

特别是在这种情况下:按列对数据类型(即axis = 1)分组,然后使用list()查看该分组的外观。

list(df.groupby(df.dtypes, axis=1))'''
[(dtype('int64'),     preTestScore  postTestScore0              4             251             24             942             31             573              2             624              3             705              4             256             24             947             31             578              2             629              3             7010             2             6211             3             70),(dtype('O'),       regiment company      name0   Nighthawks     1st    Miller1   Nighthawks     1st  Jacobson2   Nighthawks     2nd       Ali3   Nighthawks     2nd    Milner4     Dragoons     1st     Cooze5     Dragoons     1st     Jacon6     Dragoons     2nd    Ryaner7     Dragoons     2nd      Sone8       Scouts     1st     Sloan9       Scouts     1st     Piger10      Scouts     2nd     Riani11      Scouts     2nd       Ali)] df.groupby('regiment').mean().add_prefix('mean_')
mean_preTestScore mean_postTestScore
regiment
Dragoons 15.50 61.5
Nighthawks 15.25 59.5
Scouts 2.50 66.0
# 创建获取分组状态的函数
def get_stats(group):return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
count max mean min
categories
Good 8.0 70.0 63.75 57.0
Great 2.0 94.0 94.00 94.0
Low 2.0 25.0 25.00 25.0
Okay 0.0 NaN NaN NaN

在 Pandas 数据帧上应用操作

# 导入模型
import pandas as pd
import numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
coverage name reports year
Cochice 25 Jason 4 2012
Pima 94 Molly 24 2012
Santa Cruz 57 Tina 31 2013
Maricopa 62 Jake 2 2014
Yuma 70 Amy 3 2014
# 创建大写转换的 lambda 函数
capitalizer = lambda x: x.upper()

capitalizer函数应用于name列。

apply()可以沿数据帧的任意轴应用函数。

df['name'].apply(capitalizer)'''
Cochice       JASON
Pima          MOLLY
Santa Cruz     TINA
Maricopa       JAKE
Yuma            AMY
Name: name, dtype: object
'''

capitalizer lambda 函数映射到序列name中的每个元素。

map()对序列的每个元素应用操作。

df['name'].map(capitalizer)'''
Cochice       JASON
Pima          MOLLY
Santa Cruz     TINA
Maricopa       JAKE
Yuma            AMY
Name: name, dtype: object
'''

将平方根函数应用于整个数据帧中的每个单元格。

applymap()将函数应用于整个数据帧中的每个元素。

# 删除字符串变量,以便 applymap() 可以运行
df = df.drop('name', axis=1)# 返回数据帧每个单元格的平方根
df.applymap(np.sqrt)
coverage reports year
Cochice 5.000000 2.000000 44.855323
Pima 9.695360 4.898979 44.855323
Santa Cruz 7.549834 5.567764 44.866469
Maricopa 7.874008 1.414214 44.877611
Yuma 8.366600 1.732051 44.877611

在数据帧上应用函数。

# 创建叫做 times100 的函数
def times100(x):# 如果 x 是字符串,if type(x) is str:# 原样返回它return x# 如果不是,返回它乘上 100elif x:return 100 * x# 并留下其它东西else:returndf.applymap(times100)
coverage reports year
Cochice 2500 400 201200
Pima 9400 2400 201200
Santa Cruz 5700 3100 201300
Maricopa 6200 200 201400
Yuma 7000 300 201400

向 Pandas 数据帧赋予新列

import pandas as pd# 创建空数据帧
df = pd.DataFrame()# 创建一列
df['name'] = ['John', 'Steve', 'Sarah']# 查看数据帧
df
name
0 John
1 Steve
2 Sarah
# 将一个新列赋予名为 age 的 df,它包含年龄列表
df.assign(age = [31, 32, 19])
name age
0 John 31
1 Steve 32
2 Sarah 19

将列表拆分为大小为 N 的分块

在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在处理具有最大请求大小的 API 时,这是一种非常常见的做法。

这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow。

# 创建名称列表
first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']# 创建叫做 chunks 的函数,有两个参数 l 和 n
def chunks(l, n):# 对于长度为 l 的范围中的项目 ifor i in range(0, len(l), n):# 创建索引范围yield l[i:i+n]# 从函数 chunks 的结果创建一个列表
list(chunks(first_names, 5))'''
[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],['Will', 'Won', 'Li']]
'''

在 Pandas 中使用正则表达式将字符串分解为列

# 导入模块
import re
import pandas as pd# 创建带有一列字符串的数据帧
data = {'raw': ['Arizona 1 2014-12-23       3242.0','Iowa 1 2010-02-23       3453.7','Oregon 0 2014-06-20       2123.0','Maryland 0 2014-03-14       1123.6','Florida 1 2013-01-15       2134.0','Georgia 0 2012-07-14       2345.6']}
df = pd.DataFrame(data, columns = ['raw'])
df
raw
0 Arizona 1 2014-12-23 3242.0
1 Iowa 1 2010-02-23 3453.7
2 Oregon 0 2014-06-20 2123.0
3 Maryland 0 2014-03-14 1123.6
4 Florida 1 2013-01-15 2134.0
5 Georgia 0 2012-07-14 2345.6
# df['raw'] 的哪些行包含 'xxxx-xx-xx'?
df['raw'].str.contains('....-..-..', regex=True)'''
0    True
1    True
2    True
3    True
4    True
5    True
Name: raw, dtype: bool
'''# 在 raw 列中,提取字符串中的单个数字
df['female'] = df['raw'].str.extract('(\d)', expand=True)
df['female']'''
0    1
1    1
2    0
3    0
4    1
5    0
Name: female, dtype: object
'''# 在 raw 列中,提取字符串中的 xxxx-xx-xx
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
df['date']'''
0    2014-12-23
1    2010-02-23
2    2014-06-20
3    2014-03-14
4    2013-01-15
5    2012-07-14
Name: date, dtype: object
'''# 在 raw 列中,提取字符串中的 ####.##
df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)
df['score']'''
0    3242.0
1    3453.7
2    2123.0
3    1123.6
4    2134.0
5    2345.6
Name: score, dtype: object
'''# 在 raw 列中,提取字符串中的单词
df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)
df['state']'''
0     Arizona
1        Iowa
2      Oregon
3    Maryland
4     Florida
5     Georgia
Name: state, dtype: object
'''df
raw female date score state
0 Arizona 1 2014-12-23 3242.0 1 2014-12-23 3242.0 Arizona
1 Iowa 1 2010-02-23 3453.7 1 2010-02-23 3453.7 Iowa
2 Oregon 0 2014-06-20 2123.0 0 2014-06-20 2123.0 Oregon
3 Maryland 0 2014-03-14 1123.6 0 2014-03-14 1123.6 Maryland
4 Florida 1 2013-01-15 2134.0 1 2013-01-15 2134.0 Florida
5 Georgia 0 2012-07-14 2345.6 0 2012-07-14 2345.6 Georgia

由两个数据帧贡献列

# 导入库
import pandas as pd# 创建数据帧
dataframe_one = pd.DataFrame()
dataframe_one['1'] = ['1', '1', '1']
dataframe_one['B'] = ['b', 'b', 'b']# 创建第二个数据帧
dataframe_two = pd.DataFrame()
dataframe_two['2'] = ['2', '2', '2']
dataframe_two['B'] = ['b', 'b', 'b']# 将每个数据帧的列转换为集合,
# 然后找到这两个集合的交集。
# 这将是两个数据帧共享的列的集合。
set.intersection(set(dataframe_one), set(dataframe_two))# {'B'}

从多个列表构建字典

# 创建官员名称的列表
officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']# 创建官员军队的列表
officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']# 创建字典,它是两个列表的 zip
dict(zip(officer_names, officer_armies))'''
{'Chris Jefferson': 'Orange Army','Jessica Billars': 'Green Army','Michael Mulligan': 'Red Army','Sodoni Dogla': 'Purple Army','Steven Johnson': 'Blue Army'}
'''

将 CSV 转换为 Python 代码来重建它

# 导入 pandas 包
import pandas as pd# 将 csv 文件加载为数据帧
df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')# 打印创建数据帧的代码
print('==============================')
print('RUN THE CODE BELOW THIS LINE')
print('==============================')
print('raw_data =', df.to_dict(orient='list'))
print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')'''
==============================
RUN THE CODE BELOW THIS LINE
==============================
raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}
'''df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']) # 如果你打算检查结果
# 1\. 输入此单元格中上面单元格生成的代码
raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}
df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])# 查看原始数据帧的前几行
df.head()
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa
# 查看使用我们的代码创建的,数据帧的前几行
df_original.head()
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa

将分类变量转换为虚拟变量

# 导入模块
import pandas as pd# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'sex': ['male', 'female', 'male', 'female', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])
df
first_name last_name sex
0 Jason Miller male
1 Molly Jacobson female
2 Tina Ali male
3 Jake Milner female
4 Amy Cooze female
# 从 sex 变量创建一组虚拟变量
df_sex = pd.get_dummies(df['sex'])# 将虚拟变量连接到主数据帧
df_new = pd.concat([df, df_sex], axis=1)
df_new
first_name last_name sex female male
0 Jason Miller male 0.0 1.0
1 Molly Jacobson female 1.0 0.0
2 Tina Ali male 0.0 1.0
3 Jake Milner female 1.0 0.0
4 Amy Cooze female 1.0 0.0
# 连接新列的替代方案
df_new = df.join(df_sex)
df_new
first_name last_name sex female male
0 Jason Miller male 0.0 1.0
1 Molly Jacobson female 1.0 0.0
2 Tina Ali male 0.0 1.0
3 Jake Milner female 1.0 0.0
4 Amy Cooze female 1.0 0.0

将分类变量转换为虚拟变量

# 导入模块
import pandas as pd
import patsy# 创建数据帧
raw_data = {'countrycode': [1, 2, 3, 2, 1]}
df = pd.DataFrame(raw_data, columns = ['countrycode'])
df
countrycode
0 1
1 2
2 3
3 2
4 1
# 将 countrycode 变量转换为三个二元变量
patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')
C(countrycode)[1] C(countrycode)[2] C(countrycode)[3]
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 0.0 1.0 0.0
4 1.0 0.0 0.0

将字符串分类变量转换为数字变量

# 导入模块
import pandas as pdraw_data = {'patient': [1, 1, 1, 2, 2], 'obs': [1, 2, 3, 1, 2], 'treatment': [0, 1, 0, 1, 0],'score': ['strong', 'weak', 'normal', 'weak', 'strong']}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
patient obs treatment score
0 1 1 0 strong
1 1 2 1 weak
2 1 3 0 normal
3 2 1 1 weak
4 2 2 0 strong
# 创建一个函数,将 df['score'] 的所有值转换为数字
def score_to_numeric(x):if x=='strong':return 3if x=='normal':return 2if x=='weak':return 1df['score_num'] = df['score'].apply(score_to_numeric)
df
patient obs treatment score score_num
0 1 1 0 strong 3
1 1 2 1 weak 1
2 1 3 0 normal 2
3 2 1 1 weak 1
4 2 2 0 strong 3

将变量转换为时间序列

# 导入库
import pandas as pd# 创建索引为一组名称的数据集
raw_data = {'date': ['2014-06-01T01:21:38.004053', '2014-06-02T01:21:38.004053', '2014-06-03T01:21:38.004053'],'score': [25, 94, 57]}
df = pd.DataFrame(raw_data, columns = ['date', 'score'])
df
date score
0 2014-06-01T01:21:38.004053 25
1 2014-06-02T01:21:38.004053 94
2 2014-06-03T01:21:38.004053 57
# 转置数据集,使索引(在本例中为名称)为列
df["date"] = pd.to_datetime(df["date"])df = df.set_index(df["date"])df
date score
date
2014-06-01 01:21:38.004053 2014-06-01 01:21:38.004053 25
2014-06-02 01:21:38.004053 2014-06-02 01:21:38.004053 94
2014-06-03 01:21:38.004053 2014-06-03 01:21:38.004053 57

在 Pandas 数据帧中计数

# 导入库
import pandas as pdyear = pd.Series([1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894])
guardCorps = pd.Series([0,2,2,1,0,0,1,1,0,3,0,2,1,0,0,1,0,1,0,1])
corps1 = pd.Series([0,0,0,2,0,3,0,2,0,0,0,1,1,1,0,2,0,3,1,0])
corps2 = pd.Series([0,0,0,2,0,2,0,0,1,1,0,0,2,1,1,0,0,2,0,0])
corps3 = pd.Series([0,0,0,1,1,1,2,0,2,0,0,0,1,0,1,2,1,0,0,0])
corps4 = pd.Series([0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0])
corps5 = pd.Series([0,0,0,0,2,1,0,0,1,0,0,1,0,1,1,1,1,1,1,0])
corps6 = pd.Series([0,0,1,0,2,0,0,1,2,0,1,1,3,1,1,1,0,3,0,0])
corps7 = pd.Series([1,0,1,0,0,0,1,0,1,1,0,0,2,0,0,2,1,0,2,0])
corps8 = pd.Series([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1])
corps9 = pd.Series([0,0,0,0,0,2,1,1,1,0,2,1,1,0,1,2,0,1,0,0])
corps10 = pd.Series([0,0,1,1,0,1,0,2,0,2,0,0,0,0,2,1,3,0,1,1])
corps11 = pd.Series([0,0,0,0,2,4,0,1,3,0,1,1,1,1,2,1,3,1,3,1])
corps14 = pd.Series([ 1,1,2,1,1,3,0,4,0,1,0,3,2,1,0,2,1,1,0,0])
corps15 = pd.Series([0,1,0,0,0,0,0,1,0,1,1,0,0,0,2,2,0,0,0,0])variables = dict(guardCorps = guardCorps, corps1 = corps1, corps2 = corps2, corps3 = corps3, corps4 = corps4, corps5 = corps5, corps6 = corps6, corps7 = corps7, corps8 = corps8, corps9 = corps9, corps10 = corps10, corps11 = corps11 , corps14 = corps14, corps15 = corps15)horsekick = pd.DataFrame(variables, columns = ['guardCorps', 'corps1', 'corps2', 'corps3', 'corps4', 'corps5', 'corps6', 'corps7', 'corps8', 'corps9', 'corps10', 'corps11', 'corps14', 'corps15'])horsekick.index = [1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894]horsekick
guardCorps corps1 corps2 corps3 corps4 corps5 corps6 corps7 corps8 corps9 corps10 corps11 corps14 corps15
1875 0 0 0 0 0 0 0 1 1 0 0 0 1 0
1876 2 0 0 0 1 0 0 0 0 0 0 0 1 1
1877 2 0 0 0 0 0 1 1 0 0 1 0 2 0
1878 1 2 2 1 1 0 0 0 0 0 1 0 1 0
1879 0 0 0 1 1 2 2 0 1 0 0 2 1 0
1880 0 3 2 1 1 1 0 0 0 2 1 4 3 0
1881 1 0 0 2 1 0 0 1 0 1 0 0 0 0
1882 1 2 0 0 0 0 1 0 1 1 2 1 4 1
1883 0 0 1 2 0 1 2 1 0 1 0 3 0 0
1884 3 0 1 0 0 0 0 1 0 0 2 0 1 1
1885 0 0 0 0 0 0 1 0 0 2 0 1 0 1
1886 2 1 0 0 1 1 1 0 0 1 0 1 3 0
1887 1 1 2 1 0 0 3 2 1 1 0 1 2 0
1888 0 1 1 0 0 1 1 0 0 0 0 1 1 0
1889 0 0 1 1 0 1 1 0 0 1 2 2 0 2
1890 1 2 0 2 0 1 1 2 0 2 1 1 2 2
1891 0 0 0 1 1 1 0 1 1 0 3 3 1 0
1892 1 3 2 0 1 1 3 0 1 1 0 1 1 0
1893 0 1 0 0 0 1 0 2 0 0 1 3 0 0
1894 1 0 0 0 0 0 0 0 1 0 1 1 0 0
# 计算每个团队中每个死亡人数的次数
result = horsekick.apply(pd.value_counts).fillna(0); result

| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 |
| 0 | 9.0 | 11.0 | 12.0 | 11.0 | 12.0 | 10.0 | 9.0 | 11.0 | 13.0 | 10.0 | 10.0 | 6 | 6 | 14.0 |
| 1 | 7.0 | 4.0 | 4.0 | 6.0 | 8.0 | 9.0 | 7.0 | 6.0 | 7.0 | 7.0 | 6.0 | 8 | 8 | 4.0 |
| 2 | 3.0 | 3.0 | 4.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2 | 3 | 2.0 |
| 3 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 2 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 |

# 计算每个月死亡总数出现在 guardCorps 的次数
pd.value_counts(horsekick['guardCorps'].values, sort=False)'''
0    9
1    7
2    3
3    1
dtype: int64
'''horsekick['guardCorps'].unique()# array([0, 2, 1, 3])

在 Pandas 中创建流水线

Pandas 的流水线功能允许你将 Python 函数串联在一起,来构建数据处理流水线。

import pandas as pd# 创建空数据帧
df = pd.DataFrame()# Create a column
df['name'] = ['John', 'Steve', 'Sarah']
df['gender'] = ['Male', 'Male', 'Female']
df['age'] = [31, 32, 19]# 查看数据帧
df
name gender age
0 John Male 31
1 Steve Male 32
2 Sarah Female 19
# 创建函数,
def mean_age_by_group(dataframe, col):# 它按列分组数据,并返回每组的均值return dataframe.groupby(col).mean()# 创建函数,
def uppercase_column_name(dataframe):# 它大写所有列标题dataframe.columns = dataframe.columns.str.upper()# 并返回它return dataframe# 创建流水线,它应用 mean_age_by_group 函数
(df.pipe(mean_age_by_group, col='gender')# 之后应用 uppercase_column_name 函数.pipe(uppercase_column_name)
)
AGE
gender
Female 19.0
Male 31.5

使用for循环创建 Pandas 列

import pandas as pd
import numpy as npraw_data = {'student_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'test_score': [76, 88, 84, 67, 53, 96, 64, 91, 77, 73, 52, np.NaN]}
df = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])# 创建列表来储存数据
grades = []# 对于列中的每一行
for row in df['test_score']:# 如果大于某个值if row > 95:# 添加字母分数grades.append('A')# 或者,如果大于某个值elif row > 90:# 添加字母分数grades.append('A-')# 或者,如果大于某个值elif row > 85:# 添加字母分数grades.append('B')# 或者,如果大于某个值elif row > 80:# 添加字母分数grades.append('B-')# 或者,如果大于某个值elif row > 75:# 添加字母分数grades.append('C')# 或者,如果大于某个值elif row > 70:# 添加字母分数grades.append('C-')# 或者,如果大于某个值elif row > 65:# 添加字母分数grades.append('D')# 或者,如果大于某个值elif row > 60:# 添加字母分数grades.append('D-')# 否则else:# 添加不及格分数grades.append('Failed')# 从列表创建一列
df['grades'] = grades# 查看新数据帧
df
student_name test_score grades
0 Miller 76.0 C
1 Jacobson 88.0 B
2 Ali 84.0 B-
3 Milner 67.0 D
4 Cooze 53.0 Failed
5 Jacon 96.0 A
6 Ryaner 64.0 D-
7 Sone 91.0 A-
8 Sloan 77.0 C
9 Piger 73.0 C-
10 Riani 52.0 Failed
11 Ali NaN Failed

创建项目计数

from collections import Counter# 创建一个今天吃的水果的计数器
fruit_eaten = Counter(['Apple', 'Apple', 'Apple', 'Banana', 'Pear', 'Pineapple'])# 查看计数器
fruit_eaten# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 1}) # 更新菠萝的计数(因为你只吃菠萝)
fruit_eaten.update(['Pineapple'])# 查看计数器
fruit_eaten# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 2}) # 查看计数最大的三个项目
fruit_eaten.most_common(3)# [('Apple', 3), ('Pineapple', 2), ('Banana', 1)]

基于条件创建一列

# 导入所需模块
import pandas as pd
import numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
name age preTestScore postTestScore
0 Jason 42 4 25
1 Molly 52 24 94
2 Tina 36 31 57
3 Jake 24 2 62
4 Amy 73 3 70
# 创建一个名为 df.elderly 的新列
# 如果 df.age 大于 50 则值为 yes,否则为 no
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')# 查看数据帧
df
name age preTestScore postTestScore elderly
0 Jason 42 4 25 no
1 Molly 52 24 94 yes
2 Tina 36 31 57 no
3 Jake 24 2 62 no
4 Amy 73 3 70 yes

从词典键和值创建列表

# 创建字典
dict = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [2012, 2012, 2013, 2014, 2014], 'fireReports': [4, 24, 31, 2, 3]}# 创建键的列表
list(dict.keys())# ['fireReports', 'year', 'county'] # 创建值的列表
list(dict.values())'''
[[4, 24, 31, 2, 3],[2012, 2012, 2013, 2014, 2014],['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']]
'''

Pandas 中的交叉表

# 导入库
import pandas as pdraw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['infantry', 'infantry', 'cavalry', 'cavalry', 'infantry', 'infantry', 'cavalry', 'cavalry','infantry', 'infantry', 'cavalry', 'cavalry'], 'experience': ['veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie','veteran', 'rookie', 'veteran', 'rookie'],'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'experience', 'name', 'preTestScore', 'postTestScore'])
df
regiment company experience name preTestScore postTestScore
0 Nighthawks infantry veteran Miller 4 25
1 Nighthawks infantry rookie Jacobson 24 94
2 Nighthawks cavalry veteran Ali 31 57
3 Nighthawks cavalry rookie Milner 2 62
4 Dragoons infantry veteran Cooze 3 70
5 Dragoons infantry rookie Jacon 4 25
6 Dragoons cavalry veteran Ryaner 24 94
7 Dragoons cavalry rookie Sone 31 57
8 Scouts infantry veteran Sloan 2 62
9 Scouts infantry rookie Piger 3 70
10 Scouts cavalry veteran Riani 2 62
11 Scouts cavalry rookie Ali 3 70

按公司和团队创建交叉表。按公司和团队计算观测数量。

pd.crosstab(df.regiment, df.company, margins=True)
company cavalry infantry All
regiment
Dragoons 2 2 4
Nighthawks 2 2 4
Scouts 2 2 4
All 6 6 12
# 为每个团队创建公司和经验的交叉表
pd.crosstab([df.company, df.experience], df.regiment,  margins=True)
regiment Dragoons Nighthawks Scouts All
company experience
cavalry rookie 1 1 1 3
veteran 1 1 1 3
infantry rookie 1 1 1 3
veteran 1 1 1 3
All 4 4 4 12

删除重复

# 导入模块
import pandas as pdraw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 'age': [42, 42, 1111111, 36, 24, 73], 'preTestScore': [4, 4, 4, 31, 2, 3],'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Jason Miller 42 4 25
2 Jason Miller 1111111 4 25
3 Tina Ali 36 31 57
4 Jake Milner 24 2 62
5 Amy Cooze 73 3 70
# 确定哪些观测是重复的
df.duplicated()'''
0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool
'''df.drop_duplicates()
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
2 Jason Miller 1111111 4 25
3 Tina Ali 36 31 57
4 Jake Milner 24 2 62
5 Amy Cooze 73 3 70
# 删除 first_name 列中的重复项
# 但保留重复集中的最后一个观测
df.drop_duplicates(['first_name'], keep='last')
first_name last_name age preTestScore postTestScore
2 Jason Miller 1111111 4 25
3 Tina Ali 36 31 57
4 Jake Milner 24 2 62
5 Amy Cooze 73 3 70

Pandas 数据帧的描述性统计

# 导入模块
import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
name age preTestScore postTestScore
0 Jason 42 4 25
1 Molly 52 24 94
2 Tina 36 31 57
3 Jake 24 2 62
4 Amy 73 3 70

5 rows × 4 columns

# 所有年龄之和
df['age'].sum()# 227 df['preTestScore'].mean()# 12.800000000000001 df['preTestScore'].cumsum()'''
0     4
1    28
2    59
3    61
4    64
Name: preTestScore, dtype: int64
'''df['preTestScore'].describe()'''
count     5.000000
mean     12.800000
std      13.663821
min       2.000000
25%       3.000000
50%       4.000000
75%      24.000000
max      31.000000
Name: preTestScore, dtype: float64
'''df['preTestScore'].count()# 5 df['preTestScore'].min()# 2 df['preTestScore'].max()# 31 df['preTestScore'].median()# 4.0 df['preTestScore'].var()# 186.69999999999999 df['preTestScore'].std()# 13.663820841916802 df['preTestScore'].skew()# 0.74334524573267591 df['preTestScore'].kurt()# -2.4673543738411525 df.corr()
age preTestScore postTestScore
age 1.000000 -0.105651 0.328852
preTestScore -0.105651 1.000000 0.378039
postTestScore 0.328852 0.378039 1.000000

3 rows × 3 columns

# 协方差矩阵
df.cov()
age preTestScore postTestScore
age 340.80 -26.65 151.20
preTestScore -26.65 186.70 128.65
postTestScore 151.20 128.65 620.30

3 rows × 3 columns

丢弃行或者列

# 导入模块
import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Santa Cruz Tina 31 2013
Maricopa Jake 2 2014
Yuma Amy 3 2014
# 丢弃观测(行)
df.drop(['Cochice', 'Pima'])
name reports year
Santa Cruz Tina 31 2013
Maricopa Jake 2 2014
Yuma Amy 3 2014
# 丢弃变量(列)
# 注意:`axis = 1`表示我们指的是列,而不是行
df.drop('reports', axis=1)
name year
Cochice Jason 2012
Pima Molly 2012
Santa Cruz Tina 2013
Maricopa Jake 2014
Yuma Amy 2014

如果它包含某个值(这里是Tina),丢弃一行。

具体来说:创建一个名为df的新数据框,名称列中的单元格的值不等于Tina

df[df.name != 'Tina']
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Maricopa Jake 2 2014
Yuma Amy 3 2014

按照行号丢弃一行(在本例中为第 3 行)。

请注意,Pandas使用从零开始的编号,因此 0 是第一行,1 是第二行,等等。

df.drop(df.index[2])
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Maricopa Jake 2 2014
Yuma Amy 3 2014

可以扩展到范围。

df.drop(df.index[[2,3]])
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Yuma Amy 3 2014

或相对于 DF 的末尾来丢弃。

df.drop(df.index[-2])
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Santa Cruz Tina 31 2013
Yuma Amy 3 2014

你也可以选择相对于起始或末尾的范围。

df[:3] # 保留前三个
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Santa Cruz Tina 31 2013
df[:-3] # 丢掉后三个
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012

枚举列表

# 创建字符串列表
data = ['One','Two','Three','Four','Five']# 对于 enumerate(data) 中的每个项目
for item in enumerate(data):# 打印整个枚举的元素print(item)# 只打印值(没有索引)print(item[1])'''
(0, 'One')
One
(1, 'Two')
Two
(2, 'Three')
Three
(3, 'Four')
Four
(4, 'Five')
Five
'''

在 Pandas 中将包含列表的单元扩展为自己的变量

# 导入 pandas
import pandas as pd# 创建数据集
raw_data = {'score': [1,2,3], 'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}
df = pd.DataFrame(raw_data, columns = ['score', 'tags'])# 查看数据集
df
score tags
0 1 [apple, pear, guava]
1 2 [truck, car, plane]
2 3 [cat, dog, mouse]
# 将 df.tags 扩展为自己的数据帧
tags = df['tags'].apply(pd.Series)# 将每个变量重命名为标签
tags = tags.rename(columns = lambda x : 'tag_' + str(x))# 查看 tags 数据帧
tags
tag_0 tag_1 tag_2
0 apple pear guava
1 truck car plane
2 cat dog mouse
# 将 tags 数据帧添加回原始数据帧
pd.concat([df[:], tags[:]], axis=1)
score tags tag_0 tag_1 tag_2
0 1 [apple, pear, guava] apple pear guava
1 2 [truck, car, plane] truck car plane
2 3 [cat, dog, mouse] cat dog mouse

过滤 pandas 数据帧

# 导入模块
import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
coverage name reports year
Cochice 25 Jason 4 2012
Pima 94 Molly 24 2012
Santa Cruz 57 Tina 31 2013
Maricopa 62 Jake 2 2014
Yuma 70 Amy 3 2014
# 查看列
df['name']'''
Cochice       Jason
Pima          Molly
Santa Cruz     Tina
Maricopa       Jake
Yuma            Amy
Name: name, dtype: object
'''df[['name', 'reports']]
name reports
Cochice Jason 4
Pima Molly 24
Santa Cruz Tina 31
Maricopa Jake 2
Yuma Amy 3
# 查看前两行
df[:2]
coverage name reports year
Cochice 25 Jason 4 2012
Pima 94 Molly 24 2012
# 查看 Coverage 大于 50 的行
df[df['coverage'] > 50]
coverage name reports year
Pima 94 Molly 24 2012
Santa Cruz 57 Tina 31 2013
Maricopa 62 Jake 2 2014
Yuma 70 Amy 3 2014
# 查看 Coverage 大于 50 并且 Reports 小于 4 的行
df[(df['coverage']  > 50) & (df['reports'] < 4)]
coverage name reports year
Maricopa 62 Jake 2 2014
Yuma 70 Amy 3 2014

寻找数据帧的列中的最大值

# 导入模块
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
# 获取 preTestScore 列中的最大值的索引
df['preTestScore'].idxmax()# 2

寻找数据帧中的唯一值

import pandas as pd
import numpy as npraw_data = {'regiment': ['51st', '29th', '2nd', '19th', '12th', '101st', '90th', '30th', '193th', '1st', '94th', '91th'], 'trucks': ['MAZ-7310', np.nan, 'MAZ-7310', 'MAZ-7310', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'ZIS-150', 'Tatra 810', 'ZIS-150', 'ZIS-150'],'tanks': ['Merkava Mark 4', 'Merkava Mark 4', 'Merkava Mark 4', 'Leopard 2A6M', 'Leopard 2A6M', 'Leopard 2A6M', 'Arjun MBT', 'Leopard 2A6M', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT'],'aircraft': ['none', 'none', 'none', 'Harbin Z-9', 'Harbin Z-9', 'none', 'Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk']}df = pd.DataFrame(raw_data, columns = ['regiment', 'trucks', 'tanks', 'aircraft'])# 查看前几行
df.head()
regiment trucks tanks aircraft
0 51st MAZ-7310 Merkava Mark 4 none
1 29th NaN Merkava Mark 4 none
2 2nd MAZ-7310 Merkava Mark 4 none
3 19th MAZ-7310 Leopard 2A6M Harbin Z-9
4 12th Tatra 810 Leopard 2A6M Harbin Z-9
# 通过将 pandas 列转换为集合
# 创建唯一值的列表
list(set(df.trucks))# [nan, 'Tatra 810', 'MAZ-7310', 'ZIS-150'] # 创建 df.trucks 中的唯一值的列表
list(df['trucks'].unique())# ['MAZ-7310', nan, 'Tatra 810', 'ZIS-150']

地理编码和反向地理编码

在使用地理数据时,地理编码(将物理地址或位置转换为经纬度)和反向地理编码(将经纬度转换为物理地址或位置)是常见任务。

Python 提供了许多软件包,使任务变得异常简单。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包装器)来进行地理编码和反向地理编码。

首先,我们要加载我们想要在脚本中使用的包。 具体来说,我正在为地理函数加载 pygeocoder,为数据帧结构加载 pandas,为缺失值(np.nan)函数加载 numpy。

# 加载包
from pygeocoder import Geocoder
import pandas as pd
import numpy as np

地理数据有多种形式,在这种情况下,我们有一个 Python 字典,包含五个经纬度的字符串,每个坐标在逗号分隔的坐标对中。

# 创建原始数据的字典
data = {'Site 1': '31.336968, -109.560959','Site 2': '31.347745, -108.229963','Site 3': '32.277621, -107.734724','Site 4': '31.655494, -106.420484','Site 5': '30.295053, -104.014528'}

虽然技术上没必要,因为我最初使用 R,我是数据帧的忠实粉丝,所以让我们把模拟的数据字典变成数据帧。

# 将字典转换为 pandas 数据帧
df = pd.DataFrame.from_dict(data, orient='index')# 查看数据帧
df
0
Site 1 31.336968, -109.560959
Site 2 31.347745, -108.229963
Site 3 32.277621, -107.734724
Site 4 31.655494, -106.420484
Site 5 30.295053, -104.014528

你现在可以看到,我们有了包含五行的数据帧,每行包含一个经纬度字符串。 在我们处理数据之前,我们需要1)将字符串分成纬度和经度,然后将它们转换为浮点数。以下代码就是这样。

# 为循环创建两个列表
lat = []
lon = []# 对于变量中的每一行
for row in df[0]:# 尝试try:# 用逗号分隔行,转换为浮点# 并将逗号前的所有内容追加到 latlat.append(float(row.split(',')[0]))# 用逗号分隔行,转换为浮点# 并将逗号后的所有内容追加到 lonlon.append(float(row.split(',')[1]))# 但是如果你得到了错误except:# 向 lat 添加缺失值lat.append(np.NaN)# 向 lon 添加缺失值lon.append(np.NaN)# 从 lat 和 lon 创建新的两列
df['latitude'] = lat
df['longitude'] = lon

让我们看看现在有了什么。

# 查看数据帧
df
0 latitude longitude
Site 1 31.336968, -109.560959 31.336968 -109.560959
Site 2 31.347745, -108.229963 31.347745 -108.229963
Site 3 32.277621, -107.734724 32.277621 -107.734724
Site 4 31.655494, -106.420484 31.655494 -106.420484
Site 5 30.295053, -104.014528 30.295053 -104.014528

真棒。这正是我们想要看到的,一列用于纬度的浮点和一列用于经度的浮点。

为了反转地理编码,我们将特定的经纬度对(这里为第一行,索引为0)提供给 pygeocoder 的reverse_geocoder函数。

# 将经度和纬度转换为某个位置
results = Geocoder.reverse_geocode(df['latitude'][0], df['longitude'][0])

现在我们可以开始提取我们想要的数据了。

# 打印经纬度
results.coordinates# (31.3372728, -109.5609559) # 打印城市
results.city# 'Douglas' # 打印国家/地区
results.country# 'United States' # 打印街道地址(如果可用)
results.street_address# 打印行政区
results.administrative_area_level_1# 'Arizona'

对于地理编码,我们需要将包含地址或位置(例如城市)的字符串,传入地理编码函数中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由.geocode().valid_address函数验证有效,我们可以转换。

# 验证地址是否有效(即在 Google 的系统中)
Geocoder.geocode("4207 N Washington Ave, Douglas, AZ 85607").valid_address# True

因为输出是True,我们现在知道这是一个有效的地址,因此可以打印纬度和经度坐标。

# 打印经纬度
results.coordinates# (31.3372728, -109.5609559)

但更有趣的是,一旦地址由 Google 地理 API 处理,我们就可以解析它并轻松地分隔街道号码,街道名称等。

# 寻找特定地址中的经纬度
result = Geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")# 打印街道号码
result.street_number# '7250' # 打印街道名
result.route# 'South Tucson Boulevard'

你就实现了它。Python 使整个过程变得简单,只需几分钟即可完成分析。祝好运!

地理定位城市和国家

本教程创建一个函数,尝试获取城市和国家并返回其经纬度。 但是当城市不可用时(通常是这种情况),则返回该国中心的经纬度。

from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as npdef geolocate(city=None, country=None):'''输入城市和国家,或仅输入国家。 如果可以的话,返回城市的经纬度坐标,否则返回该国家中心的经纬度。'''# 如果城市存在if city != None:# 尝试try:# 地理定位城市和国家loc = geolocator.geocode(str(city + ',' + country))# 并返回经纬度return (loc.latitude, loc.longitude)# 否则except:# 返回缺失值return np.nan# 如果城市不存在else:# 尝试try:# 地理定位国家中心loc = geolocator.geocode(country)# 返回经纬度return (loc.latitude, loc.longitude)# 否则except:# 返回缺失值return np.nan# 地理定位城市和国家
geolocate(city='Austin', country='USA')# (30.2711286, -97.7436995) # 仅仅地理定位国家
geolocate(country='USA')# (39.7837304, -100.4458824)

使用 pandas 分组时间序列

# 导入所需模块
import pandas as pd
import numpy as npdf = pd.DataFrame()df['german_army'] = np.random.randint(low=20000, high=30000, size=100)
df['allied_army'] = np.random.randint(low=20000, high=40000, size=100)
df.index = pd.date_range('1/1/2014', periods=100, freq='H')df.head()
german_army allied_army
2014-01-01 00:00:00 28755 33938
2014-01-01 01:00:00 25176 28631
2014-01-01 02:00:00 23261 39685
2014-01-01 03:00:00 28686 27756
2014-01-01 04:00:00 24588 25681

Truncate the dataframe

df.truncate(before='1/2/2014', after='1/3/2014')
german_army allied_army
2014-01-02 00:00:00 26401 20189
2014-01-02 01:00:00 29958 23934
2014-01-02 02:00:00 24492 39075
2014-01-02 03:00:00 25707 39262
2014-01-02 04:00:00 27129 35961
2014-01-02 05:00:00 27903 25418
2014-01-02 06:00:00 20409 25163
2014-01-02 07:00:00 25736 34794
2014-01-02 08:00:00 24057 27209
2014-01-02 09:00:00 26875 33402
2014-01-02 10:00:00 23963 38575
2014-01-02 11:00:00 27506 31859
2014-01-02 12:00:00 23564 25750
2014-01-02 13:00:00 27958 24365
2014-01-02 14:00:00 24915 38866
2014-01-02 15:00:00 23538 33820
2014-01-02 16:00:00 23361 30080
2014-01-02 17:00:00 27284 22922
2014-01-02 18:00:00 24176 32155
2014-01-02 19:00:00 23924 27763
2014-01-02 20:00:00 23111 32343
2014-01-02 21:00:00 20348 28907
2014-01-02 22:00:00 27136 38634
2014-01-02 23:00:00 28649 29950
2014-01-03 00:00:00 21292 26395
# 设置数据帧的索引
df.index = df.index + pd.DateOffset(months=4, days=5)df.head()
german_army allied_army
2014-05-06 00:00:00 28755 33938
2014-05-06 01:00:00 25176 28631
2014-05-06 02:00:00 23261 39685
2014-05-06 03:00:00 28686 27756
2014-05-06 04:00:00 24588 25681
# 将变量提前一小时
df.shift(1).head()
german_army allied_army
2014-05-06 00:00:00 NaN NaN
2014-05-06 01:00:00 28755.0 33938.0
2014-05-06 02:00:00 25176.0 28631.0
2014-05-06 03:00:00 23261.0 39685.0
2014-05-06 04:00:00 28686.0 27756.0
# 将变量延后一小时
df.shift(-1).tail()
german_army allied_army
2014-05-09 23:00:00 26903.0 39144.0
2014-05-10 00:00:00 27576.0 39759.0
2014-05-10 01:00:00 25232.0 35246.0
2014-05-10 02:00:00 23391.0 21044.0
2014-05-10 03:00:00 NaN NaN
# 对每小时观测值求和来按天汇总
df.resample('D').sum()
german_army allied_army
2014-05-06 605161 755962
2014-05-07 608100 740396
2014-05-08 589744 700297
2014-05-09 607092 719283
2014-05-10 103102 135193
# 对每小时观测值求平均来按天汇总
df.resample('D').mean()
german_army allied_army
2014-05-06 25215.041667 31498.416667
2014-05-07 25337.500000 30849.833333
2014-05-08 24572.666667 29179.041667
2014-05-09 25295.500000 29970.125000
2014-05-10 25775.500000 33798.250000
# 对每小时观测值求最小值来按天汇总
df.resample('D').min()
german_army allied_army
2014-05-06 24882.0 31310.0
2014-05-07 25311.0 30969.5
2014-05-08 24422.5 28318.0
2014-05-09 24941.5 32082.5
2014-05-10 26067.5 37195.0
# 对每小时观测值求中值来按天汇总
df.resample('D').median()
german_army allied_army
2014-05-06 24882.0 31310.0
2014-05-07 25311.0 30969.5
2014-05-08 24422.5 28318.0
2014-05-09 24941.5 32082.5
2014-05-10 26067.5 37195.0
# 对每小时观测值取第一个值来按天汇总
df.resample('D').first()
german_army allied_army
2014-05-06 28755 33938
2014-05-07 26401 20189
2014-05-08 21292 26395
2014-05-09 25764 22613
2014-05-10 26903 39144
# 对每小时观测值取最后一个值来按天汇总
df.resample('D').last()
german_army allied_army
2014-05-06 28214 32110
2014-05-07 28649 29950
2014-05-08 28379 32600
2014-05-09 26752 22379
2014-05-10 23391 21044
# 对每小时观测值取第一个值,最后一个值,最高值,最低值来按天汇总
df.resample('D').ohlc()
german_army allied_army
open high
2014-05-06 28755 29206
2014-05-07 26401 29958
2014-05-08 21292 29786
2014-05-09 25764 29952
2014-05-10 26903 27576

按时间分组数据

2016 年 3 月 13 日,Pandas 版本 0.18.0 发布,重采样功能的运行方式发生了重大变化。 本教程遵循 v0.18.0,不适用于以前版本的 pandas。

首先让我们加载我们关心的模块。

# 导入所需模块
import pandas as pd
import datetime
import numpy as np

接下来,让我们创建一些样例数据,我们可以将它们按时间分组作为样本。 在这个例子中,我创建了一个包含两列 365 行的数据帧。一列是日期,第二列是数值。

# 为今天创建 datetime 变量
base = datetime.datetime.today()
# 创建一列变量
# 包含 365 天的 datetime 值
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]# 创建 365 个数值的列表
score_list = list(np.random.randint(low=1, high=1000, size=365))# 创建空数据帧
df = pd.DataFrame()# 从 datetime 变量创建一列
df['datetime'] = date_list
# 将列转换为 datetime 类型
df['datetime'] = pd.to_datetime(df['datetime'])
# 将 datetime 列设为索引
df.index = df['datetime']
# 为数值得分变量创建一列
df['score'] = score_list# 让我们看看数据
df.head()
datetime score
datetime
2016-06-02 09:57:54.793972 2016-06-02 09:57:54.793972 900
2016-06-01 09:57:54.793972 2016-06-01 09:57:54.793972 121
2016-05-31 09:57:54.793972 2016-05-31 09:57:54.793972 547
2016-05-30 09:57:54.793972 2016-05-30 09:57:54.793972 504
2016-05-29 09:57:54.793972 2016-05-29 09:57:54.793972 304

在 pandas 中,按时间分组的最常用方法是使用.resample()函数。 在 v0.18.0 中,此函数是两阶段的。 这意味着df.resample('M')创建了一个对象,我们可以对其应用其他函数(meancountsum等)

# 按月对数据分组,并取每组(即每个月)的平均值
df.resample('M').mean()
score
datetime
2015-06-30 513.629630
2015-07-31 561.516129
2015-08-31 448.032258
2015-09-30 548.000000
2015-10-31 480.419355
2015-11-30 487.033333
2015-12-31 499.935484
2016-01-31 429.193548
2016-02-29 520.413793
2016-03-31 349.806452
2016-04-30 395.500000
2016-05-31 503.451613
2016-06-30 510.500000
# 按月对数据分组,并获取每组(即每个月)的总和
df.resample('M').sum()
score
datetime
2015-06-30 13868
2015-07-31 17407
2015-08-31 13889
2015-09-30 16440
2015-10-31 14893
2015-11-30 14611
2015-12-31 15498
2016-01-31 13305
2016-02-29 15092
2016-03-31 10844
2016-04-30 11865
2016-05-31 15607
2016-06-30 1021

分组有很多选项。 你可以在 Pandas 的时间序列文档中了解它们的更多信息,但是,为了你的方便,我也在下面列出了它们。

描述
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
BMS business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T minutely frequency
S secondly frequency
L milliseonds
U microseconds
N nanoseconds

按小时分组数据

# 导入库
import pandas as pd
import numpy as np# 创建 2000 个元素的时间序列
# 每五分钟一个元素,起始于 2000.1.1
time = pd.date_range('1/1/2000', periods=2000, freq='5min')# 创建 pandas 序列,带有 0 到 100 的随机值
# 将 time 用于索引
series = pd.Series(np.random.randint(100, size=2000), index=time)# 查看前几行
series[0:10]'''
2000-01-01 00:00:00    40
2000-01-01 00:05:00    13
2000-01-01 00:10:00    99
2000-01-01 00:15:00    72
2000-01-01 00:20:00     4
2000-01-01 00:25:00    36
2000-01-01 00:30:00    24
2000-01-01 00:35:00    20
2000-01-01 00:40:00    83
2000-01-01 00:45:00    44
Freq: 5T, dtype: int64
'''# 按索引的小时值对数据分组,然后按平均值进行汇总
series.groupby(series.index.hour).mean()'''
0     50.380952
1     49.380952
2     49.904762
3     53.273810
4     47.178571
5     46.095238
6     49.047619
7     44.297619
8     53.119048
9     48.261905
10    45.166667
11    54.214286
12    50.714286
13    56.130952
14    50.916667
15    42.428571
16    46.880952
17    56.892857
18    54.071429
19    47.607143
20    50.940476
21    50.511905
22    44.550000
23    50.250000
dtype: float64
'''

对行分组

# 导入模块
import pandas as pd# 示例数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
# 创建分组对象。 换句话说,
# 创建一个表示该特定分组的对象。
# 这里,我们按照团队来分组 pre-test 得分。
regiment_preScore = df['preTestScore'].groupby(df['regiment'])# 展示每个团队的 pre-test 得分的均值
regiment_preScore.mean()'''
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64
'''

Pandas 中的分层数据

# 导入模块
import pandas as pd# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
# 设置分层索引但将列保留在原位
df = df.set_index(['regiment', 'company'], drop=False)
df
regiment company name preTestScore postTestScore
regiment company
Nighthawks 1st Nighthawks 1st Miller 4
1st Nighthawks 1st Jacobson 24 94
2nd Nighthawks 2nd Ali 31 57
2nd Nighthawks 2nd Milner 2 62
Dragoons 1st Dragoons 1st Cooze 3
1st Dragoons 1st Jacon 4 25
2nd Dragoons 2nd Ryaner 24 94
2nd Dragoons 2nd Sone 31 57
Scouts 1st Scouts 1st Sloan 2
1st Scouts 1st Piger 3 70
2nd Scouts 2nd Riani 2 62
2nd Scouts 2nd Ali 3 70
# 将分层索引设置为团队然后公司
df = df.set_index(['regiment', 'company'])
df
name preTestScore postTestScore
regiment company
Nighthawks 1st Miller 4
1st Jacobson 24 94
2nd Ali 31 57
2nd Milner 2 62
Dragoons 1st Cooze 3
1st Jacon 4 25
2nd Ryaner 24 94
2nd Sone 31 57
Scouts 1st Sloan 2
1st Piger 3 70
2nd Riani 2 62
2nd Ali 3 70
# 查看索引
df.indexMultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],names=['regiment', 'company']) # 交换索引中的级别
df.swaplevel('regiment', 'company')
name preTestScore postTestScore
company regiment
1st Nighthawks Miller 4 25
Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
Nighthawks Milner 2 62
1st Dragoons Cooze 3 70
Dragoons Jacon 4 25
2nd Dragoons Ryaner 24 94
Dragoons Sone 31 57
1st Scouts Sloan 2 62
Scouts Piger 3 70
2nd Scouts Riani 2 62
Scouts Ali 3 70
# 按需求和数据
df.sum(level='regiment')
preTestScore postTestScore
regiment
Nighthawks 61 238
Dragoons 62 246
Scouts 10 264

数据科学和人工智能技术笔记 十九、数据整理(上)相关推荐

  1. 数据科学和人工智能技术笔记 十二、逻辑回归

    十二.逻辑回归 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 C 超参数快速调优 有时,学习算法的特征使我们能够比蛮力或随机模型搜索方法更快地搜索最佳超参数. sci ...

  2. 数据科学和人工智能技术笔记 三、数据预处理

    三.数据预处理 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 为 Scikit-Learn 转换 Pandas 类别数据 # 导入所需的库 from sklearn ...

  3. 数据科学和人工智能技术笔记 十六、朴素贝叶斯

    十六.朴素贝叶斯 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 伯努利朴素贝叶斯 伯努利朴素贝叶斯分类器假设我们的所有特征都是二元的,它们仅有两个值(例如,已经是独热 ...

  4. 数据科学和人工智能技术笔记 十五、支持向量机

    十五.支持向量机 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 校准 SVC 中的预测概率 SVC 使用超平面来创建决策区域,不会自然输出观察是某一类成员的概率估计. ...

  5. 数据科学和人工智能技术笔记 十、模型选择

    十.模型选择 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 在模型选择期间寻找最佳预处理步骤 在进行模型选择时,我们必须小心正确处理预处理. 首先,GridSearc ...

  6. 数据科学和人工智能技术笔记 十四、K 最近邻

    十四.K 最近邻 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 确定 K 的最佳值 # 加载库 from sklearn.neighbors import KNeig ...

  7. 数据科学和人工智能技术笔记 二、数据准备

    二.数据准备 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 从字典加载特征 from sklearn.feature_extraction import DictVe ...

  8. 数据科学和人工智能技术笔记 七、特征工程

    七.特征工程 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 稀疏特征矩阵上的降维 # 加载库 from sklearn.preprocessing import St ...

  9. 数据科学和人工智能技术笔记 十三、树和森林

    十三.树和森林 作者:Chris Albon 译者:飞龙 协议:CC BY-NC-SA 4.0 Adaboost 分类器 # 加载库 from sklearn.ensemble import AdaB ...

最新文章

  1. C#合并两张表结构相同(列数和列类型都相同)的表
  2. 超硬核全套Java视频教程(学习路线+免费视频+配套资料)
  3. html限制最多字符串,css – 设置字符串换行中允许的最大换行量
  4. OpenDataSource和OPENROWSET
  5. 信息系统项目管理师优秀论文:项目整体管理
  6. VMware10.0中安装CentOS8时提示客户机操作系统已禁用CPU,请关闭或重置虚拟机
  7. 某省高职比赛试题(园区网互联)
  8. leetcode 778. 水位上升的泳池中游泳(并查集)
  9. oracle 常用调优方法
  10. centos 使用rz sz指令
  11. 高通联机修改IMEI等参数的相关解析
  12. 360驱动大师要怎么操作安装打印机驱动
  13. 麦咖啡设置指南------详细介绍访问保护的设置方法抵御未知病毒
  14. Nifi Api访问
  15. 飞机飞行轨迹可视化Tacview
  16. JS 基础知识点与高频考题解析
  17. 模型误差、观测误差、截断误差(或称方法误差)、舍入误差
  18. gtj2018如何生成工程量报表_问答系列之广联达GTJ2018常见问题汇总
  19. centos 安装flash插件
  20. 2008春晚,赵本山之《火炬手》(现场版最新完整台词)

热门文章

  1. 科学计数法在计算机上怎么表示,科学计数法怎么表示
  2. kinect_试衣间(1)
  3. 钱多多第二阶段冲刺02
  4. 专升本管理学知识点总结——决策理论
  5. Springboot接入华为云短信平台
  6. 最差的算法工程师能差到什么程度呢?
  7. 程序员DIY HIFI功放(前后级)的艰难过程
  8. 有趣的路灯问题——按规律打印图形
  9. 全国邮编区号大全和从word中读取内容保存到msql中的源程序
  10. Json代码实战演练