点击标题即可获取文章源代码和笔记
数据集:https://download.csdn.net/download/weixin_44827418/12548095
Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6 高级处理-缺失值处理1)如何进行缺失值处理两种思路:1)删除含有缺失值的样本2)替换/插补4.6.1 如何处理nan1)判断数据中是否存在NaNpd.isnull(df)pd.notnull(df)2)删除含有缺失值的样本df.dropna(inplace=False)替换/插补df.fillna(value, inplace=False)4.6.2 不是缺失值nan,有默认标记的1)替换 ?-> np.nandf.replace(to_replace="?", value=np.nan)2)处理np.nan缺失值的步骤2)缺失值处理实例
4.7 高级处理-数据离散化性别 年龄
A 1 23
B 2 30
C 1 18物种 毛发
A 1
B 2
C 3男 女 年龄
A 1 0 23
B 0 1 30
C 1 0 18狗 猪 老鼠 毛发
A 1 0 0 2
B 0 1 0 1
C 0 0 1 1
one-hot编码&哑变量
4.7.1 什么是数据的离散化原始的身高数据:165,174,160,180,159,163,192,184
4.7.2 为什么要离散化
4.7.3 如何实现数据的离散化1)分组自动分组sr=pd.qcut(data, bins)自定义分组sr=pd.cut(data, [])2)将分组好的结果转换成one-hot编码pd.get_dummies(sr, prefix=)
4.8 高级处理-合并numpynp.concatnate((a, b), axis=)水平拼接np.hstack()竖直拼接np.vstack()1)按方向拼接pd.concat([data1, data2], axis=1)2)按索引拼接pd.merge实现合并pd.merge(left, right, how="inner", on=[索引])
4.9 高级处理-交叉表与透视表找到、探索两个变量之间的关系4.9.1 交叉表与透视表什么作用4.9.2 使用crosstab(交叉表)实现pd.crosstab(value1, value2)4.9.3 pivot_table
4.10 高级处理-分组与聚合4.10.1 什么是分组与聚合4.10.2 分组与聚合APIdataframesr
4.6.1如何处理nan
import pandas as pd movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
|
Rank
|
Title
|
Genre
|
Description
|
Director
|
Actors
|
Year
|
Runtime (Minutes)
|
Rating
|
Votes
|
Revenue (Millions)
|
Metascore
|
0
|
1
|
Guardians of the Galaxy
|
Action,Adventure,Sci-Fi
|
A group of intergalactic criminals are forced ...
|
James Gunn
|
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
|
2014
|
121
|
8.1
|
757074
|
333.13
|
76.0
|
1
|
2
|
Prometheus
|
Adventure,Mystery,Sci-Fi
|
Following clues to the origin of mankind, a te...
|
Ridley Scott
|
Noomi Rapace, Logan Marshall-Green, Michael Fa...
|
2012
|
124
|
7.0
|
485820
|
126.46
|
65.0
|
2
|
3
|
Split
|
Horror,Thriller
|
Three girls are kidnapped by a man with a diag...
|
M. Night Shyamalan
|
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
|
2016
|
117
|
7.3
|
157606
|
138.12
|
62.0
|
3
|
4
|
Sing
|
Animation,Comedy,Family
|
In a city of humanoid animals, a hustling thea...
|
Christophe Lourdelet
|
Matthew McConaughey,Reese Witherspoon, Seth Ma...
|
2016
|
108
|
7.2
|
60545
|
270.32
|
59.0
|
4
|
5
|
Suicide Squad
|
Action,Adventure,Fantasy
|
A secret government agency recruits some of th...
|
David Ayer
|
Will Smith, Jared Leto, Margot Robbie, Viola D...
|
2016
|
123
|
6.2
|
393727
|
325.02
|
40.0
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
996
|
Secret in Their Eyes
|
Crime,Drama,Mystery
|
A tight-knit team of rising investigators, alo...
|
Billy Ray
|
Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...
|
2015
|
111
|
6.2
|
27585
|
NaN
|
45.0
|
996
|
997
|
Hostel: Part II
|
Horror
|
Three American college students studying abroa...
|
Eli Roth
|
Lauren German, Heather Matarazzo, Bijou Philli...
|
2007
|
94
|
5.5
|
73152
|
17.54
|
46.0
|
997
|
998
|
Step Up 2: The Streets
|
Drama,Music,Romance
|
Romantic sparks occur between two dance studen...
|
Jon M. Chu
|
Robert Hoffman, Briana Evigan, Cassie Ventura,...
|
2008
|
98
|
6.2
|
70699
|
58.01
|
50.0
|
998
|
999
|
Search Party
|
Adventure,Comedy
|
A pair of friends embark on a mission to reuni...
|
Scot Armstrong
|
Adam Pally, T.J. Miller, Thomas Middleditch,Sh...
|
2014
|
93
|
5.6
|
4881
|
NaN
|
22.0
|
999
|
1000
|
Nine Lives
|
Comedy,Family,Fantasy
|
A stuffy businessman finds himself trapped ins...
|
Barry Sonnenfeld
|
Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...
|
2016
|
87
|
5.3
|
12435
|
19.64
|
11.0
|
1000 rows × 12 columns
# 1. 判断是否存在NaN类型的缺失值,为True的就是缺失值
movie.isnull()
|
Rank
|
Title
|
Genre
|
Description
|
Director
|
Actors
|
Year
|
Runtime (Minutes)
|
Rating
|
Votes
|
Revenue (Millions)
|
Metascore
|
0
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
1
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
2
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
3
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
4
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
True
|
False
|
996
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
997
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
998
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
True
|
False
|
999
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
False
|
1000 rows × 12 columns
import numpy as np# any() 只要有一个True就会返回True
# 返回结果为True,说明数据中存在缺失值
np.any(movie.isnull())
True
# 为False的就是缺失值
pd.notnull(movie)
|
Rank
|
Title
|
Genre
|
Description
|
Director
|
Actors
|
Year
|
Runtime (Minutes)
|
Rating
|
Votes
|
Revenue (Millions)
|
Metascore
|
0
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
1
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
2
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
3
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
4
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
False
|
True
|
996
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
997
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
998
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
False
|
True
|
999
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
True
|
1000 rows × 12 columns
# all()只要有一个False就返回False
# 返回结果为False,说明数据中存在缺失值
np.all(pd.notnull(movie))
False
pd.isnull(movie).any()
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
pd.notnull(movie).all()
Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool
# 缺失值处理
# 方法1: 删除含有缺失值的样本
movie_full = movie.dropna()
movie_full.isnull().any()
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
# 方法2: 替换
movie.head()
|
Rank
|
Title
|
Genre
|
Description
|
Director
|
Actors
|
Year
|
Runtime (Minutes)
|
Rating
|
Votes
|
Revenue (Millions)
|
Metascore
|
0
|
1
|
Guardians of the Galaxy
|
Action,Adventure,Sci-Fi
|
A group of intergalactic criminals are forced ...
|
James Gunn
|
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
|
2014
|
121
|
8.1
|
757074
|
333.13
|
76.0
|
1
|
2
|
Prometheus
|
Adventure,Mystery,Sci-Fi
|
Following clues to the origin of mankind, a te...
|
Ridley Scott
|
Noomi Rapace, Logan Marshall-Green, Michael Fa...
|
2012
|
124
|
7.0
|
485820
|
126.46
|
65.0
|
2
|
3
|
Split
|
Horror,Thriller
|
Three girls are kidnapped by a man with a diag...
|
M. Night Shyamalan
|
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
|
2016
|
117
|
7.3
|
157606
|
138.12
|
62.0
|
3
|
4
|
Sing
|
Animation,Comedy,Family
|
In a city of humanoid animals, a hustling thea...
|
Christophe Lourdelet
|
Matthew McConaughey,Reese Witherspoon, Seth Ma...
|
2016
|
108
|
7.2
|
60545
|
270.32
|
59.0
|
4
|
5
|
Suicide Squad
|
Action,Adventure,Fantasy
|
A secret government agency recruits some of th...
|
David Ayer
|
Will Smith, Jared Leto, Margot Robbie, Viola D...
|
2016
|
123
|
6.2
|
393727
|
325.02
|
40.0
|
movie["Revenue (Millions)"].mean()
82.95637614678897
# 含有缺失值的字段
# Revenue (Millions) False
# Metascore False
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Revenue (Millions)"].isnull().any()
False
# inplace=True ,直接在原数据上进行填充
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
movie["Metascore"].isnull().any()
False
movie.isnull().any() # 缺失值已经处理完毕
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
不是缺失值nan,有默认标记的处理方法
data = pd.read_csv("./datas/GBvideos.csv",encoding="GBK")
data
|
video_id
|
title
|
channel_title
|
category_id
|
tags
|
views
|
likes
|
dislikes
|
comment_total
|
thumbnail_link
|
date
|
0
|
jt2OHQh0HoQ
|
Live Apple Event - Apple September Event 2017 ...
|
Apple Event
|
28
|
apple events|apple event|iphone 8|iphone x|iph...
|
7426393
|
78240
|
13548
|
705
|
https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...
|
13.09
|
1
|
AqokkXoa7uE
|
Holly and Phillip Meet Samantha the Sex Robot ...
|
This Morning
|
24
|
this morning|interview|holly willoughby|philli...
|
494203
|
2651
|
1309
|
0
|
https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg
|
13.09
|
2
|
YPVcg45W0z4
|
My DNA Test Results? I'm WHAT??
|
emmablackery
|
24
|
emmablackery|emma blackery|emma|blackery|briti...
|
142819
|
13119
|
151
|
1141
|
https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg
|
13.09
|
3
|
T_PuZBdT2iM
|
getting into a conversation in a language you ...
|
ProZD
|
1
|
skit|korean|language|conversation|esl|japanese...
|
1580028
|
65729
|
1529
|
3598
|
https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg
|
13.09
|
4
|
NsjsmgmbCfc
|
Baby Name Challenge?
|
Sprinkleofglitter
|
26
|
sprinkleofglitter|sprinkle of glitter|baby gli...
|
40592
|
5019
|
57
|
490
|
https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg
|
13.09
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1595
|
w8fAellnPns
|
Juicy Chicken Breast - You Suck at Cooking (ep...
|
You Suck At Cooking
|
26
|
how to|cooking|recipe|kitchen|chicken|chicken ...
|
788466
|
31945
|
945
|
2274
|
https://i.ytimg.com/vi/w8fAellnPns/default.jpg
|
20.09
|
1596
|
RsG37JcEQNw
|
Weezer - Beach Boys
|
weezer
|
10
|
weezer|pacific daydream|pacificdaydream|beach ...
|
107927
|
2435
|
412
|
641
|
https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg
|
20.09
|
1597
|
htSiIA2g7G8
|
Berry Frozen Yogurt Bark Recipe
|
SORTEDfood
|
26
|
frozen yogurt bark|frozen yoghurt bark|frozen ...
|
109222
|
4840
|
35
|
212
|
https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg
|
20.09
|
1598
|
ZQK1F0wz6z4
|
What Do You Want to Eat??
|
Wong Fu Productions
|
24
|
panda|what should we eat|buzzfeed|comedy|boyfr...
|
626223
|
22962
|
532
|
1559
|
https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg
|
20.09
|
1599
|
DuPXdnSWoLk
|
The Child in Time: Trailer - BBC One
|
BBC
|
24
|
BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...
|
99228
|
1699
|
?
|
135
|
https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg
|
20.09
|
1600 rows × 11 columns
# 1. 将 ! 替换为np.nan
new_data = data.replace(to_replace="?",value=np.nan)
new_data
|
video_id
|
title
|
channel_title
|
category_id
|
tags
|
views
|
likes
|
dislikes
|
comment_total
|
thumbnail_link
|
date
|
0
|
jt2OHQh0HoQ
|
Live Apple Event - Apple September Event 2017 ...
|
Apple Event
|
28
|
apple events|apple event|iphone 8|iphone x|iph...
|
7426393
|
78240
|
13548
|
705
|
https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...
|
13.09
|
1
|
AqokkXoa7uE
|
Holly and Phillip Meet Samantha the Sex Robot ...
|
This Morning
|
24
|
this morning|interview|holly willoughby|philli...
|
494203
|
2651
|
1309
|
0
|
https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg
|
13.09
|
2
|
YPVcg45W0z4
|
My DNA Test Results? I'm WHAT??
|
emmablackery
|
24
|
emmablackery|emma blackery|emma|blackery|briti...
|
142819
|
13119
|
151
|
1141
|
https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg
|
13.09
|
3
|
T_PuZBdT2iM
|
getting into a conversation in a language you ...
|
ProZD
|
1
|
skit|korean|language|conversation|esl|japanese...
|
1580028
|
65729
|
1529
|
3598
|
https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg
|
13.09
|
4
|
NsjsmgmbCfc
|
Baby Name Challenge?
|
Sprinkleofglitter
|
26
|
sprinkleofglitter|sprinkle of glitter|baby gli...
|
40592
|
5019
|
57
|
490
|
https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg
|
13.09
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1595
|
w8fAellnPns
|
Juicy Chicken Breast - You Suck at Cooking (ep...
|
You Suck At Cooking
|
26
|
how to|cooking|recipe|kitchen|chicken|chicken ...
|
788466
|
31945
|
945
|
2274
|
https://i.ytimg.com/vi/w8fAellnPns/default.jpg
|
20.09
|
1596
|
RsG37JcEQNw
|
Weezer - Beach Boys
|
weezer
|
10
|
weezer|pacific daydream|pacificdaydream|beach ...
|
107927
|
2435
|
412
|
641
|
https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg
|
20.09
|
1597
|
htSiIA2g7G8
|
Berry Frozen Yogurt Bark Recipe
|
SORTEDfood
|
26
|
frozen yogurt bark|frozen yoghurt bark|frozen ...
|
109222
|
4840
|
35
|
212
|
https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg
|
20.09
|
1598
|
ZQK1F0wz6z4
|
What Do You Want to Eat??
|
Wong Fu Productions
|
24
|
panda|what should we eat|buzzfeed|comedy|boyfr...
|
626223
|
22962
|
532
|
1559
|
https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg
|
20.09
|
1599
|
DuPXdnSWoLk
|
The Child in Time: Trailer - BBC One
|
BBC
|
24
|
BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...
|
99228
|
1699
|
NaN
|
135
|
https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg
|
20.09
|
1600 rows × 11 columns
new_data.isnull().any() # 说明dislikes列中的?已经替换成了NaN
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes True
comment_total False
thumbnail_link False
date False
dtype: bool
new_data.dropna(inplace=True)
new_data.isnull().any()
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes False
comment_total False
thumbnail_link False
date False
dtype: bool
4.7 高级处理-数据离散化
import pandas as pd # 准备数据
data = pd.Series([165,174,160,180,159,163,192,184],index=["No1:165","No2:174","No3:160","No4:180","No5:159","No6:163","No7:192","No8:184"])
data
No1:165 165
No2:174 174
No3:160 160
No4:180 180
No5:159 159
No6:163 163
No7:192 192
No8:184 184
dtype: int64
自动分组
# 1. 分组# 自动分组
#qcut(data,组数)
sr = pd.qcut(data,3)
sr
No1:165 (163.667, 178.0]
No2:174 (163.667, 178.0]
No3:160 (158.999, 163.667]
No4:180 (178.0, 192.0]
No5:159 (158.999, 163.667]
No6:163 (158.999, 163.667]
No7:192 (178.0, 192.0]
No8:184 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
# 查看分组情况
sr.value_counts()
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
type(sr)
pandas.core.series.Series
# 2. 将分组好的结果转换成独热编码
# prefix,设置列名的前缀
pd.get_dummies(sr,prefix="height")
|
height_(158.999, 163.667]
|
height_(163.667, 178.0]
|
height_(178.0, 192.0]
|
No1:165
|
0
|
1
|
0
|
No2:174
|
0
|
1
|
0
|
No3:160
|
1
|
0
|
0
|
No4:180
|
0
|
0
|
1
|
No5:159
|
1
|
0
|
0
|
No6:163
|
1
|
0
|
0
|
No7:192
|
0
|
0
|
1
|
No8:184
|
0
|
0
|
1
|
自定义分组
# 自定义分组
# pd.cut(data,包含全部分界值的列表)
sr = pd.cut(data,[150,165,180,195])
sr
No1:165 (150, 165]
No2:174 (165, 180]
No3:160 (150, 165]
No4:180 (165, 180]
No5:159 (150, 165]
No6:163 (150, 165]
No7:192 (180, 195]
No8:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr.value_counts()
(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64
pd.get_dummies(sr,prefix="身高")
|
身高_(150, 165]
|
身高_(165, 180]
|
身高_(180, 195]
|
No1:165
|
1
|
0
|
0
|
No2:174
|
0
|
1
|
0
|
No3:160
|
1
|
0
|
0
|
No4:180
|
0
|
1
|
0
|
No5:159
|
1
|
0
|
0
|
No6:163
|
1
|
0
|
0
|
No7:192
|
0
|
0
|
1
|
No8:184
|
0
|
0
|
1
|
4.8 高级处理-合并
4.8.1 pd.concat实现合并(按方向拼接)
data1 = np.arange(0,20,1).reshape(4,5)
data1 = pd.DataFrame(data1)
data1
|
0
|
1
|
2
|
3
|
4
|
0
|
0
|
1
|
2
|
3
|
4
|
1
|
5
|
6
|
7
|
8
|
9
|
2
|
10
|
11
|
12
|
13
|
14
|
3
|
15
|
16
|
17
|
18
|
19
|
data2 = np.arange(100,120,1).reshape(4,5)
data2 = pd.DataFrame(data2)
data2
|
0
|
1
|
2
|
3
|
4
|
0
|
100
|
101
|
102
|
103
|
104
|
1
|
105
|
106
|
107
|
108
|
109
|
2
|
110
|
111
|
112
|
113
|
114
|
3
|
115
|
116
|
117
|
118
|
119
|
# 将data1 和 data2 进行水平拼接
data_concat = pd.concat([data1,data2],axis=1)
data_concat
|
0
|
1
|
2
|
3
|
4
|
0
|
1
|
2
|
3
|
4
|
0
|
0
|
1
|
2
|
3
|
4
|
100
|
101
|
102
|
103
|
104
|
1
|
5
|
6
|
7
|
8
|
9
|
105
|
106
|
107
|
108
|
109
|
2
|
10
|
11
|
12
|
13
|
14
|
110
|
111
|
112
|
113
|
114
|
3
|
15
|
16
|
17
|
18
|
19
|
115
|
116
|
117
|
118
|
119
|
data2.T
|
0
|
1
|
2
|
3
|
0
|
100
|
105
|
110
|
115
|
1
|
101
|
106
|
111
|
116
|
2
|
102
|
107
|
112
|
117
|
3
|
103
|
108
|
113
|
118
|
4
|
104
|
109
|
114
|
119
|
# 将data1 和 data2 进行竖直拼接
data_concat1 = pd.concat([data1,data2.T],axis=0)
data_concat1
|
0
|
1
|
2
|
3
|
4
|
0
|
0
|
1
|
2
|
3
|
4.0
|
1
|
5
|
6
|
7
|
8
|
9.0
|
2
|
10
|
11
|
12
|
13
|
14.0
|
3
|
15
|
16
|
17
|
18
|
19.0
|
0
|
100
|
105
|
110
|
115
|
NaN
|
1
|
101
|
106
|
111
|
116
|
NaN
|
2
|
102
|
107
|
112
|
117
|
NaN
|
3
|
103
|
108
|
113
|
118
|
NaN
|
4
|
104
|
109
|
114
|
119
|
NaN
|
4.8.2 pd.merge实现合并(按索引拼接)
left=pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
left
|
key1
|
key2
|
A
|
B
|
0
|
K0
|
K0
|
A0
|
B0
|
1
|
K0
|
K1
|
A1
|
B1
|
2
|
K1
|
K0
|
A2
|
B2
|
3
|
K2
|
K1
|
A3
|
B3
|
right=pd.DataFrame({'key1':['K0','K1','K1','K2'], 'key2':['K0','K0','K0','K0'], 'C':['Co','C1','C2','C3'],'D':['DO','D1','D2','D3']})
right
|
key1
|
key2
|
C
|
D
|
0
|
K0
|
K0
|
Co
|
DO
|
1
|
K1
|
K0
|
C1
|
D1
|
2
|
K1
|
K0
|
C2
|
D2
|
3
|
K2
|
K0
|
C3
|
D3
|
# 默认内连接inner
# inner 保留共有的key
result = pd.merge(left,right,on=['key1','key2'],how="inner")
result
|
key1
|
key2
|
A
|
B
|
C
|
D
|
0
|
K0
|
K0
|
A0
|
B0
|
Co
|
DO
|
1
|
K1
|
K0
|
A2
|
B2
|
C1
|
D1
|
2
|
K1
|
K0
|
A2
|
B2
|
C2
|
D2
|
# left ,左连接
# 左表中所有的key都保留,以左表为主进行合并
result_left = pd.merge(left,right,on=['key1','key2'],how="left")
result_left
|
key1
|
key2
|
A
|
B
|
C
|
D
|
0
|
K0
|
K0
|
A0
|
B0
|
Co
|
DO
|
1
|
K0
|
K1
|
A1
|
B1
|
NaN
|
NaN
|
2
|
K1
|
K0
|
A2
|
B2
|
C1
|
D1
|
3
|
K1
|
K0
|
A2
|
B2
|
C2
|
D2
|
4
|
K2
|
K1
|
A3
|
B3
|
NaN
|
NaN
|
# right ,右连接
# 右表中所有的key都保留,以右表为主进行合并
result_right = pd.merge(left,right,on=['key1','key2'],how="right")
result_right
|
key1
|
key2
|
A
|
B
|
C
|
D
|
0
|
K0
|
K0
|
A0
|
B0
|
Co
|
DO
|
1
|
K1
|
K0
|
A2
|
B2
|
C1
|
D1
|
2
|
K1
|
K0
|
A2
|
B2
|
C2
|
D2
|
3
|
K2
|
K0
|
NaN
|
NaN
|
C3
|
D3
|
# outer ,外连接
# 左右两表中所有的key都保留,进行合并
result_outer = pd.merge(left,right,on=['key1','key2'],how="outer")
result_outer
|
key1
|
key2
|
A
|
B
|
C
|
D
|
0
|
K0
|
K0
|
A0
|
B0
|
Co
|
DO
|
1
|
K0
|
K1
|
A1
|
B1
|
NaN
|
NaN
|
2
|
K1
|
K0
|
A2
|
B2
|
C1
|
D1
|
3
|
K1
|
K0
|
A2
|
B2
|
C2
|
D2
|
4
|
K2
|
K1
|
A3
|
B3
|
NaN
|
NaN
|
5
|
K2
|
K0
|
NaN
|
NaN
|
C3
|
D3
|
4.9 高级处理-交叉表与透视表
4.9.2 使用crosstab(交叉表)实现
data = pd.read_excel("./datas/szfj_baoan.xls")
data
|
district
|
roomnum
|
hall
|
AREA
|
C_floor
|
floor_num
|
school
|
subway
|
per_price
|
0
|
baoan
|
3
|
2
|
89.3
|
middle
|
31
|
0
|
0
|
7.0773
|
1
|
baoan
|
4
|
2
|
127.0
|
high
|
31
|
0
|
0
|
6.9291
|
2
|
baoan
|
1
|
1
|
28.0
|
low
|
39
|
0
|
0
|
3.9286
|
3
|
baoan
|
1
|
1
|
28.0
|
middle
|
30
|
0
|
0
|
3.3568
|
4
|
baoan
|
2
|
2
|
78.0
|
middle
|
8
|
1
|
1
|
5.0769
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1246
|
baoan
|
4
|
2
|
89.3
|
low
|
8
|
0
|
0
|
4.2553
|
1247
|
baoan
|
2
|
1
|
67.0
|
middle
|
30
|
0
|
0
|
3.8060
|
1248
|
baoan
|
2
|
2
|
67.4
|
middle
|
29
|
1
|
0
|
5.3412
|
1249
|
baoan
|
2
|
2
|
73.1
|
low
|
15
|
1
|
0
|
5.9508
|
1250
|
baoan
|
3
|
2
|
86.2
|
middle
|
32
|
0
|
1
|
4.5244
|
1251 rows × 9 columns
time = "2020-06-23"
# pandas日期类型
date = pd.to_datetime(time)
date
Timestamp('2020-06-23 00:00:00')
type(date)
pandas._libs.tslibs.timestamps.Timestamp
date.year
2020
date.month
6
data["week"] = date.weekday
data.drop("week",axis=1,inplace=True)
data
|
district
|
roomnum
|
hall
|
AREA
|
C_floor
|
floor_num
|
school
|
subway
|
per_price
|
0
|
baoan
|
3
|
2
|
89.3
|
middle
|
31
|
0
|
0
|
7.0773
|
1
|
baoan
|
4
|
2
|
127.0
|
high
|
31
|
0
|
0
|
6.9291
|
2
|
baoan
|
1
|
1
|
28.0
|
low
|
39
|
0
|
0
|
3.9286
|
3
|
baoan
|
1
|
1
|
28.0
|
middle
|
30
|
0
|
0
|
3.3568
|
4
|
baoan
|
2
|
2
|
78.0
|
middle
|
8
|
1
|
1
|
5.0769
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1246
|
baoan
|
4
|
2
|
89.3
|
low
|
8
|
0
|
0
|
4.2553
|
1247
|
baoan
|
2
|
1
|
67.0
|
middle
|
30
|
0
|
0
|
3.8060
|
1248
|
baoan
|
2
|
2
|
67.4
|
middle
|
29
|
1
|
0
|
5.3412
|
1249
|
baoan
|
2
|
2
|
73.1
|
low
|
15
|
1
|
0
|
5.9508
|
1250
|
baoan
|
3
|
2
|
86.2
|
middle
|
32
|
0
|
1
|
4.5244
|
1251 rows × 9 columns
data["feature"] = np.where(data["per_price"] > 5.0000,1,0)
data
|
district
|
roomnum
|
hall
|
AREA
|
C_floor
|
floor_num
|
school
|
subway
|
per_price
|
feature
|
0
|
baoan
|
3
|
2
|
89.3
|
middle
|
31
|
0
|
0
|
7.0773
|
1
|
1
|
baoan
|
4
|
2
|
127.0
|
high
|
31
|
0
|
0
|
6.9291
|
1
|
2
|
baoan
|
1
|
1
|
28.0
|
low
|
39
|
0
|
0
|
3.9286
|
0
|
3
|
baoan
|
1
|
1
|
28.0
|
middle
|
30
|
0
|
0
|
3.3568
|
0
|
4
|
baoan
|
2
|
2
|
78.0
|
middle
|
8
|
1
|
1
|
5.0769
|
1
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1246
|
baoan
|
4
|
2
|
89.3
|
low
|
8
|
0
|
0
|
4.2553
|
0
|
1247
|
baoan
|
2
|
1
|
67.0
|
middle
|
30
|
0
|
0
|
3.8060
|
0
|
1248
|
baoan
|
2
|
2
|
67.4
|
middle
|
29
|
1
|
0
|
5.3412
|
1
|
1249
|
baoan
|
2
|
2
|
73.1
|
low
|
15
|
1
|
0
|
5.9508
|
1
|
1250
|
baoan
|
3
|
2
|
86.2
|
middle
|
32
|
0
|
1
|
4.5244
|
0
|
1251 rows × 10 columns
# 交叉表# 查看楼层 和 每平方米单价是否>50000的关系
# 返回值为每个楼层中,为0的个数和为1的个数
data0 = pd.crosstab(data["floor_num"],data["feature"])
data0
feature
|
0
|
1
|
floor_num
|
|
|
1
|
6
|
8
|
3
|
0
|
1
|
4
|
0
|
10
|
6
|
3
|
7
|
7
|
16
|
25
|
8
|
19
|
32
|
9
|
2
|
11
|
10
|
4
|
9
|
11
|
8
|
11
|
12
|
1
|
3
|
13
|
4
|
20
|
14
|
0
|
5
|
15
|
8
|
33
|
16
|
9
|
19
|
17
|
20
|
21
|
18
|
17
|
35
|
19
|
11
|
5
|
20
|
2
|
4
|
21
|
1
|
6
|
22
|
0
|
1
|
23
|
4
|
8
|
24
|
10
|
26
|
25
|
4
|
37
|
26
|
9
|
57
|
27
|
5
|
38
|
28
|
6
|
35
|
29
|
26
|
68
|
30
|
30
|
78
|
31
|
4
|
151
|
32
|
21
|
126
|
33
|
34
|
20
|
34
|
1
|
5
|
35
|
1
|
2
|
36
|
0
|
4
|
37
|
1
|
1
|
38
|
0
|
1
|
39
|
5
|
10
|
40
|
1
|
3
|
43
|
0
|
1
|
44
|
0
|
6
|
45
|
0
|
7
|
47
|
0
|
1
|
50
|
0
|
1
|
51
|
0
|
3
|
52
|
0
|
2
|
53
|
0
|
1
|
data0.sum(axis=1) # 按行求和
floor_num
1 14
3 1
4 10
6 10
7 41
8 51
9 13
10 13
11 19
12 4
13 24
14 5
15 41
16 28
17 41
18 52
19 16
20 6
21 7
22 1
23 12
24 36
25 41
26 66
27 43
28 41
29 94
30 108
31 155
32 147
33 54
34 6
35 3
36 4
37 2
38 1
39 15
40 4
43 1
44 6
45 7
47 1
50 1
51 3
52 2
53 1
dtype: int64
data0.div(data0.sum(axis=1),axis=0) # 按行做除法
feature
|
0
|
1
|
floor_num
|
|
|
1
|
0.428571
|
0.571429
|
3
|
0.000000
|
1.000000
|
4
|
0.000000
|
1.000000
|
6
|
0.300000
|
0.700000
|
7
|
0.390244
|
0.609756
|
8
|
0.372549
|
0.627451
|
9
|
0.153846
|
0.846154
|
10
|
0.307692
|
0.692308
|
11
|
0.421053
|
0.578947
|
12
|
0.250000
|
0.750000
|
13
|
0.166667
|
0.833333
|
14
|
0.000000
|
1.000000
|
15
|
0.195122
|
0.804878
|
16
|
0.321429
|
0.678571
|
17
|
0.487805
|
0.512195
|
18
|
0.326923
|
0.673077
|
19
|
0.687500
|
0.312500
|
20
|
0.333333
|
0.666667
|
21
|
0.142857
|
0.857143
|
22
|
0.000000
|
1.000000
|
23
|
0.333333
|
0.666667
|
24
|
0.277778
|
0.722222
|
25
|
0.097561
|
0.902439
|
26
|
0.136364
|
0.863636
|
27
|
0.116279
|
0.883721
|
28
|
0.146341
|
0.853659
|
29
|
0.276596
|
0.723404
|
30
|
0.277778
|
0.722222
|
31
|
0.025806
|
0.974194
|
32
|
0.142857
|
0.857143
|
33
|
0.629630
|
0.370370
|
34
|
0.166667
|
0.833333
|
35
|
0.333333
|
0.666667
|
36
|
0.000000
|
1.000000
|
37
|
0.500000
|
0.500000
|
38
|
0.000000
|
1.000000
|
39
|
0.333333
|
0.666667
|
40
|
0.250000
|
0.750000
|
43
|
0.000000
|
1.000000
|
44
|
0.000000
|
1.000000
|
45
|
0.000000
|
1.000000
|
47
|
0.000000
|
1.000000
|
50
|
0.000000
|
1.000000
|
51
|
0.000000
|
1.000000
|
52
|
0.000000
|
1.000000
|
53
|
0.000000
|
1.000000
|
data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
feature
|
0
|
1
|
floor_num
|
|
|
1
|
0.428571
|
0.571429
|
3
|
0.000000
|
1.000000
|
4
|
0.000000
|
1.000000
|
6
|
0.300000
|
0.700000
|
7
|
0.390244
|
0.609756
|
8
|
0.372549
|
0.627451
|
9
|
0.153846
|
0.846154
|
10
|
0.307692
|
0.692308
|
11
|
0.421053
|
0.578947
|
12
|
0.250000
|
0.750000
|
13
|
0.166667
|
0.833333
|
14
|
0.000000
|
1.000000
|
15
|
0.195122
|
0.804878
|
16
|
0.321429
|
0.678571
|
17
|
0.487805
|
0.512195
|
18
|
0.326923
|
0.673077
|
19
|
0.687500
|
0.312500
|
20
|
0.333333
|
0.666667
|
21
|
0.142857
|
0.857143
|
22
|
0.000000
|
1.000000
|
23
|
0.333333
|
0.666667
|
24
|
0.277778
|
0.722222
|
25
|
0.097561
|
0.902439
|
26
|
0.136364
|
0.863636
|
27
|
0.116279
|
0.883721
|
28
|
0.146341
|
0.853659
|
29
|
0.276596
|
0.723404
|
30
|
0.277778
|
0.722222
|
31
|
0.025806
|
0.974194
|
32
|
0.142857
|
0.857143
|
33
|
0.629630
|
0.370370
|
34
|
0.166667
|
0.833333
|
35
|
0.333333
|
0.666667
|
36
|
0.000000
|
1.000000
|
37
|
0.500000
|
0.500000
|
38
|
0.000000
|
1.000000
|
39
|
0.333333
|
0.666667
|
40
|
0.250000
|
0.750000
|
43
|
0.000000
|
1.000000
|
44
|
0.000000
|
1.000000
|
45
|
0.000000
|
1.000000
|
47
|
0.000000
|
1.000000
|
50
|
0.000000
|
1.000000
|
51
|
0.000000
|
1.000000
|
52
|
0.000000
|
1.000000
|
53
|
0.000000
|
1.000000
|
# stacked=True 是否重叠显示
data_percent.plot(kind="bar",stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>
data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
<tr><th>50</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>51</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>52</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>53</th><td>0.000000</td><td>1.000000</td>
</tr>
feature
|
0
|
1
|
floor_num
|
|
|
1
|
0.428571
|
0.571429
|
3
|
0.000000
|
1.000000
|
4
|
0.000000
|
1.000000
|
6
|
0.300000
|
0.700000
|
7
|
0.390244
|
0.609756
|
8
|
0.372549
|
0.627451
|
9
|
0.153846
|
0.846154
|
10
|
0.307692
|
0.692308
|
11
|
0.421053
|
0.578947
|
12
|
0.250000
|
0.750000
|
13
|
0.166667
|
0.833333
|
14
|
0.000000
|
1.000000
|
15
|
0.195122
|
0.804878
|
16
|
0.321429
|
0.678571
|
17
|
0.487805
|
0.512195
|
18
|
0.326923
|
0.673077
|
19
|
0.687500
|
0.312500
|
20
|
0.333333
|
0.666667
|
21
|
0.142857
|
0.857143
|
22
|
0.000000
|
1.000000
|
23
|
0.333333
|
0.666667
|
24
|
0.277778
|
0.722222
|
25
|
0.097561
|
0.902439
|
26
|
0.136364
|
0.863636
|
27
|
0.116279
|
0.883721
|
28
|
0.146341
|
0.853659
|
29
|
0.276596
|
0.723404
|
30
|
0.277778
|
0.722222
|
4.9.3使用pivot_table(透视表)实现
# 通过透视表,整个过程会变得更加简单些
# 结果直接就是值为1的百分比
data.pivot_table(["feature"],index=["floor_num"])
...
|
feature
|
floor_num
|
|
1
|
0.571429
|
3
|
1.000000
|
4
|
1.000000
|
6
|
0.700000
|
50
|
1.000000
|
51
|
1.000000
|
52
|
1.000000
|
53
|
1.000000
|
4.10 高级处理-分组与聚合
4.10.2 分组与聚合API
col = pd.DataFrame({'color':['white','red','green','red','green'],'object':["pen","pencil","pencil","ashtray","pen"],'price1':[4.56,4.20,1.30,0.56,2.75],'price2':[4.75,4.12,1.68,0.75,3.15]})
col
|
color
|
object
|
price1
|
price2
|
0
|
white
|
pen
|
4.56
|
4.75
|
1
|
red
|
pencil
|
4.20
|
4.12
|
2
|
green
|
pencil
|
1.30
|
1.68
|
3
|
red
|
ashtray
|
0.56
|
0.75
|
4
|
green
|
pen
|
2.75
|
3.15
|
# 进行分组,对颜色进行分组,对价格price1进行聚合
# 用DataFrame的方法进行分组
col.groupby(by="color")["price1"].max()
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
# 用Series的方法进行分组
col['price1'].groupby(col["color"])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col['price1'].groupby(col["color"]).max()
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
4.11 综合案例
# 1. 准备数据
movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
|
Rank
|
Title
|
Genre
|
Description
|
Director
|
Actors
|
Year
|
Runtime (Minutes)
|
Rating
|
Votes
|
Revenue (Millions)
|
Metascore
|
0
|
1
|
Guardians of the Galaxy
|
Action,Adventure,Sci-Fi
|
A group of intergalactic criminals are forced ...
|
James Gunn
|
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
|
2014
|
121
|
8.1
|
757074
|
333.13
|
76.0
|
1
|
2
|
Prometheus
|
Adventure,Mystery,Sci-Fi
|
Following clues to the origin of mankind, a te...
|
Ridley Scott
|
Noomi Rapace, Logan Marshall-Green, Michael Fa...
|
2012
|
124
|
7.0
|
485820
|
126.46
|
65.0
|
2
|
3
|
Split
|
Horror,Thriller
|
Three girls are kidnapped by a man with a diag...
|
M. Night Shyamalan
|
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
|
2016
|
117
|
7.3
|
157606
|
138.12
|
62.0
|
3
|
4
|
Sing
|
Animation,Comedy,Family
|
In a city of humanoid animals, a hustling thea...
|
Christophe Lourdelet
|
Matthew McConaughey,Reese Witherspoon, Seth Ma...
|
2016
|
108
|
7.2
|
60545
|
270.32
|
59.0
|
4
|
5
|
Suicide Squad
|
Action,Adventure,Fantasy
|
A secret government agency recruits some of th...
|
David Ayer
|
Will Smith, Jared Leto, Margot Robbie, Viola D...
|
2016
|
123
|
6.2
|
393727
|
325.02
|
40.0
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
996
|
Secret in Their Eyes
|
Crime,Drama,Mystery
|
A tight-knit team of rising investigators, alo...
|
Billy Ray
|
Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...
|
2015
|
111
|
6.2
|
27585
|
NaN
|
45.0
|
996
|
997
|
Hostel: Part II
|
Horror
|
Three American college students studying abroa...
|
Eli Roth
|
Lauren German, Heather Matarazzo, Bijou Philli...
|
2007
|
94
|
5.5
|
73152
|
17.54
|
46.0
|
997
|
998
|
Step Up 2: The Streets
|
Drama,Music,Romance
|
Romantic sparks occur between two dance studen...
|
Jon M. Chu
|
Robert Hoffman, Briana Evigan, Cassie Ventura,...
|
2008
|
98
|
6.2
|
70699
|
58.01
|
50.0
|
998
|
999
|
Search Party
|
Adventure,Comedy
|
A pair of friends embark on a mission to reuni...
|
Scot Armstrong
|
Adam Pally, T.J. Miller, Thomas Middleditch,Sh...
|
2014
|
93
|
5.6
|
4881
|
NaN
|
22.0
|
999
|
1000
|
Nine Lives
|
Comedy,Family,Fantasy
|
A stuffy businessman finds himself trapped ins...
|
Barry Sonnenfeld
|
Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...
|
2016
|
87
|
5.3
|
12435
|
19.64
|
11.0
|
1000 rows × 12 columns
#问题1:我们想知道这些电影数据中评分的平均分,导演的人数等信息,
# 我们应该怎么获取?
movie["Rating"].mean()
6.723200000000003
movie["Director"]
0 James Gunn
1 Ridley Scott
2 M. Night Shyamalan
3 Christophe Lourdelet
4 David Ayer...
995 Billy Ray
996 Eli Roth
997 Jon M. Chu
998 Scot Armstrong
999 Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
# np.unique()去重,因为导演可能是多个电影的导演
np.unique(movie["Director"])
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González Iñárritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object)
# 导演的人数
np.unique(movie["Director"]).size
644
# 问题2 : 对于这一组电影数据,如果我们先rating,runtime的分布情况,应该如何呈现数据?
movie["Rating"].plot(kind="hist",figsize=(20,8),fontsize=40)
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>
import matplotlib.pyplot as plt# 1. 创建画布
plt.figure(figsize=(20,8),dpi=100)# 2. 绘制直方图
plt.hist(movie["Rating"],20)# 修改刻度
plt.xticks(np.linspace(movie["Rating"].min(),movie["Rating"].max(),21))# 添加网格
plt.grid(linestyle="--",alpha=0.5)# 3. 显示图像
plt.show()
movie["Rating"]
0 8.1
1 7.0
2 7.3
3 7.2
4 6.2...
995 6.2
996 5.5
997 6.2
998 5.6
999 5.3
Name: Rating, Length: 1000, dtype: float64
# 问题3:对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?# 先统计电影类别有哪些
movie_genre = [i.split(",") for i in movie["Genre"]]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']]
[j for i in movie_genre for j in i]
['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi',
...'Animation','Action','Adventure','Action','Adventure','Drama',...]
movie_class = np.unique([j for i in movie_genre for j in i])
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9')
len(movie_class) # 20 个电影类别
20
# 统计每个类别有几个电影# 先创建一个空的DataFrame表
count = pd.DataFrame(np.zeros(shape=[1000,20],dtype="int32"),columns=movie_class)
count.head()
|
Action
|
Adventure
|
Animation
|
Biography
|
Comedy
|
Crime
|
Drama
|
Family
|
Fantasy
|
History
|
Horror
|
Music
|
Musical
|
Mystery
|
Romance
|
Sci-Fi
|
Sport
|
Thriller
|
War
|
Western
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
2
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
3
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
4
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
count.loc[0,movie_genre[0]]
Action 0
Adventure 0
Sci-Fi 0
Name: 0, dtype: int32
movie_genre[0]
['Action', 'Adventure', 'Sci-Fi']
# 计数填表
for i in range(1000):count.loc[i,movie_genre[i]] = 1
count
|
Action
|
Adventure
|
Animation
|
Biography
|
Comedy
|
Crime
|
Drama
|
Family
|
Fantasy
|
History
|
Horror
|
Music
|
Musical
|
Mystery
|
Romance
|
Sci-Fi
|
Sport
|
Thriller
|
War
|
Western
|
0
|
1
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
2
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
3
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
4
|
1
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
995
|
0
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
996
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
997
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
998
|
0
|
1
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
999
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
1
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1000 rows × 20 columns
# 按列求和
count.sum(axis=0)
Action 303
Adventure 259
Animation 49
Biography 81
Comedy 279
Crime 150
Drama 513
Family 51
Fantasy 101
History 29
Horror 119
Music 16
Musical 5
Mystery 106
Romance 141
Sci-Fi 120
Sport 18
Thriller 195
War 13
Western 7
dtype: int64
count.sum(axis=0).sort_values(ascending=False)
Drama 513
Action 303
Comedy 279
Adventure 259
Thriller 195
Crime 150
Romance 141
Sci-Fi 120
Horror 119
Mystery 106
Fantasy 101
Biography 81
Family 51
Animation 49
History 29
Sport 18
Music 16
War 13
Western 7
Musical 5
dtype: int64
count.sum(axis=0).sort_values(ascending=False).plot(kind="bar",fontsize=20,figsize=(20,9),colormap="cool")
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>
九、Pandas高级处理相关推荐
- pandas高级处理-分组与聚合
pandas高级处理-分组与聚合 分组与聚合通常是分析数据的一种方式,通常与一些统计函数一起使用,查看数据的分组情况 [在pandas中,必须连在一起说,不能单独使用,抛开聚合谈分组无意义] 想一想 ...
- pandas高级处理-交叉表与透视表
pandas高级处理-交叉表与透视表 1 交叉表与透视表什么作用 [就是探究两列数据之间的关系] 探究股票的涨跌与星期几有关? 以下图当中表示,week代表星期几,1,0代表这一天股票的涨跌幅是好还 ...
- pandas高级处理-合并
pandas高级处理-合并 如果你的数据由多张表组成,那么有时候需要将不同的内容合并在一起分析 1 pd.concat实现数据合并 pd.concat([data1, data2], axis=1) ...
- pandas高级处理-数据离散化
pandas高级处理-数据离散化 1 为什么要离散化 连续属性离散化的目的是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值的个数.离散化方法经常作为数据挖掘的工具.[简化数据,让数据用起来 ...
- 3 Python数据分析 美国各州人口分析案例 Pandas高级操作 美国大选献金案例 matplotlib
Python数据分析 1 案例 美国各州人口分析 1.1 数据介绍 数据来源:https://github.com/jakevdp/data-USstates/ 1.1.1 州人口数量表 state- ...
- Pandas高级数据分析快速入门之三——数据挖掘与统计分析篇
Pandas高级数据分析快速入门之一--Python开发环境篇 Pandas高级数据分析快速入门之二--基础篇 Pandas高级数据分析快速入门之三--数据挖掘与统计分析篇 Pandas高级数据分析快 ...
- Pandas高级操作
替换操作 替换操作可以同步作用于Series和DataFrame中 单值替换 普通替换:替换所有符合要求的元素:to_replace=15,value='e' 按列指定单值替换:to_replace= ...
- 数据基础---《利用Python进行数据分析·第2版》第12章 pandas高级应用
之前自己对于numpy和pandas是要用的时候东学一点西一点,直到看到<利用Python进行数据分析·第2版>,觉得只看这一篇就够了.非常感谢原博主的翻译和分享. 前面的章节关注于不同类 ...
- python高级数据分析_Python数据分析-pandas高级操作
创建一个df,两列分别是姓名和薪资,然后给其名字起对应的英文名 dic ={'name':['jay','tom','jay'],'salary':[1000,2000,1000] } df= Dat ...
最新文章
- 【Android】4.3 屏幕布局和旋转
- 深信服副总裁张开翼:随需应变的IT新架构
- 请求和响应向更多内容
- 计算机应用研究潜规则,基于相容矩阵计算的不完备决策系统规则获取算法
- brew 基本使用方法
- 用Cairo画IBM logo并输出为pdf,ps,svg格式文件
- 手机网页设计注意事项和解决方法
- OTSU 获取最佳阈值,及opencv二值化
- 【智能路由器】轻量级web服务器lighttpd架设——打造家庭影院
- 网站生成EXE文件运行——PHP网站打包工具PHPWAMP
- WireShark基本使用(5)第 5 章 文件输入/输出及打印
- 百宝云Post与Get事件教程
- php实现成语小游戏,成语小秀才微信小程序源码-PHP代码类资_aqa7qj 源码采用php实现 - 下载 - 搜珍网...
- Flash应用之百宝箱
- 通过爬虫爬取一些图片
- 隐私泄露中的人性剖析
- 实验七 计数器及其应用
- 解决qt5在windows系统下中文乱码的问题的简单方法
- 关于MHT文件研究(一)
- 电子商务格局下的营销未来
热门文章
- CentOS下添加Root权限用户‘超级用户’方法(xxx is not in the sudoers file.This incident will be reported.的解决方法)
- Flask爱家租房--订单(房东接单、拒单)
- C++中模板使用详解
- this static 面向对象三大特点
- linux 部署php svn,Linux服务器搭建svn环境方法详解
- matlab功能块,Matlab GUI重用功能块
- 日志配置(springboot、mybatis、Lombok)
- TZOJ--1518: 星星点点 (二进制模拟)
- ReactNative 告别CodePush,自建热更新版本升级环境
- 你不知道的 字符集和编码(编码字符集与字符集编码)