4.6高级处理-缺失值处理

点击标题即可获取文章源代码和笔记
数据集:https://download.csdn.net/download/weixin_44827418/12548095

Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6 高级处理-缺失值处理1)如何进行缺失值处理两种思路:1)删除含有缺失值的样本2)替换/插补4.6.1 如何处理nan1)判断数据中是否存在NaNpd.isnull(df)pd.notnull(df)2)删除含有缺失值的样本df.dropna(inplace=False)替换/插补df.fillna(value, inplace=False)4.6.2 不是缺失值nan,有默认标记的1)替换 ?-> np.nandf.replace(to_replace="?", value=np.nan)2)处理np.nan缺失值的步骤2)缺失值处理实例
4.7 高级处理-数据离散化性别 年龄
A    1   23
B    2   30
C    1   18物种 毛发
A    1
B    2
C    3男 女 年龄
A   1  0  23
B   0  1  30
C   1  0  18狗  猪  老鼠 毛发
A   1   0   0   2
B   0   1   0   1
C   0   0   1   1
one-hot编码&哑变量
4.7.1 什么是数据的离散化原始的身高数据:165,174,160,180,159,163,192,184
4.7.2 为什么要离散化
4.7.3 如何实现数据的离散化1)分组自动分组sr=pd.qcut(data, bins)自定义分组sr=pd.cut(data, [])2)将分组好的结果转换成one-hot编码pd.get_dummies(sr, prefix=)
4.8 高级处理-合并numpynp.concatnate((a, b), axis=)水平拼接np.hstack()竖直拼接np.vstack()1)按方向拼接pd.concat([data1, data2], axis=1)2)按索引拼接pd.merge实现合并pd.merge(left, right, how="inner", on=[索引])
4.9 高级处理-交叉表与透视表找到、探索两个变量之间的关系4.9.1 交叉表与透视表什么作用4.9.2 使用crosstab(交叉表)实现pd.crosstab(value1, value2)4.9.3 pivot_table
4.10 高级处理-分组与聚合4.10.1 什么是分组与聚合4.10.2 分组与聚合APIdataframesr

4.6.1如何处理nan

import pandas as pd movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0
997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

1000 rows × 12 columns

# 1. 判断是否存在NaN类型的缺失值,为True的就是缺失值
movie.isnull()
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 False False False False False False False False False False False False
1 False False False False False False False False False False False False
2 False False False False False False False False False False False False
3 False False False False False False False False False False False False
4 False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ...
995 False False False False False False False False False False True False
996 False False False False False False False False False False False False
997 False False False False False False False False False False False False
998 False False False False False False False False False False True False
999 False False False False False False False False False False False False

1000 rows × 12 columns

import numpy as np# any() 只要有一个True就会返回True
# 返回结果为True,说明数据中存在缺失值
np.any(movie.isnull())
True
# 为False的就是缺失值
pd.notnull(movie)
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 True True True True True True True True True True True True
1 True True True True True True True True True True True True
2 True True True True True True True True True True True True
3 True True True True True True True True True True True True
4 True True True True True True True True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ...
995 True True True True True True True True True True False True
996 True True True True True True True True True True True True
997 True True True True True True True True True True True True
998 True True True True True True True True True True False True
999 True True True True True True True True True True True True

1000 rows × 12 columns

# all()只要有一个False就返回False
# 返回结果为False,说明数据中存在缺失值
np.all(pd.notnull(movie))
False
pd.isnull(movie).any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool
pd.notnull(movie).all()
Rank                   True
Title                  True
Genre                  True
Description            True
Director               True
Actors                 True
Year                   True
Runtime (Minutes)      True
Rating                 True
Votes                  True
Revenue (Millions)    False
Metascore             False
dtype: bool
# 缺失值处理
# 方法1: 删除含有缺失值的样本
movie_full = movie.dropna()
movie_full.isnull().any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool
# 方法2: 替换
movie.head()
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
movie["Revenue (Millions)"].mean()
82.95637614678897
# 含有缺失值的字段
# Revenue (Millions)    False
# Metascore             False
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Revenue (Millions)"].isnull().any()
False
# inplace=True ,直接在原数据上进行填充
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
movie["Metascore"].isnull().any()
False
movie.isnull().any() # 缺失值已经处理完毕
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

不是缺失值nan,有默认标记的处理方法

data = pd.read_csv("./datas/GBvideos.csv",encoding="GBK")
data
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date
0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09
1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09
2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09
3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09
4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09
... ... ... ... ... ... ... ... ... ... ... ...
1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09
1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09
1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09
1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09
1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 ? 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09

1600 rows × 11 columns

# 1. 将 ! 替换为np.nan
new_data = data.replace(to_replace="?",value=np.nan)
new_data
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date
0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09
1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09
2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09
3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09
4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09
... ... ... ... ... ... ... ... ... ... ... ...
1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09
1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09
1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09
1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09
1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 NaN 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09

1600 rows × 11 columns

new_data.isnull().any() # 说明dislikes列中的?已经替换成了NaN
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes           True
comment_total     False
thumbnail_link    False
date              False
dtype: bool
new_data.dropna(inplace=True)
new_data.isnull().any()
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes          False
comment_total     False
thumbnail_link    False
date              False
dtype: bool

4.7 高级处理-数据离散化

import pandas as pd # 准备数据
data = pd.Series([165,174,160,180,159,163,192,184],index=["No1:165","No2:174","No3:160","No4:180","No5:159","No6:163","No7:192","No8:184"])
data
No1:165    165
No2:174    174
No3:160    160
No4:180    180
No5:159    159
No6:163    163
No7:192    192
No8:184    184
dtype: int64

自动分组

# 1. 分组# 自动分组
#qcut(data,组数)
sr = pd.qcut(data,3)
sr
No1:165      (163.667, 178.0]
No2:174      (163.667, 178.0]
No3:160    (158.999, 163.667]
No4:180        (178.0, 192.0]
No5:159    (158.999, 163.667]
No6:163    (158.999, 163.667]
No7:192        (178.0, 192.0]
No8:184        (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
# 查看分组情况
sr.value_counts()
(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
type(sr)
pandas.core.series.Series
# 2. 将分组好的结果转换成独热编码
# prefix,设置列名的前缀
pd.get_dummies(sr,prefix="height")
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0]
No1:165 0 1 0
No2:174 0 1 0
No3:160 1 0 0
No4:180 0 0 1
No5:159 1 0 0
No6:163 1 0 0
No7:192 0 0 1
No8:184 0 0 1

自定义分组

# 自定义分组
# pd.cut(data,包含全部分界值的列表)
sr = pd.cut(data,[150,165,180,195])
sr
No1:165    (150, 165]
No2:174    (165, 180]
No3:160    (150, 165]
No4:180    (165, 180]
No5:159    (150, 165]
No6:163    (150, 165]
No7:192    (180, 195]
No8:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr.value_counts()
(150, 165]    4
(180, 195]    2
(165, 180]    2
dtype: int64
pd.get_dummies(sr,prefix="身高")
身高_(150, 165] 身高_(165, 180] 身高_(180, 195]
No1:165 1 0 0
No2:174 0 1 0
No3:160 1 0 0
No4:180 0 1 0
No5:159 1 0 0
No6:163 1 0 0
No7:192 0 0 1
No8:184 0 0 1

4.8 高级处理-合并

4.8.1 pd.concat实现合并(按方向拼接)

data1 = np.arange(0,20,1).reshape(4,5)
data1 = pd.DataFrame(data1)
data1
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
data2 = np.arange(100,120,1).reshape(4,5)
data2 = pd.DataFrame(data2)
data2
0 1 2 3 4
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
# 将data1 和 data2 进行水平拼接
data_concat = pd.concat([data1,data2],axis=1)
data_concat
0 1 2 3 4 0 1 2 3 4
0 0 1 2 3 4 100 101 102 103 104
1 5 6 7 8 9 105 106 107 108 109
2 10 11 12 13 14 110 111 112 113 114
3 15 16 17 18 19 115 116 117 118 119
data2.T
0 1 2 3
0 100 105 110 115
1 101 106 111 116
2 102 107 112 117
3 103 108 113 118
4 104 109 114 119
# 将data1 和 data2 进行竖直拼接
data_concat1 = pd.concat([data1,data2.T],axis=0)
data_concat1
0 1 2 3 4
0 0 1 2 3 4.0
1 5 6 7 8 9.0
2 10 11 12 13 14.0
3 15 16 17 18 19.0
0 100 105 110 115 NaN
1 101 106 111 116 NaN
2 102 107 112 117 NaN
3 103 108 113 118 NaN
4 104 109 114 119 NaN

4.8.2 pd.merge实现合并(按索引拼接)

left=pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right=pd.DataFrame({'key1':['K0','K1','K1','K2'], 'key2':['K0','K0','K0','K0'], 'C':['Co','C1','C2','C3'],'D':['DO','D1','D2','D3']})
right
key1 key2 C D
0 K0 K0 Co DO
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
# 默认内连接inner
# inner 保留共有的key
result = pd.merge(left,right,on=['key1','key2'],how="inner")
result
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
# left ,左连接
# 左表中所有的key都保留,以左表为主进行合并
result_left = pd.merge(left,right,on=['key1','key2'],how="left")
result_left
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
# right ,右连接
# 右表中所有的key都保留,以右表为主进行合并
result_right = pd.merge(left,right,on=['key1','key2'],how="right")
result_right
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
# outer ,外连接
# 左右两表中所有的key都保留,进行合并
result_outer = pd.merge(left,right,on=['key1','key2'],how="outer")
result_outer
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3

4.9 高级处理-交叉表与透视表

  • 用来探索两个变量之间的关系

4.9.2 使用crosstab(交叉表)实现

data = pd.read_excel("./datas/szfj_baoan.xls")
data
district roomnum hall AREA C_floor floor_num school subway per_price
0 baoan 3 2 89.3 middle 31 0 0 7.0773
1 baoan 4 2 127.0 high 31 0 0 6.9291
2 baoan 1 1 28.0 low 39 0 0 3.9286
3 baoan 1 1 28.0 middle 30 0 0 3.3568
4 baoan 2 2 78.0 middle 8 1 1 5.0769
... ... ... ... ... ... ... ... ... ...
1246 baoan 4 2 89.3 low 8 0 0 4.2553
1247 baoan 2 1 67.0 middle 30 0 0 3.8060
1248 baoan 2 2 67.4 middle 29 1 0 5.3412
1249 baoan 2 2 73.1 low 15 1 0 5.9508
1250 baoan 3 2 86.2 middle 32 0 1 4.5244

1251 rows × 9 columns

time = "2020-06-23"
# pandas日期类型
date = pd.to_datetime(time)
date
Timestamp('2020-06-23 00:00:00')
type(date)
pandas._libs.tslibs.timestamps.Timestamp
date.year
2020
date.month
6
data["week"] = date.weekday
data.drop("week",axis=1,inplace=True)
data
district roomnum hall AREA C_floor floor_num school subway per_price
0 baoan 3 2 89.3 middle 31 0 0 7.0773
1 baoan 4 2 127.0 high 31 0 0 6.9291
2 baoan 1 1 28.0 low 39 0 0 3.9286
3 baoan 1 1 28.0 middle 30 0 0 3.3568
4 baoan 2 2 78.0 middle 8 1 1 5.0769
... ... ... ... ... ... ... ... ... ...
1246 baoan 4 2 89.3 low 8 0 0 4.2553
1247 baoan 2 1 67.0 middle 30 0 0 3.8060
1248 baoan 2 2 67.4 middle 29 1 0 5.3412
1249 baoan 2 2 73.1 low 15 1 0 5.9508
1250 baoan 3 2 86.2 middle 32 0 1 4.5244

1251 rows × 9 columns

data["feature"] = np.where(data["per_price"] > 5.0000,1,0)
data
district roomnum hall AREA C_floor floor_num school subway per_price feature
0 baoan 3 2 89.3 middle 31 0 0 7.0773 1
1 baoan 4 2 127.0 high 31 0 0 6.9291 1
2 baoan 1 1 28.0 low 39 0 0 3.9286 0
3 baoan 1 1 28.0 middle 30 0 0 3.3568 0
4 baoan 2 2 78.0 middle 8 1 1 5.0769 1
... ... ... ... ... ... ... ... ... ... ...
1246 baoan 4 2 89.3 low 8 0 0 4.2553 0
1247 baoan 2 1 67.0 middle 30 0 0 3.8060 0
1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1
1249 baoan 2 2 73.1 low 15 1 0 5.9508 1
1250 baoan 3 2 86.2 middle 32 0 1 4.5244 0

1251 rows × 10 columns

# 交叉表# 查看楼层 和 每平方米单价是否>50000的关系
# 返回值为每个楼层中,为0的个数和为1的个数
data0 = pd.crosstab(data["floor_num"],data["feature"])
data0
feature 0 1
floor_num
1 6 8
3 0 1
4 0 10
6 3 7
7 16 25
8 19 32
9 2 11
10 4 9
11 8 11
12 1 3
13 4 20
14 0 5
15 8 33
16 9 19
17 20 21
18 17 35
19 11 5
20 2 4
21 1 6
22 0 1
23 4 8
24 10 26
25 4 37
26 9 57
27 5 38
28 6 35
29 26 68
30 30 78
31 4 151
32 21 126
33 34 20
34 1 5
35 1 2
36 0 4
37 1 1
38 0 1
39 5 10
40 1 3
43 0 1
44 0 6
45 0 7
47 0 1
50 0 1
51 0 3
52 0 2
53 0 1
data0.sum(axis=1) # 按行求和
floor_num
1      14
3       1
4      10
6      10
7      41
8      51
9      13
10     13
11     19
12      4
13     24
14      5
15     41
16     28
17     41
18     52
19     16
20      6
21      7
22      1
23     12
24     36
25     41
26     66
27     43
28     41
29     94
30    108
31    155
32    147
33     54
34      6
35      3
36      4
37      2
38      1
39     15
40      4
43      1
44      6
45      7
47      1
50      1
51      3
52      2
53      1
dtype: int64
data0.div(data0.sum(axis=1),axis=0) # 按行做除法
feature 0 1
floor_num
1 0.428571 0.571429
3 0.000000 1.000000
4 0.000000 1.000000
6 0.300000 0.700000
7 0.390244 0.609756
8 0.372549 0.627451
9 0.153846 0.846154
10 0.307692 0.692308
11 0.421053 0.578947
12 0.250000 0.750000
13 0.166667 0.833333
14 0.000000 1.000000
15 0.195122 0.804878
16 0.321429 0.678571
17 0.487805 0.512195
18 0.326923 0.673077
19 0.687500 0.312500
20 0.333333 0.666667
21 0.142857 0.857143
22 0.000000 1.000000
23 0.333333 0.666667
24 0.277778 0.722222
25 0.097561 0.902439
26 0.136364 0.863636
27 0.116279 0.883721
28 0.146341 0.853659
29 0.276596 0.723404
30 0.277778 0.722222
31 0.025806 0.974194
32 0.142857 0.857143
33 0.629630 0.370370
34 0.166667 0.833333
35 0.333333 0.666667
36 0.000000 1.000000
37 0.500000 0.500000
38 0.000000 1.000000
39 0.333333 0.666667
40 0.250000 0.750000
43 0.000000 1.000000
44 0.000000 1.000000
45 0.000000 1.000000
47 0.000000 1.000000
50 0.000000 1.000000
51 0.000000 1.000000
52 0.000000 1.000000
53 0.000000 1.000000
data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
feature 0 1
floor_num
1 0.428571 0.571429
3 0.000000 1.000000
4 0.000000 1.000000
6 0.300000 0.700000
7 0.390244 0.609756
8 0.372549 0.627451
9 0.153846 0.846154
10 0.307692 0.692308
11 0.421053 0.578947
12 0.250000 0.750000
13 0.166667 0.833333
14 0.000000 1.000000
15 0.195122 0.804878
16 0.321429 0.678571
17 0.487805 0.512195
18 0.326923 0.673077
19 0.687500 0.312500
20 0.333333 0.666667
21 0.142857 0.857143
22 0.000000 1.000000
23 0.333333 0.666667
24 0.277778 0.722222
25 0.097561 0.902439
26 0.136364 0.863636
27 0.116279 0.883721
28 0.146341 0.853659
29 0.276596 0.723404
30 0.277778 0.722222
31 0.025806 0.974194
32 0.142857 0.857143
33 0.629630 0.370370
34 0.166667 0.833333
35 0.333333 0.666667
36 0.000000 1.000000
37 0.500000 0.500000
38 0.000000 1.000000
39 0.333333 0.666667
40 0.250000 0.750000
43 0.000000 1.000000
44 0.000000 1.000000
45 0.000000 1.000000
47 0.000000 1.000000
50 0.000000 1.000000
51 0.000000 1.000000
52 0.000000 1.000000
53 0.000000 1.000000
# stacked=True 是否重叠显示
data_percent.plot(kind="bar",stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>

data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
<tr><th>50</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>51</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>52</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>53</th><td>0.000000</td><td>1.000000</td>
</tr>
feature 0 1
floor_num
1 0.428571 0.571429
3 0.000000 1.000000
4 0.000000 1.000000
6 0.300000 0.700000
7 0.390244 0.609756
8 0.372549 0.627451
9 0.153846 0.846154
10 0.307692 0.692308
11 0.421053 0.578947
12 0.250000 0.750000
13 0.166667 0.833333
14 0.000000 1.000000
15 0.195122 0.804878
16 0.321429 0.678571
17 0.487805 0.512195
18 0.326923 0.673077
19 0.687500 0.312500
20 0.333333 0.666667
21 0.142857 0.857143
22 0.000000 1.000000
23 0.333333 0.666667
24 0.277778 0.722222
25 0.097561 0.902439
26 0.136364 0.863636
27 0.116279 0.883721
28 0.146341 0.853659
29 0.276596 0.723404
30 0.277778 0.722222

4.9.3使用pivot_table(透视表)实现

# 通过透视表,整个过程会变得更加简单些
# 结果直接就是值为1的百分比
data.pivot_table(["feature"],index=["floor_num"])

...

feature
floor_num
1 0.571429
3 1.000000
4 1.000000
6 0.700000
50 1.000000
51 1.000000
52 1.000000
53 1.000000

4.10 高级处理-分组与聚合

4.10.2 分组与聚合API

col = pd.DataFrame({'color':['white','red','green','red','green'],'object':["pen","pencil","pencil","ashtray","pen"],'price1':[4.56,4.20,1.30,0.56,2.75],'price2':[4.75,4.12,1.68,0.75,3.15]})
col
color object price1 price2
0 white pen 4.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.68
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
#  进行分组,对颜色进行分组,对价格price1进行聚合
# 用DataFrame的方法进行分组
col.groupby(by="color")["price1"].max()
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64
# 用Series的方法进行分组
col['price1'].groupby(col["color"])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col['price1'].groupby(col["color"]).max()
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64

4.11 综合案例

# 1. 准备数据
movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0
997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

1000 rows × 12 columns

#问题1:我们想知道这些电影数据中评分的平均分,导演的人数等信息,
# 我们应该怎么获取?
movie["Rating"].mean()
6.723200000000003
movie["Director"]
0                James Gunn
1              Ridley Scott
2        M. Night Shyamalan
3      Christophe Lourdelet
4                David Ayer...
995               Billy Ray
996                Eli Roth
997              Jon M. Chu
998          Scot Armstrong
999        Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
# np.unique()去重,因为导演可能是多个电影的导演
np.unique(movie["Director"])
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González Iñárritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object)
# 导演的人数
np.unique(movie["Director"]).size
644
# 问题2 : 对于这一组电影数据,如果我们先rating,runtime的分布情况,应该如何呈现数据?
movie["Rating"].plot(kind="hist",figsize=(20,8),fontsize=40)
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>

import matplotlib.pyplot as plt# 1. 创建画布
plt.figure(figsize=(20,8),dpi=100)# 2. 绘制直方图
plt.hist(movie["Rating"],20)# 修改刻度
plt.xticks(np.linspace(movie["Rating"].min(),movie["Rating"].max(),21))# 添加网格
plt.grid(linestyle="--",alpha=0.5)# 3. 显示图像
plt.show()

movie["Rating"]
0      8.1
1      7.0
2      7.3
3      7.2
4      6.2...
995    6.2
996    5.5
997    6.2
998    5.6
999    5.3
Name: Rating, Length: 1000, dtype: float64
# 问题3:对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?# 先统计电影类别有哪些
movie_genre = [i.split(",") for i in movie["Genre"]]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']]
[j for i in movie_genre for j in i]
['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi',
...'Animation','Action','Adventure','Action','Adventure','Drama',...]
movie_class = np.unique([j for i in movie_genre for j in i])
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9')
len(movie_class) # 20 个电影类别
20
# 统计每个类别有几个电影# 先创建一个空的DataFrame表
count = pd.DataFrame(np.zeros(shape=[1000,20],dtype="int32"),columns=movie_class)
count.head()
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
count.loc[0,movie_genre[0]]
Action       0
Adventure    0
Sci-Fi       0
Name: 0, dtype: int32
movie_genre[0]
['Action', 'Adventure', 'Sci-Fi']
# 计数填表
for i in range(1000):count.loc[i,movie_genre[i]] = 1
count
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
3 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0
996 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
997 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
998 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
999 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

1000 rows × 20 columns

# 按列求和
count.sum(axis=0)
Action       303
Adventure    259
Animation     49
Biography     81
Comedy       279
Crime        150
Drama        513
Family        51
Fantasy      101
History       29
Horror       119
Music         16
Musical        5
Mystery      106
Romance      141
Sci-Fi       120
Sport         18
Thriller     195
War           13
Western        7
dtype: int64
count.sum(axis=0).sort_values(ascending=False)
Drama        513
Action       303
Comedy       279
Adventure    259
Thriller     195
Crime        150
Romance      141
Sci-Fi       120
Horror       119
Mystery      106
Fantasy      101
Biography     81
Family        51
Animation     49
History       29
Sport         18
Music         16
War           13
Western        7
Musical        5
dtype: int64
count.sum(axis=0).sort_values(ascending=False).plot(kind="bar",fontsize=20,figsize=(20,9),colormap="cool")
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>

九、Pandas高级处理相关推荐

  1. pandas高级处理-分组与聚合

    pandas高级处理-分组与聚合 分组与聚合通常是分析数据的一种方式,通常与一些统计函数一起使用,查看数据的分组情况  [在pandas中,必须连在一起说,不能单独使用,抛开聚合谈分组无意义] 想一想 ...

  2. pandas高级处理-交叉表与透视表

    pandas高级处理-交叉表与透视表 1 交叉表与透视表什么作用  [就是探究两列数据之间的关系] 探究股票的涨跌与星期几有关? 以下图当中表示,week代表星期几,1,0代表这一天股票的涨跌幅是好还 ...

  3. pandas高级处理-合并

    pandas高级处理-合并 如果你的数据由多张表组成,那么有时候需要将不同的内容合并在一起分析 1 pd.concat实现数据合并 pd.concat([data1, data2], axis=1) ...

  4. pandas高级处理-数据离散化

    pandas高级处理-数据离散化 1 为什么要离散化 连续属性离散化的目的是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值的个数.离散化方法经常作为数据挖掘的工具.[简化数据,让数据用起来 ...

  5. 3 Python数据分析 美国各州人口分析案例 Pandas高级操作 美国大选献金案例 matplotlib

    Python数据分析 1 案例 美国各州人口分析 1.1 数据介绍 数据来源:https://github.com/jakevdp/data-USstates/ 1.1.1 州人口数量表 state- ...

  6. Pandas高级数据分析快速入门之三——数据挖掘与统计分析篇

    Pandas高级数据分析快速入门之一--Python开发环境篇 Pandas高级数据分析快速入门之二--基础篇 Pandas高级数据分析快速入门之三--数据挖掘与统计分析篇 Pandas高级数据分析快 ...

  7. Pandas高级操作

    替换操作 替换操作可以同步作用于Series和DataFrame中 单值替换 普通替换:替换所有符合要求的元素:to_replace=15,value='e' 按列指定单值替换:to_replace= ...

  8. 数据基础---《利用Python进行数据分析·第2版》第12章 pandas高级应用

    之前自己对于numpy和pandas是要用的时候东学一点西一点,直到看到<利用Python进行数据分析·第2版>,觉得只看这一篇就够了.非常感谢原博主的翻译和分享. 前面的章节关注于不同类 ...

  9. python高级数据分析_Python数据分析-pandas高级操作

    创建一个df,两列分别是姓名和薪资,然后给其名字起对应的英文名 dic ={'name':['jay','tom','jay'],'salary':[1000,2000,1000] } df= Dat ...

最新文章

  1. 【Android】4.3 屏幕布局和旋转
  2. 深信服副总裁张开翼:随需应变的IT新架构
  3. 请求和响应向更多内容
  4. 计算机应用研究潜规则,基于相容矩阵计算的不完备决策系统规则获取算法
  5. brew 基本使用方法
  6. 用Cairo画IBM logo并输出为pdf,ps,svg格式文件
  7. 手机网页设计注意事项和解决方法
  8. OTSU 获取最佳阈值,及opencv二值化
  9. 【智能路由器】轻量级web服务器lighttpd架设——打造家庭影院
  10. 网站生成EXE文件运行——PHP网站打包工具PHPWAMP
  11. WireShark基本使用(5)第 5 章 文件输入/输出及打印
  12. 百宝云Post与Get事件教程
  13. php实现成语小游戏,成语小秀才微信小程序源码-PHP代码类资_aqa7qj 源码采用php实现 - 下载 - 搜珍网...
  14. Flash应用之百宝箱
  15. 通过爬虫爬取一些图片
  16. 隐私泄露中的人性剖析
  17. 实验七 计数器及其应用
  18. 解决qt5在windows系统下中文乱码的问题的简单方法
  19. 关于MHT文件研究(一)
  20. 电子商务格局下的营销未来

热门文章

  1. CentOS下添加Root权限用户‘超级用户’方法(xxx is not in the sudoers file.This incident will be reported.的解决方法)
  2. Flask爱家租房--订单(房东接单、拒单)
  3. C++中模板使用详解
  4. this static 面向对象三大特点
  5. linux 部署php svn,Linux服务器搭建svn环境方法详解
  6. matlab功能块,Matlab GUI重用功能块
  7. 日志配置(springboot、mybatis、Lombok)
  8. TZOJ--1518: 星星点点 (二进制模拟)
  9. ReactNative 告别CodePush,自建热更新版本升级环境
  10. 你不知道的 字符集和编码(编码字符集与字符集编码)