【Datawhale】[task2]2.3代码示例
2.3 代码示例
2.3.1 载入各种数据科学以及可视化库
#coding:utf-8
#导入warning包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import os #结果保存路径
output_path='G:/newjourney/Datawhale/output'
if not os.path.exists(output_path):os.makedirs(output_path)
2.3.2 载入数据
## 1)载入训练集和测试集
path='G:/newjourney/Datawhale/'
Train_data=pd.read_csv(path+'used_car_train_20200313.csv',sep=' ')
Test_data=pd.read_csv(path+'used_car_testB_20200421.csv',sep=' ')
## 2)简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())
# 哇,这个学到了,利用append把head和tail一起展示出来,优秀
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
149995 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
149996 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
149997 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
149998 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
149999 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
10 rows × 31 columns
Train_data.shape
(150000, 31)
Test_data.head().append(Test_data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200000 | 133777 | 20000501 | 67.0 | 0 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | ... | 0.236520 | 0.000241 | 0.105319 | 0.046233 | 0.094522 | 3.619512 | -0.280607 | -2.019761 | 0.978828 | 0.803322 |
1 | 200001 | 61206 | 19950211 | 19.0 | 6 | 2.0 | 0.0 | 0.0 | 73 | 6.0 | ... | 0.261518 | 0.000000 | 0.120323 | 0.046784 | 0.035385 | 2.997376 | -1.406705 | -1.020884 | -1.349990 | -0.200542 |
2 | 200002 | 67829 | 20090606 | 5.0 | 5 | 4.0 | 0.0 | 0.0 | 120 | 5.0 | ... | 0.261691 | 0.090836 | 0.000000 | 0.079655 | 0.073586 | -3.951084 | -0.433467 | 0.918964 | 1.634604 | 1.027173 |
3 | 200003 | 8892 | 20020601 | 22.0 | 9 | 1.0 | 0.0 | 0.0 | 58 | 15.0 | ... | 0.236050 | 0.101777 | 0.098950 | 0.026830 | 0.096614 | -2.846788 | 2.800267 | -2.524610 | 1.076819 | 0.461610 |
4 | 200004 | 76998 | 20030301 | 46.0 | 6 | 0.0 | NaN | 0.0 | 116 | 15.0 | ... | 0.257000 | 0.000000 | 0.066732 | 0.057771 | 0.068852 | 2.839010 | -1.659801 | -0.924142 | 0.199423 | 0.451014 |
49995 | 249995 | 111443 | 20041005 | 4.0 | 4 | 0.0 | NaN | 1.0 | 150 | 15.0 | ... | 0.263668 | 0.000292 | 0.141804 | 0.076393 | 0.039272 | 2.072901 | -2.531869 | 1.716978 | -1.063437 | 0.326587 |
49996 | 249996 | 152834 | 20130409 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 179 | 4.0 | ... | 0.255310 | 0.000991 | 0.155868 | 0.108425 | 0.067841 | 1.358504 | -3.290295 | 4.269809 | 0.140524 | 0.556221 |
49997 | 249997 | 132531 | 20041211 | 4.0 | 4 | 0.0 | 0.0 | 1.0 | 147 | 12.5 | ... | 0.262933 | 0.000318 | 0.141872 | 0.071968 | 0.042966 | 2.165658 | -2.417885 | 1.370612 | -1.073133 | 0.270602 |
49998 | 249998 | 143405 | 20020702 | 40.0 | 1 | 4.0 | 0.0 | 1.0 | 176 | 15.0 | ... | 0.282106 | 0.000023 | 0.067483 | 0.067526 | 0.009006 | 2.030114 | -2.939244 | 0.569078 | -1.718245 | 0.316379 |
49999 | 249999 | 78202 | 20090708 | 32.0 | 8 | 1.0 | 0.0 | 0.0 | 0 | 3.0 | ... | 0.231449 | 0.103947 | 0.096027 | 0.062328 | 0.110180 | -3.689090 | 2.032376 | 0.109157 | 2.202828 | 0.847469 |
10 rows × 30 columns
Test_data.shape
(50000, 30)
要养成看数据集的head()和shape的习惯,这会让你每一步更放心
2.3.3总览数据概括
1、describe中有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25%、50%、75%、以及最大值。通过这些信息可以瞬间掌握数据的大概范围以及每个值的异常值的判断,比如有的时候会发现999 9999 -1等值,这些其实都是nan的另外一种表达方式,有的时候要注意一下。
2、info 通过info来了解每列的type,有助于了解是否存在除了nan以外的特殊符号异常
## 1)通过describe()来熟悉数据的相关统计量
Train_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 149999.000000 | 150000.000000 | 145494.000000 | 141320.000000 | 144019.000000 | 150000.000000 | 150000.000000 | ... | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.129021 | 8.052733 | 1.792369 | 0.375842 | 0.224943 | 119.316547 | 12.597160 | ... | 0.248204 | 0.044923 | 0.124692 | 0.058144 | 0.061996 | -0.001000 | 0.009035 | 0.004813 | 0.000313 | -0.000688 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.536040 | 7.864956 | 1.760640 | 0.548677 | 0.417546 | 177.168419 | 3.919576 | ... | 0.045804 | 0.051743 | 0.201410 | 0.029186 | 0.035692 | 3.772386 | 3.286071 | 2.517478 | 1.288988 | 1.038685 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.168192 | -5.558207 | -9.639552 | -4.153899 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243615 | 0.000038 | 0.062474 | 0.035334 | 0.033930 | -3.722303 | -1.951543 | -1.871846 | -1.057789 | -0.437034 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | ... | 0.257798 | 0.000812 | 0.095866 | 0.057014 | 0.058484 | 1.624076 | -0.358053 | -0.130753 | -0.036245 | 0.141246 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265297 | 0.102009 | 0.125243 | 0.079382 | 0.087491 | 2.844357 | 1.255022 | 1.776933 | 0.942813 | 0.680378 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19312.000000 | 15.000000 | ... | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
8 rows × 30 columns
Test_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.00000 | 50000.000000 | 48496.000000 | 47076.000000 | 48032.000000 | 50000.000000 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 224999.500000 | 68505.606100 | 2.003401e+07 | 47.64948 | 8.087140 | 1.793736 | 0.376498 | 0.226953 | 119.766960 | 12.598260 | ... | 0.248147 | 0.044624 | 0.124693 | 0.058198 | 0.062113 | 0.019633 | 0.002759 | 0.004342 | 0.004570 | -0.007209 |
std | 14433.901067 | 61032.124271 | 5.351615e+04 | 49.90741 | 7.899648 | 1.764970 | 0.549281 | 0.418866 | 206.313348 | 3.912519 | ... | 0.045836 | 0.051664 | 0.201440 | 0.029171 | 0.035723 | 3.764095 | 3.289523 | 2.515912 | 1.287194 | 1.044718 |
min | 200000.000000 | 1.000000 | 1.991000e+07 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.119719 | -5.662163 | -8.291868 | -4.157649 | -6.098192 |
25% | 212499.750000 | 11315.000000 | 1.999100e+07 | 11.00000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243436 | 0.000035 | 0.062519 | 0.035413 | 0.033880 | -3.675196 | -1.963928 | -1.865406 | -1.048722 | -0.440706 |
50% | 224999.500000 | 52215.000000 | 2.003091e+07 | 30.00000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | ... | 0.257818 | 0.000801 | 0.095880 | 0.056804 | 0.058749 | 1.632134 | -0.375537 | -0.138943 | -0.036352 | 0.136849 |
75% | 237499.250000 | 118710.750000 | 2.007110e+07 | 66.00000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265263 | 0.101654 | 0.125470 | 0.079387 | 0.087624 | 2.846205 | 1.263451 | 1.775632 | 0.945239 | 0.685555 |
max | 249999.000000 | 196808.000000 | 2.015121e+07 | 246.00000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19211.000000 | 15.000000 | ... | 0.291176 | 0.153403 | 1.411559 | 0.157458 | 0.211304 | 12.177864 | 18.789496 | 13.384828 | 5.635374 | 2.649768 |
8 rows × 29 columns
## 2)通过info()来熟悉数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null float64
brand 50000 non-null int64
bodyType 48496 non-null float64
fuelType 47076 non-null float64
gearbox 48032 non-null float64
power 50000 non-null int64
kilometer 50000 non-null float64
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null int64
offerType 50000 non-null int64
creatDate 50000 non-null int64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 50000 non-null float64
v_13 50000 non-null float64
v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
2.3.4判断数据缺失和异常
## 1)查看每列的存在nan情况
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
Test_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 1504
fuelType 2924
gearbox 1968
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
#nan可视化
missing=Train_data.isnull().sum()
missing=missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x263bf61cda0>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H1or7kIn-1588234847898)(output_17_1.png)]
通过以上两句可以很直观的了解哪些列存在"nan",并可以把nan的个数打印,主要的目的在于判断nan存在的个数是否真的很大,如果很小,一般选择填充;如果适用lgb等树模型可以直接空缺,让树自己去优化;但如果nan存在的过多,可以考虑删掉。
# 可视化下看缺省值,利用missingno(msno)
msno.matrix(Train_data.sample(250)) #这里看了250个样本
<matplotlib.axes._subplots.AxesSubplot at 0x263bf70dcc0>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4sT96KeO-1588234847900)(output_19_1.png)]
msno.matrix(Train_data)
<matplotlib.axes._subplots.AxesSubplot at 0x263c17172e8>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cXIvvEr2-1588234847904)(output_20_1.png)]
msno.bar(Train_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x263bf973358>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-68GvpzF1-1588234847909)(output_21_1.png)]
#可视化看下缺省值
msno.matrix(Test_data)
<matplotlib.axes._subplots.AxesSubplot at 0x263c19801d0>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2FayR8XY-1588234847914)(output_22_1.png)]
msno.bar(Test_data)
<matplotlib.axes._subplots.AxesSubplot at 0x263bf9d3be0>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6z8ur528-1588234847915)(output_23_1.png)]
测试集和训练集的缺省情况类似,
训练集(model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981)
训练集(bodyType 1504
fuelType 2924
gearbox 1968)其中,fuelType缺省的最多。
## 2)查看异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
可以发现处理notRepairedDamage 150000 non-null object为object类型,其它都为数字,我们可以具体看一下notRepairedDamage的值
Train_data['notRepairedDamage'].value_counts()
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
可以发现"-“也为空缺值,因为很多模型对nan有直接的处理,故在此我们先将”-“替换成"nan”
Train_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 24324
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0 37224
- 8069
1.0 4707
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Test_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 1504
fuelType 2924
gearbox 1968
power 0
kilometer 0
notRepairedDamage 8069
regionCode 0
seller 0
offerType 0
creatDate 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0 37224
1.0 4707
Name: notRepairedDamage, dtype: int64
"seller"和"offerType"两个类别严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然也可以继续挖掘,但是一般意义不大
#但有个问题是,怎么突然发现这两个类别严重倾斜呢?一个一个试的嘛?
#可以看一下这两个类别的具体情况
Train_data['seller'].value_counts()
0 149999
1 1
Name: seller, dtype: int64
Train_data['offerType'].value_counts()
0 150000
Name: offerType, dtype: int64
#所以进行删除,训练集和测试集都要删除
del Train_data['seller']
del Test_data['seller']
del Train_data['offerType']
del Test_data['offerType']
2.3.5了解预测值的分布
Train_data['price']
0 1850
1 3600
2 6222
3 2400
4 5200
5 8000
6 3500
7 1000
8 2850
9 650
10 3100
11 5450
12 1600
13 3100
14 6900
15 3200
16 10500
17 3700
18 790
19 1450
20 990
21 2800
22 350
23 599
24 9250
25 3650
26 2800
27 2399
28 4900
29 2999...
149970 900
149971 3400
149972 999
149973 3500
149974 4500
149975 3990
149976 1200
149977 330
149978 3350
149979 5000
149980 4350
149981 9000
149982 2000
149983 12000
149984 6700
149985 4200
149986 2800
149987 3000
149988 7500
149989 1150
149990 450
149991 24950
149992 950
149993 4399
149994 14780
149995 5900
149996 9500
149997 7500
149998 4999
149999 4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500 2337
1500 2158
1200 1922
1000 1850
2500 1821
600 1535
3500 1533
800 1513
2000 1378
999 1356
750 1279
4500 1271
650 1257
1800 1223
2200 1201
850 1198
700 1174
900 1107
1300 1105
950 1104
3000 1098
1100 1079
5500 1079
1600 1074
300 1071
550 1042
350 1005
1250 1003
6500 973
1999 929...
21560 1
7859 1
3120 1
2279 1
6066 1
6322 1
4275 1
10420 1
43300 1
305 1
1765 1
15970 1
44400 1
8885 1
2992 1
31850 1
15413 1
13495 1
9525 1
7270 1
13879 1
3760 1
24250 1
11360 1
10295 1
25321 1
8886 1
8801 1
37920 1
8188 1
Name: price, Length: 3763, dtype: int64
## 1)总体分布概况(无界约翰逊分布(johnsonsu)?/正态norm?/对数正态(lognorm)?等)import scipy.stats as st
y=Train_data['price']
plt.figure(1);plt.title('Juhnson SU')
sns.distplot(y,kde=False,fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y,kde=False,fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y,kde=False,fit=st.lognorm)
#这里画图为什么用sns.distplot()?
#回答:Seaborn是基于matplotlib的Python可视化库。 它提供了一个高级界面来绘制有吸引力的统计图形。Seaborn其实是在matplotlib的基础上进行了更高级的API封装,从而使得作图更加容易,不需要经过大量的调整就能使你的图变得精致。
<matplotlib.axes._subplots.AxesSubplot at 0x263b6bbd080>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CnO0RUlw-1588234847917)(output_42_1.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ySq02FTT-1588234847917)(output_42_2.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Esc5fWxu-1588234847918)(output_42_3.png)]
价格不服从正态分布,所以在进行回归之前,需要进行转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布
## 2)查看skewness 和kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis:%f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis:18.995183
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1qAaaIB3-1588234847919)(output_44_1.png)]
#训练集中全部特征的偏度和峰度,学习了
Train_data.skew(),Train_data.kurt()
(SaleID 6.017846e-17name 5.576058e-01regDate 2.849508e-02model 1.484388e+00brand 1.150760e+00bodyType 9.915299e-01fuelType 1.595486e+00gearbox 1.317514e+00power 6.586318e+01kilometer -1.525921e+00notRepairedDamage 2.430640e+00regionCode 6.888812e-01creatDate -7.901331e+01price 3.346487e+00v_0 -1.316712e+00v_1 3.594543e-01v_2 4.842556e+00v_3 1.062920e-01v_4 3.679890e-01v_5 -4.737094e+00v_6 3.680730e-01v_7 5.130233e+00v_8 2.046133e-01v_9 4.195007e-01v_10 2.522046e-02v_11 3.029146e+00v_12 3.653576e-01v_13 2.679152e-01v_14 -1.186355e+00dtype: float64, SaleID -1.200000name -1.039945regDate -0.697308model 1.740483brand 1.076201bodyType 0.206937fuelType 5.880049gearbox -0.264161power 5733.451054kilometer 1.141934notRepairedDamage 3.908072regionCode -0.340832creatDate 6881.080328price 18.995183v_0 3.993841v_1 -1.753017v_2 23.860591v_3 -0.418006v_4 -0.197295v_5 22.934081v_6 -1.742567v_7 25.845489v_8 -0.636225v_9 -0.321491v_10 -0.577935v_11 12.568731v_12 0.268937v_13 -0.438274v_14 2.393526dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel='Skewness')
<matplotlib.axes._subplots.AxesSubplot at 0x263c3402f28>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zdVlowAX-1588234847920)(output_46_1.png)]
sns.distplot(Train_data.kurt(),color='orange',axlabel='Kurtness')
<matplotlib.axes._subplots.AxesSubplot at 0x263db8ed240>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qW40gido-1588234847921)(output_47_1.png)]
##3)查看预测值的具体频数
plt.hist(Train_data['price'],orientation='vertical',histtype='bar',color='red');
#小tip:plt语句后加";",就可绘出图,当然也可以再写一句plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E0SWUUWT-1588234847922)(output_48_0.png)]
查看频数,大于20000的值极少,其实这里也可以把这些当作特殊值(异常值)直接填充或删除掉
# Log变换分布之后的分布较均匀,可以用log变化进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']),orientation='vertical',histtype='bar',color='red');
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7IhnmVPd-1588234847925)(output_50_0.png)]
2.3.6 特征分为类别特征和数字特征,并对类别特征查看unique分布
数据类型
列
#分离label即预测值
Y_train=Train_data['price']
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
'''
这个区别方式适用于没有直接Label coding的数据
这里不适用,需要人为根据实际含义来区分
+ 数字特征
numeric_features=Train_data.select_dtypes(include=[np.number])
numeric_features.columns
+ 类别特征
categorical_features=Train_data.select_dtypes(include=[np.object])
categorical_features.columns'''
numeric_features=['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',\'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']categorical_features=[ 'name', 'model', 'brand', 'bodyType', 'fuelType',\'gearbox', 'notRepairedDamage', 'regionCode']
# 特征unique分布;训练集
for cat_fea in categorical_features:print(cat_fea+"的特征分布如下:")print("{}特征有{}不同的值".format(cat_fea,Train_data[cat_fea].nunique()))print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name特征有99662不同的值
708 282
387 282
55 280
1541 263
203 233
53 221
713 217
290 197
1186 184
911 182
2044 176
1513 160
1180 158
631 157
893 153
2765 147
473 141
1139 137
1108 132
444 129
306 127
2866 123
2402 116
533 114
1479 113
422 113
4635 110
725 110
964 109
1373 104...
89083 1
95230 1
164864 1
173060 1
179207 1
181256 1
185354 1
25564 1
19417 1
189324 1
162719 1
191373 1
193422 1
136082 1
140180 1
144278 1
146327 1
148376 1
158621 1
1404 1
15319 1
46022 1
64463 1
976 1
3025 1
5074 1
7123 1
11221 1
13270 1
174485 1
Name: name, Length: 99662, dtype: int64
model的特征分布如下:
model特征有248不同的值
0.0 11762
19.0 9573
4.0 8445
1.0 6038
29.0 5186
48.0 5052
40.0 4502
26.0 4496
8.0 4391
31.0 3827
13.0 3762
17.0 3121
65.0 2730
49.0 2608
46.0 2454
30.0 2342
44.0 2195
5.0 2063
10.0 2004
21.0 1872
73.0 1789
11.0 1775
23.0 1696
22.0 1524
69.0 1522
63.0 1469
7.0 1460
16.0 1349
88.0 1309
66.0 1250...
141.0 37
133.0 35
216.0 30
202.0 28
151.0 26
226.0 26
231.0 23
234.0 23
233.0 20
198.0 18
224.0 18
227.0 17
237.0 17
220.0 16
230.0 16
239.0 14
223.0 13
236.0 11
241.0 10
232.0 10
229.0 10
235.0 7
246.0 7
243.0 4
244.0 3
245.0 2
209.0 2
240.0 2
242.0 2
247.0 1
Name: model, Length: 248, dtype: int64
brand的特征分布如下:
brand特征有40不同的值
0 31480
4 16737
14 16089
10 14249
1 13794
6 10217
9 7306
5 4665
13 3817
11 2945
3 2461
7 2361
16 2223
8 2077
25 2064
27 2053
21 1547
15 1458
19 1388
20 1236
12 1109
22 1085
26 966
30 940
17 913
24 772
28 649
32 592
29 406
37 333
2 321
31 318
18 316
36 228
34 227
33 218
23 186
35 180
38 65
39 9
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有8不同的值
0.0 41420
1.0 35272
2.0 30324
3.0 13491
4.0 9609
5.0 7607
6.0 6482
7.0 1289
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有7不同的值
0.0 91656
1.0 46991
2.0 2212
3.0 262
4.0 118
5.0 45
6.0 36
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有2不同的值
0.0 111623
1.0 32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有2不同的值
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有7905不同的值
419 369
764 258
125 137
176 136
462 134
428 132
24 130
1184 130
122 129
828 126
70 125
827 120
207 118
1222 117
2418 117
85 116
2615 115
2222 113
759 112
188 111
1757 110
1157 109
2401 107
1069 107
3545 107
424 107
272 107
451 106
450 105
129 105...
6324 1
7372 1
7500 1
8107 1
2453 1
7942 1
5135 1
6760 1
8070 1
7220 1
8041 1
8012 1
5965 1
823 1
7401 1
8106 1
5224 1
8117 1
7507 1
7989 1
6505 1
6377 1
8042 1
7763 1
7786 1
6414 1
7063 1
4239 1
5931 1
7267 1
Name: regionCode, Length: 7905, dtype: int64
#特征unique分布;测试集
for cat_fea in categorical_features:print(cat_fea+"的特征分布如下:")print("{}特征有{}个不同的值".format(cat_fea,Test_data[cat_fea].nunique()))print(Test_data[cat_fea].value_counts())
name的特征分布如下:
name特征有37536个不同的值
387 94
55 93
1541 86
708 85
203 78
713 75
911 72
1180 71
53 68
290 68
631 67
1186 60
473 54
306 53
2866 52
2044 50
422 49
893 47
1513 46
2765 45
533 44
964 44
1139 41
1479 41
2825 38
444 37
4635 37
984 37
282 35
691 33..
9747 1
7857 1
75120 1
144754 1
15731 1
66932 1
76360 1
66082 1
89231 1
93561 1
161146 1
21886 1
42368 1
101765 1
89653 1
38278 1
89645 1
60809 1
62858 1
195979 1
185951 1
81299 1
168479 1
28057 1
30106 1
97691 1
155039 1
44449 1
112034 1
105129 1
Name: name, Length: 37536, dtype: int64
model的特征分布如下:
model特征有245个不同的值
0.0 3772
19.0 3226
4.0 2790
1.0 1981
29.0 1778
48.0 1711
40.0 1524
26.0 1512
8.0 1464
31.0 1281
13.0 1214
17.0 1033
65.0 918
49.0 880
46.0 871
30.0 793
44.0 731
5.0 677
21.0 628
10.0 625
23.0 583
11.0 562
73.0 561
69.0 531
63.0 515
16.0 506
22.0 482
7.0 442
88.0 416
66.0 395...
157.0 12
151.0 12
141.0 12
193.0 12
89.0 12
68.0 11
233.0 11
226.0 11
133.0 11
227.0 8
198.0 8
18.0 8
224.0 7
237.0 7
239.0 6
231.0 6
235.0 6
220.0 6
246.0 4
234.0 4
230.0 4
223.0 3
236.0 3
232.0 3
245.0 3
229.0 2
209.0 2
242.0 1
241.0 1
244.0 1
Name: model, Length: 245, dtype: int64
brand的特征分布如下:
brand特征有40个不同的值
0 10473
4 5532
14 5345
10 4713
1 4627
6 3500
9 2360
5 1485
13 1386
11 942
3 820
16 770
25 728
7 727
8 708
27 623
21 543
15 476
19 473
20 411
12 399
22 358
26 328
30 321
17 312
24 248
28 216
32 183
29 139
37 117
2 115
31 113
18 107
33 84
35 75
34 75
36 72
23 60
38 31
39 5
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有8个不同的值
0.0 13765
1.0 11960
2.0 9886
3.0 4491
4.0 3258
5.0 2494
6.0 2212
7.0 430
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有7个不同的值
0.0 30489
1.0 15708
2.0 736
3.0 78
4.0 31
5.0 18
6.0 16
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有2个不同的值
0.0 37131
1.0 10901
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有2个不同的值
0.0 37224
1.0 4707
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有6998个不同的值
419 120
764 98
176 48
3304 45
85 45
2222 45
3545 44
462 42
1000 42
2154 42
24 41
2775 41
70 41
309 40
1688 40
188 40
792 40
955 39
172 39
3573 39
122 39
759 38
60 38
2418 38
256 38
1483 38
2690 37
125 37
827 37
450 37...
1521 1
7602 1
5523 1
7538 1
5459 1
7410 1
6630 1
6374 1
6342 1
1010 1
6897 1
5104 1
7089 1
4069 1
6993 1
2052 1
4944 1
2867 1
4912 1
2771 1
6310 1
6865 1
6833 1
4656 1
6609 1
2451 1
4231 1
6513 1
6481 1
6061 1
Name: regionCode, Length: 6998, dtype: int64
2.3.7 数字特征分析
#加上预测值
numeric_features.append('price')
numeric_features
['power','kilometer','v_0','v_1','v_2','v_3','v_4','v_5','v_6','v_7','v_8','v_9','v_10','v_11','v_12','v_13','v_14','price','price']
Train_data.head().append(Train_data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
149995 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
149996 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
149997 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
149998 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
149999 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
10 rows × 29 columns
## 1)相关性分析
price_numeric=Train_data[numeric_features]
correlation=price_numeric.corr()
correlation
print(correlation['price'])
price price
power 0.219834 0.219834
kilometer -0.440519 -0.440519
v_0 0.628397 0.628397
v_1 0.060914 0.060914
v_2 0.085322 0.085322
v_3 -0.730946 -0.730946
v_4 -0.147085 -0.147085
v_5 0.164317 0.164317
v_6 0.068970 0.068970
v_7 -0.053024 -0.053024
v_8 0.685798 0.685798
v_9 -0.206205 -0.206205
v_10 -0.246175 -0.246175
v_11 -0.275320 -0.275320
v_12 0.692823 0.692823
v_13 -0.013993 -0.013993
v_14 0.035911 0.035911
price 1.000000 1.000000
price 1.000000 1.000000
print(correlation['price'].sort_values(ascending=True))
---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-81-e5597112ec71> in <module>3 correlation=price_numeric.corr()4 correlation
----> 5 print(correlation['price'].sort_values(ascending=True))TypeError: sort_values() missing 1 required positional argument: 'by'
print(correlation['price'].sort_values(by=correlation['price'],ascending = False),'\n')
---------------------------------------------------------------------------KeyError Traceback (most recent call last)<ipython-input-74-a6cc18605a79> in <module>2 price_numeric=Train_data[numeric_features]3 correlation=price_numeric.corr()
----> 4 print(correlation['price'].sort_values(by=correlation['price'],ascending = False),'\n')G:\baidudownload2\anaconda\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)4717 4718 by = by[0]
-> 4719 k = self._get_label_or_level_values(by, axis=axis)4720 4721 if isinstance(ascending, (tuple, list)):G:\baidudownload2\anaconda\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)1704 values = self.axes[axis].get_level_values(key)._values1705 else:
-> 1706 raise KeyError(key)1707 1708 # Check for duplicatesKeyError: price price
power 0.219834 0.219834
kilometer -0.440519 -0.440519
v_0 0.628397 0.628397
v_1 0.060914 0.060914
v_2 0.085322 0.085322
v_3 -0.730946 -0.730946
v_4 -0.147085 -0.147085
v_5 0.164317 0.164317
v_6 0.068970 0.068970
v_7 -0.053024 -0.053024
v_8 0.685798 0.685798
v_9 -0.206205 -0.206205
v_10 -0.246175 -0.246175
v_11 -0.275320 -0.275320
v_12 0.692823 0.692823
v_13 -0.013993 -0.013993
v_14 0.035911 0.035911
price 1.000000 1.000000
price 1.000000 1.000000
print(data.corr()) #相关系数矩阵,即给出了任意两款菜式之间的相关系数
print("显示“百合酱蒸凤爪”与其他菜式的相关系数:")
print(data.corr()[u'price']) #只显示“百合酱蒸凤爪”与其他菜式的相关系数
f,ax=plt.subplots(figsize=(7,7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square=True,vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x263c33cfa90>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i5LpwIeD-1588234847926)(output_67_1.png)]
del price_numeric['price']
## 2)查看几个特征的偏度和峰值
for col in numeric_features:print('{:15}'.format(col),'Skewness:{:05.2f}'.format(Train_data[col].skew()),' ','kurtosis:{:06.2f}'.format(Train_data[col].kurt()))
power Skewness:65.86 kurtosis:5733.45
kilometer Skewness:-1.53 kurtosis:001.14
v_0 Skewness:-1.32 kurtosis:003.99
v_1 Skewness:00.36 kurtosis:-01.75
v_2 Skewness:04.84 kurtosis:023.86
v_3 Skewness:00.11 kurtosis:-00.42
v_4 Skewness:00.37 kurtosis:-00.20
v_5 Skewness:-4.74 kurtosis:022.93
v_6 Skewness:00.37 kurtosis:-01.74
v_7 Skewness:05.13 kurtosis:025.85
v_8 Skewness:00.20 kurtosis:-00.64
v_9 Skewness:00.42 kurtosis:-00.32
v_10 Skewness:00.03 kurtosis:-00.58
v_11 Skewness:03.03 kurtosis:012.57
v_12 Skewness:00.37 kurtosis:000.27
v_13 Skewness:00.27 kurtosis:-00.44
v_14 Skewness:-1.19 kurtosis:002.39
price Skewness:03.35 kurtosis:019.00
price Skewness:03.35 kurtosis:019.00
## 3)每个数字特征的分布可视化
f=pd.melt(Train_data,value_vars=numeric_features)
g=sns.FacetGrid(f,col="variable",col_wrap=2,sharex=False,sharey=False)
g=g.map(sns.distplot,"value")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MtlS2Yfr-1588234847928)(output_70_0.png)]
可以看出匿名特征相对分布均匀
## 4)数字特征相互之间的关系可视化
sns.set()
columns=['price','power', 'v_0', 'v_1', 'v_2', 'v_5', 'v_6',\'v_8','v_12', 'v_14']#为什么是这几个特征?
sns.pairplot(Train_data[columns],size=2,kind='scatter',diag_kind='kde');
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1e9oL2QM-1588234847928)(output_72_0.png)]
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
Y_train
0 1850
1 3600
2 6222
3 2400
4 5200
5 8000
6 3500
7 1000
8 2850
9 650
10 3100
11 5450
12 1600
13 3100
14 6900
15 3200
16 10500
17 3700
18 790
19 1450
20 990
21 2800
22 350
23 599
24 9250
25 3650
26 2800
27 2399
28 4900
29 2999...
149970 900
149971 3400
149972 999
149973 3500
149974 4500
149975 3990
149976 1200
149977 330
149978 3350
149979 5000
149980 4350
149981 9000
149982 2000
149983 12000
149984 6700
149985 4200
149986 2800
149987 3000
149988 7500
149989 1150
149990 450
149991 24950
149992 950
149993 4399
149994 14780
149995 5900
149996 9500
149997 7500
149998 4999
149999 4700
Name: price, Length: 150000, dtype: int64
## 5)多变量互相回归关系可视化
fig,((ax1,ax2),(ax3,ax4),(ax5,ax6),(ax7,ax8),(ax9,ax10))=plt.subplots(nrows=5,ncols=2,figsize=(24,20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x263d7d11cf8>
2.3.8类别特征分析
## 1) unique分布
for fea in categorical_features:print(Train_data[fea].nunique())
99662
248
40
8
7
2
2
7905
categorical_features
['name','model','brand','bodyType','fuelType','gearbox','notRepairedDamage','regionCode']
## 2) 类别特征箱形图可视化# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']
for c in categorical_features:Train_data[c] = Train_data[c].astype('category')if Train_data[c].isnull().any():Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])Train_data[c] = Train_data[c].fillna('MISSING')def boxplot(x, y, **kwargs):sns.boxplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bwGTalz3-1588234847929)(output_79_0.png)]
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :sns.violinplot(x=catg, y=target, data=Train_data)plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-daMc5Y6c-1588234847930)(output_81_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UJ9CnRDu-1588234847930)(output_81_1.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DGNBmFBY-1588234847931)(output_81_2.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sICXd0tC-1588234847931)(output_81_3.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4QV1i2Gj-1588234847935)(output_81_4.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yK7hk2QY-1588234847937)(output_81_5.png)]
categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']
## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):sns.barplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dl2d8XUu-1588234847939)(output_83_0.png)]
## 5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x, **kwargs):sns.countplot(x=x)x=plt.xticks(rotation=90)f = pd.melt(Train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NVZPRcmN-1588234847940)(output_84_0.png)]
2.3.9用pandas_profiling生成数据报告
用pandas_profiling 生成一个较为全面的可视化和数据报告,最终打开html文件即可
import pandas_profiling
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)<ipython-input-101-6a00893fb3e1> in <module>
----> 1 import pandas_profilingModuleNotFoundError: No module named 'pandas_profiling'
pfr=pandas_profiling.ProfileReport(Train_data)
pfr.to_file(os.path.join(output_path,'example.html'))
【Datawhale】[task2]2.3代码示例相关推荐
- 用户自定义协议client/server代码示例
用户自定义协议client/server代码示例 代码参考链接:https://github.com/sogou/workflow message.h message.cc server.cc cli ...
- 2021年大数据Flink(二十六):State代码示例
目录 State代码示例 Keyed State 官网代码示例 需求: 编码步骤 代码示例 Operator State 官网代码示例 需求: 编码步骤: 代码示例 State代码示例 Keyed S ...
- TensorFlow常用操作:代码示例
1,定义矩阵代码示例: import tensorflow as tftf.zeros([3,4]) #定义3行4列元素均为0的矩阵tensor=tf.constant([1,2,3,4])#定义一维 ...
- TensorFlow基本计算单元:代码示例
1,代码示例: import tensorflow as tf a = 3 #创建变量 w = tf.Variable([[0.6,1.2]])#创建行向量 x = tf.Variable([[2.1 ...
- php mms,PHP代码示例_PHP账号余额查询接口 | 微米-中国领先的短信彩信接口平台服务商...
PHP余额查询接口代码示例 请求 $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, "http://api.weimi.cc/2/accoun ...
- java结束全部操作代码_Java创建与结束线程代码示例
这篇文章主要介绍了Java创建与结束线程代码示例,小编觉得挺不错的,这里分享给大家,供需要的朋友参考. 本文讲述了在Java中如何创建和结束线程的最基本方法,只针对于Java初学者.一些高级知识如线程 ...
- doc python 颜色_Python wordcloud.ImageColorGenerator方法代码示例
本文整理汇总了Python中wordcloud.ImageColorGenerator方法的典型用法代码示例.如果您正苦于以下问题:Python wordcloud.ImageColorGenerat ...
- 机器学习简单代码示例
机器学习简单代码示例 //在gcc-4.7.2下编译通过. //命令行:g++ -Wall -ansi -O2 test.cpp -o test #include <iostream> u ...
- 手机如何看python代码_python如何绘制iPhone手机图案?(代码示例)
本篇文章给大家带来的内容是介绍python如何绘制iPhone手机图案?(代码示例).有一定的参考价值,有需要的朋友可以参考一下,希望对你们有所帮助. 虽然我用不起苹果手机,但我可以用python画出 ...
- python编程代码示例_python编程线性回归代码示例
用python进行线性回归分析非常方便,有现成的库可以使用比如:numpy.linalog.lstsq例子.scipy.stats.linregress例子.pandas.ols例子等. 不过本文使用 ...
最新文章
- c语言函数库学习~sscanf~格式化输入
- javascript改变样式(cssFloat,styleFloat)
- RTP/RTCP协议介绍
- 旋转函数_【视频课】:一次函数拓展应用(图象的平移、旋转、轴对称及5种解题方法)...
- Python 并发编程:PoolExecutor 篇
- Flutter与JS的双向调用、Flutter中Widget与Html混合加载
- 机器学习实战4-sklearn训练线性回归模型(鸢尾花iris数据集分类)
- 数据结构--树形结构(1)
- vbscript mysql_vbscript 数据库操作
- DNS的作用是什么?为什么一定要配置DNS才能上网
- 找到一个电信代理服务器~
- 阿里云打造离线下载服务器
- 剪辑过的视频md5会改变
- 《计算机网络》从零单排上王者之——坚韧黑铁篇
- Unity3d任务模型自动寻路(人员疏散)
- Adobe的视频和音频编码和格式转换软件Media Encoder(Me) 2023版本下载与安装教程
- 三、Oracle/支付宝/旺旺
- Glide之GlideModule
- 全国计算机等级考试一级ps操作,计算机等级考试一级Photoshop操作如何制作彩塑字...
- 2020年南京大学软件工程考研上岸经验帖
热门文章
- Android studio 之 Kotlin Not Configured
- 购买三维扫描仪的7大准则
- 什么是Web前端工程师?为什么Web前端工资如此之高呢?
- php 中文字,完善解决截取中文汉字不乱码-PHP字符串函数(支持utf8、GBK、GB2312)
- python将png转换为ico
- 高效能人士的七个习惯读后感与总结概括-(第二章)
- NOIP2016普及组复赛全国一等奖名单及排名(1~745名)
- CSS3 文字边框 -webkit-text-stroke
- 福师电子计算机主要以,福师《计算机应用基础》在线作业一 电子计算机主要以划分发展阶段...
- 红黑树区分 左旋 和 右旋