2.3 代码示例

2.3.1 载入各种数据科学以及可视化库

#coding:utf-8
#导入warning包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import os #结果保存路径
output_path='G:/newjourney/Datawhale/output'
if not os.path.exists(output_path):os.makedirs(output_path)

2.3.2 载入数据

## 1)载入训练集和测试集
path='G:/newjourney/Datawhale/'
Train_data=pd.read_csv(path+'used_car_train_20200313.csv',sep=' ')
Test_data=pd.read_csv(path+'used_car_testB_20200421.csv',sep=' ')
## 2)简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())
# 哇,这个学到了,利用append把head和tail一起展示出来,优秀
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 10 4.0 0.0 1.0 163 15.0 ... 0.280264 0.000310 0.048441 0.071158 0.019174 1.988114 -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 11 0.0 0.0 0.0 125 10.0 ... 0.253217 0.000777 0.084079 0.099681 0.079371 1.839166 -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 6.0 ... 0.233353 0.000705 0.118872 0.100118 0.097914 2.439812 -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 15.0 ... 0.256369 0.000252 0.081479 0.083558 0.081498 2.075380 -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 28 6.0 0.0 1.0 193 12.5 ... 0.284475 0.000000 0.040072 0.062543 0.025819 1.978453 -3.179913 0.031724 -1.483350 -0.342674

10 rows × 31 columns

Train_data.shape
(150000, 31)
Test_data.head().append(Test_data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 200000 133777 20000501 67.0 0 1.0 0.0 0.0 101 15.0 ... 0.236520 0.000241 0.105319 0.046233 0.094522 3.619512 -0.280607 -2.019761 0.978828 0.803322
1 200001 61206 19950211 19.0 6 2.0 0.0 0.0 73 6.0 ... 0.261518 0.000000 0.120323 0.046784 0.035385 2.997376 -1.406705 -1.020884 -1.349990 -0.200542
2 200002 67829 20090606 5.0 5 4.0 0.0 0.0 120 5.0 ... 0.261691 0.090836 0.000000 0.079655 0.073586 -3.951084 -0.433467 0.918964 1.634604 1.027173
3 200003 8892 20020601 22.0 9 1.0 0.0 0.0 58 15.0 ... 0.236050 0.101777 0.098950 0.026830 0.096614 -2.846788 2.800267 -2.524610 1.076819 0.461610
4 200004 76998 20030301 46.0 6 0.0 NaN 0.0 116 15.0 ... 0.257000 0.000000 0.066732 0.057771 0.068852 2.839010 -1.659801 -0.924142 0.199423 0.451014
49995 249995 111443 20041005 4.0 4 0.0 NaN 1.0 150 15.0 ... 0.263668 0.000292 0.141804 0.076393 0.039272 2.072901 -2.531869 1.716978 -1.063437 0.326587
49996 249996 152834 20130409 65.0 1 0.0 0.0 0.0 179 4.0 ... 0.255310 0.000991 0.155868 0.108425 0.067841 1.358504 -3.290295 4.269809 0.140524 0.556221
49997 249997 132531 20041211 4.0 4 0.0 0.0 1.0 147 12.5 ... 0.262933 0.000318 0.141872 0.071968 0.042966 2.165658 -2.417885 1.370612 -1.073133 0.270602
49998 249998 143405 20020702 40.0 1 4.0 0.0 1.0 176 15.0 ... 0.282106 0.000023 0.067483 0.067526 0.009006 2.030114 -2.939244 0.569078 -1.718245 0.316379
49999 249999 78202 20090708 32.0 8 1.0 0.0 0.0 0 3.0 ... 0.231449 0.103947 0.096027 0.062328 0.110180 -3.689090 2.032376 0.109157 2.202828 0.847469

10 rows × 30 columns

Test_data.shape
(50000, 30)
要养成看数据集的head()和shape的习惯,这会让你每一步更放心

2.3.3总览数据概括

1、describe中有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25%、50%、75%、以及最大值。通过这些信息可以瞬间掌握数据的大概范围以及每个值的异常值的判断,比如有的时候会发现999 9999 -1等值,这些其实都是nan的另外一种表达方式,有的时候要注意一下。
2、info 通过info来了解每列的type,有助于了解是否存在除了nan以外的特殊符号异常
## 1)通过describe()来熟悉数据的相关统计量
Train_data.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 150000.000000 150000.000000 1.500000e+05 149999.000000 150000.000000 145494.000000 141320.000000 144019.000000 150000.000000 150000.000000 ... 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000
mean 74999.500000 68349.172873 2.003417e+07 47.129021 8.052733 1.792369 0.375842 0.224943 119.316547 12.597160 ... 0.248204 0.044923 0.124692 0.058144 0.061996 -0.001000 0.009035 0.004813 0.000313 -0.000688
std 43301.414527 61103.875095 5.364988e+04 49.536040 7.864956 1.760640 0.548677 0.417546 177.168419 3.919576 ... 0.045804 0.051743 0.201410 0.029186 0.035692 3.772386 3.286071 2.517478 1.288988 1.038685
min 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.168192 -5.558207 -9.639552 -4.153899 -6.546556
25% 37499.750000 11156.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243615 0.000038 0.062474 0.035334 0.033930 -3.722303 -1.951543 -1.871846 -1.057789 -0.437034
50% 74999.500000 51638.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 0.000000 110.000000 15.000000 ... 0.257798 0.000812 0.095866 0.057014 0.058484 1.624076 -0.358053 -0.130753 -0.036245 0.141246
75% 112499.250000 118841.250000 2.007111e+07 66.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265297 0.102009 0.125243 0.079382 0.087491 2.844357 1.255022 1.776933 0.942813 0.680378
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 7.000000 6.000000 1.000000 19312.000000 15.000000 ... 0.291838 0.151420 1.404936 0.160791 0.222787 12.357011 18.819042 13.847792 11.147669 8.658418

8 rows × 30 columns

Test_data.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 50000.000000 50000.000000 5.000000e+04 50000.00000 50000.000000 48496.000000 47076.000000 48032.000000 50000.000000 50000.000000 ... 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 224999.500000 68505.606100 2.003401e+07 47.64948 8.087140 1.793736 0.376498 0.226953 119.766960 12.598260 ... 0.248147 0.044624 0.124693 0.058198 0.062113 0.019633 0.002759 0.004342 0.004570 -0.007209
std 14433.901067 61032.124271 5.351615e+04 49.90741 7.899648 1.764970 0.549281 0.418866 206.313348 3.912519 ... 0.045836 0.051664 0.201440 0.029171 0.035723 3.764095 3.289523 2.515912 1.287194 1.044718
min 200000.000000 1.000000 1.991000e+07 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.119719 -5.662163 -8.291868 -4.157649 -6.098192
25% 212499.750000 11315.000000 1.999100e+07 11.00000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243436 0.000035 0.062519 0.035413 0.033880 -3.675196 -1.963928 -1.865406 -1.048722 -0.440706
50% 224999.500000 52215.000000 2.003091e+07 30.00000 6.000000 1.000000 0.000000 0.000000 110.000000 15.000000 ... 0.257818 0.000801 0.095880 0.056804 0.058749 1.632134 -0.375537 -0.138943 -0.036352 0.136849
75% 237499.250000 118710.750000 2.007110e+07 66.00000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265263 0.101654 0.125470 0.079387 0.087624 2.846205 1.263451 1.775632 0.945239 0.685555
max 249999.000000 196808.000000 2.015121e+07 246.00000 39.000000 7.000000 6.000000 1.000000 19211.000000 15.000000 ... 0.291176 0.153403 1.411559 0.157458 0.211304 12.177864 18.789496 13.384828 5.635374 2.649768

8 rows × 29 columns

## 2)通过info()来熟悉数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48496 non-null float64
fuelType             47076 non-null float64
gearbox              48032 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

2.3.4判断数据缺失和异常

## 1)查看每列的存在nan情况
Train_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
Test_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1504
fuelType             2924
gearbox              1968
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
#nan可视化
missing=Train_data.isnull().sum()
missing=missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x263bf61cda0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H1or7kIn-1588234847898)(output_17_1.png)]

通过以上两句可以很直观的了解哪些列存在"nan",并可以把nan的个数打印,主要的目的在于判断nan存在的个数是否真的很大,如果很小,一般选择填充;如果适用lgb等树模型可以直接空缺,让树自己去优化;但如果nan存在的过多,可以考虑删掉。
# 可视化下看缺省值,利用missingno(msno)
msno.matrix(Train_data.sample(250)) #这里看了250个样本
<matplotlib.axes._subplots.AxesSubplot at 0x263bf70dcc0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4sT96KeO-1588234847900)(output_19_1.png)]

msno.matrix(Train_data)
<matplotlib.axes._subplots.AxesSubplot at 0x263c17172e8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cXIvvEr2-1588234847904)(output_20_1.png)]

msno.bar(Train_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x263bf973358>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-68GvpzF1-1588234847909)(output_21_1.png)]

#可视化看下缺省值
msno.matrix(Test_data)
<matplotlib.axes._subplots.AxesSubplot at 0x263c19801d0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2FayR8XY-1588234847914)(output_22_1.png)]

msno.bar(Test_data)
<matplotlib.axes._subplots.AxesSubplot at 0x263bf9d3be0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6z8ur528-1588234847915)(output_23_1.png)]

测试集和训练集的缺省情况类似,

训练集(model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981)
训练集(bodyType 1504
fuelType 2924
gearbox 1968)其中,fuelType缺省的最多。

## 2)查看异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现处理notRepairedDamage 150000 non-null object为object类型,其它都为数字,我们可以具体看一下notRepairedDamage的值

Train_data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以发现"-“也为空缺值,因为很多模型对nan有直接的处理,故在此我们先将”-“替换成"nan”

Train_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
Train_data.isnull().sum()
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0    37224
-       8069
1.0     4707
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Test_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1504
fuelType             2924
gearbox              1968
power                   0
kilometer               0
notRepairedDamage    8069
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0    37224
1.0     4707
Name: notRepairedDamage, dtype: int64

"seller"和"offerType"两个类别严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然也可以继续挖掘,但是一般意义不大
#但有个问题是,怎么突然发现这两个类别严重倾斜呢?一个一个试的嘛?

#可以看一下这两个类别的具体情况
Train_data['seller'].value_counts()
0    149999
1         1
Name: seller, dtype: int64
Train_data['offerType'].value_counts()
0    150000
Name: offerType, dtype: int64
#所以进行删除,训练集和测试集都要删除
del Train_data['seller']
del Test_data['seller']
del Train_data['offerType']
del Test_data['offerType']

2.3.5了解预测值的分布

Train_data['price']
0          1850
1          3600
2          6222
3          2400
4          5200
5          8000
6          3500
7          1000
8          2850
9           650
10         3100
11         5450
12         1600
13         3100
14         6900
15         3200
16        10500
17         3700
18          790
19         1450
20          990
21         2800
22          350
23          599
24         9250
25         3650
26         2800
27         2399
28         4900
29         2999...
149970      900
149971     3400
149972      999
149973     3500
149974     4500
149975     3990
149976     1200
149977      330
149978     3350
149979     5000
149980     4350
149981     9000
149982     2000
149983    12000
149984     6700
149985     4200
149986     2800
149987     3000
149988     7500
149989     1150
149990      450
149991    24950
149992      950
149993     4399
149994    14780
149995     5900
149996     9500
149997     7500
149998     4999
149999     4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500      2337
1500     2158
1200     1922
1000     1850
2500     1821
600      1535
3500     1533
800      1513
2000     1378
999      1356
750      1279
4500     1271
650      1257
1800     1223
2200     1201
850      1198
700      1174
900      1107
1300     1105
950      1104
3000     1098
1100     1079
5500     1079
1600     1074
300      1071
550      1042
350      1005
1250     1003
6500      973
1999      929...
21560       1
7859        1
3120        1
2279        1
6066        1
6322        1
4275        1
10420       1
43300       1
305         1
1765        1
15970       1
44400       1
8885        1
2992        1
31850       1
15413       1
13495       1
9525        1
7270        1
13879       1
3760        1
24250       1
11360       1
10295       1
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64
## 1)总体分布概况(无界约翰逊分布(johnsonsu)?/正态norm?/对数正态(lognorm)?等)import scipy.stats as st
y=Train_data['price']
plt.figure(1);plt.title('Juhnson SU')
sns.distplot(y,kde=False,fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y,kde=False,fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y,kde=False,fit=st.lognorm)
#这里画图为什么用sns.distplot()?
#回答:Seaborn是基于matplotlib的Python可视化库。 它提供了一个高级界面来绘制有吸引力的统计图形。Seaborn其实是在matplotlib的基础上进行了更高级的API封装,从而使得作图更加容易,不需要经过大量的调整就能使你的图变得精致。
<matplotlib.axes._subplots.AxesSubplot at 0x263b6bbd080>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CnO0RUlw-1588234847917)(output_42_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ySq02FTT-1588234847917)(output_42_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Esc5fWxu-1588234847918)(output_42_3.png)]

价格不服从正态分布,所以在进行回归之前,需要进行转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布

## 2)查看skewness 和kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis:%f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis:18.995183

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1qAaaIB3-1588234847919)(output_44_1.png)]

#训练集中全部特征的偏度和峰度,学习了
Train_data.skew(),Train_data.kurt()
(SaleID               6.017846e-17name                 5.576058e-01regDate              2.849508e-02model                1.484388e+00brand                1.150760e+00bodyType             9.915299e-01fuelType             1.595486e+00gearbox              1.317514e+00power                6.586318e+01kilometer           -1.525921e+00notRepairedDamage    2.430640e+00regionCode           6.888812e-01creatDate           -7.901331e+01price                3.346487e+00v_0                 -1.316712e+00v_1                  3.594543e-01v_2                  4.842556e+00v_3                  1.062920e-01v_4                  3.679890e-01v_5                 -4.737094e+00v_6                  3.680730e-01v_7                  5.130233e+00v_8                  2.046133e-01v_9                  4.195007e-01v_10                 2.522046e-02v_11                 3.029146e+00v_12                 3.653576e-01v_13                 2.679152e-01v_14                -1.186355e+00dtype: float64, SaleID                 -1.200000name                   -1.039945regDate                -0.697308model                   1.740483brand                   1.076201bodyType                0.206937fuelType                5.880049gearbox                -0.264161power                5733.451054kilometer               1.141934notRepairedDamage       3.908072regionCode             -0.340832creatDate            6881.080328price                  18.995183v_0                     3.993841v_1                    -1.753017v_2                    23.860591v_3                    -0.418006v_4                    -0.197295v_5                    22.934081v_6                    -1.742567v_7                    25.845489v_8                    -0.636225v_9                    -0.321491v_10                   -0.577935v_11                   12.568731v_12                    0.268937v_13                   -0.438274v_14                    2.393526dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel='Skewness')
<matplotlib.axes._subplots.AxesSubplot at 0x263c3402f28>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zdVlowAX-1588234847920)(output_46_1.png)]

sns.distplot(Train_data.kurt(),color='orange',axlabel='Kurtness')
<matplotlib.axes._subplots.AxesSubplot at 0x263db8ed240>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qW40gido-1588234847921)(output_47_1.png)]

##3)查看预测值的具体频数
plt.hist(Train_data['price'],orientation='vertical',histtype='bar',color='red');
#小tip:plt语句后加";",就可绘出图,当然也可以再写一句plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E0SWUUWT-1588234847922)(output_48_0.png)]

查看频数,大于20000的值极少,其实这里也可以把这些当作特殊值(异常值)直接填充或删除掉

# Log变换分布之后的分布较均匀,可以用log变化进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']),orientation='vertical',histtype='bar',color='red');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7IhnmVPd-1588234847925)(output_50_0.png)]

2.3.6 特征分为类别特征和数字特征,并对类别特征查看unique分布

数据类型

#分离label即预测值
Y_train=Train_data['price']
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
'''
这个区别方式适用于没有直接Label coding的数据
这里不适用,需要人为根据实际含义来区分
+ 数字特征
numeric_features=Train_data.select_dtypes(include=[np.number])
numeric_features.columns
+ 类别特征
categorical_features=Train_data.select_dtypes(include=[np.object])
categorical_features.columns'''
numeric_features=['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',\'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']categorical_features=[ 'name', 'model', 'brand', 'bodyType', 'fuelType',\'gearbox',  'notRepairedDamage', 'regionCode']
# 特征unique分布;训练集
for cat_fea in categorical_features:print(cat_fea+"的特征分布如下:")print("{}特征有{}不同的值".format(cat_fea,Train_data[cat_fea].nunique()))print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name特征有99662不同的值
708       282
387       282
55        280
1541      263
203       233
53        221
713       217
290       197
1186      184
911       182
2044      176
1513      160
1180      158
631       157
893       153
2765      147
473       141
1139      137
1108      132
444       129
306       127
2866      123
2402      116
533       114
1479      113
422       113
4635      110
725       110
964       109
1373      104...
89083       1
95230       1
164864      1
173060      1
179207      1
181256      1
185354      1
25564       1
19417       1
189324      1
162719      1
191373      1
193422      1
136082      1
140180      1
144278      1
146327      1
148376      1
158621      1
1404        1
15319       1
46022       1
64463       1
976         1
3025        1
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下:
model特征有248不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
48.0      5052
40.0      4502
26.0      4496
8.0       4391
31.0      3827
13.0      3762
17.0      3121
65.0      2730
49.0      2608
46.0      2454
30.0      2342
44.0      2195
5.0       2063
10.0      2004
21.0      1872
73.0      1789
11.0      1775
23.0      1696
22.0      1524
69.0      1522
63.0      1469
7.0       1460
16.0      1349
88.0      1309
66.0      1250...
141.0       37
133.0       35
216.0       30
202.0       28
151.0       26
226.0       26
231.0       23
234.0       23
233.0       20
198.0       18
224.0       18
227.0       17
237.0       17
220.0       16
230.0       16
239.0       14
223.0       13
236.0       11
241.0       10
232.0       10
229.0       10
235.0        7
246.0        7
243.0        4
244.0        3
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下:
brand特征有40不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有8不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有7不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有2不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有2不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有7905不同的值
419     369
764     258
125     137
176     136
462     134
428     132
24      130
1184    130
122     129
828     126
70      125
827     120
207     118
1222    117
2418    117
85      116
2615    115
2222    113
759     112
188     111
1757    110
1157    109
2401    107
1069    107
3545    107
424     107
272     107
451     106
450     105
129     105...
6324      1
7372      1
7500      1
8107      1
2453      1
7942      1
5135      1
6760      1
8070      1
7220      1
8041      1
8012      1
5965      1
823       1
7401      1
8106      1
5224      1
8117      1
7507      1
7989      1
6505      1
6377      1
8042      1
7763      1
7786      1
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64
#特征unique分布;测试集
for cat_fea in categorical_features:print(cat_fea+"的特征分布如下:")print("{}特征有{}个不同的值".format(cat_fea,Test_data[cat_fea].nunique()))print(Test_data[cat_fea].value_counts())
name的特征分布如下:
name特征有37536个不同的值
387       94
55        93
1541      86
708       85
203       78
713       75
911       72
1180      71
53        68
290       68
631       67
1186      60
473       54
306       53
2866      52
2044      50
422       49
893       47
1513      46
2765      45
533       44
964       44
1139      41
1479      41
2825      38
444       37
4635      37
984       37
282       35
691       33..
9747       1
7857       1
75120      1
144754     1
15731      1
66932      1
76360      1
66082      1
89231      1
93561      1
161146     1
21886      1
42368      1
101765     1
89653      1
38278      1
89645      1
60809      1
62858      1
195979     1
185951     1
81299      1
168479     1
28057      1
30106      1
97691      1
155039     1
44449      1
112034     1
105129     1
Name: name, Length: 37536, dtype: int64
model的特征分布如下:
model特征有245个不同的值
0.0      3772
19.0     3226
4.0      2790
1.0      1981
29.0     1778
48.0     1711
40.0     1524
26.0     1512
8.0      1464
31.0     1281
13.0     1214
17.0     1033
65.0      918
49.0      880
46.0      871
30.0      793
44.0      731
5.0       677
21.0      628
10.0      625
23.0      583
11.0      562
73.0      561
69.0      531
63.0      515
16.0      506
22.0      482
7.0       442
88.0      416
66.0      395...
157.0      12
151.0      12
141.0      12
193.0      12
89.0       12
68.0       11
233.0      11
226.0      11
133.0      11
227.0       8
198.0       8
18.0        8
224.0       7
237.0       7
239.0       6
231.0       6
235.0       6
220.0       6
246.0       4
234.0       4
230.0       4
223.0       3
236.0       3
232.0       3
245.0       3
229.0       2
209.0       2
242.0       1
241.0       1
244.0       1
Name: model, Length: 245, dtype: int64
brand的特征分布如下:
brand特征有40个不同的值
0     10473
4      5532
14     5345
10     4713
1      4627
6      3500
9      2360
5      1485
13     1386
11      942
3       820
16      770
25      728
7       727
8       708
27      623
21      543
15      476
19      473
20      411
12      399
22      358
26      328
30      321
17      312
24      248
28      216
32      183
29      139
37      117
2       115
31      113
18      107
33       84
35       75
34       75
36       72
23       60
38       31
39        5
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有8个不同的值
0.0    13765
1.0    11960
2.0     9886
3.0     4491
4.0     3258
5.0     2494
6.0     2212
7.0      430
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有7个不同的值
0.0    30489
1.0    15708
2.0      736
3.0       78
4.0       31
5.0       18
6.0       16
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有2个不同的值
0.0    37131
1.0    10901
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有2个不同的值
0.0    37224
1.0     4707
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有6998个不同的值
419     120
764      98
176      48
3304     45
85       45
2222     45
3545     44
462      42
1000     42
2154     42
24       41
2775     41
70       41
309      40
1688     40
188      40
792      40
955      39
172      39
3573     39
122      39
759      38
60       38
2418     38
256      38
1483     38
2690     37
125      37
827      37
450      37...
1521      1
7602      1
5523      1
7538      1
5459      1
7410      1
6630      1
6374      1
6342      1
1010      1
6897      1
5104      1
7089      1
4069      1
6993      1
2052      1
4944      1
2867      1
4912      1
2771      1
6310      1
6865      1
6833      1
4656      1
6609      1
2451      1
4231      1
6513      1
6481      1
6061      1
Name: regionCode, Length: 6998, dtype: int64

2.3.7 数字特征分析

#加上预测值
numeric_features.append('price')
numeric_features
['power','kilometer','v_0','v_1','v_2','v_3','v_4','v_5','v_6','v_7','v_8','v_9','v_10','v_11','v_12','v_13','v_14','price','price']
Train_data.head().append(Train_data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 10 4.0 0.0 1.0 163 15.0 ... 0.280264 0.000310 0.048441 0.071158 0.019174 1.988114 -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 11 0.0 0.0 0.0 125 10.0 ... 0.253217 0.000777 0.084079 0.099681 0.079371 1.839166 -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 6.0 ... 0.233353 0.000705 0.118872 0.100118 0.097914 2.439812 -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 15.0 ... 0.256369 0.000252 0.081479 0.083558 0.081498 2.075380 -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 28 6.0 0.0 1.0 193 12.5 ... 0.284475 0.000000 0.040072 0.062543 0.025819 1.978453 -3.179913 0.031724 -1.483350 -0.342674

10 rows × 29 columns

## 1)相关性分析
price_numeric=Train_data[numeric_features]
correlation=price_numeric.corr()
correlation
print(correlation['price'])
              price     price
power      0.219834  0.219834
kilometer -0.440519 -0.440519
v_0        0.628397  0.628397
v_1        0.060914  0.060914
v_2        0.085322  0.085322
v_3       -0.730946 -0.730946
v_4       -0.147085 -0.147085
v_5        0.164317  0.164317
v_6        0.068970  0.068970
v_7       -0.053024 -0.053024
v_8        0.685798  0.685798
v_9       -0.206205 -0.206205
v_10      -0.246175 -0.246175
v_11      -0.275320 -0.275320
v_12       0.692823  0.692823
v_13      -0.013993 -0.013993
v_14       0.035911  0.035911
price      1.000000  1.000000
price      1.000000  1.000000
print(correlation['price'].sort_values(ascending=True))
---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-81-e5597112ec71> in <module>3 correlation=price_numeric.corr()4 correlation
----> 5 print(correlation['price'].sort_values(ascending=True))TypeError: sort_values() missing 1 required positional argument: 'by'
print(correlation['price'].sort_values(by=correlation['price'],ascending = False),'\n')
---------------------------------------------------------------------------KeyError                                  Traceback (most recent call last)<ipython-input-74-a6cc18605a79> in <module>2 price_numeric=Train_data[numeric_features]3 correlation=price_numeric.corr()
----> 4 print(correlation['price'].sort_values(by=correlation['price'],ascending = False),'\n')G:\baidudownload2\anaconda\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)4717 4718             by = by[0]
-> 4719             k = self._get_label_or_level_values(by, axis=axis)4720 4721             if isinstance(ascending, (tuple, list)):G:\baidudownload2\anaconda\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)1704             values = self.axes[axis].get_level_values(key)._values1705         else:
-> 1706             raise KeyError(key)1707 1708         # Check for duplicatesKeyError:               price     price
power      0.219834  0.219834
kilometer -0.440519 -0.440519
v_0        0.628397  0.628397
v_1        0.060914  0.060914
v_2        0.085322  0.085322
v_3       -0.730946 -0.730946
v_4       -0.147085 -0.147085
v_5        0.164317  0.164317
v_6        0.068970  0.068970
v_7       -0.053024 -0.053024
v_8        0.685798  0.685798
v_9       -0.206205 -0.206205
v_10      -0.246175 -0.246175
v_11      -0.275320 -0.275320
v_12       0.692823  0.692823
v_13      -0.013993 -0.013993
v_14       0.035911  0.035911
price      1.000000  1.000000
price      1.000000  1.000000
print(data.corr()) #相关系数矩阵,即给出了任意两款菜式之间的相关系数
print("显示“百合酱蒸凤爪”与其他菜式的相关系数:")
print(data.corr()[u'price']) #只显示“百合酱蒸凤爪”与其他菜式的相关系数
f,ax=plt.subplots(figsize=(7,7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square=True,vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x263c33cfa90>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i5LpwIeD-1588234847926)(output_67_1.png)]

del price_numeric['price']
## 2)查看几个特征的偏度和峰值
for col in numeric_features:print('{:15}'.format(col),'Skewness:{:05.2f}'.format(Train_data[col].skew()),' ','kurtosis:{:06.2f}'.format(Train_data[col].kurt()))
power           Skewness:65.86   kurtosis:5733.45
kilometer       Skewness:-1.53   kurtosis:001.14
v_0             Skewness:-1.32   kurtosis:003.99
v_1             Skewness:00.36   kurtosis:-01.75
v_2             Skewness:04.84   kurtosis:023.86
v_3             Skewness:00.11   kurtosis:-00.42
v_4             Skewness:00.37   kurtosis:-00.20
v_5             Skewness:-4.74   kurtosis:022.93
v_6             Skewness:00.37   kurtosis:-01.74
v_7             Skewness:05.13   kurtosis:025.85
v_8             Skewness:00.20   kurtosis:-00.64
v_9             Skewness:00.42   kurtosis:-00.32
v_10            Skewness:00.03   kurtosis:-00.58
v_11            Skewness:03.03   kurtosis:012.57
v_12            Skewness:00.37   kurtosis:000.27
v_13            Skewness:00.27   kurtosis:-00.44
v_14            Skewness:-1.19   kurtosis:002.39
price           Skewness:03.35   kurtosis:019.00
price           Skewness:03.35   kurtosis:019.00
## 3)每个数字特征的分布可视化
f=pd.melt(Train_data,value_vars=numeric_features)
g=sns.FacetGrid(f,col="variable",col_wrap=2,sharex=False,sharey=False)
g=g.map(sns.distplot,"value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MtlS2Yfr-1588234847928)(output_70_0.png)]

可以看出匿名特征相对分布均匀

## 4)数字特征相互之间的关系可视化
sns.set()
columns=['price','power', 'v_0', 'v_1', 'v_2',  'v_5', 'v_6',\'v_8','v_12',  'v_14']#为什么是这几个特征?
sns.pairplot(Train_data[columns],size=2,kind='scatter',diag_kind='kde');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1e9oL2QM-1588234847928)(output_72_0.png)]

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
Y_train
0          1850
1          3600
2          6222
3          2400
4          5200
5          8000
6          3500
7          1000
8          2850
9           650
10         3100
11         5450
12         1600
13         3100
14         6900
15         3200
16        10500
17         3700
18          790
19         1450
20          990
21         2800
22          350
23          599
24         9250
25         3650
26         2800
27         2399
28         4900
29         2999...
149970      900
149971     3400
149972      999
149973     3500
149974     4500
149975     3990
149976     1200
149977      330
149978     3350
149979     5000
149980     4350
149981     9000
149982     2000
149983    12000
149984     6700
149985     4200
149986     2800
149987     3000
149988     7500
149989     1150
149990      450
149991    24950
149992      950
149993     4399
149994    14780
149995     5900
149996     9500
149997     7500
149998     4999
149999     4700
Name: price, Length: 150000, dtype: int64
## 5)多变量互相回归关系可视化
fig,((ax1,ax2),(ax3,ax4),(ax5,ax6),(ax7,ax8),(ax9,ax10))=plt.subplots(nrows=5,ncols=2,figsize=(24,20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x263d7d11cf8>

2.3.8类别特征分析

## 1) unique分布
for fea in categorical_features:print(Train_data[fea].nunique())
99662
248
40
8
7
2
2
7905
categorical_features
['name','model','brand','bodyType','fuelType','gearbox','notRepairedDamage','regionCode']
## 2) 类别特征箱形图可视化# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']
for c in categorical_features:Train_data[c] = Train_data[c].astype('category')if Train_data[c].isnull().any():Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])Train_data[c] = Train_data[c].fillna('MISSING')def boxplot(x, y, **kwargs):sns.boxplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bwGTalz3-1588234847929)(output_79_0.png)]

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :sns.violinplot(x=catg, y=target, data=Train_data)plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-daMc5Y6c-1588234847930)(output_81_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UJ9CnRDu-1588234847930)(output_81_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DGNBmFBY-1588234847931)(output_81_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sICXd0tC-1588234847931)(output_81_3.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4QV1i2Gj-1588234847935)(output_81_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yK7hk2QY-1588234847937)(output_81_5.png)]

categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']
## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):sns.barplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dl2d8XUu-1588234847939)(output_83_0.png)]

##  5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x,  **kwargs):sns.countplot(x=x)x=plt.xticks(rotation=90)f = pd.melt(Train_data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NVZPRcmN-1588234847940)(output_84_0.png)]

2.3.9用pandas_profiling生成数据报告

用pandas_profiling 生成一个较为全面的可视化和数据报告,最终打开html文件即可

import pandas_profiling
---------------------------------------------------------------------------ModuleNotFoundError                       Traceback (most recent call last)<ipython-input-101-6a00893fb3e1> in <module>
----> 1 import pandas_profilingModuleNotFoundError: No module named 'pandas_profiling'
pfr=pandas_profiling.ProfileReport(Train_data)
pfr.to_file(os.path.join(output_path,'example.html'))

【Datawhale】[task2]2.3代码示例相关推荐

  1. 用户自定义协议client/server代码示例

    用户自定义协议client/server代码示例 代码参考链接:https://github.com/sogou/workflow message.h message.cc server.cc cli ...

  2. 2021年大数据Flink(二十六):​​​​​​​State代码示例

    目录 State代码示例 Keyed State 官网代码示例 需求: 编码步骤 代码示例 Operator State 官网代码示例 需求: 编码步骤: 代码示例 State代码示例 Keyed S ...

  3. TensorFlow常用操作:代码示例

    1,定义矩阵代码示例: import tensorflow as tftf.zeros([3,4]) #定义3行4列元素均为0的矩阵tensor=tf.constant([1,2,3,4])#定义一维 ...

  4. TensorFlow基本计算单元:代码示例

    1,代码示例: import tensorflow as tf a = 3 #创建变量 w = tf.Variable([[0.6,1.2]])#创建行向量 x = tf.Variable([[2.1 ...

  5. php mms,PHP代码示例_PHP账号余额查询接口 | 微米-中国领先的短信彩信接口平台服务商...

    PHP余额查询接口代码示例 请求 $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, "http://api.weimi.cc/2/accoun ...

  6. java结束全部操作代码_Java创建与结束线程代码示例

    这篇文章主要介绍了Java创建与结束线程代码示例,小编觉得挺不错的,这里分享给大家,供需要的朋友参考. 本文讲述了在Java中如何创建和结束线程的最基本方法,只针对于Java初学者.一些高级知识如线程 ...

  7. doc python 颜色_Python wordcloud.ImageColorGenerator方法代码示例

    本文整理汇总了Python中wordcloud.ImageColorGenerator方法的典型用法代码示例.如果您正苦于以下问题:Python wordcloud.ImageColorGenerat ...

  8. 机器学习简单代码示例

    机器学习简单代码示例 //在gcc-4.7.2下编译通过. //命令行:g++ -Wall -ansi -O2 test.cpp -o test #include <iostream> u ...

  9. 手机如何看python代码_python如何绘制iPhone手机图案?(代码示例)

    本篇文章给大家带来的内容是介绍python如何绘制iPhone手机图案?(代码示例).有一定的参考价值,有需要的朋友可以参考一下,希望对你们有所帮助. 虽然我用不起苹果手机,但我可以用python画出 ...

  10. python编程代码示例_python编程线性回归代码示例

    用python进行线性回归分析非常方便,有现成的库可以使用比如:numpy.linalog.lstsq例子.scipy.stats.linregress例子.pandas.ols例子等. 不过本文使用 ...

最新文章

  1. c语言函数库学习~sscanf~格式化输入
  2. javascript改变样式(cssFloat,styleFloat)
  3. RTP/RTCP协议介绍
  4. 旋转函数_【视频课】:一次函数拓展应用(图象的平移、旋转、轴对称及5种解题方法)...
  5. Python 并发编程:PoolExecutor 篇
  6. Flutter与JS的双向调用、Flutter中Widget与Html混合加载
  7. 机器学习实战4-sklearn训练线性回归模型(鸢尾花iris数据集分类)
  8. 数据结构--树形结构(1)
  9. vbscript mysql_vbscript 数据库操作
  10. DNS的作用是什么?为什么一定要配置DNS才能上网
  11. 找到一个电信代理服务器~
  12. 阿里云打造离线下载服务器
  13. 剪辑过的视频md5会改变
  14. 《计算机网络》从零单排上王者之——坚韧黑铁篇
  15. Unity3d任务模型自动寻路(人员疏散)
  16. Adobe的视频和音频编码和格式转换软件Media Encoder(Me) 2023版本下载与安装教程
  17. 三、Oracle/支付宝/旺旺
  18. Glide之GlideModule
  19. 全国计算机等级考试一级ps操作,计算机等级考试一级Photoshop操作如何制作彩塑字...
  20. 2020年南京大学软件工程考研上岸经验帖

热门文章

  1. Android studio 之 Kotlin Not Configured
  2. 购买三维扫描仪的7大准则
  3. 什么是Web前端工程师?为什么Web前端工资如此之高呢?
  4. php 中文字,完善解决截取中文汉字不乱码-PHP字符串函数(支持utf8、GBK、GB2312)
  5. python将png转换为ico
  6. 高效能人士的七个习惯读后感与总结概括-(第二章)
  7. NOIP2016普及组复赛全国一等奖名单及排名(1~745名)
  8. CSS3 文字边框 -webkit-text-stroke
  9. 福师电子计算机主要以,福师《计算机应用基础》在线作业一 电子计算机主要以划分发展阶段...
  10. 红黑树区分 左旋 和 右旋