项目:纽约市建筑能源得分预测

目录

0简介

1数据清洗与格式转化

1.1数据简介

1.2导入所需的基本工具包

1.3数据分析

1.4数据类型与缺失值

1.5缺失值处理模板

2 Exploratory Data Analysis

2.2剔除离群点

2.3观察那些变量会对结果产生影响

3特征工程

3.1特征变换

3.2双变量绘图

3.3提出共线特征

4分割数据集

4.1划分数据

4.2建立Baseline

4.3结果保存下来,建模再用

5建立基础模型,尝试多种算法

5.1缺失值填充

5.2特征进行与归一化

6建立基础模型,尝试多种算法(回归问题)

6.1建立损失函数

6.2选择机器学习算法

7模型调参

7.1调参

7.2对比损失函数

8评估与测试:预测与真实之间的差异图

9解释模型:基于重要性进行特征选择

正文:

0简介

本次将介绍使用了真实数据集的机器学习项目的完整解决方案,让同学们的了解所有碎片是如何拼接在一起的。

编码之前是了解我们试图解决的问题和可用的数据。在这个项目中,我们将使用公共可用的纽约市的建筑能源数据。目标是使用能源数据建立一个模型,来预测建筑物的 Energy star Score (能源之星分数),并解释结果以找出影晌评分的因素。

数据包括 Energy star Score ,意味着这是一个监督回归机餐学习任务:监督:我们可以知道数据的特征和目标,我们的目标是训练可以学习两者之间映射关系的模型。回归: Energy Star Score 是一个连续变量。我们想要开发一个模型准确性,它可以实现预测Energy Star Score,并且结果接近班实值。

1数据清洗与格式转化

1.1数据简介

1.2导入所需的基本工具包

import pandas as pd
import numpy as np# API需要升级或者遗弃了,不想看就设置一下warning
pd.options.mode.chained_assignment = None# 经常用到head(),最多展示多少条数
pd.set_option('display.max_columns', 60)
import matplotlib.pyplot as plt# %matplotlib inline 可以在Ipython编译器比如jupyter notebook 或者 jupyter qtconsole里直接使用,功能是可以内嵌绘图,并且省略掉plt.show()。
%matplotlib inline# pylot使用rc配置文件来自定义图形的各种默认属性,称之为rc配置或rc参数。通过rc参数可以修改默认的属性,包括窗体大小、每英寸的点数、线条宽度、颜色、样式、坐标轴、坐标和网络属性、文本、字体等。
# rc参数存储在字典变量中,通过字典的方式进行访问
#绘图全局的设置好了,画图字体大小
plt.rcParams['font.size'] = 24
from IPython.core.pylabtools import figsize# matplotlib中的[seaborn](https://so.csdn.net/so/search?q=seaborn)绘图
import seaborn as sns
sns.set(font_scale = 2)
from sklearn.model_selection import train_test_split# 忽略代码中的警告消息
import warnings
warnings.filterwarnings("ignore")

1.3数据分析

# 加载数据
data = pd.read_csv('data/Energy.csv')# 展示前3行
data.head(3)
Order Property Id Property Name Parent Property Id Parent Property Name BBL - 10 digits NYC Borough, Block and Lot (BBL) self-reported NYC Building Identification Number (BIN) Address 1 (self-reported) Address 2 Postal Code Street Number Street Name Borough DOF Gross Floor Area Primary Property Type - Self Selected List of All Property Use Types at Property Largest Property Use Type Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use Type 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type 3rd Largest Property Use Type - Gross Floor Area (ft²) Year Built Number of Buildings - Self-reported Occupancy Metered Areas (Energy) Metered Areas (Water) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Weather Normalized Source EUI (kBtu/ft²) Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) Natural Gas Use (kBtu) Weather Normalized Site Natural Gas Use (therms) Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) Property GFA - Self-Reported (ft²) Water Use (All Water Sources) (kgal) Water Intensity (All Water Sources) (gal/ft²) Source EUI (kBtu/ft²) Release Date Water Required? DOF Benchmarking Submission Status Unnamed: 54
0 1 13286 201/205 13286 201/205 1013160001 1013160001 1037549 201/205 East 42nd st. Not Available 10017 675 3 AVENUE Manhattan 289356.0 Office Office Office 293447 Not Available Not Available Not Available Not Available 1963 2 100 Whole Building Not Available Not Available 305.6 303.1 37.8 Not Available 614.2 Not Available Not Available Not Available Not Available Not Available 51550675.1 Not Available Not Available 38139374.2 11082770.5 6962.2 0 6962.2 762051 Not Available Not Available 619.4 5/1/17 5:32 PM No In Compliance NaN
1 2 28400 NYP Columbia (West Campus) 28400 NYP Columbia (West Campus) 1021380040 1-02138-0040 1084198; 1084387;1084385; 1084386; 1084388; 10... 622 168th Street Not Available 10032 180 FT WASHINGTON AVENUE Manhattan 3693539.0 Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) 3889181 Not Available Not Available Not Available Not Available 1969 12 100 Whole Building Whole Building 55 229.8 228.8 24.8 2.4 401.1 Not Available 19624847.2 Not Available Not Available Not Available -391414802.6 933073441 9330734.4 332365924 96261312.1 55870.4 51016.4 4854.1 3889181 Not Available Not Available 404.3 4/27/17 11:23 AM No In Compliance NaN
2 3 4778226 MSCHoNY North 28400 NYP Columbia (West Campus) 1021380030 1-02138-0030 1063380 3975 Broadway Not Available 10032 3975 BROADWAY Manhattan 152765.0 Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) 231342 Not Available Not Available Not Available Not Available 1924 1 100 Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available 0 0 0 231342 Not Available Not Available Not Available 4/27/17 11:23 AM No In Compliance NaN
print((np.array(data)).shape)
(11746, 55)
# 在括号中填入n,便能看见数据的前N行
data.head(2)
Order Property Id Property Name Parent Property Id Parent Property Name BBL - 10 digits NYC Borough, Block and Lot (BBL) self-reported NYC Building Identification Number (BIN) Address 1 (self-reported) Address 2 Postal Code Street Number Street Name Borough DOF Gross Floor Area Primary Property Type - Self Selected List of All Property Use Types at Property Largest Property Use Type Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use Type 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type 3rd Largest Property Use Type - Gross Floor Area (ft²) Year Built Number of Buildings - Self-reported Occupancy Metered Areas (Energy) Metered Areas (Water) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Weather Normalized Source EUI (kBtu/ft²) Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) Natural Gas Use (kBtu) Weather Normalized Site Natural Gas Use (therms) Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) Property GFA - Self-Reported (ft²) Water Use (All Water Sources) (kgal) Water Intensity (All Water Sources) (gal/ft²) Source EUI (kBtu/ft²) Release Date Water Required? DOF Benchmarking Submission Status Unnamed: 54
0 1 13286 201/205 13286 201/205 1013160001 1013160001 1037549 201/205 East 42nd st. Not Available 10017 675 3 AVENUE Manhattan 289356.0 Office Office Office 293447 Not Available Not Available Not Available Not Available 1963 2 100 Whole Building Not Available Not Available 305.6 303.1 37.8 Not Available 614.2 Not Available Not Available Not Available Not Available Not Available 51550675.1 Not Available Not Available 38139374.2 11082770.5 6962.2 0 6962.2 762051 Not Available Not Available 619.4 5/1/17 5:32 PM No In Compliance NaN
1 2 28400 NYP Columbia (West Campus) 28400 NYP Columbia (West Campus) 1021380040 1-02138-0040 1084198; 1084387;1084385; 1084386; 1084388; 10... 622 168th Street Not Available 10032 180 FT WASHINGTON AVENUE Manhattan 3693539.0 Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) 3889181 Not Available Not Available Not Available Not Available 1969 12 100 Whole Building Whole Building 55 229.8 228.8 24.8 2.4 401.1 Not Available 19624847.2 Not Available Not Available Not Available -391414802.6 933073441 9330734.4 332365924 96261312.1 55870.4 51016.4 4854.1 3889181 Not Available Not Available 404.3 4/27/17 11:23 AM No In Compliance NaN

1.4数据类型与缺失值

data.info() # 可以快速让我们知道数据类型与缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11746 entries, 0 to 11745
Data columns (total 55 columns):#   Column                                                      Non-Null Count  Dtype
---  ------                                                      --------------  -----  0   Order                                                       11746 non-null  int64  1   Property Id                                                 11746 non-null  int64  2   Property Name                                               11746 non-null  object 3   Parent Property Id                                          11746 non-null  object 4   Parent Property Name                                        11746 non-null  object 5   BBL - 10 digits                                             11746 non-null  object 6   NYC Borough, Block and Lot (BBL) self-reported              11746 non-null  object 7   NYC Building Identification Number (BIN)                    11746 non-null  object 8   Address 1 (self-reported)                                   11746 non-null  object 9   Address 2                                                   11746 non-null  object 10  Postal Code                                                 11746 non-null  object 11  Street Number                                               11622 non-null  object 12  Street Name                                                 11624 non-null  object 13  Borough                                                     11628 non-null  object 14  DOF Gross Floor Area                                        11628 non-null  float6415  Primary Property Type - Self Selected                       11746 non-null  object 16  List of All Property Use Types at Property                  11746 non-null  object 17  Largest Property Use Type                                   11746 non-null  object 18  Largest Property Use Type - Gross Floor Area (ft²)          11746 non-null  object 19  2nd Largest Property Use Type                               11746 non-null  object 20  2nd Largest Property Use - Gross Floor Area (ft²)           11746 non-null  object 21  3rd Largest Property Use Type                               11746 non-null  object 22  3rd Largest Property Use Type - Gross Floor Area (ft²)      11746 non-null  object 23  Year Built                                                  11746 non-null  int64  24  Number of Buildings - Self-reported                         11746 non-null  int64  25  Occupancy                                                   11746 non-null  int64  26  Metered Areas (Energy)                                      11746 non-null  object 27  Metered Areas  (Water)                                      11746 non-null  object 28  ENERGY STAR Score                                           11746 non-null  object 29  Site EUI (kBtu/ft²)                                         11746 non-null  object 30  Weather Normalized Site EUI (kBtu/ft²)                      11746 non-null  object 31  Weather Normalized Site Electricity Intensity (kWh/ft²)     11746 non-null  object 32  Weather Normalized Site Natural Gas Intensity (therms/ft²)  11746 non-null  object 33  Weather Normalized Source EUI (kBtu/ft²)                    11746 non-null  object 34  Fuel Oil #1 Use (kBtu)                                      11746 non-null  object 35  Fuel Oil #2 Use (kBtu)                                      11746 non-null  object 36  Fuel Oil #4 Use (kBtu)                                      11746 non-null  object 37  Fuel Oil #5 & 6 Use (kBtu)                                  11746 non-null  object 38  Diesel #2 Use (kBtu)                                        11746 non-null  object 39  District Steam Use (kBtu)                                   11746 non-null  object 40  Natural Gas Use (kBtu)                                      11746 non-null  object 41  Weather Normalized Site Natural Gas Use (therms)            11746 non-null  object 42  Electricity Use - Grid Purchase (kBtu)                      11746 non-null  object 43  Weather Normalized Site Electricity (kWh)                   11746 non-null  object 44  Total GHG Emissions (Metric Tons CO2e)                      11746 non-null  object 45  Direct GHG Emissions (Metric Tons CO2e)                     11746 non-null  object 46  Indirect GHG Emissions (Metric Tons CO2e)                   11746 non-null  object 47  Property GFA - Self-Reported (ft²)                          11746 non-null  int64  48  Water Use (All Water Sources) (kgal)                        11746 non-null  object 49  Water Intensity (All Water Sources) (gal/ft²)               11746 non-null  object 50  Source EUI (kBtu/ft²)                                       11746 non-null  object 51  Release Date                                                11746 non-null  object 52  Water Required?                                             11628 non-null  object 53  DOF Benchmarking Submission Status                          11716 non-null  object 54  Unnamed: 54                                                 0 non-null      float64
dtypes: float64(2), int64(6), object(47)
memory usage: 4.9+ MB

1.5缺失值处理模板

# 缺失值Not Available转换为np.nan
#replace():描述Python replace() 方法把字符串中的 old(旧字符串) 替换成 new(新字符串),
data = data.replace({'Not Available': np.nan})#在原始数据中‘ft²’结尾的列中的属性显示的是有的是数值型float类型,但是在python环境中info()函数展示有其他类型的数据都是Object类型
#kBtu/ft²等本应该是float类型,在这里是object类型,所以要转换一下 ,以ft²、kBtu、Metric Tons CO2e等为结尾的astype一下float # 把下面的data列中的数据全部转换成float型的
for col in list(data.columns):# 如果ft^2平方英尺结尾的,本来是object强制转换为floatif ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in col or 'therms' in col or 'gal' in col or 'Score' in col):data[col] = data[col].astype(float)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Eqc6HhrF-1642160554515)(attachment:image.png)]

print(list(data.columns))
['Order', 'Property Id', 'Property Name', 'Parent Property Id', 'Parent Property Name', 'BBL - 10 digits', 'NYC Borough, Block and Lot (BBL) self-reported', 'NYC Building Identification Number (BIN)', 'Address 1 (self-reported)', 'Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough', 'DOF Gross Floor Area', 'Primary Property Type - Self Selected', 'List of All Property Use Types at Property', 'Largest Property Use Type', 'Largest Property Use Type - Gross Floor Area (ft²)', '2nd Largest Property Use Type', '2nd Largest Property Use - Gross Floor Area (ft²)', '3rd Largest Property Use Type', '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built', 'Number of Buildings - Self-reported', 'Occupancy', 'Metered Areas (Energy)', 'Metered Areas  (Water)', 'ENERGY STAR Score', 'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)', 'Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)', 'Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)', 'District Steam Use (kBtu)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)', 'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'Property GFA - Self-Reported (ft²)', 'Water Use (All Water Sources) (kgal)', 'Water Intensity (All Water Sources) (gal/ft²)', 'Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?', 'DOF Benchmarking Submission Status', 'Unnamed: 54']
print(data.columns)
Index(['Order', 'Property Id', 'Property Name', 'Parent Property Id','Parent Property Name', 'BBL - 10 digits','NYC Borough, Block and Lot (BBL) self-reported','NYC Building Identification Number (BIN)', 'Address 1 (self-reported)','Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough','DOF Gross Floor Area', 'Primary Property Type - Self Selected','List of All Property Use Types at Property','Largest Property Use Type','Largest Property Use Type - Gross Floor Area (ft²)','2nd Largest Property Use Type','2nd Largest Property Use - Gross Floor Area (ft²)','3rd Largest Property Use Type','3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built','Number of Buildings - Self-reported', 'Occupancy','Metered Areas (Energy)', 'Metered Areas  (Water)', 'ENERGY STAR Score','Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)','Weather Normalized Site Electricity Intensity (kWh/ft²)','Weather Normalized Site Natural Gas Intensity (therms/ft²)','Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)','Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)','Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)','District Steam Use (kBtu)', 'Natural Gas Use (kBtu)','Weather Normalized Site Natural Gas Use (therms)','Electricity Use - Grid Purchase (kBtu)','Weather Normalized Site Electricity (kWh)','Total GHG Emissions (Metric Tons CO2e)','Direct GHG Emissions (Metric Tons CO2e)','Indirect GHG Emissions (Metric Tons CO2e)','Property GFA - Self-Reported (ft²)','Water Use (All Water Sources) (kgal)','Water Intensity (All Water Sources) (gal/ft²)','Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?','DOF Benchmarking Submission Status', 'Unnamed: 54'],dtype='object')
# 每列中只能展示数值型的count、mean、sdt等等,object不会展示
data.describe()# 3.20e+05=3.20x10^5=3.20x100000=320000
# 在科学计数法中,为了使公式简便,可以用带“E”的格式表示。当用该格式表示时,E前面的数字和“E+”后面要精确到十分位,(位数不够末尾补0),例如7.8乘10的7次方,正常写法为:7.8x10^7,简写为“7.8E+07”的形式
Order Property Id DOF Gross Floor Area Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type - Gross Floor Area (ft²) Year Built Number of Buildings - Self-reported Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Weather Normalized Source EUI (kBtu/ft²) Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) Natural Gas Use (kBtu) Weather Normalized Site Natural Gas Use (therms) Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) Property GFA - Self-Reported (ft²) Water Use (All Water Sources) (kgal) Water Intensity (All Water Sources) (gal/ft²) Source EUI (kBtu/ft²) Unnamed: 54
count 11746.000000 1.174600e+04 1.162800e+04 1.174400e+04 3741.000000 1484.000000 11746.000000 11746.000000 11746.000000 9642.000000 11583.000000 10281.000000 10959.000000 9783.000000 10281.000000 9.000000e+00 2.581000e+03 1.321000e+03 5.940000e+02 1.600000e+01 9.360000e+02 1.030400e+04 9.784000e+03 1.150200e+04 1.096000e+04 1.167200e+04 1.166300e+04 1.168100e+04 1.174600e+04 7.762000e+03 7762.000000 11583.000000 0.0
mean 7185.759578 3.642958e+06 1.732695e+05 1.605524e+05 22778.682010 12016.825270 1948.738379 1.289971 98.762557 59.854594 280.071484 309.747466 11.072643 1.901441 417.915709 3.395398e+06 3.186882e+06 5.294367e+06 2.429105e+06 1.193594e+06 2.868907e+08 5.048543e+07 5.364578e+05 5.965472e+06 1.768752e+06 4.553657e+03 2.477937e+03 2.076339e+03 1.673739e+05 1.591798e+04 136.172432 385.908029 NaN
std 4323.859984 1.049070e+06 3.367055e+05 3.095746e+05 55094.441422 27959.755486 30.576386 4.017484 7.501603 29.993586 8607.178877 9784.731207 127.733868 97.204587 10530.524339 2.213237e+06 5.497154e+06 5.881863e+06 4.442946e+06 3.558178e+06 3.124603e+09 3.914717e+09 4.022606e+07 3.154430e+07 9.389154e+06 2.041639e+05 1.954498e+05 5.931295e+04 3.189238e+05 1.529524e+05 1730.726938 9312.736225 NaN
min 1.000000 7.365000e+03 5.002800e+04 5.400000e+01 0.000000 0.000000 1600.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.085973e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -4.690797e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -2.313430e+04 0.000000e+00 0.000000e+00 0.000000 0.000000 NaN
25% 3428.250000 2.747222e+06 6.524000e+04 6.520100e+04 4000.000000 1720.750000 1927.000000 1.000000 100.000000 37.000000 61.800000 65.100000 3.800000 0.100000 103.500000 1.663594e+06 2.550378e+05 2.128213e+06 0.000000e+00 5.698020e+04 4.320254e+06 1.098251e+06 1.176952e+04 1.043673e+06 3.019974e+05 3.287000e+02 1.474500e+02 9.480000e+01 6.699400e+04 2.595400e+03 27.150000 99.400000 NaN
50% 6986.500000 3.236404e+06 9.313850e+04 9.132400e+04 8654.000000 5000.000000 1941.000000 1.000000 100.000000 65.000000 78.500000 82.500000 5.300000 0.500000 129.400000 4.328815e+06 1.380138e+06 4.312984e+06 0.000000e+00 2.070020e+05 9.931240e+06 4.103962e+06 4.445525e+04 1.855196e+06 5.416312e+05 5.002500e+02 2.726000e+02 1.718000e+02 9.408000e+04 4.692500e+03 45.095000 124.900000 NaN
75% 11054.500000 4.409092e+06 1.596140e+05 1.532550e+05 20000.000000 12000.000000 1966.000000 1.000000 100.000000 85.000000 97.600000 102.500000 9.200000 0.700000 167.200000 4.938947e+06 4.445808e+06 6.514520e+06 4.293825e+06 2.918332e+05 2.064497e+07 6.855070e+06 7.348107e+04 4.370302e+06 1.284677e+06 9.084250e+02 4.475000e+02 4.249000e+02 1.584140e+05 8.031875e+03 70.805000 162.750000 NaN
max 14993.000000 5.991312e+06 1.354011e+07 1.421712e+07 962428.000000 591640.000000 2019.000000 161.000000 100.000000 100.000000 869265.000000 939329.000000 6259.400000 9393.000000 986366.000000 6.275850e+06 1.046849e+08 7.907464e+07 4.410378e+07 1.435178e+07 7.163518e+10 3.942850e+11 3.942852e+09 1.691763e+09 4.958273e+08 2.094340e+07 2.094340e+07 4.764375e+06 1.421712e+07 6.594604e+06 96305.690000 912801.100000 NaN
# 缺失值的模板,通用的
#  定义一个函数,传进来一个DataFrame
def missing_values_table(df): # python的pandas库中有一个十分便利的isnull()函数,它可以用来判断缺失值,把每列的缺失值算一下总和mis_val = df.isnull().sum() # 100相当于%,每列的缺失值的占比mis_val_percent = 100 * df.isnull().sum() / len(df) # 每列缺失值的个数 、 每列缺失值的占比做成表mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)# 重命名指定列的名称mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})# 因为第1列缺失值很大,ascending=False代表降序#iloc[:,1] != 0的意思是对于下面的表中的第2列(缺失的占比)进行降序,从大到小mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)# 打印所有列的个数 、 缺失了多少列print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      "There are " + str(mis_val_table_ren_columns.shape[0]) +" columns that have missing values.")return mis_val_table_ren_columns
missing_values_table(data) #第一列是每1列,第二列是缺失值个数,第三列是缺失值%比,一共是60列,有46列是有缺失值
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
Missing Values % of Total Values
Unnamed: 54 11746 100.0
Fuel Oil #1 Use (kBtu) 11737 99.9
Diesel #2 Use (kBtu) 11730 99.9
Address 2 11539 98.2
Fuel Oil #5 & 6 Use (kBtu) 11152 94.9
District Steam Use (kBtu) 10810 92.0
Fuel Oil #4 Use (kBtu) 10425 88.8
3rd Largest Property Use Type 10262 87.4
3rd Largest Property Use Type - Gross Floor Area (ft²) 10262 87.4
Fuel Oil #2 Use (kBtu) 9165 78.0
2nd Largest Property Use - Gross Floor Area (ft²) 8005 68.2
2nd Largest Property Use Type 8005 68.2
Metered Areas (Water) 4609 39.2
Water Intensity (All Water Sources) (gal/ft²) 3984 33.9
Water Use (All Water Sources) (kgal) 3984 33.9
ENERGY STAR Score 2104 17.9
Weather Normalized Site Natural Gas Intensity (therms/ft²) 1963 16.7
Weather Normalized Site Natural Gas Use (therms) 1962 16.7
Weather Normalized Source EUI (kBtu/ft²) 1465 12.5
Weather Normalized Site EUI (kBtu/ft²) 1465 12.5
Natural Gas Use (kBtu) 1442 12.3
Weather Normalized Site Electricity Intensity (kWh/ft²) 787 6.7
Weather Normalized Site Electricity (kWh) 786 6.7
Electricity Use - Grid Purchase (kBtu) 244 2.1
Site EUI (kBtu/ft²) 163 1.4
Source EUI (kBtu/ft²) 163 1.4
NYC Building Identification Number (BIN) 162 1.4
Street Number 124 1.1
Street Name 122 1.0
DOF Gross Floor Area 118 1.0
Borough 118 1.0
Water Required? 118 1.0
Direct GHG Emissions (Metric Tons CO2e) 83 0.7
Total GHG Emissions (Metric Tons CO2e) 74 0.6
Indirect GHG Emissions (Metric Tons CO2e) 65 0.6
Metered Areas (Energy) 57 0.5
DOF Benchmarking Submission Status 30 0.3
NYC Borough, Block and Lot (BBL) self-reported 11 0.1
Largest Property Use Type - Gross Floor Area (ft²) 2 0.0
Largest Property Use Type 2 0.0
# 50%是阈值,大于50%的列
missing_df = missing_values_table(data);
# 大于50%的列拿出来 ,后面drop()删掉
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))#原始的列中有60列,发现有缺失值的列有46列 , 缺失的46列中大于50%的将删除,有11列
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
We will remove 12 columns.
# 大于50%的列都drop掉
data = data.drop(columns = list(missing_columns))

2 Exploratory Data Analysis

2.1单变量绘图

# 设置图形的宽和高
figsize(8, 8)# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})# 在seaboard中找到不同的风格,不同的参数,代表不同背景格式
plt.style.use('fivethirtyeight')#dropna():该函数主要用于滤除缺失数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution');#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GO86qLvG-1642160554517)(output_26_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9bpUjb8Z-1642160554517)(attachment:image.png)]

plt.style.ava
---------------------------------------------------------------------------AttributeError                            Traceback (most recent call last)<ipython-input-186-aa5a23d3013a> in <module>
----> 1 plt.style.avaAttributeError: module 'matplotlib.style' has no attribute 'ava'

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ypjv0rRI-1642160554518)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ESukvH9A-1642160554519)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fmtPMkoL-1642160554519)(attachment:image.png)]

help(plt.hist)
# 设置图形的宽和高
figsize(8, 8)# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})# 在seaboard中找到不同的风格
plt.style.use('dark_background')# hist表示的画直方图
#dropna():该函数主要用于滤除缺失数据,删除data列表中,score列中的有缺失值的行。处理后的数据作为画图的数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution');#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。
plt.style.available
print(data.columns)
# 设置图形的宽和高
figsize(10, 10)# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})# 在seaboard中找到不同的风格
plt.style.use('fivethirtyeight')#dropna():该函数主要用于滤除缺失数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution');#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-peqW6VVa-1642160554521)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uEY5diWG-1642160554521)(attachment:image.png)]

# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black'); # 边也是黑色
plt.xlabel('Site EUI');
plt.ylabel('Count'); plt.title('Site EUI Distribution');#这显示我们有另一个问题:!由于存在几个非常高分的建筑物,这张图难以置信地倾斜了。所以必须进行异常值处理。
#你会很清楚地看到最后一个值异常大。出现异常值的原因很多:错字,测量设备故障,错误的单位,或者它们可能是合法的但是个极端值
#相当于分一下数据有很多点离均值很远,就有离群点
# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
# edgecolor:直方图中柱形边缘的颜色
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'red'); # 边也是黑色
plt.xlabel('Site EUI');
plt.ylabel('Count'); plt.title('Site EUI Distribution');#这显示我们有另一个问题:!由于存在几个非常高分的建筑物,这张图难以置信地倾斜了。所以必须进行异常值处理。
#你会很清楚地看到最后一个值异常大。出现异常值的原因很多:错字,测量设备故障,错误的单位,或者它们可能是合法的但是个极端值
#相当于分一下数据有很多点离均值很远,就有离群点
data['Site EUI (kBtu/ft²)'].describe()
# 均值mean小 , 标准差很大,就意味着有很多点离均值很远,就有离群点 ,因为最小值为0,最大值为869265
#dropna()该函数主要用于滤除缺失数据
# sort_values()先分组 ,再看后10位
#能源使用强度(EUI)
#sort_values():默认是升序 ,从小到大排序,按值排序,左边是行号,右边是数据
data['Site EUI (kBtu/ft²)'].dropna().sort_values().tail(10)
# 怎么过滤离群点呢,查看第869265行
data.loc[data['Site EUI (kBtu/ft²)'] == 869265, :]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AQnReIgf-1642160554522)(attachment:image.png)]

# 应该是版本更新的问题,没有ix,和iloc这两种函数了
# 怎么过滤离群点呢,查看第869265行
data.ix[data['Site EUI (kBtu/ft²)'] == 869265, :]
# 怎么过滤离群点呢,查看第869265行
data.iloc[data['Site EUI (kBtu/ft²)'] == 869265, :]

2.2剔除离群点

# 在describe取25%和75%分位
first_quartile = data['Site EUI (kBtu/ft²)'].describe()['25%']
third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']#  2者一减就是IQ值,就是间隔
iqr = third_quartile - first_quartile#在这里判断的是正常数据,Q3 - 3IQ  < EUI < Q3+ 3IQ ,保留正常数据,剩下的过滤异常点
# Q3+ 3IQ > 。。。。。。>Q3 - 3IQ ,中间的就是非离群点,就是咱们想要的数据
data = data[(data['Site EUI (kBtu/ft²)'] > (first_quartile - 3 * iqr)) &(data['Site EUI (kBtu/ft²)'] < (third_quartile + 3 * iqr))]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gUPEPiG6-1642160554523)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-j1F9x97h-1642160554524)(attachment:image.png)]

# #能源使用强度(EUI),剔除离群点后应该有的正太分布
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black');
plt.xlabel('Site EUI');
plt.ylabel('Count'); plt.title('Site EUI Distribution');

2.3观察那些变量会对结果产生影响

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GPpHhLOJ-1642160554524)(attachment:image.png)]

types = data.dropna(subset=['score'])#Largest Property Use Type:最大财产使用类型
#该列中有很多的个属性,大于100的值分别有4个属性 , 为:Multifamily Housing——多户住宅区 、 Office——办公室 、 Hotel——酒店
#Data Center, Non-Refrigerated Warehouse, Office——数据中心、非冷藏仓库、办公室types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100])
print(types)
types = data.dropna(subset=['score'])#Largest Property Use Type:最大财产使用类型
#该列中有很多的个属性,大于100的值分别有4个属性 , 为:Multifamily Housing——多户住宅区 、 Office——办公室 、 Hotel——酒店
#Data Center, Non-Refrigerated Warehouse, Office——数据中心、非冷藏仓库、办公室types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)
print(types)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B894snof-1642160554525)(attachment:image.png)]

# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)# b_type是变量,types是4种类型
for b_type in types:#当前Largest Property Use Type就是画的类型b_type4个 变量subset = data[data['Largest Property Use Type'] == b_type] # 拿到subset的得分值,alpha指的是透明度sns.kdeplot(subset['score'].dropna(),label = b_type, shade = False, alpha = 0.5);# 横轴是能源得分 ,纵轴是密度
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20);
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);#红色和黄色差距很大
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)# b_type是变量,types是4种类型 ,其实types列表中包含了7个元素,也就是有7种类型,这儿只是现实了前4种
for b_type in types:#当前Largest Property Use Type就是画的类型b_type4个 变量subset = data[data['Largest Property Use Type'] == b_type] print(subset)
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)# b_type是变量,types是4种类型 ,其实types列表中包含了7个元素,也就是有7种类型,这儿只是现实了前4种
for b_type in types:#当前Largest Property Use Type就是画的类型b_type4个 变量subset = data[data['Largest Property Use Type'] == b_type] print(subset['score'].dropna())

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C9YXcBux-1642160554526)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4TyHPyFb-1642160554527)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2eU3SHfW-1642160554527)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G4RVKkWP-1642160554528)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B9ybHRoH-1642160554528)(attachment:image.png)]

# 查看当前的结果跟地区有什么结果     结果
boroughs = data.dropna(subset=['score'])
#                    地区
boroughs = boroughs['Borough'].value_counts()
boroughs = list(boroughs[boroughs.values > 100].index)
print(boroughs)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pCTrfw8A-1642160554529)(attachment:image.png)]

# 4个从差异程度来说,影响不大,特征的差异性不强
#Borough:自治区镇 ,该列中有5个属性,分别为:Manhattan——曼哈顿 、 Brooklyn——布鲁克林 、 Queens——皇后区 、 Bronx——布朗克斯
# Staten Island——斯塔顿岛figsize(12, 10)# 遍历5个属性遍历,画出图,横轴是能源得分、纵轴是密度
for borough in boroughs:subset = data[data['Borough'] == borough]sns.kdeplot(subset['score'].dropna(),label = borough);plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20);
plt.title('Density Plot of Energy Star Scores by Borough', size = 28);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6G9nDpPd-1642160554530)(attachment:image.png)]

# corr()相关系数矩阵,即给出任意X与Y之间的相关系数 X——>Y两两相关的,负相关多,-0.046605接近于0的都删掉 , 正相关的少
correlations_data = data.corr()['score'].sort_values()#升序,从小到大# 后10个
print(correlations_data.head(10), '\n')
print("---------------------------")
# 前10个
print(correlations_data.tail(10))

其中corr()函数的参数为空时,默认使用的参数为pearson

3特征工程

3.1特征变换

import warnings
warnings.filterwarnings("ignore")# 所有的数值数据拿到手,只需要数值列的数据,数据为字符串或者其他类型的数据列,不要
numeric_subset = data.select_dtypes('number')# 遍历所有数值数据的每一列数据
# 遍历所有的数值数据
for col in numeric_subset.columns:# 这个项目把score看成了标签,也就是线性函数这种的y,其他特征值全部都是x,而每一个x的系数就是这个特征与score的相关系数# 如果score就是y值 ,就不做任何变换if col == 'score':next#剩下的不是y的话特征做log和开根号else: # 直接对整个列的数据进行开方和log计算numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])numeric_subset['log_' + col] = np.log(numeric_subset[col])# Borough:自治镇
# Largest Property Use Type:
categorical_subset = data[['Borough', 'Largest Property Use Type']]
print(categorical_subset)# One hot encode用到了读热编码get_dummies
categorical_subset = pd.get_dummies(categorical_subset)
print(categorical_subset)#      合并数组     一个是数值的,      一个热度编码的
print(numeric_subset)
features = pd.concat([numeric_subset, categorical_subset], axis = 1)features = features.dropna(subset = ['score'])# sort_values()做一下排序
correlations = features.corr()['score'].dropna().sort_values()
print(correlations)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wzb3UvvG-1642160554531)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k7cVG3Mj-1642160554531)(attachment:image.png)]

特征和特征之间的相关性,特征和score之间的相关性。
相关性:线性相关性和非线性相关性。

#sqrt结尾的变幻后就是sqrt_,log结尾的变幻后就是log_
# 这些都是负的
correlations.head(15)#Weather Normalized Site EUI (kBtu/ft²)和转换后sqrt_Weather Normalized Site EUI (kBtu/ft²)没啥变化,所以没有价值
#都差不多,没有明显的趋势,
# 后15位下面是正的
correlations.tail(15)

一般head能做的操作,tail也能够做

3.2 双变量绘图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ASa6g0ty-1642160554532)(attachment:image.png)]

import warnings
warnings.filterwarnings("ignore")
figsize(12, 10)# 能源得分与城镇区域之间的关系
features['Largest Property Use Type'] = data.dropna(subset = ['score'])['Largest Property Use Type']# Largest Property Use Type 最大财产使用类型 ,isin()接受一个列表,判断该列中4个属性是否在列表中
features = features[features['Largest Property Use Type'].isin(types)]# hue = 'Largest Property Use Type'是4个种类变量 ,4个颜色
sns.lmplot('Site EUI (kBtu/ft²)', 'score', # 种类变量,有4个种类,右下角hue是有4个种类变量,hue = 'Largest Property Use Type', data = features,scatter_kws = {'alpha': 0.8, 's': 60}, fit_reg = False,size = 12, aspect = 1.2);# Plot labeling
plt.xlabel("Site EUI", size = 28)
plt.ylabel('Energy Star Score', size = 28)
plt.title('Energy Star Score vs Site EUI', size = 36);

3.3 剔除共线特征

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EtmuRsXb-1642160554532)(attachment:image.png)]

#原始数据备份一下copy(),修改后数据后保持原数据不变
features = data.copy()# select_dtypes():根据数据类型选择特征,number表示数值型特征
numeric_subset = data.select_dtypes('number')# 遍历特征是数值型在一个列表中
for col in numeric_subset.columns:# 跳过能源得分就是咱们的目标值Yif col == 'score':nextelse:#numeric_subset()从某一个列中选择出符合某条件的数据或是相关的列numeric_subset['log_' + col] = np.log(numeric_subset[col])# Borough:自治区镇
# 最大财产使用类型/多户家庭的a住宅区、办公区、酒店、不制冷的大仓库
categorical_subset = data[['Borough', 'Largest Property Use Type']]# get_dummies 是利用pandas实现one hot encode的方式。
categorical_subset = pd.get_dummies(categorical_subset)#把所有数值型特征和治区镇以及最大财产的使用类型合并起来
features = pd.concat([numeric_subset, categorical_subset], axis = 1)features.shape#有110个列,比原来的列多

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k32D2KSL-1642160554533)(attachment:image.png)]

#Weather Normalized Site EUI (kBtu/ft²):天气正常指数的使用强度
#Site EUI:能源使用强度plot_data = data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna()
#'bo':由点绘制的线
plt.plot(plot_data['Site EUI (kBtu/ft²)'], plot_data['Weather Normalized Site EUI (kBtu/ft²)'], 'bo')
#横轴是天气正常指数的使用强度 、 纵轴是能源使用强度
plt.xlabel('Site EUI'); plt.ylabel('Weather Norm EUI')
plt.title('Weather Norm EUI vs Site EUI, R = %0.4f' % np.corrcoef(data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);
# collinear 共线,这个函数的作用是删除一些两个特征值,之间的相关性特别高的,其中的一个特征。
# threshold:设置的阈值,这个值,是通过多次尝试求取出来的。
def remove_collinear_features(x, threshold):y = x['score'] #在原始数据X中”score“当做y值x = x.drop(columns = ['score']) #除去标签值以外的当做特征# 多长运行,直到相关性小于阈值才稳定结束while True:# 计算一个矩阵 ,两两的相关系数corr_matrix = x.corr()for i in range(len(corr_matrix)):corr_matrix.iloc[i][i] = 0 # 将对角线上的相关系数置为0。避免自己跟自己计算相关系数一定大于阈值,自己与自己的相关系数的是1# 定义待删除的特征。drop_cols = []# col返回的是列名for col in corr_matrix:if col not in drop_cols: # A和B比 ,B和A比的相关系数一样,避免AB全删了# 取相关系数的绝对值。v = np.abs(corr_matrix[col]) # 取的是每一列的相关系数# 如果相关系数大于设置的阈值 # 取出每一列中相关系数绝对值最大的那个数if np.max(v) > threshold:# 取出最大值对应的索引。name = np.argmax(v) # 找到最大值的的列名# 将含有最大值的那一列放到drop_cols列表中drop_cols.append(name)# 列表不为空,就删除,列表为空,符合条件,退出循环   # drop_cols 列表中存储的是,两个特征的相关系数的绝对值大于设置的阈值的其中一个特征,为了减小模型的复杂度,和提高模型的效果,就需要删除其中一个特征if drop_cols:# 删除想删除的列x = x.drop(columns=drop_cols, axis=1)else:break# 指定标签# y中存储的是原始数据X中”score“x['score'] = yreturn x
help(remove_collinear_features)
# 下面这段代码运行有问题,我修改不出来,所以就注释了,不让它运行
# # 设置阈值0.6 ,tem.values相关性的矩阵的向量大于0.6的
# features = remove_collinear_features(features, 0.6);
# 上面这段代码运行有问题,我修改不出来,所以就注释了,不让它运行
# 删除
features  = features.dropna(axis=1, how = 'all')
features.shape #原来时110
features.shape

4 分割数据集

4.1 划分数据

# pandas:isna(): 如果参数的结果为#NaN, 则结果TRUE, 否则结果是FALSE。
no_score = features[features['score'].isna()]
# pandas:notnull()判断是否不是NaN
score = features[features['score'].notnull()]print(no_score.shape)
print(score.shape)
# 把所有特征放在features列表中
# 把标签,也就是targets(建筑物的得分)放在targets列表中
features = score.drop(columns='score')
targets = pd.DataFrame(score['score'])#np.inf :最大值      -np.inf:最小值
features = features.replace({np.inf: np.nan, -np.inf: np.nan})# random_state = 42设置成一个固定值,是为了让,每一次生成的训练集和测试集都是一样的,如果不设置,
# 那么每次生成的测试集和训练集是不一样的,那么这样就无法调参了random_state可以被设置成任何值,
# 但是当你使用相同的数据集,进行测试集和训练集分割时,如果想要与之前的训练集和测试集生成一样的,
# 那么你就得把random_state设置成一样的值,因为分割数据集其实是在生成的一些系列随机数,通过这些
# 随机数去取数据。由于随机数的生成也是通过程序控制的,那么当你设置相同的random_state值,就会
# 得到相同的随机数,那么就会得到相同的测试集和训练集
X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-omV3TRls-1642160554534)(attachment:image.png)]

4.2 建立Baseline

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PAh1oBf6-1642160554534)(attachment:image.png)]

# mae平均的绝对值 ,就是 (真实值 - 预测值) / n
#abs():绝对值
def mae(y_true, y_pred):return np.mean(abs(y_true - y_pred))
baseline_guess = np.median(y)print('The baseline guess is a score of %0.2f' % baseline_guess) # 中位数为66
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess)) # MAE = 24.5164

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SQLqtAbD-1642160554534)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bwU9b3eh-1642160554535)(attachment:image.png)]

4.3 结果保存下来,建模再用

# Save the no scores, training, and testing data
# to_csv:把to_csv列表中的元素以csv的格式写进data/no_score.csv文件中g
no_score.to_csv('data/no_score.csv', index = False)
X.to_csv('data/training_features.csv', index = False)
X_test.to_csv('data/testing_features.csv', index = False)
y.to_csv('data/training_labels.csv', index = False)
y_test.to_csv('data/testing_labels.csv', index = False)

5 建立基础模型,尝试多种算法

#之前把精力都放在了前面了,这回我的重点就要放在建模上了,导入所需要的包
# 数据分析库
import pandas as pd
import numpy as np# warnings:警告——>忽视
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)# 可视化
import matplotlib.pyplot as plt
%matplotlib inline# 字体大小设置
plt.rcParams['font.size'] = 24from IPython.core.pylabtools import figsize# Seaborn 高级可视化工具
import seaborn as sns
sns.set(font_scale = 2)# 预处理:缺失值 、 最大最小归一化# 下面代码是自己修改的
# 这是原代码
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler
from sklearn.impute import SimpleImputer
Imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的# 机器学习算法库
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor# 调参工具包
from sklearn.model_selection import RandomizedSearchCV, GridSearchCVimport warnings
warnings.filterwarnings("ignore")
# Read in data into dataframes
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')# Display sizes of data
print('Training Feature Size: ', train_features.shape)
print('Testing Feature Size:  ', test_features.shape)
print('Training Labels Size:  ', train_labels.shape)
print('Testing Labels Size:   ', test_labels.shape)

5.1 缺失值填充

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oMSL9UnT-1642160554535)(attachment:image.png)]

# 下面代码是自己修改的
# 这是原代码
# imputer = Imputer(strategy='median') # 因为数据有离群点,有大有小,用mean不太合适,用中位数较合适
imputer = SimpleImputer(strategy='median') # 因为数据有离群点,有大有小,用mean不太合适,用中位数较合适
# 上面代码是自己修改的
# 在训练特征中训练
imputer.fit(train_features)# 对训练数据进行转换
X = imputer.transform(train_features)#用中位数来代替做成的训练集
X_test = imputer.transform(test_features) #用中位数来代替做成的测试集
# 查看训练集和测试集中的特征列表是否还有缺失值
#np.isnan:数值进行空值检测
print('Missing values in training features: ', np.sum(np.isnan(X))) #返回的是0 ,代表缺失值任务已经完成了
print('Missing values in testing features:  ', np.sum(np.isnan(X_test)))

5.2 特征进行与归一化

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f23cQpIF-1642160554536)(attachment:image.png)]

# feature_range=(0, 1)特征值的范围在0-1之间
scaler = MinMaxScaler(feature_range=(0, 1))# 训练与转换
scaler.fit(X)# 把训练数据转换过来(0,1)
X = scaler.transform(X)
X_test = scaler.transform(X_test) # 测试数据
#标签值是1列 ,reshape变成1行
# reshape(行数,列数)常用来更改数据的行列数目
y = np.array(train_labels).reshape((-1,))#一维数组 , 变成1列
y_test = np.array(test_labels).reshape((-1, )) # 一维数组 , 变成1列

6 建立基础模型,尝试多种算法(回归问题)

6.1 建立损失函数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZQQhPLqN-1642160554536)(attachment:image.png)]

# 在这里的损失函数是MAE ,abs()是绝对值
def mae(y_true, y_pred):return np.mean(abs(y_true - y_pred))#制作一个模型 ,训练模型和在验证集上验证模型的参数
def fit_and_evaluate(model):# 训练模型model.fit(X, y)# 训练模型开始在测试数据上训练model_pred = model.predict(X_test)model_mae = mae(y_test, model_pred)return model_mae

6.2 选择机器学习算法

lr = LinearRegression()#线性回归
lr_mae = fit_and_evaluate(lr)print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)
svm = SVR(C = 1000, gamma = 0.1) #支持向量机
svm_mae = fit_and_evaluate(svm)print('Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae)
random_forest = RandomForestRegressor(random_state=60)#集成算法的随机森林
random_forest_mae = fit_and_evaluate(random_forest)print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae)
gradient_boosted = GradientBoostingRegressor(random_state=60) #梯度提升树
gradient_boosted_mae = fit_and_evaluate(gradient_boosted)print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae)
knn = KNeighborsRegressor(n_neighbors=10)#K近邻算法
knn_mae = fit_and_evaluate(knn)print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae)
plt.style.use('fivethirtyeight')
figsize(8, 6)model_comparison = pd.DataFrame({'model': ['Linear Regression', 'Support Vector Machine','Random Forest', 'Gradient Boosted','K-Nearest Neighbors'],'mae': [lr_mae, svm_mae, random_forest_mae, gradient_boosted_mae, knn_mae]})#         ascending=True是对的意思升序      降序 :从大到小/从第1行到第5行    barh:横着去画的直方图
model_comparison.sort_values('mae', ascending = False).plot(x = 'model', y = 'mae', kind = 'barh',color = 'red', edgecolor = 'black')# 纵轴是算法模型的名称    yticks:为递增值向量       横轴是MAE损失                 xticks:为递增值向量
plt.ylabel(''); plt.yticks(size = 14); plt.xlabel('Mean Absolute Error'); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z2xpgxR7-1642160554537)(attachment:image.png)]

7 模型调参

7.1 调参

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MYqoMgUR-1642160554538)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gk3Vn00o-1642160554538)(attachment:image.png)]

loss = ['ls', 'lad', 'huber']# 所使用的弱“学习者”(决策树)的数量
n_estimators = [100, 500, 900, 1100, 1500]# 决策树的最大深度
max_depth = [2, 3, 5, 10, 15]# 决策树的叶节点所需的最小示例个数
min_samples_leaf = [1, 2, 4, 6, 8]# 分割决策树节点所需的最小示例个数
min_samples_split = [2, 4, 6, 10]hyperparameter_grid = {'loss': loss,'n_estimators': n_estimators,'max_depth': max_depth,'min_samples_leaf': min_samples_leaf,'min_samples_split': min_samples_split}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jS4ntaRq-1642160554539)(attachment:image.png)]

model = GradientBoostingRegressor(random_state = 42)random_cv = RandomizedSearchCV(estimator=model, param_distributions=hyperparameter_grid,cv=4, n_iter=25, scoring = 'neg_mean_absolute_error', #选择好结果的评估值n_jobs = -1, verbose = 1, return_train_score = True,random_state=42)
# 注意:运行的时间非常慢,需要14mins
random_cv.fit(X, y)
help(GradientBoostingRegressor)
RandomizedSearchCV(cv=4, error_score='raise-deprecating',estimator=GradientBoostingRegressor(alpha=0.9,criterion='friedman_mse',init=None,learning_rate=0.1,loss='ls', max_depth=3,max_features=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100,verbose=0,warm_start=False),iid='warn', n_iter=25, n_jobs=-1,param_distributions={'loss': ['ls', 'lad', 'huber'],'max_depth': [2, 3, 5, 10, 15],'min_samples_leaf': [1, 2, 4, 6, 8],'min_samples_split': [2, 4, 6, 10],'n_estimators': [100, 500, 900, 1100,1500]},pre_dispatch='2*n_jobs', random_state=42, refit=True,return_train_score=True, scoring='neg_mean_absolute_error',verbose=1)
random_cv.best_estimator_ #最好的参数
---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)<ipython-input-187-c5a12878b76a> in <module>
----> 1 random_cv.best_estimator_ #最好的参数NameError: name 'random_cv' is not defined
# 创建树策个数
trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}#建立模型
#lad:最小化绝对偏差
model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,min_samples_leaf = 6,min_samples_split = 6,max_features = None,random_state = 42)# 传入参数
grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, scoring = 'neg_mean_absolute_error', verbose = 1,n_jobs = -1, return_train_score = True)
# 需要3mins
grid_search.fit(X, y)
GridSearchCV(cv=4, error_score='raise-deprecating',estimator=GradientBoostingRegressor(alpha=0.9,criterion='friedman_mse',init=None, learning_rate=0.1,loss='lad', max_depth=5,max_features=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=6,min_samples_split=6,min_weight_fraction_leaf=0.0,n_estimators=100,n_iter_no_change=None,presort='auto',random_state=42, subsample=1.0,tol=0.0001,validation_fraction=0.1,verbose=0, warm_start=False),iid='warn', n_jobs=-1,param_grid={'n_estimators': [100, 150, 200, 250, 300, 350, 400,450, 500, 550, 600, 650, 700, 750,800]},pre_dispatch='2*n_jobs', refit=True, return_train_score=True,scoring='neg_mean_absolute_error', verbose=1)

7.2 对比损失函数

# 得到结果传入DataFrame
results = pd.DataFrame(grid_search.cv_results_)# 画图操作
figsize(8, 8)
plt.style.use('fivethirtyeight')plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')
plt.plot(results['param_n_estimators'], -1 * results['mean_train_score'], label = 'Training Error')
#横轴是树的个数 ,纵轴是MAE的误差
plt.xlabel('Number of Trees'); plt.ylabel('Mean Abosolute Error'); plt.legend();
plt.title('Performance vs Number of Trees');
#过拟合 , 蓝色平缓 ,红色比较陡 ,中间的数据越来陡,所以overfiting

8 评估与测试:预测和真实之间的差异图

# 测试模型
default_model = GradientBoostingRegressor(random_state = 42)
default_model.fit(X,y)
# 选择最好的参数
final_model = grid_search.best_estimator_final_model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)
print('Default model performance on the test set: MAE = %0.4f.' % mae(y_test, default_pred))
print('Final model performance on the test set:   MAE = %0.4f.' % mae(y_test, final_pred))
figsize = (6, 6)# 最终的模型差异 = 模型  -  测试值 ,大部分都在+-25%
residuals = final_pred - y_testplt.hist(residuals, color = 'red', bins = 20,edgecolor = 'black')
plt.xlabel('Error'); plt.ylabel('Count')
plt.title('Distribution of Residuals');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JhYfQn90-1642160554540)(attachment:image.png)]

9 解释模型:基于重要性来进行特征选择

import pandas as pd
import numpy as nppd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)import matplotlib.pyplot as plt
%matplotlib inlineplt.rcParams['font.size'] = 24from IPython.core.pylabtools import figsizeimport seaborn as snssns.set(font_scale = 2)# 下面代码是自己修改的
# 这是原代码
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressorfrom sklearn import treeimport warnings
warnings.filterwarnings("ignore")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tWveKWPM-1642160554540)(attachment:image.png)]

# 用中值代替缺失值# 下面代码是自己修改的
# 这是原代码
# imputer = Imputer(strategy='median')
imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的# 开始训练
imputer.fit(train_features)X = imputer.transform(train_features)
# 测试集的缺失值使用的也是训练集的数据
X_test = imputer.transform(test_features)y = np.array(train_labels).reshape((-1,))
y_test = np.array(test_labels).reshape((-1,))
def mae(y_true, y_pred):return np.mean(abs(y_true - y_pred))
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,min_samples_leaf=6, min_samples_split=6, n_estimators=800, random_state=42)model.fit(X, y)
#  GBDT模型作为最终的模型
model_pred = model.predict(X_test)print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))
# 特征重要度
feature_results = pd.DataFrame({'feature': list(train_features.columns),  #所有的训练特征'importance': model.feature_importances_})# 展示前10名的重要的特征 ,降序
feature_results = feature_results.sort_values('importance', ascending = False).reset_index(drop=True)feature_results.head(10)
figsize(12, 10)
plt.style.use('fivethirtyeight')# 展示前10名的重要的特征
feature_results.loc[:9, :].plot(x = 'feature', y = 'importance', edgecolor = 'k',kind='barh', color = 'blue');#barh:直方图横着
plt.xlabel('Relative Importance', size = 20); plt.ylabel('')
plt.title('Feature Importances from Random Forest', size = 30);
most_important_features = feature_results['feature'][:10]#前10行的特征
# indices=10个列名
indices = [list(train_features.columns).index(x) for x in most_important_features]# 列表推导式X_reduced = X[:, indices]
X_test_reduced = X_test[:, indices]print('Most important training features shape: ', X_reduced.shape)
print('Most important testing  features shape: ', X_test_reduced.shape)
lr = LinearRegression()lr.fit(X, y)
lr_full_pred = lr.predict(X_test)lr.fit(X_reduced, y)
lr_reduced_pred = lr.predict(X_test_reduced)print('Linear Regression Full Results: MAE =    %0.4f.' % mae(y_test, lr_full_pred))
print('Linear Regression Reduced Results: MAE = %0.4f.' % mae(y_test, lr_reduced_pred))


纽约市建筑能源得分预测代码分析相关推荐

  1. 【能源物联网】物联网体系结构与建筑能源管理系统的相关性分析

    摘要: 在能源形势紧张的大趋势下,高能耗的大型公共建筑能源管理系统的建设逐渐受到重视,以物联网技术及基础的建筑能源管理平台可以提供即时.准确.高效的能源管理策略.系统阐述了结合物联网技术的建筑能源管理 ...

  2. 优达学城-神经网络之预测共享单车使用情况 代码分析

    优达学城-神经网络之预测共享单车使用情况 代码分析 标签(): 机器学习 代码来自于优达学城深度学习纳米学位课程的第一个项目 https://cn.udacity.com/course/deep-le ...

  3. 多层感知机 深度神经网络_使用深度神经网络和合同感知损失的能源产量预测...

    多层感知机 深度神经网络 in collaboration with Hsu Chung Chuan, Lin Min Htoo, and Quah Jia Yong. 与许忠传,林敏涛和华佳勇合作. ...

  4. 人工智能技术在建筑能源管理中的应用场景

    人工智能技术在建筑能源管理中的应用场景(龙惟定),2021 摘 要 本文简要介绍了建筑能源管理(building energy management, BEM) 的概念.并从5个方面阐述了 BEM 对 ...

  5. GraphSAGE NIPS 2017 代码分析(Tensorflow版)

    文章目录 数据集 ppi数据集信息 toy-ppi-G.json 图的信息 toy-ppi-class_map.json toy-ppi-id_map.json toy-ppi-walks.txt t ...

  6. 论文笔记-建筑能源管理的强化模型预测控制

    这是一篇使用强化学习方法来解决建筑能源的论文,作者将MPC和RL结合起来来用于建筑室内温度的调节. 首先,作者通过讨论每种方法的主要方面,在概念水平上强调RL和MPC之间的互补性.其次,描述了RL-M ...

  7. NLP-生成模型-2017-Transformer(二):Transformer各模块代码分析

    一.WordEmbedding层模块(文本嵌入层) Embedding Layer(文本嵌入层)的作用:无论是源文本嵌入还是目标文本嵌入,都是为了将文本中词汇的数字表示转变为向量表示, 由一维转为多维 ...

  8. 【小白入门】超详细的OCRnet详解(含代码分析)

    [小白入门]超详细的OCRnet详解(含代码分析) OCRnet 简介 网络结构 具体实现(含代码分析) 实验结果 本文仅梳理总结自己在学习过程中的一些理解和思路,不保证绝对正确,请酌情参考.如果各位 ...

  9. 建筑能源管理系统(EMS)

    建筑自动化系统(BAS)中有专用的建筑能源管理系统(EMS),即建筑能源管理系统是建立在建筑自动化系统的平台之上.能源管理系统针对现代楼宇能源管理的需要,通过现场总线把大楼中的电压.功率因数.温度.湿 ...

  10. 河南郑州二手房房价预测和分析

    课程大作业 河南郑州二手房房价预测和分析 爬取数据 加载库 查看数据 数据预处理 删除不需要分析的列 对数据进行去重 处理缺失值 文本数据清理 异常值处理 数据可视化分析 房价分布情况 各区域的整体情 ...

最新文章

  1. ​横扫六大权威榜单后,达摩院开源深度语言模型体系 AliceMind
  2. 特征工程学习,19 项实践 Tips!代码已开源!
  3. 论文速递:智能作为信息处理系统
  4. 如何利用CIC滤波器、CIC补偿滤波器和半带滤波器设计一个高频数字抽取滤波器
  5. Nginx解决跨域问题的具体实现
  6. 我在使用chrome经常使用的一些技巧
  7. 编译后错误提示为pls-00103:出现符号在需要下列之一时:begin case declare
  8. 让软件自己写软件,机器编程未来会取代程序员吗?
  9. AI研发新药真有那么神?可能哈佛、斯坦福和阿斯利康实验室都在吹牛
  10. Dubbox服务的消费方配置
  11. MICCAI 2019 Poster
  12. 金蝶oracle用鼎信诺取数,取数软件 审计取数软件?
  13. 数据库系统常用的数据模型
  14. 详解谷歌VR平台Daydream:有手柄就是不一样
  15. Interpreter
  16. 电脑出问题解决办法(Win7)
  17. 基于数组判断字符串是否是回文
  18. 自定义View学习之仿QQ运动步数进度效果
  19. QGroundControl连接数传(3DR)失败
  20. Mysql使用Double类型报错Out of range value的解决

热门文章

  1. IntelliJ IDEA Dependency 'XXXX' not found 或 java:程序包XXXX不存在,找不到的解决方案
  2. 哪些交易2010年可能出问题
  3. 线结构光平面方程自动标定
  4. 手绘几何图形识别(上)
  5. 数独c语言程序设计说明,c语言数独字谜游戏课程设计
  6. 解决Web部署 svg/woff/woff2字体 404错误(转)
  7. vscode配置opengl时无法使用glad库解决办法
  8. Help Bubu UVALive - 4490
  9. 计算机刊物SCI影响因子排名
  10. 华为云学院学习文档如何下载