纽约市建筑能源得分预测代码分析
项目:纽约市建筑能源得分预测
目录
0简介
1数据清洗与格式转化
1.1数据简介
1.2导入所需的基本工具包
1.3数据分析
1.4数据类型与缺失值
1.5缺失值处理模板
2 Exploratory Data Analysis
2.2剔除离群点
2.3观察那些变量会对结果产生影响
3特征工程
3.1特征变换
3.2双变量绘图
3.3提出共线特征
4分割数据集
4.1划分数据
4.2建立Baseline
4.3结果保存下来,建模再用
5建立基础模型,尝试多种算法
5.1缺失值填充
5.2特征进行与归一化
6建立基础模型,尝试多种算法(回归问题)
6.1建立损失函数
6.2选择机器学习算法
7模型调参
7.1调参
7.2对比损失函数
8评估与测试:预测与真实之间的差异图
9解释模型:基于重要性进行特征选择
正文:
0简介
本次将介绍使用了真实数据集的机器学习项目的完整解决方案,让同学们的了解所有碎片是如何拼接在一起的。
编码之前是了解我们试图解决的问题和可用的数据。在这个项目中,我们将使用公共可用的纽约市的建筑能源数据。目标是使用能源数据建立一个模型,来预测建筑物的 Energy star Score (能源之星分数),并解释结果以找出影晌评分的因素。
数据包括 Energy star Score ,意味着这是一个监督回归机餐学习任务:监督:我们可以知道数据的特征和目标,我们的目标是训练可以学习两者之间映射关系的模型。回归: Energy Star Score 是一个连续变量。我们想要开发一个模型准确性,它可以实现预测Energy Star Score,并且结果接近班实值。
1数据清洗与格式转化
1.1数据简介
1.2导入所需的基本工具包
import pandas as pd
import numpy as np# API需要升级或者遗弃了,不想看就设置一下warning
pd.options.mode.chained_assignment = None# 经常用到head(),最多展示多少条数
pd.set_option('display.max_columns', 60)
import matplotlib.pyplot as plt# %matplotlib inline 可以在Ipython编译器比如jupyter notebook 或者 jupyter qtconsole里直接使用,功能是可以内嵌绘图,并且省略掉plt.show()。
%matplotlib inline# pylot使用rc配置文件来自定义图形的各种默认属性,称之为rc配置或rc参数。通过rc参数可以修改默认的属性,包括窗体大小、每英寸的点数、线条宽度、颜色、样式、坐标轴、坐标和网络属性、文本、字体等。
# rc参数存储在字典变量中,通过字典的方式进行访问
#绘图全局的设置好了,画图字体大小
plt.rcParams['font.size'] = 24
from IPython.core.pylabtools import figsize# matplotlib中的[seaborn](https://so.csdn.net/so/search?q=seaborn)绘图
import seaborn as sns
sns.set(font_scale = 2)
from sklearn.model_selection import train_test_split# 忽略代码中的警告消息
import warnings
warnings.filterwarnings("ignore")
1.3数据分析
# 加载数据
data = pd.read_csv('data/Energy.csv')# 展示前3行
data.head(3)
Order | Property Id | Property Name | Parent Property Id | Parent Property Name | BBL - 10 digits | NYC Borough, Block and Lot (BBL) self-reported | NYC Building Identification Number (BIN) | Address 1 (self-reported) | Address 2 | Postal Code | Street Number | Street Name | Borough | DOF Gross Floor Area | Primary Property Type - Self Selected | List of All Property Use Types at Property | Largest Property Use Type | Largest Property Use Type - Gross Floor Area (ft²) | 2nd Largest Property Use Type | 2nd Largest Property Use - Gross Floor Area (ft²) | 3rd Largest Property Use Type | 3rd Largest Property Use Type - Gross Floor Area (ft²) | Year Built | Number of Buildings - Self-reported | Occupancy | Metered Areas (Energy) | Metered Areas (Water) | ENERGY STAR Score | Site EUI (kBtu/ft²) | Weather Normalized Site EUI (kBtu/ft²) | Weather Normalized Site Electricity Intensity (kWh/ft²) | Weather Normalized Site Natural Gas Intensity (therms/ft²) | Weather Normalized Source EUI (kBtu/ft²) | Fuel Oil #1 Use (kBtu) | Fuel Oil #2 Use (kBtu) | Fuel Oil #4 Use (kBtu) | Fuel Oil #5 & 6 Use (kBtu) | Diesel #2 Use (kBtu) | District Steam Use (kBtu) | Natural Gas Use (kBtu) | Weather Normalized Site Natural Gas Use (therms) | Electricity Use - Grid Purchase (kBtu) | Weather Normalized Site Electricity (kWh) | Total GHG Emissions (Metric Tons CO2e) | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | Property GFA - Self-Reported (ft²) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft²) | Source EUI (kBtu/ft²) | Release Date | Water Required? | DOF Benchmarking Submission Status | Unnamed: 54 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 13286 | 201/205 | 13286 | 201/205 | 1013160001 | 1013160001 | 1037549 | 201/205 East 42nd st. | Not Available | 10017 | 675 | 3 AVENUE | Manhattan | 289356.0 | Office | Office | Office | 293447 | Not Available | Not Available | Not Available | Not Available | 1963 | 2 | 100 | Whole Building | Not Available | Not Available | 305.6 | 303.1 | 37.8 | Not Available | 614.2 | Not Available | Not Available | Not Available | Not Available | Not Available | 51550675.1 | Not Available | Not Available | 38139374.2 | 11082770.5 | 6962.2 | 0 | 6962.2 | 762051 | Not Available | Not Available | 619.4 | 5/1/17 5:32 PM | No | In Compliance | NaN |
1 | 2 | 28400 | NYP Columbia (West Campus) | 28400 | NYP Columbia (West Campus) | 1021380040 | 1-02138-0040 | 1084198; 1084387;1084385; 1084386; 1084388; 10... | 622 168th Street | Not Available | 10032 | 180 | FT WASHINGTON AVENUE | Manhattan | 3693539.0 | Hospital (General Medical & Surgical) | Hospital (General Medical & Surgical) | Hospital (General Medical & Surgical) | 3889181 | Not Available | Not Available | Not Available | Not Available | 1969 | 12 | 100 | Whole Building | Whole Building | 55 | 229.8 | 228.8 | 24.8 | 2.4 | 401.1 | Not Available | 19624847.2 | Not Available | Not Available | Not Available | -391414802.6 | 933073441 | 9330734.4 | 332365924 | 96261312.1 | 55870.4 | 51016.4 | 4854.1 | 3889181 | Not Available | Not Available | 404.3 | 4/27/17 11:23 AM | No | In Compliance | NaN |
2 | 3 | 4778226 | MSCHoNY North | 28400 | NYP Columbia (West Campus) | 1021380030 | 1-02138-0030 | 1063380 | 3975 Broadway | Not Available | 10032 | 3975 | BROADWAY | Manhattan | 152765.0 | Hospital (General Medical & Surgical) | Hospital (General Medical & Surgical) | Hospital (General Medical & Surgical) | 231342 | Not Available | Not Available | Not Available | Not Available | 1924 | 1 | 100 | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | 0 | 0 | 0 | 231342 | Not Available | Not Available | Not Available | 4/27/17 11:23 AM | No | In Compliance | NaN |
print((np.array(data)).shape)
(11746, 55)
# 在括号中填入n,便能看见数据的前N行
data.head(2)
Order | Property Id | Property Name | Parent Property Id | Parent Property Name | BBL - 10 digits | NYC Borough, Block and Lot (BBL) self-reported | NYC Building Identification Number (BIN) | Address 1 (self-reported) | Address 2 | Postal Code | Street Number | Street Name | Borough | DOF Gross Floor Area | Primary Property Type - Self Selected | List of All Property Use Types at Property | Largest Property Use Type | Largest Property Use Type - Gross Floor Area (ft²) | 2nd Largest Property Use Type | 2nd Largest Property Use - Gross Floor Area (ft²) | 3rd Largest Property Use Type | 3rd Largest Property Use Type - Gross Floor Area (ft²) | Year Built | Number of Buildings - Self-reported | Occupancy | Metered Areas (Energy) | Metered Areas (Water) | ENERGY STAR Score | Site EUI (kBtu/ft²) | Weather Normalized Site EUI (kBtu/ft²) | Weather Normalized Site Electricity Intensity (kWh/ft²) | Weather Normalized Site Natural Gas Intensity (therms/ft²) | Weather Normalized Source EUI (kBtu/ft²) | Fuel Oil #1 Use (kBtu) | Fuel Oil #2 Use (kBtu) | Fuel Oil #4 Use (kBtu) | Fuel Oil #5 & 6 Use (kBtu) | Diesel #2 Use (kBtu) | District Steam Use (kBtu) | Natural Gas Use (kBtu) | Weather Normalized Site Natural Gas Use (therms) | Electricity Use - Grid Purchase (kBtu) | Weather Normalized Site Electricity (kWh) | Total GHG Emissions (Metric Tons CO2e) | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | Property GFA - Self-Reported (ft²) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft²) | Source EUI (kBtu/ft²) | Release Date | Water Required? | DOF Benchmarking Submission Status | Unnamed: 54 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 13286 | 201/205 | 13286 | 201/205 | 1013160001 | 1013160001 | 1037549 | 201/205 East 42nd st. | Not Available | 10017 | 675 | 3 AVENUE | Manhattan | 289356.0 | Office | Office | Office | 293447 | Not Available | Not Available | Not Available | Not Available | 1963 | 2 | 100 | Whole Building | Not Available | Not Available | 305.6 | 303.1 | 37.8 | Not Available | 614.2 | Not Available | Not Available | Not Available | Not Available | Not Available | 51550675.1 | Not Available | Not Available | 38139374.2 | 11082770.5 | 6962.2 | 0 | 6962.2 | 762051 | Not Available | Not Available | 619.4 | 5/1/17 5:32 PM | No | In Compliance | NaN |
1 | 2 | 28400 | NYP Columbia (West Campus) | 28400 | NYP Columbia (West Campus) | 1021380040 | 1-02138-0040 | 1084198; 1084387;1084385; 1084386; 1084388; 10... | 622 168th Street | Not Available | 10032 | 180 | FT WASHINGTON AVENUE | Manhattan | 3693539.0 | Hospital (General Medical & Surgical) | Hospital (General Medical & Surgical) | Hospital (General Medical & Surgical) | 3889181 | Not Available | Not Available | Not Available | Not Available | 1969 | 12 | 100 | Whole Building | Whole Building | 55 | 229.8 | 228.8 | 24.8 | 2.4 | 401.1 | Not Available | 19624847.2 | Not Available | Not Available | Not Available | -391414802.6 | 933073441 | 9330734.4 | 332365924 | 96261312.1 | 55870.4 | 51016.4 | 4854.1 | 3889181 | Not Available | Not Available | 404.3 | 4/27/17 11:23 AM | No | In Compliance | NaN |
1.4数据类型与缺失值
data.info() # 可以快速让我们知道数据类型与缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11746 entries, 0 to 11745
Data columns (total 55 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Order 11746 non-null int64 1 Property Id 11746 non-null int64 2 Property Name 11746 non-null object 3 Parent Property Id 11746 non-null object 4 Parent Property Name 11746 non-null object 5 BBL - 10 digits 11746 non-null object 6 NYC Borough, Block and Lot (BBL) self-reported 11746 non-null object 7 NYC Building Identification Number (BIN) 11746 non-null object 8 Address 1 (self-reported) 11746 non-null object 9 Address 2 11746 non-null object 10 Postal Code 11746 non-null object 11 Street Number 11622 non-null object 12 Street Name 11624 non-null object 13 Borough 11628 non-null object 14 DOF Gross Floor Area 11628 non-null float6415 Primary Property Type - Self Selected 11746 non-null object 16 List of All Property Use Types at Property 11746 non-null object 17 Largest Property Use Type 11746 non-null object 18 Largest Property Use Type - Gross Floor Area (ft²) 11746 non-null object 19 2nd Largest Property Use Type 11746 non-null object 20 2nd Largest Property Use - Gross Floor Area (ft²) 11746 non-null object 21 3rd Largest Property Use Type 11746 non-null object 22 3rd Largest Property Use Type - Gross Floor Area (ft²) 11746 non-null object 23 Year Built 11746 non-null int64 24 Number of Buildings - Self-reported 11746 non-null int64 25 Occupancy 11746 non-null int64 26 Metered Areas (Energy) 11746 non-null object 27 Metered Areas (Water) 11746 non-null object 28 ENERGY STAR Score 11746 non-null object 29 Site EUI (kBtu/ft²) 11746 non-null object 30 Weather Normalized Site EUI (kBtu/ft²) 11746 non-null object 31 Weather Normalized Site Electricity Intensity (kWh/ft²) 11746 non-null object 32 Weather Normalized Site Natural Gas Intensity (therms/ft²) 11746 non-null object 33 Weather Normalized Source EUI (kBtu/ft²) 11746 non-null object 34 Fuel Oil #1 Use (kBtu) 11746 non-null object 35 Fuel Oil #2 Use (kBtu) 11746 non-null object 36 Fuel Oil #4 Use (kBtu) 11746 non-null object 37 Fuel Oil #5 & 6 Use (kBtu) 11746 non-null object 38 Diesel #2 Use (kBtu) 11746 non-null object 39 District Steam Use (kBtu) 11746 non-null object 40 Natural Gas Use (kBtu) 11746 non-null object 41 Weather Normalized Site Natural Gas Use (therms) 11746 non-null object 42 Electricity Use - Grid Purchase (kBtu) 11746 non-null object 43 Weather Normalized Site Electricity (kWh) 11746 non-null object 44 Total GHG Emissions (Metric Tons CO2e) 11746 non-null object 45 Direct GHG Emissions (Metric Tons CO2e) 11746 non-null object 46 Indirect GHG Emissions (Metric Tons CO2e) 11746 non-null object 47 Property GFA - Self-Reported (ft²) 11746 non-null int64 48 Water Use (All Water Sources) (kgal) 11746 non-null object 49 Water Intensity (All Water Sources) (gal/ft²) 11746 non-null object 50 Source EUI (kBtu/ft²) 11746 non-null object 51 Release Date 11746 non-null object 52 Water Required? 11628 non-null object 53 DOF Benchmarking Submission Status 11716 non-null object 54 Unnamed: 54 0 non-null float64
dtypes: float64(2), int64(6), object(47)
memory usage: 4.9+ MB
1.5缺失值处理模板
# 缺失值Not Available转换为np.nan
#replace():描述Python replace() 方法把字符串中的 old(旧字符串) 替换成 new(新字符串),
data = data.replace({'Not Available': np.nan})#在原始数据中‘ft²’结尾的列中的属性显示的是有的是数值型float类型,但是在python环境中info()函数展示有其他类型的数据都是Object类型
#kBtu/ft²等本应该是float类型,在这里是object类型,所以要转换一下 ,以ft²、kBtu、Metric Tons CO2e等为结尾的astype一下float # 把下面的data列中的数据全部转换成float型的
for col in list(data.columns):# 如果ft^2平方英尺结尾的,本来是object强制转换为floatif ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in col or 'therms' in col or 'gal' in col or 'Score' in col):data[col] = data[col].astype(float)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Eqc6HhrF-1642160554515)(attachment:image.png)]
print(list(data.columns))
['Order', 'Property Id', 'Property Name', 'Parent Property Id', 'Parent Property Name', 'BBL - 10 digits', 'NYC Borough, Block and Lot (BBL) self-reported', 'NYC Building Identification Number (BIN)', 'Address 1 (self-reported)', 'Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough', 'DOF Gross Floor Area', 'Primary Property Type - Self Selected', 'List of All Property Use Types at Property', 'Largest Property Use Type', 'Largest Property Use Type - Gross Floor Area (ft²)', '2nd Largest Property Use Type', '2nd Largest Property Use - Gross Floor Area (ft²)', '3rd Largest Property Use Type', '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built', 'Number of Buildings - Self-reported', 'Occupancy', 'Metered Areas (Energy)', 'Metered Areas (Water)', 'ENERGY STAR Score', 'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)', 'Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)', 'Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)', 'District Steam Use (kBtu)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)', 'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'Property GFA - Self-Reported (ft²)', 'Water Use (All Water Sources) (kgal)', 'Water Intensity (All Water Sources) (gal/ft²)', 'Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?', 'DOF Benchmarking Submission Status', 'Unnamed: 54']
print(data.columns)
Index(['Order', 'Property Id', 'Property Name', 'Parent Property Id','Parent Property Name', 'BBL - 10 digits','NYC Borough, Block and Lot (BBL) self-reported','NYC Building Identification Number (BIN)', 'Address 1 (self-reported)','Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough','DOF Gross Floor Area', 'Primary Property Type - Self Selected','List of All Property Use Types at Property','Largest Property Use Type','Largest Property Use Type - Gross Floor Area (ft²)','2nd Largest Property Use Type','2nd Largest Property Use - Gross Floor Area (ft²)','3rd Largest Property Use Type','3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built','Number of Buildings - Self-reported', 'Occupancy','Metered Areas (Energy)', 'Metered Areas (Water)', 'ENERGY STAR Score','Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)','Weather Normalized Site Electricity Intensity (kWh/ft²)','Weather Normalized Site Natural Gas Intensity (therms/ft²)','Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)','Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)','Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)','District Steam Use (kBtu)', 'Natural Gas Use (kBtu)','Weather Normalized Site Natural Gas Use (therms)','Electricity Use - Grid Purchase (kBtu)','Weather Normalized Site Electricity (kWh)','Total GHG Emissions (Metric Tons CO2e)','Direct GHG Emissions (Metric Tons CO2e)','Indirect GHG Emissions (Metric Tons CO2e)','Property GFA - Self-Reported (ft²)','Water Use (All Water Sources) (kgal)','Water Intensity (All Water Sources) (gal/ft²)','Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?','DOF Benchmarking Submission Status', 'Unnamed: 54'],dtype='object')
# 每列中只能展示数值型的count、mean、sdt等等,object不会展示
data.describe()# 3.20e+05=3.20x10^5=3.20x100000=320000
# 在科学计数法中,为了使公式简便,可以用带“E”的格式表示。当用该格式表示时,E前面的数字和“E+”后面要精确到十分位,(位数不够末尾补0),例如7.8乘10的7次方,正常写法为:7.8x10^7,简写为“7.8E+07”的形式
Order | Property Id | DOF Gross Floor Area | Largest Property Use Type - Gross Floor Area (ft²) | 2nd Largest Property Use - Gross Floor Area (ft²) | 3rd Largest Property Use Type - Gross Floor Area (ft²) | Year Built | Number of Buildings - Self-reported | Occupancy | ENERGY STAR Score | Site EUI (kBtu/ft²) | Weather Normalized Site EUI (kBtu/ft²) | Weather Normalized Site Electricity Intensity (kWh/ft²) | Weather Normalized Site Natural Gas Intensity (therms/ft²) | Weather Normalized Source EUI (kBtu/ft²) | Fuel Oil #1 Use (kBtu) | Fuel Oil #2 Use (kBtu) | Fuel Oil #4 Use (kBtu) | Fuel Oil #5 & 6 Use (kBtu) | Diesel #2 Use (kBtu) | District Steam Use (kBtu) | Natural Gas Use (kBtu) | Weather Normalized Site Natural Gas Use (therms) | Electricity Use - Grid Purchase (kBtu) | Weather Normalized Site Electricity (kWh) | Total GHG Emissions (Metric Tons CO2e) | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | Property GFA - Self-Reported (ft²) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft²) | Source EUI (kBtu/ft²) | Unnamed: 54 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 11746.000000 | 1.174600e+04 | 1.162800e+04 | 1.174400e+04 | 3741.000000 | 1484.000000 | 11746.000000 | 11746.000000 | 11746.000000 | 9642.000000 | 11583.000000 | 10281.000000 | 10959.000000 | 9783.000000 | 10281.000000 | 9.000000e+00 | 2.581000e+03 | 1.321000e+03 | 5.940000e+02 | 1.600000e+01 | 9.360000e+02 | 1.030400e+04 | 9.784000e+03 | 1.150200e+04 | 1.096000e+04 | 1.167200e+04 | 1.166300e+04 | 1.168100e+04 | 1.174600e+04 | 7.762000e+03 | 7762.000000 | 11583.000000 | 0.0 |
mean | 7185.759578 | 3.642958e+06 | 1.732695e+05 | 1.605524e+05 | 22778.682010 | 12016.825270 | 1948.738379 | 1.289971 | 98.762557 | 59.854594 | 280.071484 | 309.747466 | 11.072643 | 1.901441 | 417.915709 | 3.395398e+06 | 3.186882e+06 | 5.294367e+06 | 2.429105e+06 | 1.193594e+06 | 2.868907e+08 | 5.048543e+07 | 5.364578e+05 | 5.965472e+06 | 1.768752e+06 | 4.553657e+03 | 2.477937e+03 | 2.076339e+03 | 1.673739e+05 | 1.591798e+04 | 136.172432 | 385.908029 | NaN |
std | 4323.859984 | 1.049070e+06 | 3.367055e+05 | 3.095746e+05 | 55094.441422 | 27959.755486 | 30.576386 | 4.017484 | 7.501603 | 29.993586 | 8607.178877 | 9784.731207 | 127.733868 | 97.204587 | 10530.524339 | 2.213237e+06 | 5.497154e+06 | 5.881863e+06 | 4.442946e+06 | 3.558178e+06 | 3.124603e+09 | 3.914717e+09 | 4.022606e+07 | 3.154430e+07 | 9.389154e+06 | 2.041639e+05 | 1.954498e+05 | 5.931295e+04 | 3.189238e+05 | 1.529524e+05 | 1730.726938 | 9312.736225 | NaN |
min | 1.000000 | 7.365000e+03 | 5.002800e+04 | 5.400000e+01 | 0.000000 | 0.000000 | 1600.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.085973e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.690797e+08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -2.313430e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | NaN |
25% | 3428.250000 | 2.747222e+06 | 6.524000e+04 | 6.520100e+04 | 4000.000000 | 1720.750000 | 1927.000000 | 1.000000 | 100.000000 | 37.000000 | 61.800000 | 65.100000 | 3.800000 | 0.100000 | 103.500000 | 1.663594e+06 | 2.550378e+05 | 2.128213e+06 | 0.000000e+00 | 5.698020e+04 | 4.320254e+06 | 1.098251e+06 | 1.176952e+04 | 1.043673e+06 | 3.019974e+05 | 3.287000e+02 | 1.474500e+02 | 9.480000e+01 | 6.699400e+04 | 2.595400e+03 | 27.150000 | 99.400000 | NaN |
50% | 6986.500000 | 3.236404e+06 | 9.313850e+04 | 9.132400e+04 | 8654.000000 | 5000.000000 | 1941.000000 | 1.000000 | 100.000000 | 65.000000 | 78.500000 | 82.500000 | 5.300000 | 0.500000 | 129.400000 | 4.328815e+06 | 1.380138e+06 | 4.312984e+06 | 0.000000e+00 | 2.070020e+05 | 9.931240e+06 | 4.103962e+06 | 4.445525e+04 | 1.855196e+06 | 5.416312e+05 | 5.002500e+02 | 2.726000e+02 | 1.718000e+02 | 9.408000e+04 | 4.692500e+03 | 45.095000 | 124.900000 | NaN |
75% | 11054.500000 | 4.409092e+06 | 1.596140e+05 | 1.532550e+05 | 20000.000000 | 12000.000000 | 1966.000000 | 1.000000 | 100.000000 | 85.000000 | 97.600000 | 102.500000 | 9.200000 | 0.700000 | 167.200000 | 4.938947e+06 | 4.445808e+06 | 6.514520e+06 | 4.293825e+06 | 2.918332e+05 | 2.064497e+07 | 6.855070e+06 | 7.348107e+04 | 4.370302e+06 | 1.284677e+06 | 9.084250e+02 | 4.475000e+02 | 4.249000e+02 | 1.584140e+05 | 8.031875e+03 | 70.805000 | 162.750000 | NaN |
max | 14993.000000 | 5.991312e+06 | 1.354011e+07 | 1.421712e+07 | 962428.000000 | 591640.000000 | 2019.000000 | 161.000000 | 100.000000 | 100.000000 | 869265.000000 | 939329.000000 | 6259.400000 | 9393.000000 | 986366.000000 | 6.275850e+06 | 1.046849e+08 | 7.907464e+07 | 4.410378e+07 | 1.435178e+07 | 7.163518e+10 | 3.942850e+11 | 3.942852e+09 | 1.691763e+09 | 4.958273e+08 | 2.094340e+07 | 2.094340e+07 | 4.764375e+06 | 1.421712e+07 | 6.594604e+06 | 96305.690000 | 912801.100000 | NaN |
# 缺失值的模板,通用的
# 定义一个函数,传进来一个DataFrame
def missing_values_table(df): # python的pandas库中有一个十分便利的isnull()函数,它可以用来判断缺失值,把每列的缺失值算一下总和mis_val = df.isnull().sum() # 100相当于%,每列的缺失值的占比mis_val_percent = 100 * df.isnull().sum() / len(df) # 每列缺失值的个数 、 每列缺失值的占比做成表mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)# 重命名指定列的名称mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})# 因为第1列缺失值很大,ascending=False代表降序#iloc[:,1] != 0的意思是对于下面的表中的第2列(缺失的占比)进行降序,从大到小mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)# 打印所有列的个数 、 缺失了多少列print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n" "There are " + str(mis_val_table_ren_columns.shape[0]) +" columns that have missing values.")return mis_val_table_ren_columns
missing_values_table(data) #第一列是每1列,第二列是缺失值个数,第三列是缺失值%比,一共是60列,有46列是有缺失值
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
Unnamed: 54 | 11746 | 100.0 |
Fuel Oil #1 Use (kBtu) | 11737 | 99.9 |
Diesel #2 Use (kBtu) | 11730 | 99.9 |
Address 2 | 11539 | 98.2 |
Fuel Oil #5 & 6 Use (kBtu) | 11152 | 94.9 |
District Steam Use (kBtu) | 10810 | 92.0 |
Fuel Oil #4 Use (kBtu) | 10425 | 88.8 |
3rd Largest Property Use Type | 10262 | 87.4 |
3rd Largest Property Use Type - Gross Floor Area (ft²) | 10262 | 87.4 |
Fuel Oil #2 Use (kBtu) | 9165 | 78.0 |
2nd Largest Property Use - Gross Floor Area (ft²) | 8005 | 68.2 |
2nd Largest Property Use Type | 8005 | 68.2 |
Metered Areas (Water) | 4609 | 39.2 |
Water Intensity (All Water Sources) (gal/ft²) | 3984 | 33.9 |
Water Use (All Water Sources) (kgal) | 3984 | 33.9 |
ENERGY STAR Score | 2104 | 17.9 |
Weather Normalized Site Natural Gas Intensity (therms/ft²) | 1963 | 16.7 |
Weather Normalized Site Natural Gas Use (therms) | 1962 | 16.7 |
Weather Normalized Source EUI (kBtu/ft²) | 1465 | 12.5 |
Weather Normalized Site EUI (kBtu/ft²) | 1465 | 12.5 |
Natural Gas Use (kBtu) | 1442 | 12.3 |
Weather Normalized Site Electricity Intensity (kWh/ft²) | 787 | 6.7 |
Weather Normalized Site Electricity (kWh) | 786 | 6.7 |
Electricity Use - Grid Purchase (kBtu) | 244 | 2.1 |
Site EUI (kBtu/ft²) | 163 | 1.4 |
Source EUI (kBtu/ft²) | 163 | 1.4 |
NYC Building Identification Number (BIN) | 162 | 1.4 |
Street Number | 124 | 1.1 |
Street Name | 122 | 1.0 |
DOF Gross Floor Area | 118 | 1.0 |
Borough | 118 | 1.0 |
Water Required? | 118 | 1.0 |
Direct GHG Emissions (Metric Tons CO2e) | 83 | 0.7 |
Total GHG Emissions (Metric Tons CO2e) | 74 | 0.6 |
Indirect GHG Emissions (Metric Tons CO2e) | 65 | 0.6 |
Metered Areas (Energy) | 57 | 0.5 |
DOF Benchmarking Submission Status | 30 | 0.3 |
NYC Borough, Block and Lot (BBL) self-reported | 11 | 0.1 |
Largest Property Use Type - Gross Floor Area (ft²) | 2 | 0.0 |
Largest Property Use Type | 2 | 0.0 |
# 50%是阈值,大于50%的列
missing_df = missing_values_table(data);
# 大于50%的列拿出来 ,后面drop()删掉
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))#原始的列中有60列,发现有缺失值的列有46列 , 缺失的46列中大于50%的将删除,有11列
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
We will remove 12 columns.
# 大于50%的列都drop掉
data = data.drop(columns = list(missing_columns))
2 Exploratory Data Analysis
2.1单变量绘图
# 设置图形的宽和高
figsize(8, 8)# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})# 在seaboard中找到不同的风格,不同的参数,代表不同背景格式
plt.style.use('fivethirtyeight')#dropna():该函数主要用于滤除缺失数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution');#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GO86qLvG-1642160554517)(output_26_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9bpUjb8Z-1642160554517)(attachment:image.png)]
plt.style.ava
---------------------------------------------------------------------------AttributeError Traceback (most recent call last)<ipython-input-186-aa5a23d3013a> in <module>
----> 1 plt.style.avaAttributeError: module 'matplotlib.style' has no attribute 'ava'
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ypjv0rRI-1642160554518)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ESukvH9A-1642160554519)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fmtPMkoL-1642160554519)(attachment:image.png)]
help(plt.hist)
# 设置图形的宽和高
figsize(8, 8)# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})# 在seaboard中找到不同的风格
plt.style.use('dark_background')# hist表示的画直方图
#dropna():该函数主要用于滤除缺失数据,删除data列表中,score列中的有缺失值的行。处理后的数据作为画图的数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution');#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。
plt.style.available
print(data.columns)
# 设置图形的宽和高
figsize(10, 10)# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})# 在seaboard中找到不同的风格
plt.style.use('fivethirtyeight')#dropna():该函数主要用于滤除缺失数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution');#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-peqW6VVa-1642160554521)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uEY5diWG-1642160554521)(attachment:image.png)]
# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black'); # 边也是黑色
plt.xlabel('Site EUI');
plt.ylabel('Count'); plt.title('Site EUI Distribution');#这显示我们有另一个问题:!由于存在几个非常高分的建筑物,这张图难以置信地倾斜了。所以必须进行异常值处理。
#你会很清楚地看到最后一个值异常大。出现异常值的原因很多:错字,测量设备故障,错误的单位,或者它们可能是合法的但是个极端值
#相当于分一下数据有很多点离均值很远,就有离群点
# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
# edgecolor:直方图中柱形边缘的颜色
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'red'); # 边也是黑色
plt.xlabel('Site EUI');
plt.ylabel('Count'); plt.title('Site EUI Distribution');#这显示我们有另一个问题:!由于存在几个非常高分的建筑物,这张图难以置信地倾斜了。所以必须进行异常值处理。
#你会很清楚地看到最后一个值异常大。出现异常值的原因很多:错字,测量设备故障,错误的单位,或者它们可能是合法的但是个极端值
#相当于分一下数据有很多点离均值很远,就有离群点
data['Site EUI (kBtu/ft²)'].describe()
# 均值mean小 , 标准差很大,就意味着有很多点离均值很远,就有离群点 ,因为最小值为0,最大值为869265
#dropna()该函数主要用于滤除缺失数据
# sort_values()先分组 ,再看后10位
#能源使用强度(EUI)
#sort_values():默认是升序 ,从小到大排序,按值排序,左边是行号,右边是数据
data['Site EUI (kBtu/ft²)'].dropna().sort_values().tail(10)
# 怎么过滤离群点呢,查看第869265行
data.loc[data['Site EUI (kBtu/ft²)'] == 869265, :]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AQnReIgf-1642160554522)(attachment:image.png)]
# 应该是版本更新的问题,没有ix,和iloc这两种函数了
# 怎么过滤离群点呢,查看第869265行
data.ix[data['Site EUI (kBtu/ft²)'] == 869265, :]
# 怎么过滤离群点呢,查看第869265行
data.iloc[data['Site EUI (kBtu/ft²)'] == 869265, :]
2.2剔除离群点
# 在describe取25%和75%分位
first_quartile = data['Site EUI (kBtu/ft²)'].describe()['25%']
third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']# 2者一减就是IQ值,就是间隔
iqr = third_quartile - first_quartile#在这里判断的是正常数据,Q3 - 3IQ < EUI < Q3+ 3IQ ,保留正常数据,剩下的过滤异常点
# Q3+ 3IQ > 。。。。。。>Q3 - 3IQ ,中间的就是非离群点,就是咱们想要的数据
data = data[(data['Site EUI (kBtu/ft²)'] > (first_quartile - 3 * iqr)) &(data['Site EUI (kBtu/ft²)'] < (third_quartile + 3 * iqr))]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gUPEPiG6-1642160554523)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-j1F9x97h-1642160554524)(attachment:image.png)]
# #能源使用强度(EUI),剔除离群点后应该有的正太分布
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black');
plt.xlabel('Site EUI');
plt.ylabel('Count'); plt.title('Site EUI Distribution');
2.3观察那些变量会对结果产生影响
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GPpHhLOJ-1642160554524)(attachment:image.png)]
types = data.dropna(subset=['score'])#Largest Property Use Type:最大财产使用类型
#该列中有很多的个属性,大于100的值分别有4个属性 , 为:Multifamily Housing——多户住宅区 、 Office——办公室 、 Hotel——酒店
#Data Center, Non-Refrigerated Warehouse, Office——数据中心、非冷藏仓库、办公室types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100])
print(types)
types = data.dropna(subset=['score'])#Largest Property Use Type:最大财产使用类型
#该列中有很多的个属性,大于100的值分别有4个属性 , 为:Multifamily Housing——多户住宅区 、 Office——办公室 、 Hotel——酒店
#Data Center, Non-Refrigerated Warehouse, Office——数据中心、非冷藏仓库、办公室types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)
print(types)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B894snof-1642160554525)(attachment:image.png)]
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)# b_type是变量,types是4种类型
for b_type in types:#当前Largest Property Use Type就是画的类型b_type4个 变量subset = data[data['Largest Property Use Type'] == b_type] # 拿到subset的得分值,alpha指的是透明度sns.kdeplot(subset['score'].dropna(),label = b_type, shade = False, alpha = 0.5);# 横轴是能源得分 ,纵轴是密度
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20);
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);#红色和黄色差距很大
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)# b_type是变量,types是4种类型 ,其实types列表中包含了7个元素,也就是有7种类型,这儿只是现实了前4种
for b_type in types:#当前Largest Property Use Type就是画的类型b_type4个 变量subset = data[data['Largest Property Use Type'] == b_type] print(subset)
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)# b_type是变量,types是4种类型 ,其实types列表中包含了7个元素,也就是有7种类型,这儿只是现实了前4种
for b_type in types:#当前Largest Property Use Type就是画的类型b_type4个 变量subset = data[data['Largest Property Use Type'] == b_type] print(subset['score'].dropna())
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C9YXcBux-1642160554526)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4TyHPyFb-1642160554527)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2eU3SHfW-1642160554527)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G4RVKkWP-1642160554528)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B9ybHRoH-1642160554528)(attachment:image.png)]
# 查看当前的结果跟地区有什么结果 结果
boroughs = data.dropna(subset=['score'])
# 地区
boroughs = boroughs['Borough'].value_counts()
boroughs = list(boroughs[boroughs.values > 100].index)
print(boroughs)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pCTrfw8A-1642160554529)(attachment:image.png)]
# 4个从差异程度来说,影响不大,特征的差异性不强
#Borough:自治区镇 ,该列中有5个属性,分别为:Manhattan——曼哈顿 、 Brooklyn——布鲁克林 、 Queens——皇后区 、 Bronx——布朗克斯
# Staten Island——斯塔顿岛figsize(12, 10)# 遍历5个属性遍历,画出图,横轴是能源得分、纵轴是密度
for borough in boroughs:subset = data[data['Borough'] == borough]sns.kdeplot(subset['score'].dropna(),label = borough);plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20);
plt.title('Density Plot of Energy Star Scores by Borough', size = 28);
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6G9nDpPd-1642160554530)(attachment:image.png)]
# corr()相关系数矩阵,即给出任意X与Y之间的相关系数 X——>Y两两相关的,负相关多,-0.046605接近于0的都删掉 , 正相关的少
correlations_data = data.corr()['score'].sort_values()#升序,从小到大# 后10个
print(correlations_data.head(10), '\n')
print("---------------------------")
# 前10个
print(correlations_data.tail(10))
其中corr()函数的参数为空时,默认使用的参数为pearson
3特征工程
3.1特征变换
import warnings
warnings.filterwarnings("ignore")# 所有的数值数据拿到手,只需要数值列的数据,数据为字符串或者其他类型的数据列,不要
numeric_subset = data.select_dtypes('number')# 遍历所有数值数据的每一列数据
# 遍历所有的数值数据
for col in numeric_subset.columns:# 这个项目把score看成了标签,也就是线性函数这种的y,其他特征值全部都是x,而每一个x的系数就是这个特征与score的相关系数# 如果score就是y值 ,就不做任何变换if col == 'score':next#剩下的不是y的话特征做log和开根号else: # 直接对整个列的数据进行开方和log计算numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])numeric_subset['log_' + col] = np.log(numeric_subset[col])# Borough:自治镇
# Largest Property Use Type:
categorical_subset = data[['Borough', 'Largest Property Use Type']]
print(categorical_subset)# One hot encode用到了读热编码get_dummies
categorical_subset = pd.get_dummies(categorical_subset)
print(categorical_subset)# 合并数组 一个是数值的, 一个热度编码的
print(numeric_subset)
features = pd.concat([numeric_subset, categorical_subset], axis = 1)features = features.dropna(subset = ['score'])# sort_values()做一下排序
correlations = features.corr()['score'].dropna().sort_values()
print(correlations)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wzb3UvvG-1642160554531)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k7cVG3Mj-1642160554531)(attachment:image.png)]
特征和特征之间的相关性,特征和score之间的相关性。
相关性:线性相关性和非线性相关性。
#sqrt结尾的变幻后就是sqrt_,log结尾的变幻后就是log_
# 这些都是负的
correlations.head(15)#Weather Normalized Site EUI (kBtu/ft²)和转换后sqrt_Weather Normalized Site EUI (kBtu/ft²)没啥变化,所以没有价值
#都差不多,没有明显的趋势,
# 后15位下面是正的
correlations.tail(15)
一般head能做的操作,tail也能够做
3.2 双变量绘图
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ASa6g0ty-1642160554532)(attachment:image.png)]
import warnings
warnings.filterwarnings("ignore")
figsize(12, 10)# 能源得分与城镇区域之间的关系
features['Largest Property Use Type'] = data.dropna(subset = ['score'])['Largest Property Use Type']# Largest Property Use Type 最大财产使用类型 ,isin()接受一个列表,判断该列中4个属性是否在列表中
features = features[features['Largest Property Use Type'].isin(types)]# hue = 'Largest Property Use Type'是4个种类变量 ,4个颜色
sns.lmplot('Site EUI (kBtu/ft²)', 'score', # 种类变量,有4个种类,右下角hue是有4个种类变量,hue = 'Largest Property Use Type', data = features,scatter_kws = {'alpha': 0.8, 's': 60}, fit_reg = False,size = 12, aspect = 1.2);# Plot labeling
plt.xlabel("Site EUI", size = 28)
plt.ylabel('Energy Star Score', size = 28)
plt.title('Energy Star Score vs Site EUI', size = 36);
3.3 剔除共线特征
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EtmuRsXb-1642160554532)(attachment:image.png)]
#原始数据备份一下copy(),修改后数据后保持原数据不变
features = data.copy()# select_dtypes():根据数据类型选择特征,number表示数值型特征
numeric_subset = data.select_dtypes('number')# 遍历特征是数值型在一个列表中
for col in numeric_subset.columns:# 跳过能源得分就是咱们的目标值Yif col == 'score':nextelse:#numeric_subset()从某一个列中选择出符合某条件的数据或是相关的列numeric_subset['log_' + col] = np.log(numeric_subset[col])# Borough:自治区镇
# 最大财产使用类型/多户家庭的a住宅区、办公区、酒店、不制冷的大仓库
categorical_subset = data[['Borough', 'Largest Property Use Type']]# get_dummies 是利用pandas实现one hot encode的方式。
categorical_subset = pd.get_dummies(categorical_subset)#把所有数值型特征和治区镇以及最大财产的使用类型合并起来
features = pd.concat([numeric_subset, categorical_subset], axis = 1)features.shape#有110个列,比原来的列多
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k32D2KSL-1642160554533)(attachment:image.png)]
#Weather Normalized Site EUI (kBtu/ft²):天气正常指数的使用强度
#Site EUI:能源使用强度plot_data = data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna()
#'bo':由点绘制的线
plt.plot(plot_data['Site EUI (kBtu/ft²)'], plot_data['Weather Normalized Site EUI (kBtu/ft²)'], 'bo')
#横轴是天气正常指数的使用强度 、 纵轴是能源使用强度
plt.xlabel('Site EUI'); plt.ylabel('Weather Norm EUI')
plt.title('Weather Norm EUI vs Site EUI, R = %0.4f' % np.corrcoef(data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);
# collinear 共线,这个函数的作用是删除一些两个特征值,之间的相关性特别高的,其中的一个特征。
# threshold:设置的阈值,这个值,是通过多次尝试求取出来的。
def remove_collinear_features(x, threshold):y = x['score'] #在原始数据X中”score“当做y值x = x.drop(columns = ['score']) #除去标签值以外的当做特征# 多长运行,直到相关性小于阈值才稳定结束while True:# 计算一个矩阵 ,两两的相关系数corr_matrix = x.corr()for i in range(len(corr_matrix)):corr_matrix.iloc[i][i] = 0 # 将对角线上的相关系数置为0。避免自己跟自己计算相关系数一定大于阈值,自己与自己的相关系数的是1# 定义待删除的特征。drop_cols = []# col返回的是列名for col in corr_matrix:if col not in drop_cols: # A和B比 ,B和A比的相关系数一样,避免AB全删了# 取相关系数的绝对值。v = np.abs(corr_matrix[col]) # 取的是每一列的相关系数# 如果相关系数大于设置的阈值 # 取出每一列中相关系数绝对值最大的那个数if np.max(v) > threshold:# 取出最大值对应的索引。name = np.argmax(v) # 找到最大值的的列名# 将含有最大值的那一列放到drop_cols列表中drop_cols.append(name)# 列表不为空,就删除,列表为空,符合条件,退出循环 # drop_cols 列表中存储的是,两个特征的相关系数的绝对值大于设置的阈值的其中一个特征,为了减小模型的复杂度,和提高模型的效果,就需要删除其中一个特征if drop_cols:# 删除想删除的列x = x.drop(columns=drop_cols, axis=1)else:break# 指定标签# y中存储的是原始数据X中”score“x['score'] = yreturn x
help(remove_collinear_features)
# 下面这段代码运行有问题,我修改不出来,所以就注释了,不让它运行
# # 设置阈值0.6 ,tem.values相关性的矩阵的向量大于0.6的
# features = remove_collinear_features(features, 0.6);
# 上面这段代码运行有问题,我修改不出来,所以就注释了,不让它运行
# 删除
features = features.dropna(axis=1, how = 'all')
features.shape #原来时110
features.shape
4 分割数据集
4.1 划分数据
# pandas:isna(): 如果参数的结果为#NaN, 则结果TRUE, 否则结果是FALSE。
no_score = features[features['score'].isna()]
# pandas:notnull()判断是否不是NaN
score = features[features['score'].notnull()]print(no_score.shape)
print(score.shape)
# 把所有特征放在features列表中
# 把标签,也就是targets(建筑物的得分)放在targets列表中
features = score.drop(columns='score')
targets = pd.DataFrame(score['score'])#np.inf :最大值 -np.inf:最小值
features = features.replace({np.inf: np.nan, -np.inf: np.nan})# random_state = 42设置成一个固定值,是为了让,每一次生成的训练集和测试集都是一样的,如果不设置,
# 那么每次生成的测试集和训练集是不一样的,那么这样就无法调参了random_state可以被设置成任何值,
# 但是当你使用相同的数据集,进行测试集和训练集分割时,如果想要与之前的训练集和测试集生成一样的,
# 那么你就得把random_state设置成一样的值,因为分割数据集其实是在生成的一些系列随机数,通过这些
# 随机数去取数据。由于随机数的生成也是通过程序控制的,那么当你设置相同的random_state值,就会
# 得到相同的随机数,那么就会得到相同的测试集和训练集
X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-omV3TRls-1642160554534)(attachment:image.png)]
4.2 建立Baseline
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PAh1oBf6-1642160554534)(attachment:image.png)]
# mae平均的绝对值 ,就是 (真实值 - 预测值) / n
#abs():绝对值
def mae(y_true, y_pred):return np.mean(abs(y_true - y_pred))
baseline_guess = np.median(y)print('The baseline guess is a score of %0.2f' % baseline_guess) # 中位数为66
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess)) # MAE = 24.5164
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SQLqtAbD-1642160554534)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bwU9b3eh-1642160554535)(attachment:image.png)]
4.3 结果保存下来,建模再用
# Save the no scores, training, and testing data
# to_csv:把to_csv列表中的元素以csv的格式写进data/no_score.csv文件中g
no_score.to_csv('data/no_score.csv', index = False)
X.to_csv('data/training_features.csv', index = False)
X_test.to_csv('data/testing_features.csv', index = False)
y.to_csv('data/training_labels.csv', index = False)
y_test.to_csv('data/testing_labels.csv', index = False)
5 建立基础模型,尝试多种算法
#之前把精力都放在了前面了,这回我的重点就要放在建模上了,导入所需要的包
# 数据分析库
import pandas as pd
import numpy as np# warnings:警告——>忽视
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)# 可视化
import matplotlib.pyplot as plt
%matplotlib inline# 字体大小设置
plt.rcParams['font.size'] = 24from IPython.core.pylabtools import figsize# Seaborn 高级可视化工具
import seaborn as sns
sns.set(font_scale = 2)# 预处理:缺失值 、 最大最小归一化# 下面代码是自己修改的
# 这是原代码
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
Imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的# 机器学习算法库
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor# 调参工具包
from sklearn.model_selection import RandomizedSearchCV, GridSearchCVimport warnings
warnings.filterwarnings("ignore")
# Read in data into dataframes
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')# Display sizes of data
print('Training Feature Size: ', train_features.shape)
print('Testing Feature Size: ', test_features.shape)
print('Training Labels Size: ', train_labels.shape)
print('Testing Labels Size: ', test_labels.shape)
5.1 缺失值填充
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oMSL9UnT-1642160554535)(attachment:image.png)]
# 下面代码是自己修改的
# 这是原代码
# imputer = Imputer(strategy='median') # 因为数据有离群点,有大有小,用mean不太合适,用中位数较合适
imputer = SimpleImputer(strategy='median') # 因为数据有离群点,有大有小,用mean不太合适,用中位数较合适
# 上面代码是自己修改的
# 在训练特征中训练
imputer.fit(train_features)# 对训练数据进行转换
X = imputer.transform(train_features)#用中位数来代替做成的训练集
X_test = imputer.transform(test_features) #用中位数来代替做成的测试集
# 查看训练集和测试集中的特征列表是否还有缺失值
#np.isnan:数值进行空值检测
print('Missing values in training features: ', np.sum(np.isnan(X))) #返回的是0 ,代表缺失值任务已经完成了
print('Missing values in testing features: ', np.sum(np.isnan(X_test)))
5.2 特征进行与归一化
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f23cQpIF-1642160554536)(attachment:image.png)]
# feature_range=(0, 1)特征值的范围在0-1之间
scaler = MinMaxScaler(feature_range=(0, 1))# 训练与转换
scaler.fit(X)# 把训练数据转换过来(0,1)
X = scaler.transform(X)
X_test = scaler.transform(X_test) # 测试数据
#标签值是1列 ,reshape变成1行
# reshape(行数,列数)常用来更改数据的行列数目
y = np.array(train_labels).reshape((-1,))#一维数组 , 变成1列
y_test = np.array(test_labels).reshape((-1, )) # 一维数组 , 变成1列
6 建立基础模型,尝试多种算法(回归问题)
6.1 建立损失函数
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZQQhPLqN-1642160554536)(attachment:image.png)]
# 在这里的损失函数是MAE ,abs()是绝对值
def mae(y_true, y_pred):return np.mean(abs(y_true - y_pred))#制作一个模型 ,训练模型和在验证集上验证模型的参数
def fit_and_evaluate(model):# 训练模型model.fit(X, y)# 训练模型开始在测试数据上训练model_pred = model.predict(X_test)model_mae = mae(y_test, model_pred)return model_mae
6.2 选择机器学习算法
lr = LinearRegression()#线性回归
lr_mae = fit_and_evaluate(lr)print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)
svm = SVR(C = 1000, gamma = 0.1) #支持向量机
svm_mae = fit_and_evaluate(svm)print('Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae)
random_forest = RandomForestRegressor(random_state=60)#集成算法的随机森林
random_forest_mae = fit_and_evaluate(random_forest)print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae)
gradient_boosted = GradientBoostingRegressor(random_state=60) #梯度提升树
gradient_boosted_mae = fit_and_evaluate(gradient_boosted)print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae)
knn = KNeighborsRegressor(n_neighbors=10)#K近邻算法
knn_mae = fit_and_evaluate(knn)print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae)
plt.style.use('fivethirtyeight')
figsize(8, 6)model_comparison = pd.DataFrame({'model': ['Linear Regression', 'Support Vector Machine','Random Forest', 'Gradient Boosted','K-Nearest Neighbors'],'mae': [lr_mae, svm_mae, random_forest_mae, gradient_boosted_mae, knn_mae]})# ascending=True是对的意思升序 降序 :从大到小/从第1行到第5行 barh:横着去画的直方图
model_comparison.sort_values('mae', ascending = False).plot(x = 'model', y = 'mae', kind = 'barh',color = 'red', edgecolor = 'black')# 纵轴是算法模型的名称 yticks:为递增值向量 横轴是MAE损失 xticks:为递增值向量
plt.ylabel(''); plt.yticks(size = 14); plt.xlabel('Mean Absolute Error'); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z2xpgxR7-1642160554537)(attachment:image.png)]
7 模型调参
7.1 调参
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MYqoMgUR-1642160554538)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gk3Vn00o-1642160554538)(attachment:image.png)]
loss = ['ls', 'lad', 'huber']# 所使用的弱“学习者”(决策树)的数量
n_estimators = [100, 500, 900, 1100, 1500]# 决策树的最大深度
max_depth = [2, 3, 5, 10, 15]# 决策树的叶节点所需的最小示例个数
min_samples_leaf = [1, 2, 4, 6, 8]# 分割决策树节点所需的最小示例个数
min_samples_split = [2, 4, 6, 10]hyperparameter_grid = {'loss': loss,'n_estimators': n_estimators,'max_depth': max_depth,'min_samples_leaf': min_samples_leaf,'min_samples_split': min_samples_split}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jS4ntaRq-1642160554539)(attachment:image.png)]
model = GradientBoostingRegressor(random_state = 42)random_cv = RandomizedSearchCV(estimator=model, param_distributions=hyperparameter_grid,cv=4, n_iter=25, scoring = 'neg_mean_absolute_error', #选择好结果的评估值n_jobs = -1, verbose = 1, return_train_score = True,random_state=42)
# 注意:运行的时间非常慢,需要14mins
random_cv.fit(X, y)
help(GradientBoostingRegressor)
RandomizedSearchCV(cv=4, error_score='raise-deprecating',estimator=GradientBoostingRegressor(alpha=0.9,criterion='friedman_mse',init=None,learning_rate=0.1,loss='ls', max_depth=3,max_features=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100,verbose=0,warm_start=False),iid='warn', n_iter=25, n_jobs=-1,param_distributions={'loss': ['ls', 'lad', 'huber'],'max_depth': [2, 3, 5, 10, 15],'min_samples_leaf': [1, 2, 4, 6, 8],'min_samples_split': [2, 4, 6, 10],'n_estimators': [100, 500, 900, 1100,1500]},pre_dispatch='2*n_jobs', random_state=42, refit=True,return_train_score=True, scoring='neg_mean_absolute_error',verbose=1)
random_cv.best_estimator_ #最好的参数
---------------------------------------------------------------------------NameError Traceback (most recent call last)<ipython-input-187-c5a12878b76a> in <module>
----> 1 random_cv.best_estimator_ #最好的参数NameError: name 'random_cv' is not defined
# 创建树策个数
trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}#建立模型
#lad:最小化绝对偏差
model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,min_samples_leaf = 6,min_samples_split = 6,max_features = None,random_state = 42)# 传入参数
grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, scoring = 'neg_mean_absolute_error', verbose = 1,n_jobs = -1, return_train_score = True)
# 需要3mins
grid_search.fit(X, y)
GridSearchCV(cv=4, error_score='raise-deprecating',estimator=GradientBoostingRegressor(alpha=0.9,criterion='friedman_mse',init=None, learning_rate=0.1,loss='lad', max_depth=5,max_features=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=6,min_samples_split=6,min_weight_fraction_leaf=0.0,n_estimators=100,n_iter_no_change=None,presort='auto',random_state=42, subsample=1.0,tol=0.0001,validation_fraction=0.1,verbose=0, warm_start=False),iid='warn', n_jobs=-1,param_grid={'n_estimators': [100, 150, 200, 250, 300, 350, 400,450, 500, 550, 600, 650, 700, 750,800]},pre_dispatch='2*n_jobs', refit=True, return_train_score=True,scoring='neg_mean_absolute_error', verbose=1)
7.2 对比损失函数
# 得到结果传入DataFrame
results = pd.DataFrame(grid_search.cv_results_)# 画图操作
figsize(8, 8)
plt.style.use('fivethirtyeight')plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')
plt.plot(results['param_n_estimators'], -1 * results['mean_train_score'], label = 'Training Error')
#横轴是树的个数 ,纵轴是MAE的误差
plt.xlabel('Number of Trees'); plt.ylabel('Mean Abosolute Error'); plt.legend();
plt.title('Performance vs Number of Trees');
#过拟合 , 蓝色平缓 ,红色比较陡 ,中间的数据越来陡,所以overfiting
8 评估与测试:预测和真实之间的差异图
# 测试模型
default_model = GradientBoostingRegressor(random_state = 42)
default_model.fit(X,y)
# 选择最好的参数
final_model = grid_search.best_estimator_final_model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)
print('Default model performance on the test set: MAE = %0.4f.' % mae(y_test, default_pred))
print('Final model performance on the test set: MAE = %0.4f.' % mae(y_test, final_pred))
figsize = (6, 6)# 最终的模型差异 = 模型 - 测试值 ,大部分都在+-25%
residuals = final_pred - y_testplt.hist(residuals, color = 'red', bins = 20,edgecolor = 'black')
plt.xlabel('Error'); plt.ylabel('Count')
plt.title('Distribution of Residuals');
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JhYfQn90-1642160554540)(attachment:image.png)]
9 解释模型:基于重要性来进行特征选择
import pandas as pd
import numpy as nppd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)import matplotlib.pyplot as plt
%matplotlib inlineplt.rcParams['font.size'] = 24from IPython.core.pylabtools import figsizeimport seaborn as snssns.set(font_scale = 2)# 下面代码是自己修改的
# 这是原代码
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressorfrom sklearn import treeimport warnings
warnings.filterwarnings("ignore")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tWveKWPM-1642160554540)(attachment:image.png)]
# 用中值代替缺失值# 下面代码是自己修改的
# 这是原代码
# imputer = Imputer(strategy='median')
imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的# 开始训练
imputer.fit(train_features)X = imputer.transform(train_features)
# 测试集的缺失值使用的也是训练集的数据
X_test = imputer.transform(test_features)y = np.array(train_labels).reshape((-1,))
y_test = np.array(test_labels).reshape((-1,))
def mae(y_true, y_pred):return np.mean(abs(y_true - y_pred))
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,min_samples_leaf=6, min_samples_split=6, n_estimators=800, random_state=42)model.fit(X, y)
# GBDT模型作为最终的模型
model_pred = model.predict(X_test)print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))
# 特征重要度
feature_results = pd.DataFrame({'feature': list(train_features.columns), #所有的训练特征'importance': model.feature_importances_})# 展示前10名的重要的特征 ,降序
feature_results = feature_results.sort_values('importance', ascending = False).reset_index(drop=True)feature_results.head(10)
figsize(12, 10)
plt.style.use('fivethirtyeight')# 展示前10名的重要的特征
feature_results.loc[:9, :].plot(x = 'feature', y = 'importance', edgecolor = 'k',kind='barh', color = 'blue');#barh:直方图横着
plt.xlabel('Relative Importance', size = 20); plt.ylabel('')
plt.title('Feature Importances from Random Forest', size = 30);
most_important_features = feature_results['feature'][:10]#前10行的特征
# indices=10个列名
indices = [list(train_features.columns).index(x) for x in most_important_features]# 列表推导式X_reduced = X[:, indices]
X_test_reduced = X_test[:, indices]print('Most important training features shape: ', X_reduced.shape)
print('Most important testing features shape: ', X_test_reduced.shape)
lr = LinearRegression()lr.fit(X, y)
lr_full_pred = lr.predict(X_test)lr.fit(X_reduced, y)
lr_reduced_pred = lr.predict(X_test_reduced)print('Linear Regression Full Results: MAE = %0.4f.' % mae(y_test, lr_full_pred))
print('Linear Regression Reduced Results: MAE = %0.4f.' % mae(y_test, lr_reduced_pred))
纽约市建筑能源得分预测代码分析相关推荐
- 【能源物联网】物联网体系结构与建筑能源管理系统的相关性分析
摘要: 在能源形势紧张的大趋势下,高能耗的大型公共建筑能源管理系统的建设逐渐受到重视,以物联网技术及基础的建筑能源管理平台可以提供即时.准确.高效的能源管理策略.系统阐述了结合物联网技术的建筑能源管理 ...
- 优达学城-神经网络之预测共享单车使用情况 代码分析
优达学城-神经网络之预测共享单车使用情况 代码分析 标签(): 机器学习 代码来自于优达学城深度学习纳米学位课程的第一个项目 https://cn.udacity.com/course/deep-le ...
- 多层感知机 深度神经网络_使用深度神经网络和合同感知损失的能源产量预测...
多层感知机 深度神经网络 in collaboration with Hsu Chung Chuan, Lin Min Htoo, and Quah Jia Yong. 与许忠传,林敏涛和华佳勇合作. ...
- 人工智能技术在建筑能源管理中的应用场景
人工智能技术在建筑能源管理中的应用场景(龙惟定),2021 摘 要 本文简要介绍了建筑能源管理(building energy management, BEM) 的概念.并从5个方面阐述了 BEM 对 ...
- GraphSAGE NIPS 2017 代码分析(Tensorflow版)
文章目录 数据集 ppi数据集信息 toy-ppi-G.json 图的信息 toy-ppi-class_map.json toy-ppi-id_map.json toy-ppi-walks.txt t ...
- 论文笔记-建筑能源管理的强化模型预测控制
这是一篇使用强化学习方法来解决建筑能源的论文,作者将MPC和RL结合起来来用于建筑室内温度的调节. 首先,作者通过讨论每种方法的主要方面,在概念水平上强调RL和MPC之间的互补性.其次,描述了RL-M ...
- NLP-生成模型-2017-Transformer(二):Transformer各模块代码分析
一.WordEmbedding层模块(文本嵌入层) Embedding Layer(文本嵌入层)的作用:无论是源文本嵌入还是目标文本嵌入,都是为了将文本中词汇的数字表示转变为向量表示, 由一维转为多维 ...
- 【小白入门】超详细的OCRnet详解(含代码分析)
[小白入门]超详细的OCRnet详解(含代码分析) OCRnet 简介 网络结构 具体实现(含代码分析) 实验结果 本文仅梳理总结自己在学习过程中的一些理解和思路,不保证绝对正确,请酌情参考.如果各位 ...
- 建筑能源管理系统(EMS)
建筑自动化系统(BAS)中有专用的建筑能源管理系统(EMS),即建筑能源管理系统是建立在建筑自动化系统的平台之上.能源管理系统针对现代楼宇能源管理的需要,通过现场总线把大楼中的电压.功率因数.温度.湿 ...
- 河南郑州二手房房价预测和分析
课程大作业 河南郑州二手房房价预测和分析 爬取数据 加载库 查看数据 数据预处理 删除不需要分析的列 对数据进行去重 处理缺失值 文本数据清理 异常值处理 数据可视化分析 房价分布情况 各区域的整体情 ...
最新文章
- ​横扫六大权威榜单后,达摩院开源深度语言模型体系 AliceMind
- 特征工程学习,19 项实践 Tips!代码已开源!
- 论文速递:智能作为信息处理系统
- 如何利用CIC滤波器、CIC补偿滤波器和半带滤波器设计一个高频数字抽取滤波器
- Nginx解决跨域问题的具体实现
- 我在使用chrome经常使用的一些技巧
- 编译后错误提示为pls-00103:出现符号在需要下列之一时:begin case declare
- 让软件自己写软件,机器编程未来会取代程序员吗?
- AI研发新药真有那么神?可能哈佛、斯坦福和阿斯利康实验室都在吹牛
- Dubbox服务的消费方配置
- MICCAI 2019 Poster
- 金蝶oracle用鼎信诺取数,取数软件 审计取数软件?
- 数据库系统常用的数据模型
- 详解谷歌VR平台Daydream:有手柄就是不一样
- Interpreter
- 电脑出问题解决办法(Win7)
- 基于数组判断字符串是否是回文
- 自定义View学习之仿QQ运动步数进度效果
- QGroundControl连接数传(3DR)失败
- Mysql使用Double类型报错Out of range value的解决