摘要:MaxCompute平台支持的日期格式通常是对齐的日期格式诸如20170725或2017/07/25这种,而本次电力AI赛提供的日期格式却是未对齐的非标准(相对MaxCompute平台来说)的日期格式2016/1/1这种,使得无法直接使用ODPS SQL中的日期函数来进行处理。同时,电力AI赛提供的气象数据并不是已经数值化的数据,也使得很多团队未能将气象数据利用起来(现已公开解决方案的团队,基本上天气数据中的气象、风速和风向信息等都未使用),而气象数据通常来说对短期负荷预测具有较大的影响。本文将详细介绍利用MaxCompute的ODPS SQL处理电力AI赛的非标准日期数据的方法和利用OPEN_MR来处理天气数据的详细方法,并给出在MaxCompute平台上使用ODPS SQL、OPEN_MR和PAI命令来完成从数据预处理、特征提取到预测结果的全过程,供大家参考,同时欢迎各位批评指正。

阿里云的MaxCompute平台具有非常强大的功能和开放式的接口,使得可以非常方便的处理各类数据并快速高效的完成数据分析和预测。本文介绍的内容,除了气象部分的数据是之前利用零散时间处理的之外(大概花了不到1天的时间),其他代码都是在电力AI赛的复赛换数据后开始的2天内临时赶出来的,组件MaxCompute的强大。实际上,除了因为比赛平台的OPEN_MR部分目前无法集成到ODPS SQL,所以运行时需要中断一次,其他的代码都可以只需点击一次“运行“按钮就可以批量运行完成,直接完成从原始数据到提交结果的全过程。需要注意的是,本文使用的平台是天池比赛平台,这是阿里云MaxCompute平台为了确保比赛数据安全而做了裁剪(限制)的比赛专用平台,阿里云对外开放的MaxCompute平台限制更少,功能更为强大。

一、 赛题说明

本次竞赛主要数据源为企业用电量表Tianchi_power,抽取了扬中市高新区的1000多家企业的用电量(数据进行了脱敏),包括企业ID(匿名化处理),日期和用电量。具体字段如下表:

tianchi_power

列名

类型

含义

示例

record_date

string

日期

20150101

user_id

bigint

企业id

1

power_consumption

bigint

用电量

1031

... ... ... ...

气象数据表为tianchi_weather_data,起内容如下表所示:

选手提交结果表

tianchi_power_answer

列名

类型

含义

示例

predict_date

string

日期

2016/9/1

power_consumption

bigint

预测的用电量

1031

二、赛题解读

这是一个短期负荷预测(short-term load forecasting)问题,国家电网于2010年曾出台过 国家电网企业标准 Q/GDW 552-2010 《电网短期超短期负荷预测技术规范》,在规范中对相关的术语、预测内容、误差计算公式、常用的预测算法等都做了介绍。在本次比赛中,由于负荷预测的用途不一样,因此并未完全遵守国家电网的企业标准中规定的预测内容(时间粒度和待预测时长),并且预测误差评价公式也采用了自定义的公式,但问题的本质并未改变,仍然是一个短期负荷预测问题。

我们在前期做光伏电站超短期发电功率预测时,发现缺失值和数值天气预报数据对预测精度的影响最大,并且国网的企业标准中对负荷预测的影响因素也有个大致的介绍:

由于社会事件等不可知,因此本次比赛中我们侧重解决缺失值和气象数据的问题,将主要工作集中在三个地方:

1)对官方给定的气象数据进行编码、变换等,构建完善的气象数据特征;

2)构建过拟合的模型来填充缺失值;

3)用修订数据构建模型一来预测趋势,原始数据构建模型二来预测用电量水平(大致值),再对两个模型进行加权融合;

三、 数据预处理

3.1 非标准日期的处理方法

利用ODPS SQL提供的字符串正则处理函数regexp_extract,分别提取年、月、日的数据,然后转换成标准日期格式,代码如下:

-- 产生每日用电量总和
DROP TABLE IF EXISTS t_netivs_daily_sum_consumption;
CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_consumption AS
SELECT*,(year*10000+month*100+day) as day_int -- 转化成 20160101 这种格式,(month*100+day) as month_day,(year*100+month) as year_month,((year-2015)*12+month) as month_index
FROM
(SELECT *,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',1) as bigint) as year   -- 提取年,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',2) as bigint) as month  -- 提取月,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',3) as bigint) as day    -- 提取日FROM(SELECTrecord_date,sum(power_consumption) as power_consumptionFROModps_tc_257100_f673506e024.tianchi_power2GROUP BYrecord_date)t2
)t1
;

利用这个代码,可以方便的将2016/1/1这种非标准的日期数据转化为bigint类型的20160101这类数据,后续可以非常方便的用 to_data(cast(xxx as string),'yyyymmdd') 函数来将这类数据转化成日期类型,在利用ODPS SQL内置的函数来提取日期特征。

3.2 节假日的实现

由于比赛过程中原则上是不允许上传和下载数据的,因此正规的做法是通过ODPS SQL中的case when来实现节假日的处理。这里给出节假日及日期特征的处理代码:

-- 产生扩展日期
DROP TABLE IF EXISTS    t_netivs_date_features;
CREATE TABLE IF NOT EXISTS t_netivs_date_features AS
SELECTday_int,day_index,month_index,year_index,month,day,(month*100+day) as month_day,(year*100+month) as year_month,case when (weekday in (6,7) and special_workday == 0) or holiday==1 then 0 else 1 end as workday,weekofyear,day_to_lastday,month_day_num,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday
FROM
(SELECTday_int,dt,datediff(dt,to_date('2015-01-01','yyyy-mm-dd'),'dd')+1 as day_index,datediff(dt,to_date('2015-01-01','yyyy-mm-dd'),'mm')+1 as month_index,datepart(dt,'yyyy')-2015+1 as year_index,datepart(dt,'yyyy') as year,datepart(dt,'mm') as month,datepart(dt,'dd') as day,datepart(lastday(dt),'dd') as month_day_num,weekofyear(dt) as weekofyear,datediff(lastday(dt),dt,'dd') as day_to_lastday,weekday(dt) as weekday,holiday,special_workday,special_holiday,case when cast(to_char(dateadd(dt,-1,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_special_holiday,case when cast(to_char(dateadd(dt,-2,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_special_holiday,case when cast(to_char(dateadd(dt,-3,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_special_holiday,case when cast(to_char(dateadd(dt,-1,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_holiday,case when cast(to_char(dateadd(dt,-2,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_holiday,case when cast(to_char(dateadd(dt,-3,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_holiday,case when cast(to_char(dateadd(dt,1,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_after_special_holiday,case when cast(to_char(dateadd(dt,2,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_after_special_holiday,case when cast(to_char(dateadd(dt,3,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_after_special_holiday,case when cast(to_char(dateadd(dt,1,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day1_after_holiday,case when cast(to_char(dateadd(dt,2,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day2_after_holiday,case when cast(to_char(dateadd(dt,3,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day3_after_holidayFROM(SELECTday_int,to_date(to_char(day_int),'yyyymmdd') as dt,case when day_int in (20150101,20150102,20150103,20150218,20150219,20150220,20150221,20150222,20150223,20150224,20150404,20150405,20150406,20150501,20150502,20150503,20150620,20150621,20150622,20150903,20150904,20150905,20150927,20151001,20151002,20151003,20151004,20151005,20151006,20151007,20160101,20160207,20160208,20160209,20160210,20160211,20160212,20160213,20160404,20160501,20160502,20160609,20160610,20160611,20160915,20160916,20160917,20161001,20161002,20161003,20161004,20161005,20161006,20161007) then 1 else 0 end as holiday,case when day_int in (20150104,20150215,20150228,20150906,20151010,20160206,20160214,20160612,20160918,20161008,20161009) then 1 else 0 end as special_workday,case when day_int in (20150101,20150218,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as special_holidayFROMt_netivs_tianchi_weather_data    )t1
)t2
;

这段代码中基本包括了利用MaxCompute平台的ODPS SQL代码来整理日期数据格式和提取日期特征的全部常用操作,借助MaxCompute来进行时间序列特征分析和预测的朋友都可以考虑借鉴和完善这段代码来提取自己的日期特征。在这里,由于 通过对比赛数据的分析,可以很容易的发现节假日对每日用电总量的影响非常大,而且节假日对每日用电总量的影响有一定的延续性,比如某些节日快到的时候,用电量会突然增加或者下降,有些节日结束后,会有连续几天的用电量增加或者下降,因此这里对节假日做了比较细致的处理,增加了节前1/2/3天和节后1/2/3天的特征。

3.3 气象数据处理

从上面的气象数据可以看出来,其中的气象、风速、风向等都是字符串数据,需要转换成数值型的数据才能用于机器学习模型。由于这里用的字符串可能的类型有限,其中一种方法是将字符串排序,用序号代表该字符串的编码,直接用于机器学习模型的输入特征。这种方式的好处是处理简单,借助ODPS SQL内置的row_number函数可以很方便的进行实现。但是这种时间的缺点也很明显:没有充分的利用不同气象类型之间的关联关系,比如大雨跟大到暴雨的关系。因此,我们这里采用了OPEN_MR来对气象数据进行了详细的处理,主要的处理思路为:

1)将数据表中所有的数据类型都找出来,观察其构成情况及类别;

2)考虑到部分气象只有一种类型,比如“大雨、中雨、小雨”,而有的气象是两种气象类型,如“大到暴雨、多云转阴”等,因此,将所有气象进行统一:只有一种类型的,就用两个一样的类型来表示;

3)对于每个类型的气象,设计 气象类型(晴、雪、雨等)、气象等级(小雨、中雨、大雨、暴雨等分别从1开始编号)、气象组合(气象类型+气象等级);

按这种思路处理后的气象数据的格式可以用如下的ODPS SQL语句来创建,并且用于OPEN_MR的输出表:

-- map reduce来处理气象数据的输出表
-- 线上给的12月份的气象数据已经一起完成了,所以不需要再更改
-- DROP TABLE IF EXISTS t_netivs_encode_weather;
CREATE TABLE IF NOT EXISTS t_netivs_encode_weather (day_int         bigint,temperature_high     bigint,temperature_low  bigint,weather1     bigint,weather1_level       bigint,weather1_type        bigint,weather2     bigint,weather2_level       bigint,weather2_type        bigint,wind_direction       bigint,wind_speed       double,wind_speed1      double,wind_speed2      double
)
;

为了实现对气象数据的解析,编写了一个OPEN_MR来进行处理,其核心代码如下:

从mapper总获得原始数据,然后进行处理,再将结果输出到reducer中去,其主流程代码如下:

// 气象数据处理主流程
public void weather_encode(long day_int, long temperature_high, long temperature_low, String weather, String wind_direction, String wind_speed, Record vals){m_output_vals      = vals;m_day_int       = day_int;m_temp_high      = temperature_high;m_temp_low      = temperature_low;reset();weather_parser(weather);wind_direction_parser(wind_direction);wind_speed_parse(wind_speed);// 输出特征output();
}

其中,气象数据转化为编码的代码如下:

// -------------- 对气象进行重新编码 ---------------------------------------------//
private void weather_parser(String weather){String weather1,weather2;// 如果最后一个字母是 ~ ,应该是不数据不完整,直接去掉 ~if(weather.endsWith("~")){weather = weather.substring(0, weather.length()-2);}weather = weather.replace("转", "~");// 解析a1的数据if(weather.contains("~")){weather1 = weather.split("~")[0];weather2 = weather.split("~")[1];}else {weather1= weather;weather2 = weather;}// 开始解析weather1和weather2// 小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、阵雨、雷雨、雷阵雨、小雪、中雪、大雪、雨夹雪、晴、阴、多云m_weather1 = get_weather_index(weather1); // 对气象进行重新编码m_weather1_level = get_weather_level(weather1);m_weather1_type = get_weather_type(weather1);m_weather2 = get_weather_index(weather2);m_weather2_level = get_weather_level(weather2);m_weather2_type = get_weather_type(weather2);
}

3.4 过拟合模型实现缺失数据填充

通过前面两个部分的代码,可以快速的完成电力负荷数据的格式转化、日期和气象特征提取等。通过分析2016年11月的每日总用电量可以发现,1416这个大客户存在2天用电缺失的情况,从而导致那两天的用电量异常偏低。由此可以想到:

1)对用户进行分类,按不同的类别分别处理;

2)对这类大客户的缺失用电量进行填充,抵消偶然事件对用电趋势的影响,从而构建模型来预测每日用电量的趋势,再配合用真实用电量(未填充)模型的预测结果来获得最终预测结果;

由于这里构建的模型是用于填充缺失数据,有别于用来预测未来数据的模型,这应该有意的利用同一用户缺失值附近两侧的用电信息以及不同用户在同一时期的用电量等信息,构建“穿越”待预测日的过拟合模型,更好的填充缺失值。这里用于缺失值填充的过拟合模型的特征提取及预测的全过程代码如下所示:

-- 经过详细分析,拟定采用的缺失数据填充规则:
-- 1. 11月份缺失值为30,所有历史用电量改成1;
-- 2. 除了11月份缺失值为30天的,其他non_default_power_consumption_median<2500的都不处理;
-- 3. 总缺失天数大于30的不处理;DROP TABLE IF EXISTS t_netivs_user_missing_info;
CREATE TABLE IF NOT EXISTS t_netivs_user_missing_info AS
selectcase when t11.user_id is not null then t11.user_id else t2.user_id end as user_id,case when t11.missing_day_cnt is null then 0 else t11.missing_day_cnt end as missing_day_cnt,case when t11.first_default_day_int is null then 0 else t11.first_default_day_int end as first_default_day_int,case when t11.last_default_day_int is null then 0 else  t11.last_default_day_int end as last_default_day_int,case when t11.last1month_default_day_cnt is null then 0 else  t11.last1month_default_day_cnt end as last1month_default_day_cnt,case when t11.last2month_default_day_cnt is null then 0 else  t11.last2month_default_day_cnt end as last2month_default_day_cnt,case when t11.last3month_default_day_cnt is null then 0 else  t11.last3month_default_day_cnt end as last3month_default_day_cnt,case when t2.power_consumption_avg is null then 0 else  t2.power_consumption_avg end as power_consumption_avg,case when t2.power_consumption_median is null then 0 else  t2.power_consumption_median end as power_consumption_median,case when t2.power_consumption_max is null then 0 else  t2.power_consumption_max end as power_consumption_max,case when t2.power_consumption_min is null then 0 else  t2.power_consumption_min end as power_consumption_min,case when t2.first_non_default_day_int is null then 0 else  t2.first_non_default_day_int end as first_non_default_day_int,case when t2.last_non_default_day_int is null then 0 else  t2.last_non_default_day_int end as last_non_default_day_int
from
(select* from(select user_id,count(*) as missing_day_cnt,min(day_int) as first_default_day_int,max(day_int) as last_default_day_int,SUM(case when day_int>=20161101 and day_int<20161201 then 1 else 0 end) as last1month_default_day_cnt,SUM(case when day_int>=20161001 and day_int<20161101 then 1 else 0 end) as last2month_default_day_cnt,SUM(case when day_int>=20160901 and day_int<20161001 then 1 else 0 end) as last3month_default_day_cnt          from t_netivs_ext_powerwherepower_consumption=1group byuser_id)t1where missing_day_cnt>1
)t11
FULL OUTER JOIN
(select user_id,avg(power_consumption) as power_consumption_avg,median(power_consumption) as power_consumption_median,max(power_consumption) as power_consumption_max,min(power_consumption) as power_consumption_min,min(day_int) as first_non_default_day_int,max(day_int) as last_non_default_day_intfrom t_netivs_ext_powerwherepower_consumption<>1group byuser_id
)t2
ON t11.user_id = t2.user_id
;-- 产生要用xgboost来填充的user_id的列表
DROP TABLE IF EXISTS t_netivs_xgb_fill_user_day_list;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_user_day_list;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_user_day_list ASSELECTuser_id,day_intFROMt_netivs_ext_powerWHEREpower_consumption =1 and user_id in (SELECT user_idFROMt_netivs_user_missing_infoWHEREpower_consumption_median>2500 and missing_day_cnt<30 and missing_day_cnt>0)
;-- 产生要用来训练xgboost模型的user_id列表
DROP TABLE IF EXISTS t_netivs_xgb_fill_train_user_list;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_user_list;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_user_list ASSELECTuser_id,day_intFROMt_netivs_ext_powerWHEREpower_consumption <> 1 and user_id in (SELECT user_idFROMt_netivs_user_missing_infoWHEREpower_consumption_median>2500 and missing_day_cnt<30)
;-- 产生要把历史数据全部清0的user_id的列表
DROP TABLE IF EXISTS t_netivs_clear_historical_data_user_list;
CREATE TABLE IF NOT EXISTS t_netivs_clear_historical_data_user_list ASSELECT user_idFROMt_netivs_user_missing_infoWHERElast1month_default_day_cnt=30
;-- 产生GBDT训练集DROP TABLE IF EXISTS t_netivs_gbdt_fill_consumption_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_consumption_features ASSELECTt1.user_id,t1.day_int,case when t2.weekly_power_consumption_avg is null then 0 else t2.weekly_power_consumption_avg end as weekly_power_consumption_avg,case when t2.weekly_power_consumption_median is null then 0 else t2.weekly_power_consumption_median end as weekly_power_consumption_median,case when t3.monthly_power_consumption_avg is null then 0 else t3.monthly_power_consumption_avg end as monthly_power_consumption_avg,case when t3.monthly_power_consumption_median is null then 0 else t3.monthly_power_consumption_median end as monthly_power_consumption_median,case when t4.last_weekly_power_consumption_avg is null then 0 else t4.last_weekly_power_consumption_avg end as last_weekly_power_consumption_avg,case when t4.last_weekly_power_consumption_median is null then 0 else t4.last_weekly_power_consumption_median end as last_weekly_power_consumption_median,case when t5.last_monthly_power_consumption_avg is null then 0 else t5.last_monthly_power_consumption_avg end as last_monthly_power_consumption_avg,case when t5.last_monthly_power_consumption_median is null then 0 else t5.last_monthly_power_consumption_median end as last_monthly_power_consumption_medianFROMt_netivs_ext_power t1LEFT OUTER JOIN(SELECTuser_id          ,weekofyear,avg(power_consumption) as weekly_power_consumption_avg,median(power_consumption) as weekly_power_consumption_medianFROMt_netivs_ext_powerWHEREpower_consumption<>1GROUP BYuser_id,weekofyear)t2ON t1.user_id = t2.user_id and t1.weekofyear = t2.weekofyearLEFT OUTER JOIN(SELECTuser_id            ,year_month,avg(power_consumption) as monthly_power_consumption_avg,median(power_consumption) as monthly_power_consumption_medianFROMt_netivs_ext_powerWHEREpower_consumption<>1GROUP BYuser_id,year_month)t3ON t1.user_id = t3.user_id and t1.year_month = t3.year_monthLEFT OUTER JOIN(SELECTuser_id          ,weekofyear,avg(power_consumption) as last_weekly_power_consumption_avg,median(power_consumption) as last_weekly_power_consumption_medianFROMt_netivs_ext_powerWHEREpower_consumption<>1GROUP BYuser_id,weekofyear)t4ON t1.user_id = t4.user_id and t1.weekofyear = t4.weekofyear+1LEFT OUTER JOIN(SELECTuser_id           ,year_month,avg(power_consumption) as last_monthly_power_consumption_avg,median(power_consumption) as last_monthly_power_consumption_medianFROMt_netivs_ext_powerWHEREpower_consumption<>1GROUP BYuser_id,year_month)t5ON t1.user_id = t5.user_id and t1.year_month = t5.year_month+1
;DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_features ASSELECTt1.user_id,t1.day_int,t2.temperature_high,t2.temperature_low,t2.weather1,t2.weather1_level,t2.weather1_type,t2.weather2,t2.weather2_level,t2.weather2_type,t2.wind_direction,t2.wind_speed,t2.wind_speed1,t2.wind_speed2,t3.day_index,t3.month_index,t3.year_index,t3.month,t3.day,t3.workday,t3.weekday,t3.holiday,t3.special_workday,t3.special_holiday,t3.day1_before_special_holiday,t3.day2_before_special_holiday,t3.day3_before_special_holiday,t3.day1_before_holiday,t3.day2_before_holiday,t3.day3_before_holiday,t3.day1_after_special_holiday,t3.day2_after_special_holiday,t3.day3_after_special_holiday,t3.day1_after_holiday,t3.day2_after_holiday,t3.day3_after_holiday,t4.weekly_power_consumption_avg,t4.weekly_power_consumption_median,t4.monthly_power_consumption_avg,t4.monthly_power_consumption_median,t4.last_weekly_power_consumption_avg,t4.last_weekly_power_consumption_median,t4.last_monthly_power_consumption_avg,t4.last_monthly_power_consumption_median,t5.power_consumptionFROMt_netivs_gbdt_fill_train_user_list t1LEFT OUTER JOINt_netivs_encode_weather t2ON t1.day_int = t2.day_intLEFT OUTER JOINt_netivs_date_features t3ON t1.day_int = t3.day_intLEFT OUTER JOINt_netivs_gbdt_fill_consumption_features t4ON t1.user_id = t4.user_id and t1.day_int = t4.day_intLEFT OUTER JOINt_netivs_ext_power t5ON t1.user_id = t5.user_id and t1.day_int = t5.day_int;-- 产生gbdt填充的测试集
DROP TABLE IF EXISTS t_netivs_gbdt_fill_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_test_features ASSELECTt1.user_id,t1.day_int,t2.temperature_high,t2.temperature_low,t2.weather1,t2.weather1_level,t2.weather1_type,t2.weather2,t2.weather2_level,t2.weather2_type,t2.wind_direction,t2.wind_speed,t2.wind_speed1,t2.wind_speed2,t3.day_index,t3.month_index,t3.year_index,t3.month,t3.day,t3.workday,t3.weekday,t3.holiday,t3.special_workday,t3.special_holiday,t3.day1_before_special_holiday,t3.day2_before_special_holiday,t3.day3_before_special_holiday,t3.day1_before_holiday,t3.day2_before_holiday,t3.day3_before_holiday,t3.day1_after_special_holiday,t3.day2_after_special_holiday,t3.day3_after_special_holiday,t3.day1_after_holiday,t3.day2_after_holiday,t3.day3_after_holiday,t4.weekly_power_consumption_avg,t4.weekly_power_consumption_median,t4.monthly_power_consumption_avg,t4.monthly_power_consumption_median,t4.last_weekly_power_consumption_avg,t4.last_weekly_power_consumption_median,t4.last_monthly_power_consumption_avg,t4.last_monthly_power_consumption_medianFROMt_netivs_gbdt_fill_user_day_list t1LEFT OUTER JOINt_netivs_encode_weather t2ON t1.day_int = t2.day_intLEFT OUTER JOINt_netivs_date_features t3ON t1.day_int = t3.day_intLEFT OUTER JOINt_netivs_gbdt_fill_consumption_features t4ON t1.user_id = t4.user_id and t1.day_int = t4.day_int
;-- 用xgb来产生填充值
DROP TABLE IF EXISTS t_netivs_xgb_fill_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_xgb_fill_model;-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="3500"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_gbdt_fill_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.4"
-Dcolsample_bytree="0.6"
-DmodelName="m_xgb_fill_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-Dbase_score="0.11"
-Dmin_child_weight="100"
-DkvDelimiter=":";-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_xgb_fill_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_xgb_fill_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-DinputTableName="t_netivs_gbdt_fill_test_features"
-DenableSparse="false";SELECT * FROM t_netivs_xgb_fill_prediction_result ORDER BY prediction_result desc limit 100;-- 产生修订后的每日用电量详单
-- t_netivs_clear_historical_data_user_list内的user_id全部清零
DROP TABLE IF EXISTS t_netivs_fixed_ext_power;
CREATE TABLE IF NOT EXISTS t_netivs_fixed_ext_power ASSELECTt1.user_id,t1.day_int--,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) when t3.user_id is not null then 1 else power_consumption end as bigint) as power_consumption,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) else power_consumption end as bigint) as power_consumptionFROMt_netivs_ext_power t1LEFT OUTER JOINt_netivs_gbdt_fill_prediction_result t2ON t1.user_id = t2.user_id and t1.day_int = t2.day_intLEFT OUTER JOINt_netivs_clear_historical_data_user_list t3ON t1.user_id = t3.user_id
;-- 产生修订后的每日用电量总和
DROP TABLE IF EXISTS t_netivs_fixed_daily_sum_consumption;
CREATE TABLE IF NOT EXISTS t_netivs_fixed_daily_sum_consumption AS
SELECTt1.day_int,t1.power_consumption,t2.fixed_power_consumption,t3.day_index,t3.month_index,t3.year_index,t3.month,t3.day,t3.month_day,t3.year_month,t3.workday,t3.weekofyear,t3.day_to_lastday,t3.weekday,t3.holiday,t3.special_workday,t3.special_holiday,t3.day1_before_special_holiday,t3.day2_before_special_holiday,t3.day3_before_special_holiday,t3.day1_before_holiday,t3.day2_before_holiday,t3.day3_before_holiday,t3.day1_after_special_holiday,t3.day2_after_special_holiday,t3.day3_after_special_holiday,t3.day1_after_holiday,t3.day2_after_holiday,t3.day3_after_holiday
FROMt_netivs_daily_sum_consumption t1
LEFT OUTER JOIN
(SELECTday_int,SUM(power_consumption) as fixed_power_consumptionFROMt_netivs_fixed_ext_powerGROUP BYday_int
)t2
ON t1.day_int = t2.day_int
LEFT OUTER JOINt_netivs_date_features t3
ON t1.day_int = t3.day_int
;

四、模型构建与融合

在做这个赛题的时候,确定解题思路是用两个模型来分别预测趋势和用电量水平,然后再进行融合,其思路如下图所示:

其中模型一的特征提取及模型构建的实现代码如下:

-- 提取每日用电总量的特征
DROP TABLE IF EXISTS t_netivs_daily_sum_features;
CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_features ASSELECTt1.day_int,t1.last_month_same_day_consumption--,t1.last_year_same_day_consumption,t2.last_month_power_consumption_avg,t2.last_month_power_consumption_median,t2.last_month_power_consumption_stddev,t2.last_month_weekday1_avg,t2.last_month_weekday1_median,t2.last_month_weekday0_avg,t2.last_month_weekday0_median,t2.last_month_workday1_avg,t2.last_month_workday1_median,t2.last_month_workday0_avg,t2.last_month_workday0_median,t2.last_month_last3day_avg,t2.last_month_last3day_median,t2.last_month_last7day_avg,t2.last_month_last7day_median,t2.last_month_first3day_avg,t2.last_month_first3day_median,t2.last_month_first7day_avg,t2.last_month_first7day_median,t2.last_month_middle_avg,t2.last_month_middle_medianFROM(SELECTt11.day_int,t21.power_consumption as last_month_same_day_consumption--,t31.power_consumption as last_year_same_day_consumptionFROM(SELECTday_int,day,day_to_lastday,case when day<=15 then cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm'),'yyyymmdd') as bigint)else cast(to_char(dateadd(lastday(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm')),-day_to_lastday,'dd'),'yyyymmdd') as bigint) endas last_month_same_day--,cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'yyyy'),'yyyymmdd') as bigint) as last_year_same_dayFROMt_netivs_date_featuresWHEREday_int>=20150201)t11LEFT OUTER JOINt_netivs_fixed_daily_sum_consumption t21ON t11.last_month_same_day = t21.day_int--LEFT OUTER JOIN--   t_netivs_fixed_daily_sum_consumption t31--ON t11.last_year_same_day = t31.day_int)t1LEFT OUTER JOIN(SELECTt1.day_int,t2.last_month_power_consumption_avg,t2.last_month_power_consumption_median,t2.last_month_power_consumption_stddev,t2.last_month_weekday1_avg,t2.last_month_weekday1_median,t2.last_month_weekday0_avg,t2.last_month_weekday0_median,t2.last_month_workday1_avg,t2.last_month_workday1_median,t2.last_month_workday0_avg,t2.last_month_workday0_median,t2.last_month_last3day_avg,t2.last_month_last3day_median,t2.last_month_last7day_avg,t2.last_month_last7day_median,t2.last_month_first3day_avg,t2.last_month_first3day_median,t2.last_month_first7day_avg,t2.last_month_first7day_median,t2.last_month_middle_avg,t2.last_month_middle_medianFROM(SELECT * FROM t_netivs_date_features WHERE month_index>1)t1LEFT OUTER JOIN(SELECTmonth_index,avg(fixed_power_consumption) as last_month_power_consumption_avg,median(fixed_power_consumption) as last_month_power_consumption_median,stddev(fixed_power_consumption) as last_month_power_consumption_stddev,avg(case when weekday=1 then fixed_power_consumption else null end) as last_month_weekday1_avg,median(case when weekday=1 then fixed_power_consumption else null end) as last_month_weekday1_median,avg(case when weekday=0 then fixed_power_consumption else null end) as last_month_weekday0_avg,median(case when weekday=0 then fixed_power_consumption else null end) as last_month_weekday0_median,avg(case when workday=1 then fixed_power_consumption else null end) as last_month_workday1_avg,median(case when workday=1 then fixed_power_consumption else null end) as last_month_workday1_median,avg(case when workday=0 then fixed_power_consumption else null end) as last_month_workday0_avg,median(case when workday=0 then fixed_power_consumption else null end) as last_month_workday0_median,avg(case when day_to_lastday<=3 then fixed_power_consumption else null end) as last_month_last3day_avg,median(case when day_to_lastday<=3 then fixed_power_consumption else null end) as last_month_last3day_median,avg(case when day_to_lastday<=7 then fixed_power_consumption else null end) as last_month_last7day_avg,median(case when day_to_lastday<=7 then fixed_power_consumption else null end) as last_month_last7day_median,avg(case when day<=3 then fixed_power_consumption else null end) as last_month_first3day_avg,median(case when day<=3 then fixed_power_consumption else null end) as last_month_first3day_median,avg(case when day<=7 then fixed_power_consumption else null end) as last_month_first7day_avg,median(case when day<=7 then fixed_power_consumption else null end) as last_month_first7day_median,avg(case when day>=14 and day_to_lastday>=14 then fixed_power_consumption else null end) as last_month_middle_avg,median(case when day>=14 and day_to_lastday>=14 then fixed_power_consumption else null end) as last_month_middle_medianFROM(SELECTt1.day_int,t1.power_consumption,t1.fixed_power_consumption,t2.day_index,t2.month_index,t2.workday,t2.day_to_lastday,t2.day,t2.weekday,t2.holiday                FROMt_netivs_fixed_daily_sum_consumption t1LEFT OUTER JOINt_netivs_date_features t2ON t1.day_int = t2.day_int)t2_1GROUP BYmonth_index)t2ON t1.month_index = t2.month_index+1)t2ON t1.day_int = t2.day_int
;-- 合并特征
DROP TABLE IF EXISTS    t_netivs_all_online_features;
CREATE TABLE IF NOT EXISTS t_netivs_all_online_features ASSELECTt1.*,t2.temperature_high,t2.temperature_low,t2.weather1,t2.weather1_level,t2.weather1_type,t2.weather2,t2.weather2_level,t2.weather2_type,t2.wind_direction,t2.wind_speed,t2.wind_speed1,t2.wind_speed2,t3.day_index,t3.month_index,t3.year_index,t3.month,t3.day,t3.workday,t3.weekday,t3.holiday,t3.special_workday,t3.special_holiday,t3.day1_before_special_holiday,t3.day2_before_special_holiday,t3.day3_before_special_holiday,t3.day1_before_holiday,t3.day2_before_holiday,t3.day3_before_holiday,t3.day1_after_special_holiday,t3.day2_after_special_holiday,t3.day3_after_special_holiday,t3.day1_after_holiday,t3.day2_after_holiday,t3.day3_after_holidayFROMt_netivs_daily_sum_features t1LEFT OUTER JOIN     t_netivs_encode_weather t2ON t1.day_int = t2.day_intLEFT OUTER JOINt_netivs_date_features t3ON t1.day_int = t3.day_int
;-- 产生训练集
DROP TABLE IF EXISTS t_netivs_online_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_train_features ASSELECTt1.*,t2.power_consumption,t2.fixed_power_consumptionFROM(SELECT * FROM t_netivs_all_online_features WHERE day_int<20161201)t1LEFT OUTER JOINt_netivs_fixed_daily_sum_consumption t2ON t1.day_int = t2.day_int
;-- 产生测试集
DROP TABLE IF EXISTS t_netivs_online_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_test_features ASSELECT * FROM t_netivs_all_online_features WHERE day_int>=20161201
;-- 用xgb来跑
DROP TABLE IF EXISTS t_netivs_online_xgb_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_online_xgb_model;-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="3500"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_online_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.4"
-Dcolsample_bytree="0.6"
-DmodelName="m_online_xgb_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="last_month_same_day_consumption,last_month_power_consumption_avg,last_month_power_consumption_median,last_month_power_consumption_stddev,last_month_weekday1_avg,last_month_weekday1_median,last_month_weekday0_avg,last_month_weekday0_median,last_month_workday1_avg,last_month_workday1_median,last_month_workday0_avg,last_month_workday0_median,last_month_last3day_avg,last_month_last3day_median,last_month_last7day_avg,last_month_last7day_median,last_month_first3day_avg,last_month_first3day_median,last_month_first7day_avg,last_month_first7day_median,last_month_middle_avg,last_month_middle_median,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-Dbase_score="0.11"
-Dmin_child_weight="100"
-DkvDelimiter=":";-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_online_xgb_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_online_xgb_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="last_month_same_day_consumption,last_month_power_consumption_avg,last_month_power_consumption_median,last_month_power_consumption_stddev,last_month_weekday1_avg,last_month_weekday1_median,last_month_weekday0_avg,last_month_weekday0_median,last_month_workday1_avg,last_month_workday1_median,last_month_workday0_avg,last_month_workday0_median,last_month_last3day_avg,last_month_last3day_median,last_month_last7day_avg,last_month_last7day_median,last_month_first3day_avg,last_month_first3day_median,last_month_first7day_avg,last_month_first7day_median,last_month_middle_avg,last_month_middle_median,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-DinputTableName="t_netivs_online_test_features"
-DenableSparse="false";select * from t_netivs_online_xgb_prediction_result ORDER BY day_int limit 100;
select avg(prediction_result) from t_netivs_online_xgb_prediction_result;

模型二的特征提取、模型构建及模型融合代码如下:

-- 根据历史同期信息来构建特征工程
DROP TABLE IF EXISTS t_netivs_same_period_feature;
CREATE TABLE IF NOT EXISTS t_netivs_same_period_feature AS
SELECTt1.day_int,t1.day,t1.day_to_lastday,t2.power_consumption as last_month_same_day_consumption,t3.power_consumption as last_year_same_day_consumption,t4.last_month_power_consumption_median,t4.last_month_weekday1_median,t4.last_month_weekday0_median,t4.last_month_workday1_median,t4.last_month_last3day_median,t4.last_month_last7day_median   ,t4.last_month_first3day_median,t4.last_month_first7day_median,t4.last_month_middle_median,t4.last_month_power_consumption_avg,t4.last_month_weekday1_avg,t4.last_month_weekday0_avg,t4.last_month_workday1_avg,t4.last_month_last3day_avg,t4.last_month_last7day_avg   ,t4.last_month_first3day_avg,t4.last_month_first7day_avg,t4.last_month_middle_avg,t5.power_consumption,t5.fixed_power_consumption,t6.temperature_high,t6.temperature_low,t6.weather1,t6.weather1_level,t6.weather1_type,t6.weather2,t6.weather2_level,t6.weather2_type,t6.wind_direction,t6.wind_speed,t6.wind_speed1,t6.wind_speed2,t7.month,t7.workday,t7.weekday,t7.holiday,t7.special_workday,t7.special_holiday,t7.day1_before_special_holiday,t7.day2_before_special_holiday,t7.day3_before_special_holiday,t7.day1_before_holiday,t7.day2_before_holiday,t7.day3_before_holiday,t7.day1_after_special_holiday,t7.day2_after_special_holiday,t7.day3_after_special_holiday,t7.day1_after_holiday,t7.day2_after_holiday,t7.day3_after_holidayFROM
(SELECTday_int,day,day_to_lastday,case when day<=15 then cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm'),'yyyymmdd') as bigint)else cast(to_char(dateadd(lastday(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm')),-day_to_lastday,'dd'),'yyyymmdd') as bigint) endas last_month_same_day,cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'yyyy'),'yyyymmdd') as bigint) as last_year_same_dayFROMt_netivs_date_featuresWHEREday_int>=20160101
)t1
LEFT OUTER JOINt_netivs_fixed_daily_sum_consumption t2
ON t1.last_month_same_day = t2.day_int
LEFT OUTER JOINt_netivs_fixed_daily_sum_consumption t3
ON t1.last_year_same_day = t3.day_int
LEFT OUTER JOINt_netivs_dail_sum_features t4
ON t1.day_int = t4.day_int
LEFT OUTER JOINt_netivs_fixed_daily_sum_consumption t5
ON t1.day_int = t5.day_int
LEFT OUTER JOIN     t_netivs_encode_weather t6
ON t1.day_int = t6.day_int
LEFT OUTER JOINt_netivs_date_features t7
ON t1.day_int = t7.day_int
;-- 产生训练集
DROP TABLE IF EXISTS t_netivs_online_historical_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_historical_train_features ASSELECT * FROM t_netivs_same_period_feature WHERE day_int<20161201
;-- 产生测试集
DROP TABLE IF EXISTS t_netivs_online_historical_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_historical_test_features ASSELECT * FROM t_netivs_same_period_feature WHERE day_int>=20161201
;-- 用xgb来跑
DROP TABLE IF EXISTS t_netivs_online_historical_xgb_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_online_historical_xgb_model;-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="4000"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_online_historical_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.8"
-Dcolsample_bytree="0.8"
-DmodelName="m_online_historical_xgb_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
---Dbase_score="0.11"
-Dmin_child_weight="50"
-DkvDelimiter=":";-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_online_historical_xgb_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_online_historical_xgb_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-DinputTableName="t_netivs_online_historical_test_features"
-DenableSparse="false";select * from t_netivs_online_historical_xgb_prediction_result ORDER BY day_int limit 100;-- 使用没有调整的power_consumption
DROP TABLE IF EXISTS t_netivs_online_historical_fixed_xgb_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_online_historical_fixed_xgb_model;-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="4000"
-DlabelColName="fixed_power_consumption"
-DinputTableName="t_netivs_online_historical_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.8"
-Dcolsample_bytree="0.8"
-DmodelName="m_online_historical_fixed_xgb_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
---Dbase_score="0.11"
-Dmin_child_weight="50"
-DkvDelimiter=":";-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_online_historical_fixed_xgb_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_online_historical_fixed_xgb_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-DinputTableName="t_netivs_online_historical_test_features"
-DenableSparse="false";select * from t_netivs_online_historical_fixed_xgb_prediction_result ORDER BY day_int limit 100;DROP TABLE IF EXISTS t_netivs_xgb_ensemble_result;
CREATE TABLE IF NOT EXISTS t_netivs_xgb_ensemble_result AS
SELECTt1.day_int,t1.prediction_result + t2.prediction_result*0.05 as prediction_result
FROMt_netivs_online_xgb_prediction_result t1
LEFT OUTER JOINt_netivs_online_historical_fixed_xgb_prediction_result t2
ON t1.day_int = t2.day_int
ORDER BY day_int limit 61
;SELECT avg(prediction_result) FROM t_netivs_xgb_ensemble_result;
SELECT * FROM t_netivs_xgb_ensemble_result ORDER BY day_int limit 100;INSERT OVERWRITE TABLE tianchi_power_answer
SELECTconcat(to_char(datepart(ds,'yyyy')),'/',to_char(datepart(ds,'mm')),'/',to_char(datepart(ds,'dd'))) as predict_date,cast(round(power_consumption,0) as bigint) as power_consumption
FROM
(SELECTto_date(cast(day_int as string),'yyyymmdd') as ds,prediction_result as power_consumptionFROMt_netivs_xgb_ensemble_result
)t1
;

五、总结与展望

本文以阿里云天池大数据平台上举办的电力AI赛(https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.0.0.3f6e7d83UaNT4W&raceId=231602)为例,介绍了借助阿里云MaxCompute平台实现电力系统负荷预测的整个流程,并给出了全部核心代码。代码面前了无秘密可言,通过对这些代码的分析,可以很容易看出来阿里云的MaxCompute强大的功能和灵活的开放接口。实际上,由于比赛平台的限制,阿里云 MaxCompute平台上还有很多可以辅助开发的功能尚未展示到,比如可视化用的DataV、商业智能引擎Quik BI等等,通过将负荷预测跟这些产品的结合,可以很方便的实现出界面美观功能强大的电力系统应用。

MaxCompute平台非标准日期和气象数据处理方法--以电力AI赛为例相关推荐

  1. 非标准语法;请使用 _一文读懂使用MCU SPI访问具有非标准SPI接口ADC的方法

    好文章当然要分享啦~如果您喜欢这篇文章,请联系后台添加白名单,欢迎转载哟~ 问题 能否用MCU访问非标准SPI接口? 答案 可以,但可能需要做一些额外的努力. 当前许多精密模数转换器(ADC)具有串行 ...

  2. 使用MCU SPI访问具有非标准SPI接口ADC的方法

    关注.星标公众号,不错过精彩内容 整理:黄工 来源:亚德诺半导体 提问:能否用MCU访问非标准SPI接口? 答案:可以,但可能需要做一些额外的努力. 当前许多精密模数转换器(ADC)具有串行外设接口( ...

  3. 怎么做蒙特卡洛计算npv_PowerBI非标准日历下的同比环比计算,你知道怎么做吗?...

    ​对于按照自然年月日来分析的业务数据,在PowerBI中可以轻松的使用时间智能函数来进行各种时间指标的计算,但如果不是按标准的日历,很多人就开始有点懵,不知道该如何计算了. 比如有的公司的业务月份是从 ...

  4. Py之PyODPS:PyODPS(MaxCompute平台上的大数据处理和分析框架)的简介、安装、使用方法之详细攻略

    Py之PyODPS:PyODPS(MaxCompute平台上的大数据处理和分析框架)的简介.安装.使用方法之详细攻略 目录 PyODPS的简介 1.PyODPS的特点 2.MaxCompute下SQL ...

  5. 阿里巴巴大数据计算平台MaxCompute(原名ODPS)全套攻略(持续更新20171127)

    概况介绍 大数据计算服务(MaxCompute,原名ODPS,产品地址:https://www.aliyun.com/product/odps)是一种快速.完全托管的TB/PB级数据仓库解决方案.Ma ...

  6. 阿里配管专家解读:如何最优成本搭建非标准的iOS构建集群

    作者简介:董必胜(叔大),阿里巴巴配置管理专家,负责集团.蚂蚁的移动端构建,负责研发协同平台RDC无线. 背景:在移动物联网大潮中iOS构建的重要性日益突出,如何能提供稳定的iOS构建服务?中大型的企 ...

  7. 数据保护伞—为MaxCompute平台数据安全保驾护航

    摘要: 数据安全是大数据发展道路上的重要挑战之一,数据,作为企业的核心资产,80%以上的核心信息是以结构化数据存储,包含个人身份证号.银行账号.电话.客户数据.医疗.交易.薪资等极其重要又敏感的信息. ...

  8. 在ISA Server 2004上发布使用非标准的21端口进行连接的FTP服务器

    在ISA Server 2004上发布使用非标准的21端口进行连接的FTP服务器 (只有PASV模式发布) 首先非常感谢Tom的指导,他在ISA Server 2004上给予了我许多帮助.:) Tom ...

  9. Java平台,标准版Oracle JDK 9中的新功能

    Java平台,标准版 Oracle JDK 9中的新增功能 版本9 E77563-05 2017年9月 JDK 9中的新功能概述 Java Platform,Standard Edition 9是一个 ...

最新文章

  1. Linux文件系统目录
  2. 【Leetcode】背包问题模板
  3. java如何开发bpm系统_java工作流bpm开发ERP实例
  4. JNI学习积累之一 ---- 常用函数大全
  5. http-server 简介 复制的
  6. 一个软件公司需要多少前端_内幕!软件外包公司开发一个软件需要多少钱?
  7. 漏洞复现|Microsoft Office数学公式编辑器内存损坏漏洞(CVE-2017-11882 )
  8. 产品经理认证(NPDP)---新产品流程
  9. CMSC5724-数据挖掘之VC维、Shatter、VC-dim以及Margin有关的通理
  10. matlab 怎麼卸載乾淨,matlab set gca用法
  11. 解锁图案-九宫格有多少种组合?安全吗?用程序来解答
  12. 疯狂模渲大师链接永久是最新版|怎么安装客户端并激活素材库联系作者加载自营专属素材扩展包高效使用超一流辅助插件脚本工具的步骤教程?...
  13. 使用AD软件绘制PCB的过程
  14. 打印机计算机名称怎么看,打印机名称在哪里找
  15. web高级前端面试实战总结
  16. ISP(图像信号处理)之——坏点校正
  17. TJUCTF新生赛-AI安全专栏write up
  18. 02 夯实根基,web网页基础
  19. java内存屏障的原理与应用
  20. 钉钉机器人实现打卡提醒定时任务

热门文章

  1. 新手学堂:Linux操作系统的启动步骤说明
  2. 从零学ELK系列(三):Centos安装Docker(超详细图文教程)
  3. AI大牛发起神秘字母接龙,起因竟然是……
  4. Android-返回桌面?退出程序?
  5. redis永久化存储
  6. 用es5实现es6的promise,彻底搞懂promise的原理
  7. Centos 中如何快速定制二进制的内核 RPM 包
  8. 白帽黑客眼中的网络安全 挡黑客财路曾收恐吓信
  9. Git与GitHub学习笔记(一)如何删除github里面的文件夹?
  10. centos 6.7 openssh 升级到openssh 7.1p