作者：大树

更新时间:01.14

email:59888745@qq.com

数据处理，机器学习

回主目录：2017 年学习记录和总结

阿里天池大航杯“智造扬中”电力AI大赛的案例分析实现

今天我来实现大航杯“智造扬中”电力AI大赛的案例实现,按照工业界流程来一一呈现:

业务场景定义包括:核心目标定义,关键场景描述.
业务规则梳理包括:业务规则提炼,规则联动分析
数据定量分析包括:数据多维分析,数据异常处理
模型设计研究包括:应用场景定制,模型参数调优设置
运算和结果分析包括:模型运算输出,业务回归验证

电力AI大赛大赛介绍请参考: https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.333.2.mnbu1L&raceId=231602

1.业务场景定义
a.电力AI大赛大赛介绍请参考URL.
b.通过分析，我们得知，业务需求是通分析江苏镇江扬中市的高新区企业历史近2年的用电量，
希望能够根据历史数据去精准预测未来一个月每一天的用电量,如10月份。
c.高新技术产业开发区,高薪区，上班族（工作日，休息日，节假日，夏天还是冬天等
和用电量相关的关键场景。
2.业务规则梳理
a.通过分析，这是一个典型的回归类问题，和我们的流量预测非常相似，
我们来看看如何用数据驱动的方式去完成这样一个预测。
3 .数据定量分析
3.1.载入数据,数据一览

In [2]:

import numpy as np
import pandas as pd_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()

Out[2]:

	record_date	user_id	power_consumption
0	2015/1/1	1	1135
1	2015/1/2	1	570
2	2015/1/3	1	3418
3	2015/1/4	1	3968
4	2015/1/5	1	3986

In [ ]:

3.2 数据清洗处理（包括异常，缺省值，空值，重复值，日期格式等）
处理na的方法有这些，具体业务具体看：
dropna(),dropna(axis=0,how='all',thresh=None) #thresh =3,
fillna(0)填充d.mean()
isnull(),
notnull(),
drop_duplicates(),重复值_df.drop_duplicates(['user_id','record_date'])

In [11]:

import numpy as np
import pandas as pd_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
#_df.shape_df.dropna(axis=0,how='all',thresh=None)
_df.drop_duplicates(['user_id','record_date'])_df['record_date']=pd.to_datetime(_df['record_date'])
_df.head()

Out[11]:

	record_date	user_id	power_consumption
0	2015-01-01	1	1135
1	2015-01-02	1	570
2	2015-01-03	1	3418
3	2015-01-04	1	3968
4	2015-01-05	1	3986

构造和时间相关的强特征¶

In [7]:

import numpy as np
import pandas as pd_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
#_df.shape_df.dropna(axis=0,how='all',thresh=None)
_df.drop_duplicates(['user_id','record_date'])_df['record_date']=pd.to_datetime(_df['record_date'])
_df.head()test_df=pd.date_range('2016-10-1',periods=31,freq='D')#create very data for 10.1--10.31test_df=pd.DataFrame(test_df,columns=['record_date'])test_df['power_consumption']=0.0
test_df
total_df=pd.concat([_df,test_df])
#total_df.fillna(0)
total_df.dropna()
#total_df.head()
total_df.tail()#时间相关的特征
total_df['day_of_week']=total_df['record_date'].apply(lambda x:x.dayofweek)
total_df['day_of_month']=total_df['record_date'].apply(lambda x:x.day)
total_df['day_of_year']=total_df['record_date'].apply(lambda x:x.dayofyear)
total_df['month_of_year']=total_df['record_date'].apply(lambda x:x.month)
total_df['year']=total_df['record_date'].apply(lambda x:x.year)#添加工作日还是周末的信息，周六周日和工作日的用电量显然是不一样
total_df['holiday']=0
total_df['holiday_sat']=0
total_df['holiday_sun']=0#周末特征信息
total_df.loc[total_df.day_of_week ==5,'holiday']=1
total_df.loc[total_df.day_of_week ==5,'holiday_sat']=1total_df.loc[total_df.day_of_week ==6,'holiday']=1
total_df.loc[total_df.day_of_week ==6,'holiday_sun']=1#一个月4周的周信息,属于第几周
def week_of_month(day):if day in range(1,8):return 1if day in range(8,15):return 2if day in range(15,22):return 3if day in range(22,32):return 4total_df['week_of_month']=total_df['day_of_month'].apply(lambda x:week_of_month(x))
total_df.head()#属于上中下旬信息,有些企业的任务是按照月份的上中下旬来安排的，同样可能对用电量会有影响
def period_of_month(day):if day in range(1,11):return 1if day in range(11,21):return 2if day in range(21,32):return 3total_df['period_of_month'] =total_df['day_of_month'].apply(lambda x:period_of_month(x))
total_df.head()#上半月下半月信息
def period2_of_month(day):if day in range(1,16):return 1if day in range(16,32):return 2
total_df['period2_of_month'] =total_df['day_of_month'].apply(lambda x:period2_of_month(x))
total_df.head()# 手动填充节日信息 另外一个对用电量非常大的影响是节假日，法定节假日大部分企业会放假，
# 电量会有大程度的下滑。我们通过查日历的方式去手动填充一个特征/字段，表明这一天是否是节日。
def day_of_festival(day):l_festival=['2016-10-01','2016-10-02','2016-10-03','2016-10-04','2016-10-05','2016-10-06','2016-10-07']if day in l_festival:return 1else:return 0total_df['festival_pc']=0
total_df['festival']=0total_df['festival']=total_df['festival'].apply(lambda x:day_of_festival(x))total_df.head(20)

Out[7]:

	power_consumption	record_date	user_id	day_of_week	day_of_month	day_of_year	month_of_year	year	holiday	holiday_sat	holiday_sun	week_of_month	period_of_month	period2_of_month
0	1135.0	2015-01-01	1.0	3	1	1	1	2015	0	0	0	1	1	1
1	570.0	2015-01-02	1.0	4	2	2	1	2015	0	0	0	1	1	1
2	3418.0	2015-01-03	1.0	5	3	3	1	2015	1	1	0	1	1	1
3	3968.0	2015-01-04	1.0	6	4	4	1	2015	1	0	1	1	1	1
4	3986.0	2015-01-05	1.0	0	5	5	1	2015	0	0	0	1	1	1
5	4082.0	2015-01-06	1.0	1	6	6	1	2015	0	0	0	1	1	1
6	4172.0	2015-01-07	1.0	2	7	7	1	2015	0	0	0	1	1	1
7	4022.0	2015-01-08	1.0	3	8	8	1	2015	0	0	0	2	1	1
8	4025.0	2015-01-09	1.0	4	9	9	1	2015	0	0	0	2	1	1
9	4047.0	2015-01-10	1.0	5	10	10	1	2015	1	1	0	2	1	1
10	4135.0	2015-01-11	1.0	6	11	11	1	2015	1	0	1	2	2	1
11	4111.0	2015-01-12	1.0	0	12	12	1	2015	0	0	0	2	2	1
12	3926.0	2015-01-13	1.0	1	13	13	1	2015	0	0	0	2	2	1
13	4244.0	2015-01-14	1.0	2	14	14	1	2015	0	0	0	2	2	1
14	4144.0	2015-01-15	1.0	3	15	15	1	2015	0	0	0	3	2	1
15	4269.0	2015-01-16	1.0	4	16	16	1	2015	0	0	0	3	2	2
16	4262.0	2015-01-17	1.0	5	17	17	1	2015	1	1	0	3	2	2
17	2782.0	2015-01-18	1.0	6	18	18	1	2015	1	0	1	3	2	2
18	3327.0	2015-01-19	1.0	0	19	19	1	2015	0	0	0	3	2	2
19	4002.0	2015-01-20	1.0	1	20	20	1	2015	0	0	0	3	2	2

In [6]:

#已经有的数据特征字段# 可以看到有# 日期# 用电量# 星期几# 一个月第几天# 一年第几天# 一年第几个月# 年# 是否节假日# 月中第几周# 一个月上中下旬哪个旬# 上半月还是下半月# 是否节日
col_names=total_df.columns.values
col_names#确认一下训练数据没有缺省值
counts={}
for name in col_names:count=total_df[name].isnull().sum()counts[name]=[count]is_null_filds = pd.DataFrame(counts)
is_null_filds

Out[6]:

	day_of_month	day_of_week	day_of_year	festival	festival_pc	holiday	holiday_sat	holiday_sun	month_of_year	period2_of_month	period_of_month	power_consumption	record_date	user_id	week_of_month	year
0	0	0	0	0	0	0	0	0	0	0	0	0	0	31	0	0

In [ ]:

# 4. 模型设计研究
包括:应用场景定制,模型参数设置
分离训练集和测试集
我们根据日期分割训练集和测试集，用于后续的建模

In [15]:

## 非十月份的是训练集
train_X = total_df[~((total_df.year==2016)&(total_df.month_of_year==10))]
test_X = total_df[((total_df.year==2016)&(total_df.month_of_year==10))]
train_y = train_X.power_consumption
train_X = train_X.drop(['power_consumption','record_date','year'],axis=1)
test_X = test_X.drop(['power_consumption','record_date','year'],axis=1)
train_X.head()

Out[15]:

	user_id	day_of_week	day_of_month	day_of_year	month_of_year	holiday	holiday_sat	holiday_sun	week_of_month	period_of_month	period2_of_month
0	1.0	3	1	1	1	0	0	0	1	1	1
1	1.0	4	2	2	1	0	0	0	1	1	1
2	1.0	5	3	3	1	1	1	0	1	1	1
3	1.0	6	4	4	1	1	0	1	1	1	1
4	1.0	0	5	5	1	0	0	0	1	1	1

In [16]:

train_X.shape

Out[16]:

(885468, 13)

In [ ]:

# 5 建模与调参,利用网格搜索交叉验证去查找最好的参数，
# DecisionTree

In [17]:

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCVparam_grid = {'max_features': [0.7, 0.8, 0.9, 1],'max_depth':  [3, 5, 7, 9, 12]}dt = DecisionTreeRegressor()grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
print(best_dt_reg)
print(best_dt_reg.score(train_X,train_y))

DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=0.9,max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,presort=False, random_state=None, splitter='best')
0.919640509654

In [ ]:

#考察一下训练集上的拟合程度

In [18]:

best_dt_reg.score(train_X, train_y)

Out[18]:

0.91964050965438204

In [ ]:

#进行结果预测

In [28]:

from datetime import datetime #完成提交日期格式的转换
def dataprocess(t):t = str(t)[0:10]time = datetime.strptime(t, '%Y-%m-%d')res = time.strftime('%Y%m%d')return res#生成10月份31天的时间段
commit_df = pd.date_range('2016/10/1', periods=31, freq='D')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = ['predict_date']#用模型进行预测
test_X['user_id']=test_X['day_of_month'].apply(lambda x:x)
test_X
prediction = best_dt_reg.predict(test_X.values)
commit_df['predict_power_consumption'] = pd.DataFrame(prediction).astype('int')
commit_df['predict_date'] = commit_df['predict_date'].apply(dataprocess)
commit_df.head()

Out[28]:

	predict_date	predict_power_consumption
0	20161001	3820886
1	20161002	3845830
2	20161003	3845830
3	20161004	3845830
4	20161005	3845830

In [ ]:

RandomForest 模型融合

In [ ]:

# RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor### 多少颗树，树有多深（一般不超过10），建树的时候不用全部属性（具体看多少属性）， 采样
param_grid = {'n_estimators': [5, 8, 10, 15, 20, 50, 100, 200],'max_depth': [3, 5, 7, 9],'max_features': [0.6, 0.7, 0.8, 0.9],}rf = RandomForestRegressor()grid = GridSearchCV(rf, param_grid=param_grid, cv=3, n_jobs=8, refit=True)
grid.fit(train_X, train_y)breg = grid.best_estimator_
print(breg)
print(breg.score(train_X, train_y))

In [ ]:

用模型进行预测

In [ ]:

from datetime import datetime def dataprocess(t):t = str(t)[0:10]time = datetime.strptime(t, '%Y-%m-%d')res = time.strftime('%Y%m%d')return res#用模型进行预测
test_X['user_id']=test_X['day_of_month'].apply(lambda x:x)commit_df = pd.date_range('2016/10/1', periods=31, freq='D')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = ['predict_date']
prediction = breg.predict(test_X)
commit_df['predict_power_consumption'] = pd.DataFrame(prediction).astype('int')
commit_df['predict_date'] = commit_df['predict_date'].apply(dataprocess)commit_df.head()

总结：

通过上面这个用电量分析预测未来用电量例子，我们可以发现，在建摸前对业务数据的分析，特征提取很重要，它直接决定了你预测的准确度的高低，所以好的特征提取很重要。只有尽可能全面准确的对业务场景的了解，才能比较好的做特征提取，在加上合适的算法模型，才能作出好的效果.

完整版代码¶

In [ ]:

import numpy as np
import pandas as pd_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
train_df = _df
_df.head()
#_df.shape
#df_201609
#train_df.head(5)
_df['record_date']=pd.to_datetime(_df['record_date'])
_df.head()
train_df=_df[['record_date','power_consumption']].groupby(by='record_date').agg('sum')train_df=train_df.reset_index()
train_df.head()test_df=pd.date_range('2016-10-1',periods=31,freq='D')#create very data for 10.1--10.31test_df=pd.DataFrame(test_df,columns=['record_date'])test_df['power_consumption']=0.0
test_df
total_df=pd.concat([_df,test_df])
#total_df.fillna(np.random.randint(100,10000))
total_df.dropna()
#total_df.head()
total_df.tail()#时间相关的特征
total_df['day_of_week']=total_df['record_date'].apply(lambda x:x.dayofweek)
total_df['day_of_month']=total_df['record_date'].apply(lambda x:x.day)
total_df['day_of_year']=total_df['record_date'].apply(lambda x:x.dayofyear)
total_df['month_of_year']=total_df['record_date'].apply(lambda x:x.month)
total_df['year']=total_df['record_date'].apply(lambda x:x.year)#添加工作日还是周末的信息，周六周日和工作日的用电量显然是不一样
total_df['holiday']=0
total_df['holiday_sat']=0
total_df['holiday_sun']=0#周末特征信息
total_df.loc[total_df.day_of_week ==5,'holiday']=1
total_df.loc[total_df.day_of_week ==5,'holiday_sat']=1total_df.loc[total_df.day_of_week ==6,'holiday']=1
total_df.loc[total_df.day_of_week ==6,'holiday_sun']=1#一个月4周的周信息,属于第几周
def week_of_month(day):if day in range(1,8):return 1if day in range(8,15):return 2if day in range(15,22):return 3if day in range(22,32):return 4total_df['week_of_month']=total_df['day_of_month'].apply(lambda x:week_of_month(x))
total_df.head()#属于第上中下旬信息
def period_of_month(day):if day in range(1,11):return 1if day in range(11,21):return 2if day in range(21,32):return 3total_df['period_of_month'] =total_df['day_of_month'].apply(lambda x:period_of_month(x))
total_df.head()#上半月下半月信息
def period2_of_month(day):if day in range(1,16):return 1if day in range(16,32):return 2
total_df['period2_of_month'] =total_df['day_of_month'].apply(lambda x:period2_of_month(x))
total_df.head()# 手动填充节日信息 另外一个对用电量非常大的影响是节假日，法定节假日大部分企业会放假，
# 电量会有大程度的下滑。我们通过查日历的方式去手动填充一个特征/字段，表明这一天是否是节日。
def day_of_festival(day):l_festival=['2016-10-01','2016-10-02','2016-10-03','2016-10-04','2016-10-05','2016-10-06','2016-10-07']if day in l_festival:return 1else:return 0total_df['festival_pc']=0
total_df['festival']=0total_df['festival']=total_df['festival'].apply(lambda x:day_of_festival(x))total_df.head(20)#已经有的数据特征字段# 可以看到有# 日期# 用电量# 星期几# 一个月第几天# 一年第几天# 一年第几个月# 年# 是否节假日# 月中第几周# 一个月上中下旬哪个旬# 上半月还是下半月# 是否节日
col_names=total_df.columns.values
col_names#确认一下训练数据没有缺省值
counts={}
for name in col_names:count=total_df[name].isnull().sum()counts[name]=[count]is_null_filds = pd.DataFrame(counts)
is_null_filds#添加独热向量编码/one-hot encoding  ;针对星期几这个特征，初始化一个长度为7的向量[0,0,0,0,0,0,0]#对于类别型特征，我们经常在特征工程的时候会对他们做一些特殊的处理# 星期一会被填充成[1,0,0,0,0,0,0]# 星期二会被填充成[0,1,0,0,0,0,0]# 星期三会被填充成[0,0,1,0,0,0,0]# 星期四会被填充成[0,0,0,1,0,0,0]# 以此类推...# 树状模型建模 树状模型是工业界最常用的机器学习算法之一，我们在训练集上去学习出来一个最好的决策路径，而每条决策路径的根节点是我们预测的结果;
# 1.分离训练集和测试集
## 非十月份的是训练集
train_X = total_df[~((total_df.year==2016)&(total_df.month_of_year==10))]
test_X = total_df[((total_df.year==2016)&(total_df.month_of_year==10))]
#print(train_X.shape)
#print(test_X.shape)train_y = train_X.power_consumption
train_X = train_X.drop(['power_consumption','record_date','year'],axis=1)
test_X = test_X.drop(['power_consumption','record_date','year'],axis=1)train_X.head()#建模与调参;我们利用网格搜索交叉验证去查找最好的参数
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCVparam_grid = {'max_features': [0.7, 0.8, 0.9, 1],'max_depth':  [3, 5, 7, 9, 12]}dt = DecisionTreeRegressor()grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
print(best_dt_reg)
print(best_dt_reg.score(train_X,train_y))from datetime import datetime #完成提交日期格式的转换
def dataprocess(t):t = str(t)[0:10]time = datetime.strptime(t, '%Y-%m-%d')res = time.strftime('%Y%m%d')return res#生成10月份31天的时间段
commit_df = pd.date_range('2016/10/1', periods=31, freq='D')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = ['predict_date']#用模型进行预测
prediction = best_dt_reg.predict(test_X.values)
commit_df['predict_power_consumption'] = pd.DataFrame(prediction).astype('int')
commit_df['predict_date'] = commit_df['predict_date'].apply(dataprocess)
commit_df.head()

特征重要度

%matplotlib inline
import matplotlib.pyplot as plt print("Feature ranking:") feature_names = [u'day_of_week', u'day_of_month', u'day_of_year', u'month_of_year', u'holiday', u'holiday_sat', u'holiday_sun', u'week_of_month', u'period_of_month', u'period2_of_month', u'festival_pc', u'festival'] feature_importances = breg.feature_importances_ indices = np.argsort(feature_importances)[::-1] for f in indices: print("feature %s (%f)" % (feature_names[f], feature_importances[f])) plt.figure(figsize=(20,8)) plt.title("Feature importances") plt.bar(range(len(feature_importances)), feature_importances[indices], color="b",align="center") plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices]) plt.xlim([-1, train_X.shape[1]]) plt.show()

remark:

说明:

模型设计:

load data

交叉验证

classer

model=classer.fit(x,y)

predict = model.transforam(x,y)

predict.filter()

predict.count()

sklearn:

from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor

import xgboost as xgb

回归问题,对连续值进行预测,如上面的用电量预测:

DecisionTreeRegressor()

XGBRegressor()

RandomForestRegressor()

xgb.XGBRegressor()

GridSearchCV(xgb_model, param_grid, n_jobs=8)

param_grid = {'max_features': [0.7, 0.8, 0.9, 1], 'max_depth': [3, 5, 7, 9, 12] }

dt = DecisionTreeRegressor()

grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)

grid.fit(train_X, train_y)

best_dt_reg = grid.best_estimator_

best_dt_reg.predict(test_X.values)

rf = RandomForestRegressor()

grid = GridSearchCV(rf, param_grid=param_grid, cv=3, n_jobs=8, refit=True)

grid.fit(train_X, train_y)

best_dt_reg = grid.best_estimator_

best_dt_reg.score(train_X, train_y)

best_dt_reg.predict(test_X.values)

param_grid = { 'max_depth': [3, 4, 5, 7, 9], 'n_estimators': [20, 40, 50, 80, 100, 200, 400, 800, 1000, 1200], 'learning_rate': [0.05, 0.1, 0.2, 0.3], 'subsample': [0.8, 1], 'colsample_bylevel':[0.8, 1] }

# 使用xgboost的regressor完成回归

xgb_model = xgb.XGBRegressor()

# 数据拟合

rgs = GridSearchCV(xgb_model, param_grid, n_jobs=8)

rgs.fit(X, y)

print(rgs.best_score_)

print(rgs.best_params_)

rgs.predict(test_X.values)

LogisticRegression逻辑回归被用来解决分类问题（二元分类），但多类的分类（所谓的一对多方法）也适用;优点是对于每一个输出的对象都有一个对应类别的概率

GaussianNB朴素贝叶斯在多类的分类问题上表现的很好;

kNN（k-最近邻）方法通常用于一个更复杂分类算法的一部分,用它的估计值做为一个对象的特征;

DecisionTree决策树分类和回归树（CART）适用于多类分类支持向量机SVM 用于分类问题;逻辑回归

机器学习处理流程、特征工程，模型设计实例相关推荐

机器学习实战之特征工程
机器学习实战与特征工程 1.机器学习概述 1.1 什么是机器学习 1.2 为什么要机器学习 1.3 机器学习应用场景 1.4 学习框架和资料的介绍 2.特征工程 2.1 特征工程介绍 2.1.1 数据 ...
机器学习中的特征工程——分类变量的处理
出品 | CDA数据分析研究院,转载需授权文章目录分类变量概念判断类型少类别分类变量处理方法独热编码(One-hot encoding) 虚拟编码(Dummy coding) 效应编码( ...
机器学习中的特征工程
机器学习中的特征工程什么是特征工程数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已.特征工程指的是把原始数据转变为模型的训练数据的过程,它的目的就是获取更好的训练数据特征,使得机器 ...
机器学习概述和特征工程
机器学习笔记一机器学习概述数据集的结构特征工程特征抽取特征的预处理数据降维机器学习基础机器学习开发流程机器学习概述影响人工智能发展的因素硬件的计算能力数据的限制算法的发展 ...
机器学习概述、特征工程、Scikit-learn
先来拜见一下祖师爷(祖师爷真帅) "人工智能之父" 艾伦.图灵图灵测试(1950) 马文·李·闵斯基(英语:Marvin Lee Minsky,1927年8月9日-2016年1月 ...
百面机器学习——第一章特征工程
特征工程:是对原始数据进行一系列工程处理,将其提炼为特征,作为输入供算法和模型使用.从本质上来讲,特征工程是一个表示和展现数据的过程.在实际工作中,特征工程旨在去除原始数据中的杂质和冗余,设计更高效的 ...
1. 机器学习概述与特征工程
文章目录 1.机器学习概述 1.机器学习工作流程学习目标 1 什么是机器学习 2 机器学习工作流程 2.1 获取到的数据集介绍 2.2 数据基本处理 2.3 特征工程 2.4 机器学习 2.5 模型 ...
机器学习实战 | 自动化特征工程工具Featuretools应用
作者:韩信子@ShowMeAI 教程地址:https://www.showmeai.tech/tutorials/41 本文地址:https://www.showmeai.tech/article-d ...
机器学习中的特征工程总结！
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货译者:张峰 ,Datawhale成员结构总览特征工程传统编程的关注 ...
【机器学习基础】机器学习中的特征工程总结！
译者:张峰 ,Datawhale成员结构总览特征工程传统编程的关注点是代码.在机器学习项目中,关注点变成了特征表示.也就是说,开发者通过添加和改善特征来调整模型."Garbage in ...

机器学习处理流程、特征工程，模型设计实例

构造和时间相关的强特征¶

完整版代码¶

机器学习处理流程、特征工程，模型设计实例相关推荐

最新文章

热门文章

	user_id	day_of_week	day_of_month	day_of_year	month_of_year	holiday	holiday_sat	holiday_sun	week_of_month	period_of_month	period2_of_month
0	1.0	3	1	1	1	0	0	0	1	1	1
1	1.0	4	2	2	1	0	0	0	1	1	1
2	1.0	5	3	3	1	1	1	0	1	1	1
3	1.0	6	4	4	1	1	0	1	1	1	1
4	1.0	0	5	5	1	0	0	0	1	1	1

	user_id	day_of_week	day_of_month	day_of_year	month_of_year	holiday	holiday_sat	holiday_sun	week_of_month	period_of_month	period2_of_month
0	1.0	3	1	1	1	0	0	0	1	1	1
1	1.0	4	2	2	1	0	0	0	1	1	1
2	1.0	5	3	3	1	1	1	0	1	1	1
3	1.0	6	4	4	1	1	0	1	1	1	1
4	1.0	0	5	5	1	0	0	0	1	1	1

	user_id	day_of_week	day_of_month	day_of_year	month_of_year	holiday	holiday_sat	holiday_sun	week_of_month	period_of_month	period2_of_month
0	1.0	3	1	1	1	0	0	0	1	1	1
1	1.0	4	2	2	1	0	0	0	1	1	1
2	1.0	5	3	3	1	1	1	0	1	1	1
3	1.0	6	4	4	1	1	0	1	1	1	1
4	1.0	0	5	5	1	0	0	0	1	1	1