在机器学习这个领域,尤其是做多媒体(声音、图像、视频)相关的机器学习方法研究,会涉及很多特征、分类模型(分类任务)的选择。以声音识别为例,常见的特征有MFCC、LPCC、spectrogram-like features 等,分类模型就很多了,有传统的分类模型SVM、KNN、Random Forest,还有现在比较火的深度模型DNN、CNN、RNN等。而往往单特征、单模型很难取得理想的性能(performance)。那么,如何高效的利用不同的特征和模型?

一个重要的方法就是进行融合(fusion)。典型的fusion方法有early fusion和late fusion。顾名思义,early fusion就是在特征上(feature-level)进行融合,进行不同特征的连接(concatenate),输入到一个模型中进行训练;late fusion指的是在预测分数(score-level)上进行融合,做法就是训练多个模型,每个模型都会有一个预测评分,我们对所有模型的结果进行fusion,得到最后的预测结果。常见的late fusion方法有取分数的平均值(average)、最大值(maximum)、加权平均(weighted average),另外还有采用Logistics Regression的方法进行late fusion。总之,方法很多,可视情况采取。

Fusion是一个提高模型性能的很好的方法,在参加kaggle比赛或者平时做项目上都是一个很常用的方法,尤其是像kaggle比赛这种比赛性质的,基本每一位参赛者的结果都是进行fusion后的结果,这里,模型融合也可以叫做ensemble,理解意思就好。

#
# import libariry
#import numpy as np
import pandas as pd
# data precession
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
# model
from xgboost import XGBRegressor
from lightgbm import LGBMRegressorfrom sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor#
#
#version 29 -> LB:0.6446
#   add more feature
#
#version 28 -> LB:0.6445
#   model params 'n_estimators' -> 100
#
# version 26 -> LB:0.6443
#   model params 'n_estimators' -> 50
#def load_data():train_2016 = pd.read_csv('../input/train_2016_v2.csv')train_2017 = pd.read_csv('../input/train_2017.csv')train = pd.concat([train_2016, train_2017], ignore_index=True)properties = pd.read_csv('../input/properties_2017.csv')sample = pd.read_csv('../input/sample_submission.csv')print("Preprocessing...")for c, dtype in zip(properties.columns, properties.dtypes):if dtype == np.float64:properties[c] = properties[c].astype(np.float32)print("Set train/test data...")id_feature = ['heatingorsystemtypeid','propertylandusetypeid', 'storytypeid', 'airconditioningtypeid','architecturalstyletypeid', 'buildingclasstypeid', 'buildingqualitytypeid', 'typeconstructiontypeid']for c in properties.columns:properties[c]=properties[c].fillna(-1)if properties[c].dtype == 'object':lbl = LabelEncoder()lbl.fit(list(properties[c].values))properties[c] = lbl.transform(list(properties[c].values))if c in id_feature:lbl = LabelEncoder()lbl.fit(list(properties[c].values))properties[c] = lbl.transform(list(properties[c].values))dum_df = pd.get_dummies(properties[c])dum_df = dum_df.rename(columns=lambda x:c+str(x))properties = pd.concat([properties,dum_df],axis=1)properties = properties.drop([c], axis=1)#print np.get_dummies(properties[c])## Add Feature## error in calculation of the finished living area of homeproperties['N-LivingAreaError'] = properties['calculatedfinishedsquarefeet'] / properties['finishedsquarefeet12']## Make train and test dataframe#train = train.merge(properties, on='parcelid', how='left')sample['parcelid'] = sample['ParcelId']test = sample.merge(properties, on='parcelid', how='left')# drop out oulierstrain = train[train.logerror > -0.4]train = train[train.logerror < 0.419]train["transactiondate"] = pd.to_datetime(train["transactiondate"])train["Month"] = train["transactiondate"].dt.monthtrain["quarter"] = train["transactiondate"].dt.quartertest["Month"] = 10test['quarter'] = 4x_train = train.drop(['parcelid', 'logerror','transactiondate', 'propertyzoningdesc', 'propertycountylandusecode'], axis=1)y_train = train["logerror"].valuesx_test = test[x_train.columns]del test, train    print(x_train.shape, y_train.shape, x_test.shape)return x_train, y_train, x_testx_train, y_train, x_test = load_data()class Ensemble(object):def __init__(self, n_splits, stacker, base_models):self.n_splits = n_splitsself.stacker = stackerself.base_models = base_modelsdef fit_predict(self, X, y, T):X = np.array(X)y = np.array(y)T = np.array(T)folds = list(KFold(n_splits=self.n_splits, shuffle=True, random_state=2016).split(X, y))S_train = np.zeros((X.shape[0], len(self.base_models)))S_test = np.zeros((T.shape[0], len(self.base_models)))for i, clf in enumerate(self.base_models):S_test_i = np.zeros((T.shape[0], self.n_splits))for j, (train_idx, test_idx) in enumerate(folds):X_train = X[train_idx]y_train = y[train_idx]X_holdout = X[test_idx]y_holdout = y[test_idx]print ("Fit Model %d fold %d" % (i, j))clf.fit(X_train, y_train)y_pred = clf.predict(X_holdout)[:]                S_train[test_idx, i] = y_predS_test_i[:, j] = clf.predict(T)[:]S_test[:, i] = S_test_i.mean(axis=1)# results = cross_val_score(self.stacker, S_train, y, cv=5, scoring='r2')# print("Stacker score: %.4f (%.4f)" % (results.mean(), results.std()))# exit()self.stacker.fit(S_train, y)res = self.stacker.predict(S_test)[:]return res# rf params
rf_params = {}
rf_params['n_estimators'] = 50
rf_params['max_depth'] = 8
rf_params['min_samples_split'] = 100
rf_params['min_samples_leaf'] = 30# xgb params
xgb_params = {}
#xgb_params['n_estimators'] = 50
xgb_params['min_child_weight'] = 12
xgb_params['learning_rate'] = 0.37
xgb_params['max_depth'] = 6
xgb_params['subsample'] = 0.77
xgb_params['reg_lambda'] = 0.8
xgb_params['reg_alpha'] = 0.4
xgb_params['base_score'] = 0
#xgb_params['seed'] = 400
xgb_params['silent'] = 1# lgb params
lgb_params = {}
#lgb_params['n_estimators'] = 50
lgb_params['max_bin'] = 8
lgb_params['learning_rate'] = 0.37 # shrinkage_rate
lgb_params['metric'] = 'l1'          # or 'mae'
lgb_params['sub_feature'] = 0.35
lgb_params['bagging_fraction'] = 0.85 # sub_row
lgb_params['bagging_freq'] = 40
lgb_params['num_leaves'] = 512        # num_leaf
lgb_params['min_data'] = 500         # min_data_in_leaf
lgb_params['min_hessian'] = 0.05     # min_sum_hessian_in_leaf
lgb_params['verbose'] = 0
lgb_params['feature_fraction_seed'] = 2
lgb_params['bagging_seed'] = 3# XGB model
xgb_model = XGBRegressor(**xgb_params)# lgb model
lgb_model = LGBMRegressor(**lgb_params)# RF model
rf_model = RandomForestRegressor(**rf_params)# ET model
et_model = ExtraTreesRegressor()# SVR model
# SVM is too slow in more then 10000 set
#svr_model = SVR(kernel='rbf', C=1.0, epsilon=0.05)# DecsionTree model
dt_model = DecisionTreeRegressor()# AdaBoost model
ada_model = AdaBoostRegressor()stack = Ensemble(n_splits=5,stacker=LinearRegression(),base_models=(rf_model, xgb_model, lgb_model, et_model, ada_model))y_test = stack.fit_predict(x_train, y_train, x_test)from datetime import datetime
print("submit...")
pre = y_test
sub = pd.read_csv('../input/sample_submission.csv')
for c in sub.columns[sub.columns != 'ParcelId']:sub[c] = pre
submit_file = '{}.csv'.format(datetime.now().strftime('%Y%m%d_%H_%M'))
sub.to_csv(submit_file, index=False,  float_format='%.4f')

转载于:https://blog.51cto.com/yixianwei/2156798

kaggle中zillow比赛中模型融合的方法及其代码相关推荐

  1. Java中遍历Set集合的三种方法(实例代码)

    哈喽,欢迎来到小朱课堂,下面开始你的学习吧! Java中遍历Set集合的三种方法 废话不多说,直接上代码 1.迭代遍历: Set set = new HashSet(); Iterator it = ...

  2. kaggle的kernel-only比赛中出现Your Notebook cannot use internet access in this competition解决方案

    报错如下: 这类比赛的规定就是: 你如果想提交文件,那么这个submission的生成不能依赖于Internet 所以在kernel-only比赛中,你的notebook不能使用internet,否则 ...

  3. 模型融合(集成方法) -投票法

    参考: https://www.cnblogs.com/gobetter/p/13786704.html https://blog.csdn.net/oyww710/article/details/1 ...

  4. 数据挖掘之stacking模型融合(以阿里妈妈广告点击率预估比赛为例)

    前面的特征工程部分参考大神操作,此代码非比赛真是代码,可以在特征工程方面多下功夫,这次比赛经过模型融合后的最好成绩为96名(5000队) #coding=utf-8 import pandas as ...

  5. 机器学习中的集成学习模型实战完整讲解

    2019-12-03 13:50:23 集成学习模型实践讲解 --沂水寒城 无论是在机器学习领域还是深度学习领域里面,通过模型的集成来提升整体模型的性能是一件非常有效的事情,当前我们所接触到的比较成熟 ...

  6. 金融风控实战——模型融合

    过采样方法使用条件 (1)负样本可以代表样本空间 (2)数据是足够干净的(样本.特征没有噪声) 过拟合 (1)增多数据 (2)特征筛选 (3)调参 (4)模型融合 模型融合 投票器模型融合 from ...

  7. 模型集成 | 14款常规机器学习 + 加权平均模型融合

    模型融合的方法很多,Voting.Averaging.Bagging .Boosting. Stacking,那么一些kaggle比赛中选手会选用各种方法进行融合,其中岭回归就是一类轻巧且非常有效的方 ...

  8. 集成学习-模型融合学习笔记(附Python代码)

    1 集成学习概述 集成学习(Ensemble Learning)是一种能在各种的机器学习任务上提高准确率的强有力技术,其通过组合多个基分类器(base classifier)来完成学习任务.基分类器一 ...

  9. 数据挖掘终篇!一文学习模型融合!从加权融合到stacking, boosting

    Datawhale 作者:田杨军 ,Datawhale优秀学习者 摘要:对于数据挖掘项目,本文将学习如何进行模型融合?常见的模型融合的方法有哪些?针对不同的问题类型,应该选择哪种方法呢? 模型融合:通 ...

最新文章

  1. 黑马java教程是什么_Java教程:揭秘什么是面向接口编程
  2. 新思科技助力IBM将AI计算性能提升1000倍
  3. [转]NDK中log输出方法
  4. ssh主机之间建立互信 --免密码
  5. java lambda::_书评:精通Lambda:多核世界中的Java编程
  6. Vungle收购移动端创意技术公司TreSensa
  7. springboot整合alibbaba-dubbo
  8. 单片机ADC采样算法----一阶低通滤波
  9. 杰克·韦尔奇语录-世界第一CEO
  10. VR AR体验或成2017圣丹斯电影节“新主角”
  11. 【java】蔡勒公式计算星期(switch语句方法和数组方法)
  12. html5页面布局 最基本的规范
  13. 计算机设置新网络,新买的电脑怎么设置网络连接
  14. 玩转Python第三方库库tqdm
  15. 单片机外围模块漫谈之四,USB总线基本概念
  16. Method threw ‘feign.codec.DecodeException‘ exception.
  17. 中国教师研修网计算机培训心得体会,教师网络培训学习心得体会最新5篇精选...
  18. 河南大学计算机类保研率,郑州大学、河南大学、河南农业大学2021届保研率
  19. 爬取哔哩哔哩综合排行榜信息及视频弹幕内容
  20. 小心!5G iPhone 12暂不支持5G,只能单卡用

热门文章

  1. 服务器自动挂载硬盘,Linux硬盘分区及开机自动挂载
  2. python中的print
  3. Day5:python之函数(3)
  4. Java关于Properties用法的总结(一)
  5. GDKOI2015 Day2
  6. 表单的get和post使用情景
  7. Swift傻傻分不清楚系列(五) 字符串和字符
  8. 网络经济与企业管理(第 2 章:企业战略管理)
  9. 使用EasyNetQ组件操作RabbitMQ消息队列服务
  10. InfluxDB学习之InfluxDB的基本操作