Task4 建模与调参

此部分为零基础入门数据挖掘之心电图分类的 Task4 建模调参部分,带你来了解各种模型以及模型的评价和调参策略,欢迎大家后续多多交流。

赛题:零基础入门数据挖掘 - 心电图分类预测

项目地址:

比赛地址:

1 学习目标

  • 学习机器学习模型的建模过程与调参流程
  • 完成相应学习打卡任务

2 内容介绍

  • 逻辑回归模型:

    • 理解逻辑回归模型;
    • 逻辑回归模型的应用;
    • 逻辑回归的优缺点;
  • 树模型:

    • 理解树模型;
    • 树模型的应用;
    • 树模型的优缺点;
  • 集成模型

    • 基于bagging思想的集成模型

      • 随机森林模型
    • 基于boosting思想的集成模型
      • XGBoost模型
      • LightGBM模型
      • CatBoost模型
  • 模型对比与性能评估:

    • 回归模型/树模型/集成模型;
    • 模型评估方法;
    • 模型评价结果;
  • 模型调参:

    • 贪心调参方法;

    • 网格调参方法;

    • 贝叶斯调参方法;

3 代码示例

3.1 导入相关关和相关设置

import pandas as pd
import numpy as np
from sklearn.metrics import f1_scoreimport os
import seaborn as sns
import matplotlib.pyplot as pltimport warnings
warnings.filterwarnings("ignore")

3.2 读取数据

reduce_mem_usage 函数通过调整数据类型,帮助我们减少数据在内存中占用的空间

def reduce_mem_usage(df):start_mem = df.memory_usage().sum() / 1024**2 print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64)  else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return df
# 读取数据
data = pd.read_csv('data/train.csv')
# 简单预处理
data_list = []
for items in data.values:data_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])data = pd.DataFrame(np.array(data_list))
data.columns = ['id'] + ['s_'+str(i) for i in range(len(data_list[0])-2)] + ['label']data = reduce_mem_usage(data)
Memory usage of dataframe is 157.93 MB
Memory usage after optimization is: 39.67 MB
Decreased by 74.9%

3.3 简单建模

基于树模型的算法特性,异常值、缺失值处理可以跳过,但是对于业务较为了解的同学也可以自己对缺失异常值进行处理,效果可能会更优于模型处理的结果。

注:以下建模的据集并未构造任何特征,直接使用原特征。本次主要任务还是模建模调参。

建模之前的预操作

from sklearn.model_selection import KFold
# 分离数据集,方便进行交叉验证
X_train = data.drop(['id','label'], axis=1)
y_train = data['label']# 5折交叉验证
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

因为树模型中没有f1-score评价指标,所以需要自定义评价指标,在模型迭代中返回验证集f1-score变化情况。

def f1_score_vali(preds, data_vali):labels = data_vali.get_label()preds = np.argmax(preds.reshape(4, -1), axis=0)score_vali = f1_score(y_true=labels, y_pred=preds, average='macro')return 'f1_score', score_vali, True

使用Lightgbm进行建模

"""对训练集数据进行划分,分成训练集和验证集,并进行相应的操作"""
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# 数据集划分
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)params = {"learning_rate": 0.1,"boosting": 'gbdt',  "lambda_l2": 0.1,"max_depth": -1,"num_leaves": 128,"bagging_fraction": 0.8,"feature_fraction": 0.8,"metric": None,"objective": "multiclass","num_class": 4,"nthread": 10,"verbose": -1,
}"""使用训练集数据进行模型训练"""
model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=2000, verbose_eval=50, early_stopping_rounds=200,feval=f1_score_vali)
Training until validation scores don't improve for 200 rounds
[50]    valid_0's multi_logloss: 0.0535465 valid_0's f1_score: 0.953675
[100]   valid_0's multi_logloss: 0.0484882 valid_0's f1_score: 0.961373
[150]   valid_0's multi_logloss: 0.0507799 valid_0's f1_score: 0.962653
[200]   valid_0's multi_logloss: 0.0531035 valid_0's f1_score: 0.963224
[250]   valid_0's multi_logloss: 0.0547945 valid_0's f1_score: 0.963721
Early stopping, best iteration is:
[88]    valid_0's multi_logloss: 0.0482441 valid_0's f1_score: 0.959676

对验证集进行预测

val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
preds = np.argmax(val_pre_lgb, axis=1)
score = f1_score(y_true=y_val, y_pred=preds, average='macro')
print('未调参前lightgbm单模型在验证集上的f1:{}'.format(score))
未调参前lightgbm单模型在验证集上的f1:0.9596756568138634

更进一步的,使用5折交叉验证进行模型性能评估

"""使用lightgbm 5折交叉验证进行建模预测"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):print('************************************ {} ************************************'.format(str(i+1)))X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]train_matrix = lgb.Dataset(X_train_split, label=y_train_split)valid_matrix = lgb.Dataset(X_val, label=y_val)params = {"learning_rate": 0.1,"boosting": 'gbdt',  "lambda_l2": 0.1,"max_depth": -1,"num_leaves": 128,"bagging_fraction": 0.8,"feature_fraction": 0.8,"metric": None,"objective": "multiclass","num_class": 4,"nthread": 10,"verbose": -1,}model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=2000, verbose_eval=100, early_stopping_rounds=200,feval=f1_score_vali)val_pred = model.predict(X_val, num_iteration=model.best_iteration)val_pred = np.argmax(val_pred, axis=1)cv_scores.append(f1_score(y_true=y_val, y_pred=val_pred, average='macro'))print(cv_scores)print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
...
lgb_scotrainre_list:[0.9674515729721614, 0.9656700872844327, 0.9700043639844769, 0.9655663272378014, 0.9631137190307674]
lgb_score_mean:0.9663612141019279
lgb_score_std:0.0022854824074775683

3.4 模型调参

  • 1. 贪心调参

    先使用当前对模型影响最大的参数进行调优,达到当前参数下的模型最优化,再使用对模型影响次之的参数进行调优,如此下去,直到所有的参数调整完毕。

    这个方法的缺点就是可能会调到局部最优而不是全局最优,但是只需要一步一步的进行参数最优化调试即可,容易理解。

    需要注意的是在树模型中参数调整的顺序,也就是各个参数对模型的影响程度,这里列举一下日常调参过程中常用的参数和调参顺序:

    • ①:max_depth、num_leaves
    • ②:min_data_in_leaf、min_child_weight
    • ③:bagging_fraction、 feature_fraction、bagging_freq
    • ④:reg_lambda、reg_alpha
    • ⑤:min_split_gain
    from sklearn.model_selection import cross_val_score
    # 调objective
    best_obj = dict()
    for obj in objective:model = LGBMRegressor(objective=obj)"""预测并计算roc的相关指标"""score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()best_obj[obj] = score# num_leaves
    best_leaves = dict()
    for leaves in num_leaves:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)"""预测并计算roc的相关指标"""score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()best_leaves[leaves] = score# max_depth
    best_depth = dict()
    for depth in max_depth:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],max_depth=depth)"""预测并计算roc的相关指标"""score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()best_depth[depth] = score"""
    可依次将模型的参数通过上面的方式进行调整优化,并且通过可视化观察在每一个最优参数下模型的得分情况
    """
    

    可依次将模型的参数通过上面的方式进行调整优化,并且通过可视化观察在每一个最优参数下模型的得分情况

  • 2. 网格搜索

    sklearn 提供GridSearchCV用于进行网格搜索,只需要把模型的参数输进去,就能给出最优化的结果和参数。相比起贪心调参,网格搜索的结果会更优,但是网格搜索只适合于小数据集,一旦数据的量级上去了,很难得出结果。

    同样以Lightgbm算法为例,进行网格搜索调参:

    """通过网格搜索确定最优参数"""
    from sklearn.model_selection import GridSearchCVdef get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):# 设置5折交叉验证cv_fold = KFold(n_splits=5, shuffle=True, random_state=2021)model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,n_estimators=n_estimators,num_leaves=num_leaves,max_depth=max_depth,bagging_fraction=bagging_fraction,feature_fraction=feature_fraction,bagging_freq=bagging_freq,min_data_in_leaf=min_data_in_leaf,min_child_weight=min_child_weight,min_split_gain=min_split_gain,reg_lambda=reg_lambda,reg_alpha=reg_alpha,n_jobs= 8)f1 = make_scorer(f1_score, average='micro')grid_search = GridSearchCV(estimator=model_lgb, cv=cv_fold,param_grid=param_grid,scoring=f1)grid_search.fit(X_train, y_train)print('模型当前最优参数为:{}'.format(grid_search.best_params_))print('模型当前最优得分为:{}'.format(grid_search.best_score_))
    
    """以下代码未运行,耗时较长,请谨慎运行,且每一步的最优参数需要在下一步进行手动更新,请注意""""""
    需要注意一下的是,除了获取上面的获取num_boost_round时候用的是原生的lightgbm(因为要用自带的cv)
    下面配合GridSearchCV时必须使用sklearn接口的lightgbm。
    """
    """设置n_estimators 为581,调整num_leaves和max_depth,这里选择先粗调再细调"""
    lgb_params = {'num_leaves': range(10, 80, 5), 'max_depth': range(3,10,2)}
    get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=None, max_depth=None, min_data_in_leaf=20, min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)"""num_leaves为30,max_depth为7,进一步细调num_leaves和max_depth"""
    lgb_params = {'num_leaves': range(25, 35, 1), 'max_depth': range(5,9,1)}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=None, max_depth=None, min_data_in_leaf=20, min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)"""
    确定min_data_in_leaf为45,min_child_weight为0.001 ,下面进行bagging_fraction、feature_fraction和bagging_freq的调参
    """
    lgb_params = {'bagging_fraction': [i/10 for i in range(5,10,1)], 'feature_fraction': [i/10 for i in range(5,10,1)],'bagging_freq': range(0,81,10)}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, min_child_weight=0.001,bagging_fraction=None, feature_fraction=None, bagging_freq=None, min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)"""
    确定bagging_fraction为0.4、feature_fraction为0.6、bagging_freq为 ,下面进行reg_lambda、reg_alpha的调参
    """
    lgb_params = {'reg_lambda': [0,0.001,0.01,0.03,0.08,0.3,0.5], 'reg_alpha': [0,0.001,0.01,0.03,0.08,0.3,0.5]}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, min_split_gain=0, reg_lambda=None, reg_alpha=None, param_grid=lgb_params)"""
    确定reg_lambda、reg_alpha都为0,下面进行min_split_gain的调参
    """
    lgb_params = {'min_split_gain': [i/10 for i in range(0,11,1)]}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, min_split_gain=None, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
    
    """
    参数确定好了以后,我们设置一个比较小的learning_rate 0.005,来确定最终的num_boost_round
    """
    # 设置5折交叉验证
    # cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    final_params = {'boosting_type': 'gbdt','learning_rate': 0.01,'num_leaves': 29,'max_depth': 7,'objective': 'multiclass','num_class': 4,'min_data_in_leaf':45,'min_child_weight':0.001,'bagging_fraction': 0.9,'feature_fraction': 0.9,'bagging_freq': 40,'min_split_gain': 0,'reg_lambda':0,'reg_alpha':0,'nthread': 6}cv_result = lgb.cv(train_set=lgb_train,early_stopping_rounds=20,num_boost_round=5000,nfold=5,stratified=True,shuffle=True,params=final_params,feval=f1_score_vali,seed=0,)
    

    在实际调整过程中,可先设置一个较大的学习率(上面的例子中0.1),通过Lgb原生的cv函数进行树个数的确定,之后再通过上面的实例代码进行参数的调整优化。

    最后针对最优的参数设置一个较小的学习率(例如0.05),同样通过cv函数确定树的个数,确定最终的参数。

    需要注意的是,针对大数据集,上面每一层参数的调整都需要耗费较长时间,

  • 贝叶斯调参

    在使用之前需要先安装包bayesian-optimization,运行如下命令即可:

    pip install bayesian-optimization
    

    贝叶斯调参的主要思想是:给定优化的目标函数(广义的函数,只需指定输入和输出即可,无需知道内部结构以及数学性质),通过不断地添加样本点来更新目标函数的后验分布(高斯过程,直到后验分布基本贴合于真实分布)。简单的说,就是考虑了上一次参数的信息,从而更好的调整当前的参数。

    贝叶斯调参的步骤如下:

    • 定义优化函数(rf_cv)
    • 建立模型
    • 定义待优化的参数
    • 得到优化结果,并返回要优化的分数指标
    from sklearn.model_selection import cross_val_score"""定义优化函数"""
    def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, min_child_weight, min_split_gain, reg_lambda, reg_alpha):# 建立模型model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', objective='multiclass', num_class=4,learning_rate=0.1, n_estimators=5000,num_leaves=int(num_leaves), max_depth=int(max_depth), bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2),bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf),min_child_weight=min_child_weight, min_split_gain=min_split_gain,reg_lambda=reg_lambda, reg_alpha=reg_alpha,n_jobs= 8)f1 = make_scorer(f1_score, average='micro')val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring=f1).mean()return val
    
    from bayes_opt import BayesianOptimization
    """定义优化参数"""
    bayes_lgb = BayesianOptimization(rf_cv_lgb, {'num_leaves':(10, 200),'max_depth':(3, 20),'bagging_fraction':(0.5, 1.0),'feature_fraction':(0.5, 1.0),'bagging_freq':(0, 100),'min_data_in_leaf':(10,100),'min_child_weight':(0, 10),'min_split_gain':(0.0, 1.0),'reg_alpha':(0.0, 10),'reg_lambda':(0.0, 10),}
    )"""开始优化"""
    bayes_lgb.maximize(n_iter=10)
    
    |   iter    |  target   | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
    |  1        |  0.9785   |  0.5174   |  10.78    |  0.8746   |  10.15    |  4.288    |  48.97    |  0.2337   |  42.83    |  6.551    |  9.015    |
    |  2        |  0.9778   |  0.6777   |  41.77    |  0.5291   |  12.15    |  4.16     |  26.39    |  0.2461   |  55.78    |  6.528    |  0.6003   |
    |  3        |  0.9745   |  0.5825   |  68.77    |  0.5932   |  8.36     |  9.296    |  77.74    |  0.7946   |  79.12    |  3.045    |  5.593    |
    |  4        |  0.9802   |  0.9669   |  78.34    |  0.77     |  19.68    |  9.886    |  66.34    |  0.255    |  161.1    |  4.727    |  8.18     |
    |  5        |  0.9836   |  0.9897   |  51.9     |  0.9737   |  16.82    |  2.001    |  42.1     |  0.03563  |  134.2    |  3.437    |  1.368    |
    |  6        |  0.9749   |  0.5575   |  46.2     |  0.6518   |  15.9     |  7.817    |  34.12    |  0.341    |  153.2    |  7.144    |  7.899    |
    |  7        |  0.9793   |  0.9644   |  55.08    |  0.9795   |  18.5     |  2.085    |  41.22    |  0.7031   |  129.9    |  3.369    |  2.717    |
    |  8        |  0.9819   |  0.5926   |  58.23    |  0.6149   |  16.81    |  2.911    |  39.91    |  0.1699   |  137.3    |  2.685    |  2.891    |
    |  9        |  0.983    |  0.7796   |  50.38    |  0.7261   |  17.87    |  3.499    |  37.59    |  0.1404   |  136.1    |  2.442    |  6.621    |
    |  10       |  0.9843   |  0.638    |  49.32    |  0.9282   |  11.33    |  6.504    |  43.21    |  0.288    |  137.7    |  0.2083   |  6.966    |
    |  11       |  0.9798   |  0.8196   |  47.05    |  0.5845   |  9.075    |  2.965    |  46.16    |  0.3984   |  131.6    |  3.634    |  2.601    |
    |  12       |  0.9726   |  0.7688   |  37.57    |  0.9811   |  10.26    |  1.239    |  17.54    |  0.9651   |  46.5     |  8.834    |  6.276    |
    |  13       |  0.9836   |  0.5214   |  48.3     |  0.8203   |  19.13    |  3.129    |  35.47    |  0.08455  |  138.2    |  2.345    |  9.691    |
    |  14       |  0.9738   |  0.5617   |  45.75    |  0.8648   |  18.88    |  4.383    |  46.88    |  0.9315   |  141.8    |  4.968    |  5.563    |
    |  15       |  0.9807   |  0.8046   |  47.05    |  0.6449   |  12.38    |  0.3744   |  41.13    |  0.6808   |  138.7    |  0.8521   |  9.461    |
    =================================================================================================================================================
    
    """显示优化结果"""
    bayes_lgb.max
    
    {'target': 0.9842625,'params': {'bagging_fraction': 0.6379596054685973,'bagging_freq': 49.319589248277715,'feature_fraction': 0.9282486828608231,'max_depth': 11.32826513626976,'min_child_weight': 6.5044214037514845,'min_data_in_leaf': 43.211716584925405,'min_split_gain': 0.28802399981965143,'num_leaves': 137.7332804262704,'reg_alpha': 0.2082701560002398,'reg_lambda': 6.966270735649479}}
    

    参数优化完成后,我们可以根据优化后的参数建立新的模型,降低学习率并寻找最优模型迭代次数

    """调整一个较小的学习率,并通过cv函数确定当前最优的迭代次数"""
    base_params_lgb = {'boosting_type': 'gbdt','objective': 'multiclass','num_class': 4,'learning_rate': 0.01,'num_leaves': 138,'max_depth': 11,'min_data_in_leaf': 43,'min_child_weight':6.5,'bagging_fraction': 0.64,'feature_fraction': 0.93,'bagging_freq': 49,'reg_lambda': 7,'reg_alpha': 0.21,'min_split_gain': 0.288,'nthread': 10,'verbose': -1,
    }cv_result_lgb = lgb.cv(train_set=train_matrix,early_stopping_rounds=1000, num_boost_round=20000,nfold=5,stratified=True,shuffle=True,params=base_params_lgb,feval=f1_score_vali,seed=0
    )
    print('迭代次数{}'.format(len(cv_result_lgb['f1_score-mean'])))
    print('最终模型的f1为{}'.format(max(cv_result_lgb['f1_score-mean'])))
    
    迭代次数4833
    最终模型的f1为0.961641452120875
    

    模型参数已经确定,建立最终模型并对验证集进行验证

    import lightgbm as lgb
    """使用lightgbm 5折交叉验证进行建模预测"""
    cv_scores = []
    for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):print('************************************ {} ************************************'.format(str(i+1)))X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]train_matrix = lgb.Dataset(X_train_split, label=y_train_split)valid_matrix = lgb.Dataset(X_val, label=y_val)params = {'boosting_type': 'gbdt','objective': 'multiclass','num_class': 4,'learning_rate': 0.01,'num_leaves': 138,'max_depth': 11,'min_data_in_leaf': 43,'min_child_weight':6.5,'bagging_fraction': 0.64,'feature_fraction': 0.93,'bagging_freq': 49,'reg_lambda': 7,'reg_alpha': 0.21,'min_split_gain': 0.288,'nthread': 10,'verbose': -1,}model = lgb.train(params, train_set=train_matrix, num_boost_round=4833, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200, feval=f1_score_vali)val_pred = model.predict(X_val, num_iteration=model.best_iteration)val_pred = np.argmax(val_pred, axis=1)cv_scores.append(f1_score(y_true=y_val, y_pred=val_pred, average='macro'))print(cv_scores)print("lgb_scotrainre_list:{}".format(cv_scores))
    print("lgb_score_mean:{}".format(np.mean(cv_scores)))
    print("lgb_score_std:{}".format(np.std(cv_scores)))
    
    ...
    lgb_scotrainre_list:[0.9615056903324599, 0.9597829114711733, 0.9644760387635415, 0.9622009947666585, 0.9607941521618003]
    lgb_score_mean:0.9617519574991267
    lgb_score_std:0.0015797109890455313
  • 模型调参小总结

    • 集成模型内置的cv函数可以较快的进行单一参数的调节,一般可以用来优先确定树模型的迭代次数

    • 数据量较大的时候(例如本次项目的数据),网格搜索调参会特别特别慢,不建议尝试

    • 集成模型中原生库和sklearn下的库部分参数不一致,需要注意,具体可以参考xgb和lgb的官方API

      xgb原生库API,sklearn库下xgbAPI

      lgb原生库API, sklearn库下lgbAPI

4 结果讨论

经过几天的学习发现了一些问题需要及时去解决。

  1. 基础知识需要去加强,很多的代码不知道是什么意思,在运行的时候没有什么问题,但是一到自己去写的时候往往需要去百度或者查询相关的知识手册。网上找了一本《利用Python进行数据分析》希望能够在较短的时间内过一遍,巩固一下基础知识
  2. 在各模型的相关原理方面也需要进行加强学习,这里列出了本次课程中推荐的一些博客和教材内容,希望能够学习学习。

逻辑回归模型:https://blog.csdn.net/han_xiaoyang/article/details/49123419
决策树模型:https://blog.csdn.net/c406495762/article/details/76262487
GBDT模型:https://zhuanlan.zhihu.com/p/45145899
XGBoost模型:https://blog.csdn.net/wuzhongqiang/article/details/104854890
LightGBM模型:https://blog.csdn.net/wuzhongqiang/article/details/105350579
Catboost模型:https://mp.weixin.qq.com/s/xloTLr5NJBgBspMQtxPoFA
时间序列模型(选学):
RNN:https://zhuanlan.zhihu.com/p/45289691
LSTM:https://zhuanlan.zhihu.com/p/83496936

Datawhale 零基础入门数据挖掘- 建模与调参相关推荐

  1. Task 3 特征工程 Datawhale零基础入门数据挖掘- 二手车交易价格预测

    Task 3 特征工程 Datawhale零基础入门数据挖掘- 二手车交易价格预测 Tips:此部分为零基础入门数据挖掘的Task3特征工程部分,主要包含各种特征工程以及分析方法 赛题:零基础入没人能 ...

  2. Datawhale零基础入门数据挖掘-Task5模型融合

    Datawhale零基础入门数据挖掘-Task5模型融合 五.模型融合 5.1 模型融合目标 5.2 内容介绍 5.3 Stacking相关理论介绍 5.4 代码示例 5.4.1 回归\分类概率-融合 ...

  3. Datawhale 零基础入门数据挖掘-Task2 数据分析

    数据探索在机器学习中我们一般称为EDA(Exploratory Data Analysis):是指对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索,通过作图.制表.方程拟合. ...

  4. Datawhale 零基础入门数据挖掘-Task4 建模调参

    本节内容为各种模型以及模型的评价和调参策略. 一.读取数据 import pandas as pd import numpy as np import warnings warnings.filter ...

  5. 【组队学习】【23期】Datawhale零基础入门数据挖掘(心跳信号分类)

    零基础入门数据挖掘(心跳信号分类) 开源内容:https://github.com/datawhalechina/team-learning-data-mining/tree/master/Heart ...

  6. Datawhale 零基础入门数据挖掘心跳信号分类学习反馈04

    Detail 零基础入门数据挖掘 (心跳信号分类) 学习反馈TASK3 使用语言:python Tas1 – Task5 Task4 建模与调参 本次学习的重点是贪心调参.网格搜索调参.贝叶斯调参共三 ...

  7. Datawhale 零基础入门数据挖掘心跳信号分类学习反馈

    Detail 零基础入门数据挖掘 (心跳信号分类) 学习反馈TASK1 使用语言:python Tas1 – Task5 Task1 赛题理解: 根据给定的数据集,建立模型,预测不同的心跳信号(以预测 ...

  8. Datawhale 零基础入门数据挖掘心跳信号分类学习反馈02

    Detail 零基础入门数据挖掘 (心跳信号分类) 学习反馈TASK2 使用语言:python Tas1 – Task5 Task2_数据探索性分析 涉及函数:总览+判断数据缺失和异常+分布 data ...

  9. Datawhale 零基础入门数据挖掘-Task5 模型融合

    模型融合是比赛后期一个重要的环节,大体来说有如下的类型方式. 一.简单加权融合 1.回归(分类概率):算术平均融合(Arithmetic mean),几何平均融合(Geometric mean): 2 ...

最新文章

  1. Rust和C / C ++的跨语言链接时间优化LTO
  2. 9周的项目实训从今天就开始了,
  3. 解决Linux安装过程中不能安装Grub的问题
  4. python画圆简单代码-Python 用turtle实现用正方形画圆的例子
  5. Ubuntu14.04安装Octave
  6. Linux 发行版与Linux内核
  7. 实录分享 | 计算未来轻沙龙:“法律+AI”前沿研讨会(PPT下载)
  8. JQuery:JQuery添加元素
  9. Doxygen从零学起———安装和配置
  10. k近邻算法(KNN)-分类算法
  11. 你们公司还没使用HTTP3?赶紧来补一补,学习一下如何在Nginx上配置HTTP3。
  12. undefined reference to libiconv_open'
  13. word转网页html,Word转网页html
  14. 神经网络做多元线性回归,神经网络是线性模型吗
  15. 使用百度OCR做答题软件辅助
  16. java调用打印机没反应_java代码调用打印机没反应
  17. 一句话告诉您什么是运维?以及如何运维才能事半功倍?
  18. 读《Jonathan von Neumann and EDVAC》
  19. Linux上音频转换工具mpg123
  20. 蛙蛙推荐:如何编写高质量的python程序

热门文章

  1. Thank you,kobe
  2. SwiftUI 内功教程之Closures 11 Escaping Closures及经典用法
  3. git的镜像下载地址,解决下载慢的问题
  4. MySQL集群搭建——主从同步(一主二从)
  5. 国庆了 推荐你看这几部人工智能电影
  6. 【定时同步系列6】Gardener误差检测算法原理
  7. 【C++11智能指针】shared_ptr的初始化、拷贝构造和拷贝赋值、移动构造和移动赋值
  8. OpenCV3计算机视觉Python语言实现人脸识别笔记
  9. 在双十一创新技术试验场上,阿里用了哪些武器?
  10. 宝塔面板安装软件或扩展一直“等待安装”状态(sleeping)的处理