lightGBM使用

1.categorical_feature(类别特征)使用
lightGBM比XGBoost的1个改进之处在于对类别特征的处理, 不再需要将类别特征转为one-hot形式, 具体可参考这里.

在使用python API时(参考官方文档)
1.1可以使用pd.DataFrame存放特征X, 每一列表示1个特征, 将类别特征设置为X[cat_cols].astype('category'). 这样模型在fit时会自动识别类别特征.
1.2在模型的fit方法中传入参数categorical_feature, 指明哪些列是类别特征.
1.3类别特征的值必须是从0开始的连续整数, 比如0,1,2,..., 不能是负数.

下面是官方文档对fit方法中categorical_feature参数的说明:
categorical_feature (list of strings or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.

之前在使用sklearn的GridSearchCV时就发现一个问题, x_train和x_test中的cat_cols列已经设置好了是category类型, 但是concat之后类型又变成了int:

x_search = pd.concat([x_train, x_test], 0)
y_search = np.concatenate([train_y, test_y],0)
gsearch.fit(x_search, y_search) # 搜索最优参数, gsearch是sklearn的GridSearchCV实例

所以, 需要在concat之后重新设置一下类型

x_search = pd.concat([x_train, x_test], 0)
x_search[cat_cols] = x_search[cat_cols].astype('category')
y_search = np.concatenate([train_y, test_y],0)
gsearch.fit(x_search, y_search) # 搜索最优参数

2.init_score使用
init_score是estimator的初始得分, 在regression任务中用init_score可以帮助模型更快收敛.
使用时只需要在fit方法中设置init_score参数即可, 最后predict时, 需要加上设置的这个init_score


model.fit(
#        pd.concat([x_train,x_val],0),
#        np.concatenate([train_y, val_y],0),x_train, train_y,init_score=y_train_base_avg1,eval_metric=['mape'], eval_set=(x_val, val_y),early_stopping_rounds=20,eval_init_score=[y_val_base_avg1],verbose=True)
...
y_train_pre = model.predict(x_train) + y_train_base_avg1 # 加上init_score

上面是regression的情况, 那么对于classification呢?以及配合GridSearchCV时怎么用(因为predict时必须手动加上init_score)?
参考这里和这里(init_score is the raw score, before any transformation.).

▲▲▲使用ParameterGrid代替GridSearchCV, 中间可加入更多的自定义操作
1.可以使用early_stopping, 而GridSearchCV做不到
2.可以支持lightGBM的init_score

...
from sklearn.model_selection import ParameterGrid
...
parameters = {'objective': ['regression', 'regression_l1'],'max_depth': [2,3,4,5],'num_leaves': [20,25,30,35],'n_estimators': [20,25,30,35,40,50],'min_child_samples': [15,20,25,30],
#        'subsample_freq': [0,2,5],
#        'subsample': [0.7,0.8,0.9,1],
#        'colsample_bytree': [0.8,0.9,1]}
...default_params = model.get_params() # 得到前面预定义的参数
best_score = np.inf
best_params = None
best_idx = 0
param_grid = list(ParameterGrid(parameters)) # 生成所有参数组合的list
for idx,param_ in enumerate(param_grid):param = default_params.copy()param.update(param_)model = LGBMRegressor(**param)model.fit(x_train, train_y,init_score=y_train_base_avg1,eval_metric=['mape'], eval_set=(x_val, val_y),early_stopping_rounds=20,eval_init_score=[y_val_base_avg1],verbose=False)score_ = model.best_score_['valid_0']['mape'] # 当前模型在val set上的最好得分print('for %d/%d, score: %.6f, best idx in %d/%d'%(idx+1, len(param_grid), score_, best_idx, len(param_grid)))if score_<best_score:best_params = parambest_score = score_best_idx = idx+1print('find best score: {}, \nbest params: {}'.format(best_score, best_params))
print('\nbest score: {}, \nbest params: {}\n'.format(best_score, best_params))
#raise ValueErrormodel = LGBMRegressor(**best_params)
model.fit(x_train, train_y,init_score=y_train_base_avg1,eval_metric=['mape'], eval_set=(x_val, val_y),early_stopping_rounds=20,eval_init_score=[y_val_base_avg1],verbose=True)

lightGBM使用相关推荐

多分类数据不平衡的处理 lightgbm
前言数据不平衡问题在机器学习分类问题中很常见,尤其是涉及到"异常检测"类型的分类.因为异常一般指的相对不常见的现象,因此发生的机率必然要小很多.因此正常类的样本量会远远高于异常类 ...
lightgbm 决策树可视化 graphviz
决策树模型,XGBoost,LightGBM和CatBoost模型可视化安装 graphviz 参考文档 http://graphviz.readthedocs.io/en/stable/manua ...
lightgbm保存模型参数
20210205 params = {'task': 'train', # 执行的任务类型'boosting_type': 'gbrt', # 基学习器'objective': 'lambdarank ...
xgboost lightgbm catboost 多分类多标签
xgboost 与 lightgbm 官方均支持多分类任务,但不直接支持多标签分类任务,实现多标签任务的方法之一是结合sklearn 提供的 multiclass 子类,如OneVsRestClass ...
梯度提升决策树（GBDT）与XGBoost、LightGBM
20211224 [机器学习算法总结]XGBoost_yyy430的博客-CSDN博客_xgboost xgboost参数默认:auto.XGBoost中使用的树构造算法.可选项:auto,exac ...
30分钟搞定数据竞赛刷分夺冠神器LightGBM！
作者 | 梁云1991 来源 | Python与算法之美(ID:Python_Ai_Road) [导读]LightGBM可以看成是XGBoost的升级加强版本,2017年经微软推出后,便成为各种数据竞 ...
大战三回合：XGBoost、LightGBM和Catboost一决高低 | 程序员硬核算法评测
作者 | LAVANYA 译者 | 陆离责编 | Jane 出品 | AI科技大本营(ID: rgznai100) [导读]XGBoost.LightGBM 和 Catboost 是三个基于 GBD ...
基于LightGBM算法实现数据挖掘！
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货作者:黄雨龙,中国科学技术大学对于回归问题,Datawhale已经梳理 ...
Kaggle神器LightGBM最全解读！
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货来源:Microstrong,编辑:AI有道本文主要内容概览: 1. ...
比赛杀器LightGBM常用操作总结！
Datawhale干货作者:阿水,北京航空航天大学,Datawhale成员 LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoos ...

lightGBM使用

lightGBM使用相关推荐

最新文章

热门文章