特征重要性指标评估三种常用的方式:①gain 增益意味着相应的特征对通过对模型中的每个树采取每个特征的贡献而计算出的模型的相对贡献。与其他特征相比,此度量值的较高值意味着它对于生成预测更为重要。②cover 覆盖度量指的是与此功能相关的观测的相对数量。例如,如果您有100个观察值,4个特征和3棵树,并且假设特征1分别用于决定树1,树2和树3中10个,5个和2个观察值的叶节点;那么该度量将计算此功能的覆盖范围为10 + 5 + 2 = 17个观测值。这将针对所有4项功能进行计算,并将以17个百分比表示所有功能的覆盖指标。③freq 频率(频率)是表示特定特征在模型树中发生的相对次数的百分比。在上面的例子中,如果feature1发生在2个分裂中,1个分裂和3个分裂在每个树1,树2和树3中;那么特征1的权重将是2 + 1 + 3 = 6。特征1的频率被计算为其在所有特征的权重上的百分比权重。增益是解释每个特征的相对重要性的最相关属性。

那么在spark ml.dmlc.xgboost4j 类库中如何获取特征重要性?

看源码可以知道 model.booster.getFeatureScore(null)只是获取③freq 频率。①gain②cover如何获取,需要通过获取model.booster.getModelDump(null, true),并对其进一步处理进行解决。

val modelInfos = model.booster.getModelDump(null, true)
println(modelInfos(0))
val modelDump = XGBoostFeatureImportanciesUtil.getFeatureImportancies(modelInfos)
下面是自己写的核心工具类,调用getFeatureImportancies(modelInfos)可以获得每个特征的 ①gain②cover③freq 频率。这里用\001进行了分割。main方法里有modelInfos的样例。import com.alibaba.fastjson.JSONObject;
import ml.dmlc.xgboost4j.java.Booster;
import ml.dmlc.xgboost4j.java.XGBoostError;import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;public class XGBoostFeatureImportanciesUtil {public static String split = "\001";public static String getFeatureImportancies(String[] modelInfos) throws XGBoostError {String featureImportancies = "";Map<String, Integer> featureFreq = new HashMap();Map<String, Double> featureGain = new HashMap();Map<String, Double> featureCover = new HashMap();String[] var4 = modelInfos;int var5 = modelInfos.length;for(int var6 = 0; var6 < var5; ++var6) {String tree = var4[var6];String[] var8 = tree.split("\n");int var9 = var8.length;for(int var10 = 0; var10 < var9; ++var10) {String node = var8[var10];String[] array = node.split("\\[");if (array.length != 1) {String fid = array[1].split("\\]")[0];fid = fid.split("<")[0];String gain = array[1].split("gain=")[1];gain = gain.split(",")[0];String cover = array[1].split("cover=")[1];cover = cover.split(",")[0];if (featureFreq.containsKey(fid)) {featureFreq.put(fid, 1 + (Integer)featureFreq.get(fid));featureGain.put(fid, Double.valueOf(gain)+(Double)featureGain.get(fid));featureCover.put(fid, Double.valueOf(cover)+(Double)featureCover.get(fid));} else {featureFreq.put(fid, 1);featureGain.put(fid, Double.valueOf(gain));featureCover.put(fid, Double.valueOf(cover));}}}}Double gainSum = 0.0d;for(Double gain : featureGain.values()){gainSum = gainSum + gain;}if(gainSum==0.0d){return "";}for(String fid:featureFreq.keySet()){featureImportancies = featureImportancies.concat(fid.concat(split).concat(featureFreq.get(fid).toString()).concat(split).concat(featureGain.get(fid).toString()).concat(split).concat((featureGain.get(fid)/gainSum)+"").concat(split).concat(featureCover.get(fid).toString())).concat("\n");}return featureImportancies;}
public static void main(String[] args) {String modelinfo = "0:[f46<0.0204612] yes=1,no=2,missing=1,gain=2319.78,cover=114480\n" +"\t1:[f23<0.603893] yes=3,no=4,missing=4,gain=478.741,cover=66524\n" +"\t\t3:[f51<0.00335573] yes=7,no=8,missing=8,gain=245.974,cover=62588.2\n" +"\t\t\t7:[f20<0.512786] yes=15,no=16,missing=16,gain=60.4821,cover=40874.2\n" +"\t\t\t\t15:[f63<0.0108743] yes=31,no=32,missing=31,gain=20.7964,cover=31967.2\n" +"\t\t\t\t\t31:[f23<0.537906] yes=63,no=64,missing=64,gain=10.1748,cover=26765\n" +"\t\t\t\t\t\t63:leaf=-0.190268,cover=25790.8\n" +"\t\t\t\t\t\t64:leaf=-0.178159,cover=974.25\n" +"\t\t\t\t\t32:[f49<0.267547] yes=65,no=66,missing=66,gain=15.3241,cover=5202.25\n" +"\t\t\t\t\t\t65:leaf=-0.184069,cover=4794.75\n" +"\t\t\t\t\t\t66:leaf=-0.161812,cover=407.5\n" +"\t\t\t\t16:[f23<0.482164] yes=33,no=34,missing=34,gain=26.911,cover=8907\n" +"\t\t\t\t\t33:[f42<1.2857] yes=67,no=68,missing=67,gain=0.971554,cover=2953.75\n" +"\t\t\t\t\t\t67:leaf=-0.187414,cover=2946.75\n" +"\t\t\t\t\t\t68:leaf=-0.1125,cover=7\n" +"\t\t\t\t\t34:[f4<0.999999] yes=69,no=70,missing=69,gain=19.0911,cover=5953.25\n" +"\t\t\t\t\t\t69:leaf=-0.183261,cover=2066\n" +"\t\t\t\t\t\t70:leaf=-0.170449,cover=3887.25\n" +"\t\t\t8:[f49<0.0197215] yes=17,no=18,missing=18,gain=61.3492,cover=21714\n" +"\t\t\t\t17:[f4<0.999999] yes=35,no=36,missing=35,gain=21.5379,cover=624.75\n" +"\t\t\t\t\t35:[f12<3] yes=71,no=72,missing=72,gain=1.60138,cover=300.75\n" +"\t\t\t\t\t\t71:leaf=-0.14714,cover=125.75\n" +"\t\t\t\t\t\t72:leaf=-0.171023,cover=175\n" +"\t\t\t\t\t36:[f20<0.525825] yes=73,no=74,missing=74,gain=11.2899,cover=324\n" +"\t\t\t\t\t\t73:leaf=-0.14304,cover=155.25\n" +"\t\t\t\t\t\t74:leaf=-0.103387,cover=168.75\n" +"\t\t\t\t18:[f21<0.492682] yes=37,no=38,missing=38,gain=53.6745,cover=21089.2\n" +"\t\t\t\t\t37:[f52<1.2884] yes=75,no=76,missing=75,gain=13.3477,cover=5378.25\n" +"\t\t\t\t\t\t75:leaf=-0.18332,cover=5364.75\n" +"\t\t\t\t\t\t76:leaf=-0.0758621,cover=13.5\n" +"\t\t\t\t\t38:[f52<1.2884] yes=77,no=78,missing=77,gain=38.0972,cover=15711\n" +"\t\t\t\t\t\t77:leaf=-0.171475,cover=15651.8\n" +"\t\t\t\t\t\t78:leaf=-0.0887967,cover=59.25\n" +"\t\t4:[f23<0.856171] yes=9,no=10,missing=10,gain=142.686,cover=3935.75\n" +"\t\t\t9:[f40<0.0587658] yes=19,no=20,missing=19,gain=36.5799,cover=2969.5\n" +"\t\t\t\t19:[f6<0.999999] yes=39,no=40,missing=39,gain=12.9572,cover=2716.75\n" +"\t\t\t\t\t39:[f20<0.550559] yes=79,no=80,missing=80,gain=15.747,cover=1731.75\n" +"\t\t\t\t\t\t79:leaf=-0.167095,cover=679.75\n" +"\t\t\t\t\t\t80:leaf=-0.146154,cover=1052\n" +"\t\t\t\t\t40:[f14<2] yes=81,no=82,missing=82,gain=9.42129,cover=985\n" +"\t\t\t\t\t\t81:leaf=-0.174622,cover=842.25\n" +"\t\t\t\t\t\t82:leaf=-0.142957,cover=142.75\n" +"\t\t\t\t20:[f15<0.52803] yes=41,no=42,missing=42,gain=7.56139,cover=252.75\n" +"\t\t\t\t\t41:[f27<5e+08] yes=83,no=84,missing=84,gain=6.3671,cover=68.75\n" +"\t\t\t\t\t\t83:leaf=-0.0166667,cover=11\n" +"\t\t\t\t\t\t84:leaf=-0.101277,cover=57.75\n" +"\t\t\t\t\t42:[f43<0.0633769] yes=85,no=86,missing=86,gain=5.73181,cover=184\n" +"\t\t\t\t\t\t85:leaf=-0.0285714,cover=6\n" +"\t\t\t\t\t\t86:leaf=-0.13352,cover=178\n" +"\t\t\t10:[f14<2] yes=21,no=22,missing=22,gain=154.902,cover=966.25\n" +"\t\t\t\t21:[f54<0.000459881] yes=43,no=44,missing=43,gain=56.6445,cover=810.5\n" +"\t\t\t\t\t43:[f25<1363] yes=87,no=88,missing=88,gain=7.87352,cover=534.75\n" +"\t\t\t\t\t\t87:leaf=-0.156702,cover=396.25\n" +"\t\t\t\t\t\t88:leaf=-0.125448,cover=138.5\n" +"\t\t\t\t\t44:[f21<0.74018] yes=89,no=90,missing=90,gain=55.2292,cover=275.75\n" +"\t\t\t\t\t\t89:leaf=-0.13506,cover=143.75\n" +"\t\t\t\t\t\t90:leaf=-0.0451128,cover=132\n" +"\t\t\t\t22:[f24<872] yes=45,no=46,missing=46,gain=52.8332,cover=155.75\n" +"\t\t\t\t\t45:[f33<0.0417348] yes=91,no=92,missing=92,gain=36.2418,cover=125.25\n" +"\t\t\t\t\t\t91:leaf=0.0578755,cover=67.25\n" +"\t\t\t\t\t\t92:leaf=-0.0491525,cover=58\n" +"\t\t\t\t\t46:[f50<1.03741e-06] yes=93,no=94,missing=93,gain=7.27111,cover=30.5\n" +"\t\t\t\t\t\t93:leaf=-0.024,cover=5.25\n" +"\t\t\t\t\t\t94:leaf=-0.158095,cover=25.25\n" +"\t2:[f23<0.482164] yes=5,no=6,missing=6,gain=485.305,cover=47955.8\n" +"\t\t5:[f24<871] yes=11,no=12,missing=12,gain=26.8177,cover=9959\n" +"\t\t\t11:[f12<2] yes=23,no=24,missing=24,gain=5.38835,cover=3626.25\n" +"\t\t\t\t23:[f59<0.961747] yes=47,no=48,missing=48,gain=2.22561,cover=1007.75\n" +"\t\t\t\t\t47:[f23<0.317355] yes=95,no=96,missing=96,gain=0.245255,cover=981.25\n" +"\t\t\t\t\t\t95:leaf=-0.187817,cover=97.5\n" +"\t\t\t\t\t\t96:leaf=-0.168918,cover=883.75\n" +"\t\t\t\t\t48:[f25<27260] yes=97,no=98,missing=98,gain=5.89723,cover=26.5\n" +"\t\t\t\t\t\t97:leaf=-0.147826,cover=22\n" +"\t\t\t\t\t\t98:leaf=-0.0181818,cover=4.5\n" +"\t\t\t\t24:[f57<4.02672] yes=49,no=50,missing=49,gain=2.78306,cover=2618.5\n" +"\t\t\t\t\t49:[f38<0.483983] yes=99,no=100,missing=100,gain=1.7698,cover=2613.75\n" +"\t\t\t\t\t\t99:leaf=-0.174041,cover=761.75\n" +"\t\t\t\t\t\t100:leaf=-0.183702,cover=1852\n" +"\t\t\t\t\t50:leaf=-0.0782609,cover=4.75\n" +"\t\t\t12:[f38<0.623771] yes=25,no=26,missing=26,gain=10.7639,cover=6332.75\n" +"\t\t\t\t25:[f25<92] yes=51,no=52,missing=52,gain=8.55349,cover=3259.25\n" +"\t\t\t\t\t51:[f47<0.146558] yes=101,no=102,missing=102,gain=7.37567,cover=63.25\n" +"\t\t\t\t\t\t101:leaf=0.025,cover=3\n" +"\t\t\t\t\t\t102:leaf=-0.128163,cover=60.25\n" +"\t\t\t\t\t52:[f21<0.515144] yes=103,no=104,missing=104,gain=3.08302,cover=3196\n" +"\t\t\t\t\t\t103:leaf=-0.164141,cover=2863\n" +"\t\t\t\t\t\t104:leaf=-0.150299,cover=333\n" +"\t\t\t\t26:[f40<0.189202] yes=53,no=54,missing=54,gain=3.86644,cover=3073.5\n" +"\t\t\t\t\t53:[f15<0.528013] yes=105,no=106,missing=106,gain=6.87584,cover=115\n" +"\t\t\t\t\t\t105:leaf=-0.04,cover=6.5\n" +"\t\t\t\t\t\t106:leaf=-0.153425,cover=108.5\n" +"\t\t\t\t\t54:[f61<0.550285] yes=107,no=108,missing=108,gain=2.70689,cover=2958.5\n" +"\t\t\t\t\t\t107:leaf=-0.172909,cover=2845\n" +"\t\t\t\t\t\t108:leaf=-0.150218,cover=113.5\n" +"\t\t6:[f12<3] yes=13,no=14,missing=13,gain=353.559,cover=37996.8\n" +"\t\t\t13:[f22<0.727618] yes=27,no=28,missing=28,gain=280.015,cover=21181.8\n" +"\t\t\t\t27:[f20<0.590193] yes=55,no=56,missing=56,gain=92.3481,cover=19267\n" +"\t\t\t\t\t55:[f24<871] yes=109,no=110,missing=110,gain=65.2798,cover=18465\n" +"\t\t\t\t\t\t109:leaf=-0.150815,cover=6106.5\n" +"\t\t\t\t\t\t110:leaf=-0.137983,cover=12358.5\n" +"\t\t\t\t\t56:[f38<0.12501] yes=111,no=112,missing=112,gain=31.4306,cover=802\n" +"\t\t\t\t\t\t111:leaf=-0.0633452,cover=139.5\n" +"\t\t\t\t\t\t112:leaf=-0.116353,cover=662.5\n" +"\t\t\t\t28:[f14<2] yes=57,no=58,missing=58,gain=93.2516,cover=1914.75\n" +"\t\t\t\t\t57:[f25<262] yes=113,no=114,missing=114,gain=31.8208,cover=1435.5\n" +"\t\t\t\t\t\t113:leaf=-0.0870588,cover=360.25\n" +"\t\t\t\t\t\t114:leaf=-0.122044,cover=1075.25\n" +"\t\t\t\t\t58:[f29<883] yes=115,no=116,missing=116,gain=50.9932,cover=479.25\n" +"\t\t\t\t\t\t115:leaf=-0.110231,cover=150.5\n" +"\t\t\t\t\t\t116:leaf=-0.0398787,cover=328.75\n" +"\t\t\t14:[f20<0.779492] yes=29,no=30,missing=30,gain=182.847,cover=16815\n" +"\t\t\t\t29:[f24<871] yes=59,no=60,missing=60,gain=97.5659,cover=16492.2\n" +"\t\t\t\t\t59:[f22<0.46508] yes=117,no=118,missing=118,gain=31.147,cover=5885.5\n" +"\t\t\t\t\t\t117:leaf=-0.177988,cover=2302.25\n" +"\t\t\t\t\t\t118:leaf=-0.162419,cover=3583.25\n" +"\t\t\t\t\t60:[f4<0.999999] yes=119,no=120,missing=120,gain=60.9987,cover=10606.8\n" +"\t\t\t\t\t\t119:leaf=-0.160178,cover=5177\n" +"\t\t\t\t\t\t120:leaf=-0.144722,cover=5429.75\n" +"\t\t\t\t30:[f15<0.528035] yes=61,no=62,missing=62,gain=30.2279,cover=322.75\n" +"\t\t\t\t\t61:[f6<-1e-06] yes=121,no=122,missing=122,gain=24.0147,cover=144.25\n" +"\t\t\t\t\t\t121:leaf=-0.138144,cover=23.25\n" +"\t\t\t\t\t\t122:leaf=-0.0286885,cover=121\n" +"\t\t\t\t\t62:[f8<-1e-06] yes=123,no=124,missing=124,gain=11.7343,cover=178.5\n" +"\t\t\t\t\t\t123:leaf=-0.00833333,cover=11\n" +"\t\t\t\t\t\t124:leaf=-0.115727,cover=167.5";String[] modelInfos = new String[1];modelInfos[0] = modelinfo;try {String importance = getFeatureImportancies(modelInfos);System.out.println(importance);} catch (XGBoostError xgBoostError) {xgBoostError.printStackTrace();}
}
}

开放给大家,希望对大家有用!转载请注明出处

spark xgboost 特征重要性分析 gain、cover、freq相关推荐

  1. 用 XGBoost 在 Python 中进行特征重要性分析和特征选择

    使用诸如梯度增强之类的决策树方法的集成的好处是,它们可以从训练有素的预测模型中自动提供特征重要性的估计. 在本文中,您将发现如何使用Python中的XGBoost库来估计特征对于预测性建模问题的重要性 ...

  2. 使用XGBoost在Python中进行特征重要性分析和特征选择

    [翻译自 : Feature Importance and Feature Selection With XGBoost in Python] [说明:Jason Brownlee PhD大神的文章个 ...

  3. 机器学习特征重要性分析

    方法 特征重要性是指特征对目标变量的影响程度,即特征在模型中的重要性程度.判断特征重要性的方法有很多,下面列举几种常用的方法: 1. 基于树模型的特征重要性:例如随机森林(Random Forest) ...

  4. XGBoost输出特征重要性以及筛选特征

    XGBoost输出特征重要性以及筛选特征 1,梯度提升算法是如何计算特征重要性的? 使用梯度提升算法的好处是在提升树被创建后,可以相对直接地得到每个属性的重要性得分.一般来说,重要性分数,衡量了特征在 ...

  5. XGBoost feature importance特征重要性-实战印第安人糖尿病数据集

    使用梯度提升之类的决策树方法集成的一个好处是,它们可以从训练有素的预测模型中自动提供特征重要性的估计. 在这篇文章中,您将了解如何使用 Python 中的 XGBoost 库估计特征对预测建模问题的重 ...

  6. 用xgboost模型对特征重要性进行排序

    用xgboost模型对特征重要性进行排序 在这篇文章中,你将会学习到: xgboost对预测模型特征重要性排序的原理(即为什么xgboost可以对预测模型特征重要性进行排序). 如何绘制xgboost ...

  7. XGBoost学习(六):输出特征重要性以及筛选特征

    XGBoost学习(一):原理 XGBoost学习(二):安装及介绍 XGBoost学习(三):模型详解 XGBoost学习(四):实战 XGBoost学习(五):参数调优 XGBoost学习(六): ...

  8. 特征重要性与shap值

    在模型的训练过程中,往往会需求更加优异的模型性能指标如准确率.召回等,但在实际生产中,随着模型上线使用产生衰减,又需要快速定位问题进行修复,因此了解模型如何运作.哪些特征起到了关键作用有着重要意义.同 ...

  9. 特征重要性判断(一)----决策树

    一.决策树的基本框架 决策树是一类基本的机器学习算法.它是基于树状结构来进行决策,从上到下依次进行判断,最终得到模型的决策. 一般的,一棵决策树包含一个根节点.若干个内部节点和若干个叶子节点:叶子节点 ...

最新文章

  1. 发展大数据还有三道坎要迈
  2. java 多级级联菜单回显_详解element-ui级联菜单(城市三级联动菜单)和回显问题...
  3. PHP安装wamp设置虚拟目录后,无法访问localhost问题
  4. python装饰器简单理解的小demo
  5. WHU 1470 Join in tasks 水题
  6. 用魔法打开科学,孩子惊叫连连,想不爱科学都难!
  7. 模块开发者使用 ES Modules 的正确姿势
  8. “不要害怕 RAID!”
  9. GUI为什么不设计为多线程?
  10. 相似度计算 java_Java基于余弦方法实现的计算相似度算法示例
  11. 桌面弹球游戏终结,含有全部代码
  12. DirectX修复工具常见问题解答
  13. node.js读取JSON文件
  14. ESET NOD32
  15. 虚拟机运行python_虚的解释|虚的意思|汉典“虚”字的基本解释
  16. 波形图控件html,[转载]LabVIEW中的波形图(Waveform Chart)详解
  17. 可以嵌入ppt的课堂点名器_学点云课堂:小班课应用场景,饱受青睐的秘诀
  18. Discuznbsp;x2.5单页制作的教程
  19. 爬取东方财富的十大成交股
  20. postman Error: connect ECONNREFUSED xxx

热门文章

  1. Curling 2.0{
  2. rust新版组队指令_腐蚀rust新版服务器指令大全 腐蚀指令一览
  3. python爬虫requests模块
  4. 边缘服务网格 osm-edge
  5. html网页制作看板娘原理,给网站添加 网页看板娘 效果 给网页添加一个可爱的小萝莉...
  6. 为pr视频文件添加字幕
  7. 哪个软件能代替斐讯路由_斐讯路由app下载-斐讯路由app安卓版下载v7.2.0-聚侠网...
  8. Vmware+移动硬盘,实现最简单环境的便携WinToGo
  9. MedianFlow中值流跟踪算法源码
  10. [前端] HTML,CSS,font-family:中文字体的英文名称 (宋体 微软雅黑)