机器学习中训练模型的前提必备工作就是特征选择,起到降维和降低计算开销的目的,当然在获取尽可能小的特征子集的同时,应不显著降低分类精度、不影响类分布、保持特征子集的稳定适应性强等。

ML库提供了特征选择方法,具体有:

1、递归特征消除 Recursive feature elimination (RFE):

递归特征消除的主要思想是反复的构建模型(如SVM或者回归模型)然后选出最好的(或者最差的)的特征(可以根据系数来选),把选出来的特征排除,然后在剩余的特征上重复这个过程,直到所有特征都遍历了。这个过程中特征被消除的次序就是特征的排序。因此,这是一种寻找最优特征子集的贪心算法。参考代码如下:

package com.gddx;import java.io.File;import net.sf.javaml.core.Dataset;
import net.sf.javaml.featureselection.ranking.RecursiveFeatureEliminationSVM;
import net.sf.javaml.tools.data.FileHandler;public class TutorialFeatureRanking {/*** Shows the basic steps to create use a feature ranking algorithm.* * @author Thomas Abeel* */public static void main(String[] args) throws Exception {/* Load the iris data set */Dataset data = FileHandler.loadDataset(new File("D:\\tmp\\iris.data"), 4, ",");/* Create a feature ranking algorithm */RecursiveFeatureEliminationSVM svmrfe = new RecursiveFeatureEliminationSVM(0.2);/* Apply the algorithm to the data set */svmrfe.build(data);/* Print out the rank of each attribute */for (int i = 0; i < svmrfe.noAttributes(); i++)System.out.println(svmrfe.rank(i));}}

2、Pearson相关系数 Pearson Correlation

皮尔森相关系数是体现特征和响应变量之间关系的方法,该方法衡量的是变量之间的线性相关性,结果的取值区间为[-1,1],-1表示完全的负相关(这个变量下降,那个就会上升),+1表示完全的正相关,0表示没有线性相关。参考代码如下:

package com.gddx;import java.io.File;import net.sf.javaml.core.Dataset;
import net.sf.javaml.distance.PearsonCorrelationCoefficient;
import net.sf.javaml.featureselection.subset.GreedyForwardSelection;
import net.sf.javaml.tools.data.FileHandler;/*** Shows the basic steps to create use a feature subset selection algorithm.* * @author Thomas Abeel* */
public class TutorialFeatureSubsetSelection {public static void main(String[] args) throws Exception {/* Load the iris data set */Dataset data = FileHandler.loadDataset(new File("D:\\tmp\\iris.data"), 4, ",");/** Construct a greedy forward subset selector that will use the Pearson* correlation to determine the relation between each attribute and the* class label. The first parameter indicates that only one, i.e. 'the* best' attribute will be selected.*/GreedyForwardSelection ga = new GreedyForwardSelection(1, new PearsonCorrelationCoefficient());/* Apply the algorithm to the data set */ga.build(data);/* Print out the attribute that has been selected */System.out.println(ga.selectedAttributes());}
}

3、集成特征选择

基于模型排序后的集成,参考代码如下:

package com.gddx;import java.io.File;import net.sf.javaml.core.Dataset;
import net.sf.javaml.featureselection.ensemble.LinearRankingEnsemble;
import net.sf.javaml.featureselection.ranking.RecursiveFeatureEliminationSVM;
import net.sf.javaml.tools.data.FileHandler;/*** Tutorial to illustrate ensemble feature selection.* * @author Thomas Abeel* */
public class TutorialEnsembleFeatureSelection {/*** Shows the basic steps to use ensemble feature selection* * @author Thomas Abeel* */public static void main(String[] args) throws Exception {/* Load the iris data set */Dataset data = FileHandler.loadDataset(new File("D:\\tmp\\iris.data"), 4, ",");/* Create a feature ranking algorithm */RecursiveFeatureEliminationSVM[] svmrfes = new RecursiveFeatureEliminationSVM[10];for (int i = 0; i < svmrfes.length; i++)svmrfes[i] = new RecursiveFeatureEliminationSVM(0.2);LinearRankingEnsemble ensemble = new LinearRankingEnsemble(svmrfes);/* Build the ensemble */ensemble.build(data);/* Print out the rank of each attribute */for (int i = 0; i < ensemble.noAttributes(); i++)System.out.println(ensemble.rank(i));}}

4、特征评分:

package com.gddx;import java.io.File;import net.sf.javaml.core.Dataset;
import net.sf.javaml.featureselection.scoring.GainRatio;
import net.sf.javaml.tools.data.FileHandler;public class TutorialFeatureScoring {/*** Shows the basic steps to create use a feature scoring algorithm.* * @author Thomas Abeel* */public static void main(String[] args) throws Exception {/* Load the iris data set */Dataset data = FileHandler.loadDataset(new File("D:\\tmp\\iris.data"), 4, ",");GainRatio ga = new GainRatio();/* Apply the algorithm to the data set */ga.build(data);/* Print out the score of each attribute */for (int i = 0; i < ga.noAttributes(); i++)System.out.println(ga.score(i));}}

5、WekaAttributeSelection,这个主要还是用增益来选择特征,应该在输出上包括排序和分数,参考代码如下:

package com.gddx;/*** This file is part of the Java Machine Learning Library* * The Java Machine Learning Library is free software; you can redistribute it and/or modify* it under the terms of the GNU General Public License as published by* the Free Software Foundation; either version 2 of the License, or* (at your option) any later version.* * The Java Machine Learning Library is distributed in the hope that it will be useful,* but WITHOUT ANY WARRANTY; without even the implied warranty of* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the* GNU General Public License for more details.* * You should have received a copy of the GNU General Public License* along with the Java Machine Learning Library; if not, write to the Free Software* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA* * Copyright (c) 2006-2012, Thomas Abeel* * Project: http://java-ml.sourceforge.net/* */
import java.io.File;
import java.io.IOException;import net.sf.javaml.core.Dataset;
import net.sf.javaml.tools.data.FileHandler;
import net.sf.javaml.tools.weka.WekaAttributeSelection;
import weka.attributeSelection.ASEvaluation;
import weka.attributeSelection.ASSearch;
import weka.attributeSelection.GainRatioAttributeEval;
import weka.attributeSelection.Ranker;/*** Tutorial how to use the Bridge to WEKA AS Evaluation , AS Search and* Evaluator algorithms in Java-ML* * * @author Irwan Krisna*/
public class TutorialWekaAttributeSelection {public static void main(String[] args) throws IOException {/* Load data */Dataset data = FileHandler.loadDataset(new File("D:\\tmp\\iris.data"),4, ",");/* Create a AS Evaluation algorithm */ASEvaluation eval = new GainRatioAttributeEval();/* Create a Weka's AS Search algorithm */ASSearch search = new Ranker();/* Wrap Wekas' Algorithms in bridge */WekaAttributeSelection wekaattrsel = new WekaAttributeSelection(eval,search);/** to apply algorithm to the data set and generate the new data based on* the given parameters*/wekaattrsel.build(data);/* to retrieve the number of attributes */System.out.println("Total number of attributes:  "+ wekaattrsel.noAttributes());/* to display all the rank and score for each attribute */for (int i = 0; i < wekaattrsel.noAttributes() - 1; i++) {System.out.println("Attribute  " + i + "  Ranks  "+ wekaattrsel.rank(i) + " and Scores "+ wekaattrsel.score(i));}}}

Java机器学习库ML之二Feature Selection(特征选择)相关推荐

  1. Java机器学习库ML之六关于模型迭代训练的思考

    我遇到的场景是:样本集有5000万条,接近5个G,那么这样的样本集一次导入训练,我放着一天一夜都没跑出结果,机器性能还特别好,是64位linux有128G内存. 针对这样的情况,我想到的是两种思路: ...

  2. Java机器学习库ML之一Dataset和Instance

    Java机器学习库ML官网:http://java-ml.sourceforge.net/ 对于一个机器学习库来说,最基础就是数据处理能力,ml库给了dataset和instance两个类,datas ...

  3. Java机器学习库ML之十一线性SVM

    线性SVM的原理就不多说了,最强大的就是libsvm库(ml库也是用这个),参考:http://blog.csdn.net/fjssharpsword/article/details/53883340 ...

  4. Java机器学习库ML之十模型选择准则AIC和BIC

    学习任务所建立的模型多数是参数估计并采用似然函数作为目标函数,当训练数据足够多时,可以不断提高模型精度,但是以提高模型复杂度为代价的,同时也带来一个机器学习中非常普遍的问题--过拟合.模型选择问题是在 ...

  5. Java机器学习库ML之四模型训练和预测示例

    基于ML库机器学习的步骤: 1)样本数据导入: 2)样本数据特征抽取和特征值处理(结合模型需要归一化或离散化):这里本文没有做处理,特征选择和特征值处理本身就很大: 3)样本集划分训练集和验证集: 4 ...

  6. Java机器学习库ML之九交叉验证法(Cross Validation)

    交叉验证(Cross Validation,CV)是用来验证分类器的性能一种统计分析方法,基本思想是把在某种意义下将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分 ...

  7. Java机器学习库ML之八关于模型迭代训练的试验

    前文提到因为数据集过大,只能拆分然后依次迭代训练,实验发现对结果有所偏差,参考代码如下: package com.vip;import java.io.File; import java.util.M ...

  8. Java机器学习库ML之五样本不均衡

    样本不均衡的问题是指训练集中类别之间的样本数据量比例失衡.有研究表明,部分学习任务,在类别出现1∶35比例时就会使某些分类学习方法无效,甚至1∶10比例也会.样本不均衡导致的问题有: 1)少数类所包含 ...

  9. Java机器学习库ML之七分类预测输出概率值

    场景:一般分类预测直接输出类别标记,不过有些情况需要输出对应类别的概率值,比如判定为正例的概率是0.6,而判定为负例的概率是0.3,那自然标记为正例,这里就是看ML用classDistribution ...

最新文章

  1. iOS autolayout
  2. 基于Debian9.3安装OpenVAS9.0(kali源)
  3. Tomcat8源码编译及导入Eclipse中研究
  4. Pycharm 2018安装步骤详解
  5. python3 getopt用法
  6. PyCharm之python书写规范--消去提示波浪线
  7. MySQL(26)--- 索引
  8. Retrofit的初次使用
  9. G面经prepare: BuyGoods
  10. 安卓和ios的ui设计区别_【交互设计】 也许这些才是你作品集最需要的
  11. 既然有MySQL了,为什么还要有MongoDB?
  12. Bullet physics 引擎的官方文档翻译
  13. 抖音32级多少钱音浪要刷多少钱 抖币详细介绍
  14. 微信小程序实时音视频的使用
  15. 自制免费防关联浏览器,免费指纹浏览器的解决方案
  16. 最新千锋3G学院Android游戏开发教程之数独游戏
  17. 客户关系管理的很好的例子
  18. mblock编程思维开发,自制糖豆人小游戏
  19. 第二课计算机ppt,第2课《与计算机交朋友》说课稿.ppt
  20. mac上投屏android_傲软投屏Mac版-傲软投屏for Mac下载 V1.2.9.1-PC6苹果网

热门文章

  1. ami编码设计流程图_Openplant智慧电厂BIM设计-很强大(艾三维BIM分享)
  2. fedora apache php,Fedora 20下安装搭建LAMP环境Apache+MySQL+PHP
  3. php return 返回html_【php socket通讯】php实现http服务
  4. PATA1001A+BFormat
  5. 面试请不要再问我Spring Cloud底层原理
  6. sql server转oracle需要注意的几点
  7. 转载:使用Auto Layout中的VFL(Visual format language)--代码实现自动布局
  8. JavaScript函数的调用
  9. 技术团队新官上任之中层篇
  10. 6410调试LCD屏AT050TN22遇到的问题