救命代码

Often times, we’re not sure how to choose our features. This is just a small guide to help choose. (Disclaimer: For now I’ll talk about binary classification.)

通常,我们不确定如何选择功能。 这只是帮助选择的小指南。 (免责声明:目前,我将讨论二进制分类。)

Many times, when we are super-excited to predict using a fancy machine-learning algorithm, and we’re almost ready to apply our models to analyze and make classifications on the test data-set–– we don’t exactly know what features to pick. Often times, the # of features can range from tens to thousands, and it’s not exactly clear how to pick relevant features, and how many features we should select. Sometimes it’s a not a bad idea to combine features together, also known as feature engineering. A common example of this, you’ve probably heard in machine-learning –– is principal components analysis (PCA), where the data matrix X is factorized into its singular-value-decomposition (SVD) U*∑*V, where ∑ is a diagonal matrix with singular values, and the # of singular values you choose determines how many principal components. You can think of principal-components as a way to reduce the dimensions of your data-set. The awesome thing about PCA is that the new engineered features, or “principal-components”, are linear combinations of the original features. And that’s great! We love linear combinations, because it only involves addition and scalar-multiplication, and they’re not too hard to interpret. For example, if you did PCA on a dataset about house-price regression, and say you only selected 2 principal components. Then the first component, PC1, could be: c1*(# of bedrooms)+c2*(# sq.ft.). And PC2 could be something similar.

很多时候,当我们非常兴奋地使用花哨的机器学习算法进行预测时,几乎准备好将我们的模型应用于测试数据集的分析和分类了,我们不知道到底有什么功能选择。 通常,功能的数量可能从数十到数千不等,并且不清楚如何选择相关功能以及应该选择多少功能。 有时特征组合在一起并不是一个坏主意,也称为特征工程。 您可能在机器学习中听说过一个常见的例子,即主成分分析(PCA) ,其中数据矩阵X被分解为其奇异值分解(SVD) U * ∑ * V ,其中∑是具有奇异值的对角矩阵,您选择的奇异值#决定了多少个主成分。 您可以将主成分视为减小数据集尺寸的一种方法。 PCA令人敬畏的是,新设计的功能或“主要组件”是原始功能的线性组合。 太好了! 我们喜欢线性组合,因为它只涉及加法和标量乘法,并且它们也不难解释。 例如,如果您对有关房价回归的数据集进行了PCA,并说您只选择了2个主要成分。 那么第一个组件PC1可以是: c1 * (卧室数量)+ c2 * (平方英尺)。 与PC2可能类似。

The limitation with principal components is, that the new features you make are *only* linear-combinations of some of the old ones. That means you can’t take advantage of making non-linear combinations of features. This is something neural networks are awesome at; they can create TONS of non-linear combinations/functions of features. But they have an even bigger problem: interpretability of the new features. The engineered features are basically hidden inside the weight-matrix multiplications between different layers of the network (which is just a composition of non-linear functions). And neural networks, with that extra non-linearity, can often be brittle and break under adversarial attacks, such as few-pixel attacks on convolutional neural networks, or tricking a neural network into mis-classifying a panda and a black square as a vulture –– weird, nonsense stuff like that.

主要成分的局限性在于,新 您制作的功能仅是某些旧功能的线性组合。 这意味着您无法利用特征的非线性组合。 这是神经网络的精妙之处。 他们可以创建非线性的特征组合/功能的TONS。 但是它们还有一个更大的问题:新功能的可解释性 。 工程特征基本上隐藏在网络不同层之间的权重矩阵乘法中(这只是非线性函数的组合)。 而且具有额外非线性的神经网络在对抗性攻击下通常会很脆弱,甚至会受到破坏,例如对卷积神经网络的小像素攻击,或者欺骗神经网络将熊猫和黑方块误分类为秃ul。 -像这样的古怪,胡说八道的东西。

So, what to do about features?? Well, if the ways we engineer new features are kinda limited, we could always just select a subset of the features we already have! But you need to be careful. There are many ways to do this, but not all of the are robust and consistent. For example, take random forests. It’s true that at the end of using the classifier, Python will output the relevant features with the feature_importances method of a random forest. But let’s think for a second: random forests work by training a bunch of decision trees, each one on a random subset of the training data. So if you kept repeating the RF model, you might get different feature-importances each time, and this is not robust or consistent. Wouldn’t it be confusing as a data scientist or ML engineer to see a different set of relevant features pop up each time? You clearly didn’t change the data set! So why should you trust different sets of “importance” features? The problem with this is that the “importance” features you’re picking, are dependent on the random-forest model itself––and even if RF’s have high accuracy, it also makes more sense to choose features based on the dataset alone, rather than including a heavy-duty model first.

那么,如何处理功能? 好吧,如果我们设计新功能的方式受到限制,那么我们总是可以选择已经拥有的功能的子集! 但是您需要小心。 有许多方法可以做到这一点,但并非所有方法都是可靠且一致的。 例如,采用随机森林。 的确,在使用分类器的最后,Python将使用随机森林的feature_importances方法输出相关功能。 但是让我们想一想:随机森林通过训练一堆决策树来工作,每个决策树都在训练数据的随机子集上。 因此,如果您不断重复RF模型,则每次功能的重要性可能会有所不同,这既不可靠也不具有一致性。 作为数据科学家或ML工程师,每次看到一组不同的相关功能都会感到困惑吗? 您显然没有更改数据集! 那么,为什么要信任不同组的“重要性”功能呢? 这样做的问题在于,您选择的“重要性”特征依赖于随机森林模型本身-即使RF的准确性很高,也要根据数据集选择特征更有意义。而不是首先包括重型模型。

The key to selecting features that are consistent, not confusing, and robust might be this: select features independently of your model. The relevant features you select should be relevant whether or not you use a neural network, an RF, logistic regression, or any other supervised learning model. This way, you don’t have to worry about the predictive power of your machine learning model while you’re trying to pick features at the same time, which be un-reliable.

选择一致,不混乱和健壮的特征的关键可能是: 独立于模型选择特征。 无论您是否使用神经网络,RF,逻辑回归或任何其他监督学习模型,您选择的相关功能都应具有相关性。 这样,当您试图同时选择不可靠的功能时,您不必担心机器学习模型的预测能力。

So, how do you pick features that are independent of your model? Scikit-Learn has a few options. One of them which is my favorite is called mutual-information. It’s a important concept from probability-theory. Basically, it computes the dependence between your features-variables and your label-variable relative to the assumption that they’re independent. An easier way of saying that is it measure how much your class-labels depend on a specific feature.

因此,如何选择与模型无关的功能? Scikit-Learn有一些选择。 我最喜欢的其中一种叫做互信息 。 从概率论出发,这是一个重要的概念。 基本上,它计算功能变量和标签变量之间的相关性 (假设它们是独立的)。 说的更简单的方法是测量你的类的标签是多么 依赖一个特定的功能。

So for example, say you’re predicting if someone has a tumor by looking at a bunch of feature columns in your dataset, like geometric-area, location, color-hue, etc. If you’re trying to choose relevant features to your prediction, you can use mutual-information to talk about how much each class-label depends on the geometric-area, location, and color-hue of the tumor. And this is a measurement gotten directly from the data; it never involved using a predictive model in the first place.

因此,例如,假设您通过查看数据集中的一堆特征列(例如几何区域,位置,色相等)来预测某人是否患有肿瘤。如果您要尝试选择与您的特征相关的特征预测时,您可以使用相互信息来讨论每个类别标签在多大程度上取决于肿瘤的几何区域,位置和颜色。 这是直接从数据中获得的度量; 它从来没有涉及使用预测模型。

You can also use Sci-kit Learn’s chi-2, or “chi-squared”, to determine feature importance. What this does, is use a Chi-Squared test between the features and the label to determine which features are relevant to the label and which ones are independent of the label. You can think of this method as testing a “null hypothesis” H0: are the features independent of the classification label?To do this, you’d calculate a chi-squared statistic based on the data-table, get a p-value, and determine which features are independent or not. You then throw away the independent features (why? because they’re independent of the label according to your test, so they give no information) and keep the dependent ones.

您还可以使用Sci-kit Learn的chi-2或“卡方”来确定功能的重要性。 这是在特征和标签之间使用Chi-Squared测试来确定哪些特征与标签相关,哪些特征与标签无关。 您可以将这种方法视为测试“零假设” H0:特征是否独立于分类标签?为此,您需要根据数据表计算卡方统计量,获得p值,并确定哪些功能是独立的。 然后,您丢弃独立的功能(为什么?,因为根据您的测试它们独立于标签 ,所以它们不提供任何信息),并保留相关的功能。

This test is actually based on similar principles to the mutual-information calculation talked about above. However, chi2 does make the important assumptions that features in your dataset taking continuous values (say, 5.3, pi, sqrt(2), stuff like that) are normally distributed. Usually for big training-data sets this isn’t a problem, but for small training-data this assumption might be violated, so calculating mutual-information might be more reliable in those cases.

该测试实际上是基于与上述的互信息计算类似的原理。 但是, chi2确实做出了重要的假设,即数据集中具有连续值(例如5.3,pi,sqrt(2)等东西)的特征是正态分布的。 通常,对于大型训练数据集,这不是问题,但是对于小型训练数据,此假设可能会被违反 ,因此在这种情况下,计算互信息可能更可靠。

The basic point is this: mutual-information and chi-squared ways of feature-selecting are robust against the predictive model. Your predictive model might be wildly inaccurate, but the data you’ve collected is static in a table which never changes, so calculating your features without the model is more consistent.

基本要点是:特征选择的互信息和卡方方法对预测模型具有鲁棒性 。 您的预测模型可能会非常不准确,但是您收集的数据在表中是静态的,永远不会改变,因此在没有模型的情况下计算特征更加一致。

Other ways of feature selecting include Recursive Feature Elimination (RFE), which uses a pre-fixed model (say, logistic/linear regression, or random forest) and tests almost all the subsets of features using the pre-fixed model, and decides which features are the best by seeing which subset of features gives the lowest accuracy error. (Technically, random forests use an additional method in Scikit Learn called feature_importance, but I won’t be getting into that here.) However, RFE does take a lot of time, because there are about 2-to-the-K subsets of features if you have K features, so it takes a long time to compute the model for each subset and get a score.

特征选择的其他方法包括递归特征消除(RFE),它使用预先确定的模型(例如,逻辑/线性回归或随机森林),并使用预先确定的模型测试几乎所有特征子集,并确定哪个通过查看哪些特征子集给出最低的准确度误差,可以确定最佳特征。 (从技术上讲,随机森林在Scikit Learn中使用了另一种称为feature_importance的方法 ,但在这里我不会赘述。)但是,RFE确实要花费很多时间,因为其中约有2个到K子集。如果您具有K个特征,则为每个子集计算模型并获得得分将花费很长时间。

Another big reason I have against RFE and similar techniques is that it is fundamentally a feature-selection technique which is model-dependent. If your model is inaccurate, or overfits heavily, or does both and isn’t that interpretable by the user –– then the features you selected weren’t actually chosen by you, but by the model. So. the feature importance might not be an accurate representation of which features actually are predictive based just on the dataset.

我反对RFE和类似技术的另一个重要原因是,从根本上讲,它是一种与模型相关的特征选择技术。 如果您的模型不准确,或者过度拟合,或者两者兼而有之,并且用户无法解释–那么您选择的功能实际上不是您选择的,而是模型选择的。 所以。 特征重要性可能无法根据数据集准确表示哪些特征实际上是可预测的。

So what can we take away from all this? Well, in the end, feature-selecting is extremely important if you don’t know how to interpret your engineered features using, say principal-component-analysis. However, when you do feature selection it’s just as important to take note about how you’re selecting your features, as well as computational time. Is your method taking too much time on the computer? Is your feature-selection based on using a particular model first? Ideally, you would want to feature-select regardless of what model you use, so in your Jupyter Notebook, you would ideally want to make a cell for feature-selection before the model –– something like this:

那么,我们可以从这一切中拿走什么呢? 好吧,最后,如果您不知道如何使用主成分分析来解释您的工程化特征,那么特征选择就非常重要。 但是,在进行特征选择时,注意如何选择特征以及计算时间同样重要。 您的方法在计算机上花费了太多时间吗? 您的功能选择是否首先基于使用特定的模型? 理想情况下, 无论使用哪种模型,您都希望进行特征选择,因此,在Jupyter Notebook中,理想情况下,您希望在模型之前制作一个用于特征选择的单元–像这样:

from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import mutual_info_classif# "mutual_info_classif" is the mutual-information way of selecting# the K most dependent features based on the class-labelK = 3selector = SelectKBest(mutual_info_classif, K)X = new_df.iloc[:, :-1]y = new_df.iloc[:, -1]X_reduced = selector.fit_transform(X,y) 

features_selected = selector.get_support()

First, I did the feature selection (above).

首先,我进行了特征选择(上文)。

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_reduced, y, train_size=0.7)# use logistic regression as a modellogreg = LogisticRegression(C=0.1, max_iter=1000, solver='lbfgs')logreg.fit(X_train, y_train)

And then I trained the model (above)! :)

然后,我训练了模型(上面)! :)

翻译自: https://towardsdatascience.com/help-how-do-i-feature-select-eaf37e58fdaf

救命代码


http://www.taodudu.cc/news/show-863508.html

相关文章:

  • 回归模型评估_评估回归模型的方法
  • gan学到的是什么_GAN推动生物学研究
  • 揭秘机器学习
  • 投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影!
  • 机器学习中的随机过程_机器学习过程
  • ci/cd heroku_在Heroku上部署Dash或Flask Web应用程序。 简易CI / CD。
  • 图像纹理合成_EnhanceNet:通过自动纹理合成实现单图像超分辨率
  • 变压器耦合和电容耦合_超越变压器和抱抱面的分类
  • 梯度下降法_梯度下降
  • 学习机器学习的项目_辅助项目在机器学习中的重要性
  • 计算机视觉知识基础_我见你:计算机视觉基础知识
  • 配对交易方法_COVID下的自适应配对交易,一种强化学习方法
  • 设计数据密集型应用程序_设计数据密集型应用程序书评
  • pca 主成分分析_超越普通PCA:非线性主成分分析
  • 全局变量和局部变量命名规则_变量范围和LEGB规则
  • dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
  • 计算机视觉课_计算机视觉教程—第4课
  • 用camelot读取表格_如何使用Camelot从PDF提取表格
  • c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择
  • 使用OpenCV,Keras和Tensorflow构建Covid19掩模检测器
  • 使用Python和OpenCV创建自己的“ CamScanner”
  • cnn图像进行预测_CNN方法:使用聚合物图像预测其玻璃化转变温度
  • 透过性别看世界_透过树林看森林
  • gan神经网络_神经联觉:当艺术遇见GAN
  • rasa聊天机器人_Rasa-X是持续改进聊天机器人的独特方法
  • python进阶指南_Python特性工程动手指南
  • 人工智能对金融世界的改变_人工智能革命正在改变网络世界
  • 数据科学自动化_数据科学会自动化吗?
  • 数据结构栈和队列_使您的列表更上一层楼:链接列表和队列数据结构
  • 轨迹预测演变(第1/2部分)

救命代码_救命! 如何选择功能?相关推荐

  1. 三行情书代码_用三行代码优化您的交易策略

    三行情书代码 If you want to consistently earn money with your investments, backtesting is one of the best ...

  2. JEPaas代码_((表单)_输入字段值而改变值)

    JEPaas代码_笔记((表单)_输入字段值而改变值) 我也是新手入门,不是很熟练JEPaas后台,正在学习中- var v=parseInt(value);console.warn('出库数量改变' ...

  3. 工作10年厌倦写代码_厌倦了数据质量讨论?

    工作10年厌倦写代码 I have been in tons of meetings where data and results of any sort of analysis have been ...

  4. TinkPHP内核仿每推推51领啦试客源码_PC源码+WAP端+APP原生代码_自带5套精美模板

    TinkPHP内核仿每推推51领啦试客源码_PC源码+WAP端+APP原生代码_自带5套精美模板 源码说明:TinkPHP内核上制作而成,是全国领先的免费试用网站!程序全开源无加密!带有wap手机端, ...

  5. html圣诞树代码_支持手机选择背景音乐圣诞树源码

    html圣诞树代码_支持手机选择背景音乐圣诞树源码小子在本地测试了下,圣诞树会根据音乐变化起来,挺好看的手机打开显示黑屏的问题,已经修复适配,上传服务器即可,如果加载慢就把远程js和css本地化或者更 ...

  6. 搞笑的好友印象代码_笨蛋┗━━┛人笨不能复生

    搞笑的好友印象代码_笨蛋┗━━┛人笨不能复生 <!-- /title --> - ◇ ゛ 爷━Ren ◇ ゛ 们━Ren ╭笨蛋╮. ┗━━┛人笨不能复生 节哀 ↗↘ 结婚无限好 ✱彡 ↖ ...

  7. 单片机STC89C52_C语言代码_来回流水_软件延时

    单片机STC89C52_C语言代码_来回流水_软件延时 //11.0592MHz: //50=3.3ms;80=5.2ms;100=6.5ms;1000=65ms; //5000=325ms;7800 ...

  8. 掷骰子python代码_通过构建一个简单的掷骰子游戏去学习怎么用 Python 编程

    不论是经验丰富的老程序员,还是没有经验的新手,Python 都是一个非常好的编程语言. Image by : opensource.com Python 是一个非常流行的编程语言,它可以用于创建桌面应 ...

  9. python简单心形代码爱情闪字_qq空间爱情闪字主人寄语代码_思念就像受傷的傷口...

    [M] [ftf=Wingdings][fts=6][ftc=00A650][ftc=F16D7E]|[/ft][/ft][/ft][/ft] [ftc=FF2800][思[/ft][ftc=FF50 ...

最新文章

  1. 《JavaScript权威指南》笔记(一)
  2. Java高并发编程:HandlerThread
  3. 信息保真度准则_设计保真度的新的非科学公式
  4. 了解SQL Server事务日志备份和完整备份的日志序列号
  5. 计算机公式复制填充的操作,办公小技巧:解决Excel公式自动填充问题
  6. 小米平板1android驱动,小米平板3usb驱动
  7. input输入框对伪类(after,before)支持情况
  8. 探究“补阶乘大法的本质“——糖水不等式
  9. 【Web】HTML(No.06)表格标签经典案例《小说排行榜》
  10. 修改linux系统的时间EDT和EST为CST
  11. E2224和E5-2630v4的区别?
  12. ZZULIOJ:1008美元和人民币
  13. Linux删除文件,df查看磁盘空间未减少
  14. oracle 毫秒时间换mysql_Mysql与Oracle常用时间格式的转换
  15. 截屏与截长图功能的实现
  16. java的课程总结报告_java课程总结报告
  17. PPP原理 PAP认证 CHAP认证
  18. IQD晶振AT切割方式及流程
  19. Web——KnockOut 绑定语法之事件
  20. 计算机视觉领域稍微容易中的期刊系列(二)1

热门文章

  1. 物联网技术渐趋成熟 车联网应用或成市场主驱力
  2. 易维帮助台:论工业产品售后服务升级转型的正确打开姿势
  3. 04. Web大前端时代之:HTML5+CSS3入门系列~HTML5 表单
  4. python xml.dom模块解析xml
  5. Lucene.Net(转)
  6. java类的实现_java类的实现
  7. liunx java font_Linux下JDK中文字体乱码 | 学步园
  8. python计算余弦距离_在Python中计算余弦距离的优化方法
  9. mysql中约束_【MySQL】:MySQL中四大约束
  10. revit建筑样板_黄石建筑工地工艺样板怎么做可按需定制