读《Machine Learning Done Wrong》(机器学习易犯错误)有感
在“我爱机器学习”微博上看到的文章,地址是http://www.52ml.net/15845.html
http://ml.posthaven.com/machine-learning-done-wrong
把原文和微博上的回答整理了一下:这7个错误自己都犯过,算是再次提醒了。
带引号部分为张栋老师在微博上的点评。
Preface
Statistical modeling is a lot like engineering.
In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern.
In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.
When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low.
But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline(pre-processing, modeling, optimization algorithm, evaluation,product ionization) accordingly.
As pointed out in my previous post,there are dozens of ways to solve a given modeling problem.
Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable.
In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data.
In this post, I would like to share some common mistakes (the don't-s).
I’ll save some of the best practices (the do-s) in a future post.
前言部分,作者提出了两个现象:
1.现在数据运算的成本很低了,大家都是直接所有方法都试一遍。
2.人们更倾向于使用自己熟悉的方法而不是最适合需要处理的数据的方法。
写到这里,即使如我这种屌丝都看出作者是不赞成这样的。下文就开始分析7大常犯错误。
1. Take default loss function for granted 想当然用缺省Loss
Many practitioners train and pick the best model using the default loss function (e.g., squared error).
In practice, off-the-shelf loss function rarely aligns with the business objective.
Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss.
The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally.
To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount.
Also, data sets in fraud detection usually contain highly imbalanced labels.
In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).
“机器学习本质上是在解一个优化问题,优化目标定义错误(或者 loss function 定义错了),就全错了!”
作者在这里以欺诈交易识别为例,说明了常用的损失函数(例如误差的平方和)是不适用的。
2. Use plain linear models for non-linear interaction 非线性情况下用线性模型
3. Forget about outliers 忘记异常值
Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.
Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.
“很多情况下,如果不把 Outlier 数据提前过滤,就要采用可处理 Outlier 的模型(或者在模型训练过程中加入处理 Outlier 数据的算法)”。
作者特别举了AdaBoost的例子,不知道类似的Boosting或随机森林是不是对异常值也很敏感。
另外作者提出异常值也是很有价值的,而不是单纯舍弃。
4. Use high variance model when n<<p 样本少时用High Viriance模型
5. L1/L2/... regularization without standardization 不做标准化就用L1/L2等正则
Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.
Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.
“特征标准化是很重要的预处理:多维度特征组合在一起时,特征具有同一尺度的可比性很重要 ”
好吧,第一次仔细思考正则化与标准化的问题,值得重视,
作者举了以美元为单位的数据和以美分为单位的数据在一起的例子,确实挺可怕的
http://blog.csdn.net/zouxy09/article/details/24971995
6. Use linear model without considering multi-collinear predictors 不考虑线性相关直接用线性模型
7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance LR模型中用参数绝对值判断feature重要性。
Summary
So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.
作者最后再一次提醒大家要根据数据选择合适的方法。
补充
读《Machine Learning Done Wrong》(机器学习易犯错误)有感相关推荐
- Machine Learning:如何选择机器学习算法?
2019独角兽企业重金招聘Python工程师标准>>> Machine Learning Algorithms Overview 关于目前最流行的一些机器学习算法,建议阅读: Mac ...
- 中科院计算所开源Easy Machine Learning:让机器学习应用开发简单快捷 By 机器之心2017年6月13日 13:05 今日,中科院计算所研究员徐君在微博上宣布「中科院计算所开源了
中科院计算所开源Easy Machine Learning:让机器学习应用开发简单快捷 By 机器之心2017年6月13日 13:05 今日,中科院计算所研究员徐君在微博上宣布「中科院计算所开源了 E ...
- AI:Algorithmia《2021 enterprise trends in machine learning 2021年机器学习的企业趋势》翻译与解读
AI:Algorithmia<2021 enterprise trends in machine learning 2021年机器学习的企业趋势>翻译与解读 目录 <2021 ent ...
- OSDev——初学者易犯错误
原文链接:https://wiki.osdev.org/Beginner_Mistakes 主页:https://blog.csdn.net/qq_37422196/article/details/1 ...
- java代码书写易犯错误
java代码书写易犯错误: 常见报错: 控制台报错: 找不到或无法加载主类 HelloWorld 原因: java.lang.NoClassDefFoundError: cn/itcast/day01 ...
- 遍历ArrayList易犯错误
场景: 将ArrayList中符合条件的记录删掉,第一时间写出的程序如下: foreach (string aStr in aList) { ...
- Python初学者易犯错误集合
Python初学者易犯错误集锦 注释与数据类型使用 分支判断/循环语句 函数定义与使用 模块导入与使用 类与对象 异常处理 与C/C++.Java等编译型语言相比,Python语言显得更简洁优雅,敲起 ...
- Build a Machine Learning Portfolio(构建机器学习投资组合)
Complete Small Focused Projects and Demonstrate Your Skills (完成小型针对性机器学习项目,证明你的能力) A portfolio is ty ...
- 研发各阶段易犯错误小结
题记:以下是研发各阶段易出问题的小结,越小细节,越容易犯错误.需要多注意! 一.需求方面 1.需求文档 真正领会客户需求,形成指导开发人员的需求文档或者需求规格说明书是非常难的一件事情. 2.需求文档 ...
- 机器学习规则 (Rules of Machine Learning): 关于机器学习工程的最佳实践
马丁·辛克维奇 本文档旨在帮助已掌握机器学习基础知识的人员从 Google 机器学习的最佳实践中受益.它介绍了一种机器学习样式,类似于 Google C++ 样式指南和其他常用的实用编程指南.如果您学 ...
最新文章
- Drupal 网站漏洞修复以及网站安全防护加固方法
- 二分图常用建图方法及其性质
- 容器编排技术 -- Kubernetes kubectl autoscale 命令详解
- Activiti的流程定义文件解析
- 如何发布一个Android库
- GO语言学习之路13
- 勤哲excel服务器端口协议,勤哲Excel服务器技术支持|Excel服务器常见问题解答
- java 统计库_几个可用于数据挖掘和统计分析的java库
- 3GQQ幻想西游攻略
- Android 集成百度人脸采集,实现简单人脸采集module
- MongoDB学习笔记之索引(一)
- 恒讯科技分析:香港服务器有效防止DDos攻击的3种方法
- 扩展Redux——Store Enhancer
- 10-Little prince's trip to Java-奇数魔方阵
- 2017-2018-2 20179215《网络攻防实践》第二周作业
- python接口自动化3-自动发帖(session)
- 发散阅读、拓宽思路【PageRank、Tf-Idf、协同过滤、分布式训练、StyleTransfer、Node2vec】
- 地税系统WEB打印提示未注册
- 记录docker自定义easyswoole镜像
- 微信小程序圣诞帽_构建圣诞快乐Web应用程序界面
热门文章
- 蓝桥杯2016年第七届真题-碱基
- NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE-论文翻译
- NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE翻译
- mysql innodb cluster 搭建
- 轻体重者入门跑鞋选购全攻略(包括跑姿分析及跑鞋推荐)
- linux pvdisplay PE,linux中的pvmove,pvremove,pvs,pvscan
- 移植MT7620A+MT7610E驱动到Openwrt trunk(Linux Kernel 3.14.18)(续:MT7620A)
- C#给图片加水印文字或图片
- Excel键盘快捷键大全
- 生物信息之独孤九剑——sort