




Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern.

In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low.

But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline(pre-processing, modeling, optimization algorithm, evaluation,product ionization) accordingly.

As pointed out in my previous post,there are dozens of ways to solve a given modeling problem.

Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable.

In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data.

In this post, I would like to share some common mistakes (the don't-s).

I’ll save some of the best practices (the do-s) in a future post.





1. Take default loss function for granted 想当然用缺省Loss

Many practitioners train and pick the best model using the default loss function (e.g., squared error).

In practice, off-the-shelf loss function rarely aligns with the business objective.

Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss.

The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally.

To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount.

Also, data sets in fraud detection usually contain highly imbalanced labels.

In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

“机器学习本质上是在解一个优化问题,优化目标定义错误(或者 loss function 定义错了),就全错了!”


2. Use plain linear models for non-linear interaction 非线性情况下用线性模型

When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

3. Forget about outliers 忘记异常值

Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

“很多情况下,如果不把 Outlier 数据提前过滤,就要采用可处理 Outlier 的模型(或者在模型训练过程中加入处理 Outlier 数据的算法)”。



4. Use high variance model when n<<p 样本少时用High Viriance模型

SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) --  common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

5. L1/L2/... regularization without standardization 不做标准化就用L1/L2等正则

Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.

Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

“特征标准化是很重要的预处理:多维度特征组合在一起时,特征具有同一尺度的可比性很重要 ”




6. Use linear model without considering multi-collinear predictors 不考虑线性相关直接用线性模型

Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.
 “绝大多数情况下,“线性相关” 很少存在(比如广告点击率和飘红长度)但是:可以一个大的 "非线性相关问题" 转化成 N 个小的 "线性相关问题" ”

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance  LR模型中用参数绝对值判断feature重要性。

Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.
 “LR 训练出来的特征权重和特征的重要性很相关,但并非完全代表特征的重要性(有很多情况需要特定考虑)”


So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.




读《Machine Learning Done Wrong》(机器学习易犯错误)有感相关推荐

  1. Machine Learning:如何选择机器学习算法?

    2019独角兽企业重金招聘Python工程师标准>>> Machine Learning Algorithms Overview 关于目前最流行的一些机器学习算法,建议阅读: Mac ...

  2. 中科院计算所开源Easy Machine Learning:让机器学习应用开发简单快捷 By 机器之心2017年6月13日 13:05 今日,中科院计算所研究员徐君在微博上宣布「中科院计算所开源了

    中科院计算所开源Easy Machine Learning:让机器学习应用开发简单快捷 By 机器之心2017年6月13日 13:05 今日,中科院计算所研究员徐君在微博上宣布「中科院计算所开源了 E ...

  3. AI:Algorithmia《2021 enterprise trends in machine learning 2021年机器学习的企业趋势》翻译与解读

    AI:Algorithmia<2021 enterprise trends in machine learning 2021年机器学习的企业趋势>翻译与解读 目录 <2021 ent ...

  4. OSDev——初学者易犯错误

    原文链接:https://wiki.osdev.org/Beginner_Mistakes 主页:https://blog.csdn.net/qq_37422196/article/details/1 ...

  5. java代码书写易犯错误

    java代码书写易犯错误: 常见报错: 控制台报错: 找不到或无法加载主类 HelloWorld 原因: java.lang.NoClassDefFoundError: cn/itcast/day01 ...

  6. 遍历ArrayList易犯错误

    场景: 将ArrayList中符合条件的记录删掉,第一时间写出的程序如下: foreach (string aStr in  aList)             {                  ...

  7. Python初学者易犯错误集合

    Python初学者易犯错误集锦 注释与数据类型使用 分支判断/循环语句 函数定义与使用 模块导入与使用 类与对象 异常处理 与C/C++.Java等编译型语言相比,Python语言显得更简洁优雅,敲起 ...

  8. Build a Machine Learning Portfolio(构建机器学习投资组合)

    Complete Small Focused Projects and Demonstrate Your Skills (完成小型针对性机器学习项目,证明你的能力) A portfolio is ty ...

  9. 研发各阶段易犯错误小结

    题记:以下是研发各阶段易出问题的小结,越小细节,越容易犯错误.需要多注意! 一.需求方面 1.需求文档 真正领会客户需求,形成指导开发人员的需求文档或者需求规格说明书是非常难的一件事情. 2.需求文档 ...

  10. 机器学习规则 (Rules of Machine Learning): 关于机器学习工程的最佳实践

    马丁·辛克维奇 本文档旨在帮助已掌握机器学习基础知识的人员从 Google 机器学习的最佳实践中受益.它介绍了一种机器学习样式,类似于 Google C++ 样式指南和其他常用的实用编程指南.如果您学 ...


  1. Drupal 网站漏洞修复以及网站安全防护加固方法
  2. 二分图常用建图方法及其性质
  3. 容器编排技术 -- Kubernetes kubectl autoscale 命令详解
  4. Activiti的流程定义文件解析
  5. 如何发布一个Android库
  6. GO语言学习之路13
  7. 勤哲excel服务器端口协议,勤哲Excel服务器技术支持|Excel服务器常见问题解答
  8. java 统计库_几个可用于数据挖掘和统计分析的java库
  9. 3GQQ幻想西游攻略
  10. Android 集成百度人脸采集,实现简单人脸采集module
  11. MongoDB学习笔记之索引(一)
  12. 恒讯科技分析:香港服务器有效防止DDos攻击的3种方法
  13. 扩展Redux——Store Enhancer
  14. 10-Little prince's trip to Java-奇数魔方阵
  15. 2017-2018-2 20179215《网络攻防实践》第二周作业
  16. python接口自动化3-自动发帖(session)
  17. 发散阅读、拓宽思路【PageRank、Tf-Idf、协同过滤、分布式训练、StyleTransfer、Node2vec】
  18. 地税系统WEB打印提示未注册
  19. 记录docker自定义easyswoole镜像
  20. 微信小程序圣诞帽_构建圣诞快乐Web应用程序界面


  1. 蓝桥杯2016年第七届真题-碱基
  4. mysql innodb cluster 搭建
  5. 轻体重者入门跑鞋选购全攻略(包括跑姿分析及跑鞋推荐)
  6. linux pvdisplay PE,linux中的pvmove,pvremove,pvs,pvscan
  7. 移植MT7620A+MT7610E驱动到Openwrt trunk(Linux Kernel 3.14.18)(续:MT7620A)
  8. C#给图片加水印文字或图片
  9. Excel键盘快捷键大全
  10. 生物信息之独孤九剑——sort