http://blog.csdn.net/pipisorry/article/details/46490177

实用机器学习Practical Machine Learning courses学习笔记

Practical Machine Learning实用机器学习

1.1 Prediction motivation预测的动机

课程概览About this course

This course covers the basic ideas behind machine learning/prediction, What this course depends on What would be useful
·Study design training
vs. test sets
Conceptual issues out
of sample error, ROC curves
Practical implementation the
caret package
·The Data Scientist's Toolbox
R Programming
·Exploratory analysis
Reporting Data and Reproducible Research
Regression models

机器学习的用处

Local governments > pension(退休金) payments
Google >whether you will click on an ad
Amazon >what movies you will watch
Insurance companies >what your risk of death is
Johns Hopkins >who will succeed in their programs

推荐书目及资源

The elements of statistical learning

Machine learning (more advanced material)

List of machine learning resources on Quora
List of machine learning resources from Science
Advanced notes from MIT open courseware
Advanced notes from CMU
Kaggle machine learning competitions

1.2 什么是预测What is prediction

预测问题的中心教条dogma

predict for these dots whether they're red or blue:

choosing the right dataset and that knowing what the specific question is are again paramount(最重要的)

可能存在的问题

一个例子:Google Flu trends algorithm didn't realize the search terms that people would use would change overtime.They might use different terms when they were searching, and so that would affect the algorithm's performance.And also, the way that those terms were actually being used in the algorithm wasn't very well understood.And so when the function of a particular search term changed in their algorithm, it can cause problems.

预测器的流程components of a predictor

question -> input data -> features -> algorithm -> parameters -> evaluation

Note: question: What are you trying to predict and what are you trying to predict it with?

预测的一个例子:垃圾邮件

question -> input data -> features -> algorithm -> parameters -> evaluation

Start with a general question

Can I automatically detect emails that are SPAM that are not?

Make it concrete

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

Note:try to make it as concrete as possible

question -> input data -> features -> algorithm -> parameters -> evaluation

rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html

question -> input data -> features -> algorithm -> parameters -> evaluation

library(kernlab)
data(spam)
head(spam)

question -> input data -> features -> algorithm-> parameters -> evaluation

Our simple algorithm

  • Find a value C.
  • frequency of 'your' >C predict "spam"

Note:best cut off is above 0.5 then we say that it's SPAM, and if it's below 0.5 we can say that it's HAM.

question -> input data -> features -> algorithm -> parameters -> evaluation

question -> input data -> features -> algorithm -> parameters -> evaluation

1.3 步骤的相对重要性Relative importance of steps

{about the tradeoffs and the different components of building a machine learning algorithm}

Relative order of importance:question > data > features > algorithms

...

Then creating features is an important component in that if you don't compress the data in the right way you might lose all of the relevant and valuable information.

And finally, in my experience it's been the algorithmis often the least important part of building a machine learning algorithm.It can be very important depending on the exact modality of the type of data that you're using.For example, image data and voice data can require certain kinds of prediction algorithms that might not necessarily be as.

An important point

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.--John Tukey

In other words, an important component of knowing how to do prediction is to know when to give up,when the data that you have is just not sufficient to answer the question that you're trying to answer.

Garbage in = Garbage out

question -> input data -> features -> algorithm -> parameters -> evaluation

  1. May be easy (movie ratings -> new movie ratings)
  2. May be harder (gene expression data -> disease)
  3. Depends on what is a "good prediction".
  4. Often more data > better models
  5. The most important step!

The key point that you need to remember.When making that decision is garbage in garbage out.In other words if you have bad data that you collected.Or data that isn't very useful for performing predictions.No matter how good your machine learning algorithm is. You'll often get very bad results out.

特征关系重大Features matter!

question -> input data -> features -> algorithm -> parameters -> evaluation

Properties of good features

  • Lead to data compression
  • Retain relevant information
  • Are created based on expert application knowledge

Note:there's a debate in the community about whether it's better to create features automatically or whether it's better to use expert domain knowledge.And in general it seems that the, expert domain knowledge can help quite a bit in many, many applications and so should be consulted when building a features for machine learning algorithm.

Common mistakes

  • Trying to automate feature selection
  • Not paying attention to data-specific quirks
  • Throwing away information unnecessarily

Note:

1. Some common mistake are trying to automate feature selection in a way that doesn't allow for you to understand how those features are actually.Being applied to make good predictions.Black box predictions can be very useful,can be very accurate but they can also change on a dime if we're not paying attention to how those features actually do predict the outcome.

2. The function of a particular set of data set might be that there's outlines if there's weird behaviors of specific features and not understanding those can cause problems.
Based on a bunch of features they sort of collected in an unsupervised way.In other words they filtered through the data in a way to identify those features that might be useful for later predictive algorithms.But even when they did this, they went back and looked at those features and tried to figure out why they would be predictive and so for example these features, this feature here makes it very clear why it would be a good predictor for a cat.

算法并没有那么重要Algorithms matter less than you'd think

question -> input data -> features -> algorithm -> parameters -> evaluation
Using a very, sensible approach will get you a very large weight of solving the problem.And then, getting the absolute best method can improve, but it often doesn't improve that much over, sort of, most good sensible methods.

建立机器学习算法要考虑的问题

Note:Scalable means it's easy to apply to a large data set.Whether that's because it's very very fast or whether it's because it's parallelizable across multiple samples for example.

[Gaining access to the best machine-learning methods]

Prediction is about accuracy tradeoffs

  • Interpretability versus accuracy
  • Speed versus accuracy
  • Simplicity versus accuracy
  • Scalability versus accuracy

Interpretability matters

Note:如医疗上,医生更容易理解决策树形式的。

http://www.cs.cornell.edu/~chenhao/pub/mldg-0815.pdf

Scalability matters

http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/

http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

Note:netflix从未应用竞赛的算法到实际中是因为拓展性问题,不能应用到大数据中

1.4 训练集和测试集误差 In and out of sample errors

In Sample Error: The error rate you get on the samedata set you used to build your predictor. Sometimescalled resubstitution error.

Out of Sample Error: The error rate you get on a newdata set. Sometimes called generalization error.

Key ideas

  1. Out of sample error is what you care about
  2. In sample error < out of sample error
  3. The reason is overfitting : Matching your algorithm to the data you have

Note:

1. In sample error is always going to be a little bit optimistic,from what the error is that you would get from a new sample.And the reason why is, in your specific sample, sometimes your prediction algorithm will tune itself a little bit to the noise that you collected in that particular data set.

2. So sometimes you want to be able to give up a little bit of accuracy in the sample you have, to be able to get accuracy on new data sets.In other words, when the noise is a little bit different, your algorithm will berobust.

过拟合Overfitting

  • Data have two parts

    • Signal : the part we're trying to use to predict
    • Noise : random variation in the dataset that we get, because the data are measured noisily.
  • The goal of a predictor is to find signal
  • You can always design a perfect in-sample predictor; You capture both signal + noise when you do that; Predictor won't perform as well on new samples

Note:the goal of a predictor is to find a signal and ignore the noise.

1.5 预测的研究设计Prediction study design

{about how to minimize the problems that can be caused by in sample verses out of sample errors}

  1. Define your error rate
  2. Split data into:
    • Training, Testing, Validation (optional)
  3. On the training set pick features
    • Use cross-validation
  4. On the training set pick prediction function
    • Use cross-validation
  5. If no validation
    • Apply 1x to test set
  6. If validation
    • Apply to test set and refine
    • Apply 1x to validation

Note:

1. 5:why do we only apply it one time?If we applied multiple models to our testing set, then, and pick the best one, then we're using the test set, in some sense, to train the model.In other words, we're still getting an optimistic view of what the data error would be on a completely new dataset.
2. 6:apply your best prediction models all to your test set and refine them a little bit.So what you might find is that some features don't work so well when you're doing out of sample prediction and you might refine and adjust your model a little bit.But now, again, like I said, your test set error is going to be a little bit optimistic error for what your actual out of sample error will be.And so what we do is we again, apply our model to exactly one time to the validation set, only the best one to get our prediction.

3. The idea is that there is one dataset that's held out from the very start, that you only apply exactly one model to, and you never do any training or tuning or testing to, and that will give you a good estimate of your out of sample error rates.

Know the benchmarks

http://www.heritagehealthprize.com/c/hhp/leaderboard

Study design

http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf

Avoid small sample sizes

  • One classifier is flipping a coin
  • Probability of perfect classification is approximately:
    • (12)samplesize
    • n=1 flipping coin 50% chance of 100% accuracy
    • n=2 flipping coin 25% chance of 100% accuracy
    • n=10 flipping coin 0.10% chance of 100% accuracy

Note:make sure that especially our test sizes are of relatively large size so we can be sure that we're not just getting good prediction accuracy by chance.

数据集划分规则Rules of thumb for prediction study design

  • If you have a large sample size

    • 60% training
    • 20% test
    • 20% validation
  • If you have a medium sample size
    • 60% training
    • 40% testing
  • If you have a small sample size
    • Do cross validation
    • Report caveat of small sample size

Note:

1. 2:you don't get to refine your models in a test set and then apply them to a validation set.But it might insure that your testing site is of sufficient size.

2. 3:First of all, you might reconsider whether you have enough samples to be able to build a prediction algorithm in the first place.But suppose your dead set on building a prediction or machine learning algorithm, then the idea might be to do cross validation and report the caveat of the small sample size and the fact that you never got to predict this in an out of sample or a testing data set.

要记住的原则Some principles to remember

  • 1. Set the test/validation set aside and don't look at it
  • 2. In general randomly sample training and test
  • 3. Your data sets must reflect structure of the problem
    • If predictions evolve with time split train/test in time chunks (calledbacktesting in finance)
  • 4. All subsets should reflect as much diversity as possible
    • Random assignment does this
    • You can also try to balance by features - but this is tricky

Note:

1. the test set or the validation set should be set aside and never looked at when building your model.In other words, you would need to have one data set which you apply only one model to, only one time, and that data set should be completely independent of anything you use to build the prediction model.
2. So for example, if you have, time fit time force data, in other words you have, data collected over time, you might want to build your, training set inchunksof time, but again, random chunks of time and build them on random predictions.
3. In other words, if you want to sample any data set that might have sources of dependence over time or across space,you need to sample your data in chunks.This is called backtesting in finance.And it's basically the idea that you want to be able to use chunks of data that consist of observations over time.

1.6 误差类型Types of errors

{about the types of errors and the ways that evaluate the prediction functions}

基本术语 Basic terms

对于二分类问题:

In general, Positive = identified and negative = rejected. Therefore:

True positive = correctly identified

False positive = incorrectly identified

True negative = correctly rejected

False negative = incorrectly rejected

Medical testing example:

True positive = Sick people correctly diagnosed as sick

False positive= Healthy people incorrectly identified as sick

True negative = Healthy people correctly identified as healthy

False negative = Sick people incorrectly identified as healthy.

关键量 Key quantities

敏感性sensitivity(召回率recall)、特异性specificity、正预测值positive predictive value(精确度precision)、负预测值、准确性accuracy

Note:
1. the columns correspond to what your disease status is, in this particular example,positive means that you have the disease.
2. And the test is our prediction, our machine learning algorithm.A positive means we predict that you have a disease.
3. 准确性the accuracy is just a probability that we classified you to the correct outcome.It's the true positives, and the true negatives, just added up.
[ Precision/Recall精确度/召回率]

[http://en.wikipedia.org/wiki/Sensitivity_and_specificity]

筛选试验 Screening tests


The general population(0.1%)
if you got a positive test, what's the probability that you actually have the disease?(也就是已知下面题目(左)中的sensitivity和specificity以及0.001的人群感染率,求positive predictive value)

Note:
1. 敏感性和特异性解释:In other words, the probability that we'll get it right, if you're diseased is 99%, and the probability we'll get it right if you're healthy is 99%.
2. 另两种解法:

{右图中+代表实际有病,T+代表算法检测有病}
3. 算出结果很小的原因是:
The reason is 99%(99) of a small number, so 99 out of 100. is still smaller than 1%(999) out of a much bigger number.

High risk sub-population(10%)
If instead we consider the case where 10% of people are actually sick

Note:这时计算出的结果为92%,明显大了很多
结论:
If you're predicting a rare event.You have to be aware of, how rare that event is.This goes back to the idea of knowing what population you're sampling from.
[ http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/]

连续型数据度量 For continuous data


Note:
1. The goal here is to see how close you are to the truth.
2. squared this distance is a little bit hard to interpret on the same scale as the predictions or the truth.And so they take the square root of that quantity.

But RMSE often doesn't work when there are a lot of outliers.Or the values of the variables can have very different scales.for example, if you have one really,really large value.It might really raise the mean.
常见的误差度量方法 Common error measures

  1. Median absolute deviation 平均绝对偏差

    • Continuous data, often more robust
  2. Sensitivity (recall)
    • If you want few missed positives
  3. Specificity
    • If you want few negatives called positives
  4. Accuracy
    • Weights false positives/negatives equally
  5. Concordance调和
    • One example is kappa

Note:
1. Median absolute deviation:they take the median of the diff, distance between the observed value,and the predicted value, and they do the absolute value instead of doing the squared value.And so again, that requires all of the distances to be positive,but it's a little bit more robust to the size of those errors.
2.3. sensitivity and specificity are very commonly used when talking about particularly medical tests, but they also are particularly widely used if you care about one type of error more than the other type of error.
4. accuracy is an important point if again you have a very large discrepancy(差异) in number of times that you're a positive or a negative.
5. For multiclass cases,one particular distance measure, kappa.But there are a whole large class of distance measures, and they all have different properties, that can be used when you have multiclass data.
from: http://blog.csdn.net/pipisorry/article/details/46490177

Practical Machine Learning实用机器学习 章1相关推荐

  1. Foundations of Machine Learning 2nd——第一章 机器学习预备知识

    Foundations of Machine Learning 2nd--第一章 机器学习预备知识 前言 1.1 什么是机器学习(Machine learning) 1.2 什么样的问题可以用机器学习 ...

  2. 《Practical Machine Learning A New Look at Anomaly Detection》摘抄+笔记

    <Practical Machine Learning A New Look at Anomaly Detection>--实用机器学习, 异常检测的新看法 本文照抄论文主要内容加上个人理 ...

  3. Foundations of Machine Learning 2nd——第二章 PAC学习框架

    Foundations of Machine Learning 2nd--第二章 PAC学习框架 前言 定义介绍 Generalization error Empirical error 定理1 PA ...

  4. Hand on Machine Learning第三章课后作业(1):垃圾邮件分类

    import os import email import email.policy 1. 读取邮件数据 SPAM_PATH = os.path.join("E:\\3.Study\\机器学 ...

  5. Paper:《A Few Useful Things to Know About Machine Learning—关于机器学习的一些有用的知识》翻译与解读

    Paper:<A Few Useful  Things to  Know About  Machine  Learning-关于机器学习的一些有用的知识>翻译与解读 目录 <A Fe ...

  6. Paper:《Hidden Technical Debt in Machine Learning Systems—机器学习系统中隐藏的技术债》翻译与解读

    Paper:<Hidden Technical Debt in Machine Learning Systems-机器学习系统中隐藏的技术债>翻译与解读 导读:机器学习系统中,隐藏多少技术 ...

  7. 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 17—Large Scale Machine Learning 大规模机器学习...

    Lecture17 Large Scale Machine Learning大规模机器学习 17.1 大型数据集的学习 Learning With Large Datasets 如果有一个低方差的模型 ...

  8. Paper之ICML:2009年~2019年ICML历年最佳论文简介及其解读—(International Conference on Machine Learning,国际机器学习大会)

    Paper之ICML:2009年~2019年ICML历年最佳论文简介及其解读-(International Conference on Machine Learning,国际机器学习大会) 导读:十年 ...

  9. Foundations of Machine Learning 2nd——第二章 PAC学习框架 后记

    Foundations of Machine Learning 2nd--第二章 PAC学习框架后记 前言 Generalities 一般性 可确定性 VS 随机场景 定义1 Agnostic PAC ...

  10. 图机器学习(Graph Machine Learning)- 第二章 图机器学习简介 Graph Machine Learning

    第二章 图机器学习简介 Graph Machine Learning 文章目录 第二章 图机器学习简介 Graph Machine Learning 前言 1. 环境要求Technical requi ...

最新文章

  1. python通过pyinstaller打包软件将GUI项目打包成exe文件
  2. OAF_开发系列17_实现OAF数组应用Vector / Hashmap / Hashtable / Arraylist(案例)
  3. oracle中extents存在的理由
  4. 2013\National _C_C++_B\1.猜灯谜
  5. Django 使用 HttpResponse 返回 json 字符串显示 Unicode 编码
  6. ASP中文件上传组件ASPUpload介绍和使用方法
  7. spring使用@Value标签读取*.properties文件的中文乱码问题的解决
  8. Codeforces 853A 贪心 优先队列
  9. AngularJs依赖注入的研究
  10. Pannellum:实例之为全景图添加指南针
  11. Unity3D资源热更新(一)
  12. 推荐一款Mac远程桌面工具——Parallels Client(免费)
  13. 如何获取ppt内的视频文件
  14. canvas抖音八卦时钟,轻喷
  15. 椭圆曲线数字签名算法
  16. PDF复制乱码 -- 原因及解决方案
  17. PlayOnLinux 让你更方便地在linux上玩windows游戏
  18. 2022医院三基相关专业每日一练模拟题及答案
  19. CX51 用户手册----MDU_F120伪指令
  20. EDA课设:CPU设计

热门文章

  1. HDOJ 1257 (最长字序列问题)
  2. 89c52串口发送接收小示例
  3. k8s的flannel网络插件配置
  4. JavaWeb开发中的乱码问题
  5. CentOS系统下安装配置ftp服务
  6. 冲刺周期二--站立会议03
  7. asp.net中获取全文图片地址的方法
  8. 周鸿祎:做产品体验先把自己切换到二傻子模式
  9. .NET Oject And Json
  10. Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies 读书笔记之六