Machine learning system design

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十二章《机器学习系统设计》中第94课时《误差分析》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In the last video, I talked when facing a machine learning problem, there are often lots of different ideas on how to improve the algorithm. In this video, let's talk about the concept of error analysis which will help give you a way to more systematically make some of these decisions.

If you're starting work on a machine learning product or building a machine learning application, it's often considered very good practice to start, not by building a very complicated system with lots of complex features and so on, but to instead start by building a very simple algorithm that you can implement quickly. And when I start a learning problem, what I usually do is spend at most one day, literally at most 24 hours, to try to get something really quick and dirty, frankly not at all sophisticated system. But get something really quick and dirty running and implement it and then test it on my cross validation data. Once you've done that, you can then plot learning curves. This is what we talked about in the previous set of videos. But plot learning curves of the training and test errors to try to figure out if your learning algorithm may be suffering from high bias or high variance or something else. And use that to try to decide if having more data and more features and so on are likely to help. And the reason that this is a good approach just often when you're just starting out on a learning problem, there's really no way to tell in advance whether you need more complex features or whether you need more data or something else. And it's just very hard to tell in advance that in the absense of evidence, in the absense of seeing a learning curve, it's just incredibly difficult to figure out where you should spend your time. And it's often by implementing even a very quick and dirty implementation and by plotting learning curves, that helps you make these decisions. So if you like, you can think of this as a way of avoiding what's sometimes called premature optimization in computing programming. And this is idea that just says that we should let evidence guide our decisions on where to spend our time rather than use gut feeling which is often wrong. In addition to plotting learning curves, one other thing that's often useful to do is what's called error analysis. And what I mean by that is that when building, say, a spam classifier, I will often look at my cross validation set and manually look at the emails that my algorithm is making errors on. So, look at the spam emails and non-spam emails that the algorithm is misclassifying, and see if you can spot any systematic patterns in what type of examples it is misclassifying. And often by doing that, this is the process that would inspire you to design new features, or they'll tell you whether the current shortcomings of the system, and give you the inspiration you need to come up with improvements to it. Concretely, here's a specific example.

Let's say you've built a spam classifier. And you have 500 examples in your cross validation set. And let's say in this example that this algorithm has a very high error rate and it misclassifies a hundred of these cross validation examples. So what I do is manually examine these 100 errors, and manually categorize them based on things like what type of email it is and what cues you think might help the algorithm classify them correctly. So specifically, by what type of email it is, if I look through these 100 errors, I may find that maybe the most common types of spam emails it misclassifies are maybe emails on pharmacy, so basically these emails try to sell drugs, maybe emails that are trying to sell replicas, like those fake watches, fake random things. Maybe have some emails are trying to steal passwords. These are also called fishing emails that's another big category of emails. And maybe other categories. So, in terms of classify what type of email it is, I would actually go through and count up of my 100 emails. Maybe I found that 12 of the mislabeled emails are pharma emails. And maybe 4 of them are emails trying to sell replicas, like fake watches or something. And maybe I find that 53 of them are what's called fishing emails, basically emails trying to persuade you to give them your password, and 31 emails are other types of emails. And it's by counting up the number of emails in these different categories that you might discover, for example, that the algorithm is doing really particularly poorly on emails trying to steal passwords. And that may suggest that it might be worth your effort to look more carefully at that type of email, and see if you can come out with better features to categorize them correctly. And also, what I might do is look at what cues or what features might have helped the algorithm classify the emails. So let's say that some of our hypotheses about things or features that might help us classify emails better are trying to detect deliberate misspellings, versus unusual email routing, versus unusual spamming punctuation, such as people use a lot of exclamation marks. And once again, I would manually go through and let's say I find 5 cases of this, and 16 of this and 32 of this and a bunch of other types of emails as well. And if this is what you get on our cross validation set, then it really tells you that maybe deliberate spelling is a sufficiently rare phenomenon that maybe is not really worth a lot of time trying to write algorithms to detect that. But if you find a lot of spammers are using unusual punctuation, then maybe that's a strong sign that it might actually be worth your while to spend the time to develop more sophisticated features based on the punctuation. So, this sort of error analysis which is really the process of manually examining the mistakes that the algorithm makes, can often help guide you to the most fruitful avenues to pursue. And this also explains why I recommend implementing a quick and dirty implementation of an algorithm. What we really want to do is figure out what are the most difficult examples to an algorithm to classify. And very often for different learning algorithms, they'll often find similar categories of examples difficult. And by having a quick and dirty implementation, that's often a quick way to let you identify some errors and quickly identify what are the hard examples so that you can focus your effort on those.

Lastly, when developing learning algorithms, one other useful tip is to make sure that you have a numerical evaluation of your learning algorithm. And what I mean by that is that if you're developing a learning algorithm, it's often incredibly helpful if you have a way of evaluating your learning algorithm that just gives you back a single rule number, maybe accuracy, maybe error. But a single rule number that tells you how well your learning algorithm is doing. I'll talk about this specific concepts in later videos, but here's a specific example. Let's say we're trying to decide whether or not we should treat words like "discount", "discounts", "discounter", "discounting" as the same word. Maybe one way to do that is to just look at the few characters in the word, like if you just look at the first few characters of a word, then you figure out that maybe are of these words are roughly have similar meanings. In natural language processing, the way that this is done is actually using a type of software called stemming software (Chinese: 词干提取软件). If you ever want to do this yourself, search on a web search engine for the Porter Stemmer and that would be one reasonable piece of software for doing this sort of stemming, which will let you treat all of these words "discount", "discounts" and so on as the same word. But using a stemming software that basically looks at the first few alphabets of the word more or less, it can help but it can hurt. And it can hurt because, for example, the software may mistake the words "universe" and "university" as being the same thing because these two words start off with very similar characters with the same alphabets. So, if you're trying to decide whether or not to use stemming software for a spam classifier, it's not always easy to tell. And in particular, error analysis may not be helpful for deciding if this sort of stemming idea is a good idea. Instead, the best way to figure out if using stemming software is good to help your classifier is if you have a way to very quickly just try it and see if it works. And in order to do this, having a way to numerically evaluate your algorithm is going to be very helpful. Concretely, maybe the most natural thing to do is to look at the cross validation error of the algorithm's performance with and without stemming. So, if you run your algorithm without stemming, and you end up with, let's say, 5% classification error. And you rerun it and you end up with, let's say, 3% classification error, then this decrease in error very quickly allows you to decide that using stemming is a good idea. For this particular problem, there's a very natural single rule number evaluation metric, namely, the cross validation error. We'll see later, examples where coming up this sort of single rule number evaluation metric maybe need a little bit more work. But as we'll see in the later video, doing so would also then let you make these decisions much more quickly of whether or not to use stemming. And just one more quick example. Let's say you're also trying to decide whether or not to distinguish between upper versus lower case. So is the word "mom" with the uppercase "M" versus lower case "m", should that be treated as the same words or as different words? Should these be treated as the same feature or different features? So, once again, because we have a way to evaluate our algorithm, if you try this out here, if I stop distinguishing upper and lower case, maybe I end up with 3.2% error and I find that therefore this does worse than if I use only stemming, and so this lets me very quickly to decide to go ahead to distinguish or not to distinguish between upper and lower case. So, when you're developing a learning algorithm, very often you'll be trying out lots of new ideas and lots of new versions of your learning algorithm. If every time you try out a new idea, if you end up manually examining a bunch of examples again to see better or worse, that's going to make it really hard to make decisions on, do you use stemming or not, do you distinguish upper or lower case or not. But by having a single rule number evaluation metric, you can then just look and see, did the error go up or down? And you can use that much more rapidly, try out new ideas and almost right away tell if your new idea has improved or worsened the performance of the learning algorithm and this will let you often make much faster progress. So I strongly recommended the way to do error analysis is on the cross validation set rather on the test set. But, there are people that will do this on the test set even though that's definitely a less mathematically appropriate set of your list, recommended what you think to do, then to do error analysis on your cross validation set.

So, to wrap up this video, when starting on the new machine learning problem, what I almost always recommend is to implement a quick and dirty implementation of your learning algorithm. And I've almost never seen anyone spend too little time on this quick and dirty implementation. I pretty much only ever seen people spend too much time building the first supposely quick and dirty implementations. So, really don't worry about it being too quick, or don't worry about it being too dirty. But really implement something as quickly as you can, and once you have the initial implementation, this is then a powerful tool for deciding where to spend your time next because first you can look the errors it makes, and do this sort of error analysis to see what mistakes it makes, and use that to inspire further development. And second, assuming your quick and dirty implementation incorporated a single rule number evaluation metric, this can then be a vehicle for you to try out different ideas and quickly see if the different ideas you're trying out are improving the performance of your algorithm and therefore let you maybe much more quickly make decisions about what things to fold, and what things to incorporate into your learning algorithm.

Machine learning system design - Error analysis相关推荐

Coursera公开课笔记: 斯坦福大学机器学习第十一课“机器学习系统设计(Machine learning system design)”
Coursera公开课笔记: 斯坦福大学机器学习第十一课"机器学习系统设计(Machine learning system design)" 斯坦福大学机器学习斯坦福大学机器学习第 ...
Machine Learning week 6 quiz: Machine Learning System Design
Machine Learning System Design 5 试题 1. You are working on a spam classification system using regular ...
吴恩达机器学习系列课程笔记——第十一章：机器学习系统的设计(Machine Learning System Design)
11.1 首先要做什么 https://www.bilibili.com/video/BV164411b7dx?p=65 在接下来的视频中,我将谈到机器学习系统的设计.这些视频将谈及在设计复杂的机器学 ...
Machine Learning System Design的一道题
一道很简单的题快把我绕晕了...(年纪大了真是拙计) 化简一下: Spam y=1 | 1% Non-spam y=0 | 99% 我开始有点懂了,上图左侧,横轴是事实,纵轴是预测. 一共100封邮件 ...
“Survey of machine learning techniques for malware analysis ”
此论文对已经发表的文献中的关于机器学习对恶意软件的监测的不同研究方向.不同的研究目标.不同的方法.得出的不同结果进行了一定的分类总结. 根据文献,对其中三种基本层面进行了分类: 1.分析的特定目标 2 ...
[C5/C6] 机器学习诊断和系统设计(Machine learning Diagnostic and System Desig
机器学习诊断(Machine learning diagnostic) Diagnostic : A test that you can run to gain insight what is / i ...
Machine Learning课程 by Andrew Ng
大名鼎鼎的机器学习大牛Andrew Ng的Machine Learning课程,在此mark一下: 一:Coursera: https://www.coursera.org/learn/machine ...
Machine Learning Basics（1）
文章目录 Mind Map CODE WORKS CONTENTS Learning Algorithms The Task, TTT The Performance Measure,PPP The ...
Java Machine Learning Tools Libraries--转载
原文地址:http://www.demnag.com/b/java-machine-learning-tools-libraries-cm570/?ref=dzone This is a list o ...

Machine learning system design - Error analysis

Machine learning system design - Error analysis相关推荐

最新文章

热门文章