[转]Data mining with WEKA, Part 3: Nearest Neighbor and server-side library
原文地址:
http://www.ibm.com/developerworks/opensource/library/os-weka3/index.html
by :
Michael Abernethy, Freelance Programmer, Freelancer
Introduction
In the previous two articles in this "Data mining with WEKA" series, I introduced the concept of data mining. If you have not yet readPart 1 and Part 2, please read them first because they cover key concepts you should know before moving forward. More importantly, I talked about three techniques used in data mining that can turn your confusing and useless data into meaningful rules and trends. The first of these was regression, which can be used to predict a numerical output (like house value) based on other example data. The second was classification (also known as classification tree or decision tree), which can be used to create an actual branching tree to predict the output value of an unknown data point. (In our example, we predicted the response to a BMW promotion.) Third, I introduced clustering, which can be used to create groups (clusters) of data from which you can identify trends and other rules (BMW sales in our example). These things all were similar in that they could transform your data into useful information, but each did it differently and with different data, which is one of the important aspects of data mining: The right model has to be used on the right data.
This last article will discuss the last of the four common data mining techniques: Nearest Neighbor. You'll see that it is like a combination of classification and clustering, and provides another useful weapon for our mission to destroy data misinformation.
In our previous articles, we use WEKA as a stand-alone application. How useful would that be in the real world? It's not ideal, obviously. Since WEKA is a Java-based application, it has a Java library you can use in our own server-side code. This will likely be the more common use for most people, as you can write code to constantly analyze your data and make adjustments on the fly, rather than rely on someone to extract the data, convert to a WEKA format, then run it through the WEKA Explorer.
Nearest Neighbor
Nearest Neighbor (also known as Collaborative Filtering or Instance-based Learning) is a useful data mining technique that allows you to use your past data instances, with known output values, to predict an unknown output value of a new data instance. So, at this point, this description should sound similar to both regression and classification. How is this different from those two? Well, first off, remember that regression can only be used for numerical outputs. That differentiates it from Nearest Neighbor immediately. Classification, as we saw from the example in the previous article, uses every data instance to create a tree, which we would traverse to find our answer. This can be a serious problem with some data. Think about a company like Amazon and the common "Customers who purchased X also purchased Y" feature. If Amazon were to create a classification tree, how many branches and nodes could it have? There are maybe a few hundred thousand products. How big would that tree be? How accurate do you think a tree that big would be? Even if you got to a single branch, you might be shocked to learn that it only has three products. Amazon's page likes to have 12 products on it to recommend to you. It's a bad data mining model for this data.
You'll find that Nearest Neighbor fixes all those problems in a very efficient manner, especially in the example used above for Amazon. It's not limited to any number of comparisons. It's as scalable for a 20-customer database as it is for a 20 million-customer database, and you can define the number of results you want to find. Seems like a great technique! It really is — and probably will be the most useful for anyone reading this who has an e-commerce store.
Let's delve into the math a little bit, so you can understand the process and also see some of the limitations of this technique.
Math behind Nearest Neighbor
You will see that the math behind the Nearest Neighbor technique is a lot like the math involved with the clustering technique. Taking the unknown data point, the distance between the unknown data point and every known data point needs to be computed. Finding the distance is really quite trivial with a spreadsheet, and a high-powered computer can zip through these calculations nearly instantly. The easiest and most common distance calculation is the "Normalized Euclidian Distance." It sounds much more complicated than it really is. Let's take a look at an example in action and try to figure out what Customer No. 5 is likely to purchase.
Listing 1. Nearest Neighbor math
Customer Age Income Purchased Product 1 45 46k Book 2 39 100k TV 3 35 38k DVD 4 69 150k Car Cover 5 58 51k ???Step 1: Determine Distance Formula Distance = SQRT( ((58 - Age)/(69-35))^2) + ((51000 - Income)/(150000-38000))^2 )Step 2: Calculate the Score Customer Score Purchased Product 1 .385 Book 2 .710 TV 3 .686 DVD 4 .941 Car Cover 5 0.0 ??? |
To answer the question "What is Customer No. 5 most likely to buy?" based on the Nearest Neighbor algorithm we ran through above, the answer would be a book. This is because the distance between Customer No. 5 and Customer No. 1 is less (far less, actually) than the distance between Customer No. 5 and any other customer. Based on this model, we say that the customer most like Customer No. 5 can predict the behavior of Customer No. 5.
However, the positives of Nearest Neighbor don't end there. The Nearest Neighbor algorithm can be expanded beyond the closest match to include any number of closest matches. These are termed "N-Nearest Neighbors" (for example, 3-Nearest Neighbors). Using the above example, if we want to know the two most likely products to be purchased by Customer No. 5, we would conclude that they are books and a DVD. Using the Amazon example from above, if they wanted to know the 12 products most likely to be purchased by a customer, they would want to run a 12-Nearest Neighbor algorithm (though Amazon actually runs something more complicated than just a simple 12-Nearest Neighbor algorithm).
Further, the algorithm shouldn't be constrained to predicting a product to be purchased. It can also be used to predict a Yes/No output value. Considering the above example, if we changed the last column to the following (from customers 1-4), "Yes,No,Yes,No," a 1-Nearest Neighbor model would predict Customer No. 5 to say "Yes" and a 2-Nearest Neighbor would predict a "Yes" (both customer nos. 1 and 3 say "Yes"), and a 3-Nearest Neighbor model would say "Yes." (Customer nos. 1 and 3 say "Yes," customer No. 2 says "No," so the average value of these is "Yes.")
The final question to consider is "How many neighbors should we use in our model?" Ah — not everything can be easy. You'll find that experimentation will be needed to determine the best number of neighbors to use. Also, if you are trying to predict the output of a column with a 0 or 1 value, you'd obviously want to select an odd number of neighbors, in order to break ties.
Data set for WEKA
The data set we'll use for our Nearest Neighbor example should look familiar — it's the same data set we used for our classification example in the previous article. It's about our fictional BMW dealership and the promotional campaign to sell a two-year extended warranty to past customers. To review the data set, here are the specifics I introduced in the last article.
There are 4,500 data points from past sales of extended warranties. The attributes in the data set are Income Bracket [0=$0-$30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7=$501k+], the year/month their first BMW was bought, the year/month the most recent BMW was bought, and whether they responded to the extended warranty offer in the past.
Listing 2. Nearest Neighbor WEKA data
@attribute IncomeBracket {0,1,2,3,4,5,6,7} @attribute FirstPurchase numeric @attribute LastPurchase numeric @attribute responded {1,0}@data4,200210,200601,0 5,200301,200601,1 ... |
Nearest Neighbor in WEKA
Why are we using the same data set we used in the classification example? Because, if you remember the results of that model, it was only 59-percent accurate, which wasn't acceptable at all (barely better than guessing). We're going to improve it and give this fictional dealership some useful information.
Load the data file bmw-training.arff into WEKA using the same steps we've used to this point in the Preprocess tab. Your screen should look like Figure 1 after loading in the data.
Figure 1. BMW Nearest Neighbor data in WEKA
Like we did with the regression and classification model in the previous articles, we should next select the Classify tab. On this tab, we should select lazy, then select IBk (the IB stands for Instance-Based, and the k allows us to specify the number of neighbors to examine).
Figure 2. BMW Nearest Neighbor algorithm
At this point, we are ready to create our model in WEKA. Ensure that Use training set is selected so we use the data set we just loaded to create our model. Click Start and let WEKA run. Figure 3 shows a screenshot, and Listing 3 contains the output from this model.
Figure 3. BMW Nearest Neighbor model
Listing 3. Output of IBk calculations
=== Evaluation on training set === === Summary ===Correctly Classified Instances 2663 88.7667 % Incorrectly Classified Instances 337 11.2333 % Kappa statistic 0.7748 Mean absolute error 0.1326 Root mean squared error 0.2573 Relative absolute error 26.522 % Root relative squared error 51.462 % Total Number of Instances 3000 === Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.95 0.177 0.847 0.95 0.896 0.972 10.823 0.05 0.941 0.823 0.878 0.972 0 Weighted Avg. 0.888 0.114 0.893 0.888 0.887 0.972=== Confusion Matrix ===a b <-- classified as1449 76 | a = 1261 1214 | b = 0 |
Using WEKA on the server
One of the coolest things about WEKA is that it is not only a stand-alone application but it also is a self-contained Java JAR file that you can throw into your server's lib folder and call from your own server-side code. Think about how many interesting and important things this can bring to your applications. You can add reports that take advantage of all the data mining techniques we've learned so far. You can create a "Product Recommendation" widget for your e-commerce stores similar to the one that Amazon has on its site (because there's no way you'd be able to do this on-demand for each customer, running it through the stand-alone application). The WEKA stand-alone application itself just calls the underlying WEKA Java API, so you've seen the API in action already. Now it's time to see how to integrate it into your own code.
In fact, you've already downloaded the WEKA API JAR; it's the same JAR file you've been invoking to start the WEKA Explorer. To get access to the code, point your Java environment to include this JAR file in the classpath. All the usual steps to using a third-party JAR file in your own code.
As you can imagine, the central building block in the WEKA API is going to be the data. Data mining revolves around the data and, of course, all the algorithms that we've learned about have revolved around the data. So let's see how to get our data into a format that the WEKA API can use. Let's start easy though, let's start with the data from the first article in the series about house values.
NOTE: I would warn you ahead of time that the WEKA API can be difficult to navigate at times. First and foremost, double-check the version of WEKA you're using and the version of the API you're browsing. The API has changed enough between releases that the code can be totally different. Also, while the API is complete, there aren't very many good examples to get started (but that's why you're reading this of course). I am using WEKA V3.6.
Listing 4 shows how the data is formatted to be consumed by WEKA.
Listing 4. Loading data into WEKA
// Define each attribute (or column), and give it a numerical column number // Likely, a better design wouldn't require the column number, but // would instead get it from the index in the container Attribute a1 = new Attribute("houseSize", 0); Attribute a2 = new Attribute("lotSize", 1); Attribute a3 = new Attribute("bedrooms", 2); Attribute a4 = new Attribute("granite", 3); Attribute a5 = new Attribute("bathroom", 4); Attribute a6 = new Attribute("sellingPrice", 5);// Each element must be added to a FastVector, a custom // container used in this version of Weka. // Later versions of Weka corrected this mistake by only // using an ArrayList FastVector attrs = new FastVector(); attrs.addElement(a1); attrs.addElement(a2); attrs.addElement(a3); attrs.addElement(a4); attrs.addElement(a5); attrs.addElement(a6);// Each data instance needs to create an Instance class // The constructor requires the number of columns that // will be defined. In this case, this is a good design, // since you can pass in empty values where they exist. Instance i1 = new Instance(6); i1.setValue(a1, 3529); i1.setValue(a2, 9191); i1.setValue(a3, 6); i1.setValue(a4, 0); i1.setValue(a5, 0); i1.setValue(a6, 205000);....// Each Instance has to be added to a larger container, the // Instances class. In the constructor for this class, you // must give it a name, pass along the Attributes that // are used in the data set, and the number of // Instance objects to be added. Again, probably not ideal design // to require the number of objects to be added in the constructor, // especially since you can specify 0 here, and then add Instance // objects, and it will return the correct value later (so in // other words, you should just pass in '0' here) Instances dataset = new Instances("housePrices", attrs, 7); dataset.add(i1); dataset.add(i2); dataset.add(i3); dataset.add(i4); dataset.add(i5); dataset.add(i6); dataset.add(i7);// In the Instances class, we need to set the column that is // the output (aka the dependent variable). You should remember // that some data mining methods are used to predict an output // variable, and regression is one of them. dataset.setClassIndex(dataset.numAttributes() - 1); |
So now we have the data loaded into WEKA. That's probably a little harder than it should be, but you can see that it would be trivial and very beneficial to write your own wrapper classes to quickly extract data from a database and place it into a WEKA instances class. In fact, I highly recommend if you get involved in using WEKA on your server, you spend some time doing that, since working with data in this way is tedious. Once you get your data in the instances object, you are free to use whatever data mining you want on the data, so you want this step to be as easy as possible.
Let's put our data through the regression model and make sure the output matches the output we computed using the Weka Explorer. It's actually quite easy to put our data through the regression model using the WEKA API, far easier than actually loading the data.
Listing 5. Creating a regression model in WEKA
// Create the LinearRegression model, which is the data mining // model we're using in this example LinearRegression linearRegression = new LinearRegression();// This method does the "magic", and will compute the regression // model. It takes the entire dataset we've defined to this point // When this method completes, all our "data mining" will be complete // and it is up to you to get information from the results linearRegression.buildClassifier(dataset);// We are most interested in the computed coefficients in our model, // since those will be used to compute the output values from an // unknown data instance. double[] coef = linearRegression.coefficients();// Using the values from my house (from the first article), we // plug in the values and multiply them by the coefficients // that the regression model created. Note that we skipped // coefficient[5] as that is 0, because it was the output // variable from our training data double myHouseValue = (coef[0] * 3198) +(coef[1] * 9669) +(coef[2] * 5) +(coef[3] * 3) +(coef[4] * 1) +coef[6];System.out.println(myHouseValue); // outputs 219328.35717359098 // which matches the output from the earlier article |
And that's it! Running a classification, clustering, or Nearest Neighbor isn't quite as easy as a regression model, but they aren't much harder. It's much easier to run the data mining model than to load the data into it.
Ideally, this little section should greatly interest you into looking how to integrate WEKA into your own server-side code. Whether you run an e-commerce shop and want to better recommend products to them, or you have a coupon promotion you want to improve, or you want to optimize your AdWords campaign, or you want to optimize your landing page, these data mining techniques can all improve your results in those areas. Taking advantage of the built-in nature of the WEKA API, you can go so far as to write server-side code to rotate your landing page and constantly analyze the results using data mining to find the most effective landing page. Combine that with data mining analysis on your AdWords, and you can quickly find the best route for getting customers to your site, and converting them into sales.
Conclusion
This article wraps up the three-article series introducing you to the concepts of data mining and especially to the WEKA software. As you've seen, WEKA can do many of the data mining tasks that were previously available only in commercial software packages. WEKA is powerful and 100-percent free to use. You can't beat a deal like that, since you can quickly get WEKA up and running and crunching your data in no time.
This article went over the fourth-common data mining algorithm, "Nearest Neighbor." This algorithm is ideal for finding data points that are close to an unknown data point and using the known output from those values to predict the output for the unknown. I showed how this situation is ideal for a situation you see every time you shop online, the Recommend Products section. Through some data mining, sites like Amazon can quickly (for them at least, with its thousands of computers) tell you what customers like you purchased.
The final section of the article showed that you shouldn't be constrained to using WEKA with the Explorer window as a stand-alone application. WEKA can be used as a stand-alone Java library, which you can drop into your server-side environment and call its API like any other Java library. I showed you how you can load data into the WEKA API (and recommended you spend some time to write a nice wrapper around your database, to make this overly complex process easier). Finally, I showed you how easy it was to create a regression model and get the same results from the API that we got from the stand-alone application.
My final recommendation when working with the API is to read through the documentation and spend some time reading all the available functions offered. I find the API somewhat difficult to work with, so reading it thoroughly can be the difference from using it successfully to throwing into the recycle bin.
Hopefully, after reading this series, you will be inspired to download WEKA and try to find patterns and rules from your own data.
Download
Description | Name | Size | Download method |
---|---|---|---|
Dealer information and Java code | os-weka3-Example.zip | 17KB | HTTP |
转载于:https://www.cnblogs.com/mavlarn/archive/2012/10/19/2731586.html
[转]Data mining with WEKA, Part 3: Nearest Neighbor and server-side library相关推荐
- data Mining with Weka: Trailer More Data Mining with Weka 用weka 进行数据挖掘 Weka 用weka 进行更多数据挖掘...
https://www.youtube.com/user/WekaMOOC 大学公开课 视频教程 weka 入门教程 data Mining with Weka: Trailer More Dat ...
- CRISP-DM (cross-industry standard process for data mining)跨行业数据挖掘过程标准
CRIP-DM+SEMMA CRISP-DM (cross-industry standard process for data mining),即为"跨行业数据挖掘过程标准". ...
- The Proposal of Service Oriented Data Mining System for Solving Real-Life Classification--阅读笔记
The Proposal of Service Oriented Data Mining System for Solving Real-Life Classification and Regress ...
- Our Proof : Page Scraping : Website Data Extraction : Data Mining Analytics : Connotate.com
Our Proof : Page Scraping : Website Data Extraction : Data Mining Analytics : Connotate.com There's ...
- dataframe常用操作_【Data Mining】机器学习三剑客之Pandas常用算法总结上
一.前言 看pandas之前我建议先看我的numpy总结,效果更佳. SEU-AI蜗牛车:[Data Mining]机器学习三剑客之Numpy常用算法总结zhuanlan.zhihu.com 可以 ...
- Data Mining 论文翻译:Deep Learning for Spatio-Temporal Data Mining: A Survey
原文链接:[1906.04928] Deep Learning for Spatio-Temporal Data Mining: A Survey (arxiv.org) IEEE Transacti ...
- [转载]SQL Server 2005 Data Mining简介
简介 企业均在尝试分析其数据时都面临若干问题.通常,并不缺乏数据.事实上,很多企业感觉到他们被数据淹没了:他们没有办法完全利用所有的数据,将其变成信息.为了处理这方面的问题,开发了数据仓库技术, ...
- Collaborative Filter - Data Mining基础(ACM暑校)
2003年,Amazon公司的Linden.Smith.York大佬刊发了一篇名为<Item-to-Item Collaborative Filtering>的文章:这篇文章首次解释了Am ...
- 数据挖掘讲座:我所知道的一点Data Mining
赵民德 2003/01/24 市场竞争的激烈迫使企业高层寻找更为科学.系统.有效的辅助决策技术和整体解决方案来处理日益复杂的公司事务.毫无疑问,数据挖掘就是很好的选择之一.数据挖掘到底是什么?数据挖掘 ...
最新文章
- html滑动直播,HTML5 canvas实现的静态循环滚动播放弹幕
- Ubuntu 使用国内apt源
- 网络营销之下的泡泡玛特盲盒广受年轻用户群体的喜爱与关注
- debian linux
- bootstrap基础
- c语言计算pi后1000位,计算圆周率 Pi (π)值, 精确到小数点后 10000 位
- python能表示多大整数_Python无法表示99999999999999999999这样大的整数。
- DB2sql——DB2临时表的创建
- android弹球动画,Android动画之自定义Evaluator实现弹球效果
- python PEP 487
- Android studio修改Logcat颜色
- SAP License:SAP会计凭证抬头的字段状态控制
- mysql中间件研究(Atlas,cobar,TDDL,Mycat)
- HDU [1529] || POJ [P1275] Cashier Employment
- 【Java_基础】java中static与final关键字的区别
- Arrays.copyOf
- 自动发送企业微信通知,让我来教你真的超简单
- Window10 IoT家庭物联网网关系统设计
- 有一个已排好序的数组,要求输入一个数后,按原来排序的规律将它插入数组中,谭浩强《c语言程序设计》第六章第四题
- php 游戏开发swoole,《基于 Swoole 的对战游戏实践》开课啦