Mahout 简介

Mahout 是一个很强大的数据挖掘工具，是一个分布式机器学习算法的集合，包括：被称为Taste的分布式协同过滤的实现、分类、聚类等。Mahout最大的优点就是基于hadoop实现，把很多以前运行于单机上的算法，转化为了MapReduce模式，这样大大提升了算法可处理的数据量和处理性能。

Hadoop

http://blog.csdn.net/fenglailea/article/details/53318459
风.fox

环境

Centos7 服务器
当前最新版 0.12.2

Mahout下载地址

http://archive.apache.org/dist/mahout/
http://archive.apache.org/dist/mahout/0.12.2/

wget http://archive.apache.org/dist/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz
tar -zxvf apache-mahout-distribution-0.12.2.tar.gz

这里放到 Hadoop 目录里

mv apache-mahout-distribution-0.12.2 /home/hadoop/hadoop/mahout

Mahout环境变量设置

设置全局/etc/bashrc，当前用户~/.bashrc
这里使用当前用户

vim ~/.bashrc

mahout环境变量

export MAHOUT_HOME=/home/hadoop/hadoop/mahout
export MAHOUT_CONF_DIR=$MAHOUT_HOME/conf
export PATH=$MAHOUT_HOME/conf:$MAHOUT_HOME/bin:$PATH

hadoop环境变量

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_HOME_WARN_SUPPRESS=not_null

应用环境变量

source  ~/.bashrc  #推荐
或
. ~/.bashrc

source在当前shell 里运行，点号是在子shell里运行

查询是否安装成功,

mahout

若出现一下，表示安装成功

arff.vector: : Generate Vectors from an ARFF file or directorybaumwelch: : Baum-Welch algorithm for unsupervised HMM trainingcanopy: : Canopy clusteringcat: : Print a file or resource as the logistic regression models would see itcleansvd: : Cleanup and verification of SVD outputclusterdump: : Dump cluster output to textclusterpp: : Groups Clustering Output In Clusterscmdump: : Dump confusion matrix in HTML or text formatscvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.describe: : Describe the fields and target variable in a data setevaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probesfkmeans: : Fuzzy K-means clusteringhmmpredict: : Generate random sequence of observations by given HMMitemsimilarity: : Compute the item-item-similarities for item-based collaborative filteringkmeans: : K-means clusteringlucene.vector: : Generate Vectors from a Lucene indexmatrixdump: : Dump matrix in CSV formatmatrixmult: : Take the product of two matricesparallelALS: : ALS-WR factorization of a rating matrixqualcluster: : Runs clustering experiments and summarizes results in a CSVrecommendfactorized: : Compute recommendations using the factorization of a rating matrixrecommenditembased: : Compute recommendations using item-based collaborative filteringregexconverter: : Convert text files on a per line basis based on regular expressionsresplit: : Splits a set of SequenceFiles into a number of equal splitsrowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}rowsimilarity: : Compute the pairwise similarities of the rows of a matrixrunAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression modelrunlogistic: : Run a logistic regression model against CSV dataseq2encoded: : Encoded Sparse Vector generation from Text sequence filesseq2sparse: : Sparse Vector generation from Text sequence filesseqdirectory: : Generate sequence files (of Text) from a directoryseqdumper: : Generic Sequence File dumperseqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archivesseqwiki: : Wikipedia xml dump to sequence filespectralkmeans: : Spectral k-means clusteringsplit: : Split Input data into test and train setssplitDataset: : split a rating dataset into training and probe partsssvd: : Stochastic SVDstreamingkmeans: : Streaming k-means clusteringsvd: : Lanczos Singular Value Decompositiontestnb: : Test the Vector-based Bayes classifiertrainAdaptiveLogistic: : Train an AdaptivelogisticRegression modeltrainlogistic: : Train a logistic regression using stochastic gradient descenttrainnb: : Train the Vector-based Bayes classifiertranspose: : Take the transpose of a matrixvalidateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data setvecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectorsvectordump: : Dump vectors from a sequence file to textviterbi: : Viterbi decoding of hidden states from given output states sequence

Mahout 和Hadoop 集成测试

首先，hadoop 要安装完成及启动

http://blog.csdn.net/fenglailea/article/details/53318459

下载测试数据

http://archive.ics.uci.edu/ml/databases/synthetic_control/

wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

hadoop 上传测试数据

hadoop fs -mkdir -p ./testdata
hadoop fs -put synthetic_control.data ./testdata

查看目录及文件

hadoop fs -ls
hadoop fs -ls ./testdata

使用Mahout中的kmeans聚类算法进行测试

mahout -core  org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

XX执行完成,最后几行如下

        1.0 : [distance=55.039831561905785]: [33.67,38.675,39.742,41.989,37.291,43.975,31.909,25.878,31.08,15.858,13.95,23.097,19.983,21.692,31.579,38.57,33.376,38.843,41.936,33.534,39.195,32.897,25.343,18.523,15.089,17.771,22.614,25.313,23.687,29.01,41.995,35.712,40.872,41.669,32.156,25.162,24.98,23.705,18.413,20.975,14.906,26.171,30.165,27.818,35.083,39.514,37.851,33.967,32.338,34.977,26.589,28.079,19.597,24.669,23.098,25.685,28.215,34.94,36.91,39.749]
16/11/24 16:47:52 INFO ClusterDumper: Wrote 6 clusters
16/11/24 16:47:52 INFO MahoutDriver: Program took 22175 ms (Minutes: 0.3695833333333333)

查看输出

hadoop fs -ls ./output

Found 15 items
-rw-r--r--   1 hadoop supergroup        194 2016-11-24 16:47 output/_policy
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusteredPoints
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-0
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-1
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-10-final
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-2
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-3
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-4
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-5
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-6
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-7
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-8
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/clusters-9
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/data
drwxr-xr-x   - hadoop supergroup          0 2016-11-24 16:47 output/random-seeds

查看数据

mahout vectordump -i ./output/data/part-m-00000

查看
http://itindex.net/detail/51681-mahout
http://blog.csdn.net/wind520/article/details/38851367

Mahout 安装配置及一个简单测试相关推荐

1-3.Win10系统利用Pycharm社区版安装Django搭建一个简单Python Web项目的步骤之三
在1-1.Win10系统利用Pycharm社区版安装Django搭建一个简单Python Web项目的步骤之一基础上进行如下操作: 所有路由不能全部都在myDjango下的urls.py路由文件中, ...
Oracle data integrator 11g安装配置和一个实例应用指南pdf
<Oracle data integrator 11g安装配置和一个实例应用指南pdf> 下载地址: 网盘下载转载于:https://www.cnblogs.com/long12365/ ...
mahout安装配置
1.下载mahout 下载地址:http://mahout.apache.org 我下载的最新版:mahout-distribution-0.9 2.把mahout解压到你想存放的文档,我是放在/Us ...
odbc配置以及一个简单的java连接的代码编写
1.odbc配置的问题记录问题描述: 刚开始写好程序之后,直接进行简单数据库调用,但是程序一直报空指针错误,后来查找资料才知道,jdk8里面是没有odbc所用的驱动类,于是换成了jdk7就可以了. ...
java基础第一章上（安装配置java、简单dos命令）
一.安装配置java 下载安装 1.java官网下载jdk(32位或者64位根据自己电脑而定). 2.双击jdk.exe文件安装. 环境变量配置右击我的电脑--属性--高 ...
Spring MVC：使用基于Java的配置创建一个简单的Controller
这是我博客上与Spring MVC相关的第一篇文章. 开端总是令人兴奋的,因此我将尽量简洁明了. Spring MVC允许以最方便,直接和快速的方式创建Web应用程序. 开始使用这项技术意味着需要Sp ...
ASP.NET Aries 入门开发教程2：配置出一个简单的列表页面
前言: 朋友们都期待我稳定地工作,但创业公司若要躺下,也非意念可控. 若人生注定了风雨飘摇,那就雨中前行了. 最机开始看聊新的工作机会,欢迎推荐,创业公司也可! 同时,趁着自由时间,抓紧把这系列教程给 ...
Kettle系列文章二(安装配置Kettle+SqlServer+简单的输入输出作业)
一.下载 Kettle下载地址:https://community.hitachivantara.com/docs/DOC-1009855 下拉到DownLoad,点击红框中的链接进行下载.. 二.解 ...
1-2.Win10系统利用Pycharm社区版安装Django搭建一个简单Python Web项目的步骤之二
七.在项目下新建 templates 路径在工程上,右键,添加templates目录注意*: 此目录下即用来存放我们的html文件: 此目录一般是与app的主目录是平级的.当然也可以建立在app的 ...
1-1.Win10系统利用Pycharm社区版安装Django搭建一个简单Python Web项目的步骤之一
首先,安装python3.8和pycharm参考其他教程. 一.安装django 使用下面命令默认安装最新版的django pip install django 使用下面命令可以安装指定版本 pip ...

Mahout 安装配置及一个简单测试