
Mahout 是 Apache Software Foundation(ASF)旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Apache Mahout项目已经发展到了它的第三个年头,目前已经有了三个公共发行版本。Mahout包含许多实现,包括集群、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。



下载hadoop http://labs.renren.com/apache-mirror/hadoop/common/下载适合版本的包(本文采用稳定版 hadoop- )


如需更多功能可能还需下载 maven 和 mahout-collections


(本文使用synthetic_control 数据,synthetic_control.tar.gz)


为了不污染Linux root环境,本文采用在个人Home目录安装程序,程序目录为$HOME/local。


tar zxvf hadoop- -C ~/local/

cd ~/local

mv hadoop- hadoop

tar zxvf mahout-distribution-0.5.tar.gz -C ~/local/

cd ~/local

mv mahout-distribution-0.5 mahout

修改.bash_profile / .bashrc

export HADOOP_HOME=$HOME/local/hadoop


为方便使用程序命令,可把程序bin目录添加到$PATH下,或者直接alias 。

#Alias for apps

alias mahout=’$HOME/local/mahout/mahout’

alias hdp=’$HOME/local/hadoop/hdp’


输入命令: mahout


Running on hadoop, using HADOOP_HOME=/home/username/local/hadoop


An example program must be given as the first argument.

Valid program names are:

arff.vector: : Generate Vectors from an ARFF file or directory

canopy: : Canopy clustering

cat: : Print a file or resource as the logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

dirichlet: : Dirichlet Clustering

eigencuts: : Eigencuts spectral clustering

evaluateFactorization: : compute RMSE of a rating matrix factorization against probes in memory

evaluateFactorizationParallel: : compute RMSE of a rating matrix factorization against probes

fkmeans: : Fuzzy K-means clustering

fpg: : Frequent Pattern Growth

itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

kmeans: : K-means clustering

lda: : Latent Dirchlet Allocation

ldatopics: : LDA Print Topics

lucene.vector: : Generate Vectors from a Lucene index

matrixmult: : Take the product of two matrices

meanshift: : Mean Shift clustering

parallelALS: : ALS-WR factorization of a rating matrix

predictFromFactorization: : predict preferences from a factorization of a rating matrix

prepare20newsgroups: : Reformat 20 newsgroups data

recommenditembased: : Compute recommendations using item-based collaborative filtering

rowid: : Map SequenceFile to {SequenceFile, SequenceFile}

rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

runlogistic: : Run a logistic regression model against CSV data

seq2sparse: : Sparse Vector generation from Text sequence files

seqdirectory: : Generate sequence files (of Text) from a directory

seqdumper: : Generic Sequence File dumper

seqwiki: : Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

splitDataset: : split a rating dataset into training and probe parts

ssvd: : Stochastic SVD

svd: : Lanczos Singular Value Decomposition

testclassifier: : Test Bayes Classifier

trainclassifier: : Train Bayes Classifier

trainlogistic: : Train a logistic regression using stochastic gradient descent

transpose: : Take the transpose of a matrix

vectordump: : Dump vectors from a sequence file to text

wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature like country

wikipediaXMLSplitter: : Reads wikipedia data and creates ch



Usage: hadoop [–config confdir] COMMAND

where COMMAND is one of:

namenode -format format the DFS filesystem

secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

datanode run a DFS datanode

dfsadmin run a DFS admin client

mradmin run a Map-Reduce admin client

fsck run a DFS filesystem checking utility

fs run a generic filesystem user client

balancer run a cluster balancing utility

fetchdt fetch a delegation token from the NameNode

jobtracker run the MapReduce job Tracker node

pipes run a Pipes job

tasktracker run a MapReduce task Tracker node

historyserver run job history servers as a standalone daemon

job manipulate MapReduce jobs

queue get information regarding JobQueues

version print the version

jar run a jar file

distcp copy file or directories recursively

archive -archiveName NAME -p * create a hadoop archive

classpath prints the class path needed to get the

Hadoop jar and the required libraries

daemonlog get/set the log level for each daemon


CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.




mahout –help

mahout kmeans –input /user/hive/warehouse/tmp_data/complex.seq –clusters 5 –output /home/hadoopuser/1.txt




(You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.)


$MAHOUT_HOME/mahout seqdirectory \

--input –output \

{UTF-8|cp1252|ascii…}> \

64> \



mahout seqdirectory –input /hive/hadoopuser/ –output /mahout/seq/ –charset UTF-8




$HADOOP_HOME/hdp fs -put testdata


dap fs -put ~/datasetsynthetic_controltest/synthetic_control.data ~/local/mahout/testdata/


hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job


hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job


hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job


hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

4:使用dirichlet 算法

mahout jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job


meanshift :

hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job


mahout vectordump –seqFile /user/hadoopuser/output/data/part-00000




可以使用 hdp fs -lsr 来查看所有的输出结果

KMeans 方法的输出结果在 output/points

Canopy 和 MeanShift 结果放在了 output/clustered-points


