Hadoop系列之Aggregate用法

1. aggregate简介
aggregate是Hadoop提供的一个软件包，其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using Map/Reduce model, the developer needs to implement Map and Reduce functions (and possibly Combine function). However, for a lot of applications related to counting and statistics computing, these functions have very similarcharacteristics. This provides a package implementing those patterns. In particular,the package provides a generic mapper class,a reducer class and a combiner class, and a set of built-in value aggregators.It also provides a generic utility class, ValueAggregatorJob, that offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。

2. aggregate class summary

DoubleValueSum	This class implements a value aggregator that sums up a sequence of double values. 可利用来统计Top K记录，类似LongValueSum
LongValueMax	This class implements a value aggregator that maintain the maximum of a sequence of long values.
LongValueMin	This class implements a value aggregator that maintain the minimum of a sequence of long values.
LongValueSum	This class implements a value aggregator that sums up a sequence of long values.
StringValueMax	This class implements a value aggregator that maintain the biggest of a sequence of strings.
StringValueMin	This class implements a value aggregator that maintain the smallest of a sequence of strings.
UniqValueCount	This class implements a value aggregator that dedupes a sequence of objects.
UserDefinedValueAggregatorDescriptor	This class implements a wrapper for a user defined value aggregator descriptor.
ValueAggregatorBaseDescriptor	This class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class.
ValueAggregatorCombiner	This class implements the generic combiner of Aggregate.
ValueAggregatorJob	This is the main class for creating a map/reduce job using Aggregate framework.
ValueAggregatorJobBase	This abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate.
ValueAggregatorMapper	This class implements the generic mapper of Aggregate.
ValueAggregatorReducer	This class implements the generic reducer of Aggregate.
ValueHistogram	This class implements a value aggregator that computes the histogram of a sequence of strings

3. streaming中使用aggregate

在mapper任务的输出中添加控制，如下：
function：key\tvalue
eg：
LongValueSum：key\tvalue
此外，置-reducer = aggregate。此时，Reducer使用aggregate中对应的function类对相同key的value进行操作，例如，设置function为LongValueSum则将对每个键值对应的value求和。

下面是一个python的例子：

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \-input myInputDirs \-output myOutputDir \-mapper myAggregatorForKeyCount.py \-reducer aggregate \-file myAggregatorForKeyCount.py \-jobconf mapred.reduce.tasks=12
python程序myAggregatorForKeyCount.py例子：#!/usr/bin/pythonimport sys;def generateLongCountToken(id):return "LongValueSum:" + id + "\t" + "1"def main(argv):line = sys.stdin.readline();try:while line:line = line[:-1];fields = line.split("\t");print generateLongCountToken(fields[0]);line = sys.stdin.readline();except "end of file":return None
if __name__ == "__main__":main(sys.argv)

Hadoop系列之Aggregate用法相关推荐

Hadoop系列之FieldSelectionMapReduce用法
Hadoop的工具类org.apache.hadoop.mapred.lib.FieldSelectionMapReduce帮助用户高效处理文本数据, 就像unix中的"cut"工 ...
Hadoop系列之DistributedCache用法
DistributedCache是Hadoop提供的文件缓存工具,它能够自动将指定的文件分发到各个节点上,缓存到本地,供用户程序读取使用.它具有以下几个特点:缓存的文件是只读的,修改这些文件内容没有意 ...
python items函数用法,Python中dictionary items()系列函数的用法实例
本文实例讲述了Python中dictionary items()系列函数的用法,对Python程序设计有很好的参考借鉴价值.具体分析如下: 先来看一个示例: import html # availab ...
Hadoop 系列之 Hive
Hadoop 系列之 Hive Hive 的官网:http://hive.apache.org/ Hive versions 1.2 onward require Java 1.7 or newer. ...
Hadoop 系列之 HDFS
Hadoop 系列之 HDFS 花絮上一篇文章 Hadoop 系列之 1.0和2.0架构中,提到了 Google 的三驾马车,关于分布式存储,计算以及列式存储的论文,分别对应开源的 HDFS,Ma ...
Hadoop 系列之 1.0 和2.0 架构
Hadoop 系列之 1.0 和2.0 架构自学大数据有一段时间了,找工作历时一周,找到一家大厂,下周入职,薪资待遇还不错,公司的业务背景自己也很喜欢.趁着还没有入职,给大家争取先把 Hadoop ...
Hadoop 基础系列一Hadoop 系列之 1.0 和2.0 架构
精选30+云产品,助力企业轻松上云!>>> Hadoop 系列之 1.0 和2.0 架构自学大数据有一段时间了,找工作历时一周,找到一家大厂,下周入职,薪资待遇还不错,公司的业务背 ...
hadoop系列三:mapreduce的使用(一)
一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6.4 上一篇:hadoop系列二: ...
hadoop系列四:mapreduce的使用(二)
转载请在页首明显处注明作者与出处一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6 ...

Hadoop系列之Aggregate用法

Hadoop系列之Aggregate用法相关推荐

最新文章

热门文章