MapReduce: Simplified Data Processing on Large Clusters MapReduce:简化数据流程在大规模集群中


1、Abstract 摘要

2、Introduction 介绍

3、Programming Model 编程模型

Abstract 摘要

(1)MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.


associated a. 关联的

implementation  [ˌɪmplɪmənˈteʃən]  n. 实施,实现

processing n. 处理

intermediate  [ˌɪntərˈmi:diət] a. 中间的

expressible  [ɪk'spresəbəl] a. 可表现的

(2)Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.


functional [ˈfʌŋkʃənəl]  a. 功能的,实用的

parallelize ['pærəlelˌaɪz] v. 并行化

commodity  [kəˈmɑ:dəti] n. 有利,有益;有价值的物品;商品

takes care of  phr. 照顾,关心

required  [rɪ'kwaɪəd] a. 必要的,必须的

inter-machine n. 机器间

utilize  [ˈjutlˌaɪz] v. 使用

(3)Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable:  a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.


scalable ['skeɪləbəl] a. 可扩展的

computation [ˌkɑ:mpjuˈteɪʃn] n.计算

terabyte  [ˈtɛrəˌbaɪt] n. TB

upwards  [ˈʌpwərdz] adv. 向上地

1 Introduction 介绍

(4)Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure  of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.


raw data phr. 原始数据

derived [dɪ'raɪvd] v. 派生

inverted indices phr. 倒排索引

conceptually [kən'septʃʊrlɪ] adv. 概念地

straightforward [ˌstreɪtˈfɔ:rwərd] a. 简单的

reasonable  [ˈrizənəbəl] a. 合理的

conspire  [kənˈspaɪr] v.搞阴谋;协力促成

obscure [əbˈskjʊr] v. 掩盖;使难理解

(5)As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.


primitive [ˈprɪmɪtɪv] a. 原始的

present v. 提出,介绍

intermediate  [ˌɪntərˈmi:diət] a. 中间的

appropriately [ə'proʊprɪrtlɪ]  adv. 适当地

userspecified a. 用户指定的

re-execution n. 重新执行,再次执行

mechanism [ˈmɛkəˌnɪzəm]  n. 机制

(6)The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.


scale  [skel] n. 级别,规格

combined with phr. 结合

(7)Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system. Section 7 discusses related and future work.


tailored [ˈteɪlərd]  a. 定制的

refinement [rɪˈfaɪnmənt]  n. 改进,细化

measurement [ˈmeʒərmənt]  n. 测量

2 Programming Model 编程模型

(8)The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of  the MapReduce library expresses the computation as two functions: Map and Reduce.


(9)Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.


(10)The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.


form [fɔ:rm] v. 形成,产生

typically [ˈtɪpɪklɪ]  adv. 通常,典型地

invocation [ˌɪnvəˈkeʃən] n. 调用

supply [səˈplaɪ] v. 供给

via [ˈvaɪə, ˈviə] prep. 经过,通过

iterator  [ɪtə'reɪtə] n. 迭代器

fit v. 适合

2.1 Example 示例

(11)Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:


occurrence [əˈkɜ:rəns] n. 发生,出现

pseudo ['su:doʊ] a. 虚伪的

map(String key, String value):

// key: document name

// value: document contents for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts int result = 0;

for each v in values: result += ParseInt(v);


(12)The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example). The reduce function sums together all counts emitted for a particular word.


emit  [ɪˈmɪt]  v. 发出,发射

associated [əˈsoʊʃieɪtɪd] v. 联系,陪伴

(13)In addition, the user writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user’s code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this example.


tune [tu:n] v. 协调,调整

appendix [əˈpɛndɪks] n. 附录,附加物

2.2 Types 类型

(14)Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types:

map        (k1,v1)        → list(k2,v2)

reduce        (k2,list(v2))        →  list(v2)


map        (key1,value1)        → list(key2,value2)

reduce        (key2,list(value2))        →  list(value2)

in terms of phr. 根据,就……而言

(15)I.e., the input keys and values are drawn from a different domain than  the  output keys and  values. Furthermore,the intermediate keys and values are from the same  domain as the output keys and values.


I.e. abbr. 即,换言之

(16)Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types.


2.3 More Examples 更多的示例

(17)Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations.


(18)Distributed Grep:The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.


grep [计] 检索目标行命令

identity  [aɪˈdɛntɪti]  n. 身份,特征

(19)Count of URL Access Frequency:  The map function processes logs of web page requests and outputs URL, 1 . The reduce function adds together all values for the same URL and emits a URL, total count pair.

翻译:计算URL的访问频率分布:Map函数处理web页面请求日志,然后输出<URL, 1>的键值对。Reduce函数将具有相同URL的键的值加起来,输出一个<URL, 总数>的键值对。

frequency  [ˈfrikwənsi] n. 频繁性,频率分布

(20)Reverse Web-Link Graph:The map function outputs target, source pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: (target, list(source))


concatenate [kɑnˈkæt(ə)ˌneɪt] v. 把……联系起来

(21)Term-Vector per Host:A term vector summarizes the most important words that occur in a document or a set of documents as a list of word, f requency pairs. The map function emits a hostname, term vector pair for each input document (where the hostname is extracted from the URL of the document).  The  reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final (hostname, term vector) pair.


summarize [ˈsʌməˌraɪz] v. 总结

extract  [ɪkˈstrækt]  v. 提取,获得

throw away phr. 丢掉,摒弃

(22)Inverted Index:The map function parses each document, and emits a sequence of word, document ID pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a word, list(document ID) pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.


parse  [pɑ:rs]  v. 解析

correspond [ˌkɔ:rəˈspɑ:nd] v. 符合,一致

augment [ɔɡˈmɛnt] v. 增强,加强

track v. 跟踪,监测

keep track of phr. 与……保持联系

(23)Distributed Sort: The map function extracts the key from each record, and emits a key, record pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning facilities described in Section 4.1 and the ordering properties described in Section 4.2.


facility [fəˈsɪləti]  n. 设备,能力

order n. 规则,制度,命令


  1. 谷歌发布深度学习新算法,适用于真实机器人的技能学习 2019-01-22 15:11:28 通过这种名叫SAC(柔性致动/评价)的强化学习算法,机器人能很快地完 ...

  2. 论文学习笔记 POSEIDON: Privacy-Preserving Federated Neural Network Learning

    论文学习笔记 POSEIDON: Privacy-Preserving Federated Neural Network Learning NDSS 2021录用文章 目录 论文学习笔记 POSEID ...

  3. 谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(公号回复“谷歌BERT论文”下载彩标PDF论文)

    谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(公号回复"谷歌BERT论文"下载彩标PDF论文) 原创: 秦陇纪 数据简化DataSimp 今天 数据简化 ...

  4. 【论文学习】《“Hello, It’s Me”: Deep Learning-based Speech Synthesis Attacks in the Real World》

    <"Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real World>论文 ...

  5. 【论文学习】《Generalized End-to-End Loss for Speaker Verification》

    <Generalized End-to-End Loss for Speaker Verification>论文学习 文章目录 <Generalized End-to-End Los ...

  6. 【论文学习】《Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems》

    <Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems>论文学习 文章目录 <Who is Real ...

  7. 【论文学习】《Practical Attacks on Voice Spoofing Countermeasures》

    <Practical Attacks on Voice Spoofing Countermeasures>论文学习 文章目录 <Practical Attacks on Voice ...

  8. 谷歌利用深度学习结合荧光标记,准确预估显微图像

    文章来源:ATYUN AI平台 在生物学和医学领域,显微技术为研究人员提供人肉眼无法观察到的细胞和分子的细节.透射光显微镜能够将生物样本在一边被照亮且成像,技术相对简单,生物样本耐受度高,然而缺点是产 ...

  9. MapReduce论文精读

    文章目录 概述 研究意义 编程模型与系统实现 编程模型 Map Reduce 示例:统计文档中所有单词的出现次数 系统实现 基本流程 容错处理 worker异常 master异常 locality T ...


  1. linux孤立cpu,Linux 抛弃旧款 CPU,一下子少 50 万行代码
  2. 候选翻译文章列表[示范]
  3. 操作选项_消防设施操作员关键技能之六:能切换集中火灾报警控制器、消防联动控制器工作状态...
  4. 【Python五篇慢慢弹】数据结构看python
  5. liferay requestrequest和actionRequest用法
  6. 让字跑起来的HTML5标签,HTML5:标记文字
  7. mysql提示错误[Error Code] 1290 - The MySQL server is running with the --secure-file-priv option解决办法...
  8. php编译成二进制文件_2020年小米高级 PHP 工程师面试题
  9. 什么是大数据,怎么理解和应对大数据时代
  10. linux终端怎么设置monaco,Monaco Editor 使用指南
  11. IT职场人生系列之十二:语言与技术I
  12. 安航云酒店管理系统面试话术
  13. php李炎恢第二季视频_李炎恢PHP视频教程第二季资源推荐
  14. 大前研一《思考的技术》
  15. 【FFT】HDU4609-3 idiots
  16. 智能头盔 Livall携全球首款智能骑行头盔亮相CES
  17. C#基础语法————变量
  18. 【计算机毕业设计】512网上商城购物系统
  19. Linux系统内核优化
  20. Python避免缩进错误


  1. pycharm 突然无法连接远程服务器
  2. Java程序员黄金年龄25-28岁,我们30+的人该去哪儿?附华为案例
  3. 【Python】pickle写入加载数据
  4. 黑盒测试三角形问题 java,黑盒测试及其实例 - 陈洪波的个人空间 - OSCHINA - 中文开源技术交流社区...
  5. 技术探析Android安全有多弱,从鳄鱼爱洗澡致300万用户中毒说开去~
  6. Python Spider
  7. 关于客户背景调查的两个案例,说下我的真实看法
  8. 输入6位数验证码的实现原理
  9. 推荐美国简单的选项卡功能实现
  10. Anolis OS8.6QU1通过cephadm部署ceph17.2.0分布式块存储(六)部署iscsi服务