本文关于原理部分的内容主要在第三第四节：

3 Implementation

3.1 Execution Overview

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be pro-cessed in parallel by different machines. Reduce invoca-tions are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g.,hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.

map:

将输入分为M份（依据什么分？），对应到M台机器上，然后分别调用map函数。他们在不同机器上是并行执行的

reduce：

将map的输出也就是中间键值对用一个partition函数（如哈希取modR）分成R个piece，然后分别调用reduce

上图：

Figure 1 shows the overall flow of a MapReduce op- eration in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 1 corre-spond to the numbers in the list below):

一次MR的流程是：

1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (con-trollable by the user via an optional parameter). It then starts up many copies of the program on a clus-ter of machines.

1.MR将输入分为M份（每份的大小由用户指定），然后在集群上运行多个该程序（用户编写的应用程序和MR框架的集合？）的copy（类似集群里的每台机器运行一个该程序的实例）

2. One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

2.这些实例中有一个是特殊的，是master，其余的是worker，master给worker分配任务（所以实际上master和worker对应的程序是相同的，只是他们的身份不同，从而行为也不同？）

共有M个Map任务，R个Reduce任务，master从空闲的worker中挑出并分配map或者reduce任务

3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The interme-diate key/value pairs produced by the Map function are buffered in memory.

3.被分配map任务的worker，首先从对应input split中读取输入，从输入中解析出键值对，然后调用map函数。输出的中间键值对缓存在内存里（？）

4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

4.阶段性的，缓存的键值对保存到本地磁盘，通过partition函数分为R piece。对应的在本地磁盘上的路径，被传给master，master负责将这些路径传给对应的reduce worker。

5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all in-termediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.

5.当reduce worker从master得知输入数据的路径信息后，使用RPC从之前的map worker的本地磁盘读出来。当他将自己的输入数据读取完成后，首先按照key排序，以便将key相同的中间键值对聚集起来，因为有不同的key通过partition函数映射到了同一个reduce worker，所以需要排序。

6. The reduce worker iterates over the sorted interme-diate data and for each unique intermediate key en-countered, it passes the key and the corresponding set of intermediate values to the user’s Reduce func-tion. The output of the Reduce function is appended to a final output file for this reduce partition.

6.reduce worker对排序后的中间值进行遍历，将相同key的键值对作为一次reduce函数的输入，调用reduce函数。reduce函数的输出被追加到a final output file for this reduce partition（？

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program.At this point, the MapReduce call in the user pro-gram returns back to the user code.

7.所有的map和reduce任务完成后，本次MR结束，返回

一次MR成功完成后，输出在R个输出文件里（注意有R个reduce调用）。一般，用户不会直接将这R个文件merge起来，而是将他们作为另一个MR的输入

3.2 Master Data Structures

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress,or completed), and the identity of the worker machine (for non-idle tasks).

master维护了一系列数据结构。对每个map和reduce任务，他保存了任务的状态（空闲，进行中，已完成。空闲就是指还没执行吧？），以及对应worker的身份（如果该任务非空闲。已完成的呢？）

The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task,the master stores the locations and sizes of the R inter-mediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incre-mentally to workers that have in-progress reduce tasks.

中间文件的路径信息通过master从map task传到reduce task。master为每个完成的map任务保存了路径信息以及该map任务产生的中间文件对应到R个region里相应的size（比如产生了3个文件，总共有三个region，刚好产生的中间文件通过partition函数分别对应到三个region，那么size就都是1）当map任务完成时，对路径信息以及size信息的更新就收到了（）。这些信息被逐渐的push到正在处理reduce任务的worker

↑ 这个在注意一下，对应到具体实现里应该是怎样的

3.3 Fault Tolerance

因为MR的主要用途是大型计算，那么自然需要妥善处理machine failure的情况

Worker Failure

The master pings every worker periodically. If no re-sponse is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their ini-tial idle state, and therefore become eligible for schedul-ing on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.

master周期性的ping每个worker。如果一定时间内没有回应，那么master就标记该worker为failed。对于failed worker完成的map任务，需要恢复到idle状态，然后重新分配给其他worker执行（因为map任务的输出是保存在本地磁盘的，一旦worker fail，那么也取不出来了）

类似的，正在进行中的map和reduce任务也应该恢复到idle，然后重新等待调度执行

Completed map tasks are re-executed on a failure be-cause their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

(解释为什么要重新执行已完成的map任务，和我上面说的一样

而reduce任务的输出保存在GFS里，不是本地，所以即使failed，也不需要重新执行已完成的reduce任务

When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the re-
execution. Any reduce task that has not already read the data from worker A will read the data from worker B.

当一个map任务首先由workerA执行，然后由workerB执行（因为A failed）。所有执行reduce 任务的worker都应该被通知到这个改变（在实现中是怎样的？）

任何还没有从A读取数据的reduce任务，都应该改由从B读取输入（实现中我又没有做到这个？）

MapReduce is resilient to large-scale worker failures.For example, during one MapReduce operation, network maintenance on a running cluster was causing groups of
80 machines at a time to become unreachable for sev-eral minutes. The MapReduce master simply re-executed the work done by the unreachable worker machines, and
continued to make forward progress, eventually complet-ing the MapReduce operation.

MR能够很好的应对大规模的worker failure。例如，一个集群上的网络维修导致大批machine不可达，MR master会直接把这些不可达的worker执行（完成的？）的任务重新分配给其machine来完成，继续推进，最终完成整个MR操作。

Master Failure

It is easy to make the master write periodic checkpoints of the master data structures described above. If the mas-ter task dies, a new copy can be started from the last
checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our cur-rent implementation aborts the MapReduce computation
if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

很容易就可以让master阶段性地将自己维护的数据结构的状态write checkpoint。如果master task（什么叫master task？）dies，可以从最近的checkpoint恢复（start a new copy）。然而，如果只有一个master，fail的可能性不大（为什么？）因此本文介绍的实现中，一旦master fail，就放弃本次MR操作。Clients can check for this condition and retry the MapReduce operation if they desire.

Semantics in the Presence of Failures

（理解这个小标题的含义）

When the user-supplied map and reduce operators are de-terministic functions of their input values, our distributed implementation produces the same output as would have
been produced by a non-faulting sequential execution of the entire program.

如果用户提供的map和reduce函数are deterministic functions of their input values（应该就是指，只要输入相同，那么输出相同），这个分布式执行的版本的输出就和不出错的顺序执行的版本相同

We rely on atomic commits of map and reduce task outputs to achieve this property. Each in-progress task writes its output to private temporary files. A reduce task produces one such file, and a map task produces R such files (one per reduce task). When a map task completes,the worker sends a message to the master and includes the names of the R temporary files in the message. If the master receives a completion message for an already completed map task, it ignores the message. Otherwise,it records the names of R files in a master data structure.

要实现上述性质，前提条件是map和reduce任务的输出的提交是原子的（？）：

每个正在执行的任务都把自己的输出写入到一个私有的临时文件中。每个reduce任务产生一个这种文件，而一个map任务产生R个。

当一个map任务完成时，worker发一条信息给master，其中包含R个临时文件的名称。

如果master收到一条信息，他来自一个已完成的map任务，就会忽视这条信息（注意这种情况）

否则，就将R个文件的名字记录到对应数据结构中。

When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task is executed on multi-ple machines, multiple rename calls will be executed for the same final output file. We rely on the atomic rename operation provided by the underlying file system to guar-antee that the final file system state contains just the data produced by one execution of the reduce task.

当一个reduce任务完成时，reduce worker就用一个原子操作（要么完成，要么没有任何改变）将他的输出的临时文件改名为最终输出文件。

如果同一个reduce任务在多台机器上执行，那么多个重命名操作将会得到同一个最终输出文件（名？）

而我们需要底层文件系统提供的原子重命名能力，来保证reduce任务的输出只会产生一个文件（即使一个reduce任务执行了多次，在GFS上，我们只保存其中一次执行的结果。是哪一次呢？是不是最后一次？）

The vast majority of our map and reduce operators are deterministic, and the fact that our semantics are equiv-alent to a sequential execution in this case makes it very easy for programmers to reason about their program’s be-havior. When the map and/or reduce operators are non-deterministic, we provide weaker but still reasonable se-mantics. In the presence of non-deterministic operators,the output of a particular reduce task R 1 is equivalent to the output for R 1 produced by a sequential execution of the non-deterministic program. However, the output for a different reduce task R 2 may correspond to the output for R 2 produced by a different sequential execution of the non-deterministic program.

很多map和reduce算子（任务）是确定的，因此语义上等同于顺序执行，从而很容易让程序员理解程序的行为。

如果map/reduce不是确定的，we provide weaker but still reasonable se-mantics。当算子是不确定的，一个reduce任务的输出等同于顺序执行该不确定算子的输出。

↑ 这个加粗句的different sequential execution怎么理解？

Consider map task M and reduce tasks R 1 and R 2 .Let e(R i ) be the execution of R i that committed (there is exactly one such execution). The weaker semantics arise because e(R 1 ) may have read the output produced by one execution of M and e(R 2 ) may have read the output produced by a different execution of M .

这里大概是说：

假设map任务M，reduce任务r1，r2都是不确定的，然后之所以会weaker semantics，是因为执行r1和执行r2的时候，可能读取的是不同execution的M，而M是不确定，所以M的不同execution也会产生不同输出?

？？？

3.4 Locality

Network bandwidth is a relatively scarce resource in our computing environment. We conserve network band-width by taking advantage of the fact that the input data(managed by GFS [8]) is stored on the local disks of the machines that make up our cluster. GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corre- sponding input data. Failing that, it attempts to schedule a map task near a replica of that task’s input data (e.g., on a worker machine that is on the same network switch as the machine containing the data). When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.

网络带宽是相当珍贵的计算资源之一。

为了节省网络带宽，我们利用了这一点：输入数据是保存在各机器的本地磁盘上的（只针对map任务吧？reduce的输入还是要从map worker上读的）

GFS将每个文件分割成64mb的block，然后每个block都维护几份copy，放在不同机器上。

MR利用了这些路径信息，并且首先会尝试让保存这些copy的机器执行对应的map任务。

如果不能让保存copy的机器执行，再考虑让保存copy的机器的附近的机器来执行那些任务（比如在同一个network switch上）。当在一个集群的大量机器上运行mr操作时，大部分输入是直接从本地读取，而不需要消耗网络带宽。

3.5 Task Granularity

We subdivide the map phase into M pieces and the re-duce phase into R pieces, as described above. Ideally, M and R should be much larger than the number of worker
machines. Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails: the many map tasks
it has completed can be spread out across all the other worker machines.

我们将map阶段分为M份，reduce阶段分为R份，如上所述

理想情况下，M和R应该远大于worker的数量：

让每个worker执行不同的任务可以提高dynamic load balancing（为什么？）

一旦有worker fail的时候也能更快恢复（为什么？）:他完成的很多map任务可以分配给所有其他worker执行（为什么？）

There are practical bounds on how large M and R can be in our implementation, since the master must make O(M + R) scheduling decisions and keeps O(M ∗ R) state in memory as described above. (The constant fac-tors for memory usage are small however: the O(M ∗ R) piece of the state consists of approximately one byte of
data per map task/reduce task pair.)

但M和R也不能太大，因为master需要O(M+R)的时间来调度，并且维护O(MR)的状态（为什么？），但是空间复杂度的常数很小，一般每个状态只需要一个字节（是吗？）

Furthermore, R is often constrained by users because the output of each reduce task ends up in a separate out-put file. In practice, we tend to choose M so that each
individual task is roughly 16 MB to 64 MB of input data (so that the locality optimization described above is most effective), and we make R a small multiple of the num-
ber of worker machines we expect to use. We often per-form MapReduce computations with M = 200, 000 and R = 5, 000, using 2,000 worker machines.

而且，用户一般会限定R的大小，因为一次MR的输出会产生R个输出文件（R个reduce任务）

实践中，通常选择M使得单个任务处理16~64MB的输入数据，这样能更好的利用之前提到的局部性（为什么？），而R一般是worker总数的一个较小的倍数。当有2000个worker时，通常使用的M是200000，R是5000

3.6 Backup Tasks

One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a ma-chine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation.Stragglers can arise for a whole host of reasons. For ex-ample, a machine with a bad disk may experience fre-quent correctable errors that slow its read performance from 30 MB/s to 1 MB/s. The cluster scheduling sys-tem may have scheduled other tasks on the machine,causing it to execute the MapReduce code more slowly due to competition for CPU, memory, local disk, or net-work bandwidth. A recent problem we experienced was a bug in machine initialization code that caused proces-sor caches to be disabled: computations on affected ma-chines slowed down by over a factor of one hundred.

导致一次MR操作耗时太长的主要原因是straggler：指一个机器在执行剩下的最后任务时，用了远比正常要多的时间。

出现这种现象可能是因为如磁盘问题导致性能下降，也可能是调度系统给他分配了其他任务，不同任务对机器资源的竞争导致执行速度下降（会吗？）

作者最近遇到的一个导致straggler的原因是机器初始化代码里禁用了cache，导致计算速度慢了很多

We have a general mechanism to alleviate the prob-lem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions
of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes. We have tuned this mechanism so that it typically increases the computational resources used by the operation by no more than a few percent. We have found that this significantly reduces the time to complete large MapReduce operations. As an exam-ple, the sort program described in Section 5.3 takes 44% longer to complete when the backup task mechanism is disabled.

应对straggler的一般措施是：

当一个MR操作接近完成时，master对于剩下的正在执行的任务，分配一些backup任务。当原本的任务或者backup任务完成时，都标记该任务完成

经过微调，作者发现这种做法不会消耗很多资源，却会显著减少完成MR的时间

（最后一句是定量举例）

4 Refinements

一些有用的扩展

4.1 Partitioning Function

The users of MapReduce specify the number of reduce tasks/output files that they desire (R). Data gets parti-tioned across these tasks using a partitioning function on the intermediate key. A default partitioning function is provided that uses hashing (e.g. “hash(key) mod R”).This tends to result in fairly well-balanced partitions. In some cases, however, it is useful to partition data by some other function of the key. For example, sometimes the output keys are URLs, and we want all entries for a single host to end up in the same output file. To support situations like this, the user of the MapReduce library
can provide a special partitioning function. For example, using “hash(Hostname(urlkey)) mod R” as the par-titioning function causes all URLs from the same host to end up in the same output file.

用户指定R的大小

对于中间数据，对他们的key使用partition函数，将其partition

默认的partition函数是哈希取模，这一般会得到一个相当均衡的结果

然后，某些情况下，应该选用一些特定的哈希函数

例如，有时键值对的key是url，而我们想把相同host的url映射到一起

因此，用户也可以自定义partition函数

4.2 Ordering Guarantees

We guarantee that within a given partition, the interme-diate key/value pairs are processed in increasing key or-der. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it con-venient to have the data sorted.

我们确保在一个给定的partition中，中间键值对是以升序的顺序被处理（我记得mr-sequential里确实有排序的操作）。这种排序能够让每个partition的输出都是有序的，这样如果需要支持高效的random access lookup会很方便，总之有序就是会很方便（

4.3 Combiner Function

In some cases, there is significant repetition in the inter-mediate keys produced by each map task, and the user-specified Reduce function is commutative and associa-tive. A good example of this is the word counting exam-ple in Section 2.1. Since word frequencies tend to follow a Zipf distribution, each map task will produce hundreds or thousands of records of the form <the, 1>. All of these counts will be sent over the network to a single re-duce task and then added together by the Reduce function to produce one number. We allow the user to specify an optional Combiner function that does partial merging of this data before it is sent over the network.

一些情况下，每个map任务产生的中间键值对里有大量的重复，而用户指定的reduce 函数是可交换和可结合的（？）。

一个典型例子是wordcount中，因此词频有一个Zipf的分布，所以会产生大量的<the,1>中间键值对

而全部的counts（中间键值对）都会通过网络送到reduce worker，然后由reduce函数处理

因此，用户可以制定一个可选的combiner函数，做一个部分merge，之后再通过网络传送

The Combiner function is executed on each machine that performs a map task. Typically the same code is used to implement both the combiner and the reduce func-tions. The only difference between a reduce function and a combiner function is how the MapReduce library han-dles the output of the function. The output of a reduce function is written to the final output file. The output of a combiner function is written to an intermediate file that will be sent to a reduce task.

每一个执行map任务的，都要执行一下combiner

通常combiner的代码和reduce的实现中有公共的部分，唯一的不同在于MR处理两者的输出的方式不同（唯一的不同？）

reduce函数的输出写到了最终的输出文件，而combiner的输出则写到了一个中间文件，最终通过网络传送给reduce任务

Partial combining significantly speeds up certain classes of MapReduce operations. Appendix A contains an example that uses a combiner.

这种combiner函数在某些情况下能够显著加速MR

例子见appendixA

4.4 Input and Output Types

The MapReduce library provides support for reading in-put data in several different formats. For example, “text” mode input treats each line as a key/value pair: the key
is the offset in the file and the value is the contents of the line. Another common supported format stores a sequence of key/value pairs sorted by key. Each input
type implementation knows how to split itself into mean-ingful ranges for processing as separate map tasks (e.g.text mode’s range splitting ensures that range splits oc-cur only at line boundaries). Users can add support for a new input type by providing an implementation of a sim-ple reader interface, though most users just use one of a small number of predefined input types.

MR提供对多种输入格式的支持

如文本模式，把每一行作为一个键值对，key是该行在该文件中的offset，val就是该行的内容

Another common supported format stores a sequence of key/value pairs sorted by key.（想表达啥？）

每种输入类型的实现知道如何将输入split成有意义的范围，以便分成单独的map任务（如文本模式只在行尾处split，即每行一个split）

用户如果要提供对新输入格式的支持，就要提供一个reader的接口，不过大部分用户都是用内置的几种类型

A reader does not necessarily need to provide data read from a file. For example, it is easy to define a reader that reads records from a database, or from data struc-
tures mapped in memory.

reader不一定需要从文件读数据，也可以从db，或者映射到内存的数据结构

In a similar fashion, we support a set of output types
for producing data in different formats and it is easy for
user code to add support for new output types.

4.5 Side-effects

In some cases, users of MapReduce have found it con-venient to produce auxiliary files as additional outputs from their map and/or reduce operators. We rely on the application writer to make such side-effects atomic and idempotent. Typically the application writes to a tempo-rary file and atomically renames this file once it has been fully generated.

一些情况下，MR的用户想要产生辅助文件作为额外的输出（日志文件？）

这里由应用的作者来保证这些side-effect是原子并且幂等的（查了下，幂等就是多次执行的效果和执行一次的效果相同）

一般，应用会首先将输出写入到一个临时文件，完成写入后原子地将其重命名为最终文件

4.6 Skipping Bad Records

Sometimes there are bugs in user code that cause the Map or Reduce functions to crash deterministically on certain records. Such bugs prevent a MapReduce operation from completing. The usual course of action is to fix the bug,but sometimes this is not feasible; perhaps the bug is in a third-party library for which source code is unavail-able. Also, sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set. We provide an optional mode of execu-tion where the MapReduce library detects which records cause deterministic crashes and skips these records in or-der to make forward progress.

有时，有些map/reduce函数会在某些record上确定的fail。当出现这些bug时，mr就难以完成。通常的做法是修复bug，但有时候做不到这一点：也许这个bug在一个第三方库，你没有源代码。而且，有时候，跳过一些record是可以接受的，如输入数据量很大时

因此一个可选的模式是，如果MR检测到某些record会确定性的导致fail，那么就跳过，处理其他数据

Each worker process installs a signal handler that catches segmentation violations and bus errors. Before invoking a user Map or Reduce operation, the MapRe-
duce library stores the sequence number of the argument in a global variable. If the user code generates a signal,the signal handler sends a “last gasp” UDP packet that contains the sequence number to the MapReduce mas-ter. When the master has seen more than one failure on a particular record, it indicates that the record should be skipped when it issues the next re-execution of the corre-sponding Map or Reduce task.

每个worker进程install一个signal handler，用来捕捉segmentation violation和bus error（查了下bus error也是内存错误，但是在x86架构上比较少见了）

在调用map或者reduce函数前，MR库将sequence number of the argument（是什么？）保存在一个全局变量里。如果用户代码产生一个signal，signal handler就发送一个UDP包给master，其中包含序列号

如果master发现处理某个record时发生了超过一次的failure，就表示当发出下一次重新执行对应任务时该record应该被跳过（实现时有按照这个做吗？）

4.7 Local Execution

Debugging problems in Map or Reduce functions can be tricky, since the actual computation happens in a dis-tributed system, often on several thousand machines,
with work assignment decisions made dynamically by the master. To help facilitate debugging, profiling, and small-scale testing, we have developed an alternative im-
plementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine. Controls are provided to the user so
that the computation can be limited to particular map tasks. Users invoke their program with a special flag and can then easily use any debugging or testing tools they find useful (e.g. gdb).

debug map或者reduce函数很困难，因为实际的计算发生在一个分布式系统中，可能在几千台机器上执行，而工作是由master动态分配的

为了便于debug，profile（测量性能）和小规模的测试，作者开发了MR的顺序执行版，这个版本在本地机器上顺序执行一次MR操作中所有的工作

这样用户就能控制，计算也能被限定到某个map任务

用户就能用他们习惯的方式去调试程序

4.8 Status Information

The master runs an internal HTTP server and exports a set of status pages for human consumption. The sta-tus pages show the progress of the computation, such as how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates, etc. The pages also contain links to the standard error and standard output files gen-erated by each task. The user can use this data to pre-dict how long the computation will take, and whether or not more resources should be added to the computation.These pages can also be used to figure out when the com-putation is much slower than expected.

master运行一个内部HTTP server，暴露一组状态页面给用户分析。

状态页显示了计算的进展，比如已经完成了多少个任务，多少在处理中，输入有多少个字节，中间数据有多少字节，输出有多少字节，处理的速率等

状态页还包含指向任务产生的标准错误和标准输出文件的链接

用户可以使用这些数据来预测计算将持续多久，是否需要增加计算资源；还能用来确定计算是否远慢于预期

In addition, the top-level status page shows which workers have failed, and which map and reduce tasks they were processing when they failed. This informa-
tion is useful when attempting to diagnose bugs in the user code.

除此之外，顶级状态页还显示哪些worker failed，以及fail时在处理什么任务

这些信息对于debug也有用

4.9 Counters

The MapReduce library provides a counter facility to count occurrences of various events. For example, user code may want to count total number of words processed
or the number of German documents indexed, etc.

MR还提供了计数器的功能，用来计数各类的事件

例如，用户可能想计数处理了多少个单词，或者...

要使用这一功能，用户只要创建对应的counter object，并在适当的时候增减即可

The counter values from individual worker machines are periodically propagated to the master (piggybacked on the ping response). The master aggregates the counter
values from successful map and reduce tasks and returns them to the user code when the MapReduce operation is completed. The current counter values are also dis-played on the master status page so that a human can watch the progress of the live computation. When aggre-gating counter values, the master eliminates the effects of duplicate executions of the same map or reduce task to avoid double counting. (Duplicate executions can arise from our use of backup tasks and from re-execution of tasks due to failures.)

每个worker维护的counter阶段性的传给master（随着对master ping的response）

map将各个成功完成的map或者reduce任务传来的counter值统计起来，并在MR结束后返回。当前counter值也会展示在master的状态页，以便用户观察进展

↑划线句啥意思，怎么消除的？

一些counter是由MR自动维护的，比如处理了多少键值对，输出了多少键值对

用户认为计数器功能对于某些对MR行为的检查有用，比如有时候，用户想要保证输出的键值对和输出的键值对数量相同，等

6.824 paper MapReduce: Simplified Data Processing on Large Clusters相关推荐

“MapReduce: Simplified Data Processing on Large Clusters”
MapReduce: Simplified Data Processing on Large Clusters MapReduce:面向大型集群的简化数据处理摘要 MapReduce既是一种编程模型 ...
MapReduce: Simplified Data Processing on Large Clusters论文翻译（MapReduce-OSDI04）
作者 Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay@google.com Google, Inc 摘要 MapReduce是一个编程 ...
MapReduce:Simplified Data Processing on Large Clusters中文版from百度文库
超大集群的简单数据处理转自百度文库 Jeffrey Dean Sanjay Ghemawat jeff@google.com , sanjay@google.com Google , Inc. 摘要 ...
MapReduce:Simplified Data Processing on Large Clusters(中文翻译2)
[注:本人菜鸟一枚,喜欢Hadoop方向的大数据处理,现在正在研读Google影响世界的三篇论文,遂一边阅读一边翻译,错误在所难免,希望大家给予批评,我会增加学习的动力] 1 Introduction ...
【分布式论文】之 1. MapReduce——Simplified Data Processing on Large Clusters
文章目录 1. 需求 / 现存问题 2. 总述 3. 实现 3.1 概述 3.2 Master的数据结构 3.3 容错性 3.3.1 worker节点故障 3.3.2 master节点故障 3.3.3 ...
《MapReduce: Simplified Data Processing on Large Clusters》译文
原文链接:http://stanford.edu/class/cs345d-01/rl/mapreduce.pdf MapReduce:在大规模集群上简化数据处理作者:Jeffrey Dean和Sa ...
MapReduce: Simplified Data Processing on Large ...
2019独角兽企业重金招聘Python工程师标准>>> MapReduce: Simplified Data Processing on Large Clusters Abstrac ...
MapReduce: Simplified Data Processing on Large Clusters_中文翻译
MapReduce: Simplified Data Processing on Large Clusters (作为大数据处理的经典文献,个人在学习的过程中参考其它译文进行翻译: 参考译文: htt ...
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
阅读笔记概述: 本文同样发表于2012年.提出了一种称为离散化数据流(Discretized Streams,D-Streams)的编程模型. 该模型提供了一种高级函数式API,具有高度的一致性和强 ...

6.824 paper MapReduce: Simplified Data Processing on Large Clusters