《Hadoop 权威指南》读书笔记之七

《`Hadoop` 权威指南》读书笔记之七 — `chapter7`【updating…】

The whole process of MapReduce

at the highes level,there are five independent entities:

the client,which submit the MapReuce job
提交MapReduce job的客户端

02.The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
YARN 资源管理器，协调集群的计算资源的分配

03.The YARN node managers, which launch and monitor the compute containers on machines in the cluster.
YARN 节点管理器，加载并且监视集群上机器的计算容器

04.The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
[The MapReduce application master]MapReduce 应用主，协调任务运行MapReduce job。application master 以及MapReduce 任务运行在容器中，这个容器是由资源管理器规划并且由 node managers 管理。

05.The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing job files between the other entities
分布式的文件系统（诸如HDFS，在第三章介绍过），被用于在集群中分享job文件。

2.作业提交过程

向resource manager 申请一个新的application ID，用于作为MapReduce 的job ID;
检查job 的输出信息。
指的是: 如果输出目录没有被指定或者已经存在，那么job就不会被提交，同时会抛出一个错误。
为job计算输入分割。如果分割块不存在，那么就不会计算，同时就会抛出一个错误。
复制job需要的运行资源到以job ID命名的文件夹中。诸如：job JAR file；配置文件；以及待计算的输入块。
提交任务通过调用submitApplication() 在资源管理器中

3.Job initialization

当 resource manager 接收到一个对其 submitApplication() 方法时，它将请求传递给YARN 调度器。
Yarn schedule 分配一个container
资源管理器加载应用程序主进程，在nodemanager的管理
The application master for MapReduce jobs is a Java application whose main class is MRAppMaster
这个过程会记录一系列的对象去保存job的进度，因为它能够从tasks 中接收到进程以及完成信息
接着，它会接收待计算的输入块，从共享的filesystem中。
然后为每个split创建一个map task；同时，reduce 任务对象的数量将会由 mapreduce.job.reduces 属性决定。这个时候 task IDs 将会被给出。
The application master 必须决定怎么运行组成MapReduce job 的 tasks 。
如果job很小，那么application master 将会选择和它自身相同的 JVM运行任务。
在任务运行之前，the application master 调用 setupJob()方法在 OutputCommitter.[默认的OutputCommitter 是 FileOutputCommitter]，这个类将会为job创建一个最终的输出目录，以及任务输出的临时的工作目录
Task Assignment [任务分配]
如果任务不能作为一个小任务运行，那么就需要使用集群运行了。主要步骤如下：
application master requests containers for all the map and reduce tasks in the job from the resource manager.
对map 任务的请求将会被首先处理【具有高优先级，比reduce任务都高】。直到5% 的map任务完成了，才会开始运行reduce 任务。
reduce 任务可以在集群的任何一台机器上运行。但是map 任务有数据局部性的限制。
Task Execution
01.一旦资源管理器给一个任务指定一个特殊节点的容器资源，the application master starts the container by contacting the node manager.
02.在运行任务之前，the application master 会定位任务所需的资源，包括job 配置；jar file；以及任何来自distributed cache 的文件
03.在任务执行的时候，Java 进程传递输入的键值对到外部的进程【这个外部的进程其实就是用户自己洗的java代码所运行起来的进程】；然后会把这个输出键值对返回给Java

6.Job Completion
01.the application master receives a notification that the last task for a job is complete, it changes the status for the job to “successful”
02.when the Job polls for status, it learns that the job has completed successfully
03.it prints a message to tell the user and then returns from the waitForCompletion() method
04.Job statistics and counters are printed to the console
05.the application master also send an HTTP job notification if it is configured to do so => must configured mapreduce.job.end-notification.url property
06.the application master and the task container clean up their working state(so intermediate output is deleted)
07.OutputCommitter’s commitJob() is called.

7.Failures
导致任务失败的原因有很多：用户代码的bug，进程奔溃，机器宕机等等。那么Hadoop 处理的问题就是处理这些失败然后让这些job顺利完成。
需要考虑的失败项有：
01.task
02.the application master
03.the node manager
04.the resource manager
下面就针上述这些项进行一一的叙述

8.Task Failure
01.thrown a runtime exception
02.the task jvm reports the error back to its parent application master
03.the error ultimately makes it into the user logs
04.the application master marks the task attempt as failed, and frees up the container so its resources are available for another task.
05.the application master notices that it hasn’t received a progress update for a while and proceeds to mark the task as failed. => The timeout period
setting the timeout to a vlaue of zero disables the timeout, so long-running tasks are never marked as failed.
06.when the application master is notified of a task attmep that has failed, it will reschedule execution of the task.
try to avoid rescheduling the task on a node manager where it has previously failed. if a task fails four times, it will not be retried again.

07.use the property mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercent to control the job’s failure tolerance

a task attempt may also be killed,why ?
01.it is a speculative duplicate
02.node manager failed so the application master marked the task attempts running on it as killed
killed task attempts don’t count against the number of attempts to run the task, because it wasn’t the task’s fault that an attempt was killed.

9.Application Master Failure
01.application in yarn are retried in the event of failure. The maximum number of attempts to run a Mapreduce application master is controlled by the mapreduce.am.max-attempts property,default value is 2
02. the way recovery works is as follows:
001.An application master sends periodic heartbeats to the resource
002.in the event of application master failure, the resource manager will detect the failure and start a new instance of the master running in a new container (managed by a node manager)
003.The MapReduce client polls the application master for progress reports, but if its application master fails, the client needs to locate the new instance.
004.During job initialization, the client asks the resource manager for the application master’s address, and then caches it so it doesn’t overload the resource manager with a request every time it needs to poll the application master.
005.If the application master fails, however, the client will experience a timeout when it issues a status update, at which point the client will go back to the resource manager to ask for the new application master’s address. This process is transparent to the user.

10.Node Manager Failure
01.stop sending heartbeats to the resource manager (or send them very infrequently).
02.The resource manager will notice a node manager that has stopped sending heartbeats if it hasn’t received one for 10 minutes[ yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms ]
03.resource manager remove it from its pool of nodes to schedule containers on
04.the application master arranges for map tasks that were run and completed successfully on the failed node manager to be rerun if they belong to incomplete jobs, since their intermediate output
residing on the failed node manager’s local filesystem may not be accessible to the reduce task
05.Node managers may be blacklisted if the number of failures for the application is high, even if the node manager itself has not failed. Blacklisting is done by the application master
06.But the resource manager does not do blacklisting across applications.[这意思就是：如果某个资源管理器R1 对某nodeManager n1标记为了黑名单，但是这个操作并不影响资源管理器R2 对n1 节点的分配]

11.Resource Manager Failure
Failure of the resource manager is serious, because without it, neither jobs nor task
containers can be launched.

In the default configuration, the resource manager is a single point of failure【单点】, since in the (unlikely) event of machine failure,
all running jobs fail — and can’t be recovered.

To achieve high availability (HA), it is necessary to run a pair of resource managers in an active-standby configuration.
If the active resource manager fails, then the standby can take over without a significant interruption to the client.

哪些信息会被存储在状态存储中？
所有的正在运行的application 的信息
需要注意的事：
02.Node manager information is not stored in the state store since it can be reconstructed relatively quickly by the new resource manager as the node managers send their first heartbeats
Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key.

The process by which the system performs the sort — and transfers the map outputs to the reducers as inputs — is known as the shuffle.

the shuffle is the heart of MapReduce and is where the “magic” happens。

Map side
02.Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.
03.Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.
Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.
04.Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record, there could be several spill files.Before the
task is finished, the spill files are merged into a single partitioned and sorted output file.

05.combiners may be run repeatedly over the input without affecting the final result.
06.compress the map output as it is written to disk, because doing so makes it faster to write to disk, save disk space,and reduces the amount of data to transfer to the reducer

Reduce side
01.The map output file is sitting on the local disk of the machine that ran the map task.
02.The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes.

03.The reduce task has a small number of copier threads so that it can fetch map outputs in parallel.[The default is five threads, but this number can be changed by setting the mapreduce.reduce.shuffle.parallelcopies property]

but there is a problem, How do reducers know which machines to fetch map output from?
001.As map tasks complete successfully, they notify their application master using the heartbeat mechanism.
002.the application master knows the mapping between map outputs and hosts.A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
003.Hosts do not delete map outputs from disk as soon as the first reducer has retrieved them,as the reducer may subsequently fail.Instead, they wait until they are told to delete them by the application master, which is after the job has completed.

04.Map outputs are copied to the reduce task JVM’s memory if they are small enough,otherwise ,they are copied to disk.
When the in-memory buffer reaches a threshold size, or reaches a threshold number of map outputs, it is merged and spilled to disk. If a combiner is specified, it will be run during the merge to reduce the amount of data written to disk.

05.As the copies accumulate on disk, a background thread merges them into larger, sorted files.
001.Note that any map outputs that were compressed (by the map task) have to be decompressed in memory in order to perform a merge on them.

06.When all the map outputs have been copied, the reduce task moves into the sort phase.[which should properly be called the merge phase, as the sorting was carried out on the map side],which merge the map outputs, maintaining their sort ordering.

其实应该叫做 merge phase，因为sort 这个步骤在map端就已经完成了。

07.For example, if there were 40 map outputs and the merge factor was 10, there would be four rounds. Each round would merge 10 files into 1, so at the end there would be 4 intermediate files.

合并文件的算法
001.The goal is to merge the minimum number of files to get to the merge factor for the final round。
002.如果有40个文件，那么merge 的步骤是：
step 1: first round would merge only 4 files => 1
step 2: merge 10 files => 1
step 3: merge 10 files => 1
step 4: merge 10 files => 1
after four step, the rest of files is 6 files,so there are total 10 files, so there have a final merge to get last file.

上面的优化并不会对合并的轮数有所改变，但是这个改变后的步骤会对刷写到磁盘的数据量进行优化，因为最后一轮的数据总是直接发送到reduce中了。

08.During the reduce phase, the reduce function is invoked for each key in the sorted output. The output of reduce is written directly to the output filesystem, typically HDFS.

reduce phase
01.the reduce function is invoked for each key in the sorted output
02.the output of this phase is written directly to the output filesystem,typlically HDFS

MapReduce 代码性能调优
The general principle is to give the shuffle as much memory as possible.
01.Map Side
001.On the map side, the best performance can be obtained by avoiding multiple spills to disk;
002.There is a MapReduce counter (SPILLED_RECORDS; see Counters) that counts the total number of records that were spilled to disk over the course of a job, which can be useful for tuning.

02.On the reduce side
001.the best performance is obtained when the intermediate data can reside entirely in memory

Task Execution
Hadoop provides information to a map or reduce task about the environment in which it is running.
for example,
01:a map task can discover the name of the file it is processing;
02:and a map or reduce task can find out the attempt number of the task

Speculative Execution
什么是speculative execution？
Why use speculative execution?
Why turn off the speculative execution?

Output Commiters
Hadoop MapReduce uses a commit protocol to ensure that jobs and tasks either succeed or fail cleanly.
01.The behavior is implemented by the OutputCommitter in use for the job
02.in the new MapReduce API, the OutputCommitter is determined by the OutputFormat, via its getOutputCommitter() method
001.The default is FileOutputCommitter, which is appropriate for file-based MapReduce
002.You can customize an existing OutputCommitter or even write a new implementation

03.the setJob() method is called before the job is run, and is typically used to perform initialization
04.creates the final output directory, ${mapreduce.output.fileoutputformat.outputdir}, and a temporary working space for task output, _temporary, as a subdirectory underneath it
05.If the job succeeds, the commitJob() method is called, which in the default file-based implementation deletes the temporary working space and creates a hidden empty marker file in the output directory called _SUCCESS to indicate to filesystem clients that the job completed successfully.

The operations are similar at the task level.
[上述的是job层级的方法，下面将要接触到的是task层级的]。

01.The setupTask() method is called before the task is run, and the default implementation doesn’t do anything,
02.If a task succeeds, commitTask() is called, which in the default implementation moves the temporary task output directory
03.The framework ensures that in the event of multiple task attempts for a particular task, only one will be committed;
为什么会有多个任务在尝试？
第一个尝试任务由于某些原因失败了，然后它会被终止，然后，一个成功的尝试会被提交；
同时也有可能是：如果两个任务尝试同时运行作为speculative duplicates

04.A task may find its working directory by retrieving the value of the mapreduce.task.output.dir property from the job configuration.