《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》论文解读

纯属云平台资源管理学习菜鸟的笔记，如有错误，请各位大侠指出，不胜感激。
Abstract
To support the sophisticated schedulers of today’s frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them.

随着数据处理的需求的不断增长，各种计算框架层出不穷。多种计算框架共享一个集群，这使得资源管理系统的资源调度策略越来越灵活，因此，调度器的设计也变得复杂。Mesos 引入了一种二级调度的机制（resouce offers)来进行集群资源的分配。简而言之，Mesos决定提供多少资源给特定框架，最终由特定框架根据自身需求接受或拒绝Mesos提供的resources，以及决定哪些计算任务运行在这些resources上。

1 Introduction
The solutions of choice to share a cluster today are either to statically partition the cluster and run one framework per partition, or allocate a set of VMs to each framework. Unfortunately, these solutions achieve neither high utilization nor efficient data sharing.

这篇论文是2010年发表的。就是说，在2010年时，实现多种框架共享一个集群的策略是

把机器分区，如A区运行mapReduce,B区运行Dryad
集群上运行多个虚拟机

The short duration of tasks and the ability to run multiple tasks per node allow jobs to achieve high data locality, as each job will quickly get a chance to run on nodes storing its input data. Short tasks also allow frameworks to achieve high utilization, as jobs can rapidly scale when new nodes become available.

短任务越多，越能achieve high utilization

In this paper, we propose Mesos, a thin resource shar- ing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.

The main design question that Mesos must address is how to match resources with tasks. This is challenging for several reasons. First, a solution will need to support a wide array of both current and future frameworks, each of which will have different scheduling needs based on its programming model, communication pattern, task dependencies, and data placement. Second, the solution must be highly scalable, as modern clusters contain tens of thousands of nodes and have hundreds of jobs with millions of tasks active at a time. Third, the scheduling system must be fault-tolerant and highly available, as all the applications in the cluster depend on it.

调度策略需要实现三个需求：

支持现在的或者未来会出现的计算框架各自的资源调度需求
可扩展性
容错性高和高可用性

Instead, Mesos takes a different approach: delegating control over scheduling to the frameworks.

Mesos decides how many resources to offer each framework, based on an organizational policy such as fair sharing, while frameworks decide which resources to accept and which tasks to run on them.

Mesos 采用新的资源调度策略：委托控制（delegating control）
Mesos 仅负责按照公平分享资源的原则给各个框架提供资源，或者说推送资源。而资源的具体使用，如哪些资源供哪些任务运行，则由具体框架自行决定。
这会使得Mesos框架相对简洁，做的资源划分工作少。

Mesos’s flexible fine-grained sharing model also has other advantages. First, even organizations that only use one framework can use Mesos to run multiple instances of that framework in the same cluster, or multiple ver- sions of the framework.

Second, by providing a means of sharing resources across frameworks, Mesos allows framework developers to build specialized frameworks targeted at particular problem domains rather than one-size-fits-all abstractions. Frameworks can therefore evolve faster and provide better support for each problem domain.

Mesos 这种灵活的细粒度资源分配模型（在2010年时，可能还比较灵活）带来两个优势：

即使集群上只运行一个框架，也可支持同一框架不同版本的多个实例同时运行。
使框架专注于与该计算框架吻合的特定领域的问题研究，不必拘泥于one-size-fits-all abstractions
通俗来说，就是Mesos提供资源给你，你可以根据自己的需要对资源做特定的划分或处理。“蛋糕”怎么切分，你自己设计。从而实现了fine-grained (细粒度）资源划分。

We have implemented Mesos in 10,000 lines of C++. The system scales to 50,000 (emulated) nodes and uses ZooKeeper [4] for fault tolerance.

we have also built a new frame- work on top of Mesos called Spark, optimized for iterative jobs where a dataset is reused in many parallel operations, and shown that Spark can outperform Hadoop by 10x in iterative machine learning workloads.

10000行代码，这么少！！！！！！激起我解读的兴趣。

2 Target Environment
Facebook 的集群需要支持多个计算框架，因此产生了Mesos.

3 Architecture

3.1 Design Philosophy
Because cluster frameworks are both highly diverse and rapidly evolving, our overriding design philosophy has been to define a minimal interface that enables efficient resource sharing across frameworks, and otherwise push control of task scheduling and execution to the frameworks.

Although Mesos provides a low-level interface, we expect higher-level libraries implementing common functionality (such as fault tolerance) to be built on top of it. These libraries would be analogous to library OSes in the exokernel [25].

Mesos 作为开源的框架，码农可以往上添加高性能的其他库。

3.2 Overview

master / slaver 主从架构

To support a diverse set of policies, the master employs a modular architecture that makes it easy to add new allocation modules via a pluggin mechanism. To make the master fault- tolerant we use ZooKeeper [4] to implement the failover mechanism (see Section 3.6).

A framework running on top of Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework’s tasks.

以插件的模式增加或减少框架调度器支持。这种模式足够灵活，但问题是，我们不仅要按照计算框架的特点编写自己应用的逻辑代码，还需要编写特定的资源调度器。Mesos赋予用户更多的自由，但同时用户需要实现资源调度策略，技术水平要提高才行，并且要做更多的工作。

Figure 3 shows an example of how a framework gets scheduled to run a task. In step (1), slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation policy module, which tells it that framework 1 should be offered all available resources. In step (2) the master sends a resource of- fer describing what is available on slave 1 to framework 1. In step (3), the framework’s scheduler replies to the master with information about two tasks to run on the slave, using ⟨2 CPUs, 1 GB RAM⟩ for the first task, and ⟨1 CPUs, 2 GB RAM⟩ for the second task. Finally, in step (4), the master sends the tasks to the slave, which al- locates appropriate resources to the framework’s execu- tor, which in turn launches the two tasks (depicted with dotted-line borders in the figure). Because 1 CPU and 1 GB of RAM are still unallocated, the allocation module may now offer them to framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free.

In particular, how can a framework achieve data locality without Mesos knowing which nodes store the data required by the framework? Mesos answers these questions by simply giving frameworks the ability to reject offers.

Mesos是这样简单地实现数据的本地化：赋予框架权利去拒绝它所提供的资源。即若框架需要的数据就在机器1和5上，如果Mesos提供给框架的cpu和内存不是存在于机器1和5上（若在机器1和5上，就能快速读取数据进行处理，不需要数据传送的网络时间消耗，即数据本地化），则框架有权拒绝这些资源，继续等待下一次能够实现数据本地化的资源到来。

3.3.1 Revocation

However, if a cluster becomes filled by long tasks, e.g., due to a buggy job or a greedy framework, Mesos can also revoke (kill) tasks. Before killing a task, Mesos gives its framework a grace period to clean it up.
Mesos asks the respective executor to kill the task, but kills the entire executor and all its tasks if it does not respond to the request. We leave it up to the allocation module to implement the policy for revoking tasks, but describe two related mechanisms here.

We allow these frameworks to avoid being killed by letting allocation modules expose a guaranteed allocation to each framework – a quantity of resources that the framework may hold without losing tasks.

if a frame-work is below its guaranteed allocation, none of its tasks should be killed, and if it is above, any of its tasks may be killed.

当集群上充满了长作业运行时（如永不停歇的搜索引擎或者长时间服务的售票系统），集群资源接近于耗尽，Mesos能够杀死一些任务以释放资源。Mesos会首先请求一些节点上的executor 去kill tasks,这会产生两种结果：1 executor 成功执行Mesos 命令，杀死特定任务。 2 executor 对Mesos 请求无动于衷，Mesos 将把整个executor 和所有在这上运行的任务一并杀死。

对于一些不能被杀死的任务的特殊框架，如MPI, Mesos 会设定 a guaranteed allocation 资源给它，这些资源不能被剥夺，直到它运行结束。当它所拥有的资源高于a guaranteed allocation时，任何它的任务都可被kill；否则，不能杀死它的任务。

3.4 Isolation 资源隔离策略
Our current implementation isolates resources us- ing operating system container technologies, specifically Linux containers [10] and Solaris projects [17]

3.5 Making Resource Offers Scalable and Robust

First, because some frameworks will always reject certain resources, Mesos lets them short-circuit the rejection process and avoid communication by providing filters to the master. We support two types of filters: “only offer nodes from list L” and “only offer nodes with at least R resources free”.

框架可以设定filters,用以对Master 发出的资源邀约进行过滤以得到最适合的资源。

Second, because a framework may take time to respond to an offer, Mesos counts resources offered to a framework towards its share of the cluster for the purpose of allocation. This is a strong incentive for frameworks to respond to offers quickly and to filter out resources that they cannot use, so that they can get offers for more suitable resources faster.

Mesos 记录框架对集群资源的使用情况，以便下一次为框架提供资源时，提供的资源的多少等属性更加适合框架的需求，从而使得框架的响应资源邀约的速度更快。

Third, if a framework has not responded to an offer for a sufficiently long time, Mesos rescinds the offer and re-offers the resources to other frameworks.

3.6 Fault tolerance
First, we have designed the master to be soft state, i.e., the master can reconstruct completely its internal state from the periodic messages it gets from the slaves, and from the framework schedulers. Second, we have implemented a hot-standby design, where the master is shadowed by several backups that are ready to take over when the master fails.

用zoopkeeper 来管理Master的备份

4 Mesos Behavior
We consider two types of frameworks: elastic and rigid. An elastic framework (e.g., Hadoop, Dryad) can scale its resources up and down, i.e., it can start using slots as soon as it acquires them and release slots as soon its task finish. In contrast, a rigid framework (e.g., MPI) can start running its jobs only after it has allocated all its slots, i.e., the minimum allocation of a rigid framework is equal to its full allocation.

System utilization Elastic jobs fully utilize their allocated slots, since they can use every slot as soon as they get it. As a result, assuming infinite demand, a system running elastic jobs is fully utilized. Rigid frameworks provide slightly worse utilizations, as their jobs cannot start before they get their full allocations, and thus they waste the slots acquired early.

Mesos 把计算框架分为两类：一是ealstic,如Hadoop 即只要得到部分所需资源即可启动运行任务，并且某一任务一旦完成即可释放该任务所占用的资源。而rigid类的框架则不同，如MPI,必须在得到所有其所需的资源之后才可以启动任务运行，也必须在所有任务都结束运行后才能释放持有的资源。Mesos中slot 和Hadoop 1.0中slot 是不同的概念。Mesos中的slot是指一些机器的集合，并非指cpu或memory的数量。另一方面，Mesos根据框架的需求特点把集群的资源分为
mandatory(命令型）和preferred（喜好型）两类。前者是框架运行时必需的，如某一框架运行某一job时必需用到GPU,则这种资源需求是mandatory,Mesos必需提供GPU给它。preferred型资源则是框架有了它会使job运行得更快，但同时这个资源不是必需的，可以通过其他方式弥补。
Framework ramp-up time 指的是框架获取资源的最长时间。
Job completion time 指job 运行时间
System Utilization方面，由于Elastic Framework 只要得到资源（无论是部分还是全部）就可以运行任务，而Rigid framework 则必须在获取所有所需资源后才能启动运行任务，因此，Elastic Framework 的资源利用率可达到100%,而 Rigid framework 则是因为存在一段只等待资源不运行作业的空白期，导致了这段空白期内已经获得的资源没有被利用，导致资源利用率低。个人认为，如果空白期越长，则rigid framework 的资源利用率更低。

4.3 Heterogeneous Tasks
In this section, we discuss heterogeneous tasks, in particular, tasks that are either short and long, where the mean duration of the long tasks is significantly longer than the mean of the short tasks.

==如何解决长时间任务和短时间任务混合在在一起的作业的资源划分问题？==

The master can alleviate this by implementing an allocation policy that limits the number of slots on each node that can be used by long tasks, e.g., no more than 50% of the slots on each node can run long tasks.

Note that the master does not need to know whether a task is short or long. By simply us- ing different timeouts to revoke short and long slots, the master incentivizes the frameworks to run long tasks on long slots only. Otherwise, if a framework runs a long task on a short slot, its performance may suffer, as the slot will be most likely revoked before the task finishes.

Mesos 限制每一个节点上能运行长时间任务的slot个数，即必须保留一些slots供短时间任务运行，以此避免全部slots被长时间的任务占用而导致短时间的任务等待过长的时间而得不到运行的“饥饿”状况。但是，master并不需要知道任务的运行时间的长短。Mesos采用这样一种简单措施实现这种slots个数限制：在slot上采取不同的超时限制分给task,进而鼓励framework 把长时间任务分配到超时限制长的slot上。

4.4 Framework Incentives

Short tasks
No minimum allocation
Scale down
Do not accept unknown resources.
Frameworks are incentivized not to accept resources that they cannot use because most allocation policies will account for all the resources that a framework owns when deciding which framework to offer resources to next.
不要接收框架不能使用的资源，因为Mesos会根据每个框架所拥有的资源去决定下一步应该向哪一个框架提高资源。
Fragmentation
在一个充满资源需求量小的任务的集群中，如果突然出现有一个对资源需求量极大的框架时，那么这个框架有可能得不到它所需的资源，原因是一旦有任务结束运行释放资源，但释放的资源满足不了这个框架的需求，并且这些释放的资源马上又被其他资源需求小的短时间任务获得，==从而导致这个框架出现“饥饿”情况。那么，如何来解决这个“饥饿”问题？==
when a cluster is filled by tasks with small resource requirements, a framework f with large resource requirements may starve, because whenever a small task finishes, f cannot accept the resources freed up by it, but other frameworks can.

To accommodate frameworks with large pertask resource requirements, allocation modules can support a minimum offer size on each slave, and abstain from offering resources on that slave until this minimum amount is free.
方法是在每个slave上面设置一个a minimum offer size，以提供给突然出现的资源需求量大的框架。如果一个slave上没有出现minimum offer size的资源空闲，Mesos则不把这个slave上的空闲资源offer给框架。总而言之，每个slave上必须保持有现minimum offer size大小的空闲资源，用以解决上述的”饥饿“问题。

5 Implementation
We have implemented Mesos in about 10,000 lines of C++.

6 Evaluation

Dedicated Cluster 专门用来运行某一框架的集群

总体而言，对于Hadoop 和Spark，Mesos都有不同程度的速度提升。但对于Torque 和MPI框架，Mesos 不但提高不了速度，反而略降低了一些。

6.5 Mesos Scalability

To evaluate Mesos scalability, we emulated large clusters by running up to 50,000 slave daemons on 99 Amazon EC2 nodes, each with 8 CPU cores and 6 GB RAM. We used one EC2 node for the master and the rest of the nodes to run slaves.

8 Conclusion
Mesos is built around two design elements: a fine-grained resource sharing model at the level of tasks within a job, and a decentralized scheduling mechanism called resource offers that lets applications choose which resources to use. Together, these elements let Mesos achieve high utilization, respond rapidly to workload changes, and cater to frameworks with diverse needs, while remaining simple and scalable. We have shown that existing frameworks can effectively share re- sources with Mesos, that new specialized frameworks, such as Spark, can provide major performance gains, and that Mesos’s simple architecture allows the system to be fault tolerant and to scale to 50,000 (emulated) nodes.

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》论文解读相关推荐

论文解读PCT: Point Cloud Transformer（用于点云处理的Transformer)
最近几年transformer越来越火,在NLP.CV等领域都取得了很大的成功.这篇文章作者利用了transformer能够处理无序数据的特点,将transformer应用到点云的处理上.它的想法是基 ...
CVM2021| PCT: Point cloud transformer（分类+分割任务SOTA）
点击上方"3D视觉工坊",选择"星标" 干货第一时间送达作者丨paopaoslam 来源丨泡泡机器人SLAM 标题:PCT: Point cloud tra ...
PCT: Point Cloud Transformer
PCT:点云Transformer Meng-Hao Guo Tsinghua University gmh20@mails.tsinghua.edu.cn Jun-Xiong Cai Tsinghu ...
FastFormers 论文解读：可以使Transformer 在CPU上的推理速度提高233倍
自Transformers诞生以来,紧随其后的是BERT,在几乎所有与语言相关的任务中,无论是问题回答,情感分析,文本分类还是文本生成,都占据着NLP的主导地位. 与RNN和LSTM消失的梯度问题(不 ...
Point Cloud Transformer(PCT)代码实现
Point Cloud Transformer(PCT)代码实现目前最火热的Transformer在自然语言和图像识别中扮演了极其重要的角色,在点云数据集中也不例外,清华大学近期提出在点云中运用Tr ...
论文解读：《基于BERT和二维卷积神经网络的DNA增强子序列识别transformer结构》
论文解读:<A transformer architecture based on BERT and 2D convolutional neural network to identify DN ...
Point Cloud Transformer（PCT）阅读翻译
PCT: Point Cloud Transformer 1. Introduction transformer是一种 encoder-decoder结构,包含了三个模块:输入词嵌入,位置(顺序)编码 ...
论文阅读 PCT：Point Cloud Transformer
论文阅读 PCT:Point Cloud Transformer PCT 介绍 Input Embedding native 版本 enhanced 版本 Attention PCT 介绍 PCT是基 ...
AI论文解读：基于Transformer的多目标跟踪方法TrackFormer
摘要:多目标跟踪这个具有挑战性的任务需要同时完成跟踪目标的初始化.定位并构建时空上的跟踪轨迹.本文将这个任务构建为一个帧到帧的集合预测问题,并提出了一个基于transformer的端到端的多目标跟踪方 ...
论文解读：《功能基因组学transformer模型的可解释性》
论文解读:<Explainability in transformer models for functional genomics> 1.文章概括 2.背景 3.相关工作 4.方法 4. ...

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》论文解读

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》论文解读相关推荐

最新文章

热门文章

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》 论文解读

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》 论文解读相关推荐

最新文章

热门文章

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》论文解读

《Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center》论文解读相关推荐