文章目录

  • Abstract
  • 1 - Introduction
  • 2 - Replicated state machines
  • 3 - What’s wrong with Paxos?
  • 4 - Designing for understandability
  • 5 - The Raft consensus algorithm
    • 5.1 - Raft basics
    • 5.2 - Leader election
    • 5.3 - Log replication
    • 5.4 - Safety
      • 5.4.1 - Election restriction
      • 5.4.2 - Committing entries from previous terms
      • 5.4.3 - Safety argument
    • 5.5 - Follower and candidate crashes
    • 5.6 - Timing and availability
  • 11 - Conclusion

Abstract

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

Raft 是复制日志时需要用到的共识算法。它和 Paxos 的效果相当,但结构与 Paxos 不同;这使得 Raft 比 Paxos 更容易理解,也为构建实用系统提供了更好的基础。为了增强易懂性,Raft 分离了共识的关键元素,如 leader 选举日志复制安全性,并且它强制了更强的一致性,以减少必须考虑的状态的数量。用户研究的结果表明,对于学生来说,Raft 比 Paxos 更容易学习。Raft 还包含了一个改变集群成员的新机制,即重叠多数机制保证安全

1 - Introduction

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [15, 16] has dominated the discussion of consensus algorithms over the last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.

共识算法允许一组机器作为一个集群进行工作,即使集群中有一些机器发生了故障,也不影响整体的运作。正因为如此,它们在构建可靠的大型软件系统中起着关键作用。Paxos [15,16] 在过去十年中主导了共识算法的讨论:大多数共识算法都是基于 Paxos 的或受其影响,Paxos 已成为教授共识算法的代表之作

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.

不幸的是,Paxos 很难理解,尽管有许多尝试使它更容易接近。此外,它的体系结构需要进行复杂的更改以支持实际系统。因此,系统构建者学生对 Paxos 都很头疼

After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.

在与 Paxos 纠缠之后,我们开始寻找一种新的共识算法,可以为系统构建和教学提供更好的基础。我们的方法是不同寻常的,因为我们的主要目标是 “易懂性”,即是否能为实际系统设计出一种比 Paxos 更容易理解的共识算法?此外,我们希望该算法还能够为系统构建者提供一些直觉上的认知,不仅需要证明算法能够起作用,而且还需解释算法为什么能够起作用

The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.

最终,我们设计出一种共识算法,名叫 Raft。在设计 Raft 时,我们采用了特定的技术来提高易懂性,包括分解(Raft 将 leader 选举、日志复制和安全性分开)和压缩状态空间(相对于 Paxos,Raft 减少了不确定性的程度和服务器之间不一致的方式)。一项针对两所大学 43 名学生的用户研究表明,Raft 比 Paxos 更容易理解,在学习了这两种算法后,其中 33 名学生能够更好地回答关于 Raft 的问题

Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [29, 22]), but it has several novel features:

Raft 在许多方面与现有的共识算法很相似(最值得注意的是 Oki 和 Liskov 的 Viewstamped Replication [29,22]),但它有几个新特性:

  • Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.

强领导:Raft 使用比其他共识算法更强大的领导形式。例如,日志条目只从 leader 流向其他服务器。这简化了复制日志的流程,使 Raft 更容易理解

  • Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.

选举 leader:Raft 使用随机计时器来选举 leader。这仅仅在心跳包的基础上增加了少量的机制,同时简单快速地解决冲突

  • Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.

成员变更:Raft 采用了一种新的 “联合共识” 方法来更改集群的 leader,在这种方法中,两种不同配置的大多数机器在转换期间重叠。这使得集群在配置更改期间继续正常运行成为可能

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.

我们相信 Raft 优于 Paxos 和其他共识算法,无论是出于教学目的 OR 作为实现的基础。它比其他算法更简单,更容易理解;它的描述足够完整,足以满足实际系统的需要;它有几个开源实现,并被几家公司使用;其安全性能已得到正式规定和证明;其效率与其他算法相当

The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–8), evaluates Raft (Section 9), and discusses related work (Section 10).

本文的其余部分介绍了复制状态机问题(第 2 节),讨论了 Paxos 的优缺点(第 3 节),描述了实现该设计的具体方法(第 4 节),介绍了 Raft 共识算法(第 5-8 节),评估了 Raft (第 9 节),并讨论了相关工作(第 10 节)

2 - Replicated state machines

Consensus algorithms typically arise in the context of replicated state machines [37]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [8], HDFS [38], and RAMCloud [33], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [11].

共识算法通常出现在复制状态机 [37] 的上下文中。在这种方法中,一组服务器上的状态机计算相同状态的相同副本,并且即使关闭某些服务器,集群也可以继续运行。复制状态机用于解决分布式系统中的各种容错问题。例如,具有单个集群领导者的大型系统,如 GFS [8]、HDFS [38] 和 RAMCloud [33],通常使用单独的复制状态机来管理 leader 选举和存储一些重要的配置信息。复制状态机的例子包括 Chubby [2] 和 ZooKeeper [11]

Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.

复制状态机一般是通过复制日志来实现,如图 1 所示。每个服务器存储一系列包含命令的日志,它的状态机按顺序执行这些命令。每条日志以相同的顺序包含同样的命令,因此集群中的每个状态机都执行相同的命令序列。又因为状态机是确定性的,所以每个状态机都能计算出相同的状态和输出序列

Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

共识算法主要用来保持复制日志的一致性。服务器上的共识模块接收来自 clients 的命令并将其添加到日志中。它与其他服务器上的共识模块通信,以确保每条日志最终以相同的顺序包含相同的请求存储在集群的每台服务器上,即使某些服务器失败,也不打紧。一旦正确地复制了命令,每个服务器的状态机将按日志顺序处理它们,并将输出返回给 clients。因此,这些服务器似乎形成了一个单一的、高度可靠的状态机

Consensus algorithms for practical systems typically have the following properties:

实用系统的共识算法通常具有以下性质:

  • They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.

它们确保在所有非拜占庭条件下的安全(永远不会返回错误的结果),包括网络延迟、分区、数据包丢失、复制和重新排序

  • They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.

只要集群中大多数服务器都是可操作的,并且可以 clients 和 peers 相互通信,那么可以认为它们是功能齐全的,即可用的。因此,一般拥有 5 台服务器的集群可以容忍任意 2 台服务器发生故障。暂停进程来模拟服务器宕机的情况;它们稍后可能恢复并重新加入集群,通过读取一些已持久化的信息

  • They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.

它们不依赖于定时来确保日志的一致性,即在最坏的情况下,错误的时钟和极端的消息延迟会带来可用性问题

  • In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.

通常,只要集群中的大多数服务器响应了一轮 RPC,命令就可以完成;少数较慢的服务器不影响整体系统的性能

3 - What’s wrong with Paxos?

Over the last ten years, Leslie Lamport’s Paxos protocol [15] has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

在过去的十年中,Leslie Lamport 的 Paxos 协议 [15] 几乎已经成为共识的代名词,它是课程中最常教授的协议,并且大多数共识的实现都将其作为起点。Paxos 首先定义了一个能够就单个决策达成一致的协议,比如单个复制的日志条目。我们将这个子集称为 “单命令Paxos”。然后,Paxos 结合该协议的多个实例来促进一系列决策,例如日志( multi-Paxos )。Paxos 确保了安全性和活动性,并且支持更改集群成员。该方法的正确性得到了验证,在一般情况下是有效的

Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [15] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.

不幸的是,Paxos 有两个明显的缺点。第一个缺点是 Paxos 非常难以理解。完整解释 [15] 是出了名的难懂,只有付出巨大努力的人才能理解它。因此,已经有人尝试用简单的术语来解释 Paxos [16,20,21]。这些解释集中于单一命令子集,但它们仍然具有挑战性。在对 2012 年 NSDI 与会者的非正式调查中,我们发现很少有人对 Paxos 感到满意,即使是经验丰富的研究人员。我们自己也在努力开发 Paxos;直至看到几个简化的解释并设计了我们自己的替代协议之后,我们才能够理解完整的协议,这个过程花了将近一年的时间

We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the single-decree protocol works. The composition rules for multi-Paxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.

我们假设 Paxos 的难懂是因为它选择了单一命令子集作为基础。单一命令的 Paxos 是密集而微妙的:它分为两个阶段,没有简单的直观解释,无法独立理解。正因为如此,很难对单一命令协议的工作原理做出直观的解释。多 paxos 的组合规则增加了额外的复杂性和微妙性。我们相信,在多个决策上达成共识的整体问题,即一条日志而不是单个条目,可以用其他更直接和明显的方式分解

The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreed-upon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [26], [39], and [13], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been published.

Paxos 的第二个问题是,它没有为实现世界中的实现环节提供良好的理论基础。其中一个原因是,对于多 paxos 还没有得到广泛认可。兰波特的描述主要是关于单一命令的 Paxos;他概述了实现多 paxos 的可能方法,但遗漏了许多细节。已经有了一些充实和优化 Paxos 的尝试,比如 [26]、[39] 和 [13],但是它们彼此不同,也不同于 Lamport 的草图。像 Chubby [4] 这样的系统已经实现了类似 Paxos 的算法,但在大多数情况下,它们的细节还没有公布

Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.

此外,Paxos 架构对于构建实际系统来说是一个糟糕的架构;这是单一命令分解的另一个后果。例如,独立地选择一组日志条目,然后将它们合并到一个顺序日志中,这几乎没有什么好处,这只会增加复杂性。为了让日志系统更简单、更高效,新条目应按约束的顺序追加。另一个问题是 Paxos 在其核心使用了对称的点对点方法(尽管它最终提出了一种弱形式的领导作为性能优化)。这在只需要做出一个决策的简化世界中是有意义的,但是很少有实际的系统使用这种方法。如果必须做出一系列决策,那么首先选举一个 leader,然后让 leader 协调决策是更简单和更快的方法

As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is time-consuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value. The following comment from the Chubby implementers is typical:

因此,实际系统与 Paxos 几乎没有相似之处。每个实现都是从 Paxos 开始的,发现实现它的困难,然后开发一个截然不同的体系结构。这既耗时又容易出错,而且理解 Paxos 的困难加剧了这个问题。Paxos 的公式对于证明其正确性的定理可能是一个很好的公式,但是实际实现与 Paxos 如此不同,以至于证明几乎没有价值。以下是 Chubby 实现者的典型评论:

There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol [4].

在 Paxos 算法的描述和真实系统的需求之间有很大的差距,最终的系统将基于一个未经验证的协议 [4]

Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.

由于这些问题,我们得出结论,Paxos 不能为系统构建或教学提供良好的基础。考虑到共识在大型软件系统中的重要性,我们决定看看是否可以设计一个具有比 Paxos 更好的共识算法。Raft 就是那次实验的结果

4 - Designing for understandability

We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

我们在设计 Raft 时有几个目标,即它必须为系统构建提供一个完整和实用的基础,这样它就可以显著减少开发人员所需的设计工作量;它必须在所有条件下都是安全的,并在典型的操作条件下是可用的;对于一般的操作,它必须是有效的。但我们最重要的目标——也是最困难的挑战——是 “可理解性”。该算法必须能够让人轻易地理解它。此外,必须对算法有直觉上的认知,以便系统构建者可以根据现实世界的需求对其进行扩展

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?

在 Raft 的设计过程中,我们需要在不同的方法中做出选择。在这些情况下,我们根据可理解性来评估备选方案,即解释每个备选方案的难度(例如,其状态空间有多复杂,它是否有微妙的含义?),读者完全理解该方法及其含义要费多大的功夫?

We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.

我们认识到,在这种分析中存在高度的主观性;尽管如此,我们还是使用了两种普遍适用的技术。第一种技术是众所周知的问题分解方法,即在可能的情况下,我们将问题分成相对独立的部分,包括解决、解释和理解。例如,在Raft中,我们分离了 leader 选举、日志复制、安全性和成员变更

Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.

我们的第二种方法是通过减少要考虑的状态数量来简化状态空间,使系统更加连贯,并在可能的情况下消除不确定性。具体来说,日志不允许为空,Raft 限制了日志相互不一致的方式。尽管在大多数情况下我们试图消除非确定性,但在某些情况下,非确定性实际上提高了可理解性。特别是,随机方法引入了不确定性,但它们倾向于通过以类似的方式处理所有可能的选择( “任意选择 OR 没关系” )。我们使用随机化来简化 Raft leader 的选举算法

5 - The Raft consensus algorithm

Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.

Raft 是一种用于管理第 2 节中描述的复制日志的算法。图 2 对算法进行了简明总结,以供参考,图 3 列出了算法的关键属性;这些数字的元素将在本节的其余部分逐条讨论

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

Raft 首先通过选举选出一个 leader,然后让 leader 来完全负责管理复制的日志。leader 接受来自 clients 的日志条目,并在其他服务器上复制它们,告诉服务器何时可以安全地将日志条目应用到它们的状态机。拥有一个 leader 可以简化对复制日志的管理。例如,leader 可以在不询问其他服务器的情况下决定新条目放置在日志中的位置,并且数据以一种简单的方式从 leader 流向其他服务器。leader 可能会失败或与其他服务器断开连接,在这种情况下,集群会选举一个新的 leader

Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:

在 leader 方法下,Raft 将共识问题分解为三个相对独立的子问题,并在以下小节中进行讨论:

  • Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).

leader 选举:当现有 leader 出问题时,必须选择新的 leader(第 5.2 节)

  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3).

日志复制:leader 必须接受来自 clients 的日志条目,并在整个集群中复制它们,迫使其他日志与自己的日志一致(第 5.3 节)

  • Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.

安全:Raft 关键的安全属性是图 3 中的状态机安全,如果有一台服务器将特定的日志条目应用到其状态机,那么其他服务器都不能对日志中相同索引的条目应用不同的命令。第 5.4 节描述了 Raft 如何确保此属性;解决方案涉及对第 5.2 节中描述的选举机制的额外限制

After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the system.

在介绍了共识算法之后,本节将讨论可用性问题和定时在系统中的作用

5.1 - Raft basics

A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.

Raft 集群包含多个服务器,一般是 5 个,这样的话,就允许系统容忍两次故障。在任何给定的时间,每个服务器都处于其中的某个状态:leader、follower 或 candidate。一般情况下,只有一个 leader,其他的服务器都是 follower。follower 是被动的:他们自己不发出任何要求,只是对 leader 和 candidate 的要求做出回应。leader 处理 clients 的请求(如果 clients 联系了 follower,它会将其重定向至 leader)。第三个状态,candidate,用于选举新的 leader,如第 5.2 节所述。图 4 展示了状态及其转换;下面将展开讨论

Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

Raft 将时间划分成不定长的任期,如图 5 所示。任期用连续的整数进行编号。每个任期以 ”选举” 开始,其中一个或多个 candidates 试图成为第 5.2 节所述的 leader 。如果一位 candidates 赢得选举,那么它将在之后的任期内担任 leader。但也有选举不成功的时候,即集群中没有一个 candidate 的选票数过半。这时 Raft 将会以无 leader 的状态结束此任期;随后立即开启新的选举。Raft 保证在给定的任期内最多有一个 leader

Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.

集群中不同的角色站在各自的角度上看待任期的变化。任期在 Raft 中充当逻辑时钟 [14],这使得服务器可以检测集群中是否有过时的讯息,例如过时的 leader。每个服务器存储当前任期的编号,该数字随时间单调递增。每当服务器之间进行通信时,都会交换(检查修改)当前任期;如果服务器 A 的当前任期小于服务器 B 的,则将服务器 A 的更新为较大的值。如果 candidates 或 leader 发现其任期已过,则将状态立刻回滚至 follower。如果服务器接收到过期的请求,那么它将拒绝该请求

Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

Raft 服务器使用远程过程调用(RPC)进行通信,一般而言,共识算法只需要两种类型的 RPC。RequestVote RPC 是由 candidates 在选举期间发起的(章节 5.2 ),而 AppendEntries RPC 是由 leader 发起的,用于复制日志条目并告知对方自己依然在线(章节 5.3 )。第 7 节添加了第三个 RPC,用于在服务器之间传输快照。如果服务器没有及时收到响应,它们会重试 RPC,并且它们会并行地发出以获得最佳性能

5.2 - Leader election

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

Raft 使用心跳机制来触发 leader 选举。当服务器启动时,它们的初始状态为 follower。只要收到来自于 leader 或 candidate 的有效 RPC,则服务器就保持 follower 状态。leader 定期向所有 followers 发送心跳( AppendEntries RPC,不携带日志条目),以维护其领导地位。如果一个 follower 在一段时间内没有收到任何 RPC,则称为 “选举超时”,它将认为集群中已无正常工作的 leader 了,随即发起选举,试图成为 leader

To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or © a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.

进入选举环节,身为 follower 的节点首先将自身的任期增一,接着将状态变为 candidate。然后,给自己投上宝贵的一票,并向集群中的每个服务器并行地发送 RequestVote RPC。candidate 晋升为 leader 或回滚至 follower 无非是下面三种情况:

  1. 该 candidate 赢得了选举,晋升为 leader;
  2. 集群中另一台服务器成功晋升为 leader,该 candidate 无奈回滚至 follower;
  3. 投票结果太分散,导致没有赢家出现

这些结果将在下文各段分别讨论

A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.

如果 candidate 在同一任期内获得整个集群中大多数服务器的选票,则该 candidate 赢得选举。在给定的任期内,每台服务器将以先到先得的方式进行投票,并有且仅有一张选票,即意味着该服务器最多只能选择一个 candidate(注意:第 5.4 节增加了投票限制)。过半票数机制确保在某一特定任期内,集群中最多有一名 candidate 能够选举(图 3 中的选举安全属性)。一旦 candidate 赢得选举,它就立马成为 leader。然后,向集群中其他服务器发送心跳包,以建立其权威并防止新的选举

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.

在等待投票期间,candidate 可能会收到来自于另一个自称为 leader 的服务器 的 AppendEntries RPC。如果该 leader 的任期(包含在其 RPC 中)与 candidate 的当前任期一样 OR 比自己的还要新,则 candidate 将承认 leader 是合法的,并回滚至 follower。如果leader 的 RPC 中的任期小于 candidate 的,则 candidate 拒绝投票并保持其状态

The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.

第三种可能的结果是 candidates 的局面很僵持,即同一时间内有好多个 follower 都成为了 candidate,如此一来,选票很有可能会被分割,因此可能不存在有 candidate 获得多数选票的情况。当出现这种局面时,每个 candidate 将会发生超时的情况,并通过增加其任期来启动新一轮的拉票环节,即发送 RequestVote RPC 来开始新的选举。然而,如果不采取额外措施,选票分散的情况可能会重复发生

Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 9.3 shows that this approach elects a leader rapidly.

Raft 使用随机选举超时机制来减少选票分散情况的发生,该方法可以快速解决上述的问题。为了防止选票分散情况的发生,首先将选举超时设定为 150 至 300ms 之间的随机值。这种操作将会很大程度缓解选票分散的情况,因为在大多数情况下,某个时刻只会有一台服务器发生超时,即计时器为 0。这位 candidate 将赢得选举,并在集群中其他服务器超时之前发送心跳包。同样的机制也适用于处理选票分散的情况。每个 candidate 在开始拉票之前将随机重置自己的超时值,并在超时结束之后开始下一轮选举(任期也要累加);这降低了新选举中再次出现选票分散的可能性。9.3 节表明这种方法可以快速地选出 leader

Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.

选举是一个例子,证明了可理解性是如何指导我们在方案设计之间做出抉择的。最初,我们计划使用一个排名系统,即每个 candidate 被分配一个唯一的排名,用于在竞争 candidates 之间进行选择。如果一个 candidate 发现了另一个排名更高的 candidate,它就会回滚至 follower,这样排名更高的 candidate 就更容易赢得下一次选举。我们发现这种方法在可用性方面产生了微妙的问题(如果排名较高的服务器竞选失败,排名较低的服务器可能需要超时并再次成为 candidate,但如果它做得太快,很有可能会拖慢选举 leader 的进程)。我们对算法进行了几次调整,但每次调整后都会有新的发现。最终,我们得出结论,随机超时法效果更好且更容易被理解

5.3 - Log replication

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

一旦选出 leader,它就开始服务 clients 请求。每个 client 请求都包含一条要由从属状态机执行的命令。leader 将命令作为新条目追加到它的日志中,然后并行地向集群中其他服务器发送 AppendEntries RPC,要求它们复制该条目。当条目被安全复制后,leader 将该条目应用于状态机,并将执行的结果返回给 client。如果 follower 崩溃 OR 运行缓慢,再者如果网络数据包丢失,leader 将无限期地重新发送 AppendEntries RPC(即使在它响应 client 之后),直到所有 followers 都存储了该条目

Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.

日志的案例如图 6 所示。每个日志条目存储一条状态机命令以及 leader 接收到该条目时的任期号。日志条目中的任期号用于检测日志之间是否一致,并确保图 3 中的一些属性。每个日志条目还有一个整数索引,用于标识其在日志中的位置

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

leader 自己来把握何为安全,即何时向状态机应用该日志条目;这样的条目被称为 “已提交”。Raft 保证已提交的条目是已持久化的,即 followers 已经存储了该日志条目,并且最终由所有可用的状态机执行。一旦 leader 察觉到大多数服务器已完成了对该日志条目的复制工作(例如,图 6 中的条目 7),日志条目就会被提交。同时,也会提交 leader 日志中所有之前的条目,包括由前任 leader 创建的条目。第 5.4 节讨论了在 leader 更换后的一些微妙变化,证明该机制依然能够保证安全。leader 记录它要提交的最新条目的索引,并将该索引包含在下一次的 AppendEntries RPC中(包括心跳),以便其他服务器能够察觉到。一旦 followers 察觉到 leader 已提交该日志条目,则立刻将该条目应用于其状态机(按日志顺序)

We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3:

我们设计 Raft 日志机制是为了使集群中服务器的日志之间保持高度的一致。这不仅简化了系统的行为,使其更可预测,而且也确保了安全。Raft 维护了以下属性,它们共同构成了图 3 中的日志匹配属性:

  • If two entries in different logs have the same index and term, then they store the same command.

如果两条日志的相同索引位置存储的条目的任期是相同的,那么可以认定它们存储的命令也是相同的

  • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

如果两条日志的相同索引位置存储的条目的任期是相同的,那么在此之前的所有条目也都是相同的

The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.

第一个属性源于这样一个事实,即 leader 在给定的任期内最多为特定的日志索引创建一个条目,并且该条目永远不会改变其在日志中的位置。第二个属性由 AppendEntries 执行简单的一致性检查来确保其正确性。当发送 AppendEntries RPC 时,leader 在其日志中包含该条目的索引和任期,该条目是被安插在新条目之前的。如果 follower 在其日志的对应索引处未找到与之任期相同的条目,则拒绝新条目。一致性检查归纳起来,即是日志的初始空状态满足日志匹配属性,无论何时扩展日志,一致性检查都要满足日志的匹配条件。因此,每当 AppendEntries 成功返回时,leader 就知道 follower 已成功追加了新的条目,并且 follower 的日志已与 leader 的保持一致了

During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.

正常情况下,leader 和 follower 的日志是一致的,所以 AppendEntries 一致性检查不会有问题。但是,leader 崩溃可能会导致日志不一致(旧 leader 可能没有完全复制其日志中的所有条目)。这些不一致可能会在一系列 leader 和 followers 的崩溃中加剧。图 7 说明了 follower 的日志可能与新 leader 有所不同。follower 可能缺少 leader 已有的条目,也可能有 leader 没有的额外条目,或者两者兼而有之。日志中缺少的和多余的条目可能跨越多个任期

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.

在 Raft 中,leader 通过强迫 follower 复制自己的日志来实现一致性。这意味着 follower 日志中的冲突条目将被覆盖。第 5.4 节将说明,如果加上另外一个限制,这样做是安全的

To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

为了使 follower 的日志与自己保持一致,leader 必须找到两人日志中能够保持一致的最新条目,删除 follower 日志中在该点之后的所有条目,并将 leader 在该点之后的所有条目发送给 follower。所有这些操作都是对 AppendEntries RPC 执行一致性检查的回应。leader 为每个 follower 维护一个 nextIndex,这记录了 leader 下一次发给该 follower 的日志条目索引。当 leader 首次掌权时,它将所有 nextIndex 值初始化为其日志中最后一个值之后的索引(图 7 中的 11)。如果 follower 的日志与 leader 的日志不一致,则会在下一个 AppendEntries RPC 中反应同步失败的情况。拒绝后,leader 减少 nextIndex 并重新发送 AppendEntries RPC。最终,nextIndex 将递减到 leader 和 follower 日志匹配的点。当这种情况发生时,AppendEntries 将会成功地删除 follower 日志中的所有冲突条目,并从 leader 日志(如果有的话)中追加条目。一旦 AppendEntries 成功,follower 的日志与 leader 的日志一致,并且在该任期之内保持这种状态

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.

如果需要,可以优化协议以减少被拒绝的 AppendEntries RPC 的数量。例如,在拒绝 AppendEntries 请求时,follower 可以通过 RPC 告知 leader 冲突条目的任期和在该任期内日志记录的第一个条目的索引。有了这些信息,leader 就可以为 follower 快速定位 nextIndex 而减少不必要的 RPC 通信;每个有冲突条目的任期需要一个 AppendEntries RPC,而不是每个条目一个 RPC。在实践中,我们怀疑这种优化是否必要,因为故障很少发生,而且不太可能有许多不一致的条目

With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).

有了这种机制,leader 不需要采取任何特殊操作就可以使日志保持一致。它只是开始正常操作,并且日志自动收敛以回应 AppendEntries 的一致性检查。leader 永远不会覆盖或删除自己日志中的条目(图 3 中的 leader Append-Only)

This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.

这种日志复制机制展示了第 2 节中描述的理想的共识方案,即只要大多数服务器正常运行,Raft 就可以接收请求、复制和应用新的日志条目;在正常情况下,一个新条目可以通过一轮 PRC 复制到集群的大多数服务器上;一两个拖油瓶不会影响 Raft 的整体表现

5.4 - Safety

The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.

前面的小节描述了 Raft 如何选择 leader 和复制日志条目。然而,到目前为止所描述的机制还不足以确保每个状态机以相同的顺序执行同样的命令。例如,当 leader 提交多个日志条目时,follower 可能宕机了,之后它可能被选为 leader 并用新的条目覆盖这些条目(无法想象这个场景);因此,不同的状态机可能执行不同的命令序列

This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.

本节在选举机制中添加一些限制来完善 Raft 算法。该约束条件确保了所有任期内的 leader 都包含了在之前任期中提交的所有条目(图 3 中的 leader 完整性属性)。考虑到选举限制,我们可以制定更精确的提交规则。最后,我们给出了 leader 完备性的证明草图,并展示了它如何确保复制状态机可以做出正确的行为

5.4.1 - Election restriction

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In some consensus algorithms, such as Viewstamped Replication [22], a leader can be elected even if it doesn’t initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.

在任何基于 leader 的共识算法中,leader 最终必须存储所有已提交的日志条目。在一些共识算法中,如 Viewstamped Replication [22],即使 leader 最初不包含所有提交的条目,也可以被选举出来。这些算法有额外的机制来识别缺失的条目,并在选举过程中 OR 之后不久将它们发送给新的 leader。但是,这有点复杂。Raft 采用了一种更简单的方法,它保证所有之前任期内已经被提交的日志将会保留在现任的 leader 日志中,这就省去了 Viewstamped Replication 中需要将缺少的条目重新发送给现任 leader 的步骤。这意味着日志条目是单向流动,即从 leader 到 follower,并且 leader 永远不会覆盖其日志中的现有条目

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

Raft 使用选票机制来阻止 candidate 赢得选举,除非其日志包含了所有已提交的条目。candidate 必须获取过半选票才能当选为 leader,这意味着 leader 已提交的条目必然已经被至少一台 followers 所复制(存储在日志中,可能并未提交)。如果 candidate 的日志与其他 followers 的日志一样,处于最新行列,那么它将保存所有已提交的条目。RequestVote RPC 实现了这个限制,即 RPC 包含关于 candidate 日志的信息,如果投票人自己的日志比 candidate 的日志还要新,那么就会拒绝投票

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

Raft 通过比较日志中最新条目的索引和任期来判定两条日志哪一个最新。如果两条日志的最新条目任期不同,那么任期号大的日志是最新的;如果日志以相同的任期结束,那么哪个日志越长,哪个日志就越最新

5.4.2 - Committing entries from previous terms

As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.

如 5.3 节所述,leader 如果察觉到当前任期的条目已在多数 followers 上存储,则会提交该条目。如果 leader 在提交条目之前崩溃,则未来的 leader 也会去提交该条目,所以完全不用担心条目丢失的情况。但是,leader 不能保证已经存储在多数 followers 日志中的前一任期的条目已被提交。因为存储和提交条目是两回事。图 8 说明了一种情况,其中旧日志条目存储在多数 followers 中,但仍然可以被未来的 leader 所覆盖

To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach for simplicity.

为了解决图 8 所示的问题,Raft 不会通过计算副本来提交之前任期中的日志条目。通过计算副本,只提交 leader 当前任期内的日志条目;一旦以这种方式提交了当前任期中的一个条目,那么由于日志匹配属性,所有先前的条目都将会被间接提交。在某些情况下,leader 可以安全地得出一个较旧的日志条目已被提交的结论(例如,如果该条目存储在每个服务器上),但是 Raft 为了简单起见采用了更保守的方法

Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs. In addition, new leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send re- dundant log entries to renumber them before they can be committed).

Raft 在提交规则中增加了额外的限制,因为当 leader 复制前一任期的条目时,日志条目会保留其原始的任期编号。在其他共识算法中,如果一个新的 leader 从先前的任期中复制条目,它必须采用新的任期号。Raft 的方法可以更容易地推断日志条目,因为它们在不同的时间和日志中都保留着相同的任期号。此外,Raft 中的新 leader 比其他算法发送更少的日志条目(其他算法在提交之前必须发送冗余的日志条目以重新编号)

5.4.3 - Safety argument

Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property holds (this argument is based on the safety proof; see Section 9.2). We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.

给定完整的 Raft 算法,我们现在可以更精确地论证 leader 完备性是否成立(这个论证是基于安全证明,参见 9.2 节)。我们假设 leader 完备性不成立,然后我们证明事实是与猜想相矛盾的。假设任期 T 的 leader 提交了一个来自其任期内的日志条目,但是该条目没有被某个未来任期的 leader 存储。考虑最小项 U > T,其 leader U 不存储该条目

  1. The committed entry must have been absent from leaderU’s log at the time of its election (leaders never delete or overwrite entries).

    已提交的条目不能存在于 leader U 的日志中(因为 leader 永远不会删除或覆盖条目)

  2. leader T replicated the entry on a majority of the cluster, and leader U received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leader T and voted for leader U, as shown in Figure 9. The voter is key to reaching a contradiction.

    leader T 在大多数 followers 上复制了该条目,leader U 收到了过半的选票。因此,至少有一个服务器( “投票人” )接受了来自 leader T的条目并投票给领导者 U,如图 9 所示。选民达成矛盾的关键

  3. The voter must have accepted the committed entry from leader T before voting for leader U; otherwise it would have rejected the AppendEntries request from leader T (its current term would have been higher than T).

    投票人必须在投票给 lead U 之前接受 lead T 已提交的条目;否则,它将拒绝来自 leader T 的 AppendEntries 请求(其当前任期将高于 T)

  4. The voter still stored the entry when it voted for leader U, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.

    投票人在投票给 leader U 时仍然存储该条目,因为每个介入其中的 leader 都包含该条目(假设),leader 永远不会删除条目,而 follower 只有在与 leader 冲突时才删除条目

  5. The voter granted its vote to leader U, so leader U’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.

    投票人把票投给了 leader U,所以 leader U 的日志必须和投票人的日志一样,是最新的。这导致了另一个矛盾

  6. First, if the voter and leader U shared the same last log term, then leader U’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leader U was assumed not to.

    首先,如果投票人和 leader U 共享相同的最后一个日志条目,那么 leader U 的日志必须至少与投票人的日志一样长,因此其日志与投票人日志中的每个条目相同。这又是一个矛盾,因为选民包含了已提交的条目,而 leader U 被认为没有

  7. Otherwise, leader U’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier leader that created leader U’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leader U’s log must also contain the committed entry, which is a contradiction.

    否则,leader U 的最后一个条目的任期肯定大于选民的。而且,它大于 T,因为投票人的最后一个日志条目至少是 T(它包含了从任期 T 中提交的条目)。创建 leader U 的最后一个日志条目的早期领导者必须在其日志中包含提交的条目(假设)。那么,根据日志匹配属性,leader U 的日志必须也包含提交的条目,这是一个矛盾

  8. This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries from term T that are committed in term T.

    这就解决了矛盾。因此,所有大于 T 的任期的 leader 必须包含所有在 T 任期中提交的条目

  9. The Log Matching Property guarantees that future leaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).

    Log Matching Property 保证未来的 leader 也将包含间接提交的条目,如图 8(d) 中的索引 2

Given the Leader Completeness Property, we can prove the State Machine Safety Property from Figure 3, which states that if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. At the time a server applies a log entry to its state machine, its log must be identical to the leader’s log up through that entry and the entry must be committed. Now consider the lowest term in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value. Thus, the State Machine Safety Property holds.

给定 leader 完整性属性,我们可以证明图 3 中的状态机安全属性,该属性表明,如果服务器在给定索引上向其状态机应用了一个日志条目,那么其他服务器将不会为相同的索引应用不同的日志条目。当服务器将一个日志条目应用到它的状态机时,它的日志必须与该条目的前任 leader 日志相同,并且该条目必须提交。现在考虑任何服务器应用给定日志索引的最低任期;日志完整性属性保证所有较高任期的 leader 将存储相同的日志条目,因此在较低任期中相同索引必为同值。因此,状态机安全属性保持不变

Finally, Raft requires servers to apply entries in log index order. Combined with the State Machine Safety Property, this means that all servers will apply exactly the same set of log entries to their state machines, in the same order.

最后,Raft 要求服务器按照日志索引顺序应用条目。结合状态机安全属性,这意味着所有服务器将以相同的顺序向其状态机应用完全相同的条目

5.5 - Follower and candidate crashes

Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.

到目前为止,我们关注的是 leader 的失败。follower 和 candidate 崩溃比 leader 更容易处理,它们的处理方式是一样的。如果 follower 或 candidate 崩溃,那么未来发送给它的 RequestVote 和 AppendEntries RPC 将会失败。Raft 通过无限重发 RPC 来处理这些问题;如果崩溃的服务器重新启动,那么 RPC 将会成功完成通信任务。如果服务器接收 RPC 但还没来得及响应就崩溃了,那么它将在重启后再次接收到相同的 RPC。Raft RPC 是幂等的,所以这不会造成伤害。例如,如果 follower 接收到一个 AppendEntries 请求,其中包含其日志中已经存在的日志条目,那么它将忽略该请求

5.6 - Timing and availability

One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.

我们对 Raft 的要求之一是安全性不能依赖于时间,即系统不能仅仅因为某些事件发生得比预期的快或慢而产生不正确的结果。然而,可用性(系统及时响应客户机的能力)需要依赖于时间的这个事实无法避免。例如,如果消息交换花费的时间比服务器崩溃的时间还要长,candidate 就不会坚持选举;没有稳定的 leader,Raft 无法正常工作

Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:

leader 选举是 Raft 最关键的环节。只要系统满足以下时间要求,Raft 将能够选举和维持一个稳定的 leader:

broadcastTime ≪ electionTimeout ≪ MTBF

In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout; we would like this to represent only a small fraction of overall time.

在这个不等式中,broadcastTime 是服务器向集群中的每个服务器并行发送 RPC 并接收它们所需的平均时间;electionTimeout 是 5.2 节中描述的选举超时;MTBF是单个服务器的平均故障的间隔时间。广播时间应该比选举超时时间少一个数量级,这样 leader 才能可靠地发送心跳消息,以防止 follower 开始选举;考虑到用于选举的随机超时法,这种不平等也缓解了选票分散的情况。选举超时应该比 MTBF 小几个数量级,这样系统才能稳定地运行。当 leader 崩溃时,系统将会在一个 electionTimeout 期间变得不可用;我们希望这只代表总时间的一小部分

The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.

broadcastTimeMTBF 是底层系统的属性,而 electionTimeout 是我们必须选择的。Raft 的 RPC 通常要求接收方将信息持久化到日志中,因此 broadcastTime 可能从 0.5ms 到 20ms 不等,具体取决于存储技术。因此,electionTimeout 可设置在 10ms 到 500ms 之间。典型的服务器 MTBF 是几个月或更长时间,这很容易满足时间需求

11 - Conclusion

Algorithms are often designed with correctness, efficiency, and/or conciseness as the primary goals. Although these are all worthy goals, we believe that understandability is just as important. None of the other goals can be achieved until developers render the algorithm into a practical implementation, which will inevitably deviate from and expand upon the published form. Unless developers have a deep understanding of the algorithm and can create intuitions about it, it will be difficult for them to retain its desirable properties in their implementation.

算法的设计通常以正确性、效率或简洁性为主要目标。虽然这些都是有价值的目标,但我们相信可理解性同样重要。除非开发人员将算法落地,否则其他目标都无法实现,而落地环节将不可避免地偏离并扩展已设计好的理论算法。除非开发人员对算法有深刻的理解,否则他们很难在实现中保留其设想的属性

In this paper we addressed the issue of distributed consensus, where a widely accepted but impenetrable algorithm, Paxos, has challenged students and developers for many years. We developed a new algorithm, Raft, which we have shown to be more understandable than Paxos. We also believe that Raft provides a better foundation for system building. Using understandability as the primary design goal changed the way we approached the design of Raft; as the design progressed we found ourselves reusing a few techniques repeatedly, such as decomposing the problem and simplifying the state space. These techniques not only improved the understandability of Raft but also made it easier to convince ourselves of its correctness.

在本文中,我们讨论了分布式共识的问题,其中一个被广泛接受但难以理解的算法 Paxos 多年来一直是学生和开发人员的梦魇。我们开发了一种新的算法 Raft,并已经证明它比 Paxos 更容易理解。我们也相信 Raft 为系统构建提供了更好的基础。将可理解性作为主要设计目标改变了我们设计 Raft 的方式;随着算法的进一步设计,我们发现自己重复使用了一些技术,比如分解问题和简化状态空间。这些技术不仅提高了 Raft 的可理解性,而且使我们更容易相信它的正确性

「论文自译」MIT 6.824 In Search of an Understandable Consensus Algorithm (Extended Version)相关推荐

  1. 论文无法复现「真公开处刑」,PapersWithCode上线「论文复现报告」

    点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达 来源丨机器之心 编辑丨极市平台 导读 近日,机器学习资源网站 Pap ...

  2. Raft 论文精读笔记|In Search of an Understandable Consensus Alg orithm (Extended Version)

    In Search of an Understandable Consensus Alg orithm (Extended Version) 寻找⼀种易于理解的⼀致性算法(扩展版) 这篇文章完全按照原 ...

  3. MIT 6.824涉及的部分论文翻译

    引言 这篇文章用于记录在学习6.824过程中所涉及到的论文的翻译,以帮助像我一样的英语蒻蒻愉快的享受6.824.因为很多论文并不是很常见,导致很多连论文阅读笔记都没有,所以希望看到这篇文章的朋友找到或 ...

  4. 最美应用+html模板,「最美应用」2017最美设计:5个获得界面设计小红花的优秀应用...

    2017 年马上就要跟我们挥手道别,这一年最美应用能够陪伴大家一起度过,小美的整体团队都感觉非常荣幸.年底和元旦这几天,小美会为大家带来一系列的年终总结,同时对于各位美友的鼎力支持深表谢意. 今天,首 ...

  5. jvm 系列(九):如何优化 Java GC 「译」

    本文由CrowHawk翻译,地址:如何优化Java GC「译」,是Java GC调优的经典佳作. Sangmin Lee发表在Cubrid上的"Become a Java GC Expert ...

  6. 70页论文,图灵奖得主Yoshua Bengio一作:「生成流网络」拓展深度学习领域

    来源:机器学习研究组订阅 GFlowNet 会成为新的深度学习技术吗? 近日,一篇名为<GFlowNet Foundations>的论文引发了人们的关注,这是一篇图灵奖得主 Yoshua ...

  7. 变量、中文-「译」javascript 的 12 个怪癖(quirks)-by小雨

    在写这篇文章之前,xxx已经写过了几篇关于改变量.中文-主题的文章,想要懂得的朋友可以去翻一下之前的文章 原文:12 JavaScript quirks 译文:「译」javascript 的 12 个 ...

  8. jvm系列(十):如何优化Java GC「译」

    本文由CrowHawk翻译,地址:如何优化Java GC「译」,是Java GC调优的经典佳作. Sangmin Lee发表在Cubrid上的"Become a Java GC Expert ...

  9. js最小化浏览器_「译」解析、抽象语法树(ast) +如何最小化解析时间的5个技巧...

    前言 该系列课程会在本周陆续更新完毕,主要讲解的都是工作中可能会遇到的真实开发中比较重要的问题以及相应的解决方法.通过本系列的课程学习,希望能对你日常的工作带来些许变化.当然,欢迎大家关注我,我将持续 ...

最新文章

  1. OpenCV Harris角点检测
  2. Java中的异步等待
  3. Zookeeper原理和实战开发经典视频教程 百度云网盘下载
  4. C#托管代码与C++非托管代码互相调用一(C#调用C++代码.net 代码安全)
  5. 搜索引擎排序DEMO
  6. java后台开发加密程序_Java后端实现MD5加密的方法
  7. vim替换字符串带斜杠_Linux vi/vim最全使用指南
  8. onnx 测试_YOLOv5来了!Pytorch实现,支持ONNX和CoreML
  9. 奶块1月25日服务器维护时间,奶块1月25更新公告 | 手游网游页游攻略大全
  10. bzoj 1109: [POI2007]堆积木Klo(二维偏序)
  11. mysql主要的两个索引Innodb和MyIASM。
  12. 特斯拉和SolarCity推出太阳能屋顶瓦片
  13. UNIX/Linux系统结构
  14. 中级软件设计师真题与答案(2009到2018)
  15. 开务正式加入中国信通院数据库应用创新实验室
  16. 西瓜直播怎么录屏游戏
  17. Simpson’s Rule (辛普森法则)
  18. 拿得起,放得下,想得开
  19. 电脑Mac地址更改后有什么害处?怎么改回原来的?
  20. IC | latency和delay区别

热门文章

  1. 【Android View】初识 View
  2. 什么是裸纤、专线、SDH、MSTP、MSTP+、OTN、PTN-Vecloud微云
  3. 最新转转交易猫闲鱼后台源码+带视频教程亲测
  4. worldpress(管理员头像) 您可以在Gravatar修改您的资料图片
  5. OPC DA客户端工具Opc quick client使用
  6. 汽车零部件加工 SMC复合材料无尘切割技术方案
  7. java毕业设计大学生规划平台Mybatis+系统+数据库+调试部署
  8. 数据结构实验课:实验四、队列的实现及应用
  9. 编写一个函数,计算任一输入的整数的各位数字之和
  10. 计算机软件著作权的价值,计算机软件著作权登记费是多少