文章目录

  • ?
  • resource
  • Summary
    • Consistent:
    • Consensus
  • Consistency Guaranteess
  • Linearizability:
    • Implementing Linearizable System
    • The Cost of Linearizability
  • Ordering Guarantees
    • Ordering and Causality
      • The causal order is not a total order
      • Linearizability is stronger than causal consistency
      • Capturing causal dependencies
    • Sequence Number Ordering
      • Lamport timestamps: 保证 causality
    • Total Order Broadcast
  • Distributed Transactions and Consensus
    • Atomic Commit and Two-Phase Commit(2PC)
      • From single-node to distributed atomic commit
      • [Introduction to two-phase cmmit](https://www.w3cschool.cn/architectroad/architectroad-two-phase-commit.html)
      • Coordinator failure
    • Distributed Transactions in Practice
      • Exactly-once message processing
      • XA transaction
      • Holding locks while in doubt
      • Recovering from coordinator failure
      • Limitations of distributed transactions
    • Fault-Tolerant Consensus
      • Consensus algorithm and total order broadcast
        • 如何选举 leader:
      • Limitations of consensus
    • Membership and Coordination Services

?

  • It turns out that there are deep connections between ordering, linearizability, and consensus. 阐明它们之间的关系?
  • 相比于单机事务,分布式事务有何不同?如何实现?
  • 了解 google spanner(事务) 和 zk(consensus) 的实现
  • CAP 定理的理解

resource

http://www.bailis.org/blog/linearizability-versus-serializability

In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant(可用性?) distributed systems. 本章讲什么

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.

Summary

Consistent:

Linearizability: 提供 stronger consistency, make replicated data appear as though there were only a single copy, and to make all operations act on it atomically

  • 优点: 易于理解
  • 缺点: 对网络问题敏感,性能慢

Causality: which puts all operations in a single, totally orderedtimeline, causality provides us with a weaker consistency model

  • 优点: 对网络问题不敏感
  • 使用场景受限

Consensus

Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remain fault-tolerant (able to make progress as long as a majority of nodes are working and reachable). They provide total order broadcast, and therefore they can also implement linearizable atomic operations in a fault-tolerant way.

定义: Deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable.

使用场景:

  • Linearizable compare-and-set registers
  • Atomic transaction commit
  • Total order broadcast
  • Locks and leases
  • Membership/coordination service
  • Uniqueness constraint

Zookeeper:

  • providing an “outsourced” consensus
  • failure detection
  • membership service

Consistency Guaranteess

分布式 db 中,由于网络等因素,数据不一致一定会发生,因此 Most replicated databases provide at least eventual consistency

eventual consistency: 所有 replicas 的数据最终会达到一致。

  1. this is a very weak guarantee—it doesn’t say anything about when the replicas will converge
  2. 难以使用和测试: you need to be constantly aware of its limitations and not accidentally assume too much

stronger consistency: 所有 replicas 数据总是保持一致

  • 缺点: worse performance, less fault-tolerant
  • 优势: easier to use correctly

distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

Linearizability:

3 个特质:

  • Recency gurantee
  • single operations on singel object
  • time dependency and always move forward in time

定义:

Linearizability(atomic consistency) is a guarantee about single operations on single objects. It provides a real-time (i.e., wall-clock) guarantee on the behavior of a set of single operations (often reads and writes) on a single object。

Linearizability is a recency guarantee(once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again) on reads and writes of a register (an individual object). It doesn’t group operations together into transactions.

Vs Serializability:

linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.

[http://www.bailis.org/blog/linearizability-versus-serializability/]

使用场景:

  • Locking and leader election: They use consensus algorithms to implement linearizable operations in a fault-tolerant way
  • Constraints and uniqueness guarantees
  • Cross-channel timing dependencies

Implementing Linearizable System

The most common approach to making a system fault-tolerant is to use replication? 怎么说

  • Single-leader replication(potentially linearizable): If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable.

  • Consensus algorithms(linearizable): consensus protocols contain measures to prevent split brain and stale replicas.

  • Multi-leader replication(not linearizable): because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes.

  • Leaderless replication(probably not linearizable): sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes (w + r > n)

简单的使用 quorums,即使满足 w + r > n, 也不一定是 linearizable, 如:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PpyVwbKp-1579403595685)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p45)]

The Cost of Linearizability

The CAP theorem:
一般来说,分区容错无法避免,因此可以认为 CAP 的 P 总是成立。CAP 定理告诉我们,剩下的 C 和 A 无法同时做到。

CAP 的定义比较局限,The CAP theorem as formally defined is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions,vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XbD8HtS6-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p46)]

Linearizability and network delays:
A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems.

Ordering Guarantees

It turns out that there are deep connections between ordering, linearizability, and consensus.

Ordering and Causality

Ordering helps preserve causality:

  • causal dependency: 比如事件的依赖顺序
  • happened before relationship
  • Cross-channel timing dependencies

Causality imposes an ordering on events,These chains of causally dependent operations define the causal order in the system—i.e., what happened before what.

The causal order is not a total order

Causality: 在 Casual 中没有依赖的操作可以并发执行,因此这些操作无法比较,所以 Casual 是 partial order(偏序关系).

Linearizability: In a linearizable system, we have a total order of operations. Therefore, according to this definition, there are no concurrent operations in a linearizable datastore.

Linearizability is stronger than causal consistency

Linearizability implies causality is what makes linearizable systems simple to understand and appealing.

Causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures.

Capturing causal dependencies

In order to maintain causality, you need to know which operation happened before which other operation.

In order to determine causal dependencies, we need some way of describing the “knowledge” of a node in the system.

Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key

In order to determine the causal ordering, the database needs to know which version of the data was read by the application

Sequence Number Ordering

使用 logical clock 给 event 编号,we can use sequence numbers or timestamps to order events, and they provide a total order.

We can create equence numbers in a total order that is consistent with causality,先发生的 event 的 number 更小.

Lamport timestamps: 保证 causality

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FVTv2jFg-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p50)]

Each node has a unique identifier,and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID). It provides total ordering.

原理解释:

  • [https://jameshfisher.com/2017/02/12/what-are-lamport-timestamps/]
  • [https://www.cnblogs.com/bangerlee/p/5448766.html]

Total Order Broadcast

This idea of knowing when your total order is finalized is captured in the topic of total order broadcast.

Total order broadcast is usually described as a protocol for exchanging messages between nodes, Two safety properties always be satisfied:

  1. Reliable delivery: msg 需要被 delivered 到所有 node
  2. Messages are delivered to every node in the same order.

This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

Distributed Transactions and Consensus

Atomic commit,即保证分布式事务的原子性,需要依赖 consensus algo,2PC is a kind of consensus algorithm, which solving atomic commit.

Atomic Commit and Two-Phase Commit(2PC)

Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state.

From single-node to distributed atomic commit

存储硬件层面保证: Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.

节点层面保证: A node must only commit once it is certain that all other nodes in the transaction are also going to commitA transaction commit must be irrevocable。

应用层面保证: However, from the database’s point of view this is a separate transaction, and thus any cross-transaction correctness requirements are the application’s problem

Introduction to two-phase cmmit

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort.

在分布式系统里,每个节点都可以知晓自己操作的成功或者失败,却无法知道其他节点操作的成功或失败。当一个事务跨多个节点时,为了保持事务的原子性与一致性,需要引入一个协调(Coordinator)来统一掌控所有参与者(Participant)的操作结果,并指示它们是否要把操作结果进行真正的提交或者回滚.

2PC顾名思义分为两个阶段,其实施思路可概括为:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A9oElRSG-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p47)]

Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery [88], and the additional network round-trips.

Coordinator failure

2PC can become stuck waiting for the coordinator to recover.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OLcLfF8h-1579403595687)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p48)]

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants:when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.

Distributed Transactions in Practice

Two types of distributed transactions are often conflated:

  • Database-internal distributed transactions: work well as usual
  • Heterogeneous distributed transactions: more challenge

Exactly-once message processing

Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded.

Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol.

XA transaction

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IkjhNBwl-1579403595687)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p49)]

The transaction coordinator implements the XA API.

Holding locks while in doubt

The database cannot release those locks until the transaction commits or aborts, Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt.

This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.

Recovering from coordinator failure

Orphaned in-doubt transactions(如transaction log lost) cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions,只能通过管理员手动解决.

Many XA implementations have an emergency escape hatch called heuristic decisions: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator.

Limitations of distributed transactions

对于 XA transactions, the key realization is that the transaction coordinator is itself a kind of database (in which transaction outcomes are stored), and so it needs to be approached with the same care as any other important database.

Fault-Tolerant Consensus

思想: Everyone decides on the same outcome, and once you have decided, you cannot change your mind, a consensus algorithm must satisfy the following properties:

  • Uniform agreement: No two nodes decide differently
  • Integrity: No node decides twice.
  • Validity
  • Termination: the idea of fault tolerance: 当有 node crash,也可以达成决策

Fault Tolerance:

  • 为保证 termination,需要假设一旦 node crash, it suddenly disappears and never comes back, 以避免无限等待 node recover.
  • 为保证 termination, 可以使用 quorum 来允许部分 node crash.
  • 系统一定保证 safety properties,所以 Termination 不满足(如大量 node crash)也不会使系统做出错误的决策。

Consensus algorithm and total order broadcast

Consensu algorithm: vsr, paxos, raft, zab…

这些算法不直接使用上述 Consensus 模型,而是 they decide on a sequence of values, which makes them total order broadcast algorithms.

So, total order broadcast is equivalent to repeated rounds of consensus (each consensus decision corresponding to one message delivery).

例子: [https://www.cnblogs.com/j-well/p/7061091.html]

如何选举 leader:

Consensus algo 使用 leader,但是不保证 leader 唯一,所以需要解决选主问题:

当有多个 leader 时候,use a leader in some form or another, but they don’t guarantee that the leader is unique.

Node 为了确定自己是 leader,每次操作前需要举行投票确认自己的身份,因此 we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal.

Consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met.

Limitations of consensus

  • 每次 proposal votes is a kind of synchronous replication. 影响性能
  • 因为要进行 majority votes,所以对机器数量有要求(如:the remaining two out of three form a majority),如果发生 split brain,部分机器就会变得不可用。
  • Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster.
  • Consensus systems generally rely on timeouts to detect failed nodes. 在网络不好时候可能导致频繁的选主.
  • Sometimes, consensus algorithms are particularly sensitive to network problems.

Membership and Coordination Services

ZooKeeper 使用场景:

  • Linearizable atomic operations
  • total order broadcast
  • Failure detection
  • Change notifications

Part 2: CHAPTER 9 Consistency and Consensus相关推荐

  1. 一致性(Consistency)介绍

    一致性是分布式领域必须面对的主要问题之一.理想的一致性模型是:当一条更新出现时,所有的观察者都能"看到"这个更新.但是该模型在分布式领域很难实现,对应的理论基础是CAP理论.这里讨 ...

  2. Designing Data-Intensive Applications

    寻找翻译本书后续章节合作者  微信:18600166191 ----------------------------------- Designing Data-Intensive Applicati ...

  3. Raft当初为什么会被命名为Raft?

    点击上方"朱小厮的博客",选择"设为星标" 后台回复"加群"获取公众号专属群聊入口 欢迎跳转到本文的原文链接:https://honeypp ...

  4. 2020 存储技术热点与趋势总结

    作者 | 张凯,SmartX 联合创始人 & CTO. 2 年前我们发表了一遍文章 2018 存储技术热点与趋势总结,受到了不少朋友的关注.2 年过去了,存储行业也在不断的发生着变化.今天,我 ...

  5. 05 共识问题:区块链如何确认记账权?

    文章目录 05 共识问题:区块链如何确认记账权? 区块链的共识 拜占庭将军问题 POW 工作量证明 挖矿的由来 比特币的 POW 实现 区块链分叉和 51% 攻击 POW 机制优缺点 其他共识方法 P ...

  6. Master Plan for becoming a Software Architect : Roadmap of a Software Architect

    软件架构师成长之路: Master Plan for becoming a Software Architect Roadmap of a Software Architect 软件架构师在软件开发团 ...

  7. NoSQL 一致性[详解]更新一致性

    原文:https://cloud.tencent.com/developer/article/1083703 翻译内容: NoSQL Distilled 第五章 Consistency 作者简介: 本 ...

  8. 分布式技术原理与实战45讲--05 第05讲:共识问题:区块链如何确认记账权?

    本课时我们主要讲解"共识问题:区块链如何确认记账权?" 区块链可以说是最近几年最热的技术领域之一,区块链起源于中本聪的比特币,作为比特币的底层技术,本质上是一个去中心化的数据库,其 ...

  9. 分布式共识算法随笔 —— 从 Quorum 到 Paxos

    分布式共识算法随笔 -- 从 Quorum 到 Paxos 概览: 为什么需要共识算法? 昨夜西风凋碧树,独上高楼,望尽天涯路 复制(Replication) 是一种通过将同一份数据在复制在多个服务器 ...

最新文章

  1. 4、Python —— 函数
  2. 【BZOJ3997】[TJOI2015]组合数学 最长反链
  3. 地图按照自己规定路线进行移动
  4. VTK:可视化之SceneBounds
  5. Python进行数据分析—可视化之seaborn
  6. iview-admin框架运行步骤
  7. mysql optimizer mrr_[转] MySQL 的 MRR 到底是什么?
  8. paip.asp vbs的代码折叠代码结构查看
  9. 网站优化快速排名软件大全
  10. JFlash添加芯片
  11. Matlab 简单图像分割实战
  12. 想自学单片机,各位有什么书和板子值得推荐的吗?
  13. 学习笔记 Tianmao 篇 自定义 ToolBar
  14. 版本管理工具-Git
  15. 长期置顶:作为一个技术人,你为什么有时间写博客?准备应对未来的中年危机
  16. 速看!!带你揭秘3D建模行业内幕!!
  17. for循环实现-水仙花
  18. 网络安全——Linux基础
  19. pytorch基于yolo目标检测的智慧课堂系统
  20. 学完软件测试还能找到工作吗?

热门文章

  1. win10设置默认浏览器
  2. 单例模式(中):单例模式的弊端
  3. 操作系统实验(八):文件管理
  4. HTML开心餐厅网页制作,开心餐厅技巧经验总结
  5. 骁龙8gen1和苹果A15评测哪个好
  6. 国内公募REITs火了 5月31号3多亿亿明天开抢!怎么买?买哪只?10问10答解读来了
  7. 通用代码生成器,一键生成前端和后端
  8. 电流电压与欧姆定律与基尔霍夫定律
  9. ERD ONline 为企业数字化转型助力
  10. hdu 6217 BBP Formula