什么事消息队列的高可用性

什么是高可用性？ (What Is High Availability?)

High availability is a term often used in computing, about system architecture or service to designate the fact that this architecture or service has an adequate level of availability.

高可用性是计算机系统中经常使用的术语，它表示系统架构或服务具有足够级别的可用性。

Availability is now a key part of infrastructure. Today it is estimated that non-availability of an IT department may have cost millions in [ref. desired], particularly in the field of industry where the shutdown of a production line can be devastating.

可用性现在是基础架构的关键部分。如今，据估计，IT部门的不可用可能已造成数百万美元的损失。期望]，特别是在生产线停工可能造成严重破坏的工业领域。

Two complementary methods are used to improve high availability:

使用两种补充方法来提高高可用性：

The establishment of a dedicated physical infrastructure, generally based on hardware redundancy. Then created a cluster of high-availability (as opposed to a computing cluster): a cluster of computers whose goal is to provide a service while avoiding downtime.通常基于硬件冗余建立专用的物理基础结构。然后创建了一个高可用性集群(与计算集群相对)：计算机集群，其目标是在避免停机的同时提供服务。
The establishment of appropriate processes to reduce errors, and accelerate recovery in case of error. ITIL contains many such processes.建立适当的流程以减少错误，并在出现错误的情况下加快恢复速度。 ITIL包含许多此类过程。

To measure the availability, we often use the percentage mainly composed of ‘9 ‘:

为了衡量可用性，我们经常使用主要由“ 9”组成的百分比：

99% means that the service is available less than 3.65 days per year99％表示该服务每年可用时间少于3.65天
99.9%, less than 8.75 hours per year99.9％，每年少于8.75小时
99.99%, less than 52 minutes per year99.99％，每年少于52分钟
99.999%, less than 5.2 minutes per year99.999％，每年少于5.2分钟
99.9999%, less than 54.8 seconds per year99.9999％，每年少于54.8秒
99.99999%, less than 3.1 seconds per year, Etc.99.99999％，每年少于3.1秒，等等。

The amalgam is often wrongly between high availability and disaster recovery activities. These are two different tasks, complementary to achieve continuous availability.

高可用性和灾难恢复活动之间的汞合金通常是错误的。这是两个不同的任务，相互补充以实现连续可用性。

提高可用性的技术 (Techniques for improving the availability)

Many techniques are used to improve the availability:

许多技术用于提高可用性：

Redundant hardware and clustering;冗余硬件和集群；
Data security: RAID, snapshots, Oracle Data Guard, BCV (Business Copy Volume), SRDF (Symmetrix Remote Data Facility), DRBD;数据安全性：RAID，快照，Oracle Data Guard，BCV(业务副本卷)，SRDF(Symmetrix远程数据设施)，DRBD；
The ability to reconfigure the server “hot” (that is to say when it works);能够“热”重新配置服务器(也就是说，何时工作)；
Limp or a panic mode;mp行或恐慌模式；
Rescue plan;救援计划；
And secure backups: outsourcing, centralization third party site.和安全的备份：外包，集中第三方站点。

High availability requires the most suitable accommodation: power supply, air conditioning on the floor, with particulate filter, maintenance service, security service and security against theft and malicious. Note also the risk of fire and water damage. The power cables and communication must be multiple and buried. They should not be prominent in the underground garage of the building, which is too often seen in buildings in Paris. These criteria are the first to come into account when choosing a hosting provider (if the rental of a local high availability).

高可用性需要最合适的住宿条件：电源，地板上的空调，微粒过滤器，维护服务，安全服务以及防盗和防恶意软件的安全性。还要注意火灾和水灾的危险。电源线和通讯必须是多根并埋在地下。它们不应在建筑物的地下车库中突出显示，在巴黎的建筑物中经常见到。选择托管服务提供商(如果租用本地高可用性)时，首先要考虑这些标准。

For each level of the architecture for each component, each connection between components, it must establish:

对于每个组件的体系结构的每个级别，组件之间的每个连接，它都必须建立：

How to detect a failure? Examples: Testing life TCP Health Check implemented by a housing Alteon, test program invoked periodically (heartbeat), interface type “diagnosis” on the components, etc.如何检测故障？示例：测试寿命由外壳Alteon实施的TCP运行状况检查，定期调用的测试程序(心跳)，组件上的接口类型“诊断”等。
How secure is calling, redundant, rescued, etc. Examples: backup server, cluster system, WebSphere Clustering, RAID storage, backup, SAN double attachment, limp, unused material free (spare) ready to be reinstalled.调用，冗余，营救的安全性如何。示例：备份服务器，集群系统，WebSphere Clustering，RAID存储，备份，SAN双重附件，软盘，未使用的空闲材料(备用)可以重新安装。
How do we want the trigger switches to emergency / degraded? — Manually after analysis? Automatically?我们如何让触发器切换到紧急状态/降级状态？ —在分析后手动进行？自动吗？
How to ensure that the backup system leave again on a stable and known. Examples: one starts with a copy of the database and reapply the archive logs, restart the batch from a known state, 2-phase commit for transactions updating multiple data repositories, etc.如何确保备份系统再次稳定并为人所知。示例：从数据库的副本开始，然后重新应用归档日志，从已知状态重新启动批处理，针对事务的两阶段提交，以更新多个数据存储库，等等。
How the application restarts on the backup mechanism. Examples: restart application, restart of interrupted batches, activation of a degraded mode, the resumption of the failed server’s IP address by the backup server, etc.应用程序如何在备份机制上重新启动。例如：重新启动应用程序，重新启动中断的批次，激活降级模式，备用服务器恢复故障服务器的IP地址等。
How to take any transactions or sessions. Examples: session persistence on the application server, a mechanism for response to a client for a transaction that has performed well before failure but for which the client does not have an answer, etc.如何进行任何交易或会话。示例：应用程序服务器上的会话持久性，对客户端响应的事务机制，该事务在故障之前表现良好，但是客户端没有答案，等等。
How to return to the nominal situation.如何回到名义状态。

Examples:

例子：

If a degraded mode allows for failure of a database to store transactions waiting in a file, how transactions are they re-applied when the database becomes active again.如果降级模式导致数据库无法将等待的事务存储在文件中，那么当数据库再次变为活动状态时，如何重新应用事务。
If a failed component has been deactivated, how is its reintroduction in active service (e.g., need to resynchronize data, retest the component, etc.)如果停用了失败的组件，如何将其重新引入活动服务(例如，需要重新同步数据，重新测试组件等)

与其他应用程序的依赖关系 (Dependency vis-à-vis other applications)

For an application seeking other applications with middleware synchronously (http web service, Tuxedo, CORBA, EJB) the rate of application availability will be strongly linked to the availability of applications on which it depends. The sensitivity of applications which it depends must be equal to or greater than the sensitivity of the application itself.

对于使用中间件同步查找其他应用程序(http Web服务，Tuxedo，CORBA，EJB)的应用程序，应用程序可用性的速率将与其所依赖的应用程序的可用性紧密地联系在一起。它所依赖的应用程序的灵敏度必须等于或大于应用程序本身的灵敏度。

Otherwise, consider

否则，请考虑

The use of asynchronous middleware: MQ Series, JMS, SonicMQ, CFT异步中间件的使用：MQ系列，JMS，SonicMQ，CFT
Implementation of a limp when an application which we depend is failing.当我们依赖的应用程序发生故障时，将执行limp。

For this reason we will emphasize the use of asynchronous middleware good availability preferred whenever possible.

因此，我们将尽可能强调使用异步中间件良好可用性的方法。

负载分配和灵敏度 (Load distribution and sensitivity)

The sensitivity is often managed by redundant elements with a load balancing mechanism. (A websphere cluster with an Alteon load-balancing for example). For this system offers a real gain in terms of reliability, check if one element fails, the remaining elements have sufficient power to service.

灵敏度通常由具有负载平衡机制的冗余元素来管理。 (例如，一个具有Alteon负载平衡的Websphere集群)。对于该系统，在可靠性方面具有真正的优势，请检查一个元件是否发生故障，其余元件是否具有足够的服务能力。

In other words, in the case of two active servers with load balancing, the power of a single server must ensure the entire load. With three servers, the power of one server must ensure 50% of the load (assuming that the probability of an incident on two servers at the same time is negligible). For good reliability, it is useless to many servers are rescuing each other. For example, a reliable 99% redundant once gives a reliability of 99.99% (probability that the two elements is failing at the same time 1/100×1/100 = = 1:10,000)

换句话说，在两个具有负载平衡的活动服务器的情况下，单个服务器的功能必须确保整个负载。对于三台服务器，一台服务器的电源必须确保50％的负载(假设同时发生在两台服务器上的可能性很小)。为了获得良好的可靠性，许多服务器相互抢救是没有用的。例如，可靠的99％冗余一次可提供99.99％的可靠性(两个元素同时发生故障的概率1/100×1/100 = = 1：10,000)

冗余差 (Redundancy differential)

The redundancy of an element is usually carried out using redundancy with multiple identical components. This assumes, to be effective, a failure of a component is random and independent of the failure of the other ingredients. It is for example the case of hardware failures.

元素的冗余通常使用具有多个相同组件的冗余来执行。有效地假定，某组分的失效是随机的，并且与其他成分的失效无关。例如，硬件故障。

This is not true of all failures, for example, an operating system failure or malfunction of a software component that can occur when conditions are favorable on all components at once. For this reason, when the application is extremely sensitive, we will consider the redundant elements with components of different natures but the same functions. This can lead to:

对于所有故障(例如，操作系统故障或软件组件的故障)可能一次出现在所有组件上的情况都可能发生时并非如此。因此，当应用程序非常敏感时，我们将考虑具有不同性质但功能相同的组件的冗余元素。这可能导致：

Choose different kind of servers, with different OSes, software products of different infrastructure,选择具有不同操作系统，不同基础架构的软件产品的不同类型的服务器，
Develop the same component twice respecting each time the contracts that apply to the component interface.每次开发适用于组件接口的合同时，都要两次开发相同的组件。

投票系统的冗余 (Redundancy with voting system)

In this mode, various components process the same inputs and produce, therefore (in principle) the same output.

在这种模式下，各种组件处理相同的输入并产生(因此，原则上)相同的输出。

The outputs of all components are collected, and then an algorithm is implemented to produce the final result. The algorithm can be simple (majority) or complex (mean, weighted mean, median, etc.), the aim being to eliminate erroneous results due to a malfunction on one of the components and / or a reliable result by combining several slightly different results.

收集所有组件的输出，然后执行算法以产生最终结果。该算法可以是简单的(多数)或复杂的(均值，加权均值，中位数等)，目的是通过组合多个略有不同的结果来消除由于组件之一发生故障而导致的错误结果和/或可靠结果。。

This process:

这个流程：

Does not load balancing不负载均衡
Introduces the problem of reliability of the component managing the voting algorithm介绍管理投票算法的组件的可靠性问题

This method is commonly used in the following cases:

在以下情况下通常使用此方法：

Systems based on sensors (e.g., temperature sensors) for which the sensors are redundant基于传感器的系统(例如温度传感器)，传感器是冗余的
Systems or several different components performing the same function are used (see Differential redundancy) and for which a better outcome can be achieved by combining the output components (e.g., pattern recognition system using multiple algorithms for better recognition rate.使用执行相同功能的系统或几个不同组件(请参阅差分冗余)，并且可以通过组合输出组件(例如，使用多个算法的模式识别系统以获得更好的识别率)获得更好的结果。

影子行动 (Shadow operations)

When the malfunction of a component redundant and after repair, we might want to reintroduce active service, check its effective operation, but the results are used. In this case, the entries are processed by one (or more) components to be reliable. These produce the result operated by the rest of the system. The same entries are also processed by the component is reintroduced said mode shadow. You can check the proper functioning of the component by comparing the results with those products tested components. This method is often used in systems based on voting for it is enough to exclude the component mode “shadow” of the final vote.

当组件的故障多余且需要维修后，我们可能希望重新引入现役服务，检查其有效运行，但使用结果。在这种情况下，条目由一个(或多个)组件进行处理以确保可靠性。这些产生由系统其余部分操作的结果。相同的条目也由组件处理，被重新引入模式阴影。您可以通过将结果与那些经过产品测试的组件进行比较来检查组件的功能是否正常。此方法通常在基于投票的系统中使用，因为它足以排除最终投票的组件模式“阴影”。

The processes that improve the availability

改善可用性的过程

There are two distinct roles in these processes.

在这些过程中有两个不同的角色。

The processes that reduce the number of failures —

减少故障数量的过程-

Based on the fact that prevention is better than cure, implement control processes that will reduce the number of incidents on the system improves availability. Two processes can play this role:

基于预防胜于治疗这一事实，实施可减少系统事件数量的控制流程可提高可用性。两个过程可以扮演这个角色：

The process of change management: 60% of errors are related to a recent change. By implementing a formalized process, accompanied by adequate tests (and made in a proper pre-production), many incidents can be eliminated.变更管理过程：60％的错误与最近的变更有关。通过实施正式的流程，并进行适当的测试(并在适当的预生产中进行)，可以消除许多事件。
A process of pro-active management of errors: incidents can often be detected before they occur: the response times increase, etc. A process dedicated to this task and provided with adequate tools (system of measurement, reporting, etc.) may take place even before the incident happens.主动管理错误的过程：通常可以在事件发生之前就将其检测出来：响应时间增加等。可能会执行专门针对此任务并提供适当工具(度量系统，报告系统等)的过程。甚至在事件发生之前

By implementing these processes, many incidents can be avoided.

通过实施这些过程，可以避免许多事故。

The process reduces the duration of outages

该过程减少了停机时间

The failure always ends up arriving at that time, the recovery process in case of error is essential for the service to be restored as quickly as possible. This process must have a goal: to allow the user to use a service as quickly as possible. The definitive repair should be avoided because it takes much longer. This process will therefore develop a workaround the problem.

故障总是在那个时候结束，如果出现错误，恢复过程对于尽快恢复服务至关重要。这个过程必须有一个目标：允许用户尽快使用服务。应避免进行最终维修，因为它需要更长的时间。因此，此过程将解决该问题。

集群高可用性 (Cluster high availability)

A high availability cluster (as opposed to a computing cluster) is a cluster of computers whose goal is to provide a service while avoiding downtime.

高可用性群集(与计算群集相对)是计算机群集，其目标是在提供服务的同时避免停机。

Study: From Wikipedia, the free encyclopedia. The text is available under the Creative Commons.

研究：来自维基百科，免费的百科全书。该文本可在“ 知识共享”下找到。

翻译自: https://www.eukhost.com/blog/webhosting/what-is-high-availability/

什么事消息队列的高可用性