什么事消息队列的高可用性

什么是高可用性? (What Is High Availability?)

High availability is a term often used in computing, about system architecture or service to designate the fact that this architecture or service has an adequate level of availability.

高可用性是计算机系统中经常使用的术语,它表示系统架构或服务具有足够级别的可用性。

Availability is now a key part of infrastructure. Today it is estimated that non-availability of an IT department may have cost millions in [ref. desired], particularly in the field of industry where the shutdown of a production line can be devastating.

可用性现在是基础架构的关键部分。 如今,据估计,IT部门的不可用可能已造成数百万美元的损失。 期望],特别是在生产线停工可能造成严重破坏的工业领域。

Two complementary methods are used to improve high availability:

使用两种补充方法来提高高可用性:

  • The establishment of a dedicated physical infrastructure, generally based on hardware redundancy. Then created a cluster of high-availability (as opposed to a computing cluster): a cluster of computers whose goal is to provide a service while avoiding downtime.通常基于硬件冗余建立专用的物理基础结构。 然后创建了一个高可用性集群(与计算集群相对):计算机集群,其目标是在避免停机的同时提供服务。
  • The establishment of appropriate processes to reduce errors, and accelerate recovery in case of error. ITIL contains many such processes.建立适当的流程以减少错误,并在出现错误的情况下加快恢复速度。 ITIL包含许多此类过程。

To measure the availability, we often use the percentage mainly composed of ‘9 ‘:

为了衡量可用性,我们经常使用主要由“ 9”组成的百分比:

  • 99% means that the service is available less than 3.65 days per year99%表示该服务每年可用时间少于3.65天
  • 99.9%, less than 8.75 hours per year99.9%,每年少于8.75小时
  • 99.99%, less than 52 minutes per year99.99%,每年少于52分钟
  • 99.999%, less than 5.2 minutes per year99.999%,每年少于5.2分钟
  • 99.9999%, less than 54.8 seconds per year99.9999%,每年少于54.8秒
  • 99.99999%, less than 3.1 seconds per year, Etc.99.99999%,每年少于3.1秒,等等。

The amalgam is often wrongly between high availability and disaster recovery activities. These are two different tasks, complementary to achieve continuous availability.

高可用性和灾难恢复活动之间的汞合金通常是错误的。 这是两个不同的任务,相互补充以实现连续可用性。

提高可用性的技术 (Techniques for improving the availability)

Many techniques are used to improve the availability:

许多技术用于提高可用性:

  • Redundant hardware and clustering;冗余硬件和集群;
  • Data security: RAID, snapshots, Oracle Data Guard, BCV (Business Copy Volume), SRDF (Symmetrix Remote Data Facility), DRBD;数据安全性:RAID,快照,Oracle Data Guard,BCV(业务副本卷),SRDF(Symmetrix远程数据设施),DRBD;
  • The ability to reconfigure the server “hot” (that is to say when it works);能够“热”重新配置服务器(也就是说,何时工作);
  • Limp or a panic mode;mp行或恐慌模式;
  • Rescue plan;救援计划;
  • And secure backups: outsourcing, centralization third party site.和安全的备份:外包,集中第三方站点。

High availability requires the most suitable accommodation: power supply, air conditioning on the floor, with particulate filter, maintenance service, security service and security against theft and malicious. Note also the risk of fire and water damage. The power cables and communication must be multiple and buried. They should not be prominent in the underground garage of the building, which is too often seen in buildings in Paris. These criteria are the first to come into account when choosing a hosting provider (if the rental of a local high availability).

高可用性需要最合适的住宿条件:电源,地板上的空调,微粒过滤器,维护服务,安全服务以及防盗和防恶意软件的安全性。 还要注意火灾和水灾的危险。 电源线和通讯必须是多根并埋在地下。 它们不应在建筑物的地下车库中突出显示,在巴黎的建筑物中经常见到。 选择托管服务提供商(如果租用本地高可用性)时,首先要考虑这些标准。

For each level of the architecture for each component, each connection between components, it must establish:

对于每个组件的体系结构的每个级别,组件之间的每个连接,它都必须建立:

  • How to detect a failure? Examples: Testing life TCP Health Check implemented by a housing Alteon, test program invoked periodically (heartbeat), interface type “diagnosis” on the components, etc.如何检测故障? 示例:测试寿命由外壳Alteon实施的TCP运行状况检查,定期调用的测试程序(心跳),组件上的接口类型“诊断”等。
  • How secure is calling, redundant, rescued, etc. Examples: backup server, cluster system, WebSphere Clustering, RAID storage, backup, SAN double attachment, limp, unused material free (spare) ready to be reinstalled.调用,冗余,营救的安全性如何。示例:备份服务器,集群系统,WebSphere Clustering,RAID存储,备份,SAN双重附件,软盘,未使用的空闲材料(备用)可以重新安装。
  • How do we want the trigger switches to emergency / degraded? — Manually after analysis? Automatically?我们如何让触发器切换到紧急状态/降级状态? —在分析后手动进行? 自动吗?
  • How to ensure that the backup system leave again on a stable and known. Examples: one starts with a copy of the database and reapply the archive logs, restart the batch from a known state, 2-phase commit for transactions updating multiple data repositories, etc.如何确保备份系统再次稳定并为人所知。 示例:从数据库的副本开始,然后重新应用归档日志,从已知状态重新启动批处理,针对事务的两阶段提交,以更新多个数据存储库,等等。
  • How the application restarts on the backup mechanism. Examples: restart application, restart of interrupted batches, activation of a degraded mode, the resumption of the failed server’s IP address by the backup server, etc.应用程序如何在备份机制上重新启动。 例如:重新启动应用程序,重新启动中断的批次,激活降级模式,备用服务器恢复故障服务器的IP地址等。
  • How to take any transactions or sessions. Examples: session persistence on the application server, a mechanism for response to a client for a transaction that has performed well before failure but for which the client does not have an answer, etc.如何进行任何交易或会话。 示例:应用程序服务器上的会话持久性,对客户端响应的事务机制,该事务在故障之前表现良好,但是客户端没有答案,等等。
  • How to return to the nominal situation.如何回到名义状态。

Examples:

例子:

  1. If a degraded mode allows for failure of a database to store transactions waiting in a file, how transactions are they re-applied when the database becomes active again.如果降级模式导致数据库无法将等待的事务存储在文件中,那么当数据库再次变为活动状态时,如何重新应用事务。
  2. If a failed component has been deactivated, how is its reintroduction in active service (e.g., need to resynchronize data, retest the component, etc.)如果停用了失败的组件,如何将其重新引入活动服务(例如,需要重新同步数据,重新测试组件等)

与其他应用程序的依赖关系 (Dependency vis-à-vis other applications)

For an application seeking other applications with middleware synchronously (http web service, Tuxedo, CORBA, EJB) the rate of application availability will be strongly linked to the availability of applications on which it depends. The sensitivity of applications which it depends must be equal to or greater than the sensitivity of the application itself.

对于使用中间件同步查找其他应用程序(http Web服务,Tuxedo,CORBA,EJB)的应用程序,应用程序可用性的速率将与其所依赖的应用程序的可用性紧密地联系在一起。 它所依赖的应用程序的灵敏度必须等于或大于应用程序本身的灵敏度。

Otherwise, consider

否则,请考虑

  • The use of asynchronous middleware: MQ Series, JMS, SonicMQ, CFT异步中间件的使用:MQ系列,JMS,SonicMQ,CFT
  • Implementation of a limp when an application which we depend is failing.当我们依赖的应用程序发生故障时,将执行limp。

For this reason we will emphasize the use of asynchronous middleware good availability preferred whenever possible.

因此,我们将尽可能强调使用异步中间件良好可用性的方法。

负载分配和灵敏度 (Load distribution and sensitivity)

The sensitivity is often managed by redundant elements with a load balancing mechanism. (A websphere cluster with an Alteon load-balancing for example). For this system offers a real gain in terms of reliability, check if one element fails, the remaining elements have sufficient power to service.

灵敏度通常由具有负载平衡机制的冗余元素来管理。 (例如,一个具有Alteon负载平衡的Websphere集群)。 对于该系统,在可靠性方面具有真正的优势,请检查一个元件是否发生故障,其余元件是否具有足够的服务能力。

In other words, in the case of two active servers with load balancing, the power of a single server must ensure the entire load. With three servers, the power of one server must ensure 50% of the load (assuming that the probability of an incident on two servers at the same time is negligible). For good reliability, it is useless to many servers are rescuing each other. For example, a reliable 99% redundant once gives a reliability of 99.99% (probability that the two elements is failing at the same time 1/100×1/100 = = 1:10,000)

换句话说,在两个具有负载平衡的活动服务器的情况下,单个服务器的功能必须确保整个负载。 对于三台服务器,一台服务器的电源必须确保50%的负载(假设同时发生在两台服务器上的可能性很小)。 为了获得良好的可靠性,许多服务器相互抢救是没有用的。 例如,可靠的99%冗余一次可提供99.99%的可靠性(两个元素同时发生故障的概率1/100×1/100 = = 1:10,000)

冗余差 (Redundancy differential)

The redundancy of an element is usually carried out using redundancy with multiple identical components. This assumes, to be effective, a failure of a component is random and independent of the failure of the other ingredients. It is for example the case of hardware failures.

元素的冗余通常使用具有多个相同组件的冗余来执行。 有效地假定,某组分的失效是随机的,并且与其他成分的失效无关。 例如,硬件故障。

This is not true of all failures, for example, an operating system failure or malfunction of a software component that can occur when conditions are favorable on all components at once. For this reason, when the application is extremely sensitive, we will consider the redundant elements with components of different natures but the same functions. This can lead to:

对于所有故障(例如,操作系统故障或软件组件的故障)可能一次出现在所有组件上的情况都可能发生时并非如此。 因此,当应用程序非常敏感时,我们将考虑具有不同性质但功能相同的组件的冗余元素。 这可能导致:

  • Choose different kind of servers, with different OSes, software products of different infrastructure,选择具有不同操作系统,不同基础架构的软件产品的不同类型的服务器,
  • Develop the same component twice respecting each time the contracts that apply to the component interface.每次开发适用于组件接口的合同时,都要两次开发相同的组件。

投票系统的冗余 (Redundancy with voting system)

In this mode, various components process the same inputs and produce, therefore (in principle) the same output.

在这种模式下,各种组件处理相同的输入并产生(因此,原则上)相同的输出。

The outputs of all components are collected, and then an algorithm is implemented to produce the final result. The algorithm can be simple (majority) or complex (mean, weighted mean, median, etc.), the aim being to eliminate erroneous results due to a malfunction on one of the components and / or a reliable result by combining several slightly different results.

收集所有组件的输出,然后执行算法以产生最终结果。 该算法可以是简单的(多数)或复杂的(均值,加权均值,中位数等),目的是通过组合多个略有不同的结果来消除由于组件之一发生故障而导致的错误结果和/或可靠结果。 。

This process:

这个流程:

  • Does not load balancing不负载均衡
  • Introduces the problem of reliability of the component managing the voting algorithm介绍管理投票算法的组件的可靠性问题

This method is commonly used in the following cases:

在以下情况下通常使用此方法:

  • Systems based on sensors (e.g., temperature sensors) for which the sensors are redundant基于传感器的系统(例如温度传感器),传感器是冗余的
  • Systems or several different components performing the same function are used (see Differential redundancy) and for which a better outcome can be achieved by combining the output components (e.g., pattern recognition system using multiple algorithms for better recognition rate.使用执行相同功能的系统或几个不同组件(请参阅差分冗余),并且可以通过组合输出组件(例如,使用多个算法的模式识别系统以获得更好的识别率)获得更好的结果。

影子行动 (Shadow operations)

When the malfunction of a component redundant and after repair, we might want to reintroduce active service, check its effective operation, but the results are used. In this case, the entries are processed by one (or more) components to be reliable. These produce the result operated by the rest of the system. The same entries are also processed by the component is reintroduced said mode shadow. You can check the proper functioning of the component by comparing the results with those products tested components. This method is often used in systems based on voting for it is enough to exclude the component mode “shadow” of the final vote.

当组件的故障多余且需要维修后,我们可能希望重新引入现役服务,检查其有效运行,但使用结果。 在这种情况下,条目由一个(或多个)组件进行处理以确保可靠性。 这些产生由系统其余部分操作的结果。 相同的条目也由组件处理,被重新引入模式阴影。 您可以通过将结果与那些经过产品测试的组件进行比较来检查组件的功能是否正常。 此方法通常在基于投票的系统中使用,因为它足以排除最终投票的组件模式“阴影”。

The processes that improve the availability

改善可用性的过程

There are two distinct roles in these processes.

在这些过程中有两个不同的角色。

The processes that reduce the number of failures —

减少故障数量的过程-

Based on the fact that prevention is better than cure, implement control processes that will reduce the number of incidents on the system improves availability. Two processes can play this role:

基于预防胜于治疗这一事实,实施可减少系统事件数量的控制流程可提高可用性。 两个过程可以扮演这个角色:

  • The process of change management: 60% of errors are related to a recent change. By implementing a formalized process, accompanied by adequate tests (and made in a proper pre-production), many incidents can be eliminated.变更管理过程:60%的错误与最近的变更有关。 通过实施正式的流程,并进行适当的测试(并在适当的预生产中进行),可以消除许多事件。
  • A process of pro-active management of errors: incidents can often be detected before they occur: the response times increase, etc. A process dedicated to this task and provided with adequate tools (system of measurement, reporting, etc.) may take place even before the incident happens.主动管理错误的过程:通常可以在事件发生之前就将其检测出来:响应时间增加等。可能会执行专门针对此任务并提供适当工具(度量系统,报告系统等)的过程。甚至在事件发生之前

By implementing these processes, many incidents can be avoided.

通过实施这些过程,可以避免许多事故。

The process reduces the duration of outages

该过程减少了停机时间

The failure always ends up arriving at that time, the recovery process in case of error is essential for the service to be restored as quickly as possible. This process must have a goal: to allow the user to use a service as quickly as possible. The definitive repair should be avoided because it takes much longer. This process will therefore develop a workaround the problem.

故障总是在那个时候结束,如果出现错误,恢复过程对于尽快恢复服务至关重要。 这个过程必须有一个目标:允许用户尽快使用服务。 应避免进行最终维修,因为它需要更长的时间。 因此,此过程将解决该问题。

集群高可用性 (Cluster high availability)

A high availability cluster (as opposed to a computing cluster) is a cluster of computers whose goal is to provide a service while avoiding downtime.

高可用性群集(与计算群集相对)是计算机群集,其目标是在提供服务的同时避免停机。

Study: From Wikipedia, the free encyclopedia. The text is available under the Creative Commons.

研究:来自维基百科,免费的百科全书。 该文本可在“ 知识共享”下找到 。

翻译自: https://www.eukhost.com/blog/webhosting/what-is-high-availability/

什么事消息队列的高可用性

什么事消息队列的高可用性_什么是高可用性相关推荐

  1. python消息队列框架持久化_消息队列如果持久化到数据库的话,相对于直接操作数据库有啥优势?...

    MQ的作用很多,典型作用: 1.削峰填谷:如果短时间内要处理的业务量大于数据库的服务能力,则可能会卡死数据库:使用MQ可以慢慢处理. 2.异步化:如果处理的工作非常耗时,则RPC的请求一直halt,对 ...

  2. 哪种消息队列更好_如何编写更好的错误消息

    哪种消息队列更好 用户第一次遇到应用程序的文档时,并不总是带有用户手册或在线帮助. 通常,与文档的第一次相遇是一条错误消息. 技术作家应参与编写错误消息. 这是工作的重要组成部分,尽管经常被忽略. 毕 ...

  3. AndroidStudio_android使用自己封装的消息队列处理问题_封装LinkedQueue---Android原生开发工作笔记242

    比如我要发送一个请求,给httpserver,然后server,返回给我信息,是需要时间的,这个过程, 我们的ui界面不能,被阻塞要不然卡顿,这个时候我的做法是,只要有消息来了,我就把消息 放到,我自 ...

  4. ros 消息队列与缓冲区_[ROS] [笔记(1)] 一个最简单的例子:Hello Robot(消息、发布者与订阅者)...

    本例程包含如下内容: 1)创建编译 Package: 2)自定义消息: 3)发布者与订阅者. 0.Hello Robot 的场景: 我们想要完成这样一个场景: 1)有一系列 robot 排成一排(pu ...

  5. 什么是高可用性_什么是高可用性| 第2部分

    什么是高可用性 高可用性系统的设计 (Design of a high availability system) Ironically, adding more components to the t ...

  6. LiteOS 消息队列

    参考:[野火]物联网操作系统 LiteOS 开发实战指南 3 LiteOS消息队列 3.1 消息队列简介 消息队列是一种常用于任务间通信的数据结构 可以在任务与任务间.中断和任务间传递消息,实现接收来 ...

  7. rocketmq 重复消费_消息队列 RocketMQ

    引言 本文整理了RocketMQ的相关知识,方便以后查阅. 功能介绍 简单来说,消息队列就是基础数据结构课程里"先进先出"的一种数据结构,但是如果要消除单点故障,保证消息传输的可靠 ...

  8. rabbitmq实战:高效部署分布式消息队列_一文看懂消息队列中间件--AMQ及部署介绍...

    概述 最近有个小项目用到了AMQ来做消息队列,之前介绍的主要是rabbitmq,所以今天主要提一下AMQ,也简单介绍下两者的区别~ 消息队列中间件 消息队列中间件(简称消息中间件)是指利用高效可靠的消 ...

  9. 用户请求队列化_分布式消息队列选型分析

    高并发架构是成为架构师的必修课,而消息队列,则是王冠上最闪亮的那颗明珠!能否驾驭消息队列这款高并发神器,亦成为架构师的试金石.本文将从队列本质.技术选型两个方面,给大家整理下个人心得,希望能对大家有所 ...

最新文章

  1. Blender创作你自己的动画短片学习教程
  2. 机器学习驱动技术是生物学进步的下一个突破
  3. 拥抱 Node.js 8.0,N-API 入门极简例子
  4. 0.爬虫 urlib库讲解 urlopen()与Request()
  5. ASP.NET Core的Kestrel服务器
  6. windows 10 安装和使用中5个常见问题
  7. linux cp命令强行覆盖复制
  8. python 密码学计算_python 密码学示例——理解哈希(Hash)算法
  9. C++学习系列笔记(三)
  10. bootstrap table通过ajax获取后台数据展示在table
  11. 安装MOSS2007全过程及创建网站的过程
  12. webbrowser设置横向打印_9个Excel打印神技巧!从此打印不求人!
  13. WF4读书笔记(一):创建一个简单的工作流
  14. 27、想安装网络摄像机看看 PoE交换机和摄像机如何连接
  15. 《大道至简——软件工程实践者的思想》读书笔记
  16. 检索有项目的教师信息mysql_学生成绩管理系统(六):项目总结
  17. maven-基本命令,打包包名问题
  18. 研究发现有适用于欧洲GDPR法规的加密货币解决方案
  19. VMware要求更新,系统更新后VMware打不开,要求更新
  20. 数理逻辑初步:命题逻辑、一阶逻辑和二阶逻辑

热门文章

  1. RuoYi-Vue,执行npm run dev,报错“RuoYi-Vue\ruoyi-ui\node_modules\eslint\lib\cli-engine\cli-engine.js:421”
  2. 居家健康管理小能手——智能体脂秤
  3. 不使用机器学习的机器视觉_我关于使用机器学习进行体育博彩的发现使博彩公司总能胜出
  4. python爬取历史天气查询_Python爬取南京历史天气数据(2345天气网)
  5. SQL SERVER增加文件组和文件
  6. eeprom的wp 引脚_24C04WP 数据手册 PDF - EEPROM - ST - DataSheet5.cn
  7. git-删除fork的项目
  8. Win11共享文件夹打不开怎么办
  9. 偶像 贝尔·格里尔斯(Bear Grylls)
  10. SAP通过事件触发后台JOB_SAP刘梦_新浪博客