混沌工程是什么

Chaos Engineering isn’t just something that Dr. Robotnik does; it’s a serious and increasingly common part of the development life cycle.

混沌工程不仅仅是Robotnik博士所做的。 这是开发生命周期中重要且日益普遍的部分。

In an ever more dynamic global environment, there is growing pressure to introduce continuous testing to the DevOps toolchain to ensure security and resilience. This is more so the case as the cloud becomes the dominant playground for an increasing number of firms, for where there is emerging tech, there is an emerging threat.

在瞬息万变的全球环境中,向DevOps工具链引入持续测试以确保安全性和弹性的压力越来越大。 当云成为越来越多的公司的主要场所时,情况就更是如此,对于有新兴技术的地方新兴威胁就在这里

In practice, a system that has not been routinely and effectively tested is more prone to downtime, which can lead to disappointment or even lost customers. In ITIC’s 11th annual Hourly Cost of Downtime Survey, it was found that 87% of organisations are now required to be available 99.99% of the time (which — they note — is up 81% in the past 2.5 years). Incredibly, 40% of enterprise organisations indicated that a single hour of downtime can cost them from $1 million to over $5 million.

在实践中,未经例行有效测试的系统更容易发生停机,这可能导致失望甚至失去客户。 在ITIC的第11次年度每小时停机成本调查中 ,发现现在要求99.99%的时间有87%的组织处于可用状态(他们指出,在过去2.5年中,这一比例增加了81%)。 令人难以置信的是,有40%的企业组织表示,一个小时的停机可能使他们蒙受的损失从100万美元到500万美元以上。

So how do you preempt and prepare for the kind of disaster that could cost your company millions? Well, one increasingly popular approach is to productively break your own things before someone else does.

那么,您如何抢占先机并为可能造成公司数百万美元损失的灾难做准备? 好吧,一种越来越流行的方法是在别人做之前有效地破坏自己的东西。

曝光,改善 (Expose, Improve)

To those of us who aren’t wizards of the DevOps world, just the idea of deliberately introducing chaos to your hard work is chilling. Getting to grips with the concept isn’t so hard though, and once you understand the work required, I’m sure you will agree that the outcomes are well worth the resource investment.

对于那些不是DevOps领域向导的人来说,故意将混乱引入您的辛苦工作的想法是令人毛骨悚然的。 不过,掌握这个概念并不难,一旦您了解了所需的工作,我相信您会同意成果值得投入资源。

A fairly traditional example of this would be penetration testing or ethical hacking whereby a company pays [typically] a third-party to try and find the security weaknesses in their infrastructure. Chaos engineering takes this idea of testing security, and dials it up a notch.

一个相当传统的例子是渗透测试道德黑客 ,公司通常会向第三方支付费用,以试图发现其基础架构中的安全漏洞。 混沌工程学采用了测试安全性的想法,并将其提高了一个等级。

To explain at a high-level, you introduce disruption (such as a major server crash, a hack attempt, a blackout) in to the normal production state of a system in order to proactively (rather than reactively) test its resilience.

为了进行高层次的解释,您会在系统的正常生产 状态中引入中断 (例如主要的服务器崩溃,黑客尝试,停电),以便主动 (而不是被动地 )测试其弹性。

Simple view of the development life-cycle (graphic my own)
开发生命周期的简单视图(图形是我自己的)

It is designed to be a replication of a real-life ‘emergency’ situation and differs from traditional and mainstream testing by taking place in the production environment, rather than earlier in the development life-cycle.

它旨在复制现实生活中的“紧急情况”,并且不同于传统测试和主流测试,而是在生产环境中进行,而不是在开发生命周期中进行。

But why would you go to the lengths of trying to tank your own network or fry your own hardware if you don’t need to?

但是,如果您不需要的话,为什么还要花很多钱尝试自己的网络或油炸自己的硬件呢?

Well, it’s sort of like the old saying “failure to prepare is preparation to fail”; if you don’t do it, then someone else probably will. Therefore, to avoid being left in a catastrophic situation with unexpected downtime, it’s best to put any mitigating procedures in place well in advance.

好吧,这有点像古话“准备失败就是准备失败”。 如果您不这样做,那么其他人可能会这样做。 因此,为避免因意外停机而陷入灾难性状况,最好提前采取任何缓解措施。

By exposing any possible weaknesses, you can ultimately improve confidence in any systems for the future.

通过暴露任何可能的弱点,您最终可以提高对未来任何系统的信心。

像科学学科一样对待它 (Treat it like a scientific discipline)

Masters of chaos Gremlin state clearly in their whitepaper that:

混乱的格雷姆林大师在白皮书中明确指出:

“Critical to chaos engineering is that it is treated as a scientific discipline.”

“对混沌工程至关重要的是,它被视为一门科学学科。”

Rather than just hiring someone to try and break your system, chaos engineering done right allows you to revert back to an original state without any impact to customer or employees. From a compliance perspective, it allows you to validate and justify hypothetical defense mechanisms. Gremlin set out six steps that should be followed for successful engineering:

完成混乱的工程设计不仅可以雇用他人尝试破坏系统,还可以使您恢复到原始状态,而对客户或员工没有任何影响。 从合规性角度来看,它使您可以验证和证明假设的防御机制。 Gremlin列出了成功进行工程设计应遵循的六个步骤:

  1. Form a hypothesis形成假设
  2. Plan your experiment计划实验
  3. Minimize the blast radius最小化爆炸半径
  4. Run the experiment运行实验
  5. Celebrate the outcome庆祝结果
  6. Complete the mission完成任务

There are many critics and skeptics of chaos engineering because, of course, there are risks involved. But following these steps and utilizing best practice — including starting with a staging environment, always having a “kill switch”, and planning in advance — drastically reduces the chance of things going awry. Remember: if you don’t test things breaking in a controlled environment, you’re ensuring that you will struggle to fix them when they inevitably go wrong in the future!

混乱工程学受到许多批评家和怀疑论者的关注,因为当然其中涉及风险。 但是,遵循这些步骤并利用最佳实践(包括从临时环境开始,始终具有“致命的转变”以及预先进行计划),可以大大减少发生问题的机会。 切记:如果您不测试在受控环境中发生的故障,那么您将确保在将来不可避免地出问题时要努力修复它们!

Ultimately by simulating events that could be completely overwhelming, you make them non-events that — should they occur in the future — would barely leave a dent.

最终,通过模拟可能完全压倒性的事件,可以使它们成为非事件-如果它们将来会发生-几乎不会留下任何痕迹。

解决假设 (Tackle assumptions)

Another draw of chaos engineering is that it challenges perhaps the biggest blight of the software engineering world: incorrect assumptions. Whether it’s believing that you hold all the power, or that’ll you never run out of space, letting these fallacies take root allows issues to creep in.

混乱工程的另一种吸引之处是,它可能挑战了软件工程领域的最大困境:错误的假设。 无论是相信您拥有所有权力,还是永远不会用完空间,让这些谬论生根发芽会让问题蔓延。

The fallacies of distributed computing, by L. Peter Deutsch (graphic my own)
分布式计算的 谬误 ,作者: L。Peter Deutsch(我自己制图)

If you aren’t continuously testing, and pushing your infrastructure to the limit, then you are relying on discovering your system weaknesses on the fly as they appear in the real world, potentially to the detriment of your customer base.

如果您不进行持续的测试并将基础架构推向极限,那么您将依赖于在现实世界中即时发现系统漏洞,这可能会损害客户群。

Be it major or minor downtime, every minute matters. It’s therefore true, as Andy Grove so excellently puts it, that “Success breeds complacency. Complacency breeds failure”.

无论是重大停机还是次要停机,每分钟都很重要。 因此,正如安迪·格鲁夫(Andy Grove)所说的那样,“成功培养了自满。 自满滋生失败”。

猿面军 (The Simian Army)

Chaos engineering isn’t a new concept. Back in 2011, Netflix started work on a suite of open-source tools referred to as “The Simian Army”. These tools have been designed to test the security, reliability, and resilience of their Amazon Web Services infrastructure and include big names like Chaos Monkey, Chaos Kong, and Janitor Monkey.

混沌工程并不是一个新概念。 早在2011年,Netflix就着手开发一套称为“猿猴军”的开源工具。 这些工具旨在测试其Amazon Web Services基础架构的安全性,可靠性和弹性,其中包括Chaos Monkey,Chaos KongJanitor Monkey等知名品牌

Chaos Monkey was Netflix’s first tool, and it worked by rebooting servers at random. Whilst incredibly innovative, it seems to have developed some infamy online in part because it does deviate from the modern fundamentals of chaos engineering. Generally speaking, it’s not advisable to break things at random in the middle of busy periods, but instead to use controls and measures to understand weaknesses (unless, like Netflix, you have a massive team of engineers lined up to bag said monkey!)

Chaos Monkey是Netflix的第一个工具,它通过随机重启服务器来工作。 尽管具有令人难以置信的创新性,但它似乎在网上发展了一些臭名昭著的部分原因是它确实偏离了混沌工程学的现代基础。 一般而言,不建议在繁忙的时段中随意破坏事物,而应使用控制和措施来了解弱点(除非像Netflix一样,您有庞大的工程师团队排着长队说猴子!)

Netflix subsequently introduced a whole cast of monkeys to work with, including Chaos Kong (which wiped out a whole AWS region *shudders*) and Latency Monkey (which introduces delays in communication to simulate real-world degradation and outages). More recently, the monkeys have been developed separately and many are built into Spinnaker, Netflix’s “open source multi-cloud Continuous Delivery platform”

Netflix随后引入了一系列可与之合作的猴子,包括Chaos Kong (消灭了整个AWS地区* shudders *)和Latency Monkey (引入了通信延迟以模拟现实世界的退化和中断)。 最近,猴子已经单独开发,许多猴子内置在Spinnaker中,这是Netflix的“开源多云连续交付平台”

You can read in detail about the Simian Army and the incredible chaos engineering projects at Netflix on the Netflix Tech Blog.

您可以在Netflix Tech Blog上详细了解有关Simian Army和不可思议的混乱工程项目的信息。

平静中的混乱 (Chaos in the Calm)

Other companies including Salesforce, Amazon, and Uber have been using resilience testing for years to improve reliability, and because so many projects are open source it’s not difficult to embrace the chaos.

包括Salesforce,Amazon和Uber在内的其他公司多年来一直在使用弹性测试来提高可靠性,并且由于有如此多的项目是开源的,因此不难摆脱混乱。

There are great forums (including the Chaos Community) as well as a wealth of offerings, including VMware’s ‘Mangle’ that is described as a tool to enable you:

有很棒的论坛(包括Chaos社区 )以及丰富的产品,包括VMware的“ Mangle ”,它被描述为使您能够使用的工具:

“…to run chaos engineering experiments seamlessly against applications and infrastructure components to assess resiliency and fault tolerance. It is designed to introduce faults with very little pre-configuration…”

“……针对应用程序和基础架构组件无缝运行混乱的工程实验,以评估弹性和容错能力。 它旨在以很少的预配置引入故障……”

There are also legitimate full-scale services like the super cool ‘Gremlin’, which offers a platform that allows you introduce ‘attacks’ across your stack with multiple failure modes.

还有一些合法的全面服务,例如超级酷的“ Gremlin ”,它提供了一个平台,可让您通过多种故障模式在堆栈中引入“攻击”。

As a practice, it’s not quite the standard yet. Even if it doesn’t need to be the most difficult process ever, firms want the assurance of experienced DevOps who can engineer chaos safely, and that can be challenging to 1) find, and 2) invest in.

实际上,这还不是很标准。 即使不需要这是有史以来最困难的过程,公司也希望得​​到经验丰富的DevOps的保证,他们可以安全地制造混乱,这对于1)寻找和2)投资可能是具有挑战性的。

Ultimately though, the need for continuous testing will only become more necessary as emerging threats become more sophisticated; as customer reliance on functional systems grows, the room for error shrinks.

但最终,随着新出现的威胁变得越来越复杂,对连续测试的需求将变得更加必要。 随着客户对功能系统的依赖增加,错误的余地也随之缩小。

As Bill Gates famously warned us 5 years ago:

正如比尔·盖茨5年前著名地警告我们的那样:

“There’s no need to panic… but we need to get going”.

“没有必要惊慌……但我们需要走下去”。

翻译自: https://towardsdatascience.com/chaos-in-the-calm-what-is-chaos-engineering-5311c27b2c7e

混沌工程是什么


http://www.taodudu.cc/news/show-6037840.html

相关文章:

  • 6/14 筑个窝
  • 学习Python的三种境界,你现在是在什么境界?
  • 単語境界/非単語境界(¥b, ¥B)
  • R语言的修仙之道--R语言之后天境界
  • R语言循环函数编写三境界
  • Linux树莓派实战案例论文,树莓派|树莓派使用实例之:2 Pi R
  • 高通FastCV简介
  • 计算机图像处理2000字论文,图像处理计算机技术论文
  • 高通QSC是什么?
  • 计算机视觉知识表征,计算机视觉基础 - 边缘和轮廓检测
  • 章节1 计算机体系结构
  • 计算机底层02-计算机指令与指令集
  • 高通 Hexagon V65 HVX 编程参考手册(1)
  • 计算机视觉初入门
  • 高通ims架构android,深度揭密高通4/5G移动基带消息系统和状态机
  • QXRService:基于高通QXRService获取SLAM Camera图像
  • 高通机器视觉快速指南二
  • 解密weblogic控制台账号密码
  • c语言实现一个密码管理器(更新中)
  • Python-小实战-1.0
  • 栈的压入、弹出序列和栈所有可能的弹出顺序
  • 栈与队列的各种操作
  • 【100题】给定入栈序列,判断一个序列是否可能为输出序列
  • 《剑指offer》:[22]如何判断一个序列是否为栈的弹出序列
  • 栈的push,pop序列
  • 元素出栈入栈顺序是否合法
  • 判断栈的出栈顺序是否正确
  • 剑指Offer面试题22(Java版):栈的压入、弹出序列
  • 卟啉试剂cas40904-90-3/四-(2-吡啶基)卟啉/TPyP(2);氯乙酰基氧基(CLP1-5);锌卟啉-富勒烯配合物(p-OCH3)ZnP-C60]
  • 程序员面试题精选(24):栈的push、pop序列

混沌工程是什么_平静中的混沌:什么是混沌工程?相关推荐

  1. 同创永益CTO郑阳受邀在清华MEM大讲堂分享“混沌工程在复杂业务稳定性中的应用”

    2022年3月30日下午,清华MEM大讲堂第65期在线上举行,同创永益CTO郑阳受邀在大讲堂详细分享企业数字化转型过程中"混沌工程在复杂业务稳定性中的应用".清华MEM大讲堂是清华 ...

  2. 混沌工程是什么_什么是混沌猴子? 混沌工程解释

    混沌工程是什么 在Netflix从分发DVD转变为构建用于流视频的分布式云系统的过程中,Pioneers率先走了出来, Chaos Monkey引入了一种工程原理,该原理已被各种规模和规模的软件开发组 ...

  3. 工程图样中粗实线的用途_《道路工程制图标准》规定:中粗实线的一般用途是( )。_学小易找答案...

    [单选题]( )的主要作用是承受活载压力和土压力等并将其传递给地基,并保证设计流量通过的必要孔径.(3.0分) [判断题]在道路工程图中,图上尺寸数字之后不必注写单位,但在注解及技术要求中要注明尺寸单 ...

  4. 神经网络激活函数对数函数_神经网络中的激活函数

    神经网络激活函数对数函数 Activation function, as the name suggests, decides whether a neuron should be activated ...

  5. 解决朋友圈压缩_朋友中最有趣的朋友[已解决]

    解决朋友圈压缩 We live in uncertain times. 我们生活在不确定的时代. We don't know when we're going back to school or th ...

  6. Win7全自动精简批处理_温柔处理极速修正版/暴力剩女工程测试版

    2011htpcfans 发表于 2012-5-11 http://bbs.wuyou.net/forum.php?mod=viewthread&tid=210269&highligh ...

  7. 混沌序列 java,基于Logistic映射混沌加密算法的研究_韩凤英

    第7卷第1期长沙航空职业技术学院学报 Vo1.7No .1 2007年3月 CHANGS HA AERO NAUT I CAL VOCATI ONAL AND TECHN I CAL COLLEGE ...

  8. 长沙中职英语计算机等级考试查询,(涉外)长沙华中涉外职业高中专业_湖南中职学校一览表...

    (涉外)长沙华中涉外职业高中专业_湖南中职学校一览表h9xh 添加老师微信或者致电招办电话可领取两千元助学金名额! 职业中专是针对专业技术的高中,对专业能力要求较高,他们不但要普通高中的语文.数学等基 ...

  9. 牛客网_剑指Offer_Python实现_更新中

    剑指Offer编程题汇总 第1题_二维数组中的查找 第2题_替换空格 第3题_从尾到头打印链表 第4题_重建二叉树 第5题_用两个栈实现队列 第6题_旋转数组的最小数字 第7题_斐波那契数列 第8题_ ...

最新文章

  1. 近期活动盘点:清华严飞大数据探寻中国文脉讲座、2019前沿信息科技创新论坛...
  2. 惠普ilo管理界面远程安装系统
  3. 工程师忽略的隐形成本
  4. 计算机网络那些事~(一)
  5. 水晶报表提示“需要数字字段”
  6. 解决RM删除没有释放空间问题
  7. gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化
  8. shell 脚本编写 if else then
  9. 解决idea文件名称大小写导致GIT无法提交问题
  10. easyui-datagrid加载时的效率低下,解决方案
  11. Tensorflow 2 Auto-Encoder
  12. (转)深入理解Java的接口和抽象类
  13. PHP两个匿名函数传递性,PHP让人不知道的匿名函数的几种写法(附代码)
  14. 改变自己,永不会晚!
  15. threeJS导入FBX模型
  16. 10个最佳企业移动支付APP应用和酷站欣赏
  17. JavaScript 求平均数的方法(实参个数不确定)
  18. python课程设计 文字游戏 魔塔3
  19. PNP与NPN两种三极管使用方法
  20. linux 嵌入式开发常用网站整理

热门文章

  1. 专访最强夫妻店:“神庙逃亡2”开发背后的故事
  2. 让树莓派变身照相机——摄像头控制
  3. 石家庄联通宽带DNS服务器地址
  4. Mediakit报告设备商的空间不足以执行此操作的纯MAC解法
  5. 存储技术对比:NVMe与SATA孰强孰弱?
  6. 驻极体式MIC电路设计
  7. [转载] excel调用python编程-超简单:用Python让Excel飞起
  8. 【微信朋友圈,如何测】
  9. windows 10 时间同步,时间显示不准自动校准。
  10. 杭州电子科技大学程序设计竞赛(2016’12)- 网络同步赛 1004