by Cory Kennedy-Darby

通过科里·肯尼迪·达比

回顾:我们从2次主要API中断中汲取的经验教训 (Retrospective: lessons we learned from 2 major API outages)

I belong to a team that provides internal services exposed through an API to various other teams within the company. These teams have become increasingly reliant on the services we provide.

我属于一个团队,该团队通过API向公司内部的其他团队提供内部服务。 这些团队越来越依赖我们提供的服务。

Without sounding egotistical, these services are likely the most critical in the company. When these services are not entirely operational and stable, it impacts nearly all other teams, and has an enormous direct impact on revenue.

听起来有些自负,这些服务可能是公司中最关键的服务。 当这些服务不能完全正常运行和稳定时,它将影响几乎所有其他团队,并对收入产生巨大的直接影响。

Internally, other teams in the company using our service are referenced as “clients.” For clarity sake, this article will treat the terms “clients” and “teams” as interchangeable.

在内部,使用我们服务的公司中其他团队称为“客户”。 为了清楚起见,本文将术语“客户”和“团队”视为可互换。

星期三-我们开始收到大量的请求 (Wednesday — we start receiving an enormous volume of requests)

I get into the office, get some coffee and begin my day by checking our metrics and logs. Most people in the company already knew from the night before that there were a couple small hiccups with the load balancers that host our service. My manager wants to me to focus on an important initiative, and send all other unrelated tasks to him.

我走进办公室,喝杯咖啡,然后通过查看指标和日志开始新的一天。 该公司中的大多数人从前一天晚上就已经知道托管我们的服务的负载均衡器会遇到一些小问题。 我的经理希望我专注于一项重要的举措,并将所有其他无关的任务发送给他。

A couple of hours into my day I notice that one of our error log emails has a massive spike in a validation error. These validation errors are nothing special. We log to notify any client that sends us incorrect payload parameters that don’t match the validation rules.

一天的几个小时后,我注意到我们的一封错误日志电子邮件中的验证错误数量激增。 这些验证错误没有什么特别的。 我们记录日志以通知任何向我们发送与验证规则不匹配的错误有效负载参数的客户端。

The team I belong to hadn’t deployed any changes, infrastructure was in a healthy state, and nothing was out of place on our side to warrant such a massive increase.

我所属的团队没有进行任何更改,基础架构处于健康状态,并且我们这边没有什么地方可以保证如此大的增长。

My manager made it clear earlier that I should pass off to him anything that isn’t related to my current project I’m working on. I quickly realize I’ve made a mistake when I ask him where can I find the team lead of the team that is sending us all these validation errors.

我的经理早些时候明确表示,我应该将与我当前正在进行的项目无关的任何事情转嫁给他。 当我问他在哪里可以找到向我们发送所有这些验证错误的团队负责人时,我很快意识到自己犯了一个错误。

The mistake is the moment I ask him this question he wants to know why I’m asking. The short version is he sees this as me not being focused on the initiative, and explains he’ll handle the validation errors. Shortly after our conversation, I get an email CC’ing me with the other team pointing out their issue to them.

错误是我问他这个问题的那一刻,他想知道我为什么问。 简短的版本是他认为这是因为我没有专注于计划,并解释说他将处理验证错误。 在我们交谈之后不久,我收到一封电子邮件,抄送我,另一个团队向他们指出他们的问题。

突然间办公室里出现了恐慌 (All sudden there’s panic in the office)

There is a flurry of emails about issues with our service, and everyone points the finger at the prior issue the night before with the load balancers.

关于我们的服务问题的电子邮件络绎不绝,每个人都在使用负载均衡器前一天晚上指责先前的问题。

I’m pulled off the initiative to handle the task of dealing with operations and the data center crew to deal with the load balancers. After some discussion, they decide to start rerouting portions of our traffic to other load balancers.

我主动完成了处理操作的任务,而数据中心的工作人员则处理了负载平衡器。 经过一番讨论,他们决定开始将部分流量重新路由到其他负载均衡器。

Magically the load balancers begin to stabilize.

神奇的是,负载均衡器开始稳定下来。

风暴前的平静 (The calm before the storm)

The panic in the office disappears and everything goes back to running smoothly. After a couple of hours, the issue seems to begin to reappear and once again our service becomes compromised by an enormous amount of requests.

办公室里的恐慌情绪消失了,一切恢复了平稳。 几个小时后,问题似乎开始再次出现,并且我们的服务再次受到大量请求的损害。

Ding!

ing!

The error log email comes in, and I notice those validation errors are reappearing again. I had been so focused on dealing with the data center that I hadn’t noticed they had stopped when the load balancers became stable.

错误日志电子邮件进入,我注意到那些验证错误再次出现。 我一直专注于与数据中心打交道,以至于我没有注意到当负载均衡器变得稳定时它们已经停止了。

Hold on! I pull up our Kibana dashboard to see our requests over time and compare the timestamps of the validation errors and the total requests over time. They are nearly an exact match.

坚持,稍等! 我拉起Kibana仪表板,查看一段时间内的请求,并比较验证错误的时间戳和一段时间内的总请求。 它们几乎完全匹配。

I am fairly confident this is the issue, and I decide that I’m not repeating the same mistake from the morning. I ask a colleague where the other team is located and head there armed with the knowledge of the correlation between when the spikes happen and when everything becomes unstable.

我完全有信心这是问题所在,因此我决定从早上开始不再重复同样的错误。 我问一个同事,另一个团队在哪里,并带着有关尖峰何时发生和一切变得不稳定之间的相关性的知识前往那里。

Within 10 minutes of explaining the situation to the other team, they realize that this lines up exactly with when they deployed a feature, when they disabled the feature, and when it was accidentally redeployed.

在向其他团队解释情况的10分钟之内,他们意识到这与他们部署功能,禁用功能以及意外重新部署功能完全一致。

They disable the feature on their side, and our service returns to stability. The feature they had deployed had a bug that would call our service in all situations, even when it wasn’t required.

他们禁用了该功能,我们的服务恢复了稳定性。 他们部署的功能存在一个错误,即使在不需要时,也可以在所有情况下调用我们的服务。

星期一-到端点的请求量约为600% (Monday — ~600% amount of requests to an endpoint)

The weekend passes, and there hadn’t been any issues with our service since the incident on Wednesday. Monday starts off normal without any issues, but by the end of the work day, our service begins to become unstable.

周末过去了,自星期三事件以来,我们的服务没有任何问题。 星期一正常开始,没有任何问题,但是在工作日结束时,我们的服务开始变得不稳定。

Quickly glancing through the Kibana events and our access logs, I notice what appears to be an insane amount of requests coming from a single team. The same team from Wednesday.

快速浏览一下Kibana事件和我们的访问日志,我注意到单个团队发出的请求数量似乎很疯狂。 从星期三开始的同一支球队。

I rush off to that team’s area in the company and explain the details, asking if they by mistake relaunched the buggy feature from Wednesday.

我赶赴公司那个团队的区域,并解释细节,询问他们是否错误地从星期三重新启动了越野车功能。

To be sure that the feature hadn’t been deployed again by mistake, they reviewed their deployment logs and verified on git that the feature hadn’t mistakenly been merged into another branch.

为了确保没有错误地再次部署该功能,他们检查了他们的部署日志并在git上验证了该功能没有错误地合并到另一个分支中。

The feature from Wednesday had not been redeployed.

从周三开始的功能尚未重新部署。

I returned to my desk and began chatting with the data center operations team. The situation at this point has become severe because of the time the service has been impacted. The operations team is investigating, and I’m sitting waiting for an answer when the VP of operations shows up.

我回到办公桌,开始与数据中心运营团队聊天。 由于服务受到影响的时间,此时的情况变得很严重。 运营团队正在调查中,当运营副总裁出现时,我正坐在等待答案。

He’s clearly not happy because our API impacts a lot of products and services that are from the operations side of the business. He has some questions about bypassing our service, utilizing some fallback methods, and an ETA on the service returning to normal conditions.

他显然很不高兴,因为我们的API影响了业务运营方面的许多产品和服务。 他对绕过我们的服务,使用一些后备方法以及在服务恢复正常状态的预计到达时间方面有一些疑问。

A couple more minutes pass and then everything stabilizes.

再过几分钟,然后一切稳定。

固定? 除非我相信独角兽和神话中的生物 (Fixed? Not unless I believe in unicorns and mythical creatures)

The data center operations team explains that a feature on our stage server was causing the load balancer processes to keep crashing. I call BS on their answer because this feature had been successfully running without any issues for nearly two months straight. The other reason I wasn’t willing to accept their answer was I had noticed on our Kibana dashboard that our total requests had roughly quadrupled.

数据中心运营团队解释说,舞台服务器上的一项功能正在导致负载均衡器进程不断崩溃。 我给BS打电话是因为他们的答案,因为该功能已经成功运行了近两个月,没有任何问题。 我不愿意接受他们回答的另一个原因是,我在Kibana仪表板上注意到我们的总请求量大约增加了三倍。

It seemed too crazy that such a stable feature on our staging servers would randomly start causing such large issues.

登台服务器上如此稳定的功能会随机开始引起如此大的问题,这似乎太疯狂了。

星期二—利益相关者想要答案 (Tuesday — Stakeholders Want Answers)

Tuesday morning my manager asks me to deal with providing a clear outline of the events that happened on Monday for him to write up an incident report for the stakeholders.

星期二上午,我的经理要我处理清楚提供星期一发生的事件的概要,以便他为利益相关者撰写事件报告。

The night before I spent a lot of time reflecting on the recent situations and I realized that a significant pain point was using our Kibana dashboard. Our Kibana dashboard is amazing when you want to see all requests, but with our current mapping, it’s hard to drill down and isolate requests from a specific client.

前一天晚上,我花了很多时间思考最近的情况,我意识到使用我们的Kibana仪表板是一个巨大的痛点。 当您想查看所有请求时,我们的Kibana仪表板非常棒,但是使用我们当前的映射,很难向下钻取并隔离来自特定客户端的请求。

The problem? It’s going to take a bit of work, nobody has assigned me to do this, and this isn’t the original task given to me. On the spot I decide that I’m going to build this out. High performing individuals don’t ask permission to do their job. They go and deliver results.

问题? 这将需要一些工作,没有人指派我来执行此操作,而这并不是给我的原始任务。 我当场决定将其扩展。 绩效高的人不会要求获得工作许可。 他们去取得成果。

High performing individuals don’t ask permission to do their job. They go and deliver results.

绩效高的人不会要求获得工作许可。 他们去取得成果。

- Cory Darby

-科里·达比(Cory Darby)

I begin to hack together a script that uses regular expressions with our Elasticsearch that passes the results into Highcharts. During this, our CTO stops by to ask about the situation that unfolded on Monday. I explain that there’s no evidence at the moment, but that my gut feeling from what I saw in the logs, graphs, and metrics is that the other team’s cache died. The other team’s cache dying would force them to make an enormous amount of requests to our service.

我开始将使用正则表达式的脚本与我们的Elasticsearch一起整理,将结果传递到Highcharts中。 在此期间,我们的CTO停下来询问星期一发生的情况。 我解释说,目前没有证据,但是从日志,图形和指标中看到的直觉是,另一个团队的缓存已失效。 另一个团队的缓存快要死了,将迫使他们向我们的服务提出大量请求。

He leaves to get answers from the other team, and I finish hacking the script together. It isn’t elegant code but it gets the job done, and it means we’re not in the dark anymore.

他离开以获得其他团队的答案,我完成了对脚本的黑客工作。 它不是优雅的代码,但是可以完成工作,这意味着我们不再处于黑暗之中。

Just 10 minutes after finishing, our CTO returns. He explains that the other team can’t find any issues on their side. At this point I show him the graph, which shows every request grouped by every team using our service:

完成后仅10分钟,我们的CTO就回来了。 他解释说,其他团队无法找到任何问题。 现在,我为他显示图形,该图形显示了每个团队使用我们的服务分组的每个请求:

The giant spike in the graph above is the other team that had just explained that everything looked healthy on their side.

上图中的峰值是另一支刚刚解释说一切对他们有利的团队。

Further investigation by the other team showed that a new feature they launched required a significant amount of caching. The rules they have in place with their cache were to evict the oldest cached objects to make room for new ones. The evicted data happened to be all the cached data for our service.

其他团队的进一步调查显示,他们启动的新功能需要大量缓存。 他们在缓存中使用的规则是逐出最旧的缓存对象,以便为新对象腾出空间。 逐出的数据恰好是我们服务的所有缓存数据。

Their cache was working, but it didn’t have any of our service cached anymore. The other team was hitting us for nearly every request.

他们的缓存正在运行,但是不再缓存我们的任何服务。 几乎每个请求都由另一个团队来打给我们。

改进领域 (Areas of Improvement)

After any major engineering incident, it’s important to cover the cause, solution, prevention, and any pain points the incident exposed.

在发生任何重大工程事件后,覆盖原因,解决方案,预防措施以及事件暴露出来的所有痛点非常重要。

具有洞察力的指标和监控 (Insightful metrics & monitoring)

Clients using our services need better insights into their usage. On our side, we will be providing easier ways to drill down for specific metrics.

使用我们服务的客户需要更好地了解其用法。 在我们这方面,我们将提供更简便的方法来深入了解特定指标。

After these events, it’s clear that we will need to create an actual dashboard for both the clients and for ourselves. The dashboard will replace the script I built to push data into the Highchart graph. The dashboard graphs will have monitoring on the data being received so they can provide the earliest alerting.

在这些事件之后,很明显,我们将需要为客户和我们自己创建一个实际的仪表板。 仪表板将替换我构建的将数据推入Highchart图形的脚本。 仪表板图将监视正在接收的数据,因此它们可以提供最早的警报。

减少团队孤岛 (Reduce team Silos)

Ideally, we would allow other teams in the company to be able to push pinpoints to the dashboard when they deploy to their production environments. If the requests from a single team increase substantially after a deployment, the deployment is most likely responsible for this spike.

理想情况下,我们将允许公司中的其他团队在部署到生产环境时将精确的信息推送到仪表板。 如果在部署后来自单个团队的请求大大增加,则该部署很可能是造成高峰的原因。

通讯 (Communication)

During these outages, stakeholders created a lot of distraction throughout the company. These distractions take the form of emails, chat messages, and individuals showing up at the team’s space to ask about the service.

在这些中断期间,利益相关者在整个公司范围内造成了很多干扰。 这些分散注意力的形式包括电子邮件,聊天消息,以及个人出现在团队的空间中以询问服务。

To reduce the number of distractions, we will have an internal dashboard that includes the current status and the events as they unfold. The events will include time of creation, a description of the situation, and an estimated time of recovery (if known).

为了减少分心的次数,我们将有一个内部仪表板,其中包括当前状态和事件的进展。 这些事件将包括创建时间,情况描述以及估计的恢复时间(如果已知)。

结语 (Wrapping up)

It’s never a question of “if” there will be an outage. It’s a question of “when.”

从来没有“如果”会发生故障的问题。 这是“何时”的问题。

Every incident — no matter how small — is a learning opportunity to prevent things like this from happening again, reduce its negative impact, and become better-prepared.

每个事件(无论多么小)都是一个学习的机会,可以防止此类事件再次发生,减少其负面影响并做好充分的准备。

Disclaimer: These are my opinions, nobody else’s, and they do not reflect my employer’s.

免责声明:这些是我的观点,没有其他人的观点,并且不反映我雇主的观点。

翻译自: https://www.freecodecamp.org/news/insights-into-two-recent-service-outage-incidents-6e9a9c93c225/

回顾:我们从2次主要API中断中汲取的经验教训相关推荐

  1. 嵌入式FreeRTOS系统,在中断中调用FreeRTOS系统的API函数

    作为在中断中调用FreeRTOS系统的API函数的参考,disp_str()是显示屏的显示函数会将传入的字符串进行显示. void TIM5_Int_Init(u16 arr,u16 psc) {TI ...

  2. 【Web API系列教程】1.2 — Web API 2中的Action Results

    前言 本节的主题是ASP.NET Web API怎样将控制器动作的返回值转换成HTTP的响应消息. Web API控制器动作能够返回下列的不论什么值: 1. void 2. HttpResponseM ...

  3. 微软宣布在Azure API管理中预览OpenAPI规范V3

    最近,微软宣布在Azure API管理中支持OpenAPI规范V3,他们的服务允许创建.发布.监控和维护API.OpenAPI规范的使用是通过 OpenAPI .NET SDK完成的,并支持从它们的实 ...

  4. 中断中是否可以使用信号量?

    不论是书上还是网上,对这块的解释,总是很凌乱,  让人不好理清.   大部分开发者可能也只知其然,不知所以然. 在网上找到一篇对此解释言简意赅,一看就懂的文章,故分享过来. 1.中断中为何不能使用信号 ...

  5. STM32F103ZE单片机在WWDG窗口看门狗的EWI中断中喂狗导致系统复位的原因及解决办法(中断函数重入问题)

    版权声明:本文为博主原创文章,欢迎转载    https://blog.csdn.net/ZLK1214/article/details/78308058 程序开启了WWDG的Early wakeup ...

  6. ASP.NET Core Web API 集成测试中使用 Bearer Token

    在 ASP.NET Core Web API 集成测试一文中, 我介绍了ASP.NET Core Web API的集成测试. 在那里我使用了测试专用的Startup类, 里面的配置和开发时有一些区别, ...

  7. 【转】Web API项目中使用Area对业务进行分类管理

    在之前开发的很多Web API项目中,为了方便以及快速开发,往往把整个Web API的控制器放在基目录的Controllers目录中,但随着业务越来越复杂,这样Controllers目录中的文件就增加 ...

  8. node.js api接口_如何在Node.js API客户端中正常处理故障

    node.js api接口 by Roger Jin 罗杰·金(Roger Jin) 如何在Node.js API客户端中正常处理故障 (How to gracefully handle failur ...

  9. FreeRTOS+STM32F103中断中发送任务通知单片机死机问题

    最近在调试FreeRTOS系统遇到了一个比较奇怪的问题,在STM32F103最小系统上调试任务通知模拟事件标志组功能时,中断一触发,单片机就会死机.通过查询方式发送任务通知没任何问题,一旦用按键触发外 ...

最新文章

  1. Java——String类中的compareTo方法总结
  2. python小波特征提取_Python 小波包变换,小波包能量特征提取 代码
  3. 复盘二进制的习题(1)
  4. Linux_c++线程函数的使用
  5. Helm 3 完整教程(七):Helm 函数讲解(1)逻辑和流控制函数
  6. 怎么自动运行文件并隐藏_绝对实用!iphone用久卡顿怎么办?5个隐藏小技巧提升运行速度...
  7. python中函数的参数传递
  8. SpringCloud-Alibaba之Nacos,Java集合面试题及答案
  9. json对象数组转数组方法
  10. 字节岗位的薪酬体系曝光,看完感叹:真的不服不行
  11. 怎样用计算机表白我爱你,微信表白的新套路,用隐藏性代码说我爱你,成功率99%...
  12. LittleVGL移植到STM32
  13. 常见720P和1080P的分辨率倒底是多少?
  14. 2021年N1叉车司机找解析及N1叉车司机试题及解析
  15. java引入math包_JAVA math包
  16. 解析:浏览器事件冒泡及事件捕获
  17. mysql修改游戏元宝_页游源码【武斗乾坤】自带安装启动教程+元宝游戏数据修改教程+自由一键游戏启动服务端...
  18. 没有NAS也要搭建私有云?花生棒+硬盘的一个任性玩法
  19. python金融分析小知识(34)——年化收益率、年化波动率以及夏普比率的计算
  20. 你想要的宏基因组-微生物组知识全在这(1906)

热门文章

  1. 组合框绑定字符串数组的数据 c# 1614236088
  2. git-分支的冲突与冲突的解决
  3. slim框架中防止crsf攻击时,用到的函数hash_equals
  4. 在Ubuntu下面编译WizNote Qt Project
  5. JQuery中trim函数的具体实现代码
  6. 【博客话题】爱上Linux的N+1个理由
  7. 利用ISA2006发布Exchange的RPC over HTTPS
  8. Internet路由结构学习心得二:通告汇聚和具体路由影响AS入流量
  9. Oracle单个数据文件超过32G后需要扩容
  10. java 获取发布后的路径问题_Java中的路径问题实例分析