软件系统扩展性

重点 (Top highlight)

一天处理超过$ 20,000,000 (Processing over $20,000,000 in a single day)

A previous company built payments systems and giving day software intended for massive giving days where we would receive tens of thousands of donations for a single campaign.

先前的公司建立了用于大规模捐赠日的支付系统和捐赠日软件，在该捐赠日中，我们为一次竞选获得数万笔捐款。

One of my responsibilities at that company was to scale the system and ensure it didn’t topple over. At its worst, it would crash on just 3–5 requests per second.

我在那家公司的职责之一是扩展系统并确保其不会翻倒。在最坏的情况下，每秒仅3–5个请求就会崩溃。

Due to a inefficient architectures, questionable technology choices, and rushed development, it had many constraints and was a patchwork of band-aids and gaping performance gaps. A combination of magical spells and incantations would keep the server running throughout the day.

由于低效的架构，可疑的技术选择以及急速的开发，它具有许多限制，并且是创可贴和巨大的性能差距的拼凑而成。魔咒和咒语的结合将使服务器全天运行。

By the time I was done with the platform, it had the potential to manage several thousand requests per second and run thousands of campaigns simultaneously, all for roughly the same operational cost.

当我完成该平台的使用时，它有潜力每秒处理数千个请求并同时运行数千个广告系列，而所有这些操作的成本大致相同。

How? I’ll tell you!

怎么样？我会告诉你！

分析使用模式 (Analyzing the usage patterns)

Before we dive into how I optimized this system, we have to understand its usage patterns and the specific circumstances and constraints which we are trying to optimize under — to do otherwise would be to shoot in the dark.

在深入研究如何优化该系统之前，我们必须了解其使用模式以及我们要在其下优化的特定环境和约束，否则将是在黑暗中进行拍摄。

给天定义了开始和结束 (Giving days have defined starts and stops)

RPS: Giving days started and ended suddenly.

Giving days are massive planned events, scheduled months in advance. They start and stop at very specific dates and times. Sometimes these dates are moveable. Other times it is not.

提前几天安排好大规模的计划活动。它们在非常特定的日期和时间开始和停止。有时这些日期是可移动的。有时不是。

强调分享 (There’s an emphasis on sharing)

During the campaign, the effort to get the word out to donate can be intense.

在竞选期间，大力宣传捐赠的努力非常大。

Our system might send out hundreds of thousands of emails at the very beginning of the day, with regular follow-up emails throughout the campaign encouraging people to visit, engage, share, and donate.

我们的系统可能在一天的开始就发送数十万封电子邮件，在整个活动期间定期跟踪电子邮件，以鼓励人们参观，参与，共享和捐赠。

Social media links are posted everywhere on every networking platform in existence— some I’ve never even heard of.

社交媒体链接被发布在现有的每个网络平台上的任何地方，其中一些我从未听说过。

There’s even physical posters, booths, and flyers all around the campus. Some customers even do a televised special for the entire 24–48 period.

整个校园甚至还有实物海报，展位和传单。有些客户甚至在整个24-48期间都进行电视特辑。

活动既尖刻又恒定 (Activity can be both spiky and constant)

Given the above, our resource usage can best be described as both spiky and constant.

鉴于以上所述，我们的资源使用情况可以最好地描述为尖峰和不变。

CPU: mostly constant resource usage with occasional spikes in activity.

During certain portions of the giving day, such as the very beginning of the day and during coordinated social media pushes, we can see activity spike massively. We can go from 0 requests per second to 150 requests per second for a single campaign in a fraction of a second. This lack of ramp-up has behavioral characteristics that can be at times indistinguishable from a DDOS.

在奉献日的某些部分，例如当天的开始和社交媒体的协调推送，我们可以看到活动急剧增加。对于单个广告系列，我们可以在不到一秒钟的时间内从每秒0个请求增加到每秒150个请求。这种缺乏加速的行为特征有时可能与DDOS难以区分。

Outside of those events, the resource usage is constant. We’ll see donations and activity come in as users engage with the site.

在这些事件之外，资源使用情况是恒定的。当用户与网站互动时，我们会看到捐款和活动的到来。

Finally, once the day ends and the campaigns all close, the activity drops as suddenly as it starts.

最终，当一天结束并且活动全部结束时，活动一开始就突然下降。

有远见的优势 (The advantage of foresight)

Because the start / end dates are known, and we work closely with customers to figure out what their day’s game plan is, it provides a lot of predictability in our server activity. This predictability allows the load to be planned for.

因为开始/结束日期是已知的，并且我们与客户紧密合作以找出他们当天的游戏计划，所以它为我们的服务器活动提供了很多可预测性。这种可预测性允许计划负载。

If we know what the various targets the customer is aiming for with their giving day, we can prepare for it by making performance optimizations and tweaking server settings to best manage their expected load. Much of this can be estimated relatively precisely through some basic calculations.

如果我们知道客户在日常活动中要达到的目标是什么，我们可以通过进行性能优化和调整服务器设置以最好地管理他们的预期负载来为此做准备。可以通过一些基本计算来相对精确地估计其中的大部分。

The business can also have a large impact on ensuring system stability as well. From a business perspective, we can stagger the start / end dates of customers to ensure that there is minimal overlap with the larger customers to improve reliability as much as possible.

业务也可能对确保系统稳定性也有很大的影响。从业务角度来看，我们可以错开客户的开始/结束日期，以确保与大客户的重叠最小，从而尽可能提高可靠性。

我们正在尝试优化什么？ (What are we trying to optimize for?)

Now that we know what kind of usage to deal with, let’s briefly go over some of the metrics we had available to us. Remember — we should benchmark and measure everything we can when before we optimize.

现在我们知道要处理的使用情况，让我们简单地回顾一下我们已有的一些指标。请记住，在优化之前，我们应该基准测试并衡量我们能做到的一切。

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

我们应该忘记效率低下的问题，例如大约97％的时间： 过早的优化是万恶之源。 然而，我们不应该放弃我们那关键的3％的机会。”

- Donald Knuth

-唐纳德·努斯

As they say: “measure twice, cut once.”

正如他们所说：“ 测量两次，切割一次。”

For our system, we can think of the metrics into two categories:

对于我们的系统，我们可以将指标分为两类：

metrics that measured activity衡量活动的指标
metrics that measured performance衡量绩效的指标

测量活动 (Measuring activity)

Measuring activity is important. It’s the input to your server’s performance.

测量活动很重要。这是服务器性能的输入。

Requests per second was simple. It was a matter of asking how many requests was our server handling every second? More meant more activity.

每秒的请求很简单。问一个问题：我们的服务器每秒处理多少个请求？更多意味着更多的活动。

CPU usage was another metric we kept an eye on to detect system unavailability. Intensive calculations would cause the system to back up, and the system shouldn’t be doing intensive calculations on a web request in the first place.

CPU使用率是我们密切关注的另一个指标，用于检测系统不可用性。密集计算将导致系统备份，并且系统首先不应对Web请求进行密集计算。

Memory usage was a make it or break it metric. We only had so much capacity on our servers. Some inefficient code were memory hogs, instantiated hundreds of thousands of objects into memory. These memory leaks were found and squashed.

内存使用情况是衡量指标的标准。我们的服务器上只有这么多容量。一些低效的代码是内存消耗，将成千上万个对象实例化到内存中。这些内存泄漏被发现并被压缩。

Connection count was something to keep an eye on since we were using cloud providers that had limitations on connection count.

连接数是什么能让眼睛上，因为我们使用云提供商对连接数的限制。

衡量绩效 (Measuring performance)

The biggest measure of performance was response time. Lowering it meant we were doing well, raising it meant we were not. APM tooling like DataDog or NewRelic could show us layer-level response times, which we could use to pinpoint bottlenecks.

性能的最大衡量标准是响应时间。降低它意味着我们表现良好，提高它意味着我们表现不好。诸如DataDog或NewRelic之类的APM工具可以向我们展示层级的响应时间，我们可以用来确定瓶颈。

The holistic request response time on Heroku was technically limited to 30 second timeouts, and realistically we wanted most of our requests for customer-facing pages to finish under 3 seconds. I personally considered anything above 8 seconds was considered an outage.

从技术上讲，Heroku上的整体请求响应时间被限制为30秒超时，实际上，我们希望大多数面向客户页面的请求能在3秒内完成。我个人认为超过8秒的任何时间都被视为中断。

The 50th percentile was often under 100ms, because many of the requests were API endpoints that completed rapidly.

第50个百分位数通常在100毫秒以下，因为许多请求都是快速完成的API端点。

The 99th percentile could exceed 20 seconds without a problem, since some admin pages just took a while to finish.

第99个百分位数可能会超过20秒而没有问题，因为某些管理页面仅花了一段时间才能完成。

What I truly cared about was the 95th percentile — we wanted 95% of requests to finish under 3 seconds. This 95% represented the bulk of customer requests and engagement and represented what the donors would experience.

我真正关心的是第95个百分点-我们希望95％的请求在3秒内完成。这95％代表了大部分客户请求和参与，并代表了捐助者将经历的。

低挂优化 (Low-hanging optimizations)

Let’s take a look at what the low-hanging optimization fruits were:

让我们看一下低挂的优化成果是什么：

vertical and horizontal scaling垂直和水平缩放
N+1 queriesN + 1个查询
inefficient code低效的代码
backgrounding背景
asset minification资产最小化
memory leaks内存泄漏
co-location共置

垂直和水平缩放 (Vertical and horizontal scaling)

垂直缩放 (Vertical scaling)

One of the first things I did was to increase the power of each server — achieving performance through vertical scaling. I gave each server more memory and processing resources to help serve and fulfill requests faster.

我要做的第一件事就是增加每个服务器的功能-通过垂直扩展实现性能。我为每个服务器提供了更多的内存和处理资源，以帮助更快地服务和满足请求。

Here, New Relic is showing a large spike in request queue time. In this case, it was time spent waiting for more resources to be allocated to our server.

However, vertical scaling has a couple downsides. One is that there is a practical limit to how much you can vertically scale a single instance.

但是，垂直缩放具有一些缺点。一个是对垂直扩展单个实例的数量有实际限制。

The second downside is that vertical scaling can get very expensive. When you don’t have infinite resources, cost becomes a major concern and a factor in determining tradeoffs.

第二个缺点是垂直扩展会变得非常昂贵。当您没有无限的资源时，成本将成为主要问题，也是决定权衡因素的一个因素。

水平缩放 (Horizontal scaling)

If one server can fulfill 10 user requests per second, then a rough ballpark estimation shows that 10 servers can fulfill 100 requests per second. It doesn’t quite scale linearly in practice, but is fine for a hypothetical. This is known as horizontal scaling.

如果一台服务器每秒可满足10个用户请求，则粗略估算表明10台服务器每秒可满足100个请求。在实践中，它并不能完全线性地缩放，但是对于一个假设来说是很好的。这称为水平缩放。

We configured our servers to automatically scale out depending on various metrics. As servers spun up to handle any increased activity, we saw a typical small spike in wait delay/queuing time. Once the additional servers were fully spun up, traffic request queue time went down as the system adapted to the increased load.

我们将服务器配置为根据各种指标自动扩展。随着服务器启动以处理任何增加的活动，我们发现等待延迟/排队时间通常会出现一个小的峰值。一旦额外的服务器完全启动，由于系统适应了增加的负载，因此流量请求队列时间减少了。

As activity increased, we automatically spun up more servers, which allowed us to handle the increased activity.

几个挑战 (A couple of challenges)

Horizontal scaling wasn’t entirely a smooth ride.

水平缩放并非一帆风顺。

There were a lot of practices done in the codebase that were not thread-safe. For example, it was hugely popular in the codebase to use class instance variables as shared state, which caused threads to overwrite each other.I had to spend a lot of time going through it and modifying algorithms and code to manage the data in a manner that was safe for a multi-threaded environment.

在代码库中有很多不是线程安全的实践。例如，在代码库中使用类实例变量作为共享状态非常流行，这导致线程互相覆盖。我不得不花费大量时间来遍历它，并修改算法和代码以某种方式管理数据这对于多线程环境是安全的。

I also had to implement better connection pooling and management techniques — we would often run out of connections to our various stores because many were hard-coded and established direct connections upon instantiation, which meant that application instance would be unable to process any transactions if there were no connections available.

我还必须实施更好的连接池和管理技术-我们经常会耗尽与各个商店的连接，因为许多存储都是硬编码的，并且在实例化时建立了直接连接，这意味着如果存在，应用程序实例将无法处理任何事务。没有可用的连接。

在Heroku上缩放 (Scaling on Heroku)

While you can and should set up scaling on other platforms, we were using Heroku, and Heroku makes scaling easy.

虽然可以并且应该在其他平台上设置缩放比例，但是我们使用的是Heroku，而Heroku使得缩放变得容易。

You have the number of dynos available to control as well as the ability to increase the power of each individual dynos. If you need more fine-grained controls, easily integrated vendors like HireFire provide scaling configuration options that give you that power and flexibility.

您拥有可控制的测功机数量，并具有增加每个测功机功率的能力。如果您需要更细粒度的控件，那么像HireFire这样的易于集成的供应商将提供扩展配置选项，这些选项可为您提供强大的功能和灵活性。

There’s also things related to the web server concurrency you can set up. We were using Puma, which had options to change not just the number of workers via the WEB_CONCURRENCY flag, but also the number of threads each process.

您还可以设置与Web服务器并发性相关的内容。我们使用的是Puma，它不仅具有通过WEB_CONCURRENCY标志更改工作程序数量的选项，而且还可以更改每个进程的线程数量。

结果 (The results)

The combination of customizable vertical and horizontal scaling gave us significant flexibility in preparing the site for the various performance characteristics.

可自定义的垂直和水平缩放比例相结合，为我们准备具有各种性能特征的站点提供了极大的灵活性。

This was a long-term effort. I had to play around a lot with the scaling thresholds until we settled on a set that balanced cost, performance, and resource usage to acceptable levels. Since acceptable levels varies within a company and its circumstances, I recommend making it a practice to constantly test scaling configurations appropriately.

这是一项长期的工作。在确定将成本，性能和资源使用量平衡到可接受水平之前，我不得不在扩展阈值方面做很多工作。由于可接受的级别在公司及其环境中会有所不同，因此我建议将其作为一种实践，以不断地适当地测试扩展配置。

N + 1个查询 (N+1 Queries)

N+1 queries are queries that require other queries to get a complete picture of the data. They are often a result of inattention to data retrieval considerations or architecture issues.

N + 1查询是需要其他查询才能完整了解数据的查询。它们通常是由于不注意数据检索注意事项或体系结构问题而导致的。

For example, suppose you have an endpoint that needs to return donations and the donors that donated. An N+1 query might be hidden within it — first a query must be made to retrieve all of the donations, and then for each donation, the donor record must also be retrieved.

例如，假设您有一个需要返回捐赠的端点和捐赠的捐赠者。 N + 1查询可能隐藏在其中-首先必须进行查询以检索所有捐赠，然后对于每次捐赠，还必须获取捐赠者记录。

Oftentimes, the additional query will be hidden in a serializer behind a retrieval, especially with Ruby on Rails:

通常，附加查询会隐藏在检索后的序列化器中，尤其是在Ruby on Rails中：

class DonationsControllerdef index    donations = Donation.all  endendclass DonationSerializerbelongs_to :donor  # This will result in a N+1 query (see above)  # because the query it is being used on doesn't load donors.end

The solution to N+1 queries usually involves eager loading the related records and ensuring that it is fetched in the initial query:

N + 1查询的解决方案通常包括急于加载相关记录并确保在初始查询中将其提取：

Donation.all.includes(:donor)

Finding the hidden N+1 queries reduced our response times, sometimes drastically.

低效的代码 (Inefficient code)

There were a lot of instances in the code where it was doing resource-intensive things when it didn’t need to.

在代码中有很多实例，它们在不需要时会执行资源密集型的事情。

转向更快的库 (Moving to faster libraries)

Some of the libraries available out there are very slow.

一些可用的库非常慢。

For serialization, using faster libraries like oj can go a long way towards improving performance when serializing larger collections.

对于序列化，在序列化较大的集合时，使用更快的库(例如oj)可以大大提高性能。

流媒体 (Streaming)

We dealt a lot with excel spreadsheets and other bulk data reports and uploads. A lot of the code was initially written to first load up the entire spreadsheet in memory, and then manipulate it, which could consume significant amounts of time, CPU, memory.

我们处理了很多Excel电子表格以及其他批量数据报告和上传。最初编写了大量代码，首先将整个电子表格加载到内存中，然后对其进行操作，这可能会占用大量时间，CPU和内存。

A lot of the prior existing code tried to be smart and optimize without truly understanding the problem at hand. These solutions would often work by loading the entire sheet in memory and pushing things into a memory cache, which caused significant issues since the sheet was still in memory. It solved a symptom, not a cause, which let the issue fester.

许多现有的现有代码试图在没有真正理解手头问题的情况下变得智能和优化。这些解决方案通常可以通过将整个工作表加载到内存中并将其推送到内存缓存中来工作，这会导致严重问题，因为工作表仍在内存中。它解决了使问题恶化的症状，而不是原因。

I had to rewrite a lot of code and its algorithms to support streaming to minimize the memory and CPU footprint. By making it so the algorithms and code didn’t have to have the whole spreadsheet loaded, it had a significant effect in speeding things up.

我不得不重写许多代码及其算法来支持流传输，以最大程度地减少内存和CPU占用空间。通过使算法和代码不必加载整个电子表格，它在加快处理速度方面具有重要作用。

将集合遍历移动到数据库 (Moving collection traversal to the database)

There’s a lot of code that does things in the application when the database could easily handle it. Examples include iterating over thousands of records to add up something instead of calculating the sum in the database, or eager-loading entire documents to access a single field.

当数据库可以轻松地处理它时，有很多代码可以在应用程序中执行操作。例如，遍历数千条记录以添加一些内容，而不是计算数据库中的总和，或者急于加载整个文档以访问单个字段。

One specific code optimization I made involved replacing a long-running calculation that took multiple seconds and ran multiple queries with a single aggregate database query.

我进行的一个特定代码优化涉及用一个汇总数据库查询替换耗时数秒并运行多个查询的长时间运行的计算。

The query in question was pulling every single user that made a donation, iterating through each record, pulling an associated tag from that user (eg. ‘Student”, “Alumni”, etc.), combining them all, and then reducing the result into a set of distinct tags.

有问题的查询是拉出每一个捐赠的用户，遍历每条记录，从该用户那里拉相关的标签(例如，“学生”，“校友”等)，将它们全部合并，然后减少结果放入一组不同的标签中

It looked something like below:

它看起来像下面的样子：

def get_unique_tags  all_tags = []  @cause.donations.each{ |donation|       donation.cause.account.tags.each{ |cause_tag|         all_tags << tag if donation.tags.include?(tag.value)    }  }  unique_tags = []  all_tags.each{ |tag|    unique_tags << tag unless unique_tags.include?(tag)  }end

This code, hidden in the deepest part of the campaign page render lifecycle, was being called on every single request.

隐藏在广告系列页面呈现生命周期最深处的该代码在每次单个请求时都被调用。

Much of the page time spent loading the campaign page was spent in the database (brown).

For smaller giving days with only a couple tags, this wasn’t a problem and was never an issue. However, something new that year was that some of our larger clients had uploaded tens of thousands of different tags during the giving day.

对于只有几个标签的较小捐赠天数，这不是问题，也绝不是问题。但是，那年的新情况是，我们的一些大客户在捐赠当天上传了成千上万个不同的标签。

I moved that logic into a single aggregation query and, as you can see below, the results were instantaneous:

我将该逻辑移到单个聚合查询中，如下所示，结果是即时的：

A code optimization I did reduced the load time of most campaign pages to 447ms, down from 2500ms.

背景 (Backgrounding)

Some things don’t need to happen immediately within a web request — things like sending emails can be delayed for a few seconds or handled by a different part of the system entirely.

有些事情不需要立即在Web请求中发生-诸如发送电子邮件之类的事情可以延迟几秒钟，或者完全由系统的不同部分处理。

Known as “backgrounding”, this moves things that would otherwise be done sequentially in steps and makes them parallel.

这被称为“背景”，它移动了本应逐步执行的操作并使它们平行。

If you’re able to make a part of a request cycle asynchronous, that means the response will return to the user faster, resulting in fewer resources being used.

如果您能够使请求周期的一部分异步进行，则意味着响应将更快地返回给用户，从而减少了使用的资源。

I backgrounded everything that wasn’t critical to the core lifecycle: email sending, uploading, report generation, etc.

我为所有对核心生命周期无关紧要的事情提供了背景信息：电子邮件发送，上传，报告生成等。

资产最小化 (Asset minification)

It turns out that a lot of our front-end assets weren’t gzipped or optimized. This was a fairly easy change that improved load times by as much as 70% for those assets.

事实证明，我们的许多前端资产并未压缩或优化。这是一个相当容易的更改，将这些资产的加载时间缩短了多达70％。

We had a deployment script that would push our front-end assets to AWS S3. All I had to do was also generate and uploaded compressed, gzipped versions of them, while telling S3 to serve gzip by setting the content encoding and the content type.

我们有一个部署脚本，可以将我们的前端资产推送到AWS S3。我要做的就是生成并上传压缩的压缩版本，同时告诉S3通过设置内容编码和内容类型来提供gzip。

A Webpack configuration like below would do this:

如下所示的Webpack配置将执行此操作：

plugins.push(new CompressionPlugin({  test: /\.(js|css)$/,}));let s3Plugin = new S3Plugin({  s3Options: {    accessKeyId: <ACCESS_KEY_ID>,    secretAccessKey: <SECRET_ACCESS_KEY>,    region: <REGION>  },  s3UploadOptions: {    Bucket: <BUCKET>,    asset: '[path][query]',    ContentEncoding(fileName) {      if (/\.gz/.test(fileName)) {        return 'gzip'      }    },    ContentType(fileName) {      if (/\.css/.test(fileName)) {        return 'text/css'      }      if (/\.js/.test(fileName)) {        return 'text/javascript'      }    }  },});plugins.push(s3Plugin);

内存泄漏 (Memory Leaks)

I spent a significant amount of time hunting down memory leaks, which greatly crippled performance (curse you, R14 errors) when we started hitting swap memory.

我花了大量时间来寻找内存泄漏，当我们开始使用交换内存时，内存泄漏极大地降低了性能(诅咒您，R14错误)。

We did the traditional “restart the server at a specific frequency” band-aid while we hunted down the actual cause of the leaks. I tweaked settings aggressively: we changed garbage collection timings, swapped our serializer libraries, and even changed the ruby garbage collector to jemalloc

在寻找导致泄漏的实际原因时，我们做了传统的“以特定频率重启服务器”创可贴。我积极地调整了设置：我们更改了垃圾收集时间，交换了序列化程序库，甚至将ruby垃圾收集器更改为jemalloc

The subject of memory leaks is an article all on its own, but here’s two very helpful links to save you time and effort:

内存泄漏是一个单独的文章，但是这里有两个非常有用的链接，可以节省您的时间和精力：

How I spent two weeks hunting a memory leak in Ruby

我如何花两周的时间来寻找Ruby中的内存泄漏
Improve your ruby application’s memory usage and performance with jemalloc

使用jemalloc改善ruby应用程序的内存使用率和性能

代管 (Co-location)

There were certain services we were using that were focused on different regions than where our servers were located.

我们使用的某些服务专注于服务器所在区域之外的其他区域。

Our servers were in N. Virginia (us-east-2), but some services such as S3 were in Oregon (us-west-2). When a workflow that executed many operations would have to communicate with that service, the resulting latency added up quickly.

我们的服务器位于弗吉尼亚北部(us-east-2)，但某些服务(例如S3)位于俄勒冈州(us-west-2)。当执行许多操作的工作流必须与该服务进行通信时，所产生的延迟会Swift加起来。

A few MS here and a few MS can add up quickly. By ensuring our services were located in the same region, we got rid of that unnecessary latency, greatly speeding up queries and operations.

这里的几个MS和几个MS可以快速累加。通过确保我们的服务位于同一区域，我们消除了不必要的延迟，从而大大加快了查询和操作速度。

帕累托再次罢工 (Pareto strikes again)

The sections above illustrate the various performance levers I pulled to improve performance. However, I quickly discovered, they were low-hanging fruit.

上面的部分说明了我为提高性能而使用的各种性能杠杆。但是，我很快发现，它们是低落的果实。

While tweaking and pulling the levers led to significant performance and stability improvements, it quickly became apparent that there was a single part of the system that was responsible for a vast majority of the performance, stability, and scaling issues. It was the 80/20 rule in full force.

调整和拉动杠杆可以显着提高性能和稳定性，但很快就可以看出，系统的单个部分负责绝大部分的性能，稳定性和扩展性问题。这完全是80/20规则。

This was the bottleneck. This was my white whale.

这是瓶颈。这是我的白鲸。

停机时间剖析 (Anatomy of downtime)

Shortly after I joined, towards the end of one giving day, we suddenly received a massive spike of error alerts and frantic messages from our customer success team.

我加入后不久，就在一天结束的那一天，我们突然收到了来自客户成功团队的大量错误警报和疯狂消息。

The SOS was clear: the site was down and unusable.

SOS很明确：该站点已关闭且无法使用。

The pale green section is request queuing time.

The above graph illustrated what happened — significantly increased load rendering the site unusable for a long period of time.

上图说明了发生的情况-负载显着增加，导致该站点长时间无法使用。

As the database usage went up (yellow area), the amount of time each request took to process also went up, causing other requests to start backing up and queuing (pale-green area).

随着数据库使用率的增加(黄色区域)，每个请求处理的时间也增加了，导致其他请求开始备份和排队(浅绿色区域)。

What was impressive was the speed at which downtime occurred. Things backed up very, very quickly. All signals were fine during the day, and then suddenly the server was overwhelmed.

令人印象深刻的是停机发生的速度。事情非常非常Swift地备份。白天所有信号都很好，然后服务器突然不堪重负。

过时的事件响应手册 (Outdated Incident Response Playbooks)

We performed the she standard operating procedure at the time, which was to spin up more servers.

当时，我们执行了她的标准操作程序，即启动了更多服务器。

Unfortunately, that had zero impact — increasing the number of application servers didn’t solve the issue since all of the web requests were being delayed by extensive calculations.

不幸的是，它的影响为零-增加应用程序服务器的数量并不能解决问题，因为所有Web请求都被大量的计算延迟了。

Counter-intuitively, it actually made the issue even worse — providing more requests to the server put even more strain on the database.

与直觉相反，这实际上使问题变得更糟-向服务器提供更多请求对数据库造成了更大压力。

是什么原因造成的？ (What caused it?)

What happened? We had a cache system, which, by all accounts, had been working fine.

发生了什么？我们有一个缓存系统，从所有方面来看，它都运行良好。

Digging deeper, I found multiple glaring issues with how caching was implemented. Significant holes that made the caching system the single point of failure for the entire platform.

深入研究，我发现了有关如何实现缓存的多个明显问题。大量的漏洞使缓存系统成为整个平台的单点故障。

缓存为王 (Cache is King)

Let’s dive into how our caching system worked.

让我们深入研究我们的缓存系统如何工作。

class Campaign

  cache_fields :first_name, :total_raiseddef total_raised    # ...complex calculation here  endend

cache_fields would call a mixin that would wrap the property access in a function that would first look at the cache before attempting to access the property (or function result).

cache_fields将调用一个混合函数，该函数将把属性访问包装在一个函数中，该函数将在尝试访问属性(或函数结果)之前先查看缓存。

However, what would happen if a value wasn’t present in the Redis cache for one reason or another?

但是，如果出于某种原因在Redis缓存中不存在值，会发生什么？

处理缓存未命中 (Dealing with cache misses)

Like all cache misses, it would attempt to recalculate the value in real-time and provide it, saving the newly calculated value to the cache.

像所有高速缓存未命中一样，它将尝试实时重新计算该值并提供它，将新计算的值保存到高速缓存中。

However, this had some problems. If there was a cache-miss, requests would force resource-intensive calculations during a high-load time.

但是，这有一些问题。如果存在缓存丢失，请求将在高负载时间内强制执行资源密集型计算。

It was clear the previous developer had thought about this — the code already had an attempt at a solution in place: scheduled caching.

很明显，以前的开发人员曾考虑过这一点-代码已经尝试过一种解决方案：计划缓存。

按计划缓存 (Caching on a schedule)

Every 5 minutes, a CacheUpdateJob would be run that would update all of the fields that were set to be cached.

每隔5分钟，将运行CacheUpdateJob ，它将更新所有设置为缓存的字段。

This caching system worked well in theory — by caching regularly, the system would be able to keep things in the cache.

该缓存系统在理论上运行良好-通过定期缓存，该系统可以将内容保留在缓存中。

However, it had a bunch of problems in practice, which we found out during several of our giving days.

但是，它在实践中存在很多问题，我们在几天的奉献中发现了这些问题。

缓存更新 (Cache updates)

A primary cause of issues was the timing at which the cache was populated and updated.

问题的主要原因是缓存的填充和更新时间。

CacheUpdateJob would run every 5 minutes, dutifully calculating values and setting expirations for 5 minutes from the time of calculation.

CacheUpdateJob将每5分钟运行一次，以尽责的方式计算值，并在计算后的5分钟内设置到期时间。

This was a hidden problem. It essentially guaranteed that CacheUpdateJob would always be updating only after a value fell out of the cache.

这是一个隐藏的问题。从本质CacheUpdateJob它保证了CacheUpdateJob始终仅在值从高速缓存中CacheUpdateJob进行更新。

狗堆在缓存未命中 (Dog-piling on cache misses)

When users attempted to access a value after a value fell out of the cache but before the CacheUpdateJob could cache the new value, it would result in a cache-miss, which then caused it to be calculated in real-time.

当用户在某个值从缓存中掉出来之后但在CacheUpdateJob可以缓存新值之前尝试访问该值时，将导致缓存未命中，从而导致实时计算该值。

This was acceptable for a low volume of people, but on major giving days, it would perform the recalculation for every single request.

对于少量的人来说，这是可以接受的，但是在主要的捐赠日，它将执行重新计算f 或每个单独的请求。

Cache failures led to increased 500 Internal Server Error responses — a result of timeouts.

After a cache-miss, up until the point any one request succeeded and was successfully able to insert the value into the cache, all of the requests accessing that data would perform a resource-intensive query, significantly increasing usage, especially on the database CPU.

发生高速缓存未命中之后，直到任何一个请求成功完成并成功将值插入高速缓存为止，所有访问该数据的请求都将执行资源密集型查询，从而大大提高了使用率，尤其是在数据库CPU上。

For a value that was intensive to calculate, that meant it could quickly clog up the resources of the database:

对于需要大量计算的值，这意味着它可能会Swift阻塞数据库的资源：

When multiple cache misses occurred, the database could get overwhelmed quickly.

User behavior then compounded the problem and made the whole thing even worse. When a user encountered a delay, they would refresh the page and try again, causing even more additional load:

然后，用户的行为使问题更加复杂，并使整个问题变得更加糟糕。当用户遇到延迟时，他们将刷新页面并重试，从而导致更多的额外负载：

Long-running database queries retried repeatedly caused us to lose our ability to read from the database.

解决方案的前三分之一-垂直缩放 (The first third of the solution — vertical scaling)

One of the first solutions I implemented was vertical scaling — improving the resourcing of the database.

我实施的首批解决方案之一是垂直扩展—改进了数据库的资源配置。

Scaling the database was only a band-aid to the problem. At some point of increased load, we would once again encounter this issue.

扩展数据库只是解决该问题的一个临时工具。在负载增加的某个时刻，我们将再次遇到此问题。

It was also an expensive solution — spending thousands of dollars to vertically scale the database cluster was not a reasonable spend.

这也是一个昂贵的解决方案-花数千美元垂直扩展数据库集群并不是合理的支出。

解决方案的第二个三分之一-水平缩放 (The second third of the solution — horizontal scaling)

We had a database cluster where the read replicas weren’t being used in any way. We could transition long-running reports and other queries that weren’t time-sensitive to run on the read replicas instead of the primary, distributing the load across the entire cluster instead of just one.

我们有一个数据库集群，其中未以任何方式使用只读副本。我们可以转换长期运行的报表和其他对时间敏感的查询，以在只读副本而不是主副本上运行，从而将负载分布在整个集群上，而不是只分布在整个集群上。

解决方案的最后三分之一-防止比赛条件 (The final third of the solution —prevent race conditions)

We needed a way to prevent the system from overloading itself by preventing it from recalculating the same exact data over and over.

我们需要一种方法，通过防止系统一次又一次地重新计算相同的精确数据来防止系统自身过载。

I solved this by adding the capability to return stale data if multiple requests requested a cache regeneration at the same time.

我解决了这一问题，方法是添加了在多个请求同时请求缓存重新生成时返回陈旧数据的功能。

Only a single request would cause a recalculation, and the rest would serve the stale data until that calculation was done instead of triggering the same calculation over and over.

只有一个请求会导致重新计算，其余请求将处理过时的数据，直到完成该计算，而不是一遍又一遍地触发相同的计算。

Rails supported this through a combination of the race_condition_ttl and expires_in parameters:

Rails通过race_condition_ttl和expires_in参数的组合来支持这一点：

Rails.cache.fetch(cache_key,                   race_condition_ttl: 30.seconds,                   expires_in: 15.minutes)

火车不准时 (The trains weren’t running on time)

As we grew in success, so did the number of campaigns we ran. This in turn made the CacheUpdateJob take longer and longer to run through the thousands of campaigns.

随着我们成功的成长，我们进行的竞选活动也增加了。反过来，这使得CacheUpdateJob花费的时间越来越长，才能遍历数千个广告系列。

One giving day, I was notified of a potential bug encountered by the team. They had queued up emails hours ago, and nobody had received them. I checked and realized that queue which traditionally had only a few jobs had hundreds of thousands of jobs in it — all CacheUpdateJob.

有一天，我收到了团队遇到的潜在错误的通知。他们已经在几个小时前将电子邮件排队，却没有人收到。我检查并意识到，传统上只有几个作业的队列中有成千上万的作业-所有都是CacheUpdateJob 。

Investigation further showed what had happened. CacheUpdateJob had gotten to the point where the job would take longer to run than the frequency at which it ran.

调查进一步表明发生了什么事。 CacheUpdateJob达到了这样的程度，即作业的运行时间要比其运行的时间长。

This meant that while CacheUpdateJob ran every 5 minutes, it would take more than 10 minutes to finish. During this time, values were falling out of cache, and jobs were stacking up in the queue. It also meant CacheUpdateJob was running all the time, racking up fairly significant usage charges.

这意味着，虽然CacheUpdateJob每5分钟运行一次，但要花费10多分钟才能完成。在此期间，值从高速缓存中丢失，并且作业在队列中堆积。这也意味着CacheUpdateJob正在运行的所有时间 ，费尽了相当显著的使用费。

It was preventing all of the other jobs from going through.

这阻碍了所有其他工作的进行。

分成多个队列 (Separating into multiple queues)

The solution here was to separate the various jobs we had into multiple queues that we could scale independently.

这里的解决方案是将我们拥有的各种作业分成多个队列，我们可以独立扩展。

Mailers and other user-triggered bulk jobs were placed in one queue. Transactional jobs were placed in another. Expensive reporting jobs were placed in a third queue. Jobs that kept the system running, like CacheUpdateJob, were placed in a highly resourced queue.

邮件程序和其他用户触发的批量作业被放在一个队列中。事务性工作被放置在另一个中。昂贵的报告作业被放置在第三个队列中。诸如CacheUpdateJob类的使系统保持运行状态的作业被放置在资源丰富的队列中。

This helped ensure that backups in any one queue didn’t greatly impact the rest of the system, and provided us the ability to turn off unneeded parts of the system in the event of an emergency.

这有助于确保任何一个队列中的备份不会对系统的其余部分造成很大影响，并且使我们能够在紧急情况下关闭系统不需要的部分。

将触发器与执行分开 (Separating trigger from execution)

One of the other changes we made was to ensure CacheUpdateJob didn’t do the work itself, and passed that responsibility onto other jobs that it queued. This also gave us the ability to check for the existence of a repeat job prior to enqueuing it. If we already had a cache update queued up for a campaign, there was no sense in adding a second job to the queue to cache the same campaign.

我们进行的其他更改之一是确保CacheUpdateJob本身不会完成工作，并将此职责转移给它排队的其他作业。这也使我们能够在排队之前检查重复作业的存在。如果我们已经为某个广告系列排队等待缓存更新，则没有必要在队列中添加第二个作业以缓存同一广告系列。

This ensured that we could parallelize and independently scale the processing of the cache update from the thing that triggered the cache updated, and do so in an optimal way.

这确保了我们可以与触发缓存更新的事物并行化并独立扩展缓存更新的处理，并以最佳方式进行。

在需要的地方分批 (Batching where needed)

I realized that the overhead of separating into individual jobs was negating some of the benefits of splitting them out in the first place.

我意识到，拆分成单独的工作的开销抵消了最初将它们拆分出来的一些好处。

We implemented batching so that the CacheUpdateJob didn’t create a new job every single record, but grouped records in customizable groups of around 100 or so. This ensured that batches were small and completed quickly, while still giving us the separation we were looking for.

我们实施了批处理，以便CacheUpdateJob不会为每条记录创建一个新作业，而是将记录分为约100个左右的可自定义组。这确保了批次较小且可以快速完成，同时仍为我们提供了所需的分离功能。

仅缓存所需的内容 (Caching only what was needed)

We also looked at the CacheUpdateJob and realized it was updating caches indiscriminately — even campaigns that had run years ago were being cached.

我们还查看了CacheUpdateJob并意识到它正在不加区别地更新缓存-甚至缓存了几年前运行的活动。

I created a settings mechanism to allow us to determine the frequency at which things were cached for each campaign.

我创建了一个设置机制，使我们可以确定每个广告系列缓存内容的频率。

For older campaigns that weren’t accessed frequently, we didn’t bother to update those values. For ones that were running active giving days, we updated more frequently and they got a higher caching priority.

对于不经常访问的旧版广告系列，我们无需费心去更新这些值。对于那些每天运行活跃的日子，我们更新的频率更高，并且它们具有更高的缓存优先级。

内存不足 (Running out of memory)

As we ran giving days, we started seeing more and more success as a business. The increased business meant previously acceptable memory allocations were suddenly reaching their limits.

当我们付出很多天时，我们开始看到越来越多的企业成功。业务量的增加意味着以前可以接受的内存分配突然达到了极限。

This meant that at a certain point, we would suddenly start seeing failures in our ability to add items to the cache that would bring the whole tower of cards down.

这意味着在某个时候，我们会突然开始发现我们无法将项目添加到缓存中而导致整个卡片塔瘫痪的能力出现了故障。

关键搬迁 (Key evictions)

We identified one of the causes — our cache server was not configured correctly.

我们确定了原因之一-我们的缓存服务器配置不正确。

Our key eviction process was set to never evict, and threw an error when memory was reached. This was what was causing us to reach our memory limits under increased load conditions.

我们的主要驱逐过程设置为永不撤离，并且在达到内存时抛出错误。这就是导致我们在负载增加的情况下达到内存限制的原因。

The solution seemed simple — set the key eviction setting on our Redis cache server to volatile-lru. This theoretically would ensure only keys with a TTL would cause an issue.

解决方案看起来很简单-将Redis缓存服务器上的密钥逐出设置为volatile-lru 。从理论上讲，这将确保只有带有TTL的键才会引起问题。

如果真那么容易就好了 (If only it were that easy)

This led to other challenges the system was never designed for. We had a lot of values that were depending on other values to recalculate, and those values in turn were being used to calculate other values.

这带来了系统从未设计过的其他挑战。我们有很多值依赖于其他值进行重新计算，这些值又被用于计算其他值。

Because caching was built ad-hoc and rather haphazardly, some of these items were expected to be cached and others were not, and they all had different TTLs.

因为缓存是临时构建的，并且是偶然的，所以这些项中的某些预期会被缓存，而其他则不会，并且它们都有不同的TTL。

The behavior of evicting a key that hasn’t been used in a while could trigger a cascade of regeneration failures, grinding the system to a halt.

收回一段时间未使用的密钥的行为可能触发一系列的再生失败，使系统陷入停顿。

We had a conundrum:

我们有一个难题：

we needed to evict keys to ensure we didn’t run out of memory我们需要逐出密钥以确保我们不会耗尽内存
if we evicted aribtrary keys, we would cause value regeneration failures如果我们撤出了功能密钥，将导致价值再生失败
architecturally, we couldn’t transition off of these queries从架构上讲，我们无法过渡到这些查询
we were constrained by operating costs, so we couldn’t scale through $我们受到运营成本的限制，因此我们无法扩展$

This seemingly intractable problem had a simple, albeit hacky solution.

这个看似棘手的问题虽然简单易懂，却有一个简单的解决方案。

后备缓存 (Fallback caches)

I implemented fallback caches in the database layer.

我在数据库层实现了后备缓存。

For every field we cached via cache_fields, we also added an accompanying timestamp and cache value:

对于我们通过cache_fields缓存的每个字段，我们还添加了随附的时间戳和缓存值：

cache_fields :total_raised

The cache_fields function would create and update two extra properties every time the cached field was updated:

cache_fields更新缓存字段时， cache_fields函数将创建并更新两个额外的属性：

cached_total_raised

cached_total_raised
cached_timestamp_total_raised

cached_timestamp_total_raised

Whenever the value wasn’t found in the Redis cache, it would use the value stored in the database, which never expired. The resulting fetch was slower than fetching from Redis, but much, much faster than recalculating.

每当在Redis缓存中找不到该值时，它将使用存储在数据库中的值，该值永不过期。最终的获取速度比从Redis的获取慢，但比重新计算快得多。

If there was no cached value in the database, it would recalculate the value.

如果数据库中没有缓存的值，它将重新计算该值。

This ensured that in almost every case, a cached value was present in one form or another, preventing the calculation from running except when the value was forcibly updated by the CacheUpdateJob or requested to be manually updated by the customer success team.

这样可以确保在几乎每种情况下，缓存值都以一种或另一种形式存在，从而阻止了计算的运行，除非该值由CacheUpdateJob强制更新或要求客户成功团队手动更新。

陈旧的缓存 (Stale caches)

All of this caching caused a problem — we would often encounter old data that was stale and no longer accurate. We often couldn’t tell what level it was cached at.

所有这些缓存都导致了一个问题-我们经常会遇到陈旧且不再准确的旧数据。我们经常不知道它被缓存在什么级别。

一个小例子 (A small example)

A situation we encountered will show you some of the consequences of this.

我们遇到的情况将向您显示一些后果。

Account.find('12345a').campaigns.limit(10)Account.find('12345a').campaigns.limit(20)

Due to what I can only describe as overly-aggressive query caching or a bug in the ORM, the above commands returned the same results if ran in succession.

由于我只能将其描述为过于激进的查询缓存或ORM中的错误，因此如果连续运行，上述命令将返回相同的结果 。

If you ran the following immediately after, you get even more interesting results:

如果您随后立即运行以下命令，则将获得更有趣的结果：

Account.find('12345a').campaigns.limit(20).countAccount.find('12345a').campaigns.limit(20).to_a.length

Oddly, the #count would return 20, but the #to_a would return 10.

奇怪的是， #count将返回20 ，但是#to_a将返回10 。

它带来了可怕的用户体验 (It made for terrible user experiences)

From a user experience perspective, it was unacceptable. When people made a donation, they’d expect to see the new donation reflected in the total immediately. They don’t think “oh, this system must have cached the previous value.”

从用户体验的角度来看，这是不可接受的。人们进行捐赠时，他们希望立即看到新捐赠反映在总数中。他们不认为“哦，该系统一定已经缓存了先前的值。”

Likewise, the cache would have to update frequently enough to track progress of the fundraiser. The Customer Success Management team was in close communication with the clients every single day, and had to give progress reports. They couldn’t do that if the reports were outdated.

同样，缓存必须足够频繁地更新以跟踪筹款活动的进度。客户成功管理团队每天与客户保持密切沟通，并且必须提供进度报告。如果报告已过时，他们将无法做到这一点。

它造成了一些非常严重的潜在错误 (It made for some very serious potential errors)

Imagine if you were scoping a collection for a bulk delete. You’d think you were deleting the 20 records, but you were actually deleting the prior set of records that a similar query returned.

想象一下，如果要对集合进行范围界定以进行批量删除。您以为您要删除20条记录，但实际上是在删除类似查询返回的先前的记录集。

That’s the stuff of nightmares, and I hope you have good backups and audit tables.

这就是噩梦，我希望您拥有良好的备份和审核表。

解决方案—缓存清除工具 (Solution — cache bust tools)

I built multiple tools that customer success could use to force cache refreshes to occur on a special queue. This would ensure whenever they needed the most recent data, they would have it.

我构建了多个工具，客户成功可使用这些工具来强制在特定队列上进行缓存刷新。这样可以确保只要他们需要最新的数据，他们就可以拥有。

By changing the cached property accessor to accept and use an optional set of parameters, I could now force a cache refresh any time I wanted:

通过更改缓存的属性访问器以接受并使用一组可选参数，我现在可以在需要的任何时候强制刷新缓存：

@campaign.total_raised(force_refresh: true)

In freshness-sensitive operations, it would ensure I was dealing with the right kind of data each time.

在对新鲜度敏感的操作中，这将确保每次都处理正确类型的数据。

I also made sure that features like critical reporting used thin cache layers and leveraged recent data as much as possible.

我还确保关键报告之类的功能使用了较薄的缓存层，并尽可能地利用了最新数据。

最终结果 (The end result)

By the end of all of the optimizations, we had a system that could handle the next order of magnitude of load we anticipated — 2000+ requests per second, thousands of concurrent campaigns. Most donor-facing endpoints were loading in less than 50ms, with customer-facing pages loading within 300ms.

在所有优化的最后，我们有了一个系统，可以处理我们预期的下一个数量级的负载-每秒2000个以上的请求，数千个并发活动。大多数面向捐助者的端点在不到50毫秒内加载，而面向客户的页面在300毫秒内加载。

It was a long journey, with many high-pressure deployments, but the end result spoke for itself. We finally had a system that we could ignore during a giving day — for the most part.

这是一段漫长的旅程，进行了许多高压部署，但最终结果不言而喻。最终，我们有了一个在赠予日中可以忽略的系统-大部分情况下。

翻译自: https://medium.com/swlh/how-i-scaled-a-software-systems-performance-by-35-000-6dacd63732df