keil 调试操作系统_调试操作系统：内存分配的经验教训

keil 调试操作系统

It began, as so many investigations do, with a bug report.

和许多调查一样，它是从一个错误报告开始的。

The name of the bug report was simple enough: “iter_content slow with large chunk size on HTTPS connection”. This is the kind of name of a bug report that immediately fires alarm bells in my head, for two reasons. Firstly, it’s remarkably difficult to quantify: what does “slow” mean here? How slow? How large is large? Secondly, it’s the kind of thing where it seems like if the effect was severe we’d have heard about it by now. We’ve had the iter_content method for a very long time: surely if it were meaningfully slower in a reasonably common use mode we’d have heard about it by now.

错误报告的名称很简单：“ iter_content缓慢，HTTPS连接上的块大。” 这类错误报告的名称立即引起我的警钟，原因有两个。首先，很难量化：“慢”在这里意味着什么？多慢多大？其次，这是一种好像效果很严重的事情，我们现在已经听说过了。我们使用iter_content方法已经很长时间了：可以肯定的是，如果在合理的普遍使用模式下它确实慢得多，那我们现在已经听说过了。

Quickly leaping into the initial report, the original reporter provides relatively little detail, but does say this: “This causes 100% CPU and slows down throughput to less than 1MB/s.”. That catches my eye, because it seems like it cannot possibly be true. The idea that merely downloading data with minimal processing could be that slow: surely not!

快速进入初始报告后，原始报告者提供的信息相对较少，但确实如此：“这将导致100％的CPU，并将吞吐量降低到1MB / s以下。” 这引起了我的注意，因为似乎不可能如此。仅用最少的处理就下载数据的想法可能会很慢：肯定不会！

However, all bugs deserve investigation before they can be ruled out. With some further back-and-forth between myself and the original poster, we got ourselves to a reproduction scenario: if you used Requests with PyOpenSSL and ran the following code, you’d pin a CPU to 100% and find your data throughput had dropped down to extremely minimal amounts:

但是，在排除所有错误之前，都应进行调查。在我自己和原始海报之间来回穿梭的情况下，我们进入了复制场景：如果您将Requests与PyOpenSSL一起使用并运行以下代码，则将CPU固定为100％，发现数据吞吐量已达到下降到极少量：

import requests
https = requests.get("https://az792536.vo.msecnd.net/vms/VMBuild_20161102/VirtualBox/MSEdge/MSEdge.Win10_preview.VirtualBox.zip", stream=True)
for content in https.iter_content(100 * 2 ** 20): # 100MBpass

import requests
https = requests.get("https://az792536.vo.msecnd.net/vms/VMBuild_20161102/VirtualBox/MSEdge/MSEdge.Win10_preview.VirtualBox.zip", stream=True)
for content in https.iter_content(100 * 2 ** 20): # 100MBpass

This is a great repro scenario, because it points the finger so clearly into the Requests stack. There is no user-supplied code running here: all of the code is shipped as part of the Requests library or one of its dependencies, so there is no risk that the user wrote some wacky low-performance code. This is really fantastic. Even more fantastic, this is a repro scenario that uses a public URL, which means I can run it! And when I did, I reproduced the bug. Every time.

这是一个很好的复制方案，因为它可以很清楚地将手指指向“请求”堆栈。这里没有用户提供的代码在运行：所有代码都作为Requests库的一部分或它的依赖项之一提供，因此用户没有编写一些古怪的低性能代码的风险。这真是太棒了。更妙的是，这是一个使用公共URL的复制方案，这意味着我可以运行它！当我这样做时，我重现了该错误。每次。

There was one other bit of tantalising detail:

还有一点诱人的细节：

At 10MB, there is no noticeable increase in CPU load, and no impact on throughput. At 1GB, CPU load is at 100%, just like with 100MB, but throughput is reduced to below 100KB/s, compared to 1MB/s at 100MB.

在10MB ，CPU负载没有明显增加，并且对吞吐量没有影响。 1GB ，CPU负载为100% ，就像100MB ，但是吞吐量降低到100KB/s ，而100MB为1MB/s 。

This is a really interesting data point, because it implies the literal value of the chunk size is affecting the work load. When we combine this information with the fact that this only occurs in PyOpenSSL, and the fact that the stack spends most of its time in the following line of code, the problem becomes clear:

这是一个非常有趣的数据点，因为它暗示着块大小的字面值正在影响工作量。当我们将此信息与仅在PyOpenSSL中发生的事实以及堆栈在以下代码行中花费大部分时间这一事实结合时，问题就变得很明显：

Some further investigation determined that CFFI’s default behaviour for FFI.new is to return zeroed memory. This meant that there was linear overhead in the allocation size: for bigger allocations we had to spend more time zeroing data. Hence the bad behaviour with large allocations. We used a CFFI feature to disable the zeroing of memory for these buffers, and the problem went away. Problem solved, right?

一些进一步的调查确定CFFI对于FFI.new的默认行为是返回零内存。这意味着分配大小存在线性开销：对于更大的分配，我们必须花费更多时间将数据清零。因此，分配较大时的不良行为。我们使用CFFI功能来禁用这些缓冲区的内存清零，问题就消失了。问题解决了吧？

Wrong.

错误。

真正的错误 (The Real Bug)

All joking aside, this genuinely did resolve the problem. However, a few days later, Nathaniel Smith asked a very insightful question: why was the memory being actively zeroed at all? To understand this question, we need to digress a bit into memory allocation in POSIX systems.

除了开玩笑，这确实解决了问题。但是，几天后，纳撒尼尔·史密斯（Nathaniel Smith）提出了一个非常有见地的问题：为什么记忆被主动归零？为了理解这个问题，我们需要对POSIX系统中的内存分配进行一些探讨。

mallocs和callocs和vm_allocs，哦，我的天！ (mallocs and callocs and vm_allocs, oh my!)

Many programmers are familiar with the standard way to ask your operating system for memory. That mechanism is through using the C standard library function malloc (you can read documentation about it on your system by typing man 3 malloc for the manual page). This function takes a single argument, a number of bytes to allocate memory for. The C standard library will use one of many different strategies for allocating this memory, but one way or another it will return a pointer to a bit of memory that is at least as large as the amount of memory you asked for.

许多程序员熟悉向操作系统请求内存的标准方法。该机制是通过使用C标准库函数malloc （您可以通过在手册页上键入man 3 malloc来阅读系统上有关此文档的文档）。此函数采用单个参数，即为其分配内存的多个字节。 C标准库将使用许多不同策略中的一种来分配此内存，但是一种或另一种方式是它将返回一个指向至少与您要求的内存量一样大的内存的指针。

By the standard, malloc returns uninitialized memory. This means that the C standard library locates some memory and immediately passes it to your program, without changing what is already there. This means that in standard use malloc can and will return a buffer to your program that your program has already written data into. This behaviour is a common source of nasty bugs in languages that are memory-unsafe, like C, and in general reading from uninitialized memory is an extremely dangerous pattern.

按照标准， malloc返回未初始化的内存。这意味着C标准库可以找到一些内存，并立即将其传递给您的程序，而无需更改已存在的内存。这意味着在标准使用中， malloc可以并且将向程序已经将数据写入其中的程序返回一个缓冲区。这种行为是内存不安全的语言（如C）中常见的臭虫的常见来源，并且通常从未初始化的内存中读取是一种极其危险的模式。

However, malloc has a friend, documented right alongside it in the same manual page: calloc. Now, calloc’s most obvious difference from malloc is that it takes two arguments: a count, and a size. That is, when you all malloc you ask the C standard library “please allocate at least n bytes”, whereas when you call calloc you ask the C standard library “please allocate enough memory for n objects of size m bytes”. It is clear that the original intent of calloc is to allocate heap memory for arrays of objects in a safe way.

但是， malloc有一个朋友，在同一个手册页中记录了该朋友： calloc 。现在， calloc与malloc的最明显区别是它需要两个参数：一个count和一个size。也就是说，当您全部使用malloc时，您都要求C标准库“请至少分配n个字节”，而当您调用calloc时，您calloc C标准库“请为n个大小为m字节的对象分配足够的内存”。显然， calloc初衷是以一种安全的方式为对象数组分配堆内存。。

But calloc has an extra side effect, related to its original purpose to allocate arrays, and mentioned very quietly in the manual page:

但是calloc具有额外的副作用，与它最初分配数组的目的有关，并且在手册页中非常安静地提到：

The allocated memory is filled with bytes of value zero.

分配的内存中填充了值为零的字节。

This goes hand-in-hand with calloc’s original purpose. If you were, for example, allocating an array of values, it is often very helpful for your array to begin in a default state. Some modern, memory-safe languages have actually adopted this as the default behaviour when building arrays and structures. For example, in the Go programming language, when you initialize a structure all of its members are defaulted to their so-called “zero” values, which are basically equivalent to “what their value would be if everything were set to zero”. This can be thought of as a promise that all Go structures are allocated using calloc.

这与calloc的初衷是齐头并进的。例如，如果要分配一个值数组，则使数组以默认状态开始通常非常有帮助。实际上，一些现代的内存安全语言在构建数组和结构时实际上已将此作为默认行为。例如，在Go编程语言中，初始化结构时，其所有成员均默认为其所谓的“零”值，该值基本上等效于“如果将一切都设置为零，其值将是什么”。这可以被认为是使用calloc分配所有Go结构的承诺。

This behaviour means that while malloc returns uninitialized memory, calloc returns initialized memory. And because it does so, and has these strict promises, the operating system can optimise it. And indeed, most modern operating systems have optimised it.

此行为意味着，尽管malloc返回未初始化的内存，但calloc返回已初始化的内存。并且由于这样做并且具有这些严格的承诺，因此操作系统可以对其进行优化。实际上，大多数现代操作系统都对其进行了优化。

我们去吧 (Let’s go calloc)

Of course, the simplest way to implement calloc is to write it like this:

当然，实现calloc的最简单方法是这样编写：

void *calloc(size_t count, size_t size) {assert(!multiplication_would_overflow(count, size));size_t allocation_size = count * size;void *allocation = malloc(allocation_size);memset(allocation, 0, allocation_size);return allocation;
}

void *calloc(size_t count, size_t size) {assert(!multiplication_would_overflow(count, size));size_t allocation_size = count * size;void *allocation = malloc(allocation_size);memset(allocation, 0, allocation_size);return allocation;
}

The cost of a function like this is clearly approximately linear in the size of the allocation: setting each byte to zero is clearly going to become increasingly expensive as you have more bytes. Now, in fact, most operating systems ship C standard libraries that have optimised paths for memset (usually by taking advantage of specialised CPU vector instructions to zero a lot of bytes in each instruction), but nevertheless the cost of doing this is linear.

像这样的函数的代价在分配大小上显然是线性的：将每个字节设置为零显然会随着字节的增加而变得越来越昂贵。现在，实际上，大多数操作系统运送已优化的路径C标准库memset （通常通过利用专门CPU向量指令优点零在每个指令很多字节），但尽管如此这样做的成本是线性的。

But operating systems have another trick up their sleeve for larger allocations, and that’s to take advantage of virtual memory tricks.

但是，对于更大的分配，操作系统还有另外一个窍门，那就是利用虚拟内存的窍门。

虚拟内存 (Virtual Memory)

Fully explaining virtual memory is going beyond the scope of this blog post, unfortunately, but I highly recommend you read up about it (it’s very interesting stuff!). However, the short version of “virtual memory” is that the operating system kernel will lie to processes about memory. Each process running on your machine has its own view of memory which belongs to it and it alone. That view of memory is then “mapped” onto physical memory indirectly.

不幸的是，完全解释虚拟内存已经超出了本博客文章的范围，但是我强烈建议您阅读它（这是非常有趣的东西！）。但是，“虚拟内存”的简称是操作系统内核将位于有关内存的进程中。您机器上运行的每个进程都有自己的内存视图，该视图属于它本身，也属于它自己。然后，该内存视图被间接“映射”到物理内存。

This allows the operating system to perform all kinds of clever trickery. One very common form of clever trickery is to have some bits of “memory” that are actually files. This is used when swapping memory out to disk, and is also used when memory-mapping a file. In the case of memory mapping a file, a program will ask the operating system: “please allocate n bytes of memory and back those bits of memory with this file on disk, such that when I write to the memory those writes are written to the file on disk, and when I read from that memory those reads come from the file on disk”.

这允许操作系统执行各种巧妙的欺骗。聪明的骗术的一种非常常见的形式是拥有一些实际上是文件的“内存”。在将内存换出到磁盘时使用，在内存映射文件时也使用。在使用内存映射文件的情况下，程序将询问操作系统：“请分配n个字节的内存，并将该文件的这些内存位返回到磁盘上，以便当我向内存写入内容时，将这些内容写入到磁盘中。磁盘上的文件，当我从该内存中读取时，这些读取来自磁盘上的文件”。

The way this works, at a kernel level, is that when the process tries to read that memory the CPU will notice that the memory the process is reading does not actually exist, will pause that process, and will emit a “page fault”. The operating system kernel will be invoked and will act to bring the data into memory so that the application can read it. The original process will then be unpaused and find, from its perspective, that no time has passed and that magically the bytes of the file are present in that memory location.

在内核级别，此方法的工作方式是，当进程尝试读取该内存时，CPU将注意到该进程正在读取的内存实际上不存在，将暂停该进程，并发出“页面错误”。操作系统内核将被调用并将采取行动将数据带入内存，以便应用程序可以读取它。然后，原始进程将被取消暂停，并从其角度出发，发现没有时间过去，并且神奇地发现该文件的字节位于该内存位置中。

This mechanism can be used to perform other neat tricks. One of them is to make very large memory allocations “free”; or, more properly, to make them expensive only in proportion to how much of that memory is used, rather than in proportion to how much of that memory is allocated.

此机制可用于执行其他巧妙的技巧。其中之一是“释放”非常大的内存；或者，更恰当地说，使它们昂贵仅是与使用的内存量成正比，而不是与分配的内存量成正比。

The reason for doing this is that historically many programs that needed a decent chunk of memory during their lifetime would, at startup, allocate a massive buffer of bytes that it could then parcel out internally over the program’s lifetime. This was done because the programs were written for environments that do not use virtual memory, and so they needed to call dibs on a certain amount of memory to avoid getting starved out. But with virtual memory, such a policy is no longer needed: each program can allocate as much memory as it needs on-demand, they can no longer starve each other out.

这样做的原因是，从历史上看，许多程序在其生命周期内需要一个体面的内存块，在启动时会分配大量的字节缓冲区，然后可以在程序生命周期内在内部对其进行拆分。这样做是因为程序是为不使用虚拟内存的环境编写的，因此，它们需要在一定数量的内存上调用dib，以避免饿死。但是有了虚拟内存，就不再需要这样的策略：每个程序都可以按需分配所需数量的内存，它们不再互相挨饿。

So, to help avoid these applications from having very large startup costs, operating systems started to lie to them about large allocations. On most operating systems, if you attempt to allocate more than about 128 kilobytes in one call, the C standard library will ask the operating system directly for brand-new virtual memory pages that cover that many bytes. But, and this is key, this costs almost nothing to do. The operating system doesn’t actually allocate or commit any memory at this point: it just sets up some virtual memory mappings. This makes this operation extremely cheap at the time of the malloc call.

因此，为了帮助避免这些应用程序具有很高的启动成本，操作系统开始对它们撒谎以分配大量资源。在大多数操作系统上，如果您试图在一个调用中分配超过128 KB的内存，则C标准库将直接向操作系统询问覆盖该字节的全新虚拟内存页面。但是，这很关键，这几乎不需要做任何事情。操作系统此时实际上并未分配或提交任何内存：它只是设置了一些虚拟内存映射。这使得malloc调用时此操作非常便宜。

Of course, because that memory hasn’t been “mapped in” to the process, the moment the application tries to actually use that memory a page fault will occur. At this point, the operating system will find an actual page of memory and put it in place, much like the page fault for a memory-mapped file (except the virtual memory will be backed by physical memory, instead of by a file).

当然，由于尚未将该内存“映射”到进程中，因此在应用程序尝试实际使用该内存的那一刻就会发生页面错误。此时，操作系统将找到一个实际的内存页面并将其放置到位，这与内存映射文件的页面错误非常相似（除了虚拟内存将由物理内存而不是文件支持）。

The net result of this is that on most modern operating systems, if you call malloc(1024 * 1024 * 1024) to allocate one gigabyte of memory, that will happen almost immediately because actually, nothing has been done to truly give your process that memory. As a result, a program that allocates many gigabytes of memory that it never actually uses will execute quite quickly, so long as those allocations are quite large.

这样做的最终结果是，在大多数现代操作系统上，如果调用malloc(1024 * 1024 * 1024)分配1 GB的内存，则几乎会立即发生这种情况，因为实际上并没有采取任何措施来真正地为您的进程提供该内存。结果，只要分配的内存很大，就会分配一个实际上从未使用过的大量GB内存的程序。

What is potentially more surprising is that the same optimisation can be made for calloc. This works because the operating system can map a brand new page to the so-called “zero page”: a page of memory that is read-only, and reads entirely as zeroes. This mapping is initially copy-on-write, which means that when your process eventually tries to write to its brand new memory map the kernel will intervene, copy all those zeroes into a new page, and then apply your write.

可能更令人惊讶的是，可以对calloc进行相同的优化。之所以可行，是因为操作系统可以将一个崭新的页面映射到所谓的“零页面”：零内存页面，它是只读的，并且全部读取为零。此映射最初是写时复制的，这意味着当您的进程最终尝试写入其全新的内存映射时，内核将进行干预，将所有这些零复制到新页中，然后应用您的写操作。

Because the OS can do this trick calloc can, for larger allocations, simply do the same as malloc does, and ask for brand-new virtual memory pages. This continues to have no cost until the memory is actually used. This neat optimisation means that calloc(1024 * 1024 * 1024, 1) costs exactly the same as malloc of the same size does, despite calloc’s additional promise of zeroing memory. Neat!

因为OS可以做到这一点，所以calloc可以对更大的分配进行与malloc相同的操作，并请求全新的虚拟内存页。直到实际使用内存，这仍然没有花费。这种巧妙的优化意味着，尽管calloc额外承诺将内存清零，但calloc(1024 * 1024 * 1024, 1)成本与相同大小的malloc完全相同。整齐！

回到错误 (Back To The Bug)

So, as Nathaniel pointed out: if CFFI was using calloc, why would the memory be being zeroed out?

因此，正如纳撒尼尔（Nathaniel）所指出的：如果CFFI使用的是calloc ，为什么将内存清零？

Part one, of course, is that it doesn’t always use calloc. But I had a suspicion that I’d bumped into a case where I could reproduce this slowdown directly with calloc, so I went back and coded up a quick repro program. I came up with this:

当然，第一部分是它并不总是使用calloc 。但是我怀疑自己遇到了可以直接用calloc重现这种减速的情况，因此我回过头来编写了一个快速的repro程序。我想出了这个：

This is a very simple C program that allocates and frees 100MB of memory using calloc ten thousand times, and then exits. There are two likely possibilities for what might happen here:

这是一个非常简单的C程序，它使用calloc分配和释放100MB的内存一万次，然后退出。这里可能发生的事情有两种可能：

calloc may use the virtual memory trick described above. In this case, we’d expect this program to be very fast indeed: because the memory that we allocate never actually gets used, it never gets paged in and so the pages never get dirtied. The OS does its little trick of lying to us about allocating memory, and we never call the OS’s bluff, so everything works out beautifully.
calloc may use malloc and zero memory manually using memset. We’d expect this to be very, very slow: in total we need to zero one terabyte of memory (ten thousand increments of 100 MB), and that takes quite a lot of effort.

calloc可以使用上述虚拟内存技巧。在这种情况下，我们希望该程序确实非常快：因为分配的内存实际上从未使用过，因此它不会被分页，因此这些页也不会被弄脏。操作系统在分配内存方面对我们撒谎没有什么招数，而且我们从不称呼操作系统为虚张声势，因此一切工作都很好。
calloc可以使用memset手动使用malloc和零内存。我们希望这会非常非常慢：总的来说，我们需要将1 TB的内存归零（100 MB的一万个增量），这需要大量的工作。

Now, this is well above the standard OS threshold for using behaviour (1), so we’d expect that behaviour. And indeed, on Linux, that’s exactly what you see: if you compile this with gcc and then run it you’ll find that it executes very quickly indeed, and causes very few page faults, and exerts very little memory pressure. But if you take the same program and run it on macOS, you’ll find it takes an extremely long time: in my testing it took nearly eight minutes.

现在，这已经远远超过了使用行为（1）的标准操作系统阈值，因此我们期望这种行为。确实，在Linux上，这正是您所看到的：如果使用gcc编译然后运行它，您会发现它确实执行得非常快，并且几乎不会引起页面错误，并且几乎没有占用内存。但是，如果您使用相同的程序并在macOS上运行它，则会发现它花费了很长时间：在我的测试中，它花费了将近八分钟。

Even more weirdly, if you make ALLOCATION_SIZE bigger (say 1000 * 1024 * 1024) then suddenly the macOS program becomes almost instantaneous! What the hell?

更奇怪的是，如果您将ALLOCATION_SIZE增大（例如1000 * 1024 * 1024 ），那么macOS程序几乎立即变为瞬时！我勒个去？

What is happening here?

这是怎么回事

去源头潜水 (Go Source Diving)

macOS contains a neat utility called sample (see man 1 sample) that can tell you quite a lot about a running process by sampling its process state. The sample output for the program above looked like this:

macOS包含一个名为sample的简洁实用程序（请参阅man 1 sample ），它可以通过对进程状态进行采样来告诉您很多有关正在运行的进程的信息。上面程序的sample输出如下所示：

Sampling process 57844 for 10 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
Sample analysis of process 57844 written to file /tmp/a.out_2016-12-05_153352_8Lp9.sample.txtAnalysis of sampling a.out (pid 57844) every 1 millisecond
Process:         a.out [57844]
Path:            /Users/cory/tmp/a.out
Load Address:    0x10a279000
Identifier:      a.out
Version:         0
Code Type:       X86-64
Parent Process:  zsh [1021]Date/Time:       2016-12-05 15:33:52.123 +0000
Launch Time:     2016-12-05 15:33:42.352 +0000
OS Version:      Mac OS X 10.12.2 (16C53a)
Report Version:  7
Analysis Tool:   /usr/bin/sample
----Call graph:3668 Thread_7796221   DispatchQueue_1: com.apple.main-thread  (serial)3668 start  (in libdyld.dylib) + 1  [0x7fffca829255]3444 main  (in a.out) + 61  [0x10a279f5d]+ 3444 calloc  (in libsystem_malloc.dylib) + 30  [0x7fffca9addd7]+   3444 malloc_zone_calloc  (in libsystem_malloc.dylib) + 87  [0x7fffca9ad496]+     3444 szone_malloc_should_clear  (in libsystem_malloc.dylib) + 365  [0x7fffca9ab4a7]+       3227 large_malloc  (in libsystem_malloc.dylib) + 989  [0x7fffca9afe47]+       ! 3227 _platform_bzero$VARIANT$Haswel  (in libsystem_platform.dylib) + 41  [0x7fffcaa3abc9]+       217 large_malloc  (in libsystem_malloc.dylib) + 961  [0x7fffca9afe2b]+         217 madvise  (in libsystem_kernel.dylib) + 10  [0x7fffca958f32]221 main  (in a.out) + 74  [0x10a279f6a]+ 217 free_large  (in libsystem_malloc.dylib) + 538  [0x7fffca9b0481]+ ! 217 madvise  (in libsystem_kernel.dylib) + 10  [0x7fffca958f32]+ 4 free_large  (in libsystem_malloc.dylib) + 119  [0x7fffca9b02de]+   4 madvise  (in libsystem_kernel.dylib) + 10  [0x7fffca958f32]3 main  (in a.out) + 61  [0x10a279f5d]Total number in stack (recursive counted multiple, when >=5):Sort by top of stack, same collapsed (when >= 5):_platform_bzero$VARIANT$Haswell  (in libsystem_platform.dylib)        3227madvise  (in libsystem_kernel.dylib)        438

Sampling process 57844 for 10 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
Sample analysis of process 57844 written to file /tmp/a.out_2016-12-05_153352_8Lp9.sample.txtAnalysis of sampling a.out (pid 57844) every 1 millisecond
Process:         a.out [57844]
Path:            /Users/cory/tmp/a.out
Load Address:    0x10a279000
Identifier:      a.out
Version:         0
Code Type:       X86-64
Parent Process:  zsh [1021]Date/Time:       2016-12-05 15:33:52.123 +0000
Launch Time:     2016-12-05 15:33:42.352 +0000
OS Version:      Mac OS X 10.12.2 (16C53a)
Report Version:  7
Analysis Tool:   /usr/bin/sample
----Call graph:3668 Thread_7796221   DispatchQueue_1: com.apple.main-thread  (serial)3668 start  (in libdyld.dylib) + 1  [0x7fffca829255]3444 main  (in a.out) + 61  [0x10a279f5d]+ 3444 calloc  (in libsystem_malloc.dylib) + 30  [0x7fffca9addd7]+   3444 malloc_zone_calloc  (in libsystem_malloc.dylib) + 87  [0x7fffca9ad496]+     3444 szone_malloc_should_clear  (in libsystem_malloc.dylib) + 365  [0x7fffca9ab4a7]+       3227 large_malloc  (in libsystem_malloc.dylib) + 989  [0x7fffca9afe47]+       ! 3227 _platform_bzero$VARIANT$Haswel  (in libsystem_platform.dylib) + 41  [0x7fffcaa3abc9]+       217 large_malloc  (in libsystem_malloc.dylib) + 961  [0x7fffca9afe2b]+         217 madvise  (in libsystem_kernel.dylib) + 10  [0x7fffca958f32]221 main  (in a.out) + 74  [0x10a279f6a]+ 217 free_large  (in libsystem_malloc.dylib) + 538  [0x7fffca9b0481]+ ! 217 madvise  (in libsystem_kernel.dylib) + 10  [0x7fffca958f32]+ 4 free_large  (in libsystem_malloc.dylib) + 119  [0x7fffca9b02de]+   4 madvise  (in libsystem_kernel.dylib) + 10  [0x7fffca958f32]3 main  (in a.out) + 61  [0x10a279f5d]Total number in stack (recursive counted multiple, when >=5):Sort by top of stack, same collapsed (when >= 5):_platform_bzero$VARIANT$Haswell  (in libsystem_platform.dylib)        3227madvise  (in libsystem_kernel.dylib)        438

The key note here is that we can clearly see that we’re spending the bulk of our time in _platform_bzero$VARIANT$Haswell. This method is used to zero buffers. This means that macOS is zeroing out the buffers. Why?

这里的关键说明是，我们可以清楚地看到我们将大部分时间都_platform_bzero$VARIANT$Haswell 。此方法用于将缓冲区归零。这意味着macOS正在将缓冲区清零。为什么？

Well, handily, Apple open-sources much of their core operating system code sometime after release. We can see that this program spends most of its time in libsystem_malloc, so I simply went to Apple’s Open Source webpage, and downloaded a tarball of libmalloc-116, which contains the relevant source code. And then I went spelunking.

好吧，Apple发行后的某个时候便方便地开源了许多核心操作系统代码。我们可以看到该程序大部分时间都libsystem_malloc ，所以我只是去了Apple的Open Source页面，并下载了一个libmalloc-116的压缩文件，其中包含相关的源代码。然后我去摸索。

It turns out that all of the magic happens in large_malloc. This branch is used for allocations larger than about 127kB, and ultimately does use the virtual memory trick above. So why do we have really slow execution?

事实证明，所有魔术都发生在large_malloc 。该分支用于大于127kB的分配，并且最终确实使用了上面的虚拟内存技巧。那么为什么我们执行起来真的很慢？

Well, here is where it turns out that Apple got a bit too clever for their own good. large_malloc contains a whole bunch of code hidden behind a #define constant, CONFIG_LARGE_CACHE. This whole bunch of code basically amounts to a “free-list” of large memory pages that have been allocated to the program. If a macOS program allocates a contiguous buffer of memory between 127kB and LARGE_CACHE_SIZE_ENTRY_LIMIT (approximately 125MB), then libsystem_malloc will attempt to re-use those pages if another allocation is made that could use them. This saves it from needing to ask the Darwin kernel for some memory pages, which saves a context switch and a syscall: a non-trivial savings, in principle.

好吧，事实证明，苹果公司为自己的利益变得太聪明了。 large_malloc包含一堆隐藏在#define常量CONFIG_LARGE_CACHE后面的代码。这段代码基本上等于分配给程序的大内存页面的“空闲列表”。如果macOS程序在127kB和LARGE_CACHE_SIZE_ENTRY_LIMIT之间分配了连续的内存缓冲区（大约125MB），那么如果进行了另一次分配使用它们， libsystem_malloc将尝试重用这些页面。这样可以省去向Darwin内核请求某些内存页面的麻烦，从而节省了上下文切换和syscall：原则上是不小的节省。

However, for calloc it is naturally the case that we need those bytes to be zeroed. For that reason, if macOS finds a page that can be reused and has been called from calloc, it will zero the memory. All of it. Every time.

但是，对于calloc ，自然需要将这些字节清零。因此，如果macOS找到可重用的页面并已从calloc调用该页面，它将内存归零。所有的。每次。

Now, this isn’t totally unreasonable: zeroed pages are legitimately a limited resource, especially on resource constrained hardware (looking at you, Apple Watch). That means that if it is possible to re-use a page, that’s potentially a really major savings.

现在，这并不是完全不合理的：归零页面合法地是有限的资源，尤其是在资源受限的硬件上（Apple Watch看着您）。这意味着，如果可以重复使用页面，那么这可能是真正的重大节省。

However, the page cache totally destroys the optimisation of using calloc to provide zeroed memory pages. That wouldn’t be so bad if it was only done to “dirty” pages: that is, if the pages that it was zeroing out had been written to by the application, and so were likely to not be zeroed. But macOS does this unconditionally. That means, if you call calloc, free, and calloc, without ever touching the memory, the second call to calloc takes the pages allocated by the first one, which were never backed by actual memory, and forces the OS to page in all that memory in order to zero it, even though it was already zeroed. This is part of what we were trying to avoid with the virtual-memory based allocator for large allocations: this memory, which was not ever used, becomes used by the free-list.

但是，页面缓存完全破坏了使用calloc提供零内存页面的优化。如果仅对“脏”页面执行操作，那并不会太糟糕：也就是说，如果要归零的页面已经由应用程序写入，因此很可能不会被归零。但是macOS会无条件执行此操作。这意味着，如果您调用calloc ， free和calloc而不接触内存，则对calloc的第二次调用将获取第一个分配的页面，这些页面从未得到实际内存的支持，并迫使OS在所有这些页面中进行分页。为了将它归零，即使它已经被归零。这是我们试图针对大型分配使用基于虚拟内存的分配器要避免的一部分：此内存从未使用过，而由空闲列表使用。

The net effect of this is that on macOS calloc has a cost that is linear in allocation size all the way up to 125MB, despite the fact that most other operating systems get O(1) behaviour after about 127kB. Over 125 MB macOS stops caching the pages, and so all of a sudden everything gets speedy and great again.

这样做的最终结果是，在macOS上， calloc的开销在分配大小上一直是线性的，一直到125MB，尽管事实是，大多数其他操作系统在大约127kB之后都会出现O（1）行为。超过125 MB的macOS停止了页面缓存，因此突然之间一切都变得又快又好。

This is a really unexpected bug to find from a Python program, and it raises a number of questions. For example, how many CPU cycles are wasted zeroing memory that was already zeroed. How many context switches are wasted by forcing applications to page-in memory they never used and didn’t need so that the OS could unnecessarily zero it?

从Python程序中发现这确实是一个出乎意料的错误，并且引发了许多问题。例如，将已清零的内存清零会浪费多少CPU周期。通过迫使应用程序进入它们从未使用过且不需要的内存中，从而浪费了多少上下文切换，从而使操作系统不必要地将其归零？

Ultimately, though, I think this shows the truth of the old adage: all abstractions are leaky. Just because you’re a Python programmer doesn’t mean you’re able to forget that, somewhere down in the depths, you are running on a machine that is built up of memory, and trickery. Someday, your program is going to be really unexpectedly slow, and the only way to find out is to dive all the way down into your operating system to work out what silly thing it is doing to your memory.

但是，最终，我认为这表明了古老谚语的真实性：所有抽象都是泄漏的。仅仅因为您是一名Python程序员，并不意味着您就可以忘记，在深处的某个地方，您正在一台由内存和诡计组成的机器上运行。总有一天，您的程序确实会出乎意料地缓慢，而找出答案的唯一方法就是将其完全潜入操作系统中，弄清楚它对内存的影响是多么愚蠢。

This bug has been filed as Radar 29508271. It is hands-down one of the weirdest bugs I’ve ever encountered.

该错误已归档为Radar 29508271 。这是我遇到过的最奇怪的错误之一。

Edit 1: A previous version of this post talked about operating system kernels zeroing pages in idle time. That’s not really the way it works in modern OSes: instead, a copy-on-write version of a zero-page is used. This is another neat optimisation that allows kernels to avoid spending lots of time writing zeroes into used pages: instead, they only write the zeroes when an application actually writes data into its pages. This has the effect of saving even more CPU cycles: if an application asks for memory that it literally never touches, then it never costs anything to fill it with zeroes. Neat!

编辑1 ：这篇文章的先前版本讨论了在空闲时间将页面清零的操作系统内核。这实际上并不是现代OS中的工作方式：而是使用零页的写时复制版本。这是另一个巧妙的优化，它使内核避免花费大量时间将零写入已使用的页面：相反，它们仅在应用程序实际将数据写入其页面时才写入零。这样可以节省更多的CPU周期：如果应用程序要求它真正不接触的内存，那么用零填充它就不会花费任何费用。整齐！

翻译自: https://www.pybloggers.com/2016/12/debugging-your-operating-system-a-lesson-in-memory-allocation/

keil 调试操作系统