如果我针对大小而不是速度进行优化，为什么GCC会生成15-20％的更快代码？

本文翻译自：Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size ( -Os ) instead of speed ( -O2 or -O3 ), and I have been wondering ever since why. 我在2009年首先注意到，如果我对大小（ -Os ）而不是速度（ -O2或-O3 ）进行优化，则GCC（至少在我的项目和我的机器上）倾向于生成明显更快的代码，并且我一直想知道为什么。

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here. 我设法创建了（相当愚蠢的）代码来显示这种令人惊讶的行为，并且足够小，可以在此处发布。

const int LOOP_BOUND = 200000000;__attribute__((noinline))
static int add(const int& x, const int& y) {return x + y;
}__attribute__((noinline))
static int work(int xval, int yval) {int sum(0);for (int i=0; i<LOOP_BOUND; ++i) {int x(xval+sum);int y(yval+sum);int z = add(x, y);sum += z;}return sum;
}int main(int , char* argv[]) {int result = work(*argv[1], *argv[2]);return result;
}

If I compile it with -Os , it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3 . 如果我使用-Os ，则执行此程序需要0.38 s，如果使用-O2或-O3进行-Os ，则需要0.44 s。 These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M). 这些时间一致且几乎没有噪音（gcc 4.7.2，x86_64 GNU / Linux，Intel Core i5-3320M）。

(Update: I have moved all assembly code to GitHub : They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.) （更新：我已将所有汇编代码移至GitHub ：他们使帖子肿，并且由于fno-align-*标志具有相同的作用，因此显然对问题的价值很小。）

Here is the generated assembly with -Os and -O2 . 这是生成的带有-Os和-O2程序集。

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here . 不幸的是，我的组装理解是非常有限的，所以我不知道是否我所做的未来是正确的：我抓起大会-O2和合并其所有的差异到大会-Os 除了 .p2align线，导致这里。 This code still runs in 0.38s and the only difference is the .p2align stuff. 这段代码仍然在0.38s内运行， 唯一的区别是 .p2align 东西。

If I guess correctly, these are paddings for stack alignment. 如果我猜对了，这些是用于堆栈对齐的填充。 According to Why does GCC pad functions with NOPs? 根据为什么GCC垫与NOP一起起作用？ it is done in the hope that the code will run faster, but apparently this optimization backfired in my case. 这样做是为了希望代码能更快地运行，但是显然，这种优化在我看来是适得其反的。

Is it the padding that is the culprit in this case? 在这种情况下，罪魁祸首是填充物吗？ Why and how? 为什么以及如何？

The noise it makes pretty much makes timing micro-optimizations impossible. 它产生的噪声使定时微优化变得不可能。

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code? 当我对C或C ++源代码进行微优化（与堆栈对齐无关）时，如何确保这种偶然的幸运/不幸的对齐方式不会受到干扰？

UPDATE: 更新：

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. 遵循Pascal Cuoq的回答，我对对齐方式进行了一些修改。 By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. 通过将-O2 -fno-align-functions -fno-align-loops给gcc，所有.p2align都从程序.p2align ，并且生成的可执行文件在0.38s内运行。 According to the gcc documentation : 根据gcc文档：

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags: -Os启用所有-O2优化[但是] -Os禁用以下优化标志：
  -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays 

So, it pretty much seems like a (mis)alignment issue. 因此，这似乎是一个（错误）对齐问题。

I am still skeptical about -march=native as suggested in Marat Dukhan's answer . 我仍然对Marat Dukhan的回答中所建议的-march=native持怀疑态度。 I am not convinced that it isn't just interfering with this (mis)alignment issue; 我不相信这不仅会干扰这一（错）对齐问题； it has absolutely no effect on my machine. 它对我的机器绝对没有影响。 (Nevertheless, I upvoted his answer.) （不过，我支持他的回答。）

UPDATE 2: 更新2：

We can take -Os out of the picture. 我们可以将-Os排除在外。 The following times are obtained by compiling with 以下时间是通过编译获得的

-O2 -fno-omit-frame-pointer 0.37s -O2 -fno-omit-frame-pointer 0.37s
-O2 -fno-align-functions -fno-align-loops 0.37s -O2 -fno-align-functions -fno-align-loops 0.37s
-S -O2 then manually moving the assembly of add() after work() 0.37s -S -O2然后在work() 0.37s之后手动移动add()的程序集
-O2 0.44s -O2 0.44秒

It looks like to me the distance of add() from the call site matters a lot. 在我看来， add()与调用站点之间的距离非常重要。 I have tried perf , but the output of perf stat and perf report makes very little sense to me. 我已经尝试过perf ，但是perf stat和perf report的输出对我来说意义不大。 However, I could only get one consistent result out of it: 但是，我只能从中得到一个一致的结果：

-O2 : -O2 ：

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle3,318 cache-misses0.432703993 seconds time elapsed[...]81.23%  a.out  a.out              [.] work(int, int)18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0][...]¦   __attribute__((noinline))¦   static int add(const int& x, const int& y) {¦       return x + y;
100.00 ¦     lea    (%rdi,%rsi,1),%eax¦   }¦   ? retq
[...]¦            int z = add(x, y);1.93 ¦    ? callq  add(int const&, int const&) [clone .isra.0]¦            sum += z;79.79 ¦      add    %eax,%ebx

For fno-align-* : 对于fno-align-* ：

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle9,508 cache-misses0.375681928 seconds time elapsed[...]82.58%  a.out  a.out              [.] work(int, int)16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0][...]¦   __attribute__((noinline))¦   static int add(const int& x, const int& y) {¦       return x + y;51.59 ¦     lea    (%rdi,%rsi,1),%eax¦   }
[...]¦    __attribute__((noinline))¦    static int work(int xval, int yval) {¦        int sum(0);¦        for (int i=0; i<LOOP_BOUND; ++i) {¦            int x(xval+sum);8.20 ¦      lea    0x0(%r13,%rbx,1),%edi¦            int y(yval+sum);¦            int z = add(x, y);35.34 ¦    ? callq  add(int const&, int const&) [clone .isra.0]¦            sum += z;39.48 ¦      add    %eax,%ebx¦    }

For -fno-omit-frame-pointer : 对于-fno-omit-frame-pointer ：

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle10,514 cache-misses0.375445137 seconds time elapsed[...]75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     ¦24.46%  a.out  a.out              [.] work(int, int)[...]¦   __attribute__((noinline))¦   static int add(const int& x, const int& y) {18.67 ¦     push   %rbp¦       return x + y;18.49 ¦     lea    (%rdi,%rsi,1),%eax¦   const int LOOP_BOUND = 200000000;¦¦   __attribute__((noinline))¦   static int add(const int& x, const int& y) {¦     mov    %rsp,%rbp¦       return x + y;¦   }12.71 ¦     pop    %rbp¦   ? retq[...]¦            int z = add(x, y);¦    ? callq  add(int const&, int const&) [clone .isra.0]¦            sum += z;29.83 ¦      add    %eax,%ebx

It looks like we are stalling on the call to add() in the slow case. 看起来我们在慢速情况下停滞了对add()的调用。

I have examined everything that perf -e can spit out on my machine; 我已经检查了perf -e可以吐到我机器上的所有东西 ； not just the stats that are given above. 不只是上面提供的统计信息。

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; 对于同一可执行文件， stalled-cycles-frontend显示与执行时间的线性相关； I did not notice anything else that would correlate so clearly. 我没有发现其他任何可以如此清晰地相互关联的东西。 (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.) （比较不同可执行文件的stalled-cycles-frontend对我来说没有意义。）

I included the cache misses as it came up as the first comment. 我将高速缓存未命中作为第一条评论包括在内。 I examined all the cache misses that can be measured on my machine by perf , not just the ones given above. 我检查了perf可以在我的机器上测量的所有缓存未命中，而不仅仅是上面给出的那些。 The cache misses are very very noisy and show little to no correlation with the execution times. 缓存未命中非常嘈杂，与执行时间几乎没有关联。

#1楼

参考：https://stackoom.com/question/1JhGL/如果我针对大小而不是速度进行优化-为什么GCC会生成-的更快代码

#2楼

I think that you can obtain the same result as what you did: 我认为您可以获得与您所做的相同的结果：

I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines: 我抓住了-O2的程序集，并将其所有差异合并到-Os的程序集中，除了.p2align行：

… by using -O2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 . …通过使用-O2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 。 I have been compiling everything with these options, that were faster than plain -O2 everytime I bothered to measure, for 15 years. 我一直在使用这些选项来编译所有内容，这些方法比我每次努力测量时都比普通-O2快15年了。

Also, for a completely different context (including a different compiler), I noticed that the situation is similar : the option that is supposed to “optimize code size rather than speed” optimizes for code size and speed. 另外，对于完全不同的上下文（包括不同的编译器），我注意到情况类似：应该“优化代码大小而不是速度”的选项针对代码大小和速度进行优化。

If I guess correctly, these are paddings for stack alignment. 如果我猜对了，这些是用于堆栈对齐的填充。

No, this has nothing to do with the stack, the NOPs that are generated by default and that options -falign-*=1 prevent are for code alignment. 不，这与堆栈无关，默认情况下会生成NOP，并且-falign-* = 1选项可以防止代码对齐。

According to Why does GCC pad functions with NOPs? 根据为什么GCC垫与NOP一起起作用？ it is done in the hope that the code will run faster but apparently this optimization backfired in my case. 这样做是希望代码能运行得更快，但显然，这种优化在我看来是适得其反的。

Is it the padding that is the culprit in this case? 在这种情况下，罪魁祸首是填充物吗？ Why and how? 为什么以及如何？

It is very likely that the padding is the culprit. 填充很可能是罪魁祸首。 The reason padding is felt to be necessary and is useful in some cases is that code is typically fetched in lines of 16 bytes (see Agner Fog's optimization resources for the details, which vary by model of processor). 人们认为填充是必要的，并且在某些情况下很有用，原因是代码通常以16字节的行读取（有关详细信息，请参见Agner Fog的优化资源，具体取决于处理器的型号）。 Aligning a function, loop, or label on a 16-bytes boundary means that the chances are statistically increased that one fewer lines will be necessary to contain the function or loop. 在16字节的边界上对齐功能，循环或标签意味着从统计上来说，包含该功能或循环所需的行数将减少。 Obviously, it backfires because these NOPs reduce code density and therefore cache efficiency. 显然，它适得其反，因为这些NOP降低了代码密度，从而降低了缓存效率。 In the case of loops and label, the NOPs may even need to be executed once (when execution arrives to the loop/label normally, as opposed to from a jump). 在循环和标签的情况下，甚至可能需要执行一次NOP（当执行正常到达循环/标签时，而不是跳转时）。

#3楼

By default compilers optimize for "average" processor. 默认情况下，编译器针对“平均”处理器进行优化。 Since different processors favor different instruction sequences, compiler optimizations enabled by -O2 might benefit average processor, but decrease performance on your particular processor (and the same applies to -Os ). 由于不同的处理器支持不同的指令序列，因此-O2启用的编译器优化可能会使平均处理器受益，但会降低特定处理器的性能（并且-Os同样适用）。 If you try the same example on different processors, you will find that on some of them benefit from -O2 while other are more favorable to -Os optimizations. 如果在不同的处理器上尝试相同的示例，则会发现其中一些受益于-O2而另一些则更有利于-Os优化。

Here are the results for time ./test 0 0 on several processors (user time reported): 以下是几个处理器上的time ./test 0 0的结果（报告了用户时间）：

Processor (System-on-Chip)             Compiler   Time (-O2)  Time (-Os)  Fastest
AMD Opteron 8350                       gcc-4.8.1    0.704s      0.896s      -O2
AMD FX-6300                            gcc-4.8.1    0.392s      0.340s      -Os
AMD E2-1800                            gcc-4.7.2    0.740s      0.832s      -O2
Intel Xeon E5405                       gcc-4.8.1    0.603s      0.804s      -O2
Intel Xeon E5-2603                     gcc-4.4.7    1.121s      1.122s       -
Intel Core i3-3217U                    gcc-4.6.4    0.709s      0.709s       -
Intel Core i3-3217U                    gcc-4.7.3    0.708s      0.822s      -O2
Intel Core i3-3217U                    gcc-4.8.1    0.708s      0.944s      -O2
Intel Core i7-4770K                    gcc-4.8.1    0.296s      0.288s      -Os
Intel Atom 330                         gcc-4.8.1    2.003s      2.007s      -O2
ARM 1176JZF-S (Broadcom BCM2835)       gcc-4.6.3    3.470s      3.480s      -O2
ARM Cortex-A8 (TI OMAP DM3730)         gcc-4.6.3    2.727s      2.727s       -
ARM Cortex-A9 (TI OMAP 4460)           gcc-4.6.3    1.648s      1.648s       -
ARM Cortex-A9 (Samsung Exynos 4412)    gcc-4.6.3    1.250s      1.250s       -
ARM Cortex-A15 (Samsung Exynos 5250)   gcc-4.7.2    0.700s      0.700s       -
Qualcomm Snapdragon APQ8060A           gcc-4.8       1.53s       1.52s      -Os

In some cases you can alleviate the effect of disadvantageous optimizations by asking gcc to optimize for your particular processor (using options -mtune=native or -march=native ): 在某些情况下，您可以通过要求gcc针对特定处理器进行优化来减轻不利优化的影响（使用选项-mtune=native或-march=native ）：

Processor            Compiler   Time (-O2 -mtune=native) Time (-Os -mtune=native)
AMD FX-6300          gcc-4.8.1         0.340s                   0.340s
AMD E2-1800          gcc-4.7.2         0.740s                   0.832s
Intel Xeon E5405     gcc-4.8.1         0.603s                   0.803s
Intel Core i7-4770K  gcc-4.8.1         0.296s                   0.288s

Update: on Ivy Bridge-based Core i3 three versions of gcc ( 4.6.4 , 4.7.3 , and 4.8.1 ) produce binaries with significantly different performance, but the assembly code has only subtle variations. 更新：基于桥常春藤酷睿i3三个版本的gcc （ 4.6.4 ， 4.7.3和4.8.1 ）有显著不同性能的产品二进制文件，但汇编代码只有细微的差别。 So far, I have no explanation of this fact. 到目前为止，我对此没有任何解释。

Assembly from gcc-4.6.4 -Os (executes in 0.709 secs): 从gcc-4.6.4 -Os汇编（在0.709秒内执行）：

00000000004004d2 <_ZL3addRKiS0_.isra.0>:4004d2:       8d 04 37                lea    eax,[rdi+rsi*1]4004d5:       c3                      ret00000000004004d6 <_ZL4workii>:4004d6:       41 55                   push   r134004d8:       41 89 fd                mov    r13d,edi4004db:       41 54                   push   r124004dd:       41 89 f4                mov    r12d,esi4004e0:       55                      push   rbp4004e1:       bd 00 c2 eb 0b          mov    ebp,0xbebc2004004e6:       53                      push   rbx4004e7:       31 db                   xor    ebx,ebx4004e9:       41 8d 34 1c             lea    esi,[r12+rbx*1]4004ed:       41 8d 7c 1d 00          lea    edi,[r13+rbx*1+0x0]4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>4004f7:       01 c3                   add    ebx,eax4004f9:       ff cd                   dec    ebp4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>4004fd:       89 d8                   mov    eax,ebx4004ff:       5b                      pop    rbx400500:       5d                      pop    rbp400501:       41 5c                   pop    r12400503:       41 5d                   pop    r13400505:       c3                      ret

Assembly from gcc-4.7.3 -Os (executes in 0.822 secs): 从gcc-4.7.3 -Os组装（执行时间为0.822秒）：

00000000004004fa <_ZL3addRKiS0_.isra.0>:4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]4004fd:       c3                      ret00000000004004fe <_ZL4workii>:4004fe:       41 55                   push   r13400500:       41 89 f5                mov    r13d,esi400503:       41 54                   push   r12400505:       41 89 fc                mov    r12d,edi400508:       55                      push   rbp400509:       bd 00 c2 eb 0b          mov    ebp,0xbebc20040050e:       53                      push   rbx40050f:       31 db                   xor    ebx,ebx400511:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]400516:       41 8d 3c 1c             lea    edi,[r12+rbx*1]40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>40051f:       01 c3                   add    ebx,eax400521:       ff cd                   dec    ebp400523:       75 ec                   jne    400511 <_ZL4workii+0x13>400525:       89 d8                   mov    eax,ebx400527:       5b                      pop    rbx400528:       5d                      pop    rbp400529:       41 5c                   pop    r1240052b:       41 5d                   pop    r1340052d:       c3                      ret

Assembly from gcc-4.8.1 -Os (executes in 0.994 secs): 从gcc-4.8.1 -Os组装（在0.994秒内执行）：

00000000004004fd <_ZL3addRKiS0_.isra.0>:4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]400500:       c3                      ret0000000000400501 <_ZL4workii>:400501:       41 55                   push   r13400503:       41 89 f5                mov    r13d,esi400506:       41 54                   push   r12400508:       41 89 fc                mov    r12d,edi40050b:       55                      push   rbp40050c:       bd 00 c2 eb 0b          mov    ebp,0xbebc200400511:       53                      push   rbx400512:       31 db                   xor    ebx,ebx400514:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]400519:       41 8d 3c 1c             lea    edi,[r12+rbx*1]40051d:       e8 db ff ff ff          call   4004fd <_ZL3addRKiS0_.isra.0>400522:       01 c3                   add    ebx,eax400524:       ff cd                   dec    ebp400526:       75 ec                   jne    400514 <_ZL4workii+0x13>400528:       89 d8                   mov    eax,ebx40052a:       5b                      pop    rbx40052b:       5d                      pop    rbp40052c:       41 5c                   pop    r1240052e:       41 5d                   pop    r13400530:       c3                      ret

#4楼

I'm by no means an expert in this area, but I seem to remember that modern processors are quite sensitive when it comes to branch prediction . 我绝不是这个领域的专家，但是我似乎记得，现代处理器在分支预测方面非常敏感。 The algorithms used to predict the branches are (or at least were back in the days I wrote assembler code) based on several properties of the code, including the distance of a target and on the direction. 用于预测分支的算法是（或者至少是在我编写汇编代码的时代）基于代码的几个属性，包括目标的距离和方向。

The scenario which comes to mind is small loops. 我想到的场景是小循环。 When the branch was going backwards and the distance was not too far, the branch predicition was optimizing for this case as all the small loops are done this way. 当分支向后移动并且距离不太远时，分支预测针对这种情况进行了优化，因为所有小循环都是以这种方式完成的。 The same rules might come into play when you swap the location of add and work in the generated code or when the position of both slightly changes. 当您在生成的代码中交换add和work的位置或两者的位置略有变化时，相同的规则也可能会起作用。

That said, I have no idea how to verify that and I just wanted to let you know that this might be something you want to look into. 就是说，我不知道如何验证这一点，我只是想让您知道这可能是您要研究的内容。

#5楼

My colleague helped me find a plausible answer to my question. 我的同事帮助我找到了一个合理的答案。 He noticed the importance of the 256 byte boundary. 他注意到256字节边界的重要性。 He is not registered here and encouraged me to post the answer myself (and take all the fame). 他没有在这里注册，所以鼓励我自己张贴答案（并承担所有成名）。

Short answer: 简短答案：

Is it the padding that is the culprit in this case? 在这种情况下，罪魁祸首是填充物吗？ Why and how? 为什么以及如何？

It all boils down to alignment. 一切归结为对齐。 Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place. 对齐会对性能产生重大影响，这就是为什么我们首先使用-falign-*标志的原因。

I have submitted a (bogus?) bug report to the gcc developers . 我已经向gcc开发人员提交了（伪造的）错误报告。 It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." 事实证明，默认行为是“默认情况下， 我们将循环对齐到8个字节，但是如果我们不需要填写超过10个字节，则尝试将其对齐到16个字节”。 Apparently, this default is not the best choice in this particular case and on my machine. 显然，在这种情况下以及在我的机器上，此默认设置都不是最佳选择。 Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior. 带有-O3 Clang 3.4（trunk）会进行适当的对齐，并且生成的代码不会显示这种奇怪的行为。

Of course, if an inappropriate alignment is done, it makes things worse. 当然， 如果进行了不适当的对齐，则会使情况变得更糟。 An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc. 不必要的/错误的对齐只会无故吞噬字节，并可能增加缓存未命中率等。

The noise it makes pretty much makes timing micro-optimizations impossible. 它产生的噪声使定时微优化变得不可能。

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes? 当我对C或C ++源代码进行微优化（与堆栈对齐无关）时，如何确保这种偶然的幸运/不幸的对齐方式不会受到干扰？

Simply by telling gcc to do the right alignment: 只需告诉gcc进行正确的对齐即可：

g++ -O2 -falign-functions=16 -falign-loops=16

Long answer: 长答案：

The code will run slower if: 在以下情况下，代码将运行缓慢：

an XX byte boundary cuts add() in the middle ( XX being machine dependent). 一个XX字节的边界在中间切开add() （ XX与机器相关）。
if the call to add() has to jump over an XX byte boundary and the target is not aligned. 如果对add()的调用必须跳过XX字节边界，并且目标未对齐。
if add() is not aligned. 如果add()不对齐。
if the loop is not aligned. 如果循环未对齐。

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted . 前2条在Marat Dukhan友善张贴的代码和结果上十分明显。 In this case, gcc-4.8.1 -Os (executes in 0.994 secs): 在这种情况下， gcc-4.8.1 -Os （在0.994秒内执行）：

00000000004004fd <_ZL3addRKiS0_.isra.0>:4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]400500:       c3

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. 256字节的边界在中间切开了add() ，而add()和循环都没有对齐。 Surprise, surprise, this is the slowest case! 惊喜，惊喜，这是最慢的情况！

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut): 如果gcc-4.7.3 -Os （在0.822秒内执行），则256个字节的边界仅被切成一个冷段（但是循环和add()都没有被切）：

00000000004004fa <_ZL3addRKiS0_.isra.0>:4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]4004fd:       c3                      ret[...]40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. 什么都没有对齐，对add()的调用必须跳过256字节边界。 This code is the second slowest. 此代码是第二慢的代码。

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away: 如果gcc-4.6.4 -Os （在0.709秒内执行），尽管没有对齐，对add()的调用不必跳过256字节边界，并且目标正好相距32字节：

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>4004f7:       01 c3                   add    ebx,eax4004f9:       ff cd                   dec    ebp4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

This is the fastest of all three. 这是三者中最快的。 Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. 为什么256字节边界在他的机器上是特定的，我将由他自己决定。 I don't have such a processor. 我没有这样的处理器。

Now, on my machine I don't get this 256 byte boundary effect. 现在，在我的机器上，我没有得到这个256字节的边界效果。 Only the function and the loop alignment kicks in on my machine. 只有功能和循环对齐在我的计算机上生效。 If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. 如果我通过g++ -O2 -falign-functions=16 -falign-loops=16那么一切都恢复正常：我总是情况最快，并且时间不再对-fno-omit-frame-pointer标志敏感。 I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either. 我可以传递g++ -O2 -falign-functions=32 -falign-loops=32或16的任意倍数，代码对此也不敏感。

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why. 我在2009年首先注意到，如果我针对大小（-Os）而不是速度（-O2或-O3）优化，gcc（至少在我的项目和我的机器上）倾向于生成明显更快的代码，我一直在想从那以后。

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. 可能的解释是，我有一些对对齐敏感的热点，就像本示例中的热点一样。 By messing with the flags (passing -Os instead of -O2 ), those hotspots were aligned in a lucky way by accident and the code became faster. 通过弄乱这些标志（传递-Os而不是-O2 ），偶然地以一种幸运的方式对齐了这些热点，并使代码变得更快。 It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. 与尺寸优化无关：这些是偶然的原因，使热点更好地对齐了。 From now on, I will check the effects of alignment on my projects. 从现在开始，我将检查对齐对我的项目的影响。

Oh, and one more thing. 哦，还有一件事。 How can such hotspots arise, like the one shown in the example? 像示例中所示的那样，如何出现这种热点？ How can the inlining of such a tiny function like add() fail? 像add()这样的小函数，如何内联失败？

Consider this: 考虑一下：

// add.cpp
int add(const int& x, const int& y) {return x + y;
}

and in a separate file: 并在一个单独的文件中：

// main.cpp
int add(const int& x, const int& y);const int LOOP_BOUND = 200000000;__attribute__((noinline))
static int work(int xval, int yval) {int sum(0);for (int i=0; i<LOOP_BOUND; ++i) {int x(xval+sum);int y(yval+sum);int z = add(x, y);sum += z;}return sum;
}int main(int , char* argv[]) {int result = work(*argv[1], *argv[2]);return result;
}

and compiled as: g++ -O2 add.cpp main.cpp . 并编译为： g++ -O2 add.cpp main.cpp 。

gcc won't inline add() ! gcc不会内联add() ！

That's all, it's that easy to unintendedly create hotspots like the one in the OP. 就是如此，无意间创建像OP中那样的热点很容易。 Of course it is partly my fault: gcc is an excellent compiler. 当然，部分原因是我的错：gcc是出色的编译器。 If compile the above as: g++ -O2 -flto add.cpp main.cpp , that is, if I perform link time optimization, the code runs in 0.19s! 如果将以上g++ -O2 -flto add.cpp main.cpp编译为： g++ -O2 -flto add.cpp main.cpp ，也就是说， 如果我执行链接时间优化，则代码将在0.19s内运行！

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower). （在OP中人为地禁用了内联，因此，OP中的代码要慢2倍）。

#6楼

If your program is bounded by the CODE L1 cache, then optimizing for size suddenly starts to pay out. 如果您的程序受CODE L1高速缓存的限制，那么针对大小的优化突然就会开始得到回报。

When last I checked, the compiler is not smart enough to figure this out in all cases. 最后一次检查时，编译器不够聪明，无法在所有情况下都解决这一问题。

In your case, -O3 probably generates code enough for two cache lines, but -Os fits in one cache line. 在您的情况下，-O3可能为两个高速缓存行生成足够的代码，但是-Os可以容纳在一个高速缓存行中。

如果我针对大小而不是速度进行优化，为什么GCC会生成15-20％的更快代码？相关推荐

提高网络服务器性能,优化网络的七条思路帮您更快提高网络速度
如何最大限度地提升网络的速度与性能,一直是企业网络管理者们所关注的问题.本文将围绕如何进一步提升网络的速度与性能这一问题,给出业内资深人士和网络专家的七条建议. 使用巨型数据包使用巨型数据包技术可使 ...
电瓶车测试速度的软件,EV-TEST测评：电动车充电速率谁更快
电动汽车车主的梦想,大概是"充电"和"加油"时间一样长,虽然现实和梦想有差距,但随着技术的不断发展,很多电动汽车的表现也是可圈可点的.EV-TEST(中国电动汽 ...
为什么静态成员函数可以访问私有成员变量？（访问控制是针对类而不是针对对象）
访问控制是针对类而不是针对对象先看几个标准定义 A member of a class can be - private; that is, its name can be used only by ...
手写体数字图像识别图像_手写识别调整笔画大小而不是图像
手写体数字图像识别图像 A straightforward algorithm to dealing with handwritten symbol recognition problems in M ...
超越halcon速度的二值图像的腐蚀和膨胀，实现目前最快的半径相关类算法（附核心源码）。...
我在两年前的博客里曾经写过 SSE图像算法优化系列七:基于SSE实现的极速的矩形核腐蚀和膨胀(最大值和最小值)算法一文,通过SSE的优化把矩形核心的腐蚀和膨胀做到了不仅和半径无关,而且速度也相当的 ...
halcon区域腐蚀膨胀算子_超越halcon速度的二值图像的腐蚀和膨胀，实现目前最快的半径相关类算法（附核心源码）。...
超越halcon速度的二值图像的腐蚀和膨胀,实现目前最快的半径相关类算法(附核心源码). 发布时间:2019-03-20 12:32, 浏览次数:1259 , 标签: halcon 我在两年前的博客里 ...
全志 Linux 系统启动优化启动优化速度方式优化启动流程优化uboot 优化kernel等
文章目录 1 概述 2 启动速度优化简介 2.1 启动流程 2.2 测量方法 2.2.1 printk time 2.2.2 initcall_debug 2.2.3 bootgraph. 2.2.4 ...
ifamre 大小随页面变_SEO优化中如何让你的页面访问速度更快
页面访问速度对于SEO优化来说是极为关键的一环.我们试想一下,如果某个访客花了几十秒才能打开你网站的首页,先不说这个访客抱着什么访问目的和什么渠道来的,一般正常的访客大概率都会放弃访问这个页面.由此可 ...
YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)...
点击上方"计算机视觉工坊",选择"星标" 干货第一时间送达作者丨ChaucerG 来源丨集智书童 1YOLOv5-Lite 1.Backbone与Head Y ...

如果我针对大小而不是速度进行优化，为什么GCC会生成15-20％的更快代码？

#1楼

#2楼

#3楼

#4楼

#5楼

#6楼

如果我针对大小而不是速度进行优化，为什么GCC会生成15-20％的更快代码？相关推荐

最新文章

热门文章