为什么我的程序在完全循环8192个元素时会变慢？

本文翻译自：Why is my program slow when looping over exactly 8192 elements?

Here is the extract from the program in question. 以下是相关程序的摘录。 The matrix img[][] has the size SIZE×SIZE, and is initialized at: 矩阵img[][]的大小为SIZE×SIZE，并初始化为：

img[j][i] = 2 * j + i

Then, you make a matrix res[][] , and each field in here is made to be the average of the 9 fields around it in the img matrix. 然后，你创建一个矩阵res[][] ，这里的每个字段都是img矩阵中它周围的9个字段的平均值。 The border is left at 0 for simplicity. 为简单起见，边框保留为0。

for(i=1;i<SIZE-1;i++) for(j=1;j<SIZE-1;j++) {res[j][i]=0;for(k=-1;k<2;k++) for(l=-1;l<2;l++) res[j][i] += img[j+l][i+k];res[j][i] /= 9;
}

That's all there's to the program. 这就是该计划的全部内容。 For completeness' sake, here is what comes before. 为了完整起见，以下是之前的内容。 No code comes after. 没有代码。 As you can see, it's just initialization. 如您所见，它只是初始化。

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) for(j=0;j<SIZE;j++) img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, eg the execution times: 基本上，当SIZE是2048的倍数时，此程序很慢，例如执行时间：

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. 编译器是GCC。 From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here. 据我所知，这是因为内存管理，但我对这个主题并不太了解，这就是我在这里问的原因。

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough. 另外如何解决这个问题会很好，但如果有人能解释这些执行时间，我已经足够开心了。

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help. 我已经知道malloc / free了，但问题不在于使用的内存量，它只是执行时间，所以我不知道这会有多大帮助。

#1楼

参考：https://stackoom.com/question/pSg6/为什么我的程序在完全循环-个元素时会变慢

#2楼

The difference is caused by the same super-alignment issue from the following related questions: 差异是由以下相关问题引起的相同超对齐问题引起的：

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513? 为什么转换512x512的矩阵要比转换513x513的矩阵慢得多？
Matrix multiplication: Small difference in matrix size, large difference in timings 矩阵乘法：矩阵大小差异小，时序差异大

But that's only because there's one other problem with the code. 但那只是因为代码还有另外一个问题。

Starting from the original loop: 从原始循环开始：

for(i=1;i<SIZE-1;i++) for(j=1;j<SIZE-1;j++) {res[j][i]=0;for(k=-1;k<2;k++) for(l=-1;l<2;l++) res[j][i] += img[j+l][i+k];res[j][i] /= 9;
}

First notice that the two inner loops are trivial. 首先注意两个内环是微不足道的。 They can be unrolled as follows: 它们可以按如下方式展开：

for(i=1;i<SIZE-1;i++) {for(j=1;j<SIZE-1;j++) {res[j][i]=0;res[j][i] += img[j-1][i-1];res[j][i] += img[j  ][i-1];res[j][i] += img[j+1][i-1];res[j][i] += img[j-1][i  ];res[j][i] += img[j  ][i  ];res[j][i] += img[j+1][i  ];res[j][i] += img[j-1][i+1];res[j][i] += img[j  ][i+1];res[j][i] += img[j+1][i+1];res[j][i] /= 9;}
}

So that leaves the two outer-loops that we're interested in. 这样就留下了我们感兴趣的两个外环。

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array? 现在我们可以看到问题在这个问题中是一样的：为什么在迭代2D数组时，循环的顺序会影响性能？

You are iterating the matrix column-wise instead of row-wise. 您是按列而不是按行迭代矩阵。

To solve this problem, you should interchange the two loops. 要解决此问题，您应该交换两个循环。

for(j=1;j<SIZE-1;j++) {for(i=1;i<SIZE-1;i++) {res[j][i]=0;res[j][i] += img[j-1][i-1];res[j][i] += img[j  ][i-1];res[j][i] += img[j+1][i-1];res[j][i] += img[j-1][i  ];res[j][i] += img[j  ][i  ];res[j][i] += img[j+1][i  ];res[j][i] += img[j-1][i+1];res[j][i] += img[j  ][i+1];res[j][i] += img[j+1][i+1];res[j][i] /= 9;}
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two. 这完全消除了所有非顺序访问，因此您不再在大功率二次上获得随机减速。

Core i7 920 @ 3.5 GHz 酷睿i7 920 @ 3.5 GHz

Original code: 原始代码：

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops: 互换的外循环：

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

#3楼

The following tests have been done with Visual C++ compiler as it is used by the default Qt Creator install (I guess with no optimization flag). 以下测试是使用Visual C ++编译器完成的，因为默认的Qt Creator安装使用它（我猜没有优化标志）。 When using GCC, there is no big difference between Mystical's version and my "optimized" code. 使用GCC时，Mystical的版本与我的“优化”代码之间没有太大区别。 So the conclusion is that compiler optimizations take care off micro optimization better than humans (me at last). 所以结论是编译器优化比人类更好地处理微优化（我最后）。 I leave the rest of my answer for reference. 我留下余下的答案供参考。

It's not efficient to process images this way. 以这种方式处理图像效率不高。 It's better to use single dimension arrays. 最好使用单维数组。 Processing all pixels is the done in one loop. 处理所有像素是在一个循环中完成的。 Random access to points could be done using: 可以使用以下方法随机访问点：

pointer + (x + y*width)*(sizeOfOnePixel)

In this particular case, it's better to compute and cache the sum of three pixels groups horizontally because they are used three times each. 在这种特殊情况下，最好水平计算和缓存三个像素组的总和，因为它们每次使用三次。

I've done some tests and I think it's worth sharing. 我做了一些测试，我认为值得分享。 Each result is an average of five tests. 每个结果平均有五个测试。

Original code by user1615209: 用户1615209的原始代码：

8193: 4392 ms
8192: 9570 ms

Mystical's version: 神秘的版本：

8193: 2393 ms
8192: 2190 ms

Two pass using a 1D array: first pass for horizontal sums, second for vertical sum and average. 使用1D阵列的两次传递：第一次传递用于水平和，第二次用于垂直和和平均值。 Two pass addressing with three pointers and only increments like this: 两个传递寻址有三个指针，只有这样的增量：

imgPointer1 = &avg1[0][0];
imgPointer2 = &avg1[0][SIZE];
imgPointer3 = &avg1[0][SIZE+SIZE];for(i=SIZE;i<totalSize-SIZE;i++){resPointer[i]=(*(imgPointer1++)+*(imgPointer2++)+*(imgPointer3++))/9;
}8193: 938 ms
8192: 974 ms

Two pass using a 1D array and addressing like this: 使用一维数组进行两次传递并进行如下寻址：

for(i=SIZE;i<totalSize-SIZE;i++){resPointer[i]=(hsumPointer[i-SIZE]+hsumPointer[i]+hsumPointer[i+SIZE])/9;
}8193: 932 ms
8192: 925 ms

One pass caching horizontal sums just one row ahead so they stay in cache: 一次缓存水平求和只是前面一行，所以它们保留在缓存中：

// Horizontal sums for the first two lines
for(i=1;i<SIZE*2;i++){hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// Rest of the computation
for(;i<totalSize;i++){// Compute horizontal sum for next linehsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];// Final resultresPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}8193: 599 ms
8192: 652 ms

Conclusion: 结论：

No benefits of using several pointers and just increments (I thought it would have been faster) 没有使用几个指针和只是增量的好处（我认为它会更快）
Caching horizontal sums is better than computing them several time. 缓存水平总和比计算几次更好。
Two pass is not three times faster, two times only. 两次通过不快三倍，仅两次。
It's possible to achieve 3.6 times faster using both a single pass and caching an intermediary result 使用单次传递和缓存中间结果可以快3.6倍

I'm sure it's possible to do much better. 我相信它可以做得更好。

NOTE Please, note that I wrote this answer to target general performance issues rather than the cache problem explained in Mystical's excellent answer. 注意请注意，我写了这个答案来解决一般性能问题，而不是Mystical的优秀答案中解释的缓存问题。 At the beginning it was just pseudo code. 一开始它只是伪代码。 I was asked to do tests in the comments... Here is a completely refactored version with tests. 我被要求在评论中做测试......这是一个完全重构的测试版本。

为什么我的程序在完全循环8192个元素时会变慢？相关推荐

完美解决小程序一维数组循环渲染列表不够用问题
完美解决小程序一维数组循环渲染列表不够用问题参考文章: (1)完美解决小程序一维数组循环渲染列表不够用问题 (2)https://www.cnblogs.com/jessical626/p/6363 ...
QT源码解析(一) QT创建窗口程序、消息循环和WinMain函数
版权声明请尊重原创作品.转载请保持文章完整性,并以超链接形式注明原始作者"tingsking18"和主站点地址,方便其他朋友提问和指正. QT源码解析(一) QT创建窗口程序.消 ...
置换元素和非置换元素_循环置换数组元素的C程序
置换元素和非置换元素 Problem statement: Write a c program to cyclically permute the element of an array. (In r ...
Android应用程序线程消息循环模型分析
出自:http://blog.csdn.net/luoshengyang/article/details/6905587 我们知道,Android应用程序是通过消息来驱动的,即在应用程序的主线程(UI ...
编写程序从1循环到150，并在每行打印一个值，另外在每个3的倍数行上打印出“foo”，在每个5的倍数行上打印“biz”；在每个7的倍数行上打印输出“baz”。
题目: 编写程序从1循环到150,并在每行打印一个值, 另外在每个3的倍数行上打印出"foo", 在每个5的倍数行上打印"biz": 在每个7的倍数行上打印输出 ...
在多线程应用程序中使用循环缓冲区高效地进行日志记录
在多线程应用程序中使用循环缓冲区高效地进行日志记录在关键的计算机应用程序的生存期中,日志记录是一件非常重要的活动,特别是当故障的症状并不十分明显时.日志记录提供了故障前应用程序状态的详细信息,如变量 ...
程序运行无线循环与死循环的区别
相同之处无限循环和死循环的相同之处就是都没有结果值不同之处两种循环是有本质区别的无限循环是指程序在运行过程中出现的无值结果,它的本质是程序可以正常循环,但产生的结果又无数个值.无限循环主要是用 ...
python控制软件点击_Python小程序控制鼠标循环点击代码实例
Python小程序控制鼠标循环点击代码实例这篇文章主要介绍了Python小程序控制鼠标循环点击代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以 ...
编写python程序、利用循环输出_Python基础编程—用户输入和while循环
温馨提示如果你喜欢本文,请分享到朋友圈,想要获得更多信息,请关注我. 函数input()的工作原理函数input()让程序暂停运行,等待用户输入一些文本.获取用户输入后,Python将其存储在一个 ...

为什么我的程序在完全循环8192个元素时会变慢？

#1楼

#2楼

#3楼

为什么我的程序在完全循环8192个元素时会变慢？相关推荐

最新文章

热门文章