难理解的bank conflict

之前看Nvidia-OpenCL-SDK里有一个例子讲到过bank conflict，但没怎么明白，它选择的是用奇数来避免。

#define BLOCK_DIM 16// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster
// than the naive kernel below.  Note that the shared memory array is sized to
// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory
// so that bank conflicts do not occur when threads address the array column-wise.
__kernel void transpose(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{// read the matrix tile into shared memoryunsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);if((xIndex + offset < width) && (yIndex < height)){unsigned int index_in = yIndex * width + xIndex + offset;block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];}barrier(CLK_LOCAL_MEM_FENCE);// write the transposed matrix tile to global memoryxIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);if((xIndex < height) && (yIndex + offset < width)){unsigned int index_out = yIndex * height + xIndex;odata[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];}
}// This naive transpose kernel suffers from completely non-coalesced writes.
// It can be up to 10x slower than the kernel above for large matrices.
__kernel void transpose_naive(__global float *odata, __global float* idata, int offset, int width, int height)
{unsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);if (xIndex + offset < width && yIndex < height){unsigned int index_in  = xIndex + offset + width * yIndex;unsigned int index_out = yIndex + height * xIndex;odata[index_out] = idata[index_in]; }
}

__kernel void simple_copy(__global float *odata, __global float* idata, int offset, int width, int height)
{unsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);if (xIndex + offset < width && yIndex < height){unsigned int index_in  = xIndex + offset + width * yIndex;odata[index_in] = idata[index_in]; }
}__kernel void shared_copy(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{// read the matrix tile into shared memoryunsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);unsigned int index_in = yIndex * width + xIndex + offset;if((xIndex + offset< width) && (yIndex < height)){//avoid bank conflictsblock[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];}barrier(CLK_LOCAL_MEM_FENCE);if((xIndex < height) && (yIndex+ offset < width)){odata[index_in] = block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)];}
}__kernel void uncoalesced_copy(__global float *odata, __global float* idata, int offset, int width, int height)
{unsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);if (xIndex + offset < width && yIndex < height){unsigned int index_in  = yIndex + height * (xIndex+ offset);odata[index_in] = idata[index_in]; }
}

大神也叫我记住这一点就好了。但还是自己想弄明白，今天看了：但都是以CUDA的视角讲解的：

http://blog.csdn.net/lucky_greenegg/article/details/9992129 http://blog.csdn.net/qqlu_did/article/details/45883159

http://blog.csdn.net/lingerlanlan/article/details/32712749 http://blog.csdn.net/endlch/article/details/47043069

http://blog.csdn.net/smsmn/article/details/6336060 http://blog.csdn.net/o_oxo_o/article/details/4296281 http://www.cnblogs.com/leohan2013/p/3333950.html http://blog.sina.com.cn/s/blog_735f29100102vq84.html http://blog.csdn.net/u014800094/article/details/54290100 http://blog.163.com/volcanolin%40126/blog/static/17086553120110482221978/ http://www.cnblogs.com/biglucky/p/4235009.html

http://blog.csdn.net/u011934885/article/details/54706812

http://blog.csdn.net/endlch/article/details/47043069 这个人写得最好

这几人写得各有千秋，等我真的弄明白再好好讲清楚

看了一会儿，貌似茅塞顿开：我整理了一下，应该是这样理解：

再讲清晰点就是这样：

（3-way 图中写错了，速度是1/3，不是1/4）

对于1所示的存取方式：每次的half-warp中没有多个线程访问同一个bank的不同位置的情况，所以无bank conflict！对于存取方式2：bank0下有不同的线程访问它的不同位置，故发生bank conflict，同样bank4、bank8和bank12都同样发生了bank conflict，所以这是所谓的3 way bank conflict 即速度变为原来的1/3。

上面是根据查到的有关bank conflict相关的资料站在几年前CUDA视角理解的！

话说：以前怎么都不明白，包括自己查资料、看实例、问大神，但就是不明白；现在还是一样的看资料而且是同样的资料看了一点点竟然就明白了，这时候又非常想不通当初的自己为何一直不明白。发现：如别人所说：有的东西，懂了就是懂了，不懂的时候怎样都不会懂。

*************************************************************************************************************************************************

但大神说：

所以对于OpenCL而言，我上面的图要改动的：

1、现在的卡的bank至少都有32个banks，不变的是每个bank依旧是32bit带宽；

2、bank conflict发生在Local memory即LDS的存取时候！

3、执行方式时32个线程即一个warp或者half-wave，即当half-wave=32个线程中有多个线程访问同一个bank的不同位置时发生bank conflict！

4、另外图像常使用uchar4、char4、int、float等都是32bit

上面的图该改善为：

（2 way 不是8 way，图中写错了）

方式3是像CUDA中类似的广播，无conflict！

*********************************************************************************************************

对于开头的矩阵转置例子，这个例子默认是以前老式的16个banks以half-warp访问的老卡：

#define BLOCK_DIM 16
__kernel void transpose(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{// read the matrix tile into shared memoryunsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);if((xIndex + offset < width) && (yIndex < height)){unsigned int index_in = yIndex * width + xIndex + offset;block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];}barrier(CLK_LOCAL_MEM_FENCE);// write the transposed matrix tile to global memoryxIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);if((xIndex < height) && (yIndex + offset < width)){unsigned int index_out = yIndex * height + xIndex;odata[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];}
}

对于写入block变量那里，即使用16X16不会发生bank conflict；但读出来是会发生conflict，所以这也是为什么用16X17的原因！看图1：

对于图1 ：写入时：half-warp中所有线程不会发生conflict，比如itemID[0---15]负责写block[0---15]所以不会conflict！！！但读出来时，因为是矩阵转置，那么itemID[0]负责读block[0]；itemID[1]负责读block[16];......所以发生了16-way的bank conflict（即16个线程访问了同一个bank的不同位置）！所以改进方式是图2使用16X17的block，画X的地方表示没有实际使用的数。写入时，itemID[0--15]负责block[0---15]，itemID[16---32]负责block[17---32]......所以不会conflict！读出时itemID[0]访问block[0];itemID[1]访问block[17]；itemID[2]访问block[34].....从图2知，它们不在一个bank中，故不会conflict！！！所以这个例子对于老式的卡可以用奇数法规避conflict！！！那么后面那个kernel ：shared_copy 即可以自行理解了！很简单了！！！

*****************************************************************************************************************

对于新式卡，可以仿照上面用32X33来规避bank conflict！

********************************完结**********************************************

2020.7.13

因为重新学习《CUDA C编程权威指南》看到第5.2.2节（按列主序存/取时），又讲到CUDA中的bank conflict，因为是以warp（不再以half-warp访问），所以我看得有些疑惑：

书上说这是16 way bank conflict？可是我觉得这是32 way啊？！因为warp里的所有线程都访问了同一个bank的不同地址。

而且我查到这里也说是32 way啊，是书上错了吗？？？？？？？？

然后我看到网上这个[32][33]来避免bank conflict的图明白了

我是这样理解的，所以这样子可以避免bank conflict。

但是书上第197页说矩形共享内存Data[16][32]，按列主序读存会有8 way bank conflict（我觉得是16 way，书上总是我分析的一半，书上错了吗？？），然后解决办法是Data[16][34]，我纸上分析了一下，这样其实和上图差别不大，就是变成warp0访问0_B0, 32_B2,64_B4...480_B30，这样子就是warp内的线程（只使用了16个）访问了不同bank，所以没有conflict。

可是书上说如果使用Data[16][33]就会有2 way bank conflict？？？我分析了一下没有啊，warp0访问的变成了0_B0,32_B1,64_B2...480_B15而已，也没有conflict啊！！！？？？求大神解答

经过大神点拨，我已明白我的症结所在：我的症结就在以为warp0只访问第一列，有16个线程没有使用。现在从你们这知道了原来是两列，warp0里32个线程必定都存在使用！！！可以看到Data[16][33]按列访问时1_B1,32_B1都访问B1；33_B2,64_B2都访问B2....即图中紫色部分都是访问了同一个bank，而这些都在同一个warp内。所以出现conflict！！！！经过大神点拨，明白我之前卡在那个点，现在茅塞顿开。太开心了。

/*******************************************************************************************************************************/

2020.11.13

我以为我真的都理解了，然而 https://bbs.gpuworld.cn/index.php?topic=73410.0 这里大神又给我上了一课，并不是现在所有的卡都是warp访问。如果如大神所说4B是按照warp访问, 8B按照half warp来的, 而16B按照1/4 warp访问。那么这个网友这种情况的确不会出现conflict，因为a[0]放在bank0和bank1简称b0、b1;a[1]--b2、b3;依次类推，a[15]在b30、b31所以这16个线程中并没有出现不同线程访问同一个bank的情况！是我轻率了，看来还有很多要学习的地方。

再次和大神交流，依旧获益良多，包括回顾自己的笔记也感触连连。一下就回到几年前大家一起讨论的日子，虽然当时很菜但交流过程中慢慢明白很多原理，这是一种发自内心的快乐。现在依然很菜，没有大神帮助时，更多的是自己去查书、查网上的解答、查自己当初的笔迹，看当初的交流记录，有时干脆自己推论、验证，这也许就是学习中的成长。

难理解的bank conflict相关推荐

CUDA bank 及bank conflict
bank 是CUDA中一个重要概念,是内存的访问时一种划分方式,在CPU中,访问某个地址的内存时,为了减少读写内次次数,访问地址并不是随机的,而是一次性访问bank内的内存地址,类似于内存对齐一样,一 ...
share memory的bank conflict分析
背景在做高性能分析的时候,经常会出现一个什么bank conflict的名词,不仅是GPU的share memory会出现bank confict, 甚至连寄存器也会出现bank conflict, ...
bank conflict
存储体冲突(bank conflict):当被访问的存储体没有恢复时又出现对该存储体新访问的现象. 简介目前 CUDA 装置中,每个 multiprocessor 有 16KB 的 shared me ...
Opengl-基本概念-转换矩阵坐标系（最难理解的两章）
前言可能这是Opengl学习的时候最难理解的地方,很多人也因为这个地方放弃了.但是我觉得,我们可以不明白有些矩阵是如何推到出来的,但是我们要明白一些简单的矩阵为什么可以做到从一个坐标系到另一个坐标系 ...
Share memory中bank conflict问题
Share memory是片上资源,生命周期是整个block中,它的数据读写十分快,有1个cycle latency.在Share memory中,经常存在bank conflict问题,如果没有ba ...
python装饰器功能是冒泡排序怎么做_传说中Python最难理解的点|看这完篇就够了(装饰器)...
https://mp.weixin.qq.com/s/B6pEZLrayqzJfMtLqiAfpQ 1.什么是装饰器网上有人是这么评价装饰器的,我觉得写的很有趣,比喻的很形象每个人都有的内裤主要是 ...
Docker 概念很难理解？一文搞定 Docker 端口绑定
作者 | Dieter Jordens 译者 | 苏本如,责编 | 夕颜出品 | CSDN(ID:CSDNnews) 以下为译文: 作为初级开发人员的你,是不是参加过这样的面试,在面试中面试官希望你 ...
聊天机器人最难理解的 10 个词汇
简评:现在,越来越多的「聊天机器人」凭借着人工智能能与人类对话,甚至编写新闻.人们该如何判断对方是一个血肉之躯,还是一个可笑的算法?又该如何判断一个小说故事是由一台机器编写的,而不是由一位真正的人类作 ...
Javascript之旅——第十站：为什么都说闭包难理解呢？
原文:Javascript之旅--第十站:为什么都说闭包难理解呢? 研究过js的朋友大多会说,理解了js的原型和闭包就可以了,然后又说这些都是js的高级内容,然后就又扯到了各种神马的作用域...然后不 ...

难理解的bank conflict

难理解的bank conflict相关推荐

最新文章

热门文章