How to caching Global data in on-chip (level 1) cache in Morden GPU

1.Fermi arch

因为在CC 2.x（Compute Capability NVIDIA 计算能力）时，L1 Data Cache 还是可用的，我们可以缓存 local 和 global 的数据，不管ld（load 读）或者st （store 写），其默认的操作参数都是cache all 的。ld.ca 和 st.wb 是其默认指令。

但是这样的话，SM之间会出现 cache coherency 问题！

英伟达的策略是对于global的写操作不往 L1 cache 进行的，因此提了一个 local write back global write evict 的策略。。如果有写操作命中时，就将命中的data 置失效，并且通知其他 SM 同样失效这个data（但是还没有找的其真正是怎么实现的！）

2.Kepler arch

CC 3.X 中，新引入了 Read Only Data Cache，其他与上一代类似。

1）重点说一下这个 Read Only Data Cache，之前我们说，写操作已经完全不往 L1这层 caching 了，那么load 指令呢？还是要放的，但是还是会有cc问题，那么为了应对这个，就只caching 那些只读数据，因此加了这个 Read Only Data Cache！

2）但是往 Read Only Data Cache 缓存数据，还要在 cuda 代码中，用一个 ldg（）函数来告诉编译器。
这个 LDG（）只支持 CC 3.5 以上的GPU

Note：这个__ldg()函数对应的ptx指令是 ld.global.nc

egT __ldg(const T* address);returns the data of type T located at address address, where T is char, short, int, long longunsigned char, unsigned short, unsigned int, unsigned long long, int2, int4, uint2, uint4, float, float2, float4, double, or double2. The operation is cached in the read-only data cacheRead more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4UiAE527J
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

3.Maxwell

架构真正巨变的一代，从这一代开始，没有了 L1 Data cache 这个概念，只剩下：
a unified L1/texture cache of 24 KB used to cache reads from global memory

L1 / Texture 统一了，这就不好办了，，因为texture cache 是通过特殊的 tex 模块来进行访问的，那该如何是好。
从这一代起，
Global memory accesses are only cached in L2

但是只读的数据还是可以通过 ldg()来缓存到L1/texture cache 中的，并且还需要通过关键字进行限制 const 和 _restrict 。

Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is, by default, not cached in the unified L1/texture cache, but caching may be enabled using the following mechanisms:
Perform the read using inline assembly with the appropriate modifier as described in the PTX reference manual;

Compile with the -Xptxas -dlcm=ca compilation flag, in which case all reads are cached, except reads that are performed using inline assembly with a modifier that disables caching;

Compile with the -Xptxas -fscm=ca compilation flag, in which case all reads are cached, including reads that are performed using inline assembly regardless of the modifier used.

When caching is enabled using some the three mechanisms listed above, devices of compute capability 5.2 will cache global memory reads in the unified L1/texture cache for all kernel launches except for the kernel launches for which thread blocks consume too much of the multiprocessor’s resources. These exceptions are reported by the profiler.