store buffer and invalidate queues

以下来自于

http://en.wikipedia.org/wiki/MESI_protocol

Memory Barriers

MESI in its naive, straightforward implementation exhibits two particular low-performance behaviours; firstly, when writing to an invalid cache line, there is a long delay while the line is fetched from another CPU, secondly, moving cache lines to the invalid state is time consuming.

Consequently, CPUs implement store buffers and invalidate queues.

A store buffer is used when writing to an invalid cache line. Since the write will proceed anyway, the CPU issues a read-invalid message (hence the cache line in question and all other CPU's cache lines which store that address of memory are invalidated) and then pushes the write into the store-buffer, to be executed when the cache line finally arrives. (A CPU will when trying to read cache lines scan its own store buffer, in case it has something ready to write to the cache).

Consequently, a CPU can from its point of view have written something, but it isn't yet in the cache and so other CPUs *cannot see this* - they cannot scan the store buffer of other CPUs.

With regard to invalidation, CPUs implement invalidate queues, whereby incoming invalidate requests are instantly acknowledged but not in fact acted upon - they instead simply enter an invalidation queue, their processing occurs as soon as possible (but not necessarily instantly). As such a CPU can have in its cache a line which is invalid, but where it doesn't yet know that line is invalid - the invalidation queue contains the invalidation which hasn't yet been acted upon. (The invalidation queue is on the other "side" of the cache; the CPU can't scan it, as it can the store buffer).

As a result, memory barriers are required. A store barrier will flush the store-buffer (ensuring all writes have entered that CPUs cache). A read barrier will flush the invalidation queue (ensuring all writes by other CPUs become visible to the flushing CPU).

So MESI in practice doesn't quite work - not a problem if you're single threaded, but definitely a problem if not.

以下来自于

http://remonstrate.wordpress.com/tag/invalidate-queue/

CPU 的速度与访问内存的速度差距巨大，为此 CPU 设计者使用了 cache，为每个 CPU/core 提供一个相对快速访问的存储空间，有了 cache 意味着存在 replication，这本质上跟分布式系统面临的问题类似，只是单机上我们可以使用一些硬件上的东西来提供某种 consistency model
cache 类似一个 hash table，但是其 hash 比较简单，通常是地址计算出来的 bit set
为了使得 CPU 访问一致的数据，人们通过 cache coherence protocol 来同步 cache 里面的数据，常见的协议是 MESI 及其变种
MESI 对应四个状态：modified（被修改过，要求对应内存不存在其他 cache 中）、exclusive（尚未被修改，但是仅有这个 CPU 拥有 copy，对应的也是 up-to-date 的数据）、shared（可能被多个 CPU 拥有，此时可以读但不可以写）和 invalid（表示数据非法，也就是可以看成不可用）；通常 CPU 需要用个 2bit 表示这四个状态；状态之间的转移通过某种 meesage 来实现
一般对 MESI 的简单实现都是没有实际价值，这是因为发生写操作往往会带来很长时间的等候：首先需要写的 CPU 需要让别的 CPU 将状态转换到 invalid，收到 response 以后才能进行实际的写，为此硬件专家使用了 store buffer（Sutter 同志也说过，modern CPU 如果没有 store buffer 就不值得买，可见这个 feature 对整体性能的影响是不可忽略的）
store buffer 的作用是让 CPU 需要写的时候仅仅将其操作交给 store buffer，然后继续执行下去，store buffer 在某个时刻就会完成一系列的同步行为；很明显这个简单的东西会违背 self consistency，因为如果某个 CPU 试图写其他 CPU 占有的内存，消息交给 store buffer 后，CPU 继续执行后面的指令，而如果后面的指令依赖于这个被写入的内存（尚未被更新，这个时候读取的值是错误的）就会产生问题，所以实际实现 store buffer 可能会增加 snoop 特性，即 CPU 读取数据时会从 store buffer 和 cache 两处读
即便增加了 snoop，store buffer 仍然会违背 global memory ordering，导致的解决方案是 memory barrier：我们知道程序员书写两个写操作的时候，隐含的假定是如果能观察到后一个写的结果，那么前一个写的结果势必也会发生，这是一个非常符合人直觉的行为，但是由于 store buffer 的存在，这个结论可能并不正确：这是因为如果观察线程位于另一个 core，首先读取后一个写（该地址并不在 cache 内）需要向写入线程所在 core 要对应地址的值，由于该 core 从 store buffer 返回了新值的时候这个 buffer 里面的写操作可能尚未发生，所以观察线程在获取了后一个写的最新结果时，前一个写的结果依然无法观察到，这违背了 sequential consistency 的假定，往往程序员更倾向于这个 consistency model 下的 reasoning
硬件 level 上很难揣度软件上这种前后依赖关系，因此往往无法通过某种手段自动的避免这种问题，因而只有通过软件的手段表示（对应也需要硬件提供某种指令来支持这种语义），这个就是 memory barrier，从硬件上来看这个 barrier 就是 CPU flush 其 store buffer 的指令，那么一种做法就是提供给程序员对应的指令（封装到函数里面）要求在合适的时候插入表达这种关系，另一种做法就是通过某种标识让编译器翻译的时候自动的插入这个指令
往往 store buffer 都很小，针对连续写操作力不从心；类似的情况也发生在碰到 memory barrier 之后；开始写之前首先需要 invalidate 其他 cache 里面的数据，为了加速这个过程硬件设计者又加入了 invalidate queue，这个 queue 将 incoming 的 invalidate 消息存放，立即返回对应的 response 这样以便发起者能尽快做后面的事情，而这个 CPU 可以通过 invalidate queue 后续处理这些内存
invalidate queue 的存在会使得我们有更多的地方需要 memory barrier（理由与 store buffer 类似）
实际 memory barrier 又有一些细分，如 read/write 的，软件上会通过 smb_mb/rmb/wmb 等表示，对应的硬件指令不同平台下各不相同
实际实现的时候由于某些指令集之间的关系使得 memory barrier 的实现不可能做到最优，很多常见的平台都使用了简单粗暴的 bus 锁（x86、amd64、armv7），这也就是 Sutter talk 里面认为硬件平台往往提供了一些“过度”的指令的原因，最终软件需要的 sequential consistency 尽管得以实现，但是产生了一些不必要的代价
memory barrier 不是一个必需的东西，但是似乎如果有 real-time 的 requirement 似乎就不能避开，这一点有待证实

实际的例子：http://sstompkins.wordpress.com/2011/04/12/why-memory-barrier%EF%BC%9F/