Linux 之 CPU and Cache

Page Cache and Buffer Cache

在 Linux 的实现中，文件 Cache 分为两个层面，一是 Page Cache，另一个 Buffer Cache，每一个 Page Cache 包含若干 Buffer Cache。

内存管理系统和 VFS 只与 Page Cache 交互，内存管理系统负责维护每项 Page Cache 的分配和回收，同时在使用 memory map 方式访问时负责建立映射；

具体文件系统则一般只与 Buffer Cache 交互，它们负责在外围存储设备和 Buffer Cache 之间交换数据。

page cache和buffer cache最大的差别在于：page cache是对文件数据的缓存；buffer cache是对设备数据的缓存。两者在实现上差别不是很大，都是采用radix树进行管理。

Buffers是内存中块I/O的缓冲区。相对来说，它们是比较短暂的。在Linux内核2.4版本之前，page cache跟buffer cache是分开的。从2.4开始，page cache跟buffer cache统一了。Buffers就只缓存 raw disk block 了，这一部分不在page cache—也就是非文件数据。Buffers这个指标也就不那么重要了。大部分系统中，Buffers经常也就几十M。

CPU cache

L1和L2cache是每个cpu核独享的。
L1cache中又分为L1D(L1 Data Cahce)和L1I(L1 Instruction Cache)。
L3cache是多个cpu核共享。

#Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K

CPU Name: Intel Xeon Platinum 8268
Max MHz.: 3900
Nominal: 2900
Enabled: 96 cores, 4 chips
Orderable: 2,4 chips
Cache L1: 32 KB I + 32 KB D on chip per core
L2: 1 MB I+D on chip per core
L3: 35.75 MB I+D on chip per chip

What is Hyper-Threading?

Hyper-threading was Intel’s first effort to bring parallel computation to end user’s PCs. It was first used on desktop CPUs with the Pentium 4 in 2002.

The Pentium 4’s at that time only featured just a single CPU core. Therefore, it only performs a single task and fails to perform any type of multiple operations.

A single CPU with hyper-threading appears as two logical CPUs for an operating system. In this case, the CPU is single, but the OS considers two CPUs for each core, and CPU hardware has a single set of execution resources for every CPU core.

Therefore, CPU assumes as it has multiple cores than it does, and the operating system assumes two CPUs for each single CPU core.

由于cpu的执行速度非常快，而其他部件相对来说又比较慢，所以cpu经常会处于空闲状态，为了充分利用cpu资源，intel又在core的基础上提出了hyper-threading概念，即一个core里可以模拟多个逻辑核，这个就叫做thread。

与core不同是，thread并不是一个物理概念，而是一个软件概念，它本质上就是利用core的空闲时间，来执行其他代码，所以thread其实只能算是并发，而不能算是并行。

vCPU

购买云服务器的时候，经常会遇到一个概念叫vCPU
vCPU其实就是指的虚拟核，也就是上面的thread

Summary

A thread is a unit of execution on concurrent programming.
The CPU cores mean the actual hardware component whereas threads
refer to the virtual component which manages the tasks.

Some concepts on CPU

Socket是一个物理上的概念，指的是主板上的cpu插槽。
Node是一个逻辑上的概念，对应于socket。
Core就是一个物理cpu,一个独立的硬件执行单元。
Thread就是超线程的概念，是一个逻辑cpu，共享core上的执行单元。

Topology

可以看到机器上有两个node，分别为node0和node1。

# ls /sys/devices/system/node/node*
/sys/devices/system/node/node0:
/sys/devices/system/node/node1:

Socket的信息可以通过/proc/cpuinfo查看，里面的physical id标示的就是socket号。

#cat /proc/cpuinfo | grep "physical id" |sort -u
physical id     : 0
physical id     : 1Node0上共有48个cpu， 编号为 0-23,48-71

ll /sys/devices/system/node/node0/cpu*

Node1上共有48个cpu，编号为 24-47,72-95

processor 条目包括这一逻辑处理器的唯一标识符。
core id ：此cpu在所在core中的编号
cpu cores ：一个socket上面有多少个core
siblings 条目列出了位于相同物理封装中的逻辑处理器的数量。
physical id标示的就是socket号

# cat /proc/cpuinfo | grep "cpu cores" |sort -u
cpu cores       : 24

查看 Cache

# ll /sys/devices/system/cpu/cpu0/cache/
total 0
drwxr-xr-x 2 root root 0 Jun 29 17:33 index0
drwxr-xr-x 2 root root 0 Jun 29 17:33 index1
drwxr-xr-x 2 root root 0 Jun 29 17:33 index2
drwxr-xr-x 2 root root 0 Jun 29 17:33 index3

ndex0对应的是L1 Data Cache，
index1对应的是L1 Instruction Cache，
index2对应的是L2 Cache，
index3对应的是L3 Cache。

_mm_prefetch

void _mm_prefetch (char const* p, int i)

Fetch the cache line that contains address p using the given STRATEGY.
Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.

The actual implementation depends on the particular CPU. This instruction is considered a hint, so the CPU is also free to simply ignore the request.

Most modern CPUs already automatically prefetch data based on
predicted access patterns.
Data is usually not fetched if this would cause a TLB miss or a page fault.
Too much prefetching can cause unnecessary cache evictions.
Prefetching may also fail if there are not enough memory-subsystem
resources (e.g., request buffers).

Software Prefetch Distance

Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency. The memory address of the cache block to be prefetched in an array data structure is calculated by adding a constant value D to the array index. This value D is called the prefetch distance. We define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address. The prefetch distance D for an array data structure a in Table I can be calculated as follows

D >= l/s

where l is the prefetch latency and s is the length of the shortest path through the loop body. The average memory latency could vary at runtime and the average execution time of one loop iteration can also vary, so the prefetch distance should be chosen in such a way that it is large enough to hide the latency. However, if it is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not be prefetched, leading to less coverage and more cache misses. Hence the prefetch distance has a significant effect on the overall performance of the prefetcher.

预取时间节点

何时开始预取：如果预取不及时，甚至晚于需要相关数据的节点，那么这个策略毫无帮助，浪费硬件资源，预取过早亦同理

Locality hints

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:

T0 (temporal data)—prefetch data into all levels of the cache
hierarchy.
T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher.
T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or an implementation-specific choice.
NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution.

Cache line — the unit of data transfer between cache and memory

A cache line is the unit of data transfer between the cache and main memory. Typically the cache line is 64 bytes. The processor will read or write an entire cache line when any location in the 64 byte region is read or written. The processors also attempt to prefetch cache lines by analyzing the memory access pattern of a thread.

查看cache line大小

cat /sys/devices/system/cpu/cpu1/cache/index0/coherency_line_size

# cat /sys/devices/system/cpu/cpu1/cache/index0/coherency_line_size
64

局部性原理

程序在一段时间内访问的数据通常具有局部性，比如对一维数组来说，访问了地址x上的元素，那么以后访问地址x+1、x+2上元素的可能性就比较高；现在访问的数据，在不久之后再次被访问的可能性也比较高。局部性分为“时间局部性”和“空间局部性”，时间局部性是指当前被访问的数据随后有可能访问到；空间局部性是指当前访问地址附近的地址可能随后被访问。处理器通过在内存和核心之间增加缓存以利用局部性增强程序性能，这样可以用远低于缓存的价格换取接近缓存的速度。

时间局部性

代码1：

for (loop=0; loop<10; loop++) {for (i=0; i<N; i++) {... = ... x[i] ...}
}

代码2：

for (i=0; i<N; i++) {for (loop=0; loop<10; loop++) {... = ... x[i] ...}
}

代码二的性能优于代码1，x的元素现在被重复使用，因此更有可能留在缓存中。这个
重新排列的代码在使用x[i]时显示更好的时间局部性。

空间局部性

代码1：

for i=1..nfor j=1..nfor k=1..nc[i,j] += a[i,k]*b[k,j]

代码2：

for i=1..nfor k=1..nfor j=1..nc[i,j] += a[i,k]*b[k,j]

代码2的性能优于代码一的性能。
代码2的b[k,j]是按行访问的，所以存在良好的空间局部性，cache line被充分利用。
代码1中，b [k，j]由列访问。由于行的存储矩阵，因此对于每个缓存行加载，只有一个元素用于遍历。

Example of Non-Temporal Data Optimization

As a simple example, assume that we have a program that uses two arrays. The first array is 2 MB large, and the second array is 8 MB large. The program first iterates through the 2 MB array from beginning to end, then iterates through the 8 MB array from beginning to end, and then repeats this a number of times.

Now assume that this program is run on a processor with a 3 MB cache. When it starts from the beginning of the 2 MB array it has just iterated through the 8 MB array, so the 2 MB array will have been evicted from the cache and it will have to fetch each cache line in the array from memory. When it then starts from the beginning of the 8 MB array, it will have touched the 2 MB array and rest of the 8 MB array since it last touched each cache line, so each cache line will have been evicted and have to be fetched from memory.

We get a cache line fetch for each cache line each time we go through each of the arrays. However, we know that the larger array is not going to fit in the cache anyway. If we could tell the processor not to try to cache it at all, the smaller array would actually fit in the cache. Instead of getting cache line fetches in both the small and large arrays, we could at least get cache hits in the small array.

In this situation like this we would like to be able to tell the processor not to try to cache the larger of the arrays, and most modern processors actually implement instructions that allow us to do that. These instructions are said to have non-temporal hints, and allow you to tell the processor what data is non-temporal and should not be cached.

一般性的优化经验

行访问

C++中数组在内存中是按行存放的，因此在访问多维数组时，尽量按行访问，避免按列访问。因为按列访问，会导致跳跃式访问内存，数据量大的时候，会导致频繁的Cache换入换出，对Cache很不友好。

指令预取

程序中的热点函数尽量放在一起，方便指令预取，减小指令缓存的占用，同时，模块间的函数调用，可以通过调整编译器链接顺序，来调整指令位置。

另外，如果程序经常执行跳转指令，将不利于指令预取，我们可以采用软件分支预测方法，帮助编译器生成更优化的指令。Linux下一般使用如下代码实现：

likely和unlikely是C++宏，likely宏的意思是告诉编译器之后的分支执行概率较大，unlikely刚好相反。在上例中，因为使用了unlikely，编译器认为if分支执行概率小，因此在生成指令时，会将else分支的指令放在前面，以减少程序执行时的指令跳转，使指令尽可能顺序执行。