Page Cache and Buffer Cache

在 Linux 的实现中,文件 Cache 分为两个层面,一是 Page Cache,另一个 Buffer Cache,每一个 Page Cache 包含若干 Buffer Cache。

内存管理系统和 VFS 只与 Page Cache 交互,内存管理系统负责维护每项 Page Cache 的分配和回收,同时在使用 memory map 方式访问时负责建立映射;

具体文件系统则一般只与 Buffer Cache 交互,它们负责在外围存储设备和 Buffer Cache 之间交换数据。

page cache和buffer cache最大的差别在于:page cache是对文件数据的缓存;buffer cache是对设备数据的缓存。两者在实现上差别不是很大,都是采用radix树进行管理。

Buffers是内存中块I/O的缓冲区。相对来说,它们是比较短暂的。在Linux内核2.4版本之前,page cache跟buffer cache是分开的。从2.4开始,page cache跟buffer cache统一了。Buffers就只缓存 raw disk block 了,这一部分不在page cache—也就是非文件数据。Buffers这个指标也就不那么重要了。大部分系统中,Buffers经常也就几十M。

CPU cache

L1和L2cache是每个cpu核独享的。
L1cache中又分为L1D(L1 Data Cahce)和L1I(L1 Instruction Cache)。
L3cache是多个cpu核共享。

#Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
CPU Name: Intel Xeon Platinum 8268
Max MHz.: 3900
Nominal: 2900
Enabled: 96 cores, 4 chips
Orderable: 2,4 chips
Cache L1: 32 KB I + 32 KB D on chip per core
L2: 1 MB I+D on chip per core
L3: 35.75 MB I+D on chip per chip

What is Hyper-Threading?

Hyper-threading was Intel’s first effort to bring parallel computation to end user’s PCs. It was first used on desktop CPUs with the Pentium 4 in 2002.

The Pentium 4’s at that time only featured just a single CPU core. Therefore, it only performs a single task and fails to perform any type of multiple operations.

A single CPU with hyper-threading appears as two logical CPUs for an operating system. In this case, the CPU is single, but the OS considers two CPUs for each core, and CPU hardware has a single set of execution resources for every CPU core.

Therefore, CPU assumes as it has multiple cores than it does, and the operating system assumes two CPUs for each single CPU core.

由于cpu的执行速度非常快,而其他部件相对来说又比较慢,所以cpu经常会处于空闲状态,为了充分利用cpu资源,intel又在core的基础上提出了hyper-threading概念,即一个core里可以模拟多个逻辑核,这个就叫做thread。

与core不同是,thread并不是一个物理概念,而是一个软件概念,它本质上就是利用core的空闲时间,来执行其他代码,所以thread其实只能算是并发,而不能算是并行。

vCPU

购买云服务器的时候,经常会遇到一个概念叫vCPU
vCPU其实就是指的虚拟核,也就是上面的thread

Summary

  • A thread is a unit of execution on concurrent programming.
  • The CPU cores mean the actual hardware component whereas threads
    refer to the virtual component which manages the tasks.

Some concepts on CPU

  • Socket是一个物理上的概念,指的是主板上的cpu插槽。
  • Node是一个逻辑上的概念,对应于socket。
  • Core就是一个物理cpu,一个独立的硬件执行单元。
  • Thread就是超线程的概念,是一个逻辑cpu,共享core上的执行单元。
Topology

可以看到机器上有两个node,分别为node0和node1。

# ls /sys/devices/system/node/node*
/sys/devices/system/node/node0:
/sys/devices/system/node/node1:

Socket的信息可以通过/proc/cpuinfo查看,里面的physical id标示的就是socket号。

#cat /proc/cpuinfo | grep "physical id" |sort -u
physical id     : 0
physical id     : 1Node0上共有48个cpu, 编号为 0-23,48-71
ll /sys/devices/system/node/node0/cpu*

Node1上共有48个cpu,编号为 24-47,72-95

  • processor 条目包括这一逻辑处理器的唯一标识符。
  • core id :此cpu在所在core中的编号
  • cpu cores :一个socket上面有多少个core
  • siblings 条目列出了位于相同物理封装中的逻辑处理器的数量。
  • physical id标示的就是socket号
# cat /proc/cpuinfo | grep "cpu cores" |sort -u
cpu cores       : 24

查看 Cache

# ll /sys/devices/system/cpu/cpu0/cache/
total 0
drwxr-xr-x 2 root root 0 Jun 29 17:33 index0
drwxr-xr-x 2 root root 0 Jun 29 17:33 index1
drwxr-xr-x 2 root root 0 Jun 29 17:33 index2
drwxr-xr-x 2 root root 0 Jun 29 17:33 index3

ndex0对应的是L1 Data Cache,
index1对应的是L1 Instruction Cache,
index2对应的是L2 Cache,
index3对应的是L3 Cache。

_mm_prefetch

void _mm_prefetch (char const* p, int i)

Fetch the cache line that contains address p using the given STRATEGY.
Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.

The actual implementation depends on the particular CPU. This instruction is considered a hint, so the CPU is also free to simply ignore the request.

  • Most modern CPUs already automatically prefetch data based on
    predicted access patterns.
  • Data is usually not fetched if this would cause a TLB miss or a page fault.
  • Too much prefetching can cause unnecessary cache evictions.
  • Prefetching may also fail if there are not enough memory-subsystem
    resources (e.g., request buffers).

Software Prefetch Distance

Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency. The memory address of the cache block to be prefetched in an array data structure is calculated by adding a constant value D to the array index. This value D is called the prefetch distance. We define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address. The prefetch distance D for an array data structure a in Table I can be calculated as follows

D >= l/s

where l is the prefetch latency and s is the length of the shortest path through the loop body. The average memory latency could vary at runtime and the average execution time of one loop iteration can also vary, so the prefetch distance should be chosen in such a way that it is large enough to hide the latency. However, if it is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not be prefetched, leading to less coverage and more cache misses. Hence the prefetch distance has a significant effect on the overall performance of the prefetcher.

预取时间节点

何时开始预取:如果预取不及时,甚至晚于需要相关数据的节点,那么这个策略毫无帮助,浪费硬件资源,预取过早亦同理

Locality hints

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:

  • T0 (temporal data)—prefetch data into all levels of the cache
    hierarchy.
  • T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher.
  • T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or an implementation-specific choice.
  • NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution.
Cache line — the unit of data transfer between cache and memory

A cache line is the unit of data transfer between the cache and main memory. Typically the cache line is 64 bytes. The processor will read or write an entire cache line when any location in the 64 byte region is read or written. The processors also attempt to prefetch cache lines by analyzing the memory access pattern of a thread.

查看cache line大小

cat /sys/devices/system/cpu/cpu1/cache/index0/coherency_line_size

# cat /sys/devices/system/cpu/cpu1/cache/index0/coherency_line_size
64

局部性原理

程序在一段时间内访问的数据通常具有局部性,比如对一维数组来说,访问了地址x上的元素,那么以后访问地址x+1、x+2上元素的可能性就比较高;现在访问的数据,在不久之后再次被访问的可能性也比较高。局部性分为“时间局部性”和“空间局部性”,时间局部性是指当前被访问的数据随后有可能访问到;空间局部性是指当前访问地址附近的地址可能随后被访问。处理器通过在内存和核心之间增加缓存以利用局部性增强程序性能,这样可以用远低于缓存的价格换取接近缓存的速度。

时间局部性

代码1:

for (loop=0; loop<10; loop++) {for (i=0; i<N; i++) {... = ... x[i] ...}
}

代码2:

for (i=0; i<N; i++) {for (loop=0; loop<10; loop++) {... = ... x[i] ...}
}

代码二的性能优于代码1,x的元素现在被重复使用,因此更有可能留在缓存中。 这个
重新排列的代码在使用x[i]时显示更好的时间局部性。

空间局部性

代码1:

for i=1..nfor j=1..nfor k=1..nc[i,j] += a[i,k]*b[k,j]

代码2:

for i=1..nfor k=1..nfor j=1..nc[i,j] += a[i,k]*b[k,j]

代码2的性能优于代码一的性能。
代码2的b[k,j]是按行访问的,所以存在良好的空间局部性,cache line被充分利用。
代码1中,b [k,j]由列访问。 由于行的存储矩阵,因此对于每个缓存行加载,只有一个元素用于遍历。

Example of Non-Temporal Data Optimization

As a simple example, assume that we have a program that uses two arrays. The first array is 2 MB large, and the second array is 8 MB large. The program first iterates through the 2 MB array from beginning to end, then iterates through the 8 MB array from beginning to end, and then repeats this a number of times.

Now assume that this program is run on a processor with a 3 MB cache. When it starts from the beginning of the 2 MB array it has just iterated through the 8 MB array, so the 2 MB array will have been evicted from the cache and it will have to fetch each cache line in the array from memory. When it then starts from the beginning of the 8 MB array, it will have touched the 2 MB array and rest of the 8 MB array since it last touched each cache line, so each cache line will have been evicted and have to be fetched from memory.

We get a cache line fetch for each cache line each time we go through each of the arrays. However, we know that the larger array is not going to fit in the cache anyway. If we could tell the processor not to try to cache it at all, the smaller array would actually fit in the cache. Instead of getting cache line fetches in both the small and large arrays, we could at least get cache hits in the small array.

In this situation like this we would like to be able to tell the processor not to try to cache the larger of the arrays, and most modern processors actually implement instructions that allow us to do that. These instructions are said to have non-temporal hints, and allow you to tell the processor what data is non-temporal and should not be cached.

一般性的优化经验

行访问

C++中数组在内存中是按行存放的,因此在访问多维数组时,尽量按行访问,避免按列访问。因为按列访问,会导致跳跃式访问内存,数据量大的时候,会导致频繁的Cache换入换出,对Cache很不友好。

指令预取

程序中的热点函数尽量放在一起,方便指令预取,减小指令缓存的占用,同时,模块间的函数调用,可以通过调整编译器链接顺序,来调整指令位置。

另外,如果程序经常执行跳转指令,将不利于指令预取,我们可以采用软件分支预测方法,帮助编译器生成更优化的指令。Linux下一般使用如下代码实现:

likely和unlikely是C++宏,likely宏的意思是告诉编译器之后的分支执行概率较大,unlikely刚好相反。在上例中,因为使用了unlikely,编译器认为if分支执行概率小,因此在生成指令时,会将else分支的指令放在前面,以减少程序执行时的指令跳转,使指令尽可能顺序执行。

Linux 之 CPU and Cache相关推荐

  1. linux查看CPU高速缓存(cache)信息

    一.Linux下查看CPU Cache级数,每级大小 dmesg | grep cache 实例结果如下: 二.查看Cache的关联方式 在 /sys/devices/system/cpu/中查看相应 ...

  2. linux内核学习6:Linux的CPU高速缓存cache和页高速缓存cache,buffer

    一.CPU高速缓存(cache) 参考:https://blog.csdn.net/u014470361/article/details/80060701 参考:https://blog.csdn.n ...

  3. cpu不支持虚拟装linux,linux 查看cpu是不是支持虚拟化

    linux 查看cpu是否支持虚拟化 一.Windows平台: 使用cpu-Z即可查看. 二.Linux平台: 在终端执行#cat /proc/cpuinfo(或#grep -E '(vmx|svm) ...

  4. linux 使cpu使用率升高_关于linux系统CPU篇---gt;CPU使用率升高

    1.CPU使用率为单位时间内CPU使用情况的统计,以百分比的方式展示. LINUX作为一个多任务操作系统,将每个CPU的时间划分为很短的时间片,再通过调度器轮流分配给各个任务使用,因此造成多任务同时运 ...

  5. linux cpu不足处理运维,Linux运维知识之Linux服务器CPU占用率较高问题排查思路

    本文主要向大家介绍了Linux运维知识之Linux服务器CPU占用率较高问题排查思路,通过具体的内容向大家展现,希望对大家学习Linux运维知识有所帮助. 注意:本文相关配置及说明已在 CentOS  ...

  6. Linux内存buffer与cache区别

    一.首先大概了解一下计算机CPU.Cache.Buffer.内存.硬盘.SWAP CPU也称为中央处理器(CPU,Central Processing Unit)是一块超大规模的集成电路,是一台计算机 ...

  7. 用Prime95来做linux下CPU压力测试

    Prime95是用来做linux下CPU压力测试的,由GIMPS (Great Internet Mersenne Prime Search)所提供,主要是透过运算找出梅森质数,质数(Prime nu ...

  8. linux服务器怎么查看cpu配置信息,linux服务器cpu信息查看详解

    在linux系统中,提供了/proc目录下文件,显示系统的软硬件信息.如果想了解系统中CPU的提供商和相关配置信息,则可以查/proc/cpuinfo.但是此文件输出项较多,不易理解.例如我们想获取, ...

  9. linux 排查cpu负载过高原因

    CPU负载查看方法: 使用vmstat查看系统维度的CPU负载 使用top查看进程维度的CPU负载 一.测试工具 1.使用 vmstat 查看系统纬度的 CPU 负载: 可以通过 vmstat 从系统 ...

最新文章

  1. java中super用来定义父类,Java中super的几种用法及与this的区别
  2. Leangoo看板协作工具与Trello还真的不一样
  3. 内核中修改和保存defconfig的方法
  4. .NET轻量级配置中心AgileConfig
  5. Python实现定时任务,定时采集数据,定时执行脚本程序都可以
  6. 在路由器与交换机之间添加ISA Server软路由与防火墙
  7. c++之string格式化
  8. UI_storyboard实现页面回调
  9. 李政轩讲核方法kernel Method 视频笔记
  10. 高一计算机会考英语作文,高一考试英语作文常考题目及范文
  11. VB6实现数组Slice()函数,可以像JS一样,切片出一个新数组
  12. LiveGBS国标GB/T28181视频流媒体平台云端录像配置开启关闭支持录像计划根据计划自动录制
  13. 电商API数据采集,教你如何获取商品详情数据
  14. [QNX Hypervisor 2.2用户手册]12.2 术语(二)
  15. c51单片机光电门测反应时间(实战小项目)
  16. 关于NI PXI机箱及板卡的路由说明
  17. 怎么把HTML文件拉出来,怎么把网页HTML格式的文件
  18. 2023 易语言 MuX云切片转码系统前端源码
  19. 【笔记】vue实现音乐播放器
  20. app做好后如何上线_上线后如何进行app运营?

热门文章

  1. 视频教程-汇编语言程序设计V-其他
  2. SpaceX的垂直起降火箭已经达到840 英尺高度
  3. 133 行代码实现质感地形
  4. 最新的适合0基础的Java 学习路线,(附视频教程)不仅仅是Javaweb还有大数据哦
  5. OpenShift集群完善及创建应用CakePHP
  6. TIA-810测试,TIA-920测试,手机、耳机、音箱、会议电话、APP语音通讯测试
  7. 绑定MAC地址-防止ARP攻击
  8. Flink SQL空闲状态保留时间(idle state retention time)实现原理
  9. fuzz测试libmodbus | AFL篇
  10. 慧点科技新品牌SmartGO亮相 发力政企移动信息化