(未完待续,持续更新中)

目录

PMU

perf map

Cache类

TLB​​

TLB结构与层级

任务切换

Perf监测

Cache

结构和层级

Perf监测


PMU

perf map

perf list可以看到如下hardware event,

如何查看他们与intel pmu的对应关系 ?参考代码arch/x86/events/intel/core.c

intel_pmu_init()
---case INTEL_FAM6_SKYLAKE_MOBILE:case INTEL_FAM6_SKYLAKE_DESKTOP:case INTEL_FAM6_SKYLAKE_X:case INTEL_FAM6_KABYLAKE_MOBILE:case INTEL_FAM6_KABYLAKE_DESKTOP:x86_pmu.late_ack = true;memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));...name = "skylake";...}snprintf(pmu_name_str, sizeof(pmu_name_str), "%s", name);
---
  1. 查看intel微架构版本,具体微架构历史可以参考,List of Intel CPU microarchitectureshttps://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures 查看方法:cat /sys/devices/cpu/caps/pmu_name;比如Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz,显示的是skylake;
  2. 参考skl_hw_cache_event_ids,比如iTLB相关:

事件的具体意义可以参考Intel PMU Hardware Eventhttps://perfmon-events.intel.com/skylake.html

另外,相关事件的描述中,有一个单词,"retired",其意义可以参考连接performance - What does Intel mean by "retired"? - Stack Overflow

n the context "retired" means: the instruction (microoperation, μop) leaves the "Retirement Unit". It means that in Out-of-order CPU pipeline the instruction is finally executed and its results are correct and visible in the architectural state as if they execute in-order. In performance context this is the number you should check to compute how many instructions were really executed (with useful output)

Cache类

http://www.cs.uni.edu/~diesburg/courses/cs3430_sp14/sessions/s14/s14_caching_and_tlbs.pdf

Cache的组织结构:

这三种类型可以直观的理解为:数组、hash表、链表;

比如:32-KB, 8-way set associative, 64-byte line size

  • cache总容量是32K
  • 每个组容量为 8 * 64 Bytes
  • 总共有 2^15 / 2^9 = 2^6 64个组

另外,cache hierarchy还有inclusiveexclusive,参考连接Memory part 2: CPU caches [LWN.net] Section 3.2中的一段:

To be able to load new data in a cache it is almost always first necessary to make room in the cache. An eviction from L1d pushes the cache line down into L2 (which uses the same cache line size). This of course means room has to be made in L2. This in turn might push the content into L3 and ultimately into main memory. Each eviction is progressively more expensive. What is described here is the model for an exclusive cache as is preferred by modern AMD and VIA processors. Intel implements inclusive caches {This generalization is not completely correct. A few caches are exclusive and some inclusive caches have exclusive cache properties.} where each cache line in L1d is also present in L2. Therefore evicting from L1d is much faster. With enough L2 cache the disadvantage of wasting memory for content held in two places is minimal and it pays off when evicting. A possible advantage of an exclusive cache is that loading a new cache line only has to touch the L1d and not the L2, which could be faster.d

总结起来就是,cache上下层之间的关系,分成两种:

  • inclusive,上下层之间会保存相同的内容
  • exclusive,下层只是作为上层的victim cache

TLB​​

TLB结构与层级

TLB Entry的格式并未找官方的Xeon的文档,不过可以参考Nios_II的3.2.4. TLB Organization

TLB Tag Fomat

Field Name Description
VPN VPN is the virtual page number field. This field is compared with the top 20 bits of the virtual address.
PID PID is the process identifier field. This field is compared with the value of the current process identifier stored in the tlbmisc control register, effectively extending the virtual address. The field size is configurable in the Nios_II Processor parameter editor, and can be between 8 and 14 bits.
G G is the global flag. When G = 1, the PID is ignored in the TLB lookup.

TLB Data Format

Field Name Description
PFN PFN is the physical frame number field. This field specifies the upper bits of the physical address. The size of this field depends on the range of physical addresses present in the system. The maximum size is 20 bits.
C C is the cacheable flag. Determines the default data cacheability of a page. Can be overridden for data accesses using I/O load and store family of Nios II instructions.
R R is the readable flag. Allows load instructions to read a page.
W W is the writable flag. Allows store instructions to write a page.
X X is the executable flag. Allows instruction fetches from a page.

任务切换

需要特别关注的是PID和G,这关系到当切换上下文时,是否需要invalidate tlb

参考文档

Intel® 64 and IA-32 Architectures Software Developer’s Manual

Volume 3A: System Programming Guide, Part 1 September 2016

4.10.1 Process-Context Identifiers (PCIDs)

Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces. The processor may retain cached information when software switches to a different linear- address space with a different PCID (e.g., by loading CR3; see Section 4.10.4.1 for details)

A PCID is a 12-bit identifier. Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17) of CR4. If CR4.PCIDE = 0, the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3. Not all processors allow CR4.PCIDE to be set to 1.

When a logical processor creates entries in the TLBs (Section 4.10.2) and paging-structure caches (Section 4.10.3), it associates those entries with the current PCID. When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID.

4.10.4.1 Operations that Invalidate TLBs and Paging-Structure Caches

MOV to CR3. The behavior of the instruction depends on the value of CR4.PCIDE:

—  If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0, the instruction invalidates all TLB entries associated with the PCID specified in bits 11:0 of the instruction’s source operand except those for global pages. It also invalidates all entries in all paging-structure caches associated with that PCID. It is not required to invalidate entries in the TLBs and paging-structure caches that are associated with other PCIDs.

4.10.4 Invalidation of TLBs and Paging-Structure Caches

As noted in Section 4.10.2 and Section 4.10.3, the processor may create entries in the TLBs and the paging-struc- ture caches when linear addresses are translated, and it may retain these entries even after the paging structures used to create them have been modified. To ensure that linear-address translation uses the modified paging structures, software should take action to invalidate any cached entries that may contain information that has since been modified.

上文中提到的global pages,来自页表项中的一位,参考下图:

(该图来自2006版本)

在Linux内核中,相关代码为:

arch/x86/mm/tlb.c上下文切换的过程中,ASID的切换,
switch_mm_irqs_off()
---if (real_prev == next) {...} else {u16 new_asid;bool need_flush;...next_tlb_gen = atomic64_read(&next->context.tlb_gen);choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);if (need_flush) {this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);load_new_mm_cr3(next->pgd, new_asid, true);} else {/* The new ASID is already up to date. */load_new_mm_cr3(next->pgd, new_asid, false);...}...}
---

这里有三个关键的值:

  • asid,per-cpu只有6个,(TLB_NR_DYN_ASIDS 6),所有到该CPU上的任务,轮着用

    choose_new_asid()
    ---/** We don't currently own an ASID slot on this CPU.* Allocate a slot.*/*new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;if (*new_asid >= TLB_NR_DYN_ASIDS) {*new_asid = 0;this_cpu_write(cpu_tlbstate.next_asid, 1);}*need_flush = true;
    ---
    
  • ctx_id,每个任务都有一个ctx_id,是一个全局的原子变量,这个ctx_id用于区分当前cpu上是否有已经持有的asid,这决定了,下一步是不是需要申请新的asid,如果申请,就涉及到invalidate掉该asid之前对应的tlb entry
    init_new_context()
    ---mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);atomic64_set(&mm->context.tlb_gen, 0);
    ---choose_new_asid()
    ---for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=next->context.ctx_id)continue;*new_asid = asid;*need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <next_tlb_gen);return;}
    ---
    
  • tlb_gen,per-mm,即地址空间;即使是同一个context的tlb也可能需要flush
    flush_tlb_mm_range()
    ---/* This is also a barrier that synchronizes with switch_mm(). */info.new_tlb_gen = inc_mm_tlb_gen(mm);---return atomic64_inc_return(&mm->context.tlb_gen);---
    ---
    

Perf监测

看下面这组perf采集的数据:

perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads make O=../out -j20 > /dev/nullPerformance counter stats for 'make O=../out -j20':10,095,672,437      dTLB-load-misses          #    0.09% of all dTLB cache hits   (63.78%)
10,869,338,077,000      dTLB-loads                                                    (63.77%)3,475,643,439      dTLB-store-misses                                             (63.77%)5,408,658,177,811      dTLB-stores                                                   (63.77%)8,101,851,811      iTLB-load-misses          #   22.88% of all iTLB cache hits   (63.76%)35,402,725,030      iTLB-loads                                                    (63.77%)684.556635266 seconds time elapsed

各项对应的intel的pmu事件是:

  • dTLB-loads,MEM_INST_RETIRED.ALL_LOADS,All retired load instructions;这里使用的计数是,执行成功的所有load指令;
  • dTLB-load-misses,DTLB_LOAD_MISSES.WALK_COMPLETED,Counts completed page walks (all page sizes) caused by demand data loads. This implies it missed in the DTLB and further levels of TLB
  • dTLB-store,MEM_INST_RETIRED.ALL_STORES,All retired  instructions
  • dTLB-store-misses,DTLB_STORE_MISSES.WALK_COMPLETED,Counts completed page walks (all page sizes) caused by demand data stores. This implies it missed in the DTLB and further levels of TLB
  • iTLB-loads,ITLB_MISSES.STLB_HIT,Instruction fetch requests that miss the ITLB and hit the STLB
  • iTLB-load-misses,ITLB_MISSES.WALK_COMPLETED,Counts completed page walks (all page sizes) caused by a code fetch. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB

Cache

结构和层级

还是以Intel skylate为例,参考以下文档:

Skylake (server) - Microarchitectures - Intel - WikiChip

它的cache hiearchy为:

  • L1I Cache:

    • 32 KiB/core, 8-way set associative

      • 64 sets, 64 B line size
      • competitively shared by the threads/core
  • L1D Cache:
    • 32 KiB/core, 8-way set associative
    • 64 sets, 64 B line size
    • competitively shared by threads/core
    • 4 cycles for fastest load-to-use (simple pointer accesses)
      • 5 cycles for complex addresses
    • 128 B/cycle load bandwidth
    • 64 B/cycle store bandwidth
    • Write-back policy
  • L2 Cache:
    • 1 MiB/core, 16-way set associative
    • 64 B line size
    • Inclusive
    • 64 B/cycle bandwidth to L1$
    • Write-back policy
    • 14 cycles latency
  • L3 Cache:
    • 1.375 MiB/core, 11-way set associative, shared across all cores
    • 2,048 sets, 64 B line size
    • Non-inclusive victim cache
    • Write-back policy
    • 50-70 cycles latency

我们看到,L2是Inclusive的而L3是Non-inclusive的,这是什么意思?参考文档:Skylake Processors - HECC Knowledge Base

An inclusive L3 cache guarantees that every block that exists in the L2 cache also exists in the L3 cache. A non-inclusive L3 cache does not guarantee this.

A larger L2 cache increases the hit rate into the L2 cache, resulting in lower effective memory latency and lower demand on the mesh interconnect and L3 cache.

If the processor has a miss on all the levels of the cache, it fetches the line from memory and puts it directly into the L2 cache of the requesting core, rather than putting a copy into both the L2 and L3 caches, as is done on Broadwell. When the cache line is evicted from the L2 cache, it is placed into L3 if it is expected to be reused.

Due to the non-inclusive nature of the L3 cache, the absence of a cache line in L3 does not indicate that the line is absent in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or L2 caches of cores when a cache line is not allocated in L3. On the previous-generation processors, the shared L3 itself takes care of this task.

正如对L3的特性描述,它是Non-inclusive Victim Cache

同时在https://www.quora.com/Why-did-Intel-and-AMD-add-an-additional-layer-of-L3-cache-to-share-among-CPU-cores-and-why-couldnt-they-just-expand-and-share-the-L2-cache

也找到了类似的描述:

Perf监测

Perf监测的相关事件分别对应了那些Hardware Event呢?

 Performance counter stats for 'make O=../out -j20':586,724,859,432      L1-dcache-load-misses     #    5.40% of all L1-dcache hits    (95.80%)
10,872,724,488,141      L1-dcache-loads                                               (95.85%)5,408,474,346,992      L1-dcache-stores                                              (95.84%)930,699,520,202      L1-icache-load-misses                                         (95.79%)725.217684983 seconds time elapsed

  • L1-dcache-loads,MEM_INST_RETIRED.ALL_LOADS,All retired load instructions;这里使用的计数是,执行成功的所有load指令;这与dTLB-loads一样
  • L1-dcache-load-misses,L1D.REPLACEMENT,Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace;
  • L1-dcache-stores,MEM_INST_RETIRED.ALL_STORES,All retired store instructions
  • L1-icache-misses,ICACHE_64B.MISS,Instruction fetch tag lookups that miss in the instruction cache (L1I). Counts at 64-byte cache-line granularity.

CPU微架构资源及监测相关推荐

  1. (转)【CPU微架构设计】分布式多端口(4写2读)寄存器堆设计

    寄存器堆(Register File)是微处理的关键部件之一.寄存器堆往往具有多个读写端口,其中写端口往往与多个处理单元相对应.传统的方法是使用集中式寄存器堆,即一个集中式寄存器堆匹配N个处理单元.随 ...

  2. linux系统查看cpu微架构,Intel CPU 历代微架构名称

    1 本文由来 前段时间网上买了金士顿的系统指定内存(DDR3 1600 低电压版本1.35V), 8G x 2 = 16GB.顺利安装到了Mac Mini上,运行稳定快速.今天觉得这么大内存用在家用的 ...

  3. Intel Skylake (Client) 架构/微架构/流水线 (5) - PAUSE指令时延

    PAUSE Latency in Skylake Microarchitecture PAUSE指令通常用于软件线程中,这些线程运行在一个处理器核的两个逻辑处理器中,等待某个锁被释放.这种短小的等待循 ...

  4. 华为(英国)招聘CPU/GPU架构及系统软件工程师

    关注公众号,获取更多AI领域发展机会 岗位一 『职位名称』 Graduate CPU Architect(全职:CPU 架构师) 『工作职责』 通过工作负荷和 CPU 性能分析来识别 CPU 瓶颈 建 ...

  5. 英特尔发布CPU新架构,突破性采用3D堆栈法

    当地时间12月12日,英特尔在"架构日"活动中公布了下一代CPU微架构-Sunny Cove,这个微架构采用10纳米工艺制造,会成为英特尔下一代酷睿和至强处理器的基础.一同发布的还 ...

  6. 嵌入式_cpu微架构、互连结构与总线

    一.cpu微架构 (一).定义 处理器微架构又称为微体系结构/微处理器体系结构,是在计算机工程中,将一种给 定的 指令集架构在处理器中执行的方法和具体硬件实现方案. • 一种给定指令集可以在不同的微架 ...

  7. 处理器架构 (三) 架构指令集微架构ISA 等概念

    简述 指令集架构标准 RISC与CISC RISC(全称Reduced Instruction Set Computer,精简指令系统计算机)则是一套优化过的指令架构 更像是 指令集架构标准,并不是实 ...

  8. 科普:什么是处理器微架构?

    本次来分享一些芯片相关的小科普文.作为嵌入式开发工程师,我们对芯片都需要有一定的了解. 指令集 1.指令集的体现 指令集,就是CPU中用来计算和控制计算机系统的一套指令的集合.而指令集的先进与否,也关 ...

  9. Intel, AMD及VIA CPU的微架构(39,完)

    21. 微架构的比较 已经调查的最先进微架构代表了不同的微架构核心:AMD,Pentium 4(NetBurst),Pentium M与Intel Core 2核心.现在我将讨论这些微架构的优缺点.我 ...

最新文章

  1. 为什么 Python 会成为程序员害怕的编程语言?
  2. clob存base64文件存不进去_Kafka 和 RocketMQ 底层存储之那些你不知道的事
  3. 《JavaScript权威指南第六版》学习笔记-JavaScript概述
  4. 内核对象——Windows核心编程学习手札系列之三
  5. 试图抓取非英文windows操作系统镜像时PE无法正常启动解决方法
  6. 比较两个字符串的相似度算法
  7. mybatis批量插入oracle大量数据记录性能问题解决
  8. 界面设计方法 (2) — 1. 界面与组件的概念
  9. 一进庙会freeeim
  10. Android 出现警告Exported service does not require permission
  11. 图像的像素原点_超火的机器视觉OpenCVSharp学习笔记3——图像形态学处理
  12. springboot之@Async实现异步
  13. Gom引擎Key.lic配套的X-FKGOM授权启动
  14. java 获取视频编码_Java如何获取文件编码格式
  15. 雷电模拟器的一些命令
  16. 几款常见的可视化HTML编辑器(WYSIWYG)
  17. VTK-医学三维图像四视图显示以及鼠标滑轮控制切片交互
  18. 一键还原奥运版_《马力欧索尼克东京奥运会》评测6.9分:体感玩法也带不动的枯燥...
  19. qq空间播放器肤代码
  20. vmware workstation15 清理磁盘

热门文章

  1. BAT华为等一线大厂Java工程师必读书单
  2. 计算机的屏幕约是16平方分米吗,电脑屏幕的面积大约是六平方分米对吗
  3. 【JavaSE】异常 超详讲解(编程思想)
  4. RKE安装Kubernetes
  5. 公务员考试计算机专业课难度,三本计算机考公务员难吗
  6. 院士联合指导+超强专家阵容+丰厚奖金机会,第十二届“麒麟杯”大赛报名正式开启!
  7. 从零编出个区块链:椭圆曲线,区块链绝对安全的基石
  8. Flume和Kafka的区别与联系
  9. Python自动化连接谷歌浏览器
  10. How-To-Ask-Questions-The-Smart-Way