why MLC

影响程序性能的两个重要因素：

①应用程序从处理器缓存和从内存子系统获取数据所消耗的时间，其中存在各种延迟；

②带宽b/w(bandwidth 非Bilibili World)

mlc正是做这个的

测试内容

Node访问速度

在NUMA（Non-Uniform Memory Access 非一致性内存访问）构架下，不同的内存器件和CPU核心从属不同的 Node，每个 Node 都有自己的集成内存控制器（IMC，Integrated Memory Controller），解决了“每个处理器共享相同的地址空间问题”，避免总线带宽，内存冲突问题。

（补充：core=物理cpu，独立的物理执行单元；thread=逻辑cpu，线程

socket = node 相当于主板上的cpu插槽。node内部，不同核心间使用IMC Bus通信；不同node间通过QPI（Quick Path Interconnect）进行通信

同城速达的速度肯定与国际邮件不同，所以QPI（remote）延迟明显高于IMC Bus（local）

测试样例：

查询内存访问延迟指令

./mlc --latency_matrix

结果

        Numa node
Numa node        0       1  0      82.2   129.6  1     131.1    81.6

表示node之间/内部的空闲内存访问延迟矩阵，以ns为单位

带宽

带宽反映了单位时间的传输速率马路越宽，就不会堵车了。带宽反映了单位时间的传输速率

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :  69143.9
3:1 Reads-Writes :  61908.4
2:1 Reads-Writes :  60040.5
1:1 Reads-Writes :  54517.6
Stream-triad like:  57473.4

r:w 表示不同读写比下的内存带宽

一般情况下，内存的写速度慢于读取速度（Talk is easy, show me the CODE）

所以当读写比下降时，带宽会下降（路窄了，塞车了）

问题分析：如果带宽急剧下降，可能是写入程序增多；或者是写入程序出问题，速度太慢了

测试样例

查询存访问带宽指令（单独判断numa节点间内存访问是否正常还可以使用）

./mlc --bandwidth_matrix

结果

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic typeNuma node
Numa node        0       1  0    35216.6 32537.9 1    31875.1 35048.5

问题分析：如果副对角线数值相差过大，表明两个node相互访问的带宽差距较大

解决方法：出现不平衡的时候一般从内存插法、内存是否故障以及numa平衡等角度进行排查

内存访问带宽和内存延迟的关系（读操作）

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================00000  523.74    69057.400002  589.55    68668.700008  686.99    68571.400015  549.87    68873.600050  575.48    68673.000100  524.74    68877.500200  197.61    64225.800300  131.60    47141.000400  110.39    36803.000500  117.32    30135.200700  100.90    22179.101000  100.93    15762.801300   91.74    12351.601700   98.61     9475.202500   86.66     6927.803500   88.13     5132.605000   87.68     3818.609000   85.36     2473.520000   84.83     1538.7

可以观察内存在负载压力下的响应变化，以及是否在到达一定带宽时，出现不可接受的内存响应时间

测量CPU cache到CPU cache之间的访问延迟

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency    38.6
Local Socket L2->L2 HITM latency    43.6
Remote Socket L2->L2 HITM latency (data address homed in writer socket)Reader Socket
Writer Socket         0         10         -     133.41     133.7         -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)Reader Socket
Writer Socket         0         10         -     133.51     133.7         -

峰值带宽

指令

mlc --peak_bandwidth

结果

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writesMeasuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :    50035.2
3:1 Reads-Writes :    48119.3
2:1 Reads-Writes :    47434.3
1:1 Reads-Writes :    48325.5
Stream-triad like:    44029.0

空闲内存延迟

指令

mlc --idle_latency

结果

Using buffer size of 200.000MB
Each iteration took 260.5 core clocks (    113.3    ns)

有负载内存延时

指令

mlc --loaded_latency

结果

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writesMeasuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject    Latency    Bandwidth
Delay    (ns)    MB/sec
==========================00000    217.32      49703.400002    258.98      49482.400008    217.48      49908.100015    220.12      49973.700050    206.33      49185.700100    174.02      43811.800200    141.63      27651.100300    130.65      19614.600400    126.05      15217.000500    122.70      12506.000700    121.46       9253.001000    120.55       6690.601300    118.75       5314.901700    120.18       4148.702500    119.53       3055.703500    119.60       2349.405000    116.60       1816.909000    116.17       1257.820000    116.87        867.6

其余操作（未完待续

测量指定node之间的访问延迟
测量CPU cache的访问延迟
测量cores/Socket的指定子集内的访问带宽
测量不同读写比下的带宽
指定随机的访问模式以替换默认的顺序模式进行测量
指定测试时的步幅

MLC——内存延迟及带宽测试工具相关推荐

4测试命令_局域网带宽测试工具-iPerf3
工具名称:iPerf3 官网: https://iperf.fr/ 简介:用于TCP,UDP和SCTP的终极速度测试工具: 功能:跨平台(Windows,Linux,Android,MacOS X,F ...
带宽测试工具 iperf3
带宽测试工具-iperf3 iperf3是一款带宽测试工具,它支持调节各种参数,比如通信协议,数据包个数,发送持续时间,测试完会报告网络带宽,丢包率和其他参数. 安装操作系统:centos7.0 软 ...
bandwidth 0.32k 发布，内存带宽测试工具
bandwidth 0.32k 修复了一些小的 AVX 问题. Bandwidth 是一个内存带宽测试的基准工具,但它也可以测量网络带宽.它可以测量每个内存系统的最大内存带宽,包括主内存,L1和L2缓 ...
【开发工具】【stream】内存带宽测试工具（Stream）的使用
获取更多相关的嵌入式开发工具,可收藏系列博文,持续更新中: [开发工具]嵌入式常用开发工具汇总帖 Stream简介 STREAM是一套综合性能测试程序集,通过fortran和C两种高级且高效的语言编写 ...
mysql 带宽测试工具_MySQL自带的性能压力测试工具mysqlslap详解
使用语法如下: # MySQLslap [options] 常用参数 [options] 详细说明: --auto-generate-sql, -a 自动生成测试表和数据,表示用mysqlslap工具 ...
简单的TCP带宽测试工具TTCP
源码可以从陈硕的github上下载到,位置在muduo-master\examples\ace\ttcp TTCP是一个传统的测试TCP性能的工具,它主要测试两个机器之间TCP的吞吐量,在应用层模拟消 ...
linux 服务器带宽测试工具
很多时候我们需要测试Linux服务器的上行和下行宽带.在可用于测试宽带速度的网站中,Speedtest.net也许是使用最广泛的应用"之一". Speedtest.net提供了一个 ...
链路带宽测试工具iperf使用
下载安装iperf https://iperf.fr/iperf-download.php 下载安装对应版本,不区分服务端和客户端: 运行 udp模式测试:(tcp模式去掉-u参数即可) 服务端执行: ...
stream.c 内存带宽测试
内存带宽测试工具:stream 介绍 Stream测试是内存测试中业界公认的内存带宽性能测试基准工具,作为衡量服务器内存性能指标的通用工具. 2. 原理申请了三个巨大的双精度浮点数组a[N], b[ ...

MLC——内存延迟及带宽测试工具