The metric we all use for CPU utilization is deeply misleading, and getting worse every year. What is CPU utilization? How busy your processors are? No, that's not what it measures. Yes, I'm talking about the "%CPU" metric used everywhere, by everyone. In every performance monitoring product. In top(1).

What you may think 90% CPU utilization means:

What it might really mean:

Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O. The ratio I drew above (between busy and stalled) is what I typically see in production. Chances are, you're mostly stalled, but don't know it.

What does this mean for you? Understanding how much your CPUs are stalled can direct performance tuning efforts between reducing code or reducing memory I/O. Anyone looking at CPU performance, especially on clouds that auto scale based on CPU, would benefit from knowing the stalled component of their %CPU.

What really is CPU Utilization?

The metric we call CPU utilization is really "non-idle time": the time the CPU was not running the idle thread. Your operating system kernel (whatever it is) usually tracks this during context switch. If a non-idle thread begins running, then stops 100 milliseconds later, the kernel considers that CPU utilized that entire time.

This metric is as old as time sharing systems. The Apollo Lunar Module guidance computer (a pioneering time sharing system) called its idle thread the "DUMMY JOB", and engineers tracked cycles running it vs real tasks as a important computer utilization metric. (I wrote about this before.)

So what's wrong with this?

Nowadays, CPUs have become much faster than main memory, and waiting on memory dominates what is still called "CPU utilization". When you see high %CPU in top(1), you might think of the processor as being the bottleneck – the CPU package under the heat sink and fan – when it's really those banks of DRAM.

This has been getting worse. For a long time processor manufacturers were scaling their clockspeed quicker than DRAM was scaling its access latency (the "CPU DRAM gap"). That levelled out around 2005 with 3 GHz processors, and since then processors have scaled using more cores and hyperthreads, plus multi-socket configurations, all putting more demand on the memory subsystem. Processor manufacturers have tried to reduce this memory bottleneck with larger and smarter CPU caches, and faster memory busses and interconnects. But we're still usually stalled.

How to tell what the CPUs are really doing

By using Performance Monitoring Counters (PMCs): hardware counters that can be read using Linux perf, and other tools. For example, measuring the entire system for 10 seconds:

# perf stat -a -- sleep 10Performance counter stats for 'system wide':641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)379,651      context-switches          #    0.592 K/sec                    (100.00%)51,546      cpu-migrations            #    0.080 K/sec                    (100.00%)13,423,039      page-faults               #    0.021 M/sec                  1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)<not supported>      stalled-cycles-frontend  <not supported>      stalled-cycles-backend   1,118,336,816,068      instructions              #    0.78  insns per cycle          (75.01%)249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)10.003794539 seconds time elapsed

The key metric here is instructions per cycle (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification). The above example of 0.78 sounds not bad (78% busy?) until you realize that this processor's top speed is an IPC of 4.0. This is also known as 4-wide, referring to the instruction fetch/decode path. Which means, the CPU can retire (complete) four instructions with every clock cycle. So an IPC of 0.78 on a 4-wide system, means the CPUs are running at 19.5% their top speed. The new Intel Skylake processors are 5-wide.

There are hundreds more PMCs you can use to dig further: measuring stalled cycles directly by different types.

In the cloud

If you are in a virtual environment, you might not have access to PMCs, depending on whether the hypervisor supports them for guests. I recently posted about The PMCs of EC2: Measuring IPC, showing how PMCs are now available for dedicated host types on the AWS EC2 Xen-based cloud.

Interpretation and actionable items

If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.

If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.

For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.

What performance monitoring products should tell you

Every performance tool should show IPC along with %CPU. Or break down %CPU into instruction-retired cycles vs stalled cycles, eg, %INS and %STL.

As for top(1), there is tiptop(1) for Linux, which shows IPC by process:

tiptop -                  [root]
Tasks:  96 total,   3 displayed                               screen  0: defaultPID [ %CPU] %SYS    P   Mcycle   Minstr   IPC  %MISS  %BMIS  %BUS COMMAND3897   35.3  28.5    4   274.06   178.23  0.65   0.06   0.00   0.0 java1319+   5.5   2.6    6    87.32   125.55  1.44   0.34   0.26   0.0 nm-applet900    0.9   0.0    6    25.91    55.55  2.14   0.12   0.21   0.0 dbus-daemo

Other reasons CPU utilization is misleading

It's not just memory stall cycles that makes CPU utilization misleading. Other factors include:

  • Temperature trips stalling the processor.
  • Turboboost varying the clockrate.
  • The kernel varying the clock rate with speed step.
  • The problem with averages: 80% utilized over 1 minute, hiding bursts of 100%.
  • Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress.

Update: is CPU utilization actually wrong?

There have been hundreds of comments on this post, here (below) and elsewhere (1, 2). Thanks to everyone for taking the time and the interest in this topic. To summarize my responses: I'm not talking about iowait at all (that's disk I/O), and there are actionable items if you know you are memory bound (see above).

But is CPU utilization actually wrong, or just deeply misleading? I think many people interpret high %CPU to mean that the processing unit is the bottleneck, which is wrong (as I said earlier). At that point you don't yet know, and it is often something external. Is the metric technically correct? If the CPU stall cycles can't be used by anything else, aren't they are therefore "utilized waiting" (which sounds like an oxymoron)? In some cases, yes, you could say that %CPU as an OS-level metric is technically correct, but deeply misleading. With hyperthreads, however, those stalled cycles can now be used by another thread, so %CPU may count cycles as utilized that are in fact available. That's wrong. In this post I wanted to focus on the interpretation problem and suggested solutions, but yes, there are technical problems with this metric as well.

You might just say that utilization as a metric was already broken, as Adrian Cockcroft discussed previously.


CPU utilization has become a deeply misleading metric: it includes cycles waiting on main memory, which can dominate modern workloads. You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC). An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound. I covered IPC in my previous post, including an introduction to the Performance Monitoring Counters (PMCs) needed to measure it.

Performance monitoring products that show %CPU – which is all of them – should also show PMC metrics to explain what that means, and not mislead the end user. For example, they can show %CPU with IPC, and/or instruction-retired cycles vs stalled cycles. Armed with these metrics, developers and operators can choose how to better tune their applications and systems.

CPU Utilization is Wrong相关推荐

  1. Unix command to find CPU Utilization

    2019独角兽企业重金招聘Python工程师标准>>> ...

  2. Intel® Performance Counter Monitor - A Better Way to Measure CPU Utilization ...

  3. 压力测试衡量CPU的三个指标:CPU Utilization、Load Average和Context Switch Rate

    上篇讲如何用LoadRunner监控Linux的性能指标 ,但是关于CPU的几个指标没有搞清楚,下面就详细说说. CPU Utilization 好理解,就是CPU的利用率,75%以上就比较高了(也有 ...

  4. ZABBIX4.0 CPU utilization和load参数

    cpu  utilization: CPU idle time:空闲的cpu时间比[简称id] CPU user time:用户态使用的cpu时间比[简称us] CPU system time:系统态 ...

  5. CPU Utilization command

    AVG. CPU Utilization (%) command: iostat ------------单次获取 iostat -d -x 1 (连续获取) ----连续获取 history --- ...

  6. HPA monitoring cpu utilization fails for deployments which have init containers

    HPA monitoring cpu utilization fails for deployments which have init containers 1. 背景 2. 环境 3. 问题 4. ...

  7. zabbix CPU Utilization load jumps 性能监控

    主要功能介绍: CPU Utilization :是CPU的利用率(某一时间段内cpu资源占用情况),通常界定80%. CPU load: 某一段时间内,CPU正在处理以及等待CPU处理的进程数的之和 ...

  8. 衡量CPU的三个指标:CPU Utilization、Load Average和Context Switch Rate

    CPU Utilization 好理解,就是CPU的利用率,75%以上就比较高了(也有说法是80%或者更高).除了这个指标外,还要结合Load Average和Context Switch Rate来 ...

  9. 压力测试衡量CPU的三个指标CPU Utilization、Load Average和ContextSwitch Rate

    压力测试衡量CPU的三个指标:CPU Utilization.Load Average和ContextSwitch Rate 上篇讲如何用LoadRunner监控Linux的性能指标,但是关于CPU的 ...


  1. 在Eclipse中集成Ant编程之配置篇
  2. php-fpm打开错误日志的配置
  3. 牛客网CSP-S提高组赛前集训营1题解(仓鼠的石子游戏 [博弈论] + 乃爱与城市的拥挤程度 [树上DP] + 小w的魔术扑克[dfs + 离线])
  4. 论文浅尝 | How to Keep a Knowledge Base Synchronized
  5. C#中泛型类型约束条件
  6. android10手机众筹,最小Android 10手机?屏幕仅3英寸的Jelly 2开始众筹
  7. vue实现部分页面导入底部 vue配置公用头部、底部,可控制显示隐藏
  8. git add remote_git命令
  9. 一个Form中2个按钮,PHP后台如何判断提交的是哪一个按钮
  10. ASIHTTPRequest开源类项目导入问题及解决方法
  11. 2017年下半年软考合格标准出炉啦
  12. 潭州课堂25班:Ph201805201 tornado 项目 第三课 项目 图片上传,展示 (课堂笔记)...
  13. 互换性与技术测量教材pdf_《互换性与技术测量》赵燕【pdf】
  14. container_of()宏
  15. ES文件浏览器 WIFI 查看电脑文件怎么弄
  16. 关于菜鸡学习时服务器购买的注意点
  17. 日本药妆店扫货必备手册·收藏版
  18. 如何在graphpad表示出正负误差_正负公差表示方法
  19. GLES2.0中文API-glBlendFunc
  20. taro3 支付宝小程序 -- 授权手机号和用户信息


  1. 经济学十大原理之四——人们会对激励做出响应
  2. cba篮球暂停次数和时间_关于篮球CBA官方暂停规则的体会
  3. svg 格式图形文件和 svg 标签
  4. 使用python3.7.2 实现大名鼎鼎的Elo Score等级分制度
  5. Kotlin forEach的continue 和break 如何写?
  6. 服装行业SRM供应商管理平台加强产业链协同,优化供应网络
  7. 附近的人mysql实现_附近的人功能实现及原理
  8. 推荐步态识别能投的sci期刊
  9. 2022-2028年中国高频高速覆铜板行业市场专项调研及竞争战略分析报告
  10. STC15系列 8系列解码红外遥控器(NEC协议)(12mhz)