Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

背景1：Amdahl‘s Law: Gene Amdahl进行了一个富有洞察力的观察：提升一个系统的一个部分的性能对整个系统有多大影响。这一观察被称为Amdahl’s Law（阿姆达尔定律)
背景2：David Patterson，2017年图灵奖得主、加州伯克利大学计算机科学教授、谷歌杰出工程师David Patterson. 作为计算机体系结构宗师，David Patterson曾带领伯克利团队起草了精简数据集RISC-1，奠定RISC架构基础，该架构后来被当时的巨头「太阳微电子」（Sun Microsystem，后来被甲骨文收购）选中用来制作Sparc处理器。他与斯坦福大学前校长、Google母公司Alphabet现董事长John Hennessey合作的《计算机体系结构：量化研究方法》开创性地提供了体系结构的分析和科学框架，至今都是该领域的经典教材。2016年从伯克利退休后，David Patterson以杰出工程师身份加入Google Brain团队，为两代TPU研发做出了卓越贡献。

2018年3月，David Patterson与John Hennessey共同获得2017年度ACM图灵奖，以表彰他们在计算机体系结构的设计和评估方面开创了一套系统的、量化的方法，并对微处理器行业产生了深远的影响。

Amdahl‘s Law

1. Abstract

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

2. The roofline model

We believe that for the recent past and foreseeable future, off-chip memory bandwidth will often be the constraining resource[23]. Hence, we want a model that relates processor performance to offchip memory traffic.

**operational intensity: operations per byte of DRAM traffic, operational intensity suggests the
DRAM bandwidth needed by a kernel on a particular computer. **
The Y-axis is attainable floating-point performance. The Xaxis is operational intensity

Attainable GFlops/sec = Min(Peak Floating Point Performance, Peak Memory Bandwidth x Operational Intensity).

起初阶段，Attainable performance 随着 operation intensity增加而增加，到 ridge point 后保持不变；
同时如果不理想会出现 memory-bound 和 computer-bound 两种情况。

右图说明，给定一个rooline，您可以在不同的 kernels 上重复使用它，因为rooline不会变化。
图1b比较了两个系统的Roofline模型。不出所料，Opteron X2的脊点从1.0右移到 Opteron X4的4.4版本。因此，要在 X4，kernel 需要大于1的操作强度。

脊点的横坐标是实现最大性能所需的最小操作强度，如果脊点偏右，则只有操作强度非常高的核才能达到该计算机的最大性能。如果它在最左边，那么几乎所有内核都可能达到最大性能。

4. Adding ceilings to the roofline model

Roofline模型为性能提供了一个上界, 假设你的程序在远低于它的屋顶线的地方执行。应该执行哪些优化，以什么顺序执行?

To reduce computational bottlenecks

Improve instruction level parallelism (ILP) and apply SIMD?

Balance floating-point operation mix.

To reduce memory bottlenecks:

Restructure loops for unit stride accesses.Optimizing for unit stride memory accesses engages hardware prefetching, which significantly increases memory bandwidth.

Ensure memory affinity. Most microprocessors today include a memory controller on the same chip with the processors. If the system has two multicore chips, then some addresses go to the DRAM local to one multicore chip and the rest must go over a chip interconnect to access the DRAM that is local
to another chip

Use software prefetching

第一个图是改善了compute，第二个图是改善了memory；第三个图是绘制在一起；

上述是四种 FP kernel 的实现方法，每个方法有各自的 Oper Inten.

Intel , Intel includes a snoop filter to prevent unnecessary coherency traffic on the bus. If the working set is small enough for the hardware to filter, the snoop filter nearly doubles the delivered memory bandwidth.
AMD
IBM

Fallacy: The model does not take into account all features of modern processors, such as caches or prefetching.
Fallacy: Doubling cache size will increase operational intensity
Fallacy: The model doesn’t account for the long memory latency
Fallacy: The model ignores integer units in floating-point programs, which can limit performance
Fallacy: The model is limited to easily optimized kernels that never hit in the cache
Fallacy: The model is limited to floating-point programs

Conclusions

This paper describes a simple and visual model to help see which systems would be a good match to important kernels, or conversely, to see how to change kernel code or hardware to run desired kernels well.

AI算力基础 -- Roofline模型相关推荐

AI算力基础 -- Nvidia TESLA V100 GPU
– 2017年 1. Introduction to the NVIDIA Tesla V100 GPU Architecture 新的NVIDIA®Tesla®V100加速器(如图1所示)集成了强大 ...
打破AI算力成本困局趋动科技即将重磅发布全球首个AI算力池化云服务
10月23-25日,由湖南湘江新区管委会指导,长沙工业与信息化局.长沙信息产业园管委会.CSDN联合主办的"2022 长沙·中国1024 程序员节"即将在线上隆重开启.本届大会聚焦 ...
2020年中国AI算力报告发布：超大算法模型挑战之下，公共AI算力基建是关键
随着人工智能算法突飞猛进的发展,越来越多的模型训练需要巨量的算力支撑才能快速有效地实施.目前,如AlphaFold.GPT-3等模型已经逼近人工智能的算力极限,GPT-3的模型尺寸增大到了1750亿, ...
AI模型加速进入万亿级时代，中国AI算力独占全球三成
来源:新智元本文约1400字,建议阅读6分钟<全球人工智能市场半年度追踪报告>重磅发布! [ 导读 ]IDC 2020H1<全球人工智能市场半年度追踪报告>(<World ...
阿里云免费开放一切AI算力，加速新型冠状病毒新药和疫苗研发
近日,阿里云宣布,为了帮助加速新药和疫苗研发,将向全球公共科研机构免费开放一切AI算力. 目前,中国疾控中心已成功分离病毒,疫苗研发和药物筛选仍在争分夺秒地进行.新药和疫苗研发期间,需要进行大量的数据 ...
华为麒麟990发布！余承东：全球首款旗舰5G SoC，业界最强手机AI算力，友商还都是PPT...
乾明晓查假装发自柏林量子位报道 | 公众号 QbitAI 刚刚,华为发布新一代芯片,麒麟990 5G. 集成5G,AI算力更强,性能再提升. 在发布会上,余承东用六个"最&quo ...
90TB显存！英伟达发布新一代SuperPod超算，AI算力新巅峰！
周一,黄教主又很淡定的在自家厨房里开完了GTC发布会众所周知,NLP领域的模型一个比一个大,自从百亿参数的Google T5出来后,大部分AI研究者只能望着手里的蹩脚算力兴叹.如今动辄就是千亿.万亿 ...
撑起百万亿参数模型想象力！英伟达发布新一代SuperPOD超算，AI算力新巅峰！
周一,黄教主又很淡定的在自家厨房里开完了GTC发布会. 众所周知,NLP领域的模型一个比一个大,自从百亿参数的Google T5出来后,大部分AI研究者只能望着手里的蹩脚算力兴叹.如今动辄就是千亿.万 ...
AI算力霸主诞生！英伟达发布首款安培架构GPU，性能提升20倍
来源:雷锋网由于疫情缘故,本该在今年3月英伟达(NVIDIA)GTC 2020上发布的安培(Ampere)架构曝光多次却一直未发布. 5月15日,英伟达CEO黄仁勋发布了英伟达新一代GPU架构安培, ...

AI算力基础 -- Roofline模型