Roofline 代码现状:

  • CS Roofline Toolkit 为 Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis 的实现,uo-cdux/ert-mirror 为 github 上的一个镜像;
  • cyanguwa/nersc-roofline 为 Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs 对应的代码,包含 GPP 和 C 语言的 ERT kernel;
  • NERSC/roofline-on-nvidia-gpus 为 8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks 所对应的代码,数据收集方法有改进,但只有 GPP;
  • NERSC/timemory 是 Timemory: Modular Performance Analysis for HPC 所对应的代码,更为系统和规范。

下面对 NERSC/roofline-on-nvidia-gpus 进行介绍。

NERSC/roofline-on-nvidia-gpus 展示了在 NVIDIA GPU 尤其是在 V100上使用 Roofline 分析方法。仓库的结构如下所示。

  • /example-codes 包含一些玩具内核kernel_abc.cu和一个真正的 HPC 迷你应用程序 GPP,提取自材料科学代码 BerkeleyGW。

  • /ncu-section-files 包含 CUDA 11 中 Nsight Compute 附带的默认 Speed of Light 节文件,以及几个用于分层 Roofline 分析的自定义节文件,用于双精度、单精度、半精度和张量核心操作。这些节文件旨在使用 Nsight Compute(ncu) 自动收集屋顶线数据并进行可视化。

  • run.ncu演示了如何在 CUDA 11 中运行 Nsight Compute,而run.gpp.ncu是一个 Slurm 作业脚本,用于在 Cori GPU 上运行五个版本的 GPP 示例。

  • /custom-scripts 提供了一套作业启动、后处理和可视化脚本,可用于手动的 Roofline 数据采集和可视化。这样做的目的是使用户更容易将 Roofline 分析集成到自己的工作流中。

Customized ncu-based Roofline Workflow

为了与用户的其他工作流集成,/custom-scripts 提供了一套用于手动度量收集和 Roofline 可视化的脚本。

  • run.gpp.customized
  • postprocess.py and roofline.py

run.gpp.customized自定义脚本以 GPP 为例展示了 Roofline 分析所需的 Nsight Compute 指标列表。这些指标使用 Nsight Compute ncu (或nv-nsight-cu-cli)命令行实用程序收集,并写入/custom-scripts中的.csv文件。

然后,postprocess.py使用 Pandas 对结果进行后处理,以计算每个被分析内核的算术强度( Arithmetic Intensity,AI)和 FLOP/s 吞吐量。
处理完成后,postprocess.py将调用基于 Matplotlib 的roofline.py绘制 Roofline 图表,然后将图表保存到.png文件中。

这些脚本中使用的数据收集方法详述如下。它是 CUDA 11 中 Nsight Compute 的新功能。

  • Time:

    • sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
  • FLOPs:
    • DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_dfma_pred_on.sum + sm__sass_thread_inst_executed_op_dmul_pred_on.sum
    • SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_ffma_pred_on.sum + sm__sass_thread_inst_executed_op_fmul_pred_on.sum
    • HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_hfma_pred_on.sum + sm__sass_thread_inst_executed_op_hmul_pred_on.sum
    • Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
  • Bytes:
    • DRAM: dram__bytes.sum
    • L2: lts__t_bytes.sum
    • L1: l1tex__t_bytes.sum

run.gpp.customized

#mermaid-svg-pSBUv9W1Wkg6Vi9e .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .label text{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .node rect,#mermaid-svg-pSBUv9W1Wkg6Vi9e .node circle,#mermaid-svg-pSBUv9W1Wkg6Vi9e .node ellipse,#mermaid-svg-pSBUv9W1Wkg6Vi9e .node polygon,#mermaid-svg-pSBUv9W1Wkg6Vi9e .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .node .label{text-align:center;fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .node.clickable{cursor:pointer}#mermaid-svg-pSBUv9W1Wkg6Vi9e .arrowheadPath{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .flowchart-link{stroke:#333;fill:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edgeLabel rect{opacity:0.9}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edgeLabel span{color:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .cluster text{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-pSBUv9W1Wkg6Vi9e .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-pSBUv9W1Wkg6Vi9e text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .actor-line{stroke:grey}#mermaid-svg-pSBUv9W1Wkg6Vi9e .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .sequenceNumber{fill:#fff}#mermaid-svg-pSBUv9W1Wkg6Vi9e #sequencenumber{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e #crosshead path{fill:#333;stroke:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .messageText{fill:#333;stroke:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-pSBUv9W1Wkg6Vi9e .labelText,#mermaid-svg-pSBUv9W1Wkg6Vi9e .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .loopText,#mermaid-svg-pSBUv9W1Wkg6Vi9e .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-pSBUv9W1Wkg6Vi9e .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-pSBUv9W1Wkg6Vi9e .noteText,#mermaid-svg-pSBUv9W1Wkg6Vi9e .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-pSBUv9W1Wkg6Vi9e .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-pSBUv9W1Wkg6Vi9e .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-pSBUv9W1Wkg6Vi9e .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .section{stroke:none;opacity:0.2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .section2{fill:#fff400}#mermaid-svg-pSBUv9W1Wkg6Vi9e .section1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .section3{fill:#fff;opacity:0.2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .sectionTitle0{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .sectionTitle1{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .sectionTitle2{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .sectionTitle3{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-pSBUv9W1Wkg6Vi9e .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .grid path{stroke-width:0}#mermaid-svg-pSBUv9W1Wkg6Vi9e .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .task{stroke-width:2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText:not([font-size]){font-size:11px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .task.clickable{cursor:pointer}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskText3{fill:#fff}#mermaid-svg-pSBUv9W1Wkg6Vi9e .task0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .task1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .task2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutside0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutside2{fill:#000}#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutside1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .taskTextOutside3{fill:#000}#mermaid-svg-pSBUv9W1Wkg6Vi9e .active0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .active1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .active2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeText0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeText1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeText2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeText3{fill:#000 !important}#mermaid-svg-pSBUv9W1Wkg6Vi9e .done0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .done1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .done2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneText0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneText1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneText2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneText3{fill:#000 !important}#mermaid-svg-pSBUv9W1Wkg6Vi9e .crit0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .crit1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .crit2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCrit0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCrit1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCrit2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCrit0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCrit1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCrit2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-pSBUv9W1Wkg6Vi9e .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .milestoneText{font-style:italic}#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCritText0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCritText1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCritText2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .doneCritText3{fill:#000 !important}#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCritText0,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCritText1,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCritText2,#mermaid-svg-pSBUv9W1Wkg6Vi9e .activeCritText3{fill:#000 !important}#mermaid-svg-pSBUv9W1Wkg6Vi9e .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.classGroup text .title{font-weight:bolder}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.clickable{cursor:pointer}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-pSBUv9W1Wkg6Vi9e .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .dashed-line{stroke-dasharray:3}#mermaid-svg-pSBUv9W1Wkg6Vi9e #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e .commit-id,#mermaid-svg-pSBUv9W1Wkg6Vi9e .commit-msg,#mermaid-svg-pSBUv9W1Wkg6Vi9e .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-pSBUv9W1Wkg6Vi9e g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-pSBUv9W1Wkg6Vi9e .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-pSBUv9W1Wkg6Vi9e .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-pSBUv9W1Wkg6Vi9e .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edgeLabel text{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-pSBUv9W1Wkg6Vi9e .node circle.state-start{fill:black;stroke:black}#mermaid-svg-pSBUv9W1Wkg6Vi9e .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-pSBUv9W1Wkg6Vi9e #statediagram-barbEnd{fill:#9370db}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-state .divider{stroke:#9370db}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-pSBUv9W1Wkg6Vi9e .note-edge{stroke-dasharray:5}#mermaid-svg-pSBUv9W1Wkg6Vi9e .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-pSBUv9W1Wkg6Vi9e .error-icon{fill:#522}#mermaid-svg-pSBUv9W1Wkg6Vi9e .error-text{fill:#522;stroke:#522}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edge-thickness-normal{stroke-width:2px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-pSBUv9W1Wkg6Vi9e .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-pSBUv9W1Wkg6Vi9e .marker{fill:#333}#mermaid-svg-pSBUv9W1Wkg6Vi9e .marker.cross{stroke:#333}:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-pSBUv9W1Wkg6Vi9e {color: rgba(0, 0, 0, 0.75);font: ;}

run.gpp.customized
postprocess.py

Environment module 工具,常用于高性能计算集群的环境配置管理上。它可以将软件编译器、MPI 库、数学库、应用软件(计算类软件、分析类软件等)等,以模块的方式,统一到一个框架下,使得用户可以动态切换环境变量。

module load cuda/11.0.2
module load pgi/19.10

设置 Nsight Compute CLI 需要收集的指标。

# Time
metrics="sm__cycles_elapsed.avg,\
sm__cycles_elapsed.avg.per_second,"# DP
metrics+="sm__sass_thread_inst_executed_op_dadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_dfma_pred_on.sum,\
sm__sass_thread_inst_executed_op_dmul_pred_on.sum,"# SP
metrics+="sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum,\
sm__sass_thread_inst_executed_op_fmul_pred_on.sum,"# HP
metrics+="sm__sass_thread_inst_executed_op_hadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_hfma_pred_on.sum,\
sm__sass_thread_inst_executed_op_hmul_pred_on.sum,"# Tensor Core
metrics+="sm__inst_executed_pipe_tensor.sum,"# DRAM, L2 and L1
metrics+="dram__bytes.sum,\
lts__t_bytes.sum,\
l1tex__t_bytes.sum"

Slurm 是一个开源、容错、高度可扩展的集群管理和作业调度系统,适用于大型和小型 Linux 集群。
srun 用于提交作业以便实时执行或启动作业步骤。
切换到 GPP 目录,编译并运行。
指定-k参数可以根据内核名称的正则表达式匹配来过滤内核。

cd ../example-codes/GPP/input=gpp214unformatted.dat
dir=../../custom-scripts/# Baseline
output=output.csv
profilestr="ncu -k sigma_gpp_gpu --metrics $metrics --csv"
echo Baseline version
git checkout gpp.f90
make clean
make
srun -n1 $profilestr ./gpp.x $input  > $dir/$output 2>&1

切换到优化的4种实现并执行。

# Four optimization steps
for n in `seq 1 4`
dooutput=output$n.csvprofilestr="ncu -k sigma_gpp_gpu --metrics $metrics --csv"echo Patch version: $ngit checkout gpp.f90patch gpp.f90 step$n.patchmake cleanmakesrun -n1 $profilestr ./gpp.x $input   > $dir/$output 2>&1
done

调用 postprocess.py 生成 Roofline 图。

module load python/3.7-anaconda-2019.10
cd $dir
srun -n1 python postprocess.py

postprocess.py

#mermaid-svg-FVDGwySECbLnH0AR .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-FVDGwySECbLnH0AR .label text{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .node rect,#mermaid-svg-FVDGwySECbLnH0AR .node circle,#mermaid-svg-FVDGwySECbLnH0AR .node ellipse,#mermaid-svg-FVDGwySECbLnH0AR .node polygon,#mermaid-svg-FVDGwySECbLnH0AR .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-FVDGwySECbLnH0AR .node .label{text-align:center;fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .node.clickable{cursor:pointer}#mermaid-svg-FVDGwySECbLnH0AR .arrowheadPath{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-FVDGwySECbLnH0AR .flowchart-link{stroke:#333;fill:none}#mermaid-svg-FVDGwySECbLnH0AR .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-FVDGwySECbLnH0AR .edgeLabel rect{opacity:0.9}#mermaid-svg-FVDGwySECbLnH0AR .edgeLabel span{color:#333}#mermaid-svg-FVDGwySECbLnH0AR .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-FVDGwySECbLnH0AR .cluster text{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-FVDGwySECbLnH0AR .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-FVDGwySECbLnH0AR text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-FVDGwySECbLnH0AR .actor-line{stroke:grey}#mermaid-svg-FVDGwySECbLnH0AR .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-FVDGwySECbLnH0AR .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-FVDGwySECbLnH0AR #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-FVDGwySECbLnH0AR .sequenceNumber{fill:#fff}#mermaid-svg-FVDGwySECbLnH0AR #sequencenumber{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR #crosshead path{fill:#333;stroke:#333}#mermaid-svg-FVDGwySECbLnH0AR .messageText{fill:#333;stroke:#333}#mermaid-svg-FVDGwySECbLnH0AR .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-FVDGwySECbLnH0AR .labelText,#mermaid-svg-FVDGwySECbLnH0AR .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-FVDGwySECbLnH0AR .loopText,#mermaid-svg-FVDGwySECbLnH0AR .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-FVDGwySECbLnH0AR .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-FVDGwySECbLnH0AR .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-FVDGwySECbLnH0AR .noteText,#mermaid-svg-FVDGwySECbLnH0AR .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-FVDGwySECbLnH0AR .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-FVDGwySECbLnH0AR .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-FVDGwySECbLnH0AR .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-FVDGwySECbLnH0AR .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .section{stroke:none;opacity:0.2}#mermaid-svg-FVDGwySECbLnH0AR .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-FVDGwySECbLnH0AR .section2{fill:#fff400}#mermaid-svg-FVDGwySECbLnH0AR .section1,#mermaid-svg-FVDGwySECbLnH0AR .section3{fill:#fff;opacity:0.2}#mermaid-svg-FVDGwySECbLnH0AR .sectionTitle0{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .sectionTitle1{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .sectionTitle2{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .sectionTitle3{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-FVDGwySECbLnH0AR .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .grid path{stroke-width:0}#mermaid-svg-FVDGwySECbLnH0AR .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-FVDGwySECbLnH0AR .task{stroke-width:2}#mermaid-svg-FVDGwySECbLnH0AR .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .taskText:not([font-size]){font-size:11px}#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-FVDGwySECbLnH0AR .task.clickable{cursor:pointer}#mermaid-svg-FVDGwySECbLnH0AR .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-FVDGwySECbLnH0AR .taskText0,#mermaid-svg-FVDGwySECbLnH0AR .taskText1,#mermaid-svg-FVDGwySECbLnH0AR .taskText2,#mermaid-svg-FVDGwySECbLnH0AR .taskText3{fill:#fff}#mermaid-svg-FVDGwySECbLnH0AR .task0,#mermaid-svg-FVDGwySECbLnH0AR .task1,#mermaid-svg-FVDGwySECbLnH0AR .task2,#mermaid-svg-FVDGwySECbLnH0AR .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutside0,#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutside2{fill:#000}#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutside1,#mermaid-svg-FVDGwySECbLnH0AR .taskTextOutside3{fill:#000}#mermaid-svg-FVDGwySECbLnH0AR .active0,#mermaid-svg-FVDGwySECbLnH0AR .active1,#mermaid-svg-FVDGwySECbLnH0AR .active2,#mermaid-svg-FVDGwySECbLnH0AR .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-FVDGwySECbLnH0AR .activeText0,#mermaid-svg-FVDGwySECbLnH0AR .activeText1,#mermaid-svg-FVDGwySECbLnH0AR .activeText2,#mermaid-svg-FVDGwySECbLnH0AR .activeText3{fill:#000 !important}#mermaid-svg-FVDGwySECbLnH0AR .done0,#mermaid-svg-FVDGwySECbLnH0AR .done1,#mermaid-svg-FVDGwySECbLnH0AR .done2,#mermaid-svg-FVDGwySECbLnH0AR .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-FVDGwySECbLnH0AR .doneText0,#mermaid-svg-FVDGwySECbLnH0AR .doneText1,#mermaid-svg-FVDGwySECbLnH0AR .doneText2,#mermaid-svg-FVDGwySECbLnH0AR .doneText3{fill:#000 !important}#mermaid-svg-FVDGwySECbLnH0AR .crit0,#mermaid-svg-FVDGwySECbLnH0AR .crit1,#mermaid-svg-FVDGwySECbLnH0AR .crit2,#mermaid-svg-FVDGwySECbLnH0AR .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-FVDGwySECbLnH0AR .activeCrit0,#mermaid-svg-FVDGwySECbLnH0AR .activeCrit1,#mermaid-svg-FVDGwySECbLnH0AR .activeCrit2,#mermaid-svg-FVDGwySECbLnH0AR .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-FVDGwySECbLnH0AR .doneCrit0,#mermaid-svg-FVDGwySECbLnH0AR .doneCrit1,#mermaid-svg-FVDGwySECbLnH0AR .doneCrit2,#mermaid-svg-FVDGwySECbLnH0AR .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-FVDGwySECbLnH0AR .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-FVDGwySECbLnH0AR .milestoneText{font-style:italic}#mermaid-svg-FVDGwySECbLnH0AR .doneCritText0,#mermaid-svg-FVDGwySECbLnH0AR .doneCritText1,#mermaid-svg-FVDGwySECbLnH0AR .doneCritText2,#mermaid-svg-FVDGwySECbLnH0AR .doneCritText3{fill:#000 !important}#mermaid-svg-FVDGwySECbLnH0AR .activeCritText0,#mermaid-svg-FVDGwySECbLnH0AR .activeCritText1,#mermaid-svg-FVDGwySECbLnH0AR .activeCritText2,#mermaid-svg-FVDGwySECbLnH0AR .activeCritText3{fill:#000 !important}#mermaid-svg-FVDGwySECbLnH0AR .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-FVDGwySECbLnH0AR g.classGroup text .title{font-weight:bolder}#mermaid-svg-FVDGwySECbLnH0AR g.clickable{cursor:pointer}#mermaid-svg-FVDGwySECbLnH0AR g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-FVDGwySECbLnH0AR g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-FVDGwySECbLnH0AR .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-FVDGwySECbLnH0AR .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-FVDGwySECbLnH0AR .dashed-line{stroke-dasharray:3}#mermaid-svg-FVDGwySECbLnH0AR #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR .commit-id,#mermaid-svg-FVDGwySECbLnH0AR .commit-msg,#mermaid-svg-FVDGwySECbLnH0AR .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-FVDGwySECbLnH0AR g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-FVDGwySECbLnH0AR g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-FVDGwySECbLnH0AR g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-FVDGwySECbLnH0AR .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-FVDGwySECbLnH0AR .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-FVDGwySECbLnH0AR .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-FVDGwySECbLnH0AR .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-FVDGwySECbLnH0AR .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-FVDGwySECbLnH0AR .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-FVDGwySECbLnH0AR .edgeLabel text{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-FVDGwySECbLnH0AR .node circle.state-start{fill:black;stroke:black}#mermaid-svg-FVDGwySECbLnH0AR .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-FVDGwySECbLnH0AR #statediagram-barbEnd{fill:#9370db}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-state .divider{stroke:#9370db}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-FVDGwySECbLnH0AR .note-edge{stroke-dasharray:5}#mermaid-svg-FVDGwySECbLnH0AR .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-FVDGwySECbLnH0AR .error-icon{fill:#522}#mermaid-svg-FVDGwySECbLnH0AR .error-text{fill:#522;stroke:#522}#mermaid-svg-FVDGwySECbLnH0AR .edge-thickness-normal{stroke-width:2px}#mermaid-svg-FVDGwySECbLnH0AR .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-FVDGwySECbLnH0AR .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-FVDGwySECbLnH0AR .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-FVDGwySECbLnH0AR .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-FVDGwySECbLnH0AR .marker{fill:#333}#mermaid-svg-FVDGwySECbLnH0AR .marker.cross{stroke:#333}:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-FVDGwySECbLnH0AR {color: rgba(0, 0, 0, 0.75);font: ;}

postprocess.py
roofline

files为当前路径下"output"开头的csv文件列表

datadir='.'
files=[x for x in os.listdir(datadir) if x.endswith('.csv') and x.startswith('output')]
files.sort()
files=[os.path.join(datadir,file) for file in files]

变量名用file不太可取。
获取文件行数。
pandas.read_csv 读取时跳过最后一行。
pandas.DataFrame.groupby 使用映射器或一系列列对 DataFrame 进行分组,返回一个pandas.core.groupby.DataFrameGroupBy对象。
pandas.pivot_table 创建一个电子表格样式的数据透视表作为 DataFrame。
按’Kernel Name’和’Metric Name’两列分组求和。
pandas.DataFrame.shape 返回一个表示 DataFrame 维度的元组。
计算的结果放入了dfs[tag]中。

dfs={}
for file in files:tag, ext = os.path.splitext(os.path.basename(file))dfs[tag]=pd.DataFrame()with open(file,'r') as f:cnt=0while True:ln=f.readline()if not ln:breakcnt+=1if 'Host Name' in ln:breakdf = pd.read_csv(file, skiprows=cnt-1)dft=df.groupby(['Kernel Name','Metric Name']).sum()dfmetric=pd.pivot_table(dft, index='Kernel Name', columns='Metric Name', values='Metric Value')dfmetric['Count']=df.groupby(['Kernel Name']).count()['ID'].div(dfmetric.shape[1])

time=cyclesrate\mathrm{time} = \frac{\mathrm{cycles}}{\mathrm{rate}} time=ratecycles​

        dfmetric['Time']=dfmetric['sm__cycles_elapsed.avg'] \/ (dfmetric['sm__cycles_elapsed.avg.per_second'] /dfmetric['Count'] )

add+2×fma+mul\mathrm{add} + 2\times \mathrm{fma} + \mathrm{mul} add+2×fma+mul

        dfmetric['CC FLOPs']= 2 * dfmetric['sm__sass_thread_inst_executed_op_dfma_pred_on.sum'] \+ dfmetric['sm__sass_thread_inst_executed_op_dmul_pred_on.sum'] \+ dfmetric['sm__sass_thread_inst_executed_op_dadd_pred_on.sum'] \+ 2 * dfmetric['sm__sass_thread_inst_executed_op_ffma_pred_on.sum'] \+ dfmetric['sm__sass_thread_inst_executed_op_fmul_pred_on.sum'] \+ dfmetric['sm__sass_thread_inst_executed_op_fadd_pred_on.sum'] \+ 2 * dfmetric['sm__sass_thread_inst_executed_op_hfma_pred_on.sum'] \+ dfmetric['sm__sass_thread_inst_executed_op_hmul_pred_on.sum'] \+ dfmetric['sm__sass_thread_inst_executed_op_hadd_pred_on.sum']

FLOPtc=Insttc×512\mathrm{FLOP_{tc}} = \mathrm{Inst_{tc}}\times 512 FLOPtc​=Insttc​×512

        dfmetric['TC FLOPs']= 512 * dfmetric['sm__inst_executed_pipe_tensor.sum']dfmetric['all FLOPs']= dfmetric['CC FLOPs'] + dfmetric['TC FLOPs']
        dfmetric['AI HBM'] = dfmetric['all FLOPs'].div(dfmetric['dram__bytes.sum'])dfmetric['AI L2'] = dfmetric['all FLOPs'].div(dfmetric['lts__t_bytes.sum'])dfmetric['AI L1'] = dfmetric['all FLOPs'].div(dfmetric['l1tex__t_bytes.sum'])dfmetric['GFLOP/s'] = dfmetric['all FLOPs']/ dfmetric['Time'] /1024/1024/1024dfmetric['TC GFLOP/s'] = dfmetric['TC FLOPs']/ dfmetric['Time'] /1024/1024/1024
#         dfmetric.to_csv('pd_'+tag+'.csv')dfs[tag]=dfmetric

对于每个文件的结果,
pandas.Index.tolist 返回值列表。
pandas.Series.tolist 返回值列表。
这样 roofline 函数不再需要调用 Pandas 的库函数。

tags=dfs.keys()
flags=['all'] #'HBM','L2','L1' or 'all'
for tag in tags:for flag in flags:dfm=dfs[tag]LABELS = dfm.index.tolist()AIL1   = dfm['AI L1'].tolist()AIL2   = dfm['AI L2'].tolist()AIHBM  = dfm['AI HBM'].tolist()FLOPS  = dfm['GFLOP/s'].tolist()roofline(tag, FLOPS, AIHBM, AIL2, AIL1, LABELS, flag)

roofline

检查输入参数是否为空。

def roofline(filename, FLOPS, AIHBM, AIL2=None, AIL1=None, LABELS=None, flag='HBM'):if not FLOPS:print('FLOPS can not be empty!')returnif max(FLOPS)==0:print('FLOPS are all 0s!')returnif (not AIHBM) and (not AIL2) and (not AIL1):print('AIHBM, AIL2 and AIL1 can not all be empty!')returnif (len(FLOPS) != len(AIHBM)) or (len(FLOPS) != len(AIL2)) or (len(FLOPS) != len(AIL1)):print('FLOPS needs to have the same length as AI!')returnif (flag != 'HBM') and (flag != 'L2') and (flag != 'L1') and (flag != 'all'):print('flag needs to be one of HBM, L2, L1, and all!')return

memRoofscmpRoofs为提前确定好的值。
matplotlib.pyplot.figure 创建新图窗,或激活现有图窗。figsize为以英寸为单位的宽和高。
matplotlib.pyplot.clf 清除当前图形。
matplotlib.figure.Figure.gca 获取当前轴。
matplotlib.axes.Axes.set_xscale 设置 x 轴比例。
matplotlib.axes.Axes.set_xlabel 设置 x 轴的标签。
matplotlib.axes.Axes.set_xlim 设置 x 轴视图限制。
matplotlib.axes.Axes.get_xlim 返回 x 轴视图限制。
x 轴和 y 轴对数尺度,其中 x 轴的可见区间为 [10xmin,10xmax][10^{x_{min}}, 10^{x_{max}}][10xmin​,10xmax​]。

    LABELS = [x[:maxchar] for x in LABELS]memRoofs = [('L1', 54000.), ('L2', 2996.77),  ('HBM', 828.76)] cmpRoofs = [('Tensor', 96.9),('DP', 7.8)]fig = plt.figure(1,figsize=(10.67,6.6))plt.clf()ax = fig.gca()ax.set_xscale('log')ax.set_yscale('log')ax.set_xlabel('Arithmetic Intensity [FLOPs/Byte]')ax.set_ylabel('Performance [GFLOP/sec]')nx   = 10000xmin = -3 xmax = 3ymin = 1ymax = 200000ax.set_xlim(10**xmin, 10**xmax)ax.set_ylim(ymin, ymax)ixx = int(nx*0.02)xlim = ax.get_xlim()ylim = ax.get_ylim()

numpy.logspace 返回在[10**xmin, 10**xmax)区间内对数刻度上nx个均匀间隔的数字。
xmemRoofs相乘即可得到 y 轴上的性能值。
对于cmpRoofs中的每一种,如果当前位置的计算小于 L1的限制而前一点的计算大于 L1的限制,则将前一点加入scomp_x_elbowscomp_ix_elbow中。
对于memRoofs中的每一种,如果当前位置的内存带宽大于 Tensor 算力的限制而前一点的内存带宽小于 Tensor 算力的限制,则将前一点加入smem_x_elbowsmem_ix_elbow中。

    scomp_x_elbow  = []scomp_ix_elbow = []smem_x_elbow   = []smem_ix_elbow  = []x = np.logspace(xmin,xmax,nx)for roof in cmpRoofs:for ix in range(1,nx):if float(memRoofs[0][1] * x[ix]) >= roof[1]*1024 and (memRoofs[0][1] * x[ix-1]) < roof[1]*1024:scomp_x_elbow.append(x[ix-1])scomp_ix_elbow.append(ix-1)breakfor roof in memRoofs:for ix in range(1,nx):if (cmpRoofs[0][1]*1024 <= roof[1] * x[ix] and cmpRoofs[0][1]*1024 > roof[1] * x[ix-1]):smem_x_elbow.append(x[ix-1])smem_ix_elbow.append(ix-1)break

绘制 Roofline 的折线。
对于每种cmpRoofs,绘制转弯后的部分。
对于每种memRoofs,绘制转弯前的部分。
这里使用len(cmpRoofs)len(memRoofs)可能会遇到访问错误,换成len(scomp_ix_elbow)len(smem_ix_elbow)较为合适。
matplotlib.axes.Axes.plot 用于绘制 XY 坐标系的点、线或其他标记形状。
color 为黑色,linestyle 为实线,linewidth 为 2 个像素。

    for i in range(len(cmpRoofs)):roof = cmpRoofs[i][1]*1024y = np.ones(len(x)) * roofax.plot(x[scomp_ix_elbow[i]:],y[scomp_ix_elbow[i]:],c='k',ls='-',lw='2')for i in range(len(memRoofs)):roof = memRoofs[i][1]y = x * roofax.plot(x[:smem_ix_elbow[i]+1],y[:smem_ix_elbow[i]+1],c='k',ls='-',lw='2')

绘制 kernel 性能数据到图上。L1的为圆圈,L2的为方形标记,HBM 的为倒三角标记。
按照AIHBM的长度遍历,这样假定其总是存在且长度匹配的。
根据flag来决定绘制哪一部分的结果。
从 colors 列表中取不同的颜色。
LABELS 为图例中的标签。

    for i in range(len(AIHBM)):if flag == 'L1':ax.plot(float(AIL1[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[0],\linestyle='None',ms=markersize,markerfacecolor='none',\markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")elif flag == 'L2':ax.plot(float(AIL2[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[1],\linestyle='None',ms=markersize,markerfacecolor='none',\markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")elif flag == 'HBM':ax.plot(float(AIHBM[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[2],\linestyle='None',ms=markersize,markerfacecolor='none',\markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")elif flag == 'all':ax.plot(float(AIL1[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[0],\linestyle='None',ms=markersize,markerfacecolor='none',\markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")ax.plot(float(AIL2[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[1],\linestyle='None',ms=markersize,markerfacecolor='none',\markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")ax.plot(float(AIHBM[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[2],\linestyle='None',ms=markersize,markerfacecolor='none',\markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")

matplotlib.axes.Axes.plot 会返回 matplotlib.lines.Line2D 对象的列表。

    marker_handles = []  if flag == 'L1':marker_handles.append(ax.plot([],[],c='k',marker=styles[0],linestyle='None',ms=markersize,\markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[0][0])[0])elif flag == 'L2':marker_handles.append(ax.plot([],[],c='k',marker=styles[1],linestyle='None',ms=markersize,\markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[1][0])[0])elif flag == 'HBM':marker_handles.append(ax.plot([],[],c='k',marker=styles[2],linestyle='None',ms=markersize,\markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[2][0])[0])elif flag == 'all':for i in range(len(memRoofs)):marker_handles.append(ax.plot([],[],c='k',marker=styles[i],linestyle='None',ms=markersize,\markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[i][0])[0])            

matplotlib.axes.Axes.text 向轴添加计算峰值和内存速率数据。

    for roof in cmpRoofs:ax.text(x[-ixx],roof[1]*1024,roof[0] + ': ' + '{0:.1f}'.format(roof[1]) + ' TFLOP/s',horizontalalignment='right',verticalalignment='bottom')for roof in memRoofs:ang = np.arctan(np.log10(xlim[1]/xlim[0]) / np.log10(ylim[1]/ylim[0])* fig.get_size_inches()[1]/fig.get_size_inches()[0] )if x[ixx]*roof[1] >ymin:ax.text(x[ixx],x[ixx]*roof[1]*(1+0.25*np.sin(ang)**2),roof[0] + ': ' + '{0:.1f}'.format(float(roof[1])) + ' GB/s',horizontalalignment='left',verticalalignment='bottom',rotation=180/np.pi*ang)else:ymin_ix_elbow=list()ymin_x_elbow=list()for ix in range(1,nx):if (ymin <= roof[1] * x[ix] and ymin > roof[1] * x[ix-1]):ymin_x_elbow.append(x[ix-1])ymin_ix_elbow.append(ix-1)breakax.text(x[ixx+ymin_ix_elbow[0]],x[ixx+ymin_ix_elbow[0]]*roof[1]*(1+0.25*np.sin(ang)**2),roof[0] + ': ' + '{0:.1f}'.format(float(roof[1])) + ' GB/s',horizontalalignment='left',verticalalignment='bottom',rotation=180/np.pi*ang)

matplotlib.pyplot.legend 在右下方放置一个内存类型的图例marker_handles
matplotlib.axes.Axes.add_artist 添加 Artist
matplotlib.patches.Patch 是具有外观和边缘颜色的 2D Artist。
leg2中使用的loc=4不易理解。
matplotlib.pyplot.savefig 保存当前图窗。

        leg1 = plt.legend(handles = marker_handles,loc='lower right', ncol=len(flag[0]) if 'all' not in flag else 3,bbox_to_anchor = (1,0))ax.add_artist(leg1)patch_handles = list()for i in range(0,len(AIHBM)):if FLOPS[i] > 0:patch_handles.append(mpatches.Patch(color=colors[i%10],label = LABELS[i] if LABELS else "unknown"))leg2 = plt.legend(handles = patch_handles,loc=4,ncol=1,bbox_to_anchor = (1,0.1),scatterpoints = 1)ax.text(xlim[0]*1.1,ylim[1]/1.1, '-'.join([filename,flag]), horizontalalignment='left',verticalalignment='top')
#     plt.title('-'.join([filename,flag]))plt.savefig('_'.join([filename,flag])+'.png')
#     plt.savefig('_'.join([filename,flag])+'.eps')#    plt.show()

理想效果为:

参考资料:

  • Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis
  • Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs
  • 8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks
  • Pandas透视表(pivot_table)详解
  • Pandas | 一文看懂透视表pivot_table
  • 一文看懂pandas的透视表pivot_table
  • Python学习笔记(7)——Matplotlib中的Axes.plot(绘制点、线和标记)的用法
  • List of named colors
  • Matplotlib : making a colored markers legend from scratch
  • pandas之分组groupby()的使用整理与总结
  • Pandas教程 | 超好用的Groupby用法详解
  • Pandas透视表(pivot_table)详解
  • Pandas group-by and sum
  • 数据结构简介
  • cyanguwa/nersc-roofline
  • Performance and Algorithms Research
  • Kernel Profiling Guide
  • Nsight Compute CLI
  • 用Modules优雅地管理你的环境变量
  • 设置编译及运行环境
  • Environment Modules 环境变量模块化管理工具的使用
  • Environment Modules
  • module
  • Python 数据分析三剑客之 Matplotlib(三):图例/LaTeX/刻度/子图/补丁等基本图像属性
  • Python matplotlib notes整理总结
  • 路径教程
  • matplotlib-绘制精美的图表
  • 可视化组队学习2:艺术画笔见乾坤
  • Matplotlib 中如何在图像上绘制矩形
  • How to find the min/max value of a common key in a list of dicts?
  • GeorgOfenbeck/perfplot
  • arm-hpc/roofline
  • Techercise/AMD-Instruction-Roofline-using-rocProf-Metrics
  • Tutorial: Empirical Roofline Model
  • jeewhanchoi/a-roofline-model-of-energy-ubenchmarks
    • Bash Range

Roofline-on-NVIDIA-GPUs代码分析相关推荐

  1. NVIDIA GPUs上深度学习推荐模型的优化

    NVIDIA GPUs上深度学习推荐模型的优化 Optimizing the Deep Learning Recommendation Model on NVIDIA GPUs 推荐系统帮助人在成倍增 ...

  2. 基于NVIDIA GPUs的深度学习训练新优化

    基于NVIDIA GPUs的深度学习训练新优化 New Optimizations To Accelerate Deep Learning Training on NVIDIA GPUs 不同行业采用 ...

  3. 2018-2019-2 20165209 《网络对抗技术》Exp4:恶意代码分析

    2018-2019-2 20165209 <网络对抗技术>Exp4:恶意代码分析 1 基础问题回答和实验内容 1.1基础问题回答 如果在工作中怀疑一台主机上有恶意代码,但只是猜想,所有想监 ...

  4. Exp4 恶意代码分析 20164321 王君陶

    Exp4 恶意代码分析 20164321 王君陶 1.实践目标 1.1是监控你自己系统的运行状态,看有没有可疑的程序在运行. 1.2是分析一个恶意软件,就分析Exp2或Exp3中生成后门软件:分析工具 ...

  5. 【小白入门】超详细的OCRnet详解(含代码分析)

    [小白入门]超详细的OCRnet详解(含代码分析) OCRnet 简介 网络结构 具体实现(含代码分析) 实验结果 本文仅梳理总结自己在学习过程中的一些理解和思路,不保证绝对正确,请酌情参考.如果各位 ...

  6. SSD(2)代码分析

    目录 代码运行 代码分析 代码运行 同样分析tensorflow版的实现,代码地址:SSD-Tensorflow 1.预测 unzip ssd_300_vgg.ckpt.zip jupyter not ...

  7. NVIDIA GPUs Compute Capability 英伟达显卡计算力简介及cuda支持显卡链接

    深度学习中我们对GPU的计算能力一般是要求大于5.0,具体情况具体分析,低于5.0也并非一定不可以. 那为啥不用CPU?CPU只能一个一个按照顺序进行运算,GPU可以利用多个CUDA核心并行进行运算, ...

  8. 20145236《网络攻防》Exp4 恶意代码分析

    20145236<网络攻防>Exp4 恶意代码分析 一.基础问题回答 如果在工作中怀疑一台主机上有恶意代码,但只是猜想,所有想监控下系统一天天的到底在干些什么.请设计下你想监控的操作有哪些 ...

  9. C#中类的继承 override virtual new的作用以及代码分析

    继承中override virtual new的作用 virtual 父类中需要注明允许重写的方法: override 子类中必须显示声明该方法是重写的父类中的方法: new 子类中忽略父类的已存在的 ...

最新文章

  1. 使用pxe来实现无人值守linux
  2. 元气森林,饮料界的小罐茶?
  3. 规则引擎:大厂营销系统资格设计全解
  4. ACM10.14题解
  5. global全局变量
  6. 推荐我们在B站免费的生信入门基础课程|测序原理,GO/GSEA/WGCNA
  7. ANSI颜色字体一篇通
  8. 朝鲜 APT37被指发动软件供应链攻击,瞄准股票投资人
  9. Linux下编译环境及Makefile的学习笔记
  10. NVIDIA Game Ready 显卡驱动517.48发布!为《守望先锋2》做好游戏准备
  11. 博图注册表删除方法_「博图+仿真+授权」西门子软件安装指南及注意事项
  12. Cadence16.6 最新83号补丁下载-Hotfix_SPB16.60.083_wint_1of1.exe
  13. 如何更改您的Apple ID电子邮件地址
  14. TLR4助力攻克脑血管难题 | MedChemExpress
  15. Cracker学习——任务1
  16. beyond compare 4 This license key has been revoked 出现的问题与解决办法
  17. IDEA必装插件-Gyro(强烈推荐)
  18. 条码打印四 - 1.打印管理库函数Winspool.drv
  19. 北海450值得入手吗?附带(越野萝莉)照片
  20. 基于卷积神经网络的序列特异性预测研究--云南大学范航恺硕士论文

热门文章

  1. PCM开发板模块实验指导--SPI读写PSRAM64实验
  2. Android中如何利用Minui显示PNG格式的图片
  3. 20210714学习手记 CANopen 协议
  4. 谈谈CANopen协议的机制
  5. 11.9 至 11.17 四道典型题记录: Counter 弹出 | map函数 | 子集求取 | 有序字符桶分装
  6. 企业如何提高客户转化率、复购率?用快鲸scrm效果突出
  7. 【自动化运维新手村】Flask-ORM关联查询
  8. IDEA如何在包下面继续建包
  9. HP34401a实现高精度温度测量
  10. python discuz_[Python代码]Discuz!论坛(X2.5)发帖及回复脚本