
  • Visual Profiler
  • nvprof

1.1. Focused Profiling

  • 代码分为初始化,拷贝数据,算法kernel运行,拷贝数据,数据校验和后处理,感兴趣的位置是kernel,此时可以采用;
  • 程序是分阶段的,每个阶段互相之间无依赖,每个阶段有不同的算法kernel,此时可以对每个阶段单独分析
  • 程序的迭代次数很多,每次迭代之前性能变化不明显,此时可以对一小部分迭代做分析
    cudaProfilerStart()/cudaProfilerStop() cuda_profiler_api.h
    cuProfilerStart()/cuProfilerStop() cudaProfiler.h
    nvprof –profile-from-start off 关闭从程序开始的profile

event nvprof –query-events,事件是一种硬件计数器,在kernel运行期间不断累计;
metric nvprof –query-metrics 度量是根据一种或多种计数器计算得到的该kernel特有的运行特征

1.2 Marking Regions of CPU Activity
Visual Profiler可以看到所有cpu线程如何调用cuda kernel,为了看到CPU线程在执行GPU函数之外的执行轨迹,需要使用NVIDIA Tools Extension API (NVTX)来修改应用程序,nvprof同样支持。
1.3. Naming CPU and CUDA Resources
You can use the NVIDIA Tools Extension API to assign custom names for your CPU and GPU resources. Your custom names will then be displayed in the Timeline View.
1.4. Flush Profile Data
性能数据默认收集到缓存中,以低优先级落盘,为防止性能数据没及时下盘。可以在所有线程退出之前,调用cuProfilerStop() 强制刷盘。


​Visual Profiler



有非常多的options,cuda/cpu/print/IO 等等options,还有一些执行模式和控制模式可以指定。具体TODO,需要每个指令尝试一下,或者才有需要的时候可以查询解决问题。

Remote Profiling

You can profile your remote application directly from nsight or the Visual Profiler.
Or you can use nvprof to collect the profile data on the remote system and then use nvvp on the host system to view and analyze the data.

NVIDIA Tools Extension


  • Tracing of CPU events and time ranges.
  • Naming of OS and CUDA resources.


MPI Profiling With nvprof


mpirun -np 2 nvprof –annotate-mpi openmpi ./my_mpi_app


MPS Profiling

You can collect profiling data for a CUDA application using Multi-Process Service(MPS) with nvprof and then view the timeline by importing the data in the Visual Profiler.

Dependency Analysis


Metrics Reference


Warp State

  • Instruction issued - An instruction or a pair of independent instructions was issued from a warp.
  • Stalled - Warp can be stalled for one of the following reasons.
    • Stalled for instruction fetch - The next instruction was not yet available.指令缓存导致stall
    • Stalled for execution dependency.依赖的寄存器还没准备好,前面的计算指令,FP64,barrier. try to increase instruction-level parallelism (ILP)
    • Stalled for memory dependency - The next instruction is waiting for a previous memory accesses to complete.依赖的寄存器还没准备好,前面的访存指令LD。
    • Stalled for memory throttle - A large number of outstanding memory requests prevents forward progress. 带宽限制,global和shared memory都有一定的带宽限制。
    • Stalled for texture
    • Stalled for sync - The warp is waiting for all threads to synchronize after a barrier instruction.
    • Stalled for constant memory dependency.常量内存的访存行为
    • Stalled for pipe busy - The warp is stalled because the functional unit required to execute the next instruction is busy.FP64导致busy
    • Stalled for not selected - Warp was ready but did not get a chance to issue as some other warp was selected for issue.充分优化的程序
    • Stalled for other - Warp is blocked for an uncommon reason like compiler or hardware reasons. barrier > 18,stall pipeline


