
  • 3.3 并行性的表现
    • 3.3.1 用nvprof检测活跃的线程束
    • 3.3.2 用nvprof检测内存操作
      • 内存读取效率
      • 全局加载效率
    • 3.3.3 增大并行性
      • 指标与性能

3.3 并行性的表现



#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>/** This example implements matrix element-wise addition on the host and GPU.* sumMatrixOnHost iterates over the rows and columns of each matrix, adding* elements from A and B together and storing the results in C. The current* offset in each matrix is stored using pointer arithmetic. sumMatrixOnGPU2D* implements the same logic, but using CUDA threads to process each matrix.*/void initialData(float *ip, const int size)
{int i;for(i = 0; i < size; i++){ip[i] = (float)( rand() & 0xFF ) / 10.0f;}
}void sumMatrixOnHost(float *A, float *B, float *C, const int nx, const int ny)
{float *ia = A;float *ib = B;float *ic = C;for (int iy = 0; iy < ny; iy++){for (int ix = 0; ix < nx; ix++){ic[ix] = ia[ix] + ib[ix];}ia += nx;ib += nx;ic += nx;}return;
}void checkResult(float *hostRef, float *gpuRef, const int N)
{double epsilon = 1.0E-8;for (int i = 0; i < N; i++){if (abs(hostRef[i] - gpuRef[i]) > epsilon){printf("host %f gpu %f ", hostRef[i], gpuRef[i]);printf("Arrays do not match.\n\n");break;}}
}// grid 2D block 2D
__global__ void sumMatrixOnGPU2D(float *A, float *B, float *C, int NX, int NY)
{unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;unsigned int iy = blockIdx.y * blockDim.y + threadIdx.y;unsigned int idx = iy * NX + ix;if (ix < NX && iy < NY){C[idx] = A[idx] + B[idx];}
}int main(int argc, char **argv)
{// set up deviceint dev = 0;cudaDeviceProp deviceProp;CHECK(cudaGetDeviceProperties(&deviceProp, dev));CHECK(cudaSetDevice(dev));// set up data size of matrixint nx = 1 << 13;int ny = 1 << 13;int nxy = nx * ny;int nBytes = nxy * sizeof(float);// malloc host memoryfloat *h_A, *h_B, *hostRef, *gpuRef;h_A = (float *)malloc(nBytes);h_B = (float *)malloc(nBytes);hostRef = (float *)malloc(nBytes);gpuRef = (float *)malloc(nBytes);// initialize data at host sidedouble iStart = seconds();initialData(h_A, nxy);initialData(h_B, nxy);double iElaps = seconds() - iStart;memset(hostRef, 0, nBytes);memset(gpuRef, 0, nBytes);// add matrix at host side for result checksiStart = seconds();sumMatrixOnHost (h_A, h_B, hostRef, nx, ny);iElaps = seconds() - iStart;// malloc device global memoryfloat *d_MatA, *d_MatB, *d_MatC;CHECK(cudaMalloc((void **)&d_MatA, nBytes));CHECK(cudaMalloc((void **)&d_MatB, nBytes));CHECK(cudaMalloc((void **)&d_MatC, nBytes));// transfer data from host to deviceCHECK(cudaMemcpy(d_MatA, h_A, nBytes, cudaMemcpyHostToDevice));CHECK(cudaMemcpy(d_MatB, h_B, nBytes, cudaMemcpyHostToDevice));// invoke kernel at host sideint dimx = 32;int dimy = 32;if(argc > 2){dimx = atoi(argv[1]);dimy = atoi(argv[2]);}dim3 block(dimx, dimy);dim3 grid((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y);// execute the kernelCHECK(cudaDeviceSynchronize());iStart = seconds();sumMatrixOnGPU2D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);CHECK(cudaDeviceSynchronize());iElaps = seconds() - iStart;printf("sumMatrixOnGPU2D <<<(%d,%d), (%d,%d)>>> elapsed %f ms\n", grid.x,grid.y,block.x, block.y, iElaps);CHECK(cudaGetLastError());// copy kernel result back to host sideCHECK(cudaMemcpy(gpuRef, d_MatC, nBytes, cudaMemcpyDeviceToHost));// check device resultscheckResult(hostRef, gpuRef, nxy);// free device global memoryCHECK(cudaFree(d_MatA));CHECK(cudaFree(d_MatB));CHECK(cudaFree(d_MatC));// free host memoryfree(h_A);free(h_B);free(hostRef);free(gpuRef);// reset deviceCHECK(cudaDeviceReset());return EXIT_SUCCESS;

3.3.1 用nvprof检测活跃的线程束


$ sudo nvprof --metrics achieved_occupancy ./main 32 32
==15352== NVPROF is profiling process 15352, command: ./main 32 32
sumMatrixOnGPU2D <<<(256,256), (32,32)>>> elapsed 0.017286 ms
==15352== Profiling application: ./main 32 32
==15352== Profiling result:
==15352== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                        achieved_occupancy                        Achieved Occupancy    0.800294    0.800294    0.800294$ sudo nvprof --metrics achieved_occupancy ./main 32 16
==15366== NVPROF is profiling process 15366, command: ./main 32 16
sumMatrixOnGPU2D <<<(256,512), (32,16)>>> elapsed 0.016680 ms
==15366== Profiling application: ./main 32 16
==15366== Profiling result:
==15366== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                        achieved_occupancy                        Achieved Occupancy    0.823351    0.823351    0.823351$ sudo nvprof --metrics achieved_occupancy ./main 16 32
==15380== NVPROF is profiling process 15380, command: ./main 16 32
sumMatrixOnGPU2D <<<(512,256), (16,32)>>> elapsed 0.016678 ms
==15380== Profiling application: ./main 16 32
==15380== Profiling result:
==15380== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                        achieved_occupancy                        Achieved Occupancy    0.839795    0.839795    0.839795$ sudo nvprof --metrics achieved_occupancy ./main 16 16
==15394== NVPROF is profiling process 15394, command: ./main 16 16
sumMatrixOnGPU2D <<<(512,512), (16,16)>>> elapsed 0.017072 ms
==15394== Profiling application: ./main 16 16
==15394== Profiling result:
==15394== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                        achieved_occupancy                        Achieved Occupancy    0.862736    0.862736    0.862736
  • 因为第二种情况中的块数比第一种情况的多,所以设备就可以有更多活跃的线程束。其原因可能是第二种情况与第一种情况相比有更高的可实现占用率和更好的性能。

  • 第四种情况有最高的可实现占用率,但它不是最快的,因此,更高的占用率并不一定意味着有更高的性能。肯定有其他因素限制GPU的性能。

3.3.2 用nvprof检测内存操作



sudo nvprof --metrics gld_throughput ./main 32 32


liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_throughput ./main 32 32
==15453== NVPROF is profiling process 15453, command: ./main 32 32
sumMatrixOnGPU2D <<<(256,256), (32,32)>>> elapsed 0.154867 ms
==15453== Profiling application: ./main 32 32
==15453== Profiling result:
==15453== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_throughput                    Global Load Throughput  3.6326GB/s  3.6326GB/s  3.6326GB/s
liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_throughput ./main 32 16
==15468== NVPROF is profiling process 15468, command: ./main 32 16
sumMatrixOnGPU2D <<<(256,512), (32,16)>>> elapsed 0.129354 ms
==15468== Profiling application: ./main 32 16
==15468== Profiling result:
==15468== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_throughput                    Global Load Throughput  4.4766GB/s  4.4766GB/s  4.4766GB/s
liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_throughput ./main 16 32
==15482== NVPROF is profiling process 15482, command: ./main 16 32
sumMatrixOnGPU2D <<<(512,256), (16,32)>>> elapsed 0.125502 ms
==15482== Profiling application: ./main 16 32
==15482== Profiling result:
==15482== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_throughput                    Global Load Throughput  4.4943GB/s  4.4943GB/s  4.4943GB/s
liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_throughput ./main 16 16
==15496== NVPROF is profiling process 15496, command: ./main 16 16
sumMatrixOnGPU2D <<<(512,512), (16,16)>>> elapsed 0.126294 ms
==15496== Profiling application: ./main 16 16
==15496== Profiling result:
==15496== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_throughput                    Global Load Throughput  4.4697GB/s  4.4697GB/s  4.4697GB/s
  • 第四种情况中的加载吞吐量最高,但第四种情况却比第二种情况慢。所以,更高的加载吞吐量并不一定意味着更高的性能。第4章介绍内存事务在GPU设备上的工作原理时将会具体分析产生这种现象的原因。


sudo nvprof --metrics gld_efficiency ./main 16 16


liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_efficiency ./main 16 16
==15700== NVPROF is profiling process 15700, command: ./main 16 16
==15700== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (done)
sumMatrixOnGPU2D <<<(512,512), (16,16)>>> elapsed 0.436797 ms
==15700== Profiling application: ./main 16 16
==15700== Profiling result:
==15700== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_efficiency ./main 16 8
==15714== NVPROF is profiling process 15714, command: ./main 16 8
==15714== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (2 of 2)...
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (done)
==15714== Profiling application: ./main 16 86,8)>>> elapsed 0.434808 ms
==15714== Profiling result:
==15714== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_efficiency ./main 8 8
==15728== NVPROF is profiling process 15728, command: ./main 8 8
==15728== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (2 of 2)...
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (done)
==15728== Profiling application: ./main 8 8lapsed 0.440531 ms
==15728== Profiling result:
==15728== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
liwete@liwete-OMEN-by-HP-Laptop:~/CLionProjects/sumMatrix$ sudo nvprof --metrics gld_efficiency ./main 4 4
==15743== NVPROF is profiling process 15743, command: ./main 4 4
==15743== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (2 of 2)...
Replaying kernel "sumMatrixOnGPU2D(float*, float*, float*, int, int)" (done)
==15743== Profiling application: ./main 4 4lapsed 0.747474 ms
==15743== Profiling result:
==15743== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 1050 (0)"Kernel: sumMatrixOnGPU2D(float*, float*, float*, int, int)1                            gld_efficiency             Global Memory Load Efficiency      50.00%      50.00%      50.00%




3.3.3 增大并行性

从前一节可以总结出,一个块的最内层维数(block.x)应该是线程束大小的倍数。 这样能极大地提高了加载效率。你可能对以下问题仍然很好奇:

  • 调整block.x会进一步增加加载吞吐量吗
  • 有其他方法可以增大并行性吗


  • 线程块最内层维度的大小对性能起着的关键的作用。

  • 在所有其他情况下,线程块的数量都比最好的情况少。因此,增大并行性仍然是性能优化的一个重要因素。

  • 最好的执行配置既不具有最高的可实现占用率,也不具有最高的加载吞吐量。从这些实验中可以推断出,没有一个单独的指标能直接优化性能。我们需要在几个相关的指标间寻找一个恰当的平衡来达到最佳的总体性能。


  • 在大部分情况下,一个单独的指标不能产生最佳的性能

  • 与总体性能最直接相关的指标或事件取决于内核代码的本质

  • 在相关的指标与事件之间寻求一个好的平衡

  • 从不同角度查看内核以寻找相关指标间的平衡

  • 网格/块启发式算法为性能调节提供了一个很好的起点


  1. CUDA学习:GPU硬件连接模型

    CUDA学习:GPU硬件连接模型 一.基本的CPU与GPU连接模型 CPU与GPU之间的连接是通过PCI-Express总线进行连接的.GPU不是一个独立运行的平台而是CPU的协处理器.因此,GPU必 ...

  2. CUDA学习(四):CUDA编程七个步骤

    博主CUDA学习系列汇总传送门(持续更新):编程语言|CUDA入门 文章目录 一.cudaMalloc.cudaMemcpy和cudaFree 介绍 二.CUDA编程七步曲 本章节学习内容: 1.CU ...

  3. CUDA 学习(CUDA实战 第四章)

    1.矢量求和 1.1 CPU #include <stdio.h>#define N 10void add( int *a, int *b, int *c ) {int tid = 0; ...

  4. 计算机视觉大型攻略 —— CUDA(2)执行模型

    Professional CUDA C Programming[1]是一本不错的入门书籍,虽说命名为"Professional",但实际上非常适合入门阅读.他几乎涵盖了所有理论部分 ...

  5. CUDA学习笔记之 CUDA存储器模型

    CUDA学习笔记之 CUDA存储器模型 标签: cuda存储bindingcache编程api 2010-12-14 01:33 1223人阅读 评论(0) 收藏 举报 分类: CUDA(26) GP ...

  6. 【CUDA 基础】3.1 CUDA执行模型概述

    title: [CUDA 基础]3.1 CUDA执行模型概述 categories: CUDA Freshman tags: CUDA SM SIMT SIMD Fermi Kepler toc: t ...

  7. CUDA编程第三章: CUDA执行模型

    前言: 本章内容: 通过配置文件驱动的方法优化内核 理解线程束执行的本质 增大GPU的并行性 掌握网格和线程块的启发式配置 学习多种CUDA的性能指标和事件 了解动态并行与嵌套执行 通过上一章的练习, ...

  8. (CUDA 编程1).CUDA 线程执行模型分析(一)招兵 ------ GPU的革命

    (CUDA 编程1).CUDA 线程执行模型分析(一)招兵 ------ GPU的革命 作者:赵开勇 来源:http://www.hpctech.com/2009/0818/198.html 序:或许 ...

  9. 3.1 CUDA执行模型概述

    3.1. CUDA执行模型概述 3.1.1 GPU架构概述 GPU架构主要围绕流式多处理器(SM)进行的搭建.如下图所示. SM中有多个CUDA执行核心,支持数百个线程的并发执行. 当启动一个grid ...


  1. 求助大佬6——1种贪心
  2. Spring Boot实现一个天气预报系统(二)数据同步
  3. 这些心智程序你安装了吗?
  4. 计算机导论 网络,计算机导论(网络).ppt
  5. npm私服搭建与应用
  6. 利用swap()函数来收缩内存
  7. 链表(python版)
  8. 快速掌握——LCD1602液晶显示(多组实验,附带源程序)
  9. c# 多线程单例模式_线程安全C#单例模式
  10. GitHub的安装与配置
  11. java计算机毕业设计中小型超市管理系统录像补源码+数据库+系统+lw文档+mybatis+运行部署
  12. Ubuntu系统实现简单c语言编程
  13. 【转】宁做创业狼,不做打工狗(ZT)
  14. 学习CNN的比较好的网站
  15. ipad pro + zotero + 坚果云 + PDF Expert 搭建多平台文献管理(自用备忘)
  16. 深入浅出Stream和parallelStream
  18. rosdep update 使用小鱼fishros解决ros1/ros2问题 2022
  19. [转载] 深入理解Android系统网络架构
  20. 多传感器融合定位十五-多传感器时空标定(综述)


  1. Amazon SES 邮件发送服务
  2. TreeSet集合如何保证元素唯一
  3. AD各种布线方法总结
  4. 【Yolo】Jetson Orin Nano下部署 YoloV5
  5. 【网络蠕虫】恶意代码之计算机病毒、网络蠕虫、木马
  6. 刘鹏飞_2022_reStructured Pre-training
  7. Dubbo流程及源码分析(一)
  8. oracle rr与yy日期格式
  9. Android开发介绍
  10. PAT甲级 1042 Shuffling Machine 模拟洗牌 map的使用