矩阵空间是 3x4,其左上角有一个子矩阵2x3,表示如下

11 22 33 0

44 55 66 0

0   0    0   0

i, j分别表示行索引,列索引

如果用列存储的话,leading dimension = 3(矩阵空间的行个数), 换算公式是i + j *ld

11 44 0 22 55 0 33 66 0 0 0 0

如果是用行存储, leading dimension = 4(矩阵空间的列个数),换算公式是 i*ld + j

11 22 33 0 44 55 66 0 0 0 0 0

cublas中矩阵用列存储表示,cusparse中的矩阵用行CNC表示

在cuda中二维数组是按照列存储的,在c中二维数组按照行存储,在c语言中的一个矩阵(二维数组)A = M X K, B = N X K, C = A * Bt = M x N

对于cuda而言(列存储),看到的A矩阵是K x M, B 是 K x N, 计算的C = Bt * A = N x M

计算结果C矩阵在c语言看来就是按照行存储的 M x N

在cuda中,对于A矩阵,不论在cublasSgemm, 是trans还是non-trans,其leading dimension就是 K, 同理对于矩阵B,是转置还是不转置,leading dimension都是K

在cuda中

设 A' = B = K x N , B' = A = K x M

transa = CUBLAS_OP_T

transb = CUBLAS_OP_N

m = op(A')_row = N

n = op(B')_col = M

k = op(A')_col = op(B')_row = K

lda = A'_row = K

ldb = B'_row = K

ldc = C_row = N

在c语言中有上图的矩阵,矩阵空间是 M x N, 但是只用到了mxn的子矩阵

对cuda而言,它看到的矩阵空间是NxM, 子矩阵是nxm

调用cublasSgemm,m,n,k都是用子矩阵的大小维度,但是lda = N

总结就是,gemm中的n,m,k都是计算的在cuda视角下子矩阵的维度,随着矩阵转置与否变化,但是lda是在cuda视角下矩阵空间的row的大小,并且不随矩阵转置与否变化

参考http://stackoverflow.com/questions/14595750/transpose-matrix-multiplication-in-cublas-howto

The problem is simple: I have two matrices, A and B, that are M by N, where M >> N. I want to first take the transpose of A, and then multiply that by B (A^T * B) to put that into C, which is N by N. I have everything set up for A and B, but how do I call cublasSgemm properly without it returning the wrong answer?

I understand that cuBlas has a cublasOperation_t enum for transposing things beforehand, but somehow I'm not quite using it correctly. My matrices A and B are in row-major order, i.e. [ row1 ][ row2 ][ row3 ]..... in device memory. That means for A to be interpreted as A-transposed, BLAS needs to know my A is in column-major order. My current code looks like below:

float *A, *B, *C;
// initialize A, B, C as device arrays, fill them with values
// initialize m = num_row_A, n = num_row_B, and k = num_col_A;
// set lda = m, ldb = k, ldc = m;
// alpha = 1, beta = 0;
// set up cuBlas handle ...cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);

My questions:

Am I setting up m, k, n correctly?

What about lda, ldb, ldc?

Thanks!

Since cuBLAS always assume that the matrices are stored in column-major. You could either transpose your matrices first into colum-major by using cublas_geam(), or

You could treat your matrix A stored in row-major, as a new matrix AT stored in column-major. The matrix AT is actually the transpose of A. For B do the same thing. Then you could calculate matrix C stored in column-major by C=AT * BT^T

float* AT = A;
float* BT = B;

The leading dimension is a param related to the storage, which doesn't change no matter you use the transpose flag CUBLAS_OP_T or not.

lda = num_col_A = num_row_AT = N;
ldb = num_col_B = num_row_BT = N;
ldc = num_row_C = N;

m and n in the cuBLAS GEMM routine are the #rows and #cols of the result matrix C,

m = num_row_C = num_row_AT = num_col_A = N;
n = num_col_C = num_row_BT = num_col_B = N;

k is the common dimension of A^T and B,

k = num_col_AT = num_row_B = M;

Then you could invoke the GEMM routine by

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, AT, lda, BT, ldb, &beta, C, ldc);

If you want the matrix C to be stored in row-major, you could calculate the CT stored in column-major with the formula CT = BT * AT^T by

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, n, m, k, &alpha, BT, ldb, AT, lda, &beta, CT, ldc);

Please note you don't have to swap m and n since C is a square matrix in this case.

leading dimension相关推荐

  1. (原)使用mkl计算特征值和特征向量

    转载请注明出处: http://www.cnblogs.com/darkknightzh/p/5585271.html 参考文档:mkl官方文档 lapack_int LAPACKE_sgeev(in ...

  2. 【 MATLAB 】filter 函数介绍 之 Filter Data in Sections

    [ MATLAB ]filter 函数介绍(一维数字滤波器) 在上篇博文中,里面有一个例子,就是过滤部分中的数据,这个部分中的数据的意思是如果有一个向量需要过滤,我们可以把它分为几段,然后分段过滤. ...

  3. LMDIF_函数源码

    函数源码: /* lmdif.f -- translated by f2c (version 20020621).You must link the resulting object file wit ...

  4. CUDA Libraries简介

    CUDA Libraries简介 上图是CUDA 库的位置,本文简要介绍cuSPARSE.cuBLAS.cuFFT和cuRAND,之后会介绍OpenACC. cuSPARSE线性代数库,主要针对稀疏矩 ...

  5. CUDA编程中内存管理机制

    GPU设备端存储器的主要分类和特点: 大小: 全局(Global)和纹理(Texture)内存:大小受RAM大小的限制. 本地(local)内存:每个线程限制在16KB 共享内存:最大16kB 常量内 ...

  6. 基于代数距离的椭圆拟合

    问题 给定离散点集Xi=(xi,yi),i=1,2,...NX_i=(x_i,y_i) ,i=1,2,...NXi​=(xi​,yi​),i=1,2,...N,我们希望找到误差最小的椭圆去拟合这些离散 ...

  7. matlab 二维数组转一维数组中,将二维数组映射到一维数组上

    C语言使用multidimensional array的行顺序 要用单维数组来模拟这个,可以将行索引乘以宽度,然后添加列索引: int array[width * height]; int SetEl ...

  8. 编程思想 —— 哨兵的使用

    LDA:Leading Dimension of A View topic - leading dimension? Clarification of the leading dimension in ...

  9. scan——Theano中循环的实现

    引子 在开始scan函数的设计之前,我们从一个实例出发,首先来看,一个循环需要必备哪些成分:简单的循环累积相乘,计算AkA^k,使用numpy,代码如下. def power(A, k):result ...

最新文章

  1. 数据库基础笔记(MySQL)2 —— 基础查询
  2. 《强化学习周刊》第19期:ICCV-2021强化学习的最新研究与应用
  3. mysql 快照读 幻读,InnoDB的MVCC如何解决不可重复读和快照读的幻读,当前读用next-key解决幻读...
  4. 网页检测 AdBlock 的 6 种方法
  5. 性能测试指标(重要)
  6. Unity3D实践3:BOSS血条
  7. java url dns_JAVA反序列化-ysoserial-URLDNS
  8. mysql连接教程_MySQL 连接
  9. 450B Jzzhu and Sequences 我考研之后做的第一道题
  10. python简单的爬虫实例
  11. Selenium 批量执行url(附完整代码)
  12. 阅读笔记-HTTP返回状态码
  13. PyQt4转PyQt5心得
  14. Flutter 120hz 高刷新率在 Android 和 iOS 上的调研总结
  15. vue.js 的学习之路
  16. 搜狗推送软件搜狗收录详细教程
  17. 李嘉诚收购英国电信公司遭英国官方要求封杀
  18. OpenCV小例程——分区域不同的显示视频
  19. 有坐标的主买量占比主卖占比判断操盘通达信指标公式源码 附效果图
  20. java计算机毕业设计租房管理系统源程序+mysql+系统+lw文档+远程调试

热门文章

  1. android 将SQLite数据库的表格导出为csv格式,并解析csv文件
  2. 四十七、SQL 语法总结
  3. pyqt5动态显示当前时间
  4. 博士申请 | 澳大利亚悉尼科技大学招收人工智能/软件工程方向全奖博士生
  5. 博士申请 | 英国爱丁堡大学NLP组招收自然语言处理方向全奖博士生
  6. 不能兼顾速度与精度,STOC 2021最佳论文揭示梯度下降复杂度理论
  7. SIGIR 2021 | 深入探索犯罪情节,中科大提出基于环境感知的法律判决预测
  8. 学多少返多少 | 人工智能核心课零门槛就业涨薪培养计划
  9. 直播 | 彩云科技CEO袁行远:NLP与冒险游戏研究一览
  10. Github大热论文 | U-GAT-IT:基于GAN的新型无监督图像转换