对灰度图进行向右旋转90度，需要至少遍历访问所有元素一遍，时间复杂度为o(mn)，利用NEON加速可以并行读取多个元素，虽然没有改变时间复杂度，但常数因子减小了。

问题描述

针对灰度图进行旋转，假设输入图像尺寸为Height*Width,转换后的图像尺寸为Width*Height。转换效果如下图所示：

转换后的图像坐标对应关系如下：

$g(w, h) = f(Height-h, w)$

C代码实现

int GrayImageRotation90(uint8_t * in,uint8_t* out,int height,int width)
{uint8_t *p_img = (uint8_t*)in;uint8_t *p_o = out;for(int h=0; h<height; h++){for(int w=0; w<width; w++){*(p_o + w*height + height-1-h) = *p_img++;}}return 0;
}

NEON优化

NEON加速思路：将一个输入的大图像划分成8*8大小的图像，分别对每一个8*8的子图像进行旋转，再将其输出到对应的位置处即可。最关键的是对8*8子图像的旋转处理。

NEON的指令vtrn可以解决转置，但针对8*8的图像数据要使用vtrn还需要使用较大的强制转换指令进行配合，总的指令数量较多，最终的计算速度慢，网上通过vtrn实现的旋转图像的加速比只有2.5左右。本文使用查找表指令vtbl进行选择。

step 1. 通过vld指令，将内存数据读取到向量结构体中

step 2. 通过vtbl指令，将列方向的数据组织到一起，如下图所示，A1~D1， A5~D5 4列数据选择输出到一个寄存器中，其它寄存器同理，需要注意的是，在选则寄存器数据是，数字下标相同的数据输出到同一个寄存器，这是一般只有4个相同的，数据，那么另外四个数据该如何选择呢？红色虚线以下的四行数据组合成的寄存器进行同样的选择输出，那么，第一行与第五行能够再一次进行查找表选择时刚好能输出为输出所需要的8个元素的向量为最佳选择。

step 3. 再将vd0与vd4进行查找表选择输出两个连续的寄存器

step 4. 按输出输出存储在对应的内存位置即可。

int GrayRotation90_NEON(uint8_t * in, uint8_t* out, int height, int width)
{uint8x8_t vone = {1,1,1,1,1,1,1,1};uint8x8_t index_0 = {28, 20, 12, 4, 24, 16, 8, 0};uint8x8_t index_1 = vadd_u8(index_0, vone);uint8x8_t index_2 = vadd_u8(index_1, vone);uint8x8_t index_3 = vadd_u8(index_2, vone);uint8x8_t index_4 = {12, 13, 14, 15, 4, 5, 6 ,7};uint8x8_t index_5 = {8, 9, 10, 11, 0, 1, 2, 3};uint8x8x4_t mat0;uint8x8x4_t mat1;uint8x8x4_t temp0;uint8x8x4_t temp1;uint8x8x2_t out0;uint8x8x2_t out1;uint8x8x2_t out2;uint8x8x2_t out3;int x = 0, y = 0;for(y=0; y<height; y+=8){for(x=0; x<width; x+=8){mat0.val[0] = vld1_u8(in + y*width+x);mat0.val[1] = vld1_u8(in + (y+1)*width+x);mat0.val[2] = vld1_u8(in + (y+2)*width+x);mat0.val[3] = vld1_u8(in + (y+3)*width+x);mat1.val[0] = vld1_u8(in + (y+4)*width+x);mat1.val[1] = vld1_u8(in + (y+5)*width+x);mat1.val[2] = vld1_u8(in + (y+6)*width+x);mat1.val[3] = vld1_u8(in + (y+7)*width+x);temp0.val[0] = vtbl4_u8(mat0, index_0);temp0.val[1] = vtbl4_u8(mat0, index_1);temp0.val[2] = vtbl4_u8(mat0, index_2);temp0.val[3] = vtbl4_u8(mat0, index_3);temp1.val[0] = vtbl4_u8(mat1, index_0);temp1.val[1] = vtbl4_u8(mat1, index_1);temp1.val[2] = vtbl4_u8(mat1, index_2);temp1.val[3] = vtbl4_u8(mat1, index_3);out0.val[0] = temp0.val[0];out0.val[1] = temp1.val[0];out1.val[0] = temp0.val[1];out1.val[1] = temp1.val[1];out2.val[0] = temp0.val[2];out2.val[1] = temp1.val[2];out3.val[0] = temp0.val[3];out3.val[1] = temp1.val[3];mat0.val[0] = vtbl2_u8(out0, index_4); // line 0mat0.val[1] = vtbl2_u8(out0, index_5); // line 4mat0.val[2] = vtbl2_u8(out1, index_4); // line 1mat0.val[3] = vtbl2_u8(out1, index_5); // line 5mat1.val[0] = vtbl2_u8(out2, index_4); // line 2mat1.val[1] = vtbl2_u8(out2, index_5); // line 6mat1.val[2] = vtbl2_u8(out3, index_4); // line 3mat1.val[3] = vtbl2_u8(out3, index_5); // line 7// store out data in order: 0, 4, 1, 5, 2, 6, 3, 7vst1_u8(out + (x + 0) * height + height-8 - y, mat0.val[0]); // line 0vst1_u8(out + (x + 1) * height + height-8 - y, mat0.val[2]); // line 1vst1_u8(out + (x + 2) * height + height-8 - y, mat1.val[0]); // line 2vst1_u8(out + (x + 3) * height + height-8 - y, mat1.val[2]); // line 3vst1_u8(out + (x + 4) * height + height-8 - y, mat0.val[1]); // line 4vst1_u8(out + (x + 5) * height + height-8 - y, mat0.val[3]); // line 5vst1_u8(out + (x + 6) * height + height-8 - y, mat1.val[1]); // line 6vst1_u8(out + (x + 7) * height + height-8 - y, mat1.val[3]); // line 7}}return 0;
}

性能分析

输入图像分辨率为640*360*1，在ARMv7处理上运行1000次，C语言函数与NEON函数运行耗时对比如下：

算法	C	NEON	加速比
耗时	2811	995	2.8

由此可见，NEON至少提高了2.8倍的速度。

欢迎关注亦梦云烟的微信公众号: 亦梦智能计算

ARM NEON优化5.图像旋转相关推荐

ARM NEON 优化
确认处理器是否支持NEON cat /proc/cpuinfo | grep neon 看是否有如下内容 Features : swp half thumb fastmult vfp edsp neo ...
ARM NEON优化3.RGB Packed转RGB Planar
问题描述 RGB Packet图像格式在内存中的排布顺序为 R, G, B, R, G, B,...,每个像素都是由连续的三个字节按RGB的顺序组成(8bit每像素的图像).现在,我们想要将RGB的各 ...
ARM NEON优化4.RGB图像转灰度图
问题描述灰度图像是用不同饱和度的黑色来表示每个图像像素,用0~255之间的数表示"灰色"的程度,比如0表示黑色,255表示白色,RGB值与灰度图之间的转换公式如下: 一幅640* ...
基于NEON指令的图像旋转加速【armv7】
目录前言知识直通车 NEON转置指令右旋90 4x4矩阵右旋实例灰度图(单通道)右旋90 彩图(RGB三通道)右旋90 左旋90 4x4矩阵左旋实例灰度图(单通道)左旋90 彩图(RGB三通 ...
双线性插值算法ARM NEON优化
C语言版本双线性插值算法 inline double bilinear_interp(double x, double y, double v11, double v12,double v21, do ...
arm neon优化
neon是simd的一种实现使用neon的方式有: 1.neon library 使用第三方开源库,直接函数调用 2.auto-vectorization 使用编译器自动auto-vectoriza ...
ARM neon详解
NEON 学习参考文档: ARM NEON优化(一)--NEON简介及基本架构 - Orchid Bloghttp://zyddora.github.io/2016/02/28/neon_1/ neo ...
ARM NEON编程
下午终于把串口的任务完成的差不多了,同时老板有给安排了一个新的任务:看一下ARM NEON,一脸懵逼,这是个什么玩意!!! 我原本想做CUDA下的GPU加速的,结果这给我弄了个ARM的,这可咋整,不管 ...
ARM NEON指令集优化理论与实践
ARM NEON指令集优化理论与实践一．简介 NEON就是一种基于SIMD思想的ARM技术,相比于ARMv6或之前的架构,NEON结合了64-bit和128-bit的SIMD指令集,提供128-bi ...

ARM NEON优化5.图像旋转

问题描述

C代码实现

NEON优化

性能分析

ARM NEON优化5.图像旋转相关推荐

最新文章

热门文章