原文地址:http://community.arm.com/groups/processors/blog/2010/05/10/coding-for-neon--part-2-dealing-with-leftovers

In the first post on NEON about loads and stores we looked at transferring data between the NEON processing unit and memory. In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process. You need to handle the leftover elements at the start or end of the array - what is the best way to do this on NEON?

Leftovers

Using NEON typically involves operating on vectors of data from four to sixteen elements in length. Frequently, you will find that your array is not a multiple of that length, and you have to process those leftover elements separately.

For example, you want to load, process and store eight elements per iteration using NEON, but your array is 21 elements long. The first two iterations go well, but for the third, there are only five elements remaining to be processed. What do you do?

Fixing Up

There are three ways to handle these leftovers. The methods vary in requirements, performance, and code size. They are listed below in order, with the fastest approach first.

Larger Arrays

If you can change the size of the arrays that you are processing, increase the length of the array to the next multiple of the vector size using padding elements. This allows you to read and write beyond the end of your data without corrupting adjacent storage.

In the example above, increasing the array size to 24 elements allows the third iteration to complete without potential data corruption.

Notes

  • Allocating larger arrays will consume more memory. The increase could be significant if many short arrays are involved.
  • The new padding elements created at the end of the array may need to be initialized to a value that does not affect the result of the calculation. For example, if you are summing an array, the new elements must be initialized to zero for the result to be unaffected. If you are finding the minimum of an array, set the new elements to the maximum value an element can take.
  • In some cases, it may not be possible to initialize the padding elements to a value that does not affect the result of a calculation - when finding the range of a set of numbers, for example.

Code Fragment

@ r0 = input array pointer@ r1 = output array pointer@ r2 = length of data in array@ We can assume that the array length is greater than zero, is an integer @ number of vectors, and is greater than or equal to the length of data @ in the array.add  r2, r2, #7      @ add (vector length-1) to the data lengthlsr  r2, r2, #3      @ divide the length of the array by the length@  of a vector, 8, to find the number of@  vectors of data to be processedloop:subs    r2, r2, #1      @ decrement the loop counter, and set flagsvld1.8  {d0}, [r0]!  @ load eight elements from the array pointed to@  by r0 into d0, and update r0 to point to the @  next vector......                  @ process the input in d0...vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and@  update r1 to point to next vectorbne  loop            @ if r2 is not equal to 0, loop

Overlapping

If the operation is suitable, leftover elements can be handled using overlapping. This involves processing some of the elements in the array twice.

In the example case, the first iteration would process elements zero to seven, the second processes elements five to 12, and the third 13 to 20. Notice that elements five to seven, the overlap between the first and second vectors, have been processed twice.

Notes

  • Overlapping can be used only when the operation applied to the input data does not vary with the number of times the operation is applied; the operation must be idempotent. For example, it can be used if you are trying to find the maximum element in an array. It can not be used if you are summing an array - the overlapped elements will be counted twice.
  • The number of elements in the array must fill at least one complete vector.

Code Fragment

@ r0 = input array pointer@ r1 = output array pointer@ r2 = length of data in array@ We can assume that the operation is idempotent, and the array is greater@ than or equal to one vector long.ands    r3, r2, #7      @ calculate number of elements left over after@  processing complete vectors using@  data length & (vector length - 1)beq  loopsetup    @ if the result of the ands is zero, the length@  of the data is an integer number of vectors,@  so there is no overlap, and processing can begin @  at the loop@ handle the first vector separatelyvld1.8  {d0}, [r0], r3  @ load the first eight elements from the array,@  and update the pointer by the number of elements@  left over......                  @ process the input in d0...vst1.8  {d0}, [r1], r3  @ write eight elements to the output array, and@  update the pointer@ now, set up the vector processing looploopsetup:lsr  r2, r2, #3      @ divide the length of the array by the length@  of a vector, 8, to find the number of@  vectors of data to be processed@ the loop can now be executed as normal. the@  first few elements of the first vector will@  overlap with some of those processed aboveloop:subs    r2, r2, #1      @ decrement the loop counter, and set flagsvld1.8  {d0}, [r0]!  @ load eight elements from the array, and update@  the pointer......                  @ process the input in d0...vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and@  update the pointerbne  loop            @ if r2 is not equal to 0, loop

Single Elements

NEON provides loads and stores that can operate on single elements in a vector. Using these, you can load a partial vector containing one element, operate on it, and write the element back to memory.

For the example problem, the first two iterations execute as normal, processing elements zero to seven, and eight to 15. The third iteration needs only to process five elements. They are handled in a separate loop, which loads, processes and stores single elements.

Notes

  • This approach is slower than the previous methods, as each element must be loaded, processed and stored individually.
  • Handling leftovers like this requires two loops - one for the vectors, and a second for the single elements. This can double the amount of code in the function.
  • NEON single element loads only change the value of the destination element, leaving the rest of the vector intact. If the calculation that you are vectorizing involves instructions that work across a vector, such as VPADD, the register must be initiliazed before loading the first single element into it.

Code Fragment

@ r0 = input array pointer@ r1 = output array pointer@ r2 = length of data in arraylsrs    r3, r2, #3      @ calculate the number of complete vectors to be@  processed and set flagsbeq  singlesetup  @ if there are zero complete vectors, branch to@  the single element handling code@ process vector loopvectors:subs    r3, r3, #1      @ decrement the loop counter, and set flagsvld1.8  {d0}, [r0]!  @ load eight elements from the array and update@  the pointer......                  @ process the input in d0...vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and@  update the pointerbne  vectors      @ if r3 is not equal to zero, loopsinglesetup:ands    r3, r2, #7      @ calculate the number of single elements to processbeq  exit            @ if the number of single elements is zero, branch@  to exit@ process single element loopsingles:subs    r3, r3, #1      @ decrement the loop counter, and set flagsvld1.8  {d0[0]}, [r0]!  @ load single element into d0, and update the@  pointer......                  @ process the input in d0[0]...vst1.8  {d0[0]}, [r1]!  @ write the single element to the output array,@  and update the pointerbne  singles      @ if r3 is not equal to zero, loopexit:

Further Considerations

Beginning or End

The overlapping and single element techniques can be applied at the start or end of processing an array. The code above can be easily adapted to fix up elements at either end, if it is more suitable for your application.

Alignment

Load and store addresses should be aligned to cache lines, allowing more efficient memory accesses.

This requires at least 16-word alignment on Cortex-A8. If you can not align the start of your input and output arrays, you must handle elements at the beginning of processing an array (for alignment) and at the end of the array (for the incomplete final vector.)

When aligning memory accesses for speed, remember to use :64 or :128 or :256 address qualifiers with your load and store instructions, for optimum performance. You can compare the number of cycles required to issue a load or store using the data available in the Technical Reference Manual for your target core.

Here's the relevant page in the Cortex-A8 TRM.

Using ARM to Fix Up

In the single elements case, you could use ARM instructions to operate on each element. However, storing to the same area of memory with both ARM and NEON instructions can reduce performance, as the writes from the ARM pipeline are delayed until writes from the NEON pipeline have been completed.

Generally, you should avoid writing to the same area of memory (specifically, the same cache line) from both ARM and NEON code.

In the next post, we will look at a practical application of NEON: matrix multiplication.

Coding for NEON - Part 2: Dealing With Leftovers相关推荐

  1. Coding for NEON - Part 3: Matrix Multiplication

    原文地址:http://community.arm.com/groups/processors/blog/2010/06/28/coding-for-neon--part-3-matrix-multi ...

  2. ARM NEON 优化

    确认处理器是否支持NEON cat /proc/cpuinfo | grep neon 看是否有如下内容 Features : swp half thumb fastmult vfp edsp neo ...

  3. 什么?!NEON还要优化?

    作者:十曰立 链接:https://www.jianshu.com/p/16d60ac56249 來源:简书 官网介绍: NEON宏观介绍 NEON Programmer's Guide Versio ...

  4. ARM Neon Intrinsics 学习指北:从入门、进阶到学个通透

    本文同步发表于GiantPandaCV公众号,未经作者允许严禁转载 前言 Neon是ARM平台的向量化计算指令集,通过一条指令完成多个数据的运算达到加速的目的,常用于AI.多媒体等计算密集型任务. 本 ...

  5. 【genius_platform软件平台开发】第八十二讲:ARM Neon指令集一(ARM NEON Intrinsics, SIMD运算, 优化心得)

    1. ARM Neon Intrinsics 编程 1.入门:基本能上手写Intrinsics 1.1 Neon介绍.简明案例与编程惯例 1.2 如何检索Intrinsics 1.3 优化效果案例 1 ...

  6. 微信 Android 视频编码爬过的那些坑

    [编者按]Android 视频相关的开发,大概一直是整个 Android 生态.以及 Android API 中,最为分裂以及兼容性问题最为突出的一部分,本文从视频编码器的选择和如何对摄像头输出的 Y ...

  7. 转: The Code Commandments: Best Practices for Objective-C Coding (updated for ARC)

    PrefaceI don't normally post highly technical stuff to my blog, but here's an exception that I hope ...

  8. vs2015 支持Android arm neon Introducing Visual Studio’s Emulator for Android

    visual studio 2015支持Android开发了. Microsoft released Visual Studio 2015 Preview this week and with it ...

  9. 大前端CPU优化技术--NEON编程优化技巧

    前言 在前面的文章中我们介绍了NEON的基础,NEON技术的全景,指令及NEON intrinsic指令,相信大家能通过前面的学习写一些简单的NEON程序.但要想写好一个性能高的NEON程序,远不止你 ...

最新文章

  1. 200行代码解读TDEngine背后的定时器
  2. 在元宇宙里怎么交朋友?Meta发布跨语种交流语音模型,支持128种语言无障碍对话...
  3. 循环神经网络RNN的基本组件(五)
  4. linux之find命令,Linux基础知识之find命令详解
  5. leetcode 119. 杨辉三角 II
  6. Http benchmarking 工具 wrk 基本使用
  7. 北京春雨天下软件公司的面试题
  8. 笔记本电脑下载matlab没有图标,win7系统安装matlab后找不到图标打不开如何解决...
  9. codeforces 702A A. Maximum Increase(水题)
  10. python运算符and_Python AND运算子
  11. 数据结构与算法python—7.链表题目leetcode总结
  12. Kubernetes 小白学习笔记(14)--k8s集群路线-kubernetes核心组件详解
  13. 阿里巴巴技术大牛赏鉴
  14. Kinect2.0相机标定
  15. 图像坐标球面投影_图像の球面投影算法
  16. 计算机硬件的五大逻辑部分,计算机的硬件系统由五大部分组成(计算机由几部分组成)...
  17. Java imageio底层_java - Java中的ImageIO问题 - 堆栈内存溢出
  18. 如何刷一些网站的阅读量
  19. Linux中cut命令的作用
  20. 如何知道当前操作系统是centos的哪个版本和内核版本?

热门文章

  1. 2022-2028全球及中国人体传感器 IC行业研究及十四五规划分析报告
  2. 无线降噪耳机对比测评,入耳式降噪耳机排行榜10强
  3. SpringBoot+SpringDataJpa配置双数据源SqlServer和Mysql
  4. 数字 IC 设计、FPGA 设计秋招笔试题目、答案、解析(3)2022 大疆创新数字芯片 B 卷
  5. 第十二章 当别人叫你往前站的时候,先看看是否有子弹飞来
  6. ArcBlock ⑨月报 | ABT 节点 1.0 版正式发布
  7. 2.MySQL表的增删改查(进阶)
  8. 计算机无线网络的性能和稳定性分析,计算机无线网络的性能和稳定性分析
  9. pca人脸识别个人理解及步骤
  10. scratch 简单的下雨场景