概览

这篇文章向我们展示了在向量化查询执行引擎里的两种filter representation(bitmap以及select vectors)和两种执行策略:1. Full, 不管有多少active的row,都全部执行;2. Partial,只执行active的row,并用理论分析,辅之以实验论证,哪些情况下,哪种组合更优。

后文用到的缩写

  • BM for BitMap
  • SV for Select Vectors

Update Operation和Map Operation的区别


Map需要物化其结果到另一个新的vector中,update需要返回一个新的filter representation

原语操作代价公式

实现的所有策略

加了Manual后缀表示手动的以SIMD的形式写出核心操作进行优化,不以Manual结尾的则依赖编译器自行进行优化

不同Operation的最优策略

没有SIMD可以加速的Operation

变长数据的操作

最优策略选SVPartial

We see that Full consistently performs worst when it processes more tuples. At full selectivity, Full and Selective process the same number of tuples, but the former slightly benefits from a simpler iteration logic. We also see that the performance of SVPartial is similar to that of BMPartial, despite the former’s simpler iteration logic, meaning that the core operation is the dominating cost.

选择Partial策略,并且使用BM和SV的差别不大,因为即使理论上SV的迭代逻辑更简单,但是主要的耗时在于算子的真正操作耗时。

整数相除/取模

最优策略选SVPartial,结论与与”变长数据的操作“一致

总结

We conclude that the number of tuples processed is the dominant factor for non-data-parallel primitives, making Full impractical. Because of the small impact of iteration logic time, SVPartial has a performance edge over BMPartial, but developers can choose either representation without much affecting performance.

简而言之,需要处理的元组数量是主要因素,最优策略是SVPartial

有SIMD可以加速但不高效的Operation

即存在执行分支的Operation,如逻辑或和逻辑与操作

最优策略选SVPartial

We can see that Full is either always the worst strategy in Figure 5a, or only competitive at high selectivities (>= 0.85) in Figure 5b. Although Full is competitive at the highest selectivities, we recommend, for the sake of simplicity, the SVPartial strategy.

有SIMD可以加速且没有分支的Operation

core operations with straight-line and data-parallel (SLDP) code (i.e., non-branching code that leverages SIMD instruc- tions).

Therefore, we expect Selective strategies to only be competitive when the selectivity is low, meaning that Full processes far more tuples. SVManual is the exception: it is a Selective strategy that uses SIMD instructions. Its gain in iteration logic time is, however, not as high as that of Full strategies because it uses a gather instruction [5] to collect the elements at the selected indices (see Figure 10). The higher the core operations time per tuple, the less important the gather overhead because iteration logic time becomes more insignificant. Thus, we expect SVManual to be competitive with Full at medium or below selectivities depending on the core operation.

Selective策略只会在选择率较低的情况下更优,SVManual因为使用了SIMD指令,所以它比Full策略差的 选择率阈值那个点要更高一些。但是因为它使用了gather指令,所以它在迭代效率上的性能提升并没有那么大,所以在算子核心操作的时间代价越大,SVManual比Full策略差的 选择率阈值就要更高

图6和图7中的One Operation和Five Operation的差异就在于Five Operation它模拟了更大核心操作时间的算子

In all experiments, SVPartial performs better than BMPartial. As explained in the previous section, this is only due to a simpler iteration logic. SVManual is always the best Selective strategy because it uses data-parallelism to reduce the time spent per tuple.

在Selective策略里,SVPartial要比BMPartial更优,因为迭代效率更高;SVManual是Selective策略里最优的。

Manual Vectorization sometimes performs better than Auto-Vectorization with Full and BMFull. For example, the results in Figure 7a show that BMFullManual is 1.8× faster than BMFull. Upon investigation of the generated code, we discovered that the compiler is overly conservative when it comes to using AVX512. AVX512 registers result in decreased CPU frequency [10], so the compiler does not always use them. It often uses AVX2 registers instead. We, on the other hand, always use AVX512. As we will see in the next section, the decreased CPU frequency can slow down the query.

因为编译器的原因,从实验结果看,手动编写的SIMD代码(BMFullManual)要比BMFull更优,因为编译器采用一种保守策略,考虑到使用AVX512会导致CPU主频下降,经常不对其做SIMD优化。

Selective策略只在选择比较低(<=0.2)的情况下更优,SVManual 可以将这个阈值提高(<=0.5左右),所以表明一个系统需要Mixed的策略,选择率低时采用Selective,选择率高时采用Full

实验

Although there are other types of primitives, they often have side-effects (e.g., insertion in a hash table for joins or an array for sorting) and are, therefore, not amenable to our optimizations for correctness reasons.

无副作用的算子会应用前文的策略,有副作用的并不会,他们只会选择有效的行

Q1 – High Selectivity SLDP Primitives

Q1 consists of a set of SLDP operations on vectors with high selectivity (>= 0.95) followed by a side-effect full aggregation that dominates the running time.

BMFull在没有副作用的算子运算时间上表现最优,但是总的执行时间,所有执行策略大致相同。原因有两个:1. 主要是有副作用的算子(aggregation)执行时间占主导;2. 使用AVX512 SIMD的指令会导致CPU主频下降,BMPartial 和BMFull 对于Aggregation算子的代码是一样的,但是BMPartial执行只需要739ms,而BMFull执行却需要838ms

Q6 – Mixed Selectivity SLDP Primitives

Q6 contains five SLDP filters, an arithmetic SLDP projection, and a final aggregation. Across all input vectors, the second filter leads to a selectivity smaller than 0.15, triggering a threshold-based switch in the filter’s representation.

所有混合策略都比非混合策略更优;
尽管SVManual比BMPartial快,但是 BMFull+SVManual 比BMFull+BMPartial慢0.3%,这就是由于Filter Representation由BM向SV转化的代价。

Finally, the SLDP primitives dominate the total running time for this query, so we do not observe the slowdown caused by AVX512 registers. Thus, for queries with a structure similar to Q6, the benefits of AVX512 outweigh its disadvantages.

因为Q6的运行时代价主要由SLDP原语构成,所以使用AVX512优大于劣

Q4 – Low Selectivity, Inefficient Parallelism

Q4 is a join of two tables (LINEITEM, ORDERS) followed an aggregation and an order-by operator.

Hash Join的build侧,包含两个SLDP的Filter,第一个filter选择率为0.3,第二个降到0.1。Join的Probe侧,有一个选择率为0.6的filter,并且包含很多复杂的探测原语,如accessing hash table entries or the keys within those entries for exact comparison,这些原语使用SIMD的加速是不高效的,所以对于probe侧,应该使用BMFull去执行SLDP的filter,并使用SVPartial执行复杂的原语而不是SVManual.

实验结果表明,BMFull+SVPartial 是最优的。

总结

This work analyzed the impact of filter representation (i.e., Bitmap vs. Selection Vector) and compute strategy (i.e., Full vs. Selective) on the performance of the vectorized primitives in an in-memory analytical DBMS. We identified the factors that influence performance: number of tuples processed, iteration logic, and core operation time per tuple. We explained how each combination of representation and compute strategy balances between these three factors. Full has the cheapest iteration logic, processes all tuples, but spends less time on each tuple when SIMD vectorization is possible. Full is, however, only available with Bitmaps on Update primitives. Selective with SVs has a cheaper iteration logic than Selective with Bitmaps, and is more amenable to SIMD vectorization. We confirmed these observations with several micro-benchmarks. Finally, we showcased the benefits of our analysis on OLAP queries with multiple primitives and consistently achieved the best performance. Our performance gains over the best techniques that do not adapt filter representation and compute strategy can be up to 1.3×.

这篇文章在一个分析型内存数据库上,分析了filter的表示(BM还是SV)以及计算策略(Selective还是Full)对于向量化执行的原语效率的影响。指出了影响效率的因素:需要处理的元组数量,迭代逻辑以及处理每个原子的核心操作所需要的时间。解释了表示和计算策略的每种组合如何在这三个因素之间取得平衡。Full 的迭代逻辑最简单,代价最小,需要处理所有元组,但当 有SIMD 加持时,在每个元组上花费的时间更少。但是,对于Update的原语,Full 仅适用于 Bitmaps表示。 Selective with SVs 比 Selective with Bitmaps 的迭代代价更小,并且更适合 SIMD 矢量化。并且文章中也用几个微基准证实了这些观察结果。最后,文章展示了Mixed策略要比单种的表示和计算策略组合的最佳技术更优。

对Apache IoTDB的MPP查询执行框架的思考

IoTDB中目前采取的是Full + BM的组合策略,当然这里的BitMap,IoTDB使用的是一个boolean[],其实也是基于我们大部分的查询场景选择率都比较高,按照论文中的实验结果,选择率较高时,使用FullBM其实是最优的。

当然之后IoTDB也会采用Mixed策略,在一些异常点检测的查询场景下,选择率一般都会比较低,在这种场景下,Partial + SV/BM的组合应该是会更优的,至于采用SV还是BM,这个还需要进一步实验验证,虽然SV的迭代代价更小,但是首先受制于Java语言,即使采用了PartialSV,我们也无法像论文中一样,进一步用Manual的方式进行SIMD优化;其次,因为IoTDB中其他查询组件还都会保留BM的Filter Representation,所以SV到BM转换的overhead也要考虑进去。

Filter Representation in Vectorized Query Execution论文笔记相关推荐

  1. Few-Shot Representation Learning for Out-Of-Vocabulary Words 论文笔记

    <Few-Shot Representation Learning for Out-Of-Vocabulary Words> 这篇文章是发表在2019年NAACL上的,主要是针对out o ...

  2. Recurrent Filter Learning for Visual Tracking(RFL)论文笔记

    这是一篇ICCV2017的文章,文章条理清晰,通俗易懂,做以总结如下~  本人小菜鸟一个,水平有限,如有错误恳请指正. 第一次写博客,格式可能不够完美,望海涵~ [亮点] 加了LSTM,既保留了目标空 ...

  3. CFT:Multi-Camera Calibration Free BEV Representation for 3D Object Detection——论文笔记

    参考代码:暂无 1. 概述 介绍:在相机数据作为输入的BEV感知算法中很多是需要显式或是隐式使用相机内外参数的,但是相机的参数自标定之后并不是一直保持不变的,这就对依赖相机标定参数的算法带来了麻烦.如 ...

  4. 论文笔记:Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions

    论文笔记:Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions 目录 论文笔记:Editing ...

  5. 论文笔记(十二):Particle Filter Networks: End-to-End Probabilistic Localization From Visual Observations

    Particle Filter Networks: End-to-End Probabilistic Localization From Visual Observations 文章概括 摘要 1. ...

  6. 论文笔记(十四):PoseRBPF: A Rao–Blackwellized Particle Filter for 6-D Object Pose Tracking

    PoseRBPF: A Rao–Blackwellized Particle Filter for 6-D Object Pose Tracking 文章概括 摘要 1. 介绍 2. 相关工作 3. ...

  7. 【论文笔记】半监督的多视图学习:Semi-supervised Multi-view Deep Discriminant Representation Learning

    [论文笔记]Semi-supervised Multi-view Deep Discriminant Representation Learning 1. 概念 多视图学习(Multiview Lea ...

  8. Decoupling Representation and Classifier for Long-Tailed Recognition论文笔记

    Decoupling Representation and Classifier for Long-Tailed Recognition论文笔记 Facebook出品,ICLR2020会议论文,官方代 ...

  9. 论文笔记:Evolving Losses for Unsupervised Video Representation Learning

    Evolving Losses for Unsupervised Video Representation Learning 论文笔记 Distillation Knowledge Distillat ...

最新文章

  1. python官网怎么下载安装-Python怎么下载安装
  2. (转)gLFlush()和gLFinish()
  3. 时间胶囊——给未来的留言板
  4. [react] render函数中return如果没有使用()会有什么问题?
  5. c# oldb连接_(转)C#连接OleDBConnection数据库的操作
  6. SpringMVC中自定义类型转换器
  7. 增强用户体验,设置快捷键
  8. docker run -di -p 日志_docker随手笔记第十一节 portainer.io安装使用,比k8s简单
  9. D-Link service.cgi远程命令执行漏洞复现
  10. localtime选择00:00值不回显_配置OSPF的DR选择示例
  11. 我中的第一标:4000人中脱颖而出的威客
  12. 百度关键词抓取工具_手把手教你百度霸屏引流 三大核心推广技巧 人人可操作...
  13. 直播线上实时翻译和流式字幕技术实践与应用
  14. python求解一元二次方程
  15. 阿里巴巴国际站如何避免产品图片判定重复铺货?
  16. idea新建sourceFolder
  17. 微信小程序 table表格 PC版本
  18. 面试神仙打架?探秘华为人工智能工程师岗位
  19. 密码学常用网站与解密脚本
  20. java存储字节_Java字节与字符流永久存储json数据

热门文章

  1. 为什么要获取rhce认证?rhce认证好不好考吗
  2. 利息计算公式(计息)
  3. 3CX電話系統橋接中正確顯示來電號碼
  4. DOTS: Decoupling Operation and Topology in Differentiable Architecture Search
  5. Linux安装JDK(全版本)
  6. dedecms mysql_织梦安装及其中的Mysql问题
  7. element 上传图片前获取本地图片地址
  8. 计算机无法写入U盘,解决Mac无法写入U盘问题:无需安装软件,让Mac支持NTFS读写...
  9. Amazon数据集网址
  10. 美国确诊超100万!教你用Python画出全球疫情动态图