Filter Representation in Vectorized Query Execution论文笔记

概览

这篇文章向我们展示了在向量化查询执行引擎里的两种filter representation(bitmap以及select vectors)和两种执行策略：1. Full, 不管有多少active的row，都全部执行；2. Partial，只执行active的row，并用理论分析，辅之以实验论证，哪些情况下，哪种组合更优。

后文用到的缩写

BM for BitMap
SV for Select Vectors

Update Operation和Map Operation的区别

Map需要物化其结果到另一个新的vector中，update需要返回一个新的filter representation

原语操作代价公式

实现的所有策略

加了Manual后缀表示手动的以SIMD的形式写出核心操作进行优化，不以Manual结尾的则依赖编译器自行进行优化

不同Operation的最优策略

没有SIMD可以加速的Operation

变长数据的操作

最优策略选SVPartial

We see that Full consistently performs worst when it processes more tuples. At full selectivity, Full and Selective process the same number of tuples, but the former slightly benefits from a simpler iteration logic. We also see that the performance of SVPartial is similar to that of BMPartial, despite the former’s simpler iteration logic, meaning that the core operation is the dominating cost.

选择Partial策略，并且使用BM和SV的差别不大，因为即使理论上SV的迭代逻辑更简单，但是主要的耗时在于算子的真正操作耗时。

整数相除/取模

最优策略选SVPartial，结论与与”变长数据的操作“一致

总结

We conclude that the number of tuples processed is the dominant factor for non-data-parallel primitives, making Full impractical. Because of the small impact of iteration logic time, SVPartial has a performance edge over BMPartial, but developers can choose either representation without much affecting performance.

简而言之，需要处理的元组数量是主要因素，最优策略是SVPartial

有SIMD可以加速但不高效的Operation

即存在执行分支的Operation，如逻辑或和逻辑与操作

最优策略选SVPartial

We can see that Full is either always the worst strategy in Figure 5a, or only competitive at high selectivities (>= 0.85) in Figure 5b. Although Full is competitive at the highest selectivities, we recommend, for the sake of simplicity, the SVPartial strategy.

有SIMD可以加速且没有分支的Operation

core operations with straight-line and data-parallel (SLDP) code (i.e., non-branching code that leverages SIMD instruc- tions).

Therefore, we expect Selective strategies to only be competitive when the selectivity is low, meaning that Full processes far more tuples. SVManual is the exception: it is a Selective strategy that uses SIMD instructions. Its gain in iteration logic time is, however, not as high as that of Full strategies because it uses a gather instruction [5] to collect the elements at the selected indices (see Figure 10). The higher the core operations time per tuple, the less important the gather overhead because iteration logic time becomes more insignificant. Thus, we expect SVManual to be competitive with Full at medium or below selectivities depending on the core operation.

Selective策略只会在选择率较低的情况下更优，SVManual因为使用了SIMD指令，所以它比Full策略差的选择率阈值那个点要更高一些。但是因为它使用了gather指令，所以它在迭代效率上的性能提升并没有那么大，所以在算子核心操作的时间代价越大，SVManual比Full策略差的选择率阈值就要更高

图6和图7中的One Operation和Five Operation的差异就在于Five Operation它模拟了更大核心操作时间的算子

In all experiments, SVPartial performs better than BMPartial. As explained in the previous section, this is only due to a simpler iteration logic. SVManual is always the best Selective strategy because it uses data-parallelism to reduce the time spent per tuple.

在Selective策略里，SVPartial要比BMPartial更优，因为迭代效率更高；SVManual是Selective策略里最优的。

Manual Vectorization sometimes performs better than Auto-Vectorization with Full and BMFull. For example, the results in Figure 7a show that BMFullManual is 1.8× faster than BMFull. Upon investigation of the generated code, we discovered that the compiler is overly conservative when it comes to using AVX512. AVX512 registers result in decreased CPU frequency [10], so the compiler does not always use them. It often uses AVX2 registers instead. We, on the other hand, always use AVX512. As we will see in the next section, the decreased CPU frequency can slow down the query.

因为编译器的原因，从实验结果看，手动编写的SIMD代码(BMFullManual)要比BMFull更优，因为编译器采用一种保守策略，考虑到使用AVX512会导致CPU主频下降，经常不对其做SIMD优化。

Selective策略只在选择比较低(<=0.2)的情况下更优，SVManual 可以将这个阈值提高(<=0.5左右)，所以表明一个系统需要Mixed的策略，选择率低时采用Selective，选择率高时采用Full

实验

Although there are other types of primitives, they often have side-effects (e.g., insertion in a hash table for joins or an array for sorting) and are, therefore, not amenable to our optimizations for correctness reasons.

无副作用的算子会应用前文的策略，有副作用的并不会，他们只会选择有效的行

Q1 – High Selectivity SLDP Primitives

Q1 consists of a set of SLDP operations on vectors with high selectivity (>= 0.95) followed by a side-effect full aggregation that dominates the running time.

BMFull在没有副作用的算子运算时间上表现最优，但是总的执行时间，所有执行策略大致相同。原因有两个：1. 主要是有副作用的算子(aggregation)执行时间占主导；2. 使用AVX512 SIMD的指令会导致CPU主频下降，BMPartial 和BMFull 对于Aggregation算子的代码是一样的，但是BMPartial执行只需要739ms，而BMFull执行却需要838ms

Q6 – Mixed Selectivity SLDP Primitives

Q6 contains five SLDP filters, an arithmetic SLDP projection, and a final aggregation. Across all input vectors, the second filter leads to a selectivity smaller than 0.15, triggering a threshold-based switch in the filter’s representation.

所有混合策略都比非混合策略更优；
尽管SVManual比BMPartial快，但是 BMFull+SVManual 比BMFull+BMPartial慢0.3%，这就是由于Filter Representation由BM向SV转化的代价。

Finally, the SLDP primitives dominate the total running time for this query, so we do not observe the slowdown caused by AVX512 registers. Thus, for queries with a structure similar to Q6, the benefits of AVX512 outweigh its disadvantages.

因为Q6的运行时代价主要由SLDP原语构成，所以使用AVX512优大于劣

Q4 – Low Selectivity, Inefficient Parallelism

Q4 is a join of two tables (LINEITEM, ORDERS) followed an aggregation and an order-by operator.

Hash Join的build侧，包含两个SLDP的Filter，第一个filter选择率为0.3，第二个降到0.1。Join的Probe侧，有一个选择率为0.6的filter，并且包含很多复杂的探测原语，如accessing hash table entries or the keys within those entries for exact comparison，这些原语使用SIMD的加速是不高效的，所以对于probe侧，应该使用BMFull去执行SLDP的filter，并使用SVPartial执行复杂的原语而不是SVManual.

实验结果表明，BMFull+SVPartial 是最优的。

总结

This work analyzed the impact of filter representation (i.e., Bitmap vs. Selection Vector) and compute strategy (i.e., Full vs. Selective) on the performance of the vectorized primitives in an in-memory analytical DBMS. We identified the factors that influence performance: number of tuples processed, iteration logic, and core operation time per tuple. We explained how each combination of representation and compute strategy balances between these three factors. Full has the cheapest iteration logic, processes all tuples, but spends less time on each tuple when SIMD vectorization is possible. Full is, however, only available with Bitmaps on Update primitives. Selective with SVs has a cheaper iteration logic than Selective with Bitmaps, and is more amenable to SIMD vectorization. We confirmed these observations with several micro-benchmarks. Finally, we showcased the benefits of our analysis on OLAP queries with multiple primitives and consistently achieved the best performance. Our performance gains over the best techniques that do not adapt filter representation and compute strategy can be up to 1.3×.

这篇文章在一个分析型内存数据库上，分析了filter的表示(BM还是SV)以及计算策略（Selective还是Full）对于向量化执行的原语效率的影响。指出了影响效率的因素：需要处理的元组数量，迭代逻辑以及处理每个原子的核心操作所需要的时间。解释了表示和计算策略的每种组合如何在这三个因素之间取得平衡。Full 的迭代逻辑最简单，代价最小，需要处理所有元组，但当有SIMD 加持时，在每个元组上花费的时间更少。但是，对于Update的原语，Full 仅适用于 Bitmaps表示。 Selective with SVs 比 Selective with Bitmaps 的迭代代价更小，并且更适合 SIMD 矢量化。并且文章中也用几个微基准证实了这些观察结果。最后，文章展示了Mixed策略要比单种的表示和计算策略组合的最佳技术更优。

对Apache IoTDB的MPP查询执行框架的思考

IoTDB中目前采取的是Full + BM的组合策略，当然这里的BitMap，IoTDB使用的是一个boolean[]，其实也是基于我们大部分的查询场景选择率都比较高，按照论文中的实验结果，选择率较高时，使用FullBM其实是最优的。

当然之后IoTDB也会采用Mixed策略，在一些异常点检测的查询场景下，选择率一般都会比较低，在这种场景下，Partial + SV/BM的组合应该是会更优的，至于采用SV还是BM，这个还需要进一步实验验证，虽然SV的迭代代价更小，但是首先受制于Java语言，即使采用了PartialSV，我们也无法像论文中一样，进一步用Manual的方式进行SIMD优化；其次，因为IoTDB中其他查询组件还都会保留BM的Filter Representation，所以SV到BM转换的overhead也要考虑进去。