Recheck Cond filter IO\CPU放大原理与优化CASE - 含超级大表不包含(反选) SQL优化

背景

在阅读本文之前，先提几个问题：

1、用了索引是不是就不需要访问表了？

2、用了索引是不是就不需要进行二次判断了？

第一个问题，只有一种情况用了索引是不需要访问表的，Index Only Scan，并且对应堆表行号对应的HEAP块中没有不可见TUPLE（访问表的VM文件得到）。否则都需要回表，访问堆表的tuple head(infomask掩码)，并判断行版本获得可见性。注意回表并不需要访问COLUMN的VALUE，只需要访问堆表中TUPLE的HEAD，取其掩码来判断可见性。

第二个问题，实际上是索引精确性的问题，对于精准索引的INDEX SCAN，是不需要RECHECK的，例如B-TREE索引。但是这里指的是index scan和index only scan，并不是bitmap scan。bitmap scan实际上只是定位到了BLOCK，并没有定位到item，也就是说，需要recheck。因此什么时候recheck，取决于索引的精确性以及是否使用了bitmapcan。对于不lossy index，必然是要recheck的，对于bitmap scan也必然是需要recheck的。

recheck有哪些开销？

recheck需要取得HEAP TABLE的对应被过滤列的VALUE，进行recheck。

好的，接下来谈一下recheck引入的隐含问题：

recheck过多，就会导致CPU使用偏高，同时响应变慢，毋庸置疑。

recheck 引入的开销举例

例子1 - bitmap scan带来的recheck

bitmap scan将数据收敛到了被命中的BLOCK，并执行顺序的HEAP扫描。这样减少了数据块的随机性，但是引入了一个问题，RECHECK的问题。

postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from a where id>10 and id<10000;  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------  Aggregate  (cost=10840.18..10840.19 rows=1 width=8) (actual time=4.238..4.238 rows=1 loops=1)  Output: count(*)  Buffers: shared hit=97 read=27  ->  Bitmap Heap Scan on public.a  (cost=132.77..10815.80 rows=9750 width=0) (actual time=1.082..3.056 rows=9989 loops=1)  Output: id, info, crt_time  -- 这条就是heap scan的时候的recheck  Recheck Cond: ((a.id > 10) AND (a.id < 10000))  Heap Blocks: exact=94  Buffers: shared hit=97 read=27  ->  Bitmap Index Scan on a_pkey  (cost=0.00..130.34 rows=9750 width=0) (actual time=1.060..1.060 rows=9989 loops=1)  Index Cond: ((a.id > 10) AND (a.id < 10000))  Buffers: shared hit=3 read=27  Planning time: 0.148 ms  Execution time: 4.276 ms
(13 rows)

同样的SQL，使用index scan就不需要recheck。

postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from a where id>10 and id<10000;  QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------  Aggregate  (cost=344.41..344.42 rows=1 width=8) (actual time=3.493..3.493 rows=1 loops=1)  Output: count(*)  Buffers: shared hit=124  ->  Index Scan using a_pkey on public.a  (cost=0.43..320.04 rows=9750 width=0) (actual time=0.020..2.280 rows=9989 loops=1)  Output: id, info, crt_time  Index Cond: ((a.id > 10) AND (a.id < 10000))  Buffers: shared hit=124  Planning time: 0.156 ms  Execution time: 3.528 ms
(9 rows)

例子2 - 大字段recheck带来的性能问题

大字段recheck是很危险的，特别是命中的数据量所在的字段占用空间非常庞大时。

例如一个100万的表，但是其中一个JSON字段是1MB，那么如果这个JSON字段需要被recheck的话，即使涉及的数据只有1000条，也需要1GB的扫描量。

postgres=# create table t_big(id int, info tsvector);
CREATE TABLE  create or replace function gen_rand_tsvector(int,int) returns tsvector as $$    select array_to_tsvector(array_agg((random()*$1)::int::text)) from generate_series(1,$2);
$$ language sql strict;   insert into t_big select 1 , gen_rand_tsvector(15000, 20000) from generate_series(1,100000);

postgres=# create index idx_t_big on t_big using gin (info);
CREATE INDEX

下面的QUERY复合条件的数据块太多，需要扫描大字段，导致了超长时间的查询。

postgres=#  explain (analyze,verbose,timing,costs,buffers) select id from t_big where info @@ to_tsquery('! -1');  QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------  Bitmap Heap Scan on public.t_big  (cost=234779.77..261535.52 rows=99500 width=4) (actual time=6445.766..6460.928 rows=100000 loops=1)  Output: id  Recheck Cond: (t_big.info @@ to_tsquery('! -1'::text))  Heap Blocks: exact=637  Buffers: shared hit=165720  ->  Bitmap Index Scan on idx_t_big  (cost=0.00..234754.90 rows=99500 width=0) (actual time=6445.690..6445.690 rows=100000 loops=1)  Index Cond: (t_big.info @@ to_tsquery('! -1'::text))  Buffers: shared hit=165083  Planning time: 0.166 ms  Execution time: 6468.692 ms
(10 rows)

后面谈优化方法。

例子3 - lossy index 引入的recheck

brin索引, bloom索引都属于lossy索引，索引本身决定了在输入条件后，只能将数据缩小到一个范围，然后再通过recheck来决定tuple是否符合条件。

create table t_brin (id int, info text);
insert into t_brin select generate_series(1,10000000), 'test';  create index idx_t_brin on t_brin using brin(id);  explain (analyze,verbose,timing,costs,buffers) select count(*) from t_brin where id between 1 and 100;

postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from t_brin where id between 1 and 100;  QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------  Aggregate  (cost=21314.01..21314.02 rows=1 width=8) (actual time=2.685..2.685 rows=1 loops=1)  Output: count(*)  Buffers: shared hit=137  ->  Bitmap Heap Scan on public.t_brin  (cost=4.83..21314.01 rows=1 width=0) (actual time=0.165..2.669 rows=100 loops=1)  Output: id, info  -- 这里出现了recheck  Recheck Cond: ((t_brin.id >= 1) AND (t_brin.id <= 100))  Rows Removed by Index Recheck: 23580  Heap Blocks: lossy=128  Buffers: shared hit=137  ->  Bitmap Index Scan on idx_t_brin  (cost=0.00..4.83 rows=23641 width=0) (actual time=0.148..0.148 rows=1280 loops=1)  Index Cond: ((t_brin.id >= 1) AND (t_brin.id <= 100))  Buffers: shared hit=9  Planning time: 0.211 ms  Execution time: 2.739 ms
(14 rows)

切片存储-TOAST

这里为什么要提到切片存储，因为我接下来的优化方法，如果没有切片存储，那么就不成立。

切片存储指对超过BLOCK 1/4大小的变长记录，存储到切片中，而在TUPLE中仅仅使用TOAST 指针来引用，从而减少TUPLE本身的大小。

那么扫描全表的行号时，实际上，如果记录数本身不多，只是列比较大时，实际上全表扫描取行号很快。

如何优化大字段 recheck带来的问题

前面举的例子，当字段很大时，recheck很耗费资源。

假设满足条件的数据太多，导致recheck很久。应该如何优化呢？

select * from tbl where id not in (....);  select * from tbl where ts @@ to_tsquery('! china');

既然这些条件的结果很多，那么说明相反的结果就很少，这些条件相反的条件来查询（走index scan，避免recheck），另外再使用全量数据(全表扫，不recheck，所以不需要扫描大字段)，except来排他。

except方法优化

1、找出所有符合条件的数据的ID或行号

2、找出（满足条件）相反的ID或行号

小心去重（加上PK，或行号，避免去重）

例子，我们使用tsvector存储了大量的文本向量信息，每个字段1MB，有几百万记录，然后有一个NOT IN的查询，这个查询筛选得到的结果过多。

由于tsvector的索引GIN扫描用到了bitmapscan，当记录数很多时，需要RECHECK的记录就越多，并且这部分是空间占用的大坨。

最终会导致搜索很慢。

优化方法：

既然按所需条件复合条件的记录过多，导致RECHECK部分的时间变慢，那么我们可以使用反条件，减少复合条件的记录减少RECHECK的耗时。

同时使用全表扫得到CTID或PK（因为全表扫不需要扫TOAST），然后再EXCEPT它。最后根据PK或CTID得到索要的记录。

explain (analyze,verbose,timing,costs,buffers)
select id from t_big where ctid = any (                 -- 通过行号来定位使用except得到的符合条件的记录
array
(  select ctid from t_big   -- 这个是全表的ctid(行号)  except   select ctid from t_big where info @@ to_tsquery('-1')    -- 这里就是反条件得到的ctid(行号)
)
);

用t_big那个例子，性能飙升

                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------  Tid Scan on public.t_big  (cost=12023.32..12035.42 rows=10 width=4) (actual time=105.696..129.313 rows=100000 loops=1)  Output: t_big.id  TID Cond: (t_big.ctid = ANY ($0))  Buffers: shared hit=100640  InitPlan 1 (returns $0)  ->  SetOp Except  (cost=11520.81..12023.31 rows=100000 width=10) (actual time=62.363..93.300 rows=100000 loops=1)  Output: "*SELECT* 1".ctid, (0)  Buffers: shared hit=640  ->  Sort  (cost=11520.81..11772.06 rows=100500 width=10) (actual time=62.359..70.542 rows=100000 loops=1)  Output: "*SELECT* 1".ctid, (0)  Sort Key: "*SELECT* 1".ctid  Sort Method: quicksort  Memory: 7760kB  Buffers: shared hit=640  ->  Append  (cost=0.00..3170.85 rows=100500 width=10) (actual time=0.012..46.268 rows=100000 loops=1)  Buffers: shared hit=640  ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..2637.00 rows=100000 width=10) (actual time=0.012..30.602 rows=100000 loops=1)  Output: "*SELECT* 1".ctid, 0  Buffers: shared hit=637  ->  Seq Scan on public.t_big t_big_1  (cost=0.00..1637.00 rows=100000 width=6) (actual time=0.011..13.605 rows=100000 loops=1)  Output: t_big_1.ctid  Buffers: shared hit=637  ->  Subquery Scan on "*SELECT* 2"  (cost=19.73..533.85 rows=500 width=10) (actual time=0.047..0.047 rows=0 loops=1)  Output: "*SELECT* 2".ctid, 1  Buffers: shared hit=3  ->  Bitmap Heap Scan on public.t_big t_big_2  (cost=19.73..528.85 rows=500 width=6) (actual time=0.045..0.045 rows=0 loops=1)  Output: t_big_2.ctid  Recheck Cond: (t_big_2.info @@ to_tsquery('-1'::text))  Buffers: shared hit=3  ->  Bitmap Index Scan on idx_t_big  (cost=0.00..19.60 rows=500 width=0) (actual time=0.042..0.042 rows=0 loops=1)  Index Cond: (t_big_2.info @@ to_tsquery('-1'::text))  Buffers: shared hit=3  Planning time: 0.239 ms  Execution time: 137.356 ms
(33 rows)

小结

本文讲述了recheck的原理，以及在什么情况下RECHECK可能引发较大性能问题。如何避免等方法。

参考

《全文检索不包含优化 - 阿里云RDS PostgreSQL最佳实践》

《HTAP数据库 PostgreSQL 场景与性能测试之 26 - (OLTP) NOT IN、NOT EXISTS 查询》

《PostgreSQL 空间切割(st_split)功能扩展 - 空间对象网格化 (多边形GiST优化)》

《PostgreSQL 空间st_contains，st_within空间包含搜索优化 - 降IO和降CPU(bound box) (多边形GiST优化)》