多维数据查询效率分析（1）

有时需求需要我们把系统做成灵活的。最常见的形式是，属性不能是固定的，要用户可以自定义。这样的需求往往会在数据库中建模成一个一对多的关系。

create table person {

...　　

}

create table person_attribute {

person_id ...

attribute_name ...

attribute_value ...

...

}

这样的建模在没有查询需要的时候，还是蛮不错的。但是一旦需要对扩充的属性值进行查询，速度往往惨不忍睹。曾经在新加坡做过一个电信的遗留系统的前端，其数据库的建模就是这样的。对于中间的属性表，一个简单的查询都需要join好几次，速度非常慢。好在那次只是做ETL，并不是直接把这样的数据库做后端，要不然肯定死的很惨。当时的做法是把所有的数据读入到内存中，针对属性的查询用内存集合遍历来实现。这样做的前提是集合的元素数量非常少（几百而已），总数据量也非常少。但是如果我们需要处理的数据量非常大，那么我们就必须在数据库中能够对多维数据进行高效查询。

为了搞清楚这个问题，我们需要做一系列实验。先来介绍一下我们实验的对象。

假定我们有一张contacts表，然后对每个联系人有一个contact_categories的表，简称cc

create table cc(  contact_id integer,  cad_id integer,  value integer);

cad_id代表字段的id，value是cateogry的值。数据量是500万。实验用的数据库分别是PostgreSQL（原始数据大小250M），和使用MYISAM引擎的MySQL（原始数据大小73M）。使用的磁盘是普通的笔记本硬盘，没有raid，普通的ext4分区，峰值传输率大概是70M/s。假设没有资源的争抢，而且数据库总是以最快的顺序读的方式从磁盘中加载数据，那么PostgreSQL得用4s，而MySQL也需要1s才能把所有的磁盘内容读到内存中。

所以如果我们使用基于磁盘的解决方案的话，无论如何也无法把查询压缩到1s以内。因为我们需要提供一个Reponsive的界面前端，所以数据必须能够在内存中被查询，可能不是所有的数据都能放入内存，但是最起码被查询到的数据得一直在内存中。最简单的办法把数据库移到内存中的方式不是改数据库的设置，而是直接把内存映射成文件夹：

sudo mount -t ramfs -o size=200000m ramfs /mnt/memory

然后把数据库的数据目录移动到/mnt/memory之中。但是数据库的缓存设置也是必须修改的，比如PostgreSQL的work_mem如果设置过小的话，在做对一个很大的表做count(distinct xxx)时就会导致中间结果被写入到临时表之中。所以我们还是把所有的缓存搞大一些吧。

work_mem = 1000MBshared_buffer = 1000MBtemp_buffer = 1000MBeffective_cache_size = 1000MBwal_buffers = 1000MBauto_vacuum = off

让我们来看看最基本的一个查询能有多快吧

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc;'  count  --------- 5904385(1 row)

real    0m0.448suser    0m0.024ssys    0m0.000s

count(*)和count(contact_id)是一样的么，让我们来试验一下

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc;'  count  --------- 5904385(1 row)

real    0m0.537suser    0m0.016ssys    0m0.008s

有意思！居然比count(*)还要慢。可能是因为我们没有给contact_id字段加索引的原因。好吧，加上索引。

taowen@dmright-perf:~$ time psql postgres -c 'create index contact_id_idx on cc(contact_id);'CREATE INDEX

real    0m4.848suser    0m0.016ssys    0m0.008staowen@dmright-perf:~$ time psql postgres -c 'analyze cc;'ANALYZE

real    0m0.197suser    0m0.024ssys    0m0.004s

再试试看

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc;'  count  --------- 5904385(1 row)

real    0m0.534suser    0m0.020ssys    0m0.008s

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc;'  count  --------- 5904385(1 row)

real    0m0.447suser    0m0.028ssys    0m0.008s

基本上没有变化……count(*)貌似就是比count(contact_id)要快。不管啦，既然小于500ms，也算是够快了。让我们给查询加上个条件吧。

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc where cad_id = 101 and value = 5;' count  -------- 998839(1 row)

real    0m0.686suser    0m0.024ssys    0m0.000s

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc where cad_id = 101 and value = 5;' count  -------- 998839(1 row)

real    0m0.660suser    0m0.024ssys    0m0.000s

我们可以看到，在count(*)和count(contact_id)之间没有特别大的区别。但是我们还没有给cad_id和value加索引，所以让我们加上看看如何

taowen@dmright-perf:~$ time psql postgres -c 'create index cad_id_value_idx on cc(cad_id, value);'CREATE INDEX

real    0m10.069suser    0m0.020ssys    0m0.008s

taowen@dmright-perf:~$ time psql postgres -c 'analyze cc;'ANALYZE

real    0m0.199suser    0m0.016ssys    0m0.012s

and try again.

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc where cad_id = 101 and value = 5;' count  -------- 998839(1 row)

real    0m0.283suser    0m0.020ssys    0m0.012s

真是很快很快耶！让我们来回顾一下，我们现在对于contact_id和(cad_id, value)都建立了索引。
接下来我们把查询弄得更复杂一些，加上AND条件。有三种可能的方式：
1、INTERSECT
2、INNER JOIN
3、IN + SUB QUERY
我们每种做法都试试

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp' count  -------- 164788(1 row)

real    0m1.159suser    0m0.028ssys    0m0.008s

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp' count  -------- 164788(1 row)

real    0m1.148suser    0m0.032ssys    0m0.000s

显而易见，这么做很慢。那么INNER JOIN是不是更快一些呢？

taowen@dmright-perf:~$ time psql postgres -c 'select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;' count  -------- 164788(1 row)

real    0m1.162suser    0m0.036ssys    0m0.000s

这大概要慢上个200ms了。那么IN + SUB QUERY呢？

taowen@dmright-perf:~$ time psql postgres -c 'select count(a1.contact_id) from cc as a1 where a1.cad_id = 101 and a1.value = 5 and a1.contact_id in (select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7)' count  -------- 164788(1 row)

real    0m2.645suser    0m0.024ssys    0m0.004s

靠，居然更慢了。总结就是，在有两个条件的情况下，INTERSECT似乎是最快的。但是即便如此，它也超过了一秒钟了。为什么会这样呢？时间都花哪里去了？

taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp'                                                                              QUERY PLAN                                                                               ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=107853.91..107853.92 rows=1 width=4) (actual time=1471.907..1471.907 rows=1 loops=1)   ->  Subquery Scan on temp  (cost=7467.63..107375.65 rows=191301 width=4) (actual time=1366.131..1459.419 rows=164788 loops=1)         ->  HashSetOp Intersect  (cost=7467.63..105462.64 rows=191301 width=4) (actual time=1366.129..1439.781 rows=164788 loops=1)               ->  Append  (cost=7467.63..103326.69 rows=854380 width=4) (actual time=69.436..797.478 rows=1829408 loops=1)                     ->  Subquery Scan on "*SELECT* 2"  (cost=7467.63..48180.53 rows=351876 width=4) (actual time=69.435..350.031 rows=830569 loops=1)                           ->  Bitmap Heap Scan on cc  (cost=7467.63..44661.77 rows=351876 width=4) (actual time=69.434..264.538 rows=830569 loops=1)                                 Recheck Cond: ((cad_id = 102) AND (value = 7))                                 ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7379.66 rows=351876 width=0) (actual time=64.162..64.162 rows=830569 loops=1)                                       Index Cond: ((cad_id = 102) AND (value = 7))                     ->  Subquery Scan on "*SELECT* 1"  (cost=10667.56..55146.16 rows=502504 width=4) (actual time=69.846..331.544 rows=998839 loops=1)                           ->  Bitmap Heap Scan on cc  (cost=10667.56..50121.12 rows=502504 width=4) (actual time=69.845..233.794 rows=998839 loops=1)                                 Recheck Cond: ((cad_id = 101) AND (value = 5))                                 ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10541.94 rows=502504 width=0) (actual time=64.501..64.501 rows=998839 loops=1)                                       Index Cond: ((cad_id = 101) AND (value = 5)) Total runtime: 1477.484 ms(15 rows)

首先，INTERSECT不是并行执行的。两个子查询分别花费了300ms以上的时间，加起来有800ms是用在搜集contact_id上了。其次，集合之间的并集操作花费了600多ms。其余的时间都花在了数集合的成员个数上了。有一点值得注意的是，根据 http://postgresql.1045698.n5.nabble.com/ANTI-JOIN-needs-table-index-scan-not-possible-td3425340.html index scan和heap scan实际上都用上了索引。然而，SELECT还是在其之上额外耗费了100ms，我猜测它可能是回到原始的表结构中把行取出来，以获得contact_id的值。这在磁盘上的话速度会更慢，因为会是random seek操作。
看完了INTERSECT，让我们再来分析分析Join：

taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'                                                                        QUERY PLAN                                                                         ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=180366.31..180366.32 rows=1 width=4) (actual time=1280.964..1280.964 rows=1 loops=1)   ->  Merge Join  (cost=174783.00..180104.02 rows=104912 width=4) (actual time=1043.879..1270.197 rows=164788 loops=1)         Merge Cond: (a1.contact_id = a2.contact_id)         ->  Sort  (cost=97705.18..98961.44 rows=502504 width=4) (actual time=586.735..626.292 rows=998839 loops=1)Sort Key: a1.contact_idSort Method:  quicksort  Memory: 71397kB               ->  Bitmap Heap Scan on cc a1  (cost=10667.56..50121.12 rows=502504 width=4) (actual time=75.377..257.403 rows=998839 loops=1)                     Recheck Cond: ((cad_id = 101) AND (value = 5))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10541.94 rows=502504 width=0) (actual time=69.565..69.565 rows=998839 loops=1)                           Index Cond: ((cad_id = 101) AND (value = 5))         ->  Sort  (cost=77077.83..77957.52 rows=351876 width=4) (actual time=457.131..492.428 rows=830569 loops=1)Sort Key: a2.contact_idSort Method:  quicksort  Memory: 63509kB               ->  Bitmap Heap Scan on cc a2  (cost=7467.63..44661.77 rows=351876 width=4) (actual time=60.039..186.422 rows=830569 loops=1)                     Recheck Cond: ((cad_id = 102) AND (value = 7))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7379.66 rows=351876 width=0) (actual time=54.929..54.929 rows=830569 loops=1)                           Index Cond: ((cad_id = 102) AND (value = 7)) Total runtime: 1286.846 ms(18 rows)

貌似大部分时间都花在了排序上。而且它也不是并行执行的。如果后台同时执行top命令的话，就会发现只有一个核是被实际占用着的。
要是再创建更多的索引呢？会不会有帮助？

taowen@dmright-perf:~$ time psql postgres -c 'create index cad_id_value_contact_id on cc(cad_id, value, contact_id);'CREATE INDEX

real    0m10.683suser    0m0.020ssys    0m0.004staowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'                                                                        QUERY PLAN                                                                         ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=180366.31..180366.32 rows=1 width=4) (actual time=1338.478..1338.478 rows=1 loops=1)   ->  Merge Join  (cost=174783.00..180104.02 rows=104912 width=4) (actual time=1090.136..1327.312 rows=164788 loops=1)         Merge Cond: (a1.contact_id = a2.contact_id)         ->  Sort  (cost=97705.18..98961.44 rows=502504 width=4) (actual time=615.843..657.438 rows=998839 loops=1)Sort Key: a1.contact_idSort Method:  quicksort  Memory: 71397kB               ->  Bitmap Heap Scan on cc a1  (cost=10667.56..50121.12 rows=502504 width=4) (actual time=80.926..275.431 rows=998839 loops=1)                     Recheck Cond: ((cad_id = 101) AND (value = 5))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10541.94 rows=502504 width=0) (actual time=75.816..75.816 rows=998839 loops=1)                           Index Cond: ((cad_id = 101) AND (value = 5))         ->  Sort  (cost=77077.83..77957.52 rows=351876 width=4) (actual time=474.279..510.866 rows=830569 loops=1)Sort Key: a2.contact_idSort Method:  quicksort  Memory: 63509kB               ->  Bitmap Heap Scan on cc a2  (cost=7467.63..44661.77 rows=351876 width=4) (actual time=65.335..198.655 rows=830569 loops=1)                     Recheck Cond: ((cad_id = 102) AND (value = 7))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7379.66 rows=351876 width=0) (actual time=60.314..60.314 rows=830569 loops=1)                           Index Cond: ((cad_id = 102) AND (value = 7)) Total runtime: 1346.587 ms(18 rows)

还是不行！Merge Join慢的话，升级到9.1然后强制使用hash join会不会好一点？

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5';                                                                        QUERY PLAN                                                                        ---------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=902356.85..902356.86 rows=1 width=4) (actual time=1693.137..1693.137 rows=1 loops=1)   ->  Hash Join  (cost=59599.78..902135.88 rows=88389 width=4) (actual time=461.788..1682.718 rows=164788 loops=1)         Hash Cond: (a1.contact_id = a2.contact_id)         ->  Bitmap Heap Scan on cc a1  (cost=10833.24..50406.32 rows=510472 width=4) (actual time=76.337..224.571 rows=998839 loops=1)               Recheck Cond: ((cad_id = 101) AND (value = 5))               ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10705.62 rows=510472 width=0) (actual time=71.028..71.028 rows=998839 loops=1)                     Index Cond: ((cad_id = 101) AND (value = 5))         ->  Hash  (cost=44444.15..44444.15 rows=345792 width=4) (actual time=385.256..385.256 rows=830569 loops=1)               Buckets: 65536  Batches: 1  Memory Usage: 29200kB               ->  Bitmap Heap Scan on cc a2  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=64.778..258.059 rows=830569 loops=1)                     Recheck Cond: ((cad_id = 102) AND (value = 7))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=59.675..59.675 rows=830569 loops=1)                           Index Cond: ((cad_id = 102) AND (value = 7)) Total runtime: 1698.207 ms(14 rows)

有两个原因造成这样的情况：
1、两个条件意味着扫描两次，而且不是并行扫描
2、join自身很耗费时间，无论是sort merge join还是hash join

两个条件都这熊样了，三个条件呢？看看吧，先上INTERSECT：

postgres@dmright-perf:/mnt/memory/pg$ time psql postgres -c 'select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7 intersect select contact_id from cc where cad_id = 6 and value = 1) as temp' count -------  6748(1 row)

real    0m1.350suser    0m0.020ssys    0m0.008s

然后是INNER JOIN

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id inner join cc as a3 on a1.contact_id = a3.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1'; count -------  6748(1 row)

real    0m0.756suser    0m0.028ssys    0m0.000s

然后是IN + SUBQUERY：

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'select count(a1.contact_id) from cc as a1 where a1.cad_id = 101 and a1.value = 5 and a1.contact_id in (select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7 and a2.contact_id in (select contact_id from cc as a3 where a3.cad_id = 6 and a3.value = 1))'; count -------  6748(1 row)

real    0m7.320suser    0m0.024ssys    0m0.004s

我们可以看到这回INNER JOIN是最快的了。为什么？

postgres@dmright-perf:/mnt/memory/pg$ time psql postgres -c 'explain analyze select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7 intersect select contact_id from cc where cad_id = 6 and value = 1) as temp'                                                                                       QUERY PLAN                                                                                        ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=132357.21..132357.22 rows=1 width=4) (actual time=1659.666..1659.666 rows=1 loops=1)   ->  Subquery Scan on temp  (cost=306.85..132335.71 rows=8602 width=4) (actual time=1648.818..1659.243 rows=6748 loops=1)         ->  HashSetOp Intersect  (cost=306.85..132249.69 rows=8602 width=4) (actual time=1648.817..1658.543 rows=6748 loops=1)               ->  Append  (cost=306.85..131691.89 rows=223118 width=4) (actual time=33.797..1547.811 rows=372016 loops=1)                     ->  Subquery Scan on "*SELECT* 3"  (cost=306.85..26138.13 rows=14239 width=4) (actual time=33.797..153.230 rows=207228 loops=1)                           ->  Bitmap Heap Scan on cc  (cost=306.85..25995.74 rows=14239 width=4) (actual time=33.795..134.292 rows=207228 loops=1)                                 Recheck Cond: ((cad_id = 6) AND (value = 1))                                 ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..303.29 rows=14239 width=0) (actual time=28.558..28.558 rows=207228 loops=1)                                       Index Cond: ((cad_id = 6) AND (value = 1))                     ->  Result  (cost=7341.27..105553.76 rows=208879 width=4) (actual time=1282.426..1371.255 rows=164788 loops=1)                           ->  HashSetOp Intersect  (cost=7341.27..105553.76 rows=208879 width=4) (actual time=1282.418..1353.422 rows=164788 loops=1)                                 ->  Append  (cost=7341.27..103413.10 rows=856264 width=4) (actual time=57.550..734.282 rows=1829408 loops=1)                                       ->  Subquery Scan on "*SELECT* 2"  (cost=7341.27..47902.07 rows=345792 width=4) (actual time=57.550..292.595 rows=830569 loops=1)                                             ->  Bitmap Heap Scan on cc  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=57.549..218.162 rows=830569 loops=1)                                                   Recheck Cond: ((cad_id = 102) AND (value = 7))                                                   ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=52.606..52.606 rows=830569 loops=1)                                                         Index Cond: ((cad_id = 102) AND (value = 7))                                       ->  Subquery Scan on "*SELECT* 1"  (cost=10833.24..55511.04 rows=510472 width=4) (actual time=69.129..330.737 rows=998839 loops=1)                                             ->  Bitmap Heap Scan on cc  (cost=10833.24..50406.32 rows=510472 width=4) (actual time=69.128..242.416 rows=998839 loops=1)                                                   Recheck Cond: ((cad_id = 101) AND (value = 5))                                                   ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10705.62 rows=510472 width=0) (actual time=64.161..64.161 rows=998839 loops=1)                                                         Index Cond: ((cad_id = 101) AND (value = 5)) Total runtime: 1665.691 ms(23 rows)

对于INTERSECT来说，过程和两个条件是差不多的，只是集合更大一些罢了。

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id inner join cc as a3 on a1.contact_id = a3.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';                                                                           QUERY PLAN                                                                           ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=118979.47..118979.48 rows=1 width=4) (actual time=771.393..771.393 rows=1 loops=1)   ->  Nested Loop  (cost=49073.39..118977.89 rows=630 width=4) (actual time=399.835..770.587 rows=6748 loops=1)         ->  Hash Join  (cost=49073.39..97159.97 rows=2466 width=8) (actual time=399.814..691.519 rows=34219 loops=1)               Hash Cond: (a3.contact_id = a2.contact_id)               ->  Bitmap Heap Scan on cc a3  (cost=306.85..25995.74 rows=14239 width=4) (actual time=25.655..83.526 rows=207228 loops=1)                     Recheck Cond: ((cad_id = 6) AND (value = 1))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..303.29 rows=14239 width=0) (actual time=20.570..20.570 rows=207228 loops=1)                           Index Cond: ((cad_id = 6) AND (value = 1))               ->  Hash  (cost=44444.15..44444.15 rows=345792 width=4) (actual time=373.969..373.969 rows=830569 loops=1)                     Buckets: 65536  Batches: 1  Memory Usage: 29200kB                     ->  Bitmap Heap Scan on cc a2  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=59.271..250.932 rows=830569 loops=1)                           Recheck Cond: ((cad_id = 102) AND (value = 7))                           ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=54.030..54.030 rows=830569 loops=1)                                 Index Cond: ((cad_id = 102) AND (value = 7))         ->  Index Scan using cad_id_value_contact_id_idx on cc a1  (cost=0.00..8.83 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=34219)               Index Cond: ((cad_id = 101) AND (value = 5) AND (contact_id = a2.contact_id)) Total runtime: 774.588 ms(17 rows)

对于INNER JOIN，索引都被利用上了。它不需要取得contact_id然后再来做集合操作。结论是对rowid做hash操作比集合操作更快。
不显示用INNER JOIN，让Planner决定Join顺序也是一样的：

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1, cc as a2, cc as a3 where a1.contact_id = a2.contact_id and a2.contact_id = a3.contact_id and a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';                                                                           QUERY PLAN                                                                           ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=118979.47..118979.48 rows=1 width=4) (actual time=762.969..762.970 rows=1 loops=1)   ->  Nested Loop  (cost=49073.39..118977.89 rows=630 width=4) (actual time=398.554..762.206 rows=6748 loops=1)         ->  Hash Join  (cost=49073.39..97159.97 rows=2466 width=8) (actual time=398.531..684.425 rows=34219 loops=1)               Hash Cond: (a3.contact_id = a2.contact_id)               ->  Bitmap Heap Scan on cc a3  (cost=306.85..25995.74 rows=14239 width=4) (actual time=34.802..91.672 rows=207228 loops=1)                     Recheck Cond: ((cad_id = 6) AND (value = 1))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..303.29 rows=14239 width=0) (actual time=29.524..29.524 rows=207228 loops=1)                           Index Cond: ((cad_id = 6) AND (value = 1))               ->  Hash  (cost=44444.15..44444.15 rows=345792 width=4) (actual time=363.537..363.537 rows=830569 loops=1)                     Buckets: 65536  Batches: 1  Memory Usage: 29200kB                     ->  Bitmap Heap Scan on cc a2  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=57.799..245.467 rows=830569 loops=1)                           Recheck Cond: ((cad_id = 102) AND (value = 7))                           ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=52.349..52.349 rows=830569 loops=1)                                 Index Cond: ((cad_id = 102) AND (value = 7))         ->  Index Scan using cad_id_value_contact_id_idx on cc a1  (cost=0.00..8.83 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=34219)               Index Cond: ((cad_id = 101) AND (value = 5) AND (contact_id = a2.contact_id)) Total runtime: 766.107 ms(17 rows)

结果是一样的。

基本上对PostgreSQL的实验就到这里了，差不多也就这样了，提高空间不大。Google之后发现，PostgreSQL使用的MVCC机制导致其甚至在做SELECT COUNT这样的操作的时候也会去更新hint bit。也许这就是其慢的重要原因。MySQL的MYISAM引擎以完全不负责事务和著称，应该会比PostgreSQL有更大的提升，下一篇中，我们将对MySQL重复同样的实验。

转载于:https://www.cnblogs.com/taowen/archive/2012/02/26/2367724.html

多维数据查询效率分析（1）相关推荐

以太坊数据查询与分析工具：以太坊浏览器Etherscan、ETHERQL
第八章文章目录第八章一.以太坊浏览器Etherscan 1.Ethersan的基本功能 2.其他功能 3.API 4.ENS域名查询二.ETHERQL 1.同步管理器 2.处理程序链 3.持久 ...
基于springboot项目中使用docker-compose+es+kibana+logstash+mysql 提高数据查询效率
基于springboot项目中使用docker-compose+es+kibana+logstash+mysql 提高数据查询效率 1.拉取logstash,kibana,es,mysql镜像 #命令 ...
阿里云图数据库GDB公测，高度连接数据查询效率提升10倍
3月21日,阿里云图数据库GDB正式公测,用于处理高度连接数据查询与存储,支持属性图模型,并且全面兼容TinkerPop Gremlin 查询语言. 图数据库覆盖的应用场景非常广泛,比较典型的有社交网 ...
Mysql 多表联合查询效率分析及优化
1. 多表连接类型 1. 笛卡尔积(交叉连接) 在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用',' 如: [sql] view plaincopy print? S ...
mysql 多表查询优化_Mysql 多表联合查询效率分析及优化
1. 多表连接类型 1. 笛卡尔积(交叉连接)在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用',' 如: SELECT * FROM table1 CROSS JOI ...
mysql多表性能_Mysql 多表联合查询效率分析及优化
1. 多表连接类型 1. 笛卡尔积(交叉连接)在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用',' 如: SELECT*FROMtable1CROSSJOINtabl ...
mysql 多表查询慢_详解Mysql多表联合查询效率分析及优化
1. 多表连接类型1. 笛卡尔积(交叉连接) 在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用',' 如: SELECT * FROM table1 CROSS JOI ...
oracle sql语句执行顺序、查询效率分析、步骤解析
语句执行顺序 select –5 from –1 where –2 group by –3 having –4 order by –6 查询语句中select from where group by ...
mysql 多表联查的效率_Mysql 多表联合查询效率分析
1 多表连接类型 1. 笛卡尔积(交叉连接) 在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用',' 如: SELECT * FROM table1 CROSS JOIN ...

多维数据查询效率分析（1）

多维数据查询效率分析（1）相关推荐

最新文章

热门文章