有时需求需要我们把系统做成灵活的。最常见的形式是,属性不能是固定的,要用户可以自定义。这样的需求往往会在数据库中建模成一个一对多的关系。

create table person {

...  

}

create table person_attribute {

person_id ...

attribute_name ...

attribute_value ...

...

}

这样的建模在没有查询需要的时候,还是蛮不错的。但是一旦需要对扩充的属性值进行查询,速度往往惨不忍睹。曾经在新加坡做过一个电信的遗留系统的前端,其数据库的建模就是这样的。对于中间的属性表,一个简单的查询都需要join好几次,速度非常慢。好在那次只是做ETL,并不是直接把这样的数据库做后端,要不然肯定死的很惨。当时的做法是把所有的数据读入到内存中,针对属性的查询用内存集合遍历来实现。这样做的前提是集合的元素数量非常少(几百而已),总数据量也非常少。但是如果我们需要处理的数据量非常大,那么我们就必须在数据库中能够对多维数据进行高效查询。

为了搞清楚这个问题,我们需要做一系列实验。先来介绍一下我们实验的对象。

假定我们有一张contacts表,然后对每个联系人有一个contact_categories的表,简称cc

create table cc(  contact_id integer,  cad_id integer,  value integer);

cad_id代表字段的id,value是cateogry的值。数据量是500万。实验用的数据库分别是PostgreSQL(原始数据大小250M),和使用MYISAM引擎的MySQL(原始数据大小73M)。使用的磁盘是普通的笔记本硬盘,没有raid,普通的ext4分区,峰值传输率大概是70M/s。假设没有资源的争抢,而且数据库总是以最快的顺序读的方式从磁盘中加载数据,那么PostgreSQL得用4s,而MySQL也需要1s才能把所有的磁盘内容读到内存中。

所以如果我们使用基于磁盘的解决方案的话,无论如何也无法把查询压缩到1s以内。因为我们需要提供一个Reponsive的界面前端,所以数据必须能够在内存中被查询,可能不是所有的数据都能放入内存,但是最起码被查询到的数据得一直在内存中。最简单的办法把数据库移到内存中的方式不是改数据库的设置,而是直接把内存映射成文件夹:

sudo mount -t ramfs -o size=200000m ramfs /mnt/memory

然后把数据库的数据目录移动到/mnt/memory之中。但是数据库的缓存设置也是必须修改的,比如PostgreSQL的work_mem如果设置过小的话,在做对一个很大的表做count(distinct xxx)时就会导致中间结果被写入到临时表之中。所以我们还是把所有的缓存搞大一些吧。

work_mem = 1000MBshared_buffer = 1000MBtemp_buffer = 1000MBeffective_cache_size = 1000MBwal_buffers = 1000MBauto_vacuum = off

让我们来看看最基本的一个查询能有多快吧

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc;'  count  --------- 5904385(1 row)

real    0m0.448suser    0m0.024ssys    0m0.000s

count(*)和count(contact_id)是一样的么,让我们来试验一下

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc;'  count  --------- 5904385(1 row)

real    0m0.537suser    0m0.016ssys    0m0.008s

有意思!居然比count(*)还要慢。可能是因为我们没有给contact_id字段加索引的原因。好吧,加上索引。

taowen@dmright-perf:~$ time psql postgres -c 'create index contact_id_idx on cc(contact_id);'CREATE INDEX

real    0m4.848suser    0m0.016ssys    0m0.008staowen@dmright-perf:~$ time psql postgres -c 'analyze cc;'ANALYZE

real    0m0.197suser    0m0.024ssys    0m0.004s

再试试看

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc;'  count  --------- 5904385(1 row)

real    0m0.534suser    0m0.020ssys    0m0.008s

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc;'  count  --------- 5904385(1 row)

real    0m0.447suser    0m0.028ssys    0m0.008s

基本上没有变化……count(*)貌似就是比count(contact_id)要快。不管啦,既然小于500ms,也算是够快了。让我们给查询加上个条件吧。

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from cc where cad_id = 101 and value = 5;' count  -------- 998839(1 row)

real    0m0.686suser    0m0.024ssys    0m0.000s

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc where cad_id = 101 and value = 5;' count  -------- 998839(1 row)

real    0m0.660suser    0m0.024ssys    0m0.000s

我们可以看到,在count(*)和count(contact_id)之间没有特别大的区别。但是我们还没有给cad_id和value加索引,所以让我们加上看看如何

taowen@dmright-perf:~$ time psql postgres -c 'create index cad_id_value_idx on cc(cad_id, value);'CREATE INDEX

real    0m10.069suser    0m0.020ssys    0m0.008s

taowen@dmright-perf:~$ time psql postgres -c 'analyze cc;'ANALYZE

real    0m0.199suser    0m0.016ssys    0m0.012s

and try again.

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from cc where cad_id = 101 and value = 5;' count  -------- 998839(1 row)

real    0m0.283suser    0m0.020ssys    0m0.012s

真是很快很快耶!让我们来回顾一下,我们现在对于contact_id和(cad_id, value)都建立了索引。
接下来我们把查询弄得更复杂一些,加上AND条件。有三种可能的方式:
1、INTERSECT
2、INNER JOIN
3、IN + SUB QUERY
我们每种做法都试试

taowen@dmright-perf:~$ time psql postgres -c 'select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp' count  -------- 164788(1 row)

real    0m1.159suser    0m0.028ssys    0m0.008s

taowen@dmright-perf:~$ time psql postgres -c 'select count(*) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp' count  -------- 164788(1 row)

real    0m1.148suser    0m0.032ssys    0m0.000s

显而易见,这么做很慢。那么INNER JOIN是不是更快一些呢?

taowen@dmright-perf:~$ time psql postgres -c 'select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;' count  -------- 164788(1 row)

real    0m1.162suser    0m0.036ssys    0m0.000s

这大概要慢上个200ms了。那么IN + SUB QUERY呢?

taowen@dmright-perf:~$ time psql postgres -c 'select count(a1.contact_id) from cc as a1 where a1.cad_id = 101 and a1.value = 5 and a1.contact_id in (select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7)' count  -------- 164788(1 row)

real    0m2.645suser    0m0.024ssys    0m0.004s

靠,居然更慢了。总结就是,在有两个条件的情况下,INTERSECT似乎是最快的。但是即便如此,它也超过了一秒钟了。为什么会这样呢?时间都花哪里去了?

taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7) as temp'                                                                              QUERY PLAN                                                                               ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=107853.91..107853.92 rows=1 width=4) (actual time=1471.907..1471.907 rows=1 loops=1)   ->  Subquery Scan on temp  (cost=7467.63..107375.65 rows=191301 width=4) (actual time=1366.131..1459.419 rows=164788 loops=1)         ->  HashSetOp Intersect  (cost=7467.63..105462.64 rows=191301 width=4) (actual time=1366.129..1439.781 rows=164788 loops=1)               ->  Append  (cost=7467.63..103326.69 rows=854380 width=4) (actual time=69.436..797.478 rows=1829408 loops=1)                     ->  Subquery Scan on "*SELECT* 2"  (cost=7467.63..48180.53 rows=351876 width=4) (actual time=69.435..350.031 rows=830569 loops=1)                           ->  Bitmap Heap Scan on cc  (cost=7467.63..44661.77 rows=351876 width=4) (actual time=69.434..264.538 rows=830569 loops=1)                                 Recheck Cond: ((cad_id = 102) AND (value = 7))                                 ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7379.66 rows=351876 width=0) (actual time=64.162..64.162 rows=830569 loops=1)                                       Index Cond: ((cad_id = 102) AND (value = 7))                     ->  Subquery Scan on "*SELECT* 1"  (cost=10667.56..55146.16 rows=502504 width=4) (actual time=69.846..331.544 rows=998839 loops=1)                           ->  Bitmap Heap Scan on cc  (cost=10667.56..50121.12 rows=502504 width=4) (actual time=69.845..233.794 rows=998839 loops=1)                                 Recheck Cond: ((cad_id = 101) AND (value = 5))                                 ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10541.94 rows=502504 width=0) (actual time=64.501..64.501 rows=998839 loops=1)                                       Index Cond: ((cad_id = 101) AND (value = 5)) Total runtime: 1477.484 ms(15 rows)

首先,INTERSECT不是并行执行的。两个子查询分别花费了300ms以上的时间,加起来有800ms是用在搜集contact_id上了。其次,集合之间的并集操作花费了600多ms。其余的时间都花在了数集合的成员个数上了。有一点值得注意的是,根据 http://postgresql.1045698.n5.nabble.com/ANTI-JOIN-needs-table-index-scan-not-possible-td3425340.html index scan和heap scan实际上都用上了索引。然而,SELECT还是在其之上额外耗费了100ms,我猜测它可能是回到原始的表结构中把行取出来,以获得contact_id的值。这在磁盘上的话速度会更慢,因为会是random seek操作。
看完了INTERSECT,让我们再来分析分析Join:

taowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'                                                                        QUERY PLAN                                                                         ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=180366.31..180366.32 rows=1 width=4) (actual time=1280.964..1280.964 rows=1 loops=1)   ->  Merge Join  (cost=174783.00..180104.02 rows=104912 width=4) (actual time=1043.879..1270.197 rows=164788 loops=1)         Merge Cond: (a1.contact_id = a2.contact_id)         ->  Sort  (cost=97705.18..98961.44 rows=502504 width=4) (actual time=586.735..626.292 rows=998839 loops=1)Sort Key: a1.contact_idSort Method:  quicksort  Memory: 71397kB               ->  Bitmap Heap Scan on cc a1  (cost=10667.56..50121.12 rows=502504 width=4) (actual time=75.377..257.403 rows=998839 loops=1)                     Recheck Cond: ((cad_id = 101) AND (value = 5))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10541.94 rows=502504 width=0) (actual time=69.565..69.565 rows=998839 loops=1)                           Index Cond: ((cad_id = 101) AND (value = 5))         ->  Sort  (cost=77077.83..77957.52 rows=351876 width=4) (actual time=457.131..492.428 rows=830569 loops=1)Sort Key: a2.contact_idSort Method:  quicksort  Memory: 63509kB               ->  Bitmap Heap Scan on cc a2  (cost=7467.63..44661.77 rows=351876 width=4) (actual time=60.039..186.422 rows=830569 loops=1)                     Recheck Cond: ((cad_id = 102) AND (value = 7))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7379.66 rows=351876 width=0) (actual time=54.929..54.929 rows=830569 loops=1)                           Index Cond: ((cad_id = 102) AND (value = 7)) Total runtime: 1286.846 ms(18 rows)

貌似大部分时间都花在了排序上。而且它也不是并行执行的。如果后台同时执行top命令的话,就会发现只有一个核是被实际占用着的。
要是再创建更多的索引呢?会不会有帮助?

taowen@dmright-perf:~$ time psql postgres -c 'create index cad_id_value_contact_id on cc(cad_id, value, contact_id);'CREATE INDEX

real    0m10.683suser    0m0.020ssys    0m0.004staowen@dmright-perf:~$ time psql postgres -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id and a2.cad_id = 102 and a2.value = 7 where a1.cad_id = 101 and a1.value = 5;'                                                                        QUERY PLAN                                                                         ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=180366.31..180366.32 rows=1 width=4) (actual time=1338.478..1338.478 rows=1 loops=1)   ->  Merge Join  (cost=174783.00..180104.02 rows=104912 width=4) (actual time=1090.136..1327.312 rows=164788 loops=1)         Merge Cond: (a1.contact_id = a2.contact_id)         ->  Sort  (cost=97705.18..98961.44 rows=502504 width=4) (actual time=615.843..657.438 rows=998839 loops=1)Sort Key: a1.contact_idSort Method:  quicksort  Memory: 71397kB               ->  Bitmap Heap Scan on cc a1  (cost=10667.56..50121.12 rows=502504 width=4) (actual time=80.926..275.431 rows=998839 loops=1)                     Recheck Cond: ((cad_id = 101) AND (value = 5))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10541.94 rows=502504 width=0) (actual time=75.816..75.816 rows=998839 loops=1)                           Index Cond: ((cad_id = 101) AND (value = 5))         ->  Sort  (cost=77077.83..77957.52 rows=351876 width=4) (actual time=474.279..510.866 rows=830569 loops=1)Sort Key: a2.contact_idSort Method:  quicksort  Memory: 63509kB               ->  Bitmap Heap Scan on cc a2  (cost=7467.63..44661.77 rows=351876 width=4) (actual time=65.335..198.655 rows=830569 loops=1)                     Recheck Cond: ((cad_id = 102) AND (value = 7))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7379.66 rows=351876 width=0) (actual time=60.314..60.314 rows=830569 loops=1)                           Index Cond: ((cad_id = 102) AND (value = 7)) Total runtime: 1346.587 ms(18 rows)

还是不行!Merge Join慢的话,升级到9.1然后强制使用hash join会不会好一点?

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5';                                                                        QUERY PLAN                                                                        ---------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=902356.85..902356.86 rows=1 width=4) (actual time=1693.137..1693.137 rows=1 loops=1)   ->  Hash Join  (cost=59599.78..902135.88 rows=88389 width=4) (actual time=461.788..1682.718 rows=164788 loops=1)         Hash Cond: (a1.contact_id = a2.contact_id)         ->  Bitmap Heap Scan on cc a1  (cost=10833.24..50406.32 rows=510472 width=4) (actual time=76.337..224.571 rows=998839 loops=1)               Recheck Cond: ((cad_id = 101) AND (value = 5))               ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10705.62 rows=510472 width=0) (actual time=71.028..71.028 rows=998839 loops=1)                     Index Cond: ((cad_id = 101) AND (value = 5))         ->  Hash  (cost=44444.15..44444.15 rows=345792 width=4) (actual time=385.256..385.256 rows=830569 loops=1)               Buckets: 65536  Batches: 1  Memory Usage: 29200kB               ->  Bitmap Heap Scan on cc a2  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=64.778..258.059 rows=830569 loops=1)                     Recheck Cond: ((cad_id = 102) AND (value = 7))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=59.675..59.675 rows=830569 loops=1)                           Index Cond: ((cad_id = 102) AND (value = 7)) Total runtime: 1698.207 ms(14 rows)

有两个原因造成这样的情况:
1、两个条件意味着扫描两次,而且不是并行扫描
2、join自身很耗费时间,无论是sort merge join还是hash join

两个条件都这熊样了,三个条件呢?看看吧,先上INTERSECT:

postgres@dmright-perf:/mnt/memory/pg$ time psql postgres -c 'select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7 intersect select contact_id from cc where cad_id = 6 and value = 1) as temp' count -------  6748(1 row)

real    0m1.350suser    0m0.020ssys    0m0.008s

然后是INNER JOIN

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id inner join cc as a3 on a1.contact_id = a3.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1'; count -------  6748(1 row)

real    0m0.756suser    0m0.028ssys    0m0.000s

然后是IN + SUBQUERY:

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'select count(a1.contact_id) from cc as a1 where a1.cad_id = 101 and a1.value = 5 and a1.contact_id in (select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7 and a2.contact_id in (select contact_id from cc as a3 where a3.cad_id = 6 and a3.value = 1))'; count -------  6748(1 row)

real    0m7.320suser    0m0.024ssys    0m0.004s

我们可以看到这回INNER JOIN是最快的了。为什么?

postgres@dmright-perf:/mnt/memory/pg$ time psql postgres -c 'explain analyze select count(contact_id) from (select contact_id from cc where cad_id = 101 and value = 5 intersect select contact_id from cc where cad_id = 102 and value = 7 intersect select contact_id from cc where cad_id = 6 and value = 1) as temp'                                                                                       QUERY PLAN                                                                                        ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=132357.21..132357.22 rows=1 width=4) (actual time=1659.666..1659.666 rows=1 loops=1)   ->  Subquery Scan on temp  (cost=306.85..132335.71 rows=8602 width=4) (actual time=1648.818..1659.243 rows=6748 loops=1)         ->  HashSetOp Intersect  (cost=306.85..132249.69 rows=8602 width=4) (actual time=1648.817..1658.543 rows=6748 loops=1)               ->  Append  (cost=306.85..131691.89 rows=223118 width=4) (actual time=33.797..1547.811 rows=372016 loops=1)                     ->  Subquery Scan on "*SELECT* 3"  (cost=306.85..26138.13 rows=14239 width=4) (actual time=33.797..153.230 rows=207228 loops=1)                           ->  Bitmap Heap Scan on cc  (cost=306.85..25995.74 rows=14239 width=4) (actual time=33.795..134.292 rows=207228 loops=1)                                 Recheck Cond: ((cad_id = 6) AND (value = 1))                                 ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..303.29 rows=14239 width=0) (actual time=28.558..28.558 rows=207228 loops=1)                                       Index Cond: ((cad_id = 6) AND (value = 1))                     ->  Result  (cost=7341.27..105553.76 rows=208879 width=4) (actual time=1282.426..1371.255 rows=164788 loops=1)                           ->  HashSetOp Intersect  (cost=7341.27..105553.76 rows=208879 width=4) (actual time=1282.418..1353.422 rows=164788 loops=1)                                 ->  Append  (cost=7341.27..103413.10 rows=856264 width=4) (actual time=57.550..734.282 rows=1829408 loops=1)                                       ->  Subquery Scan on "*SELECT* 2"  (cost=7341.27..47902.07 rows=345792 width=4) (actual time=57.550..292.595 rows=830569 loops=1)                                             ->  Bitmap Heap Scan on cc  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=57.549..218.162 rows=830569 loops=1)                                                   Recheck Cond: ((cad_id = 102) AND (value = 7))                                                   ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=52.606..52.606 rows=830569 loops=1)                                                         Index Cond: ((cad_id = 102) AND (value = 7))                                       ->  Subquery Scan on "*SELECT* 1"  (cost=10833.24..55511.04 rows=510472 width=4) (actual time=69.129..330.737 rows=998839 loops=1)                                             ->  Bitmap Heap Scan on cc  (cost=10833.24..50406.32 rows=510472 width=4) (actual time=69.128..242.416 rows=998839 loops=1)                                                   Recheck Cond: ((cad_id = 101) AND (value = 5))                                                   ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..10705.62 rows=510472 width=0) (actual time=64.161..64.161 rows=998839 loops=1)                                                         Index Cond: ((cad_id = 101) AND (value = 5)) Total runtime: 1665.691 ms(23 rows)

对于INTERSECT来说,过程和两个条件是差不多的,只是集合更大一些罢了。

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1 inner join cc as a2 on a1.contact_id = a2.contact_id inner join cc as a3 on a1.contact_id = a3.contact_id where a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';                                                                           QUERY PLAN                                                                           ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=118979.47..118979.48 rows=1 width=4) (actual time=771.393..771.393 rows=1 loops=1)   ->  Nested Loop  (cost=49073.39..118977.89 rows=630 width=4) (actual time=399.835..770.587 rows=6748 loops=1)         ->  Hash Join  (cost=49073.39..97159.97 rows=2466 width=8) (actual time=399.814..691.519 rows=34219 loops=1)               Hash Cond: (a3.contact_id = a2.contact_id)               ->  Bitmap Heap Scan on cc a3  (cost=306.85..25995.74 rows=14239 width=4) (actual time=25.655..83.526 rows=207228 loops=1)                     Recheck Cond: ((cad_id = 6) AND (value = 1))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..303.29 rows=14239 width=0) (actual time=20.570..20.570 rows=207228 loops=1)                           Index Cond: ((cad_id = 6) AND (value = 1))               ->  Hash  (cost=44444.15..44444.15 rows=345792 width=4) (actual time=373.969..373.969 rows=830569 loops=1)                     Buckets: 65536  Batches: 1  Memory Usage: 29200kB                     ->  Bitmap Heap Scan on cc a2  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=59.271..250.932 rows=830569 loops=1)                           Recheck Cond: ((cad_id = 102) AND (value = 7))                           ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=54.030..54.030 rows=830569 loops=1)                                 Index Cond: ((cad_id = 102) AND (value = 7))         ->  Index Scan using cad_id_value_contact_id_idx on cc a1  (cost=0.00..8.83 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=34219)               Index Cond: ((cad_id = 101) AND (value = 5) AND (contact_id = a2.contact_id)) Total runtime: 774.588 ms(17 rows)

对于INNER JOIN,索引都被利用上了。它不需要取得contact_id然后再来做集合操作。结论是对rowid做hash操作比集合操作更快。
不显示用INNER JOIN,让Planner决定Join顺序也是一样的:

postgres@dmright-perf:/mnt/memory/pg$ time psql -c 'explain analyze select count(a1.contact_id) from cc as a1, cc as a2, cc as a3 where a1.contact_id = a2.contact_id and a2.contact_id = a3.contact_id and a2.cad_id = 102 and a2.value = 7 and a1.cad_id = 101 and a1.value = 5 and a3.cad_id = 6 and a3.value = 1';                                                                           QUERY PLAN                                                                           ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=118979.47..118979.48 rows=1 width=4) (actual time=762.969..762.970 rows=1 loops=1)   ->  Nested Loop  (cost=49073.39..118977.89 rows=630 width=4) (actual time=398.554..762.206 rows=6748 loops=1)         ->  Hash Join  (cost=49073.39..97159.97 rows=2466 width=8) (actual time=398.531..684.425 rows=34219 loops=1)               Hash Cond: (a3.contact_id = a2.contact_id)               ->  Bitmap Heap Scan on cc a3  (cost=306.85..25995.74 rows=14239 width=4) (actual time=34.802..91.672 rows=207228 loops=1)                     Recheck Cond: ((cad_id = 6) AND (value = 1))                     ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..303.29 rows=14239 width=0) (actual time=29.524..29.524 rows=207228 loops=1)                           Index Cond: ((cad_id = 6) AND (value = 1))               ->  Hash  (cost=44444.15..44444.15 rows=345792 width=4) (actual time=363.537..363.537 rows=830569 loops=1)                     Buckets: 65536  Batches: 1  Memory Usage: 29200kB                     ->  Bitmap Heap Scan on cc a2  (cost=7341.27..44444.15 rows=345792 width=4) (actual time=57.799..245.467 rows=830569 loops=1)                           Recheck Cond: ((cad_id = 102) AND (value = 7))                           ->  Bitmap Index Scan on cad_id_value_idx  (cost=0.00..7254.82 rows=345792 width=0) (actual time=52.349..52.349 rows=830569 loops=1)                                 Index Cond: ((cad_id = 102) AND (value = 7))         ->  Index Scan using cad_id_value_contact_id_idx on cc a1  (cost=0.00..8.83 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=34219)               Index Cond: ((cad_id = 101) AND (value = 5) AND (contact_id = a2.contact_id)) Total runtime: 766.107 ms(17 rows)

结果是一样的。

基本上对PostgreSQL的实验就到这里了,差不多也就这样了,提高空间不大。Google之后发现,PostgreSQL使用的MVCC机制导致其甚至在做SELECT COUNT这样的操作的时候也会去更新hint bit。也许这就是其慢的重要原因。MySQL的MYISAM引擎以完全不负责事务和著称,应该会比PostgreSQL有更大的提升,下一篇中,我们将对MySQL重复同样的实验。

转载于:https://www.cnblogs.com/taowen/archive/2012/02/26/2367724.html

多维数据查询效率分析(1)相关推荐

  1. 以太坊数据查询与分析工具:以太坊浏览器Etherscan、ETHERQL

    第八章 文章目录 第八章 一.以太坊浏览器Etherscan 1.Ethersan的基本功能 2.其他功能 3.API 4.ENS域名查询 二.ETHERQL 1.同步管理器 2.处理程序链 3.持久 ...

  2. 基于springboot项目中使用docker-compose+es+kibana+logstash+mysql 提高数据查询效率

    基于springboot项目中使用docker-compose+es+kibana+logstash+mysql 提高数据查询效率 1.拉取logstash,kibana,es,mysql镜像 #命令 ...

  3. 阿里云图数据库GDB公测,高度连接数据查询效率提升10倍

    3月21日,阿里云图数据库GDB正式公测,用于处理高度连接数据查询与存储,支持属性图模型,并且全面兼容TinkerPop Gremlin 查询语言. 图数据库覆盖的应用场景非常广泛,比较典型的有社交网 ...

  4. Mysql 多表联合查询效率分析及优化

    1. 多表连接类型 1. 笛卡尔积(交叉连接) 在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用','  如: [sql] view plaincopy print? S ...

  5. mysql 多表查询 优化_Mysql 多表联合查询效率分析及优化

    1. 多表连接类型 1. 笛卡尔积(交叉连接)在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用','  如: SELECT * FROM table1 CROSS JOI ...

  6. mysql多表 性能_Mysql 多表联合查询效率分析及优化

    1. 多表连接类型 1. 笛卡尔积(交叉连接)在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用','  如: SELECT*FROMtable1CROSSJOINtabl ...

  7. mysql 多表 查询慢_详解Mysql多表联合查询效率分析及优化

    1. 多表连接类型1. 笛卡尔积(交叉连接) 在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用','  如: SELECT * FROM table1 CROSS JOI ...

  8. oracle sql语句执行顺序、查询效率分析、步骤解析

    语句 执行顺序 select –5 from –1 where –2 group by –3 having –4 order by –6 查询语句中select from where group by ...

  9. mysql 多表联查的效率_Mysql 多表联合查询效率分析

    1 多表连接类型 1. 笛卡尔积(交叉连接) 在MySQL中可以为CROSS JOIN或者省略CROSS即JOIN,或者使用',' 如: SELECT * FROM table1 CROSS JOIN ...

最新文章

  1. html5 直接获取当前位置,HTML5调用百度地图API获取当前位置并直接导航目的地的方法...
  2. linux下几种反弹Shell方法的总结与理解
  3. ubuntu中使用VsCode+Eigen创建Eiegn应用程序
  4. 老男孩Linux运维第41期20171016第六周学习重点课堂记录
  5. 美团DSP广告策略实践
  6. Java基础常见的面试题
  7. django路由层 虚拟环境 视图层
  8. Android 图像按钮ImageButton
  9. RTKLIB学习:数据下载和数据转换
  10. 铺铜需要把agnd和dgnd分开_AGNDDGND 分析
  11. 2021-11-08 RAID磁盘阵列的分级与结构 总结
  12. iOS 加载本地gif缓慢
  13. win10 ,jkd8 查看线程状态
  14. SQL中Round函数没有四舍五入原因及处理方法
  15. 功放管php和NPN的区别,一张图搞懂PNP型和NPN型三极管区别
  16. 计算机开机最快,你电脑开机要多久?4招让你的电脑开机速度比之前快几倍
  17. CSS制作大风车动画及花朵
  18. 船模、富斯遥控器、电调联调心得记录
  19. python如何读pst文件_Python 学习 第11篇:文件读写
  20. Cocos Creator 2.0.10极限跳跃实现

热门文章

  1. js 运算符 || 妙用
  2. [小改进]Blog页面导航调整
  3. Error Kernel Pattern
  4. 源码编译spring
  5. 60分钟精通正则表达式
  6. OpenCv 005---图像像素的算术操作
  7. Codeforces 920D Tanks (看题解)
  8. LeetCode 59 _ Sprial Matrix II 旋转矩阵2
  9. 20172311《程序设计与数据结构》第八周学习总结
  10. 织梦gbk转utf8(数据库篇)