postgresql索引

indexing engine, the interface of access methods, and the following methods: 索引引擎，访问方法的接口以及以下方法： hash indexes, 哈希索引， B-trees, B树， GiST, GiST ， SP-GiST, SP-GiST ， GIN, and GIN和RUM. The topic of this article is BRIN indexes.RUM 。本文的主题是BRIN索引。

布林 (BRIN)

一般概念 (General concept)

Unlike indexes with which we've already got acquainted, the idea of BRIN is to avoid looking through definitely unsuited rows rather than quickly find the matching ones. This is always an inaccurate index: it does not contain TIDs of table rows at all.

与我们已经熟悉的索引不同，BRIN的想法是避免浏览绝对不合适的行，而不是快速找到匹配的行。这始终是一个不准确的索引：它根本不包含表行的TID。

Simplistically, BRIN works fine for columns where values correlate with their physical location in the table. In other words, if a query without ORDER BY clause returns the column values virtually in the increasing or decreasing order (and there are no indexes on that column).

简而言之，对于值与表中物理位置相关的列，BRIN可以很好地工作。换句话说，如果没有ORDER BY子句的查询实际上以升序或降序返回列值(并且该列上没有索引)。

This access method was created in scope of Axle, the European project for extremely large analytical databases, with an eye on tables that are several terabyte or dozens of terabytes large. An important feature of BRIN that enables us to create indexes on such tables is a small size and minimal overhead costs of maintenance.

这种访问方法是在Axle的范围内创建的， Axle是用于大型分析数据库的欧洲项目，着眼于几TB或数十TB的表。 BRIN的一项重要功能使我们能够在此类表上创建索引，它的体积小且维护开销最小。

This works as follows. The table is split into ranges that are several pages large (or several blocks large, which is the same) — hence the name: Block Range Index, BRIN. The index stores summary information on the data in each range. As a rule, this is the minimal and maximal values, but it happens to be different, as shown further. Assume that a query is performed that contains the condition for a column; if the sought values do not get into the interval, the whole range can be skipped; but if they do get, all rows in all blocks will have to be looked through to choose the matching ones among them.

其工作原理如下。表被分成多个显示页面大(或几个块大，这是相同的) 的范围 -故名：块范围索引，布林。索引存储有关每个范围中数据的摘要信息 。通常，这是最小值和最大值，但碰巧是不同的，如进一步所示。假设执行的查询包含列的条件；如果所搜索的值未进入该间隔，则可以跳过整个范围；但如果确实获得，则必须仔细检查所有块中的所有行，以在其中选择匹配的行。

It will not be a mistake to treat BRIN not as an index, but as an accelerator of sequential scan. We can regard BRIN as an alternative to partitioning if we consider each range as a «virtual» partition.

将BRIN视为索引，而不是顺序扫描的加速器，这不是错误的。如果我们将每个范围视为“虚拟”分区，则可以将BRIN视为分区的替代方案。

Now let's discuss the structure of the index in more detail.

现在让我们更详细地讨论索引的结构。

结构体 (Structure)

The first (more exactly, zero) page contains the metadata.

第一页(更确切地说是零)包含元数据。

Pages with the summary information are located at a certain offset from the metadata. Each index row on those pages contains summary information on one range.

带有摘要信息的页面与元数据之间有一定的偏移量。这些页面上的每个索引行都包含一个范围的摘要信息。

Between the meta page and summary data, pages with the reverse range map (abbreviated as «revmap») are located. Actually, this is an array of pointers (TIDs) to the corresponding index rows.

在元页面和摘要数据之间，找到具有反向范围图 (简称为“ revmap”)的页面。实际上，这是指向相应索引行的指针(TID)的数组。

For some ranges, the pointer in «revmap» can lead to no index row (one is marked in gray in the figure). In such a case, the range is considered to have no summary information yet.

对于某些范围，«revmap»中的指针可能不会导致索引行(图中的灰色标记为一个)。在这种情况下，该范围被认为还没有摘要信息。

扫描索引 (Scanning the index)

How is the index used if it does not contain references to table rows? This access method certainly cannot return rows TID by TID, but it can build a bitmap. There can be two kinds of bitmap pages: accurate, to the row, and inaccurate, to the page. It's an inaccurate bitmap that is used.

如果索引不包含对表行的引用，该如何使用？这种访问方法当然不能按TID逐行返回TID，但可以构建位图。位图页面可以有两种：精确到位的页面和不精确的页面。使用的是不正确的位图。

The algorithm is simple. The map of ranges is sequentially scanned (that is, the ranges are went through in the order of their location in the table). The pointers are used to determine index rows with summary information on each range. If a range does not contain the value sought, it is skipped, and if it can contain the value (or summary information is unavailable), all pages of the range are added to the bitmap. The resulting bitmap is then used as usual.

该算法很简单。依次扫描范围图(即，范围按照表中位置的顺序进行浏览)。指针用于确定带有每个范围的摘要信息的索引行。如果范围不包含所寻求的值，则将其跳过，并且如果范围可以包含该值(或摘要信息不可用)，则该范围的所有页面都将添加到位图中。然后，照常使用生成的位图。

更新索引 (Updating the index)

It is more interesting how the index is updated when the table is changed.

更改表时如何更新索引更有趣。

When adding a new version of a row to a table page, we determine which range it is contained in and use the map of ranges to find the index row with the summary information. All these are simple arithmetic operations. Let, for instance, the size of a range be four and on page 13, a row version with the value of 42 occur. The number of the range (starting with zero) is 13 / 4 = 3, therefore, in «revmap» we take the pointer with the offset of 3 (its order number is four).

当将新版本的行添加到表页面时，我们确定该行包含在哪个范围中，并使用范围图查找包含摘要信息的索引行。所有这些都是简单的算术运算。例如，假设范围的大小为4，然后在第13页上，出现值为42的行版本。范围的数字(从零开始)是13/4 = 3，因此，在《 revmap》中，我们采用偏移量为3的指针(其顺序号为4)。

The minimal value for this range is 31, and the maximal one is 40. Since the new value of 42 is out of the interval, we update the maximal value (see the figure). But if the new value is still within the stored limits, the index does not need to be updated.

此范围的最小值为31，最大值为40。由于新值42不在间隔内，因此我们更新最大值(请参见图)。但是，如果新值仍在存储的限制内，则无需更新索引。

All this relates to the situation when the new version of the page occurs in a range for which the summary information is available. When the index is created, the summary information is computed for all ranges available, but while the table is further expanded, new pages can occur that fall out of the limits. Two options are available here:

所有这些都与页面的新版本出现在可使用摘要信息的范围内的情况有关。创建索引时，将为所有可用范围计算摘要信息，但是在进一步扩展表时，可能会出现超出限制的新页面。这里有两个选项：

Usually the index is not updated immediately. This is not a big deal: as already mentioned, when scanning the index, the whole range will be looked through. Actual update is done during «vacuum», or it can be done manually by calling «brin_summarize_new_values» function.
通常，索引不会立即更新。没什么大不了的：如前所述，在扫描索引时，将浏览整个范围。实际更新是在“真空”期间完成的，也可以通过调用“ brin_summarize_new_values”函数手动完成。
If we create the index with «autosummarize» parameter, the update will be done immediately. But when pages of the range are populated with new values, updates can happen too often, therefore, this parameter is turned off by default.
如果我们使用«autosummarize»参数创建索引，则更新将立即完成。但是，当使用新值填充范围页面时，更新可能会经常发生，因此，默认情况下此参数处于关闭状态。

When new ranges occur, the size of «revmap» can increase. Whenever the map, located between the meta page and summary data, needs to be extended by another page, existing row versions are moved to some other pages. So, the map of ranges is always located between the meta page and summary data.

当出现新范围时，«revmap»的大小可能会增加。每当位于元页面和摘要数据之间的地图需要由另一页面扩展时，现有行版本就会移至其他页面。因此，范围图始终位于元页面和摘要数据之间。

When a row is deleted,… nothing happens. We can notice that sometimes the minimal or maximal value will be deleted, in which case the interval could be reduced. But to detect this, we would have to read all values in the range, and this is costly.

当删除一行时，…什么也没有发生。我们可以注意到，有时最小值或最大值将被删除，在这种情况下可以减小间隔。但是要检测到这一点，我们将必须读取该范围内的所有值，这是昂贵的。

The correctness of the index is not affected, but search may require looking through more ranges than is actually needed. In general, summary information can be manually recalculated for such a zone (by calling «brin_desummarize_range» and «brin_summarize_new_values» functions), but how can we detect such a need? Anyway, no conventional procedure is available to this end.

索引的正确性不受影响，但是搜索可能需要查看比实际需要更多的范围。通常，可以手动重新计算此类区域的摘要信息(通过调用«brin_desummarize_range»和«brin_summarize_new_values»函数)，但是我们如何检测到这种需求？无论如何，没有常规的程序可用于此目的。

Finally, updating a row is just a deletion of the outdated version and addition of a new one.

最后， 更新一行只是删除过时的版本，而增加新的版本。

例 (Example)

Let's try to build our own mini data warehouse for the data from tables of the demo database. Let's assume that for the purpose of BI reporting, a denormalized table is needed to reflect the flights departed from an airport or landed in the airport to the accuracy of a seat in the cabin. The data for each airport will be added to the table once a day, when it is midnight in the appropriate time zone. The data will be neither updated nor deleted.

让我们尝试为演示数据库表中的数据构建自己的小型数据仓库。假设出于BI报告的目的，需要使用非规范化表格来反映从机场起飞或降落在机场的航班到机舱座位的准确性。每个机场的数据每天都会在适当时区的午夜12点添加到表中。数据将不会被更新或删除。

The table will look as follows:

该表如下所示：

demo=# create table flights_bi(airport_code char(3),airport_coord point,         -- geo coordinates of airportairport_utc_offset interval, -- time zoneflight_no char(6),           -- flight numberflight_type text.            -- flight type: departure / arrival scheduled_time timestamptz,  -- scheduled departure/arrival time of flightactual_time timestamptz,     -- actual time of flightaircraft_code char(3),seat_no varchar(4),          -- seat numberfare_conditions varchar(10), -- travel classpassenger_id varchar(20),passenger_name text
);

We can simulate the procedure of loading the data using nested loops: an external one — by days (we will consider a large database, therefore 365 days), and an internal loop — by time zones (from UTC+02 to UTC+12). The query is pretty long and not of particular interest, so I'll hide it under the spoiler.

我们可以模拟使用嵌套循环加载数据的过程：一个外部循环-按天(我们将考虑一个大型数据库，因此为365天)，一个内部循环-按时区(从UTC + 02到UTC + 12) 。该查询很长，并且没有特别的兴趣，因此我将其隐藏在扰流器下。

模拟将数据加载到存储中 (Simulation of loading the data to the storage)

DO $$
<<local>>
DECLAREcurdate date := (SELECT min(scheduled_departure) FROM flights);utc_offset interval;
BEGINWHILE (curdate <= bookings.now()::date) LOOPutc_offset := interval '12 hours';WHILE (utc_offset >= interval '2 hours') LOOPINSERT INTO flights_biWITH flight (airport_code,airport_coord,flight_id,flight_no,scheduled_time,actual_time,aircraft_code,flight_type) AS (-- прибытияSELECT a.airport_code,a.coordinates,f.flight_id,f.flight_no,f.scheduled_departure,f.actual_departure,f.aircraft_code,'departure'FROM   airports a,flights f,pg_timezone_names tznWHERE  a.airport_code = f.departure_airportAND    f.actual_departure IS NOT NULLAND    tzn.name = a.timezoneAND    tzn.utc_offset = local.utc_offsetAND    timezone(a.timezone, f.actual_departure)::date = curdateUNION ALL-- вылетыSELECT a.airport_code,a.coordinates,f.flight_id,f.flight_no,f.scheduled_arrival,f.actual_arrival,f.aircraft_code,'arrival'FROM   airports a,flights f,pg_timezone_names tznWHERE  a.airport_code = f.arrival_airportAND    f.actual_arrival IS NOT NULLAND    tzn.name = a.timezoneAND    tzn.utc_offset = local.utc_offsetAND    timezone(a.timezone, f.actual_arrival)::date = curdate)SELECT f.airport_code,f.airport_coord,local.utc_offset,f.flight_no,f.flight_type,f.scheduled_time,f.actual_time,f.aircraft_code,s.seat_no,s.fare_conditions,t.passenger_id,t.passenger_nameFROM   flight fJOIN seats sON s.aircraft_code = f.aircraft_codeLEFT JOIN boarding_passes bpON bp.flight_id = f.flight_idAND bp.seat_no = s.seat_noLEFT JOIN ticket_flights tfON tf.ticket_no = bp.ticket_noAND tf.flight_id = bp.flight_idLEFT JOIN tickets tON t.ticket_no = tf.ticket_no;RAISE NOTICE '%, %', curdate, utc_offset;utc_offset := utc_offset - interval '1 hour';END LOOP;curdate := curdate + 1;END LOOP;
END;
$$;

demo=# select count(*) from flights_bi;

count
----------30517076
(1 row)

demo=# select pg_size_pretty(pg_total_relation_size('flights_bi'));

pg_size_pretty
----------------4127 MB
(1 row)

We get 30 million rows and 4 GB. Not so large a size, but good enough for a laptop: sequential scan took me about 10 seconds.

我们得到3000万行和4 GB。尺寸不算大，但足以用于笔记本电脑：顺序扫描花了我大约10秒钟。

我们应该在哪些列上创建索引？ (On what columns should we create the index?)

Since BRIN indexes have a small size and moderate overhead costs and updates happen infrequently, if any, a rare opportunity arises to build many indexes «just in case», for example, on all fields on which analyst users can create their ad-hoc queries. Won't come useful — never mind, but even an index that is not very efficient will work better than sequential scan for sure. Of course, there are fields on which it is absolutely useless to build an index; pure common sense will prompt them.

由于BRIN索引的大小小且管理费用适中，并且更新很少发生(如果有的话)，因此出现了难得的机会(例如，以防万一)建立许多索引，例如，在分析师用户可以创建其临时查询的所有字段上。不会有用-没关系，但是即使是效率不高的索引也肯定会比顺序扫描更好。当然，在某些字段上建立索引绝对是没有用的。纯粹的常识会提示他们。

But it should be odd to limit ourselves to this piece of advice, therefore, let's try to state a more accurate criterion.

但是将自己限制在这条建议上应该很奇怪，因此，让我们尝试提出一个更准确的标准。

We've already mentioned that the data must somewhat correlate with its physical location. Here it makes sense to remember that PostgreSQL gathers table column statistics, which include the correlation value. The planner uses this value to select between a regular index scan and bitmap scan, and we can use it to estimate the applicability of BRIN index.

我们已经提到，数据必须与其物理位置有所关联。这里要记住，PostgreSQL收集表列统计信息，其中包括相关值。计划者使用此值在常规索引扫描和位图扫描之间进行选择，我们可以使用它来估计BRIN索引的适用性。

In the above example, the data is evidently ordered by days (by «scheduled_time», as well as by «actual_time» — there is no much difference). This is because when rows are added to the table (without deletions and updates), they are laid out in the file one after another. In the simulation of data loading we did not even use ORDER BY clause, therefore, dates within a day can be, in general, mixed up in an arbitrary way, but ordering must be in place. Let's check this:

在上面的示例中，数据显然按天排序(按“ scheduled_time”和“ actual_time”排序-差别不大)。这是因为将行添加到表中(没有删除和更新)时，它们在文件中一个接一个地排列。在数据加载的模拟中，我们甚至没有使用ORDER BY子句，因此，通常一天内的日期可以以任意方式混合，但是必须有序。让我们检查一下：

demo=# analyze flights_bi;demo=# select attname, correlation from pg_stats where tablename='flights_bi'
order by correlation desc nulls last;

attname       | correlation
--------------------+-------------scheduled_time     |    0.999994actual_time        |    0.999994fare_conditions    |    0.796719flight_type        |    0.495937airport_utc_offset |    0.438443aircraft_code      |    0.172262airport_code       |   0.0543143flight_no          |   0.0121366seat_no            |  0.00568042passenger_name     |   0.0046387passenger_id       | -0.00281272airport_coord      |
(12 rows)

The value that is not too close to zero (ideally, near plus-minus one, as in this case), tells us that BRIN index will be appropriate.

该值不太接近零(在这种情况下，理想情况下，接近正负1)告诉我们BRIN指数是合适的。

The travel class «fare_condition» (the column contains three unique values) and type of the flight «flight_type» (two unique values) unexpectedly appeared to be in the second and third places. This is an illusion: formally the correlation is high, while actually on several successive pages all possible values will be encountered for sure, which means that BRIN won't do any good.

出差航班类别“ fare_condition»(该列包含三个唯一值)和航班类型“ flight_type»(两个唯一值)出乎意料地位于第二和第三位。这是一种错觉：形式上的相关性很高，而实际上在几个连续的页面上肯定会遇到所有可能的值，这意味着BRIN不会发挥任何作用。

The time zone «airport_utc_offset» goes next: in the considered example, within a day cycle, airports are ordered by time zones «by construction».

接下来是时区“ airport_utc_offset”：在所考虑的示例中，在一天周期内，按时区“按构造”对机场进行了排序。

It's these two fields, time and time zone, that we will further experiment with.

我们将进一步试验这两个字段(时间和时区)。

可能削弱相关性 (Possible weakening of the correlation)

The correlation that is place «by construction» can be easily weakened when the data is changed. And the matter here is not in a change to a particular value, but in the structure of the multiversion concurrency control: the outdated row version is deleted on one page, but a new version may be inserted wherever free space is available. Due to this, whole rows get mixed up during updates.

更改数据时，很容易削弱“构造”位置的相关性。此处的问题不是更改特定值，而是多版本并发控件的结构：过时的行版本在一页上被删除，但是只要有可用空间，就可以插入新版本。因此，整个行在更新期间会混合在一起。

We can partially control this effect by reducing the value of «fillfactor» storage parameter and this way leaving free space on a page for future updates. But do we want to increase the size of an already huge table? Besides, this does not resolve the issue of deletions: they also «set traps» for new rows by freeing the space somewhere inside existing pages. Due to this, rows that otherwise would get to the end of file, will be inserted at some arbitrary place.

我们可以通过减小«fillfactor»存储参数的值来部分控制此效果，并通过这种方式在页面上留下可用空间以供将来更新。但是，我们是否要增加已经很大的桌子的大小？此外，这不能解决删除问题：它们还通过释放现有页面内某处的空间来为新行“设置陷阱”。因此，否则将到达文件末尾的行将插入到任意位置。

By the way, this is a curious fact. Since BRIN index does not contain references to table rows, its availability should not hinder HOT updates at all, but it does.

顺便说一句，这是一个奇怪的事实。由于BRIN索引不包含对表行的引用，因此它的可用性不应完全阻止HOT更新，但它确实可以。

So, BRIN is mainly designed for tables of large and even huge sizes that are either not updated at all or updated very slightly. However, it perfectly copes with the addition of new rows (to the end of the table). This is not surprising since this access method was created with a view to data warehouses and analytical reporting.

因此，BRIN主要设计用于甚至根本不更新或更新很小的大型甚至大型表。但是，它完美地应对了新行的增加(到表的末尾)。这并不奇怪，因为创建此访问方法是为了查看数据仓库和分析报告。

我们需要选择什么大小的范围？ (What size of a range do we need to select?)

If we deal with a terabyte table, our main concern when selecting the size of a range will probably be not to make BRIN index too large. However, in our situation, we can afford analyzing data more accurately.

如果处理一个TB的表，那么在选择范围大小时，我们主要关心的可能不是使BRIN索引太大。但是，在我们的情况下，我们可以提供更准确的数据分析能力。

To do this, we can select unique values of a column and see on how many pages they occur. Localization of the values increases the chances of success in applying BRIN index. Moreover, the found number of pages will prompt the size of a range. But if the value is «spread» over all pages, BRIN is useless.

为此，我们可以选择列的唯一值，并查看它们出现在多少页上。值的本地化增加了成功应用BRIN指数的机会。此外，找到的页数将提示范围的大小。但是，如果该值在所有页面上都“传播”，则BRIN是无用的。

Of course, we should use this technique keeping a watchful eye on an internal structure of the data. For example, it makes no sense to consider each date (more exactly, a timestamp, also including time) as a unique value — we need to round it to days.

当然，我们应该使用这种技术来密切注意数据的内部结构。例如，将每个日期(更确切地说是时间戳，还包括时间)视为唯一值是没有意义的-我们需要将其舍入为几天。

Technically, this analysis can be done by looking at the value of the hidden «ctid» column, which provides the pointer to a row version (TID): the number of the page and the number of the row inside the page. Unfortunately, there is no conventional technique to decompose TID into its two components, therefore, we have to cast types through the text representation:

从技术上讲，可以通过查看隐藏的“ ctid”列的值来完成此分析，该值提供了指向行版本(TID)的指针：页面数和页面内行数。不幸的是，没有传统的技术可以将TID分解为两个部分，因此，我们必须通过文本表示来转换类型：

demo=# select min(numblk), round(avg(numblk)) avg, max(numblk)
from ( select count(distinct (ctid::text::point)[0]) numblkfrom flights_bigroup by scheduled_time::date
) t;

min  | avg  | max
------+------+------1192 | 1500 | 1796
(1 row)

demo=# select relpages from pg_class where relname = 'flights_bi';

relpages
----------528172
(1 row)

We can see that each day is distributed across pages pretty evenly, and days are slightly mixed up with each other (1500 &times 365 = 547500, which is only a little larger than the number of pages in the table 528172). This is actually clear «by construction» anyway.

我们可以看到，每天几乎均匀地分布在页面上，并且天彼此之间略有混淆(1500＆times 365 = 547500，这仅比表528172中的页面数大一点)。无论如何，这实际上是“通过建设”明确的。

Valuable information here is a specific number of pages. With a conventional range size of 128 pages, each day will populate 9–14 ranges. This seems realistic: with a query for a specific day, we can expect an error around 10%.

此处的重要信息是特定数量的页面。传统的范围大小为128页，每天将填充9-14个范围。这似乎很现实：查询特定的一天，我们可以预期出现10％左右的错误。

Let's try:

我们试试吧：

demo=# create index on flights_bi using brin(scheduled_time);

The size of the index is as small as 184 KB:

索引的大小小至184 KB：

demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_scheduled_time_idx'));

pg_size_pretty
----------------184 kB
(1 row)

In this case, it hardly makes sense to increase the size of a range at the cost of losing the accuracy. But we can reduce the size if required, and the accuracy will, on the contrary, increase (along with the size of the index).

在这种情况下，以损失精度为代价增加范围的大小几乎没有意义。但是如果需要，我们可以减小大小，相反，准确性会提高(以及索引的大小)。

Now let's look at time zones. Here we cannot use a brute-force approach either. All values should be divided by the number of day cycles instead since the distribution is repeated within each day. Besides, since there are few time zones only, we can look at the entire distribution:

现在让我们看一下时区。在这里，我们也不能使用暴力手段。所有值都应除以天周期数，而不是因为每天都会重复分配。此外，由于只有几个时区，我们可以查看整个分布：

demo=# select airport_utc_offset, count(distinct (ctid::text::point)[0])/365 numblk
from flights_bi
group by airport_utc_offset
order by 2;

airport_utc_offset | numblk
--------------------+--------12:00:00           |      606:00:00           |      802:00:00           |     1011:00:00           |     1308:00:00           |     2809:00:00           |     2910:00:00           |     4004:00:00           |     4707:00:00           |    11005:00:00           |    23103:00:00           |    932
(11 rows)

On average, the data for each time zone populates 133 pages a day, but the distribution is highly non-uniform: Petropavlovsk-Kamchatskiy and Anadyr fit as few as six pages, while Moscow and its neighborhood require hundreds of them. The default size of a range is no good here; let's, for example, set it to four pages.

平均而言，每个时区的数据每天填充133页，但分布高度不均匀：Petropavlovsk-Kamchatskiy和Anadyr仅有六页，而莫斯科及其附近地区则需要数百页。范围的默认大小在这里不合适。例如，将其设置为四个页面。

demo=# create index on flights_bi using brin(airport_utc_offset) with (pages_per_range=4);demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_airport_utc_offset_idx'));

pg_size_pretty
----------------6528 kB
(1 row)

执行计划 (Execution plan)

Let's look at how our indexes work. Let's select some day, say, a week ago (in the demo database, «today» is determined by «booking.now» function):

让我们看一下索引的工作方式。让我们选择某天，例如一周前(在演示数据库中，“今天”由“ booking.now”函数确定)：

demo=# \set d 'bookings.now()::date - interval \'7 days\''demo=# explain (costs off,analyze)select *from flights_biwhere scheduled_time >= :d and scheduled_time < :d + interval '1 day';

QUERY PLAN
--------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=10.282..94.328 rows=83954 loops=1)Recheck Cond: ...Rows Removed by Index Recheck: 12045Heap Blocks: lossy=1664->  Bitmap Index Scan on flights_bi_scheduled_time_idx(actual time=3.013..3.013 rows=16640 loops=1)Index Cond: ...Planning time: 0.375 msExecution time: 97.805 ms

As we can see, the planner used the index created. How accurate is it? The ratio of the number of rows that meet the query conditions («rows» of Bitmap Heap Scan node) to the total number of rows returned using the index (the same value plus Rows Removed by Index Recheck) tells us about this. In this case 83954 / (83954 + 12045), which is approximately 90%, as expected (this value will change from one day to another).

如我们所见，计划者使用了创建的索引。它有多精确？满足查询条件的行数(“位图堆扫描”节点的“行”)与使用索引返回的总行数(相同的值加上通过索引重新检查删除的行)之比告诉我们这一点。在这种情况下，为83954 /(83954 + 12045)，大约为预期值的90％(此值将一天到一天更改)。

Where does the 16640 number in «actual rows» of Bitmap Index Scan node originate from? The thing is that this node of the plan builds an inaccurate (page-by-page) bitmap and is completely unaware of how many rows the bitmap will touch, while something needs to be shown. Therefore, in despair one page is assumed to contain 10 rows. The bitmap contains 1664 pages in total (this value is shown in «Heap Blocks: lossy=1664»); so, we just get 16640. Altogether, this is a senseless number, which we should not pay attention to.

位图索引扫描节点的“实际行”中的16640数字从何而来？问题在于，该计划的该节点将构建不准确的(逐页)位图，并且完全不知道该位图将触摸多少行，而需要显示某些内容。因此，绝望地假设一页包含10行。位图总共包含1664页(此值在《堆块：有损= 1664》中显示)；因此，我们只得到16640。这是一个毫无意义的数字，我们不应该注意。

How about airports? For example, let's take the time zone of Vladivostok, which populates 28 pages a day:

机场呢？例如，让我们以符拉迪沃斯托克(Vladivostok)的时区为例，该时区每天填充28页：

demo=# explain (costs off,analyze)select *from flights_biwhere airport_utc_offset = interval '8 hours';

QUERY PLAN
----------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=75.151..192.210 rows=587353 loops=1)Recheck Cond: (airport_utc_offset = '08:00:00'::interval)Rows Removed by Index Recheck: 191318Heap Blocks: lossy=13380->  Bitmap Index Scan on flights_bi_airport_utc_offset_idx(actual time=74.999..74.999 rows=133800 loops=1)Index Cond: (airport_utc_offset = '08:00:00'::interval)Planning time: 0.168 msExecution time: 212.278 ms

The planner again uses the BRIN index created. The accuracy is worse (about 75% in this case), but this is expected since the correlation is lower.

计划者再次使用创建的BRIN索引。准确性较差(在这种情况下约为75％)，但这是可以预期的，因为相关性较低。

Several BRIN indexes (just like any other ones) can certainly be joined at the bitmap level. For example, the following is the data on the selected time zone for a month (notice «BitmapAnd» node):

当然，可以在位图级别上连接几个BRIN索引(就像其他索引一样)。例如，以下是所选时区一个月的数据(注意“ BitmapAnd”节点)：

demo=# \set d 'bookings.now()::date - interval \'60 days\''demo=# explain (costs off,analyze)select *from flights_biwhere scheduled_time >= :d and scheduled_time < :d + interval '30 days'and airport_utc_offset = interval '8 hours';

QUERY PLAN
---------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=62.046..113.849 rows=48154 loops=1)Recheck Cond: ...Rows Removed by Index Recheck: 18856Heap Blocks: lossy=1152->  BitmapAnd (actual time=61.777..61.777 rows=0 loops=1)->  Bitmap Index Scan on flights_bi_scheduled_time_idx(actual time=5.490..5.490 rows=435200 loops=1)Index Cond: ...->  Bitmap Index Scan on flights_bi_airport_utc_offset_idx(actual time=55.068..55.068 rows=133800 loops=1)Index Cond: ...Planning time: 0.408 msExecution time: 115.475 ms

与B树比较 (Comparison with B-tree)

What if we create regular B-tree index on the same field as BRIN?

如果我们在与BRIN相同的字段上创建常规B树索引，该怎么办？

demo=# create index flights_bi_scheduled_time_btree on flights_bi(scheduled_time);demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_scheduled_time_btree'));

pg_size_pretty
----------------654 MB
(1 row)

It appeared to be several thousand times larger than our BRIN! However, the query is performed a little faster: the planner used statistics to figure out that the data is physically ordered and it is not needed to build a bitmap and, mainly, that the index condition does not need to be rechecked:

它似乎比我们的BRIN大数千倍！但是，查询的执行速度要快一些：计划者使用统计信息来确定数据是物理排序的，不需要构建位图，并且主要是不需要重新检查索引条件：

demo=# explain (costs off,analyze)select *from flights_biwhere scheduled_time >= :d and scheduled_time < :d + interval '1 day';

QUERY PLAN
----------------------------------------------------------------Index Scan using flights_bi_scheduled_time_btree on flights_bi(actual time=0.099..79.416 rows=83954 loops=1)Index Cond: ...Planning time: 0.500 msExecution time: 85.044 ms

That's what is so wonderful about BRIN: we sacrifice the efficiency, but gain very much space.

对于BRIN而言，这真是太妙了：我们牺牲了效率，却获得了很大的空间。

操作员类别 (Operator classes)

最小最大 (minmax)

For data types whose values can be compared with one another, summary information consists of the minimal and maximal values. Names of the corresponding operator classes contain «minmax», for example, «date_minmax_ops». Actually, these are data types that we were considering so far, and most of the types are of this kind.

对于其值可以相互比较的数据类型，摘要信息由最小值和最大值组成。相应的运算符类别的名称包含«minmax»，例如«date_minmax_ops»。实际上，这些是我们到目前为止正在考虑的数据类型，并且大多数类型都是这种类型。

包括的 (inclusive)

Comparison operators are defined not for all data types. For example, they are not defined for points («point» type), which represent the geographical coordinates of airports. By the way, it's for this reason that the statistics do not show the correlation for this column.

并非为所有数据类型定义比较运算符。例如，没有为代表机场地理坐标的点(“点”类型)定义它们。顺便说一下，正是由于这个原因，统计信息并未显示此列的相关性。

demo=# select attname, correlation
from pg_stats
where tablename='flights_bi' and attname = 'airport_coord';

attname    | correlation
---------------+-------------airport_coord |
(1 row)

But many of such types enable us to introduce a concept of a «bounding area», for example, a bounding rectangle for geometric shapes. We discussed in detail how GiST index uses this feature. Similarly, BRIN also enables gathering summary information on columns having data types like these: the bounding area for all values inside a range is just the summary value.

但是许多这样的类型使我们能够引入“边界区域”的概念，例如，几何形状的边界矩形。我们详细讨论了GiST索引如何使用此功能。同样，BRIN还可以收集具有以下数据类型的列的摘要信息： 范围内所有值的边界区域仅是摘要值。

Unlike for GiST, the summary value for BRIN must be of the same type as the values being indexed. Therefore, we cannot build the index for points, although it is clear that the coordinates could work in BRIN: the longitude is closely connected with the time zone. Fortunately, nothing hinders creation of the index on an expression after transforming points into degenerate rectangles. At the same time, we will set the size of a range to one page, just to show the limit case:

与GiST不同，BRIN的摘要值必须与所索引的值具有相同的类型。因此，尽管很明显坐标可以在BRIN中工作，但我们无法建立点的索引：经度与时区紧密相关。幸运的是，在将点转换为退化的矩形后，没有任何事情会妨碍在表达式上创建索引。同时，我们将范围的大小设置为一页，以显示极限情况：

demo=# create index on flights_bi using brin (box(airport_coord)) with (pages_per_range=1);

The size of the index is as small as 30 MB even in such an extreme situation:

即使在这种极端情况下，索引的大小也只有30 MB：

demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_box_idx'));

pg_size_pretty
----------------30 MB
(1 row)

Now we can make up queries that limit the airports by coordinates. For example:

现在，我们可以组成通过坐标限制机场的查询。例如：

demo=# select airport_code, airport_name
from airports
where box(coordinates) <@ box '120,40,140,50';

airport_code |  airport_name
--------------+-----------------KHV          | Khabarovsk-NovyiVVO          | Vladivostok
(2 rows)

The planner will, however, refuse to use our index.

但是，计划者将拒绝使用我们的索引。

demo=# analyze flights_bi;demo=# explain select * from flights_bi
where box(airport_coord) <@ box '120,40,140,50';

QUERY PLAN
---------------------------------------------------------------------Seq Scan on flights_bi  (cost=0.00..985928.14 rows=30517 width=111)Filter: (box(airport_coord) <@ '(140,50),(120,40)'::box)

Why? Let's disable sequential scan and see what happens:

为什么？让我们禁用顺序扫描，看看会发生什么：

demo=# set enable_seqscan = off;demo=# explain select * from flights_bi
where box(airport_coord) <@ box '120,40,140,50';

QUERY PLAN
--------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi  (cost=14079.67..1000007.81 rows=30517 width=111)Recheck Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)->  Bitmap Index Scan on flights_bi_box_idx(cost=0.00..14072.04 rows=30517076 width=0)Index Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)

It appears that the index can be used, but the planner supposes that the bitmap will have to be built on the whole table (look at «rows» of Bitmap Index Scan node), and it is no wonder that the planner chooses sequential scan in this case. The issue here is that for geometric types, PostgreSQL does not gather any statistics, and the planner has to go blindly:

看来可以使用索引，但是计划者认为位图必须建立在整个表上(请看“位图索引扫描”节点的“行”)，也就不足为奇了。这个案例。这里的问题是，对于几何类型，PostgreSQL不会收集任何统计信息，并且计划者必须盲目行动：

demo=# select * from pg_stats where tablename = 'flights_bi_box_idx' \gx

-[ RECORD 1 ]----------+-------------------
schemaname             | bookings
tablename              | flights_bi_box_idx
attname                | box
inherited              | f
null_frac              | 0
avg_width              | 32
n_distinct             | 0
most_common_vals       |
most_common_freqs      |
histogram_bounds       |
correlation            |
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

Alas. But there are no complaints about the index — it does work and works fine:

唉。但是没有人对该索引有任何抱怨，它确实可以正常工作：

demo=# explain (costs off,analyze)
select * from flights_bi where box(airport_coord) <@ box '120,40,140,50';

QUERY PLAN
----------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=158.142..315.445 rows=781790 loops=1)Recheck Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)Rows Removed by Index Recheck: 70726Heap Blocks: lossy=14772->  Bitmap Index Scan on flights_bi_box_idx(actual time=158.083..158.083 rows=147720 loops=1)Index Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)Planning time: 0.137 msExecution time: 340.593 ms

The conclusion must be like this: PostGIS is needed if anything nontrivial is required of the geometry. It can gather statistics anyway.

结论必须是这样的：如果几何图形有任何重要要求，则需要PostGIS。它仍然可以收集统计信息。

内部构造 (Internals)

The conventional extension «pageinspect» enables us to look inside BRIN index.

传统的扩展名“ pageinspect”使我们能够查看BRIN索引的内部。

First, the metainformation will prompt us the size of a range and how many pages are allocated for «revmap»:

首先，元信息将提示我们范围的大小以及«revmap»分配了多少页：

demo=# select *
from brin_metapage_info(get_raw_page('flights_bi_scheduled_time_idx',0));

magic    | version | pagesperrange | lastrevmappage
------------+---------+---------------+----------------0xA8109CFA |       1 |           128 |              3
(1 row)

Pages 1–3 here are allocated for «revmap», while the rest contain summary data. From «revmap» we can get references to summary data for each range. Say, the information on the first range, incorporating first 128 pages, is located here:

此处的第1至3页分配给«revmap»，其余的则包含摘要数据。从《 revmap》中，我们可以获得每个范围的摘要数据的引用。说，第一个范围的信息(包含前128页)位于以下位置：

demo=# select *
from brin_revmap_data(get_raw_page('flights_bi_scheduled_time_idx',1))
limit 1;

pages
---------(6,197)
(1 row)

And this is the summary data itself:

这是摘要数据本身：

demo=# select allnulls, hasnulls, value
from brin_page_items(get_raw_page('flights_bi_scheduled_time_idx',6),'flights_bi_scheduled_time_idx'
)
where itemoffset = 197;

allnulls | hasnulls |                       value
----------+----------+----------------------------------------------------f        | f        | {2016-08-15 02:45:00+03 .. 2016-08-15 17:15:00+03}
(1 row)

Next range:

下一个范围：

demo=# select *
from brin_revmap_data(get_raw_page('flights_bi_scheduled_time_idx',1))
offset 1 limit 1;

pages
---------(6,198)
(1 row)

demo=# select allnulls, hasnulls, value
from brin_page_items(get_raw_page('flights_bi_scheduled_time_idx',6),'flights_bi_scheduled_time_idx'
)
where itemoffset = 198;

allnulls | hasnulls |                       value
----------+----------+----------------------------------------------------f        | f        | {2016-08-15 06:00:00+03 .. 2016-08-15 18:55:00+03}
(1 row)

And so on.

等等。

For «inclusion» classes, the «value» field will display something like

对于“包含”类，“值”字段将显示类似

{(94.4005966186523,69.3110961914062),(77.6600036621,51.6693992614746) .. f .. f}

The first value is the embedding rectangle, and «f» letters at the end denote lacking empty elements (the first one) and lacking unmergeable values (the second one). Actually, the only unmergeable values are «IPv4» and «IPv6» addresses («inet» data type).

第一个值是嵌入矩形，末尾的“ f”字母表示缺少空元素(第一个)和缺少不可合并的值(第二个)。实际上，唯一不可合并的值是“ IPv4”和“ IPv6”地址(“ inet”数据类型)。

物产 (Properties)

Reminding you of the queries that have already been provided.

提醒您已经提供的查询。

The following are the properties of the access method:

以下是访问方法的属性：

amname |     name      | pg_indexam_has_property
--------+---------------+-------------------------brin   | can_order     | fbrin   | can_unique    | fbrin   | can_multi_col | tbrin   | can_exclude   | f

Indexes can be created on several columns. In this case, its own summary statistics are gathered for each column, but they are stored together for each range. Of course, this index makes sense if one and the same size of a range is suitable for all columns.

可以在几列上创建索引。在这种情况下，将为每列收集其自己的摘要统计信息，但对于每个范围将它们一起存储。当然，如果一个且相同大小的范围适用于所有列，则此索引才有意义。

The following index-layer properties are available:

以下索引层属性可用：

name      | pg_index_has_property
---------------+-----------------------clusterable   | findex_scan    | fbitmap_scan   | tbackward_scan | f

Evidently, only bitmap scan is supported.

显然，仅支持位图扫描。

However, lack of clustering may seem confusing. Seemingly, since BRIN index is sensitive to physical order of rows, it would be logical to be able to cluster data according to the index. But this is not so. We can only create a «regular» index (B-tree or GiST, depending on the data type) and cluster according to it. By the way, do you want to cluster a supposedly huge table taking into account Exclusive locks, execution time, and consumption of disk space during rebuilding?

但是，缺乏群集似乎令人困惑。看来，由于BRIN索引对行的物理顺序很敏感，因此能够根据索引对数据进行聚类是合乎逻辑的。但是事实并非如此。我们只能创建一个“常规”索引(B树或GiST，取决于数据类型)并根据它进行聚类。顺便说一句，您是否要考虑到排他锁，执行时间以及重建过程中磁盘空间的消耗，来对一个据称庞大的表进行聚类？

The following are the column-layer properties:

以下是列层属性：

name        | pg_index_column_has_property
--------------------+------------------------------asc                | fdesc               | fnulls_first        | fnulls_last         | forderable          | fdistance_orderable | freturnable         | fsearch_array       | fsearch_nulls       | t

The only available property is the ability to manipulate NULLs.

唯一可用的属性是操作NULL的能力。

Read on.继续阅读。

翻译自: https://habr.com/en/company/postgrespro/blog/452900/