B+树索引—分裂

预备知识

《PostgreSQL B+树索引—查询》

《PostgreSQL B+树索引—插入》

概述

现在，我们终于进入到了B+树索引最难的一个部分，节点的分裂。其实分裂本身并不难，难的主要是两个问题：

分裂引起的并发控制问题。
如何保证分裂的原子性，即分列时如果发生系统崩溃，如何恢复索引。

在本章，主要阐述分裂基本流程及实现，分裂相关的并发控制以及原子性问题，后续由专门的文档来阐述。

图解分裂

在讲代码实现之前，我们先来图解一下PostgreSQL的分裂算法。

图1

B+树的当前状态如图1所示。关于图1，我们需要注意以下几个点：

当前B+树只有一个叶子节点block1，所以block1自然也就是根节点。
由于block1没有右兄弟，所以block1是一个right_most节点，而right_most节点没有high key（因为high key应该是无穷大）。
此时，block1已经装满了，所以如果再插入一个元素就会发生分裂。

现在我们向block1中插入元素6，block1就会发生分裂。分裂主要分为以下步骤：

确定分裂点
迁移

分配一个新的节点，将原节点中分裂点后的数据迁移到新节点中。
串链

将新节点加入到链表中。
写父节点

将新节点的min key和blockno作为index tuple插入父节点。
修改根节点

这个操作不是所有节点分裂都会发生，只有分裂根节点才会需要修改根节点。

下面我们分别来看这些步骤：

确定分裂点

分裂点通常位于节点的中间，这样可以使分裂后节点的空间更加均衡。block1中有5个元素，所以我们选择5/2 = 2为分裂点（数组从0开始计数），即将3和3之后的元素移动到新节点。如图2所示：

图2

迁移

现在，我们需要创建一个新节点，然后将分裂点后的数据迁移到新节点中，如图3所示。

图3

迁移之后，block2将作为block1的右兄弟，所以block2现在是最右节点，所以block2没有high key，但block1此时已经不是最右节点了，所以block1需要high key。block1的high key应该与block2的min key相等，所以block1的high key为3。如图4所示：

图4

串链

将block2加入双向链表，如图5所示：

图5

写父节点

完成上述流程之后，最后一个步骤就是将新节点的min key和节点编号作为index tuple写入父节点。如图6所示：

图6

关于这个步骤有几点需要注意：

由于block1的原本是根节点，没有父节点，所以需要分配一个父节点block3。
block3没有右兄弟，所以block3也是一个right_most节点，没有high key。
block3是一个非叶子节点（也称内部节点），block3也没有左兄弟，所以block3也是一个left_most节点，对于这样的节点，也不应该有最小值，因为最小值应该是无穷小，所以block3的第一个index tuple用null表示。在PostgreSQL的源代码中，也有相应的注释：
```
/** Create downlink item for left page (old root).  Since this will be the* first item in a non-leaf page, it implicitly has minus-infinity key* value, so we need not store any actual key in it.*/
```
block3的第二个元素为block2的最小值，其实也就是block1的high key，即3。

修改根节点

显然图6中，根节点已经不再是block1而是block3了，于是我们需要对根节点进行修改。

图7

小结

我们先小结一下，整个分裂流程，有两个值得考究的地方：

关于high key

节点的high key与其右兄弟的min key相等，这样如果待插入的index tuple与节点high key相等，那么这个index tuple既可以插入当前节点，也可以插入当前节点的右兄弟。这一点，我们在《B+树索引—插入》中已经见过了。
关于非叶子节点的index tuple

非叶子节点的index tuple为下级节点的最小值。这一点比较有意思，在一些文献书籍中也提到过另外一种做法，把下级节点的high key作为非叶子节点的index tuple。而这样的方式存在一个问题，我们以图7为例。假设我们将high key作为非叶子节点的index tuple，那么对于block2来说，对应的index tuple就应该是<5,block2>。当block2发生分裂，由于是向右分裂，所以block2的high key会发生变化。那么相应的index tuple就需要修改。所以如果我们将high key作为非叶子节点的index tuple，那么当节点发生分裂后，我们不光需要将<high key, new blockno>插入父节点，还需要修改<high key, orign blockno>的high key。但是，如果我们将<min key, blockno>作为index tuple，节点的min key不会因为分裂而发生变化，所以分裂时只需要插入<high key, new blockno>。别小看这个优化，少一个步骤，意味着少一条XLOG，少一个需要维护的数据，少一个可能发生错误的因素。

代码实现

现在我们来看看B+树分裂的代码实现，B+树分裂的代码在_bt_insertonpg函数中实现，代码如下：

if (PageGetFreeSpace(page) < itemsz)
{bool       is_root = P_ISROOT(lpageop);bool       is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);bool     newitemonleft;Buffer        rbuf;/* Choose the split point * step1：获取分裂点*/firstright = _bt_findsplitloc(rel, page,newitemoff, itemsz,&newitemonleft);/* split the buffer into left and right halves * step2：分裂*/rbuf = _bt_split(rel, buf, cbuf, firstright,newitemoff, itemsz, itup, newitemonleft);PredicateLockPageSplit(rel,BufferGetBlockNumber(buf),BufferGetBlockNumber(rbuf));/*step3：写父节点*/_bt_insert_parent(rel, buf, rbuf, stack, is_root, is_only);
}

上述代码主要包括三个步骤：

step1：获取分裂点
step2：分裂

由前面说过的迁移和串链两个步骤组成。
step3：写父节点

我们一个步骤一个步骤来看。

获取分裂点—_bt_findsplitloc

分裂点选择原则

获取分裂点，需要决定两件事：

从哪里开始分裂？
待插入的index tuple应该插入到分裂后的左节点还是右节点？

分裂的原则有三个：

通常情况下，分裂后的左右两个节点空闲空间尽量均衡。

在前面的实例中，图1~图7，我们都假设index tuple是一个定长的值，所以block就是一个index tuple的数组，那么我们只需要简单的将数组中间作为分裂点就好了。而实际情况是，index tuple可能是变长的，因为可能包含字符串，所以找分裂点的事就没有那么简单了。
如果插入的节点是right most节点，则尽可能多的将空闲空间留给新节点。

这主要是为了应对自增列的批量插入，由于列值是自增的，所以每次都只会在B+树的最后进行插入，如果还是按照空间均衡的方式进行分裂，会增加分裂次数，同时B+树的空间也会比较大，影响查询效率（查询时需要读取的索引块就更多）。
不论怎样分裂，都必须保证分裂后的节点有足够的空间插入新的index tuple。

这个很好理解，分裂后如果空间还不够，那不是白分裂了。

算法基本思路

如何做到分裂后两个节点尽量均衡，且保证节点有足够空间插入新的index tuple？这个算法可谓是简单粗暴，核心思想就是试一试！怎么试？

图8

现在，我们要向图8的block1中插入元素5。那么首先假设分裂点是1，即block1中的所有元素都会被迁移到新节点block2中。那么显然5也应该插入block2，但是由于所有元素都迁移到了block2，所以显然block2没有空间写入5。这个方案就不可行。于是我们又假设分裂点是2，那么分裂后，block1中有元素1，block2中有元素：2、5、10、11、12。block1和block2相差4个元素。我们再假设分裂点是10，那么分裂后block1元素为：1、2、5，block2元素为：10、11、12，block2和block2相差0个元素。继续假设分裂点是11，以此类推。当我们将block1中的所有元素都试过一遍后，我们就能找到最佳的分裂点。

这就是PostgreSQL确定分裂点是基本思路，而在具体实现上有几个点值得注意：

PostgreSQL通常不会穷举节点中的所有元素，这样开销比较大，PostgreSQL在找到一个good-enough的分裂点后就会停止。而所谓good-enough是指左右页面的空闲空间相差小于页面大小的1/16。相关的代码和注释如下所示：

/** Finding the best possible split would require checking all the possible* split points, because of the high-key and left-key special cases.* That's probably more work than it's worth; instead, stop as soon as we* find a "good-enough" split, where good-enough is defined as an* imbalance in free space of no more than pagesize/16 (arbitrary...) This* should let us stop near the middle on most pages, instead of plowing to* the end.*/
goodenough = leftspace / 16;

前面提到过，right-most的节点分裂需要尽可能保证右节点有更多的空闲空间，这一点是由一个叫做fillfactor来控制的。fillfactor有下面三种可能

#define BTREE_MIN_FILLFACTOR     10  //页面中至少要有10%的数据
#define BTREE_DEFAULT_FILLFACTOR    90  //叶子节点的默认FILLFACTOR
#define BTREE_NONLEAF_FILLFACTOR    70  //非叶子节点的默认FILLFACTOR

现在我们要向图8中插入10（记为new 10），假设分裂点也是10（记为old 10），那么new 10既可以插入block1也可以插入block2。所以需要先假设new 10插入block1，再假设new 10插入block2，比较两种方式哪种好。

下面，我们来看看这个部分的代码实现。

代码实现

在介绍代码流程之前，我们先来看一个非常重要的结构体FindSplitData，定义如下：

typedef struct
{/* context data for _bt_checksplitloc */Size           newitemsz;          /* size of new item to be inserted */int                fillfactor;         /* needed when splitting rightmost page */bool          is_leaf;            /* T if splitting a leaf page */bool            is_rightmost;       /* T if splitting a rightmost page */OffsetNumber   newitemoff;         /* where the new item is to be inserted */int               leftspace;          /* space available for items on left page */int             rightspace;         /* space available for items on right page */int                olddataitemstotal;  /* space taken by old items */bool          have_split;         /* found a valid split? *//* these fields valid only if have_split is true */bool           newitemonleft;      /* new item on left or right of best split */OffsetNumber   firstright;         /* best split point */int               best_delta;         /* best size delta so far */
} FindSplitData;

由于这个结构体很重要，所以我们一一说明其中的每一个成员：

newitemsz

待插入的index tuple的大小。
fillfactor

填充因子，前面提到过，注意只有在分裂right-most节点时会用到这个值，其余时候填因子都是50%。
is_leaf

发生分裂的节点是不是叶子节点。
is_rightmost

发生分裂的节点是不是right-most节点，这个成员决定是否使用fillfactor，is_leaf决定fillfactor为多少。
newitemoff

index tuple的插入位置，这个是在定位阶段就决定的。
leftspace

左节点的空闲空间。
rightspace

右节点的空间空间。
olddataitemstotal

分裂前节点中数据的总大小（不包括待插入的index tuple的大小）。这里需要注意一下olddataitemstotal与leftspace、rightspace的区别。olddataitemstotal是指数据的大小，leftspace、rightspace的是指空闲空间的大小。

have_split

是否找到了分裂点。其实是不可能找不到分裂点的，如果找不到就需要报错。

/** I believe it is not possible to fail to find a feasible split, but just* in case ...*/
if (!state.have_split)elog(ERROR, "could not find a feasible split point for index \"%s\"",RelationGetRelationName(rel));

newitemonleft

待插入的index tuple会插入到左边节点还是右边节点。如果分裂点的index tuple与待插入的index tuple相等，那么这个index tuple既可以插入左节点又可以插入右节点，所以需要这个成员来标识index tuple到底插入到哪个节点。
firstright

分裂点，firstright及之后的元素都会被移动到新节点。
best_delta

左右节点空间差的绝对值，如果是非righ-most节点分裂，这个值就那等于abs(leftspace - rightspace)，分裂的原则就是让这个值尽可能的小。

_bt_findsplitloc函数的声明如下：

static OffsetNumber
_bt_findsplitloc(Relation rel,Page page,OffsetNumber newitemoff,Size newitemsz,bool *newitemonleft);

参数：

rel

表信息。
page

待分裂的页面。
newitemoff

待插入的index tuple的插入位置。
newitemsz

待插入的index tuple的大小。
newitemonleft

待插入的index tuple是否插入左节点，这是一个出参。

返回值：

分裂点

_bt_findsplitloc的代码实现如下：

static OffsetNumber
_bt_findsplitloc(Relation rel,Page page,OffsetNumber newitemoff,Size newitemsz,bool *newitemonleft)
{BTPageOpaque opaque;OffsetNumber offnum;OffsetNumber maxoff;ItemId     itemid;FindSplitData state;int          leftspace,rightspace,goodenough,olddataitemstotal,olddataitemstoleft;bool       goodenoughfound;opaque = (BTPageOpaque) PageGetSpecialPointer(page);/*初始化state，省略*//** Scan through the data items and calculate space usage for a split at* each possible position.* 遍历待分裂节点的所有item，将每个item都作为分裂点，* 计算以此节点分裂后，左节点和右节点的空间比例，从而找到最优分裂点。*/olddataitemstoleft = 0;goodenoughfound = false;maxoff = PageGetMaxOffsetNumber(page);for (offnum = P_FIRSTDATAKEY(opaque);offnum <= maxoff;offnum = OffsetNumberNext(offnum)){Size     itemsz;itemid = PageGetItemId(page, offnum);itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);/** Will the new item go to left or right of split?*//* offnum > newitemoff，表示new item将插入左节点 */if (offnum > newitemoff)_bt_checksplitloc(&state, offnum, true,olddataitemstoleft, itemsz);/* offnum < newitemoff，表示new item将插入右节点 */else if (offnum < newitemoff)_bt_checksplitloc(&state, offnum, false,olddataitemstoleft, itemsz);else{/* need to try it both ways! * offnum == newitemoff，不确定new item应该插入哪个节点，所以两边都是要试一试*/_bt_checksplitloc(&state, offnum, true,olddataitemstoleft, itemsz);_bt_checksplitloc(&state, offnum, false,olddataitemstoleft, itemsz);}/* Abort scan once we find a good-enough choice */if (state.have_split && state.best_delta <= goodenough){goodenoughfound = true;break;}olddataitemstoleft += itemsz;}/** If the new item goes as the last item, check for splitting so that all* the old items go to the left page and the new item goes to the right* page.* * 如果newitemoff > maxoff，表示new item将插入到节点最后。* 在这种情况下，如果没有找到一个goodenoughfound的分裂点，则试试将所有原始数据留在左节点，* 只将new item插入右节点。*/if (newitemoff > maxoff && !goodenoughfound)_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);/** I believe it is not possible to fail to find a feasible split, but just* in case ...*/if (!state.have_split)elog(ERROR, "could not find a feasible split point for index \"%s\"",RelationGetRelationName(rel));*newitemonleft = state.newitemonleft;return state.firstright;
}

这个函数的核心就是_bt_checksplitloc函数，_bt_checksplitloc的声明如下：

static void
_bt_checksplitloc(FindSplitData *state,OffsetNumber firstoldonright,bool newitemonleft,int olddataitemstoleft,Size firstoldonrightsz);

参数：

state

FindSplitData结构体，前面介绍过。
firstoldonright

分裂点。
newitemonleft

待插入的index tuple应该插入左节点还是右节点，这是一个出参。
olddataitemstoleft

左节点剩余的数据量
firstoldonrightsz

右节点第一个item的大小，其实就是分裂点item大小。

功能：

该函数会计算以firstoldonright为分裂点进行分裂，分裂后左右节点的空间大小，空间的差值。将这些信息记录到state中。

_bt_checksplitloc的实现如下：

static void
_bt_checksplitloc(FindSplitData *state,OffsetNumber firstoldonright,bool newitemonleft,int olddataitemstoleft,Size firstoldonrightsz)
{int            leftfree,rightfree;Size     firstrightitemsz;bool       newitemisfirstonright;/* Is the new item going to be the first item on the right page? */newitemisfirstonright = (firstoldonright == state->newitemoff&& !newitemonleft);if (newitemisfirstonright)firstrightitemsz = state->newitemsz;elsefirstrightitemsz = firstoldonrightsz;/* Account for all the old tuples */leftfree = state->leftspace - olddataitemstoleft;rightfree = state->rightspace -(state->olddataitemstotal - olddataitemstoleft);/** The first item on the right page becomes the high key of the left page;* therefore it counts against left space as well as right space.*/leftfree -= firstrightitemsz;/* account for the new item */if (newitemonleft)leftfree -= (int) state->newitemsz;elserightfree -= (int) state->newitemsz;/** If we are not on the leaf level, we will be able to discard the key* data from the first item that winds up on the right page.*/if (!state->is_leaf)rightfree += (int) firstrightitemsz -(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));/** If feasible split point, remember best delta.* 要点1*/if (leftfree >= 0 && rightfree >= 0){int           delta;if (state->is_rightmost){/** If splitting a rightmost page, try to put (100-fillfactor)% of* free space on left page. See comments for _bt_findsplitloc.* 要点3*/delta = (state->fillfactor * leftfree)- ((100 - state->fillfactor) * rightfree);}else{/* Otherwise, aim for equal free space on both sides */delta = leftfree - rightfree;}if (delta < 0)delta = -delta;/*要点2*/if (!state->have_split || delta < state->best_delta){state->have_split = true;state->newitemonleft = newitemonleft;state->firstright = firstoldonright;state->best_delta = delta;}}
}

上述代码有三个要点：

要点1：line49

该函数会计算分裂后左右节点的空闲空间大小，分别记为leftfree和rightfree。这个空闲空间是除去了new item之后的空闲空间。所以leftfree或rightfree如果有一个<0，就表示这种分裂方案会导致分裂后的节点无法容纳new item。那么这就不是一个可行的方案，只有当leftfree、rightfree都>0时，才是一个可行方案，才会计算leftfree和rightfree的差值，从而得到最优分裂点。
要点2：line74

state->best_delta记录了最优分裂点左右节点的空间差，如果当前分裂点的空间差比best_delta还要小，那么就将当前分裂点作为最优分裂点。
要点3：line62

对于right_most节点在计算delta是要考虑填充因子fillfactor。

分裂—_bt_split

接下来我们来看看分裂阶段的代码实现，在分裂阶段需要注意的要点有以下几个：

分裂的主体流程

在分裂的过程中，为了尽可能的减小分裂期间发生系统故障时的恢复成本，分裂采用这样的方式。
- 分配一个新的页面rightpage作为分裂后的右节点。
- 分配一个临时页面leftpage用于临时存放左节点中的数据。
- 遍历原始节点origpage，依据分裂点，将origpage中的item分别迁移到leftpage和rightpage中。
我们注意到，在执行上述三个步骤时，原始的B+树没有发生任何变化，如果这个时候系统发生了故障，重启后根本不需要恢复。对于索引的影响也就是插入失败。而上述三个步骤是节点分裂流程中最耗时的三个步骤。

在上述步骤完成之后，将leftpage中的内容拷贝到origpage中，然后再将rightpage加入链表。这两个操作会改变B+树的内容和结构，所以需要WAL来保证原子性。
迁移与拷贝

前面用到了迁移和拷贝两个词，迁移是指从原始节点获取item然后到新节点重新执行一次插入操作。而拷贝是指直接memcpy原始节点的内容到新节点。
关于XLOG

B+树的分裂如何保证原子性？这是B+树索引非常重要的一个问题，后面由专门的文档来说明。

/**  _bt_split() -- split a page in the btree.*  用于分裂btree中的一个页面**       On entry, buf is the page to split, and is pinned and write-locked.*        firstright is the item index of the first item to be moved to the*      new right page.  newitemoff etc. tell us about the new item that*       must be inserted along with the data from the old page.*        buffer是待分裂的页面，该页面已经加pin和wirte-lock*      firstright是第一个需要被迁移到new right page（分裂产生的新的页面）的item的下标。      *     newitemoff等参数用于描述查询的new item，这个item需要与old page中的数据一同插入***        When splitting a non-leaf page, 'cbuf' is the left-sibling of the*        page we're inserting the downlink for.  This function will clear the*      INCOMPLETE_SPLIT flag on it, and release the buffer.*      还没搞懂***      Returns the new right sibling of buf, pinned and write-locked.*     The pin and lock on buf are maintained.*        返回buf的右兄弟，也就是分裂产生的新页面，对该页面加pin和wirte-lock*        buf上的pin和lock依然保留*      */
static Buffer
_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,bool newitemonleft)
{Buffer     rbuf;Page       origpage;Page       leftpage,rightpage;BlockNumber origpagenumber,rightpagenumber;BTPageOpaque ropaque,lopaque,oopaque;Buffer       sbuf = InvalidBuffer;Page      spage = NULL;BTPageOpaque sopaque = NULL;Size     itemsz;ItemId       itemid;IndexTuple   item;OffsetNumber leftoff,rightoff;OffsetNumber maxoff;OffsetNumber i;bool      isroot;bool     isleaf;/* * Acquire a new page to split into * 分配一个新的页面，用于存放分裂的数据，该函数会对新的页面加pin和lock*/rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);/** origpage is the original page to be split.  leftpage is a temporary* buffer that receives the left-sibling data, which will be copied back* into origpage on success.  rightpage is the new page that receives the* right-sibling data.  If we fail before reaching the critical section,* origpage hasn't been modified and leftpage is only workspace. In* principle we shouldn't need to worry about rightpage either, because it* hasn't been linked into the btree page structure; but to avoid leaving* possibly-confusing junk behind, we are careful to rewrite rightpage as* zeroes before throwing any error.* * origpage是待分裂的原始页面，leftpage是一个临时buffer用于存放origpage页面分裂后的数据，* 分裂成功后会被copy回origpage页面。rightpage是新分配的页面，用于存放分裂后右兄弟的数据。* 如果我们在critical出错了，那么origpage中的数据并没有修改，leftpage也只是临时内存中，* 在这种规则下，我们也不需要担心rightpage，因为它也还没有并被链接到btree中，* 但是为了避免留下可能造成混淆的垃圾数据，我们会在抛出任何错误之前将rightpage重写为零。* * 其实这段话的意思就是在critical之前，分裂并没有真正发生，我们也无需做什么特殊处理。*/origpage     = BufferGetPage(buf);leftpage  = PageGetTempPage(origpage);rightpage  = BufferGetPage(rbuf);origpagenumber   = BufferGetBlockNumber(buf);rightpagenumber = BufferGetBlockNumber(rbuf);_bt_pageinit(leftpage, BufferGetPageSize(buf));/* * rightpage was already initialized by _bt_getbuf * leftpage由于是一个临时页，所以需要在这里进行初始化，* rightpage已经在_bt_pageinit中进行了初始化*//** Copy the original page's LSN into leftpage, which will become the* updated version of the page.  We need this because XLogInsert will* examine the LSN and possibly dump it in a page image.* 初始化后，需要将origpage的lsn写入leftpage，* 因为后面XLogInsert时需要依据pd_lsn判断是否将page image写入xlog（应对partial write）*/PageSetLSN(leftpage, PageGetLSN(origpage));/* * init btree private data * BTPageOpaque中主要记录的一个节点的左右孩子，用于在B+树种形成链*/oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage);lopaque = (BTPageOpaque) PageGetSpecialPointer(leftpage);ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);isroot = P_ISROOT(oopaque);isleaf = P_ISLEAF(oopaque);/* * if we're splitting this page, it won't be the root when we're done * 如果我们要分裂一个页面，那么分裂之后，他就不是一个根节点了（分裂之前可能是根节点）* also, clear the SPLIT_END and HAS_GARBAGE flags in both pages* 同理，需要清理待分裂页面的BTP_SPLIT_END（最右页）标志和BTP_HAS_GARBAGE标志（LP_DEAD列）*/lopaque->btpo_flags = oopaque->btpo_flags;lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);ropaque->btpo_flags = lopaque->btpo_flags;/* set flag in left page indicating that the right page has no downlink */lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;lopaque->btpo_prev   = oopaque->btpo_prev;lopaque->btpo_next  = rightpagenumber;ropaque->btpo_prev    = origpagenumber;/*注意：leftpage是临时的，所以要用origpagenumber*/ropaque->btpo_next     = oopaque->btpo_next;lopaque->btpo.level = ropaque->btpo.level = oopaque->btpo.level;/* Since we already have write-lock on both pages, ok to read cycleid */lopaque->btpo_cycleid = _bt_vacuum_cycleid(rel);ropaque->btpo_cycleid = lopaque->btpo_cycleid;/** 功能：获取右节点的最大值* * If the page we're splitting is not the rightmost page at its level in* the tree, then the first entry on the page is the high key for the* page.  We need to copy that to the right half.  Otherwise (meaning the* rightmost page case), all the items on the right half will be user* data.* 如果origpage不是当前层的最右节点， 那么分裂产生的rightpage也不可能是当前层的最右节点。* * 如果rightpage不是当前层的最右节点，那么P_HIKEY就应该存放rightpage的最大值，* 而rightpage中第一个真正的key就应该存放在P_HIKEY+1的位置上。* 在这种情况下，很显然rightpage应该继承origpage的最大值，* 所以需要将origpage的最大值拷贝到rightpage。* * 如果rightpage是当前层的最右节点，那么P_HIKEY就应该存放一个真正的key，* 而不用考虑最大值的情况。*/rightoff = P_HIKEY;if (!P_RIGHTMOST(oopaque)){//不是最右节点itemid    = PageGetItemId(origpage, P_HIKEY);itemsz  = ItemIdGetLength(itemid);item     = (IndexTuple) PageGetItem(origpage, itemid);if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,false, false) == InvalidOffsetNumber){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add hikey to the right sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}rightoff = OffsetNumberNext(rightoff);}/** 功能：获取左节点的最大值** The "high key" for the new left page will be the first key that's going* to go into the new right page.  This might be either the existing data* item at position firstright, or the incoming tuple.* 对于分裂后的origpage也就是当前的leftpage，其最大值应该等于rightpage的最小值* 显然rightpage的最小值就是：first key that's going to go into the new right page* 这个值可能是新插入的值，也可能是分裂处（firstright）的值。*/leftoff = P_HIKEY;if (!newitemonleft && newitemoff == firstright){/* incoming tuple will become first on right page */itemsz  = newitemsz;item   = newitem;}else{/* existing item at firstright will become first on right page */itemid = PageGetItemId(origpage, firstright);itemsz = ItemIdGetLength(itemid);item   = (IndexTuple) PageGetItem(origpage, itemid);}if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,false, false) == InvalidOffsetNumber){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add hikey to the left sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}leftoff = OffsetNumberNext(leftoff);/** Now transfer all the data items to the appropriate page.** Note: we *must* insert at least the right page's items in item-number* order, for the benefit of _bt_restore_page().* 下面是真正的分裂流程，遍历oopaque中的所有item，根据情况将他们插入leftpage或rightpage***/maxoff = PageGetMaxOffsetNumber(origpage);for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i)){itemid     = PageGetItemId(origpage, i);itemsz    = ItemIdGetLength(itemid);item     = (IndexTuple) PageGetItem(origpage, itemid);/* does new item belong before this one? */if (i == newitemoff){if (newitemonleft){if (!_bt_pgaddtup(leftpage, newitemsz, newitem, leftoff)){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add new item to the left sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}leftoff = OffsetNumberNext(leftoff);}else{if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff)){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add new item to the right sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}rightoff = OffsetNumberNext(rightoff);}}/** 注意，上面代码有一个非常精妙的地方，* 首先在分裂之前并没有对newitem做插入操作，所以newitem并不属于任何一个节点* 但是这里通过newitemoff表示了，如果有足够的空间newitem应该插入到origpage中的什么位置* 所以当i == newitemoff时，我们就知道我们应该“迁移”newitem了。** 注意，newitem“迁移”完成后，并没有continue，也不能有continue，* 因为我们还要迁移原本就位于origpage中i处的item。**//* decide which page to put it on */if (i < firstright){if (!_bt_pgaddtup(leftpage, itemsz, item, leftoff)){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add old item to the left sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}leftoff = OffsetNumberNext(leftoff);}else{if (!_bt_pgaddtup(rightpage, itemsz, item, rightoff)){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add old item to the right sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}rightoff = OffsetNumberNext(rightoff);}}/* cope with possibility that newitem goes at the end */if (i <= newitemoff){/** Can't have newitemonleft here; that would imply we were told to put* *everything* on the left page, which cannot fit (if it could, we'd* not be splitting the page).** 如果newitem比origpage中的所有item都要大，那么就会出现这种情况。* 显然，在这种情况下，newitem应该插入到rightpage的最右边。** 上面那个英文注释的意思是，此时不应该有newitemonleft的存在，* 因为newitem不可能插入leftpage，如果有这种可能，那就不需要分裂了。*/Assert(!newitemonleft);if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff)){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "failed to add new item to the right sibling"" while splitting block %u of index \"%s\"",origpagenumber, RelationGetRelationName(rel));}rightoff = OffsetNumberNext(rightoff);}/** We have to grab the right sibling (if any) and fix the prev pointer* there. We are guaranteed that this is deadlock-free since no other* writer will be holding a lock on that page and trying to move left, and* all readers release locks on a page before trying to fetch its* neighbors.** 现在需要对oopaque的右节点上锁，* 注意我们现在已经持有oopaque的锁，然后要锁定oopaque的右节点。那么这里就可能会出现死锁。* 死锁的原因如下： 为了提高并发性，postgresql在分裂的时候不会锁定整个索引，* 所以其他进程可以对索引进行查询。那么如果在分裂的时候有如下SQL语句：* select * from table where id < 10000 order by id desc;* 由于这是一个要求id降序排列的SQL，所以显然索引会定位到值为10000的item，然后前向遍历。* 对于节点内的item进行遍历肯定要锁定该页面，一个页面遍历完后，* 就会“切换”到当前页面的leftpage，然后锁定leftpage继续遍历。* 注意，如果我们的“切换”流程是先锁定leftpage，再解锁当前节点，那么就可能和分裂流程发生死锁。* 所以切换流程必须是：all readers release locks on a page before trying to fetch its* neighbors，即先解锁当前节点再锁定邻居。** 如果不这么做，即便没有分裂也可能死锁。比如：两个进行分别执行如下两条语句：* select * from table where id < 10000 order by id desc;* select * from table where id < 10000 order by id asc;*/if (!P_RIGHTMOST(oopaque)){sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);spage = BufferGetPage(sbuf);sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);if (sopaque->btpo_prev != origpagenumber){memset(rightpage, 0, BufferGetPageSize(rbuf));elog(ERROR, "right sibling's left-link doesn't match: ""block %u links to %u instead of expected %u in index \"%s\"",oopaque->btpo_next, sopaque->btpo_prev, origpagenumber,RelationGetRelationName(rel));}/** Check to see if we can set the SPLIT_END flag in the right-hand* split page; this can save some I/O for vacuum since it need not* proceed to the right sibling.  We can set the flag if the right* sibling has a different cycleid: that means it could not be part of* a group of pages that were all split off from the same ancestor* page.  If you're confused, imagine that page A splits to A B and* then again, yielding A C B, while vacuum is in progress.  Tuples* originally in A could now be in either B or C, hence vacuum must* examine both pages.  But if D, our right sibling, has a different* cycleid then it could not contain any tuples that were in A when* the vacuum started.** 这个好像是优化vacuum的，所以没太看懂，先放一放*/if (sopaque->btpo_cycleid != ropaque->btpo_cycleid)ropaque->btpo_flags |= BTP_SPLIT_END;}/** Right sibling is locked, new siblings are prepared, but original page* is not updated yet.** NO EREPORT(ERROR) till right sibling is updated.  We can get away with* not starting the critical section till here because we haven't been* scribbling on the original page yet; see comments above.*/START_CRIT_SECTION();/** 功能：将leftpage拷贝回origpage，拷贝完成后会释放leftpage* * By here, the original data page has been split into two new halves, and* these are correct.  The algorithm requires that the left page never* move during a split, so we copy the new left page back on top of the* original.  Note that this is not a waste of time, since we also require* (in the page management code) that the center of a page always be* clean, and the most efficient way to guarantee this is just to compact* the data by reinserting it into a new left page.  (XXX the latter* comment is probably obsolete; but in any case it's good to not scribble* on the original page until we enter the critical section.)** 至此，原始的数据页面被分为了两个新的部分。该算法要求在分裂的过程中left page不存在move操作。* 所以我们将新的left page拷回origpage的最开始位置。注意这不是在浪费时间，由于我们同样要求* （在页面管理代码中）页面的中心需要维持clean，保证这一点最有效的办法就是将其重新插入一个新的* 节点，以此来压缩数据。** We need to do this before writing the WAL record, so that XLogInsert* can WAL log an image of the page if necessary.* 我们需要在WAL日志写入之前完成这些工作，这样，如果有需要的话，* XLogInsert可以在日志中记录image page**/PageRestoreTempPage(leftpage, origpage);/* * leftpage, lopaque must not be used below here * 函数返回后leftpage所在的临时空间会被释放，所以leftpage和lopaque就不能用了*/MarkBufferDirty(buf);MarkBufferDirty(rbuf);/* 将rightpage加入链表 */if (!P_RIGHTMOST(ropaque)){sopaque->btpo_prev = rightpagenumber;MarkBufferDirty(sbuf);}/** Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes* a split.*/if (!isleaf){Page      cpage = BufferGetPage(cbuf);BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;MarkBufferDirty(cbuf);}/* XLOG stuff */if (RelationNeedsWAL(rel)){xl_btree_split  xlrec;uint8         xlinfo;XLogRecPtr       recptr;xlrec.level      = ropaque->btpo.level;  //分裂节点在树中的levelxlrec.firstright     = firstright;          //分裂位置xlrec.newitemoff  = newitemoff;          //newitem的插入位置XLogBeginInsert();/* 注册分裂信息：xlrec */XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);/* 注册buf、rbuf、sbuf、cbuf */XLogRegisterBuffer(0, buf, REGBUF_STANDARD); XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT);/* * Log the right sibling, because we've changed its prev-pointer. * 由于非最右节点改变了sbuf的前驱，所以将sbuf也写入日志*/if (!P_RIGHTMOST(ropaque))XLogRegisterBuffer(2, sbuf, REGBUF_STANDARD);if (BufferIsValid(cbuf))XLogRegisterBuffer(3, cbuf, REGBUF_STANDARD);/** Log the new item, if it was inserted on the left page. (If it was* put on the right page, we don't need to explicitly WAL log it* because it's included with all the other items on the right page.)* Show the new item as belonging to the left page buffer, so that it* is not stored if XLogInsert decides it needs a full-page image of* the left page.  We store the offset anyway, though, to support* archive compression of these records.* 记录新插入的item，如果这个item被插入到left page（原始节点）* 如果被插入right page（新节点）就不用记录，因为新节点整个都会被记录在日志中。*/if (newitemonleft)XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));/* Log left page */if (!isleaf){/** We must also log the left page's high key, because the right* page's leftmost key is suppressed on non-leaf levels.  Show it* as belonging to the left page buffer, so that it is not stored* if XLogInsert decides it needs a full-page image of the left* page.*/itemid = PageGetItemId(origpage, P_HIKEY);item = (IndexTuple) PageGetItem(origpage, itemid);XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));}/** Log the contents of the right page in the format understood by* _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,* because we're going to recreate the whole page anyway, so it should* never be stored by XLogInsert.** Direct access to page is not good but faster - we should implement* some new func in page API.  Note we only store the tuples* themselves, knowing that they were inserted in item-number order* and so the item pointers can be reconstructed.  See comments for* _bt_restore_page().*/XLogRegisterBufData(1,(char *) rightpage + ((PageHeader) rightpage)->pd_upper,((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);if (isroot)xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L_ROOT : XLOG_BTREE_SPLIT_R_ROOT;elsexlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;recptr = XLogInsert(RM_BTREE_ID, xlinfo);PageSetLSN(origpage, recptr);PageSetLSN(rightpage, recptr);if (!P_RIGHTMOST(ropaque)){PageSetLSN(spage, recptr);}if (!isleaf){PageSetLSN(BufferGetPage(cbuf), recptr);}}END_CRIT_SECTION();/* * release the old right sibling * 释放原始右节点的锁*/if (!P_RIGHTMOST(ropaque))_bt_relbuf(rel, sbuf);/* release the child */if (!isleaf)_bt_relbuf(rel, cbuf);/* split's done */return rbuf;
}

写父节点—_bt_insert_parent

B+树分裂的最后一个步骤是将新节点的min key和节点编号组成index tuple，然后插入到父亲节点中。这个操作本身非常简单。难的是如何确定当前节点的父亲节点是谁？如果分裂的节点是根节点，根节点自然没有父节点，所以需要分配一个新的节点作为父节点。如果分裂的节点不是根节点呢？其实在我们定位插入位置时，本身就是一个从上至下的查找过程，那么只要我们在查询时记录好B+树的遍历路径，通过这个路径就能直接找到当前节点的父节点。而记录遍历路径的结构体就是BTStack，定义如下：

typedef struct BTStackData
{BlockNumber bts_blkno;OffsetNumber bts_offset;IndexTupleData bts_btentry;struct BTStackData *bts_parent;
} BTStackData;typedef BTStackData *BTStack;

bts_blkno

当前节点的编号
bts_offset

当前index tuple的偏移
bts_btentry

当前index tuple
bts_parent

父节点指针

现在，我们要向图7所示的B+树中插入元素2，那么定位到插入位置之后BTStack的值应该如图9所示：

图9

所以，如果block1发生分裂，通过BTStack我们就可以很容易的知道，新节点对应的index tuple即<min key, new blockno>应该插入3号节点，offset为1的位置上。

然而事情真的如此美好么？BTStack真的靠谱么？答案是不靠谱！有两种情况会使BTStack不靠谱。

情况1

BTStack是在_bt_search时构建的。回顾下《B+树索引—插入》中我们讲过的插入的基本流程，通过_bt_search我们找到了index tuple插入的叶子节点。然后，我们通过下面两行代码，将叶子节点上的共享锁变为互斥锁。

/* trade in our read lock for a write lock */
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BT_WRITE);

我们说过，在多进程环境下，一旦对节点执行了unlock，那么这个节点就可能被其他进程加上互斥锁，然后执行插入，插入可能会导致该节点分裂，分裂会写父节点。所以，在给叶子节点加上互斥锁之后，会调用_bt_moveright来获取index tuple真正插入的节点。而_bt_moveright只会右移叶子节点，父节点不会联动，所以_bt_moveright之后BTStack就不靠谱了。具体情况如图10所示：

图10

情况2

在《B+树索引—插入》中，我们说过如果待插入的index tuple与当前节点的high key相等，那么既可以插入当前节点，也可以插入当前节点的右兄弟，如果插入当前节点会引起节点分裂，那么就插入当前节点的右兄弟。所以插入的节点都发生了变化BTStack自然就不靠谱了。

_bt_getstackbuf

既然BTStack不靠谱，我们需要一种机制，来校验BTStack是否靠谱，如果不靠谱就让他靠谱。这个机制通过_bt_getstackbuf来实现。_bt_getstackbuf的实现思路非常简单，就是从bts_parent->bts_btentry中获取孩子节点的blockno（记为blockno child），然后将blockno child与当前的bts_blkno比较，如果不相等则说明父亲节点发生了变化。于是在父亲节点中从bts_offset开始，先向后遍历寻找blockno child与bts_blkno相等的item，如果找不到则从bts_offset开始向前遍历。当前节点找不到就向右遍历下一个节点，直到找到bts_blkno对应的那个父亲节点以及对应的item。

注意

这里为什么要从bts_offset开始先向后遍历再向前遍历？因为只有分裂和合并才会导致BTStack不靠谱，这种概率比较低，通常bts_offset对应的blockno child与bts_blkno是相等的，即便不相等，都应该在他附近。并且，由于分裂的概率又大于合并，所以先向后，再向前是很合理的。

_bt_getstackbuf的实现代码如下：

Buffer
_bt_getstackbuf(Relation rel, BTStack stack, int access)
{BlockNumber blkno;OffsetNumber start;blkno = stack->bts_blkno;start = stack->bts_offset;for (;;){Buffer        buf;Page        page;BTPageOpaque opaque;buf = _bt_getbuf(rel, blkno, access);page = BufferGetPage(buf);opaque = (BTPageOpaque) PageGetSpecialPointer(page);if (access == BT_WRITE && P_INCOMPLETE_SPLIT(opaque)){_bt_finish_split(rel, buf, stack->bts_parent);continue;}if (!P_IGNORE(opaque)){OffsetNumber offnum,minoff,maxoff;ItemId       itemid;IndexTuple   item;minoff = P_FIRSTDATAKEY(opaque);maxoff = PageGetMaxOffsetNumber(page);/** start = InvalidOffsetNumber means "search the whole page". We* need this test anyway due to possibility that page has a high* key now when it didn't before.*/if (start < minoff)start = minoff;/** Need this check too, to guard against possibility that page* split since we visited it originally.*/if (start > maxoff)start = OffsetNumberNext(maxoff);/** These loops will check every item on the page --- but in an* order that's attuned to the probability of where it actually* is.  Scan to the right first, then to the left.** 先向后遍历*/for (offnum = start;offnum <= maxoff;offnum = OffsetNumberNext(offnum)){itemid = PageGetItemId(page, offnum);item = (IndexTuple) PageGetItem(page, itemid);if (BTEntrySame(item, &stack->bts_btentry)){/* Return accurate pointer to where link is now */stack->bts_blkno = blkno;stack->bts_offset = offnum;return buf;}}/*再向前遍历*/for (offnum = OffsetNumberPrev(start);offnum >= minoff;offnum = OffsetNumberPrev(offnum)){itemid = PageGetItemId(page, offnum);item = (IndexTuple) PageGetItem(page, itemid);if (BTEntrySame(item, &stack->bts_btentry)){/* Return accurate pointer to where link is now */stack->bts_blkno = blkno;stack->bts_offset = offnum;return buf;}}}/** The item we're looking for moved right at least one page.*/if (P_RIGHTMOST(opaque)){_bt_relbuf(rel, buf);return InvalidBuffer;}blkno = opaque->btpo_next;start = InvalidOffsetNumber;_bt_relbuf(rel, buf);}
}

而_bt_insert_parent的实现代码如下：

static void
_bt_insert_parent(Relation rel,Buffer buf,Buffer rbuf,BTStack stack,bool is_root,bool is_only)
{/** Here we have to do something Lehman and Yao don't talk about: deal with* a root split and construction of a new root.  If our stack is empty* then we have just split a node on what had been the root level when we* descended the tree.  If it was still the root then we perform a* new-root construction.  If it *wasn't* the root anymore, search to find* the next higher level that someone constructed meanwhile, and find the* right place to insert as for the normal case.** If we have to search for the parent level, we do so by re-descending* from the root.  This is not super-efficient, but it's rare enough not* to matter.** 分裂节点为根节点*/if (is_root){Buffer     rootbuf;Assert(stack == NULL);Assert(is_only);/* create a new root node and update the metapage */rootbuf = _bt_newroot(rel, buf, rbuf);/* release the split buffers */_bt_relbuf(rel, rootbuf);_bt_relbuf(rel, rbuf);_bt_relbuf(rel, buf);}else{/* 分裂节点不为根节点 */BlockNumber bknum = BufferGetBlockNumber(buf);BlockNumber rbknum = BufferGetBlockNumber(rbuf);Page     page = BufferGetPage(buf);IndexTuple   new_item;BTStackData fakestack;IndexTuple   ritem;Buffer        pbuf;if (stack == NULL){BTPageOpaque lpageop;elog(DEBUG2, "concurrent ROOT page split");lpageop = (BTPageOpaque) PageGetSpecialPointer(page);/* Find the leftmost page at the next level up */pbuf = _bt_get_endpoint(rel, lpageop->btpo.level + 1, false,NULL);/* Set up a phony stack entry pointing there */stack = &fakestack;stack->bts_blkno = BufferGetBlockNumber(pbuf);stack->bts_offset = InvalidOffsetNumber;/* bts_btentry will be initialized below */stack->bts_parent = NULL;_bt_relbuf(rel, pbuf);}/* get high key from left page == lowest key on new right page */ritem = (IndexTuple) PageGetItem(page,PageGetItemId(page, P_HIKEY));/* form an index tuple that points at the new right page */new_item = CopyIndexTuple(ritem);ItemPointerSet(&(new_item->t_tid), rbknum, P_HIKEY);/** Find the parent buffer and get the parent page.** Oops - if we were moved right then we need to change stack item! We* want to find parent pointing to where we are, right ?  - vadim* 05/27/97*/ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);/** Now we can unlock the right child. The left child will be unlocked* by _bt_insertonpg().*/_bt_relbuf(rel, rbuf);/* Check for error only after writing children */if (pbuf == InvalidBuffer)elog(ERROR, "failed to re-find parent key in index \"%s\" for split pages %u/%u",RelationGetRelationName(rel), bknum, rbknum);/* Recursively update the parent */_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,new_item, stack->bts_offset + 1,is_only);/* be tidy */pfree(new_item);}
}

这里需要注意的是line48，为什么会出现stack == NULL的情况。是这样的，现在进程1希望在图1中插入2，于是首先通过_bt_search遍历定位插入节点，由于图1中叶子节点即是根节点，所以stack为null。然后进程1希望block1加锁。然而此时有其他进程正在对block1做insert操作，所以进程1只有等待。等到其他进程unlock block1后，进程1锁住block1，然而此时的B+树变成了图11所示的样子：

图11

此时，2依然应该插入到block1中，插入2会导致block1分裂，分裂后在_bt_insert_parent时就会发现stack为空。stack为空怎么办？按照前面所讲的_bt_getstackbuf的思路，我们应该遍历Level2的所有item，找到blockno child为block1的那个item。所以在这里，PostgreSQL构建了一个fakestack，将他指向Level2的最左节点的第一个元素。然后调用_bt_getstackbuf来寻找真正的父节点。

补充

注意看line60的代码stack->bts_offset = InvalidOffsetNumber;这里用InvalidOffsetNumber来表示节点内的第一个元素，原因是这里不知道这个节点是否为right-most节点。right-most节点没有high key所以first key为第一个元素，非right-most节点有high key所以first key为第二个元素。在前面_bt_getstackbuf的line42就在处理bts_offset为InvalidOffsetNumber的情况。

_bt_insert_parent还有一个要点是并发控制，这个由后面专门的文章讲解。

PostgreSQL B+树索引---分裂相关推荐

PostgreSQL B+树索引---页面删除
PostgreSQL B+树索引-页面删除预备知识 <PostgreSQL Blink-tree ReadMe-翻译> <PostgreSQL Buffer ReadMe-翻译&g ...
mysql索引如何分裂节点_从MySQL Bug#67718浅谈B+树索引的分裂优化（转）
原文链接:http://hedengcheng.com/?p=525 问题背景今天,看到Twitter的DBA团队发布了其最新的MySQL分支:Changes in Twitter MySQL 5. ...
从MySQL Bug#67718浅谈B+树索引的分裂优化
从MySQL Bug#67718浅谈B+树索引的分裂优化 1月 6th, 2013 发表评论 | Trackback 问题背景今天,看到Twitter的DBA团队发布了其最新的MySQL分支:Cha ...
MySQL(InnoDB剖析):24---B+树索引（聚集索引与非聚集索引(辅助索引)）、B+树索引的分裂
一.B+树索引概述 B+树索引的本质就是B+树在数据库中的实现.但是B+索引在数据库中有一个特点就是高扇出性,因此在数据库中,B+树的高度一般都在2~4层,也就是说查找某一键值的行记录最多只需要2~4 ...
B树索引是怎么分裂的？
什么是分裂在开始介绍之前,我们先来搞清楚什么是索引分裂吧."索引分裂"就是索引块的分裂,当一次DML事务操作修改了索引块上的数据,但是旧有的索引块没有足够的空间来容纳新修改的数据 ...
MySQL(InnoDB剖析):---B+树索引（聚集索引与非聚集索引(辅助索引)）、B+树索引的分裂
小伙伴们大家好!今天是大年三十,给大家拜个早年!在此小弟祝各位大哥们与家人团团圆圆,和和睦睦,新的一年身体健康,工作顺利! 一.B+树索引概述 B+树索引的本质就是B+树在数据库中的实现.但是B+索引 ...
PostgreSQL中的索引——4（B树）
目录 B树架构通过等式搜索通过不等式搜索通过范围查询示例 (本文中所述的B树通过双向链表组织了叶节点,其实应该算B+树) 我们已经讨论了PostgreSQL的索引引擎和访问方法的接口,以及哈 ...
c++删除数组中重复元素_PG13中的功能—B树索引中的重复数据删除
PostgreSQL 13 Beta 1版本于2020年5月21日发布,PostgreSQL 13 Beta 2版本于2020年6月25日发布.虽然Beta 版本中依旧包含一些错误,但是它总是几乎涵盖 ...
面试题：mysql 表删除一半数据，B+树索引文件会不会变小？？？
今日寄语:努力的阶段,往往是最不养生的阶段! 一张千万级的数据表,删除了一半的数据,你觉得B+树索引文件会不会变小? (答案在文章中!!) 我们先来做个实验,看看表的大小是如何变化的?? 做个实验,让 ...

PostgreSQL B+树索引---分裂

B+树索引—分裂

预备知识

概述

图解分裂

确定分裂点

迁移

串链

写父节点

修改根节点

小结

代码实现

获取分裂点—_bt_findsplitloc

分裂点选择原则

算法基本思路

代码实现

分裂—_bt_split

写父节点—_bt_insert_parent

情况1

情况2

_bt_getstackbuf

PostgreSQL B+树索引---分裂相关推荐

最新文章

热门文章