PostgreSQL数据库TableAM——HeapAM Parallel table scan

TableAM Parallel table scan

TableAM与Parallel table scan相关的函数如下所示，真正的scan动作的支持还需要PostgreSQL数据库TableAM——HeapAM Scans中Scan相关函数的执行。这里列出的函数仅仅与并行化SCAN初始化工作相关。

 /* ------------------------------------------------------------------------* Parallel table scan related functions.* ------------------------------------------------------------------------*//* Estimate the size of shared memory needed for a parallel scan of this relation. The snapshot does not need to be accounted for. */Size        (*parallelscan_estimate) (Relation rel);/* Initialize ParallelTableScanDesc for a parallel scan of this relation. `pscan` will be sized according to parallelscan_estimate() for the same relation. */Size        (*parallelscan_initialize) (Relation rel, ParallelTableScanDesc pscan);/* Reinitialize `pscan` for a new scan. `rel` will be the same relation as when `pscan` was initialized by parallelscan_initialize. */void     (*parallelscan_reinitialize) (Relation rel, ParallelTableScanDesc pscan);

table_parallelscan_estimate

table_parallelscan_estimate函数定义在src/backend/access/table/tableam.c中，主要作用就是在使用MVCC快照情况下评估快照大小，并调用parallelscan_estimate函数评估tableAM为支持并行扫描需要的空间。

Size table_parallelscan_estimate(Relation rel, Snapshot snapshot){Size       sz = 0;if (IsMVCCSnapshot(snapshot)) sz = add_size(sz, EstimateSnapshotSpace(snapshot));else Assert(snapshot == SnapshotAny);sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));return sz;
}

在src/backend/executor/nodeSeqscan.c文件中的ExecSeqScanEstimate函数会调用table_parallelscan_estimate接口来计算SeqScanState所需要的空间。

table_parallelscan_initialize

table_parallelscan_initialize函数定义在src/backend/access/table/tableam.c中，首先调用TableAM自定义的parallelscan_initialize函数（初始化ParallelTableScanDesc的’子类’，并返回该’子类’结构体大小），如果是MVCC快照，将快照序列化到(char *) pscan + pscan->phs_snapshot_off起始的内存空间中。

void table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, Snapshot snapshot){Size        snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);pscan->phs_snapshot_off = snapshot_off;if (IsMVCCSnapshot(snapshot)){SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);pscan->phs_snapshot_any = false;}else{pscan->phs_snapshot_any = true;}
}

在src/backend/executor/nodeSeqscan.c文件中的ExecSeqScanInitializeDSM函数会调用table_parallelscan_initialize接口来初始化ParallelTableScanDesc或其’子类’。

table_parallelscan_reinitialize

table_parallelscan_reinitialize定义在src/include/access/tableam.h中，直接调用TableAM自定义的parallelscan_reinitialize函数。

/* Restart a parallel scan.  Call this in the leader process.  Caller is responsible for making sure that all workers have finished the scan beforehand. */
static inline void table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan){rel->rd_tableam->parallelscan_reinitialize(rel, pscan);
}

在src/backend/executor/nodeSeqscan.c文件中的ExecSeqScanReInitializeDSM函数会调用table_parallelscan_reinitialize接口。

Parallel table scan

table_beginscan_parallel函数由执行器调用，其流程是如果由快照被序列化到共享内存中，则将快照恢复并Register，增加SO_TEMP_SNAPSHOT(scan完成后endscan Unregister该快照)；否则使用调用者传入的SNAPSHOT_ANY(Any tuple is visible)。最后调用TableAM自定义的scan_begin接口。

TableScanDesc table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan){Snapshot  snapshot;uint32     flags = SO_TYPE_SEQSCAN | SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;if (!parallel_scan->phs_snapshot_any){/* Snapshot was serialized -- restore it */snapshot = RestoreSnapshot((char *) parallel_scan + parallel_scan->phs_snapshot_off);RegisterSnapshot(snapshot);flags |= SO_TEMP_SNAPSHOT;}else{/* SnapshotAny passed by caller (not serialized) */snapshot = SnapshotAny;}return relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,parallel_scan, flags);
}

在src/backend/executor/nodeSeqscan.c文件中的ExecSeqScanInitializeDSM和ExecSeqScanInitializeWorker函数都调用了TableAM提供的table_beginscan_parallel函数。

ParallelTableScanDesc

ParallelTableScanDesc是贯穿上述API的结构体，下面block oriented AMs parallel scans提供的ParallelBlockTableScanDesc可以看成是ParallelTableScanDesc的子类。每个参与并行扫描的后端进程都有私有的TableScanDesc，它们都指向ParallelTableScanDescData，所有后端进程共享其成员信息。src/backend/executor/nodeSeqscan.c文件中的函数大多依赖于TableAM的Parallel table scan和Plain table scan API。

/* Shared state for parallel table scan.* Each backend participating in a parallel table scan has its own* TableScanDesc in backend-private memory, and those objects all contain a* pointer to this structure.  The information here must be sufficient to* properly initialize each new TableScanDesc as workers join the scan, and it* must act as a information what to scan for those workers. */
typedef struct ParallelTableScanDescData{Oid            phs_relid;      /* OID of relation to scan */bool       phs_syncscan;   /* report location to syncscan logic? */bool        phs_snapshot_any;   /* SnapshotAny, not phs_snapshot_data? */Size       phs_snapshot_off;   /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
/* Shared state for parallel table scans, for block oriented storage. */
typedef struct ParallelBlockTableScanDescData{ParallelTableScanDescData base;BlockNumber phs_nblocks;   /* # blocks in relation at start of scan */slock_t      phs_mutex;      /* mutual exclusion for setting startblock */BlockNumber phs_startblock; /* starting block number */pg_atomic_uint64 phs_nallocated;    /* number of blocks allocated to workers so far. */
}           ParallelBlockTableScanDescData;

block oriented AMs parallel scans Helper functions

HeapAM对parallelscan_estimate、parallelscan_initialize和parallelscan_reinitialize三个函数指针的实现就是直接使用table_block_parallelscan_estimate、table_block_parallelscan_initialize和table_block_parallelscan_reinitialize函数。

/* ----------------------------------------------------------------------------* Helper functions to implement parallel scans for block oriented AMs.* ----------------------------------------------------------------------------*/
Size table_block_parallelscan_estimate(Relation rel) {return sizeof(ParallelBlockTableScanDescData);
}
Size table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan) {ParallelBlockTableScanDesc bpscan = (ParallelBlockTableScanDesc) pscan;bpscan->base.phs_relid = RelationGetRelid(rel); // 初始化需要扫描的表的oidbpscan->phs_nblocks = RelationGetNumberOfBlocks(rel); // blocks in relation at start of scan/* compare phs_syncscan initialization to similar logic in initscan */bpscan->base.phs_syncscan = synchronize_seqscans && !RelationUsesLocalBuffers(rel) && bpscan->phs_nblocks > NBuffers / 4;SpinLockInit(&bpscan->phs_mutex);bpscan->phs_startblock = InvalidBlockNumber; // starting block numberpg_atomic_init_u64(&bpscan->phs_nallocated, 0); // number of blocks allocated to workers so farreturn sizeof(ParallelBlockTableScanDescData);
}
void table_block_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan) {ParallelBlockTableScanDesc bpscan = (ParallelBlockTableScanDesc) pscan;pg_atomic_write_u64(&bpscan->phs_nallocated, 0); // number of blocks allocated to workers so far
}

table_block_parallelscan_startblock_init函数由src/backend/access/heap/heapam.c文件中的heapgettup和heapgettup_pagemode函数调用，也就是说在ExecSeqScan流程中调用。该函数的作用就是确定phs_startblock的值，也就是并行SeqScan的起始块。如果没有使用synchronized scan machinery，则设置scan’s startblock为0；如果sync_startpage为InvalidBlockNumber，则调用ss_get_location利用synchronized scan machinery获取sync_startpage；如果sync_startpage不为InvalidBlockNumber，则设置scan’s startblock为sync_startpage。

/* find and set the scan's startblock: Determine where the parallel seq scan should start.  This function may be called many times, once by each parallel worker.  We must be careful only to set the startblock once. */
void table_block_parallelscan_startblock_init(Relation rel, ParallelBlockTableScanDesc pbscan) {BlockNumber sync_startpage = InvalidBlockNumber;
retry:  SpinLockAcquire(&pbscan->phs_mutex); /* Grab the spinlock. *//* If the scan's startblock has not yet been initialized, we must do so now.  If this is not a synchronized scan, we just start at block 0, but if it is a synchronized scan, we must get the starting position from the synchronized scan machinery.  We can't hold the spinlock while doing that, though, so release the spinlock, get the information we need, and retry.  If nobody else has initialized the scan in the meantime, we'll fill in the value we fetched on the second time through. */if (pbscan->phs_startblock == InvalidBlockNumber){if (!pbscan->base.phs_syncscan)pbscan->phs_startblock = 0;else if (sync_startpage != InvalidBlockNumber)pbscan->phs_startblock = sync_startpage;else{SpinLockRelease(&pbscan->phs_mutex);sync_startpage = ss_get_location(rel, pbscan->phs_nblocks);goto retry;}}SpinLockRelease(&pbscan->phs_mutex);
}

table_block_parallelscan_nextpage函数由src/backend/access/heap/heapam.c文件中的heapgettup和heapgettup_pagemode函数调用，也就是说在ExecSeqScan流程中调用。

phs_nallocated跟踪已分配给workers的页面数。当phs_nallocated>=rs_nblocks时，所有块都已分配。要返回的实际页面是通过将计数器添加到起始块号starting block number模上nblocks来计算的。
报告扫描位置。通常，我们报告当前页码current page number。但是，当我们到达扫描结束时，我们报告的是起始页，而不是结束页，这样以后扫描的起始位置就不会反转。但是，我们只在扫描结束时报告一次位置：后续调用方将不报告任何内容。

/* get the next page to scan* Get the next page to scan.  Even if there are no pages left to scan, another backend could have grabbed a page to scan and not yet finished looking at it, so it doesn't follow that the scan is done when the first backend gets an InvalidBlockNumber return. */
BlockNumber table_block_parallelscan_nextpage(Relation rel, ParallelBlockTableScanDesc pbscan) {BlockNumber page;uint64     nallocated;/* phs_nallocated tracks how many pages have been allocated to workers already.  When phs_nallocated >= rs_nblocks, all blocks have been allocated. Because we use an atomic fetch-and-add to fetch the current value, the phs_nallocated counter will exceed rs_nblocks, because workers will still increment the value, when they try to allocate the next block but all blocks have been allocated already. The counter must be 64 bits wide because of that, to avoid wrapping around when rs_nblocks is close to 2^32.* The actual page to return is calculated by adding the counter to the starting block number, modulo nblocks. */nallocated = pg_atomic_fetch_add_u64(&pbscan->phs_nallocated, 1);if (nallocated >= pbscan->phs_nblocks) page = InvalidBlockNumber;    /* all blocks have been allocated */else page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;/* Report scan location.  Normally, we report the current page number. When we reach the end of the scan, though, we report the starting page, not the ending page, just so the starting positions for later scans doesn't slew backwards.  We only report the position at the end of the scan once, though: subsequent callers will report nothing. */if (pbscan->base.phs_syncscan){if (page != InvalidBlockNumber) ss_report_location(rel, page);else if (nallocated == pbscan->phs_nblocks) ss_report_location(rel, pbscan->phs_startblock);}return page;
}

synchronized scan machinery
ss_get_location
ss_report_location