本节介绍了checkpoint中用于刷一个脏page的函数:SyncOneBuffer,该函数在syncing期间处理一个buffer.

一、数据结构

宏定义
checkpoints request flag bits,检查点请求标记位定义.


/** OR-able request flag bits for checkpoints.  The "cause" bits are used only* for logging purposes.  Note: the flags must be defined so that it's* sensible to OR together request flags arising from different requestors.*/
/* These directly affect the behavior of CreateCheckPoint and subsidiaries */
#define CHECKPOINT_IS_SHUTDOWN  0x0001  /* Checkpoint is for shutdown */
#define CHECKPOINT_END_OF_RECOVERY  0x0002  /* Like shutdown checkpoint, but* issued at end of WAL recovery */
#define CHECKPOINT_IMMEDIATE  0x0004  /* Do it without delays */
#define CHECKPOINT_FORCE    0x0008  /* Force even if no activity */
#define CHECKPOINT_FLUSH_ALL  0x0010  /* Flush all pages, including those* belonging to unlogged tables */
/* These are important to RequestCheckpoint */
#define CHECKPOINT_WAIT     0x0020  /* Wait for completion */
#define CHECKPOINT_REQUESTED  0x0040  /* Checkpoint request has been made */
/* These indicate the cause of a checkpoint request */
#define CHECKPOINT_CAUSE_XLOG 0x0080  /* XLOG consumption */
#define CHECKPOINT_CAUSE_TIME 0x0100  /* Elapsed time */

二、源码解读

SyncOneBuffer,在syncing期间处理一个buffer,其主要处理逻辑如下:
1.获取buffer描述符
2.锁定buffer
3.根据buffer状态和输入参数执行相关判断/处理
4.钉住脏页,上共享锁,调用FlushBuffer刷盘
5.解锁/解钉和其他收尾工作


/** SyncOneBuffer -- process a single buffer during syncing.* 在syncing期间处理一个buffer** If skip_recently_used is true, we don't write currently-pinned buffers, nor* buffers marked recently used, as these are not replacement candidates.* 如skip_recently_used为T,既不写currently-pinned buffers,*   也不写标记为最近使用的buffers,因为这些缓冲区不是可替代的缓冲区.** Returns a bitmask containing the following flag bits:*  BUF_WRITTEN: we wrote the buffer.*  BUF_REUSABLE: buffer is available for replacement, ie, it has*    pin count 0 and usage count 0.* 返回位掩码:*   BUF_WRITTEN: 已写入buffer*   BUF_REUSABLE: buffer可用于替代(pin count和usage count均为0)** (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean* after locking it, but we don't care all that much.)** Note: caller must have done ResourceOwnerEnlargeBuffers.*/
static int
SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{BufferDesc *bufHdr = GetBufferDescriptor(buf_id);int     result = 0;uint32    buf_state;BufferTag tag;ReservePrivateRefCountEntry();/** Check whether buffer needs writing.* 检查buffer是否需要写入.** We can make this check without taking the buffer content lock so long* as we mark pages dirty in access methods *before* logging changes with* XLogInsert(): if someone marks the buffer dirty just after our check we* don't worry because our checkpoint.redo points before log record for* upcoming changes and so we are not required to write such dirty buffer.* 在使用XLogInsert() logging变化前通过访问方法标记pages为脏时,*   不需要持有锁太长的时间来执行该检查:* 因为如果某个进程在检查后标记buffer为脏,*   在这种情况下checkpoint.redo指向了变化出现前的log位置,因此无需担心,而且不必写这样的脏块.*/buf_state = LockBufHdr(bufHdr);if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&BUF_STATE_GET_USAGECOUNT(buf_state) == 0){result |= BUF_REUSABLE;}else if (skip_recently_used){/* Caller told us not to write recently-used buffers *///跳过最近使用的bufferUnlockBufHdr(bufHdr, buf_state);return result;}if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY)){/* It's clean, so nothing to do *///buffer无效或者不是脏块UnlockBufHdr(bufHdr, buf_state);return result;}/** Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the* buffer is clean by the time we've locked it.)* 钉住它,上共享锁,并刷到盘上.*/PinBuffer_Locked(bufHdr);LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);//调用FlushBuffer//If the caller has an smgr reference for the buffer's relation, pass it as the second parameter.  //If not, pass NULL.FlushBuffer(bufHdr, NULL);LWLockRelease(BufferDescriptorGetContentLock(bufHdr));tag = bufHdr->tag;UnpinBuffer(bufHdr, true);ScheduleBufferTagForWriteback(wb_context, &tag);return result | BUF_WRITTEN;
}

FlushBuffer
FlushBuffer函数物理上把共享缓存刷盘,主要实现函数还是smgrwrite(storage manager write).


/** FlushBuffer*    Physically write out a shared buffer.* 物理上把共享缓存刷盘.** NOTE: this actually just passes the buffer contents to the kernel; the* real write to disk won't happen until the kernel feels like it.  This* is okay from our point of view since we can redo the changes from WAL.* However, we will need to force the changes to disk via fsync before* we can checkpoint WAL.* 只是把buffer内容发给os内核,何时真正写盘由os来确定.* 在checkpoint WAL前需要通过fsync强制落盘.** The caller must hold a pin on the buffer and have share-locked the* buffer contents.  (Note: a share-lock does not prevent updates of* hint bits in the buffer, so the page could change while the write* is in progress, but we assume that that will not invalidate the data* written.)* 调用者必须钉住了缓存并且持有共享锁.* (注意:共享锁不会buffer中的hint bits的更新,因此在写入期间page可能会出现变化,*  但我假定那样不会让写入的数据无效)** If the caller has an smgr reference for the buffer's relation, pass it* as the second parameter.  If not, pass NULL.*/
static void
FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{XLogRecPtr  recptr;ErrorContextCallback errcallback;instr_time  io_start,io_time;Block   bufBlock;char     *bufToWrite;uint32    buf_state;/** Acquire the buffer's io_in_progress lock.  If StartBufferIO returns* false, then someone else flushed the buffer before we could, so we need* not do anything.*/if (!StartBufferIO(buf, false))return;/* Setup error traceback support for ereport() */errcallback.callback = shared_buffer_write_error_callback;errcallback.arg = (void *) buf;errcallback.previous = error_context_stack;error_context_stack = &errcallback;/* Find smgr relation for buffer */if (reln == NULL)reln = smgropen(buf->tag.rnode, InvalidBackendId);TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,buf->tag.blockNum,reln->smgr_rnode.node.spcNode,reln->smgr_rnode.node.dbNode,reln->smgr_rnode.node.relNode);buf_state = LockBufHdr(buf);/** Run PageGetLSN while holding header lock, since we don't have the* buffer locked exclusively in all cases.*/recptr = BufferGetLSN(buf);/* To check if block content changes while flushing. - vadim 01/17/97 */buf_state &= ~BM_JUST_DIRTIED;UnlockBufHdr(buf, buf_state);/** Force XLOG flush up to buffer's LSN.  This implements the basic WAL* rule that log updates must hit disk before any of the data-file changes* they describe do.** However, this rule does not apply to unlogged relations, which will be* lost after a crash anyway.  Most unlogged relation pages do not bear* LSNs since we never emit WAL records for them, and therefore flushing* up through the buffer LSN would be useless, but harmless.  However,* GiST indexes use LSNs internally to track page-splits, and therefore* unlogged GiST pages bear "fake" LSNs generated by* GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake* LSN counter could advance past the WAL insertion point; and if it did* happen, attempting to flush WAL through that location would fail, with* disastrous system-wide consequences.  To make sure that can't happen,* skip the flush if the buffer isn't permanent.*/if (buf_state & BM_PERMANENT)XLogFlush(recptr);/** Now it's safe to write buffer to disk. Note that no one else should* have been able to write it while we were busy with log flushing because* we have the io_in_progress lock.*/bufBlock = BufHdrGetBlock(buf);/** Update page checksum if desired.  Since we have only shared lock on the* buffer, other processes might be updating hint bits in it, so we must* copy the page to private storage if we do checksumming.*/bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);if (track_io_timing)INSTR_TIME_SET_CURRENT(io_start);/** bufToWrite is either the shared buffer or a copy, as appropriate.*/smgrwrite(reln,buf->tag.forkNum,buf->tag.blockNum,bufToWrite,false);if (track_io_timing){INSTR_TIME_SET_CURRENT(io_time);INSTR_TIME_SUBTRACT(io_time, io_start);pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);}pgBufferUsage.shared_blks_written++;/** Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and* end the io_in_progress state.*/TerminateBufferIO(buf, true, 0);TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,buf->tag.blockNum,reln->smgr_rnode.node.spcNode,reln->smgr_rnode.node.dbNode,reln->smgr_rnode.node.relNode);/* Pop the error context stack */error_context_stack = errcallback.previous;
}

三、跟踪分析

测试脚本


testdb=# update t_wal_ckpt set c2 = 'C4#'||substr(c2,4,40);
UPDATE 1
testdb=# checkpoint;

跟踪分析


(gdb) handle SIGINT print nostop pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal        Stop  Print Pass to program Description
SIGINT        No  Yes Yes   Interrupt
(gdb) b SyncOneBuffer
Breakpoint 1 at 0x8a7167: file bufmgr.c, line 2357.
(gdb) c
Continuing.
Program received signal SIGINT, Interrupt.
Breakpoint 1, SyncOneBuffer (buf_id=0, skip_recently_used=false, wb_context=0x7fff27f5ae00) at bufmgr.c:2357
2357    BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
(gdb) n
2358    int     result = 0;
(gdb) p *bufHdr
$1 = {tag = {rnode = {spcNode = 1663, dbNode = 16384, relNode = 221290}, forkNum = MAIN_FORKNUM, blockNum = 0}, buf_id = 0, state = {value = 3548905472}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 53, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}
(gdb) n
2362    ReservePrivateRefCountEntry();
(gdb)
2373    buf_state = LockBufHdr(bufHdr);
(gdb)
2375    if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
(gdb)
2376      BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
(gdb)
2375    if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
(gdb)
2380    else if (skip_recently_used)
(gdb)
2387    if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
(gdb)
2398    PinBuffer_Locked(bufHdr);
(gdb) p buf_state
$2 = 3553099776
(gdb) n
2399    LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
(gdb)
2401    FlushBuffer(bufHdr, NULL);
(gdb) step
FlushBuffer (buf=0x7fedc4a68300, reln=0x0) at bufmgr.c:2687
2687    if (!StartBufferIO(buf, false))
(gdb) n
2691    errcallback.callback = shared_buffer_write_error_callback;
(gdb)
2692    errcallback.arg = (void *) buf;
(gdb)
2693    errcallback.previous = error_context_stack;
(gdb)
2694    error_context_stack = &errcallback;
(gdb)
2697    if (reln == NULL)
(gdb)
2698      reln = smgropen(buf->tag.rnode, InvalidBackendId);
(gdb)
2700    TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,
(gdb)
2706    buf_state = LockBufHdr(buf);
(gdb)
2712    recptr = BufferGetLSN(buf);
(gdb)
2715    buf_state &= ~BM_JUST_DIRTIED;
(gdb) p recptr
$3 = 16953421760
(gdb) n
2716    UnlockBufHdr(buf, buf_state);
(gdb)
2735    if (buf_state & BM_PERMANENT)
(gdb)
2736      XLogFlush(recptr);
(gdb)
2743    bufBlock = BufHdrGetBlock(buf);
(gdb)
2750    bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
(gdb) p bufBlock
$4 = (Block) 0x7fedc4e68300
(gdb) n
2752    if (track_io_timing)
(gdb)
2758    smgrwrite(reln,
(gdb)
2764    if (track_io_timing)
(gdb)
2772    pgBufferUsage.shared_blks_written++;
(gdb)
2778    TerminateBufferIO(buf, true, 0);
(gdb)
2780    TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
(gdb)
2787    error_context_stack = errcallback.previous;
(gdb)
2788  }
(gdb)
SyncOneBuffer (buf_id=0, skip_recently_used=false, wb_context=0x7fff27f5ae00) at bufmgr.c:2403
2403    LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
(gdb)
2405    tag = bufHdr->tag;
(gdb)
2407    UnpinBuffer(bufHdr, true);
(gdb)
2409    ScheduleBufferTagForWriteback(wb_context, &tag);
(gdb)
2411    return result | BUF_WRITTEN;
(gdb)
2412  }
(gdb)

四、参考资料

PG Source Code
PgSQL · 特性分析 · 谈谈checkpoint的调度

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2651712/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/6906/viewspace-2651712/

PostgreSQL 源码解读(212)- 后台进程#11(checkpointer-SyncOneBuffer)相关推荐

  1. PostgreSQL 源码解读(153)- 后台进程#5(walsender#1)

    本节简单介绍了PostgreSQL的后台进程walsender,该进程实质上是streaming replication环境中master节点上普通的backend进程,在standby节点启动时,s ...

  2. PostgreSQL 源码解读(156)- 后台进程#8(walsender#4)

    上节介绍了PostgreSQL的后台进程walsender中的函数WalSndLoop->WaitLatchOrSocket->WaitEventSetWait->WaitEvent ...

  3. PostgreSQL 源码解读(154)- 后台进程#6(walsender#2)

    本节继续介绍PostgreSQL的后台进程walsender,重点介绍的是调用栈中的exec_replication_command和StartReplication函数. 调用栈如下: (gdb) ...

  4. PostgreSQL 源码解读(155)- 后台进程#7(walsender#3)

    本节继续介绍PostgreSQL的后台进程walsender,重点介绍的是调用栈中的函数WalSndLoop->WaitLatchOrSocket->WaitEventSetWait-&g ...

  5. PostgreSQL 源码解读(160)- 查询#80(如何实现表达式解析)

    本节介绍了PostgreSQL如何解析查询语句中的表达式列并计算得出该列的值.表达式列是指除关系定义中的系统列/定义列之外的其他投影列.比如: testdb=# create table t_expr ...

  6. PostgreSQL 源码解读(147)- Storage Manager#3(fsm_search函数)

    本节简单介绍了PostgreSQL在执行插入过程中与存储相关的函数RecordAndGetPageWithFreeSpace->fsm_search,该函数搜索FSM,找到有足够空闲空间(min ...

  7. PostgreSQL 源码解读(203)- 查询#116(类型转换实现)

    本节简单介绍了PostgreSQL中的类型转换的具体实现. 解析表达式,涉及不同数据类型时: 1.如有相应类型的Operator定义(pg_operator),则尝试进行类型转换,否则报错; 2.如有 ...

  8. PostgreSQL 源码解读(35)- 查询语句#20(查询优化-简化Having和Grou...

    本节简单介绍了PG查询优化中对Having和Group By子句的简化处理. 一.基本概念 简化Having语句 把Having中的约束条件,如满足可以提升到Where条件中的,则移动到Where子句 ...

  9. PostgreSQL 源码解读(32)- 查询语句#17(查询优化-表达式预处理#2)

    本节简单介绍了PG查询优化表达式预处理中常量的简化过程.表达式预处理主要的函数主要有preprocess_expression和preprocess_qual_conditions(调用preproc ...

最新文章

  1. 内存按字节编址,地址从A4000H到CBFFFH,共有多少个字节呢?
  2. Microsoft StreamInsight 构建物联网
  3. python提升计算速度的方法
  4. 我安装Microsoft SQLServer 2000时出现问题
  5. js new到底干了什么,new的意义是什么?
  6. C# 访问mongodb数据库
  7. python语言程序设计期末试卷_Python语言程序设计17182试题题目及答案,课程2020最新期末考试题库,章节测验答案...
  8. 【毕业设计】基于Android的家校互动平台开发(内含完整代码和所有文档)——爱吖校推(你关注的,我们才推)
  9. windows10无法使用内置管理员账户打开应用
  10. 13.CUDA编程手册中文版---附录I C++ 语言支持
  11. Android 中 QQ 和 微信打开第三方应用
  12. 输入中文转换成拼音首字母
  13. Android 音乐播放器的开发教程(六)service的运用及音乐列表点击播放 ----- 小达
  14. Python 图片批量处理(图片批量rename,图片批量resize,图片批量split,图片批量concat)
  15. 关于湖南卫视正在播放的TVB剧
  16. Android - Broadcasts overview(不完整)
  17. 任何人都能看得懂的网络协议之 ARP
  18. Golang实现Raft一致性算法
  19. 用JNA开发身份证阅读程序
  20. 奇瑞新能源小蚂蚁,一款实用好看的居家小车

热门文章

  1. iOS开发:修改类名(Refactor—Rename)的正确姿势
  2. Android Anatomy and Physiology
  3. 【开源】港中文多媒体实验室开源目标跟踪工具箱MMTracking
  4. windows常用脚本bat合集
  5. js事件-阻止默认操作
  6. 机器学习:非负矩阵分解(NMF)
  7. 中国设计在重庆丨5G+VR直播直击秋冬风尚大秀
  8. aws saa是什么缩写?aws saa认证考察什么?
  9. 2017远景能源Java面试
  10. Swift5 10.初始化Initialization(待深究)