（未完待续，持续更新中）

序言

0 存储设备

0.1 机械盘

0.2 SSD

0.3 NVDIMM

1 VFS

1.0 基础组件

1.0.1 iov_iter

1.0.2 iomap

1.1 inode

1.1.1 inode的生命周期

1.2 Dentry

1.2.1 dentry的功能

1.2.2 dentry操作

1.2.3 root dentry

1.2.4 d_splice_alias

1.2.5 lookup

1.2.6 symbol link

1.3 mount

1.3.1 嫁接

1.3.2 loop mount

1.4 page cache

1.4.1 radix tree

1.4.2 page cache tags

1.4.3 mmap

1.4.4 direct IO

1.5 DAX

1.5.1 Pmem

1.5.2 fs/dax

2 文件系统

2.1 空间管理

2.1.1 ext4 Allocator

2.1.2 xfs Allocator

2.1.3 ext4 预申请

2.1.4 xfs 预申请

3 Block

3.1 bio

3.2 request

3.3 iostat

3.3.1 util%

序言

Linux的存储栈包括，

API，将POSIX系统调用接口引入VFS层
VFS，维护page cache、dcache、inode cache等结构的组织和内容，同时向下对接文件系统
FS，包括各种文件系统，ext2、ext4、xfs、btrfs等
Logical-Block，包括基于device-mapper的lvm、md-raid、bcache、multi-path等
Block，即通常我们所讲的块层，向下对接存储协议，抽象存储设备的物理特性，包括最大IO size、是否支持cache等，同时提供各种IO调度(电梯算法可以认为是默认)策略
设备驱动，比如nvme和scsi类的存储设备；
物理存储设备

0 存储设备

存储栈的组件在设计时，一个重要的考量因素，就是如何更好的使用存储设备，使其发挥出最好的性能；所以，在研究Linux存储栈之前，我们先要了解下，各种存储设备的特点。

0.1 机械盘

我们评估价下磁盘以下性能指标：
Host：Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
HDD：ST2000NM0055-1V4，磁盘容量1.8T

顺序IO，不同IOsize (--bs)

fio --bs=256K --size=512G --rw=read --norandommap --direct=1 --ioengine=libaio --iodepth=1 --runtime=60 --name=test --filename=/dev/sdb

512	4K	8K	16K	32K	64K	128K	256K	512K	1M	4M
24us	30us	39us	79us	157us	314us	629us	1.26ms	2.52ms	5.04ms	20.24ms

随机IO，不同IO Range （--size）

fio --bs=512 --size=1536G --rw=randread --norandommap --direct=1 --ioengine=libaio --iodepth=1 --runtime=60 --name=test --filename=/dev/sdb --group_reporting

32M	64M	128M	256M	512M	1G	2G
106us	954us	3.02ms	4.08ms	4.63ms	5.06ns	5.29ms

4G	8G	16G	32G	64G	128G	256G	512G	1T	1.5T
5.51ms	5.69ms	5.85ms	6.09ms	6.34ms	6.78ms	7.36ms	8.1ms	9.44ms	10.52ms

通过以上数据，我们可以大致估算，磁盘执行IO主要分成三部分的时间：

软件栈开销，包括host和firmware，参考512字节顺序IO，应该在24 us以下；
磁头从磁盘上拷贝数据，大致200M/s，这与测试机械盘的理论带宽上限接近；
磁头移动，在移动范围超过128M之后，磁头移动时间就已经达到3ms，超过1G之后达到5ms；

所以，这块机械盘的瓶颈有个决定因素：

200M/s的带宽上限，这个通常只有在测试中使用纯顺序IO才能达到。
磁头移动范围，在实践中，这才是真正决定磁盘性能的因素；

还是以这个磁盘为例，如果是xfs，默认4个Allocation Group，每个AG的大小512G(最后一个不足512G)，不同目录的文件，通常会被分配到不同的AG，当多个目录的文件同时IO时，其范围很可能在256G以上，也就是磁头移动造成的时间大概在7ms；那么，每秒处理的IO数，也就是IOPS最高只能在143，这就是此时盘的瓶颈；此时，IOsize越大，盘的带宽就会越高。

0.2 SSD

（占坑）

0.3 NVDIMM

NVDIMM这个名字中包含了该种类型设备的两个特性：

Non-Volatile，非易失性，即掉电不丢失；
DIMM，这是该种设备的接口形式，跟DDR一样，这意味着，它有非常高的带宽上限，并且可以通过load/store指令访问；

即NVDIMM指的是一种使用DIMM接口的非易失性存储设备；

Intel近些年推出的Optane DC PMM就是其中的一种，其基于3D XPoint存储技术；同样的存储技术还基于NVMe接口实现了Optane SSD；其实我们也可以给硬盘实现DIMM接口，只不过成本和性能就看不过去了。

Optane DCPMM第一代产品代号为Apache Pass，也就是我们常说的AEP，其第二代产品代号为Barlow Pass，不过，网传英特尔将终止基于 3D XPoint 技术的 Optane 业务。

NVDIMM的性能数据来自网络上的两篇Paper《Basic Performance Measurements of the Intel Optane DC Persistent Memory Module》《Initial Experience with 3D XPoint Main Memory》，感兴趣的同学可以到https://download.csdn.net/download/home19900111/86736994https://download.csdn.net/download/home19900111/86736994

Initial Experience with 3D XPoint Main Memoryhttps://download.csdn.net/download/home19900111/86737051

图中的LDRAM的意思是本地内存访问，RDRAM跨numa内存访问；

L stands for regular loads,

LN stands for non-temporal loads,

SF stands for stores followed by clflush

SN stands for non- temporal stores

SO stands for stores followed by clflushopt

SW stands for stores followed by clwb.

从图中的数据来看，虽然Opane的数据与真正的内存相比依然有些差距，但是已经基本在一个数量级；结合《Initial Experience with 3D XPoint Main Memory》中的结论，

The memory access performance of 3D XPoint is mod- estly lower than that of DRAM. Compared to DRAM, 3D XPoint’s read performance can be up to 2.5x slower, and its write performance can be up to 2.3x slower.

The write performance of 3D XPoint is quite similar to its read performance. Writes are at most 2.3x slower than reads. The asymmetric effect of NVM reads and writes are modest.

Persisting writes to 3D XPoint incur drastic cost for sequential and random writes. The slowdowns can be at least 10.7x and 2.9x for sequential writes and random writes, respectively.

The performance of 3D XPoint degrades as an application accesses more data. We suspect that techniques such as caching are exploited in the 3D Xpoint modules to improve performance. This effect is less significant when the size of data visited is 32GB or larger for the 256GB 3D XPoint module.

与HDD或者Nand SSD对比，这无疑是"史诗级"的性能提升；但是，在实际使用中，我们发现AEP最大的问题是，由于其使用load/store指令直接访问，无法实现异步，这造成相关应用的CPU利用率显著升高。

Intel后来推出了基于FPGA的硬件加速模块，使用内部的几百上千的并发单元访问AEP。

不过，现阶段AEP仍然无法真正在引发计算系统革命；当NVDIMM的性能达到DDR且成本可控时，才会真正颠覆现有的架构。

关于nvdimm的持久化编程，参考掉电保护域，

Asynchronous DRAM Refresh (ADR) is a feature supported on Intel products which flushes the write-protected data buffers and places the DRAM in self-refresh. This process is critical during a power loss event or system crash to ensure the data is in a safe and consistent state on persistent memory. By default, ADR does not flush the processor caches. A platform that supports ADR only includes persistent memory and the memory controller’s write pending queues within the persistence domain. This is the reason data in the CPU caches must be flushed by the application using the CLWB, CLFLUSHOPT, CLFLUSH, non-temporal stores, or WBINVD machine instructions.

Enhanced Asynchronous DRAM Refresh (eADR) requires that a non-maskable interrupt (NMI) routine be called to flush the CPU caches before the ADR event can begin. Applications running on an eADR platform do not need to perform flush operations because the hardware should flush the data automatically, but they are still required to perform an SFENCE operation to maintain write order correctness. Stores should be considered persistent only when they are globally visible, which the SFENCE guarantees.

实现持久化编程的几个关键指令：

CLFLUSH，This instruction, supported in many generations of CPU, flushes a single cache line. Historically, this instruction is serialized, causing multiple CLFLUSH instructions to execute one after the other, without any concurrency.
CLFLUSHOPT This instruction, newly introduced for persistent memory support, is like CLFLUSH but without the serialization. To flush a range, software executes a CLFLUSHOPT instruction for each 64-byte cache line in the range, followed by a single SFENCE instruction to ensure the flushes are complete before continuing. CLFLUSHOPT is optimized (hence the name) to allow some concurrency when executing multiple CLFLUSHOPT instructions back-to-back
CLWB Another newly introduced instruction, CLWB stands for cache line write back. The effect is the same as CLFLUSHOPT except that the cache line may remain valid in the cache (but no longer dirty, since it was flushed). This makes it more likely to get a cache hit on this line as the data is accessed again later
NT stores Another feature that has been around for a while in x86 CPUs is the non-temporal store. These stores are “write combining” and bypass the CPU cache, so using them does not require a flush.
WBINVD，This kernel-mode-only instruction flushes and invalidates every cache line on the CPU that executes it. After executing this on all CPUs, all stores to persistent memory are certainly in the persistence domain, but all cache lines are empty, impacting performance. In addition, the overhead of sending a message to each CPU to execute this instruction can be significant. Because of this, WBINVD is only expected to be used by the kernel for flushing very large ranges, many megabytes at least.

1 VFS

1.0 基础组件

1.0.1 iov_iter

参考

What is iov_iterhttps://jianchwa.github.io/fs/xfs_v2.html#iov_iter

1.0.2 iomap

我们首先看下iomap结构

#define IOMAP_HOLE  0   /* no blocks allocated, need allocation */
#define IOMAP_DELALLOC  1   /* delayed allocation blocks */
#define IOMAP_MAPPED    2   /* blocks allocated at @addr */
#define IOMAP_UNWRITTEN 3   /* blocks allocated at @addr in unwritten state */
#define IOMAP_INLINE    4   /* data inline in the inode */struct iomap {u64         addr;   /* disk offset of mapping, bytes */loff_t       offset; /* file offset of mapping, bytes */u64          length; /* length of mapping, bytes */u16           type;   /* type of mapping */u16            flags;  /* flags for mapping */struct block_device  *bdev;  /* block device for I/O */...const struct iomap_page_ops *page_ops;
};

iomap描述的是从文件的(iomap.offset, iomap.lengh)到block设备上(iomap.addr, iomap.length)的映射；其意义类似ext2的block mapping和ext4的extent；参考generic_perform_write()，执行aops->write_begin()是以page为单位的；而xfs_file_write_iter()使用的iomap_file_bufferred_write()，其调用iomap_start操作的是一个范围；这对于大块的连续IO可以提高代码执行效率。

我们看下iomap的标准操作流程：

dax_iomap_rw()
---while ((ret = iomap_iter(&iomi, ops)) > 0)---(1) iomap_end(2) advance iter by iter.processed(3) iomap_begin, based on iter.inode, iter.pos, iter.len, get iomap.addr, iomap.offset, iomap.length---iomi.processed = dax_iomap_iter(&iomi, iter);---read/write on iomap.addr + (iomap.offset - iter.pos)---
---

这里需要特别说明的是，iomap_begin返回的iomap范围可能会比需要的要大。

1.1 inode

Linux讲究万物皆文件，这个文件就是inode，我们可以将其比作一个c++类，其内部包含了：

数据，包括属性(i_mode、i_uid、i_gid、i_a/c/mtime等)和page cache(i_mapping)，
方法，包括inode_operations(i_op)、file_operations(i_fop)和address_space_operations

以及其他一些辅助性质的成员。

注：需要注意file和inode的区别，file代表的是一个操作文件的上下文，其中甚至包含了文件操作的偏移，所以，当我们用read/write连续读写文件的时候，不需要传入偏移。

1.1.1 inode的生命周期

一个inode有两个生命周期的计数：

i_nlink, 代表有一个目录项指向它，它代表的是inode on-disk的生命周期；
i_count, 代表的是该inode在内存中的生命周期；

本小节，我们主要关注i_count，即inode cache的生命周期。

I_NEW，inode处于创建期间，元数据正在填充过程中；代码可以参考ext2_iget()

注：元数据填充需要IO，时间不定，所以，内核先创建一个I_NEW状态的inode在inode cache中占个坑，其他所有想要访问该inode的任务都需要等待I_NEW状态的解除。

ext2_iget()
---inode = iget_locked(sb, ino);if (!(inode->i_state & I_NEW))return inode;ei = EXT2_I(inode);ei->i_block_alloc_info = NULL;//这个步骤需要跟存储设备做IOraw_inode = ext2_get_inode(inode->i_sb, ino, &bh);// Fill the in-core inode from on-disk inodeunlock_new_inode(inode);
---iget_locked()
---spin_lock(&inode_hash_lock);inode = find_inode_fast(sb, head, ino);spin_unlock(&inode_hash_lock);if (inode) {...//在此处等待I_NEW被清除掉wait_on_inode(inode);return inode;}inode = alloc_inode(sb);if (inode) {struct inode *old;spin_lock(&inode_hash_lock);/* We released the lock, so.. */old = find_inode_fast(sb, head, ino);if (!old) {inode->i_ino = ino;spin_lock(&inode->i_lock);inode->i_state = I_NEW;hlist_add_head(&inode->i_hash, head);spin_unlock(&inode->i_lock);inode_sb_list_add(inode);spin_unlock(&inode_hash_lock);/* Return the locked inode with I_NEW set, the* caller is responsible for filling in the contents*/return inode;}...}
---

被引用，即i_count > 0，inode通常被dentry引用，之后，文件系统的标准操作都从dentry切入；同时，只有dentry被释放时，才会放掉对inode的引用。具体代码可以参考__d_add()和dentry_unlink_inode()；
待回收，在inode的最后一个引用计数被释放后，inode会被存放在per-super的lru list里，等待被回收；同时，dirty inode并不会加入lru list，在writeback完成之后，inode_sync_complete()存入lru list ；参考代码iput_final()；

iput_final()
---if (op->drop_inode)drop = op->drop_inode(inode);elsedrop = generic_drop_inode(inode);// !inode->i_nlink || inode_unhashed(inode);//引用计数为0之后，将inode放到lru中，等待回收if (!drop && (sb->s_flags & SB_ACTIVE)) {inode_add_lru(inode);spin_unlock(&inode->i_lock);return;}...//对于已经被文件系统除名的文件，即i_nlink = 0, 不必再保留inode，直接释放掉inode->i_state |= I_FREEING;if (!list_empty(&inode->i_lru))inode_lru_list_del(inode);spin_unlock(&inode->i_lock);evict(inode);
}
---

执行回收，当系统内存低于low水线，执行shrink_slab时，会扫描lru list上的inode，按照下面的条件挑选可以回收的inode (此过程可以参考函数inode_lru_isolate):
- i_count = 0, 当inode处于lru list期间，其可能被其他任务引用，并增加其引用计数，这种inode会被移出lru list，否则，进行下一步检查；
- 不是dirty inode，如果inode是dirty的，也不能被回收，将inode移出 lru list，否则，进行下一步检查；
- 没有I_REFERENCED标记，参考代码inode_lru_list_add()，只有第二次调用这个函数时，才会被设置这个标记；产生这个标记的一个情况是，在inode处在lru list上，再次被访问，然后再被释放；当有这个标记的时候，inode会在lru list中再轮一圈；

super_cache_scan()-> prune_icache_sb()-> inode_lru_isolate()-> dispose_list()

I_FREEING，这个状态的inode最终一定会被evict()释放掉；在I_FREEING期间，其他尝试访问该inode的操作，会被暂时block，待evict()完成之后，再唤醒重试。

1.2 Dentry

Dentry，即Directory Entry-目录项；文件系统使用目录将文件组织成一个树形结构，目录项就是树的节点，它可以指向另外一个目录，即中间节点，也可以指向一个文件，即叶子节点。目录项可以由以下公式总结：

<Directory + Name> -> Inode Number

On-disk，我们通过Directory所对应的磁盘数据结构(xfs的btree或者ext4的index tree)，查询Name对应的项，取出其中Inode Number；
In-Memory，我们通过Name + Parent dentry指针计算得到的Hash，在全局的Hash表中索引dentry结构，并通过其引用的struct inode结构指针，访问文件；

本小节，我们分析下dentry在内存中的实现。

1.2.1 dentry的功能

dentry在vfs层中，可以提供以下信息：

通过快速索引，查找到对应文件名对应的inode，

Hash值的生成，通过函数full_name_hash() (exported)hash_name() //返回的hash值和len的一起的值link_path_walk()
---hash_len = hash_name(nd->path.dentry, name);
---在计算hash值的时候引入了dentry指针__d_lookup()
---rcu_read_lock();hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {if (dentry->d_name.hash != hash)continue;spin_lock(&dentry->d_lock);if (dentry->d_parent != parent)goto next;if (d_unhashed(dentry))goto next;if (!d_same_name(dentry, parent, name))goto next;dentry->d_lockref.count++;found = dentry;spin_unlock(&dentry->d_lock);break;
next:spin_unlock(&dentry->d_lock);}rcu_read_unlock();---
内核使用了基于链表解决冲突的hash表，我们看到其中比较了parent指针和name字符串；

目录层级结构，可以访问父目录以及目录内的子dentry(仅限已经在dcache中的文件，即执行过open的，并不包括执行ls操作的)；

dentry.d_parent/d_child/dentry.d_subdirs用来组织dentry的层级关系
相关操作在d_add()中完成，使用d_walk()可以遍历一个目录下的子目录项

hardlink信息，多个目录项是可以指向同一个inode的，dentry也支持这个功能

harklink，即指向一个inode的多个目录项，dentry通过dentry.u.d_alias挂入inode.i_dentry，
__d_find_alias(struct inode *inode)可以返回一个inode的dentry

注：不支持对目录进行hardlink操作，具体原因可以参考连接filesystem - Why are hard links not allowed for directories? - Ask Ubuntuz

其中最重要的一点事，环形目录问题，如

mkdir -p /tmp/a/b
cd /tmp/a/b
ln -d /tmp/a l

cd /tmp/a/b/l/b/l/b/l/b/l/b ....

这破坏了文件组织的树形结构

hardlink是一个目录项，是树的节点，而symlink是一个文件，是叶子节点；两者都可以实现跳转，但是是线上是有本质区别的。

1.2.2 dentry操作

dentry几种状态及其切换：

Allocted

从d_alloc()中申请出来的dentry可以被称为'Allocated'，此时，- 与父目录dentry的层级关系已经建立，- __d_entry_type(dentry) == 0，即d_is_miss()或者d_is_negative()- 并没有加入dcache的hash table，即d_unhashed()

Hashed

即该dentry是否被插入了dcache的hash table，主要调用方式：- __d_add()，对应的是通常对应的是lookup操作，比如ext2_lookup() -> d_splice_alias() -> __d_add()- __d_move()，对应的是rename操作

Positive/Negative

这个状态取决于__d_entry_type(dentry)，如果为0，则表示此dentry为空，
d_is_negative()为TRUEd_flags_for_inode()根据inode的类型返回不同的dentry type，
__d_set_inode_and_type()将inode和type设置给dentry，
d_is_positive()为TRUE__d_set_inode_and_type()调用方主要有两个，- __d_add()，对应lookup操作，参考ext2_lookup()，如果对应的文件存在，则获取对应的inode，__d_add()会把inode安装到dentry里；否则inode为空，此时被加入到hash table里的dentry就是negative的；- __d_instantiate()，针对上一步中，negative的dentry，我们通过这个接口，给dentry安装inode，之后，inode变为positive；常见的场景包括：link、create、mkdir

注：dcache会故意创建一些negative的dentry，这样可以迅速判断一个文件不存在，而不用执行文件系统的lookup，参考 The value of negative dentries [LWN.net]

1.2.3 root dentry

文件系统的root inode和挂载目录的dentry是怎样的关系呢？参考函数

follow_managed()
---struct vfsmount *mounted = lookup_mnt(path);if (mounted) {dput(path->dentry);if (need_mntput)mntput(path->mnt);path->mnt = mounted;path->dentry = dget(mounted->mnt_root);need_mntput = true;continue;}
---struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
{struct hlist_head *head = m_hash(mnt, dentry);struct mount *p;hlist_for_each_entry_rcu(p, head, mnt_hash)if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)return p;return NULL;
}do_new_mount()-> vfs_kern_mount()-> do_add_mount()-> graft_tree()-> attach_recursive_mnt()-> commit_tree()-> __attach_mnt()static void __attach_mnt(struct mount *mnt, struct mount *parent)
{hlist_add_head_rcu(&mnt->mnt_hash,m_hash(&parent->mnt, mnt->mnt_mountpoint));list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
}

从中我们可以得到如下：

被挂载的文件系统的root inode并不是被安装在mount point的dentry上，而是以mount point dentry和vfsmount(可区分namespace)为索引，在m_hash中找对应的mount结构，然后获得挂载文件系统的root inode的dentry (IS_ROOT);
mount结构每次否会被insert到队首，所以，我们看到的都是最新挂载的目录；

1.2.4 d_splice_alias

文件系统的inode.lookup方法大多使用了d_splice_alias()，参考ext2_lookup、ext4_lookup和xfs_vn_lookup等；其中对directory的处理，比较让人困惑，为什么会存在"inode已经有dentry"的情况？因为vfs_rename已经通过d_move对dentry进行了修改。

通过查看d_splice_alias的commit信息，发现了下面这个commit，

commit b5ae6b15bd73e35b129408755a0804287a87e041
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Sun Oct 12 22:16:02 2014 -0400

merge d_materialise_unique() into d_splice_alias()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

当前的d_splice_alias还包括了原d_materialise_unique的一部分；通过这个commit，我们看看原d_splice_alias：

struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry){
-       struct dentry *new = NULL;
-if (IS_ERR(inode))return ERR_CAST(inode);-       if (inode && S_ISDIR(inode->i_mode)) {
-               spin_lock(&inode->i_lock);
-               new = __d_find_any_alias(inode);
-               if (new) {
-                       if (!IS_ROOT(new)) {
-                               spin_unlock(&inode->i_lock);
-                               dput(new);
-                               iput(inode);
-                               return ERR_PTR(-EIO);
-                       }// 如果dentry位于new树下，这会造成loop
-                       if (d_ancestor(new, dentry)) {
-                               spin_unlock(&inode->i_lock);
-                               dput(new);
-                               iput(inode);
-                               return ERR_PTR(-EIO);
-                       }
-                       write_seqlock(&rename_lock);
-                       __d_move(new, dentry, false);
-                       write_sequnlock(&rename_lock);
-                       spin_unlock(&inode->i_lock);
-                       security_d_instantiate(new, inode);
-                       iput(inode);
-               } else {
-                       /* already taking inode->i_lock, so d_add() by hand */
-                       __d_instantiate(dentry, inode);
-                       spin_unlock(&inode->i_lock);
-                       security_d_instantiate(dentry, inode);
-                       d_rehash(dentry);
-               }
-       } else {
-               d_instantiate(dentry, inode);
-               if (d_unhashed(dentry))
-                       d_rehash(dentry);
-       }
-       return new;
-}
-EXPORT_SYMBOL(d_splice_alias);

所以，现版本中d_splice_alias令人费解的部分，即__d_unalias，主要来自_materialise_unique；而后者的主要使用者是NFS client；NFS是网络共享文件系统，但是它是弱一致性，参考连接，nfs-cache-coherence.txt · GitHub

其中关于dentry cache一致性，有以下描述：

Directory entry caching
The Linux NFS client caches the result of all NFS LOOKUP requests. If the requested directory entry exists on the server, the result is referred to as a positive lookup result. If the requested directory entry does not exist on the server (that is, the server returned ENOENT), the result is referred to as negative lookup result.

To detect when directory entries have been added or removed on the server, the Linux NFS client watches a directory's mtime. If the client detects a change in a directory's mtime, the client drops all cached LOOKUP results for that directory. Since the directory's mtime is a cached attribute, it may take some time before a client notices it has changed. See the descriptions of the acdirmin, acdirmax, and noac mount options for more information about how long a directory's mtime is cached.

Caching directory entries improves the performance of applications that do not share files with applications on other clients. Using cached information about directories can interfere with applications that run concurrently on multiple clients and need to detect the creation or removal of files quickly, however. The lookupcache mount option allows some tuning of directory entry caching behavior.

Before kernel release 2.6.28, the Linux NFS client tracked only positive lookup results. This permitted applications to detect new directory entries created by other clients quickly while still providing some of the performance benefits of caching. If an application depends on the previous lookup caching behavior of the Linux NFS client, you can use lookupcache=positive.

If the client ignores its cache and validates every application lookup request with the server, that client can immediately detect when a new directory entry has been either created or removed by another client. You can specify this behavior using lookupcache=none. The extra NFS requests needed if the client does not cache directory entries can exact a performance penalty. Disabling lookup caching should result in less of a performance penalty than using noac, and has no effect on how the NFS client caches the attributes of files.

结合下划线部分，参考d_splice_alias中关于__d_unalias的部分，内核并没有直接将目录下的所有的dentry全部drop掉，而是在发现存在对于同一个directory inode已经存在dentry的时候，使用__d_unalias替换掉。

IS_ROOT的分支，也来自NFS，参考函数d_obtain_alias()，具体原因待研究。。。

1.2.5 lookup

参考连接：

Dcache scalability and RCU-walk [LWN.net]

Pathname lookup in Linux [LWN.net]

RCU-walk: faster pathname lookup in Linux [LWN.net]

path-lookup.txt [LWN.net]

Scaling dcache with RCU | Linux Journal

当我们尝试打开文件/home/will/Desktop/linux-stable时，内核实际上把它分解成：

1. lookup "/"
2. lookup "home"         in dentry of "/"
3. lookup "will"         in dentry of "/home"
4. lookup "Desktop"      in dentry of "/home/will"
5. lookup "linux-stable" in dentry of "/home/will/Desktop"

dcache lookup只会保证每一步的lookup操作的原子性，而不会保证整个对/home/will/Desktop/linux-stable的lookup的原子性；换句话说，我们最终得到的可能是/home/kate/Desktop/linux-stable。

lookup操作分为两种ref-walk和rcu-walk两种，ref和rcu其实对应的是，如何保证在执行操作期间dentry结构不会被释放，尤其是rcu：

ref-walk使用dentry.d_lock保护dentry.d_parent和dentry.d_name不会被rename；使用dentry.d_lockref来保证父dentry以及dentry.d_inode不会被释放，这使得我们可以执行更加复杂的操作，比如进入文件系统的lookup方法，甚至等待IO；
rcu-walk使用rcu read cirtical section保护父dentry不会被释放，使用parent dentry.d_seq保护其不会被修改，尤其是unlink操作；使用dentry.d_seq保护dentry.d_name和dentry.d_parent不会被rename；如果发现异常，会回退到ref-walk；

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. creat or negative lookup)
* parent with ->d_op->d_hash
* parent with d_inode->i_op->permission or ACLs
* Following links
* Dentry with ->d_op->d_revalidate

为什么seqcount_t的扩展性更好? 其中没有任何原子操作；

typedef struct seqcount {unsigned sequence;
} seqcount_t;static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{unsigned ret;repeat:ret = READ_ONCE(s->sequence);if (unlikely(ret & 1)) {cpu_relax();goto repeat;}return ret;
}static inline void raw_write_seqcount_begin(seqcount_t *s)
{s->sequence++;smp_wmb();
}static inline void raw_write_seqcount_end(seqcount_t *s)
{smp_wmb();s->sequence++;
}

Note: 内核之所以使用list处理hashtable的冲突，是因为list可以实现rcu版，rcu下的lookup操作扩展性更好。

当dcache中没有找对对应dentry，或者negative的dentry，会进入__lookup_slow，其主要分成两个步骤：

d_lookup_parallel()，申请dentry并插入lookup hashtable，注意，此处并不是dcache的hashtable；此时的dentry处于unhashed状态，不会被lookup到；其他的并发的lookup操作都会进入到这个函数；
inode->i_op->lookup()，文件系统的lookup操作，会产生两个IO操作，目录项的lookup和获取on-disk inode；不论是否找对对应的目录项，都会对dentry插入d_hash；

这里需要特别说明的是，因为文件系统的lookup操作是IO操作，可能速度较慢，并行的其他lookup操作，通过in_lookup_hash的hash lock的同步，会睡眠在d_wait_lookup()；

Note: 一个问题是，在lookup期间，所有的任务都会进入到该路径的任务，执行d_alloc()，这里是不是有点多余？是不是可以将该negative的dentry + in_lookup插入到d_hash，这样其他lookup该dentry的任务可以直接等待

1.2.6 symbol link

symbol link的本质是在一个文件中保存一个路径，其相当于一个文件系统指向文件的指针;

在分析symbol link的处理之前，我们需要首先看下内核是如何解析文件路径的。

"/home/will/Desktop/linux-stable"这样的路径，是完全由内核解析的，具体代码如下：

以下代码已经将部分错误分支处理去掉，以使代码变得更清晰
link_path_walk()
---while (*name=='/')name++;if (!*name)return 0;for(;;) {//对路径中 .. 和 . 的处理，类似于/home/will/Deskop/../Downloadtype = LAST_NORM;if (name[0] == '.') switch (hashlen_len(hash_len)) {case 2:if (name[1] == '.') {type = LAST_DOTDOT;nd->flags |= LOOKUP_JUMPED;}break;case 1:type = LAST_DOT;}// hash_name函数会以'/'作为name的结尾来返回lenhash_len = hash_name(nd->path.dentry, name);nd->last.hash_len = hash_len;nd->last.name = name;nd->last_type = type;// 指向路径的下一个元素name += hashlen_len(hash_len);if (!*name)goto OK;// 路径中加几个'/'都没有关系do {name++;} while (unlikely(*name == '/'));if (unlikely(!*name)) {} else {//处理当前的路径元素/* not the last component */err = walk_component(nd, WALK_FOLLOW | WALK_MORE);}...}
}

简单来说，就是依次对path的每一个元素调用walk_component()。

引入symbol link之后，路径的某一个元素，会变成一个另外一个路径，甚至，symlink路径的一部分也可能是另外一个symlink，所以，这里引入了一个stack的概念，例如：

/home/will/Desktop/sourcecode/linux-stable|+--> /mnt/disk1/AAAA/sourcecode|+--> /mnt/disk2/BBBB路径的解析过程为stack[0]   home -> will -> Dektop -> sourcecode                   +-> linux-stable____________/\__________              |/                        \             |
stack[1]                    mnt  ->  disk1   -> AAAA              |____________/\__________     |/                        \    |mnt  ->  disk2   -> BBBB    -+与symlink stack有关的代码为：扩展stack，push stack：walk_component() -> step_into() -> pick_link() -> nd_alloc_stack()link_path_walk()
---if (err) {// 此处调用的是文件系统的get_link方法const char *s = get_link(nd);...if (unlikely(!s)) {} else {nd->stack[nd->depth - 1].name = name;name = s;continue;}}
---pop stack:
link_path_walk()
---if (unlikely(!*name)) {
OK:/* pathname body, done */if (!nd->depth)return 0;// Stack is popped herename = nd->stack[nd->depth - 1].name;/* trailing symlink, done */if (!name)return 0;/* last component of nested symlink */err = walk_component(nd, WALK_FOLLOW);} else {/* not the last component */err = walk_component(nd, WALK_FOLLOW | WALK_MORE);}
---
NO WALK_MORE
walk_component() -> step_into() -> put_link() -> nd->depth--

1.3 mount

关于mount有两个关键点需要掌握：

mount操作是将一棵目录树"嫁接"到另外一棵目录树上；根文件系统为其他所有文件系统提供了一棵根树；
mount namespace，即在不同的namespace里，进程看到的挂载点并不相同

另外，有两个看起来比较相似的结构体，mount和vfsmount；具体来源可以参考：

vfs: start hiding vfsmount guts series

Almost all fields of struct vfsmount are used only by core VFS (and
a fairly small part of it, at that). The plan: embed struct vfsmount
into struct mount, making the latter visible only to core parts of VFS.
Then move fields from vfsmount to mount, eventually leaving only
mnt_root/mnt_sb/mnt_flags in struct vfsmount. Filesystem code still
gets pointers to struct vfsmount and remains unchanged; all such
pointers go to struct vfsmount embedded into the instances of struct
mount allocated by fs/namespace.c. When fs/namespace.c et.al. get
a pointer to vfsmount, they turn it into pointer to mount (using
container_of) and work with that.

This is the first part of series; struct mount is introduced,
allocation switched to using it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

总之，vfsmount和mount结构可以看成是一个。

1.3.1 嫁接

实现目录树嫁接的关键是__lookup_mnt()函数，如下：

struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
{struct hlist_head *head = m_hash(mnt, dentry);struct mount *p;hlist_for_each_entry_rcu(p, head, mnt_hash)if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)return p;return NULL;
}

函数通过vfsmount和dentry指针，找到匹配的mount结构；

vfsmount，注意，这里的vfsmount是挂载点所在的目录的vfsmount，它在path_init()被设置；

path_init()
---
if (*s == '/') {....// nd->root = current->fs->rootset_root(nd);// nd->path = nd->rootif (likely(!nd_jump_root(nd)))return s;...} else if (nd->dfd == AT_FDCWD) {if (flags & LOOKUP_RCU) {...} else {//nd->path = current->fs->pwdget_fs_pwd(current->fs, &nd->path);nd->inode = nd->path.dentry->d_inode;}return s;}
---

mount，这是要寻找的挂载点的mount结构，其mnt_parent指向挂载点目录的mount；参考代码：

do_add_mount()
---parent = real_mount(path->mnt);err = graft_tree(newmnt, parent, mp);-> attach_recursive_mnt()-> mnt_set_mountpoint()
---

当从一棵目录树通过挂载点进入另外一棵目录树时，dentry和vfsmount都要切换：

walk_component()-> follow_managed()---if (managed & DCACHE_MOUNTED) {struct vfsmount *mounted = lookup_mnt(path);if (mounted) {dput(path->dentry);if (need_mntput)mntput(path->mnt);path->mnt = mounted;path->dentry = dget(mounted->mnt_root);need_mntput = true;continue;}----> step_into()-> path_to_nameidata()---nd->path.mnt = path->mnt;nd->path.dentry = path->dentry;---

1.3.2 loop mount

或者叫bind mount，其操作如下：

mkdir /tmp/foo
touch /tmp/foo/will
mkdir /tmp/bar

mount --bind /tmp/foo /tmp/bar

之后，我们就可以通过/tmp/bar访问/tmp/foo了；但是，使用symlink也能实现类似的功能，如下

mkdir /tmp/foo
touch /tmp/foo/will

ln -s /tmp/foo /tmp/bar

两者有什么区别呢？我们可以从两者的原理上分析：

symlink，在解析symlink的时候，是基于当前的目路树，对路径进行重新解析，进而转向另外一个目录或者文件；
bind mount，通过vfsmount+dentry找到一个新的mount，跳转到另外一个目录；文中提到：

Bind mounts can be thought of as a sort of symbolic link at the filesystem level. Using mount --bind, it is possible to create a second mount point for an existing filesystem, making that filesystem visible at a different spot in the namespace. Bind mounts are thus useful for creating specific views of the filesystem namespace; one can, for example, create a bind mount which makes a piece of a filesystem visible within an environment which is otherwise closed off with chroot().

我们先看下chroot是如何实现的，

修改task_struct.fs.root
ksys_chroot()
---error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);set_fs_root(current->fs, &path);---mntget(path->mnt);dget(path->dentry);spin_lock(&fs->lock);write_seqcount_begin(&fs->seq);old_root = fs->root;fs->root = *path;write_seqcount_end(&fs->seq);spin_unlock(&fs->lock);if (old_root.dentry) {dput(path->dentry);mntput(path->mnt);}---
---这个结构会在fork的时候，被子任务继承
copy_process()-> copy_fs()-> copy_fs_struct()---spin_lock(&old->lock);fs->root = old->root;path_get(&fs->root);fs->pwd = old->pwd;path_get(&fs->pwd);spin_unlock(&old->lock);---

结合解析路径时，如果symlink是绝对路径，在chroot之后，解析出的inode会改变；而bind mount，如果chroot的路径依然原文件系统下，那么vfsmount不会变化，在dentry没有改变的情况下，得到的结果是不变的。

那么bind mount如何实现呢？

do_loopback()
---// bind mount directoryold = real_mount(old_path.mnt);// mount pointparent = real_mount(path->mnt);mnt = clone_mnt(old, old_path.dentry, 0);---mnt = alloc_vfsmnt(old->mnt_devname);...//增加bind mount directory所在文件系统的sb的引用计数atomic_inc(&sb->s_active);mnt->mnt.mnt_sb = sb;// bind mount的root就是bind mount directorymnt->mnt.mnt_root = dget(root);//这个会在set_mount_point的时候再次更新mnt->mnt_mountpoint = mnt->mnt.mnt_root;mnt->mnt_parent = mnt;lock_mount_hash();list_add_tail(&mnt->mnt_instance, &sb->s_mounts);unlock_mount_hash();---err = graft_tree(mnt, parent, mp);-> attach_recursive_mnt()-> mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);

bind mount实现read only，基于的是mount结构上的RDONLY flag

mount -t bind /home/will/Desktop/sourcecode /mnt/readonly
mount -o remount,ro /mnt/readonlyfilename_create()
do_renameat2()
do_unlinkat()
do_last()
....
mnt_want_write()-> __mnt_want_write()---preempt_disable();mnt_inc_writers(mnt);...smp_rmb();if (mnt_is_readonly(m)) {mnt_dec_writers(mnt);ret = -EROFS;}preempt_enable();return ret;---

1.4 page cache

本小节我们主要看Page Cache在系统中如何组织和使用，至于其回收和Writeback，可以参考：

Linux 内存管理_jianchwa的博客-CSDN博客_内存管理器

4.4 active和inactive
4.6 vmscan中的几个数量
5 Writeback

1.4.1 radix tree

page cache使用radix tree维护，本小节，我们看下radix tree如何实现，并讨论下，使用radix tree的原因。

radix tree，本质上是一个数组，容量与每个节点的slot的个数和高度有关，即N^H，如下：

与普通数组不同的是，radix tree支持空洞，如下：

所以，他比一般的数组节省内存，同时，也保留了数组查询块的特点，radix tree不是O1，而是OH，而H = logN(S)，其中S是总容量，N越大，H越小。

下面，我们看下Linux内核中，radix tree是如何实现的：

radix_tree_node，我们看下radix_tree_node的实现：

struct radix_tree_node {unsigned int count;...void __rcu *slots[RADIX_TREE_MAP_SIZE];unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};#define RADIX_TREE_MAP_SHIFT  (6)#define RADIX_TREE_MAP_SIZE  (1UL << RADIX_TREE_MAP_SHIFT)#define RADIX_TREE_MAX_TAGS 3
#define RADIX_TREE_TAG_LONGS    \((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)- 每个node的容量为32,- count，保存这个node中已经安装的slot的数量，- tags，每个slot还有3bits的tag，用来保存一些属性信息，比如page cache的三个属性是：
/** Radix-tree tags, for tagging dirty and writeback pages within the pagecache* radix trees*/
#define PAGECACHE_TAG_DIRTY 0
#define PAGECACHE_TAG_WRITEBACK 1
#define PAGECACHE_TAG_TOWRITE   2

extend，当插入的元素的index超过了当前的最大容量，则需要拓展radix tree的Height，

extend的过程大致如下：Root                      Root/  \             -\        /Node0   Node1        -/     Node2/  \Node0    Node1static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
{.../* Figure out what the height should be.  */height = root->height + 1;while (index > radix_tree_maxindex(height))height++;...do {if (!(node = radix_tree_node_alloc(root)))return -ENOMEM;.../* Increase the height.  */newheight = root->height+1;node->count = 1;node->parent = NULL;slot = root->rnode;if (newheight > 1) {slot = indirect_to_ptr(slot);slot->parent = node;}//将原来的root node安装到新的root node的slot[0]node->slots[0] = slot;node = ptr_to_indirect(node);rcu_assign_pointer(root->rnode, node);root->height = newheight;} while (height > root->height);
out:return 0;
}

4系内核中，radix tree引入了order的概念，参考下图：

这个版本的radix tree，使用shift替换了height的概念；order代表的是，按2^order对齐的范围，如上图所示的order = 3和order = 2；在order范围内，slot都指向其第一个成员(蓝色红框)，同时，子成员都被释放；参考代码:

__radix_tree_create()
---//遍历的停止条件，比较orderwhile (shift > order) {shift -= RADIX_TREE_MAP_SHIFT;if (child == NULL) {/* Have to add a child node.  */child = radix_tree_node_alloc(gfp, node, root, shift,offset, 0, 0);rcu_assign_pointer(*slot, node_to_entry(child));if (node)node->count++;} else if (!radix_tree_is_internal_node(child))break;/* Go a level down */node = entry_to_node(child);offset = radix_tree_descend(node, &child, index);slot = &node->slots[offset];}
---insert_entries()
---if (node) {if (order > node->shift)n = 1 << (order - node->shift);elsen = 1;offset = get_slot_offset(node, slot);} else {n = 1;offset = 0;}if (n > 1) {offset = offset & ~(n - 1);slot = &node->slots[offset];}sibling = xa_mk_sibling(offset);...for (i = 0; i < n; i++) {struct radix_tree_node *old = rcu_dereference_raw(slot[i]);// 第一个成员保存item，其他保存sibling指针if (i) {rcu_assign_pointer(slot[i], sibling);} else {rcu_assign_pointer(slot[i], item);}//释放掉所有子成员if (xa_is_node(old))radix_tree_free_nodes(old);if (xa_is_value(old))node->nr_values--;}
---

1.4.2 page cache tags

page cache在radix tree中存在三个tag，即

PAGECACHE_TAG_DIRTY ，用来标记脏页，

标记脏页
__set_page_dirty()
---xa_lock_irqsave(&mapping->i_pages, flags);if (page->mapping) { /* Race with truncate? */account_page_dirtied(page, mapping);radix_tree_tag_set(&mapping->i_pages,page_index(page), PAGECACHE_TAG_DIRTY);}xa_unlock_irqrestore(&mapping->i_pages, flags);
---在writeback时，使用该标记将脏页写出
write_cache_pages()
---if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)tag = PAGECACHE_TAG_TOWRITE;elsetag = PAGECACHE_TAG_DIRTY;if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)tag_pages_for_writeback(mapping, index, end);done_index = index;while (!done && (index <= end)) {int i;nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end,tag);if (nr_pages == 0)break;for (i = 0; i < nr_pages; i++) {struct page *page = pvec.pages[i];done_index = page->index;lock_page(page);if (!clear_page_dirty_for_io(page))goto continue_unlock;ret = (*writepage)(page, wbc, data);if (--wbc->nr_to_write <= 0 &&wbc->sync_mode == WB_SYNC_NONE) {done = 1;break;}}pagevec_release(&pvec);cond_resched();}
---dirty转为writeback
__test_set_page_writeback()
---if (mapping && mapping_use_writeback_tags(mapping)) {...xa_lock_irqsave(&mapping->i_pages, flags);ret = TestSetPageWriteback(page);if (!ret) {radix_tree_tag_set(&mapping->i_pages, page_index(page),PAGECACHE_TAG_WRITEBACK);...}if (!PageDirty(page))radix_tree_tag_clear(&mapping->i_pages, page_index(page),PAGECACHE_TAG_DIRTY);if (!keep_write)radix_tree_tag_clear(&mapping->i_pages, page_index(page),PAGECACHE_TAG_TOWRITE);xa_unlock_irqrestore(&mapping->i_pages, flags);}
---

PAGECACHE_TAG_WRITEBACK，IO发出但是还没有完成的Page标记为writeback，参考代码：

__test_set_page_writeback()
test_clear_page_writeback()等待处于wrieteback状态，即写IO发出但未完成
__filemap_fdatawrite_range()
---pagevec_init(&pvec);while (index <= end) {unsigned i;nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index,end, PAGECACHE_TAG_WRITEBACK);if (!nr_pages)break;for (i = 0; i < nr_pages; i++) {struct page *page = pvec.pages[i];wait_on_page_writeback(page);ClearPageError(page);}pagevec_release(&pvec);cond_resched();}---

PAGECACHE_TAG_TOWRITE，TOWRITE tag的存在是为了避免sync的时候livelock，具体参考函数及其注释：

This function scans the page range from @start to @end (inclusive) and tags
all pages that have DIRTY tag set with a special TOWRITE tag. The idea is
that write_cache_pages (or whoever calls this function) will then use
TOWRITE tag to identify pages eligible for writeback.  This mechanism is
used to avoid livelocking of writeback by a process steadily creating new
dirty pages in the file (thus it is important for this function to be quick
so that it can tag pages faster than a dirtying process can create them).tag_pages_for_writeback()
---xa_lock_irq(&mapping->i_pages);radix_tree_for_each_tagged(slot, &mapping->i_pages, &iter, start,PAGECACHE_TAG_DIRTY) {if (iter.index > end)break;radix_tree_iter_tag_set(&mapping->i_pages, &iter,PAGECACHE_TAG_TOWRITE);tagged++;if ((tagged % WRITEBACK_TAG_BATCH) != 0)continue;...}xa_unlock_irq(&mapping->i_pages);
}

1.4.3 mmap

对page cache的访问有两方式：

read/write系统调用
mmap到地址空间内，直接访问

通过mmap访问可以节省系统调用相关代码的开销，并减少一次内存拷贝，是最高效的访问方式。

mmap主要涉及两个点：

创建vma，vma代表的是一段进程的地址空间，其中保存的是相关地址发生page cache之后的策略，
- 通过file_operations.get_unmapped_area()回调获取虚拟地址，换句话说，文件系统可以通过该回调影响文件的映射的地址；
- 把涉及文件访问的相关策略设置到vma中；具体可以参考以下函数：
  - ```
  generic_file_map() 设置vma->ops为generic_file_vm_ops
  vma_set_anonymous() 设置vma->ops为NULLhandle_pte_fault()
  ---if (!vmf->pte) {if (vma_is_anonymous(vmf->vma))  // check vma->opsreturn do_anonymous_page(vmf);elsereturn do_fault(vmf);}
  ---
```

发生page fault之后，handle_pte_fault() 处理，其中有几个关键的函数：

__do_fault() -> filemap_fault()
__do_fault()负责准备好page，并且为读操作准备好page，其中使用了filemap_read_page()do_page_mkwrite()
则负责为写操作准备好page，
比如
filemap_page_mkwrite()
---file_update_time(vmf->vma->vm_file);set_page_dirty(page);
---
比如
ext4_page_mkwrite()
---/* Delalloc case is easy... */if (test_opt(inode->i_sb, DELALLOC) &&!ext4_nonda_switch(inode->i_sb)) {do {err = block_page_mkwrite(vma, vmf,ext4_da_get_block_prep);} while (err == -ENOSPC &&ext4_should_retry_alloc(inode->i_sb, &retries));goto out_ret;}
---
如果是非delay allocation模式，还会在这里分配空间finish_fault() -> do_set_pte() 设置页表do_fault_around() -> filemap_map_pages()
这是一个优化，do_fault_around()尝试将发生fault周围的page都map起来fault_dirty_shared_page() -> balance_dirty_pages_ratelimited()
这点跟正常的write系统调用一样

另外，我们考虑下，系统如何感知到mmap的page上的写操作？答案是，writeprotect，我们看下函数：

handle_pte_fault()
---// vmf->pte is not NULL and presentif (vmf->flags & FAULT_FLAG_WRITE) {if (!pte_write(entry))return do_wp_page(vmf);entry = pte_mkdirty(entry);}
---
do_wp_page()会调用到page_mkwrite，其中会调用set_page_dirty()和
file_update_time()等；而write_cache_pages()则会调用
clear_page_dirty_for_io() -> page_mkclean()清除相关标记位

1.4.4 direct IO

buffer read IO且cache miss时，IO分为两步：

用cache page组成bio提交给block层并直到设备；存储设备通过DMA向cache page中写入数据，然后返回IO成功；
从cache page中将数据拷贝到用户buffer

Direct IO是指，直接通过用户Buffer组装成bio提交给block层并直到设备，由此，我们可以获得以下：

避免引入page cache，有时用户想要在自己的程序里维护cache；
使用异步IO，Linux内核的buffer IO都是同步的，其内部并没有维护异步上下文的机制；io-uring其实也是伪异步，它本质上是使用了一个内核线程上下文等待IO；但是使用direct IO，其可以使用bio中的异步上下文，实现真异步；

direct IO的实现分为两个关键部分：

准备好用户态buffer；在buffer IO或者其他与内核交互时，我们会用到copy_from/to_user系列接口，如果访问的用户态地址依然没有分配物理内存，则会在page fault中分配(如果地址的非法的，则会返回-EFAULT，参考5. Kernel level exception handling — The Linux Kernel documentation)；但是，显然，我们不能给这样一个地址到设备中，于是，direct IO需要使用get_user_pages_fast()保证这些page都申请好了，并增加引用计数；

组装bio，其中最重要的是获取所需内容在磁盘上的地址，目前主要有两种框架：

generic_file_read_iter()-> mapping->a_ops->direct_IO()ext2_direct_IO()-> blockdev_direct_IO(iocb, inode, iter, ext2_get_block);ext4_file_read_iter()-> ext4_dio_read_iter()-> iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);

1.5 DAX

DAX，direct access extensions, 其官方定义为，

File system extensions to bypass the page cache and block layer to mmap persistent memory, from a PMEM block device, directly into a process address space

DAX的引入有两几方面的原因：

兼容现有的通用存储架构，目前绝大多数应用都是基于POSIX API和树形目录文件存储和组织数据的，所以NVDIMM也必须如此；
通过0.3小节中的介绍，NVDIMM的存储的访问延迟已经达到DDR的相同数量级，这时，额外的内存拷贝甚至系统调用及内核代码开销都会显现出来，所以，我们需要的是direct IO + mmap；
NVDIMM设备的另一个优势是支持字节级访问，这对于存储数据结构的设计是革命性的(虽然目前的AEP实际表现并没有那么好)；所以，我们还需要bypass只支持块级访问的Block层；

所以，DAX在整个Linux内核存储栈上开了一个旁路，bypass page cache和block layer，如下：

1.5.1 Pmem

Pmem设备是用来支持DAX模式的块设备；嗯？既然bypass了block层，为什么还需要块设备？原因在于，DAX要兼容原来的文件系统，比如Ext2、Ext4、XFS等，而这些文件系统的元数据依然是基于块设备，所以nvdimm设备依然需要块设备访问接口；参考如下代码：

pmem_submit_bio()
---bio_for_each_segment(bvec, bio, iter) {if (op_is_write(bio_op(bio)))rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,iter.bi_sector, bvec.bv_len);elserc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,iter.bi_sector, bvec.bv_len);if (rc) {bio->bi_status = rc;break;}}bio_endio(bio);
---pmem_submit_bio()-> pmem_do_write()//数据的写入地址，从sector转换而来-> phys_addr_t pmem_off = sector * 512 + pmem->data_offset;-> write_pmem();-> memcpy_flushcache()-> __memcpy_flushcache()//写入size一定是512对齐的---/* 4x8 movnti loop */while (size >= 32) {asm("movq    (%0), %%r8\n""movq   8(%0), %%r9\n""movq  16(%0), %%r10\n""movq  24(%0), %%r11\n""movnti  %%r8,   (%1)\n""movnti  %%r9,  8(%1)\n""movnti %%r10, 16(%1)\n""movnti %%r11, 24(%1)\n":: "r" (source), "r" (dest): "memory", "r8", "r9", "r10", "r11");dest += 32;source += 32;size -= 32;}---

上图中，最终的拷贝操作使用到了ntstore操作，该指令可以保证指令返回时，数据落入非易失性存储介质。但是，从实现上，我们可以看到，pmem并不支持sector级更新的原子性。

这对文件系统metadata更新正确性是否会造成影响？

Ext4和Xfs都通过日志来保证metadata更新的原子性，即all or nothing；
Ext4所使用的jbd2和Xfs log，都有checksum机制，在异常重启后recovery过程中，可以识别出不完整的日志，尤其是最后的commit record；

1.5.2 fs/dax

fs/dax.c内提供了文件系统实现dax的几个通用功能：

read/write dax访问 ，也就是调用read/write系列接口时，dax_iomap_rw()通过文件系统提供的iomap_begin/end回调，获取文件的映射，然后直接访问pmem，这个过程中bypass了page cache和block层；另外，对于写操作，参考上一小节，其使用的ntstore，可以保证数据落入掉电保护域。

dax_iomap_rw()
---while ((ret = iomap_iter(&iomi, ops)) > 0)---(1) iomap_end(2) advance iter by iter.processed(3) iomap_begin, based on iter.inode, iter.pos, iter.len, get iomap.addr, iomap.offset, iomap.length---iomi.processed = dax_iomap_iter(&iomi, iter);---read/write on iomap.addr + (iomap.offset - iter.pos)---
---
dax_iomap_iter()
---/*iter->pos     iter->len|            ||--------------------------------|mmapped chunk in fileiomap.offset                       iomap.length*/length = min(iter->len, iter->iomap.offset + iter->iomap.length - iter->pos);pos = iter->pos;end = pos + length;while (pos < end) {// Copy one page per roundunsigned offset = pos & (PAGE_SIZE - 1);const size_t size = ALIGN(length + offset, PAGE_SIZE);const sector_t sector = (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;// Get the address of pmem for direct accessret = bdev_dax_pgoff(bdev, sector, size, &pgoff);map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),&kaddr, NULL);...if (iov_iter_rw(iter) == WRITE)xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,map_len, iter);elsexfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,map_len, iter);pos += xfer;length -= xfer;...}---

剩下的大都是关于page fault相关，其中比较特殊的是，本来用来存储page cache address_space.i_pages中存储了一种特殊的entry，其包含以下信息：

pfn，对应的pmem的page的页帧号，这个信息在执行truncate_inode_pages_range()时有用到，参考代码：

truncate_inode_pages_range()-> truncate_exceptional_pvec_entries()-> dax_delete_mapping_entry()-> __dax_invalidate_entry()-> dax_disassociate_entry()---for_each_mapped_pfn(entry, pfn) {struct page *page = pfn_to_page(pfn);page->mapping = NULL;page->index = 0;}---此处获取pfn信息，虽然也可以通过iomap，但是此过程需要访问元数据，效率不如直接保存在
address_space.i_pages里效率高page->mapping和page->index代表的是该page所在的文件及在文件中的偏移

entry状态，目前有以下几种：
- LOCKED，用于将一定范围锁起来，以保证整个操作的原子性，尤其是在insert pte时可能会申请内存，并且sleep；在正常的page fault中，这个操作是通过lock_page实现的；这里之所以没有使用这个标记，可能是因为在pmem的page结构体可能是保存在nvdimm上的(参考nvdimm create-namespace --map参数)
- PMD or PTE，如果PMD bit设置了，就代表这个entry是pmd的；
- ZERO，对应read fault on unmapped page，此时会给其分配一个zero page；zero page可能是pmd的，在有pte fault(没能分配到pmd大小的page导致pmd fault fallback to pte fault)发生的时候，是可以降级拆分的；
- EMPTY，与LOCKED配合用来占坑的

之前提到read/write dax路径访问pmem的时候，写操作直接通过ntstore，可以保证数据直接落入掉电保护域。mmap的文件将pmem直接暴露给用户，并不能强制用户使用持久化编程技术，为了保证fsync的语义，dax虽然bypass page cache，但是依然使用了writeback子系统，并借此实现数据的flsuh，如下：

Mark the entry and hand the inode to writeback subsystem

dax_fault_iter()-> dax_insert_entry()---if (dirty)__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);xas_reset(xas);xas_lock_irq(xas);if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {...} else {xas_load(xas);    /* Walk the xa_state */}if (dirty)xas_set_mark(xas, PAGECACHE_TAG_DIRTY);xas_unlock_irq(xas);---

flush page out and make page write-protected again in writepages

ext2_dax_writepages/ext4_dax_writepages/xfs_dax_writepages()-> dax_writeback_mapping_range()---xas_lock_irq(&xas);xas_for_each_marked(&xas, entry, end_index, PAGECACHE_TAG_TOWRITE) {ret = dax_writeback_one(&xas, dax_dev, mapping, entry);...}xas_unlock_irq(&xas);---dax_writeback_one()
---/* Lock the entry to serialize with page faults */dax_lock_entry(xas, entry);pfn = dax_to_pfn(entry);count = 1UL << dax_entry_order(entry);index = xas->xa_index & ~(count - 1);// Make the page become writeprotecteddax_entry_mkclean(mapping, index, pfn);dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);/** After we have flushed the cache, we can clear the dirty tag. There* cannot be new dirty data in the pfn after the flush has completed as* the pfn mappings are writeprotected and fault waits for mapping* entry lock.*/xas_reset(xas);xas_lock_irq(xas);xas_store(xas, entry);xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
---

这里使用mark_inode_dirty()将inode交给writeback子系统，可以保证sync语义；参考代码：

sync_filesystem()
---writeback_inodes_sb(sb, WB_REASON_SYNC);if (sb->s_op->sync_fs) {ret = sb->s_op->sync_fs(sb, 0);if (ret)return ret;}...
---

2 文件系统

目前我们常见的文件系统，如本地文件系统ext2、ext4、xfs，网络文件系统nfs、ocfs2、cephfs等，都包含以下元素：

文件，一段线性存储空间，可以用来存储数据；
目录，将若干个文件或者目录按照树形结构组织起来；
其他额外属性，例如修改日期、所有者、权限信息等;
空间管理机制，以上这些信息都需要存储空间来管理，文件系统需要知道哪些还是空闲的；

我们可以通过任何方式构造文件系统，只要最终效果包含以上元素即可。

为了构建起以上文件系统的基本元素，还需要一些额外数据，即元数据，包括：

文件属性，例如日期、所有者、权限等；
文件Block Mapping，即文件与存储设备空间之间的对应关系，也可能是多台主机的多个存储设备；
目录项，即某个目录下包含哪几个文件；
空间管理相关元数据；
其他辅助性元数据，例如日志；

文件系统的目的，是将一段线性的存储空间，组织成一种符合人脑习惯的形式。

非人脑习惯，可以参考对象存储，它只有一个object ID；我们可以很容易的记住，/home/will/Desktop/sourcecode/linux-stable，但是你能记住这样一段ID吗？

489f6d8b8553722f4c6f34b256aac5cacbee52eb

本节参考文档连接：

File systems ext2, ext3 and ext4https://students.mimuw.edu.pl/ZSO/Wyklady/11_extXfs/extXfs_short.pdfExt4 block and inode allocator improvementshttps://landley.net/kdocs/ols/2008/ols2008v1-pages-263-274.pdf
The new ext4 filesystem: current status and future planshttps://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

Scalability in the XFS File Systemhttps://users.soe.ucsc.edu/~sbrandt/290S/xfs.pdfXFS Algorithms & Data Structureshttp://ftp.ntu.edu.tw/linux/utils/fs/xfs/docs/xfs_filesystem_structure.pdf

下文中，会使用如下简称：

fsblk ，filesytem block，文件系统管理空间的基本单位，可以是512/1K/2K/4K，当前默认基本都是4K

2.1 空间管理

空间管理，用于记录文件系统中块的使用情况，主要是Free Block的数量和位置；除此之外，还会配以一定的分配算法，用于提供性能。

2.1.1 ext4 Allocator

ext4正在空间管理上，使用block group(以下简称bg)来组织磁盘的空间，然后使用bitmap记录fsblk的使用情况；每个bg使用一个fsblk来承载bitmap，于是fsblk大小就决定了每个bg所管理的空间的大小，例如：fsblk=4K，一个bg所管理的空间大小就是4K * 8 * 4K = 128M。

可参考参考连接，Ext4 Disk Layout - Blockshttps://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Blocks

为什么要这么要引入block group？bg继承自ext2，一种说法是:

Disk blocks are divided into groups including adjacent tracks, thanks to which reading a file located within a single group is associated with a short seek time

这个说法没毛病，但是，避免磁盘seek，其实可以利用文件相邻的fsblk作为线索，查找附件的空闲fsblk，而且ext2确实也是这样做的，参考代码：

ext2_get_block()-> ext2_find_goal()-> ext2_find_near()-> ext2_alloc_branch()-> ext2_alloc_blocks()-> ext2_new_blocks()---grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) % EXT2_BLOCKS_PER_GROUP(sb));bitmap_bh = read_block_bitmap(sb, group_no);grp_alloc_blk = ext2_try_to_allocate_with_rsv(sb, group_no, bitmap_bh, grp_target_blk, my_rsv, &num);---grp_target_blk will used as 'offset' parameter offind_next_zero_bit_le()

block group的引入，应该是因为bitmap；如果你用bitmap管理1G的空间，试想一下，find_next_zero_bit的时候，效率极差。

ext4引入了flexible block group，参考连接：Ext4 Disk Layout - Flexible Block Groupshttps://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Flexible_Block_Groups

它会将多个bg并起来，进而可以让GB级的文件空间连续性更强；试想下，没有这个feature，一个文件最大的extent也就接近128M(除去bitmap和inode table)。

对应bitmap保存free fsblk的方式，ext4采用buddy的方式维护每个block group中的fsblk；类似内存的buddy系统；参考连接：ext4 blocks buddyhttps://jianchwa.github.io/fs/xfs_v2.html#ext4_blocks

bitmap buddy可以用来快速定位到所需长度的fsblk，而不需要进行bitmap scan；

bitmap buddy并不on-disk结构，而是在内存中，在load bg的bitmap时，会有构建buddy bitmap，参考函数ext4_mb_init_cache()。

在执行申请操作之前，ext4首先需要确定一个goal，也就是申请的的目标位置，以增加文件的locality ，参考代码：

ext4_ext_map_blocks()-> ext4_ext_find_goal()---if (path) {int depth = path->p_depth;ex = path[depth].p_ext;if (ex) {ext4_fsblk_t ext_pblk = ext4_ext_pblock(ex);ext4_lblk_t ext_block = le32_to_cpu(ex->ee_block);if (block > ext_block)return ext_pblk + (block - ext_block);elsereturn ext_pblk - (ext_block - block);}/* it looks like index is empty;* try to find starting block from index itself */if (path[depth].p_bh)return path[depth].p_bh->b_blocknr;}/* OK. use inode's group */return ext4_inode_to_goal_block(inode);---
ext4_inode_to_goal_block()
---block_group = ei->i_block_group;if (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) {/** If there are at least EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME* block groups per flexgroup, reserve the first block* group for directories and special files.  Regular* files will start at the second block group.  This* tends to speed up directory access and improves* fsck times.*/block_group &= ~(flex_size-1);if (S_ISREG(inode->i_mode))block_group++;}bg_start = ext4_group_first_block_no(inode->i_sb, block_group);if (test_opt(inode->i_sb, DELALLOC))return bg_start;...
---

确定了goal之后，会首先尝试goal所指向的位置，参考代码：

ext4_ext_map_blocks()-> ext4_mb_new_blocks()-> ext4_mb_regular_allocator()-> ext4_mb_find_by_goal()-> mb_find_extent()---buddy = mb_find_buddy(e4b, 0, &max);//这里会确定goal所指向的位置是否freeif (mb_test_bit(block, buddy)) {ex->fe_len = 0;ex->fe_start = 0;ex->fe_group = 0;return 0;}//确定free长度的代码此处省略-----//需要确定长度是否满足if (max >= ac->ac_g_ex.fe_len) {ac->ac_found++;ac->ac_b_ex = ex;ext4_mb_use_best_found(ac, e4b);} --

ext4_mb_find_by_goal()会去确定goal所指向的位置，free的fsblk是否满足需求，如果不满足，则进入下一种策略；

ext4_mb_regular_allocator()
---// 如果申请的fsblk长度是power of 2，并且其order值大于mb_order2_reqs,// 设置 ac_2order；此时cr从0开始，在申请fsblk时，也可以直接使用buddyi = fls(ac->ac_g_ex.fe_len);ac->ac_2order = 0;if (i >= sbi->s_mb_order2_reqs && i <= sb->s_blocksize_bits + 2) {if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)ac->ac_2order = array_index_nospec(i - 1, sb->s_blocksize_bits + 2);}cr = ac->ac_2order ? 0 : 1;
repeat://cr逐渐增长，申请策略会逐渐激进；for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) {ac->ac_criteria = cr;group = ac->ac_g_ex.fe_group;//优先从目标group申请for (i = 0; i < ngroups; group++, i++) {if (group >= ngroups)group = 0;//部分代码省略，以优先表达申请策略ret = ext4_mb_good_group(ac, group, cr);if (ret <= 0)continue;if (cr == 0)//因为此时fe_len是power of 2，可以直接使用buddy判断和申请ext4_mb_simple_scan_group(ac, &e4b);else if (cr == 1 && sbi->s_stripe &&!(ac->ac_g_ex.fe_len % sbi->s_stripe))ext4_mb_scan_aligned(ac, &e4b);elseext4_mb_complex_scan_group(ac, &e4b);if (ac->ac_status != AC_STATUS_CONTINUE)break;}}
---该算法的核心在于这个函数，
ext4_mb_good_group()
---switch (cr) {case 0:BUG_ON(ac->ac_2order == 0);// 此策略是不是可以也放到cr 1里？/* Avoid using the first bg of a flexgroup for data files */if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&(flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&((group % flex_size) == 0))return 0;if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||(free / fragments) >= ac->ac_g_ex.fe_len)return 1;if (grp->bb_largest_free_order < ac->ac_2order)return 0;return 1;case 1:// free/fragments 可以认为是这个bg的free segments的平均大小if ((free / fragments) >= ac->ac_g_ex.fe_len)return 1;break;case 2:if (free >= ac->ac_g_ex.fe_len)return 1;break;case 3:return 1;default:BUG();}---

fallback策略，依然会优先从目标bg开始申请；在遍历各个bg的时候，依然是优先考虑选择满足所需长度的bg，以保证IO size；

2.1.2 xfs Allocator

xfs也将磁盘划分成不同的区域，称为allocation group(以下简称ag)；不过，与ext2的无可奈何相比，xfs的这种设计有明显的目的性的，参考《Scalability in the XFS File System》:

To avoid requiring all structures in XFS to scale to the 64 bit size of the file system；举个例子，xfs的ag内部的空间管理结构，只需要使用32bit的指针就可以，在4K的fsblk下，这个32bit的指针可以覆盖2^44 = 16T的空间，参考代码：

typedef uint32_t    xfs_agblock_t;  /* blockno in alloc. group */xfs_alloc_vextent()
----
if (args->agbno == NULLAGBLOCK)args->fsbno = NULLFSBLOCK;else {args->fsbno = XFS_AGB_TO_FSB(mp, args->agno, args->agbno);...}
---#define XFS_AGB_TO_FSB(mp,agno,agbno)    \(((xfs_fsblock_t)(agno) << (mp)->m_sb.sb_agblklog) | (agbno))

Partitioning the file system into AGs limits the size of the individual structures used for tracking free space and inodes. 这一点与ext2使用bg的原因类似，减小相关数据结构的size；
Improve parallelism in the management of free space and inode allocation；各个ag之间的空间和Inode管理结构可以完全并行操作；如果在底层存储设备层级上引入lvm，则可以保证各个ag之间代码和IO的完全隔离；

Note: 对于空间局部性的问题，xfs并不是靠ag，考虑ag的size往往都是几十上百GB，参考<0.1 机械盘>中的数据，保证在一个ag内，对于降低磁盘seek没有任何意义。xfs使用的基于临近fsblk的原则，在下面的章节里，我们会详解这一块。

xfs的空间管理使用的是一对保存fsblk extent (agbno, length)的B+ tree，他们使用不同的元素作为index，

agbno，即bnobt，基于临近fsblk原则寻找free fsblk，可以降低磁盘seek；
length，即cntbt，方便快速找到满足长度要求的extent，降低文件的碎片化；

在执行分配操作之前，首先需要确定以下参数：

blkno，执行分配的目标位置，

xfs_bmap_btalloc()
---// t_firstblock是当前transaction首次执行分配操作的记录，xfs_bmap_btalloc()可能// 会被xfs_bmapi_write()调用多次；是否为空代表着是否是第一次分配操作。nullfb = ap->tp->t_firstblock == NULLFSBLOCK;...if (nullfb) {// 选择目标位置的时候参考的inode fsblk位置ap->blkno = XFS_INO_TO_FSB(mp, ap->ip->i_ino);}xfs_bmap_adjacent(ap)---/** If allocating at eof, and there's a previous real block,* try to use its last block as our starting point.*//** If not at eof, then compare the two neighbor blocks.* Figure out whether either one gives us a good starting point,* and pick the better one.*/---
---xfs_bmap_adjacent()代表比较长，这里暂时忽略从代码中，可以看到，在文件为空的时候，选择inode所在fsblk作为目标，这可以保证inode和
数据尽量近一些；在那之后，会议临近的fsblk作为目标。以此，来尽量避免磁盘seek

ap，执行分配操作的allocation group；参考函数：

xfs_bmap_btalloc()
---// t_firstblock是当前transaction首次执行分配操作的记录，xfs_bmap_btalloc()可能// 会被xfs_bmapi_write()调用多次；是否为空代表着是否是第一次分配操作。nullfb = ap->tp->t_firstblock == NULLFSBLOCK;...xfs_bmap_adjacent(ap);...// ap的选择依赖blkno，即目标fsblk，发生在xfs_bmap_adjacent()之后标明，ag的选择// 也会遵循临近原理if (nullfb) {error = xfs_bmap_btalloc_nullfb(ap, &args, &blen);}
---xfs_bmap_btalloc_nullfb()
---args->type = XFS_ALLOCTYPE_START_BNO;args->total = ap->total;//args->fsbno来自ap->blkno即目标fsblkstartag = ag = XFS_FSB_TO_AGNO(mp, args->fsbno);if (startag == NULLAGNUMBER)startag = ag = 0;//blen是该ag的包含的最长的extent的长度，在这里选择一个包含要求的maxlen的ag，//从这里也可以看出，申请一个块连续的fsblk的优先级是要高于临近原则的while (*blen < args->maxlen) {error = xfs_bmap_longest_free_extent(args->tp, ag, blen,&notinit);if (++ag == mp->m_sb.sb_agcount)ag = 0;if (ag == startag)break;}¡
---每个ag的最长的extent，是被记录在agf，即xfs的ag空间管理结构里的，如下：
cntbt的最后一个record就是最长的，
xfs_allocbt_update_lastrec()
---agf->agf_longest = len;pag = xfs_perag_get(cur->bc_mp, seqno);pag->pagf_longest = be32_to_cpu(len);xfs_perag_put(pag);xfs_alloc_log_agf(cur->bc_tp, cur->bc_private.a.agbp, XFS_AGF_LONGEST);---xfs_alloc_read_agf
---if (!pag->pagf_init) {...pag->pagf_longest = be32_to_cpu(agf->agf_longest);...}
---综上，在选择ag的时候，会优先选择有所需长度fsblk的ag，而且xfs为了保存每个ag的最长extent，
还着实费了些功夫

在选定了ag和blkno之后，借助bnobt和cntbt，选择合适的fsblk；

fsblk分配有以下五种策略：

XFS_ALLOCTYPE_START_BNO，near this block else anywhere，该策略分成两段，会首先尝试NEAR_AG，如果失败则fallback到THIS_AG，参考代码：

xfs_alloc_vextent()
---switch (type) {...case XFS_ALLOCTYPE_START_BNO:/** Try near allocation first, then anywhere-in-ag after* the first a.g. fails.*/args->agbno = XFS_FSB_TO_AGBNO(mp, args->fsbno);args->type = XFS_ALLOCTYPE_NEAR_BNO;/* FALLTHROUGH */case XFS_ALLOCTYPE_FIRST_AG:for (;;) {args->pag = xfs_perag_get(mp, args->agno);error = xfs_alloc_fix_freelist(args, flags);/** If we get a buffer back then the allocation will fly.*/if (args->agbp) {if ((error = xfs_alloc_ag_vextent(args)))goto error0;break;}/** For the first allocation, we can try any AG to get* space.  However, if we already have allocated a* block, we don't want to try AGs whose number is below* sagno. Otherwise, we may end up with out-of-order* locking of AGF, which might cause deadlock.*/if (++(args->agno) == mp->m_sb.sb_agcount) {if (args->tp->t_firstblock != NULLFSBLOCK)args->agno = sagno;elseargs->agno = 0;}if (args->agno == sagno &&type == XFS_ALLOCTYPE_START_BNO)args->type = XFS_ALLOCTYPE_THIS_AG;xfs_perag_put(args->pag);}break;default:ASSERT(0);/* NOTREACHED */}if (args->agbno == NULLAGBLOCK)args->fsbno = NULLFSBLOCK;else {args->fsbno = XFS_AGB_TO_FSB(mp, args->agno, args->agbno);}---

XFS_ALLLOCTYPE_FIRST_AG，最终fallback策略，通常意味着文件系统存储空间处于低位，参考代码：

xfs_bmap_btalloc()
---if (args.fsbno == NULLFSBLOCK && nullfb) {args.fsbno = 0;args.type = XFS_ALLOCTYPE_FIRST_AG;args.total = ap->minlen;if ((error = xfs_alloc_vextent(&args)))return error;ap->tp->t_flags |= XFS_TRANS_LOWMODE;}
----

以下三种是基础策略，以上两种策略会最终转换成基础策略，参考函数xfs_alloc_ag_vextent()：

XFS_ALLOCTYPE_THIS_AG，Allocate a variable extent anywhere in the allocation group agno，这里主要基于cntbt，选择一个长度满足要求的fsblk extent，完全不考虑fsblk临近的问题；XFS_ALLOCTYPE_FIRST_AG会转换成该策略，或者作为XFS_ALLOCTYPE_START_BNO()的fallback策略，

XFS_ALLOCTYPE_NEAR_AG，Allocate a variable extent near bno in the allocation group agno，这里会同时参考cntbt和bnobt；通常作为XFS_ALLOCTYPE_START_BNO的优先策略；


xfs_alloc_ag_vextent_near()
---代码此处省略了此ag中没有要求长度extent的情况，源码也为了提高可读性省略一些分支/** Loop going left with the leftward cursor, right with the* rightward cursor, until either both directions give up or* we find an entry at least as big as minlen.*/do {if (bno_cur_lt) {error = xfs_alloc_get_rec(bno_cur_lt, &ltbno, &ltlen, &i);...if (ltlena >= args->minlen && ltbnoa >= args->min_agbno)break;error = xfs_btree_decrement(bno_cur_lt, 0, &i);}if (bno_cur_gt) {error = xfs_alloc_get_rec(bno_cur_gt, &gtbno, &gtlen, &i)；if (gtlena >= args->minlen && gtbnoa <= args->max_agbno)break;error = xfs_btree_increment(bno_cur_gt, 0, &i)；}} while (bno_cur_lt || bno_cur_gt);
---

XFS_ALLOCTYPE_THIS_BNO，Allocate a variable extent at exactly agno/bno，基于cntbt，这种策略用在file append操作时，进而可以保证文件的fsblk的连续性，参考代码：

xfs_bmap_btalloc()
---if (!(ap->tp->t_flags & XFS_TRANS_LOWMODE) && ap->aeof) {// offset in file filling in if (!ap->offset) {} else {/** First try an exact bno allocation.* If it fails then do a near or start bno* allocation with alignment turned on.*/atype = args.type;tryagain = 1;args.type = XFS_ALLOCTYPE_THIS_BNO;args.alignment = 1;...}}
---

Note:

如果对比下一小节的xfs，在尝试在goal位置申请失败之后，xfs的策略时，从goal位置，同时想左右搜索；而ext4这里只是向右；所以，xfs在磁盘seek上表现应该会更好一些；

xfs采用B+tree结构，对于维护大存储空间和大文件extent，非常高效；但是如果是频繁的小文件场景，B+tree会最终会变的碎片化，B+tree的IO也会变得非常沉重；相反，ext4 bitmap的方式，此时反倒IO很轻；

2.1.3 ext4 预申请

Delay Allocation理论上也是一种预申请机制，只不过它只是在内存中保留了额度，而没有真正的在分配器中申请对应的空间；在执行writeback时，可以一次性执行批量申请，获取更大的连续fsblk；但是，DA也不一定是万能的，比如，一些日志文件会持续append写操作，同时为保证数据的持久性，写完之后就会立刻fsync，但是后续可能还需要大批量的读上来；当然，还有很多其他场景，我们分别看下ext4和xfs的预申请机制。

参考以下文章：Ext4 block and inode allocator improvementshttps://landley.net/kdocs/ols/2008/ols2008v1-pages-263-274.pdf

The Ext4 multiple block allocator maintains two preallocated spaces from which block requests are satisfied: A per-inode preallocation space and a per-CPU locality group prealloction space. The per-inode preallocation space is used for larger request and helps in making sure larger files are less interleaved if blocks are allocated at the same time. The per-CPU locality group preallocation space is used for smaller file allocation and helps in making sure small files are placed closer on disk

首先看下ext4的per-file allocation，其包括以下几个关键点：

fsblk跟文件的offset是有对应关系的，否则就无法形成连续IO；所以，我们看到ext4_prealloc_space中还包括了文件offset，

struct ext4_prealloc_space {...ext4_fsblk_t     pa_pstart;  /* phys. block */ext4_lblk_t        pa_lstart;  /* log. block */ext4_grpblk_t       pa_len;     /* len of preallocated chunk */...
}

在尝试从预申请的资源中中获取空间时，参考函数：

ext4_mb_new_blocks()-> xt4_mb_use_preallocated()
---if (!(ac->ac_flags & EXT4_MB_HINT_DATA))return 0;rcu_read_lock();list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {//检查文件的偏移是否符合if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||ac->ac_o_ex.fe_logical >= (pa->pa_lstart +EXT4_C2B(sbi, pa->pa_len)))continue;/* found preallocated blocks, use them */spin_lock(&pa->pa_lock);if (pa->pa_deleted == 0 && pa->pa_free) {atomic_inc(&pa->pa_count);//根据pa的pa_start获取fsblk的地址和长度ext4_mb_use_inode_pa(ac, pa);spin_unlock(&pa->pa_lock);rcu_read_unlock();return 1;}spin_unlock(&pa->pa_lock);}rcu_read_unlock();
---

注意，在使用prealloc space时，并不会考虑要求申请的长度；

preallocate的长度，可以参考如下使用dd命令进行顺序IO的log，

dd if=/dev/zero of=/mnt/test/big.1G bs=4K count=$((256<<10)) oflag=directgoal: 16(was 17) blocks at 16goal: 32(was 33) blocks at 32goal: 64(was 65) blocks at 64goal: 128(was 129) blocks at 128goal: 256(was 257) blocks at 256goal: 512(was 513) blocks at 512goal: 1024(was 1025) blocks at 1024goal: 2048(was 2049) blocks at 2048goal: 2048(was 4097) blocks at 4096goal: 2048(was 6145) blocks at 6144goal: 2048(was 8193) blocks at 8192goal: 2048(was 10241) blocks at 10240goal: 2048(was 12289) blocks at 12288goal: 2048(was 14337) blocks at 14336goal: 2048(was 16385) blocks at 16384goal: 2048(was 18433) blocks at 18432goal: 2048(was 20481) blocks at 20480goal: 2048(was 22529) blocks at 22528goal: 2048(was 24577) blocks at 24576goal: 2048(was 26625) blocks at 26624goal: 2048(was 28673) blocks at 28672goal: 2048(was 30721) blocks at 30720goal: 2048(was 32769) blocks at 32768goal: 2048(was 34817) blocks at 34816goal: 2048(was 36865) blocks at 36864preallocate size, original file size, preallocate offset

概括起来就是，在8M以前，每次preallocate的size是文件大小的一倍，之后，每次申请8M；需要特别注意的是preallocate的文件的偏移，其有一个基于Preallocate的调整，参考如下代码：

ext4_mb_normalize_request()
---rcu_read_lock();list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),pa->pa_len);if (pa->pa_lstart >= end || pa_end <= start) {spin_unlock(&pa->pa_lock);continue;}/* adjust start or end to be adjacent to this pa */if (pa_end <= ac->ac_o_ex.fe_logical) {start = pa_end;} else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {end = pa->pa_lstart;}}
---

preallocation的block并不会被持久化，只有被文件申请之后，才会被记入log，参考代码：

ext4_mb_use_preallocated()-> ext4_mb_use_inode_pa()---start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len),start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len));len = EXT4_NUM_B2C(sbi, end - start);ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group,&ac->ac_b_ex.fe_start);ac->ac_b_ex.fe_len = len;---ext4_mb_new_blocks()-> ext4_mb_use_preallocated()-> ext4_mb_mark_diskspace_used()---ext4_lock_group(sb, ac->ac_b_ex.fe_group);ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,ac->ac_b_ex.fe_len);ext4_unlock_group(sb, ac->ac_b_ex.fe_group);err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);---

另外，ext4还支持per-cpu的preallocation，参考

The per-CPU locality group preallocation space is used for smaller file allocation and helps in making sure small files are placed closer on disk. Which preallocation space to use depends on the total size derived out of current file size and allocation request size. The allocator provides a tunable /prof/fs/ext4/ /stream_req that defaults to 16

per-cpu的preallocation工作原理与per-file大致相同，参考如下几点：

per-file or per-cpu，参考代码：

ext4_mb_initialize_context()-> ext4_mb_group_or_file()---if (sbi->s_mb_group_prealloc <= 0) {ac->ac_flags |= EXT4_MB_STREAM_ALLOC;return;}/* don't use group allocation for large files */size = max(size, isize);if (size > sbi->s_mb_stream_request) {ac->ac_flags |= EXT4_MB_STREAM_ALLOC;return;}/** locality group prealloc space are per cpu. The reason for having* per cpu locality group is to reduce the contention between block* request from multiple CPUs.*/ac->ac_lg = raw_cpu_ptr(sbi->s_locality_groups);ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;mutex_lock(&ac->ac_lg->lg_mutex);---

每次预申请的size，这是一个可配置的值，/sys/fs/ext4/vdb/mb_group_prealloc，默认是512个fsblk

2.1.4 xfs 预申请

xfs的预申请机制正式名称为speculative preallocation，以下内容参考链接：

[FAQ] XFS speculative preallocationhttps://linux-xfs.oss.sgi.narkive.com/jjjfnyI1/faq-xfs-speculative-preallocation

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to combat fragmentation under parallel sequential write workloads.
This post-EOF block allocation is included in 'st_blocks' counts via stat() system calls and is accounted as globally allocated space by the filesystem.
Speculative preallocation is not enabled for files smaller than a minimum size (64k by default, but can vary depending on filesystem geometry and/or mount options). Preallocations are capped at a maximum of 8GB on 4k block filesystems. Preallocation is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).
In most cases, speculative preallocation is automatically reclaimed when a file is closed. The preallocation may persist after file close if an open, write, close pattern is repeated on a file. In this scenario, post-EOF preallocation is trimmed once the inode is reclaimed from cache
or the filesystem unmounted.

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses files that have been recently modified to not interfere with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime
XFS block allocation algorithm can be configured to use a fixed allocation size with the 'allocsize=' mount option

（待续）

3 Block

3.1 bio

3.2 request

3.3 iostat

本小节分别看下block层几个核心指标的含义和数据来源

3.3.1 util%

根据iostat man page的解释：

iostat(1) - Linux manual page

Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits

其中对于设备利用率提到了两种解释：

设备有inflight IO的时间的占比，直观理解为，设备处理IO的能力越强，则IO返回的速度越快，于是IO在设备的停留的时间就越短，有inlight IO的时间的占比自然越低；
带宽利用率，这个则更加直接，当前带宽占设备带宽上限的百分比；

接下来，我们分别做几个简单试验，内核版本为Centos 4.18:

fio --bs=4k --iodepth=32 --rw=randread --direct=1 --ioengine=libaio --numjobs=1 --runtime=60 --name=test --filename=/dev/sdb

HDD ST4000NM0035-1V4
测试条件	BW	Util%
rw=randread bs=4k iodepth=32	0.83 MB/s	20%
rw=read bs=4k iodepth=32	202 MB/s	100%
rw=read bs=1M iodepth=1	203 MB/s	41%

NVMe SSD HUSMR7632BHP301
测试条件	BW	Util%
rw=randread bs=4k iodepth=32	1454 MB/s	100 %
rw=randread bs=4k iodepth=32 numjobs=16	5448 MB/s	100 %

对于两个测试结果，带宽利用率这个解释无疑是不成立的，下面，我们主要从IO处理能力利用率的角度分析，Util%结果的意义。

对于HDD机械盘来说，其处理IO的能力与IO Pattern强相关；参考随机小IO这个结果：

rw=randread bs=4k iodepth=32

0.83 MB/s

20%

利用率只有20%，是不是说还有80%的能力未被利用？然而，无论我们如何加压，只要是随机IO，带宽和利用率始终无法上去，也就是说此时设备对随机IO的处理能力，或者准确的说，是磁头的处理能力，已经达到上限，即100%；我们再参考下面顺序IO的结果：

rw=read bs=4k iodepth=32	202 MB/s	100%
rw=read bs=1M iodepth=1	203 MB/s	41%

bs = 1M时，利用率只有40%，是说磁盘还有60%的使用空间？此时203MB/s其实已经达到带宽上限，如果我们再加压，带宽基本不会再有上升，当然，利用率确实有可能上升到100%。那么，在这个测试中，利用率有什么意义呢？

再看NVMe SDD的情况，结果更加明显，同样都是利用率100%，带宽差了几倍；造成这种差异的原因是，SSD内部的并发能力，详情可以参考论文，这里引用里面的一张图：Parallel all the time: Plane Level Parallelism Exploration for High Performance SSDshttps://storageconference.us/2019/Research/ParallelAllTheTime.pdf

综上，从数据表现来看，util%这个指标对于表征存储设备的使用情况，参考意义不大。

那么util%到底是什么？接下来，我们对照代码看下。

参考iostat代码中util的计算方法：iostat源码Performance monitoring tools for Linux. Contribute to sysstat/sysstat development by creating an account on GitHub.https://github.com/sysstat/sysstat/blob/master/iostat.c

write_ext_stat()-> compute_ext_disk_stats()---xds->util  = S_VALUE(sdp->tot_ticks, sdc->tot_ticks, itv);---read_sysfs_dlist_part_stat()-> read_sysfs_file_stat()---i = fscanf(fp, "%lu %lu %lu %lu %lu %lu %lu %u %u %u %u",&rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,&wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);---
filename = /sys/block/xxx/stat

tot_ticks的值来自/sys/block/xxx/stat，参考内核代码diskstat_show，tot_ticks就是disk_stats.io_ticks；该统计数据的来源为：

void update_io_ticks(struct hd_struct *part, unsigned long now)
{unsigned long stamp;
again:stamp = READ_ONCE(part->stamp);if (unlikely(stamp != now)) {if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {__part_stat_add(part, io_ticks, 1);}}if (part->partno) {part = &part_to_disk(part)->part0;goto again;}
}blk_mq_bio_to_request()-> blk_account_io_start()-> update_io_ticks()__blk_mq_end_request()-> blk_account_io_done()-> update_io_ticks(

io_ticks的统计，取决于io start和io done两个事件的间隔，当IO延迟高于1 jiffy(HZ = 1000, 1 jiffy = 1ms)时，统计的io_ticks是少于应该统计的值的；对比测试结果：

条件	bw	rwait	util%
rw=read bs=1M iodepth=1	203.00 MB/s	4.05 ms	41.10
rw=read bs=4M iodepth=1	206.69 MB/s	11.98 ms	32.30

bs = 4M时，其带宽稍高，利用率反到降低了。

其实这个算法并不是一直如此，而是经过一次修改，参考如下commit：

commit 5b18b5a737600fd20ba2045f320d5926ebbf341a
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Thu Dec 6 11:41:19 2018 -0500block: delete part_round_stats and switch to less precise countingWe want to convert to per-cpu in_flight counters.The function part_round_stats needs the in_flight counter every jiffy, itwould be too costly to sum all the percpu variables every jiffy, so itmust be deleted. part_round_stats is used to calculate two counters -time_in_queue and io_ticks.time_in_queue can be calculated without part_round_stats, by adding theduration of the I/O when the I/O ends (the value is almost as exact as thepreviously calculated value, except that time for in-progress I/Os is notcounted).io_ticks can be approximated by increasing the value when I/O is startedor ended and the jiffies value has changed. If the I/Os take less than ajiffy, the value is as exact as the previously calculated value. If theI/Os take more than a jiffy, io_ticks can drift behind the previouslycalculated value.Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>Signed-off-by: Mike Snitzer <snitzer@redhat.com>Signed-off-by: Jens Axboe <axboe@kernel.dk>

在这个commit之前，io_tick的算法为：

static void part_round_stats_single(struct request_queue *q,struct hd_struct *part, unsigned long now,unsigned int inflight)
{if (inflight) {__part_stat_add(part, time_in_queue,inflight * (now - part->stamp));__part_stat_add(part, io_ticks, (now - part->stamp));}part->stamp = now;
}

在这种算法下，只要存储设备上有inflight的IO，时间就会被计入io_ticks，

甚至，在diskstat_show()中，都会调用一次该函数；所以，如果有IO hung，util%会显示100%。

此时，util%的意义更接近于，在过去的时间内，设备被使用的时间的在占比。

综上，util%不可以作为设备能力的利用率来参考。

Linux 存储 Stack详解相关推荐

linux存储--虚拟内存详解MMU、页表（十）
内存是程序得以运行的重要物质基础.如何在有限的内存空间运行较大的应用程序,曾是困扰人们的一个难题.为解决这个问题,人们设计了许多的方案,其中最成功的当属虚拟内存技术.Linux作为一个以通用为目的的现 ...
linux存储--inode详解（六）
当划分磁盘分区并格式化的时候,整个分区会被划分为两个部分,即inode区和data block(实际数据放置在数据区域中)这个inode即是(目录.档案)文件在一个文件系统中的唯一标识,需要访问这个文 ...
linux存储--inode详解（五）
一.inode是什么? 理解inode,要从文件储存说起. 文件储存在硬盘上,硬盘的最小存储单位叫做"扇区"(Sector).每个扇区储存512字节(相当于0.5KB). 操作系统 ...
linux zip 命令详解
功能说明:压缩文件. 语法:zip [-AcdDfFghjJKlLmoqrSTuvVwXyz$][-b <工作目录>][-ll][-n <字尾字符串>][-t <日期 ...
Mysql存储引擎详解(MyISAM与InnoDB的区别)
Mysql存储引擎详解(MyISAM与InnoDB的区别) 存储引擎 MySQL中的数据用各种不同的技术存储在文件(或者内存)中.这些技术中的每一种技术都使用不同的存储机制.索引技巧.锁定水平 ...
linux arpwatch 命令详解
linux arpwatch 命令详解功能说明:监听网络上ARP的记录. 语法:arpwatch [-d][-f<记录文件>][-i<接口>][-r<记录文件> ...
linux awk命令详解，使用system来内嵌系统命令,批量github，批量批下载视频, awk合并两列...
linux awk命令详解简介 awk是一个强大的文本分析工具,相对于grep的查找,sed的编辑,awk在其对数据分析并生成报告时,显得尤为强大.简单来说awk就是把文件逐行的读入,以空格为默认分 ...
红帽Linux故障定位技术详解与实例(1)
红帽Linux故障定位技术详解与实例(1) 2011-09-28 14:26 圈儿 BEAREYES.COM 我要评论(0) 字号:T | T 在线故障定位就是在故障发生时, 故障所处的操作系统环境仍 ...
linux php 上级目录,Linux目录架构详解_php
linux和Windows操作系统的显著区别之一就是目录架构的不同.Linux操作系统的目录架构遵循文件系统层级结构标准.不知你是否使用ls命令浏览过Linux的根目录"/",亲爱 ...

Linux 存储 Stack详解

序言

0 存储设备

0.1 机械盘

0.2 SSD

0.3 NVDIMM

1 VFS

1.0 基础组件

1.0.1 iov_iter

1.0.2 iomap

1.1 inode

1.1.1 inode的生命周期

1.2 Dentry

1.2.1 dentry的功能

1.2.2 dentry操作

1.2.3 root dentry

1.2.4 d_splice_alias

1.2.5 lookup

1.2.6 symbol link

1.3 mount

1.3.1 嫁接

1.3.2 loop mount

1.4 page cache

1.4.1 radix tree

1.4.2 page cache tags

1.4.3 mmap

1.4.4 direct IO

1.5 DAX

1.5.1 Pmem

1.5.2 fs/dax

2 文件系统

2.1 空间管理

2.1.1 ext4 Allocator

2.1.2 xfs Allocator

2.1.3 ext4 预申请

2.1.4 xfs 预申请

3 Block

3.1 bio

3.2 request

3.3 iostat

3.3.1 util%

Linux 存储 Stack详解相关推荐

最新文章

热门文章