linux内核分析及应用 -- Linux 的进程隔离技术与 Docker 容器

近两年容器技术突然变得很火热，几乎所有开发人员都在学习这个技术。技术人员并不见得一定比女性购物更加理智，有时候，选择一种新技术也仅仅是为了追赶时髦和潮流。

从实现的角度来讲，容器技术不是全新的发明，Docker 容器只是针对 Linux 内核提供的基础功能（例如 namespace、cgroup）进行了扩展。

本章不打算介绍所有与虚拟化相关的技术和实现方案，而是探讨和容器相关的虚拟化技术。主要探讨以下几个内容：

1）虚拟化相关的技术原理。

2）容器技术经常会用到的 Linux 内核的相关功能，如 namespace、cgroup 等实现。

3）Docker 容器部分的实现分析。

7.1　虚拟化相关技术

在容器技术流行之前，为了提升单机的利用率，并且实现进程之间的隔离，比较流行的方式是通过虚拟化的相关技术在一台物理机上运行多个操作系统。下面简单介绍 CPU 虚拟化技术。

一般情况下，我们都是通过软件的方式来模拟多个硬件栈然后再在上面模拟硬件指令，跑多个操作系统。但是这种方法在可靠性、安全性和性能上存在很多问题，所以 Intel 在它的硬件产品上引入了 Intel VT（Virtualization Technology，虚拟化技术），如图7-1所示。Intel VT 可以让一个 CPU 工作起来像多个 CPU 在并行运行，从而使得在一部计算机内同时运行多个操作系统成为可能。

图7-1　CPU 虚拟化技术原理图

Intel CPU 提供了：CPU 虚拟化、内存虚拟化、I/O 虚拟化、图形卡虚拟化、网络虚拟化等功能，本章主要介绍软件相关的虚拟化技术，有兴趣大家可以通过访问：http://www.intel.com/content/www/us/en/virtualization/virtualization-technology/intel-virtualizationtechnology.html 来了解 Intel VT 相关的技术。

Linux 内核中，内置了 KVM 模块，主要负责虚拟机的创建、虚拟内存的分配、VCPU 寄存器的读写以及 VCPU 的运行，可以基于 Intel VT 和 AMD-V 这两种不同厂商的解决方案来实现。

用户如果需要使用 KVM 虚拟化环境，就需要使用 QEMU 之类的模拟器，用于模拟虚拟机的用户空间组件，提供 I/O 设备模型，访问外设的途径，如图7-2所示。

图7-2　KVM 解决方案原理

7.2　Linux 进程隔离技术

在简单了解了 Intel VT 的硬件虚拟化技术和 Linux 的 KVM 虚拟化解决方案之后，现在着重来了解一下容器隔离的相关技术。

7.2.1　chroot

chroot 是 Linux 内核提供的一个系统调用，从安全性的角度考虑，用于限定用户使用的根目录。

下面来看 chroot 系统调用的实现（代码详见：/linux-4.5.2/fs/open.c）：

SYSCALL_DEFINE1(chroot, const char __user *, filename)
{struct path path;int error;unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
…error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);// 从当前目录开始查找获取 path
…error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);// 检查 inode 权限
…set_fs_root(current->fs, &path); // 把当前进程的 rootfs 设置为新的 patherror = 0;
…return error;
}

从上面的源码可以发现，chroot 的逻辑其实非常简单，首先是通过 user_path_at 函数搜索路径获取该 filename 的 path，然后通过 set_fs_root 把当前进程的文件系统的 root 设置为 path。

set_fs_root 方法为：

void set_fs_root(struct fs_struct *fs, const struct path *path)
{struct path old_root;path_get(path);spin_lock(&fs->lock);write_seqcount_begin(&fs->seq);old_root = fs->root;fs->root = *path;write_seqcount_end(&fs->seq);spin_unlock(&fs->lock);if (old_root.dentry)path_put(&old_root);
}

所以，chroot 仅仅是在访问文件系统目录的时候，限定了用户的根目录，这是一种障眼法，并不是真正的虚拟化技术，可以发挥的作用也非常有限。

7.2.2　namespace

namespace 对 Linux 非常重要，用于实现容器之类的隔离，这样为每个容器创建的进程就可以运行在一个独立的命名空间之中。隔离后每个 namespace 看上去就像一个单独的 Linux 系统，如图7-3所示。

图7-3　namespace 对进程的隔离

Linux 中 namespace 提供了6种隔离功能，如表7-1所示。

表7-1　namespace 的隔离功能

下面先来了解一下进程和 namespace 之间的关系。一个进程可以属于多个 namesapce，在 task_struct 结构中有一个指向 namespace 结构体的指针 nsproxy：

struct nsproxy {atomic_t count;struct uts_namespace        *uts_ns;        // 运行内核的名称、版本、底层体系结构类型等信息struct ipc_namespace        *ipc_ns;        // 所有与进程间通信有关的信息struct mnt_namespace        *mnt_ns;        // 已经装载的文件系统的视图struct pid_namespace        *pid_ns_for_children;// 有关进程 ID 的信息struct net                *net_ns;          // 网络相关的命名空间参数};

假如不指定 ns，那么默认所有进程在创建的时候，都会指定一个默认的 ns：

struct nsproxy init_nsproxy        = {.count                         = ATOMIC_INIT(1),.uts_ns                                = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC).ipc_ns                                = &init_ipc_ns,
#endif.mnt_ns                                = NULL,.pid_ns_for_children             = &init_pid_ns,
#ifdef CONFIG_NET.net_ns                                = &init_net,
#endif
};

一般通过 fork 调用来创建 namespace，其执行过程为：

do_fork->copy_process->copy_namespaces->create_new_namespaces：

static struct nsproxy *create_new_namespaces(unsigned long flags,struct task_struct *tsk, struct user_namespace *user_ns,struct fs_struct *new_fs)
{struct nsproxy *new_nsp;int err;new_nsp = create_nsproxy();                // 创建一个新的 ns
…new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);…new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);…new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);…new_nsp->pid_ns_for_children =copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);…new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);…return new_nsp;
…
}

在上面的代码中可发现，create_new_namespaces 函数的主要任务是创建一个新的 namespace，将老 task 的 uts、ipc、pid、net、mount 等 ns 复制给新的 task。

在创建完独立的 ns 后，以分配 pid 为例：

if (pid != &init_struct_pid) {pid = alloc_pid(p->nsproxy->pid_ns_for_children);
…
}

pid 将会在其独立的 ns 中进行分配。

7.2.3　cgroup

为进程创建了独立的 namespace 之后，就可以通过 cgroup 对该 namespace 中的进程进行相关资源的限制，例如 CPU、内存、IO 等。

cgroups 是 control groups 的缩写，是 Linux 内核提供的一种可以限制、记录、隔离进程组（process groups）所使用的物理资源（如 CPU、内存、IO 等）的机制。最初由 Google 的工程师提出，后来被整合进 Linux 内核。

cgroup 目前支持的子系统有以下几种：

blkio——为块设备设定输入/输出限制，比如物理设备（磁盘、固态硬盘、USB 等）。

cpu——使用调度程序提供对 CPU 的 cgroup 任务访问。

cpuacct——自动生成 cgroup 中任务所使用的 CPU 报告。

cpuset——为 cgroup 中的任务分配独立 CPU（在多核系统）和内存节点。

devices——可允许或者拒绝 cgroup 中的任务访问设备。

freezer——挂起或者恢复 cgroup 中的任务。

memory——设定 cgroup 中任务使用的内存限制，并自动生成由那些任务使用的内存资源报告。

net_cls——使用等级识别符（classid）标记网络数据包，可允许 Linux 流量控制程序（tc）识别从具体 cgroup 中生成的数据包。

ns——名称空间子系统。

下面是一个限制进程使用 CPU 的例子：

mkdir /sys/fs/cgroup/cpuset
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset

创建文件夹后，自动生成的文件如下：

total 0
-rw-r--r-- 1 root root 0 Aug 14 15:10 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 14 15:10 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 14 15:10 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.cpu_exclusive
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.cpus
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.mem_exclusive
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.mem_hardwall
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.memory_migrate
-r--r--r-- 1 root root 0 Aug 14 15:10 cpuset.memory_pressure
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.memory_spread_page
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.memory_spread_slab
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.mems
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.sched_load_balance
-rw-r--r-- 1 root root 0 Aug 14 15:10 cpuset.sched_relax_domain_level
-rw-r--r-- 1 root root 0 Aug 14 15:10 notify_on_release
-rw-r--r-- 1 root root 0 Aug 14 15:10 tasks

设置 task 允许使用的 CPU 为0～1：

echo 0-1 > cpuset.cpus

添加 task 到 cgroup：

echo [pid] >> tasks

这样就可以对指定 pid 的进程限制所用的 CPU 仅为0和1。

cgroup 的核心概念如下：

任务（task）。在 cgroup 中，任务就是系统的一个进程。

控制组（cgroup），cgroups 中的资源控制都是以控制组为单位实现的。一个进程可以加入到某个控制组中，也可以从一个进程组迁移到另一个控制组。一个进程组的进程可以使用 cgroup 以控制组为单位分配的资源，同时受到 cgroup 以控制组为单位设定的限制。

层级（hierarchy）。控制组可以组织成层级的树状结构，其中子节点控制组继承父控制组的特定属性。

子系统（subsystem）。一个子系统就是一个资源控制器，比如 CPU 子系统就是控制 CPU 时间分配的一个控制器。子系统必须附加到一个层级上才能起作用，之后，这个层级上的所有控制组都受到这个子系统的控制。

从某种角度上来讲，cgroup 的这个结构有点像业务系统中的权限系统，根据图7-4，简单介绍 cgroup 实现的各数据结构之间的关系：

cgroupfs_root：可以理解为 mount 操作指定的 dir 目录。

css_set：提供了与进程相关的 cgroups 信息。其中 cg_links 指向一个由 struct_cg_cgroup_link 连成的链表。Subsys 是一个指针数组，存储一组指向 cgroup_subsys_state 的指针。一个 cgroup_subsys_state 就是进程与一个特定子系统相关的信息。通过这个指针数组，进程就可以获得相应的 cgroups 控制信息。

css_set_table：css_set_table 保存了所有的 css_set，hash 函数及 key 为 css_set_hash（css_set->subsys）。

cg_group_link：由于 cgroup 和 css_cet 之间是多对多的关系，cg_group_link 是用来关联这两者的。

cgroup_subsys：代表 cgroup 的某个子系统：

struct cgroup_subsys {
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
int  (*css_online)(struct cgroup_subsys_state *css);
void (*css_offline)(struct cgroup_subsys_state *css);
void (*css_released)(struct cgroup_subsys_state *css);
void (*css_free)(struct cgroup_subsys_state *css);
void (*css_reset)(struct cgroup_subsys_state *css);
void (*css_e_css_changed)(struct cgroup_subsys_state *css);
int  (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset);
void (*attach)(struct cgroup_taskset *tset);
int  (*can_fork)(struct task_struct *task);
void (*cancel_fork)(struct task_struct *task);
void (*fork)(struct task_struct *task);
void (*exit)(struct task_struct *task);
void (*free)(struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css);

上面的代码罗列了 cgroup_subsys 接口的钩子函数。

cgroup_subsys_state：代表每个子系统真正的控制结构，图中的 cup、cupset、blkio_group 等都是它的实现，也就是说，每个子系统对系统资源的限制是通过 cgroup_subsys_state 的实现来完成的。

task 和 cgroup 的关系：css_set->tasks 是所有引用该 css_set 的 tasks 的 list 的 head，task 之间用 task->cg_list 进行链接，一个 cgroupfs_root 的所有 cgroup_subsys 由 cgroupfs_root->subsys_list 组织，所有的 cgroupfs_root 通过它的 root_list 链接到 roots 这个全局变量头里。

图7-4　cgroup 整体架构

注意　

cgroup 是通过标准的 VFS 接口与上层交互的。

每当进行 mount 或者 mkdir 时候，目录下面的文件就是通过 cgroup 创建的，这些文件定义方式如下：

static struct file_system_type cgroup_fs_type = {.name = "cgroup",.mount = cgroup_mount,.kill_sb = cgroup_kill_sb,
};

下面定义了 cgroup 的核心接口文件和默认层级：

static struct cftype cgroup_dfl_base_files[] = {{.name = "cgroup.procs",.file_offset = offsetof(struct cgroup, procs_file),.seq_start = cgroup_pidlist_start,.seq_next = cgroup_pidlist_next,.seq_stop = cgroup_pidlist_stop,.seq_show = cgroup_pidlist_show,.private = CGROUP_FILE_PROCS,.write = cgroup_procs_write,},{.name = "cgroup.controllers",.flags = CFTYPE_ONLY_ON_ROOT,.seq_show = cgroup_root_controllers_show,},
…

子系统也维护了各自的 files[] 文件，比如 cpuset：

static struct cftype files[] = {{.name = "cpus",.seq_show = cpuset_common_seq_show,.write = cpuset_write_resmask,.max_write_len = (100U + 6 * NR_CPUS),.private = FILE_CPULIST,},{.name = "mems",.seq_show = cpuset_common_seq_show,.write = cpuset_write_resmask,.max_write_len = (100U + 6 * MAX_NUMNODES),.private = FILE_MEMLIST,},…

创建一个新的 cgroup 也是通过标准的 VFS 操作来执行的，在 cgroup 中定义了 cgroup 文件系统的一些操作：

static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {.remount_fs                = cgroup_remount,.show_options              = cgroup_show_options,.mkdir                     = cgroup_mkdir,.rmdir                     = cgroup_rmdir,.rename                    = cgroup_rename,
};

创建 cgroup 是通过 mkdir 来进行的：

static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,umode_t mode)
{struct cgroup *parent, *cgrp, *tcgrp;struct cgroup_root *root;struct cgroup_subsys *ss;struct kernfs_node *kn;int level, ssid, ret;…root = parent->root;level = parent->level + 1;// 为cgroup分配空间cgrp = kzalloc(sizeof(*cgrp) +sizeof(cgrp->ancestor_ids[0]) * (level + 1), GFP_KERNEL);…// 创建目录kn = kernfs_create_dir(parent->kn, name, mode, cgrp);…cgrp->kn = kn;…// 为cgroup创建cgroup_subsys_statefor_each_subsys(ss, ssid) {if (parent->child_subsys_mask & (1 << ssid)) {ret = create_css(cgrp, ss,parent->subtree_control & (1 << ssid));…}…if (!cgroup_on_dfl(cgrp)) {cgrp->subtree_control = parent->subtree_control;cgroup_refresh_child_subsys_mask(cgrp);}…
}

以上过程先为该 cgroup 所属的每个 subsys 创建一个 cgroup_subsys_state，并初始化。通过该 cgroup->subsys[] 可以获得该 cgroup 的所有 cgroup_subsys_state，同样通过 cgroup_subsys_state->cgroup 可以知道该 cgroup_subsys_state 所属的 cgroup。以后 cgroup 与 subsys 的 group 控制体的转换都是通过该结构来完成的。

这里并没有建立 css_set 与该 cgroup 的关系，因为进行 mkdir 时，该 cgroup 还没有附加任何进程，所以也不与 css_set 有关系。

以 cgroup.procs 文件的写操作为例，从上面的分析可发现对 cgroup.procs 文件的写操作其实调用的是 cgroup_procs_write，而 cgroup_procs_write 函数其实是对 __cgroup_procs_write 函数的封装：

static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,size_t nbytes, loff_t off, bool threadgroup)
{struct task_struct *tsk;struct cgroup *cgrp;pid_t pid;int ret;…ret = cgroup_attach_task(cgrp, tsk, threadgroup);…return ret ?: nbytes;
}

这个写入的过程最终通过 cgroup_attach_task 把 cgroup 下的 subsys 附加到该 task 中。

如果是多核心的 CPU，这个子系统会为 cgroup 任务分配单独的 CPU 和内存。

这里简单分析对 cpuset.cpus 文件的操作：

static struct cftype files[] = {{.name = "cpus",.seq_show = cpuset_common_seq_show,.write = cpuset_write_resmask,.max_write_len = (100U + 6 * NR_CPUS),.private = FILE_CPULIST,},
…

在 files 数组中定义了写操作的函数是 cpuset_write_resmask：

static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,char *buf, size_t nbytes, loff_t off)
{struct cpuset *cs = css_cs(of_css(of));struct cpuset *trialcs;int retval = -ENODEV;…switch (of_cft(of)->private) {case FILE_CPULIST:retval = update_cpumask(cs, trialcs, buf);break;case FILE_MEMLIST:retval = update_nodemask(cs, trialcs, buf);break;
…
}

针对 FILE_CPULIST 的场景，由 update_cpumask 更新 CPU 对应的 mask：

static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,const char *buf)
{int retval;…cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);…update_cpumasks_hier(cs, trialcs->cpus_allowed);return 0;
}

最终的目的就是更新该 cgroup 下的每个进程的 cpus_allowed。

然后在下次 task 被唤醒的时候，select_task_rq_fair 选择 cpu_allowed 里的某一个 CPU，可能是负载最低的，来确定它应该置于哪个 CPU 的运行队列，一个进程在某一时刻只能存在于一个 CPU 的运行队列里。

7.3　Docker 容器的部分实现

关于 Docker 的介绍，相关的文章书籍已经很多了，这里不再过多阐述，这里谈谈我对 Docker 的理解。当 Docker 刚刚出现的时候，我对它并不太感兴趣，因为它底层依赖的就是用 LXC 创建容器，这不是新瓶装旧酒吗？无非用 GO 语言重新封装，然后加上个镜像管理而已。

Docker 能火其实是和时机不可分割的，个人理解有以下几点。

1.云计算的浪潮推动

在物理机计算性能过剩的时代，云计算不再是亚马逊、阿里云这样的云计算厂商需要考虑的问题，很多小公司也要考虑这些问题，例如一些创业型企业，会考虑如何解决同一个物理机上跑多种应用并且相互隔离不受影响的问题等。

另外，传统虚拟化技术的实现方案是从上到下的全栈模拟方式，成本和代价都很高，故障运维也很麻烦，假如仅仅是做 PAAS 或者 SAAS 服务，就完全可以考虑用容器这样更加轻量级的解决方案来实现。

2.DevOps 思想的普及

传统的运维思路是开发人员在完成开发工作之后，提交代码，后续就交给测试和运维了，至于系统环境的问题，后续维护的问题都和开发没什么太大的关系。

而 DevOps 是一组过程、方法与系统的统称，当企业希望让原本笨重的开发与运维之间的工作移交得流畅无碍，便可借助 DevOps 来完成。

Docker 貌似就是为此而生的，因为代码的完成同时也需要让 Docker 容器启动起来，版本的控制、不同环境配置的隔离等都可以借助镜像管理来完成。

3.社区力量的推动

最近这两年开源社区的影响力和程序员参与的积极性比以往都高，特别是 GitHub 的出现和流行，目前 Docker 的源码也是在 GitHub 上维护的。

其实为了让 Docker 更加贴近云计算，成为一朵容器云，社区在这方面还是做了很多的工作，比如分布式配置的管理 etcd，集群的管理工具 Swarm，Google 的容器化系统解决方案 Kubernetes。可以说，假如没有这些相关配套软件解决方案的出现，Docker 也仅仅是个进程级别的虚拟化隔离技术而已，现在 Docker 不仅仅是 Docker 自己了，还有一套围绕它运转的生态系统了。

7.3.1　新版 Docker 架构

Docker 从1.11版本开始，就把创建、运行、销毁容器的功能交给 containerd 组件来维护。

在 Docker 更名为 Moby 之后，其默认容器的维护，还是通过 containerd（见图7-5）组建来维护的。

图7-5　containerd 相关服务

containerd 提供了一个 ctr 管理命令可以来对 containerd 进程进行管理，用于开通、关闭容器等。containerd 通过 grpc 协议暴露给调用者，客户端也可以直接通过 GRPC 协议与其进行通信，其内部子系统主要分为三块（见图7-6）：

Distribution：和 Docker Registry 打交道，拉取镜像。

Bundle：管理本地磁盘上面镜像的子系统。

Runtime：创建容器、管理容器的子系统。

图7-6　containerd 内部架构

注意　

gRPC 是由 Google 主导开发的 RPC 框架，使用 HTTP/2 协议并用 ProtoBuf 作为序列化工具。其客户端提供 Objective-C、Java 接口，服务器端则有 Java、Golang、C++ 等接口，从而为移动端（iOS/Android）到服务器端通信提供了一种解决方案。当然在当前的环境下，这种解决方案更热门的方式是 RESTFull API 接口，该方式需要自己去选择编码方式、服务器架构、自己搭建框架（JSON-RPC）。

7.3.2　containerd 的实现

下面我们围绕容器的创建分析一下 containerd 的实现。

1.containerd 启动

和大多数软件一样，containerd 也是通过 main 函数来启动的，其入口为：cmd/containe-rd/main.go：

func main() {app := cli.NewApp()app.Name = "containerd"…app.Action = func(context *cli.Context) error {var (start   = time.Now()signals = make(chan os.Signal, 2048)serverC = make(chan *server.Server)ctx     = log.WithModule(gocontext.Background(), "containerd")config  = defaultConfig())done := handleSignals(ctx, signals, serverC)// 启动信号处理 hander 越快越好，这样我们就不会在启动的时候丢失信号signal.Notify(signals, handledSignals...)if err := server.LoadConfig(context.GlobalString("config"), config);err != nil && !os.IsNotExist(err) {return err}// 应用 flags 到配置中if err := applyFlags(context, config); err != nil {return err}address := config.GRPC.Addressif address == "" {return errors.New("grpc address cannot be empty")}…server, err := server.New(ctx, config)…serverC <- serverif config.Debug.Address != "" {l, err := sys.GetLocalListener(config.Debug.Address, config.Debug.Uid, config.Debug.Gid)if err != nil {return errors.Wrapf(err, "failed to get listener for debugendpoint")}serve(log.WithModule(ctx, "debug"), l, server.ServeDebug)}if config.Metrics.Address != "" {l, err := net.Listen("tcp", config.Metrics.Address)if err != nil {return errors.Wrapf(err, "failed to get listener for metricsendpoint")}serve(log.WithModule(ctx, "metrics"), l, server.ServeMetrics)}l, err := sys.GetLocalListener(address, config.GRPC.Uid, config.GRPC.Gid)if err != nil {return errors.Wrapf(err, "failed to get listener for main endpoint")}serve(log.WithModule(ctx, "grpc"), l, server.ServeGRPC)log.G(ctx).Infof("containerd successfully booted in %fs", time.Since(start).Seconds())<-donereturn nil}if err := app.Run(os.Args); err != nil {fmt.Fprintf(os.Stderr, "containerd: %s\n", err)os.Exit(1)}
}

这个过程中，最重要就是对 app.Action 的设置，关键一步就是 server.New（ctx，config），containerd 服务的初始化工作都在这里面进行：

func New(ctx context.Context, config *Config) (*Server, error) {…if err := os.MkdirAll(config.Root, 0711); err != nil {return nil, err}if err := os.MkdirAll(config.State, 0711); err != nil {return nil, err}if err := apply(ctx, config); err != nil {return nil, err}plugins, err := loadPlugins(config)if err != nil {return nil, err}rpc := grpc.NewServer(grpc.UnaryInterceptor(interceptor),grpc.StreamInterceptor(grpc_prometheus.StreamServerInterceptor),)var (services []plugin.Services        = &Server{rpc:    rpc,events: events.NewExchange(),}initialized = make(map[plugin.PluginType]map[string]interface{}))for _, p := range plugins {id := p.URI()log.G(ctx).WithField("type", p.Type).Infof("loading plugin %q...", id)initContext := plugin.NewContext(ctx,initialized,config.Root,config.State,id,)initContext.Events = s.eventsinitContext.Address = config.GRPC.Address// 装载 plugin 指定的配置if p.Config != nil {pluginConfig, err := config.Decode(p.ID, p.Config)if err != nil {return nil, err}initContext.Config = pluginConfig}instance, err := p.Init(initContext)…if types, ok := initialized[p.Type]; ok {types[p.ID] = instance} else {initialized[p.Type] = map[string]interface{}{p.ID: instance,}}// 检测 grpc 服务是否已经在 server 中注册if service, ok := instance.(plugin.Service); ok {services = append(services, service)}}// 服务注册后，所有的 plugin 需要进行初始化for _, service := range services {if err := service.Register(rpc); err != nil {return nil, err}}return s, nil
}

初始化过程中，最重要的是建立了 grpc 服务，然后把 plugin 装载初始化为服务后，调用 register 函数进行注册。其中 plugin 的接口定义为：

type Registration struct {Type     PluginTypeID       stringConfig   interface{}Requires []PluginTypeInit     func(*InitContext) (interface{}, error)added bool
}

plugin 的类型分为以下几种：

const (RuntimePlugin     PluginType = "io.containerd.runtime.v1"GRPCPlugin        PluginType = "io.containerd.grpc.v1"SnapshotPlugin    PluginType = "io.containerd.snapshotter.v1"TaskMonitorPlugin PluginType = "io.containerd.monitor.v1"DiffPlugin        PluginType = "io.containerd.differ.v1"MetadataPlugin    PluginType = "io.containerd.metadata.v1"ContentPlugin     PluginType = "io.containerd.content.v1"
)

因为我们现在只对容器服务感兴趣，所以，先来看一下容器服务如何初始化：

func init() {plugin.Register(&plugin.Registration{Type: plugin.GRPCPlugin,ID:   "containers",Requires: []plugin.PluginType{plugin.MetadataPlugin,},Init: func(ic *plugin.InitContext) (interface{}, error) {m, err := ic.Get(plugin.MetadataPlugin)if err != nil {return nil, err}return NewService(m.(*bolt.DB), ic.Events), nil},})
}

从 containers 服务的 init 方法可见容器相关的接口是通过 gRPC 协议暴露出去的，我们可以通过其实现的 register 函数来验证：

func (s *Service) Register(server *grpc.Server) error {api.RegisterContainersServer(server, s)return nil
}

容器服务在 gRPC 协议中注册的接口函数如下：

type ContainersServer interface {Get(context.Context, *GetContainerRequest) (*GetContainerResponse, error)List(context.Context, *ListContainersRequest) (*ListContainersResponse, error)Create(context.Context, *CreateContainerRequest) (*CreateContainerResponse, error)Update(context.Context, *UpdateContainerRequest) (*UpdateContainerResponse, error)Delete(context.Context, *DeleteContainerRequest) (*google_protobuf2.Empty, error)
}

以上方法对应 gPRC 的 handler 为：

var _Containers_serviceDesc = grpc.ServiceDesc{ServiceName: "containerd.services.containers.v1.Containers",HandlerType: (*ContainersServer)(nil),Methods: []grpc.MethodDesc{{MethodName: "Get",Handler:    _Containers_Get_Handler,},{MethodName: "List",Handler:    _Containers_List_Handler,},{MethodName: "Create",Handler:    _Containers_Create_Handler,},{MethodName: "Update",Handler:    _Containers_Update_Handler,},{MethodName: "Delete",Handler:    _Containers_Delete_Handler,},},Streams:  []grpc.StreamDesc{},Metadata: "github.com/containerd/containerd/api/services/containers/v1/con-tainers.proto",
}

2.创建并运行容器

在 containerd 服务启动之后，我们就可以通过命令来创建和运行容器。containerd 提供了一个 ctr 命令来与 containerd 服务进行通信。运行容器的命令 ctr run 的格式为：

ctr run [command options] Image|RootFS ID [COMMAND] [ARG...]

例如：

ctr run docker.io/library/redis:latest containerd-redis

该命令的实现如下：

var runCommand = cli.Command{Name:      "run",Usage:     "run a container",ArgsUsage: "Image|RootFS ID [COMMAND] [ARG...]",…Action: func(context *cli.Context) error {var (err             errorcheckpointIndex digest.Digestctx, cancel = appContext(context)id          = context.Args().Get(1)imageRef    = context.Args().First()tty         = context.Bool("tty"))defer cancel()if imageRef == "" {return errors.New("image ref must be provided")}if id == "" {return errors.New("container id must be provided")}if raw := context.String("checkpoint"); raw != "" {if checkpointIndex, err = digest.Parse(raw); err != nil {return err}}client, err := newClient(context)if err != nil {return err}container, err := newContainer(ctx, client, context)if err != nil {return err}if context.Bool("rm") {defer container.Delete(ctx, containerd.WithSnapshotCleanup)}task, err := newTask(ctx, container, checkpointIndex, tty)if err != nil {return err}defer task.Delete(ctx)statusC, err := task.Wait(ctx)if err != nil {return err}var con console.Consoleif tty {con = console.Current()defer con.Reset()if err := con.SetRaw(); err != nil {return err}}if err := task.Start(ctx); err != nil {return err}if tty {if err := handleConsoleResize(ctx, task, con); err != nil {logrus.WithError(err).Error("console resize")}} else {sigc := forwardAllSignals(ctx, task)defer stopCatch(sigc)}status := <-statusCcode, _, err := status.Result()if err != nil {return err}if _, err := task.Delete(ctx); err != nil {return err}if code != 0 {return cli.NewExitError("", int(code))}return nil},
}

通过以上代码可以发现 run 命令的执行主要分为两个步骤：

1）newContainer 创建新的容器。

2）newTask 创建容器中运行的任务。

容器创建部分我们暂时忽略，这里先研究一下 newTask。NewTask 创建过程最终会调用 /containerd/container.go 下的 newTask 方法：

func (c *container) NewTask(ctx context.Context, ioCreate IOCreation, opts ...NewTaskOpts) (Task, error) {c.mu.Lock()defer c.mu.Unlock()i, err := ioCreate(c.c.ID)if err != nil {return nil, err}cfg := i.Config()request := &tasks.CreateTaskRequest{ContainerID: c.c.ID,Terminal:    cfg.Terminal,Stdin:       cfg.Stdin,Stdout:      cfg.Stdout,Stderr:      cfg.Stderr,}…t := &task{client: c.client,io:     i,id:     c.ID(),}if info.Checkpoint != nil {…} else {response, err := c.client.TaskService().Create(ctx, request)…t.pid = response.Pid}return t, nil
}

在 newTask 执行过程中，主要通过 c.client.Task-Service（）.Create 发送 grpc 请求给 containerd 进程，进行相应处理。

在服务端创建 task 的过程有点啰嗦，通过图7-7我们直接分析关键部分内容，服务端通过 runtime 层创建 shame 进程，然后通过 gRPC 协议发送 create 命令给 shame 进程，shame 会通过 runc 创建 task。

图7-7　服务端容器创建和运行流程

runc 是个独立的工程，集成了 libcontainer，下面我们来分析 runc create 和 start 过程。

首先分析容器创建的过程：

var createCommand = cli.Command{Name:  "create",Usage: "create a container",ArgsUsage: `<container-id>
…},Action: func(context *cli.Context) error {if err := checkArgs(context, 1, exactArgs); err != nil {return err}if err := revisePidFile(context); err != nil {return err}spec, err := setupSpec(context)if err != nil {return err}status, err := startContainer(context, spec, CT_ACT_CREATE, nil)if err != nil {return err}…os.Exit(status)return nil},
}

create 命令的核心就是 startContainer 函数，它最终会通过 createContainer 来创建容器：

func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{CgroupName:       id,UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),NoPivotRoot:      context.Bool("no-pivot"),NoNewKeyring:     context.Bool("no-new-keyring"),Spec:             spec,Rootless:         isRootless(),})if err != nil {return nil, err}factory, err := loadFactory(context)if err != nil {return nil, err}return factory.Create(id, config)
}

createContainer 分为两个步骤：

1）loadFactory 装载 factory：

func loadFactory(context *cli.Context) (libcontainer.Factory, error) {root := context.GlobalString("root")abs, err := filepath.Abs(root)if err != nil {return nil, err}
…cgroupManager := libcontainer.Cgroupfs…return libcontainer.New(abs, cgroupManager, intelRdtManager,libcontainer.CriuPath(context.GlobalString("criu")),libcontainer.NewuidmapPath(newuidmap),libcontainer.NewgidmapPath(newgidmap))
}

loadFactory 在指定了 cgroupManager 后，初始化了一个 LinuxFactory：

func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {if root != "" {if err := os.MkdirAll(root, 0700); err != nil {return nil, newGenericError(err, SystemError)}}l := &LinuxFactory{Root:      root,InitArgs:  []string{"/proc/self/exe", "init"},Validator: validate.New(),CriuPath:  "criu",}Cgroupfs(l)for _, opt := range options {if opt == nil {continue}if err := opt(l); err != nil {return nil, err}}return l, nil
}

2）factory.Create 这个步骤在验证参数之后，返回了一个 libcontainer 对象：

func (l *LinuxFactory) Create(id string, config *configs.Config) (Container, error) {if l.Root == "" {return nil, newGenericError(fmt.Errorf("invalid root"), ConfigInvalid)}if err := l.validateID(id); err != nil {return nil, err}if err := l.Validator.Validate(config); err != nil {return nil, newGenericError(err, ConfigInvalid)}containerRoot := filepath.Join(l.Root, id)if _, err := os.Stat(containerRoot); err == nil {return nil, newGenericError(fmt.Errorf("container with id exists: %v",id), IdInUse)} else if !os.IsNotExist(err) {return nil, newGenericError(err, SystemError)}if err := os.MkdirAll(containerRoot, 0711); err != nil {return nil, newGenericError(err, SystemError)}if err := os.Chown(containerRoot, unix.Geteuid(), unix.Getegid()); err != nil {return nil, newGenericError(err, SystemError)}if config.Rootless {RootlessCgroups(l)}c := &linuxContainer{id:            id,root:          containerRoot,config:        config,initArgs:      l.InitArgs,criuPath:      l.CriuPath,newuidmapPath: l.NewuidmapPath,newgidmapPath: l.NewgidmapPath,cgroupManager: l.NewCgroupsManager(config.Cgroups, nil),}c.intelRdtManager = nilif intelrdt.IsEnabled() && c.config.IntelRdt != nil {c.intelRdtManager = l.NewIntelRdtManager(config, id, "")}c.state = &stoppedState{c: c}return c, nil
}

在上面的 startContainer 函数执行过程中，通过 createContainer 创建完容器后，会初始化一个 runner 并且运行：

func startContainer(context *cli.Context, spec *specs.Spec, action CtAct,criuOpts *libcontainer.CriuOpts) (int, error) {…container, err := createContainer(context, id, spec)…r := &runner{enableSubreaper: !context.Bool("no-subreaper"),shouldDestroy:   true,container:       container,listenFDs:       listenFDs,notifySocket:    notifySocket,consoleSocket:   context.String("console-socket"),detach:          context.Bool("detach"),pidFile:         context.String("pid-file"),preserveFDs:     context.Int("preserve-fds"),action:          action,criuOpts:        criuOpts,}return r.run(spec.Process)
}

接着我们来看 startContainer 最后一步调用的 run 函数：

func (r *runner) run(config *specs.Process) (int, error) {…process, err := newProcess(*config)…if len(r.listenFDs) > 0 {process.Env = append(process.Env, fmt.Sprintf("LISTEN_FDS=%d", len(r.listenFDs)), "LISTEN_PID=1")process.ExtraFiles = append(process.ExtraFiles, r.listenFDs...)}baseFd := 3 + len(process.ExtraFiles)for i := baseFd; i < baseFd+r.preserveFDs; i++ {process.ExtraFiles = append(process.ExtraFiles, os.NewFile(uintptr(i),"PreserveFD:"+strconv.Itoa(i)))}rootuid, err := r.container.Config().HostRootUID()…rootgid, err := r.container.Config().HostRootGID()…var (detach = r.detach || (r.action == CT_ACT_CREATE))handler := newSignalHandler(r.enableSubreaper, r.notifySocket)tty, err := setupIO(process, rootuid, rootgid, config.Terminal, detach,r.consoleSocket)…defer tty.Close()switch r.action {case CT_ACT_CREATE:err = r.container.Start(process)case CT_ACT_RESTORE:err = r.container.Restore(process, r.criuOpts)case CT_ACT_RUN:err = r.container.Run(process)default:panic("Unknown action")}……if r.pidFile != "" {if err = createPidFile(r.pidFile, process); err != nil {r.terminate(process)r.destroy()return -1, err}}status, err := handler.forward(process, tty, detach)…if detach {return 0, nil}r.destroy()return status, err
}

run 函数主要对 libcontainer.Process 进行初始化，并且做一些 I/O、权限等设置，最后因为我们传入的 action 为 CT_ACT_CREATE，所以执行：

r.container.Start

最后，我们进入容器启动最最关键的 start 方法：

func (c *linuxContainer) start(process *Process, isInit bool) error {parent, err := c.newParentProcess(process, isInit)if err != nil {return newSystemErrorWithCause(err, "creating new parent process")}if err := parent.start(); err != nil {if err := parent.terminate(); err != nil {logrus.Warn(err)}return newSystemErrorWithCause(err, "starting container process")}…
}

容器 start 有两个关键步骤：

1）newParentProcess 父进程进行相应的初始化工作，创建管道用于父进程和容器子进程进行通信，创建命令模版，初始化 init 进程相关数据，例如 namespace 等数据：

func (c *linuxContainer) newParentProcess(p *Process, doInit bool) (parent-Process, error) {parentPipe, childPipe, err := utils.NewSockPair("init")if err != nil {return nil, newSystemErrorWithCause(err, "creating new init pipe")}cmd, err := c.commandTemplate(p, childPipe)if err != nil {return nil, newSystemErrorWithCause(err, "creating new commandtemplate")}if !doInit {return c.newSetnsProcess(p, cmd, parentPipe, childPipe)}// 在没有使用 runc 的老版本中需要设置 fifoFd，这是历史原因，在 runc 中无需这样做if err := c.includeExecFifo(cmd); err != nil {return nil, newSystemErrorWithCause(err, "including execfifo in cmd.Exec setup")}return c.newInitProcess(p, cmd, parentPipe, childPipe)
}

其中 newInitProcess 实现如下：

func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, parentPipe,childPipe *os.File) (*initProcess, error) {cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))nsMaps := make(map[configs.NamespaceType]string)for _, ns := range c.config.Namespaces {if ns.Path != "" {nsMaps[ns.Type] = ns.Path}}_, sharePidns := nsMaps[configs.NEWPID]data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps)if err != nil {return nil, err}return &initProcess{cmd:             cmd,childPipe:       childPipe,parentPipe:      parentPipe,manager:         c.cgroupManager,intelRdtManager: c.intelRdtManager,config:          c.newInitConfig(p),container:       c,process:         p,bootstrapData:   data,sharePidns:      sharePidns,}, nil
}

2）parent.start 启动子进程并且进行 namespace、cgroup 等设置，最后发送 config 信息给子进程：

func (p *initProcess) start() error {defer p.parentPipe.Close()err := p.cmd.Start()p.process.ops = pp.childPipe.Close()if err != nil {p.process.ops = nilreturn newSystemErrorWithCause(err, "starting init process command")}// 在和子进程同步前执行，所以，没有一个子进程可以逃脱 cgroup 的限制，我们不用担心子进程会获// 取 root 权限，因为我们使用 rootless cgroup 来管理它if err := p.manager.Apply(p.pid()); err != nil {return newSystemErrorWithCause(err, "applying cgroup configuration forprocess")}if p.intelRdtManager != nil {if err := p.intelRdtManager.Apply(p.pid()); err != nil {return newSystemErrorWithCause(err, "applying Intel RDT configurationfor process")}}defer func() {if err != nil {// TODO: should not be the responsibility to call herep.manager.Destroy()if p.intelRdtManager != nil {p.intelRdtManager.Destroy()}}}()if _, err := io.Copy(p.parentPipe, p.bootstrapData); err != nil {return newSystemErrorWithCause(err, "copying bootstrap data to pipe")}if err := p.execSetns(); err != nil {return newSystemErrorWithCause(err, "running exec setns process for init")}fds, err := getPipeFds(p.pid())if err != nil {return newSystemErrorWithCausef(err, "getting pipe fds for pid %d", p.pid())}p.setExternalDescriptors(fds)if err := p.createNetworkInterfaces(); err != nil {return newSystemErrorWithCause(err, "creating network interfaces")}if err := p.sendConfig(); err != nil {return newSystemErrorWithCause(err, "sending config to init process")}var (sentRun    boolsentResume bool)ierr := parseSync(p.parentPipe, func(sync *syncT) error {switch sync.Type {case procReady:// 设置 rlimit，因为我们丢失了权限，所以必须进行设置，一旦我们进入用户空间，就可以增加限制if err := setupRlimits(p.config.Rlimits, p.pid()); err != nil {return newSystemErrorWithCause(err, "setting rlimits for readyprocess")}// 调用prestart hook函数if !p.config.Config.Namespaces.Contains(configs.NEWNS) {// 在启动hook前设置cgroup,因此启动前的hook可以设置cgroup的权限if err := p.manager.Set(p.config.Config); err != nil {return newSystemErrorWithCause(err, "setting cgroup configfor ready process")}if p.intelRdtManager != nil {if err := p.intelRdtManager.Set(p.config.Config); err != nil {return newSystemErrorWithCause(err, "setting Intel RDTconfig for ready process")}}if p.config.Config.Hooks != nil {s := configs.HookState{Version: p.container.config.Version,ID:      p.container.id,Pid:     p.pid(),Bundle:  utils.SearchLabels(p.config.Config.Labels, "bundle"),}for i, hook := range p.config.Config.Hooks.Prestart {if err := hook.Run(s); err != nil {return newSystemErrorWithCausef(err, "runningprestart hook %d", i)}}}}// 和子进程进行同步if err := writeSync(p.parentPipe, procRun); err != nil {return newSystemErrorWithCause(err, "writing syncT 'run'")}sentRun = truecase procHooks:// 在启动 hook 前设置 cgroup,因此启动前的 hook 可以设置 cgroup 的权限if err := p.manager.Set(p.config.Config); err != nil {return newSystemErrorWithCause(err, "setting cgroup config forprocHooks process")}if p.intelRdtManager != nil {if err := p.intelRdtManager.Set(p.config.Config); err != nil {return newSystemErrorWithCause(err, "setting Intel RDTconfig for procHooks process")}}if p.config.Config.Hooks != nil {s := configs.HookState{Version: p.container.config.Version,ID:      p.container.id,Pid:     p.pid(),Bundle:  utils.SearchLabels(p.config.Config.Labels, "bundle"),}for i, hook := range p.config.Config.Hooks.Prestart {if err := hook.Run(s); err != nil {return newSystemErrorWithCausef(err, "running prestarthook %d", i)}}}// 和子进程进行同步if err := writeSync(p.parentPipe, procResume); err != nil {return newSystemErrorWithCause(err, "writing syncT 'resume'")}sentResume = truedefault:return newSystemError(fmt.Errorf("invalid JSON payload from child"))}return nil})if !sentRun {return newSystemErrorWithCause(ierr, "container init")}if p.config.Config.Namespaces.Contains(configs.NEWNS) && !sentResume {return newSystemError(fmt.Errorf("could not synchronise afterexecuting prestart hooks with container process"))}if err := unix.Shutdown(int(p.parentPipe.Fd()), unix.SHUT_WR); err != nil {return newSystemErrorWithCause(err, "shutting down init pipe")}// 在 shutdown 后一定会结束，所以我们在这里等待子进程的退出if ierr != nil {p.wait()return ierr}return nil
}

在上述代码执行过程中，p.cmd.Start（）会先启动子进程，p.manager.Apply（p.pid（））进行 cgroup 设置的应用，p.sendConfig（）把需要子进程进行设置的配置发送给子进程，然后通过 parseSync/writeSync 和子进程进行信息同步。

其中子进程通过 exec 程序并通过参数 init 执行 init 命令：

var initCommand = cli.Command{Name:  "init",Usage: `initialize the namespaces and launch the process (do not call it ou-tside of runc)`,Action: func(context *cli.Context) error {factory, _ := libcontainer.New("")if err := factory.StartInitialization(); err != nil {// 因为错误已经被发送回父进程，因此这里不需要记录，副进程会进行处理os.Exit(1)}panic("libcontainer: container init failed to exec")},
}

init 进程通过 factory.StartInitialization 进行初始化：

func (l *LinuxFactory) StartInitialization() (err error) {var (pipefd, fifofd intconsoleSocket  *os.FileenvInitPipe    = os.Getenv("_LIBCONTAINER_INITPIPE")envFifoFd      = os.Getenv("_LIBCONTAINER_FIFOFD")envConsole     = os.Getenv("_LIBCONTAINER_CONSOLE"))// 获取 INITPIPE.pipefd, err = strconv.Atoi(envInitPipe)if err != nil {return fmt.Errorf("unable to convert _LIBCONTAINER_INITPIPE=%s to int:%s", envInitPipe, err)}var (pipe = os.NewFile(uintptr(pipefd), "pipe")it   = initType(os.Getenv("_LIBCONTAINER_INITTYPE")))defer pipe.Close()// 仅仅初始化具有 fifofd 的进程fifofd = -1if it == initStandard {if fifofd, err = strconv.Atoi(envFifoFd); err != nil {return fmt.Errorf("unable to convert _LIBCONTAINER_FIFOFD=%s toint: %s", envFifoFd, err)}}
…i, err := newContainerInit(it, pipe, consoleSocket, fifofd)if err != nil {return err}// 假如初始化成功 syscall.Exec 不会返回return i.Init()
}

子进程在拿到管道后执行 newContainerInit：

func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File,fifoFd int) (initer, error) {var config *initConfigif err := json.NewDecoder(pipe).Decode(&config); err != nil {return nil, err}if err := populateProcessEnvironment(config.Env); err != nil {return nil, err}switch t {case initSetns:return &linuxSetnsInit{pipe:          pipe,consoleSocket: consoleSocket,config:        config,}, nilcase initStandard:return &linuxStandardInit{pipe:          pipe,consoleSocket: consoleSocket,parentPid:     unix.Getppid(),config:        config,fifoFd:        fifoFd,}, nil}return nil, fmt.Errorf("unknown init type %q", t)
}

我们这里初始化条件为 initStandard，则执行 standard_init_linux.go 中的 init 方法：

func (l *linuxStandardInit) Init() error {…if err := setupNetwork(l.config); err != nil {return err}if err := setupRoute(l.config.Config); err != nil {return err}label.Init()…if err := syscall.Exec(name, l.config.Args[0:], os.Environ()); err != nil {return newSystemErrorWithCause(err, "exec user process")}return nil
}

以上代码根据父进程发送过来的 config，进行了网络、路由等配置，最后通过 exec 系统调用执行用户指定的程序。

最后，我们总结一下容器创建过程中父子进程的通信过程（见图7-8），runc 进程通过 startContainer 通过 libcontainer 创建容器，子进程 fork 出来的时候会设置好 namespace、cgroup 等，然后父进程通过 sendconfig 发送配置给子进程，子进程进行相应设置，最后执行完用户指定程序后通知父进程。

图7-8　runc 容器创建父子进程通信过程

其中 namespace 的设置是在 newInitProcess 函数执行的时候，用 bootstrapData 通过 netlink 通知内核设置的。

7.4　本章小结

我记得有位大牛说过，一个软件最核心的代码也就那么十几二十行，大部分代码都是用于工程组织和封装。其实容器技术也如此，抛开 DevOps 的思想实践不说（比如 Docker 镜像的相关应用），Docker 容器相关代码无非就是对 Linux 的 namespace、cgroup 等技术的封装，只要掌握了核心代码，其他问题也将会迎刃而解。授人以鱼不如授人以渔，本章并没有花费全部精力去解释 Docker 方方面面的实现（如容器、隔离、集群管理、镜像等），而是以容器隔离的相关技术作为切入点进行分析，我相信所有的软件都可以用类似思路去分析，掌握核心技术，以不变应万变。