今天开始读Linux Kernel Development这本书。

看了这本书的目录，覆盖比较广泛，和LDD相比多了一些东西，毕竟LDD侧重于device driver，而LKD侧重于kernel本身。

前面两章是Introduction和Get Started，主要是linux的历史，操作系统概念，kernel开发环境，以及下载kernel code，编译linux kernel等内容，这些内容作为阅读性内容，这里不做记录。

直接从第三章开始——Process Management

这一章主要讲解进程（Process），并且引入相关的概念——线程（Thread），以及kernel的进程管理和它的生命周期。作为应用程序的服务者，kernel的进程管理对用户态程序来说尤其重要。

The Process

以前操作系统里说，进程（Process）是运行着的程序（Program）。其实不大准确，进程除了包含程序的代码之外，还包含了很多进程执行需要的resource，比如open files，pending signals， internal kernel data，process state，内存地址空间，一个或多个线程，以及包含全局数据的data section等等，这些资源程序是不具备的。

不过这些资源对于用户态的进程来说都是透明的，由kernel统一管理。

线程和进程类似，但是又有所不同，操作系统里说过，线程是kernel调度的基本单位，进程是资源管理的基本单位，也就说真正在执行代码以及被kernel调度的是线程，而不是进程。每一个线程包含了自己的program counter，process stack，以及processor registers，有意思的是，linux kernel并不区分进程和线程，线程就是特殊的进程，也对应同一个结构体。

进程提供了两种虚拟化的概念：CPU的虚拟化和内存的虚拟化。在进程执行的过程中可以使用全部的CPU资源，也可以使用全部的内存资源，就像没有其他人在使用CPU和内存一样，应用程序使用CPU和内存的时候，不需要考虑别的进程。实际上CPU和内存这些物力资源都是被很多进程共享的。

进程的生命周期开始于被创建的时候，linux中创建进程使用fork系统调用，子进程拷贝当前进程执行。在调用的地方会返回两次，一次是父进程，一次是子进程。当子进程被创建出来以后，会调用exec系统调用开始执行全新的program。

进程的生命周期结束于exit系统调用，这个系统调用会结束进程的执行并释放进程占用的所有资源。父进程可以通过wait系统调用等待子进程结束，如果没有人wait子进程，那么子进程就会变为僵尸进程。

Process Descriptor and the Task Structure

kernel使用struct task_struct来管理进程，而process descriptor实际上就是task_struct类型的指针而已，这个struct里包含了一个进程的所有信息，比如打开的文件，虚拟地址空间，pending signals，进程的状态以及其他的很多信息，因此结构体本身非常大，至少有1.7KB。

Allocating the Process Descriptor

在kernel 2.6中，struct task_struct结构体占用的内存是通过slab allocator来分配。在kernel 2.6以前，task_struct直接存储在进程stack的末尾，这样通过stack pointer就能直接访问到task_struct，不需要额外的寄存器来存储它，对于x86这种寄存器不多的架构比较友好。在kernel 2.6以后，task_struct通过动态分配的方式获取内存，位置就不在stack的末尾了，同样的，struct thread_info取代了task_struct，被放到了stack的末尾，thread_info结构体如下（kernel 4.15）：

/** On IA-64, we want to keep the task structure and kernel stack together, so they can be* mapped by a single TLB entry and so they can be addressed by the "current" pointer* without having to do pointer masking.*/
struct thread_info {struct task_struct *task;   /* XXX not really needed, except for dup_task_struct() */__u32 flags;           /* thread_info flags (see TIF_*) */__u32 cpu;           /* current CPU */__u32 last_cpu;            /* Last CPU thread ran on */__u32 status;           /* Thread synchronous flags */mm_segment_t addr_limit;  /* user-level address space limit */int preempt_count;      /* 0=premptable, <0=BUG; will also serve as bh-counter */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE__u64 utime;__u64 stime;__u64 gtime;__u64 hardirq_time;__u64 softirq_time;__u64 idle_time;__u64 ac_stamp;__u64 ac_leave;__u64 ac_stime;__u64 ac_utime;
#endif
};

存储thread_info的示意图：

如果stack向下（低地址）增长，thread_info存储在低地址，如果向上（高地址）增长，thread_info就存储在高地址。上面这个图有个typo，低地址存储的应该是struct thread_info结构体，而不是struct thread_struct，这两个不一样。

Storing the Process Descriptor

kernel中使用pid_t来标记一个process，这个pid_t就是进程的PID，一般用int。为了和以前的兼容，kernel PID最大到32768，不过可以通过/proc/sys/kernel/pid_max来修改。

kernel中操作一个process，一般是通过process的task_struct结构体来进行，因此如何获取某个进程的task_struct就很重要，如果要操作当前的process，直接使用current宏即可，这个宏是架构相关的，考虑到是通过stack上的thread_info来实现，依赖于架构也就容易理解了。（有些结构直接用寄存器存储task_struct，但是像x86这种，是通过访问stack上的thread_info里的task_struct来实现），简单看一下x86上的实现：

首先是current的定义，是在include/asm-generic/current.h:

#include <linux/thread_info.h>
#define get_current() (current_thread_info()->task)
#define current get_current()

current就是current_thread_info()->task，我们来看current_thread_info():

#ifdef CONFIG_THREAD_INFO_IN_TASK
/** For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the* definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,* including <asm/current.h> can cause a circular dependency on some platforms.*/
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif

4.15的kernel上定义了CONFIG_THREAD_INFO_IN_TASK，所以current_thread_info()就是current转换的thread_info指针，接着看这里面的current：

DECLARE_PER_CPU(struct task_struct *, current_task);static __always_inline struct task_struct *get_current(void)
{return this_cpu_read_stable(current_task);
}#define current get_current()

current宏是get_current()函数，这个函数又调用了this_cpu_read_stable：

/** this_cpu_read() makes gcc load the percpu variable every time it is* accessed while this_cpu_read_stable() allows the value to be cached.* this_cpu_read_stable() is more efficient and can be used if its value* is guaranteed to be valid across cpus.  The current users include* get_current() and get_thread_info() both of which are actually* per-thread variables implemented as per-cpu variables and thus* stable for the duration of the respective task.*/
#define this_cpu_read_stable(var)   percpu_stable_op("mov", var)

this_cpu_read_stable也是一个宏，直接调用percpu_stable_op：

#define percpu_stable_op(op, var)            \
({                          \typeof(var) pfo_ret__;             \switch (sizeof(var)) {             \case 1:                        \asm(op "b "__percpu_arg(P1)",%0"   \: "=q" (pfo_ret__)          \: "p" (&(var)));         \break;                 \case 2:                        \asm(op "w "__percpu_arg(P1)",%0"   \: "=r" (pfo_ret__)          \: "p" (&(var)));         \break;                 \case 4:                        \asm(op "l "__percpu_arg(P1)",%0"   \: "=r" (pfo_ret__)          \: "p" (&(var)));         \break;                 \case 8:                        \asm(op "q "__percpu_arg(P1)",%0"   \: "=r" (pfo_ret__)          \: "p" (&(var)));         \break;                 \default: __bad_percpu_size();          \}                      \pfo_ret__;                 \
})

通过percpu_stable_op这个宏，可以看到是通过汇编做了实现，current_task是一个指针，在32位系统上匹配case 4，在64位系统上匹配case 8，没看出是怎么读到的thread_info。

以上实现都是基于kernel 4.15的code，看上去和2.6的实现有所不同，而且4.15中CONFIG_THREAD_INFO_IN_TASK=y是有的，也就意味着thread_info存放在task_struct中了。

Process State

task_struct中的state用来表明process当前的状态，有五个：

TASK_RUNNING

当前的process可以运行或者正在运行。如果process在run queue里面，就是等待运行。

TASK_INTERRUPTIBLE

当前的process正在sleep。正在等待某个事件，可以被signal唤醒，process不在run queue里。

TASK_UNINTERRUPTIBLE

当前的process正在sleep。正在等待某个事件，但是只能在事件发生时被唤醒，signal不能唤醒它，process不在run queue里。

__TASK_TRACED

当前的process被别的进程trace，比如ptrace，或者gdb等。

__TASK_STOPPED

process的执行被停止。发生在process收到SIGSTOP，SIGTSTP，SIGTTIN，SIGTTOU这些信号时，或者process在被debug的时候收到任何信号。

Manipulating the Current Process State

kernel经常需要修改process的状态，使用：

#define set_current_state(state_value)                   \smp_store_mb(current->state, (state_value))

注意，在kernel 4.15中set_task_state已经没有了。

Process Context

process context就是进程上下文，process最主要的活儿就是在user space address空间内，执行从program load进来的指令。当process执行了系统调用，或者产生了异常，process就会陷入内核，此时kernel运行在process context，代表原来的用户态process执行，此时current变得有效。如果系统调用完成或者异常处理完毕，就会从kernel space退出，恢复process在user space的运行，除非此时有更高优先级的process等待运行。

从用户态陷入内核态，只有这两个接口：系统调用，异常。

The Process Family Tree

系统中有个层次分明的process树状结构，所有的进程都是init进程的子孙，在系统启动完成时，init进程开始执行，并通过initscripts把其他的进程创建并启动。

所有的进程（除了init进程）都有一个parent，有0或者多个children。parent相同的process被称为siblings，这些层次关系都存储在task_struct里，通过parent和children指针来记录和索引，通过这些指针可以获得对应的parent或者children process：

//访问parent
struct task_struct *my_parent = current->parent;//遍历所有的children
struct task_struct *task; struct list_head *list;
list_for_each(list, &current->children) {/* task now points to one of current’s children */task = list_entry(list, struct task_struct, sibling);
}

作为系统中第一个process，init process的task_struct是静态创建的：

/** Set up the first task table, touch at your own risk!. Base=0,* limit=0x1fffff (=2MB)*/
struct task_struct init_task
#ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK__init_task_data
#endif
= {
#ifdef CONFIG_THREAD_INFO_IN_TASK.thread_info   = INIT_THREAD_INFO(init_task),.stack_refcount  = ATOMIC_INIT(1),
#endif.state        = 0,.stack     = init_stack,.usage        = ATOMIC_INIT(2),.flags        = PF_KTHREAD,.prio     = MAX_PRIO - 20,.static_prio   = MAX_PRIO - 20,.normal_prio   = MAX_PRIO - 20,.policy        = SCHED_NORMAL,.cpus_allowed   = CPU_MASK_ALL,.nr_cpus_allowed= NR_CPUS,.mm      = NULL,.active_mm  = &init_mm,.restart_block  = {.fn = do_no_restart_syscall,},.se      = {.group_node     = LIST_HEAD_INIT(init_task.se.group_node),},.rt        = {.run_list   = LIST_HEAD_INIT(init_task.rt.run_list),.time_slice    = RR_TIMESLICE,},.tasks        = LIST_HEAD_INIT(init_task.tasks),
#ifdef CONFIG_SMP.pushable_tasks    = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
#endif
#ifdef CONFIG_CGROUP_SCHED.sched_task_group = &root_task_group,
#endif.ptraced  = LIST_HEAD_INIT(init_task.ptraced),.ptrace_entry  = LIST_HEAD_INIT(init_task.ptrace_entry),.real_parent  = &init_task,.parent       = &init_task,.children = LIST_HEAD_INIT(init_task.children),.sibling  = LIST_HEAD_INIT(init_task.sibling),.group_leader  = &init_task,RCU_POINTER_INITIALIZER(real_cred, &init_cred),RCU_POINTER_INITIALIZER(cred, &init_cred),.comm        = INIT_TASK_COMM,.thread       = INIT_THREAD,.fs      = &init_fs,.files      = &init_files,.signal      = &init_signals,.sighand   = &init_sighand,.nsproxy   = &init_nsproxy,.pending   = {.list = LIST_HEAD_INIT(init_task.pending.list),.signal = {{0}}},.blocked  = {{0}},.alloc_lock    = __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),.journal_info = NULL,INIT_CPU_TIMERS(init_task).pi_lock  = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),.timer_slack_ns = 50000, /* 50 usec default slack */.pids = {[PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),},.thread_group   = LIST_HEAD_INIT(init_task.thread_group),.thread_node  = LIST_HEAD_INIT(init_signals.thread_head),
#ifdef CONFIG_AUDITSYSCALL.loginuid = INVALID_UID,.sessionid   = (unsigned int)-1,
#endif
#ifdef CONFIG_PERF_EVENTS.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
#endif
#ifdef CONFIG_PREEMPT_RCU.rcu_read_lock_nesting = 0,.rcu_read_unlock_special.s = 0,.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),.rcu_blocked_node = NULL,
#endif
#ifdef CONFIG_TASKS_RCU.rcu_tasks_holdout = false,.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),.rcu_tasks_idle_cpu = -1,
#endif
#ifdef CONFIG_CPUSETS.mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq),
#endif
#ifdef CONFIG_RT_MUTEXES.pi_waiters = RB_ROOT_CACHED,.pi_top_task  = NULL,
#endifINIT_PREV_CPUTIME(init_task)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN.vtime.seqcount    = SEQCNT_ZERO(init_task.vtime_seqcount),.vtime.starttime = 0,.vtime.state = VTIME_SYS,
#endif
#ifdef CONFIG_NUMA_BALANCING.numa_preferred_nid = -1,.numa_group   = NULL,.numa_faults    = NULL,
#endif
#ifdef CONFIG_KASAN.kasan_depth = 1,
#endif
#ifdef CONFIG_TRACE_IRQFLAGS.softirqs_enabled = 1,
#endif
#ifdef CONFIG_LOCKDEP.lockdep_recursion = 0,
#endif
#ifdef CONFIG_FUNCTION_GRAPH_TRACER.ret_stack   = NULL,
#endif
#if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPT).trace_recursion = 0,
#endif
#ifdef CONFIG_LIVEPATCH.patch_state = KLP_UNDEFINED,
#endif
#ifdef CONFIG_SECURITY.security = NULL,
#endif
};
EXPORT_SYMBOL(init_task);/** Initial thread structure. Alignment of this is handled by a special* linker map entry.*/
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info init_thread_info __init_thread_info = INIT_THREAD_INFO(init_task);
#endif

通过判断task_struct是否和init_task相等，就可以知道是不是已经找到了最开始的process（init process）。

Process Creation

linux系统中创建进程，主要使用了两个函数：fork，exec。fork会把当前的进程copy到子进程中（可以指定copy哪些部分），然后通过exec让子进程开始执行新的program。

Copy-on-Write

fork创建子进程时，并不是真的把parent的内容copy到子进程中，而是使用了copy-on-write技术，也就是写时复制，如果子进程只读，就和parent share同一份，如果要写，就为子进程创建新的存储区域来写。虽然有copy-on-write，但是当fork时，至少要为子进程分配新的page table，以及一个新的task_struct。

Forking

用户态使用的fork，对应的系统调用是clone（），包含了一些flag，用来告诉kernel父子进程需要share哪些resource。除了fork函数，用户态使用的vfork（），__clone（）等，都是使用clone（）系统调用来实现。在kernel中，clone（）又会调用do_fork（）来实现，新进程的创建都是在do_fork()中来完成的，我们接下来看do_fork().

注意，在kernel 4.15中，fork的实现方式和2.6有一些区别，比如加入了HAVE_COPY_THREAD_TLS，这里仍然按照老的code来看。patch：http://lkml.iu.edu/hypermail/linux/kernel/1504.2/03324.html。

do_fork()定义在kernel/fork.c，do_fork调用了_do_fork，我们直接看_do_fork：

/**  Ok, this is the main fork-routine.** It copies the process, and if successful kick-starts* it and waits for it to finish using the VM if required.*/
long _do_fork(unsigned long clone_flags,unsigned long stack_start,unsigned long stack_size,int __user *parent_tidptr,int __user *child_tidptr,unsigned long tls)
{struct task_struct *p;int trace = 0;long nr;...p = copy_process(clone_flags, stack_start, stack_size,child_tidptr, NULL, trace, tls, NUMA_NO_NODE);add_latent_entropy();/** Do this prior waking up the new thread - the thread pointer* might get invalid after that point, if the thread exits quickly.*/if (!IS_ERR(p)) {struct completion vfork;struct pid *pid;trace_sched_process_fork(current, p);pid = get_task_pid(p, PIDTYPE_PID);nr = pid_vnr(pid);if (clone_flags & CLONE_PARENT_SETTID)put_user(nr, parent_tidptr);if (clone_flags & CLONE_VFORK) {p->vfork_done = &vfork;init_completion(&vfork);get_task_struct(p);}wake_up_new_task(p);/* forking complete and child started to run, tell ptracer */if (unlikely(trace))ptrace_event_pid(trace, pid);if (clone_flags & CLONE_VFORK) {if (!wait_for_vfork_done(p, &vfork))ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);}put_pid(pid);} else {nr = PTR_ERR(p);}return nr;
}

_do_fork里通过copy_process创建了新的进程，然后通过wake_up_new_task让新进程开始执行。我们看copy_process：

/** This creates a new process as a copy of the old one,* but does not actually start it yet.** It copies the registers, and all the appropriate* parts of the process environment (as per the clone* flags). The actual kick-off is left to the caller.*/
static __latent_entropy struct task_struct *copy_process(unsigned long clone_flags,unsigned long stack_start,unsigned long stack_size,int __user *child_tidptr,struct pid *pid,int trace,unsigned long tls,int node)
{int retval;struct task_struct *p;//... 一些clone flag的检查retval = -ENOMEM;p = dup_task_struct(current, node);if (!p)goto fork_out;//...
}

在copy_process里：

1. 调用了dup_task_struct，为新的进程创建kernel stack，thread info structure，以及task_struct结构体。里面的信息都是来自与parent，此时子进程里的info和parent都是完全一样的。

2. 检查子进程没有超过给当前用户分配的资源限制。

 retval = -EAGAIN;if (atomic_read(&p->real_cred->user->processes) >=task_rlimit(p, RLIMIT_NPROC)) {if (p->real_cred->user != INIT_USER &&!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))goto bad_fork_free;}current->flags &= ~PF_NPROC_EXCEEDED;retval = copy_creds(p, clone_flags);if (retval < 0)goto bad_fork_free;

3. 新进程中的一些变量被清掉，不过大部分都是统计用的信息，task_struct中的大部分都没有改变。

4. 新进程的状态在sched_fork中被设置为TASK_NEW，防止被调度。

5. 新进程的一些flag被设置，比如：

p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
p->flags |= PF_FORKNOEXEC;

6. 为新进程分配pid

 if (pid != &init_struct_pid) {pid = alloc_pid(p->nsproxy->pid_ns_for_children);if (IS_ERR(pid)) {retval = PTR_ERR(pid);goto bad_fork_cleanup_thread;}

7. 为新进程copy需要的一切，比如share open files，fs，signal handler， process address space等等：

retval = copy_semundo(clone_flags, p);
retval = copy_files(clone_flags, p);
retval = copy_fs(clone_flags, p);
retval = copy_sighand(clone_flags, p);
retval = copy_signal(clone_flags, p);
retval = copy_mm(clone_flags, p);
retval = copy_namespaces(clone_flags, p);
retval = copy_io(clone_flags, p);
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);

8. 最后copy_process返回新进程的指针。

回到_do_fork，在新进程创建成功以后，新进程就会被唤醒（wake_up_new_task），开始执行：

/** wake_up_new_task - wake up a newly created task for the first time.** This function will do some initial scheduler statistics housekeeping* that must be done for every newly created context, then puts the task* on the runqueue and wakes it.*/
void wake_up_new_task(struct task_struct *p)
{struct rq_flags rf;struct rq *rq;raw_spin_lock_irqsave(&p->pi_lock, rf.flags);p->state = TASK_RUNNING;
#ifdef CONFIG_SMP/** Fork balancing, do it here and not earlier because:*  - cpus_allowed can change in the fork path*  - any previously selected CPU might disappear through hotplug** Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,* as we're not fully set-up yet.*/__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endifrq = __task_rq_lock(p, &rf);update_rq_clock(rq);post_init_entity_util_avg(&p->se);activate_task(rq, p, ENQUEUE_NOCLOCK);p->on_rq = TASK_ON_RQ_QUEUED;trace_sched_wakeup_new(p);check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMPif (p->sched_class->task_woken) {/** Nothing relies on rq->lock after this, so its fine to* drop it.*/rq_unpin_lock(rq, &rf);p->sched_class->task_woken(rq, p);rq_repin_lock(rq, &rf);}
#endiftask_rq_unlock(rq, p, &rf);
}

activate_task就会把新的进程放到run queue里去，准备调度。

vfork和fork不同，vfork不会copy page table，并且子进程会把父进程block住，直到子进程完成，父进程才会被继续执行。

The Linux Implementation of Threads

linux kernel中没有thread这种概念，thread就是process，在数据结构和调度上没有任何区别，唯一的不同在于，thread是share了很多资源的process。这种实现方式非常简单优雅，并且逻辑简单。

Creating Threads

thread的创建和普通的process没有大的区别，都是使用clone()来实现，只不过传递的clone flag会有所区别。比如创建thread使用的clone flag可能是这样的：

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

这样，父子两个share同样的vm，fs，files，以及signal handler。常用的clone flag有：

CLONE_FILESParent and child share open files.
CLONE_FSParent and child share filesystem information.
CLONE_IDLETASKSet PID to zero (used only by the idle tasks).
CLONE_NEWNSCreate a new namespace for the child.
CLONE_PARENTChild is to have same parent as its parent.
CLONE_PTRACEContinue tracing child.
CLONE_SETTIDWrite the TID back to user-space.
CLONE_SETTLSCreate a new TLS for the child.
CLONE_SIGHANDParent and child share signal handlers and blocked signals.
CLONE_SYSVSEMParent and child share System V SEM_UNDO semantics.
CLONE_THREADParent and child are in the same thread group.
CLONE_VFORKvfork() was used and the parent will sleep until the child
wakes it.
CLONE_UNTRACEDDo not let the tracing process force CLONE_PTRACE on the
child.
CLONE_STOPStart process in the TASK_STOPPED state.
CLONE_SETTLSCreate a new TLS (thread-local storage) for the child.
CLONE_CHILD_CLEARTIDClear the TID in the child.
CLONE_CHILD_SETTIDSet the TID in the child.
CLONE_PARENT_SETTIDSet the TID in the parent.
CLONE_VMParent and child share address space.

Kernel Threads

kernel thread就是process，和user 的process的区别是kernel thread没有地址空间，也就说kernel thread的task_struct里mm为NULL，user space的process是保存的当前进程的vma。因为没有mm，就限制了kernel thread完全运行在内核空间，和user space没有任何交互，不过和别的process一样都是可以正常调度和抢占的。

linux kernel中有很多的kernel thread，比如flush task和ksoftirq，通过命令ps -ef可以看到kernel当前正在运行的kernel thread。每一个kernel thread都是由其他的kernel thread创建出来，并且也只能由其他的kernel thread创建出来。kernel在创建kernel thread的时候，都是从kthreadd这个thread fork出来的。

创建kernel thread接口定义在<linux/kthread.h>，有两个：kthread_create和kthread_run。二者的区别在于kthread_run创建出来的thread自动运行，前者需要手动调用一次wake_up_process才会运行thread，看一下这两个函数的原型：

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),void *data,int node,const char namefmt[], ...);/*** kthread_create - create a kthread on the current node* @threadfn: the function to run in the thread* @data: data pointer for @threadfn()* @namefmt: printf-style format string for the thread name* @arg...: arguments for @namefmt.** This macro will create a kthread on the current node, leaving it in* the stopped state.  This is just a helper for kthread_create_on_node();* see the documentation there for more details.*/
#define kthread_create(threadfn, data, namefmt, arg...) \kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)

kthread_create是一个宏，直接调用kthread_create_on_node，threadfn是thread执行的入口函数，data是threadfn的参数data，node是CPU的node，不用关心，namefmt是thread的名字。再看一下kthread_run：

/*** kthread_run - create and wake a thread.* @threadfn: the function to run until signal_pending(current).* @data: data ptr for @threadfn.* @namefmt: printf-style name for the thread.** Description: Convenient wrapper for kthread_create() followed by* wake_up_process().  Returns the kthread or ERR_PTR(-ENOMEM).*/
#define kthread_run(threadfn, data, namefmt, ...)              \
({                                     \struct task_struct *__k                        \= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \if (!IS_ERR(__k))                         \wake_up_process(__k);                      \__k;                                   \
}

可以看到kthread_run比kthread_create多做了一步wake_up_process而已，这个wake_up_process也并不是让kthread马上运行，而是把它加到task run queue里去，等待调度。

kernel thread如果需要退出，就需要调用do_exit，或者kthread_stop。

Process Termination

当process结束的时候，kernel要把process占用的资源释放掉，同时通知它的parent process。通常来说，process都是主动结束自己，比如调用exit系统调用（比如用户态程序，C编译器会在main函数之后调用一次exit系统调用），或者从某些routine中返回（比如kernel thread从threadfn中返回），但是也有可能是process发生了异常所以退出，比如收到了退出的信号（SIGKILL）或者产生了无法处理的异常（Segement Fault），无论是哪一种退出方式，最终都会调用do_exit来清理process。do_exit定义在kernel/exit.c中，原型如下：

void __noreturn do_exit(long code)
{
...
}

do_exit主要干了这些事情：

1. 通过exit_signal把task的state设置为PF_EXITING。

2. 书里说这里会调用del_timer_sync，把process的timer移除，kernel 4.15中未见这段code。

3. 调用acct_update_integrals把一些统计数据写出去。

4. 调用exit_mm把process的mm释放掉——如果没有人share的话。

5. 调用exit_sem，如果之前process在等待IPC semaphore的队列，这里就会把它移除。

6. 调用exit_files和exit_fs，把files和fs的引用计数减掉，如果变为0，这些资源就会被释放。

7. 记录process的exit_code。（tsk->exit_code = code;），以后parent可以从exit_code知道子进程的退出原因。

8. 调用exit_notify，通知parent进程，并且为当前process的children寻找合适的parent（比如kthread group里的其他thread，或者init process），然后设置task struct里的exit_state为EXIT_ZOMBIE。

9. 最后在do_task_dead中调用schedule，并且不再返回。（因为当前的process已经不存在了）

在do_exit走完以后，这个process占用的memory就只有它的kernel stack了，也就是thread info和task_struct这两个结构体，之所以他们还存在，就是为了给parent传递一些信息，当parent已经获取到了信息，或者说对这些信息不感兴趣，那么结构体占用的memory也就被彻底的释放了。

Removing the Process Descriptor

前面已经说了，process的do_exit调用完以后，还留有两个结构体给parent获取信息用，那么parent如何获取到信息呢？答案是wait4系统调用。parent在创建了child process以后，需要调用wait4来等待child process的状态，调用这个函数时，parent会被block住，直到它的child process退出，此时parent会获取到child process的PID以及exit code。当wait4之后，child process的两个结构体就会被彻底的释放，这个是通过release_task来实现的。release_task做了如下事情：

1. 调用__exit_signal，其中会调用__unhash_process，其中又会调用detach_pid，就会把process从pidhash里移除，并从task list中移除。

2.__exit_signal也会把一些其他的资源释放掉，完成一些统计信息。

3. 如果退出的process是thread group里的non-leader member，而且这个group里leader的状态是zombie，此时会通知leader的parent。

4. release_task最后会调用delayed_put_task_struct，其中会调用put_task_struct，最后就会把task struct占用的memory释放掉。

The Dilemma of the Parentless Task

如果某一个process在它的children process退出之前退出，就要为它的children process重新寻找合适的parent，否则这些children会因为没有parent wait而导致无法退出（zombie）。在do_exit的过程中我们提到过，在exit_notify的时候要为children寻找合适的parent，现在看一下这个逻辑：do_exit -> exit_notify -> forget_original_parent -> find_new_reaper.

/** When we die, we re-parent all our children, and try to:* 1. give them to another thread in our thread group, if such a member exists* 2. give it to the first ancestor process which prctl'd itself as a*    child_subreaper for its children (like a service manager)* 3. give it to the init process (PID 1) in our pid namespace*/
static struct task_struct *find_new_reaper(struct task_struct *father,struct task_struct *child_reaper)
{struct task_struct *thread, *reaper;thread = find_alive_thread(father);if (thread)return thread;if (father->signal->has_child_subreaper) {unsigned int ns_level = task_pid(father)->level;/** Find the first ->is_child_subreaper ancestor in our pid_ns.* We can't check reaper != child_reaper to ensure we do not* cross the namespaces, the exiting parent could be injected* by setns() + fork().* We check pid->level, this is slightly more efficient than* task_active_pid_ns(reaper) != task_active_pid_ns(father).*/for (reaper = father->real_parent;task_pid(reaper)->level == ns_level;reaper = reaper->real_parent) {if (reaper == &init_task)break;if (!reaper->signal->is_child_subreaper)continue;thread = find_alive_thread(reaper);if (thread)return thread;}}return child_reaper;

上面的这段code是为了寻找合适的parent，如果thread group里有满足条件的process，就返回它，否则会返回init process。在这个函数返回以后，就找到了合适的parent，然后把所有的children的parent设置为新找到的parent：

 list_for_each_entry(p, &father->children, sibling) {for_each_thread(p, t) {t->real_parent = reaper;BUG_ON((!t->ptrace) != (t->parent == father));if (likely(!t->ptrace))t->parent = t->real_parent;if (t->pdeath_signal)group_send_sig_info(t->pdeath_signal,SEND_SIG_NOINFO, t);}/** If this is a threaded reparent there is no need to* notify anyone anything has happened.*/if (!same_thread_group(reaper, father))reparent_leader(father, p, dead);}

kernel 2.6以后引入了ptrace，如果当前exit的process有ptraced，比如被gdb attach，在它exit的时候就会把gdb设为它的children的parent。

Process Management [LKD 03]相关推荐

业务流程管理（Business Process Management）
1.业务流程管理编辑本义项 BPM 求助编辑百科名片 BPM是Business Process Management的英文字母缩写,大致有五个意思,即业务流程管理,是一套达成企业各种业务环节整合的全 ...
Memory Management [LKD 12]
kernel中和user space存在很大不同,从user space角度看,分配/释放内存易如反掌,即便失败了也容易处理,kernel里面不一样.比如有些kernel code不允许sleep,或 ...
14 exec/fork/wait cycles for Process Management
1 Executing a Programing exec将加载新的进程,并替换掉现在的进程 exev execvp execvpe exev execvp execvpe 1.1 Using exe ...
Process management of windows
igfxem.exe进程是正常的进程.是intel家的核显驱动类的进程.核显即"核芯显卡",是指GPU部分它是与CPU建立在同一内核芯片上,两者完全融合的芯片."核芯显卡 ...
【沃顿商学院学习笔记】商业基础——Operation Management：03运营管理活动中流程数据的详细分析
运营管理的学习笔记--流程数据分析流程数据分析内容从律特法则(Little's Low)说起,然后围绕库存周转率Inventory Turn.持有库存成本及其五个原因进行分析.最后对比分析了存货生产 ...
Process Monitor中文手册
1.介绍 2.使用Process Monitor 3.列的选择 4.过滤和高亮 5. 进程树(The Process Tree) 6. 信息概要工具(Trace Summary Tools) 7. 选 ...
ch7 Process Mamagement
Overview 进程的描述进程的状态 State 线程 Thread 进程间通信 Inter-Process Communication 进程互斥与同步死锁 DeadLock 进程的描述在操作 ...
业务规则管理（Business Rules Management，简称BRM）
企业的业务规则对绝大多数人来说都非常抽象,就算是企业的决策者也说不清自己的企业内部到底有多少业务规则在使用.如何让企业规则与企业的数据信息一样成为企业的重要资产? 业务规则管理"复苏&qu ...
ibm收购红帽满清十大酷刑_IBM Business Process Manager的十大编辑精选
ibm收购红帽满清十大酷刑这是我在developerWorks上发布的IBM®Business Process Manager(BPM)内容的十大编辑精选. 我选择的内容可帮助您完成业务流程应用程序 ...

Process Management [LKD 03]