调度域（Scheduling Domain）

Scheduling Domains 引入的背景

Scheduling Domains 是现代硬件技术尤其是多 CPU 多核技术发展的产物。现在，一个复杂的高端系统由上到下可以这样构成：

它是一个 NUMA 架构的系统，系统中的每个 Node 访问系统中不同区域的内存有不同的速度。
同时它又是一个 SMP 系统。由多个物理 CPU(Physical Package) 构成。这些物理 CPU 共享系统中所有的内存。但都有自己独立的 Cache 。
每个物理 CPU 又由多个核 (Core) 构成，即 Multi-core 技术或者叫 Chip-level Multi processor(CMP) 。这些核都被集成在一块 die 里面。一般有自己独立的 L1 Cache，但可能共享 L2 Cache 。
每个核中又通过 SMT 之类的技术实现多个硬件线程，或者叫 Virtual CPU( 比如 Intel 的 Hyper-threading 技术 ) 。这些硬件线程，逻辑上看是就是一个 CPU 。它们之间几乎所有的东西都共享。包括 L1 Cache，甚至是逻辑运算单元 (ALU) 以及 Power 。

在上述系统中，最小的执行单元是逻辑 CPU，进程的调度执行也是相对于逻辑 CPU 的。因此，后文皆简称逻辑 CPU 为 CPU，是物理 CPU 时会特别说明。

在这样复杂的系统，调度器要解决的一个首要问题就是如何发挥这么多 CPU 的性能，使得负载均衡。不存某些 CPU 一直很忙，进程在排队等待运行，而某些 CPU 却是处于空闲状态。

但是在这些 CPU 之间进行 Load Balance 是有代价的，比如对处于两个不同物理 CPU 的进程之间进行负载平衡的话，将会使得 Cache 失效。造成效率的下降。而且过多的 Load Balance 会大量占用 CPU 资源。

还有一个要考虑的就是功耗 (Power) 的问题。一个物理 CPU 中的两个 Virtual CPU 各执行一个进程，显然比两个物理 CPU 中的 Virtual CPU 各执行一个进程节省功耗。因为硬件上可以实现一个物理 CPU 不用时关掉它以节省功耗。

为了解决上述的这些问题，内核开发人员 Nick Piggin 等人在 Linux 2.6 中引入基于 Scheduling Domains 的解决方案。

回页首

Scheduling Domains 原理

每个 Scheduling Domain 其实就是具有相同属性的一组 cpu 的集合。并且跟据 Hyper-threading, Multi-core, SMP, NUMA architectures 这样的系统结构划分成不同的级别。不同级之间通过指针链接在一起，从而形成一种的树状的关系。如下图所示：

图 1. Scheduling Domains 原理

负载平衡就是针对 Scheduling domain 的。从叶节点往上遍历。直到所有的 domain 中的负载都是平衡的。当然对不同的 domain 会有不同的策略识别是否负载不平衡，以及不同的调度策略。通过这样的方式，从而很好的发挥众多 cpu 的效率。

回页首

Scheduling Domains 结构

基于 Scheduling Domains 的调度器引入了一组新的数据结构。下面先讲一下两个主要的数据结构：

struct sched_domain: 代表一个 Scheduling Domain，也就是一个 CPU 集合，这个集合里所有的 CPU 都具有相同的属性和调度策略。 Load Balance 是针对每个 domain 里的 CPU 进行的。这里要注意 Scheduling Domains 是分级的。像上节所讲的复杂系统就分为 Allnuma_domain,Numa_domain, Phy_domain, Core_domain, Smt_domain(Cpu_domain) 五个等级。

表 1 列出了这个结构的主要字段。

struct sched_group: 每个 Scheduling domain 都有一个或多个 CPU group，每个 group 都被 domain 当做一个单独的单元来对待。 Load Balance 就是在这些 CPU group 之间的 CPU 进行的。表 2 列出了它的主要字段。

表 1. sched_domain 数据结构

类型	名称	描述
struct domain *	Parent	当前 domain 的父 domain
struct domain *	Child	当前 domain 的子 domain
cpumask_t	Span	当前 domain 中的所有 cpu 位图
unsigned long	min_interval	最小的 load balance 间隔
unsigned long	max_interval	最大的 load balance 间隔
unsigned int	busy_factor	Busy 时延迟进行 balance 的系数
unsigned int	busy_idx
unsigned int	idle_idx
unsigned int	newidle_idx
unsigned int	wake_idx
unsigned int	forkexec_idx
int	flags	当前 domain 的一些状态标记
enum sched_domain_level	level	当前 domain 的级别
unsigned long	last_balance	当前 domain 最近一次进行 balance 时的时间 (jiffies 为单位 )
unsigned int	balance_interval	进行 balance 的时间间隔（ms 为单位）
unsigned int	nr_balance_failed	balance 失败的次数

表 2. sched_group 数据结构

类型	名称	描述
struct sched_group *	next	下一个 group 的指针
cpumask_t	cpumask	当前 group 有哪些 CPU
unsigned int	__cpu_power	当前 group 的 CPU power
u32	reciprocal_cpu_power	CPU power 的倒数

下面以我们之前提到的 NUMA，SMP，Multcore，SMT 结合的复杂系统为例来具体解释一下 Scheduling domains 和 CPU groups 之间的关系。

这里为了简化讨论，假设每个物理 CPU 只有两个核，每个核只有两个逻辑 CPU 。

如下图所示：

图 2. 物理 CPU 示意图

当系统启动时，会分别把每个核的两个逻辑 CPU 放入一个 Scheduling Domain，

这个级别的 domain 叫做 cpu_domain 。其中每个 domain 包括两个 CPU groups，每个 CPU group 只有一个逻辑 CPU 。

如下图所示：

图 3. 逻辑 CPU

同时每个物理 CPU 的两个核被放入一个高一级的 Scheduling Domain 。这个 domain 命名为 core_domain 。其中每个 domain 包括两个 CPU groups，每个 CPU group 有两个逻辑 CPU 。

如下图所示：

图 4. CPU group

对于我们前述的复杂系统，再往上的话依次还有 phys_domain( 物理 CPU 放入的 domain) ；

numa_domain(NUMA 中的 16 个 nodes 放入的 domain) ； allnode_domain( 所有 NUMA 中的 node 放入的 domain) 。从而将所有 CPU 组织成一个基于 domain 的层次结构。

回页首

Scheduling Domains 实现

对每个 Scheduling Domain 中的 CPU groups 之间的 CPU 进行的。

每个 Scheduling Domain 都包含一些重要的信息用来决定在这级 domain 的 CPU groups 之间如何进行 Load Balance 。

典型的一些原则如下：
1. 在 cpu_domain 级：因为是共享 cache，cpu_power 也基本是共用的。所以可以在这个 domain 级的 cpu groups 之间可以不受限制的进行 load balance 。
2. 在 core_domain 级：可能会共享 L2 级 cache, 具体跟实现相关了。因此这一级的 balance 相对没那么频繁。要 core 之间负载的不平衡达到一定程度才进行 balance 。
3. 在 phys_domain 级：在这一个 domain 级，如果进行 balance 。则每个物理 CPU 上的 Cache 会失效一段时间，并且要考虑 cpu_power，因为物理 CPU 级的 power 一般是被数关系。比如两个物理 CPU 是 power*2，而不像 core, 或逻辑 CPU，只是 power*1.1 这样的关系。
4. 在 numa_domain 级：这一级的开销最大，一般很少进行 balance 。

基本实现

Linux 主要通过以下几个方面来实现基于 Scheduling domains 的 Load Balance 。

周期性的 load balance

每次时钟中断到来，如果发现当前 cpu 的运行队列需要进行下一次的 balance 的时间已

经到了，则触发 SCHED_SOFTIRQ 软中断。

触发软中断

if (time_after_eq(jiffies, rq->next_balance))

raise_softirq(SCHED_SOFTIRQ);

软中断的执行函数是 run_rebalance_domains(), 它主要是再调 rebalance_domains() 来实现。

清单 1. run_rebalance_domains()

static void run_rebalance_domains(struct softirq_action *h) { int this_cpu = smp_processor_id(); struct rq *this_rq = cpu_rq(this_cpu); enum cpu_idle_type idle = this_rq->idle_at_tick ? CPU_IDLE : CPU_NOT_IDLE; rebalance_domains(this_cpu, idle); #ifdef CONFIG_NO_HZ #endif }

rebalance_domains()，根据 domain 的级别，从下往上扫描每一级 Scheduling Domain 。如果发现这个 domain 的 balance 之间的间隔时间到了，则进一步进行 task 的移动。不同级别的 domain 是会有不同的间隔时间的。而且级别越高值越大，因为移动 task 的代价越大。

清单 2. rebalance_domains()

static void rebalance_domains(int cpu, enum cpu_idle_type idle) { int balance = 1; struct rq *rq = cpu_rq(cpu); unsigned long interval; struct sched_domain *sd; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; int need_serialize; cpumask_t tmp; for_each_domain(cpu, sd) { if (!(sd->flags & SD_LOAD_BALANCE)) continue; interval = sd->balance_interval; if (idle != CPU_IDLE) interval *= sd->busy_factor; /* scale ms to jiffies */ interval = msecs_to_jiffies(interval); if (unlikely(!interval)) interval = 1; if (interval > HZ*NR_CPUS/10) interval = HZ*NR_CPUS/10; need_serialize = sd->flags & SD_SERIALIZE; if (need_serialize) { if (!spin_trylock(&balancing)) goto out; } if (time_after_eq(jiffies, sd->last_balance + interval)) { if (load_balance(cpu, rq, sd, idle, &balance, &tmp)) { /* * We've pulled tasks over so either we're no * longer idle, or one of our SMT siblings is * not idle. */ idle = CPU_NOT_IDLE; } sd->last_balance = jiffies; } if (need_serialize) spin_unlock(&balancing); out: if (time_after(next_balance, sd->last_balance + interval)) { next_balance = sd->last_balance + interval; update_next_balance = 1; } /* * Stop the load balance at this level. There is another * CPU in our sched group which is doing load balancing more * actively. */ if (!balance) break; } /* * next_balance will be updated only when there is a need. * When the cpu is attached to null domain for ex, it will not be * updated. */ if (likely(update_next_balance)) rq->next_balance = next_balance; }

针对 CPU IDLE 的处理
如果一个逻辑 CPU 进入了 IDLE 状态，并且它所属的 domain 设置了 SD_BALANCE_NEWIDLE，则马上就会进行 balance，把忙的 CPU 上的进程 move 过来，从而最大的发挥多 CPU 的优势。
针对 fork(), exec() 的处理
当一个进行调用 exec() 执行时，本来就是要加载一个新进程，缓存本来就会失效。所以，move 到哪里都可以。因此找设置了 SD_BALANCE_EXEC 标记的 domain 。然后把进程移动到那个 domain 中最闲的 CPU group 的 CPU 上。

fork() 时也进行类似的处理。
其它因素比如针对 cpu_power 的处理

调度域（Scheduling Domain）相关推荐

WebLogic11g-创建域（Domain）及基本配置
2019独角兽企业重金招聘Python工程师标准>>> 最近看到经常有人提问weblogic相关问题,所以闲暇之际写几篇博文(基于weblogic11),仅供大家参考. 具体webl ...
题解 P2949 【[USACO09OPEN]工作调度Work Scheduling】
P2949 [USACO09OPEN]工作调度Work Scheduling 题目标签是单调队列+dp,萌新太弱不会明显的一道贪心题,考虑排序先做截止时间早的,但我们发现后面可能会出现价值更高却没有 ...
RH442-7 调度域和CPU集之一
调度域(sched domains)和CPU集(cpusets) 一.内核时间(kernel time) 1. 时钟类型 l RTC (Real time clock):真实时间(BIOS硬件时 ...
调度框架 Scheduling Framework 实践
调度框架 Scheduling Framework 架构设计工作流程图可以查看相关文档参见 sig-scheduling 前提调度框架定义了一组扩展点,用户可以实现扩展点定义的接口来定义自己的调 ...
实现不同域（Domain）之间的数据交换(转)
引言前些天在对公司原有的 web 应用进行改版时遇到一个问题,当时需要从原有的应用中提取出一部分,用一个更为通用的来进行替换,并且仍然保留原有的应用接口.原有的应用属于 news.mycompany ...
什么是windows的域（Domain）？
原文链接:https://www.howtogeek.com/194069/what-is-a-windows-domain-and-how-does-it-affect-my-pc/ Windows ...
域账户登录本地计算机用户名,如何实现域账户domain/user登陆时自动添加到本地计算机poweruser用户组...
关于类似本主题的文章还有,请参考: 本文内容来自Microsoft News Group. Q:客户端计算机是RIS安装windows xp, 要实现域内用户登陆时自动添加到本地poweruser. ...
加了域的计算机无法更改时间,加入域用户(DOMAIN USER)后如何修改系统时间
如何给DOMAIN USER 用户修改客户机本地时间的权限? 回答: 您好,您可以使用组策略来定义. 首先你打开组策略编辑器(gpedit.msc)选择计算机配置–Windows设置–本地策略–用户权 ...
Nomad技术手册：调度（Scheduling）
调度是Nomad的核心功能之一.它是将任务从作业分配给客户机的过程.此过程必须遵循作业中声明的约束,并优化资源利用率.本文档记录了Nomad中调度工作方式的详细信息,以帮助用户和开发人员构建一个心智模 ...
matlab车辆贪心作业调度,贪心算法-区间调度-Interval Scheduling
什么是贪心算法呢? 贪心算法可以认为是动态规划算法的一个特例,相比动态规划,使用贪心算法需要满足更多的条件(贪心选择性质),但是效率比动态规划要高. 比如说一个算法问题使用暴力解法需要指数级时间,如果 ...

调度域（Scheduling Domain）

调度域（Scheduling Domain）相关推荐

最新文章

热门文章