CGROUP CFS 调度中的 period，burst 概念

基础：Period

先看一个例子：

If period is 500ms and quota is  250ms, the group will get
0.5 CPU worth of runtime every 500ms.# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 500ms */

我有一个疑问：上面的设置，如果换成下面的，会有什么不同？

If period is 1000ms and quota is  2000ms, the group will get
0.5 CPU worth of runtime every 2000ms.# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 1000000 > cpu.cfs_period_us /* period = 2000ms */

二者都可以让这个资源组获得 0.5 个 CPU 的资源，period 设置成 500ms 和 2000ms 的区别在哪里呢？

按照设计预期，每个 period 内，最多可以获得 quota 个时间。如果在 quota 时间内没有完成任务处理，则会被 throttle，等到下一个 period 继续干。

这篇文章【ref】里说：

period 变大，throughput 表现好，rt 表现差
- Larger periods will improve throughput at the expense of latency, since the scheduler will be able to sustain a cpu-bound workload for longer.
period 变小，rt 表现好，throughput 表现差

我不全能理解上面的话。

RT 表现好坏可以稍微理解一点：

所谓 RT 表现好，不是均值变好，而是方差变小，用户感受到更小的波动。

一个有趣的类比：假设用 cgroup 限制高清电影播放器，当 period 很小时，你感受到的卡是一帧一帧的卡；当 period 很大时，你感受到的是流畅播放几秒钟后卡几秒钟，然后继续流畅播放几秒钟，如此往复。而最终，你会使用相同的时间看完一部“卡片”。

但 Throughput 表现好是什么意思呢？Throughput 是单位时间内处理的任务量。通常变好是说处理的任务量变多了。但从上面的类比可以知道，无论 period 怎么变，平均算下来，单位时间内处理的任务量并没有变化。我的理解是这样：

所谓 Throughput 变好，有一个前提：请求的到来不是平均分布（可能是正态分布、泊松分布等），系统有忙有闲。那么当 period 变大时，系统能更好地容忍突发流量。

整体与局部的矛盾

上面说，period 变小时，RT 表现会更好，这说的是整体 RT。但对于单个任务的 RT却恰好相反：

假设一个 java 程序每次要执行 300ms 来完成一个请求。

如果 quota = 250ms，period = 500ms，那么这个 java 程序的 rt 等于 250 + 250 + 50 = 550ms
- 解释：这个 java 程序需要两个 period 才能执行完。第一个 period 内，它会被执行 250ms，然后挂起等待 250ms。当下一个 period 开始后，它会被继续执行 50ms。
如果 quota = 1000ms，period = 2000ms，那么这个 java 程序的 rt 等于 300ms
- 解释：这个 java 程序只需要一个 period 就可以执行完成。

假设一个系统里，大部分任务的 RT 都在一个 quota 内，少数任务的 RT 会超过一个 quota，如何尽量降低调度对后者的影响呢？

改进： burst

Burst 概念用请假打个比方（来自阿里云文档）：一年10天年假。家里有突发事件，需要请假30天，那么就把明年、后年的年假借用过来。明年后年就不要再休假了。最多能借多少年假呢？cpu.cfs_burst_us 来控制。

在 web 领域，流量是持续均匀的，但偶尔会有一些突发高耗时流量，这类流量的 RT 在 quota 的影响下往往会变长很多。

为了克服突发长尾流量的 RT 抖动问题，必须去了解 CFS Bandwidth Control 的实现思路，这篇文档里有一个 burst 的概念。

CPU 时间根据 cpu.cfs_period_us 划分成一个个 period，当一个资源组里出现突发流量时，允许它在当前 period 内超出 quota 限制，然后在后面的 period 内把超出的找补回来。从长期看，实现了总体的限时。

This feature borrows time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded.

那么，后面要用多少个 period 才能找补回来呢？如果后面的 period 里也一直超出 quota 限制呢？在比较新的内核里，cpu.cfs_burst_us 会控制 burst 上限

*** Burst feature ***This feature borrows time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded.Traditional (UP-EDF) bandwidth control is something like:(U = Sum u_i) <= 1This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we’d have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail.The burst feature observes that a workload doesn’t always executes the full quota; this enables one to describe u_i as a statistical distribution.For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional Worst Case Execution Time , WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average.That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there’s a threshold where one exceeds and the other doesn’t underrun enough to compensate; this depends on the specific CDFs.At the same time, we can say that the worst case deadline miss, will be Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET).The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

补充资料

burst 这个功能还在进化中，可以对比两个文档看到：
https://www.kernel.org/doc/html/v5.13/scheduler/sched-bwc.html
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html
在阿里的常规内核(3.10.0-327.ali2010.rc7.alios7.x86_64) 里，还没有 cpu.cfs_burst_us 这个东西。
它的 burst 能力也比较有限，只能借。