k8s 驱逐eviction机制源码分析
原理部分
1. 驱逐概念介绍
kubelet会定期监控node的内存,磁盘,文件系统等资源,当达到指定的阈值后,就会先尝试回收node级别的资源,比如当磁盘资源不足时会删除不同的image,如果仍然在阈值之上就会开始驱逐pod来回收资源。
2. 驱逐信号
kubelet定义了如下的驱逐信号,当驱逐信号达到了驱逐阈值执行驱逐流程
3. 驱逐阈值
驱逐阈值用来指定当驱逐信号达到某个阈值后执行驱逐流程,格式如下:[eviction-signal][operator][quantity],其中eviction-signa为上面定义的驱逐信号,operator为操作符,比如小于等,quantity为指定阈值数据,可以为数字,也可以为百分比。
比如一个node有10G内存,如果期望当可用内存小于1G触发驱逐,可以定义阈值如下:memory.available<10%或者memory.available<1Gi
a. 软驱逐阈值
软驱逐阈值会指定一个grace period时间,只有达到阈值的时间超过了grace period才会执行驱逐流程。
有如下三个相关参数:
–eviction-soft: 指定驱逐阈值集合,比如memory.available<1.5Gi
–eviction-soft-grace-period:指定驱逐grace period时间集合,比如memory.available=1m30s
–eviction-max-pod-grace-period: 指定pod优雅退出时间,要和上面的–eviction-soft-grace-period区分开,–eviction-soft-grace-period指的是
达到阈值持续多久后才执行驱逐流程,而–eviction-max-pod-grace-period指的是驱逐pod后,pod的退出时间
b. 硬驱逐阈值
硬驱逐阈值只要达到了阈值就会执行驱逐流程,有如下参数
–eviction-hard: 指定驱逐阈值集合
如果不指定–eviction-hard,则使用如下默认值
//pkg/kubelet/apis/config/v1beta1/default_linux.go
// DefaultEvictionHard includes default options for hard eviction.
var DefaultEvictionHard = map[string]string{"memory.available": "100Mi","nodefs.available": "10%","nodefs.inodesFree": "5%","imagefs.available": "15%",
}
驱逐的这些参数都可以在KubeletConfiguration文件中指定,只需要指定驱逐信号和对应的数值即可,操作符默认为小于,具体可参考官网
4. node健康状况
当达到软驱逐(不用等到grace period)或者硬驱逐阈值后,kubelet就会向api-server报告node的健康状况,反应出node的压力。
驱逐信号和node健康状况的关系如下表
有些情况下,node健康状况可能会在软阈值上下波动,时而健康时而有压力,导致错误的驱逐决定。为了防止这种情况,可以使用参数–eviction-pressure-transition-period来控制node健康状况至少多久变化一次,默认值为5分钟
5. 驱逐pod选择
如果kubelet回收node级别的资源后仍然在阈值之上,则需要驱逐用户创建的pod。影响选择pod进行驱逐的因素如下:
a. 是否pod的资源使用超过了请求值
b. pod的优先级
c. pod的资源使用值和请求值的比例
根据上面三个因素,kubelet排序后进行驱逐pod的顺序如下:
a. 资源使用值超过了请求值的BestEffort和Burstable级别的pod。根据他们的pod优先级和超出请求值的多少进行驱逐
b. 资源使用值小于请求值的Guaranteed和Burstable级别的pod。根据他们的pod优先级进行驱逐
6. 最小回收资源
有些情况下,驱逐pod只能回收很少的一部分资源,可能导致kubelet频繁的执行达到阈值/进行驱逐的过程,为了防止这种情况,可以使用参数–eviction-minimum-reclaim为每种资源配置最小回收数值,当kubelet执行驱逐时,回收的资源会额外加上–eviction-minimum-reclaim指定的值。默认值为0
举例如下,当nodefs.available达到驱逐阈值后,kubelet开始驱逐回收资源直到可用资源达到1G,此时还会继续回收直到达到1.5G才会停止
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:memory.available: "500Mi"nodefs.available: "1Gi"imagefs.available: "100Gi"
evictionMinimumReclaim:memory.available: "0Mi"nodefs.available: "500Mi"imagefs.available: "2Gi"
7. KernelMemcgNotification
kubelet会启动协程周期检查是否达到驱逐阈值,如果有个很重要的pod内存使用增长很快的话,即使达到内存阈值了,kubelet也可能不能及时发现,最终oomkill掉此pod,如果kubelet能及时发现,就能驱逐其他低优先级的pod释放资源给这个高优先级的pod使用。
此时可以使用参数–kernel-memcg-notification使能memcg机制,kubelet会使用epoll监听,当达到阈值后,kubelet能很快得到通知,及时执行驱逐流程。
源码分析
1. 解析驱逐阈值配置
KubeletConfiguration文件中的驱逐阈值相关配置最终会保存到如下结构体中,Thresholds用来保存软驱逐阈值及其grace period,硬驱逐阈值。
// Config holds information about how eviction is configured.
type Config struct {// PressureTransitionPeriod is duration the kubelet has to wait before transitioning out of a pressure condition.PressureTransitionPeriod time.Duration// Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.MaxPodGracePeriodSeconds int64// Thresholds define the set of conditions monitored to trigger eviction.Thresholds []evictionapi.Threshold// KernelMemcgNotification if true will integrate with the kernel memcg notification to determine if memory thresholds are crossed.KernelMemcgNotification bool// PodCgroupRoot is the cgroup which contains all pods.PodCgroupRoot string
}// Threshold defines a metric for when eviction should occur.
type Threshold struct {// Signal defines the entity that was measured.Signal Signal// Operator represents a relationship of a signal to a value.Operator ThresholdOperator// Value is the threshold the resource is evaluated against.Value ThresholdValue// GracePeriod represents the amount of time that a threshold must be met before eviction is triggered.GracePeriod time.Duration// MinReclaim represents the minimum amount of resource to reclaim if the threshold is met.MinReclaim *ThresholdValue
}
调用ParseThresholdConfig解析用户配置
//pkg/kubelet/kubelet.go
func NewMainKubelet(...)thresholds, err := eviction.ParseThresholdConfig(enforceNodeAllocatable, kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)evictionConfig := eviction.Config{PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),Thresholds: thresholds,KernelMemcgNotification: kernelMemcgNotification,PodCgroupRoot: kubeDeps.ContainerManager.GetPodCgroupRoot(),}
2. 创建evictionManager
调用eviction.NewManager创建evictionManager,klet.resourceAnalyzer用来获取node和pod的统计信息,evictionConfig为用户配置的驱逐阈值,killPodNow用来kill pod
//pkg/kubelet/kubelet.go
func NewMainKubelet(...)// setup eviction managerevictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)klet.evictionManager = evictionManager//将evictionManager添加到admitHandlers,创建pod时会调用evictionManager.Admit如果node有资源压力则拒绝pod运行在此node上klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
3. 启动evictionManager
调用evictionManager的start函数启动驱逐管理,kl.GetActivePods用来获取运行在本node上的active pod,kl.podResourcesAreReclaimed用来确认pod的资源是否完全释放,evictionMonitoringPeriod为没有驱逐pod时的sleep时间
// Period for performing eviction monitoring.
// ensure this is kept in sync with internal cadvisor housekeeping.
evictionMonitoringPeriod = time.Second * 10func initializeRuntimeDependentModules(...)// eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefskl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
如果指定了–kernel-memcg-notification,则启动实时驱逐,同时也会创建协程启动轮训驱逐
//pkg/kubelet/eviction/eviction_manager.go
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {thresholdHandler := func(message string) {klog.InfoS(message)m.synchronize(diskInfoProvider, podFunc)}//实时驱逐。如果指定了--kernel-memcg-notification,则使能memcg机制,如果达到指定的阈值,就能很快调用synchronizeif m.config.KernelMemcgNotification {for _, threshold := range m.config.Thresholds {if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)if err != nil {klog.InfoS("Eviction manager: failed to create memory threshold notifier", "err", err)} else {go notifier.Start()m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)}}}}//轮训驱逐// start the eviction manager monitoringgo func() {for {//如果synchronize返回值不为空,说明有pod被驱逐了,则调用waitForPodsCleanup等待pod的资源被释放if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {klog.InfoS("Eviction manager: pods evicted, waiting for pod to be cleaned up", "pods", format.Pods(evictedPods))m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)} else {//否则没有pod被驱逐时,需要sleep 10stime.Sleep(monitoringInterval)}}}()
}
waitForPodsCleanup堵塞等待pod资源被回收,直到成功或者超时
func (m *managerImpl) waitForPodsCleanup(podCleanedUpFunc PodCleanedUpFunc, pods []*v1.Pod) {//最多等待30stimeout := m.clock.NewTimer(podCleanupTimeout)defer timeout.Stop()//每秒执行一次ticker := m.clock.NewTicker(podCleanupPollFreq)defer ticker.Stop()for {select {case <-timeout.C():klog.InfoS("Eviction manager: timed out waiting for pods to be cleaned up", "pods", format.Pods(pods))returncase <-ticker.C():for i, pod := range pods {//podCleanedUpFunc为pkg/kubelet/kubelet_pods.go:podResourcesAreReclaimed,用来判断pod的资源是否已经被回收,//如果仍然被回收则返回false,跳出循环等待下次if !podCleanedUpFunc(pod) {break}if i == len(pods)-1 {klog.InfoS("Eviction manager: pods successfully cleaned up", "pods", format.Pods(pods))return}}}}
}
synchronize为evictionManager的核心函数,实时驱逐和轮训驱逐都会调用它
// synchronize is the main control loop that enforces eviction thresholds.
// Returns the pod that was killed, or nil if no pod was killed.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {// if we have nothing to do, just returnthresholds := m.config.Thresholds//如果配置的驱逐阈值集合为空,则返回。因为有默认值的存在,肯定不会为空if len(thresholds) == 0 && !utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {return nil}klog.V(3).InfoS("Eviction manager: synchronize housekeeping")// build the ranking functions (if not yet known)// TODO: have a function in cadvisor that lets us know if global housekeeping has completedif m.dedicatedImageFs == nil {hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()if ok != nil {return nil}m.dedicatedImageFs = &hasImageFs//建立驱逐信号到排序函数的映射,比如对于SignalMemoryAvailable,使用rankMemoryPressure进行排序m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)//建立驱逐信号到node级别资源回收函数的映射m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)}//podFunc()为kl.GetActivePods用来获取运行在本node上的active podactivePods := podFunc()updateStats := true//调用summaryProviderImpl.Get获取node和pod统计信息summary, err := m.summaryProvider.Get(updateStats)if err != nil {klog.ErrorS(err, "Eviction manager: failed to get summary stats")return nil}//notifier相关的,暂时忽略if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {m.thresholdsLastUpdated = m.clock.Now()for _, notifier := range m.thresholdNotifiers {if err := notifier.UpdateThreshold(summary); err != nil {klog.InfoS("Eviction manager: failed to update notifier", "notifier", notifier.Description(), "err", err)}}}//将获取的node和pod统计信息summary转换到observations,此为驱逐信号到signalObservation的map,signalObservation保存了驱逐信号//的总容量,可用值和获取统计时的时间// make observations and get a function to derive pod usage stats relative to those observations.observations, statsFunc := makeSignalObservations(summary)debugLogObservations("observations", observations)//比较observations中的资源信息和配置的驱逐阈值thresholds,将达到阈值的thresholds返回,//比如配置的驱逐阈值为memory.available: "500Mi"和nodefs.available: "1Gi",而observations中内存可用为400Mi,//可用nodefs为2Gi,则返回memory.available相关信息// determine the set of thresholds met independent of grace periodthresholds = thresholdsMet(thresholds, observations, false)debugLogThresholdsWithObservation("thresholds - ignoring grace period", thresholds, observations)//thresholdsMet记录的是上次达到驱逐阈值的信号// determine the set of thresholds previously met that have not yet satisfied the associated min-reclaimif len(m.thresholdsMet) > 0 {//经过上次的驱逐流程后,可能已经成功回收资源,对上次达到驱逐阈值的信号再次进行判断是否降低到阈值之下,//如果没降低到阈值之下,需要将本次的thresholds和上次的m.thresholdsMet进行合并。//还有一种情况,对于软驱逐来说,第一次达到阈值后可能还没超过grace periods,则保存到m.thresholdsMet,下次走到这里时进行合并,//执行后续流程,如果仍然没超过grace periods则继续,直到超过grace periods指定的时间thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)}debugLogThresholdsWithObservation("thresholds - reclaim not satisfied", thresholds, observations)//记录驱逐信号第一次达到阈值的时间,目的是为了计算是否超过grace periods指定的时间//thresholds是本次发现达到阈值的驱逐信号,m.thresholdsFirstObservedAt保存的驱逐信号是第一次//达到阈值的时间// track when a threshold was first observednow := m.clock.Now()thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)//根据达到阈值的驱逐信号返回对应的node健康状况,比如达到内存可用阈值了,则返回NodeMemoryPressure// the set of node conditions that are triggered by currently observed thresholdsnodeConditions := nodeConditions(thresholds)if len(nodeConditions) > 0 {klog.V(3).InfoS("Eviction manager: node conditions - observed", "nodeCondition", nodeConditions)}//记录不同node健康状况上次出问题的时间,目的是为了计算是否超过了PressureTransitionPeriod指定的时间,防止node健康状况出现波动// track when a node condition was last observednodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)//只要记录的node健康状况出问题的时间在PressureTransitionPeriod指定的时间内,则返回true,即认为node健康状况还是有问题,//比如PressureTransitionPeriod为5分钟,在第一分钟时可用内存信号达到了内存阈值,设置node健康状况为NodeMemoryPressure,//第二分钟即使可用内存信号降到了内存阈值之下,也不会将NodeMemoryPressure删除,而是等到过了5分钟之后,如果可用内存信号仍然//在内存阈值之下,才会删除NodeMemoryPressure,即需要保持NodeMemoryPressure状态持续至少5分钟,不过这段时间内内存如何变化。// node conditions report true if it has been observed within the transition period windownodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)if len(nodeConditions) > 0 {klog.V(3).InfoS("Eviction manager: node conditions - transition period not met", "nodeCondition", nodeConditions)}//返回超过grace periods的驱逐信号,主要是针对软驱逐来说,硬驱逐每次都会返回// determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)debugLogThresholdsWithObservation("thresholds - grace periods satisfied", thresholds, observations)//保存重要的变量// update internal statem.Lock()m.nodeConditions = nodeConditionsm.thresholdsFirstObservedAt = thresholdsFirstObservedAtm.nodeConditionsLastObservedAt = nodeConditionsLastObservedAtm.thresholdsMet = thresholds// determine the set of thresholds whose stats have been updated since the last syncthresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)m.lastObservations = observationsm.Unlock()// evict pods if there is a resource usage violation from local volume temporary storage// If eviction happens in localStorageEviction function, skip the rest of eviction actionif utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {if evictedPods := m.localStorageEviction(activePods, statsFunc); len(evictedPods) > 0 {return evictedPods}}//如果为0,说明没有驱逐信号达到阈值,即没有资源压力,返回即可if len(thresholds) == 0 {klog.V(3).InfoS("Eviction manager: no resources are starved")return nil}//将驱逐信号进行排序,将内存驱逐信号排在前面// rank the thresholds by eviction prioritysort.Sort(byEvictionPriority(thresholds))//获取第一个驱逐信号,上面刚排序过,最新驱逐的是内存信号thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)if !foundAny {return nil}klog.InfoS("Eviction manager: attempting to reclaim", "resourceName", resourceToReclaim)// record an event about the resources we are now attempting to reclaim via evictionm.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)//首先查看是否有node层级的资源可回收,回收后,再次m.summaryProvider.Get获取node和pod统计信息,和m.config.Thresholds进行比较,//如果没有达到阈值的驱逐信号,则返回true,说明node层级的资源回收后,已经没有资源压力。//内存驱逐信号不属于node层级资源// check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {klog.InfoS("Eviction manager: able to reduce resource pressure without evicting pods.", "resourceName", resourceToReclaim)return nil}klog.InfoS("Eviction manager: must evict pod(s) to reclaim", "resourceName", resourceToReclaim)//根据驱逐信号获取排序函数,比如内存驱逐信号对应的排序函数为rankMemoryPressure// rank the pods for evictionrank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]if !ok {klog.ErrorS(nil, "Eviction manager: no ranking function for signal", "threshold", thresholdToReclaim.Signal)return nil}//没有active的pod,返回// the only candidates viable for eviction are those pods that had anything running.if len(activePods) == 0 {klog.ErrorS(nil, "Eviction manager: eviction thresholds have been met, but no pods are active to evict")return nil}//对activePods进行排序// rank the running pods for eviction for the specified resourcerank(activePods, statsFunc)klog.InfoS("Eviction manager: pods ranked for eviction", "pods", format.Pods(activePods))//record age of metrics for met thresholds that we are using for evictions.for _, t := range thresholds {timeObserved := observations[t.Signal].timeif !timeObserved.IsZero() {metrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInSeconds(timeObserved.Time))}}//遍历activePods开始驱逐,每次最多驱逐一个pod。为什么还要循环呢?因为有些pod属于Critical pod,这些pod不能驱逐// we kill at most a single pod during each eviction intervalfor i := range activePods {pod := activePods[i]gracePeriodOverride := int64(0)//软驱逐设置gracePeriodOverride为MaxPodGracePeriodSeconds,//硬驱逐置gracePeriodOverride为0if !isHardEvictionThreshold(thresholdToReclaim) {gracePeriodOverride = m.config.MaxPodGracePeriodSeconds}message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)//执行驱逐,如果evictPod返回nil说明pod为Critical pod,返回非nil说明对pod进行驱逐了,但不会管驱逐是否成功if m.evictPod(pod, gracePeriodOverride, message, annotations) {metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()return []*v1.Pod{pod}}}klog.InfoS("Eviction manager: unable to evict any pods from the node")return nil
}
evictPod对pod执行驱逐,如果pod为Critical pod则直接返回nil
func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {// If the pod is marked as critical and static, and support for critical pod annotations is enabled,// do not evict such pods. Static pods are not re-admitted after evictions.// https://github.com/kubernetes/kubernetes/issues/40573 has more details.if kubelettypes.IsCriticalPod(pod) {klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod))return false}// record that we are evicting the podm.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)// this is a blocking call and should only return when the pod and its containers are killed.klog.V(3).InfoS("Evicting pod", "pod", klog.KObj(pod), "podUID", pod.UID, "message", evictMsg)//killPodFunc为pkg/kubelet/pod_workers.go:killPodNowerr := m.killPodFunc(pod, true, &gracePeriodOverride, func(status *v1.PodStatus) {status.Phase = v1.PodFailedstatus.Reason = Reasonstatus.Message = evictMsg})if err != nil {klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod))} else {klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod))}return true
}
Critical pod的判断方法
// IsCriticalPod returns true if pod's priority is greater than or equal to SystemCriticalPriority.
func IsCriticalPod(pod *v1.Pod) bool {//pod的注释kubernetes.io/config.source不为api,即不是从apiserver获取的pod都为静态podif IsStaticPod(pod) {return true}//pod的注释kubernetes.io/config.mirror不为空的pod为镜像podif IsMirrorPod(pod) {return true}//pod的优先级不为空,且优先级大于 2*1000000000if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {return true}return false
}
killPodNow调用pod_workers的UpdatePod kill pod,此函数为堵塞调用,或者返回成功,或者等timeout超时
//pkg/kubelet/pod_workers.go
// killPodNow returns a KillPodFunc that can be used to kill a pod.
// It is intended to be injected into other modules that need to kill a pod.
func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {return func(pod *v1.Pod, isEvicted bool, gracePeriodOverride *int64, statusFn func(*v1.PodStatus)) error {// determine the grace period to use when killing the podgracePeriod := int64(0)if gracePeriodOverride != nil {gracePeriod = *gracePeriodOverride} else if pod.Spec.TerminationGracePeriodSeconds != nil {gracePeriod = *pod.Spec.TerminationGracePeriodSeconds}// we timeout and return an error if we don't get a callback within a reasonable time.// the default timeout is relative to the grace period (we settle on 10s to wait for kubelet->runtime traffic to complete in sigkill)timeout := int64(gracePeriod + (gracePeriod / 2))minTimeout := int64(10)if timeout < minTimeout {timeout = minTimeout}timeoutDuration := time.Duration(timeout) * time.Second// open a channel we block against until we get a resultch := make(chan struct{}, 1)podWorkers.UpdatePod(UpdatePodOptions{Pod: pod,UpdateType: kubetypes.SyncPodKill, //更新类型为killKillPodOptions: &KillPodOptions{CompletedCh: ch,Evict: isEvicted,PodStatusFunc: statusFn,PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,},})// wait for either a response, or a timeoutselect {//堵塞channel等待,kill成功后会close此channel,这里返回nilcase <-ch:return nilcase <-time.After(timeoutDuration): //超时了返回errrecorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")return fmt.Errorf("timeout waiting to kill pod")}}
}
UpdatePod的调用链如下
UpdatePod-> managePodLoop -> syncTerminatingPod -> killPod -> containerRuntime.KillPod
4. nodeConditions的作用
有如下两个作用
a. 如果node健康状况出问题就会上报到kube-apiserver,代码如下
func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) errorsetters = append(setters,//上传MemoryPressurenodestatus.MemoryPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderMemoryPressure, kl.recordNodeStatusEvent),//上传DiskPressurenodestatus.DiskPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderDiskPressure, kl.recordNodeStatusEvent),//上传PIDPressurenodestatus.PIDPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderPIDPressure, kl.recordNodeStatusEvent),)
如果m.nodeConditions包含NodeMemoryPressure说明有当前可用内存超过了内存驱逐信号的阈值
// IsUnderMemoryPressure returns true if the node is under memory pressure.
func (m *managerImpl) IsUnderMemoryPressure() bool {m.RLock()defer m.RUnlock()return hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure)
}
会有两个表现,一个是node污点增加了node.kubernetes.io/memory-pressure:NoSchedule,另一个是node的Conditions也能看到MemoryPressure,可使用kubectl describe node master查看
root@master:~# kubectl describe node master
...
Taints: node.kubernetes.io/memory-pressure:NoSchedule
Unschedulable: false
Lease:HolderIdentity: masterAcquireTime: <unset>RenewTime: Sat, 10 Dec 2022 10:23:38 +0000
Conditions:Type Status LastHeartbeatTime LastTransitionTime Reason Message---- ------ ----------------- ------------------ ------ -------MemoryPressure True Sat, 10 Dec 2022 10:22:06 +0000 Sat, 10 Dec 2022 10:22:06 +0000 KubeletHasInsufficientMemory kubelet has insufficient memory available
node有了node.kubernetes.io/memory-pressure:NoSchedule污点后,还可能会影响新pod的调度,在Filter扩展点TaintToleration插件会检查污点
// Filter invoked at the filter extension point.
func (pl *TaintToleration) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {if nodeInfo == nil || nodeInfo.Node() == nil {return framework.AsStatus(fmt.Errorf("invalid nodeInfo"))}filterPredicate := func(t *v1.Taint) bool {// PodToleratesNodeTaints is only interested in NoSchedule and NoExecute taints.return t.Effect == v1.TaintEffectNoSchedule || t.Effect == v1.TaintEffectNoExecute}taint, isUntolerated := v1helper.FindMatchingUntoleratedTaint(nodeInfo.Node().Spec.Taints, pod.Spec.Tolerations, filterPredicate)if !isUntolerated {return nil}errReason := fmt.Sprintf("node(s) had taint {%s: %s}, that the pod didn't tolerate",taint.Key, taint.Value)return framework.NewStatus(framework.UnschedulableAndUnresolvable, errReason)
}
如果新创建的pod没有声明容忍memory-pressure污点,会有如下报错,导致pod调度失败
I1210 12:43:04.147715 92808 scheduler.go:351] "Unable to schedule pod; no fit; waiting" pod="default/nginx-demo3-574cdd99c7-lwk9x" err="0/2 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/memory-pressure: }, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling."
b. 即使新建pod能调度成功,在目标node上也需要经过canAdmitPod检查后才能允许此pod运行在此node上,调用链如下
HandlePodAdditions -> canAdmitPod -> podAdmitHandler.Admit
evictionManager的Admit会检查node健康状态m.nodeConditions,如果为空说明没有资源压力,可直接返回true,如果是CriticalPod也返回true,如果只有内存压力,会根据pod的qos级别进行区分,如果不是besteffort的pod也返回true,是besteffort的pod如果可以容忍内存压力也可以,但如果不是内存压力就返回false,即pod不能运行在此node上
// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {m.RLock()defer m.RUnlock()//node健康状态集合为空,说明没有资源压力,可直接返回trueif len(m.nodeConditions) == 0 {return lifecycle.PodAdmitResult{Admit: true}}//CriticalPod也返回true// Admit Critical pods even under resource pressure since they are required for system stability.// https://github.com/kubernetes/kubernetes/issues/40573 has more details.if kubelettypes.IsCriticalPod(attrs.Pod) {return lifecycle.PodAdmitResult{Admit: true}}//有且只有内存压力// Conditions other than memory pressure reject all podsnodeOnlyHasMemoryPressureCondition := hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) && len(m.nodeConditions) == 1if nodeOnlyHasMemoryPressureCondition {//获取pod的qos级别notBestEffort := v1.PodQOSBestEffort != v1qos.GetPodQOS(attrs.Pod)//非BestEffort的pod返回trueif notBestEffort {return lifecycle.PodAdmitResult{Admit: true}}//如果能容忍内存压力也返回true// When node has memory pressure, check BestEffort Pod's toleration:// admit it if tolerates memory pressure taint, fail for other tolerations, e.g. DiskPressure.if v1helper.TolerationsTolerateTaint(attrs.Pod.Spec.Tolerations, &v1.Taint{Key: v1.TaintNodeMemoryPressure,Effect: v1.TaintEffectNoSchedule,}) {return lifecycle.PodAdmitResult{Admit: true}}}//其他情况均返回false// reject pods when under memory pressure (if pod is best effort), or if under disk pressure.klog.InfoS("Failed to admit pod to node", "pod", klog.KObj(attrs.Pod), "nodeCondition", m.nodeConditions)return lifecycle.PodAdmitResult{Admit: false,Reason: Reason,Message: fmt.Sprintf(nodeConditionMessageFmt, m.nodeConditions),}
}
参考
https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#eviction-thresholds
https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-eviction.md
k8s 驱逐eviction机制源码分析相关推荐
- Apache Storm 实时流处理系统通信机制源码分析
我们今天就来仔细研究一下Apache Storm 2.0.0-SNAPSHOT的通信机制.下面我将从大致思想以及源码分析,然后我们细致分析实时流处理系统中源码通信机制研究. 1. 简介 Worker间 ...
- Spark资源调度机制源码分析--基于spreadOutApps及非spreadOutApps两种资源调度算法
Spark资源调度机制源码分析--基于spreadOutApps及非spreadOutApps两种资源调度算法 1.spreadOutApp尽量平均分配到每个executor上: 2.非spreadO ...
- Nacos 服务端健康检查及客户端服务订阅机制源码分析(三)
Nacos 服务端健康检查 长连接 概念:长连接,指在一个连接上可以连续发送多个数据包,在连接保持期间,如果没有数据包发送,需要双方发送链路检测包 注册中心客户端 2.0 以后使用 gRPC 代替 h ...
- ART虚拟机 | Cleaner机制源码分析
目录 思考问题 1.Android为什么要将Finalize机制替换成Cleaner机制? 2.Cleaner机制回收Native堆内存的原理是什么? 3.Cleaner机制源码是如何实现的? 一.版 ...
- Android——RIL 机制源码分析
Android 电话系统框架介绍 在android系统中rild运行在AP上,AP上的应用通过rild发送AT指令给BP,BP接收到信息后又通过rild传送给AP.AP与BP之间有两种通信方式: 1. ...
- Linux Thermal机制源码分析之Governor
一.thermal_init() 在开始源码分析之前,需要先说明一下.Linux 内核代码庞大而复杂,如何 reading the Fxxking source code 相信是很多从事 Linux ...
- Handler机制源码分析
一.Handler使用上需要注意的几点 1.1 handler使用不当造成的内存泄漏 public class MainActivity extends AppCompatActivity {priv ...
- HashMap扩容机制源码分析
前几天写了一篇,ArrayList扩容源码分析.好像源码也没有我们想象的那么可怕?(当然了,只是简单的分析,后面等我知识充足了,将进一步的分析) 今天本来想打游戏的,但是网速太差了,真是的是让人火爆. ...
- Android -- 消息处理机制源码分析(Looper,Handler,Message)
android的消息处理有三个核心类:Looper,Handler和Message.其实还有一个Message Queue(消息队列),但是MQ被封装到Looper里面了,我们不会直接与MQ打交道,因 ...
最新文章
- poj2186 求有向图G中所有点都能到达的点的数量
- Python入门100题 | 第064题
- mysql查询时间类型c语言处理_资讯类app用户热度及资讯类型分析-Mysql进行数据预处理...
- webgl 基础渲染demo_WebGL + ThreeJS 实现实时水下焦散 Part 1
- 使用brew安装软件
- mysql 改表面_CSS表面(outline)是什么【html5教程】,CSS
- Decrease (Judge ver.)
- iOS学习——UITableViewCell两种重用方法的区别
- gcc编译c文件生成可执行文件
- 网站没有外链 如何计算权重
- matlab插值函数 外插,Matlab数据插值-内插、外插
- crc32 C语言程序
- 解构荣耀销量奇迹背后的化学反应:技术+品质+产品力
- 树莓派使用pigpio 控制舵机
- Windows XP系统中实用的命令及操作技巧
- 0006-Flink原理(Flink数据流 执行图)
- 基于BES+DSP 的音频系统方案设计
- 【软考二】程序设计语言(做题)
- shell脚本(linux)
- Android 腾讯手机管家 报毒 a.gray.PiggyGoldcoin.a