Eviction Kill POD选择分析

Eviction机制是当节点上的资源紧张达到自己设定的阈值时可以进行资源回收。

EvictionManager工作流程中两个比较重要的步骤是: reclaimNodeLevelResources(回收节点资源）和killPod。

killPod总会有一个优先级选择pod来kill，下面就说一下这个优先级排序的过程。

pkg/kubelet/kubelet.go
func (kl *Kubelet) initializeRuntimeDependentModules() {if err := kl.cadvisor.Start(); err != nil {// Fail kubelet and rely on the babysitter to retry starting kubelet.// TODO(random-liu): Add backoff logic in the babysitterglog.Fatalf("Failed to start cAdvisor %v", err)}// eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefskl.evictionManager.Start(kl.cadvisor, kl.GetActivePods, kl.podResourcesAreReclaimed, kl.containerManager, evictionMonitoringPeriod)
}

这是kubelet中的代码从最后一行可以看到调用了evictionManager的Start函数，这个函数就是启动eviction使eviction开始工作，最后一个参数evictionMonitoringPeriod是个常量值是10秒，这个参数是让eviction流程10秒进行一次。

pkg/kubelet/eviction/eviction_manager.go
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, capacityProvider CapacityProvider, monitoringInterval time.Duration) {// start the eviction manager monitoringgo func() {for {if evictedPods := m.synchronize(diskInfoProvider, podFunc, capacityProvider); evictedPods != nil {glog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)} else {time.Sleep(monitoringInterval)}}}()
}

这个Start就是上文kubelet调用的函数，其中又调用了synchronize函数，这个就是进行evcition的流程。

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod {// if we have nothing to do, just returnthresholds := m.config.Thresholdsif len(thresholds) == 0 {return nil}glog.V(3).Infof("eviction manager: synchronize housekeeping")// build the ranking functions (if not yet known)// TODO: have a function in cadvisor that lets us know if global housekeeping has completedif m.dedicatedImageFs == nil {hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()if ok != nil {return nil}m.dedicatedImageFs = &hasImageFsm.resourceToRankFunc = buildResourceToRankFunc(hasImageFs)m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)}activePods := podFunc()// make observations and get a function to derive pod usage stats relative to those observations.observations, statsFunc, err := makeSignalObservations(m.summaryProvider, capacityProvider, activePods, *m.dedicatedImageFs)if err != nil {glog.Errorf("eviction manager: unexpected err: %v", err)return nil}debugLogObservations("observations", observations)// attempt to create a threshold notifier to improve eviction response timeif m.config.KernelMemcgNotification && !m.notifiersInitialized {glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")m.notifiersInitialized = true// start soft memory notificationerr = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {glog.Infof("soft memory eviction threshold crossed at %s", desc)// TODO wait grace period for soft memory limitm.synchronize(diskInfoProvider, podFunc, capacityProvider)})if err != nil {glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)}// start hard memory notificationerr = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {glog.Infof("hard memory eviction threshold crossed at %s", desc)m.synchronize(diskInfoProvider, podFunc, capacityProvider)})if err != nil {glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)}}// determine the set of thresholds met independent of grace periodthresholds = thresholdsMet(thresholds, observations, false)debugLogThresholdsWithObservation("thresholds - ignoring grace period", thresholds, observations)// determine the set of thresholds previously met that have not yet satisfied the associated min-reclaimif len(m.thresholdsMet) > 0 {thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)}debugLogThresholdsWithObservation("thresholds - reclaim not satisfied", thresholds, observations)// determine the set of thresholds whose stats have been updated since the last syncthresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)// track when a threshold was first observednow := m.clock.Now()thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)// the set of node conditions that are triggered by currently observed thresholdsnodeConditions := nodeConditions(thresholds)if len(nodeConditions) > 0 {glog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions)}// track when a node condition was last observednodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)// node conditions report true if it has been observed within the transition period windownodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)if len(nodeConditions) > 0 {glog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions)}// determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)debugLogThresholdsWithObservation("thresholds - grace periods satisified", thresholds, observations)// update internal statem.Lock()m.nodeConditions = nodeConditionsm.thresholdsFirstObservedAt = thresholdsFirstObservedAtm.nodeConditionsLastObservedAt = nodeConditionsLastObservedAtm.thresholdsMet = thresholdsm.lastObservations = observationsm.Unlock()// evict pods if there is a resource usage violation from local volume temporary storage// If eviction happens in localVolumeEviction function, skip the rest of eviction actionif utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {if evictedPods := m.localStorageEviction(activePods); len(evictedPods) > 0 {return evictedPods}}// determine the set of resources under starvationstarvedResources := getStarvedResources(thresholds)if len(starvedResources) == 0 {glog.V(3).Infof("eviction manager: no resources are starved")return nil}// rank the resources to reclaim by eviction prioritysort.Sort(byEvictionPriority(starvedResources))resourceToReclaim := starvedResources[0]glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim)// determine if this is a soft or hard eviction associated with the resourcesoftEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)// record an event about the resources we are now attempting to reclaim via evictionm.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)// check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.if m.reclaimNodeLevelResources(resourceToReclaim, observations) {glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)return nil}glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim)// rank the pods for evictionrank, ok := m.resourceToRankFunc[resourceToReclaim]if !ok {glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim)return nil}// the only candidates viable for eviction are those pods that had anything running.if len(activePods) == 0 {glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")return nil}// rank the running pods for eviction for the specified resourcerank(activePods, statsFunc)glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))//record age of metrics for met thresholds that we are using for evictions.for _, t := range thresholds {timeObserved := observations[t.Signal].timeif !timeObserved.IsZero() {metrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInMicroseconds(timeObserved.Time))}}// we kill at most a single pod during each eviction intervalfor i := range activePods {pod := activePods[i]// If the pod is marked as critical and static, and support for critical pod annotations is enabled,// do not evict such pods. Static pods are not re-admitted after evictions.// https://github.com/kubernetes/kubernetes/issues/40573 has more details.if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {continue}status := v1.PodStatus{Phase:   v1.PodFailed,Message: fmt.Sprintf(message, resourceToReclaim),Reason:  reason,}// record that we are evicting the podm.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))gracePeriodOverride := int64(0)if softEviction {gracePeriodOverride = m.config.MaxPodGracePeriodSeconds}// this is a blocking call and should only return when the pod and its containers are killed.err := m.killPodFunc(pod, status, &gracePeriodOverride)if err != nil {glog.Warningf("eviction manager: error while evicting pod %s: %v", format.Pod(pod), err)}return []*v1.Pod{pod}}glog.Infof("eviction manager: unable to evict any pods from the node")return nil
}

activePods := podFunc()获取了节点上的active的pod,然后调用rank(activePods, statsFunc)对activePod进行排序，这个rank最终的排序方法是在pkg/kubelet/eviction/helpers.go里面。

// Swap is part of sort.Interface.
func (ms *multiSorter) Swap(i, j int) {ms.pods[i], ms.pods[j] = ms.pods[j], ms.pods[i]
}// Less is part of sort.Interface.
func (ms *multiSorter) Less(i, j int) bool {p1, p2 := ms.pods[i], ms.pods[j]var k intfor k = 0; k < len(ms.cmp)-1; k++ {cmpResult := ms.cmp[k](p1, p2)// p1 is less than p2if cmpResult < 0 {return true}// p1 is greater than p2if cmpResult > 0 {return false}// we don't know yet}// the last cmp func is the final deciderreturn ms.cmp[k](p1, p2) < 0
}func qosComparator(p1, p2 *v1.Pod) int {qosP1 := v1qos.GetPodQOS(p1)qosP2 := v1qos.GetPodQOS(p2)// its a tieif qosP1 == qosP2 {return 0}// if p1 is best effort, we know p2 is burstable or guaranteedif qosP1 == v1.PodQOSBestEffort {return -1}// we know p1 and p2 are not besteffort, so if p1 is burstable, p2 must be guaranteedif qosP1 == v1.PodQOSBurstable {if qosP2 == v1.PodQOSGuaranteed {return -1}return 1}// ok, p1 must be guaranteed.return 1
}// memory compares pods by largest consumer of memory relative to request.
func memory(stats statsFunc) cmpFunc {return func(p1, p2 *v1.Pod) int {p1Stats, found := stats(p1)// if we have no usage stats for p1, we want p2 firstif !found {return -1}// if we have no usage stats for p2, but p1 has usage, we want p1 first.p2Stats, found := stats(p2)if !found {return 1}// if we cant get usage for p1 measured, we want p2 firstp1Usage, err := podMemoryUsage(p1Stats)if err != nil {return -1}// if we cant get usage for p2 measured, we want p1 firstp2Usage, err := podMemoryUsage(p2Stats)if err != nil {return 1}// adjust p1, p2 usage relative to the request (if any)p1Memory := p1Usage[v1.ResourceMemory]p1Spec, err := core.PodUsageFunc(p1)if err != nil {return -1}p1Request := p1Spec[api.ResourceRequestsMemory]p1Memory.Sub(p1Request)p2Memory := p2Usage[v1.ResourceMemory]p2Spec, err := core.PodUsageFunc(p2)if err != nil {return 1}p2Request := p2Spec[api.ResourceRequestsMemory]p2Memory.Sub(p2Request)// if p2 is using more than p1, we want p2 firstreturn p2Memory.Cmp(p1Memory)}
}// disk compares pods by largest consumer of disk relative to request for the specified disk resource.
func disk(stats statsFunc, fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) cmpFunc {return func(p1, p2 *v1.Pod) int {p1Stats, found := stats(p1)// if we have no usage stats for p1, we want p2 firstif !found {return -1}// if we have no usage stats for p2, but p1 has usage, we want p1 first.p2Stats, found := stats(p2)if !found {return 1}// if we cant get usage for p1 measured, we want p2 firstp1Usage, err := podDiskUsage(p1Stats, p1, fsStatsToMeasure)if err != nil {return -1}// if we cant get usage for p2 measured, we want p1 firstp2Usage, err := podDiskUsage(p2Stats, p2, fsStatsToMeasure)if err != nil {return 1}// disk is best effort, so we don't measure relative to a request.// TODO: add disk as a guaranteed resourcep1Disk := p1Usage[diskResource]p2Disk := p2Usage[diskResource]// if p2 is using more than p1, we want p2 firstreturn p2Disk.Cmp(p1Disk)}
}// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {orderedBy(qosComparator, memory(stats)).Sort(pods)
}// rankDiskPressureFunc returns a rankFunc that measures the specified fs stats.
func rankDiskPressureFunc(fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) rankFunc {return func(pods []*v1.Pod, stats statsFunc) {orderedBy(qosComparator, disk(stats, fsStatsToMeasure, diskResource)).Sort(pods)}
}

rankMemoryPressure是对pod进行内存的排序, rankDiskPressureFunc是对pod对disk进行排序。
qosComparator，memory，disk是具体的排序策略。

排序首先会调用qosComparator进行排序，如果pod的Qos是PodQOSBestEffort会最先被杀掉，如果两个pod的Qos相同然后会根据memory或是disk规则根据使用量进行排序，使用量大的会优先被杀掉。每次eviction流程kill pod会kill一个直至kill成功，10秒钟以后会进行下一次流程。

转载于：https://www.jianshu.com/p/ac3603f8b687

Eviction Kill POD选择分析相关推荐

k8s 驱逐eviction机制源码分析
原理部分 1. 驱逐概念介绍 kubelet会定期监控node的内存,磁盘,文件系统等资源,当达到指定的阈值后,就会先尝试回收node级别的资源,比如当磁盘资源不足时会删除不同的image,如果仍然在 ...
如何选择分析场景？2种指标梳理方式
当一个决策分析类项目(如商务智能.数据仓库.大数据分析等)开始筹划的时候,往往面临着如何选择分析场景的问题.我们需要的就是"指标"! 指标的工作主要采用"自上而下&quo ...
梦幻西游手游最多人的服务器,梦幻西游手游哪个区人多及区服选择分析
梦幻西游手游哪个区人多及区服选择分析,玩家在选择区服的时候一定要冷静,因为在战斗中不同的区服玩家数量都是不一样的.对于新区也是很多玩家的选择,那么在游戏中选择区服有哪些技巧呢,小编就给玩家分析一下梦幻 ...
冥王峡谷黑苹果之苹果网卡转接板选择分析
冥王峡谷黑苹果之苹果网卡转接板选择分析拆过冥王峡谷的都清楚主板的资源分布了,这里重点介绍3个2接口的,如图, 2个nvme固态接口并排在左侧,支持2280和2242.1个无线网卡接口位于左下测,确确 ...
Kubernetes Eviction Manager工作机制分析
2019独角兽企业重金招聘Python工程师标准>>> 研究过Kubernetes Resource QoS的同学,肯定会有一个疑问:QoS中会通过Pod QoS和OOM Kille ...
php 抽象类接口区别,PHP中抽象类、接口的区别与选择分析
本文实例分析了PHP中抽象类.接口的区别与选择.分享给大家供大家参考,具体如下: 区别: 1.对接口的使用是通过关键字implements.对抽象类的使用是通过关键字extends.当然接口也可以通过 ...
mysql 创建唯一索引_Mysql普通索引和唯一索引的选择分析
假设一个用户管理系统,每个人注册都有一个唯一的手机号,而且业务代码已经保证了不会写入两个重复的手机号.如果用户管理系统需要按照手机号查姓名,就会执行类似这样的 SQL 语句: select name ...
信息安全考研和就业的选择分析
文章目录前序关于未来关于考研 1.学历到底是不是个问题? 2.考研之路,如何准备. 2.1 开阔视野,切勿思想局限 2.2 提前准备,保研优研政策有很多. 2.3 来点干货,信息安全专业的学校怎 ...
王者荣耀英雄选择分析
目录 1. 压制与反压制->英雄选择 (1) BAN (2)压制 (3)英雄选择 2. 英雄分析 (1)单英雄分析 (2)多英雄分析 3.压制 (1)单英雄压制 (2)多英雄压制 4.召唤师技能 ...
5G与LTE双连接技术架构的选择分析
转载自:http://rf.eefocus.com/module/forum/thread-601837-1-1.html 未来的5G网络建设中,可以采用5G作为宏覆盖独立组网,也可以采用5G微小区进 ...

Eviction Kill POD选择分析

Eviction Kill POD选择分析相关推荐

最新文章

热门文章