linux内存管理（十一）-页回收总览

随着linux系统不断分配内存，当系统内存压力越来越大时，就会对系统的每个压力大的zone进程内存回收，内存回收主要是针对匿名页和文件页进行的。对于匿名页，内存回收过程中会筛选出一些不经常使用的匿名页，将它们写入到swap分区中，然后作为空闲页框释放到伙伴系统。而对于文件页，内存回收过程中也会筛选出一些不经常使用的文件页，如果此文件页中保存的内容与磁盘中文件对应内容一致，说明此文件页是一个干净的文件页，就不需要进行回写，直接将此页作为空闲页框释放到伙伴系统中，相反，如果文件页保存的数据与磁盘中文件对应的数据不一致，则认定此文件页为脏页，需要先将此文件页回写到磁盘中对应数据所在位置上，然后再将此页作为空闲页框释放到伙伴系统中。这样当内存回收完成后，系统空闲的页框数量就会增加，能够缓解内存压力。

要说清楚内存的页面回收，就必须要先理清楚内存分配过程，当我们申请分配页的时候，页分配器首先尝试使用低水线分配页，如果成功，就是走快速路径；如果失败，就是走慢速路径，说明内存轻微不足，页分配器将会唤醒内存节点的页回收内核线程，异步回收页。然后尝试使用最低水线分配页。如果使用最低水线分配失败，走最慢的路径，说明内存严重不足，页分配器会直接回收。针对不同的物理页，采用不同的回收策略：交换支持的页和存储设备支持的文件页。

在发现内存紧张时，也就是慢路径上，系统就会通过一系列机制来回收内存，比如下面这三种方式：

回收缓存，比如使用 LRU（Least Recently Used）算法，回收最近使用最少的内存页面；
Linux内核使用LRU（Least Recently Used，最近最少使用）算法选择最近最少使用的物理页。回收物理页的时候，如果物理页被映射到进程的虚拟地址空间，那么需要从页表中删除虚拟页到物理页的映射。
回收不常访问的内存，把不常用的内存通过交换分区直接写到磁盘中；
回收不常访问的内存时，会用到交换分区（以下简称 Swap）。Swap 其实就是把一块磁盘空间当成内存来用。它可以把进程暂时不用的数据存储到磁盘中（这个过程称为换出），当进程访问这些内存时，再从磁盘读取这些数据到内存中（这个过程称为换入）。
杀死进程，内存紧张时系统还会通过 OOM（Out of Memory），直接杀掉占用大量内存的进程。
OOM（Out of Memory）其实是内核的一种保护机制。它监控进程的内存使用情况，并且使用 oom_score 为每个进程的内存使用情况进行评分，一个进程消耗的内存越大，oom_score 就越大；一个进程运行占用的 CPU 越多，oom_score 就越小。这样，进程的 oom_score 越大，代表消耗的内存越多，也就越容易被 OOM 杀死，从而可以更好保护系统。

回收缓存和回收不常访问的内存一般都是使用LRU算法选择最近最少使用的物理页。

一、LRU数据结构

在linux内存管理（一）-内存管理架构中讲过，现在继续复习一下，内存的数据结构。内存管理子系统使用节点(node)，区域(zone)、页(page)三级结构描述物理内存：分别使用内存节点结构体 struct pglist_data，区域结构体struct zone，页结构体struct page表示。在内存节点结构体 struct pglist_data中，有一个struct lruvec结构体，表示lru链表描述符：

typedef struct pglist_data {......spinlock_t        lru_lock;//lru链表锁/* Fields commonly accessed by the page reclaim scanner */struct lruvec        lruvec;//lru链表描述符，里面有5个lru链表
......
} pg_data_t;struct lruvec {struct list_head     lists[NR_LRU_LISTS];//5个lru双向链表头struct zone_reclaim_stat    reclaim_stat; //与回收相关的统计数据/* Evictions & activations on the inactive file list */atomic_long_t          inactive_age;/* Refaults at the time of last reclaim cycle */unsigned long          refaults;//记录最后一次回收周期发生的结果
#ifdef CONFIG_MEMCGstruct pglist_data *pgdat;//所属内存节点结构体 struct pglist_data
#endif
};enum lru_list {LRU_INACTIVE_ANON = LRU_BASE,LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,LRU_UNEVICTABLE,NR_LRU_LISTS
};

从上面的enum lru_list中可以看出5个lru链表分别是：

不活动匿名页LRU链表，用来链接不活动的匿名页，即最近访问频率低的匿名页；
活动匿名页LRU链表，用来链接活动的匿名页，即最近访问频率高的匿名页；
不活动文件页LRU链表，用来链接不活动的文件页，即最近访问频率低的文件页；
活动文件页LRU链表，用来链接活动的文件页，即最近访问频率高的文件页；
不可回收LRU链表，用来链接使用mlock锁定在内存中、不允许回收的物理页。

//用于页描述符，一组标志(如PG_locked、PG_error)，同时页框所在的管理区和node的编号也保存在当中
struct page {/* 在lru算法中主要用到的标志* PG_active: 表示此页当前是否活跃，当放到或者准备放到活动lru链表时，被置位* PG_referenced: 表示此页最近是否被访问，每次页面访问都会被置位* PG_lru: 表示此页是处于lru链表中的* PG_mlocked: 表示此页被mlock()锁在内存中，禁止换出和释放* PG_swapbacked: 表示此页依靠swap，可能是进程的匿名页(堆、栈、数据段)，匿名mmap共享内存映射，shmem共享内存映射*/unsigned long flags;......union {/* 页处于不同情况时，加入的链表不同* 1.是一个进程正在使用的页，加入到对应lru链表和lru缓存中* 2.如果为空闲页框，并且是空闲块的第一个页，加入到伙伴系统的空闲块链表中(只有空闲块的第一个页需要加入)* 3.如果是一个slab的第一个页，则将其加入到slab链表中(比如slab的满slab链表，slub的部分空slab链表)* 4.将页隔离时用于加入隔离链表*/struct list_head lru;   ......};......}

二、页回收源码分析

前面有一篇文章详细讲解了内存页面分配过程和内存水线这个重要概念：linux内存管理（六）-伙伴分配器。在调用函数进行一次内存分配时：alloc_page → alloc_pages_current → __alloc_pages_nodemask，__alloc_pages_nodemask()这个函数是内存分配的心脏，对内存分配流程做了一个整体的组织。在上面那篇文章中没有详细分析__alloc_pages_slowpath，只是讲解了快速路劲get_page_from_freelist。现在我们详细的看看__alloc_pages_slowpath函数，函数位于mm/page_alloc.c文件中：

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,struct alloc_context *ac)
{bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;struct page *page = NULL;unsigned int alloc_flags;unsigned long did_some_progress;enum compact_priority compact_priority;enum compact_result compact_result;int compaction_retries;int no_progress_loops;unsigned int cpuset_mems_cookie;int reserve_flags;/** We also sanity check to catch abuse of atomic reserves being used by* callers that are not in atomic context.*/if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))gfp_mask &= ~__GFP_ATOMIC;retry_cpuset:compaction_retries = 0;no_progress_loops = 0;compact_priority = DEF_COMPACT_PRIORITY;//后面可能会检查cpuset是否允许当前进程从哪些内存节点申请页cpuset_mems_cookie = read_mems_allowed_begin();/** The fast path uses conservative alloc_flags to succeed only until* kswapd needs to be woken up, and to avoid the cost of setting up* alloc_flags precisely. So we do that now.*///把分配标志位转化为内部的分配标志位，调整为最低水线标志alloc_flags = gfp_to_alloc_flags(gfp_mask);/** We need to recalculate the starting point for the zonelist iterator* because we might have used different nodemask in the fast path, or* there was a cpuset modification and we are retrying - otherwise we* could end up iterating over non-eligible zones endlessly.*///获取首选的内存区域，因为在快速路径中使用了不同的节点掩码，避免再次遍历不合格的区域。ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,ac->high_zoneidx, ac->nodemask);if (!ac->preferred_zoneref->zone)goto nopage;//异步回收页，唤醒kswapd内核线程进行页面回收if (gfp_mask & __GFP_KSWAPD_RECLAIM)wake_all_kswapds(order, gfp_mask, ac);/** The adjusted alloc_flags might result in immediate success, so try* that first*///调整alloc_flags后可能会立即申请成功，所以先尝试一下page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);if (page)goto got_pg;/** For costly allocations, try direct compaction first, as it's likely* that we have enough base pages and don't need to reclaim. For non-* movable high-order allocations, do that as well, as compaction will* try prevent permanent fragmentation by migrating from blocks of the* same migratetype.* Don't try this for allocations that are allowed to ignore* watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.*///申请阶数大于0，不可移动的位于高阶的，忽略水位线的if (can_direct_reclaim &&(costly_order ||(order > 0 && ac->migratetype != MIGRATE_MOVABLE))&& !gfp_pfmemalloc_allowed(gfp_mask)) {//尝试内存压缩，进行页面迁移，然后进行页面分配page = __alloc_pages_direct_compact(gfp_mask, order,alloc_flags, ac,INIT_COMPACT_PRIORITY,&compact_result);if (page)goto got_pg;/** Checks for costly allocations with __GFP_NORETRY, which* includes THP page fault allocations*/if (costly_order && (gfp_mask & __GFP_NORETRY)) {/** If compaction is deferred for high-order allocations,* it is because sync compaction recently failed. If* this is the case and the caller requested a THP* allocation, we do not want to heavily disrupt the* system, so we fail the allocation instead of entering* direct reclaim.*/if (compact_result == COMPACT_DEFERRED)goto nopage;/** Looks like reclaim/compaction is worth trying, but* sync compaction could be very expensive, so keep* using async compaction.*///同步压缩非常昂贵，所以继续使用异步压缩compact_priority = INIT_COMPACT_PRIORITY;}}retry:/* Ensure kswapd doesn't accidentally go to sleep as long as we loop *///如果页回收线程意外睡眠则再次唤醒，确保交换线程没有意外if (gfp_mask & __GFP_KSWAPD_RECLAIM)wake_all_kswapds(order, gfp_mask, ac);//如果调用者承若给我们紧急内存使用，我们就忽略水线，进行无水线分配reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);if (reserve_flags)alloc_flags = reserve_flags;/** Reset the nodemask and zonelist iterators if memory policies can be* ignored. These allocations are high priority and system rather than* user oriented.*///如果可以忽略内存策略，则重置nodemask和zonelistif (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {ac->nodemask = NULL;ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,ac->high_zoneidx, ac->nodemask);}/* Attempt with potentially adjusted zonelist and alloc_flags *///尝试使用可能调整的区域备用列表和分配标志page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);if (page)goto got_pg;/* Caller is not willing to reclaim, we can't balance anything *///如果不可以直接回收，则申请失败if (!can_direct_reclaim)goto nopage;/* Avoid recursion of direct reclaim */if (current->flags & PF_MEMALLOC)goto nopage;/* Try direct reclaim and then allocating *///直接页面回收，然后进行页面分配page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,&did_some_progress);if (page)goto got_pg;/* Try direct compaction and then allocating *///进行页面压缩，然后进行页面分配page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,compact_priority, &compact_result);if (page)goto got_pg;/* Do not loop if specifically requested *///如果调用者要求不要重试，则放弃if (gfp_mask & __GFP_NORETRY)goto nopage;/** Do not retry costly high order allocations unless they are* __GFP_RETRY_MAYFAIL*///不要重试代价高昂的高阶分配，除非它们是__GFP_RETRY_MAYFAILif (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))goto nopage;//重新尝试回收页if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,did_some_progress > 0, &no_progress_loops))goto retry;/** It doesn't make any sense to retry for the compaction if the order-0* reclaim is not able to make any progress because the current* implementation of the compaction depends on the sufficient amount* of free memory (see __compaction_suitable)*///如果申请阶数大于0，判断是否需要重新尝试压缩if (did_some_progress > 0 &&should_compact_retry(ac, order, alloc_flags,compact_result, &compact_priority,&compaction_retries))goto retry;/* Deal with possible cpuset update races before we start OOM killing *///如果cpuset允许修改内存节点申请就修改if (check_retry_cpuset(cpuset_mems_cookie, ac))goto retry_cpuset;/* Reclaim has failed us, start killing things *///使用oom选择一个进程杀死page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);if (page)goto got_pg;/* Avoid allocations with no watermarks from looping endlessly *///如果当前进程是oom选择的进程，并且忽略了水线，则放弃申请if (tsk_is_oom_victim(current) &&(alloc_flags == ALLOC_OOM ||(gfp_mask & __GFP_NOMEMALLOC)))goto nopage;/* Retry as long as the OOM killer is making progress *///如果OOM杀手正在取得进展，再试一次if (did_some_progress) {no_progress_loops = 0;goto retry;}nopage:/* Deal with possible cpuset update races before we fail */if (check_retry_cpuset(cpuset_mems_cookie, ac))goto retry_cpuset;/** Make sure that __GFP_NOFAIL request doesn't leak out and make sure* we always retry*/if (gfp_mask & __GFP_NOFAIL) {/** All existing users of the __GFP_NOFAIL are blockable, so warn* of any new users that actually require GFP_NOWAIT*/if (WARN_ON_ONCE(!can_direct_reclaim))goto fail;/** PF_MEMALLOC request from this context is rather bizarre* because we cannot reclaim anything and only can loop waiting* for somebody to do a work for us*/WARN_ON_ONCE(current->flags & PF_MEMALLOC);/** non failing costly orders are a hard requirement which we* are not prepared for much so let's warn about these users* so that we can identify them and convert them to something* else.*/WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);/** Help non-failing allocations by giving them access to memory* reserves but do not use ALLOC_NO_WATERMARKS because this* could deplete whole memory reserves which would just make* the situation worse*///允许它们访问内存备用列表page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);if (page)goto got_pg;cond_resched();goto retry;}
fail:warn_alloc(gfp_mask, ac->nodemask,"page allocation failure: order:%u", order);
got_pg:return page;
}

上面代码可以看出，在慢速路径上系统做了很多判断，我已经整理成下图：

wake_all_kswapdsb是唤醒一个异步回收内存的线程，他有着自己的初始化函数，后面会另外讲解。get_page_from_freelist就是快速路径的函数，在之前已经讲解过。接下来会讲 __alloc_pages_direct_reclaim，__alloc_pages_direct_compact和__alloc_pages_may_oom后面讲。