linux内存管理（八）-不连续页分配和页表

一、不连续页

1.不连续页的接口函数
a.用户台接口函数

//分配不连续的物理页并且把物理页映射到连续的虚拟地址空间；
void *vmalloc(unsigned long size);//释放vma I loc分配的物理页和虚拟地址空间;
void vfree (const void * addr);//把已经分配的不连续物理而映射到连续的虚拟地址空间；
void *vmap(struct type **pages / unsigned int count,unsigned long flags,pgprot t prot);//释放使用vmap分配的虚拟地址空间。
void vunmap(const void *addr);

还有一些函数，比如malloc/calloc/realloc/free，都是不保证物理地址连续，而且大小有限制因为是在堆申请的，单位为字节。
calloc：初始化为0, realloc改变内存大小
mmap/munmap：将文件利用虚拟内存技术映射到内存当中
brk/sbrk ：虚拟内存到内存的映射
b.内核态接口函数

//首先尝试使用kmalloc分配内存块，如果失败，那么使用vmalloc函数分配不连续的物理页；
void *kvmalloc(size_t size,gfp_t flags);//如果内存块是是使用vmalloc分配的，那么使用vfree释放，否则使用kfree释放。
void kvfree (const void * addr);

vmalloc/vfree：申请的虚拟地址连续但是物理地址不连续，而且大小有限制因为是在vmalloc区申请的，单位为页。可能睡眠，不能从中断上下文中调用，或其他不允许阻塞情况下调用。
kmalloc/kcalloc/krealloc/kfree：虚拟地址连续且物理地址连续。大小限制为（64B-4MB）单位为2^order字节（Normal区域）。大小有限，不如vmalloc/malloc大，通过slab进行管理。
kmem_cache_create：物理地址连续。64B-4MB字节大小需要对齐（Normal区域）。便于固定大小数据的频繁分配和释放，分配时从缓存池中获取地址，释放时也不一定真正释放内存，通过slab进行管理。
_get_free_page/ get_free_pages：物理地址连续，大小为4MB （1024页），单位为页（Normal区域）。但是限定不能使用HIGH MEM）。通过伙伴系统进行管理
alloc page/alloc pages/free pages：物理地址连续，4mb,单位为页（Normal/Vmalloc都可以）。配置定义最大页面数2^11, 一次能分配到的最大页面数是1024。通过伙伴系统进行管理

2.不连续页分配的结构体

//虚拟内存块实例
struct vm_struct {struct vm_struct  *next; // 指向下一个vm_struct实例void          *addr;//起始虚拟地址unsigned long     size;//长度unsigned long      flags;//标志位struct page      **pages; //指向page指针数组unsigned int       nr_pages;//页数phys_addr_t        phys_addr;//起始物理地址const void        *caller;//回收调用的回调函数
};//虚拟内存区域的范围
struct vmap_area {unsigned long va_start; // 起始虚拟地址unsigned long va_end; // 结束虚拟地址unsigned long flags; //标志位,如果此标志位设为VM_VM_AREA,表示成员vm指向一个struct vm_struct实例//红黑树节点，用来把vmap_area实例加入到根节点是vmap_area_root的红黑树当中struct rb_node rb_node;         /* address sorted rbtree */struct list_head list;  // 链表节点struct llist_node purge_list; //延后回写的脏页链表struct vm_struct *vm;//指向一系列的struct vm_struct实例struct rcu_head rcu_head;
};

这两个结构的关系如下：

3.分配过程分析
vmalloc函数执行过程分为三步：
a.分配虚拟内存区域
b.分配物理页
c.在内核的页表中把虚拟页映射到物理页

vmalloc虚拟地址空间的范围是（VMALLOC_START,VMALLOC_END），每种处理器架构都需要定义这两个宏，比如：ARM64架构定义的宏如下：

/** VMALLOC range.** VMALLOC_START: beginning of the kernel vmalloc space* VMALLOC_END: extends to the available space below vmmemmap, PCI I/O space*    and fixed mappings*/
#define VMALLOC_START       (MODULES_END)
#define VMALLOC_END     (PAGE_OFFSET - PUD_SIZE - VMEMMAP_SIZE - SZ_64K)#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)

MODULES END是内核模块区域的结束地址；PAGE OFFSET是线性映射区域的起始地址；PUD SIZE是一个页上层目录表项映射的地址空间长度；VMEMMAP SIZE是vmemmap区域的长度。
vmalloc虚拟地址空间的起始地址 = 内核模块区域的结束地址
vmalloc虚拟地址空间的结束地址 = 线性映射区域的起始地址 - 一个页上层目录表项映射的地址空间长度 - vmemmap区域的长度 - 64KB

备注：内核的页表就是0号内核线程的页表。函数vmap and vmalloc区别在于不需要分配物理页。

二、页表

页表是一种特殊的数据结构，放在系统空间的页表区，存放逻辑页与物理页帧的对应关系。每一个进程都拥有一个自己的页表，PCB表中有指针指向页表。
CPU中有一个页表寄存器，里面存放着当前进程页表的起始地址和页表长度。将上述计算的页表号和页表长度进行对比，确认在页表范围内，然后将页表号和页表项长度相乘，得到目标页相对于页表基地址的偏移量，最后加上页表基地址偏移量就可以访问到相对应的框了，CPU拿到框的起始地址之后，再把页内偏移地址加上，访问到最终的目标地址。每个进程都有页表，页表起始地址和页表长度的信息在进程不被CPU执行的时候，存放在其栈内。

CPU 并不是直接访问物理内存地址，而是通过虚拟地址空间来间接的访问物理内存地址。虚拟地址空间是操作系统为每个正在执行的进程分配一个逻辑地址，比如在 32 位系统，范围 0-4G-1 。操作系统通过将虚拟地址空间和物理内存地址之间建立映射关系，让 CPU 能够间接访问物理内存地此一般情况将虚拟地址空间以 512byte-8K, 作为一个单位，称为页，并从 0 开始依次对它进行页编号。这个大小就称为页面。将物理地址按照同样大小，作为一个单位，称为框或者是块。也从。开始依次进行对每个框编号。 OS 通过维护一张表，这张表记录每一对页和框的映射关系。系统为每个进程建立一个页表，在进程逻辑地址空间中每一页，依次在页表中有一个表项，记录该页对应的物理块号.通过直找页表就可以很容易地找到该页在内存中的位置。页表具有逻辑地址到物理地址映射作用。大多系统页面大小为 4KB 。

ARM64处理器把页表称为转换表，最多4级。ARM64处理器支持3种页长度，4KB、16KB和64KB。页长度和虚拟地址的宽度决定了转换表的级数。
一般情况下，linux内核把页表直接分为4级：页全局目录（PGD）、页上层目录（PUD）、页中间目录（PMD）、直接页表
（PT）。
如果选择三级（页全局目录（PGD）、页中间目录（PMD）、直接页表（PT））。
如果选择二级（页全局目录（PGD）和有接页表（PT））。

实际上，linux还存在五级页表的结构，每个进程有独立的页表，进程的mm_struct实例成员pgd指向页全局目录.前面四级页表的表项存放下一级页表的起始地址，直接页表的表项存放页帧号（PFN）.

五级目录结构的查询页表，把虚拟地址转换成物理地址流程:
1 、根据页全局目录的起始地址和页全局目录索引得到页全局目录表项的地址，然后再从表项得到页四级目录的起始地址；
2 、根据页四级目录的起始地址和页四级目录索引得到页四级目录表项的地址，然后从表项得到页上层目录的起始地址；
3 、根据页上层目录的起始地址和页上层目录索引得到页上层目录表项的地址，然后从表项得到页中间目录的起始地址；
4 、根据页中间目录的起始地址和页中间目录索引得到页中间目录表项的地址，然后从表项得到直接页表的起始地址；
5 、根据直接页表的起始地址和直接页表索引得到页表项的地址，然后从表项得到页帧号；
6 、把页帧号和页内偏移组合形成物理地址。

因为大部分linux系统都是使用四级目录结构，我们用四级页表举个例子,一般64位系统都是使用48位虚拟地址的，页长度和转换表级数关系是这样子的:

每级转换表占用一页，有512项，索引是48位虚拟地址的9个位，最后的偏移地址4K大小。

三、vmalloc过程分析

我们看看看 vmalloc() 函数的实现，文件在mm/nommu.c中，代码如下：

/**  vmalloc  -  allocate virtually contiguous memory**  @size:     allocation size**   Allocate enough pages to cover @size from the page level*  allocator and map them into contiguous kernel virtual space.**  For tight control over page level allocator and protection flags*   use __vmalloc() instead.*/
void *vmalloc(unsigned long size)
{return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
}
EXPORT_SYMBOL(vmalloc);

从上面代码可以看出，vmalloc() 函数直接调用了 __vmalloc() 函数，而 __vmalloc() 函数的实现如下：

void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)
{/**  You can't specify __GFP_HIGHMEM with kmalloc() since kmalloc()* returns only a logical address.*/return kmalloc(size, (gfp_mask | __GFP_COMP) & ~__GFP_HIGHMEM);
}
EXPORT_SYMBOL(__vmalloc);

__vmalloc() 函数直接调用了kmalloc函数，在include/linux/slab.h文件中

/*** kmalloc - allocate memory* @size: how many bytes of memory are required.* @flags: the type of memory to allocate.** kmalloc is the normal method of allocating memory* for objects smaller than page size in the kernel.** The @flags argument may be one of:** %GFP_USER - Allocate memory on behalf of user.  May sleep.** %GFP_KERNEL - Allocate normal kernel ram.  May sleep.** %GFP_ATOMIC - Allocation will not sleep.  May use emergency pools.*   For example, use this inside interrupt handlers.** %GFP_HIGHUSER - Allocate pages from high memory.** %GFP_NOIO - Do not do any I/O at all while trying to get memory.** %GFP_NOFS - Do not make any fs calls while trying to get memory.** %GFP_NOWAIT - Allocation will not sleep.** %__GFP_THISNODE - Allocate node-local memory only.** %GFP_DMA - Allocation suitable for DMA.*   Should only be used for kmalloc() caches. Otherwise, use a*   slab created with SLAB_DMA.** Also it is possible to set different flags by OR'ing* in one or more of the following additional @flags:** %__GFP_HIGH - This allocation has high priority and may use emergency pools.** %__GFP_NOFAIL - Indicate that this allocation is in no way allowed to fail*   (think twice before using).** %__GFP_NORETRY - If memory is not immediately available,*   then give up at once.** %__GFP_NOWARN - If allocation fails, don't issue any warnings.** %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail*   eventually.** There are other flags available as well, but these are not intended* for general use, and so are not documented here. For a full list of* potential flags, always refer to linux/gfp.h.*/
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{if (__builtin_constant_p(size)) {// 以下是找一个对象大小刚好大于等于size的cacheif (size > KMALLOC_MAX_CACHE_SIZE)return kmalloc_large(size, flags);
#ifndef CONFIG_SLOBif (!(flags & GFP_DMA)) {unsigned int index = kmalloc_index(size);if (!index)return ZERO_SIZE_PTR;return kmem_cache_alloc_trace(kmalloc_caches[index],flags, size);}
#endif}return __kmalloc(size, flags);
}

kmalloc() 函数首先在cache中寻找一个大小刚刚好的内存，找不到则调用了__kmalloc函数，在mm/slab.c文件中

void *__kmalloc(size_t size, gfp_t flags)
{return __do_kmalloc(size, flags, _RET_IP_);
}

__kmalloc() 函数直接调用了__do_kmalloc函数，在mm/slab.c文件中

/*** __do_kmalloc - allocate memory* @size: how many bytes of memory are required.* @flags: the type of memory to allocate (see kmalloc).* @caller: function caller for debug tracking of the caller*/
static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,unsigned long caller)
{struct kmem_cache *cachep;void *ret;if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))return NULL;cachep = kmalloc_slab(size, flags);if (unlikely(ZERO_OR_NULL_PTR(cachep)))return cachep;ret = slab_alloc(cachep, flags, caller);kasan_kmalloc(cachep, ret, size, flags);trace_kmalloc(caller, ret,size, cachep->size, flags);return ret;
}

__do_kmalloc函数首先调用了kmalloc_slab函数，失败再使用slab_alloc来分配内存，slab_alloc我们是不是很熟悉，对，就是我们之前slab中说的。

/** Find the kmem_cache structure that serves a given size of* allocation*/
struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
{unsigned int index;if (size <= 192) {if (!size)return ZERO_SIZE_PTR;index = size_index[size_index_elem(size)];} else {if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {WARN_ON(1);return NULL;}index = fls(size - 1);}#ifdef CONFIG_ZONE_DMAif (unlikely((flags & GFP_DMA)))return kmalloc_dma_caches[index];#endifreturn kmalloc_caches[index];
}

kmalloc_slab函数是来slab分配器中找满足给定分配大小的kmem_cache结构体，看到kmem_cache我们是不是又很熟悉，他就是我们之前在slab中讲的基础结构体，这部分主要是在之前有一些函数申请了kmem_cache，使用完释放的kmem_cache，但是我们linux内核认为他很热，于是当有需要申请相同大小kmem_cache的时候就直接把他分配出去了。

我们再看看slab_alloc函数，在mm/slab.c文件中

static __always_inline void *
slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
{unsigned long save_flags;void *objp;flags &= gfp_allowed_mask;cachep = slab_pre_alloc_hook(cachep, flags);if (unlikely(!cachep))return NULL;cache_alloc_debugcheck_before(cachep, flags);local_irq_save(save_flags);objp = __do_cache_alloc(cachep, flags);local_irq_restore(save_flags);objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);prefetchw(objp);if (unlikely(flags & __GFP_ZERO) && objp)memset(objp, 0, cachep->object_size);slab_post_alloc_hook(cachep, flags, 1, &objp);return objp;
}

slab_alloc函数主要调用了__do_cache_alloc函数分配。在mm/slab.c文件中

static __always_inline void *
__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
{void *objp;if (current->mempolicy || cpuset_do_slab_mem_spread()) {objp = alternate_node_alloc(cache, flags);if (objp)goto out;}objp = ____cache_alloc(cache, flags);/** We may just have run out of memory on the local node.* ____cache_alloc_node() knows how to locate memory on other nodes*/if (!objp)objp = ____cache_alloc_node(cache, flags, numa_mem_id());out:return objp;
}

看到这里大家是不是又很熟悉，____cache_alloc就是我们之前讲解过的函数，忘记的同学可以回去看看，____cache_alloc是SLAB分配的核心函数。

//SLAB分配的核心函数
static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{void *objp;struct array_cache *ac;check_irq_off();ac = cpu_cache_get(cachep);// 获取本地缓冲池对象if (likely(ac->avail)) {//如果有本地缓冲ac->touched = 1;//标记一下，然后从缓冲池中分配objp = ac->entry[--ac->avail];STATS_INC_ALLOCHIT(cachep);goto out;}STATS_INC_ALLOCMISS(cachep);//运行到这里，说明本地缓冲为空objp = cache_alloc_refill(cachep, flags);/** the 'ac' may be updated by cache_alloc_refill(),* and kmemleak_erase() requires its correct value.*/ac = cpu_cache_get(cachep);out:/** To avoid a false negative, if an object that is in one of the* per-CPU caches is leaked, we need to make sure kmemleak doesn't* treat the array pointers as a reference to the object.*/if (objp)kmemleak_erase(&ac->entry[ac->avail]);return objp;
}

而____cache_alloc_node函数就是在本节点找不到合适的slab的时候，会再次创建一个node，并且把node节点加入到slab list中。

/** A interface to enable slab creation on nodeid*/
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,int nodeid)
{struct page *page;struct kmem_cache_node *n;void *obj = NULL;void *list = NULL;VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);n = get_node(cachep, nodeid);BUG_ON(!n);check_irq_off();spin_lock(&n->list_lock);page = get_first_slab(n, false);if (!page)goto must_grow;check_spinlock_acquired_node(cachep, nodeid);STATS_INC_NODEALLOCS(cachep);STATS_INC_ACTIVE(cachep);STATS_SET_HIGH(cachep);BUG_ON(page->active == cachep->num);obj = slab_get_obj(cachep, page);n->free_objects--;fixup_slab_list(cachep, n, page, &list);spin_unlock(&n->list_lock);fixup_objfreelist_debug(cachep, &list);return obj;must_grow:spin_unlock(&n->list_lock);page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);if (page) {/* This slab isn't counted yet so don't update free_objects */obj = slab_get_obj(cachep, page);}cache_grow_end(cachep, page);return obj ? obj : fallback_alloc(cachep, flags);
}