Why need early boot memory?

early boot memory即位于系统上电到内核内存管理模型构建之前这段内存管理阶段，严格说它只是系统启动过程中一个中间阶段的内存管理，当sparse内存模型数据初始化完成之后，将会从boot memory中接管内存管理权限。那么就有疑问为什么在系统启动阶段不等到sparse内存模型构建完毕之后，使用sparse内存模型申请内存？Mike Rapoport在《A quick history of early-boot memory allocators》，给出了答案：

The problem is that the primary Linux page allocator is a complex beast and it, too, needs to allocate memory to initialize itself. Moreover, the page-allocator data structures should be allocated in a NUMA-aware way. So another solution is required to get to the point where the memory-management subsystem can become fully operational.

sparse模型本身内存管理数据也是要经过复杂的初始化，而初始化过程中需要申请内存比如mem_map等，因此需要在sparse模型之前需要一个内存管理子系统为其分配内存，提供一个简单的内存管理系统，尤其是针对NUMA系统需要指明在各自的节点上申请到相应的内存。与sparse模型相比， early boot memory 不需要考虑特别复杂的场景，尤其此时系统还在初始化阶段因此也不需要考虑任何业务场景，因此相对来说early boot memory相对来说比较简单，内存碎片等问题也不需要考虑。

Early boot memory history

整个boot memory的技术演进也是随着系统的复杂度而逐渐上升，早期1.0 时代内存初始化相对来说比较简单，early boot memory仅仅是采用一个全局memory_start 变量来记录内存空闲的起始地址，当有内存申请时则增长memory_start地址即可。2.0时代虽然已经支持超过5种架构，但是boot-time memory 仍然采用1.0的方案仍然继续使用memory_start，当时芯片架构相对来说比较简单，所以只需要做少量修改即可。一直到2.3.23pre3，之前版本early memory都是采用全局变量memory_start，如果遇到相对来说比较复杂的内存申请管理等，可以等待内存模型初始化完成之后通过page 和 slab分配器来完成。

In the early days, Linux didn't have an early memory allocator; in the 1.0 kernel, memory initialization was not as robust and versatile as it is today. Every subsystem initialization call, or simply any function called from start_kernel(), had access to the starting address of the single block of free memory via the global memory_start variable. If a function needed to allocate memory it just increased memory_start by the desired amount. By the time v2.0 was released, Linux was already ported to five more architectures, but boot-time memory management remained as simple as in v1.0, with the only difference being that the extents of the physical memory were detected by the architecture-specific code. It should be noted, though, that hardware in those days was much simpler and memory configurations could be detected more easily.

Up until version 2.3.23pre3, all early memory allocations used global variables indicating the beginning and end of free memory and adjusted them accordingly. Luckily, the page and slab allocators were available early, so heavy memory users, such as buffers_init() and page_cache_init(), could use them. Still, as hardware evolved and became more sophisticated, the architecture-specific code dealing with memory had grown quite a bit of complex cruft.

一直到2.3.23pre3 patch之后出现了第一个early boot memory机制bootmem 机制，其核心思想就是使用一个bitmap来代表每个物理页的状态，当对应的页bitmap被设置则表明该内存正在使用中busy或者absent，该机制一直到2.3.48版本才将所有架构都切换过来。

the 2.3.23pre3 patch set included the first bootmem allocator implementation, which used a bitmap to represent the status of each physical memory page. Cleared bits identified available pages, while set bits meant that the corresponding memory pages were busy or absent. All the generic functions that tweaked memory_start and the i386 initialization code were converted to use bootmem, but other architectures were left behind. They were converted by the time version 2.3.48 was ready. Meanwhile, Linux was ported to Itanium (ia64), which was the first architecture to start off using bootmem.

但是采用bootmem机制有很大缺点: 创建bitmap时必须要明确知道物理内存的配置，以及bitmap 要多大，以及需要一个连续的物理内存来存放bitmap。

The major drawback of bootmem is the bitmap initialization. To create this bitmap, it is necessary to know the physical memory configuration. What is the correct size of the bitmap? Which memory bank has enough contiguous physical memory to store the bitmap? And, of course, as memory sizes increase so does the bootmem bitmap

为了解决上述问题Power 64架构中引入了LMB(logical memory block allocator)机制，该机制为memblock的前身，通过引入两个数组region，其中一个用于描述在系统中的连续物理内存，另外一个用于跟踪记录内存的使用情况，后来该机制被引入到sprarc内存模型中，并进一步引进变为memblock机制。

Over time, memory detection has evolved from simply asking the BIOS for the size of the extended memory block to dealing with complex tables, pieces, banks, and clusters. In particular, the Power64 architecture came prepared, bringing with it the Logical Memory Block allocator (or LMB). With LMB, memory is represented as two arrays of regions. The first array describes the physically contiguous memory areas available in the system, while the second array tracks allocated regions. The LMB allocator made its way into 32-bit PowerPC when the 32-bit and 64-bit architectures were merged. Later on it was adopted by SPARC. Eventually LMB made its way to other architectures and became what is now known as memblock.

memblock机制假设前提时有很少的内存申请和重新申请释放动作，因为该机制没有内存碎片管理，它的生命周期也很优先只能用于系统初始化阶段。

the design of memblock relies on the assumption that there will be relatively few allocation and deallocation requests before the primary page allocator is up and running. It does not need to be especially smart, since its lifetime is limited before it hands off all the memory to the buddy page allocator.

技术从来不是一簇而就，经过长期不断迭代演进结果。

memblock数据结构

memblock是现阶段early boot memory 管理使用的机制，其核心管理结果相对来说比较简单：

memblock的所有数据都存储在memblock变量中：

extern struct memblock memblock

struct memblock 结构以memblock_type形式管理物理内存，每个memblock内都包括多个memblock_region ，每个region记录该region的起始物理地址和size以及该region flags。

struct memblock

struct memblock 为memblock管理的主要数据结构，也称为memblock allocator metadata，主要数据成员如下(kernel 版本为5.8.10)：

struct memblock {bool bottom_up;  /* is bottom up direction? */phys_addr_t current_limit;struct memblock_type memory;struct memblock_type reserved;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAPstruct memblock_type physmem;
#endif
};

bool bottom_up: 申请内存是分配器分配内存的方式，true表示从低地址(内核镜像结束位置)到高地址分配，false表示从高地址往低地址分配。
phys_addr_t current_limit:内存块大小限制，一般可在memblock_alloc申请内存时进行检查限制
struct memblock_type memory:可以被memblock管理分配的内存（一般系统启动时，会因为内核镜像加载等一些原因，需要提前占用预留一些内存空间，这些空间不在memory之中）
struct memblock_type reserved: 预留已经分配的空间，主要包括两部分 一部分为memblock之前占用的内存空间，另外一部分为通过memblock_alloc从memory中申请的内存空间
struct memblock_type physmem:需要开启CONFIG_HAVE_MEMBLOCK_PHYS_MAP，所有物理内存的集合

struct memblock_type

struct memblock_type结构为按照memblock type管理的内存结构，由memblock结构可以看出主要分配memory、reserved、physmem三个类型，每个memblock_type结构内包含若干个region数组，每个region包括一定范围和大小的连续物理内存空间，struct memblock结构如下：

struct memblock_type {unsigned long cnt;unsigned long max;phys_addr_t total_size;struct memblock_region *regions;char *name;
};

unsigned long cnt: 该memblock_type内包含多少个regions
unsigned long max:memblock_type内最多包含多少个regions,默认为128个：INIT_MEMBLOCK_REGIONS
phys_addr_t total_size: 该memblock_type内所有regions 加起来的size大小
struct memblock_region *regions： regions数组，指向的是数组首地址
char *name：memblock_type 名称

struct memblock_region

struct memblock_region代表了一块物理内存memblock 区域，主要结构如下：

struct memblock_region {phys_addr_t base;phys_addr_t size;enum memblock_flags flags;
#ifdef CONFIG_NEED_MULTIPLE_NODESint nid;
#endif
};

phys_addr_t base：region区域首地址
phys_addr_t size：region 区域size 大小
enum memblock_flags flag：region区域flag
int nid: node id

memblock API

memblock模块几个常用API:

memblock_add()

memblock_add()添加一段物理内存到memblock.memory中，注意只能添加到memblock.memory中

int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)

memblock_reserve()

添加一段物理内存到memblock.reserved中：

int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)

memblock_physmem_add()

添加一段物理内地到memblock.physmem中

int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)

memblock_free()

是否一段物理内存，注意由于之前通过memblock_malloc申请的内存都是存放到 memblock.reserved中的，所以释放内存实质上释放的是 memblock.reserved中的，而不是释放memblock.memory中内存,但是释放的内存不会添加到buddy中

int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)

memblock_alloc_exact_nid_raw()

memblock对外提供的在早期阶段提供的申请物理内存函数：

void * __init memblock_alloc_exact_nid_raw(
           phys_addr_t size, phys_addr_t align,
           phys_addr_t min_addr, phys_addr_t max_addr,
           int nid)

memblock_is_reserved()

检查该地址是否属于 memblock.reserved：

bool __init_memblock memblock_is_reserved(phys_addr_t addr)

memblock_is_memory()

检查地址是否属于 memblock.memory:

bool __init_memblock memblock_is_memory(phys_addr_t addr)

memblock_free_all()

释放所有的空闲页到buddy中：

unsigned long __init memblock_free_all(void)

常用遍历宏

除了上述一些主要函数以外，还提供了常用遍历的宏：

遍历memblock_type内的所有regions数组

#define for_each_memblock(memblock_type, region)                 \for (region = memblock.memblock_type.regions;                 \region < (memblock.memblock_type.regions + memblock.memblock_type.cnt);    \region++)

除了遍历memblock_type内的所有regions之前还支持指定index遍历：

#define for_each_memblock_type(i, memblock_type, rgn)            \for (i = 0, rgn = &memblock_type->regions[0];         \i < memblock_type->cnt;                  \i++, rgn = &memblock_type->regions[i])

membock源码

membock初始化

memblock初始化，默认提供一个region，提供如下：

static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS] __initdata_memblock;
#endifstruct memblock memblock __initdata_memblock = {.memory.regions      = memblock_memory_init_regions,.memory.cnt     = 1,   /* empty dummy entry */.memory.max      = INIT_MEMBLOCK_REGIONS,.memory.name       = "memory",.reserved.regions = memblock_reserved_init_regions,.reserved.cnt     = 1,   /* empty dummy entry */.reserved.max        = INIT_MEMBLOCK_RESERVED_REGIONS,.reserved.name        = "reserved",#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP.physmem.regions    = memblock_physmem_init_regions,.physmem.cnt       = 1,   /* empty dummy entry */.physmem.max     = INIT_PHYSMEM_REGIONS,.physmem.name       = "physmem",
#endif.bottom_up        = false,.current_limit     = MEMBLOCK_ALLOC_ANYWHERE,
};

memblock_memory_init_regions、memblock_reserved_init_regions、memblock_physmem_init_regions为三个分别管理memory、regions、physmems三个memory_type 的regions数组

memblock添加

以X86架构为主说明下memoryblock添加， e820建立之后通过e820__memblock_setup()将e820中的entry添加到memoryblock中：


void __init e820__memblock_setup(void)
{int i;u64 end;/** The bootstrap memblock region count maximum is 128 entries* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries* than that - so allow memblock resizing.** This is safe, because this call happens pretty late during x86 setup,* so we know about reserved memory regions already. (This is important* so that memblock resizing does no stomp over reserved areas.)*/memblock_allow_resize();for (i = 0; i < e820_table->nr_entries; i++) {struct e820_entry *entry = &e820_table->entries[i];end = entry->addr + entry->size;if (end != (resource_size_t)end)continue;if (entry->type == E820_TYPE_SOFT_RESERVED)memblock_reserve(entry->addr, entry->size);if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)continue;memblock_add(entry->addr, entry->size);}/* Throw away partial pages: */memblock_trim_memory(PAGE_SIZE);memblock_dump_all();
}

按照e820 table在 entry类型，将E820_TYPE_SOFT_RESERVED类型即该内存类型为软件保留类型，添加到memblock reserve中，不再添加到memoryblock memory中进行再次分配做系统保留。
entry->type类型为E820_TYPE_RAM和E820_TYPE_RESERVED_KERN添加到memoryblock memory中由内核进行分配管理，E820_TYPE_RAM内存类型为RAM， E820_TYPE_RESERVED_KERN即内存由kernel进行保留由kernel进行分配。
剩余其他类型不添加到memoryblock中。

memblock_add（）

添加memblock memory类型：

int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{phys_addr_t end = base + size - 1;memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,&base, &end, (void *)_RET_IP_);return memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
}

代码中中可以看到直接调用memblock_add_range添加memblock.memory类型。

memblock_reserve（）

添加memblock reserve类型：

int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
{phys_addr_t end = base + size - 1;memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,&base, &end, (void *)_RET_IP_);return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
}

指定添加memblock.reserved类型，最终还是调用memblock_add_rang()添加。

memblock_add_range（）

该函数为memblock添加核心函数，代码如下：


/*** memblock_add_range - add new memblock region* @type: memblock type to add new region into* @base: base address of the new region* @size: size of the new region* @nid: nid of the new region* @flags: flags of the new region** Add new memblock region [@base, @base + @size) into @type.  The new region* is allowed to overlap with existing ones - overlaps don't affect already* existing regions.  @type is guaranteed to be minimal (all neighbouring* compatible regions are merged) after the addition.** Return:* 0 on success, -errno on failure.*/
static int __init_memblock memblock_add_range(struct memblock_type *type,phys_addr_t base, phys_addr_t size,int nid, enum memblock_flags flags)
{bool insert = false;phys_addr_t obase = base;phys_addr_t end = base + memblock_cap_size(base, &size);int idx, nr_new;struct memblock_region *rgn;if (!size)return 0;/* special case for empty array */if (type->regions[0].size == 0) {WARN_ON(type->cnt != 1 || type->total_size);type->regions[0].base = base;type->regions[0].size = size;type->regions[0].flags = flags;memblock_set_region_node(&type->regions[0], nid);type->total_size = size;return 0;}
repeat:/** The following is executed twice.  Once with %false @insert and* then with %true.  The first counts the number of regions needed* to accommodate the new area.  The second actually inserts them.*/base = obase;nr_new = 0;for_each_memblock_type(idx, type, rgn) {phys_addr_t rbase = rgn->base;phys_addr_t rend = rbase + rgn->size;if (rbase >= end)break;if (rend <= base)continue;/** @rgn overlaps.  If it separates the lower part of new* area, insert that portion.*/if (rbase > base) {
#ifdef CONFIG_NEED_MULTIPLE_NODESWARN_ON(nid != memblock_get_region_node(rgn));
#endifWARN_ON(flags != rgn->flags);nr_new++;if (insert)memblock_insert_region(type, idx++, base,rbase - base, nid,flags);}/* area below @rend is dealt with, forget about it */base = min(rend, end);}/* insert the remaining portion */if (base < end) {nr_new++;if (insert)memblock_insert_region(type, idx, base, end - base,nid, flags);}if (!nr_new)return 0;/** If this was the first round, resize array and repeat for actual* insertions; otherwise, merge and return.*/if (!insert) {while (type->cnt + nr_new > type->max)if (memblock_double_array(type, obase, size) < 0)return -ENOMEM;insert = true;goto repeat;} else {memblock_merge_regions(type);return 0;}
}

该函数主要分配四个部分：

第一部分主要为当memblock中的type为第一次添加时，直接添加到第一个region即可，因为再memblock初始化时会默认申请一个region，直接添加到第一个region即可。
当memblock中的type不是第一次添加，则进入repeat部分，遍历type中的所有region，将所有的region按照从小到大进行排列，如果所添加的region范围(rbase >= end)或者 (rend <= base)，则说明添加的region与已经存在的region没有重叠，如果(rbase > base)则说明和旧的已经存在的region存在重叠则需要将其分成两个部分。nr_new++说明是需要生成一个新的region插入到里面。
如果nr_new大于0则说明需要生成一个region 并插入到里面：

f (!insert) {while (type->cnt + nr_new > type->max)if (memblock_double_array(type, obase, size) < 0)return -ENOMEM;insert = true;goto repeat;

则将insert 设置为true，并重新跳转到repeat部分，这时和第一次repeat部分不同，第一次repeat部分只是查找是否需要新生成插入region，第二次进入repeat则是真正生成region并插入到里面。
处理完成之后，则调用memblock_merge_regions()接口，对region进行及时合并，如果两个region的end地址和begin地址相等则进行region合并

memblock_merge_regions()

memblock_merge_regions()对regions进行及时合并，防止regions超过128：

static void __init_memblock memblock_merge_regions(struct memblock_type *type)
{int i = 0;/* cnt never goes below 1 */while (i < type->cnt - 1) {struct memblock_region *this = &type->regions[i];struct memblock_region *next = &type->regions[i + 1];if (this->base + this->size != next->base ||memblock_get_region_node(this) !=memblock_get_region_node(next) ||this->flags != next->flags) {BUG_ON(this->base + this->size > next->base);i++;continue;}this->size += next->size;/* move forward from next + 1, index of which is i + 2 */memmove(next, next + 1, (type->cnt - (i + 2)) * sizeof(*next));type->cnt--;}
}

如果this->base + this->size和next->base相等则进行region合并

memblock_alloc_try_nid_raw（）

用于在初始化阶段，由于buddy还未初始化完成，其他模块调用memblock 该接口申请内存，比如用于sparse内存模型中的memmap物理内存申请如下：

struct page __init *__populate_section_memmap(unsigned long pfn,unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{unsigned long size = section_map_size();struct page *map = sparse_buffer_alloc(size);phys_addr_t addr = __pa(MAX_DMA_ADDRESS);if (map)return map;map = memblock_alloc_try_nid_raw(size, size, addr,MEMBLOCK_ALLOC_ACCESSIBLE, nid);if (!map)panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n",__func__, size, PAGE_SIZE, nid, &addr);return map;
}

memblock_alloc_try_nid_raw（）函数代码如下：

void * __init memblock_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,phys_addr_t min_addr, phys_addr_t max_addr,int nid)
{void *ptr;memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",__func__, (u64)size, (u64)align, nid, &min_addr,&max_addr, (void *)_RET_IP_);ptr = memblock_alloc_internal(size, align,min_addr, max_addr, nid, false);if (ptr && size > 0)page_init_poison(ptr, size);return ptr;
}

参数说明：

size:申请的物理内存大小
align:对齐要求
min_addr:最小地址
max_addr最大物理地址
nid:申请node 节点nid的内存

最终调用memblock_alloc_internal()接口：


/*** memblock_alloc_internal - allocate boot memory block* @size: size of memory block to be allocated in bytes* @align: alignment of the region and block's size* @min_addr: the lower bound of the memory region to allocate (phys address)* @max_addr: the upper bound of the memory region to allocate (phys address)* @nid: nid of the free area to find, %NUMA_NO_NODE for any node* @exact_nid: control the allocation fall back to other nodes** Allocates memory block using memblock_alloc_range_nid() and* converts the returned physical address to virtual.** The @min_addr limit is dropped if it can not be satisfied and the allocation* will fall back to memory below @min_addr. Other constraints, such* as node and mirrored memory will be handled again in* memblock_alloc_range_nid().** Return:* Virtual address of allocated memory block on success, NULL on failure.*/
static void * __init memblock_alloc_internal(phys_addr_t size, phys_addr_t align,phys_addr_t min_addr, phys_addr_t max_addr,int nid, bool exact_nid)
{phys_addr_t alloc;/** Detect any accidental use of these APIs after slab is ready, as at* this moment memblock may be deinitialized already and its* internal data may be destroyed (after execution of memblock_free_all)*/if (WARN_ON_ONCE(slab_is_available()))return kzalloc_node(size, GFP_NOWAIT, nid);if (max_addr > memblock.current_limit)max_addr = memblock.current_limit;alloc = memblock_alloc_range_nid(size, align, min_addr, max_addr, nid,exact_nid);/* retry allocation without lower limit */if (!alloc && min_addr)alloc = memblock_alloc_range_nid(size, align, 0, max_addr, nid,exact_nid);if (!alloc)return NULL;return phys_to_virt(alloc);
}

最终是调用memblock_alloc_range_nid（）函数即从node id的内存中申请一个合适的范围

memblock_alloc_range_nid（）

memblock_alloc_range_nid()是memory block 内存申请核心函数，代码如下：


/*** memblock_alloc_range_nid - allocate boot memory block* @size: size of memory block to be allocated in bytes* @align: alignment of the region and block's size* @start: the lower bound of the memory region to allocate (phys address)* @end: the upper bound of the memory region to allocate (phys address)* @nid: nid of the free area to find, %NUMA_NO_NODE for any node* @exact_nid: control the allocation fall back to other nodes** The allocation is performed from memory region limited by* memblock.current_limit if @end == %MEMBLOCK_ALLOC_ACCESSIBLE.** If the specified node can not hold the requested memory and @exact_nid* is false, the allocation falls back to any node in the system.** For systems with memory mirroring, the allocation is attempted first* from the regions with mirroring enabled and then retried from any* memory region.** In addition, function sets the min_count to 0 using kmemleak_alloc_phys for* allocated boot memory block, so that it is never reported as leaks.** Return:* Physical address of allocated memory block on success, %0 on failure.*/
phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,phys_addr_t align, phys_addr_t start,phys_addr_t end, int nid,bool exact_nid)
{enum memblock_flags flags = choose_memblock_flags();phys_addr_t found;if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))nid = NUMA_NO_NODE;if (!align) {/* Can't use WARNs this early in boot on powerpc */dump_stack();align = SMP_CACHE_BYTES;}again:found = memblock_find_in_range_node(size, align, start, end, nid,flags);if (found && !memblock_reserve(found, size))goto done;if (nid != NUMA_NO_NODE && !exact_nid) {found = memblock_find_in_range_node(size, align, start,end, NUMA_NO_NODE,flags);if (found && !memblock_reserve(found, size))goto done;}if (flags & MEMBLOCK_MIRROR) {flags &= ~MEMBLOCK_MIRROR;pr_warn("Could not allocate %pap bytes of mirrored memory\n",&size);goto again;}return 0;done:/* Skip kmemleak for kasan_init() due to high volume. */if (end != MEMBLOCK_ALLOC_KASAN)/** The min_count is set to 0 so that memblock allocated* blocks are never reported as leaks. This is because many* of these blocks are only referred via the physical* address which is not looked up by kmemleak.*/kmemleak_alloc_phys(found, size, 0, 0);return found;
}

主要分配两个部分：

memblock_find_in_range_node（）:从memblock 中node id中查找到一块合适的内存
如何查到到合适的内存，则调用memblock_reserve(),将申请到的内存块添加到memblock reserve中，说明该内存块已经被分配出去，注意memblock memory中的region没有改变，memory只是记录了所可以管理分配的内存，如果里面已经由部分内存分配出去则添加到reserve中，memory中的region不会有任何变动。

memblock_free（）

memblock_free()释放内存，时将memblock中的reserve region释放出来：

/*** memblock_free - free boot memory block* @base: phys starting address of the  boot memory block* @size: size of the boot memory block in bytes** Free boot memory block previously allocated by memblock_alloc_xx() API.* The freeing memory will not be released to the buddy allocator.*/
int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
{phys_addr_t end = base + size - 1;memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,&base, &end, (void *)_RET_IP_);kmemleak_free_part_phys(base, size);return memblock_remove_range(&memblock.reserved, base, size);
}

最终调用memblock_remove_range()接口从memblock.reserved中释放出来

memblock_remove_range()


static int __init_memblock memblock_remove_range(struct memblock_type *type,phys_addr_t base, phys_addr_t size)
{int start_rgn, end_rgn;int i, ret;ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);if (ret)return ret;for (i = end_rgn - 1; i >= start_rgn; i--)memblock_remove_region(type, i);return 0;
}

early_calculate_totalpages（）

系统早期启动过程中，尤其是在ZONE建立的过程中，需要获取到当前系统内所有节点物理内存页数目，此时只能从early_calculate_totalpages()中获取：

static unsigned long __init early_calculate_totalpages(void)
{unsigned long totalpages = 0;unsigned long start_pfn, end_pfn;int i, nid;for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {unsigned long pages = end_pfn - start_pfn;totalpages += pages;if (pages)node_set_state(nid, N_MEMORY);}return totalpages;
}

从bootmem中获取所有节点的物理页数目：

如果是NUMA系统，需要变量每个node 节点从memblock中获取到物理内存，for_each_mem_pfn_range宏源码如下：

#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)        \for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))

调用memblock 的__next_mem_pfn_range 接口获取到相应节点的物理起始和结束pfn
根据获取到node节点内的start_pfn和end_pfn，计算该节点内的物理内存页数目，并设置每个节点node_states的内存状态系统，设置为N_MEMORY

memblock_flags

memblock_region结构中，除了每个region 物理起始地址以及大小之外，还有flags状态标志位用于描述该region中状态属性：

struct memblock_region {phys_addr_t base;phys_addr_t size;enum memblock_flags flags;
#ifdef CONFIG_NEED_MULTIPLE_NODESint nid;
#endif
};

menblock_flags支持的标志位如下：

enum memblock_flags {MEMBLOCK_NONE       = 0x0, /* No special request */MEMBLOCK_HOTPLUG    = 0x1, /* hotpluggable region */MEMBLOCK_MIRROR        = 0x2, /* mirrored region */MEMBLOCK_NOMAP     = 0x4, /* don't add to kernel direct mapping */
};

其中：

MEMBLOCK_NONE：表示没有特殊需求正常使用
MEMBLOCK_HOTPLUG：该块内存支持热插拔，用于后续创建zone时，归ZONE_MOVABLE管理(x86平台来自于ACPI System Resource Affinity Table (SRAT)）。
MEMBLOCK_MIRROR：用于mirror 功能
MEMBLOCK_NOMAP：不能被kernel用于直接映射

memblock_flag直接表明该regions 内存块使用用途，用于后续指导

memblock用例

为了查看具体memblock用例，笔者编写了一个用例，用于查看memblock分布，用例源代码如下：

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kobject.h>
#include <linux/string.h>
#include <linux/stat.h>
#include <linux/slab.h>
#include <linux/sysfs.h>
#include <linux/device.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/mm.h>
#include <linux/memblock.h>
MODULE_LICENSE("GPL v2");#define MY_MEM "my_memblock"static struct proc_dir_entry * my_mem_entry=NULL;static int my_mem_show(struct seq_file *m, void * v)
{struct memblock_region *reg;int index=0;seq_printf(m,"memory block limit:%lld\n",memblock.current_limit);seq_printf(m,"memory block name:%s\n",memblock.memory.name);seq_printf(m,"       regions:%ld\n",memblock.memory.cnt);seq_printf(m,"       total size:%lld\n",memblock.memory.total_size);seq_printf(m,"       region info:\n");seq_printf(m,"\t%6s %64s %64s %64s %12s\n","index","base","end","size","node");for_each_memblock(memory,reg) {seq_printf(m,"\t%6d %64lld %64lld %64lld %12d\n",index++, reg->base, (reg->base + reg->size),reg->size,reg->nid);//seq_printf(m,"\tbase:%121lld\n",reg->base);//seq_printf(m,"\tsize:%12lld\n",reg->size);//seq_printf(m,"\tnid :%12d\n",reg->nid);}seq_printf(m,"memory block memory name:%s\n",memblock.reserved.name);seq_printf(m,"       regions:%ld\n",memblock.reserved.cnt);seq_printf(m,"       total size:%lld\n",memblock.reserved.total_size);seq_printf(m,"       region info:\n");seq_printf(m,"\t%6s %64s %64s %64s %12s\n","index","base","end","size","node");index = 0;for_each_memblock(reserved,reg) {seq_printf(m,"\t%6d %64lld %64lld %64lld %12d\n",index++, reg->base, (reg->base + reg->size),reg->size,reg->nid);//seq_printf(m,"region index:%d\n",index++);//seq_printf(m,"\tbase:%lld\n",reg->base);//seq_printf(m,"\tsize:%lld\n",reg->size);//seq_printf(m,"\tnid :%d\n",reg->nid);}return 0;
}static int my_mem_open(struct inode *inode, struct file *file)
{return single_open(file, my_mem_show, NULL);
}static const struct proc_ops my_mem_ops = {.proc_open = my_mem_open,.proc_read = seq_read,.proc_lseek= seq_lseek,.proc_release = single_release,
};
static int __init my_mem_init(void)
{my_mem_entry = proc_create(MY_MEM,S_IRUGO, NULL, &my_mem_ops);if (NULL == my_mem_entry){return -ENOMEM;}return 0;
}static void __exit my_mem_exit(void)
{remove_proc_entry(MY_MEM, NULL);
}module_init(my_mem_init);
module_exit(my_mem_exit);
MODULE_AUTHOR("hzk");
MODULE_DESCRIPTION("A simple moudule for my proc");
MODULE_VERSION("V1.0");

由于内核源代码中memblock遍历符号表导出来，使用该用例之前首先需要修改kernel代码，将符号表导出来，修改如下：

然后才能编译该ko模块，编译安装完成之后会在生成一个/proc/my_memblock用于查看memblock分布状况

运行结果

linux内核那些事之early boot memory-memblock相关推荐

linux内核那些事之内存规整(memory compact)
内存规整内存规整是Mel Gormal开发防止内存碎片anti-fragmen pach补丁的第二个部分Avoiding fragmentation with page clustering v27 ...
linux内核那些事之buddy(anti-fragment机制)(4)
程序运行过程中,有些内存是短暂的驻留用完一段时间之后就可以将内存释放以供后面再次使用,但是有些内存一旦申请之后,会长期使用而得不到释放.长久运行有可能造成碎片.以<professional l ...
linux内核那些事之pg_data_t、zone结构初始化
free_area_init 继续接着<linux内核那些事之ZONE>,分析内核物理内存初始化过程,zone_sizes_init()在开始阶段主要负责对各个类型zone 大小进行计算, ...
linux内核那些事之buddy(慢速申请内存__alloc_pages_slowpath)(5)
内核提供__alloc_pages_nodemask接口申请物理内存主要分为两个部分:快速申请物理内存get_page_from_freelist(linux内核那些事之buddy(快速分配get_p ...
linux内核那些事之buddy
buddy算法是内核中比较古老的一个模块,很好的解决了相邻物理内存碎片的问题即"内碎片问题",同时有兼顾内存申请和释放效率问题,内核从引入该算法之后一直都能够在各种设备上完好运行, ...
linux内核那些事之Sparse vmemmap
<inux内核那些事之物理内存模型之SPARCE(3)>中指出在传统的sparse 内存模型中,每个mem_section都有一个属于自己的section_mem_map,如下图所示: 而 ...
linux内核那些事之mmap_region流程梳理
承接<linux内核那些事之mmap>,mmap_region()是申请一个用户进程虚拟空间并根据匿名映射或者文件映射做出相应动作,是实现mmap关键函数,趁这几天有空闲时间整理下mm ...
linux内核那些事之buddy(anti-fragment机制-steal page)(5)
继<linux内核那些事之buddy(anti-fragment机制)(4)>,在同一个zone内指定的migrate type中没有足够内存,会启动fallback机制,从fallbac ...
linux内核那些事之Memory protection keys(硬件原理）
mprotect/map原理及缺陷 <linux 关于虚拟内存的几个系统调用>,提及到用户程序可以通过mprotect或者map(MAP_FIXED)系统调用可以根据需要再次修改已经申请过 ...

linux内核那些事之early boot memory-memblock