内核引导过程. Part 6.



你可能还记得,Linux内核的入口点是 main.c 的start_kernel函数,它在LOAD_PHYSICAL_ADDR地址开始执行。这个地址依赖于CONFIG_PHYSICAL_START内核配置选项,默认为0x1000000:

config PHYSICAL_STARThex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)default "0x1000000"---help---This gives the physical address where the kernel is loaded..........


在这种情况下,Linux内核镜像解压和加载的物理地址会被随机化。我们在这一部分考虑这个选项被启用,并且为了安全原因 ALSR,内核镜像的加载地址被随机化的情况。


在内核解压器要开始找随机的内核解压和加载地址之前,应该初始化恒等映射(identity mapped,虚拟地址和物理地址相同)页表。如果引导加载器使用16位或32位引导协议,那么我们已经有了页表。但在任何情况下,如果内核解压器选择它们之外的内存区域,我们需要新的页。这就是为什么我们需要建立新的恒等映射页表



/** The compressed kernel image (ZO), has been moved so that its position* is against the end of the buffer used to hold the uncompressed kernel* image (VO) and the execution environment (.bss, .brk), which makes sure* there is room to do the in-place decompression. (See header.S for the* calculations.)**                             |-----compressed kernel image------|*                             V                                  V* 0                       extract_offset                      +INIT_SIZE* |-----------|---------------|-------------------------|--------|*             |               |                         |        |*           VO__text      startup_32 of ZO          VO__end    ZO__end*             ^                                         ^*             |-------uncompressed kernel image---------|**/
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,unsigned char *input_data,unsigned long input_len,unsigned char *output,unsigned long output_len);


void choose_random_location(unsigned long input,unsigned long input_size,unsigned long *output,unsigned long output_size,unsigned long *virt_addr)


  • input;
  • input_size;
  • output;
  • output_isze;
  • virt_addr.

让我们试着理解一下这些参数是什么。第一个input参数来自源文件 arch/x86/boot/compressed/misc.c 里的extract_kernel函数:

asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,unsigned char *input_data,unsigned long input_len,unsigned char *output,unsigned long output_len)
{.........choose_random_location((unsigned long)input_data, input_len,(unsigned long *)&output,max(output_len, kernel_total_size),&virt_addr);.........

这个参数由 arch/x86/boot/compressed/head_64.S 的汇编代码传递:

leaq input_data(%rip), %rdx

input_data由 arch/x86/boot/compressed/mkpiggy.c 程序生成。如果你亲手编译过Linux内核源码,你会找到这个程序生成的文件,它应该位于 linux/arch/x86/boot/compressed/piggy.S. 在我这里,这个文件是这样的:

.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 6988196
.globl z_output_len
z_output_len = 29207032
.globl input_data, input_data_end
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"


.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 13449128
.globl z_output_len
z_output_len = 83868872
.globl input_data, input_data_end
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
.section ".rodata","a",@progbits
.globl input_len
input_len:.long 13449128
.globl output_len
output_len:.long 83868872


[rongtao@localhost linux-5.10.13]$ ll ./arch/x86/boot/compressed/vmlinux.bin.gz
-rw-rw-r-- 1 rongtao rongtao 13449128 3月  11 12:26 ./arch/x86/boot/compressed/vmlinux.bin.gz
[rongtao@localhost linux-5.10.13]$ ll ./arch/x86/boot/compressed/vmlinux.bin
-rwxrwxr-x 1 rongtao rongtao 81791784 3月  11 12:26 ./arch/x86/boot/compressed/vmlinux.bin



choose_random_location函数的第三和第四个参数分别是解压后的内核镜像的位置和长度。放置解压后内核的地址来自 arch/x86/boot/compressed/head_64.S,并且它是startup_32对齐到 2MB 边界的地址。解压后的内核的大小来自同样的piggy.S,并且它是z_output_len.


unsigned long virt_addr = LOAD_PHYSICAL_ADDR;




if (cmdline_find_option_bool("nokaslr")) {warn("KASLR disabled: 'nokaslr' on cmdline.");return;


kaslr/nokaslr [X86]Enable/disable kernel and module base offset ASLR
(Address Space Layout Randomization) if built into
the kernel. When CONFIG_HIBERNATION is selected,
kASLR is disabled by default. When kASLR is enabled,
hibernation will be disabled.





SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated).../* Pass boot_params to initialize_identity_maps() */movq    (%rsp), %rdicall    initialize_identity_mapspopq    %rsi...call extract_kernel      /* returns kernel location in %rax *///从 extract_kernel 返回popq  %rsi/** Jump to the decompressed kernel.*/jmp   *%rax

它在 arch/x86/boot/compressed/pagetable.c 源码文件定义。这个函数从初始化mapping_info,x86_mapping_info结构体的一个实例开始。

mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
mapping_info.kernpg_flag = _KERNPG_TABLE | sev_me_mask;

x86_mapping_info结构体在 arch/x86/include/asm/init.h 头文件定义:

struct x86_mapping_info {   /* 此结构提供有关内存映射的信息 */void *(*alloc_pgt_page)(void *); /* allocate buf for page table 为页表项分配空间 */void *context;             /* context for alloc_pgt_page 跟踪分配的页表 */unsigned long page_flag;    /* page flag for PMD or PUD entry, PMD或PUD条目的标志*/unsigned long offset;      /* ident mapping offset 内核的虚拟地址与其实际地址之间的偏移量*/bool direct_gbpages;       /* PUD level 1GB page support 检查是否支持大页面*/unsigned long kernpg_flag;     /* kernel pagetable flag override 内核页面的可覆盖标志*/




entry = pages->pgt_buf + pages->pgt_buf_offset;
pages->pgt_buf_offset += PAGE_SIZE;


struct alloc_pgt_data {unsigned char *pgt_buf;unsigned long pgt_buf_size;unsigned long pgt_buf_offset;

initialize_identity_maps函数最后的目标是初始化pgdt_buf_sizepgt_buf_offset. 由于我们只是在初始化阶段,initialize_identity_maps函数设置pgt_buf_offset为0:

pgt_data.pgt_buf_offset = 0;

pgt_data.pgt_buf_size会根据引导加载器所用的引导协议(64位或32位)被设置为7782469632. pgt_data.pgt_buf也是一样。如果引导加载器在startup_32引导内核,pgdt_data.pgdt_buf会指向已经在 arch/x86/boot/compressed/head_64.S 初始化的页表的末尾:

pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;

其中_pgtable指向这个页表 _pgtable 的开头。另一方面,如果引导加载器用64位引导协议并在startup_64加载内核,早期页表应该由引导加载器建立,并且_pgtable会被重写:

pgt_data.pgt_buf = _pgtable



在恒等映射页表相关的数据被初始化之后,我们可以开始选择放置解压后内核的随机位置。但是正如你猜的那样,我们不能选择任意地址。在内存的范围中,有一些保留的地址。这些地址被重要的东西占用,如initrd, 内核命令行等等。这个函数:

mem_avoid_init(input, input_size, *output);


struct mem_vector {unsigned long long start;unsigned long long size;
};static struct mem_vector mem_avoid[MEM_AVOID_MAX];

数组。其中MEM_AVOID_MAX来自枚举类型mem_avoid_index, 它代表不同类型的保留内存区域:

enum mem_avoid_index {MEM_AVOID_ZO_RANGE = 0,MEM_AVOID_INITRD,       /* initrd: 初始RAM磁盘(initrd)是在系统引导过程中挂载的一个临时根文件系统,用来支持两阶段的引导过程 */MEM_AVOID_CMDLINE,      /*  */MEM_AVOID_BOOTPARAMS,   /*  */MEM_AVOID_MEMMAP_BEGIN, /*  */MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,MEM_AVOID_MAX,

它们都定义在源文件 arch/x86/boot/compressed/kaslr.c 中。


/** In theory, KASLR can put the kernel anywhere in the range of [16M, MAXMEM)* on 64-bit, and [16M, KERNEL_IMAGE_SIZE) on 32-bit.** The mem_avoid array is used to store the ranges that need to be avoided* when KASLR searches for an appropriate random address. We must avoid any* regions that are unsafe to overlap with during decompression, and other* things like the initrd, cmdline and boot_params. This comment seeks to* explain mem_avoid as clearly as possible since incorrect mem_avoid* memory ranges lead to really hard to debug boot failures.** The initrd, cmdline, and boot_params are trivial to identify for* avoiding. They are MEM_AVOID_INITRD, MEM_AVOID_CMDLINE, and* MEM_AVOID_BOOTPARAMS respectively below.** What is not obvious how to avoid is the range of memory that is used* during decompression (MEM_AVOID_ZO_RANGE below). This range must cover* the compressed kernel (ZO) and its run space, which is used to extract* the uncompressed kernel (VO) and relocs.** ZO's full run size sits against the end of the decompression buffer, so* we can calculate where text, data, bss, etc of ZO are positioned more* easily.** For additional background, the decompression calculations can be found* in header.S, and the memory diagram is based on the one found in misc.c.** The following conditions are already enforced by the image layouts and* associated code:*  - input + input_size >= output + output_size*  - kernel_total_size <= init_size*  - kernel_total_size <= output_size (see Note below)*  - output + init_size >= output + output_size** (Note that kernel_total_size and output_size have no fundamental* relationship, but output_size is passed to choose_random_location* as a maximum of the two. The diagram is showing a case where* kernel_total_size is larger than output_size, but this case is* handled by bumping output_size.)** The above conditions can be illustrated by a diagram:** 0   output            input            input+input_size    output+init_size* |     |                 |                             |             |* |     |                 |                             |             |* |-----|--------|--------|--------------|-----------|--|-------------|*                |                       |           |*                |                       |           |* output+init_size-ZO_INIT_SIZE  output+output_size  output+kernel_total_size** [output, output+init_size) is the entire memory range used for* extracting the compressed image.** [output, output+kernel_total_size) is the range needed for the* uncompressed kernel (VO) and its run size (bss, brk, etc).** [output, output+output_size) is VO plus relocs (i.e. the entire* uncompressed payload contained by ZO). This is the area of the buffer* written to during decompression.** [output+init_size-ZO_INIT_SIZE, output+init_size) is the worst-case* range of the copied ZO and decompression code. (i.e. the range* covered backwards of size ZO_INIT_SIZE, starting from output+init_size.)** [input, input+input_size) is the original copied compressed image (ZO)* (i.e. it does not include its run size). This range must be avoided* because it contains the data used for decompression.** [input+input_size, output+init_size) is [_text, _end) for ZO. This* range includes ZO's heap and stack, and must be avoided since it* performs the decompression.** Since the above two ranges need to be avoided and they are adjacent,* they can be merged, resulting in: [input, output+init_size) which* becomes the MEM_AVOID_ZO_RANGE below.*//***  在初始化与身份页表相关的内容之后,我们可以选择一个随机的内存位置来提取内核映像。*  但是,正如您可能已经猜到的,我们不能只选择任何地址。有一些重新存储的内存区域,*  这些区域被重要的东西占用,例如initrd和内核命令行,必须避免。*  该mem_avoid_init功能将帮助我们做到这一点:* *  所有不安全的内存区域将被收集在一个称为数组中 mem_avoid*/
static void mem_avoid_init(unsigned long input, unsigned long input_size,unsigned long output)
{unsigned long init_size = boot_params->hdr.init_size;u64 initrd_start, initrd_size;unsigned long cmd_line, cmd_line_size;/** Avoid the region that is unsafe to overlap during* decompression.*/mem_avoid[MEM_AVOID_ZO_RANGE].start = input;mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;/* Avoid initrd. */initrd_start  = (u64)boot_params->ext_ramdisk_image << 32;initrd_start |= boot_params->hdr.ramdisk_image;initrd_size  = (u64)boot_params->ext_ramdisk_size << 32;initrd_size |= boot_params->hdr.ramdisk_size;mem_avoid[MEM_AVOID_INITRD].start = initrd_start;mem_avoid[MEM_AVOID_INITRD].size = initrd_size;/* No need to set mapping for initrd, it will be handled in VO. *//* Avoid kernel command line. */cmd_line = get_cmd_line_ptr();/* Calculate size of cmd_line. */if (cmd_line) {cmd_line_size = strnlen((char *)cmd_line, COMMAND_LINE_SIZE-1) + 1;mem_avoid[MEM_AVOID_CMDLINE].start = cmd_line;mem_avoid[MEM_AVOID_CMDLINE].size = cmd_line_size;}/* Avoid boot parameters. */mem_avoid[MEM_AVOID_BOOTPARAMS].start = (unsigned long)boot_params;mem_avoid[MEM_AVOID_BOOTPARAMS].size = sizeof(*boot_params);/* We don't need to set a mapping for setup_data. *//* Mark the memmap regions we need to avoid */handle_mem_options();/* Enumerate the immovable memory regions */num_immovable_mem = count_immovable_mem_regions();


mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;


mem_avoid_init函数的开头尝试避免用于当前内核解压的内存区域。我们用这个区域的起始地址和大小填写mem_avoid数组的一项,并调用add_identity_map函数,它会为这个区域建立恒等映射页。add_identity_map函数在源文件 arch/x86/boot/compressed/kaslr.c 定义:

/** Mapping information structure passed to kernel_ident_mapping_init().* Due to relocation, pointers must be assigned at run time not build time.*/
static struct x86_mapping_info mapping_info;/** Adds the specified range to the identity mappings.*/
static void add_identity_map(unsigned long start, unsigned long end)
{int ret;/* Align boundary to 2M. */start = round_down(start, PMD_SIZE);end = round_up(end, PMD_SIZE);if (start >= end)return;/* Build the mapping. */ret = kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt, start, end);if (ret)error("Error: kernel_ident_mapping_init() failed\n");

你可以看到,它对齐内存到 2MB 边界并检查给定的起始地址和终止地址

最后它调用kernel_ident_mapping_init函数,它在源文件 arch/x86/mm/ident_map.c 中,并传入以上初始化好的mapping_info实例、顶层页表的地址和建立新的恒等映射的内存区域的地址。


if (!info->kernpg_flag)info->kernpg_flag = _KERNPG_TABLE;

并且开始建立新的2MB (因为mapping_info.page_flag中的PSE位) 给定地址相关的页表项(五级页表中的PGD -> P4D -> PUD -> PMD或者四级页表中的PGD -> PUD -> PMD)。

for (; addr < end; addr = next) {p4d_t *p4d;next = (addr & PGDIR_MASK) + PGDIR_SIZE;if (next > end)next = end;p4d = (p4d_t *)info->alloc_pgt_page(info->context);result = ident_p4d_init(info, p4d, addr, next);return result;

首先我们找给定地址在 页全局目录 的下一项,如果它大于给定的内存区域的末地址end,我们把它设为end.之后,我们用之前看过的x86_mapping_info回调函数分配一个新页,然后调用ident_p4d_init函数。ident_p4d_init函数做同样的事情,但是用于低层的页目录 (p4d -> pud -> pmd).


和保留地址相关的新页表项已经在我们的页表中。这不是mem_avoid_init函数的末尾,但是其他部分类似。它建立用于 initrd、内核命令行等数据的页。




min_addr = min(*output, 512UL << 20);

你可以看到,它应该小于512MB. 选择这个512MB的值只是避免低内存区域中未知的东西。


random_addr = find_random_phys_addr(min_addr, output_size);


static unsigned long find_random_phys_addr(unsigned long minimum,unsigned long image_size)
{minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);if (process_efi_entries(minimum, image_size))return slots_fetch_random();process_e820_entries(minimum, image_size);return slots_fetch_random();


static unsigned long find_random_phys_addr(unsigned long minimum,unsigned long image_size)
{u64 phys_addr;/* Bail out early if it's impossible to succeed. */if (minimum + image_size > mem_limit)return 0;/* Check if we had too many memmaps. */if (memmap_too_large) {debug_putstr("Aborted memory entries scan (more than 4 memmap= args)!\n");return 0;}//在完全可访问的内存中找到所有合适的内存范围以加载内核if (!process_efi_entries(minimum, image_size))process_e820_entries(minimum, image_size);//现在,我们有一个随机的物理地址将内核解压缩到该地址phys_addr = slots_fetch_random();/* Perform a final check to make sure the address is in range. */if (phys_addr < minimum || phys_addr + image_size > mem_limit) {warn("Invalid physical address chosen!\n");return 0;}return (unsigned long)phys_addr;



struct slot_area {unsigned long addr;int num;
};#define MAX_SLOT_AREA 100static struct slot_area slot_areas[MAX_SLOT_AREA];


slot = kaslr_get_random_long("Physical") % slot_max;

kaslr_get_random_long函数在源文件 arch/x86/lib/kaslr.c 中定义,它返回一个随机数。注意这个随机数会通过不同的方式得到,取决于内核配置、系统机会(基于时间戳计数器的随机数、rdrand等等)。




random_addr = find_random_phys_addr(min_addr, output_size);if (*output != random_addr) {add_identity_map(random_addr, output_size);*output = random_addr;


if (IS_ENABLED(CONFIG_X86_64))random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);*virt_addr = random_addr;


static unsigned long find_random_virt_addr(unsigned long minimum,unsigned long image_size)
{unsigned long slots, random_addr;/** There are how many CONFIG_PHYSICAL_ALIGN-sized slots* that can hold image_size within the range of minimum to* KERNEL_IMAGE_SIZE?*/slots = 1 + (KERNEL_IMAGE_SIZE - minimum - image_size) / CONFIG_PHYSICAL_ALIGN;random_addr = kaslr_get_random_long("Virtual") % slots;return random_addr * CONFIG_PHYSICAL_ALIGN + minimum;







如果你有什么问题或建议,写个评论或在 twitter 找我。

如果你发现文中描述有任何问题,请提交一个 PR 到 linux-insides-zh 。


  • Address space layout randomization
  • Linux kernel boot protocol
  • long mode
  • initrd
  • Enumerated type
  • four-level page tables
  • five-level page tables
  • EFI
  • e820
  • time stamp counter
  • rdrand
  • x86_64
  • Previous part


  1. Linux开机启动过程:从点下电源键到系统正常运行

    学习内核,只要是要以柔克刚,不能急于求成.共勉 <Linux开机启动过程(1):内核引导过程> <Linux开机启动过程(2):内核启动的第一步> <Linux开机启动过 ...

  2. linux 打开上一级目录,linux开机启动过程、PATH、过滤一级目录、cd的参数、ls -lrt、命令切割日志...

    第二波命令正向我方来袭 :开机启动过程.PATH.过滤一级目录.cd的参数.ls -lrt.命令切割日志 1.1 linux开机启动过程 1.1.1 开机自检(BIOS)-- MBR引导-- GRUB ...

  3. linux开机启动过程(简述)

    简述linux开机启动过程 第一步:加电 第二步:加载BIOS设置,选择启动盘. 这是因为因为BIOS中包含了CPU的相关信息.设备启动顺序信息.硬盘信息.内存信息.时钟信 息.PnP特性等等.在此之 ...

  4. linux内核启动过程3:内核初始化阶段

    上一篇<<linux内核启动过程2:保护模式执行流程>>分析了保护模式启动过程以及bzImage的解压入口函数,本篇继续分析内核启动过程,从保护模式到C代码初始化. start ...

  5. Linux开机启动过程(10):start_kernel 初始化(至setup_arch初期)

    内核初始化. Part 4. 在原文的基础上添加了5.10.13部分的源码解读. Kernel entry point 还记得上一章的内容吗 - 跳转到内核入口之前的最后准备?你应该还记得我们已经完成 ...

  6. Linux开机启动过程(3):显示模式初始化和进入保护模式

    内核启动过程,第三部分 本文是在原文基础上经过本人的修改. 显示模式初始化和进入保护模式 这一章是内核启动过程的第三部分,在前一章中,我们的内核启动过程之旅停在了对 set_video 函数的调用(这 ...

  7. Linux开机启动过程详细分析

    from: http://www.linuxidc.com/Linux/2007-11/8701.htm 由于操作系统正在变得越来越复杂,所以开机引导和关机下电的过程也越来越智能化.从简单的DOS系统 ...

  8. Linux开机启动过程(12):start_kernel()->还是setup_arch

    内核初始化 第六部分 在原文的基础上添加了5.10.13部分的源码解读. 仍旧是与系统架构有关的初始化 在之前的章节我们从 arch/x86/kernel/setup.c了解了特定于系统架构的初始化事 ...

  9. Linux开机启动过程(9):进入内核入口点之前最后的准备工作

    内核初始化 第三部分 在原文的基础上添加了5.10.13部分的源码解读. 进入内核入口点之前最后的准备工作 这是 Linux 内核初始化过程的第三部分.在上一个部分 中我们接触到了初期中断和异常处理, ...


  1. ffmpeg + opencv 把摄像头画面保存为mp4文件
  2. DTS增量/同步支持DDL迁移的说明
  3. RAID2.0核心思想:数据保护与物理资源管理域分离
  4. memsql 落地mysql_MemSQL初体验 - (2)初始化测试环境
  5. float布局设置同一行行高一样_布局思想:大事化小、先行后列、见缝插针
  6. 基于python的移动物体检测_感兴趣区域的移动物体检测,框出移动物体的轮廓 (固定摄像头, opencv-python)...
  7. mysql压力测试并优化_MySQL压力测试索引优化效果演示全过程
  8. 手把手教你Chrome浏览器安装Postman(含下载云盘链接)【转载】
  9. Linux下fdisk命令操作磁盘详解--添加、删除、转换分区
  10. 24点游戏c语言源代码6,C语言解24点游戏程序
  11. P6282 [USACO20OPEN] Cereal S 思维
  12. idea新建module 后 mapper老是说mapper和xml没有绑定
  13. AcWing:3.完全背包问题
  14. 纯干货:Linux抓包命令集锦(tcpdump)
  15. tomcat下多个app 不同的图标_iOS平台设计规范(五)图标与图片
  16. redis连接被拒绝
  17. Kaggle实战之食尸鬼、地精、鬼魂分类
  18. envi 打开影像报错:‘HISTOGRAM:illegal binsize or max/min‘.The result maybe invalid
  19. c语言基础知识点字母和含义,大学c语言必背基础知识_c语言基础知识大全
  20. 图像处理——matlab人脸识别(1)


  1. Java中引入泛型的好处
  2. linux mysql libc.so_mysql-arm-linux-gcc编译报错:libc.so format not recognized.
  3. 查看被docker-proxy占用的端口
  4. 清除Docker中所有为<none>的镜像(虚悬镜像)
  5. Java基础-序列化和反序列化
  6. Spring Boot异常处理
  7. git 无法提交空目录
  8. 【Python】windows电脑 python3.6安装lxml库
  9. UIAlertView中显示进度条 ios iphone xcode
  10. 修改本机域名服务器为Google Public DNS或者OpenDNS