论文阅读：ThinLTO: Scalable and Incremental LTO

ThinLTO: Scalable and Incremental LTO 论文阅读笔记

A little bit of history.

SYZYGY – A Framework for Scalable Cross-Module IPO

ielf，persistent intermediate representation (IR)，基于 ELF 格式，存储自定义格式的 IR（以及 summary）。论文中表示 ielf 文件大小平均是原始 elf 文件的五倍，极端情况下有700倍。在 SYZYGY 中的地位类似于 LLVM ThinLTO 中的 bitcode IR。
- 注：从上图中我们可以看到，FE会直接生成 ielf 文件，同时会收集summary信息，并基于这些信息进行后面的 interprocedural analysis。
FE，编译源码，进行简单的优化，生成 IR 并收集summary信息，生成 ielf 文件。
u2comp，真正的 IPO 过程。主要是基于 summary 信息，进行 interprocedural analysis，然后将 analysis result 写回到 ielf 文件中，也就是会更新这些 summary 信息。
BE会 consume u2comp 分析得到的 analysis result，然后进行最后的优化和代码生成

可以看到这里的核心是 ielf 以及 summary 的设计，但是论文中没有给出详细的介绍。

GCC LTO

GCC LTO 的发展历程可以去看 Jan Hubička 的博客，Honza Hubička’s Blog 。讲述了 GCC LTO 的发展过程，从

statement-at-a-time

basic common subexpression elimination or peephole optimization

mixture of statement-at-a-time and function-at-a-time

frontend 是statement-at-a-time，RTL 是 function-at-a-time，此时是GCC 2.9

function-at-a-time

随着 C++ 的崛起，gcc 为了能够更好的编译 c++，拥有了更强大的前端，实现了 function-at-a-time parsing。由于 C++ 抽象程度更好，存在很多的“小函数”，所谓的 abstraction penalty，所以此时 inlining 变得至关重要，所以对 inlining 进行了持续的迭代，此时是GCC3.0

unit-at-a-time

设计了新的 callgraph module，unit 级别的 interprocedural analysis 得以更好的完善，此时是GCC3.4

–combine

由于Open64开源并且具有更好的 interprocedural analysis 以及部分 LTO feature ，gcc 已经完全落后，所以Geoffrey Keating扩充了 gcc，使其能够同时编译多个源文件，然后将结果作为一个整体喂给gcc后端，但仅限于C语言。

GIMPLE*

GIMPLE Proceedings of the 2003 GCC Developers Summit，原来 GCC 结构：

添加 GIMPLE 后的GCC结构：

LLVM

2003年 chris lattner 曾经推过LLVM，想使其成为 GCC 的后端以此用来取代 GIMPLE ，见Architecture for a Next-Generation GCC，这篇论文在 GIMPLE 后面，主要是 chris lattner 为了推销自己的LLVM，指出 GCC 在 statement-at-atime，function-at-a-time，unit-at-a-time 之后，变得越来越复杂，可以将 LLVM 纳入到GCC，作为其optimizer，整体结构图如下。这样做的好处是双赢，LLVM 不需要为 frontend 以及配套的binutils 而发愁，GCC 不用为其 LTO 而发愁，但是最后却没有实现。

From what I understand, LLVM has never been used outside of a research
environment and it can only generate code for a very limited set of
targets. These two are very serious limitations.

直到2005年，双方还在讨论 LLVM project 纳入 GCC 的问题，https://marc.info/?l=gcc&m=113269846722569&w=2 ，这里面有很多有趣的争论，例如 LLVM 的性能很差，LLVM 是 C++ 写的，但是GCC是C写的，LLVM 是不是真的要替换 GIMPLE 等等问题。

At the time Tree-SSA merge was being prepared, Chris Lattner presented at 2003 GCC Summit an proposal of itegration of LLVM into GCC as a new middle-end that would take place of the (not yet finished) Tree-SSA project. This never happened and as a result we now have two major open source compilers.
One of major trading point at that time was implementation of LTO. People made guesses if it will be harder to modify GCC architecture for LTO or if it would be harder to complete LLVM to features and richness of backends GCC had. It is definitely interesting to observe that from today point of view.

Hubika 还不忘内涵一把。现在 LLVM 成长为现在这个样子，不知道他现在是什么想法。现在 GCC 终于有了完善的 LTO，LLVM 的后端也足够强大。

LTO

2005年发表了Link-Time Optimization in GCC: Requirements and High-Level Design，没有细读这篇paper。

WHOPR

2007年又提出了新的 proposal ，WHOPR - Fast and Scalable Whole Program Optimizations in GCC，这篇 paper 的思想和 SYZYGY 很相似，WHOPR 的结构如下所示：

整个过程分为如下几个步骤：

Local generation (LGEN). This stage executes in parallel. Every file in the program is compiled into intermediate language (IL) together with the local call-graph and summary information.

Whole Program Analysis (WPA). This stage is performed sequentially. The global call-graph is assembled, and a global analysis process makes transformation decisions. The global call-graph may be partitioned to facilitate optimization during phase 3.

Local transformations (LTRANS). This stage executes in parallel. All the decisions made during phase 2 are implemented locally in each target file, and final object code is generated. Optimizations which cannot be decided efficiently during phase 2 may be performed on local call-graph partitions.

目前这个 IL 就是 GIMPLE ，然后附带上 summary 信息，主要是 call graph（还有 symbol table 以及 type table），这个和 SYZYGY 类似。WPA 作用在这个 call graph 以及 summary 信息上。这也是为了解决编译耗时和内存占用问题。

另外为了支持增量分析，可以为 summary 信息计算 checksum 来进行复用。

另外为了支持第三方库，可以手工指定一些简单的summary信息，例如 mod/ref以及 type 信息。

对于优化大致分为两类，

Global decisions with local transformations

只需要基于 global summary 信息计算得到的分析结果，就可以在为module单独codegen时做优化。这应该是最理想化的方式。有以下几种优化：

Devirtualization
Inline caching
PIC optimizations
Common block padding and splitting, array padding
Global variable analysis (assigned ones, global->static)
Class hierarchy analysis, static cast removal
Indirect call promotion
Dead variable elimination
Dead function elimination
Field reordering, structure splitting/peeling
Data reuse analysis
Inter-procedural prefetch

Global decisions with global transformations

需要将各个 module 的相关信息都 hold 在 memory 中（也就是都可见），才能做的优化，都是 inline 相关的优化。

Cross-module inlining
Virtual function inlining
Cloning
Inter-procedural points-to
Inter-procedural constant propagation

WHOPR 合入 GCC，并经过一些迭代后，成为了 GCC LTO 的默认实现，此时是GCC4.6。

LIPO

LIPO 对应的论文是Lightweight Feedback-Directed Cross-Module Optimization。前面提到 LTO 或者 CMO 需要面临的问题是，如何在编译耗时以及 memory 开销可控前提下看到更多的 scope，做更 aggressive 的优化。这篇文章把 LTO 相关的逻辑描述的很清楚：

In the parallelizable front-end phase, the compiler performs a reasonable amount of optimizations for code cleanup and canonicalization. It may also compute summary information for consumption by IPO. This phase writes the compiler intermediate representation (IR) to disk as fake ELF object files, which allows seamless integration into existing build systems.

The IPO phase is typically executed at link time. IPO reads in the fake object files, or parts of them, e.g., sections containing summary data. It may perform type unification and build an interprocedural symbol table, perform analysis, either on summaries, IR, or combined IR, make transformation decisions or perform actual transformations, and readies output for consumption by the back-end phase.

The parallelizable backend phase accepts the compiler IR from IPO, either directly out of memory or via intermediate fake object files, and performs a stronger set of scalar, loop, and other optimizations, as well as code generation and object file production. These final object files are then fed back to the linker to produce the final binary.

LIPO 另辟蹊径，在 runtime 的时候基于 FDO 得到的 profile 信息，做 cross-module 分析。这篇文章先列举了inter-procrdural optimization，然后指出了其中最重要的是 inlining 和 indirect call promotion。

inlining and cloning
indirect function call promotion
constant propagation
alias
mod/ref
points-to analysis
register allocation
register pressure estimation
global variable optimizations
code and data layout techniques
profile propagation techniques
C++ class hierarchy analysis and devirtualization
dead variable and function elimination, and many more.

可以说是论文作者”有意“着重描述了LIPO提升最大的两个优化，inling 和 indirect call promotion。

然后作者列列举了 LTO 的几个缺点

需要将中间表示IR，需要在disk和memory之间序列化，开销比较大
某个文件小小的修改，都有可能重新触发LTO
debug和LTO一起用起来可能比较麻烦

LIPO 分为如下几个要点：

We seamlessly integrate IPO and FDO.

We move the IPA analysis phase into the binary and execute it at the end of the FDO training run.

We add aggregated IPA analysis results to the FDO profiles.

During the FDO optimization build, we use these results to read in additional source modules and form larger pseudo modules to extend scope and to enable more intra-module inter-procedural optimizations.

核心是基于”运行时信息，构建一个dynamic call graph，将亲和度（affinity）比较高的modules(称之为auxiliary module)组合成一个更大的module(称之为module groupings)，然后进行cross-module optimization“，亲和度通过call graph的hot edges来判定，称之为 source module affinity analysis。

如何判断 module 之间是否要放在一起，论文有相关算法的介绍，这里不赘述。

通过一个示例来介绍LIPO的过程，

// a.c
int foo(char *d, unsigned int i){return d[i] + d[i + 1];
}// b.c
char tokens [] = {1 ,3 ,5 ,7 ,11 ,13};
int extern foo(char *d, unsigned int i);
int bar(void) {return foo(tokens , 3);
}// main.c
#include <stdio.h>
#include <string.h>
extern int sum(char *d, unsigned int i);
extern int bar(void);
int main(int argc, char **argv) {unsigned int i, s = 0;if(argc != 2) return;for (i = 0; i < strlen(argv[1]) - 1; i++) s += foo(argv[1], i);printf ("sum: %d\n", s);printf ("tokens: %d\n", bar());
}

假设 main.c 中的 for 循环调用 foo 是 hot call，将其放在一起，{main.c, a.c(aux)}，{a.c}，{b.c}。

有一个小插曲由于 LIPO(google的项目) 太成功了，导致 WHOPR (同样是 google 提的proposal)没有开发人员开发，直到2009年，WHOPR 才合入 GCC。

LLVM full LTO

LLVM full LTO 是比较朴素的 LTO 实现，frontend 编译生成的是 bitcode IR，然后在链接时进行 materialize，合并成为一个超大的 IR 文件，然后再进行优化。

ThinLTO: Scalable and Incremental LTO

ThinLTO 可以说和 WHOPR 以及 SYZYGY 很像，都是 summary-based Whole-Program Analysis (WPA)，只是工程实现以及一些实现上大同小异，ThinLTO 列举了以前 LTO/CMO 实现的一些缺点（虽然我感觉不够实锤）。

实现	缺点
LLVM full LTO & HP-UX compiler	将所有的 compilation unit 合并成为一个，开销较大；IPA/IPO 无法并行化
SYZYGY	summary-based analysis, inlining 无法并行完成，中间需要频繁修改 ielf ；IPA/IPO 无法并行化；不能处理增量编译
Open64	summary-based analysis, 中间需要频繁修改 IR；IPA/IPO 无法并行化；不能处理增量编译
WHOPR	summary-based analysis, 在IPA的过程中需要频繁 I/O?(没有看实现，存疑)；IPA/IPO 无法并行化；不能处理增量编译
LIPO	需要插桩并收集 profile；虽然通过 module grouping 减少了 scope 以降低开销，但是 module group 较大时还是会存在 memory issue；同时需要修改 frontend

ThinLTO 虽然也有 serial 过程，但作用在 compact summary(C:没有对比过其它 LTO summary formats，ThinLTO 的 summary 确实比较 compact)上进行 global analysis，开销比较可控。ThinLTO 提出了新颖的技术，function importing transformation，用来 enable inlining。不同于 LIPO 或者 WHOPR，ThinLTO 通过 function importing技术，只 import 需要的 function，使得开销进一步降低。

ThinLTO Design

ThinLTO 大致分为如下几个步骤：

Compile: Generate IR as with LTO mode, but extended with module summaries.

Thin link: Thin linker plugin layer to combine summaries and perform global analyses.

ThinLTO backend: Parallel backends with summary- based importing and optimizations.ThinLTO: Scalable and Incremental LTO

上面几个步骤没有什么好说的，比较重要的是下面的这张图，清晰展示了 ThinLTO Thin 在哪里，重点在于 summary 以及基于 summary whole program analysis。

每一个全局变量和函数都有一个 entry
entry中包含一些metadata，用来驱动后面的global analysis
function entry包含，linkage type，指令数量，可选的PGO信息，以及一些flag
另外所有的ref和函数调用关系都被记录下来了。函数调用可以附带有PGO hotness信息。

Thin Link主要是将不同 module 中的 summary 信息（又称之为 summary index）聚合起来，合并成为一个summary index，然后基于这个 summary 做后续的分析。

Function Importing

Function Importing 是整个过程中比较重要的一步，对于一个 module 来说，只有其需要的函数才会被 import 到 module 中，这样做可以减少开销。所以这里的核心点是如何判断一个函数是否需要 importing？目前论文中的判断标准，就是这个函数足够 hot 并且足够小，从而能够使得 inlining 从中受益。整个过程通过一个 threshold 来控制，未来如果需要对其进行扩展，只需要根据”其余判断标准“调整 threshold 即可。

在 backend 进行真正的优化时，module 将其需要的 function import 进来（也就是拷贝进来）。这个过程在backend 的较早期进行，以便使得其它IPO收益。

如果需要 import 的函数引用了一个 local symbol（例如 static 修饰的变量），那么需要将这个 local symbol 的 scope 提升到全局 scope。需要 renamed。我们可以看到 ThinLTO 比 LIPO 更进一步，LIPO 是将 module 进行 grouping，但是 ThinLTO 只将需要 function 给 imoprt。另外需要注意的是，ThinLTO 也保留了接口基于 PGO profile data 来协助 function importing。

ThinLTO Cross-Module Optimizations

在 full LTO 中有一个比较重要的概念叫做 internalization，这个是针对 global symbol 来说的，如果说在把“所有”的 IR 都合并成为一个超大 IR 的时候，发现一些 symbol 不需要对外可见了，那么就将其进行internalize。与之类似的还有 visibility 问题。internalization有两个好处，一是可以做dead symbol elimination ，二是可以做更精确与激进的分析，例如别名分析。通过 global reference graph ，ThinLTO 也可以做 internalization。

另外一个需要注意的点是 weak symbol resolution，这里引入了两个概念 prevailing 和 preempted。如果 linker 选择了 weak symbol 中的某一个 copy，将其它的丢弃，那么这个 symbol 就是 prevailing symbol，其它 symbol 就被 preempted 了（C：未来再介绍 linkage 相关的内容）

对于 indirect call promotioin 来说，PGO profile data 应该算是比较重要的了，用来告诉 optimizer，这个 call target 比较热，值得你对代码进行transform来做ICP。ThinLTO 的 summary 也保留了这部分的接口。

Integration with Profile Guided Optimization (PGO)

前面我们提到 ThinLTO 可以和 PGO 进行结合，profile data 放在 summary 中的 call 信息中，用来表征 hot or cold 。然后可以沿着 global call graph 进行 propagtion，用来协助做 function importing。

这篇文章只是论文阅读到的内容，真正 ThinLTO 中实现的细节，未来再进行讨论。