跟我读论文系列之XGBoost

文章目录

理论
- TITLE 标题
- AUTHOR 作者
- ABSTRACT 摘要
- INTRODUCTION 介绍
- TREE BOOSTING IN A NUTSHELL 简单介绍一下树提升
- - Regularized Learning Objective 带惩罚项的目标函数
  - Gradient Tree Boosting 梯度提升树
  - Shrinkage and Column Subsampling 收缩和列采样
- SPLIT FINDING ALGORITHMS 分裂算法
- - Basic Exact Greedy Algorithm 贪心算法
  - Approximate Algorithm 近似算法
  - Weighted Quantile Sketch 加权分位数草案
  - Sparsity-aware Split Finding 稀疏感知的分裂发现
- SYSTEM DESIGN 系统设计
- - Column Block for Parallel Learning 并行计算的列块
  - Cache-aware Access 支持缓存访问
  - Blocks for Out-of-core Computation 核外计算块
优点

理论

论文地址
readpaper

TITLE 标题

XGBoost: A Scalable Tree Boosting System

XGBoost：可拓展的树提升系统

AUTHOR 作者

Tianqi Chen 陈天奇
Carlos Guestrin

ABSTRACT 摘要

In this paper, we describe a scalable end to end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges

在本文中，我们描述了一个名为XGBoost的可扩展的端到端树提升系统，该系统被数据科学家广泛用于在许多机器学习挑战中并获得最先进的结果

More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system

更重要的是，我们提供了关于缓存访问模式、数据压缩和分片的见解，以构建可扩展的树提升系统

By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems

通过结合这些见解，XGBoost可以使用比现有系统少得多的资源来扩展数十亿个示例

Large-scale Machine Learning

大规模机器学习

INTRODUCTION 介绍

There are two important factors that drive these successful applications: usage of effective (statistical) models that capture the complex data dependencies and scalable learning systems that learn the model of interest from large datasets.

有两个重要因素推动了这些成功的应用:有效的(统计)模型的使用，捕捉复杂的数据依赖关系和从大型数据集中学习感兴趣的模型的可拓展学习系统。

Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks

树提升已被证明可以在许多标准分类基准上提供最先进的结果

LambdaMART, a variant of tree boosting for ranking, achieves state-of-the-art result for ranking problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction

LambdaMART是一种用于排序的树提升变体，在排序问题上实现了最先进的结果。除了作为一个独立的预测器，它也被纳入到真实的生产管道中，用于广告点击率的预测

In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package2. The impact of the system has been widely recognized in a number of machine learning and data mining challenges.Take the challenges hosted by the machine learning competition site Kaggle for example. Among the 29 challenge winning solutions3published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions.The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-configured XGBoost by only a small amount

这一段在说xgboost在kaggle竞赛中很受欢迎

These results demonstrate that our system gives state-ofthe-art results on a wide range of problems.Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classification; web text classification; customer behavior prediction; motion detection; ad click through rate prediction; malware classification; product categorization; hazard risk prediction; massive online course dropout rate prediction.While domain dependent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consensus choice of learner shows the impact and importance of our system and tree boosting.

这些结果表明，我们的系统在广泛的问题上提供了最先进的结果。这些成功的解决方案中的问题包括:商店销售预测;高能物理事件分类;web文本分类;客户行为的预测;运动检测;广告点击率预测;恶意软件分类;产品分类;灾害风险预测;海量在线课程辍学率预测。虽然领域相关的数据分析和特征工程在这些解决方案中发挥着重要作用，但XGBoost是学习者的共识选择这一事实表明了我们的系统和树提升的影响和重要性。

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handlingsparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model exploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources

XGBoost成功背后最重要的因素是它在所有场景中的可拓展性。该系统在单台机器上的运行速度比现有的流行解决方案快十倍以上，并且在分布式或内存有限的设置下可扩展到数十亿个示例。XGBoost的可拓展性源于几个重要的系统和算法优化。这些创新包括:

一种新的树学习算法用于处理稀疏数据;
一个理论上合理的加权分位数草图过程可以在近似树学习中处理实例权值。
并行和分布式计算使学习更快，从而使模型探索更快。
更重要的是，XGBoost利用了内核外计算和使数据科学家能够在桌面上处理数亿个例子。

最后，更令人兴奋的是，将这些技术结合在一起，以最少的集群资源创建一个端到端系统，可以扩展到更大的数据

The major contributions of this paper is listed as follows:•We design and build a highly scalable end-to-end tree boosting system.•We propose a theoretically justified weighted quantile sketch for efficient proposal calculation.•We introduce a novel sparsity-aware algorithm for parallel tree learning.•We propose an effective cache-aware block structure for out-of-core tree learning.

本文的主要贡献如下:

我们设计并构建了一个高度可扩展的端到端树提升系统。
我们提出了一个理论上合理的加权分位数草图，用于高效的提案计算。
我们介绍了一种新的稀疏感知并行树学习算法。
我们提出了一种有效的缓存感知块结构，用于核外树学习。

TREE BOOSTING IN A NUTSHELL 简单介绍一下树提升

Regularized Learning Objective 带惩罚项的目标函数

函数公式

给定一个n个样本m个特征的数据集D={(xi,yi)}D=\{(x_i, y_i)\}D={(xi,yi)}，树提升模型的公式为：

yi^=ϕ(xi)=∑k=1Kfk(xi),fk∈F\hat{y_i} = \phi{(x_i)} = \sum_{k=1}^{K}{f_k(x_i)}, f_k\in F yi^=ϕ(xi)=k=1∑Kfk(xi),fk∈F

其中F={f(x)=ωq(x)}(q:Rm−>T,ω∈RT)F=\{f(x)=\omega_{q(x)}\}(q: R^m -> T, \omega \in R^T)F={f(x)=ωq(x)}(q:Rm−>T,ω∈RT)，是回归树(CART)空间

其中qqq表示每棵树的结构，将一个样本映射到相应的叶索引

其中fkf_kfk是一棵具体树qqq和叶子权重ω\omegaω

损失函数

L(ϕ)=∑il(yi^,yi)+∑kΩ(fk)Ω(f)=γT+12λ∥ω∥2L(\phi) = \sum_{i}{l(\hat{y_i}, y_i)} + \sum_{k}{\Omega{(f_k)}}\\ \Omega{(f)} = \gamma T + \frac{1}{2}\lambda\|\omega\|^2 L(ϕ)=i∑l(yi^,yi)+k∑Ω(fk)Ω(f)=γT+21λ∥ω∥2

其中Ω(f)\Omega{(f)}Ω(f)是正则化项，避免过拟合的

Gradient Tree Boosting 梯度提升树

The tree ensemble model in Eq. (2) includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space.

上述式子中的树集成模型是以函数为参数的，无法在欧氏空间中使用传统的优化方法进行优化。

使用y^i(t)\hat{y}_{i}^{(t)}y^i(t)表示第t棵树时第i个样本的预测，则目标转换成找出ftf_tft使下式最小

L(t)=∑i=1nl(yi,y^i(t−1)+ft(xi))+Ω(ft)L^{(t)} = \sum_{i=1}^{n}{l(y_i, \hat{y}_{i}^{(t-1)} + f_t(x_i))} + \Omega{(f_t)} L(t)=i=1∑nl(yi,y^i(t−1)+ft(xi))+Ω(ft)

即每一步都是贪心寻找当前最优ftf_tft

将其在y^i(t−1)\hat{y}_{i}^{(t-1)}y^i(t−1)处泰勒二阶展开得到

L(t)≈∑i=1nl(yi,y^i(t−1))+dl(yi,y^i(t−1))dy^i(t−1)(y^i(t−1)+ft(xi)−y^i(t−1))+12d2l(yi,y^i(t−1))d2y^i(t−1)(y^i(t−1)+ft(xi)−y^i(t−1))2+Ω(ft)L(t)≈∑i=1nl(yi,y^i(t−1))+dl(yi,y^i(t−1))dy^i(t−1)(ft(xi))+12d2l(yi,y^i(t−1))d2y^i(t−1)(ft(xi))2+Ω(ft)L(t)≈∑i=1nl(yi,y^i(t−1))+gi(ft(xi))+12hi(ft(xi))2+Ω(ft)L^{(t)} \approx \sum_{i=1}^{n}{l(y_i, \hat{y}_{i}^{(t-1)}) + \frac{dl(y_i, \hat{y}_{i}^{(t-1)})}{d\hat{y}_{i}^{(t-1)}}(\hat{y}_{i}^{(t-1)} + f_t(x_i)-\hat{y}_{i}^{(t-1)}) + \frac{1}{2}\frac{d^2l(y_i, \hat{y}_{i}^{(t-1)})}{d^2\hat{y}_{i}^{(t-1)}}(\hat{y}_{i}^{(t-1)} + f_t(x_i)-\hat{y}_{i}^{(t-1)})^2} + \Omega{(f_t)}\\ L^{(t)} \approx \sum_{i=1}^{n}{l(y_i, \hat{y}_{i}^{(t-1)}) + \frac{dl(y_i, \hat{y}_{i}^{(t-1)})}{d\hat{y}_{i}^{(t-1)}}(f_t(x_i)) + \frac{1}{2}\frac{d^2l(y_i, \hat{y}_{i}^{(t-1)})}{d^2\hat{y}_{i}^{(t-1)}}(f_t(x_i))^2} + \Omega{(f_t)}\\ L^{(t)} \approx \sum_{i=1}^{n}{l(y_i, \hat{y}_{i}^{(t-1)})} + g_i(f_t(x_i)) + \frac{1}{2}h_i(f_t(x_i))^2 + \Omega{(f_t)} L(t)≈i=1∑nl(yi,y^i(t−1))+dy^i(t−1)dl(yi,y^i(t−1))(y^i(t−1)+ft(xi)−y^i(t−1))+21d2y^i(t−1)d2l(yi,y^i(t−1))(y^i(t−1)+ft(xi)−y^i(t−1))2+Ω(ft)L(t)≈i=1∑nl(yi,y^i(t−1))+dy^i(t−1)dl(yi,y^i(t−1))(ft(xi))+21d2y^i(t−1)d2l(yi,y^i(t−1))(ft(xi))2+Ω(ft)L(t)≈i=1∑nl(yi,y^i(t−1))+gi(ft(xi))+21hi(ft(xi))2+Ω(ft)

其中l(yi,y^i(t−1))l(y_i, \hat{y}_{i}^{(t-1)})l(yi,y^i(t−1))是常数项，则上式可以等价于

L^(t)=∑i=1ngi(ft(xi))+12hi(ft(xi))2+Ω(ft)\hat{L}^{(t)} = \sum_{i=1}^{n}{g_i(f_t(x_i)) + \frac{1}{2}h_i(f_t(x_i))^2} + \Omega{(f_t)} L^(t)=i=1∑ngi(ft(xi))+21hi(ft(xi))2+Ω(ft)

定义进入叶子结点j对应的样本为{i∣q(xi)=j}\{i|q(x_i)=j\}{i∣q(xi)=j}，则上式可改写为

L^(t)=∑i=1ngi(ft(xi))+12hi(ft(xi))2+γT+12λ∑j=1Tωj2=∑j=1T[(∑i∈Ijgi)ωj+12(∑i∈Ijhi+λ)ωj2]+γT\hat{L}^{(t)} = \sum_{i=1}^{n}{g_i(f_t(x_i)) + \frac{1}{2}h_i(f_t(x_i))^2} + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T}{\omega_j^2}\\ =\sum_{j=1}^{T}{[(\sum_{i\in I_j}{g_i})\omega_j + \frac{1}{2}(\sum_{i\in I_j}{h_i} + \lambda)\omega_j^2] + \gamma T} L^(t)=i=1∑ngi(ft(xi))+21hi(ft(xi))2+γT+21λj=1∑Tωj2=j=1∑T[(i∈Ij∑gi)ωj+21(i∈Ij∑hi+λ)ωj2]+γT

对于固定的结构q(x)q(x)q(x)，将其对ωj\omega_jωj求导并等于0求出最佳ωj∗\omega_j^*ωj∗为下式：

ωj∗=−∑i∈Ijgi∑i∈Ijhi+λ\omega_j^* = -\frac{\sum_{i\in I_j}{g_i}}{\sum_{i\in I_j}{h_i} + \lambda} ωj∗=−∑i∈Ijhi+λ∑i∈Ijgi

则得到结构qqq对应的目标函数

L^(t)(q)=−12∑j=1T(∑i∈Ijgi)2∑i∈Ijhi+λ+γT\hat{L}^{(t)(q)} = -\frac{1}{2}\sum_{j=1}^{T}{\frac{(\sum_{i\in I_j}{g_i})^2}{\sum_{i\in I_j}{h_i} + \lambda}} + \gamma T L^(t)(q)=−21j=1∑T∑i∈Ijhi+λ(∑i∈Ijgi)2+γT

Normally it is impossible to enumerate all the possible tree structures q, A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead.

通常不可能列举出所有可能的树形结构，取而代之的是一种贪婪算法，它从单个叶子开始，迭代地向树中添加分支，而分裂对应的loss reduction为

Lsplit=L(I)−[L(IL)+L(IR)]=−12(∑i∈Ijgi)2∑i∈Ijhi+λ+γ−[−12(∑i∈ILgi)2∑i∈ILhi+λ+γ+−12(∑i∈IRgi)2∑i∈IRhi+λ+γ]=12[(∑i∈ILgi)2∑i∈ILhi+λ+(∑i∈IRgi)2∑i∈IRhi+λ−(∑i∈Ijgi)2∑i∈Ijhi+λ]−γL_{split} = L(I) - [L(I_L) + L(I_R)]\\ = -\frac{1}{2}\frac{(\sum_{i\in I_j}{g_i})^2}{\sum_{i\in I_j}{h_i} + \lambda} + \gamma - [-\frac{1}{2}\frac{(\sum_{i\in I_L}{g_i})^2}{\sum_{i\in I_L}{h_i} + \lambda} + \gamma + -\frac{1}{2}\frac{(\sum_{i\in I_R}{g_i})^2}{\sum_{i\in I_R}{h_i} + \lambda} + \gamma]\\ = \frac{1}{2}[\frac{(\sum_{i\in I_L}{g_i})^2}{\sum_{i\in I_L}{h_i} + \lambda} + \frac{(\sum_{i\in I_R}{g_i})^2}{\sum_{i\in I_R}{h_i} + \lambda} - \frac{(\sum_{i\in I_j}{g_i})^2}{\sum_{i\in I_j}{h_i} + \lambda}] - \gamma Lsplit=L(I)−[L(IL)+L(IR)]=−21∑i∈Ijhi+λ(∑i∈Ijgi)2+γ−[−21∑i∈ILhi+λ(∑i∈ILgi)2+γ+−21∑i∈IRhi+λ(∑i∈IRgi)2+γ]=21[∑i∈ILhi+λ(∑i∈ILgi)2+∑i∈IRhi+λ(∑i∈IRgi)2−∑i∈Ijhi+λ(∑i∈Ijgi)2]−γ

Shrinkage and Column Subsampling 收缩和列采样

除了上面的带惩罚项的目标函数能防止过拟合，还有两种技术来防止过拟合，

Shrinkage, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.
Column Subsampling, According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling (which is also supported). The usage of column sub-samples also speeds up computations of the parallel algorithm described later.
Shrinkage, 降低当前树的影响，留些空间给后面的树来增强模型
Column Subsampling, 相比行采样，列采样能更防止过拟合，还可以增加计算速度

SPLIT FINDING ALGORITHMS 分裂算法

Basic Exact Greedy Algorithm 贪心算法

One of the key problems in tree learning is to find the best split as indicated by Eq (7). In order to do so, a split finding algorithm enumerates over all the possible splits on all the features.We call this the exact greedy algorithm.

核心问题是根据上文提到的LsplitL_{split}Lsplit找到最佳分裂，为了做到这点，基础贪心算法是在所有特征上列举所有可能的分裂。

Approximate Algorithm 近似算法

The exact greedy algorithm is very powerful since it enumerates over all possible splitting points greedily. However, it is impossible to efficiently do so when the data does not fit entirely into memory

基础贪心算法非常强大，因为它贪心地枚举所有可能的分裂点。然而，当数据不能完全装入内存时，就不可能有效地执行此操作

To summarize, the algorithm first proposes candidate splitting points according to percentiles of feature distribution (a specific criteria will be given in Sec. 3.3). The algorithm then maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

该算法首先根据特征分布的百分位数提出候选分割点(具体标准将在第3.3节给出)。然后，该算法将连续特征映射到由这些候选点划分的桶中，聚合统计信息，并根据聚合的统计信息在建议中找到最佳解决方案。

There are two variants of the algorithm, depending on when the proposal is given. The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

根据提出建议的时间，该算法有两种变体。全局变体在树构造的初始阶段提出所有候选分割，并在所有层上使用相同的分割查找。本地变体在每次分裂后重新寻找分割点。

The global method requires less proposal steps than the local method. However, usually more candidate points are needed for the global proposal because candidates are not refined after each split. The local proposal refines the candidates after splits, and can potentially be more appropriate for deeper trees.

全局方法比局部方法需要更少的查找分割步骤。然而，通常全局方法需要更多的候选点，因为候选点在每次拆分后都没有细化。局部方法在分裂后完善了候选树，可能更适合更深的树。

We find that the local proposal indeed requires fewer candidates.The global proposal can be as accurate as the local one given enough candidates.

局部方法需要的候选集少，而全局需要更多的候选集，当候选集足够，两个效果差不多。

Quantile strategy benefit from being distributable and recomputable, which we will detail in next subsection. From Fig. 3, we also find that the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

分位数策略得益于可分配和可重计算，我们将在下一小节详细介绍。从图3中我们还发现，在合理的逼近水平下，分位数策略可以获得与精确贪婪相同的精度。

Weighted Quantile Sketch 加权分位数草案

One important step in the approximate algorithm is to propose candidate split points

近似算法的重点是提出候选分割点

让Dk={(x1k,h1),...,(xnk,hn)}D_k = \{(x_{1k}, h_1), ..., (x_{nk}, h_n)\}Dk={(x1k,h1),...,(xnk,hn)}代表第k个特征的样本值以及每个样本的二阶值，我们定义一个rank function 如下：

rk(z)=∑(x,h)∈Dk,x<zh∑(x,h)∈Dkhr_k(z) = \frac{\sum_{(x, h)\in D_k, x<z}{h}}{\sum_{(x, h)\in D_k}h} rk(z)=∑(x,h)∈Dkh∑(x,h)∈Dk,x<zh

目标是找到最佳分裂点集合{sk1,...,skl}\{s_{k1}, ..., s_{kl}\}{sk1,...,skl}，这些分裂点满足

∣rk(sk,j)−rk(sk,j+1)∣<ϵ|r_k(s_{k, j}) - r_k(s_{k, j+1})| < \epsilon ∣rk(sk,j)−rk(sk,j+1)∣<ϵ

这里ϵ\epsilonϵ是近似因子，这意味着大约有1/ϵ1 / \epsilon1/ϵ个候选点。每个点的权重是hih_ihi

For large datasets, it is non-trivial to find candidate splits that satisfy the criteria.When every instance has equal weights, an existing algorithm called quantile sketch [14, 24] solves the problem. However, there is no existing quantile sketch for the weighted datasets.Therefore, most existing approximate algorithms either resorted to sorting on a random subset of data which have a chance of failure or heuristics that do not have theoretical guarantee

对于大数据集，找到满足条件的候选点集并非易事。当每个样本有同样的权重，quantile sketch可以解决这个问题。然而，对于加权数据集，还没有现有的分位数草图。因此，现有的大多数近似算法要么是对有可能失败的随机数据子集进行排序，要么是没有理论保证的启发式算法。

To solve this problem, we introduced a novel distributed weighted quantile sketch algorithm that can handle weighted data with aprovable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level.

为了解决这一问题，我们提出了一种新的分布式加权分位数草图算法，该算法能够处理加权数据，并有可靠的理论保证。其基本思想是提出一种支持合并和修剪操作的数据结构，并证明每个操作都保持一定的准确性水平。

Sparsity-aware Split Finding 稀疏感知的分裂发现

There are multiple possible causes for sparsity:1) presence of missing values in the data; 2) frequent zero entries in the statistics; and, 3) artifacts of feature engineering such as one-hot encoding.

稀疏的原因：1. 缺失， 2. 统计值中的0项居多，3. 特征工程产生例如onehot

It is important to make the algorithm aware of the sparsity pattern in the data.In order to do so, we propose to add a default direction in each tree node, which is shown in Fig. 4.

让算法意识到数据中的稀疏模式是很重要的。为此，我们建议在每个树节点中添加一个默认方向，如图4所示。当稀疏数据中有缺失值时，将其分到默认方向（左边 or 右边）。分到左边还是右边是通过数据学习出来的。

The key improvement is to only visit the non-missing entries IkI_kIk. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.The same algorithm can also be applied when the non-presence corresponds to a user specified value by limiting the enumeration only to consistent solutions.

关键的改进是只访问非缺失的样本IkI_kIk。该算法将不存在的值视为缺失值，并学习处理缺失值的最佳方向

XGBoost handles all sparsity patterns in a unified way.More importantly, our method exploits the sparsity to make computation complexity linear to number of non-missing entries in the input.

XGBoost以统一的方式处理所有稀疏模式（连续值、离散值缺失）。更重要的是，我们的方法利用稀疏性使计算复杂度与输入中非缺失项的数量成线性关系。

SYSTEM DESIGN 系统设计

Column Block for Parallel Learning 并行计算的列块

The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block.Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations.

树学习中最耗时的部分是将数据按顺序排列。为了减少排序的成本，我们建议将数据存储在内存单元中，我们称之为块。每个块中的数据以压缩列(CSC)格式存储，每列按对应的特征值排序。这个输入数据布局只需要在训练之前计算一次，并且可以在以后的迭代中重用。

The block structure also helps when using the approximate algorithms. Multiple blocks can be used in this case, with each block corresponding to subset of rows in the dataset. Different blocks can be distributed across machines, or stored on disk in the out-of-core setting.Using the sorted structure, the quantile finding step becomes alinear scanover the sorted columns. This is especially valuable for local proposal algorithms, where candidates are generated frequently at each branch. The binary search in histogram aggregation also becomes a linear time merge style algorithm.

块结构在使用近似算法时也有帮助。在这种情况下，可以使用多个块，每个块对应于数据集中的行子集。不同的块可以分布在不同的机器上，或者在非核心设置中存储在磁盘上。使用排序结构，分位数查找步骤变成了对排序列的线性扫描。这对于局部提议算法尤其有价值，因为在局部提议算法中，每个分支都会频繁地生成候选对象。直方图聚合中的二分搜索也变成了一种线性时间合并风格的算法。

Collecting statistics for each column can beparallelized, giving us a parallel algorithm for split finding. Importantly, the column block structure also supports column subsampling, as it is easy to select a subset of columns in a block.

为每一列收集统计信息可以并行化，这为我们提供了一种并行查找拆分的算法。重要的是，列块结构还支持列子采样，因为很容易在块中选择列的子集。

Cache-aware Access 支持缓存访问

While the proposed block structure helps optimize the computation complexity of split finding, the new algorithm requires indirect fetches of gradient statistics by row index, since these values are accessed in order of feature.This is a non-continuous memory access.A naive implementation of split enumeration introduces immediate read/write dependency between the accumulation and the non-continuous memory fetch operation (see Fig. 8). This slows down split finding when the gradient statistics do not fit into CPU cache and cache miss occur.

虽然提出的块结构有助于优化拆分查找的计算复杂性，但新算法需要按行索引对梯度统计数据进行间接获取，因为这些值是按特征值顺序访问的。这是一个不连续的内存访问。拆分枚举引入了累积和非连续内存提取操作之间的立即读/写依赖性（见图8）。当梯度统计信息不适合CPU缓存和缓存失误时，这会减慢分裂点查找。

For the exact greedy algorithm, we can alleviate the problem by a cache-aware prefetching algorithm.Specifically, we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner.This prefetching changes the direct read/write dependency to a longer dependency and helps to reduce the runtime overhead when number of rows in the is large.

对于贪心算法，我们可以通过缓存访问的预取算法来缓解问题：将直接的读/写依赖性更改为较长的依赖性，并有助于减少运行时间开销。

We find that cache-aware implementation of the exact greedy algorithm runs twice as fast as the naive version when the dataset is large.

我们发现，当数据集很大时，精确贪心算法的缓存感知实现的运行速度是朴素算法的两倍。

For approximate algorithms, we solve the problem by choosing a correct block size. We define the block size to be maximum number of examples in contained in a block, as this reflects the cache storage cost of gradient statistics. Choosing an overly small block size results in small workload for each thread and leads to inefficient parallelization. On the other hand, overly large blocks result in cache misses, as the gradient statistics do not fit into the CPU cache.A good choice of block size balances these two factors.

对于近似算法，我们通过选择正确的块大小来解决问题。我们将块大小定义为块中包含的最大示例数，因为这（块大小）反映了梯度统计的缓存存储成本。选择过小的块大小会给每个线程带来较小的工作负载，会导致低效的并行化。另一方面，过大的块会导致缓存丢失，因为梯度统计数据不适合CPU缓存。好的块大小选择可以平衡这两个因素。

This result validates our discussion and shows that choosing 2162^{16}216 examples per block balances the cache property and parallelization.

Blocks for Out-of-core Computation 核外计算块

One goal of our system is to fully utilize a machine’s resources to achieve scalable learning. Besides processors and memory, it is important to utilize disk space to handle data that does not fit into main memory.

我们系统的目标之一是充分利用机器的资源来实现可扩展学习。除了处理器和内存外，重要的是要利用磁盘空间处理不适合主内存的数据。

To enable out-of-core computation, we divide the data into multiple blocks and store each block on disk. During computation, it is important to use an independent thread to pre-fetch the block into a main memory buffer, so computation can happen in concurrence with disk reading.

为了支持核外计算，我们将数据分成多个块，并将每个块存储在磁盘上。在计算过程中，使用一个独立的线程将块预取到主内存缓冲区中是很重要的，这样计算就可以在磁盘读取的同时进行。

However, this does not entirely solve the problem since the disk reading takes most of the computation time.It is important to reduce the overhead and increase the throughput of disk IO.

然而，这并不能完全解决问题，因为磁盘读取占用了大部分的计算时间。减少开销和增加磁盘IO的吞吐量是很重要的。

We mainly use two techniques to improve the out-of-core computation.

Block Compression, 块压缩，The first technique we use is block compression. The block is compressed by columns, and decompressed on the fly by an independent thread when loading into main memory.This helps to trade some of the computation in decompression with the disk reading cost. We use a general purpose compression algorithm for compressing the features values. For the row index, we substract the row index by the begining index of the block and use a 16bit integer to store each offset. 我们使用的第一个技术是块压缩。块被列压缩，并在加载到主存时由一个独立的线程动态解压。这有助于用磁盘读取成本来交换一些解压缩的计算。我们使用一种通用的压缩算法来压缩特征值。对于行索引，我们用块的开始索引减去行索引，并使用16位整数来存储每个偏移量。
Block Sharding, 块分片，The second technique is to shard the data onto multiple disks in an alternative manner. A pre-fetcher thread is assigned to each disk and fetches the data into an in-memory buffer.The training thread then alternatively reads the data from each buffer. This helps to increase the throughput of disk reading when multiple disks are available。第二种技术是以另一种方式将数据分片到多个磁盘上。给每个磁盘分配一个预取线程，并将数据取到内存缓冲区中。然后，训练线程从每个缓冲区读取数据。当有多个磁盘可用时，这有助于提高磁盘读取的吞吐量

优点

支持shrinkage和列采样，带惩罚的目标函数二阶展开
加权分位数草案，对特征分桶，加速分裂实现
特征预排序，训练前一次完成，训练时重复使用
缺失值处理，学习出默认方向
并行计算列块，csc列压缩，支持缓存访问，核外计算