

• Materialization points (e.g., for multiple consumers)
• Sparsity exploitation and ordering of sparse inputs
• Decisions on fusion patterns (e.g., template types), and
• Constraints (e.g., memory budget and blocksizes) and costs for local and/or distributed operations.
例如,关于物化点的决策,考虑了冗余计算与物化和需求,比较指数级的计划。基本解决方案是启发式的,例如fuse all或fuse no redundancy,这些启发式算法很难为复杂的问题找到好本地和分布式操作的DAG或混合计划的解决方案。
• System Architecture: We describe the integration into SystemML in Section 2. This overview includes the compiler integration, our optimization framework, code generation plans, and their runtime integration.
• Candidate Exploration: In Section 3, we introduce a novel bottom-up algorithm for the efficient exploration of valid partial fusion plans. We also discuss our memoization data structure and basic pruning rules.
• Candidate Selection: In Section 4, we then present strategies for selecting the optimal candidate plans. We formulate the problem, describe heuristics, and introduce our novel cost-based plan selection, including its search space, cost model, and enumeration algorithm.
• Experiments: In Section 5, we then report on extensive experiments in SystemML. These cover micro benchmarks for code generation, end-to-end performance in local and distributed environments, as well as comparisons with Julia, TensorFlow, and fusion heuristics.


第一, 关于候选探索,做了一个自下而上的研究、
pass over the HOP DAG to explore all valid partial fusion plans and store these plans in a memoization table, organized by HOPs. Second, during candidate selection (Section 4), we choose the optimal subset of fusion plans using a time-based cost model. Third, we construct code generation plans (CPlans, Section 2.2)—which are a backendindependent basis for code generation—for all selected fusion plans. Fourth, we then recursively expand templates for these given CPlans to generate java source code for each fused operator, compile the classes and load them into the JVM. By default, we use the fast janino compiler [89] but also support the standard javac compiler. Generated fused operators are maintained in a plan cache—which identifies equivalent CPlans via hashing—to avoid redundant code generation and compilation for existing operators. Finally, we replace covered parts of the HOP DAG by the fused operators. These separate compilation steps are very valuable for debugging without compromising on fusion potential. 2.2 Code Generation Plans Code generation plans (CPlans) [27] are a backend-independent representation of fused operators and allow for recursive code generation. We generate code via a depth-first template expansion to ensure valid ordering. Such plans consist of CNodes, which are either template or basic operation nodes. Template nodes represent generic fused operator skeletons that have a specific data binding and contain a DAG of basic operations that encodes the data flow. Example Expressions: We illustrate CPlans for two typical expressions with high performance impact of fusion. The first expression is part of an inner-loop update rule of ALS-CG (alternating least squares via conjugate gradient)

1: public final class TMP10 extends SpoofCellwise { 2: public TMP10() {super(CellType.NO_AGG, null, false);} 3: protected double genexec(double a,SideInput[] b, 4: double[] scalars,…, int rix, int cix) { 5: double TMP5 = getValue(b[0], n, rix, cix); 6: double TMP6 = a * 1.0E-6; 7: double TMP7 = getValue(b[1], rix); 8: double TMP8 = TMP6 * TMP7; 9: double TMP9 = TMP5 + TMP8; 10: return TMP9; }} Finally, Figure 3© shows the row-wise CPlan of Expression (2). This fused operator requires only a single pass over X by exploiting temporal locality and it avoids six large intermediates. The memory for row intermediates is managed via a preallocated ring buffer per thread (here of size 5). 1: public final class TMP25 extends SpoofRowwise { 2: public TMP25() {super(RowType.COL_AGG_B1_T,true,5);} 3: protected void genexecDense(double[] a, int ai, 4: SideInput[] b, double[] c,…, int len) { 5: double[] TMP11 = getVector(b[1].vals(rix),…); 6: double[] TMP12 = vectMatMult(a,b[0].vals(rix),…); 7: double[] TMP13 = vectMult(TMP11, TMP12, 0,0,…); 8: double TMP14 = vectSum(TMP13, 0, TMP13.length); 9: double[] TMP15 = vectMult(TMP11, TMP14, 0,…); 10: double[] TMP16 = vectMinus(TMP13, TMP15, 0,0,…); 11: vectOuterMultAdd(a, TMP16, c, ai,0,0,…); } 12: protected void genexecSparse(double[] avals, int[] 13: aix, int ai, SideInput[] b, …, int len){…}}

Runtime Integration: Templates refer to generic skeletons of fused operators, which are inspired by algorithmic skeletons [21]. Figure 4 exemplifies this runtime integration using the Cell template. Unlike existing work [9, 22, 48], we made the conscious design decision not to generate the data access into the fused operators. Instead, the handcoded skeleton implements the data access—depending on its sparse-safeness over cells or non-zero values—of dense, sparse, or compressed [28] matrices and calls an abstract (virtual) genexec method for each value. Generated operators inherit this skeleton and only override the specific genexec, which yields very lean yet efficient operators. The skeleton also handles multi-threading, cache blocking, memory management, and pseudo-sparse-safe aggregations. Sharing common skeletons and vector primitives among fused operators can also reduce the instruction footprint and thus, L1 instruction cache misses, which is a known bottleneck in OLTP [85] and scale-out workloads [30]. 3. CANDIDATE EXPLORATION The exploration of candidate fusion plans aims to identify all valid partial fusion plans to provide a common input for different plan selection policies and simplify optimization. However, the exponential search space prohibits the enumeration of all possible plans. Instead, we enumerate partial fusion plans per operator, which represent local fusion decisions. We describe (1) the representations of partial fusion plans in our central memoization table, and (2) an efficient algorithm for populating this memo table in a single pass over the HOP DAG, including pruning techniques. 3.1 Memoization Table Our memoization (memo) table consists of a set of groups, where each group represents the output of an operator in the HOP DAG, i.e., a logical subexpression. Each group is identified by the operator ID, has access to its operator meta data, and contains a set of valid partial fusion plans for this operator. A partial fusion plan is called a memo table entry, and can reference other groups to represent fusion decisions. This structure is similar to groups and group expressions in the Cascades Optimization Framework [16, 34, 83], but we use it merely as a compact representation of fusion plans, which only includes operators that are amenable to fusion. Memo Table Entries: A memo table entry is a tuple (type, {i1, …, ik}, closed), consisting of a template type (as introduced in Table 1), a list of input references, and a closed type. The list of inputs corresponds to HOP inputs (i.e., data dependencies) by position, and each input is either a group reference or -1, which indicate fusion or materialized intermediates. A reference from an entry to a group implies that the group contains at least one compatible fusion plan. Finally, the close status can be open valid, open invalid (i.e., an invalid entry point), closed valid, and closed invalid. Example: We use Expression (2) from Section 2.2 to illustrate the structure of our memo table. Figure 5 shows the HOP DAG and the related memo table after candidate exploration and pruning (described in Section 3.2). All eight operators are represented by groups in the memo table. The group 11 refers to the final matrix multiplication (binary aggregate ba(+*)), and consists of three memo table entries of type Row. These entries encode fusion alternatives: (1) fuse right R(-1,9), (2) fuse left R(10,-1), and (3) fuse both R(10,9). Instead of encoding all alternative subplans along inputs, we only reference the input groups. This memo table then allows for simple costing and fusion by traversing the HOP DAG top down, probing for fusion plans, traversing group references, and determining the input HOPs, from where this process repeats until we reach the leaf HOPs. 3.2 Open-Fuse-Merge-Close Exploration Given a HOP DAG and an empty memo table, we aim to efficiently discover all valid partial fusion plans. We introduce a bottom-up algorithm that is template-oblivious and populates the memo table in a single pass over the DAG. OFMC Template Abstraction: As the basis of our candidate exploration algorithm, we define the open-fusemerge-close (OFMC) template abstraction: • open(Hop h): Indicates if a new fused operator of this template can be started at HOP h, covering its operation and reading materialized inputs. For example, the condition of an Outer template is an outer-product-like matrix multiplication with size constraints. • fuse(Hop h, Hop in): Indicates if an open fused operator (of this template) at the input HOP in can be expanded to its consumer HOP h. For example, a Cell template can fuse valid unary, binary, or ternary operations, valid aggregations, and inner products. • merge(Hop h, Hop in): Indicates if an open fused operator (of this template) at the consumer HOP h can be expanded to its input HOP in, i.e., if it can merge with fused operators at the input. An example is the merge of Cell templates into Row templates. • close(Hop h): Indicates the close status of the template after the HOP h and its validity. For example, any aggregation closes a Cell template (as valid or invalid), whereas only column-wise or full aggregations close a Row template. Outer templates are also validated for the existence of sparsity exploiting operators. The major benefit of this OFMC abstraction is the separation of template-specific conditions from the HOP DAG traversal and the population of the memo table. The OFMC Algorithm: Based on the memo table and OFMC abstraction, we introduce the OFMC exploration algorithm (Algorithm 1). This algorithm is called recursively, in a depth-first manner to populate the memo table bottomup. First, we check for already processed operators— indicated by existing groups or marked operators—(lines 1- 3) to avoid redundant exploration if nodes are reachable over multiple paths. Second, we recursively explore all input operators (lines 4-6) because these input data dependencies constitute potential fusion references. Third, we explore all templates for valid opening conditions at the current operator (lines 7-10). In case of a valid opening condition, we add this memo entry and enumerate merge plans with createPlans. This merging is important to cover scenarios such as X>(y
z), where the matrix-vector multiplication with X opens a Row template, which can also merge Cell templates over y
z. Third, we fuse and merge existing partial fusion plans from the operator inputs to the current operator (lines 11-15). This step entails iterating over all distinct template types of all inputs and probing pair-wise fusion conditions. In case of a valid condition, we again call createPlans, which constructs a memo table entry for the fused operator, and enumerates all local plan combinations for inputs that satisfy the pair-wise merge condition. This entire plan set is then added to the group of the current operator. Fourth, we check all group entries for closing conditions (lines 16-20). Entries which satisfy the closing condition of their templates are either removed (invalid) or marked as closed (valid), while all other entries remain open. Pruning Techniques: Finally, we prune duplicates and valid closed entries without group references (line 22). For example, the group 7 ua(R+) in Figure 5 does not contain C(-1) because a rowSums closes the Cell template, which would cover only a single operator. In addition, there are advanced techniques that exploit characteristics of candidate selection policies. For instance, a policy that only considers materialization points with multiple consumers allows pruning dominated plans. A memo entry is dominated if all its references point to operators that are consumed once, and there is another entry (of the same type) whose reference list is a strict superset. For example, in Figure 5, R(10,9) dominates R(10,-1) but R(6,8) does not dominate R(-1,8) because group 6 has multiple consumers. However, we prune dominated plans only for selection heuristics. Algorithm Analysis: Overall our algorithm has linear time and space complexity in the number of operators. Memoization ensures that we visit each operator exactly once and the OFMC conditions apply only locally to an operator and its inputs. These conditions still have access to the hops and thus the entire DAG but this flexibility is only exploited in rare exceptions such as recognizing t(cumsum(t(X))) as a row operation.

  1. CANDIDATE SELECTION Given a memo table of partial fusion plans, candidate selection aims to choose the optimal subset of non-conflicting partial fusion plans. We describe the problem and cost model, as well as introduce our cost-based enumeration algorithm MPSkipEnum. The basic ideas are to (1) partition the set of partial fusion plans into independent groups, (2) restrict the search per group to interesting materialization points, (3) linearize the resulting exponential search space, and (4) enumerate and cost plans with skipping of search space areas that can be pruned based on cost or structure. 4.1 Problem Formulation and Heuristics Overall, we aim to find the cost-optimal set of fusion plans with the optimization scope of a single HOP DAG at-a-time and hybrid runtime plans that might include single-node and distributed operations. We define this problem as follows:



  1. MXNet 图优化与算子融合

    MXNet 图优化与算子融合Graph Optimization and Quantization based on subgraph and MKL-DNN Purpose MKL-DNN引入了两个 ...

  2. TVM图优化与算子融合

    TVM图优化与算子融合 计算图的定义 Computational graphs: a common way to represent programs in deep learning framewo ...

  3. 【点云论文速读】基于优化的视觉惯导里程计与GPS的紧耦合的融合方案

    转载自: [点云论文速读]基于优化的视觉惯导里程计与GPS的紧耦合的融合方案 原创 dianyunPC ...

  4. TalkingData大规模机器学习的应用

     TalkingData大规模机器学习的应用 width="22" height="16" src=" ...

  5. 如何解决大规模机器学习的三大痛点?

    阿里妹导读:阿里巴巴电商平台有上亿的用户和产品,每天产生百亿规模的用户反馈数据.比如淘宝首页的猜你喜欢场景,每天就有100亿规模的用户行为数据.如此超大规模的训练数据,给分布式机器学习带来了巨大的挑战 ...

  6. 谷歌大规模机器学习:模型训练、特征工程和算法选择 (32PPT下载)

    本文转自: 谷歌大规模机器学习:模型训练.特征工程和算法选择 (32PPT下载) 2017-01-26  ...

  7. 大规模机器学习在爱奇艺视频分析理解中的实践

    视频包含了图像.声音.文字等多种信息,可以表达生动.丰富的内容.随着AI时代的带来,互联网视频应用高速发展,视频更成为一种人人可生成的内容,数据量暴涨.如何利用机器学习将海量的视频内容充分利用起来,成 ...

  8. 【视频分析】大规模机器学习在爱奇艺视频分析理解中的实践

    原标题:大规模机器学习在爱奇艺视频分析理解中的实践 AI 前线导读:视频包含了图像.声音.文字等多种信息,可以表达生动.丰富的内容.随着 AI 时代的带来,互联网视频应用高速发展,视频更成为一种人人可 ...

  9. 大规模机器学习的运用-实践之谈

    张夏天的这篇文章写得很好,面向实际运用的时候,大数据的用法.方向,和学术研究有很大的不同.这里介绍的大量工作和我们在腾讯/盛大的工作非常接近,所以特别有共鸣.原文地址如下: http://blog.t ...


  1. Fiddler 抓取eclipse中的请求
  2. beyond compare 3.10在异常关机后无法启动
  3. 无法通过sak判断卡片类型_如何判断你家门能否更换智能锁?选锁门道你要懂!...
  4. OpenShift 4.3 - 基于虚拟机的BareMetal离线安装(7-9)
  5. matplotlib xticks yticks
  6. 好东西真多,如何让自己学的能跟上技术的发展呢
  7. 在浏览器地址栏输入url的后的过程
  8. 使用VS Code插件Code Runner一键运行ANSYS命令流
  9. 亚像素边缘检测提取算法的实现
  10. win10系统IIS服务器配置详细教程,win10系统配置iis的操作方法
  11. WIN10中程序以管理员身份运行的解决方法
  12. js 写一个任意类型转浮点小数点保留两位
  13. FLUENT中的常用边界条件
  14. linux配置防火墙和重启防火墙
  15. hihoCoder1044
  16. 如何通过Apple ID找回弄丢的设备
  17. 百度网盘限速下载,PanDownload简直逆天
  18. C#读写Excel的4种方案(OpenXml、NPOI、EPPlus、Spire.Office)
  19. cf1月24日服务器维护更新公告,CF官网公告 1月24日停机维护公告
  20. 软件测试工程师的一次思考


  1. 2022-2028年中国PVC糊树脂行业市场深度分析及市场规模预测报告
  2. win10 4步快速安装vue
  3. Huggingface及BERT代码介绍
  4. 残差复合正态分布的重要性
  5. 《Attention is All You Need》浅读(简介+代码)
  6. python中使用指定GPU
  7. GOF23设计模式(创建型模式) 原型模式
  8. NVIDIA Nsight Systems CUDA 跟踪
  9. centos7 安装 Mysql 5.7.28,详细完整教程
  10. C++ 预编译的时候使用defined 的含义