CNCC技术论坛 | 面向人工智能芯片的编程语言和编译器

本论坛将于CNCC期间，10月24日13:30-15:30，在北京新世纪日航饭店2层四川厅举行。本论坛邀请到了国内外知名学者和工业界领军人物一起，讨论在人工智能领域设计领域定制芯片的挑战和机遇。欢迎您的参与！

随着摩尔定律发展的逐渐放缓，领域特定架构芯片成为当前处理器发展的主流方向。为了满足深度学习应用对计算力的巨大需求，硬件公司推出了各种领域特定架构的人工智能芯片，例如寒武纪Cambricon、华为昇腾系列、阿里巴巴含光系列等。开展面向人工智能芯片的自动编译技术对推动我国人工智能芯片的发展具有重要意义。

本论坛将讨论如下问题:

1) 如何设计面向人工智能芯片的领域特定编程语言?

2) 如何设计面向人工智能芯片的高效编译器?

3) 目前在人工智能芯片上编程语言和编译器主要痛点包括哪些?

4) 如何加强国产编程语言和编译器等核心基础系统软件的设计?

论坛主席

翟季冬

清华大学计算机系⻓聘副教授，博士生导师。ACM中国高性能计算专家委员会秘书⻓、北京智源⻘年科学家。主要研究领域为高性能计算、编译优化等。相关研究成果发表在高性能计算等领域重要国际会议和期刊SC、PPoPP、ICS、MICRO、ASPLOS、ATC、CGO、NSDI、IEEE TPDS、IEEE TC等。其中SC14论文入选会议Best Paper Finalist，是大陆学者首次入围该奖项。担任NPC 2018程序委员会主席、SC 2018/2019/2020、PPOPP 2019/2020/2021程序委员会委员、国际期刊IEEE TPDS编委、FCS和JCST⻘年编委等。担任清华大学学生超算团队教练，指导的团队共九次获得世界冠军。在2015年和2018年包揽了SC、ISC、ASC三大国际超算竞赛的总冠军，实现“大满贯”。获教育部科技进步一等奖、CCF优秀博士学位论文奖、国家自然科学基金优秀⻘年科学基金。

陈文光

清华大学计算机系教授，博士生导师。CCF杰出会员和杰出讲者，CCF副秘书⻓，CCF YOCSEF荣誉委员。主要研究领域为操作系统、程序设计语言与并行计算。多次担任高性能计算和并行计算重要国际会议如OSDI、PPoPP、CGO、SC、ICS、 PLDI、ASPLOS和APSYS的程序委员会委员。同时担任ACM中国理事会主席，ACM中国操作系统分会ChinaSys主席。获国家科技进步二等奖、国家教委科技进步二等奖和北京市科技进步二等奖各一次。国家杰出⻘年基金获得者。

讲者简介

胡振江

北京大学讲席教授，北京大学信息科学技术学院副院⻓、计算机科学技术系主任。1996年在日本东京大学信息工学专业获博士学位。曾担任东京大学情报理工学研究科教授，日本国立信息学研究所教授/系主任, 北京大学⻓江讲座教授。胡振江教授⻓期从事程序设计语言和软件科学与工程的研究，在程序语言设计、结构化函数式程序设计、程序的自动综合和优化、并行程序设计、双向变换语言的设计和实现、以及软件的演化和维护等方面做出了一系列开创性工作，曾获全日本最佳博士论文奖和日本软件科学会基础研究成就奖、日本工学会会士、欧洲科学院院士，IEEE Fellow、ACM杰出科学家。

演讲题目：从芯片定制到语言定制:程序设计语言的系统化定制及其支撑环境

摘要：随着摩尔定律的逐渐失效以及深度学习等高效特定计算的迫切需求，我们正渐渐转向一个⻘睐专用定制计算设备的时代。为此，我们需要软件具备面向不同专用硬件的定制能力。在这个报告中，我们将提出程序设计语言的系统化定制的基本概念和应用，讨论其支撑环境的实现，并探讨未来的挑战。

卡内基梅隆大学、助理教授

陈天奇

Tianqi Chen is currently an Assistant Professor at the Machine Learning Department and Computer Science Department of Carnegie Mellon University. He received his PhD. from the Paul G. Allen School of Computer Science & Engineering at the University of Washington, working with Carlos Guestrin on the interp of machine learning and systems. He has created three major learning systems that are widely adopted: XGBoost, TVM, and MXNet (co-creator). He is a recipient of the Google Ph.D. Fellowship in Machine Learning.

演讲题目TVM：An automated deep learning compiler

摘要：Data, models, and computing are the three pillars that enable machine learning to solve real- world problems at scale. Making progress on these three domains requires not only disruptive algorithmic advances but also systems innovations that can continue to squeeze more efficiency out of modern hardware. Learning systems are in the center of every intelligent application nowadays. However, the ever-growing demand for applications and hardware specialization creates a huge engineering burden for these systems, most of which rely on heuristics or manual optimization. In this talk, I will present a new approach that uses machine learning to automate system optimizations. I will describe our approach in the context of deep learning deployment problems. I will first discuss how to design invariant representations that can lead to transferable statistical cost models, and apply these representations to optimize tensor programs used in deep learning applications. I will then describe the system improvements we made to enable diverse hardware backends. TVM, our end-to-end system, delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned deep learning frameworks.

卡内基梅隆大学、助理教授

贾志豪

Zhihao Jia is an incoming Assistant Professor of Computer Science at CMU (starting Fall 2021). He obtained his Ph.D. at Stanford working with Alex Aiken and Matei Zaharia. His research interests lie in the interp of computer systems and machine learning, with a focus on building efficient, scalable, and high-performance systems for ML computations.

演讲题目：Automated Discovery of Machine Learning Optimizations

摘要：As an increasingly important workload, machine learning (ML) applications require different performance optimization techniques from traditional runtimes and compilers. In particular, to accelerate ML applications, it is generally necessary to perform ML computations on heterogeneous hardware and parallelize computations using multiple data dimensions, neither of which is even expressible in traditional compilers and runtimes. In this talk, I will describe my work on automated discovery of performance optimizations to accelerate ML computations. TASO, the Tensor Algebra SuperOptimizer, optimizes the computation graphs of deep neural networks (DNNs) by automatically generating potential graph optimizations and formally verifying their correctness. TASO outperforms rule-based graph optimizers in existing ML systems (e.g., TensorFlow, TensorRT, and TVM) by up to 3X by automatically discovering novel graph optimizations, while also requiring significantly less human effort. FlexFlow is a system for accelerating distributed DNN training. FlexFlow identifies parallelization dimensions not considered in existing ML systems (e.g., TensorFlow and PyTorch) and automatically discovers fast parallelization strategies for a specific parallel machine. Companies and national labs are using FlexFlow to train production ML models that do not scale well in current ML systems, achieving over 10x performance improvement. I will also outline future research directions for further automating ML systems, such as codesigning ML models, software systems, and hardware backends for end-to-end ML deployment.

崔慧敏

博士，中国科学院计算技术研究所研究员，博士生导师。研究方向为异构芯片的编译和编程，近年来围绕大数据、AI等新型计算范式，研究这些应用在异构架构下的编译优化、编程环境优化。崔慧敏作为负责人承担自然科学基金、重点研发计划等多项国家级项目和课题，先后在PLDI、MICRO、PPoPP、TPDS等国际会议和期刊上发表论文二十余篇。

演讲题目：高性能智能处理器编程语言与编译器设计

摘要：以寒武纪平台为代表的高性能智能处理器提供了一个通用的深度学习平台，其目标是为当前和未来的智能应用提供强大的计算能力。由于未来应用的多样性和不可预测性，提供基础的高级编程语言是其生态构建和推广中不可缺少的一个环节。我们针对这一需求，以C语言为基础，面向应用和平台设计了通用的高级编程语言Bang语言，解决了用户自定义算子的灵活开发问题。并进一步，利用深度的编译优化技术来充分发挥芯片的处理能力。

阿里巴巴公司、资深总监

林伟

Wei Lin is currently Senior Director of Platform of Artificial Intelligence (PAI) and Chief Architect of Big-data computation platform in Alibaba. 15+ years’ experience specializing in backend/infrastructure, distributed system development, storage and a large-scale computation system include batch, streaming and machine learning.

演讲题目：AI Compiler at Alibaba

摘要：With the emerging AI workloads and diversity of executing computing hardware, AI compiler plays a vital role to bridge the gap between model expressive flexibility and underlying high- performance system implementation. In this talk, we will share our experiences of applying AI compiler into Alibaba's production environment, including: 1.Large-scale deployment of our AI Compiler into PAI (Platform of Artificial Intelligence) production clusters running stably for more than 6 months with tens of thousands of GPU hour saving. We will talk about our aggressive fusion and co-design strategy in which a cost-based approach is exploited to find the optimal fusion plan to boost hardware efficiency. In addition, lots of experiences to ensure that our compiler can be enabled by default in a large-scale production cluster will be shared. 2.Automatic code generation framework named as Ansor. This work has already been accepted by OSDI 2020 and deployed into our production environment. Compared with existing search strategies, Ansor explores much more optimization combinations and thus can find high-performance programs that are outside the search space of existing state-of-the-art approaches. Our evaluation shows that Ansor improves the execution performance of deep neural networks on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3:8x, 2:6x, and 1:7x, respectively. 3.Our thoughts about the future direction of AI compiler from industry perspective, such as the inter-play between compiler, runtime, resource scheduling and distributed execution. Also, we would like to raise some questions looking forward to the potential interaction between academia and industry.

鉴释公司、Chief Architect

Shin-Ming Liu

Shin-Ming Liu is the Chief Architect@Xcalibyte.com. Shin-Ming started as compiler developer since early ’80. He has participated in compilation systems from scratch in various companies in Silicon Valley and established wide influence in modern day compilation systems including gcc and llvm design. Besides the in-depth compiler development work, Shin-Ming has been the Director for Java C/C++ ToolChain Lab. of HP-UX Server, Director for HP Kernel Development Lab for HP 3PAR Storage System, and developed extensive insight about computer ecosystem for high performance computing and software development productivity.

演讲题目：Matrix multiply: from 1 to 62806 X speedup Bridging the gap between productivity and performance

摘要：John Hennessy in his Stanford lecture discussed a new era of computing with the challenge to improve GEMM by ~63,000 times. We will further elaborate his vision with a deep dive into the compilation and runtime techniques needed. We also suggest a possible roadmap to bring productivity and performance into his vision . We argue the need for an open source platform that enables multiple languages co-exist in the compilation and runtime while allow individual chip/accelerator vendors to specialize for their target domain. We will analyze the technical challenges ahead and possible directions moving forward for a thriving industry in AI and data science.

点击“阅读原文”，报名参会。

— 完 —

量子位 QbitAI · 头条号签约作者

վ'ᴗ' ի 追踪AI技术和产品新动态

一键三连「分享」、「点赞」和「在看」

科技前沿进展日日相见~