开启报名 | AI芯片体系架构和软件专题报告会2020

近年来，专用的人工智能芯片为人工智能应用提供了强大的算力。面对日新月异的人工智能应用，人工智能芯片的体系架构，系统软件，安全成为计算机体系结构和系统软件的热点研究问题。为了推进国内在该领域的发展，由北京智源人工智能研究院和北京大学高能效计算与应用中心联合主办的「AI芯片体系架构和软件专题报告会」将于 2020 年 4 月 12 日上午召开，本报告会邀请了五位学者介绍在人工智能芯片方向的最新进展。

报告会举办时间：

2020年4月12日（周日）

上午9:00-12:00

报告会举办方式：

线上直播，进微信听课群获取直播间地址或者登录官网直接「进入直播间」

报告会官方主页：

https://event.baai.ac.cn/con/architecture-and-software-design-for-ai-chip-2020/

（复制网址至浏览器查看）

报告会论文：

五篇论文，请至微信交流群获取

进群方式见文末

特邀嘉宾

陈云霁

中科院计算所研究员

智源首席科学家

报告会主席

梁云

北京大学研究员

智源青年科学家

演讲嘉宾

梁云

北京大学研究员，智源青年科学家

主要研究领域为计算机体系结构、编译优化、芯片设计自动化。在 MICRO、HPCA、PPoPP、DAC 等顶级会议发表论文 90 多篇，谷歌学术引用超过 2400 次，根据 CSranking 的统计，共发表 24 篇顶级会议论文。8 次被评选或提名为国际会议最佳论文，包括 ICCAD 2017 和 FCCM 2011 最佳论文，DAC 2017,2012 和 PPoPP 2019 的最佳论文提名。担任 ASAP 2019 Program Chair 及 6个 CCF A 会议的 TPC。目前担任北京大学商汤科技智能计算联合实验室主任，曾获北京市自然科学基金杰出青年项目支持。

骆沁毅

美国南加州大学研究助理

美国南加州大学博士生，本科毕业于清华大学电子系，研究方向为分布式机器学习和数据分析。她在 ASPLOS、MICRO、PLDI 等顶会上发表数篇论文，曾获国家奖学金、WeTech 高通奖学金、提名微软博士生奖学金等荣誉。

高鸣宇

清华大学交叉信息研究院助理教授

清华大学交叉信息研究院助理教授，博士生导师。美国斯坦福大学电子工程系博士、硕士，清华大学微纳电子系学士。研究方向为计算机系统与体系结构，大数据系统优化，硬件系统安全。主要研究成果包括针对大数据应用的近数据处理架构软硬件系统，高密度低功耗可重构计算芯片，及专用神经网络芯片的调度优化。已发表多篇国际顶级学术会议（ISCA、ASPLOS、HPCA、PACT 等）论文，授权多项专利。曾获得 IEEE Micro 2016年度计算机系统结构最佳论文奖（Top Picks）、欧洲HiPEAC论文奖、福布斯中国评选2019年“30 Under 30”等荣誉。

陈晓明

中国科学院计算技术研究所副研究员

于 2009 年和 2014 年在清华大学获得工学学士、博士学位，之后在美国 CMU 从事博士后研究 2 年，在美国圣母大学担任访问助理教授 1 年。研究方向为集成电路设计自动化和计算机体系结构。已发表专著 1 本，论文 70 余篇。担任 DAC、ICCAD、ASP-DAC、GLSVLSI 等多个会议的程序委员会委员。获得 2018 年首届阿里巴巴达摩院青橙奖，入选 2018 年中国科协青年人才托举工程，获得 2015 年 EDAA 优秀博士论文奖（迄今大陆毕业的博士唯一获奖者）。

侯锐

中国科学院信息工程研究所研究员，博导

信息安全国家实验室副主任

主要研究方向包括处理器芯片设计与安全、AI 芯片安全与数据隐私，以及数据中心服务器等领域。哈尔滨工程大学兼职教授，通信学会区块链专委会副主任，计算机学会体系结构专委会委员。先后主持或参与国家自然基金、科学院战略先导等多项重大项目。侯锐长期从事国产自主可控高性能处理器芯片的研制和开发，主持、参与了多款芯片的设计开发工作。侯锐在国内外期刊及会议上发表论文 40 余篇，包括 ACM TOCS、HPCA，ASPLOS，S&P 等多个体系结构和安全领域顶级会议及期刊，并申请国内外专利 50 余项。曾作为技术委员会或组织委员会委员服务多个国际顶级学术会议（CCF A 类会议 Micro 2020, HPCA 2017/2018 技术委员会委员, CCF B类会议 PACT 2017 技术委员会委员等），担任 Journal of Parallel Distributed Computing 编委。作为主席成功举办多界“数据中心服务器国际技术论坛”与“内置安全国际学术论坛”等。多次受邀在国际学术会议、国际知名公司研究院、知名科研院所进行学术报告。

会议日程

09:00-09:05

开场致辞 - 中科院计算所研究员，智源首席科学家陈云霁

09:05-09:35

梁云北京大学研究员，智源青年科学家

报告题目：

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

报告摘要：

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computa-tion and its huge computation cost has led to high demand for flexible, portable, and high-performance library imple-mentation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the al-gorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the appli-cation algorithms.

In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor compu-tation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level program-ming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines differ-ent exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for differ-ent hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x perfor-mance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.

09:40-10:10

美国南加州大学研究助理骆沁毅

报告题目：

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

报告摘要：

Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers. For this reason, it is significantly slower in heterogeneous settings. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds --- designing a distributed training method that has both high performance like All-Reduce in homogeneous environment and good heterogeneity tolerance like AD-PSGD?

In this paper, we propose Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that enables fast synchronization among a group of workers. To reduce serialization cost, we propose static group scheduling in homogeneous environment and simple techniques, i.e., Group Buffer and Group Division, to largely eliminate conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Prague is 1.2x faster than the state-of-the-art implementation of All-Reduce, 5.3x faster than Parameter Server and 3.7x faster than AD-PSGD. In a heterogeneous setting, Prague tolerates slowdowns well and achieves 4.4x speedup over All-Reduce.

10:20-10:50

清华大学交叉信息研究院助理教授高鸣宇

报告题目：

Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

报告摘要：

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.

10:55-11:25

陈晓明中国科学院计算技术研究所副研究员

报告题目：

Communication Lower Bound in Convolution Accelerators

报告摘要：

In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communication for CNN accelerators. For the off-chip communication, we derive the theoretical lower bound for any convolutional layer and propose a dataflow to reach the lower bound. This fundamental problem has never been solved by prior studies. The on-chip communication is minimized based on an elaborate workload and storage mapping scheme. We in addition design a communication-optimal CNN accelerator architecture. Evaluations based on the 65nm technology demonstrate that the proposed architecture nearly reaches the theoretical minimum communication in a three-level memory hierarchy and it is computation dominant. The gap between the energy efficiency of our accelerator and the theoretical best value is only 37-87%.

11:30-12:00

侯锐中国科学院信息工程研究所研究员

报告题目：

DNNGuard: An Elastic Heterogeneous Architecture for DNN Accelerator against Adversarial Attacks

报告摘要：

Recent studies show that Deep Neural Networks (DNN) are vulnerable to adversarial samples that are generated by perturbing correctly classified inputs to cause the misclassification of DNN models. This can potentially lead to disastrous consequences, especially in security-sensitive applications such as unmanned vehicles, finance and healthcare. Existing adversarial defense methods require a variety of computing units to effectively detect the adversarial samples. However, deploying adversary sample defense methods in existing DNN accelerators leads to many key issues in terms of cost, computational efficiency and information security. Moreover, existing DNN accelerators cannot provide effective support for special computation required in the defense methods. To address these new challenges, we propose DNNGuard, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the simultaneous execution of original (target) DNN networks and the detect algorithm or network that detects adversary sample attacks. The architecture tightly couples the DNN accelerator with the CPU core into one chip for efficient data transfer and information protection. An elastic DNN accelerator is designed to run the target network and detection network simultaneously. Besides the capability to execute two networks at the same time, DNNGuard also supports the non-DNN computing and allows the special layer of the neural network to be effectively supported by the CPU core. To reduce off-chip traffic and improve resources utilization, we propose a dynamical resource scheduling mechanism. To build a general implementation framework, we propose an extended AI instruction set for neural networks synchronization, task scheduling and efficient data interaction.

12:00-12:05 报告会总结

梁云北京大学研究员，智源青年科学家

报名方式

扫码添加小源微信好友

发送关键词“live0412”后

进入报名微信群，获取课件和直播间地址

或点击「阅读原文」前往报告会官网

开启报名 | AI芯片体系架构和软件专题报告会2020相关推荐

深挖Cerebras：世界上最大AI芯片的架构设计
作者|Sean Lie 翻译|胡燕君.程浩源近年来,神经网络模型规模呈指数级增长,从2018年拥有超1亿参数的Bert到2020年拥有1750亿个参数GPT-3,短短两年模型的参数量增加了3个数量级 ...
新思科技Chekib：AI芯片架构创新面临四大挑战
https://www.toutiao.com/a6673484789430878728/ 3月15日,上海,由智东西主办.AWE 和极果联合主办的 GTIC 2019 全球 AI 芯片创新峰会成功举 ...
AI芯片：寒武纪DianNao，英伟达NVDLA和谷歌TPU1的芯片运算架构对比分析
前面几篇博客分别分析了目前市面上能够找到的各家AI芯片的结构. 下面做一个阶段性的对比分析及总结. AI芯片运算架构对比整体来看,NVDLA的架构与寒武纪的DianNao比较像.所以,单位资源的性能 ...
“不设边界”的云知声：从多场景AI芯片到视觉AI，誓要2019营收近3倍
记者| 杨丽出品| AI科技大本营若非要总结 2018.展望 2019 的话,可以借用云知声创始人兼 CEO 黄伟的一句点评:"所有伟大的公司都诞生于真实的生产需求." 20 ...
打开AI芯片的“万能钥匙”
来源:雷锋网作者:包永刚雷锋网按,新推出的AI芯片因架构的独特性和软件的易用性增加了客户尝试和迁移的成本,因此,软件成为了能否快速.低成本迁移的关键.现在普遍的做法是在TensorFlow写一些 ...
清华发布《AI芯片技术白皮书》：新计算范式，挑战冯诺依曼、CMOS瓶颈
来源:机器人悦智网摘要:在由北京未来芯片技术高精尖创新中心和清华大学微电子学研究所联合主办的第三届未来芯片论坛上,清华大学正式发布了<人工智能芯片技术白皮书(2018)>. <白 ...
清华出品：一文看尽AI芯片两类瓶颈三大趋势，存储技术开拓新疆界 | 附全文...
12月10日-11日,由北京未来芯片技术高精尖创新中心和清华大学微电子学研究所联合主办的「第三届未来芯片论坛:可重构计算的黄金时代」在清华大学主楼举办,并正式发布了<人工智能芯片技术白皮书(20 ...
向死而生，浴火重生，创新能让AI芯片新生？
<中智观察>第1652篇推送作者:赵满满编辑:杨小天头图来源:比特网本文是<中智观察>"企业数字服务供需市场"行业洞察之人工智能篇.市场预言,AI芯 ...
2023年中国AI芯片行业市场现状及未来发展趋势预测
2023年AI芯片报告汇总了60家国产AI芯片厂商,大致按如下应用类别进行归类:云端加速.智能驾驶.智能安防.智能家居.智能穿戴.其它AIoT.对于每一家筛选的公司,我们从主要产品.核心技术.应用场景 ...

开启报名 | AI芯片体系架构和软件专题报告会2020

开启报名 | AI芯片体系架构和软件专题报告会2020相关推荐

最新文章

热门文章