
  • Bufferless Multi-Ring NoC
  • AI-Processor's NoC
  • Experiment
  • 题目:Application Defined On-chip Networks for Heterogeneous Chiplets: An Implementation Perspective
  • 时间:2021
  • 会议:HPCA
  • 研究机构:华为


A successful NoC design must make trade-off decisions from three options: application, architecture, and physical design, which means that it is impossible to have all three of these aspects optimized at the same time


However, the shared memory abstraction is still has a significant effect on the expressiveness
In order to maintain the expressiveness of the software and hardware, our architecture development team choose to stick to the shared memory abstraction.



  1. We propose a highly scalable bufferless multi-ring NoC design for a heterogeneous-chiplet based system
  2. We introduce the application-architecture-physical implementation co-design process and design methodology of the bufferless multi-ring NoC system
  3. We show that the NoC design can achieve cache coherency among nearly one hundred cores (in one package) and low latency off-chip memory access in Server-CPU scenario


  • Network Latency
  • Network Bandwidth
  • Network Area Efficiency


  • CPU程序会经常涉及基于指针的数据结构,访存不均匀,更需要低延时的片外访存
  • 神经网络有着更高的算术密度,需要更多的数据复用和更高的访存带宽


  • Application:大于15TBps的带宽,跨chiplet保证足够小的延时
  • Architecture:架构足够灵活,任意两个NoC transaction是独立的,无状态的(stateless)
  • Physical Implementation:尽可能的提高distance per cycle,这需要电路尽可能的简单

Bufferless Multi-Ring NoC


Compare to the communication-based flow control mechanism used by the buffered routing scheme, bufferless NoC uses purely local and simple flow control without any need for communication between routers


  1. increase average latency because of deflection routing
  2. ufferless method will reduce the available network bandwidth as all in-network flits consume wire fabric resources
  3. Since bufferless NoC can deflect individual flits, flits of a packet can arrive out-of-order and at significantly different points in time at the destination agent
    然而,AMBA5-CHI本就是非阻塞且乱序的协议,本来就需要在每个节点需要buffer,这就对于bufferless NoC比较友好

AI-Processor’s NoC

Traditional mesh-like NoC confines the devices to the intersection of the mesh
The NoC in our AI training processor is a multi-ring based mesh
Communication between AI core and L2 and communication between L2 and HBM are the most significant on-chip traffic flow

在该架构中,L2都不会访问其他的L2,AI core也不会相互访问,所以可以将AI core沿竖直方向的环放,L2和LLC都沿水平的环放,这样的话任意一个路由路径都最多之切一次环,路由只需要X-Y路由或Y-X路由就可以

we put all the AI cores on the vertical rings and the memory-related nodes on the horizontal rings


对于Server-CPU,benchmark包括了LMBench, SPECint-2006/2017, SPECpower-ssj-2008,比较对象为Intel-8280/Intel-8180/AMD-7742

其中AMD-7742是64核128线程,采用Zen 2架构;而志强铂金8280有28核

对于AI Processor,作者搭建了周期精确的软件仿真器,根据AI Processor的指令trace作为NoC的输入,benchmark包括了MLPerf Benchmark training cases: ResNet-50, Mask R-CNN, BERT

  • 题目:Bufferless Network-on-Chips With Bridged Multiple Subnetworks for Deflection Reduction and Energy Savings
  • 时间:2019
  • 期刊:TC
  • 研究机构:Xilinx

这篇论文更具体的介绍了Bufferless NoC的具体实现。router没有了buffer处理逻辑就简单了许多,每个flit沿着流水线往前走。但万一有多个flit要走同一个输出端口,那只能有一个flit成功,其他的flit要么送到别的端口(deflection routing),要么就直接丢弃

Whenever multiple flits require the same output port, only one flit (which usually has the highest priority) is granted, with the remaining contending flits either deflected to the unclaimed ports or simply dropped.

这篇论文主要关注deflection routing,当然,deflection会导致一个package的传输时间变得更长(因为绕远了),同时消耗了网络更多的带宽

High deflection also increases the average latency in delivering packets, since many flits then take unproductive extra hops before reaching their destinations
Moreover, those deflected flits consume network bandwidth, possibly interfering with more flits to further amplify the deflection rate and easily causing network saturation

