Introducing RDNA Architecture

The RDNA architecture white paper
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

The all new Radeon gaming architecture powering “Navi”
全新 Radeon 游戏架构为 Navi 提供动力

Introduction
RDNA Architecture Overview and Philosophy
RDNA System Architecture
RDNA Shader Array and Graphics Functions
Dual Compute Unit Front End
SIMD Execution Units
Scalar Execution and Control Flow
Vector Execution
Vector ALUs
Dual Compute Unit Memories
Local Data Share and Atomics
Vector Caches
Export and GDS
Shared Graphics L1 Cache
L2 Cache and Memory
Radeon Multimedia and Display Engines
True Audio Next
Advanced Visual Effects
Radeon RX 5700 Family
Conclusion
Legal Disclaimer and Attributions

philosophy [fɪˈlɒsəfi]：n. 哲学，人生哲学，哲学体系，思想体系
atomic [əˈtɒmɪk]：adj. 原子的，与原子有关的，原子能的，原子武器的
disclaimer [dɪsˈkleɪmə(r)]：n. 免责声明，弃权声明

1. Introduction

The world of graphics has fundamentally evolved and shifted over the course of the last three decades towards greater programmability. The earliest graphics systems were implemented purely in software and ran on CPUs, but could not offer the performance needed for more than the most basic visual effects. The first specialized graphics architectures were almost purely fixed-function and could only accelerate a very limited range of specific 2D or 3D computations such as geometry or lighting transformations. The next wave of architectures introduced graphics shaders that gave programmers a taste of flexibility, but with stringent limitations. More recently, graphics processors evolved towards programmability - offering programmable graphics shaders and fully general-purpose computing.
在过去的三十年里，图形世界已经从根本上发展并转向更高的可编程性。最早的图形系统纯粹是在软件中实现并在 CPU 上运行，但不能提供超出最基本视觉效果所需的性能。第一个专门的图形架构几乎是纯粹的固定功能，只能加速非常有限的特定 2D 或 3D 计算，例如几何或光线转换。下一波架构引入了图形着色器，让程序员尝到了灵活性，但也有严格的限制。最近，图形处理器向可编程方向发展 - 提供可编程图形着色器和完全通用的计算。

fundamentally [.fʌndə'ment(ə)li]：adv. 从根本上，基础地，重要地
evolve [ɪˈvɒlv]：v. 进化，演化，逐步发展
course [kɔː(r)s]：adv. 当然 n. 过程，航向，球场，课程 v. 流下，突然变得很强烈，打猎
decade ['dekeɪd]：n. 十年
programmability [prəʊɡræmə'bɪlɪtɪ]：n. 可编程性
geometry [dʒiːˈɒmətri]：n. 几何形状，几何图形，几何，几何结构
transformation [ˌtrænsfə(r)ˈmeɪʃ(ə)n]：n. 转变，民主改革
taste [teɪst]：n. 味道，味觉，滋味，体验 v. 尝，吃，浅尝，有 ... 味道的
flexibility [.fleksə'bɪləti]：n. 灵活性，弹性，挠性，折射性
stringent [ˈstrɪndʒ(ə)nt]：adj. 严格的，严厉的，紧缩的，短缺的

Figure 1 - Eras of Graphics.

AMD’s TeraScale was designed for the era of programmable graphics and ushered in general-purpose compute with DirectX® 11’s DirectCompute API and a VLIW-based architecture. The Graphics Core Next (GCN) architecture moved to a more programmable interleaved vector compute model and introduced asynchronous computing, enabling traditional graphics and general-compute to work together efficiently. The GCN architecture is at the heart of over 400 million systems, spanning from notebook PCs, to extreme gaming desktops, cutting-edge game consoles, and cloud gaming services that can reach any consumer on a network.
AMD 的 TeraScale 专为可编程图形时代而设计，并通过 DirectX® 11 的 DirectCompute API 和基于 VLIW 的架构引入了通用计算。下一代图形核心 (GCN) 架构转向更可编程的交错矢量计算模型并引入异步计算，使传统图形和通用计算能够高效地协同工作。GCN 架构是超过 4 亿个系统的核心，从笔记本电脑到极限游戏台式机、尖端游戏机和云游戏服务，可以接触到网络上的任何消费者。

era [ˈɪərə]：n. 时代，年代，纪元，代
usher [ˈʌʃə(r)]：n. 引座员，门房 v. 把 ... 引往，引导，引领
span [spæn]：n. 持续时间，范围，包括的种类，跨度 v. 持续，贯穿，包括，涵盖

Looking to the future, the challenge for the next era of graphics is to shift away from the conventional graphics pipeline and its limitations to a compute-first world where the only limit on visual effects is the imagination of developers. To meet the challenges of modern graphics, AMD’s RDNA architecture is a scalar architecture, designed from the ground up for efficient and flexible computing, that can scale across a variety of gaming platforms. The 7nm “Navi” family of GPUs is the first instantiation of the RDNA architecture and includes the Radeon RX 5700 series.
展望未来，下一个图形时代的挑战是从传统的图形管道及其局限性转移到计算优先的世界，视觉效果的唯一限制是开发人员的想象力。为了应对现代图形的挑战，AMD 的 RDNA 架构是一种标量架构，专为高效灵活的计算而设计，可以跨各种游戏平台进行扩展。7nm Navi 系列 GPU 是 RDNA 架构的第一个实例，包括 Radeon RX 5700 系列。

2. RDNA Architecture Overview and Philosophy

The new RDNA architecture is optimized for efficiency and programmability while offering backwards compatibility with the GCN architecture. It still uses the same seven basic instruction types: scalar compute, scalar memory, vector compute, vector memory, branches, export, and messages. However, the new architecture fundamentally reorganizes the data flow within the processor, boosting performance and improving efficiency.
新的 RDNA 架构针对效率和可编程性进行了优化，同时提供与 GCN 架构的向后兼容性。它仍然使用相同的七种基本指令类型：scalar compute, scalar memory, vector compute, vector memory, branches, export, and messages. 但是，新架构从根本上重组了处理器内的数据流，从而提升了性能并提高了效率。

In all AMD graphics architectures, a kernel is a single stream of instructions that operate on a large number of data parallel work-items. The work-items are organized into architecturally visible work-groups that can communicate through an explicit local data share (LDS). The shader compiler further subdivides work-groups into microarchitectural wavefronts that are scheduled and executed in parallel on a given hardware implementation.
在所有 AMD 图形架构中，内核是对大量数据并行工作项进行操作的单个指令流。工作项被组织成架构上可见的工作组，这些工作项可以通过显式本地数据共享 (LDS) 进行通信。着色器编译器进一步将工作组细分为在给定硬件实现上并行调度和执行的微架构波前。着色器编译器进一步将工作组细分进微架构 wavefront 中，wavefront 在给定硬件上并行调度和并行执行。

For the GCN architecture, the shader compiler creates wavefronts that contain 64 work-items. When every work-item in a wavefront is executing the same instruction, this organization is highly efficient. Each GCN compute unit (CU) includes four SIMD units that consist of 16 ALUs; each SIMD executes a full wavefront instruction over four clock cycles. The main challenge then becomes maintaining enough active wavefronts to saturate the four SIMD units in a CU.
对于 GCN 架构，着色器编译器创建包含 64 个工作项的 wavefront。当 wavefront 中的每个工作项都执行相同的指令时，这种组织是高效的。每个 GCN 计算单元 (CU) 包括四个 SIMD 单元，每个 SIMD 由 16 个 ALU 组成，每个 SIMD 在四个时钟周期内执行一条完整的 wavefront 指令。主要挑战是要保持足够的有效 wavefront 以使 CU 中的四个 SIMD 单元饱和。

The RDNA architecture is natively designed for a new narrower wavefront with 32 work-items, intuitively called wave32, that is optimized for efficient compute. Wave32 offers several critical advantages for compute and complements the existing graphics-focused wave64 mode.
RDNA 架构原生设计用于具有 32 个工作项的窄小的 wavefront，直观地称为 wave32，针对高效计算进行了优化。 Wave32 为计算提供了几个关键优势，并补充了现有的以图形为中心的 wave64 模式。

One of the defining features of modern compute workloads is complex control flow: loops, function calls, and other branches are essential for more sophisticated algorithms. However, when a branch forces portions of a wavefront to diverge and execute different instructions, the overall efficiency suffers since each instruction will execute a partial wavefront and disable the other portions. The new narrower wave32 mode improves efficiency for more complex compute workloads by reducing the cost of control flow and divergence.
现代计算工作负载的定义特征之一是复杂的控制流：循环、函数调用和其他分支对于更复杂的算法至关重要。然而，当分支迫使 wavefront 的一部分分叉并执行不同的指令时，整体效率会受到影响，因为每条指令将执行部分波前并禁用其他部分。更窄小的 wave32 模式通过降低控制流和发散的成本来提高更复杂计算工作负载的效率。

sophisticate [səˈfɪstɪkeɪt]：v. 用诡辩欺骗，使迷惑，窜改，掺坏 n. 老于世故的人，见多识广的人

Second, a narrower wavefront completes faster and uses fewer resources for accessing data. Each wavefront requires control logic, registers, and cache while active. As one example, the new wave32 mode uses half the number of registers. Since the wavefront will complete quicker, the registers free up faster, enabling more active wavefronts. Ultimately, wave32 enables delivering throughput and hiding latency much more efficient.
第二，更窄小的 wavefront 完成速度更快，访问数据使用的资源更少。每个 wavefront 在活动时都需要控制逻辑、寄存器和缓存。例如，新的 wave32 模式使用了一半的寄存器。由于 wavefront 完成得更快，寄存器释放得更快，从而使 wavefront 更活跃。最终，wave32 能够更有效地提供吞吐量和隐藏延迟。

Third, splitting a workload into smaller wave32 dataflows increases the total number of wavefronts. This subdivision of work items boosts parallelism and allows the GPU to use more cores to execute a given workload, improving both performance and efficiency.
第三，将工作负载拆分为更小的 wave32 数据流会增加 wavefront 的总数。工作项的这种细分提高了并行性，并允许 GPU 使用更多内核来执行给定的工作负载，从而提高性能和效率。

Figure 2 - Adjacent Compute Unit Cooperation in RDNA architecture. - RDNA 架构中的相邻计算单元合作。

The new dual compute unit design is the essence of the RDNA architecture and replaces the GCN compute unit as the fundamental computational building block of the GPU. As Figure 2 illustrates, the dual compute unit still comprises four SIMDs that operate independently. However, this dual compute unit was specifically designed for wave32 mode; the RDNA SIMDs include 32 ALUs, twice as wide as the vector ALUS in the prior generation, boosting performance by executing a wavefront up to 2X faster. The new SIMDs were built for mixed-precision operation and efficiently compute with a variety of datatypes to enable scientific computing and machine learning. Figure 3 below illustrates how the new dual compute unit exploits instruction-level parallelism within a simple example shader to execute on a SIMD with half the latency in wave32 mode and a 44% reduction in latency for wave64 mode compared to the previous generation GCN SIMD.
新的双计算单元设计是 RDNA 架构的精髓，取代 GCN 计算单元成为 GPU 的基本计算构建块。如图 2 所示，双计算单元仍然包含四个独立运行的 SIMD。但是，这个双计算单元是专门为 wave32 模式设计的，RDNA SIMD 包含 32 个 ALU，宽度是上一代矢量 ALUS 的两倍，通过将 wavefront 执行速度提高 2 倍来提高性能。新的 SIMD 专为混合精度操作而构建，可使用各种数据类型进行高效计算，以实现科学计算和机器学习。下面的图 3 说明了新的双计算单元如何利用简单示例着色器中的指令级并行性在 SIMD 上执行，与上一代 GCN SIMD 相比，wave32 模式下的延迟只有一半，wave64 模式下的延迟减少了 44%。

INSTRUCTION ISSUE EXAMPLE

Figure 3 - Wave32 and wave64 execution on an example snippet of code.

The RDNA architecture also redefines the cache and memory hierarchy to increase bandwidth for graphics and compute, reduce power consumption, and enable greater scaling for the future. Earlier architectures employed a two-level cache hierarchy. Generally, the first level of caching was private to each GCN compute unit and focused on compute. The second level of caching was the globally shared L2 that resided alongside the memory controllers and would deliver data both to compute units and graphics functions such as the geometry engines and pixel pipelines.
RDNA 架构还重新定义了缓存和内存层次结构，以增加图形和计算的带宽、降低功耗并为未来实现更大的扩展。早期的架构采用两级缓存层次结构。通常，第一级缓存是每个 GCN 计算单元私有的，并且专注于计算。第二级缓存是全局共享的 L2，它与内存控制器一起驻留，并将数据传输到计算单元和图形功能，例如几何引擎和像素管道。

To satisfy the more powerful dual compute units, the L0 caches for scalar and vector data have scaled up as well. The new architecture introduces a specialized intermediate level of cache hierarchy, a shared graphics L1 cache that serves a group of dual compute units and pixel pipelines. This arrangement reduces the pressure on the globally shared L2 cache, which is still closely associated with the memory controllers.
为了满足更强大的双计算单元，用于标量和矢量数据的 L0 缓存也已经扩展。新架构引入了一个专门的中间缓存层次结构，一个共享图形 L1 缓存，为一组双计算单元和像素管道提供服务。这种安排减少了全局共享二级缓存的压力，它仍然与内存控制器密切相关。

snippet [ˈsnɪpɪt]：n. 一小条 (消息)，一则 (新闻)，一小段 (谈话、音乐等)

3. RDNA System Architecture

Graphics processors (GPUs) built on the RDNA architecture will span from power-efficient notebooks and smartphones to some of the world’s largest supercomputers. To accommodate so many different scenarios, the overall system architecture is designed for extreme scalability while boosting performance over the previous generations. Figure 4 below illustrates the 7nm Radeon RX 5700 XT, which is one of the first incarnations of the RDNA architecture.
基于 RDNA 架构的图形处理器 (GPU) 将涵盖从节能笔记本电脑和智能手机到一些世界上最大的超级计算机。为了适应如此多不同的场景，整个系统架构旨在实现极高的可扩展性，同时提高前几代产品的性能。下面的图 4 展示了 7nm Radeon RX 5700 XT，它是 RDNA 架构的首批产品之一。

Figure 4 - Block diagram of the Radeon RX 5700 XT, one of the first GPUs powered by the RDNA architecture.

infinity [ɪn'fɪnəti]：n. 无限，无穷，无穷远，无限远的点
fabric ['fæbrɪk]：n. 织物，布料，结构
Infinity Fabric，IF：芯片内、外部互连总线技术

The RX 5700 XT is organized into several main blocks that are all tied together using AMD’s Infinity Fabric. The command processor and PCI Express interface connect the GPU to the outside world and control the assorted functions. The two shader engines house all the programmable compute resources and some of the dedicated graphics hardware. Each of the two shader engines include two shader arrays, which comprise of the new dual compute units, a shared graphics L1 cache, a primitive unit, a rasterizer, and four render backends (RBs). In addition, the GPU includes dedicated logic for multimedia and display processing. Access to memory is routed via the partitioned L2 cache and memory controllers.
RX 5700 XT 被组织成几个主要块，这些块都使用 AMD 的 Infinity Fabric 连接在一起。命令处理器和 PCI Express 接口将 GPU 连接到外部世界并控制各种功能。这两个着色器引擎包含所有可编程计算资源和一些专用图形硬件。两个着色器引擎中的每一个都包括两个着色器阵列，其中包括新的双计算单元、一个共享图形 L1 缓存、一个基元单元、一个光栅化器和四个渲染后端 (RB)。此外，GPU 包括用于多媒体和显示处理的专用逻辑。对内存的访问通过分区的 L2 高速缓存和内存控制器进行路由。

render backend，RB：渲染后端

The RDNA architecture is the first family of GPUs to use PCIe® 4.0 to connect with the host processor. The host processor runs the driver, which sends API commands and communicates data to and from the GPU. The new PCIe® 4.0 interface operates at 16 GT/s, which is double the throughput of earlier 8 GT/s PCI-E 3.0-based GPUs. In an era of immersive 4K or 8K textures, greater link bandwidth saves power and boosts performance.
RDNA 架构是第一个使用 PCIe® 4.0 与主机处理器连接的 GPU 系列。主机处理器运行驱动程序，该驱动程序发送 API 命令并与 GPU 进行数据通信。新的 PCIe® 4.0 接口以 16 GT/s 的速度运行，是早期基于 8 GT/s PCI-E 3.0 的 GPU 吞吐量的两倍。在沉浸式 4K 或 8K 纹理的时代，更大的链路带宽可以节省电力并提高性能。

The hypervisor agent enables the GPU to be virtualized and shared between different operating systems. Most cloud gaming services live in data centers where virtualization is crucial from a security and operational standpoint. While modern consoles focus on gaming, many offer a rich suite of communication and media capabilities and benefit from virtualizing the hardware to deliver performance for all tasks.
管理程序代理使 GPU 能够在不同操作系统之间进行虚拟化和共享。大多数云游戏服务都位于数据中心，从安全和运营的角度来看，虚拟化至关重要。虽然现代控制台专注于游戏，但许多控制台提供了丰富的通信和媒体功能，并受益于虚拟化硬件以提供所有任务的性能。

hypervisor ['haɪpəvaɪzə]：n. 管理程序

The command processor receives API commands and in turn operates different processing pipelines in the GPU. The graphics command processor manages the traditional graphic pipeline (e.g., DirectX®, Vulkan®, OpenGL®) shaders tasks and fixed function hardware. Compute tasks are implemented using the Asynchronous Compute Engines (ACE), which manage compute shaders. Each ACE maintains an independent stream of commands and can dispatch compute shader wavefronts to the compute units. Similarly, the graphics command processor has a stream for each shader type (e.g., vertex and pixel). All the work scheduled by the command processor is spread across the fixed-function units and shader array for maximum performance.
命令处理器接收 API 命令并依次操作 GPU 中的不同处理管道。图形命令处理器管理传统图形管道 (e.g., DirectX®, Vulkan®, OpenGL®) 着色器任务和固定功能硬件。计算任务是使用管理计算着色器的异步计算引擎 (ACE) 实现的。每个 ACE 维护一个独立的命令流，并且可以将计算着色器 wavefront 分派给计算单元。类似地，图形命令处理器对每种着色器类型 (例如，顶点和像素) 都有一个流。命令处理器安排的所有工作都分布在固定功能单元和着色器阵列中，以获得最佳性能。

The RDNA architecture introduces a new scheduling and quality-of-service feature known as Asynchronous Compute Tunneling that enables compute and graphics workloads to co-exist harmoniously on GPUs. In normal operation, many different types of shaders will execute on the RDNA compute unit and make forward progress. However, at times one task can become far more latency sensitive than other work. In prior generations, the command processor could prioritize compute shaders and reduce the resources available for graphics shaders. As Figure 5 illustrates, the RDNA architecture can completely suspend execution of shaders, freeing up all compute units for a high-priority task. This scheduling capability is crucial to ensure seamless experiences with the most latency sensitive applications such as realistic audio and virtual reality.
RDNA 架构引入了一种称为异步计算隧道的新调度和服务质量功能，它使计算和图形工作负载能够在 GPU 上和谐共存。在正常操作中，许多不同类型的着色器将在 RDNA 计算单元上执行并向前推进。然而，有时一项任务可能比其他工作对延迟更加敏感。在前几代中，命令处理器可以优先考虑计算着色器并减少可用于图形着色器的资源。如图 5 所示，RDNA 架构可以完全暂停着色器的执行，从而为高优先级任务释放所有计算单元。这种调度功能对于确保与对延迟最敏感的应用程序 (例如逼真的音频和虚拟现实) 的无缝体验至关重要。

realistic [ˌrɪəˈlɪstɪk]：adj. 现实的，实际的，实事求是的，明智的
free up [friː ʌp]：使解脱出来，使空出来，免 (税) 的，开放，放开
harmoniously [hʌrˈməʊnɪəslɪ]：adv. 和谐地

Figure 5 - Asynchronous Compute Tunneling in RDNA architecture.

drain [dreɪn]：n. 消耗，下水道，排水管，耗竭 v. 排空，流出，流光，放干
tunnel ['tʌn(ə)l]：n. 地下通道，地道，隧道，洞穴通道 v. 开凿隧道，挖地道

4. RDNA Shader Array and Graphics Functions

The traditional graphics pipeline starts with by assembling vertices into triangles; applying vertex shaders; optionally applying hull shaders, tessellation, and domain shaders; rasterizing the triangles into pixels; shading the pixels; and then blending the output. In addition, compute shaders can be used in many different stages for advanced effects.
传统的图形管线首先将顶点组装成三角形，应用顶点着色器，可选择应用外壳着色器、曲面细分和域着色器，将三角形光栅化为像素，着色像素，然后混合输出。此外，计算着色器可用于许多不同阶段以获得高级效果。

assemble [ə'semb(ə)l]：v. 装配，组装，集合，聚集
vertex [ˈvɜː(r)teks]：n. 顶 (点)，颅顶，天顶，绝顶
triangle [ˈtraɪæŋɡ(ə)l]：n. 三角形，三角形物体，三角铁，三角关系
hull [hʌl]：n. 船体，船身 v. 剥去 (豌豆、大豆等的) 外壳，摘掉 (草莓的) 花萼
tessellation [ˌtesɪ'leɪʃən]：n. 嵌石装饰，细分曲面技术
blend [blend]：v. 混合，融合，加入，相称 n. 混合，混合物，混合种，混合色
rasterize ['ræstəraɪz]：v. 将 (文本或图像) 光栅化，将 ... 转换为打印形式

The command processor and scheduling logic will partition compute and graphics work to enable dispatching it to the arrays for maximum performance. For example, a very common approach for the graphics pipeline is partitioning screen space and then dispatching each partition independently. Developers can also create their own scheduling algorithms, which are especially useful for compute-based effects.
命令处理器和调度逻辑将划分计算和图形工作，以便将其分派到阵列以获得最佳性能。例如，图形管道的一种非常常见的方法是对屏幕空间进行分区，然后独立调度每个分区。开发人员还可以创建自己的调度算法，这对于基于计算的效果特别有用。

For scalability and performance, the RDNA architecture is built from multiple independent arrays that comprise fixed-function hardware and the programmable dual compute units. To scale performance from the low-end to the high-end, different GPUs can increase the number of shader arrays and also alter the balance of resources within each shader array. As Figure 4 illustrates the Radeon RX 5700 XT includes four shader arrays, each one including a primitive unit, a rasterizer, four RBs, five dual compute units, and a graphics L1 cache.
为了可扩展性和性能，RDNA 架构由多个独立阵列构建而成，这些阵列包括固定功能硬件和可编程双计算单元。为了将性能从低端扩展到高端，不同的 GPU 可以增加着色器阵列的数量，并改变每个着色器阵列内的资源平衡。如图 4 所示，Radeon RX 5700 XT 包括四个着色器阵列，每个着色器阵列包括一个基元单元、一个光栅化器、四个 RB、五个双计算单元和一个图形 L1 缓存。

primitive ['prɪmətɪv]：adj. 原始的，远古的，(人类或动物) 发展早期的，简陋的 n. 文艺复兴前的艺术家，原始派画家

The primitive units assemble triangles from vertices and are also responsible for fixed-function tessellation. Each primitive unit has been enhanced and supports culling up to two primitives per clock, twice as fast as the prior generation. One primitive per clock is output to the rasterizer. The work distribution algorithm in the command processor has also been tuned to distribute vertices and tessellated polygons more evenly between the different shader arrays, boosting throughput for geometry.
基元单元从顶点组装三角形，并且还负责固定功能细分曲面。每个基元单元都得到了增强，每个时钟最多支持剔除两个基元，速度是上一代的两倍。每个时钟一个基元输出到光栅化器。命令处理器中的工作分配算法也经过调整，可以在不同的着色器阵列之间更均匀地分配顶点和镶嵌多边形，从而提高几何体的吞吐量。

cull [kʌl]：n. (为防止动物种群量过多而通常对最弱者的) 选择性宰杀 v. 部分捕杀
tessellate ['tesɪleɪt]：v. (把路面等) 镶嵌作花纹状
polygon [ˈpɒlɪɡən]：n. 多边形；多角形

The rasterizer in each shader engine performs the mapping from the geometry-centric stages of the graphics pipeline to the pixel-centric stages. Each rasterizer can process one triangle, test for coverage, and emit up to sixteen pixels per clock. As with other fixed-function hardware, the screen is subdivided and each portion is distributed to one rasterizer.
每个着色器引擎中的光栅化器执行从图形管道的以几何为中心的阶段到以像素为中心的阶段的映射。每个光栅化器可以处理一个三角形，测试覆盖率，并在每个时钟发出多达 16 个像素。与其他固定功能硬件一样，屏幕被细分，每个部分都分配给一个光栅化器。

The final fixed-function graphics stage is the RB, which performs depth, stencil, and alpha tests and blends pixels for anti-aliasing. Each of the RBs in the shader array can test, sample, and blend pixels at a rate of four output pixels per clock. One of the major improvements in the RDNA architecture is that the RBs primarily access data through the graphics L1 cache, which reduces the pressure on the L2 cache and saves power by moving less data.
最后的固定功能图形阶段是 RB，它执行深度、模板和 alpha 测试，并混合像素以进行抗锯齿。着色器阵列中的每个 RB 都可以以每个时钟四个输出像素的速率测试、采样和混合像素。RDNA 架构的主要改进之一是 RB 主要通过图形 L1 缓存访问数据，从而减少了 L2 缓存的压力并通过移动更少的数据来节省电力。

stencil ['stens(ə)l]：n. (印文字或图案用的) 模板，(用模板印的) 文字 v. 用模板印 (文字或图案)
anti aliasing ['æntɪ 'eɪlɪəsɪŋ]：防叠处理，反混淆

Last, the shader array comprises of several dual compute units and a graphics L1 cache. The dual compute units provide the compute for executing programmable shaders. Each dual compute unit comprises four SIMDs that can execute a wave32, for a total of 256 single-precision FLOPS per cycle and even more for applications that can use mixed precision. The SIMDs also contain powerful dedicated scalar units and higher bandwidth caching. The new graphics L1 cache serves most requests in each shader array, simplifying the design of the L2 cache and boosting the available bandwidth.
最后，着色器阵列由几个双计算单元和一个图形 L1 缓存组成。双计算单元为执行可编程着色器提供计算。每个双计算单元包含四个可以执行 wave32 的 SIMD，每个周期总共有 256 个单精度 FLOPS，对于可以使用混合精度的应用程序甚至更多。SIMD 还包含强大的专用标量单元和更高带宽的缓存。新的图形 L1 缓存服务于每个着色器阵列中的大多数请求，简化了 L2 缓存的设计并提高了可用带宽。

comprise [kəm'praɪz]：v. 组成，构成，包括，包含

5. Dual Compute Unit Front End

The more powerful dual compute unit starts with a dedicated front-end as illustrated in Figure 6. The L0 instruction cache is shared between all four SIMDs within a dual compute unit, whereas prior instruction caches were shared between four CUs - or sixteen GCN SIMDs. The instruction cache is 32KB and 4-way set-associative; it is organized into four banks of 128 cache lines that are 64-bytes long. Each of the four SIMDs can request instructions every cycle and the instruction cache can deliver 32B (typically 2-4 instructions) every clock to each of the SIMDs - roughly 4X greater bandwidth than GCN.
更强大的双计算单元从专用前端开始，如图 6 所示。L0 指令高速缓存在双计算单元内的所有四个 SIMD 之间共享，而之前的指令高速缓存在四个 CU 或十六个 GCN SIMD 之间共享。指令缓存为 32KB，4 路组相联；它被组织成四组 128 条 64 字节长的高速缓存行。四个 SIMD 中的每一个都可以在每个周期请求指令，并且指令缓存可以每个时钟向每个 SIMD 提供 32B (通常是 2-4 条指令) - 带宽大约是 GCN 的 4 倍。

Figure 6 - RDNA compute unit front end and SIMD.

The fetched instructions are deposited into wavefront controllers. Each SIMD has a separate instruction pointer and a 20-entry wavefront controller, for a total of 80 wavefronts per dual compute unit. Wavefronts can be from a different work-group or kernel, although the dual compute unit maintains 32 work-groups simultaneously. The new wavefront controllers can operate in wave32 or wave64 mode. While the RDNA architecture is optimized for wave32, the existing wave64 mode can be more effective for some applications. To handle wave64 instructions, the wave controller issues and executes two wave32 instructions, each operating on half of the work-items of the wave64 instruction. The default way to handle a wave64 instruction is simply to issue and execute the upper and lower halves of each instruction back-to-back - conceptually slicing every instruction horizontally.
提取的指令被存放到 wavefront 控制器中。每个 SIMD 都有一个单独的指令指针和一个 20 项 wavefront 控制器，每个双计算单元总共有 80 个 wavefront。wavefront 可以来自不同的工作组或内核，尽管双计算单元同时维护 32 个工作组。新的 wavefront 控制器可以在 wave32 或 wave64 模式下运行。虽然 RDNA 架构针对 wave32 进行了优化，但现有的 wave64 模式对于某些应用程序可能更有效。为了处理 wave64 指令，wave 控制器发出并执行两条 wave32 指令，每条指令对 wave64 指令的一半工作项进行操作。处理 wave64 指令的默认方法是简单地发出和执行每条指令的上半部分和下半部分背靠背 - 从概念上讲，水平切片每条指令。

deposit [dɪˈpɒzɪt]：n. 存款，押金，订金，订钱 v. 存储，放置，放下，使沉积

The RDNA architecture also introduces a software-controlled sub-vector mode that cuts a sequence of instructions in half, and first processes the low half of all the instructions, followed by the upper half. Figure 7 illustrates these two approaches on a simple four instruction loop. The two modes will achieve the same utilization of execution units and complete a full iteration of the loop in eight cycles. However, the sub-vector mode can release half of its registers once the low half of inst3 has executed in cycle 4 and may use the cache more effectively.
RDNA 架构还引入了软件控制的子向量模式，将指令序列切成两半，首先处理所有指令的低半部分，然后处理上半部分。图 7 在一个简单的四指令循环上说明了这两种方法。这两种模式将实现相同的执行单元利用率，并在八个周期内完成循环的完整迭代。但是，一旦 inst3 的低半部分在周期 4 中执行，子向量模式可以释放其一半的寄存器，并且可以更有效地使用缓存。

Figure 7 - Sub-vector mode for wave64 execution executes the low half of all four instructions, followed by the high half.

Another powerful tool introduced by the RDNA architecture is the clause, which gives software greater control over the execution of groups of instructions. A clause modifies the behavior of compute and memory instructions in the scalar and vector SIMD pipelines. In particular, instructions in a clause are uninterruptable except at software-specified breaks. Clauses enables more intelligent scheduling. For example, a clause of vector load instructions guarantees that all data is read from a fetched cache line before it is evicted - thus maximizing bandwidth and power efficiency. Similarly, forming a clause of compute instructions ensures that all results are available to be written back to a cache or consumed by subsequent instructions.
RDNA 架构引入的另一个强大工具是子句，它使软件能够更好地控制指令组的执行。子句修改标量和向量 SIMD 管道中的计算和内存指令的行为。特别是，子句中的指令是不可中断的，除非在软件指定的中断处。子句支持更智能的调度。例如，向量加载指令的一个子句保证所有数据在被逐出之前从已获取的高速缓存行中读取 - 从而最大化带宽和功率效率。类似地，形成计算指令子句可确保所有结果都可用于写回高速缓存或由后续指令使用。

clause [klɔːz]：n. 子句，从句，分句，(法律文件的) 条款
evict [ɪ'vɪkt]：v. (尤指依法从房屋或土地上) 驱逐

Once instructions have been fetched into the wavefront buffers, the next step is decoding and issuing instructions within a SIMD. In the prior generation, SIMDs decoded and issued instructions on a round-robin basis, limiting the throughput to once every four cycles. One of the core principles of the RDNA architecture is reducing latency to exploit instruction-level parallelism for each wavefront. Accordingly, each RDNA SIMD can decode and issue instructions every cycle – boosting throughput and reducing latency by 4X.
一旦指令被提取到 wavefront 缓冲器中，下一步就是在 SIMD 中解码和发出指令。在上一代中，SIMD 以循环方式解码和发出指令，将吞吐量限制为每四个周期一次。RDNA 架构的核心原则之一是减少延迟以利用每个 wavefront 的指令级并行性。因此，每个 RDNA SIMD 都可以在每个周期解码和发出指令，从而提高吞吐量并将延迟降低 4 倍。

The dual compute units include a new scheduling mechanism to ensure consistent forward progress. Earlier generations tended to issue instructions from the oldest wavefront and sometimes different wavefronts within a work-group would make uneven progress. The compute unit scheduler has a new oldest work-group scheduling policy that can be activated by software. Using this scheduling algorithm, the wavefronts within a work-group will tend to make more consistent progress and improve locality within caches saving power by more efficiently moving data.
双计算单元包括一个新的调度机制，以确保一致的前进进度。前几代倾向于从最早的 wavefront 发出指令，有时工作组内的不同 wavefront 会获得不均匀的进展。计算单元调度程序有一个新的最早的工作组调度策略，可以由软件激活。使用这种调度算法，工作组内的 wavefront 将倾向于取得更一致的进展，并通过更有效地移动数据来提高缓存内的局部性，从而节省电力。

consistent [kən'sɪstənt]：adj. 一致的，始终如一的，连续的，持续的
progress [prəʊˈɡres]：n. 进展，进步，进程，前进 v. 进展，进步，前进，改进
uneven [ʌnˈiːv(ə)n]：adj. 凹凸不平的，不平坦的，无定型的，不规则的

6. SIMD Execution Units

The RDNA front-end can issue four instructions every cycle to every SIMD, which include a combination of vector, scalar, and memory pipeline. The scalar pipelines are typically used for control flow and some address calculation, while the vector pipelines provide the computational throughput for the shaders and are fed by the memory pipelines.
RDNA 前端可以在每个周期向每个 SIMD 发出四条指令，其中包括向量、标量和内存管线的组合。标量管线通常用于控制流和一些地址计算，而矢量管线为着色器提供计算吞吐量并由内存管线提供。

7. Scalar Execution and Control Flow

The GCN architecture introduced scalar execution units, shared by the four GCN SIMDs to handle control flow and address generation for compute applications. The RDNA dual compute unit takes this to the next level by providing each SIMD with its own dedicated scalar pipelines, boosting performance for general-purpose applications.
GCN 架构引入了由四个 GCN SIMD 共享的标量执行单元，以处理计算应用程序的控制流和地址生成。RDNA 双计算单元通过为每个 SIMD 提供的专用标量管线，将这一点提升到一个新的水平，从而提高通用应用程序的性能。

Each SIMD contains a 10KB scalar register file, with 128 entries for each of the 20 wavefronts. A register is 32-bits wide and can hold packed 16-bit data (integer or floating-point) and adjacent register pairs hold 64-bit data. When a wavefront is initiated, the scalar register file can preload up to 32 user registers to pass constants, avoiding explicit load instructions and reducing the launch time for wavefronts.
每个 SIMD 包含一个 10KB scalar register file，20 个 wavefront 中的每一个都有 128 个条目。寄存器为 32 位宽，可以保存打包的 16 位数据 (integer or floating-point)，相邻的寄存器对保存 64 位数据。启动 wavefront 时，scalar register file 可以预加载多达 32 个用户寄存器以传递常量，避免显式加载指令并减少 wavefront 的启动时间。

The branch pipeline is mostly unchanged from the previous generation, although it has been tuned to work with the smaller wavefronts. It handles conditional branches and interrupts.
分支管道与上一代相比基本没有变化，尽管它已经过调整以适用于较小的 wavefront。它处理条件分支和中断。

The scalar ALU accesses the scalar register file and performs basic 64-bit arithmetic operations. The scalar ALU is used for a variety of control purposes, such as calculating values that will be broadcast across an entire wavefront, managing predicates for the wavefront, and computing branch targets. In addition, the scalar ALU is used for address calculation when reading or writing data in the scalar data cache.
标量 ALU 访问 scalar register file 并执行基本的 64 位算术运算。标量 ALU 用于各种控制目的，例如计算将在整个 wavefront 上广播的值、管理 predicates 以及计算分支目标。此外，标量 ALU 用于在标量数据缓存中读取或写入数据时进行地址计算。

arithmetic logic unit，ALU：算术逻辑单元

Like the instruction cache, the scalar cache is shared by all the SIMDs in a dual compute unit. The 16KB write-back scalar cache is 4-way associative and built from two banks of 128 cache lines that are 64B. Each bank can read a full cache line, and the cache can deliver 16B per clock to the scalar register file in each SIMD. For graphics shaders, the scalar cache is commonly used to stored constants and work-item independent variables.
与指令缓存一样，标量缓存由双计算单元中的所有 SIMD 共享。16KB 回写标量高速缓存是 4 路关联的，由两组 128 个 64B 高速缓存行构成。每个 bank 可以读取一个完整的高速缓存行，高速缓存每个时钟可以将 16B 传输到每个 SIMD 中的 scalar register file。对于图形着色器，标量缓存通常用于存储常量和工作项独立变量。

8. Vector Execution

The superb performance and efficiency of modern graphics processors is derived from the parallel computing capabilities of vector execution units. As Figure 8 illustrates, one of the biggest improvements in the compute unit is doubling the size of the SIMDs and enabling back-to-back execution. When using the more efficient wave32 wavefronts, the new SIMDs boosts IPC and cuts latency by 4X.
现代图形处理器的卓越性能和效率源于矢量执行单元的并行计算能力。如图 8 所示，计算单元的最大改进之一是将 SIMD 的大小增加一倍并支持背靠背执行。当使用更高效的 wave32 波前时，新的 SIMD 可提高 IPC 并将延迟降低 4 倍。

Figure 8 - RDNA SIMD units reduce latency compared to the GCN SIMD units.

Previously, the interleaved vector approach in GCN required scheduling other instructions around four-cycle vector instructions. The larger RDNA dual compute unit also simplifies compiler design and enables more efficient code by scheduling and issuing independent instructions every cycle.
以前，GCN 中的交错向量方法需要围绕四周期向量指令调度其他指令。更大的 RDNA 双计算单元还简化了编译器设计，并通过在每个周期调度和发出独立指令来实现更高效的代码。

To accommodate the narrower wavefronts, the vector register file has been reorganized. Each vector general purpose register (vGPR) contains 32 lanes that are 32-bits wide, and a SIMD contains a total of 1,024 vGPRs - 4X the number of registers as in GCN. The registers typically hold single-precision (32-bit) floating-point (FP) data, but are also designed for efficiently handling mixed precision. For larger 64-bit (or double precision) FP data, adjacent registers are combined to hold a full wavefront of data. More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers.
为了适应更窄的 wavefront，矢量寄存器文件已经过重新组织。每个矢量通用寄存器 (vGPR) 包含 32 个 32 位宽的通道，一个 SIMD 总共包含 1,024 个 vGPR - 是 GCN 中寄存器数量的 4 倍。这些寄存器通常保存单精度 (32 位) 浮点 (FP) 数据，但也设计用于有效处理混合精度。对于较大的 64 位 (或双精度) FP 数据，相邻的寄存器组合在一起以保存完整的 wavefront 数据。更重要的是，计算单元向量寄存器本身支持打包数据，包括两个半精度 (16 位) FP 值、四个 8 位整数或八个 4 位整数。

superb [sʊˈpɜː(r)b]：adj. 极佳的，卓越的，质量极高的
accommodate [əˈkɒmədeɪt]：v. 容纳，顺应，提供住宿，提供空间

9. Vector ALUs

The computational power of the dual compute unit resides in the SIMDs, which have been comprehensively enhanced for greater performance and additional capabilities. The SIMD vector execution pipeline contains several types of execution units. Each SIMD can only issue one wavefront per clock to the vector ALUs, but longer latency operations can overlap execution.
双计算单元的计算能力驻留在 SIMD 中，SIMD 已全面增强以提供更高的性能和附加功能。SIMD 向量执行管道包含几种类型的执行单元。每个 SIMD 每个时钟只能向矢量 ALU 发出一个 wavefront，但较长的延迟操作可能会与执行重叠。

comprehensively [ˌkɒmprɪ'hensɪvli]：adv. 完全地，彻底地

The main vector ALU is a 32-lane pipeline that is twice as wide as the prior generation and enables the SIMD to complete a wavefront every clock cycle, instead of taking four cycles in GCN. The individual lanes have been redesigned for fused multiply-accumulate (FMA) operations, which improves the accuracy of computation and reduces power consumption compared to the previous generation’s multiply-add units.
主向量 ALU 是一个 32 通道的管线，其宽度是上一代的两倍，并使 SIMD 能够在每个时钟周期完成一个 wavefront，而不是在 GCN 中花费四个周期。各个通道已针对融合乘加 (FMA) 操作进行了重新设计，与上一代的乘加单元相比，这提高了计算的准确性并降低了功耗。

The lanes have also been rearchitected for efficiently executing a wide variety of data types as illustrated in Figure 9. The execution units perform single-precision FMA, add, multiply, or other general FP operations at full rate as well as 24-bit/32-bit integer multiply-accumulates. Packed half-precision FP or 16-bit integer operations run at twice the throughput. The vector ALUs feature a new mixed precision FMA that computes 16-bit multiplication and accumulates into a 32-bit result to avoid losing any precision for integer or FP data. While the vector ALUs primarily read data from the high-bandwidth vGPRs, the scalar register file now supplies up to two broadcast operands per clock to every lane.
通道也经过重新构造，以高效执行各种数据类型，如图 9 所示。执行单元以全速率执行单精度 FMA、加法、乘法或其他通用 FP 操作，以及 24 位/32 位整数乘法累加。打包的半精度 FP 或 16 位整数运算以两倍的吞吐量运行。矢量 ALU 具有新的混合精度 FMA，它计算 16 位乘法并累积成 32 位结果，以避免丢失整数或 FP 数据的任何精度。虽然矢量 ALU 主要从高带宽 vGPR 读取数据，但标量寄存器文件现在每个时钟向每个通道提供多达两个广播操作数。

Figure 9 - Mixed-precision FMA and dot-product operations.

Some variants of the dual compute unit expose additional mixed-precision dot-product modes in the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows.
双计算单元的一些变体在 ALU 中公开了额外的混合精度点积模式，主要用于加速机器学习推理。混合精度 FMA dot2 将计算两个半精度乘法，然后将结果添加到单精度累加器。为了获得更大的吞吐量，一些 ALU 将支持 8 位整数 dot4 运算和 4 位 dot8 运算，所有这些都使用 32 位累加器来避免任何溢出。

The SIMDs use separate execution units for double-precision data. Each implementation includes between two and sixteen double-precision pipelines that can perform FMAs and other FP operations, depending on the target market. As a result, the latency of double-precision wavefronts varies from as little as two cycles up to sixteen. The double-precision execution units operate separately from the main vector ALUs and can overlap execution.
SIMD 对双精度数据使用单独的执行单元。每个实现包括 2 到 16 个双精度流水线，可以执行 FMA 和其他 FP 操作，具体取决于目标市场。双精度 wavefront 的延迟从两个周期到十六个周期不等。双精度执行单元与主向量 ALU 分开运行，并且可以重叠执行。

The dual compute unit includes new transcendental execution units to accelerate more complicated math operations that are used in both graphics and general computing. Each SIMD contains an 8-lane transcendental unit that can overlap execution with the main vector ALUs and will complete a wavefront in four clock cycles.
双计算单元包括新的超越执行单元，以加速图形和通用计算中使用的更复杂的数学运算。每个 SIMD 包含一个 8 通道超越单元，可以与主向量 ALU 重叠执行，并将在四个时钟周期内完成一个 wavefront。

transcendental [.trænsen'dent(ə)l]：n. 超越数，超越物，抽象的普遍概念 adj. (尤指宗教或精神方面) 超验的

The dual compute unit includes a crossbar within the vector execution pipeline and introduces several new communication instructions. In some previous architectures, cross-lane operations (i.e. between different work-items in a wavefront) were fairly expensive. The new data path in the SIMD is designed for cross-lane data reading, permutations, swizzles, and other operations at full rate.
双计算单元在向量执行流水线中包括一个交叉开关，并引入了几个新的通信指令。在以前的一些架构中，跨通道操作 (即 wavefront 中的不同工作项之间) 相当昂贵。SIMD 中的新数据路径专为全速率的跨通道数据读取、排列、混合和其他操作而设计。

crossbar [ˈkrɒsˌbɑː(r)]：n. (足球球门的) 横梁，(自行车的) 横梁，交叉开关

10. Dual Compute Unit Memories

Collectively, the four SIMDs in a dual compute unit can sustain an impressive 256 FLOPs every cycle, in addition to scalar operations. The dual compute unit memory hierarchy has evolved to keep up with this level of performance and feed enough data to take advantage of the computational resources. To efficiently provide data, the RDNA architecture includes three levels of caching - L0 caches within a dual compute unit, a shared L1 cache within an array, and a globally shared L2 cache. In addition, the explicitly addressed local data share (LDS) is now part of the overall address space, simplifying programming.
总的来说，双计算单元中的四个 SIMD 可以在每个周期维持令人印象深刻的 256 次浮点运算，此外还有标量运算。双计算单元内存层次结构已经发展到跟上这种性能水平并提供足够的数据以利用计算资源。为了有效地提供数据，RDNA 架构包括三个级别的缓存 - 双计算单元内的 L0 缓存、阵列内的共享 L1 缓存和全局共享的 L2 缓存。此外，显式寻址的本地数据共享 (LDS) 现在是整个地址空间的一部分，从而简化了编程。

Figure 10 - SIMD memory path for RDNA architecture.

The memory hierarchy starts within the SIMDs as illustrated in Figure 10. The cache and memory pipeline in each SIMD have been redesigned to sustain a full wave32 every clock cycle - twice the throughput compared to the prior generation. Every SIMD has a 32-wide request bus that can transmit an address for each work-item in a wavefront to the memory hierarchy; for store operations, the request bus will instead provide 32x4B of write data. The requested data is transmitted back via a 32-wide return bus and can be provided directly to the ALUs or the vGPRs. The request and return bus generally work with 128-byte cache lines and fan out to the explicitly addressed LDS, cached memory and the texturing units. For efficiency, pairs of SIMDs share a request and return bus, although an individual SIMD can actually receive two 128-byte cache lines per clock – one from the LDS and the other from the L0 cache.
内存层次结构从 SIMD 开始，如图 10 所示。每个 SIMD 中的缓存和内存管道都经过重新设计，以支持每个时钟周期的 wave32 - 吞吐量是上一代的两倍。每个 SIMD 都有一个 32 宽的请求总线，可以将 wavefront 中每个工作项的地址传输到内存层次结构；对于存储操作，请求总线将改为提供 32x4B 的写入数据。请求的数据通过 32 宽的返回总线传回，并可直接提供给 ALU 或 vGPR。请求和返回总线通常使用 128 字节的缓存线，并扇出到明确寻址的 LDS、缓存内存和纹理单元。为提高效率，成对的 SIMD 共享请求和返回总线，尽管单个 SIMD 实际上每个时钟可以接收两条 128 字节的高速缓存线 - 一条来自 LDS，另一条来自 L0 高速缓存。

11. Local Data Share and Atomics

The first type of memory available to the compute units is the local data share (LDS), which a low-latency and high-bandwidth explicitly addressed memory that is used for synchronization within a workgroup and for some graphics functions associated with texturing.
计算单元可用的第一种内存类型是本地数据共享 (LDS)，它是一种低延迟和高带宽的显式寻址内存，用于工作组内的同步以及与纹理相关的某些图形功能。

Each compute unit has access to double the LDS capacity and bandwidth. The new LDS has is built from two 64KB arrays with 32 banks per array. As with the previous generation, each bank includes 512 entries that are 32-bits wide and can sustain a read and write every cycle. Each LDS bank includes an ALU that performs atomic operations including FP min and max. Maintaining the bank size and increasing the number of banks ensures that existing shaders will run smoothly, while doubling throughput and performance.
每个计算单元都可以访问双倍的 LDS 容量和带宽。新的 LDS 由两个 64KB 阵列构成，每个阵列有 32 个存储体。与上一代一样，每个存储库包括 512 个 32 位宽的条目，并且可以维持每个周期的读取和写入。每个 LDS bank 包括一个 ALU，它执行包括 FP min 和 max 在内的原子操作。保持存储库大小并增加存储库数量可确保现有着色器平稳运行，同时将吞吐量和性能提高一倍。

The LDS includes a crossbar to move data between wavefront lanes and banks and can also broadcast or replicate data. Typically, all accesses from a wavefront will complete in a single cycle; however, the hardware automatically detects conflicts and will serialize multiple accesses to a single bank of the LDS.
LDS 包括一个用于在 wavefront 通道和库之间移动数据的交叉开关，还可以广播或复制数据。通常，来自 wavefront 的所有访问都将在一个周期内完成；但是，硬件会自动检测冲突，并将对 LDS 的单个存储体的多次访问进行串行化。

The RDNA architecture has two modes of operation for the LDS, compute-unit and workgroup-processor mode, which are controlled by the compiler. The former is designed to match the behavior of the GCN architecture and statically divides the LDS capacity into equal portions between the two pairs of SIMDs. By matching the capacity of the GCN architecture, this mode ensures that existing shaders will run efficiently. However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group.
RDNA 架构有两种 LDS 操作模式，计算单元和工作组处理器模式，由编译器控制。前者旨在匹配 GCN 架构的行为，并将 LDS 容量静态划分为两对 SIMD 之间的相等部分。通过匹配 GCN 架构的容量，该模式确保现有着色器能够高效运行。但是，工作组处理器模式允许使用更大的 LDS 分配来提高单个工作组的性能。

References

https://www.amd.com/en/technologies/rdna
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
https://en.wikichip.org/wiki/amd/infinity_fabric