游戏中的Compute Shaders

现在越来越多的游戏开始在渲染中大量使用Compute Shaders来做一些辅助的计算工作。而且好多通用计算的地方也有用Vulkan的Compute Shaders的列子，甚至有些可以替代OpenCL了。

Pixel Shaders要走过长长的图形管线才能执行，而且在UAV以前它只能输出到该pixel坐标对应的frame buffer中去。对于渲染中的计算工作并不友好。Compute Shaders和通用计算用的OpenCL类似，没有图形管线，可以输出到任意位置。

下图是传统图形渲染走的GPU里的管线（以AMD Vega为例）：

这是Compute Shaders的工作方式：（注意它跳过了图形处理单元）

有很多介绍Compute Shaders的文章，但多数都是表面介绍和API使用的例子。

下面有几篇Anteru 的深度介绍背后硬件GPU工作原理的好文章，介绍给大家。当然还是AMD GCN架构为例。

Introduction to compute shaders

More compute shaders

Even more compute shaders

Table of Contents

Introduction to compute shaders

More compute shaders

Even more compute shaders

Introduction to compute shaders

More compute shaders

Introduction to compute shaders

Anteru

2018-07-07 09:00

A couple of months ago I went to the Munich GameCamp -- a bar camp where anyone can propose a talk, and then a quick vote is cast which talks get accepted. I've been asking around for some ideas, and one developer mentioned I might want to cover compute shaders. So I went in, without much hope to attract many people, but I ended up with an overcrowded room with roughly one quarter of all the game camp participants, rambling about compute shaders for roughly an hour. Afterwards, the main question I got was: "Where can I read about this?", and I couldn't quite nail a good introduction to compute shaders (there's Ryg's "a trip through the graphics pipeline", but that's already quite past an introduction.)

The hardware

To understand how compute shaders happened, we have to take a look at the hardware evolution. Back in the old days, before shaders, we had geometry processing and texturing separated. For instance, a Voodoo² card had one rasterizer and two texturing units on the card, splitting the workload between them. This theme continued for a long time, even after shaders were introduced. Up until the GeForce 7 and the Radeon X1950, GPUs had separate vertex and pixel shader units. Those units had usually similar capabilities in terms of what they could compute (after all, additions and multiplications are the bulk of the work on a GPU), but memory access differed a lot. For instance, accessing textures was something vertex shaders couldn't do for a long time. At that time, the split made sense as scenes consisted of few polygons, covering many pixels, so having less vertex shading power typically didn't result in a bottleneck. By removing functionality from the vertex shaders, they could be better optimized and thus run faster.

However, a fixed distribution also makes it impossible to load-balance resources. As games evolved, sometimes more vertex power was required -- for instance, when rendering dense geometry like trees -- while other games were exploiting the new shader programming models to write more complex pixel shaders. This was also the time at which GPGPU programming starting, that is, programmers were trying to solve numerical programs on GPUs. Eventually, it became clear that the fixed split was not good enough.

In late 2006 -- early 2007, the age of "unified shaders" began with the release of the GeForce 8800 GTX and the Radeon HD 2800 (technically speaking, the XBox 360 was first with a customized ATI chip.) Gone were the days of separate units; instead, the shader core could process any kind of workload.

So what happened back then? The ALUs -- the units executing the math instructions -- were already similar. What changed was that the distinction between the units was removed completely. Vertex shaders were now executing on the same units as pixel shaders, making it possible to balance the workload between vertex and pixel shader work. One thing you'll notice here is that vertex and pixel shaders need to communicate somehow -- a decent chunk of memory is needed to store attributes for interpolation, like vertex color. A vertex shader will compute those per vertex and move on to the next vertex, but a pixel shader might have to hang on this information for a while until all pixels are processed. We'll come to this memory later, let's remember if for now.

This model -- some ALUs bundled together, with some extra memory to allow communicating between sets of them -- was exposed at the same time as a "compute shader". In reality there are a few more details to this, for instance, a compute unit usually does not process a single element, but multiple of those, and there are quite a few more caches involved to make all of this efficient. Below is the diagram for a single compute unit in the AMD GCN architecture -- note that what I previously called "compute unit" is now a "SIMD", more on this in a moment.

Note that other GPUs look very similar, I'm simply using GCN here as I'm most familiar with it. On NVIDIA hardware, a GCN compute unit maps to a "streaming multiprocessor", or SM. The SIMDs have different width, and the caches will look a bit different, but the basic compute model is still exactly the same.

Back to the the SIMD unit -- GPUs have been always optimized to process many things concurrently. Where you have one pixel, you have many, and all of them run the same shader, so this is how the hardware was designed. In the case of GCN, the designers built four, 16-wide SIMDs into each compute unit. A SIMD (SIMD is short for single instruction, multiple data) unit can execute one operation across 16 elements at once. Not in one clock cycle -- there's some latency -- but for GCN, that latency is four cycles, so by having 4 SIMD, and pretending a SIMD is not 16, but 64 wide, the machine behaves as-if it executes 64 instructions per cycle. Confused? Don't worry, this is just an implementation detail, the point here is that every GPU executes some set of instructions together, be it 64 in the GCN case or 32 on current NVIDIA architectures.

What we ended up with is a GPU consisting of many compute units, and each compute unit containing:

A bunch of SIMD processors, which execute the instructions.
Some local memory inside each compute unit which can be used to communicate between shader stages.
Each SIMD unit executes one instruction across many items.

With those prerequisites, let's have a look at how we might expose them when not doing pixel and vertex shader work. Let's get going!

The kernel

The goal is to write a programming system which allows us to take advantage of the hardware at hand. Obviously, the first thing we notice is that cross-compute-unit communication is something we want to avoid. All of the compute units are clients of the L2 cache, but clearly the L1 caches can go out of sync, so if we distribute work, we should pretend we can't talk across compute units. This means that we somehow have to split our work into smaller units. We could go just pretend there are no compute units and dispatch individual items, but then we're missing out on the local memory, so it seems that we want one level below "here's all work".

We could go directly at the SIMD level, but that's not optimal as there is a capability we haven't discussed yet within a compute unit. As we have some kind of local memory, it's natural to assume we can use that to synchronize the SIMD units (given we could just write something to local memory and wait for it to show up). This also means we can only synchronize within a single compute unit, so let's use that as the next grouping level -- we'll call the work that gets put onto a single compute unit a work group. We thus start with a work domain, which is then split into work groups. Each work group runs independently of the others. Given the developer might not know how many groups can run concurrently, we enforce that work groups can't depend on the fact that:

Another work group from the same domain is running
Work groups inside a domain execute in any particular order

That allows us to run one work group after the other on a single unit, or all of them at once. Now, inside a work group, we still need many independent items of work, so we can fill all of our SIMD units. Let's call the individual item of work a work item, and our programming model will launch work groups containing many work items -- at least enough to fill up one compute unit (we probably want more than just enough to fill it, but more on this later.)

Let's put all of this together: Our domain consists of work groups, each work group gets mapped onto one compute unit, and the SIMD units process the work items.

One thing we forgot to mention so far is the fact that there's apparently one more level which we didn't cover. I mentioned a SIMD unit executes multiple work items together, yet we haven't exposed this in our programming model, we only have "independent" work items there. Let's give the work items executing together a name -- we'll call them a subgroup (on AMD, they're typically referred to as a "wavefront" or "wave", while NVIDIA calls them "warp".) Now, we can expose another set of instructions, which perform operations across all items in a SIMD. For instance, we could compute the minimum of all values without having to go through local memory.

I mentioned before that there are only very few constraints on ordering. For instance, a GPU may want to execute a whole domain on a single compute unit. One of the reasons for doing this is memory latency. Those wide SIMD units are great when it comes to number crunching, but it also means that memory access is relatively speaking super slow. In fact, it's horribly slow, because a GCN CU can process 64 float multiply-add instructions per cycle, which is 64×3×4 bytes of input data, and 64×4 bytes of out. Across a large chip like a Vega10, that's 48 KiB worth of data read in a single cycle -- at 1.5 GHz, that's 67 TiB of data you'd have to read in. GPU memory systems have been thus optimized for throughput, at the expense of latency, and that impacts how we program them.

Before we head into the effects of this on the programming model, let's summarize how compute shaders look like:

Work is defined through a 3-level hierarchy:
- A domain specifies all work
- The domain is subdivided into work groups, which are executed independently, but allow for communication inside a group
- Work items are the individual elements of work
Some APIs also expose an intermediate level -- the subgroup, which allows some optimizations below the work group, but above the work item level.
Work groups can synchronize internally and exchange data through local memory.

This programming model is universal across all GPU compute shader APIs. Differences occur in terms of what guarantees are provided, for instance, an API might limit the number of work groups in flight to allow you to synchronize between all of them; or some hardware might guarantee the execution order so you can pass information from one workgroup to the next.

Latency hiding & how to write code

Now that we understand how the hardware and the programming model looks like, how do we actually program this efficiently? Until now, it sounds like we're writing normal code, and all that happens is that we process a work-item in it, potentially communicating between our neighbors on the same SIMD, others through local memory, and finally read and write through global memory. The problem is that "normal" code is not good, as we need to hide latency.

The way a GPU hides latency is by having more work in flight. A lot more work -- taking GCN again as the example, every compute unit can have up to 40 subgroups in flight. Every time a subgroup accesses memory, another one gets scheduled in. Given only four can execute at the same time, this means we can switch up to 10 times before we return to the subgroup which started the original request.

There's a problem here though -- subgroups switching needs to be instantaneous to make this possible. Which means you can't write the program state to memory and read it back. Instead, all state of all programs is kept permanently resident in registers. This requires a huge amount of registers -- a single compute unit on GCN has 256 KiB of registers. With that, we can use up to (256 KiB / 40 / 4 / 64B) = 24 registers for a single item before we need to reduce occupancy. For our programming style, this means we should try to minimize the amount of state we need to keep around as much as possible, at least as long as we're accessing memory. If we don't access memory, a single wave can keep a SIMD 100% occupied. We should also make sure to use that local memory and L1 cache as much as we can, as they have more bandwidth and less latency than the external memory.

The other problem we're going to face is how branches are handled. SIMDs execute everything together, so they can't perform jumps. What happens instead is that the individual lanes get masked off, i.e. they no longer participate. It also implies that unless all work items take the same branch, a GPU will typically execute both sides of the branch.

Which means that for a SIMD of length N, we can get a worst-case usage of 1/N. On CPUs N is usually 1..8, so it's not catastrophic, but on a GPU N can be as high as 64 at which point this does really matter. How can we make sure this doesn't happen? First, we can take advantage of the subgroup execution; if we have a branch to select between a cheap and expensive version, and some work items take the expensive version, we might as well all take it. This reduces the cost of a branch from expensive + cheap to just expensive. The other part is to simply avoid excessive branching. As CPUs get wider, this is also becoming more important, and techniques like sorting all data first, and the processing it become more interesting than heavy branching on individual work items. For instance, if you're writing a particle simulation engine, it's much faster to sort the particles by type, and the run a specialized program for each, compared to running all possible simulation programs for each particle.

What have we learned? We need:

A large problem domain -- the more independent work items, the better.
Memory intensive code should minimize the state, to allow for high occupancy.
We should avoid branches which disable most of the subgroup.

Those guidelines will apply universally to all GPUs. After all, all of them were designed to solve massively parallel graphics problems, and those have little branching, not too much state in flight, and tons and tons of independent work items.

Summary

I hope this blog post gives you a good idea how we ended up with those compute shaders. The really interesting bit is that the concepts of minimal communication, local memory access, and divergence costs are universally applicable. Modern multi-core CPUs are also introducing new levels of communications costs, NUMA has been long used to model costs for memory access, and so on. Understanding that not all memory is equal, and that your code is executed in some particular way on the underlying hardware will enable you to squeeze out more performance everywhere!

More compute shaders

Anteru

2018-07-14 09:00

Last week I've covered compute shaders, and I've been asked to go a bit deeper on the hardware side to cover subgroups and more. But before we get there, let's recap briefly what a compute unit looks like and what is happening in it.

In the last post, I explained that the hardware is optimized for many items executing the same program. This resulted in the usage of very wide SIMD units (in the case of AMD's GCN, 16-wide), hiding memory latency through task switching, and a branching model which relies on masking. I didn't get into too much detail about those though, which we'll do now, and after that, we'll find out what to do with our new-found knowledge!

SIMD execution

When it comes to executing your code, a GCN compute unit has two main building blocks you care about: Several SIMD units and a scalar unit. The SIMD units are each 16 wide, meaning they process 16 elements at a time. However, they don't have a latency of one -- they don't finish executing an instruction in a single clock cycle. Instead, it takes four cycles to process an instruction from start to end (some take longer, but let's pretend all take four cycles for now.) Four cycles is about the speed you'd expect for something like a fused-multiply-add, which needs to fetch three operands from the register file, do a multiply-and-add, and write the result back (highly optimized CPU designs also take four cycles, as can be seen in Agner Fog's instruction tables.)

Latency doesn't equal throughput, though. An instruction with a latency of four can still have a throughput of one, if it's properly pipelined. Let's look at an example:

Which means that if we have sufficient work to do for a single SIMD, we can get 16 FMA instructions executed per clock. If we could issue a different instructions in every cycle, we'd have to deal with another problem though -- some of the results may not be ready. Imagine our architecture would have a four-cycle latency on all instructions, and a single-cycle dispatch (meaning every cycle we can throw a new instruction into the pipeline.) Now, we want to execute this imaginary code:

v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2

We have a dependency between the first and the second instruction -- the second instruction cannot start before the first has finished, as it needs to wait for r0 to be ready. This means we'd have to stall for three cycles before we issue another instruction. The GCN architects solved this by issuing one instruction to a SIMD every four cycles. Additionally, instead of executing one operation on 16 elements and then switching to the next instruction, GCN runs the sameinstruction four times on a total of 64 elements. The only real change this requires to the hardware is to make the registers wider than the SIMD unit. Now, you never have to wait, as by the time the v_mul_f32 starts on the first 16 elements, the v_add_f32 just finished them:

You'll immediately notice those wait cycles, and clearly that's no good to have a unit spend most of its time waiting. To fill those up, the GCN designers use four SIMD units, so the real picture on the hardware is as following:

This 64-wide construct is called a "wavefront" or "wave", and it's the smallest unit of execution. A wave can get scheduled onto a SIMD to execute, and each thread group consists of at least one wave.

Scalar code

Phew, that was quite a bit, and we're unfortunately still not done. So far, we pretended everything is executed on the SIMD, but remember when I wrote there are two blocks related to execution? High time we get to this other ... thing.

If you program anything remotely complex, you'll notice that there are really two kinds of variables: Uniform values, which are constant across all elements, and non-uniform values, which differ per lane.

A non-uniform variable would be for instance the laneId. We'll pretend there's a special register we can read -- g_laneId, and then we want to execute the following code:

if (laneId & 1) {result += 2;
} else {result *= 2;
}

In this example, we're not going to talk about conditional moves, so this needs to get compiled to a branch. How would this look like on a GPU? As we learned, there is something called the execution mask which controls what lanes are active (also known as divergent control flow.) With that knowledge, this code would probably compile to something like this:

v_and_u32 r0, g_laneId, 1   ; r0 = laneId & 1
v_cmp_eq_u32 exec, r0, 1    ; exec[lane] = r0[lane] == 1
v_add_f32 r1, r1, 2         ; r1 += 2
v_invert exec               ; exec[i] = !exec[i]
v_mul_f32 r1, r1, 2         ; r1 *= 2
v_reset exec                ; exec[i] = 1

The exact instructions don't matter, what matters is that all the values we're looking at that are not literals are per-lane values. I.e. g_laneId has a different value, as has r1, for every single lane. This is a "non-uniform" value, and the default case, as each lane has its own slot in a vector register.

Now, if the control flow was looking like this, with cb coming from a constant buffer:

if (cb == 1) {result += 2
} else {result *= 2;
}

Turning this into this straightforward code:

v_cmp_eq_u32 exec, cb, 1    ; exec[i] = cb == 1
v_add_f32 r1, r1, 2         ; r1 += 2
v_invert exec               ; exec[i] = !exec[i]
v_mul_f32 r1, r1, 2         ; r1 *= 2
v_reset exec                ; exec[i] = 1

This has suddenly a problem the previous code didn't have -- cb is constant for all lanes, yet we pretend it's not. As cb is an uniform value, the comparison against 1 could be computed once instead of per lane. That's how you'd do it on a CPU, where vector instructions are a new addition. You'd probably do a normal conditional jump (again, ignore conditional moves for now), and call a vector instruction in each branch. Turns out, GCN has the same concept of a "non-vector" execution, which is aptly named "scalar" as it operates on a single scalar instead of a vector. In GCN assembly, the code could compile to:

s_cmp_eq_u32 cb, 1          ; scc = (cb == 1)
s_cbranch_scc0 else         ; jump to else if scc == 0
v_add_f32 r1, r1, 2         ; r1 += 2
s_branch end                ; jump to end
else:
v_mul_f32 r1, r1, 2         ; r1 *= 2
end:

What does this buy us? The big advantage is that those scalar units and register are super cheap compared to vector units. Whereas a vector register is 64x32 bit in size, a scalar register is just 32 bit, so we can throw many more scalar registers on the chip than vector registers (some hardware has special predicate registers for the same reason, one bit per lane is much less storage than a full-blown vector register.) We can also add exotic bit-manipulation instructions to the scalar unit as we don't have to instantiate it 64 times per CU. Finally, we use less power as the scalar unit has less data to move and work on.

Putting it all together

Now, being hardware experts, let's see how we can put our knowledge to good use finally. We'll start with wavefront wide instructions, which are a hot topic for GPU programmers. Wavefront wide means we do something for all lanes, instead of doing it per lane -- what could that possibly be?

Scalar optimizations

The first thing we might want to try is to play around with that execution mask. Every hardware has this present in some form, be it explicit or as a predication mask. With that, we can do some neat optimization. Let's assume we have the following code:

if (distanceToCamera < 10) {return sampleAllTerrainLayers ();
} else {return samplePreblendedTerrain ();
}

Looks innocent enough, but both function calls sample memory and are thus rather expensive. As we learned, if we have divergent control flow, the GPU will execute both paths. Even worse, the compiler will likely compile this to following pseudo-code:

VectorRegisters<0..32> allLayerCache = loadAllTerrainLayers();
VectorRegisters<32..40> simpleCache = loadPreblendedTerrainLayers();
if (distanceToCamera < 10) {blend (allLayerCache);
} else {blend (sampleCache);
}

Basically, it will try to front-load the memory access as much as possible, so by the time we reach the else path, there's a decent chance the loads has finished. However, we -- as developers -- know that the all-layer variant is higher quality, so how about this approach: If any lane goes down the high-quality path, we send all lanes to the high-quality path. We'll have slightly higher quality overall, and on top of that, we get two optimizations in return:

The compiler can use fewer registers by not front-loading both
The compiler can use scalar branches

There's a bunch of functions for this, all of which operate on the execution mask (or predicate registers, from here on I'll pretend it's the execution mask.) The three functions you'll hear of are:

ballot () -- returns the exec mask
any () -- returns exec != 0
all() -- returns ~exec == 0

Changing our code to take advantage of this is trivial:

if (any (distanceToCamera < 10)) {return sampleAllTerrainLayers ();
} else {return samplePreblendedTerrain ();
}

Another common optimization applies to atomics. If we want to increment a global atomic by one per lane, we could do it like this:

atomic<int> globalAtomic;
if (someDynamicCondition) {++globalAtomic;
}

This will require up to 64 atomic increments on GCN (as GCN does not coalesce them across a full wavefront.) That's quite expensive, we can do much better by translating this into:

atomic<int> globalAtomic;
var ballotResult = ballot (someDynamicCondition);
if (laneId == 0) {globalAtomic += popcount (ballotResult);
}

Where popcount counts the number of set bits. This cuts down the number of atomics by 64. In reality, you probably still want to have a per-lane value if you're doing a compaction, and it turns out that case is so common that GCN has a separate opcode for it (v_mbcnt) which is used automatically by the compiler when doing atomics.

Finally, one more on the scalar unit. Let's assume we have a vertex shader which passes through some drawId, and the pixel shader gets it as a normal interpolant. In this case (barring cross-stage optimization and vertex-input-layout optimization), code like this will cause problems:

var materialProperties = materials [drawId];

As the compiler does not know that drawId is uniform, it will assume it can be non-uniform, and thus perform a vector load into vector registers. If we do know it's uniform -- dynamically uniformis the specific term here -- we can tell the compiler about this. GCN has a special instruction for this which somewhat become the "standard" way to express it -- v_readfirstlane. Read-first-lane reads the first active lane and broadcasts its value to all other lanes. In an architecture with separate scalar registers, this means that the value can be loaded into a scalar register. The optimal code would be thus:

var materialProperties = materials [readFirstLane (drawId)];

Now, the materialProperties are stored in scalar registers. This reduces the vector register pressure, and will also turn branches that reference the properties into scalar branches.

Vector fun

So much for the scalar unit, let's turn to the vector unit, because things get really funny here. Turns out, pixel shaders have a huge impact on compute, because they force the hardware to do something really funky -- make lanes talk to each other. Everything we learned so far about GPUs said you can't talk across lanes except through LDS, or by broadcasting something through scalar registers (or read a single lane.) Turns out, pixel shaders have a very unique requirement -- they require derivatives. GPUs implement derivatives by using quads, i.e. 2×2 pixels, and exchange data between them in a dynamically varying way. Mind blown?

This is commonly referred to as quad swizzle, and you'll be hard pressed to find a GPU which doesn't do this. However, most GPUs go much further, and provide more than just simple swizzling across four lanes. GCN takes it really far since the introduction of DPP -- data parallel primitives. DPP goes beyond swizzling and provides cross-lane operand sourcing. Instead of just permuting lanes within a lane, it actually allows one lane to use another lane as the input for an instruction, so you can express something like this:

v_add_f32 r1, r0, r0 row_shr:1

What does it do? It takes the current value of r0 on this lane and the one to the right on the same SIMD (row-shift-right being set to one), adds them together, and stores it in the current lane. This is some really serious functionality which introduces new wait states, and also has various limitations as to which lanes you can broadcast to. All of this requires intimate knowledge of the implementation, and as vendors differ in how they can exchange data between lanes, the high level languages expose general wave-wide reductions like min etc. which will either swizzle or use things like DPP to get to one value. With those, you can reduce values across a wavefront in very few steps, without access to memory -- it's faster and still easy to use; what's not to like here!

Summary

I hope I could shed some light on how things really work this time around. There's really not much left that hasn't been covered so far, for GCN, I can only think of separate execution ports and how waits are handled, so maybe we'll cover that in a later post? Other than that, thanks for reading, and if you have questions, don't hesitate to get in touch with me!