Forward Plus Rendering

文章中的方法来自“Forward+: Bringing Deferred Lighting to the Next Level”。文章介绍了Forward+，这是一种通过仅剔除和存储对像素有贡献的灯光来渲染多光源的方法。 Forward+ 是传统前向渲染的扩展。使用 GPU 的计算能力实现的灯光剔除被添加到管道中以创建灯光列表；该列表被传递到最终渲染着色器，它可以访问有关灯光的所有信息。

实现方法

Forward+ 通过在最终着色之前仅添加一个光剔除阶段来扩展前向渲染管道。管道由三个阶段组成：深度预处理、光剔除和最终着色。另一项修改是针对灯光的数据结构，必须将其存储在可从着色器访问的线性缓冲区中，以进行灯光剔除和最终着色。深度预处理是前向渲染的一个选项，但它对于 Forward+ 来说是必不可少的，可以用来减少最终着色步骤的昂贵的像素过度绘制。

1. 深度预处理

在执行光剔除和着色之前，我们首先进行深度预处理。这里仅需要执行一个pass保存场景的深度。在光剔除阶段，我们将使用该深度纹理计算合适的视锥体分块。

2. 光剔除

光剔除阶段计算与像素重叠的光源列表。虽然可以为每个像素计算灯光列表，但这样效率不高。因此，可以让屏幕被拆分为多个区块（tile），并在每个区块的基础上计算受影响的灯光数量。光索引缓冲区的内存大小和最终着色器的效率是一个权衡。文章利用了现代 GPU 的计算能力，将光剔除在 GPU 上实现，因此，整个光照管道将完全在 GPU 上执行。

在光剔除阶段中，我们将屏幕分为多个tile（每个tile通常为16 * 16像素）：

如上图所示，我们需要计算每个tile包含的灯光。那我们如何为视锥体分片呢？首先，我们对视锥体的水平（x）和竖直方向（y）均匀分片：

在深度上，则需要用到我们深度预处理阶段的深度值，我们寻找到每个tile中的像素的最小深度（min）和最大深度（max），由它们构成近平面和远平面：

由上述六个平面构成的锥体确定包含了哪些光源（这里以电光源为例）：

我们使用OpenGL的computer shader来实现光剔除，代码如下：

#version 430struct PointLight {vec4 color;vec4 position;vec4 paddingAndRadius;
};struct VisibleIndex {int index;
};// 灯光列表
layout(std430, binding = 0) readonly buffer LightBuffer {PointLight data[];
} lightBuffer;layout(std430, binding = 1) writeonly buffer VisibleLightIndicesBuffer {VisibleIndex data[];
} visibleLightIndicesBuffer;// Uniforms
uniform sampler2D depthMap;
uniform mat4 view;
uniform mat4 projection;
uniform ivec2 screenSize;
uniform int lightCount;// 线程组内共享的线程数据
shared uint minDepthInt;
shared uint maxDepthInt;
shared uint visibleLightCount;
shared vec4 frustumPlanes[6];
shared int visibleLightIndices[1024];
shared mat4 viewProjection;#define TILE_SIZE 16
layout(local_size_x = TILE_SIZE, local_size_y = TILE_SIZE, local_size_z = 1) in;
void main() {ivec2 location = ivec2(gl_GlobalInvocationID.xy);ivec2 itemID = ivec2(gl_LocalInvocationID.xy);ivec2 tileID = ivec2(gl_WorkGroupID.xy);ivec2 tileNumber = ivec2(gl_NumWorkGroups.xy);uint index = tileID.y * tileNumber.x + tileID.x;// 初始化属性if (gl_LocalInvocationIndex == 0) {minDepthInt = 0xFFFFFFFF;maxDepthInt = 0;visibleLightCount = 0;viewProjection = projection * view;}barrier();// Step 1: 为每个tile计算最大，最小深度值// TO DO// Step 2: 为每个tile计算视锥体的六个平面// TO DO// Step 3: 光剔除// TO DO// Step 4: 保存数据// To DO
}

我们将其分为三个部分执行，step1:获取每个tile的最大最小线性深度。step2:计算每个tile的锥体的6个平面。step3:执行光剔除。

step1：

需要从深度预处理的深度纹理中使用原子操作获取tile中最大深度和最小深度。为了获得视口空间下的深度，我们需要将深度线性化。为了比较深度，我们需要将深度值转变为uint。代码如下：

 // Step 1: 为每个tile计算最大，最小深度值float maxDepth, minDepth;vec2 text = vec2(location) / screenSize;float depth = texture(depthMap, text).r;// 线性化深度depth = (0.5 * projection[3][2]) / (depth + 0.5 * projection[2][2] - 0.5);// 将深度值转换为uint类型，用以在线程之间比较uint depthInt = floatBitsToUint(depth);atomicMin(minDepthInt, depthInt);atomicMax(maxDepthInt, depthInt);barrier();

step2：

我们需要获得时间空间下，6个椎体平面的信息。这里使用了一种比较有趣的方法。“Fast Extraction of Viewing Frustum Planes from the World- View-Projection Matrix”一文中，提出了一种使用投影矩阵获取视锥平面的方法。

我们假设矩阵M为P•V，其中P为投影矩阵，V为视口矩阵。我们设世界空间中的一点v。Mv则表示经过视口变换和投影变换后在投影空间下位置，写出表达式：

其中 $row_{i}$ 表示矩阵M的第i行。经过该计算后，由于我们未执行透视除法，因此我们的视锥体内的点满足如下条件：

我们可以通过上述条件计算椎体的六个平面。我们以x‘位于左平面的右方为例，我们需要满足：

可以转化为：

(row4 + row1)可以写成代码：

    leftPlane = vec4(1.0, 0.0, 0.0, 1.0); // LeftleftPlane *= viewProjection;

我们可以将其转换为平面公式：

其中平面法线n = (a, b, c)，d = -(n • p)，p为平面上的一点，我们带入待转化的公式：

其中w = 1。根据该公式，我们成功的利用M矩阵（P • V）获得了一个世界空间下的视锥平面。

在光剔除阶段，我们需要获得光源位置到各个tile椎体平面的距离，因此我们需要获得归一化的平面方程，并通过其计算点到平面的距离。我们通过如下方式计算归一化后的平面：

其中||n||为n = (a, b, c)的模。这样计算点到平面的距离就更为简单了，我们另平面为向量l = (a, b, c, d)。世界空间中的一个点为p，则l • p就是点p到平面l的距离。获取锥体6个面与计算点到椎体平面的距离代码如下：

    // 计算平面系数，由M矩阵其中2个raw得出frustumPlanes[0] = vec4(1.0, 0.0, 0.0, 1.0); // LeftfrustumPlanes[1] = vec4(-1.0, 0.0, 0.0, 1.0); // RightfrustumPlanes[2] = vec4(0.0, 1.0, 0.0, 1.0); // BottomfrustumPlanes[3] = vec4(0.0, -1.0, 0.0, 1.0); // TopfrustumPlanes[4] = vec4(0.0, 0.0, 1.0, 1.0); // NearfrustumPlanes[5] = vec4(0.0, 0.0, -1.0, 1.0); // Far// 计算M的对应行，之后单位化for (uint i = 0; i < 6; i++) {frustumPlanes[i] *= viewProjection;frustumPlanes[i] /= length(frustumPlanes[i].xyz);}// 判定是否在视锥体内float distance = 0.0;for (uint j = 0; j < 6; j++) {// 计算点到平面的距离distance = dot(position, frustumPlanes[j]);// 当其中一个小于0，则不再视锥内if (distance <= 0.0) {break;}}

其中，我们可以通过判定destance的值是大于0，小于0，还是等于0，来获得带符号的距离值，即我们需要点在椎体内。

为了满足我们光剔除的tile椎体，我们需要对上述代码做一点修改，代码如下：

 // Step 2: 为每个tile计算视锥体的六个平面if (gl_LocalInvocationIndex == 0) {// 将深度从uint转化为floatminDepth = uintBitsToFloat(minDepthInt);maxDepth = uintBitsToFloat(maxDepthInt);// 计算每个tile的偏移vec2 negativeStep = (2.0 * vec2(tileID)) / vec2(tileNumber);vec2 positiveStep = (2.0 * vec2(tileID + ivec2(1, 1))) / vec2(tileNumber);// 获取锥体的六个平面frustumPlanes[0] = vec4(1.0, 0.0, 0.0, 1.0 - negativeStep.x); // LeftfrustumPlanes[1] = vec4(-1.0, 0.0, 0.0, -1.0 + positiveStep.x); // RightfrustumPlanes[2] = vec4(0.0, 1.0, 0.0, 1.0 - negativeStep.y); // BottomfrustumPlanes[3] = vec4(0.0, -1.0, 0.0, -1.0 + positiveStep.y); // TopfrustumPlanes[4] = vec4(0.0, 0.0, -1.0, -minDepth); // NearfrustumPlanes[5] = vec4(0.0, 0.0, 1.0, maxDepth); // Far// 变换出了深度平面之外的4个平面for (uint i = 0; i < 4; i++) {frustumPlanes[i] *= viewProjection;frustumPlanes[i] /= length(frustumPlanes[i].xyz);}// 变换深度平面frustumPlanes[4] *= view;frustumPlanes[4] /= length(frustumPlanes[4].xyz);frustumPlanes[5] *= view;frustumPlanes[5] /= length(frustumPlanes[5].xyz);}barrier();

我们的深度已经被转换成线性空间下，因此影响深度的两个平面单独处理，视口空间下(z < 0)：

$Z_{far} < Z_{e} < Z_{near}$

而我们从深度纹理获得并计算后的线性深度大于0，因此，可以获得：

$Z_{e} < -minDepth$

$0 < -minDepth - Z_{e}$

因此frunstumPlanes[4] = vec4(0.0, 0.0, -1.0, -minDepth); 远平面同理。

因为我们的tile为16 * 16个，因此我们需要给每个椎体添加偏移，因此我们需要修改左，右，上，下四个平面：

 frustumPlanes[0] = vec4(1.0, 0.0, 0.0, 1.0 - negativeStep.x); // LeftfrustumPlanes[1] = vec4(-1.0, 0.0, 0.0, -1.0 + positiveStep.x); // RightfrustumPlanes[2] = vec4(0.0, 1.0, 0.0, 1.0 - negativeStep.y); // BottomfrustumPlanes[3] = vec4(0.0, -1.0, 0.0, -1.0 + positiveStep.y); // Top

step3：

获得每个tile的6个平面后，我们需要开始正式的执行光剔除了。为了充分利用GPU的并行性，我们将按照灯光进行并行计算，而不是像素。由于我们定义tile尺寸为16 * 16，因此我们每次最多能并行计算256个光源，如果光源数量大于256，我们将每256个光源作为一个pass执行光剔除计算。代码如下：

 // Step 3: 光剔除uint threadCount = TILE_SIZE * TILE_SIZE;uint passCount = (lightCount + threadCount - 1) / threadCount;for (uint i = 0; i < passCount; i++) {// 获得灯光索引，通过并行数256分块计算得到uint lightIndex = i * threadCount + gl_LocalInvocationIndex;// 仅让符合编号的线程执行，超出编号的线程直接退出if (lightIndex >= lightCount) {break;}vec4 position = lightBuffer.data[lightIndex].position;float radius = lightBuffer.data[lightIndex].paddingAndRadius.w;// 判断灯光是否在椎体内float distance = 0.0;for (uint j = 0; j < 6; j++) {distance = dot(position, frustumPlanes[j]) + radius;if (distance <= 0.0) {break;}}// 只要distance大于0，则表示灯光在椎体内部if (distance > 0.0) {// 将灯光索引添加至共享数组visibleLightIndices内uint offset = atomicAdd(visibleLightCount, 1);visibleLightIndices[offset] = int(lightIndex);}}barrier();

可以发现，我们每个tile可以执行所有灯光和当前tile的椎体进行检测，判断光源是否影响到该tile。当所有tile都执行光剔除后，那么场景所有的灯光影响哪些tile也就确定了。需要注意的是我们这里限制最大灯光数量为1024。visibleLightIndices则保存了当前tile受哪些灯光影响。

step4：

最后的光剔除步骤就是将灯光信息输出了。我们在每个tile的局部线程索引为0时，将灯光数据保存在线性的数组visibleLightIndicesBuffer.data中。使用全局tile索引乘上最大灯光数1024作为索引偏移，将所有tile的灯光信息均保存在该线性数组中。当每个tile受灯光影响数量不满1024个时，我们定义-1为终结符。

 // 填充全局缓存if (gl_LocalInvocationIndex == 0) {// 确定在全局缓存的位置uint offset = index * 1024; for (uint i = 0; i < visibleLightCount; i++) {visibleLightIndicesBuffer.data[offset + i].index = visibleLightIndices[i];}if (visibleLightCount != 1024) {// 当前tile不满1024时，-1作为终结符visibleLightIndicesBuffer.data[offset + visibleLightCount].index = -1;}}

3. 着色计算

着色计算和传统的正向渲染几乎一致，唯一差别就是从遍历所有光源变成遍历所在tile的所有光源。代码如下：

#version 430in VERTEX_OUT{vec3 fragmentPosition;vec2 textureCoordinates;mat3 TBN;vec3 tangentViewPosition;vec3 tangentFragmentPosition;
} fragment_in;struct PointLight {vec4 color;vec4 position;vec4 paddingAndRadius;
};struct VisibleIndex {int index;
};// Shader storage buffer objects
layout(std430, binding = 0) readonly buffer LightBuffer {PointLight data[];
} lightBuffer;layout(std430, binding = 1) readonly buffer VisibleLightIndicesBuffer {VisibleIndex data[];
} visibleLightIndicesBuffer;// Uniforms
uniform sampler2D texture_diffuse1;
uniform sampler2D texture_specular1;
uniform sampler2D texture_normal1;
uniform int numberOfTilesX;out vec4 fragColor;// 计算点光源的衰弱
float attenuate(vec3 lightDirection, float radius) {float cutoff = 0.5;float attenuation = dot(lightDirection, lightDirection) / (100.0 * radius);attenuation = 1.0 / (attenuation * 15.0 + 1.0);attenuation = (attenuation - cutoff) / (1.0 - cutoff);return clamp(attenuation, 0.0, 1.0);
}void main() {// 获得当前像素所在tileivec2 location = ivec2(gl_FragCoord.xy);ivec2 tileID = location / ivec2(16, 16);uint index = tileID.y * numberOfTilesX + tileID.x;// 通过纹理获取法线，颜色vec4 base_diffuse = texture(texture_diffuse1, fragment_in.textureCoordinates);vec4 base_specular = texture(texture_specular1, fragment_in.textureCoordinates);vec3 normal = texture(texture_normal1, fragment_in.textureCoordinates).rgb;normal = normalize(normal * 2.0 - 1.0);vec4 color = vec4(0.0, 0.0, 0.0, 1.0);vec3 viewDirection = normalize(fragment_in.tangentViewPosition - fragment_in.tangentFragmentPosition);// 遍历tile内所有灯光，-1表示遍历结束uint offset = index * 1024;for (uint i = 0; i < 1024 && visibleLightIndicesBuffer.data[offset + i].index != -1; i++) {uint lightIndex = visibleLightIndicesBuffer.data[offset + i].index;PointLight light = lightBuffer.data[lightIndex];vec4 lightColor = light.color;vec3 tangentLightPosition = fragment_in.TBN * light.position.xyz;float lightRadius = light.paddingAndRadius.w;// 计算衰弱vec3 lightDirection = tangentLightPosition - fragment_in.tangentFragmentPosition;float attenuation = attenuate(lightDirection, lightRadius);// 计算方向向量lightDirection = normalize(lightDirection);vec3 halfway = normalize(lightDirection + viewDirection);// phong模型计算光照float diffuse = max(dot(lightDirection, normal), 0.0);float specular = pow(max(dot(normal, halfway), 0.0), 32.0);// 不考虑阴影的情况，hack镜面光照if (diffuse == 0.0) {specular = 0.0;}vec3 irradiance = lightColor.rgb * ((base_diffuse.rgb * diffuse) + (base_specular.rgb * vec3(specular))) * attenuation;color.rgb += irradiance;}color.rgb += base_diffuse.rgb * 0.08;// 裁剪透明if (base_diffuse.a <= 0.2) {discard;}fragColor = color;
}