In order to be able to satisfy growing demands on visual fidelity and runtime performance, we investigated a new culling and rendering system for future use in the Dawn Engine. This was part of many researches done by our internal R&D team, LABS, for the Deus Ex Universe, but is not used in the upcoming game Deus Ex: Mankind Divided. A major aspect of this investigation was to develop a system that is compatible with the existing asset pipeline and allow for fast iteration times during game production. Our culling system combines the low latency and low overhead of a hierarchical depth buffer based approach [Hill and Collin 11] with the pixel accuracy of conventional GPU hardware occlusion queries. It efficiently culls highly dynamic, complex environments while maintaining compatibility with standard mesh assets. Our rendering system uses a practical approach to the idea of deferred texturing [Reed 14] and efficiently supports highly diverse and complex materials while using conventional texture assets. The culling as well as the rendering system makes use of new graphics capabilities available with DirectX 12, most notably enhanced indirect rendering and the new shader resource binding model.
Our culling system is partially based on the ideas presented by [Haar and Aaltonen 15] where the depth buffer from the previous frame is used to acquire an initial visibility and potential false negatives are retested with the updated depth buffer from the current frame. In this way we avoid rendering dedicated occlusion geometry which may be difficult to generate e.g. for natural environments. However, instead of using a hierarchical depth buffer based approach and subdividing meshes into small clusters, a concept is used in the spirit of [Kubisch and Tavenrath 14] that relies on the early depth-stencil testing capabilities of modern consumer graphics hardware. For this, the oriented bounding boxes of the occludees are rendered using the reprojected depth buffer from the previous frame and the associated pixel-shader is forced to use early depth-stencil testing. In this way only visible fragments mark in a common GPU buffer at a location, unique for each mesh instance, that the corresponding instance is visible. A subsequent compute shader generates, from the acquired visibility information, the data which is used for indirect rendering. As proposed by [Haar and Aaltonen 15], occluded objects are retested with the updated depth buffer from the current frame to avoid missing false negatives. Figure 1 gives an overview of the involved steps and resources.
Figure 1. Overview of the culling system. Arrows on the left side of each culling step represents input data,
arrows on the right side output data.The colors match with those of the corresponding GPU buffers.
Since scenes, that were built with the current asset pipeline of the Dawn Engine, consisted anyway of relatively small modular blocks, with this approach we could avoid introducing a system for subdividing meshes into small clusters. By replacing hierarchical depth buffer based culling with the early depth-stencil based approach, we were able to achieve within a natural jungle environment without mesh clustering on average 2.3x higher culling rates and 1.6x faster frame times.
For modern games, it is important to utilize a rendering system that can handle increasingly complex mesh geometry and realistic surface materials. Forward rendering systems support high material diversity but either suffer from overdraw, or require a depth pre-pass, which can be expensive for meshes with a high triangle count, GPU hardware tessellation, alpha-testing or vertex-shader skinning. Deferred rendering systems manage to run efficiently without a depth pre-pass, but only support a limited range of materials and therefore often require additional forward rendering for more diverse materials. Our practical approach to deferred texturing combines the strengths of both rendering systems by supporting a high diversity of materials while only performing a single geometry pass. We go one step further than traditional deferred rendering and completely decouple geometry from materials and lighting. In an initial geometry pass, all mesh instances, that pass the GPU culling stage, are rendered indirectly and their vertex attributes are written, compressed, into a set of geometry buffers. No material specific operations and texture fetches are done (except for alpha-testing and certain kinds of GPU hardware tessellation techniques). A subsequent full screen pass transfers a material ID from the geometry buffers into a 16-bits depth buffer. Finally, in the shading pass for each material, a screen space rectangle is rendered that encloses the boundaries of all visible meshes. The depth of the rectangle vertices is set to a value that corresponds to the currently processed material ID and early depth-stencil testing is used to reject pixels from other materials. All standard materials that use the same shader and resource binding layout are rendered in a single pass via dynamically indexed textures. At this point, material specific rendering and lighting (e.g. tiled [Billeter et al. 13] or clustered [Olsson et al. 12]) are done simultaneously. Figure 2 gives an overview of the rendering process.
Since, in general, memory bandwidth of current consumer graphics hardware is much more limited than computational power, it is important to keep the size of the geometry buffers as low as possible, thus vertex attributes needs to be stored in a compressed way. We store texture coordinates into 2x 16-bits by storing only their fractional part after interpolation. Since the derivatives of the original texture coordinates are stored alongside, no seams will be visible later on. In theory, derivatives can be reconstructed in the shading pass by using the neighbor texture coordinates, but in the case of geometry edges, where appropriate neighbor texture coordinates can’t be always obtained, artifacts will be visible. This is especially noticeable under camera motion with dense alpha-tested foliage. Therefore we decided to store texture coordinates along with their derivatives. For that we treat derivatives in X- and Y-direction as 2D vectors. By decoupling the vector length from the orientation, the vector length can be stored as 2x 16-bits and the orientation as 2x 8-bits, which gives still enough precision for anisotropic texture filtering. We looked into storing the tangent space (tangent, bitangent, normal) as a quaternion into 32 bits according to [Mc Auley 15] but dismissed this approach due to visible faceting on smooth shiny surfaces. Instead we store the tangent space as an axis-angle representation, where the normal is stored as an axis by using octahedron normal vector encoding [Meyer et al. 10] and the tangent is stored as an angle. In this way we could store the entire tangent space in 32 bits and could achieve the same quality as storing the tangent space uncompressed into 3x 30-bits. It should be noted that this method requires about half the instruction count to encode the tangent space into 32 bits as when converting a TBN matrix into a quaternion in a mathematically stable, precise manner and packing it into 32 bits.
Pros and Cons
Below we summarized the most important pros and cons for the presented culling and rendering system.
• Same pixel accuracy as with conventional GPU hardware occlusion queries, but without latency issues (popping).
• Support of highly dynamic, complex, alpha-tested occluders without needing to author and render dedicated occluder geometry.
• High culling efficiency even without mesh clustering for modular composited scenes, thus fully compatible with standard asset pipelines.
• Low performance overhead of culling system.
• Number of draw calls massively reduced (performance benefit even with low-overhead graphics APIs such as DirectX 12 and Vulkan).
• Draw commands are no longer in deterministic order (nearly coplanar surfaces are more likely to cause Z-fighting and should be avoided).
• Depth sorting of draw calls no longer given, causing higher overdraw (with deferred+ less problematic due to light weight geometry pass and high culling efficiency).
• Due to lightweight geometry pass, depth pre-pass no longer required.
• GPU warp utilization for applying materials and lighting significantly better than with clustered forward shading [Olsson et al. 12], thus small triangles are far less problematic and GPU hardware tessellation performs much better.
• Unified rendering system that, in contrast to deferred rendering, can handle highly diverse range of materials efficiently.
• Geometry processing completely decoupled from material rendering and lighting, resulting in less shader permutations and faster iteration times in game production.
• By decoupling geometry processing from material rendering, switching of GPU resources significantly reduced.
• In contrast to system with deferred vertex attribute fetching [Burns and Hunt 13], geometry information only fetched once per frame in cache-friendly, coherent manner.
• Compressed texture data doesn’t need to be decompressed into GPU memory like with deferred rendering, thus texture memory bandwidth significantly reduced.
• Modified geometry buffers contain useful information not available with deferred rendering:
o Texture coordinate derivatives (fix mip-mapping issues with deferred decals)
o Vertex normals (enhance screen-space ambient occlusion techniques)
o Vertex tangents (anisotropic lighting)
• Not depending on vendor-specific graphics features and compatible with the entire range of DirectX 12 capable graphics hardware (when supported range of dynamically indexed textures too low, applications can still fall back to rendering common materials separately).
• Vertex attributes much more limited in comparison to traditional rendering techniques.
• Transparent objects have to be handled separately.
• Antialiasing still difficult to handle.
To capture the results, we used scenes from the game Deus Ex: Mankind Divided that we converted into a format which we could load and render in an experimental framework that is based on DirectX 12. Our test machine used an AMD Radeon R9 390 graphics card and the screen resolution was set to 1920x1080. In the video below you can see a scene rendered with 1024 dynamic spherical area lights using clustered lighting. To simulate dynamic objects for occlusion culling, the source of each area light is rendered as an emissive sphere. To be able to compare our rendering system with a reference clustered forward renderer while using GPU culling, all materials, except the emissive sphere material, use the same shading code and are rendered for deferred+ in separate passes with the help of the described depth-stencil rejection method. We also used for all material textures 8x anisotropic texture filtering to ensure that our texture derivative compression method doesn’t produce artifacts. The frame time and GPU times displayed on the left side of the screen are measured in milliseconds and the culling counters in the middle of the screen only take meshes into consideration that were frustum-culled on CPU.
At the beginning of the video we toggle GPU culling on/ off which reveals a performance gain of approximately 1.44 ms and a culling efficiency of approximately 80%. Then we display the boundaries of culled objects as red wireframe boxes. Finally we compare the rendering times of deferred+ with clustered forward rendering which shows that deferred+ is running approximately 4.31 ms faster while producing quality-wise almost equivalent results. With adaptive GPU hardware tessellation enabled, using a maximum tessellation factor of 5, deferred+ runs even 24.98 ms faster. Under realistic game conditions with more complex materials and more complex lighting (different light types, shadow mapping), the performance benefit of deferred+ should be even more prominent.
We presented a system for culling and rendering complex, highly dynamic scenes, which makes use of new graphics capabilities available with DirectX 12. The culling system provides high culling efficiency even for non-clustered, traditional mesh assets while having a low overhead. The rendering system outperforms a quality-wise comparable clustered forward rendering system, even more so when GPU hardware tessellation techniques are employed. It is fully compatible with conventional texture assets, doesn’t depend on vendor-specific graphics features and can run on the entire range of DirectX 12 capable graphics hardware. By combining the proposed culling and rendering system it is possible to render an entire complex scene in just a few draw calls.
We would like to thank Francis Maheux for providing us with the assets for the prototype, Samuel Delmont and Uriel Doyon for their valuable input on the implementation itself. Furthermore we would like to thank Eidos Montreal and Square Enix for allowing us to make the results of this research project public.
Along with Wolfgang Engel, Eidos-Montréal LABS is working on an in-depth article describing our work, for the new upcoming book, GPU Zen, coming out in 2017.
Eidos-Montréal is always looking for great talent to help us shape the next generation of games. If you believe that you have what it takes, we definitively want to hear from you. Look at the current positions available at Eidos-Montréal and join our talented teams now!
[Billeter et al. 13] M. Billeter, O. Olsson and U. Assarsson. "Tiled Forward Shading". GPU Pro 4: Advanced Rendering Techniques. A. K. Peters, pp. 99–114. 2013.
[Burns and Hunt 13] C. A. Burns and W. A. Hunt. “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading”. Journal of Computer Graphics Techniques, Vol. 2, No. 2. 2013.
[Haar and Aaltonen 15] U. Haar and S. Aaltonen. “GPU-Driven Rendering Pipelines”. In ACM SIGGRAPH 2015 Talks, ACM, Los Angeles, USA, SIGGRAPH ’15.
[Hill and Collin 11] S. Hill and D. Collin. “Practical, Dynamic Visibility for Games”. GPU Pro 2, A K Peters, 2011, pp. 329-347.
[McAuley 15] S. McAuley. “Rendering the World of Far Cry 4”. In Game Developer Conference 2015 Talks, San Francisco. 2015.
[Meyer et al. 10] Q. Meyer, J. Süßmuth, G. Sußner, M. Stamminger and G. Greiner. “On Floating-Point Normal Vectors”. In Eurographics Symposium on Rendering, Vol. 29, No. 4. 2010.
[Olsson et al. 12] O. Olsson, M. Billeter and U. Assarson. "Clustered deferred and forward shading". In HPG 12: Proceedings of the Fourth ACM SIGGRAPH/ Eurographics Conference on High Performance Graphics, ACM, pp. 87–96. 2012.
[Kubisch and Tavenrath 14] C. Kubisch and M. Tavenrath. “OpenGL 4.4 Scene Rendering Techniques”. In GPU Technology Conference 2014 Presentation, NVIDIA, San Jose, USA, page 50.
[Reed 14] N. Reed. “Deferred Texturing”. Blog post, 2014, http://www.reedbeta.com/blog/2014/03/25/deferred-texturing