As I (red: Jesper Børlum previous employee), was looking through the presentations from Siggraph Asia 2014, one presentation in particular caught my eye. Tristan Lorachs presentation on Nvidias upcoming manual Command-List OpenGL extension. With all the focus on reducing the CPU-side driver overhead in the current graphics APIs this last year, and the upcoming new rendering APIs (AMDs Mantle, Microsoft DirectX 12, Apple Metal), I decided to make an overview of the current recommendations for scene rendering using core OpenGL and take a poke at Nvidias new extension. This first article is going to look at the core OpenGL recommendations, and the next article is going to be on Nvidias new extension. I am writing this article because I wanted to get a better grasp on the implementation details in the excellent GTC / Siggraph performance presentations found here:

For performance results and shader code please refer to the Nvidia presentations.

Disclaimer – This post is a simplification of a complex topic. If you feel I have left out important details, please add them to the comments at the end or write me.

Modern GPUs are absolute beasts. It never ceases to amaze me how much raw processing power they can handle. Even standard gaming hardware. However, scenes requirements are getting increasingly complex. They contain more geometry, more different types of materials used, and new and complex render effects. The GPU driver often ends up being a serious performance bottleneck handling this complexity. This means that no matter how much GPU power you throw at the rendering the overall performance is not going to increase.
A lot of stuff eats up CPU performance. Scenegraph traversal, animation, renderlist generation, sorting by state and all driver interactions etc.
Current driver performance culprits are:

  • Frequent GPU state changes (shader, parameters, textures, framebuffer etc.).
  • Draw commands.
  •  Geometry stream changes.
  •  Data transfers (uploads / read-backs).

All of these boils down to the driver eating up your precious CPU clockcycles.
Using the techniques below most of this CPU driver overhead can be reduces to almost zero. In the following sections, I will be looking at several methods for reducing the overhead. Most achieve this simply by calling the driver less. Seams simple enough, but handling material changes, texture changes, buffer changes, state changes between the draw calls can get tricky. Also, note that most of these methods require a newer version of OpenGL. Some of the functions only just made it into the core specifications (OpenGL 4.4 / 4.5).

A scene, in the context of this post, is a collection of objects, each consisting of sub-objects. A sub-object is a material and a draw command. Objects are logical collections of sub-objects each with their own world transform matrix. A material is a collection consisting of a shader program, parameters for the shader program and an OpenGL render state collection.

I have provided two naïve approaches to scene rendering and uploading of shader parameters – The two areas we will be focusing on.

Naïve scene rendering
This will act as the baseline for performance, and is what each improvement will try to improve on.

This method imposes a large number of driver interactions:

  • Geometry streams are changed per sub-object.
  • Shaders are changed per sub-object, if different from current.
  • Shader parameters are uploaded per draw.
  • A draw call per sub-object.

Naïve parameter update
Uploading parameters, also known as uniform parameters, to shaders can impose a significant number of driver calls – especially if uploaded “the old fashioned way” where each parameter upload is a separate call to glUniform. This will act as the baseline for performance, and is what each improvement will try to improve on.

This technique has several weaknesses. Its many separate driver calls, which the driver cannot predict. To make it even worse, we need to re-upload all the parameters each time we change the shader program. Shader program objects contain the parameter values – Not the general OpenGL state. In the past, I have solved this by maintaining CPU-side parameter state cache per shader program. The proxy is then responsible for re-uploading if the uniform becomes dirty. This is a workable solution if you cannot use buffer objects, which trivializes the sharing of parameter data across shader programs as seen later in this post.

Improvement 1 – Single buffer per object
The obvious improvement to the naïve scene rendering is to move the buffers from the sub-objects into a collection of collapsed buffers in the containing object. This will allow us to move the buffer bind call from the inner loop to the outer loop. This will dramatically lower the number of geometry driver calls in a scene were each object contains many sub-objects. Each sub-object will now need to know the correct stream offset into the collapsed buffers to be able to draw correctly. When loading geometry you will need to collapse all sub-object buffers and offset the vertex indices to reflect the new position in the collapsed buffer.

Improvement 2 – Sort sub-objects by material
Sorting by complete materials (same shaders, render state and material parameters – for now) achieves two things. We can now draw several sub-objects at a time and avoid costly shader changes.
The main difference to the render loop is that instead of looping over each sub-object, we now loop over a material batch. A material batch contains the material information, along with information about which parts of the geometry is to be rendered using that material setup.
During geometry load, you will need to sort by materials so that each batch contains enough information to render all sub-objects it contains.
You can opt to rearrange the vertex buffer data so that the draw command ranges can be “grown” to draw several sub-objects in a single command.
When drawing you can choose between two different ways:

  • Using a loop over each of the sub-object buffer ranges in the batch drawing each with glDrawElements.
  • Submitting all draw calls in one call using the slightly improved glMultiDrawElements.

The second multi draw approach will execute the loop for you inside the driver – hence only a slight improvement.

Improvement 3 – Buffers for uniforms
Instead of uploading each uniform separately as shown in the naïve parameter update, OpenGL allows you to store uniform in objects. So called Uniform Buffer Objects (UBO). Instead of having a glUniform call per object, you can upload a chunk of uniforms using a buffer upload like glBufferData or glBufferSubData. It is important to group uniforms according to frequency of change, when uploading data into buffers. A practical grouping of uniforms could look something like the following:

  • Scene globals – camera etc.
  • Active lights.
  • Material parameters.
  • Object specifics – transform etc.

Grouping parameters allows you to leave infrequently changed data on the GPU, while the only the dynamic data is re-uploaded. A key UBO feature is that they allow parameter sharing across shader programs unlike glUniform. I am not going to write a full usage guide on UBOs – one can be found here.
There are different ways to use Uniform Buffer Objects. They recommended way changes according to if the data you are using is fairly static or dynamic. Below are examples of both. Note – You can mix the methods as best fit your use case.

Static buffer data:
If the data changes infrequently, upload parameters for all the sub-objects in one go into a large UBO. Then target the correct parameters by using the glBindBufferRange calls as shown below:

Dynamic buffer data:
If data change frequently, upload parameters into a small UBO for each material batch. The example below takes advantage of the new direct state methods (DSA) introduced in OpenGL 4.5. The below shows how such a render loop could look.

Note – Upload of scattered data changes to static buffer using compute + SSBO
Nvidia mentioned a cute way to scatter data into a buffer. Normally you need to upload using a series of smaller glBufferSubData calls if the changes are non-continuous in memory. Alternatively, you could re-upload the entire buffer from scratch. This could potentially degrade performance significantly. They suggests placing all changes in a SSBO and perform the scatter-write using a compute shader. A shader storage buffer object (SSBO) is just a user-defined OpenGL buffer object that can be read/written using compute shaders. I have yet to try this technique out so I cannot comment if the performance makes it feasible. I really like the idea though.

Improvement 4 – Shader-based material / transform lookup
Improvement 3 introduces the notion of using UBOs to improve the uniform communication performance. Unfortunately, there are still many glBindBufferRange operations. It is possible to remove those binds by binding the entire buffer and then have the shader index the information. Communication of the index is done through a generic vertex attributes as shown below.

You use a generic vertex attribute as any other vertex attribute from inside the shader.

Improvement 5 – Bindless resources
Changing texture state have up to recently been a major headache when it came to batching efficiently. Sure, it is possible to store several textures inside an array texture and then index into the different layers, but there are several limitations and it is generally a pain to work with. OpenGL requires the application to bind textures to the texture slots prior to dispatching the draw calls. Textures are merely CPU-side handles as all other OpenGL object, but the new extension ARB_bindless_texture allows the application to retrieve a unique 64-bit GPU handle that the shader can use to lookup texture data without binding first. It is possible to store these new GPU handles in uniform buffers, unlike the CPU-side handles. GPU handles can be set like any other uniform using glUniformHandleui64, but it is strongly recommended to use UBOs (or similar – see Improvement 3). It is the applications responsibility to make sure textures are resident before dispatching the draw call. More information regarding this can be found in the extension spec here.
Nvidia has an extension that allows bindless buffers as well – More information can be found here. This is something we will have a look at when looking at the new Nvidia commandlist extension in the next article.

Improvement 6 – The indirect draw commands
A new addition to the numerous ways to draw in OpenGL is the indirect draw commands. Rather than submitting each draw call from the CPU, it is now possible to store all the draw information inside a buffer, which the GPU then loops through when drawing. The buffer contains an array of predefined structures, which in the case of glMultiDrawElementsIndirect looks like this:

Using an indirect draw command works much like the glMultiDrawElements described in Improvement 2 works. An added benefit is that you can create your GPU worklist directly on the GPU. You can e.g. use this to cull your scene from a compute shader rather than use the CPU.

There is a special bind target for indirect buffers called GL_DRAW_INDIRECT_BUFFER. The driver uses bound buffer to read the draw data. It is illegal to submit an indirect draw call using client memory.
Using indirect draw you will not need a separate draw command for each sub-object in a material batch as described in Improvement 2. To draw efficiently you will only have to create a buffer filled with the structs that describe the ranges of the objects you wish to draw using the active shader. This can be a huge draw command improvement. I have yet to test if you get an improved performance by growing the draw ranges by physically rearranging the vertex buffers.
Which material parameters and matrix to use when drawing each of the sub-objects can be handled much like in Improvement 4. Through a matrix / material array index. However, the method is a bit different as we are no longer able to set a generic vertex between each drawn sub-object. The indirect struct contains a lot of information, not all of which we need to use. The baseInstance member for example. By using this, we can communicate both the material and matrix index, so the shader program can get the data it needs. How you choose to split the bits all comes down to how much you need to draw.

Unfortunately, it is not yet possible to change state (renderstate and shaders) using the indirect draw commands. This is something I am going to look at in the next article on the Nvidia CommandList extension.

This post turned out to be bigger than I had first anticipated, but efficient drawing is tricky. If you made it this far – Good for you! I hope to get time to write the follow up article as soon as real life allows me.

Classifying birds from above
Boat-tracking at "Sejerøbugten"