Olivier Therrien, programer and researcher at CDRIN
August 27, 2018

Ray tracing is making its entry into games for the first time and is now pioneering the future of real-time graphics. This is really promising for the future of gaming, yet there is still a lot to figure out to make the best use of those rays and achieve good levels of performance.

Right now, the best strategy to maximize hardware performance is to take an hybrid approach and use rasterization for coherent information, like rendering the direct visibility from the camera in the form of a G-Buffer, and resort to ray tracing for incoherent information, like indirect bounces of light.

When doing classical path tracing, the primary rays are coherent and follow the camera frustum in a linear fashion, but the bounce rays are incoherent because they can reflect in any direction:

Incoherent rays: Primary rays (red), 1st bounce (green), 2nd bounce (blue)

The problem is that, even tough ray tracing is much better at rendering incoherent information than rasterization, the performances for incoherent rays are way less than for coherent ones. This is mostly due to information not being present in the GPU cache for all the rays, and rays not hitting the same point, thus not executing the same shader, resulting into stalls for a lot of threads. This is unfortunate, because ray tracing is interesting precisely because of it’s ability to handle incoherent scene traversal well.

This is a problem that is also present in the movie industry on high end path traced renderers, and they tackled this issue with complex ray sorting algorithms, that classify rays depending on their position and orientation in the scene, and batch similar rays together to maximize the chances that they access similar texture, geometry or shaders, and generate better memory access patterns. More details here. This kind of technique is very hard to implement efficiently for games though, especially on dynamic scenes, where the rays would need to be sorted on the fly at every frame.

One observation that can be made is that rays for mirror reflections perform a lot better than those for indirect diffuse light bounces. This is because those rays will follow a similar direction relative to the neighboring pixels, because they are influenced only by surface normal and view direction, which usually varies smoothly across multiple pixels. Those rays are not residing in a strictly linear subspace as is the case for rasterization, but they still have a certain degree of coherence, and can often reuse data across pixels.

Can this behavior be reproduced for diffuse lighting? The way diffuse lighting is done normally is by using a random number at every pixel that will be used to generate a random ray. So every pixel will be exploring the scene in a random direction that fits within the hemisphere oriented towards the surface normal. When undersampled, this results in a noisy pattern like this one (Modified version of this shadertoy scene made by Reinder):

In OptiX, this type of noise is generated using the following seed:

unsigned int seed = tea<16>(screen.x*launch_index.x + launch_index.y, frame_number);
float r = rnd(seed);

This seed will produce a noise pattern that vary depending on the pixel position, and will also change when a new sample will be taken, but is not animated over time. This way of doing things is pretty standard.

But the rays can be generated differently, by using a seed that is coherent across pixels. Now, the direction vector is still generated to fit within the hemispherical space of the surface, so not every ray will point in the same direction, they are still influenced by the underlying surface normal and BRDF.

Coherent rays: Primary rays (red), 1st bounce (green), 2nd bounce (blue)

This will result in patchy, structured noise like the following:

Instead of classic noise, we get structural noise. Note that the image is more or less biased in either case; the actual variance in the image will be the same, the bias is just distributed differently. Both noise patterns will eventually converge to the same result with a sufficient sample count. That being said, structural noise is much more noticeable for us humans, mainly because of this.

This requires only a small modification of the seed:

unsigned int cseed = tea<16>(screen.x, frame_number);
float cr = rnd(cseed);

At the performance level, the coherent approach is much better. All the following benchmarks are done on a GTX 980, using OptiX 5 as ray tracing API. I’m not sure that this version of the library is the best fit for this card, it might be fine tuned for more recent hardware. Also, it’s hard to tell at the moment to which extent DXR performances will match those of OptiX 5, or what underlying optimizations will be supported by RTX hardware, but it seems reasonable at the moment to assume that the performance ratio between coherent and incoherent rays will remain in the near future comparable to current hardware.

Rendering the following scene in 1920×1080 with direct rays + one incoherent bounce takes 375 ms:

Rendering the same thing with direct rays + one coherent bounce takes between 215 and 315 ms (average of 263 ms):

So it’s 1.43x faster on average for the two ray types combined. Notice that there is much more fluctuations in this benchmark, due to the rays hitting different parts of the scene at each frame, some of which taking less work to trace than others. But when rays are incoherent, they are much better distributed and so it averages out better within a single frame.

If not considering direct ray hit (108 ms), then the coherent vs incoherent result is 155 ms vs 267 ms, for a total speedup of 1.72x.

This is interesting to note, but the structural noise is so noticeable that it’s hard to consider it a win. What if we could blend the two approaches together? Interpolating the seeds directly will generate a new random value, and we will lose all the coherence, so it’s not an option. One solution might simply be to lerp between the totally coherent random number, and totally incoherent one, and this gives an adjustable coherence ratio between 0 and 1. When blending the two, we get the best of both worlds, because it adds a bit of noise on the structural noise, and acts a bit like dithering would, removing the banding effect.

A coherence ratio of 97% would give the following result:

This is the code in OptiX:

float lerped_r = lerp(rnd(cseed), rnd(seed), (1.0f-coherence)); // BIASED

The problem is that the code above introduces some bias in the image. This will give correct results for coherence values of 0 or 1, but all values in-between will be wrong and this will alter the sampling result:

So this is far from perfect, but at least the bias is not very noticeable when using high coherence values like 97%. To make sure to respect the probability distribution, it’s possible to alternate between the two approaches temporally like the following:

float blended_r = rnd(seed) < coherence ? rnd(cseed) : rnd(seed);

The performance result is that we still conserve some of the speed gains with an average of 296 ms total, with 188 ms for the indirect bounce only, for a speedup of 1.42x. Less than with 100% coherence but still not bad:

Also this is very tied to the complexity of the scene and the size of the BVH, so it’s probable that performance gaps will be wider on bigger scenes.

On the quality level, the perceptual results make it look like there is less noise. This is a static image though, if the camera were to move changes would be visible in the structural noise, making it look more unstable temporally. This type of structural noise is a bit harder for TAA to eliminate. Still, it’s interesting to see the effect this little change on the random number has on performances.

Part 2 of this article is here.

Share this post