Today, after nearly three years of research and more than a dozen rewrites, I finally present a little side project of mine:

https://github.com/rawrunprotected/rasterizer

This is state-of-the-art software occlusion culling system; similar to Intel's Software Occlusion Culling, but better optimized. By default, the application loads and renders Crytek's Sponza mesh. Camera controls are simple WASD to move and cursor keys to control the camera. The current version requires AVX2 support, so you'll need a modern CPU to run it; although downgrading to AVX or SSE4 should be straightforward if needed at some point.

Here's a quick overview of some of the techniques the rasterizer makes use of:

## Clipless rasterization

Clipless or homogeneous rasterization, is not exactly a new idea. The basic principle is that the usual perspective division by W after vertex transformation is ommitted and the triangle edge equations are directly projected to 2D space, thus saving a division, and more importantly, clipping against the near plane and guard band. A nice side-effect is that no near plane is required anymore.

However, implementing this naively brings more problems at first than it solves: Perspective division brings vertex coordinates and thus edge equations into pixel space, where they can be rounded to fixed-point coordinates with well-defined error behavior. Ommitting this step forces the rasterizer to either operate in floating point with a ridiculously high number of mantissa bits, or use some other magic trick to ensure the rasterization result remains artifact-free, as a small change to the edge equations can result in large deviations from the ground truth. Besides, without having explicit 2D vertex positions, there is no explicit bounding box, making it difficult to limit the screen area that needs to be scanned.

As it turns out, neither magic nor high-precision floats are required after all: Instead, the practical approach to clipless rasterization is to simply clamp the W component away from zero to avoid overflow and perform the perspective division after all. Afterwards, rounding of the vertex coordinates and edge equation computation can be performed as usual; although the edge equations need to be sign-inverted once for each vertex with W < 0. Similarly, the backface culling test result needs to be flipped for each vertex behind the camera. Computing the bounding box is somewhat tricky, as one needs to determine the bounding box for all vertices with W > 0 and all with W < 0 separately, and extend the former towards infinity depending on its relation to the latter.
As this introduces some overhead compared to the standard triangle setup, I've templated the rasterization routine to only perform the extra steps for W < 0 if a batch of triangles is close to the camera plane. Search for `possiblyNearClipped`

in Rasterizer.cpp to see where special handling is required.

## Edge mask precomputation

Another interesting idea is precomputing the coverage masks for each 8x8 pixel block. This approach is easily explained: For each edge equation, the slope and offset at the center of a pixel block are quantized, and then looked up in a precomputed table to yield a 64bit coverage mask. By ANDing together the masks for each edge, one finds the mask of all pixels that are inside the triangle. Since quantization invariably modifies the edge equation's slope, this results in minor "bristle" artifacts for thin triangles. For a graphical application, these would be unacceptable, however, as these bristles are no more than a pixel wide and don't extend across block boundaries, there's little risk of incorrect over-occlusion, so they can be safely ignored for this rasterizer's intended use case.

The lookup table is sufficiently small to not have any significant impact on cache at only 32KB, with an upper bound of 2.5KB being actually sampled for each pair of triangles in the worst case. Curiously, the lookup should be a perfect use case for AVX2's gather instruction set, however performing the lookup in scalar code turned out to be invariably faster on a SkyLake CPU. FMA on the other hand, the other major feature coming with AVX2-capable CPUs, brings a solid 20% performance gain from speeding up vertex transformation.

## Vertex buffer compression

The major requirement for any form of vertex buffer compression is that multiple occurences of a vertex encode and decode to the exact same floating point value. This is both sufficient and necessary to ensure that the compression step doesn't introduce any holes in the mesh. As variable-length compression schemes rarely play nice with SIMD, I've chosen to encode vertices straightforwardly as 11-11-10 fixed point format, relative to the models bounding box. This cuts occluder size to a third, compared to storing uncompressed vertex data. Since the rasterizer bakes the bounding box transform into the model view projection matrix, and the integer masking operation during decoding can be pipelined with the matrix multiplication, decompression is practically zero-cost.

Note that using an indexed representation doesn't provide any speedup here: Contrary to the GPU case, the rasterizer is neither bound by vertex fetch bandwidth nor processing cost, and the overhead of gathering via the index buffer managing a cache in software is fairly large. In addition, having a flat linear input stream allows perfect usage of streaming load instructions to prevent cache pollution.

## Rasterizing quads instead of triangles

By rasterizing quadrilaterals instead of triangles, setup cost and overdraw are significantly reduced. Unfortunately, one can't rely on coplanar pairs of triangles remaining coplanar after vertex quantization, so the rasterizer expliticly deals with quads becoming non-convex after projection by generating the coverage mask for each half quad separately to prevent introducing holes. Triangles that can't be merged to quads are simply handled by duplicating one of the vertices, effectively degenerating one of the quad edges. Despite this overhead, rendering quads is overall 20-30% faster.

## Depth buffer compression

Typical GPU depth buffer compression schemes conserve bandwidth by attempting to store only each triangle's depth derivatives and a per-pixel index mask for a fixed block size, rather than the full interpolated depth values. This doesn't map well to a software implementation, so instead the rasterizer uses a customized 16bit minifloat format to reduce the depth buffer size.

The minifloat format has 4 exponent and 12 mantissa bits, with no sign bit; with decoded values ranging from 0 to 1. The trick for quickly encoding into this format is first rescaling an input 32bit float with 2.5237386e-29f (which has the bit pattern 0xFFFF << 12), and then right-shifting by 12 bits. This maps 0.0 to the bit pattern 0x0000, and 1.0 to the pattern 0xFFFF, with an exponential distribution in between. By making use of SSE's saturating packing instruction, one also gets well-defined clamping behavior for values outside of the [0, 1] range. The rescaling value is simply baked into the projection matrix, which means that 8 depth values can be encoded extremely quickly with only 3 integer operations. Combined with an inverted depth range with near at 1.0f and far at 0.0f, so that the denser floating point distribution around 0 counteracts the perspective error, this minifloat actually has a significantly better worst-case depth precision for typical near/far ratios than a 24bit fixed point depth buffer.

## And everything else

Besides these high-level tricks, there are tons of smaller bits and pieces to improve performance, such as computing z from its linear relation with 1/W; or rounding vertex coordinates to force zero-area culling even though the rasterizer doesn't run in fixed point. The bulk of the optimization effort however is completely invisible: Several hundred hours have been spent on cycle shaving, reordering to improve pipelining, or running assembly through IACA to minimize instruction latencies.

Finally, a significant contribution usually comes from proper preprocessing of the geometry. In a typical game application, you would use a simplified version of the scene geometry. This is out of scope for this sample, though, so the preprocessing is limited to splitting the scene into batches according to the surface area heuristic, and clustering by normal vectors to improve backface culling coherency. The end result is still the full, unsimplified render geometry, just with a little bit of reordering.

I still haven't decided on an open-source license yet; feel free to ping me if you have suggestions.