Software Rasterizer Update

I'm happy to announce that my software rasterizer side project has been picked up by a major commercial VR product a while ago. While I can't go into too much detail, adding in the rasterizer gave an average frame time saving of 1ms on the CPU and GPU each, which was critical to shipping the product on the targetted min spec configuration. There's some unique challenges in using this in practice which I'm going to address as part of this post, as well as going into some details regarding the next release of the rasterizer which brings a ~1.5x speedup compared to the initial release. Because real life has forced me to delay this post a bit, it covers both commits https://github.com/rawrunprotected/rasterizer/commit/d2fa65fd6b5df8b94d8d4b02914354c9aca2d46d and https://github.com/rawrunprotected/rasterizer/commit/ca469a402a65341a37c692ed06b4d355fe6d5494, even those are almost a year apart.

Over-occlusion in VR

Using software occlusion culling in 3D graphics typically involves some amount of false negatives; i.e. queries returning that an object is occluded even though it should be visible on screen. The reason for that is that software rasterization is generally performed on a lower resolution than the target render resolution. Unless the rasterizer is written "to spec" and matches the behavior of the hardware graphics API exactly, it may also introduce additional error through differences in vertex quantization or edge test precision.

This isn't much of a problem in practice for flat-screen applications as long as the incorrect over-occlusion can be kept below some threshold, say 5 pixels, which is barely noticeable to the user. In VR however, lens distortion implies that angular resolution is much lower, particularly in the center of the screen. This makes even small over-occlusion artifacts very noticable. Another problem is that it's not practical to perform occlusion for both eyes each frame, and it is easy to construct cases where culling only for one eye or a synthetic camera position centered between the eyes fails completely. Applying occlusion culling naively led to some user complaints in the initial beta release of the product as there were some geometric constellations which would result in noticable artifacts.

A simple approach to both of these issues is to apply a simple form of temporal filtering. For monoscopic rendering, applying a small amount of sub-pixel camera jitter helps break quantization errors. The jitter pattern repeats over a small period N (e.g. 4 frames), and an object is only considered culled when occlusion queries fail for all N of these frames. The standard MSAA sample distributions work well in practice as jitter patterns. This is fairly cheap to implement by means of a simple counter per object. While it has the disadvantage of introducing a small delay before objects can be culled, it solves the false negative problem for all practical purposes even when downscaling the target resolution by a factor of 5 in both X and Y.

The same can then be extended to stereoscopic rendering, by culling against left and right eye in alternating frames; doubling the effective length of the jitter period. As this is effectively fully external to the usage of the rasterizer and I'm not at liberty to publish the code anyway, I won't include the temporal filtering approach in the public release.

Another source of errors occurs when the camera moves close enough to near-clip the geometry. While this should ideally be prevented on a higher level as in can result in comfort issues in VR, it produces artifacts as the rasterizer does not perform near-clipping, which allows the user to peek through geometry and see objects flickering in and out of existence. As full emulation of near-clipping is infeasible for performance reasons, a simple hack solves this problems: Pixels which are in front of the near plane are clamped to depth values of 0xFFFF by the rasterizer, so simply adding 1 to each value forces these pixels to wrap around to the far plane, effectively discarding them. The near resulting clipping edge doesn't fully match the GPU result, but is good enough to avoid major discrepancies.

Now for the optimizations in the latest version:

Triangle setup optimizations

The first version of the rasterizer required AVX2, but performed vertex transform and triangle setup only at SSE widths (i.e. 4 in parallel). This has been extended to the full AVX vector width so that 8 triangles are set up in parallel, and the vertex packing format has been adapted accordingly. This results in a rather modest performance increase as it reduces the average utilization per vector, since it's much less likely to discard an entire batch of triangles via back face culling, but it's a net win overall.

An interesting trick was to remove most of the masking on decoding vertices from the packed 11/11/10 vertex format which shaves of a few instructions in the vertex transform portion. For transforming the X component, rather than right-shifting to remove the Y and Z component, I'm now applying the equivalent scale of 1.0f / (1u << 21) as part of the projection matrix. This result in some bleedover of the Y and Z components that were previously masked out, into the X component, so the meshes end up slightly skewed. This effect however can be cancelled by applying the inverse skew to the projection matrix. Making use of the _mm256_broadcast_ss allows keeping most of the projection matrix in memory, which reduces register pressure during the vertex transformation. The rest of the triangle setup code has received a bit of cycle shaving by moving around constant factors, so that the entire portion that translates the edge equations into block space is now removed.

Advanced transposition

A side effect of the wider triangle setup is that transposing the data from triangle-parallel to pixel-parallel layout becomes much harder, as AVX shuffle operations typically operate on each sub-lane of 4 elements separately. This implies having to extract and store each lane to memory separately. However, we can also restrict the shuffle operations to be incomplete, and store data to memory without the full sequence of operations, which effectively changes the order of triangles within each batch. Modifying the primitive index lets us iterate over the triangles in each batch in a changed order, but saving quite a few instructions for transposition.

Fully unrolled coverage mask generation

Similar to the triangle setup, the critical inner loop code was restricted to 4-wide SSE. The main obstacle to widening this was the fact that the _mm256_packus_epi32 operation did not operate on 8 lanes in parallel, resulting in some unintentional interleaving. This was fixed by pre-applying a permutation to each rasterization table entry, allowing the transfer from a single 64bit mask to 4 16x16bit depth masks in a few instructions.

 

Fast clears

This is a simple optimization employed by virtually every hardware rasterizer. Rather than resetting the full-resolution depth buffer to the far plane, we only mark blocks as cleared, which saves significant amounts of bandwidth. Before the previously mentioned unrolled coverage mask generation, this didn't yield much benefit, but now allows squeezing out a few extra percentage points. Since we can't afford an extra buffer to store tile metadata, fast cleared blocks are marked in the HiZ buffer with the value 1. This value is extraordinarily unlikely to be generated during normal operation (compared to using HiZ == 0 which happens all the time), and close enough to 0 so we can ignore falsely culled blocks.

Effectively this change allows us to skip the prior depth load for a large proportion of blocks at the cost of an additional branch for an overall net win. It requires some extra complexity for reading back full depth, but since that is intended for debug visualization only, we can ignore this impact.

 

Complete classification of rasterized quadrilaterals

Now for the meat of the actual original research work. As you may be aware, my rasterizer works with quads rather than triangles (which gives a net benefit as most meshes are authored as quads, which is trivial to recover from a trangle representation).

While convex quadrilaterals are mostly trivial to handle by simply adding a 4th edge equation, this is not the case as soon as non-convex or non-simple vertex quadrilaterals turn up. Until now, the rasterizer handled this with a 5th edge, which splits non-convex quads into two triangles, which are then composed separately. This occurs quite frequently in practice, even if the source data is fully comprised of convex quads, as vertex quantization and rounding errors during perspective projection can easily make any geometry a mess of concave or self-overlapping quads.

Unfortunately, this adds significant overhead for the 5th edge, as well has handling some cases incorrectly. To rectify this, I set out to systematically investigate the properties of possible quads. By exhaustively searching over all possible configurations, I found a total of 25 distinct vertex configurations (rather than the over-simplified 2 cases I handled before). The program I wrote for this breaks my personal complexity record - it is O(\epsilon^{-10}) in a search error of \epsilon, with a total nesting depth of 14 loops.

Describing the complete classification is out of scope for this post, but the important classes are the following ones:

  • Culled - occurs for example if both triangles of the quad are backfacing, or in front of the W plane, or a mix of these.
  • Convex - the obvious simple case.
  • Triangle 0 or triangle 1 culled - the quad is "folded over" so that a single sub triangle is backface culled.
  • Concave left or concave right - both sub triangles are forward facing, but with an angle > 180 degrees at either connecting vertex.
  • Concave center - some vertices are in front of the camera plane so that the remaining edges form a sort of inifinite W shape, which is extremely rare (< 1 in 1e8 triangles).

All other classes are specialization but can be represented as subsets of these 7. Notably, each of these cases requires only 3 or 4 edge equations to evaluate the covered area, rather than the 5 previously used.

As each configuration can be described only by the signs of the areas of each sub triangle (4 bits including complementary areas) as well as the signs of the W component of each vertex (another 4 bits), a simple lookup table with 16 resp. 256 entries (if any W < 0) allows correctly categorizing each primitive. This lookup is expensive due to the high latency of the gather operation, but allows handling the backface culling and near plane culling tests at the same time, both of which are simply represented by specific entries in the LUT.

Inside the inner-most loop we need to switch over the primitive configuration to determine how to combine the bit masks for individual edge equations. Luckily, the distribution is heavily skewed, with convex primitives being much more likely than the single triangles, which in turn are an order of magnitude more likely to occur than the concave cases. Specializing heavily for the convex case wasn't possible before due to the incorrect handling of some cases, but now yields a significant performance gains. For example, we can now skip evaluation of the edge equations for the rest of a scan line once we enter and leave the covered primitive area.

2D Hilbert curves in O(1)

...sort of.

The core of computing 2D coordinates from a Hilbert index in logarithmic time is a XOR prefix sum. As it so happens, the upper bits of carry-less multiplication with -1 are equivalent to a XOR prefix sum. On an x86 machine with CLMUL and PEXT support, this yields the following "constant time" algorithm for mapping a 32bit Hilbert index to 16bit X and Y:

void hilbertIndexToXY_constantTime(uint32_t i, uint32_t& x, uint32_t& y)
{
	uint32_t i0 = _pext_u32(i, 0x55555555);
	uint32_t i1 = _pext_u32(i, 0xAAAAAAAA);

	uint32_t A = i0 & i1;
	uint32_t B = i0 ^ i1 ^ 0xFFFF;

	uint32_t C = uint32_t(_mm_extract_epi16(_mm_clmulepi64_si128(_mm_set_epi64x(0, 0xFFFF), _mm_cvtsi32_si128(A), 0b00), 1));
	uint32_t D = uint32_t(_mm_extract_epi16(_mm_clmulepi64_si128(_mm_set_epi64x(0, 0xFFFF), _mm_cvtsi32_si128(B), 0b00), 1));

	uint32_t a = C ^ (i0 & D);

	x = (a ^ i1);
	y = (a ^ i0 ^ i1);
}

Of course this isn't really a constant time solution since the CLMUL intrinsic is limited to 64bit operands, and can't be extended to operate on n bits in O(1) time in any reasonable machine model. In practice, this solution also tends to be a hair slower than the logarithmic time solution since both transferring data into/out of SSE registers and the CLMUL intrinsic itself are rather high latency operations.

3D Hilbert curves in even fewer instructions

So far, the LUT based algorithm is the fastest method in general for mapping between Morton and Hilbert orders in 3D. This method performs linearly in the input size, consuming 3 bits of input and producing 3 bits of output per iteration. Let's break it down into 5 logical steps:

  • Extract 3 bits from input: uint32_t input = (in >> i) & 7;
  • Combine input bits with current transform: uint32_t lookupArg = transform | input;
  • Perform lookup: uint32_t lookupRes = lookupTable[lookupArg];
  • Append output bits: out = (out << 3) | (lookup & 7);
  • Extract new transform: transform = lookup & ~7;

Each of these steps is essential to the algorithm, and there's no obvious way to optimize each individual operation, such as masking and shifting bits. I'm also dismissing the obvious approach here of consuming more input bits per iteration, since that increases the size of the LUT exponentially.

To break the barrier of 5 (logical) instructions per iteration, we need to resort to a few tricks. As we need to shift 3 bits of input and 3 bits of output per iteration, let's try to combine these operations into one. On x86, the family of ror/rol instructions supports rotating shifts where bits shifted out the end of the register are inserted on the other end. This allows us to shift state around in a register without losing bits that are pushed out on one end.

As a rough sketch, let's consider then storing the entire state a single state register, with the input bits initially aligned to the MSB. By left-rotating by 3 we can shift 3 bits of input into the lowest bits of the state, where they are conveniently located for lookup. At the same time, this rotation pushes 3 output bits from the previous iteration to a higher position in the state register. At the end, all input bits are consumed, and we're left with the finished output in the high bits of our register.

Since the high bits of our state are used, we need to mask out the low bits (containing 3 bits of input and the current transform) to perform the lookup. However we can do this without any explicit bitmasking operation if the state register is a 64bit integer by casting to a 32bit integer. On x86 this amounts to move of the lower half of a register, which is essentially free as it's handled by the renaming engine in the front end. This also implies that our LUT needs to contain the 3 output bits at bit offsets 31 through 29, so that the next left rotation moves them into the upper 32 bits, keeping the lower 32 free for the LUT argument.

Finally, we need to clear the transform bits out of the state register and update them with the next transform. By pre-XORing each LUT entry with its argument, and XORing the state with the LUT result, we can clear the previous argument and replace it with the next transform at the same time.

Combined, these ideas yield the following algorithm:

static const uint32_t mortonToHilbertTable[] = {
    0x00000006, 0x20000005, 0x60000001, 0x40000007, 0xe0000001, 0xc000000c, 0x80000005, 0xa000000e,
    0x40000000, 0xa000000a, 0x6000000c, 0x8000000d, 0x20000004, 0xc000000e, 0x00000007, 0xe0000008,
    0x80000019, 0xe000001a, 0x6000001b, 0x00000010, 0xa0000012, 0xc0000013, 0x4000001c, 0x2000001d,
    0x4000001a, 0x60000019, 0x20000018, 0x00000011, 0xa000001b, 0x8000001d, 0xc0000019, 0xe000001e,
    0x00000020, 0x60000026, 0xe000002a, 0x80000024, 0x20000022, 0x40000023, 0xc000002c, 0xa000002d,
    0x80000022, 0x60000023, 0xa000002a, 0x40000020, 0xe0000025, 0x0000002a, 0xc000002e, 0x20000024,
    0x00000034, 0xe0000033, 0x20000032, 0xc0000038, 0x60000035, 0x80000034, 0x40000036, 0xa000003c,
    0x4000003d, 0x2000003c, 0xa000003b, 0xc000003a, 0x60000038, 0x00000036, 0x8000003a, 0xe000003c,
    0x8000004b, 0xa0000045, 0xe0000048, 0xc0000047, 0x6000004f, 0x4000004c, 0x00000047, 0x2000004e,
    0xc000004d, 0xa000004c, 0x2000004b, 0x4000004a, 0xe000004c, 0x8000004f, 0x00000046, 0x6000004d,
    0xc0000058, 0x20000052, 0xe0000056, 0x00000051, 0xa000005c, 0x40000056, 0x80000053, 0x60000052,
    0xc000005a, 0xe000005f, 0xa0000058, 0x80000053, 0x2000005b, 0x00000058, 0x40000059, 0x60000057,
};

static const uint32_t hilbertToMortonTable[] = {
    0x00000006, 0x20000005, 0x60000006, 0x40000000, 0xc0000007, 0xe000000c, 0xa000000f, 0x80000002,
    0xc0000001, 0x80000001, 0x00000002, 0x4000000d, 0x6000000a, 0x2000000e, 0xa000000d, 0xe0000008,
    0x60000013, 0xe000001b, 0xc0000018, 0x4000001a, 0x0000001d, 0x80000013, 0xa0000010, 0x2000001c,
    0x60000012, 0x4000001b, 0x00000018, 0x2000001b, 0xa000001c, 0x8000001a, 0xc0000019, 0xe000001e,
    0x00000020, 0x80000027, 0xa0000024, 0x20000024, 0x60000023, 0xe000002f, 0xc000002c, 0x4000002f,
    0xa000002f, 0xe0000022, 0x60000021, 0x20000021, 0x00000026, 0x4000002d, 0xc000002e, 0x80000026,
    0x00000034, 0x40000031, 0xc0000032, 0x80000032, 0xa0000035, 0xe000003e, 0x6000003d, 0x20000035,
    0xa0000033, 0x2000003c, 0x0000003f, 0x8000003f, 0xc0000038, 0x4000003c, 0x6000003f, 0xe000003c,
    0xc0000041, 0xe0000048, 0xa000004b, 0x80000048, 0x0000004f, 0x20000041, 0x60000042, 0x4000004d,
    0xc0000040, 0x40000048, 0x6000004b, 0xe0000049, 0xa000004e, 0x20000048, 0x0000004b, 0x8000004f,
    0x60000052, 0x20000052, 0xa0000051, 0xe0000056, 0xc0000051, 0x8000005d, 0x0000005e, 0x40000053,
    0xa000005d, 0x8000005e, 0xc000005d, 0xe0000053, 0x60000054, 0x4000005f, 0x0000005c, 0x20000059,
};

uint32_t transformCurve(uint32_t in, uint32_t bits, const uint32_t* lookupTable)
{
    uint64_t x = _rotr64(in, 3 * bits);

    for (uint32_t i = 0; i < bits; ++i)
    {
        x = _rotl64(x, 3);
        x ^= lookupTable[uint32_t(x)];
    }

    return x >> 29;
}

uint32_t mortonToHilbert3D(uint32_t mortonIndex, uint32_t bits)
{
    return transformCurve(mortonIndex, bits, mortonToHilbertTable);
}

uint32_t hilbertToMorton3D(uint32_t hilbertIndex, uint32_t bits)
{
    return transformCurve(hilbertIndex, bits, hilbertToMortonTable);
}

This method is faster than the previous implementation by about 30%. However, it only works on the x64 architecture, increases the size of the LUT to 4 bytes per entry, and is limited to 30bit inputs, so unfortunately it's not quite as useful in general.

3D Hilbert curves in O(log(n)) optimized

Finally I've managed to implement an optimized version of the 3D Hilbert Index -> XYZ transform in logarithmic time. The crucial element was finding an efficient bitwise representation of the alternating group A4. In the end I decided for a rather direct representation as a subgroup of S4, describing each permutation as a 4-tuple (a,b,c,d). Multiplication in this form (implementing 32 operations at the same time bitwise SIMD style) is then a simple bit selection:

struct S4x32
{
  uint32_t a1;
  uint32_t a0;

  uint32_t b1;
  uint32_t b0;

  uint32_t c1;
  uint32_t c0;

  uint32_t d1;
  uint32_t d0;
};


S4x32 mul(S4x32 x, S4x32 y)
{
  S4 res;

  // If x.a == 0, let res.a = y.a; else if x.a == 1, let res.a = y.b; else if x.a == 2, let res.a = y.c, etc.
  res.a0 = (y.a0 & ~x.a0 & ~x.a1) | (y.b0 & x.a0 & ~x.a1) | (y.c0 & ~x.a0 & x.a1) | (y.c0 & x.a0 & x.a1);
  res.a1 = (y.a1 & ~x.a0 & ~x.a1) | (y.b1 & x.a0 & ~x.a1) | (y.c1 & ~x.a0 & x.a1) | (y.c1 & x.a0 & x.a1);
  res.b0 = (y.a0 & ~x.b0 & ~x.b1) | (y.b0 & x.b0 & ~x.b1) | (y.c0 & ~x.b0 & x.b1) | (y.c0 & x.b0 & x.b1);
  res.b1 = (y.a1 & ~x.b0 & ~x.b1) | (y.b1 & x.b0 & ~x.b1) | (y.c1 & ~x.b0 & x.b1) | (y.c1 & x.b0 & x.b1);
  res.c0 = (y.a0 & ~x.c0 & ~x.c1) | (y.b0 & x.c0 & ~x.c1) | (y.c0 & ~x.c0 & x.c1) | (y.c0 & x.c0 & x.c1);
  res.c1 = (y.a1 & ~x.c0 & ~x.c1) | (y.b1 & x.c0 & ~x.c1) | (y.c1 & ~x.c0 & x.c1) | (y.c1 & x.c0 & x.c1);
  res.d0 = (y.a0 & ~x.d0 & ~x.d1) | (y.b0 & x.d0 & ~x.d1) | (y.c0 & ~x.d0 & x.d1) | (y.c0 & x.d0 & x.d1);
  res.d1 = (y.a1 & ~x.d0 & ~x.d1) | (y.b1 & x.d0 & ~x.d1) | (y.c1 & ~x.d0 & x.d1) | (y.c1 & x.d0 & x.d1);

  return res;
}

This is a lot less code than the autogenerated transform code of the original proof of concept implementation. Now there's some further symmetries that we can immediately exploit. Since there are no duplicate values in a permutation tuple, the last element is uniquely determined by the remaining ones, via d = a ^ b ^ c, and we can get rid of two components.

Since we're in the subgroup A4, the third element also happens to be uniquely determined as we only allow even permutations. After some logic optimizations, we arrive at the rather compact

struct A4x32
{
  uint32_t a1;
  uint32_t a0;

  uint32_t b1;
  uint32_t b0;
};

A4x32 mul(A4x32 x, A4x32 y)
{
  A4x32 res;

  uint32_t yc1 = y.a0 ^ y.b0 ^ y.b1;
  uint32_t ys = y.a1 ^ y.b1;
  uint32_t yc0 = ~((ys & y.a0) ^ (~y.s & y.b0));

  res.a0 = (y.a0 & ~(x.a0 ^ x.a1)) ^ (y.b0 & x.a0) ^ (yc0 & x.a1);
  res.a1 = (y.a1 & ~(x.a0 ^ x.a1)) ^ (y.b1 & x.a0) ^ (yc1 & x.a1);
  res.b0 = (y.a0 & ~(x.b0 ^ x.b1)) ^ (y.b0 & x.b0) ^ (yc0 & x.b1);
  res.b1 = (y.a1 & ~(x.b0 ^ x.b1)) ^ (y.b1 & x.b0) ^ (yc1 & x.b1);

  return res;
}

With this key piece, implementing the remaining function is straightforward. After prefix scanning, we only need to bring the transforms back into a representation that allows us to efficiently transform each octant in 3D. Luckily it's relatively easy to convert from the representation as a permutation tuple to a 3D transformation matrix. Putting it all together, this is the result:

void hilbertTo3D_logarithmic(uint64_t hilbertIndex, uint32_t bits, uint32_t& x, uint32_t& y, uint32_t& z)
{
  // Deinterleave index bits - use a different method if pext is not available
  uint32_t i0 = _pext_u64(hilbertIndex, 0x9249249249249249ull << 0);
  uint32_t i1 = _pext_u64(hilbertIndex, 0x9249249249249249ull << 1);
  uint32_t i2 = _pext_u64(hilbertIndex, 0x9249249249249249ull << 2);

  // Align to MSB
  i0 <<= 32 - bits;
  i1 <<= 32 - bits;
  i2 <<= 32 - bits;

  // Compute initial transform from index
  uint32_t j = ~(i1 ^ i2);
  uint32_t ta1 = (i2 & ~i1 & i0) | (j & ~i0);
  uint32_t ta0 = i0 | i1 | i2;
  uint32_t tb1 = (~i2 & i1 & ~i0) | (j & i0);
  uint32_t tb0 = i2 & i1 & i0;

  // Shift downwards, filling MSB with identity transform
  ta0 >>= 1;
  ta1 >>= 1;
  tb0 = (tb0 >> 1) | 0x80000000;
  tb1 >>= 1;

  // Prefix scan iteration
  auto fold = [&](uint32_t shift)
  {
    // Shift downwards
    uint32_t sa0 = ta0 >> shift;
    uint32_t sa1 = ta1 >> shift;
    uint32_t sb0 = (tb0 >> shift) | (~0 << (32 - shift));
    uint32_t sb1 = tb1 >> shift;

    // Recover third element of permutation which is uniquely determined in A4 by the first two elements
    uint32_t sc1 = sa0 ^ sb0 ^ sb1;
    uint32_t ss = sa1 ^ sb1;
    uint32_t sc0 = ~((ss & sa0) ^ (~ss & sb0));

    // Multiply permutations
    uint32_t ra0 = (sa0 & ~(ta0 ^ ta1)) ^ (sb0 & ta0) ^ (sc0 & ta1);
    uint32_t ra1 = (sa1 & ~(ta0 ^ ta1)) ^ (sb1 & ta0) ^ (sc1 & ta1);
    uint32_t rb0 = (sa0 & ~(tb0 ^ tb1)) ^ (sb0 & tb0) ^ (sc0 & tb1);
    uint32_t rb1 = (sa1 & ~(tb0 ^ tb1)) ^ (sb1 & tb0) ^ (sc1 & tb1);

    // Store current transform
    ta0 = ra0;
    ta1 = ra1;
    tb0 = rb0;
    tb1 = rb1;
  };

  // Apply prefix scan
  fold(1);
  fold(2);
  fold(4);
  fold(8);
  fold(16);

  // Recover untransformed octants
  uint32_t o0 = i0 ^ i1;
  uint32_t o1 = i1 ^ i2;
  uint32_t o2 = i2;

  // Build transformation matrix columns from permutation
  uint32_t m0 = ~(ta1 ^ tb1);
  uint32_t m2 = ~(ta0 ^ tb0);
  uint32_t m1 = ~(m0 | m2);

  uint32_t ma = tb0 ^ m0;
  uint32_t mc = ta1 ^ m2;
  uint32_t mb = ~ta1 & ~ta0 & tb1 | ta1 & ta0 & ~tb1 | ~ta0 & tb1 & tb0 | ta0 & ~tb1 & ~tb0;

  // Apply transform to original octants
  x = (m0 & o0) ^ (m1 & o1) ^ (m2 & o2) ^ ma;
  y = (m2 & o0) ^ (m0 & o1) ^ (m1 & o2) ^ mb;
  z = (m1 & o0) ^ (m2 & o1) ^ (m0 & o2) ^ mc;

  // Undo MSB alignment
  x >>= 32 - bits;
  y >>= 32 - bits;
  z >>= 32 - bits;
}

The resulting code quite nicely demonstrates all of the symmetries inherent to the problem, so I consider it unlikely to find any further significant reductions to the bit logic. The instruction count per iteration is unfortunately still massive compared to the rather trivial O(n) version, so for small bit counts this method is slower in practice despite the lower asymptotic complexity. The cutoff point seems to be around transforming 63 -> 3x21 bits, where it starts being just barely faster on my machine. So for computing positions for massive Hilbert indices (>64 bits) the logarithmic version would be preferable, although I consider that a very rare use case.

Proof of concept: 3D Hilbert curves in O(log(n))

Back in my earlier post about 2D Hilbert curves in logarithmic time, I hinted that it would be possible but likely not efficient to extend the same approach to higher dimensions. Just as a proof of concept, I've done just that for the Hilbert Index -> XYZ mapping in 3D. Most of the actual code is auto-generated, and I haven't spent much effort optimizing it in the same way as the 2D version, as it is rather obviously useless in practice. I'm only presenting the 32bit version here, but the code easily extends to arbitrary bit widths with an appropriate large integer library.

Applying the same would also be possible for the XYZ -> Hilbert Index mapping, however the code would be at least 10x-20x slower due to the exponential blowup of the technique, and probably rather tedious to generate.


// Shamelessly borrowed from http://and-what-happened.blogspot.de/2011/08/fast-2d-and-3d-hilbert-curves-and.html
void Morton_3D_Decode_10bit(const uint32_t morton,
	uint32_t& index1, uint32_t& index2, uint32_t& index3)
{ // unpack 3 10-bit indices from a 30-bit Morton code
	uint32_t value1 = morton;
	uint32_t value2 = (value1 >> 1);
	uint32_t value3 = (value1 >> 2);
	value1 &= 0x09249249;
	value2 &= 0x09249249;
	value3 &= 0x09249249;
	value1 |= (value1 >> 2);
	value2 |= (value2 >> 2);
	value3 |= (value3 >> 2);
	value1 &= 0x030c30c3;
	value2 &= 0x030c30c3;
	value3 &= 0x030c30c3;
	value1 |= (value1 >> 4);
	value2 |= (value2 >> 4);
	value3 |= (value3 >> 4);
	value1 &= 0x0300f00f;
	value2 &= 0x0300f00f;
	value3 &= 0x0300f00f;
	value1 |= (value1 >> 8);
	value2 |= (value2 >> 8);
	value3 |= (value3 >> 8);
	value1 &= 0x030000ff;
	value2 &= 0x030000ff;
	value3 &= 0x030000ff;
	value1 |= (value1 >> 16);
	value2 |= (value2 >> 16);
	value3 |= (value3 >> 16);
	value1 &= 0x000003ff;
	value2 &= 0x000003ff;
	value3 &= 0x000003ff;
	index1 = value1;
	index2 = value2;
	index3 = value3;
}

void transformFromIndex_parallel(uint32_t i0, uint32_t i1, uint32_t i2, uint32_t& t0, uint32_t& t1, uint32_t& t2, uint32_t& t3)
{
	t0 = i2 | (i0 & i1);
	t1 = (~i0 & ~i1 & ~i2) | (i0 & i1 & ~i2) | (~i0 & ~i1 & i2);
	t2 = ~(i0 & i1) ^ i2;
	t3 = (i0 & ~i1 & i2) | (~i0 & i1 & i2);
}

void multiplyTransform_parallel(
	uint32_t A0, uint32_t A1, uint32_t A2, uint32_t A3,
	uint32_t B0, uint32_t B1, uint32_t B2, uint32_t B3,
	uint32_t& C0, uint32_t& C1, uint32_t& C2, uint32_t& C3)
{
	C0 = (A0 & ~A1 & ~A2 & B0 & B1 & ~B2 & ~B3) | (A0 & ~A2 & ~A3 & ~B0 & B2) | (A0 & ~A1 & ~A2 & ~A3 & B0 & ~B1 & B3) | (A0 & ~A2 & ~A3 & ~B0 & B1 & B3) | (A0 & A1 & A3 & ~B0 & B1 & ~B2 & ~B3) | (A1 & ~A3 & ~B0 & B1 & ~B2 & ~B3) | (~A0 & A1 & ~A2 & ~A3 & B1 & B3) | (A1 & A2 & B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & B1 & ~B3) | (A1 & ~A2 & ~A3 & ~B0 & ~B1 & B2) | (~A0 & A1 & A2 & B0 & B2) | (~A0 & A1 & ~A2 & ~B0 & ~B1 & B3) | (A0 & A1 & ~A2 & ~B0 & ~B1 & ~B2) | (~A0 & A1 & ~A3 & B0 & ~B1 & B2) | (~A0 & A2 & ~B0 & ~B1 & B3) | (~A0 & A1 & A3 & ~B0 & B2) | (~A0 & ~A1 & ~A2 & ~A3 & B0) | (A0 & A1 & A3 & ~B1 & ~B3) | (A0 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & A1 & ~A3 & ~B0 & ~B3) | (~A0 & A2 & B0 & B1 & B3) | (~A1 & A3 & B0 & ~B1 & B3) | (~A0 & A3 & B0 & ~B1 & B3) | (~A0 & ~A1 & A2 & B3) | (A0 & A2 & ~B2 & ~B3) | (~A1 & A2 & B0 & B2) | (~A0 & A3 & B1 & B3) | (~A1 & A3 & B1 & B2);
	C1 = (~A0 & A3 & B0 & B1 & ~B2 & ~B3) | (A0 & ~A1 & ~A2 & ~A3 & B0 & ~B1 & B3) | (~A0 & ~A2 & ~A3 & B0 & B1 & B2) | (A0 & A1 & A3 & ~B0 & B1 & ~B2 & ~B3) | (A0 & ~A3 & B0 & ~B1 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (~A1 & A2 & ~B0 & B1 & ~B2 & ~B3) | (A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (A0 & ~A1 & A3 & B1 & ~B2 & ~B3) | (~A1 & ~A2 & ~A3 & ~B0 & B1 & B2) | (~A0 & A1 & A2 & B0 & B2) | (A0 & A1 & ~A2 & ~B0 & ~B1 & ~B2) | (~A0 & A1 & ~A3 & B0 & ~B1 & B2) | (~A0 & A2 & B1 & ~B2 & ~B3) | (~A0 & ~A1 & A2 & ~B1 & B2) | (A1 & ~A2 & ~A3 & ~B1 & B3) | (A0 & ~A1 & ~A2 & ~B0 & B2) | (~A0 & ~A3 & ~B0 & B1 & B3) | (~A0 & ~A1 & ~A3 & B1 & ~B2) | (A0 & A1 & A3 & B1 & B2) | (A0 & ~A3 & ~B0 & ~B1 & B3) | (A1 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & A1 & ~A3 & ~B1 & ~B2) | (A0 & A1 & ~A2 & ~A3 & ~B1) | (~A0 & A3 & B0 & ~B1 & B3) | (A1 & A3 & ~B0 & B1 & B3) | (~A0 & A3 & ~B1 & B2) | (A0 & A2 & B1 & B2) | (~A1 & B0 & B1 & B3);
	C2 = (A0 & ~A1 & A2 & ~B1 & ~B2 & ~B3) | (~A0 & A1 & ~A2 & B0 & B1 & ~B2 & ~B3) | (A0 & A1 & ~A2 & B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & ~A2 & B0 & B1 & ~B2 & ~B3) | (A0 & A1 & A2 & ~B0 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & ~B1 & B2) | (~A1 & ~A2 & ~A3 & B0 & ~B1 & B2) | (~A0 & A1 & A2 & ~B0 & B1 & B2) | (~A0 & ~A1 & A3 & ~B0 & B1 & ~B3) | (A1 & A3 & B0 & ~B1 & ~B2 & ~B3) | (~A0 & A1 & A2 & B0 & ~B1 & B2) | (A0 & ~A1 & ~A2 & ~B0 & ~B1 & B3) | (~A0 & ~A1 & ~A3 & ~B0 & ~B1 & B2) | (A1 & ~A2 & ~A3 & B0 & B1 & B2) | (~A0 & ~A1 & ~A3 & B0 & B1 & B2) | (A0 & A1 & A3 & ~B0 & B1 & ~B2 & ~B3) | (A0 & A1 & ~A2 & ~B0 & B1 & B3) | (A0 & A1 & A3 & B0 & ~B1 & ~B2) | (A0 & ~A1 & A3 & B1 & ~B2 & ~B3) | (~A1 & ~A2 & ~A3 & ~B0 & B1 & B2) | (A1 & ~A2 & ~A3 & ~B0 & ~B1 & B2) | (~A0 & A1 & ~A2 & ~B0 & ~B1 & B3) | (A2 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & A2 & B1 & B3) | (A0 & A1 & A2 & B0 & B3) | (~A0 & A2 & B0 & B1 & B3) | (~A1 & A3 & B0 & ~B1 & B3) | (A1 & A3 & ~B0 & B1 & B3);
	C3 = (A0 & A1 & ~A2 & ~A3 & ~B0 & B1 & ~B2 & ~B3) | (A0 & ~A1 & ~A2 & ~A3 & B0 & B1 & B2) | (A0 & A1 & ~A2 & ~A3 & ~B0 & ~B1 & B3) | (A0 & A1 & A3 & B0 & B1 & ~B2 & ~B3) | (~A0 & ~A1 & A2 & B0 & ~B1 & ~B3) | (A0 & ~A1 & ~A2 & ~B0 & B1 & B3) | (~A0 & A1 & ~A2 & ~B0 & B1 & B2) | (A0 & ~A1 & ~A2 & ~A3 & B0 & ~B1 & B3) | (~A0 & A1 & ~A3 & B0 & ~B1 & B3) | (A0 & A1 & ~A3 & B0 & ~B1 & B2) | (~A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (~A1 & A2 & ~B0 & B1 & ~B2 & ~B3) | (~A0 & A1 & ~A2 & ~A3 & B1 & B3) | (A1 & A2 & B0 & ~B1 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & B1 & ~B3) | (~A0 & ~A1 & ~A2 & ~A3 & B3) | (~A1 & A2 & ~B0 & ~B1 & B3) | (A3 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & A2 & ~B1 & B2) | (~A2 & ~A3 & B0 & B1 & B3) | (~A0 & A1 & A3 & ~B0 & ~B3) | (A1 & A2 & B0 & B1 & ~B3) | (A0 & A3 & ~B0 & B2);
}

void octantFromIndex_parallel(uint32_t i0, uint32_t i1, uint32_t i2, uint32_t& x, uint32_t& y, uint32_t& z)
{
	x = i0 ^ i1;
	y = i1 ^ i2;
	z = i2;
}

void applyTransform_parallel(uint32_t t0, uint32_t t1, uint32_t t2, uint32_t t3, uint32_t x, uint32_t y, uint32_t z, uint32_t& o0, uint32_t& o1, uint32_t& o2)
{
	o2 = (~t3 | t1 | t0 | ~z)&(t3 | t1 | ~t0 | ~y)&(~t3 | ~t2 | ~z | ~y | x)&(~t3 | t2 | t1 | ~t0 | ~x)&(t3 | ~t2 | t1 | t0 | x)&(t3 | ~t2 | ~t1 | t0 | y)&(~t3 | t2 | ~t1 | t0 | y)&(t3 | ~t2 | ~t1 | ~t0 | ~x)&(~t3 | t2 | ~t1 | ~t0 | ~z)&(t3 | t2 | t1 | t0 | z)&(t3 | t2 | ~t1 | t0 | x)&(t3 | t2 | ~t1 | ~t0 | z)&(~t3 | ~t2 | t1 | ~z | y | x)&(~t3 | ~t2 | t1 | z | y | ~x)&(~t3 | ~t2 | t0 | z | ~y | ~x)&(~t3 | ~t2 | ~t1 | ~t0 | y | x)&(~t3 | t0 | ~z | y | ~x)&(~t2 | t1 | t0 | ~y | x);
	o1 = (~t3 | ~t2 | ~y | ~x)&(~t3 | ~t1 | t0 | ~x)&(~t3 | t1 | t0 | ~y)&(~t3 | ~t2 | t1 | ~z | ~x)&(t3 | ~t2 | z | x)&(~t3 | ~t2 | t1 | z | ~y)&(~t2 | t1 | t0 | z | ~x)&(~t3 | ~t2 | t0 | z | ~y)&(~t2 | ~t1 | ~t0 | z | ~x)&(~t3 | ~t2 | ~t1 | y | x)&(~t3 | t2 | t1 | ~t0 | ~z)&(~t3 | t2 | ~t1 | ~t0 | y)&(t3 | t2 | t1 | t0 | y)&(t3 | t2 | t1 | ~t0 | ~x)&(t2 | ~t1 | t0 | ~z | ~x)&(t3 | ~t1 | t0 | ~z | x)&(t3 | t2 | ~t1 | ~t0 | ~y)&(~t3 | ~t2 | t1 | ~z | y | x)&(t3 | ~t2 | t1 | ~t0 | ~z | x)&(~t3 | ~t2 | ~t0 | z | y | x);
	o0 = (~t3 | ~t2 | ~y | ~x)&(~t2 | ~t1 | t0 | z)&(t2 | t1 | t0 | x)&(t2 | ~t1 | ~t0 | ~x)&(~t3 | t0 | ~z | x)&(~t3 | ~t2 | t1 | ~z | ~x)&(~t3 | ~t2 | t1 | z | ~y)&(~t3 | ~t2 | ~z | ~y | x)&(~t3 | ~t2 | ~t1 | z | ~x)&(t3 | ~t2 | t1 | t0 | y)&(~t3 | t2 | t1 | ~t0 | y)&(~t2 | t1 | ~t0 | ~z | ~x)&(~t3 | t2 | ~t1 | ~z | ~x)&(t3 | ~t2 | ~t1 | ~t0 | ~y)&(t3 | t2 | t1 | ~t0 | z)&(t3 | t2 | ~t1 | t0 | ~y)&(t3 | ~t2 | t1 | ~t0 | ~z | x)&(~t3 | ~t2 | ~t0 | z | y | x)&(~t2 | t0 | z | y | ~x);
}

void hilbertToMorton3D_logarithmic(uint32_t hilbertIndex, uint32_t bits, uint32_t& x, uint32_t& y, uint32_t& z)
{
	uint32_t i0, i1, i2;
	Morton_3D_Decode_10bit(hilbertIndex, i0, i1, i2);
	i0 <<= 32 - bits;
	i1 <<= 32 - bits;
	i2 <<= 32 - bits;

	uint32_t t0, t1, t2, t3;
	transformFromIndex_parallel(i0, i1, i2, t0, t1, t2, t3);

	{
		t0 >>= 1;
		t1 >>= 1;
		t2 >>= 1;
		t3 >>= 1;
	}

	for (uint32_t shift = 1; shift <= 16; shift *= 2)
	{
		uint32_t s0, s1, s2, s3;

		s0 = t0 >> shift;
		s1 = t1 >> shift;
		s2 = t2 >> shift;
		s3 = t3 >> shift;

		uint32_t T0, T1, T2, T3;
		multiplyTransform_parallel(t0, t1, t2, t3, s0, s1, s2, s3, T0, T1, T2, T3);

		t0 = T0;
		t1 = T1;
		t2 = T2;
		t3 = T3;
	}

	uint32_t o0, o1, o2;
	octantFromIndex_parallel(i0, i1, i2, o0, o1, o2);
	applyTransform_parallel(t0, t1, t2, t3, o0, o1, o2, x, y, z);

	x >>= 32 - bits;
	y >>= 32 - bits;
	z >>= 32 - bits;
}

LUT-based 3D Hilbert curves

As referenced in my earlier post about Hilbert curves, it's possible to map between d-dimensional Euclidean coordinates and the offset along the Hilbert curve in O(n) time by direct application of the transformation group at each recursion level of the curve. The following applies this to the 3D case:

static const uint8_t mortonToHilbertTable[] = {
	48, 33, 27, 34, 47, 78, 28, 77,
	66, 29, 51, 52, 65, 30, 72, 63,
	76, 95, 75, 24, 53, 54, 82, 81,
	18,  3, 17, 80, 61,  4, 62, 15,
	 0, 59, 71, 60, 49, 50, 86, 85,
	84, 83,  5, 90, 79, 56,  6, 89,
	32, 23,  1, 94, 11, 12,  2, 93,
	42, 41, 13, 14, 35, 88, 36, 31,
	92, 37, 87, 38, 91, 74,  8, 73,
	46, 45,  9, 10,  7, 20, 64, 19,
	70, 25, 39, 16, 69, 26, 44, 43,
	22, 55, 21, 68, 57, 40, 58, 67,
};

static const uint8_t hilbertToMortonTable[] = {
	48, 33, 35, 26, 30, 79, 77, 44,
	78, 68, 64, 50, 51, 25, 29, 63,
	27, 87, 86, 74, 72, 52, 53, 89,
	83, 18, 16,  1,  5, 60, 62, 15,
	 0, 52, 53, 57, 59, 87, 86, 66,
	61, 95, 91, 81, 80,  2,  6, 76,
	32,  2,  6, 12, 13, 95, 91, 17,
	93, 41, 40, 36, 38, 10, 11, 31,
	14, 79, 77, 92, 88, 33, 35, 82,
	70, 10, 11, 23, 21, 41, 40,  4,
	19, 25, 29, 47, 46, 68, 64, 34,
	45, 60, 62, 71, 67, 18, 16, 49,
};

uint32_t transformCurve(uint32_t in, uint32_t bits, const uint8_t* lookupTable)
{
	uint32_t transform = 0;
	uint32_t out = 0;

	for (int32_t i = 3 * (bits - 1); i >= 0; i -= 3) {
		transform = lookupTable[transform | ((in >> i) & 7)];
		out = (out << 3) | (transform & 7);
		transform &= ~7;
	}

	return out;
}

uint32_t mortonToHilbert3D(uint32_t mortonIndex, uint32_t bits)
{
	return transformCurve(mortonIndex, bits, mortonToHilbertTable);
}

uint32_t hilbertToMorton3D(uint32_t hilbertIndex, uint32_t bits)
{
	return transformCurve(hilbertIndex, bits, hilbertToMortonTable);
}

The 3D Hilbert curve transforms under the group A4, which has 12 elements. However as there are only 8 transformations that can appear on the left-hand side of each multiplications, only an 8x12 subset of the multiplication table is required. By folding the the index <=> octant mapping function into this table and additionally storing each index respectively octant into the lower 3 bits of each entry, we end up with a compact 96 byte LUT for each function.

These functions can be used as drop-in replacement for the algorithm provided by Fast 2D and 3D Hilbert curves and Morton codes, but are faster by a factor of roughly 2.5 on my machine.

Sponza in a Millisecond

Today, after nearly three years of research and more than a dozen rewrites, I finally present a little side project of mine:

https://github.com/rawrunprotected/rasterizer

This is state-of-the-art software occlusion culling system; similar to Intel's Software Occlusion Culling, but better optimized. By default, the application loads and renders Crytek's Sponza mesh. Camera controls are simple WASD to move and cursor keys to control the camera. The current version requires AVX2 support, so you'll need a modern CPU to run it; although downgrading to AVX or SSE4 should be straightforward if needed at some point.

Here's a quick overview of some of the techniques the rasterizer makes use of:

Clipless rasterization

Clipless or homogeneous rasterization, is not exactly a new idea. The basic principle is that the usual perspective division by W after vertex transformation is ommitted and the triangle edge equations are directly projected to 2D space, thus saving a division, and more importantly, clipping against the near plane and guard band. A nice side-effect is that no near plane is required anymore.

However, implementing this naively brings more problems at first than it solves: Perspective division brings vertex coordinates and thus edge equations into pixel space, where they can be rounded to fixed-point coordinates with well-defined error behavior. Ommitting this step forces the rasterizer to either operate in floating point with a ridiculously high number of mantissa bits, or use some other magic trick to ensure the rasterization result remains artifact-free, as a small change to the edge equations can result in large deviations from the ground truth. Besides, without having explicit 2D vertex positions, there is no explicit bounding box, making it difficult to limit the screen area that needs to be scanned.

As it turns out, neither magic nor high-precision floats are required after all: Instead, the practical approach to clipless rasterization is to simply clamp the W component away from zero to avoid overflow and perform the perspective division after all. Afterwards, rounding of the vertex coordinates and edge equation computation can be performed as usual; although the edge equations need to be sign-inverted once for each vertex with W < 0. Similarly, the backface culling test result needs to be flipped for each vertex behind the camera. Computing the bounding box is somewhat tricky, as one needs to determine the bounding box for all vertices with W > 0 and all with W < 0 separately, and extend the former towards infinity depending on its relation to the latter. As this introduces some overhead compared to the standard triangle setup, I've templated the rasterization routine to only perform the extra steps for W < 0 if a batch of triangles is close to the camera plane. Search for possiblyNearClipped in Rasterizer.cpp to see where special handling is required.

Edge mask precomputation

Another interesting idea is precomputing the coverage masks for each 8x8 pixel block. This approach is easily explained: For each edge equation, the slope and offset at the center of a pixel block are quantized, and then looked up in a precomputed table to yield a 64bit coverage mask. By ANDing together the masks for each edge, one finds the mask of all pixels that are inside the triangle. Since quantization invariably modifies the edge equation's slope, this results in minor "bristle" artifacts for thin triangles. For a graphical application, these would be unacceptable, however, as these bristles are no more than a pixel wide and don't extend across block boundaries, there's little risk of incorrect over-occlusion, so they can be safely ignored for this rasterizer's intended use case.

The lookup table is sufficiently small to not have any significant impact on cache at only 32KB, with an upper bound of 2.5KB being actually sampled for each pair of triangles in the worst case. Curiously, the lookup should be a perfect use case for AVX2's gather instruction set, however performing the lookup in scalar code turned out to be invariably faster on a SkyLake CPU. FMA on the other hand, the other major feature coming with AVX2-capable CPUs, brings a solid 20% performance gain from speeding up vertex transformation.

Vertex buffer compression

The major requirement for any form of vertex buffer compression is that multiple occurences of a vertex encode and decode to the exact same floating point value. This is both sufficient and necessary to ensure that the compression step doesn't introduce any holes in the mesh. As variable-length compression schemes rarely play nice with SIMD, I've chosen to encode vertices straightforwardly as 11-11-10 fixed point format, relative to the models bounding box. This cuts occluder size to a third, compared to storing uncompressed vertex data. Since the rasterizer bakes the bounding box transform into the model view projection matrix, and the integer masking operation during decoding can be pipelined with the matrix multiplication, decompression is practically zero-cost.

Note that using an indexed representation doesn't provide any speedup here: Contrary to the GPU case, the rasterizer is neither bound by vertex fetch bandwidth nor processing cost, and the overhead of gathering via the index buffer managing a cache in software is fairly large. In addition, having a flat linear input stream allows perfect usage of streaming load instructions to prevent cache pollution.

Rasterizing quads instead of triangles

By rasterizing quadrilaterals instead of triangles, setup cost and overdraw are significantly reduced. Unfortunately, one can't rely on coplanar pairs of triangles remaining coplanar after vertex quantization, so the rasterizer expliticly deals with quads becoming non-convex after projection by generating the coverage mask for each half quad separately to prevent introducing holes. Triangles that can't be merged to quads are simply handled by duplicating one of the vertices, effectively degenerating one of the quad edges. Despite this overhead, rendering quads is overall 20-30% faster.

Depth buffer compression

Typical GPU depth buffer compression schemes conserve bandwidth by attempting to store only each triangle's depth derivatives and a per-pixel index mask for a fixed block size, rather than the full interpolated depth values. This doesn't map well to a software implementation, so instead the rasterizer uses a customized 16bit minifloat format to reduce the depth buffer size.

The minifloat format has 4 exponent and 12 mantissa bits, with no sign bit; with decoded values ranging from 0 to 1. The trick for quickly encoding into this format is first rescaling an input 32bit float with 2.5237386e-29f (which has the bit pattern 0xFFFF << 12), and then right-shifting by 12 bits. This maps 0.0 to the bit pattern 0x0000, and 1.0 to the pattern 0xFFFF, with an exponential distribution in between. By making use of SSE's saturating packing instruction, one also gets well-defined clamping behavior for values outside of the [0, 1] range. The rescaling value is simply baked into the projection matrix, which means that 8 depth values can be encoded extremely quickly with only 3 integer operations. Combined with an inverted depth range with near at 1.0f and far at 0.0f, so that the denser floating point distribution around 0 counteracts the perspective error, this minifloat actually has a significantly better worst-case depth precision for typical near/far ratios than a 24bit fixed point depth buffer.

And everything else

Besides these high-level tricks, there are tons of smaller bits and pieces to improve performance, such as computing z from its linear relation with 1/W; or rounding vertex coordinates to force zero-area culling even though the rasterizer doesn't run in fixed point. The bulk of the optimization effort however is completely invisible: Several hundred hours have been spent on cycle shaving, reordering to improve pipelining, or running assembly through IACA to minimize instruction latencies.

Finally, a significant contribution usually comes from proper preprocessing of the geometry. In a typical game application, you would use a simplified version of the scene geometry. This is out of scope for this sample, though, so the preprocessing is limited to splitting the scene into batches according to the surface area heuristic, and clustering by normal vectors to improve backface culling coherency. The end result is still the full, unsimplified render geometry, just with a little bit of reordering.

I still haven't decided on an open-source license yet; feel free to ping me if you have suggestions.

Hilbert curves in O(log(n)) time

Jump to the code

Today I present a novel technique for mapping a point in the d-dimensional plane to its corresponding d*n-bit offset on the nth order Hilbert curve with a running time of O(log(n)) on a processor with n-bit words. While a large constant factor makes it infeasible in higher dimensions, it is significantly faster than the previous best method for d=2.

To explain the approach, let's consider the reverse computation first: mapping an offset on the Hilbert curve to its corresponding coordinates.

Given a d*n-bit number describing the index on the d-dimensional Hilbert curve of order n, split the index into n groups i_j of d bits each, starting from the most significant bit. Each of these tuples i_0 through i_{n-1} describes both an orthant to recurse into, determining one bit for the each of the coordinates axes (which we'll group together as x_j) of the point on the curve, as well as the transformation that is to be applied to the next recursion level of the curve.

We have

x_0 = q(i_0) = T_0 * q(i_0)
x_1 = t(i_0) * q(i_1) = T_1 * q(i_1)
x_2 = t(i_0) * t(i_1) * q(i_2) = T_2 * q(i_2)
...
x_{n-1} = t(i_0) * t(i_1) * ... * t(i_{n-2}) * q(i_{n-1})  = T_{n-1} * q(i_{n-1})

Where q is the function mapping d index bits to an orthant, t is the function mapping d index bits to an element of the transformation group of the Hilbert curve, and * is the operator of that group. If q, t and * are known for a particular dimension, this yields an O(n) algorithm for mapping offsets on the curve to the corresponding coordinates, and vice versa, by successively computing the product of transformations at each recursion level.

Designing these functions efficiently is rather involved, though, and as far as I'm aware this direct approach has only been applied to the two dimensional case, either via more or less explicit geometric operations (http://www.hackersdelight.org/hdcodetxt/hilbert/lamxy.c.txt), or lookup tables (http://www.hackersdelight.org/hdcodetxt/hilbert/hil_xy_from_s.c.txt, http://www.hackersdelight.org/hdcodetxt/hilbert/hil_s_from_xy.c.txt). For higher dimensions, methods based on Grey codes are usually employed.

In this written form, it is readily apparent that one can apply a prefix scan technique to parallelize the mapping from index to coordinates by computing the partial products of transformations in number of steps logarithmic in the order of the curve. In two dimensions, observe that the group of transformations applied to each quadrant is simply the Klein four-group, which is trivial to implement as a bitwise XOR. This allows us to parallelize across the individual bits of an n-bit register, a technique sometimes dubbed SWAR (SIMD within a register). An implementation can be found at http://www.hackersdelight.org/hdcodetxt/hilbert/glsxy.c.txt.

The same method generalizes to higher dimensions; however it quickly becomes infeasible to parallelize within a register as the transformation groups become too complex (in 3D, the transformation group is the alternating group A4, which does not seem to exhibit a structure that would make it particularly efficient to implement).

An open problem until now was to apply the same idea in the other direction: Can mapping a point to the corresponding Hilbert curve index be parallelized in the same way? The issue is that, given x_0, ..., x_{n-1}, we can't easily separate the partial product of transformations from the orthant q(i_j) without knowing the transformation of the previous recursion level, making it seem that is impossible to achieve sub-linear time.

As it turns out, there is a trick to do exactly that: We can't recover T_j without knowledge of T_{j-1}. However, we can compute a function object f_j(t) that evaluates T_j under the assumption that t = T_{j-1}. We now have a semi-group with the operation of functional composition - and can compute the partial sums f0, f1 \circ f0, f2 \circ f1 \circ f0, ... efficiently via a prefix scan. Next, evaluate each function at identity, which yields each of T_j, from which we can recover t(i_j) = T_j * T_{j-1}^{-1}, then q(i_j) = t_j^{-1} * x, and finally i_j as q is necessarily bijective.

Compared to mapping indices to coordinates, this unfortunately blows up the time complexity by a factor of the order of the transformation group, which appears to be super-exponential in d. Combined with the above-mentioned difficulty of efficiently implementing the transformation group, this makes it unlikely that this method has any practical merit except for d>2.

Without further ado, I present my implementation (for 32bit indices), which is three times faster than the O(n) lookup table based implemention linked above for 32bits. The code has been extensively transformed and optimized by hand to exploit some further symmetries of the transformation group for d=2, so it bears little resemblance any more to the underlying mathematical structure outlined above.

uint32_t interleave(uint32_t x)
{
	x = (x | (x << 8)) & 0x00FF00FF;
	x = (x | (x << 4)) & 0x0F0F0F0F;
	x = (x | (x << 2)) & 0x33333333;
	x = (x | (x << 1)) & 0x55555555;
	return x;
}

uint32_t hilbertXYToIndex_logarithmic(uint32_t x, uint32_t y)
{
  uint32_t A, B, C, D;

  // Initial prefix scan round, prime with x and y
  {
    uint32_t a = x ^ y;
    uint32_t b = 0xFFFF ^ a;
    uint32_t c = 0xFFFF ^ (x | y);
    uint32_t d = x & (y ^ 0xFFFF);

    A = a | (b >> 1);
    B = (a >> 1) ^ a;

    C = ((c >> 1) ^ (b & (d >> 1))) ^ c;
    D = ((a & (c >> 1)) ^ (d >> 1)) ^ d;
  }

  {
    uint32_t a = A;
    uint32_t b = B;
    uint32_t c = C;
    uint32_t d = D;

    A = ((a & (a >> 2)) ^ (b & (b >> 2)));
    B = ((a & (b >> 2)) ^ (b & ((a ^ b) >> 2)));

    C ^= ((a & (c >> 2)) ^ (b & (d >> 2)));
    D ^= ((b & (c >> 2)) ^ ((a ^ b) & (d >> 2)));
  }

  {
    uint32_t a = A;
    uint32_t b = B;
    uint32_t c = C;
    uint32_t d = D;

    A = ((a & (a >> 4)) ^ (b & (b >> 4)));
    B = ((a & (b >> 4)) ^ (b & ((a ^ b) >> 4)));

    C ^= ((a & (c >> 4)) ^ (b & (d >> 4)));
    D ^= ((b & (c >> 4)) ^ ((a ^ b) & (d >> 4)));
  }

  // Final round and projection
  {
    uint32_t a = A;
    uint32_t b = B;
    uint32_t c = C;
    uint32_t d = D;

    C ^= ((a & (c >> 8)) ^ (b & (d >> 8)));
    D ^= ((b & (c >> 8)) ^ ((a ^ b) & (d >> 8)));
  }

  // Undo transformation prefix scan
  uint32_t a = C ^ (C >> 1);
  uint32_t b = D ^ (D >> 1);

  // Recover index bits
  uint32_t i0 = x ^ y;
  uint32_t i1 = b | (0xFFFF ^ (i0 | a));

  return (interleave(i1) << 1) | interleave(i0);
}

The Near Plane Is A Lie

In rasterization, the near plane is a weird beast. The reason for its existence is far from intuitive - after all, ray tracing does fine without it. Why does rasterization specifically have to deal with this, and the associated the trade-off between depth precision and clipping artifacts for geometry near the camera?

One reason that comes to mind is that rasterization typically uses the projected depth value Z/W, because it's convenient (the difference between perspective and orthogonal projection boils down to a few matrix values) and efficient (Z/W is linear in screen space, making hardware implementation trivial). This value isn't equi-distributed along the view vector however, and quickly goes towards minus infinity as we get closer to the camera. Historically, the near plane make sense as an arbitrary cut-off point for fixed-point depth values.

Nowadays, we could easily work around this by storing Z/W as floating point values in the depth buffer, or switching to linear depth Z altogether. Modern GPUs typically use 32bit depth buffers anyway - using D24S8 as a depth stencil format often means that the hardware has a 32bit buffer for depth (out of which 8bits are unused padding) and separate 8bit stencil buffer, potentially even in a specialized on-chip memory region. Using floating point depth doesn't incur any extra cost, so why is the near plane still a thing?

Consider the basic steps of rasterizing a 3D triangle:

- Project 3D vertices to 2D pixel positions
- Determine pixels covered by 2D triangle
- Interpolate vertex attributes with perspective correction for each pixel

Except that this is not quite correct. When projecting a 3D triangle onto a 2D screen, the result is not a triangle. In particular, when part of a triangle is in front and another part is behind the camera plane, the resulting 2D shape has three edges, two of which extend to infinity rather than meeting in a common point. However because it is much easier, both on an intuitive as well as technical level, to work under the assumption that 3D triangle map to 2D triangles, the near plane was introduced in order to clip away those parts of the geometry that don't fullfil this condition.

A better mental model is perhaps that a half-space in 3D is projected to a half-space in 2D. A triangle is simply the intersection of 3 half-spaces as defined by its edges, and the shape projected by a triangle is the intersection of the projection of each half-space defining the triangle. As it happens, these half-spaces do not necessarily enclose a finite area.

Until ray tracing finally takes over for real-time rendering, which is usually only 10 years from now at any given point of time, there's a technique called clipless rasterization that does not require clipping geometry against the near plane. It works in terms of projecting 3D edges directly to 2D half-space equations, against which points in the plane can be tested. NVIDIA GPUs have implemented this technique for a number of years, while AMD still uses traditional clipping rasterization.

While implementing a clipless rasterizer is initially straightforward to implement once all the math is worked out, it's fairly difficult to control precision. With clipping, the computed 2D vertex position are typically rounded to some fixed-point format, after which all operations are carried out exactly. Any deviation between the rasterized area and the "ground truth" is directly determined by the chosen fixed point precision. Without clipping, we only have edge equations, which are very sensitive to numerical errors. Consider two edges meeting at an acute angle: If either edge is offset by a small amount, the intersection point can potentially move very far. Another issue is that two triangles sharing an edge might generate numerically different but geometrically identical edge equations, but the rasterizer needs to ensure to generate the exact same pixels for each. One solution is to use floating point formats with enough mantissa bits to ensure that all edge computations are lossless. This is costly in hardware, but on the other hand it removes the need for complex clipping circuitry in the front-end.

However, all current graphics APIs are specified in terms of traditional rasterization, including the requirement of near-clipping. Because of this, NVIDIA emulates the effect of clipping against the near plane by discarding pixels below some depth threshold, meaning all of the user-facing benefits of clipless rasterization are lost.

Until those specifications are fixed, I guess we'll have to live with the near plane a little-bit longer.

Dithering in Gamma vs. Linear Space

Here is some public domain image, originally created by Ricardo Cancho Niemietz.

Original 8888 color image
Original 8888 color image

Conversion to R4G4B4A4, without application of dithering, yields a rather unsightly result. Note the significant color banding on the gradients, which results in the Mach band optical illusion. There is some heavy color banding of the parrot's plumage as well.

Non-dithered 4444 color image
Non-dithered 4444 color image

Dithering improves the artifacts and replaces most of banding by visually less-objectionable high-frequency noise.

Dithered 4444 color image
Dithered 4444 color image

However, you may notice that there now appear slightly darkened stripes in the linear gradients on the right side where the undithered image jumps to the next color level. This is because the image has been dithered in gamma space where adding up color values results in a color with slightly lower intensity. Linearizing color values before computing and distributing the error, these artifacts disappear completely:

Dithered 4444 color image with sRGB correction
Dithered 4444 color image with sRGB correction

Visually, there's little advantage of using sRGB-correct dithering for high-frequency photographic images, as the differences are neglible. For low-frequency gradients the gain in quality is obvious; however images with perfect gradients are most likely lookup tables and the like that wouldn't be stored at low color depths in the first place.

Anyway, the result using an approximate (and cheaper) gamma exponent of 2.0 is visually completely indistinguishable from applying the proper sRGB conversion for the purposes of dithering, so there's that.