The core of computing 2D coordinates from a Hilbert index in logarithmic time is a XOR prefix sum. As it so happens, the upper bits of carry-less multiplication with -1 are equivalent to a XOR prefix sum. On an x86 machine with CLMUL and PEXT support, this yields the following "constant time" algorithm for mapping a 32bit Hilbert index to 16bit X and Y:

void hilbertIndexToXY_constantTime(uint32_t i, uint32_t& x, uint32_t& y) { uint32_t i0 = _pext_u32(i, 0x55555555); uint32_t i1 = _pext_u32(i, 0xAAAAAAAA); uint32_t A = i0 & i1; uint32_t B = i0 ^ i1 ^ 0xFFFF; uint32_t C = uint32_t(_mm_extract_epi16(_mm_clmulepi64_si128(_mm_set_epi64x(0, 0xFFFF), _mm_cvtsi32_si128(A), 0b00), 1)); uint32_t D = uint32_t(_mm_extract_epi16(_mm_clmulepi64_si128(_mm_set_epi64x(0, 0xFFFF), _mm_cvtsi32_si128(B), 0b00), 1)); uint32_t a = C ^ (i0 & D); x = (a ^ i1); y = (a ^ i0 ^ i1); }

Of course this isn't really a constant time solution since the CLMUL intrinsic is limited to 64bit operands, and can't be extended to operate on bits in time in any reasonable machine model. In practice, this solution also tends to be a hair slower than the logarithmic time solution since both transferring data into/out of SSE registers and the CLMUL intrinsic itself are rather high latency operations.

]]>- Extract 3 bits from input:
`uint32_t input = (in >> i) & 7;`

- Combine input bits with current transform:
`uint32_t lookupArg = transform | input;`

- Perform lookup:
`uint32_t lookupRes = lookupTable[lookupArg];`

- Append output bits:
`out = (out << 3) | (lookup & 7);`

- Extract new transform:
`transform = lookup & ~7;`

Each of these steps is essential to the algorithm, and there's no obvious way to optimize each individual operation, such as masking and shifting bits. I'm also dismissing the obvious approach here of consuming more input bits per iteration, since that increases the size of the LUT exponentially.

To break the barrier of 5 (logical) instructions per iteration, we need to resort to a few tricks. As we need to shift 3 bits of input and 3 bits of output per iteration, let's try to combine these operations into one. On x86, the family of ror/rol instructions supports rotating shifts where bits shifted out the end of the register are inserted on the other end. This allows us to shift state around in a register without losing bits that are pushed out on one end.

As a rough sketch, let's consider then storing the entire state a single state register, with the input bits initially aligned to the MSB. By left-rotating by 3 we can shift 3 bits of input into the lowest bits of the state, where they are conveniently located for lookup. At the same time, this rotation pushes 3 output bits from the previous iteration to a higher position in the state register. At the end, all input bits are consumed, and we're left with the finished output in the high bits of our register.

Since the high bits of our state are used, we need to mask out the low bits (containing 3 bits of input and the current transform) to perform the lookup. However we can do this without any explicit bitmasking operation if the state register is a 64bit integer by casting to a 32bit integer. On x86 this amounts to move of the lower half of a register, which is essentially free as it's handled by the renaming engine in the front end. This also implies that our LUT needs to contain the 3 output bits at bit offsets 31 through 29, so that the next left rotation moves them into the upper 32 bits, keeping the lower 32 free for the LUT argument.

Finally, we need to clear the transform bits out of the state register and update them with the next transform. By pre-XORing each LUT entry with its argument, and XORing the state with the LUT result, we can clear the previous argument and replace it with the next transform at the same time.

Combined, these ideas yield the following algorithm:

static const uint32_t mortonToHilbertTable[] = { 0x00000006, 0x20000005, 0x60000001, 0x40000007, 0xe0000001, 0xc000000c, 0x80000005, 0xa000000e, 0x40000000, 0xa000000a, 0x6000000c, 0x8000000d, 0x20000004, 0xc000000e, 0x00000007, 0xe0000008, 0x80000019, 0xe000001a, 0x6000001b, 0x00000010, 0xa0000012, 0xc0000013, 0x4000001c, 0x2000001d, 0x4000001a, 0x60000019, 0x20000018, 0x00000011, 0xa000001b, 0x8000001d, 0xc0000019, 0xe000001e, 0x00000020, 0x60000026, 0xe000002a, 0x80000024, 0x20000022, 0x40000023, 0xc000002c, 0xa000002d, 0x80000022, 0x60000023, 0xa000002a, 0x40000020, 0xe0000025, 0x0000002a, 0xc000002e, 0x20000024, 0x00000034, 0xe0000033, 0x20000032, 0xc0000038, 0x60000035, 0x80000034, 0x40000036, 0xa000003c, 0x4000003d, 0x2000003c, 0xa000003b, 0xc000003a, 0x60000038, 0x00000036, 0x8000003a, 0xe000003c, 0x8000004b, 0xa0000045, 0xe0000048, 0xc0000047, 0x6000004f, 0x4000004c, 0x00000047, 0x2000004e, 0xc000004d, 0xa000004c, 0x2000004b, 0x4000004a, 0xe000004c, 0x8000004f, 0x00000046, 0x6000004d, 0xc0000058, 0x20000052, 0xe0000056, 0x00000051, 0xa000005c, 0x40000056, 0x80000053, 0x60000052, 0xc000005a, 0xe000005f, 0xa0000058, 0x80000053, 0x2000005b, 0x00000058, 0x40000059, 0x60000057, }; static const uint32_t hilbertToMortonTable[] = { 0x00000006, 0x20000005, 0x60000006, 0x40000000, 0xc0000007, 0xe000000c, 0xa000000f, 0x80000002, 0xc0000001, 0x80000001, 0x00000002, 0x4000000d, 0x6000000a, 0x2000000e, 0xa000000d, 0xe0000008, 0x60000013, 0xe000001b, 0xc0000018, 0x4000001a, 0x0000001d, 0x80000013, 0xa0000010, 0x2000001c, 0x60000012, 0x4000001b, 0x00000018, 0x2000001b, 0xa000001c, 0x8000001a, 0xc0000019, 0xe000001e, 0x00000020, 0x80000027, 0xa0000024, 0x20000024, 0x60000023, 0xe000002f, 0xc000002c, 0x4000002f, 0xa000002f, 0xe0000022, 0x60000021, 0x20000021, 0x00000026, 0x4000002d, 0xc000002e, 0x80000026, 0x00000034, 0x40000031, 0xc0000032, 0x80000032, 0xa0000035, 0xe000003e, 0x6000003d, 0x20000035, 0xa0000033, 0x2000003c, 0x0000003f, 0x8000003f, 0xc0000038, 0x4000003c, 0x6000003f, 0xe000003c, 0xc0000041, 0xe0000048, 0xa000004b, 0x80000048, 0x0000004f, 0x20000041, 0x60000042, 0x4000004d, 0xc0000040, 0x40000048, 0x6000004b, 0xe0000049, 0xa000004e, 0x20000048, 0x0000004b, 0x8000004f, 0x60000052, 0x20000052, 0xa0000051, 0xe0000056, 0xc0000051, 0x8000005d, 0x0000005e, 0x40000053, 0xa000005d, 0x8000005e, 0xc000005d, 0xe0000053, 0x60000054, 0x4000005f, 0x0000005c, 0x20000059, }; uint32_t transformCurve(uint32_t in, uint32_t bits, const uint32_t* lookupTable) { uint64_t x = _rotr64(in, 3 * bits); for (uint32_t i = 0; i < bits; ++i) { x = _rotl64(x, 3); x ^= lookupTable[uint32_t(x)]; } return x >> 29; } uint32_t mortonToHilbert3D(uint32_t mortonIndex, uint32_t bits) { return transformCurve(mortonIndex, bits, mortonToHilbertTable); } uint32_t hilbertToMorton3D(uint32_t hilbertIndex, uint32_t bits) { return transformCurve(hilbertIndex, bits, hilbertToMortonTable); }

This method is faster than the previous implementation by about 30%. However, it only works on the x64 architecture, increases the size of the LUT to 4 bytes per entry, and is limited to 30bit inputs, so unfortunately it's not quite as useful in general.

]]>struct S4x32 { uint32_t a1; uint32_t a0; uint32_t b1; uint32_t b0; uint32_t c1; uint32_t c0; uint32_t d1; uint32_t d0; }; S4x32 mul(S4x32 x, S4x32 y) { S4 res; // If x.a == 0, let res.a = y.a; else if x.a == 1, let res.a = y.b; else if x.a == 2, let res.a = y.c, etc. res.a0 = (y.a0 & ~x.a0 & ~x.a1) | (y.b0 & x.a0 & ~x.a1) | (y.c0 & ~x.a0 & x.a1) | (y.c0 & x.a0 & x.a1); res.a1 = (y.a1 & ~x.a0 & ~x.a1) | (y.b1 & x.a0 & ~x.a1) | (y.c1 & ~x.a0 & x.a1) | (y.c1 & x.a0 & x.a1); res.b0 = (y.a0 & ~x.b0 & ~x.b1) | (y.b0 & x.b0 & ~x.b1) | (y.c0 & ~x.b0 & x.b1) | (y.c0 & x.b0 & x.b1); res.b1 = (y.a1 & ~x.b0 & ~x.b1) | (y.b1 & x.b0 & ~x.b1) | (y.c1 & ~x.b0 & x.b1) | (y.c1 & x.b0 & x.b1); res.c0 = (y.a0 & ~x.c0 & ~x.c1) | (y.b0 & x.c0 & ~x.c1) | (y.c0 & ~x.c0 & x.c1) | (y.c0 & x.c0 & x.c1); res.c1 = (y.a1 & ~x.c0 & ~x.c1) | (y.b1 & x.c0 & ~x.c1) | (y.c1 & ~x.c0 & x.c1) | (y.c1 & x.c0 & x.c1); res.d0 = (y.a0 & ~x.d0 & ~x.d1) | (y.b0 & x.d0 & ~x.d1) | (y.c0 & ~x.d0 & x.d1) | (y.c0 & x.d0 & x.d1); res.d1 = (y.a1 & ~x.d0 & ~x.d1) | (y.b1 & x.d0 & ~x.d1) | (y.c1 & ~x.d0 & x.d1) | (y.c1 & x.d0 & x.d1); return res; }

This is a lot less code than the autogenerated transform code of the original proof of concept implementation. Now there's some further symmetries that we can immediately exploit. Since there are no duplicate values in a permutation tuple, the last element is uniquely determined by the remaining ones, via d = a ^ b ^ c, and we can get rid of two components.

Since we're in the subgroup A4, the third element also happens to be uniquely determined as we only allow even permutations. After some logic optimizations, we arrive at the rather compact

struct A4x32 { uint32_t a1; uint32_t a0; uint32_t b1; uint32_t b0; }; A4x32 mul(A4x32 x, A4x32 y) { A4x32 res; uint32_t yc1 = y.a0 ^ y.b0 ^ y.b1; uint32_t ys = y.a1 ^ y.b1; uint32_t yc0 = ~((ys & y.a0) ^ (~y.s & y.b0)); res.a0 = (y.a0 & ~(x.a0 ^ x.a1)) ^ (y.b0 & x.a0) ^ (yc0 & x.a1); res.a1 = (y.a1 & ~(x.a0 ^ x.a1)) ^ (y.b1 & x.a0) ^ (yc1 & x.a1); res.b0 = (y.a0 & ~(x.b0 ^ x.b1)) ^ (y.b0 & x.b0) ^ (yc0 & x.b1); res.b1 = (y.a1 & ~(x.b0 ^ x.b1)) ^ (y.b1 & x.b0) ^ (yc1 & x.b1); return res; }

With this key piece, implementing the remaining function is straightforward. After prefix scanning, we only need to bring the transforms back into a representation that allows us to efficiently transform each octant in 3D. Luckily it's relatively easy to convert from the representation as a permutation tuple to a 3D transformation matrix. Putting it all together, this is the result:

void hilbertTo3D_logarithmic(uint64_t hilbertIndex, uint32_t bits, uint32_t& x, uint32_t& y, uint32_t& z) { // Deinterleave index bits - use a different method if pext is not available uint32_t i0 = _pext_u64(hilbertIndex, 0x9249249249249249ull << 0); uint32_t i1 = _pext_u64(hilbertIndex, 0x9249249249249249ull << 1); uint32_t i2 = _pext_u64(hilbertIndex, 0x9249249249249249ull << 2); // Align to MSB i0 <<= 32 - bits; i1 <<= 32 - bits; i2 <<= 32 - bits; // Compute initial transform from index uint32_t j = ~(i1 ^ i2); uint32_t ta1 = (i2 & ~i1 & i0) | (j & ~i0); uint32_t ta0 = i0 | i1 | i2; uint32_t tb1 = (~i2 & i1 & ~i0) | (j & i0); uint32_t tb0 = i2 & i1 & i0; // Shift downwards, filling MSB with identity transform ta0 >>= 1; ta1 >>= 1; tb0 = (tb0 >> 1) | 0x80000000; tb1 >>= 1; // Prefix scan iteration auto fold = [&](uint32_t shift) { // Shift downwards uint32_t sa0 = ta0 >> shift; uint32_t sa1 = ta1 >> shift; uint32_t sb0 = (tb0 >> shift) | (~0 << (32 - shift)); uint32_t sb1 = tb1 >> shift; // Recover third element of permutation which is uniquely determined in A4 by the first two elements uint32_t sc1 = sa0 ^ sb0 ^ sb1; uint32_t ss = sa1 ^ sb1; uint32_t sc0 = ~((ss & sa0) ^ (~ss & sb0)); // Multiply permutations uint32_t ra0 = (sa0 & ~(ta0 ^ ta1)) ^ (sb0 & ta0) ^ (sc0 & ta1); uint32_t ra1 = (sa1 & ~(ta0 ^ ta1)) ^ (sb1 & ta0) ^ (sc1 & ta1); uint32_t rb0 = (sa0 & ~(tb0 ^ tb1)) ^ (sb0 & tb0) ^ (sc0 & tb1); uint32_t rb1 = (sa1 & ~(tb0 ^ tb1)) ^ (sb1 & tb0) ^ (sc1 & tb1); // Store current transform ta0 = ra0; ta1 = ra1; tb0 = rb0; tb1 = rb1; }; // Apply prefix scan fold(1); fold(2); fold(4); fold(8); fold(16); // Recover untransformed octants uint32_t o0 = i0 ^ i1; uint32_t o1 = i1 ^ i2; uint32_t o2 = i2; // Build transformation matrix columns from permutation uint32_t m0 = ~(ta1 ^ tb1); uint32_t m2 = ~(ta0 ^ tb0); uint32_t m1 = ~(m0 | m2); uint32_t ma = tb0 ^ m0; uint32_t mc = ta1 ^ m2; uint32_t mb = ~ta1 & ~ta0 & tb1 | ta1 & ta0 & ~tb1 | ~ta0 & tb1 & tb0 | ta0 & ~tb1 & ~tb0; // Apply transform to original octants x = (m0 & o0) ^ (m1 & o1) ^ (m2 & o2) ^ ma; y = (m2 & o0) ^ (m0 & o1) ^ (m1 & o2) ^ mb; z = (m1 & o0) ^ (m2 & o1) ^ (m0 & o2) ^ mc; // Undo MSB alignment x >>= 32 - bits; y >>= 32 - bits; z >>= 32 - bits; }

The resulting code quite nicely demonstrates all of the symmetries inherent to the problem, so I consider it unlikely to find any further significant reductions to the bit logic. The instruction count per iteration is unfortunately still massive compared to the rather trivial O(n) version, so for small bit counts this method is slower in practice despite the lower asymptotic complexity. The cutoff point seems to be around transforming 63 -> 3x21 bits, where it starts being just barely faster on my machine. So for computing positions for massive Hilbert indices (>64 bits) the logarithmic version would be preferable, although I consider that a very rare use case.

]]>Applying the same would also be possible for the XYZ -> Hilbert Index mapping, however the code would be at least 10x-20x slower due to the exponential blowup of the technique, and probably rather tedious to generate.

// Shamelessly borrowed from http://and-what-happened.blogspot.de/2011/08/fast-2d-and-3d-hilbert-curves-and.html void Morton_3D_Decode_10bit(const uint32_t morton, uint32_t& index1, uint32_t& index2, uint32_t& index3) { // unpack 3 10-bit indices from a 30-bit Morton code uint32_t value1 = morton; uint32_t value2 = (value1 >> 1); uint32_t value3 = (value1 >> 2); value1 &= 0x09249249; value2 &= 0x09249249; value3 &= 0x09249249; value1 |= (value1 >> 2); value2 |= (value2 >> 2); value3 |= (value3 >> 2); value1 &= 0x030c30c3; value2 &= 0x030c30c3; value3 &= 0x030c30c3; value1 |= (value1 >> 4); value2 |= (value2 >> 4); value3 |= (value3 >> 4); value1 &= 0x0300f00f; value2 &= 0x0300f00f; value3 &= 0x0300f00f; value1 |= (value1 >> 8); value2 |= (value2 >> 8); value3 |= (value3 >> 8); value1 &= 0x030000ff; value2 &= 0x030000ff; value3 &= 0x030000ff; value1 |= (value1 >> 16); value2 |= (value2 >> 16); value3 |= (value3 >> 16); value1 &= 0x000003ff; value2 &= 0x000003ff; value3 &= 0x000003ff; index1 = value1; index2 = value2; index3 = value3; } void transformFromIndex_parallel(uint32_t i0, uint32_t i1, uint32_t i2, uint32_t& t0, uint32_t& t1, uint32_t& t2, uint32_t& t3) { t0 = i2 | (i0 & i1); t1 = (~i0 & ~i1 & ~i2) | (i0 & i1 & ~i2) | (~i0 & ~i1 & i2); t2 = ~(i0 & i1) ^ i2; t3 = (i0 & ~i1 & i2) | (~i0 & i1 & i2); } void multiplyTransform_parallel( uint32_t A0, uint32_t A1, uint32_t A2, uint32_t A3, uint32_t B0, uint32_t B1, uint32_t B2, uint32_t B3, uint32_t& C0, uint32_t& C1, uint32_t& C2, uint32_t& C3) { C0 = (A0 & ~A1 & ~A2 & B0 & B1 & ~B2 & ~B3) | (A0 & ~A2 & ~A3 & ~B0 & B2) | (A0 & ~A1 & ~A2 & ~A3 & B0 & ~B1 & B3) | (A0 & ~A2 & ~A3 & ~B0 & B1 & B3) | (A0 & A1 & A3 & ~B0 & B1 & ~B2 & ~B3) | (A1 & ~A3 & ~B0 & B1 & ~B2 & ~B3) | (~A0 & A1 & ~A2 & ~A3 & B1 & B3) | (A1 & A2 & B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & B1 & ~B3) | (A1 & ~A2 & ~A3 & ~B0 & ~B1 & B2) | (~A0 & A1 & A2 & B0 & B2) | (~A0 & A1 & ~A2 & ~B0 & ~B1 & B3) | (A0 & A1 & ~A2 & ~B0 & ~B1 & ~B2) | (~A0 & A1 & ~A3 & B0 & ~B1 & B2) | (~A0 & A2 & ~B0 & ~B1 & B3) | (~A0 & A1 & A3 & ~B0 & B2) | (~A0 & ~A1 & ~A2 & ~A3 & B0) | (A0 & A1 & A3 & ~B1 & ~B3) | (A0 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & A1 & ~A3 & ~B0 & ~B3) | (~A0 & A2 & B0 & B1 & B3) | (~A1 & A3 & B0 & ~B1 & B3) | (~A0 & A3 & B0 & ~B1 & B3) | (~A0 & ~A1 & A2 & B3) | (A0 & A2 & ~B2 & ~B3) | (~A1 & A2 & B0 & B2) | (~A0 & A3 & B1 & B3) | (~A1 & A3 & B1 & B2); C1 = (~A0 & A3 & B0 & B1 & ~B2 & ~B3) | (A0 & ~A1 & ~A2 & ~A3 & B0 & ~B1 & B3) | (~A0 & ~A2 & ~A3 & B0 & B1 & B2) | (A0 & A1 & A3 & ~B0 & B1 & ~B2 & ~B3) | (A0 & ~A3 & B0 & ~B1 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (~A1 & A2 & ~B0 & B1 & ~B2 & ~B3) | (A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (A0 & ~A1 & A3 & B1 & ~B2 & ~B3) | (~A1 & ~A2 & ~A3 & ~B0 & B1 & B2) | (~A0 & A1 & A2 & B0 & B2) | (A0 & A1 & ~A2 & ~B0 & ~B1 & ~B2) | (~A0 & A1 & ~A3 & B0 & ~B1 & B2) | (~A0 & A2 & B1 & ~B2 & ~B3) | (~A0 & ~A1 & A2 & ~B1 & B2) | (A1 & ~A2 & ~A3 & ~B1 & B3) | (A0 & ~A1 & ~A2 & ~B0 & B2) | (~A0 & ~A3 & ~B0 & B1 & B3) | (~A0 & ~A1 & ~A3 & B1 & ~B2) | (A0 & A1 & A3 & B1 & B2) | (A0 & ~A3 & ~B0 & ~B1 & B3) | (A1 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & A1 & ~A3 & ~B1 & ~B2) | (A0 & A1 & ~A2 & ~A3 & ~B1) | (~A0 & A3 & B0 & ~B1 & B3) | (A1 & A3 & ~B0 & B1 & B3) | (~A0 & A3 & ~B1 & B2) | (A0 & A2 & B1 & B2) | (~A1 & B0 & B1 & B3); C2 = (A0 & ~A1 & A2 & ~B1 & ~B2 & ~B3) | (~A0 & A1 & ~A2 & B0 & B1 & ~B2 & ~B3) | (A0 & A1 & ~A2 & B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & ~A2 & B0 & B1 & ~B2 & ~B3) | (A0 & A1 & A2 & ~B0 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & ~B1 & B2) | (~A1 & ~A2 & ~A3 & B0 & ~B1 & B2) | (~A0 & A1 & A2 & ~B0 & B1 & B2) | (~A0 & ~A1 & A3 & ~B0 & B1 & ~B3) | (A1 & A3 & B0 & ~B1 & ~B2 & ~B3) | (~A0 & A1 & A2 & B0 & ~B1 & B2) | (A0 & ~A1 & ~A2 & ~B0 & ~B1 & B3) | (~A0 & ~A1 & ~A3 & ~B0 & ~B1 & B2) | (A1 & ~A2 & ~A3 & B0 & B1 & B2) | (~A0 & ~A1 & ~A3 & B0 & B1 & B2) | (A0 & A1 & A3 & ~B0 & B1 & ~B2 & ~B3) | (A0 & A1 & ~A2 & ~B0 & B1 & B3) | (A0 & A1 & A3 & B0 & ~B1 & ~B2) | (A0 & ~A1 & A3 & B1 & ~B2 & ~B3) | (~A1 & ~A2 & ~A3 & ~B0 & B1 & B2) | (A1 & ~A2 & ~A3 & ~B0 & ~B1 & B2) | (~A0 & A1 & ~A2 & ~B0 & ~B1 & B3) | (A2 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & A2 & B1 & B3) | (A0 & A1 & A2 & B0 & B3) | (~A0 & A2 & B0 & B1 & B3) | (~A1 & A3 & B0 & ~B1 & B3) | (A1 & A3 & ~B0 & B1 & B3); C3 = (A0 & A1 & ~A2 & ~A3 & ~B0 & B1 & ~B2 & ~B3) | (A0 & ~A1 & ~A2 & ~A3 & B0 & B1 & B2) | (A0 & A1 & ~A2 & ~A3 & ~B0 & ~B1 & B3) | (A0 & A1 & A3 & B0 & B1 & ~B2 & ~B3) | (~A0 & ~A1 & A2 & B0 & ~B1 & ~B3) | (A0 & ~A1 & ~A2 & ~B0 & B1 & B3) | (~A0 & A1 & ~A2 & ~B0 & B1 & B2) | (A0 & ~A1 & ~A2 & ~A3 & B0 & ~B1 & B3) | (~A0 & A1 & ~A3 & B0 & ~B1 & B3) | (A0 & A1 & ~A3 & B0 & ~B1 & B2) | (~A0 & ~A1 & A3 & B0 & ~B2 & ~B3) | (~A1 & A2 & ~B0 & B1 & ~B2 & ~B3) | (~A0 & A1 & ~A2 & ~A3 & B1 & B3) | (A1 & A2 & B0 & ~B1 & ~B2 & ~B3) | (~A0 & ~A1 & A3 & B0 & B1 & ~B3) | (~A0 & ~A1 & ~A2 & ~A3 & B3) | (~A1 & A2 & ~B0 & ~B1 & B3) | (A3 & ~B0 & ~B1 & ~B2 & ~B3) | (A0 & ~A1 & A2 & ~B1 & B2) | (~A2 & ~A3 & B0 & B1 & B3) | (~A0 & A1 & A3 & ~B0 & ~B3) | (A1 & A2 & B0 & B1 & ~B3) | (A0 & A3 & ~B0 & B2); } void octantFromIndex_parallel(uint32_t i0, uint32_t i1, uint32_t i2, uint32_t& x, uint32_t& y, uint32_t& z) { x = i0 ^ i1; y = i1 ^ i2; z = i2; } void applyTransform_parallel(uint32_t t0, uint32_t t1, uint32_t t2, uint32_t t3, uint32_t x, uint32_t y, uint32_t z, uint32_t& o0, uint32_t& o1, uint32_t& o2) { o2 = (~t3 | t1 | t0 | ~z)&(t3 | t1 | ~t0 | ~y)&(~t3 | ~t2 | ~z | ~y | x)&(~t3 | t2 | t1 | ~t0 | ~x)&(t3 | ~t2 | t1 | t0 | x)&(t3 | ~t2 | ~t1 | t0 | y)&(~t3 | t2 | ~t1 | t0 | y)&(t3 | ~t2 | ~t1 | ~t0 | ~x)&(~t3 | t2 | ~t1 | ~t0 | ~z)&(t3 | t2 | t1 | t0 | z)&(t3 | t2 | ~t1 | t0 | x)&(t3 | t2 | ~t1 | ~t0 | z)&(~t3 | ~t2 | t1 | ~z | y | x)&(~t3 | ~t2 | t1 | z | y | ~x)&(~t3 | ~t2 | t0 | z | ~y | ~x)&(~t3 | ~t2 | ~t1 | ~t0 | y | x)&(~t3 | t0 | ~z | y | ~x)&(~t2 | t1 | t0 | ~y | x); o1 = (~t3 | ~t2 | ~y | ~x)&(~t3 | ~t1 | t0 | ~x)&(~t3 | t1 | t0 | ~y)&(~t3 | ~t2 | t1 | ~z | ~x)&(t3 | ~t2 | z | x)&(~t3 | ~t2 | t1 | z | ~y)&(~t2 | t1 | t0 | z | ~x)&(~t3 | ~t2 | t0 | z | ~y)&(~t2 | ~t1 | ~t0 | z | ~x)&(~t3 | ~t2 | ~t1 | y | x)&(~t3 | t2 | t1 | ~t0 | ~z)&(~t3 | t2 | ~t1 | ~t0 | y)&(t3 | t2 | t1 | t0 | y)&(t3 | t2 | t1 | ~t0 | ~x)&(t2 | ~t1 | t0 | ~z | ~x)&(t3 | ~t1 | t0 | ~z | x)&(t3 | t2 | ~t1 | ~t0 | ~y)&(~t3 | ~t2 | t1 | ~z | y | x)&(t3 | ~t2 | t1 | ~t0 | ~z | x)&(~t3 | ~t2 | ~t0 | z | y | x); o0 = (~t3 | ~t2 | ~y | ~x)&(~t2 | ~t1 | t0 | z)&(t2 | t1 | t0 | x)&(t2 | ~t1 | ~t0 | ~x)&(~t3 | t0 | ~z | x)&(~t3 | ~t2 | t1 | ~z | ~x)&(~t3 | ~t2 | t1 | z | ~y)&(~t3 | ~t2 | ~z | ~y | x)&(~t3 | ~t2 | ~t1 | z | ~x)&(t3 | ~t2 | t1 | t0 | y)&(~t3 | t2 | t1 | ~t0 | y)&(~t2 | t1 | ~t0 | ~z | ~x)&(~t3 | t2 | ~t1 | ~z | ~x)&(t3 | ~t2 | ~t1 | ~t0 | ~y)&(t3 | t2 | t1 | ~t0 | z)&(t3 | t2 | ~t1 | t0 | ~y)&(t3 | ~t2 | t1 | ~t0 | ~z | x)&(~t3 | ~t2 | ~t0 | z | y | x)&(~t2 | t0 | z | y | ~x); } void hilbertToMorton3D_logarithmic(uint32_t hilbertIndex, uint32_t bits, uint32_t& x, uint32_t& y, uint32_t& z) { uint32_t i0, i1, i2; Morton_3D_Decode_10bit(hilbertIndex, i0, i1, i2); i0 <<= 32 - bits; i1 <<= 32 - bits; i2 <<= 32 - bits; uint32_t t0, t1, t2, t3; transformFromIndex_parallel(i0, i1, i2, t0, t1, t2, t3); { t0 >>= 1; t1 >>= 1; t2 >>= 1; t3 >>= 1; } for (uint32_t shift = 1; shift <= 16; shift *= 2) { uint32_t s0, s1, s2, s3; s0 = t0 >> shift; s1 = t1 >> shift; s2 = t2 >> shift; s3 = t3 >> shift; uint32_t T0, T1, T2, T3; multiplyTransform_parallel(t0, t1, t2, t3, s0, s1, s2, s3, T0, T1, T2, T3); t0 = T0; t1 = T1; t2 = T2; t3 = T3; } uint32_t o0, o1, o2; octantFromIndex_parallel(i0, i1, i2, o0, o1, o2); applyTransform_parallel(t0, t1, t2, t3, o0, o1, o2, x, y, z); x >>= 32 - bits; y >>= 32 - bits; z >>= 32 - bits; }]]>

static const uint8_t mortonToHilbertTable[] = { 48, 33, 27, 34, 47, 78, 28, 77, 66, 29, 51, 52, 65, 30, 72, 63, 76, 95, 75, 24, 53, 54, 82, 81, 18, 3, 17, 80, 61, 4, 62, 15, 0, 59, 71, 60, 49, 50, 86, 85, 84, 83, 5, 90, 79, 56, 6, 89, 32, 23, 1, 94, 11, 12, 2, 93, 42, 41, 13, 14, 35, 88, 36, 31, 92, 37, 87, 38, 91, 74, 8, 73, 46, 45, 9, 10, 7, 20, 64, 19, 70, 25, 39, 16, 69, 26, 44, 43, 22, 55, 21, 68, 57, 40, 58, 67, }; static const uint8_t hilbertToMortonTable[] = { 48, 33, 35, 26, 30, 79, 77, 44, 78, 68, 64, 50, 51, 25, 29, 63, 27, 87, 86, 74, 72, 52, 53, 89, 83, 18, 16, 1, 5, 60, 62, 15, 0, 52, 53, 57, 59, 87, 86, 66, 61, 95, 91, 81, 80, 2, 6, 76, 32, 2, 6, 12, 13, 95, 91, 17, 93, 41, 40, 36, 38, 10, 11, 31, 14, 79, 77, 92, 88, 33, 35, 82, 70, 10, 11, 23, 21, 41, 40, 4, 19, 25, 29, 47, 46, 68, 64, 34, 45, 60, 62, 71, 67, 18, 16, 49, }; uint32_t transformCurve(uint32_t in, uint32_t bits, const uint8_t* lookupTable) { uint32_t transform = 0; uint32_t out = 0; for (int32_t i = 3 * (bits - 1); i >= 0; i -= 3) { transform = lookupTable[transform | ((in >> i) & 7)]; out = (out << 3) | (transform & 7); transform &= ~7; } return out; } uint32_t mortonToHilbert3D(uint32_t mortonIndex, uint32_t bits) { return transformCurve(mortonIndex, bits, mortonToHilbertTable); } uint32_t hilbertToMorton3D(uint32_t hilbertIndex, uint32_t bits) { return transformCurve(hilbertIndex, bits, hilbertToMortonTable); }

The 3D Hilbert curve transforms under the group A4, which has 12 elements. However as there are only 8 transformations that can appear on the left-hand side of each multiplications, only an 8x12 subset of the multiplication table is required. By folding the the index <=> octant mapping function into this table and additionally storing each index respectively octant into the lower 3 bits of each entry, we end up with a compact 96 byte LUT for each function.

These functions can be used as drop-in replacement for the algorithm provided by Fast 2D and 3D Hilbert curves and Morton codes, but are faster by a factor of roughly 2.5 on my machine.

]]>https://github.com/rawrunprotected/rasterizer

This is state-of-the-art software occlusion culling system; similar to Intel's Software Occlusion Culling, but better optimized. By default, the application loads and renders Crytek's Sponza mesh. Camera controls are simple WASD to move and cursor keys to control the camera. The current version requires AVX2 support, so you'll need a modern CPU to run it; although downgrading to AVX or SSE4 should be straightforward if needed at some point.

Here's a quick overview of some of the techniques the rasterizer makes use of:

Clipless or homogeneous rasterization, is not exactly a new idea. The basic principle is that the usual perspective division by W after vertex transformation is ommitted and the triangle edge equations are directly projected to 2D space, thus saving a division, and more importantly, clipping against the near plane and guard band. A nice side-effect is that no near plane is required anymore.

However, implementing this naively brings more problems at first than it solves: Perspective division brings vertex coordinates and thus edge equations into pixel space, where they can be rounded to fixed-point coordinates with well-defined error behavior. Ommitting this step forces the rasterizer to either operate in floating point with a ridiculously high number of mantissa bits, or use some other magic trick to ensure the rasterization result remains artifact-free, as a small change to the edge equations can result in large deviations from the ground truth. Besides, without having explicit 2D vertex positions, there is no explicit bounding box, making it difficult to limit the screen area that needs to be scanned.

As it turns out, neither magic nor high-precision floats are required after all: Instead, the practical approach to clipless rasterization is to simply clamp the W component away from zero to avoid overflow and perform the perspective division after all. Afterwards, rounding of the vertex coordinates and edge equation computation can be performed as usual; although the edge equations need to be sign-inverted once for each vertex with W < 0. Similarly, the backface culling test result needs to be flipped for each vertex behind the camera. Computing the bounding box is somewhat tricky, as one needs to determine the bounding box for all vertices with W > 0 and all with W < 0 separately, and extend the former towards infinity depending on its relation to the latter.
As this introduces some overhead compared to the standard triangle setup, I've templated the rasterization routine to only perform the extra steps for W < 0 if a batch of triangles is close to the camera plane. Search for `possiblyNearClipped`

in Rasterizer.cpp to see where special handling is required.

Another interesting idea is precomputing the coverage masks for each 8x8 pixel block. This approach is easily explained: For each edge equation, the slope and offset at the center of a pixel block are quantized, and then looked up in a precomputed table to yield a 64bit coverage mask. By ANDing together the masks for each edge, one finds the mask of all pixels that are inside the triangle. Since quantization invariably modifies the edge equation's slope, this results in minor "bristle" artifacts for thin triangles. For a graphical application, these would be unacceptable, however, as these bristles are no more than a pixel wide and don't extend across block boundaries, there's little risk of incorrect over-occlusion, so they can be safely ignored for this rasterizer's intended use case.

The lookup table is sufficiently small to not have any significant impact on cache at only 32KB, with an upper bound of 2.5KB being actually sampled for each pair of triangles in the worst case. Curiously, the lookup should be a perfect use case for AVX2's gather instruction set, however performing the lookup in scalar code turned out to be invariably faster on a SkyLake CPU. FMA on the other hand, the other major feature coming with AVX2-capable CPUs, brings a solid 20% performance gain from speeding up vertex transformation.

The major requirement for any form of vertex buffer compression is that multiple occurences of a vertex encode and decode to the exact same floating point value. This is both sufficient and necessary to ensure that the compression step doesn't introduce any holes in the mesh. As variable-length compression schemes rarely play nice with SIMD, I've chosen to encode vertices straightforwardly as 11-11-10 fixed point format, relative to the models bounding box. This cuts occluder size to a third, compared to storing uncompressed vertex data. Since the rasterizer bakes the bounding box transform into the model view projection matrix, and the integer masking operation during decoding can be pipelined with the matrix multiplication, decompression is practically zero-cost.

Note that using an indexed representation doesn't provide any speedup here: Contrary to the GPU case, the rasterizer is neither bound by vertex fetch bandwidth nor processing cost, and the overhead of gathering via the index buffer managing a cache in software is fairly large. In addition, having a flat linear input stream allows perfect usage of streaming load instructions to prevent cache pollution.

By rasterizing quadrilaterals instead of triangles, setup cost and overdraw are significantly reduced. Unfortunately, one can't rely on coplanar pairs of triangles remaining coplanar after vertex quantization, so the rasterizer expliticly deals with quads becoming non-convex after projection by generating the coverage mask for each half quad separately to prevent introducing holes. Triangles that can't be merged to quads are simply handled by duplicating one of the vertices, effectively degenerating one of the quad edges. Despite this overhead, rendering quads is overall 20-30% faster.

Typical GPU depth buffer compression schemes conserve bandwidth by attempting to store only each triangle's depth derivatives and a per-pixel index mask for a fixed block size, rather than the full interpolated depth values. This doesn't map well to a software implementation, so instead the rasterizer uses a customized 16bit minifloat format to reduce the depth buffer size.

The minifloat format has 4 exponent and 12 mantissa bits, with no sign bit; with decoded values ranging from 0 to 1. The trick for quickly encoding into this format is first rescaling an input 32bit float with 2.5237386e-29f (which has the bit pattern 0xFFFF << 12), and then right-shifting by 12 bits. This maps 0.0 to the bit pattern 0x0000, and 1.0 to the pattern 0xFFFF, with an exponential distribution in between. By making use of SSE's saturating packing instruction, one also gets well-defined clamping behavior for values outside of the [0, 1] range. The rescaling value is simply baked into the projection matrix, which means that 8 depth values can be encoded extremely quickly with only 3 integer operations. Combined with an inverted depth range with near at 1.0f and far at 0.0f, so that the denser floating point distribution around 0 counteracts the perspective error, this minifloat actually has a significantly better worst-case depth precision for typical near/far ratios than a 24bit fixed point depth buffer.

Besides these high-level tricks, there are tons of smaller bits and pieces to improve performance, such as computing z from its linear relation with 1/W; or rounding vertex coordinates to force zero-area culling even though the rasterizer doesn't run in fixed point. The bulk of the optimization effort however is completely invisible: Several hundred hours have been spent on cycle shaving, reordering to improve pipelining, or running assembly through IACA to minimize instruction latencies.

Finally, a significant contribution usually comes from proper preprocessing of the geometry. In a typical game application, you would use a simplified version of the scene geometry. This is out of scope for this sample, though, so the preprocessing is limited to splitting the scene into batches according to the surface area heuristic, and clustering by normal vectors to improve backface culling coherency. The end result is still the full, unsimplified render geometry, just with a little bit of reordering.

I still haven't decided on an open-source license yet; feel free to ping me if you have suggestions.

]]>Today I present a novel technique for mapping a point in the -dimensional plane to its corresponding -bit offset on the th order Hilbert curve with a running time of on a processor with -bit words. While a large constant factor makes it infeasible in higher dimensions, it is significantly faster than the previous best method for .

To explain the approach, let's consider the reverse computation first: mapping an offset on the Hilbert curve to its corresponding coordinates.

Given a -bit number describing the index on the -dimensional Hilbert curve of order , split the index into groups of bits each, starting from the most significant bit. Each of these tuples through describes both an orthant to recurse into, determining one bit for the each of the coordinates axes (which we'll group together as ) of the point on the curve, as well as the transformation that is to be applied to the next recursion level of the curve.

We have

Where is the function mapping index bits to an orthant, is the function mapping index bits to an element of the transformation group of the Hilbert curve, and is the operator of that group. If , and are known for a particular dimension, this yields an algorithm for mapping offsets on the curve to the corresponding coordinates, and vice versa, by successively computing the product of transformations at each recursion level.

Designing these functions efficiently is rather involved, though, and as far as I'm aware this direct approach has only been applied to the two dimensional case, either via more or less explicit geometric operations (http://www.hackersdelight.org/hdcodetxt/hilbert/lamxy.c.txt), or lookup tables (http://www.hackersdelight.org/hdcodetxt/hilbert/hil_xy_from_s.c.txt, http://www.hackersdelight.org/hdcodetxt/hilbert/hil_s_from_xy.c.txt). For higher dimensions, methods based on Grey codes are usually employed.

In this written form, it is readily apparent that one can apply a prefix scan technique to parallelize the mapping from index to coordinates by computing the partial products of transformations in number of steps logarithmic in the order of the curve. In two dimensions, observe that the group of transformations applied to each quadrant is simply the Klein four-group, which is trivial to implement as a bitwise XOR. This allows us to parallelize across the individual bits of an -bit register, a technique sometimes dubbed SWAR (SIMD within a register). An implementation can be found at http://www.hackersdelight.org/hdcodetxt/hilbert/glsxy.c.txt.

The same method generalizes to higher dimensions; however it quickly becomes infeasible to parallelize within a register as the transformation groups become too complex (in 3D, the transformation group is the alternating group A4, which does not seem to exhibit a structure that would make it particularly efficient to implement).

An open problem until now was to apply the same idea in the other direction: Can mapping a point to the corresponding Hilbert curve index be parallelized in the same way? The issue is that, given , we can't easily separate the partial product of transformations from the orthant without knowing the transformation of the previous recursion level, making it seem that is impossible to achieve sub-linear time.

As it turns out, there is a trick to do exactly that: We can't recover without knowledge of . However, we can compute a function object that evaluates under the assumption that . We now have a semi-group with the operation of functional composition - and can compute the partial sums efficiently via a prefix scan. Next, evaluate each function at identity, which yields each of , from which we can recover , then , and finally as is necessarily bijective.

Compared to mapping indices to coordinates, this unfortunately blows up the time complexity by a factor of the order of the transformation group, which appears to be super-exponential in . Combined with the above-mentioned difficulty of efficiently implementing the transformation group, this makes it unlikely that this method has any practical merit except for .

Without further ado, I present my implementation (for 32bit indices), which is three times faster than the lookup table based implemention linked above for 32bits. The code has been extensively transformed and optimized by hand to exploit some further symmetries of the transformation group for , so it bears little resemblance any more to the underlying mathematical structure outlined above.

uint32_t interleave(uint32_t x) { x = (x | (x << 8)) & 0x00FF00FF; x = (x | (x << 4)) & 0x0F0F0F0F; x = (x | (x << 2)) & 0x33333333; x = (x | (x << 1)) & 0x55555555; return x; } uint32_t hilbertXYToIndex_logarithmic(uint32_t x, uint32_t y) { uint32_t A, B, C, D; // Initial prefix scan round, prime with x and y { uint32_t a = x ^ y; uint32_t b = 0xFFFF ^ a; uint32_t c = 0xFFFF ^ (x | y); uint32_t d = x & (y ^ 0xFFFF); A = a | (b >> 1); B = (a >> 1) ^ a; C = ((c >> 1) ^ (b & (d >> 1))) ^ c; D = ((a & (c >> 1)) ^ (d >> 1)) ^ d; } { uint32_t a = A; uint32_t b = B; uint32_t c = C; uint32_t d = D; A = ((a & (a >> 2)) ^ (b & (b >> 2))); B = ((a & (b >> 2)) ^ (b & ((a ^ b) >> 2))); C ^= ((a & (c >> 2)) ^ (b & (d >> 2))); D ^= ((b & (c >> 2)) ^ ((a ^ b) & (d >> 2))); } { uint32_t a = A; uint32_t b = B; uint32_t c = C; uint32_t d = D; A = ((a & (a >> 4)) ^ (b & (b >> 4))); B = ((a & (b >> 4)) ^ (b & ((a ^ b) >> 4))); C ^= ((a & (c >> 4)) ^ (b & (d >> 4))); D ^= ((b & (c >> 4)) ^ ((a ^ b) & (d >> 4))); } // Final round and projection { uint32_t a = A; uint32_t b = B; uint32_t c = C; uint32_t d = D; C ^= ((a & (c >> 8)) ^ (b & (d >> 8))); D ^= ((b & (c >> 8)) ^ ((a ^ b) & (d >> 8))); } // Undo transformation prefix scan uint32_t a = C ^ (C >> 1); uint32_t b = D ^ (D >> 1); // Recover index bits uint32_t i0 = x ^ y; uint32_t i1 = b | (0xFFFF ^ (i0 | a)); return (interleave(i1) << 1) | interleave(i0); }]]>

One reason that comes to mind is that rasterization typically uses the projected depth value Z/W, because it's convenient (the difference between perspective and orthogonal projection boils down to a few matrix values) and efficient (Z/W is linear in screen space, making hardware implementation trivial). This value isn't equi-distributed along the view vector however, and quickly goes towards minus infinity as we get closer to the camera. Historically, the near plane make sense as an arbitrary cut-off point for fixed-point depth values.

Nowadays, we could easily work around this by storing Z/W as floating point values in the depth buffer, or switching to linear depth Z altogether. Modern GPUs typically use 32bit depth buffers anyway - using D24S8 as a depth stencil format often means that the hardware has a 32bit buffer for depth (out of which 8bits are unused padding) and separate 8bit stencil buffer, potentially even in a specialized on-chip memory region. Using floating point depth doesn't incur any extra cost, so why is the near plane still a thing?

Consider the basic steps of rasterizing a 3D triangle:

- Project 3D vertices to 2D pixel positions

- Determine pixels covered by 2D triangle

- Interpolate vertex attributes with perspective correction for each pixel

Except that this is not quite correct. **When projecting a 3D triangle onto a 2D screen, the result is not a triangle.** In particular, when part of a triangle is in front and another part is behind the camera plane, the resulting 2D shape has three edges, two of which extend to infinity rather than meeting in a common point. However because it is much easier, both on an intuitive as well as technical level, to work under the assumption that 3D triangle map to 2D triangles, the near plane was introduced in order to clip away those parts of the geometry that don't fullfil this condition.

A better mental model is perhaps that a half-space in 3D is projected to a half-space in 2D. A triangle is simply the intersection of 3 half-spaces as defined by its edges, and the shape projected by a triangle is the intersection of the projection of each half-space defining the triangle. As it happens, these half-spaces do not necessarily enclose a finite area.

Until ray tracing finally takes over for real-time rendering, which is usually only 10 years from now at any given point of time, there's a technique called clipless rasterization that does not require clipping geometry against the near plane. It works in terms of projecting 3D edges directly to 2D half-space equations, against which points in the plane can be tested. NVIDIA GPUs have implemented this technique for a number of years, while AMD still uses traditional clipping rasterization.

While implementing a clipless rasterizer is initially straightforward to implement once all the math is worked out, it's fairly difficult to control precision. With clipping, the computed 2D vertex position are typically rounded to some fixed-point format, after which all operations are carried out exactly. Any deviation between the rasterized area and the "ground truth" is directly determined by the chosen fixed point precision. Without clipping, we only have edge equations, which are very sensitive to numerical errors. Consider two edges meeting at an acute angle: If either edge is offset by a small amount, the intersection point can potentially move very far. Another issue is that two triangles sharing an edge might generate numerically different but geometrically identical edge equations, but the rasterizer needs to ensure to generate the exact same pixels for each. One solution is to use floating point formats with enough mantissa bits to ensure that all edge computations are lossless. This is costly in hardware, but on the other hand it removes the need for complex clipping circuitry in the front-end.

However, all current graphics APIs are specified in terms of traditional rasterization, including the requirement of near-clipping. Because of this, NVIDIA emulates the effect of clipping against the near plane by discarding pixels below some depth threshold, meaning all of the user-facing benefits of clipless rasterization are lost.

Until those specifications are fixed, I guess we'll have to live with the near plane a little-bit longer.

]]>Conversion to R4G4B4A4, without application of dithering, yields a rather unsightly result. Note the significant color banding on the gradients, which results in the Mach band optical illusion. There is some heavy color banding of the parrot's plumage as well.

Dithering improves the artifacts and replaces most of banding by visually less-objectionable high-frequency noise.

However, you may notice that there now appear slightly darkened stripes in the linear gradients on the right side where the undithered image jumps to the next color level. This is because the image has been dithered in gamma space where adding up color values results in a color with slightly lower intensity. Linearizing color values before computing and distributing the error, these artifacts disappear completely:

Visually, there's little advantage of using sRGB-correct dithering for high-frequency photographic images, as the differences are neglible. For low-frequency gradients the gain in quality is obvious; however images with perfect gradients are most likely lookup tables and the like that wouldn't be stored at low color depths in the first place.

Anyway, the result using an approximate (and cheaper) gamma exponent of 2.0 is visually completely indistinguishable from applying the proper sRGB conversion for the purposes of dithering, so there's that.

]]>Inspired by this stackoverflow answer, which lists a method of expanding 5 and 6 bit color channels to their closest equivalent 8bit value with a cheap combination of integer multiplication, addition and shifting, the following table lists conversion formulas for mapping between UNORM ranges of 4, 5, 6, 8, 10 and 11 bits with correct rounding. This is enough to cover all common color formats (4444, 555, 565, 8888, 10-10-10, 11-11-10) and build fast and accurate conversion routines for them.

4bit Target | 5bit Target | 6bit Target | 8bit Target | 10bit Target | 11bit Target | |
---|---|---|---|---|---|---|

4bit Source | x | (x * 33 + 8) >> 4 | (x * 67 + 9) >> 4 | x * 17 | (x * 1091 + 9) >> 4 | (x * 546 + 1) >> 2 |

5bit Source | (x * 2 + 1) >> 2 | x | (x * 65 + 16) >> 5 | (x * 527 + 23) >> 6 | x * 33 | (x * 2113 + 16) >> 5 |

6bit Source | (x * 61 + 128) >> 8 | (x * 2 + 1) >> 2 | x | (x * 259 + 33) >> 6 | (x * 4157 + 128) >> 8 | (x * 130 + 1) >> 2 |

8bit Source | (x * 15 + 135) >> 8 | (x * 249 + 1024) >> 11 | (x * 253 + 512) >> 10 | x | (x * 1027 + 129) >> 8 | (x * 65761 + 4096) >> 13 |

10bit Source | (x * 961 + 32768) >> 16 | (x * 31 + 527) >> 10 | (x * 1009 + 8192) >> 14 | (x * 1021 + 2048) >> 12 | x | (x * 2049 + 512) >> 10 |

11bit Source | (x * 1921 + 131072) >> 18 | (x * 993 + 32256) >> 16 | (x * 2017 + 32768) >> 16 | (x * 2041 + 8192) >> 14 | (x * 2 + 1) >> 2 | x |

Each formula yields the exact same result as truncate(float(x) / float((1 << sourceBits) - 1) * float((1 << targetBits) - 1) + 0.5f). You can also find the full table for up to 16 bits here.

]]>