3D Hilbert curves in even fewer instructions

So far, the LUT based algorithm is the fastest method in general for mapping between Morton and Hilbert orders in 3D. This method performs linearly in the input size, consuming 3 bits of input and producing 3 bits of output per iteration. Let's break it down into 5 logical steps:

Extract 3 bits from input: uint32_t input = (in >> i) & 7;
Combine input bits with current transform: uint32_t lookupArg = transform | input;
Perform lookup: uint32_t lookupRes = lookupTable[lookupArg];
Append output bits: out = (out << 3) | (lookup & 7);
Extract new transform: transform = lookup & ~7;

Each of these steps is essential to the algorithm, and there's no obvious way to optimize each individual operation, such as masking and shifting bits. I'm also dismissing the obvious approach here of consuming more input bits per iteration, since that increases the size of the LUT exponentially.

To break the barrier of 5 (logical) instructions per iteration, we need to resort to a few tricks. As we need to shift 3 bits of input and 3 bits of output per iteration, let's try to combine these operations into one. On x86, the family of ror/rol instructions supports rotating shifts where bits shifted out the end of the register are inserted on the other end. This allows us to shift state around in a register without losing bits that are pushed out on one end.

As a rough sketch, let's consider then storing the entire state a single state register, with the input bits initially aligned to the MSB. By left-rotating by 3 we can shift 3 bits of input into the lowest bits of the state, where they are conveniently located for lookup. At the same time, this rotation pushes 3 output bits from the previous iteration to a higher position in the state register. At the end, all input bits are consumed, and we're left with the finished output in the high bits of our register.

Since the high bits of our state are used, we need to mask out the low bits (containing 3 bits of input and the current transform) to perform the lookup. However we can do this without any explicit bitmasking operation if the state register is a 64bit integer by casting to a 32bit integer. On x86 this amounts to move of the lower half of a register, which is essentially free as it's handled by the renaming engine in the front end. This also implies that our LUT needs to contain the 3 output bits at bit offsets 31 through 29, so that the next left rotation moves them into the upper 32 bits, keeping the lower 32 free for the LUT argument.

Finally, we need to clear the transform bits out of the state register and update them with the next transform. By pre-XORing each LUT entry with its argument, and XORing the state with the LUT result, we can clear the previous argument and replace it with the next transform at the same time.

Combined, these ideas yield the following algorithm:

static const uint32_t mortonToHilbertTable[] = {
    0x00000006, 0x20000005, 0x60000001, 0x40000007, 0xe0000001, 0xc000000c, 0x80000005, 0xa000000e,
    0x40000000, 0xa000000a, 0x6000000c, 0x8000000d, 0x20000004, 0xc000000e, 0x00000007, 0xe0000008,
    0x80000019, 0xe000001a, 0x6000001b, 0x00000010, 0xa0000012, 0xc0000013, 0x4000001c, 0x2000001d,
    0x4000001a, 0x60000019, 0x20000018, 0x00000011, 0xa000001b, 0x8000001d, 0xc0000019, 0xe000001e,
    0x00000020, 0x60000026, 0xe000002a, 0x80000024, 0x20000022, 0x40000023, 0xc000002c, 0xa000002d,
    0x80000022, 0x60000023, 0xa000002a, 0x40000020, 0xe0000025, 0x0000002a, 0xc000002e, 0x20000024,
    0x00000034, 0xe0000033, 0x20000032, 0xc0000038, 0x60000035, 0x80000034, 0x40000036, 0xa000003c,
    0x4000003d, 0x2000003c, 0xa000003b, 0xc000003a, 0x60000038, 0x00000036, 0x8000003a, 0xe000003c,
    0x8000004b, 0xa0000045, 0xe0000048, 0xc0000047, 0x6000004f, 0x4000004c, 0x00000047, 0x2000004e,
    0xc000004d, 0xa000004c, 0x2000004b, 0x4000004a, 0xe000004c, 0x8000004f, 0x00000046, 0x6000004d,
    0xc0000058, 0x20000052, 0xe0000056, 0x00000051, 0xa000005c, 0x40000056, 0x80000053, 0x60000052,
    0xc000005a, 0xe000005f, 0xa0000058, 0x80000053, 0x2000005b, 0x00000058, 0x40000059, 0x60000057,
};

static const uint32_t hilbertToMortonTable[] = {
    0x00000006, 0x20000005, 0x60000006, 0x40000000, 0xc0000007, 0xe000000c, 0xa000000f, 0x80000002,
    0xc0000001, 0x80000001, 0x00000002, 0x4000000d, 0x6000000a, 0x2000000e, 0xa000000d, 0xe0000008,
    0x60000013, 0xe000001b, 0xc0000018, 0x4000001a, 0x0000001d, 0x80000013, 0xa0000010, 0x2000001c,
    0x60000012, 0x4000001b, 0x00000018, 0x2000001b, 0xa000001c, 0x8000001a, 0xc0000019, 0xe000001e,
    0x00000020, 0x80000027, 0xa0000024, 0x20000024, 0x60000023, 0xe000002f, 0xc000002c, 0x4000002f,
    0xa000002f, 0xe0000022, 0x60000021, 0x20000021, 0x00000026, 0x4000002d, 0xc000002e, 0x80000026,
    0x00000034, 0x40000031, 0xc0000032, 0x80000032, 0xa0000035, 0xe000003e, 0x6000003d, 0x20000035,
    0xa0000033, 0x2000003c, 0x0000003f, 0x8000003f, 0xc0000038, 0x4000003c, 0x6000003f, 0xe000003c,
    0xc0000041, 0xe0000048, 0xa000004b, 0x80000048, 0x0000004f, 0x20000041, 0x60000042, 0x4000004d,
    0xc0000040, 0x40000048, 0x6000004b, 0xe0000049, 0xa000004e, 0x20000048, 0x0000004b, 0x8000004f,
    0x60000052, 0x20000052, 0xa0000051, 0xe0000056, 0xc0000051, 0x8000005d, 0x0000005e, 0x40000053,
    0xa000005d, 0x8000005e, 0xc000005d, 0xe0000053, 0x60000054, 0x4000005f, 0x0000005c, 0x20000059,
};

uint32_t transformCurve(uint32_t in, uint32_t bits, const uint32_t* lookupTable)
{
    uint64_t x = _rotr64(in, 3 * bits);

    for (uint32_t i = 0; i < bits; ++i)
    {
        x = _rotl64(x, 3);
        x ^= lookupTable[uint32_t(x)];
    }

    return x >> 29;
}

uint32_t mortonToHilbert3D(uint32_t mortonIndex, uint32_t bits)
{
    return transformCurve(mortonIndex, bits, mortonToHilbertTable);
}

uint32_t hilbertToMorton3D(uint32_t hilbertIndex, uint32_t bits)
{
    return transformCurve(hilbertIndex, bits, hilbertToMortonTable);
}

This method is faster than the previous implementation by about 30%. However, it only works on the x64 architecture, increases the size of the LUT to 4 bytes per entry, and is limited to 30bit inputs, so unfortunately it's not quite as useful in general.