Put in a couple hours to see how well it worked. The approximately correct form ...

Put in a couple hours to see how well it worked. The approximately correct form is slightly faster than just doing an FP division without proper rounding.

    uint8_t magic_divide(uint8_t a, uint8_t b) {
        float x = static_cast<float>(b);
        int i = 0x7eb504f3 - std::bit_cast<int>(x);
        float reciprocal = std::bit_cast<float>(i);
        return a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f);
    }

No errors, ~20 cycles latency on skylake by uiCA. You can optimize it a bit further by approximating the newton raphson step though:

    uint8_t approx_magic_divide(uint8_t a, uint8_t b) {
        float x = static_cast<float>(b);
        int i = 0x7eb504f3 - std::bit_cast<int>(x);
        float reciprocal = std::bit_cast<float>(i);
        reciprocal = 1.385356f*reciprocal;
        return a*reciprocal;
    }

Slightly off when a is a multiple of b, but approximately correct. 15.5 cycles on skylake because of the dependency chain.