Put in a couple hours to see how well it worked. The approximately correct form is slightly faster than just doing an FP division without proper rounding.
uint8_t magic_divide(uint8_t a, uint8_t b) {
float x = static_cast<float>(b);
int i = 0x7eb504f3 - std::bit_cast<int>(x);
float reciprocal = std::bit_cast<float>(i);
return a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f);
}
No errors, ~20 cycles latency on skylake by uiCA. You can optimize it a bit further by approximating the newton raphson step though:
uint8_t approx_magic_divide(uint8_t a, uint8_t b) {
float x = static_cast<float>(b);
int i = 0x7eb504f3 - std::bit_cast<int>(x);
float reciprocal = std::bit_cast<float>(i);
reciprocal = 1.385356f*reciprocal;
return a*reciprocal;
}
Slightly off when a is a multiple of b, but approximately correct. 15.5 cycles on skylake because of the dependency chain.