Since it's 8bit by 8bit, a precomputed lookup table (64K in size) will be anothe...

ack_complete · on Dec 21, 2024

A lookup table in memory can only be accessed an element at a time, so it gets bottlenecked in the execution units on the load and address generation traffic. This algorithm uses 8-bit integers, so the vectorized version is processing between 16-64 elements per operation depending on the vector width. It's even worse if the 8-bit divide is integrated with other vector operations as then the lookup table method also has insert/extract overhead to exchange the values between the vector and integer units.

A hybrid approach using small in-register lookup tables in the vector unit (pshufb/tbl) can be lucrative, but are very limited in table size.

mithametacs · on Dec 21, 2024

I've never thought of a micro-lookup-table!

That's cool. What have you used those for?

anonymoushn · on Dec 22, 2024

You can test whether a register full of bytes belong to a class of bytes with the high bit unset and distinct lower nibbles in 2 instructions (each with 1 cycle latency). For example the characters that commonly occur in text and must be escaped in json are "\"\\\r\n\t" (5 different bytes). Their lower nibbles are:

\": 0010

\t: 1001

\n: 1010

\\: 1100

\r: 1101

Since there are no duplicate lower nibbles, we can test for membership in this set in 2 instructions. If we want to get the value into a general-purpose register and branch on whether it's 0 or not, that's 2 more instructions.

For larger sets or sets determined at runtime, see here: http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html

ack_complete · on Dec 21, 2024

Typically just like conventional lookup tables, where you can get the table size down small enough. Indexed palette / quantization coding is a case where this can often work. It's pretty niche given the limitations, but if you can make it work it's often a major speedup since you're able to do 16/32/64 lookups in parallel.

glangdale · on Dec 22, 2024

Not OP, but we use these tables all the time in Hyperscan (for string and character class matching) and it's a long-standing technique to use it for things like SIMD population count (obsoleted now if you have the right AVX-512 ISA, ofc).

mmastrac · on Dec 21, 2024

Lookup tables aren't necessarily faster these days for a lot of things when using low-level languages, but it would have been interesting to see the comparison here.

adhoc32 · on Dec 21, 2024

pretty sure a memory access is faster than the methods presented in the article.

PhilipRoman · on Dec 21, 2024

Depends also heavily on the context. You pay for each cache miss twice - once for the miss itself, and next time when you access whatever was evicted during the first miss. This is why LUTs often shine in microbenchmarks, but drag down performance in real world scenarios when mixed with other cache bound code.

dist-epoch · on Dec 21, 2024

Hitting L2 is more than 3-4 cycles

retrac · on Dec 21, 2024

Access to main memory can be many many cycles; a short routine already in cache may be able to recompute a value more quickly than pulling it from main memory.

ryao · on Dec 21, 2024

An uncached random memory access is around 100 cycles.

Sesse__ · on Dec 21, 2024

100 cycles would be very low. Many systems have more than 100 ns!

ryao · on Dec 25, 2024

You are correct. I used the wrong unit:

https://jsmemtest.chipsandcheese.com/latencydata

We can say around 100ns, although likely somewhat more.

Retr0id · on Dec 21, 2024

64K is enough to fill L1 on many systems

ashvardanian · on Dec 21, 2024

64 KB is a pretty significant budget for such a small operation. I've had a variant that uses 768 bytes with some extra logic, but will deprecate that kernel soon.

https://github.com/ashvardanian/StringZilla/blob/0d47be212c5...

RaisingSpear · on Dec 23, 2024

If you scale up the multipliers, you should be able to eliminate the variable shift, which would reduce the lookup table to 512 bytes.

sroussey · on Dec 21, 2024

This seems like something that could be in hardware to allow native execution of the instruction. Is that not the case anywhere?