HN2new | past | comments | ask | show | jobs | submitlogin

Since it's 8bit by 8bit, a precomputed lookup table (64K in size) will be another option.


A lookup table in memory can only be accessed an element at a time, so it gets bottlenecked in the execution units on the load and address generation traffic. This algorithm uses 8-bit integers, so the vectorized version is processing between 16-64 elements per operation depending on the vector width. It's even worse if the 8-bit divide is integrated with other vector operations as then the lookup table method also has insert/extract overhead to exchange the values between the vector and integer units.

A hybrid approach using small in-register lookup tables in the vector unit (pshufb/tbl) can be lucrative, but are very limited in table size.


I've never thought of a micro-lookup-table!

That's cool. What have you used those for?


You can test whether a register full of bytes belong to a class of bytes with the high bit unset and distinct lower nibbles in 2 instructions (each with 1 cycle latency). For example the characters that commonly occur in text and must be escaped in json are "\"\\\r\n\t" (5 different bytes). Their lower nibbles are:

\": 0010

\t: 1001

\n: 1010

\\: 1100

\r: 1101

Since there are no duplicate lower nibbles, we can test for membership in this set in 2 instructions. If we want to get the value into a general-purpose register and branch on whether it's 0 or not, that's 2 more instructions.

For larger sets or sets determined at runtime, see here: http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html


Typically just like conventional lookup tables, where you can get the table size down small enough. Indexed palette / quantization coding is a case where this can often work. It's pretty niche given the limitations, but if you can make it work it's often a major speedup since you're able to do 16/32/64 lookups in parallel.


Not OP, but we use these tables all the time in Hyperscan (for string and character class matching) and it's a long-standing technique to use it for things like SIMD population count (obsoleted now if you have the right AVX-512 ISA, ofc).


Lookup tables aren't necessarily faster these days for a lot of things when using low-level languages, but it would have been interesting to see the comparison here.


pretty sure a memory access is faster than the methods presented in the article.


Depends also heavily on the context. You pay for each cache miss twice - once for the miss itself, and next time when you access whatever was evicted during the first miss. This is why LUTs often shine in microbenchmarks, but drag down performance in real world scenarios when mixed with other cache bound code.


Hitting L2 is more than 3-4 cycles


Access to main memory can be many many cycles; a short routine already in cache may be able to recompute a value more quickly than pulling it from main memory.


An uncached random memory access is around 100 cycles.


100 cycles would be very low. Many systems have more than 100 ns!


You are correct. I used the wrong unit:

https://jsmemtest.chipsandcheese.com/latencydata

We can say around 100ns, although likely somewhat more.


64K is enough to fill L1 on many systems


64 KB is a pretty significant budget for such a small operation. I've had a variant that uses 768 bytes with some extra logic, but will deprecate that kernel soon.

https://github.com/ashvardanian/StringZilla/blob/0d47be212c5...


If you scale up the multipliers, you should be able to eliminate the variable shift, which would reduce the lookup table to 512 bytes.


This seems like something that could be in hardware to allow native execution of the instruction. Is that not the case anywhere?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: