So using 128bits instructions would imply you had 'words' which were over 16 cha...

drdrey · on Jan 20, 2017

Not all strings are made of English text

sfrailsdev · on Jan 20, 2017

you don't even have to go to non english, just utf-8 has stuff like mathematic spaces and spaces of different em sizes. http://perldoc.perl.org/perlrecharclass.html has a reasonably good list of non ascii spaces, though I'm not sure where to dig through for locale specific lists.

That said, while it may not be a solution for every case, it's a solution for the common case and a starting point for other cases, and thus pretty nifty and potentially useful.

countryqt30 · on Jan 20, 2017

You are so right - everything depends on the dataset. Now after all, just skipping 4 characters (1 byte, 32 bit arch) actually seems VERY reasonable.

Maybe do it log> 1) check if you can jump 16, 2) check if you can jump 8, 3) check if you can jump 4, execute

nkurz · on Jan 20, 2017

This is a good instinct, but doesn't work well in practice unless the prevalence of spaces is very low. And if it's very low, you only need the largest check followed by a single fallback. The problem is that mispredicted branches (for example, 'if statements' that require different code paths without a clear pattern) are relatively expensive.

The scalar operations of reading a character, checking whether it's a space (0x20), and writing it to an output can often be done in a single cycle (the processor is 'superscalar'). A mispredicted branch costs about 15 cycles. Thus for simple tasks like this, if the average distance between spaces is 16 or less, you are likely better off with the simpler straightforward approach.