Remember that on a little-endian machine (i.e. nearly everything now) the "last ...

Remember that on a little-endian machine (i.e. nearly everything now) the "last byte" of the pointer is actually the first byte of the string. The only way you can use the 8th byte of the string as your sentinel is if you can be sure that the allocator won't ever give you a pointer with MSB=0x00.

You are right that on big-endian machines you can smuggle a 7th byte into the pointer though by sharing the "tag" and the '\0' terminator. You don't really have to worry about the 0-byte case since in a traditional implementation there is a shared empty-string sentinel that the default constructor uses. So if you are mutating a short-string and the result is 0 bytes you can always just replace it with a pointer to the shared sentinel.

I agree with your intuition about the costs. I think your program would have to be pretty dominated with tiny strings for all of this optimization to help much. My guess is that it would microbenchmark well. However, all of those extra branches would add pressure to I-Cache and branch predictor history which would offset it in the real world.

folly's fbstring actually has 3 separate regimes (interned tiny strings, classic normal strings, threadsafe-COW large strings) so I guess they decided that the extra branches were worth it for them. I still prefer a simpler design where c_str()/size() don't require any branches though.