Very cool. The horsle demo made me think, how hard would it be to add a virtual memory address (or a non-8086 RAND instruction) that returns a random byte (that would allow it to pick a random value and get a standard wordle working in principle)
I see CSS random() is only supported by Safari, I wonder if there's some side channel that would work in Chrome specifically? (I guess timing the user input would work)
In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].
Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.
To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.
Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.
Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.
Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.
This isn't a microbenchmark. In fact, I haven't even bothered to benchmark it (perhaps the non-simd version actually is faster?)
This is purely me looking at the emitted assembly and being surprised at when the compilers decide to deploy it and not deploy it. It may be the case that the SIMD instructions are in fact slower even though they should theoretically end up faster.
Both compilers are simply using heuristics to determine when it's fruitful to deploy SIMD instructions.
I see yeah that makes sense. I wanted to highlight that "magic" will, on average, give the optimizer a harder time. Explicit offset loops like that are generally avoided in many C++ styles in favor of iterators.
It emits a cmp/jmp still when arithmetic would be fine though which is the difference highlighted in the article and examples in this thread. It's nice that it simplifies down to assembly, but the assembly is somewhat questionable (especially that xor eax eax branch target on the other side).
I think you could argue there is already some effort to do type safety at the ISA register level, with e.g. shadow stack or control flow integrity. Isn't that very similar to this, except targeting program state rather than external memory?
I mean, if the stacks grew upwards, that alone would nip 90% of buffer overflow attacks in the bud. Moving the return address from the activation frame into a separate stack would help as well, but I understand that having an activation frame to be a single piece of data (a current continuation's closure, essentially) can be quite convenient.
The PL/I stack growing up rather than down reduced potential impact of stack overflows in Multics (and PL/I already had better memory safety, with bounded strings, etc.) TFA's author would probably have appreciated the segmented memory architecture as well.
There is no reason why the C/C++ stack can't grow up rather than down. On paged hardware, both the stack and heap could (and probably should) grow up. "C's stack should grow up", one might say.
x86-64 call instruction decrements the stack pointer to push the return address. x86-64 push instructions decrement the stack pointer. The push instructions are easy to work around because most compilers already just push the entire stack frame at once and then do offset accesses, but the call instruction would be kind of annoying.
ARM does not suffer from that problem due to the usage of link registers and generic pre/post-modify. RISC-V is probably also safe, but I have not looked specifically.
> [x86] call instruction would be kind of annoying
I wonder what the best way to do it (on current x86) would be. The stupid simple way might be to adjust SP before the call instruction, and that seems to me like something that would be relatively efficient (simple addition instruction, issued very early).
Some architectures had CALL that was just "STR [SP], IP" without anything else, and it was up to the called procedure to adjust the stack pointer further to allocate for its local variables and the return slot for further calls. The RET instruction would still normally take an immediate (just as e.g. x86/x64's RET does) and additionally adjust the stack pointer by its value (either before or after loading the return address from the tip of the stack).
Linux on PA-RISC also has an upward-growing stack (AFAIK, it's the only architecture Linux has ever had an upward-growing stack on; it's certainly the only currently-supported one).
Both this and parent comment about PA-RISC are very interesting.
As noted, stack growing up doesn't prevent all stack overflows, but it makes it less trivially easy to overwrite a return address. Bounded strings also made it less trivially easy to create string buffer overflows.
In ARMv4/v5 (non-thumb-mode) stack is purely a convention that hardware does not enforce. Nobody forces you to use r13 as the stack pointer or to make the stack descending. You can prototype your approach trivially with small changes to gcc and linux kernel. As this is a standard architectural feature, qemu and the like will support emulating this. And it would run fine on real hardware too. I'd read the paper you publish based on this.
For modern systems, stack buffer overflow bugs haven't been great to exploit for a while. You need at least a stack cookie leak and on Apple Silicon the return addresses are MACed so overwriting them is a fools errand (2^-16 chance of success).
Most exploitable memory corruption bugs are heap buffer overflows.
Being able to cat directories like that doesn't surprise me as much as the contents being readable. Is there not a bunch of binary garbage in between the filenames?
I came here to suggest the same! It's incredibly handy and I use it all the time at work: there's a process that runs for a very long time and I can't be sure ahead of time if the output it generates is going to be useful or not, but if it's useful I want to capture it. I usually just pipe it into `less` and then examine the contents once it's done running, and if needed I will use `s` to save it to a file.
(I suppose I could `tee`, but then I would always dump to a file even if it ends up being useless output.)
2. Replace libc.so with a fake library that has the right version symbol with a version script
e.g. version.map
GLIBC_2.29 {
global:
*;
};
With an empty fake_libc.c
`gcc -shared -fPIC -Wl,--version-script=version.map,-soname,libc.so.6 -o libc.so.6 fake_libc.c`
3. Hope that you can still point the symbols back to the real libc (either by writing a giant pile of dlsym C code, or some other way, I'm unclear on this part)
Ideally glibc would stop checking the version if it's not actually marked as needed by any symbol, not sure why it doesn't (technically it's the same thing normally, so performance?).
So you can do e.g. `patchelf --remove-needed-version libm.so.6 GLIBC_2.29 ./mybinary` instead of replacing glibc wholesale (step 2 and 3) and assuming all of used glibc by the executable is ABI compatible this will just work (it's worked for a small binary for me, YMMV).
> When you get into lower power, anything lower than Steam Deck, I think you’ll find that there’s an Arm chip that maybe is competitive with x86 offerings in that segment.
At which point does this pay off the emulation overhead? Fex has a lot of work to do to bridge two ISAs while going through the black box of compiler output of assembly, right?
afaia emulators like Fex are within 30 to 70% of native performance. On the fringes worse or better. But overall emulation seems totally fine.
Plus emulator technology in general could be used for binary optimization rather than strict mappings, opening up space for more optimization.
> On 18 March 2024, the Secretary of State was provided with a Submission which made it clear that Category 1 duties were not primarily aimed at pornographic content or the protection of children (which were dealt with by other parts of the Act).
Notice this is under Sunak, not Starmer. The Times chooses when to support and opposite the Online Safety Act based on which party is in government, and provides evidence for its view by lying through omission.
The Online Safety Act is undeniably terrible legislation, but you won't find good-faith criticism of it from the Times.
I see CSS random() is only supported by Safari, I wonder if there's some side channel that would work in Chrome specifically? (I guess timing the user input would work)
reply