More

btdmaster · 2026-02-24T12:10:51 1771935051

Very cool. The horsle demo made me think, how hard would it be to add a virtual memory address (or a non-8086 RAND instruction) that returns a random byte (that would allow it to pick a random value and get a standard wordle working in principle)

I see CSS random() is only supported by Safari, I wonder if there's some side channel that would work in Chrome specifically? (I guess timing the user input would work)

rebane2001 · 2026-02-24T12:15:47 1771935347

It's really easy, I was considering adding it.

The easiest way is to make an @property that's animated at ridiculous speeds that can be sampled to get (sort of) random bits.

Sesse__ · 2026-02-24T18:38:30 1771958310

Or use a cycle timer and run a PRNG from it.

Or wait for us to launch random() :-) (It's in development, available if you enable a flag)

btdmaster · 2026-02-07T12:38:32 1770467912

In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].

Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.

[0] https://godbolt.org/z/vso7xbh61

[1] https://godbolt.org/z/MjcEKd9Tr

maccard · 2026-02-11T13:03:02 1770814982

To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.

cogman10 · 2026-02-11T13:59:11 1770818351

What's strange is I'm finding that gcc really struggles to correctly optimize this.

This was my function

    for (auto v : array) {
        if (v != 0)
            return false;
    }
    return true;

clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.

Here's gcc for 42 elements:

https://godbolt.org/z/sjz7xd8Gs

and here's clang for 42 elements:

https://godbolt.org/z/frvbhrnEK

Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.

Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.

maccard · 2026-02-12T18:43:50 1770921830

Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.

cogman10 · 2026-02-12T19:09:59 1770923399

This isn't a microbenchmark. In fact, I haven't even bothered to benchmark it (perhaps the non-simd version actually is faster?)

This is purely me looking at the emitted assembly and being surprised at when the compilers decide to deploy it and not deploy it. It may be the case that the SIMD instructions are in fact slower even though they should theoretically end up faster.

Both compilers are simply using heuristics to determine when it's fruitful to deploy SIMD instructions.

secondcoming · 2026-02-11T15:54:52 1770825292

I;ve had to coerce gcc to emitting SIMD code by using int instead of bool. Also, the early return may be putting it off.

abbeyj · 2026-02-11T19:11:00 1770837060

Doing both of those things does seem to help: https://godbolt.org/z/1vv7cK4bE

GCC trunk seems to like using `bool` so we may eventually be able to retire the hack of using `int`.

btdmaster · 2026-02-11T16:11:56 1770826316

I see yeah that makes sense. I wanted to highlight that "magic" will, on average, give the optimizer a harder time. Explicit offset loops like that are generally avoided in many C++ styles in favor of iterators.

delta_p_delta_x · 2026-02-11T17:17:20 1770830240

Even at a higher level of abstraction, the compiler seems to pull through: https://godbolt.org/z/1nvE34YTe

btdmaster · 2026-02-11T19:23:20 1770837800

It emits a cmp/jmp still when arithmetic would be fine though which is the difference highlighted in the article and examples in this thread. It's nice that it simplifies down to assembly, but the assembly is somewhat questionable (especially that xor eax eax branch target on the other side).

pjmlp · 2026-02-07T16:32:50 1770481970

Except that the C++ version doesn't need to be like that.

Abstractions are welcome when it doesn't matter, when it matters there are other ways to write the code and it keeps being C++ compliant.

btdmaster · 2026-01-04T19:47:21 1767556041

I think you could argue there is already some effort to do type safety at the ISA register level, with e.g. shadow stack or control flow integrity. Isn't that very similar to this, except targeting program state rather than external memory?

nine_k · 2026-01-04T20:22:24 1767558144

Tagged memory was a thing, and is a thing again on some ARM machines. Check out Google Pixel 9.

Joker_vD · 2026-01-04T20:21:33 1767558093

I mean, if the stacks grew upwards, that alone would nip 90% of buffer overflow attacks in the bud. Moving the return address from the activation frame into a separate stack would help as well, but I understand that having an activation frame to be a single piece of data (a current continuation's closure, essentially) can be quite convenient.

musicale · 2026-01-04T20:34:05 1767558845

The PL/I stack growing up rather than down reduced potential impact of stack overflows in Multics (and PL/I already had better memory safety, with bounded strings, etc.) TFA's author would probably have appreciated the segmented memory architecture as well.

There is no reason why the C/C++ stack can't grow up rather than down. On paged hardware, both the stack and heap could (and probably should) grow up. "C's stack should grow up", one might say.

Joker_vD · 2026-01-04T20:51:17 1767559877

> There is no reason why the C/C++ stack can't grow up rather than down.

Historical accident. Imagine if PDP-7/PDP-11 easily allowed for the following memory layout:

    FFFF +---------------+
         |     text      |  X
         +---------------+
         |    rodata     |  R
         +---------------+
         |  data + bss   |  RW
         +---------------+
         |     heap      |
         |      ||       |  RW
         |      \/       |
         +---------------+
         |  empty space  |  unmapped
         +---------------+
         |      /\       |
         |      ||       |  RW
         |     stack     |
    0000 +---------------+

Things could have turned out very differently than they have. Oh well.

musicale · 2026-01-06T03:39:13 1767670753

Nice diagram. I might put read-only pages on both sides of 0 though to mitigate null pointer effects.

josephg · 2026-01-04T22:04:27 1767564267

Is there anything stopping us from doing this today on modern hardware? Why do we grow the stack down?

Veserv · 2026-01-04T22:18:26 1767565106

x86-64 call instruction decrements the stack pointer to push the return address. x86-64 push instructions decrement the stack pointer. The push instructions are easy to work around because most compilers already just push the entire stack frame at once and then do offset accesses, but the call instruction would be kind of annoying.

ARM does not suffer from that problem due to the usage of link registers and generic pre/post-modify. RISC-V is probably also safe, but I have not looked specifically.

musicale · 2026-01-05T05:39:21 1767591561

> [x86] call instruction would be kind of annoying

I wonder what the best way to do it (on current x86) would be. The stupid simple way might be to adjust SP before the call instruction, and that seems to me like something that would be relatively efficient (simple addition instruction, issued very early).

Joker_vD · 2026-01-05T05:53:53 1767592433

Some architectures had CALL that was just "STR [SP], IP" without anything else, and it was up to the called procedure to adjust the stack pointer further to allocate for its local variables and the return slot for further calls. The RET instruction would still normally take an immediate (just as e.g. x86/x64's RET does) and additionally adjust the stack pointer by its value (either before or after loading the return address from the tip of the stack).

sph · 2026-01-05T08:25:33 1767601533

Nothing stops you from having upward growing stacks in RISC-V, for example, as there are no dedicated stack instructions.

Instead of

  addi sp, sp, -16
  sd a0, 0(sp)
  sd a1, 8(sp)

Do:

  addi sp, sp, 16
  sd a0, -8(sp)
  sd a1, -16(sp)

ch_123 · 2026-01-04T20:54:46 1767560086

HP-UX on PA-RISC had an upward-growing stack. In practice, various exploits were developed which adapted to the changed direction of the stack.

One source from a few mins of searching: https://phrack.org/issues/58/11

LukeShu · 2026-01-04T22:50:11 1767567011

Linux on PA-RISC also has an upward-growing stack (AFAIK, it's the only architecture Linux has ever had an upward-growing stack on; it's certainly the only currently-supported one).

musicale · 2026-01-05T05:45:22 1767591922

Both this and parent comment about PA-RISC are very interesting.

As noted, stack growing up doesn't prevent all stack overflows, but it makes it less trivially easy to overwrite a return address. Bounded strings also made it less trivially easy to create string buffer overflows.

ch_123 · 2026-01-05T09:19:52 1767604792

Yeah, my assumption is that all the PA-RISC operating systems did, but I only know about HP-UX for certain.

dmitrygr · 2026-01-05T06:18:44 1767593924

In ARMv4/v5 (non-thumb-mode) stack is purely a convention that hardware does not enforce. Nobody forces you to use r13 as the stack pointer or to make the stack descending. You can prototype your approach trivially with small changes to gcc and linux kernel. As this is a standard architectural feature, qemu and the like will support emulating this. And it would run fine on real hardware too. I'd read the paper you publish based on this.

axoltl · 2026-01-04T20:49:09 1767559749

For modern systems, stack buffer overflow bugs haven't been great to exploit for a while. You need at least a stack cookie leak and on Apple Silicon the return addresses are MACed so overwriting them is a fools errand (2^-16 chance of success).

Most exploitable memory corruption bugs are heap buffer overflows.

saagarjha · 2026-01-05T03:19:19 1767583159

It’s still fairly easy to attack buffer overflows if the stack grows upward

btdmaster · 2026-01-03T12:09:54 1767442194

Everything really is a file: if you do `cat /` you'll get back the internal representation of the directory entries in / (analogous to ls)

And they still had coredumps at the time if you press ctrl-\

zahlman · 2026-01-03T17:40:13 1767462013

Being able to cat directories like that doesn't surprise me as much as the contents being readable. Is there not a bunch of binary garbage in between the filenames?

chuckadams · 2026-01-03T13:44:42 1767447882

I remember `cat` on directories working on Unixen much newer than v4. Not sure if it ever was the case on Linux tho.

btdmaster · 2026-01-02T15:09:32 1767366572

You can also press `s` to save data from a pipe to a file rather than manually copy pasting.

osmsucks · 2026-01-02T19:57:36 1767383856

I came here to suggest the same! It's incredibly handy and I use it all the time at work: there's a process that runs for a very long time and I can't be sure ahead of time if the output it generates is going to be useful or not, but if it's useful I want to capture it. I usually just pipe it into `less` and then examine the contents once it's done running, and if needed I will use `s` to save it to a file.

(I suppose I could `tee`, but then I would always dump to a file even if it ends up being useless output.)

btdmaster · 2025-12-31T00:15:54 1767140154

Yes you can do this, thanks for mentioning I was interested and checked how you would go about it.

1. Delete the shared symbol versioning as per https://stackoverflow.com/a/73388939 (patchelf --clear-symbol-version exp mybinary)

2. Replace libc.so with a fake library that has the right version symbol with a version script e.g. version.map GLIBC_2.29 { global: *; };

With an empty fake_libc.c `gcc -shared -fPIC -Wl,--version-script=version.map,-soname,libc.so.6 -o libc.so.6 fake_libc.c`

3. Hope that you can still point the symbols back to the real libc (either by writing a giant pile of dlsym C code, or some other way, I'm unclear on this part)

Ideally glibc would stop checking the version if it's not actually marked as needed by any symbol, not sure why it doesn't (technically it's the same thing normally, so performance?).

btdmaster · 2025-12-31T16:02:29 1767196949

Ah you can use https://github.com/NixOS/patchelf/pull/564

So you can do e.g. `patchelf --remove-needed-version libm.so.6 GLIBC_2.29 ./mybinary` instead of replacing glibc wholesale (step 2 and 3) and assuming all of used glibc by the executable is ABI compatible this will just work (it's worked for a small binary for me, YMMV).

btdmaster · 2025-12-04T08:56:04 1764838564

> When you get into lower power, anything lower than Steam Deck, I think you’ll find that there’s an Arm chip that maybe is competitive with x86 offerings in that segment.

At which point does this pay off the emulation overhead? Fex has a lot of work to do to bridge two ISAs while going through the black box of compiler output of assembly, right?

hydroreadsstuff · 2025-12-04T12:39:07 1764851947

afaia emulators like Fex are within 30 to 70% of native performance. On the fringes worse or better. But overall emulation seems totally fine. Plus emulator technology in general could be used for binary optimization rather than strict mappings, opening up space for more optimization.

btdmaster · 2025-11-27T19:02:43 1764270163

See also "Parse, don't validate (2019)" [0]

[0] https://hackernews.hn/item?id=41031585

btdmaster · 2025-09-06T22:53:44 1757199224

SEMamba is a speech enhancement model to denoise speech.

Demo: https://roychao19477.github.io/speech-enhancement-demo-2024/

Try it: https://huggingface.co/spaces/rc19477/Speech_Enhancement_Mam...

btdmaster · 2025-08-15T10:18:15 1755253095

The Times is more or less lying here.

https://www.judiciary.uk/wp-content/uploads/2025/08/Wikimedi...

> On 18 March 2024, the Secretary of State was provided with a Submission which made it clear that Category 1 duties were not primarily aimed at pornographic content or the protection of children (which were dealt with by other parts of the Act).

Notice this is under Sunak, not Starmer. The Times chooses when to support and opposite the Online Safety Act based on which party is in government, and provides evidence for its view by lying through omission.

The Online Safety Act is undeniably terrible legislation, but you won't find good-faith criticism of it from the Times.