Hacker News .hnnew | past | comments | ask | show | jobs | submit | hmpc's commentslogin

Similarly, if not performance-focused, I can wholeheartedly recommend Building Git[0], which walks you through building your own git clone in Ruby (although the language is immaterial).

[0]: https://shop.jcoglan.com/building-git/


It's definitely possible to do correctly, but looking through the code for both crates it doesn't look like they take the necessary precautions (issuing a fence or using RDTSCP). Which is a little weird because at least quanta explicitly checks for RDTSCP support, but then doesn't seem to use it.

(I'm not a Rust expert and I'm on my phone though, so I might be missing something.)


Thanks for sharing this! I hadn't seen it before, I'll definitely add a mention in the post.

This approach is a smart way to get the same precision as the kernel without tying it to the vDSO implementation as in the post, so presumably someone at Google was worried about very similar issues. I also considered an implicit automatic refresh, but ultimately for my purposes controlling when to take the hit was more important than the slight increase in API complexity introduced by an explicit call.


IIRC (it’s been awhile) this was an optimization that was inspired by someone noticing all of Google spent a lot of time in vDSO using https://research.google/pubs/google-wide-profiling-a-continu... Google generally doesn’t have any (well that I have ever heard of) hard realtime requirements like the high frequency trading or HPC systems do.

Cool stuff thanks for sharing!


This is explicitly called out in the post as well as the Intel instruction manual. Every codebase I've ever seen that reads the TSC either issues an LFENCE or uses RDTSCP.

In my benchmarks RDTSCP has a slight advantage, despite the slower latency on paper, because it doesn't fully serialise the instruction stream (later instructions can start executing, unlike with LFENCE). Whether the ECX clobber will outweigh that will depend on the situation.


You might've misunderstood the requirements. The time scale was 1-10 micros per component; 100 ns was the overhead per span we were aiming for.

In this case distributed tracing absolutely was the right choice. These were not simple computational tasks. The components were highly stateful and interconnected both on- and cross-host. Between this and the timescale, as well as the volume of events and the dollar-value impact of each potential failure (of which there were many), we needed real-time analysis capabilities, not a profiler.


I guess my skepticism about the application colored my reading of the rest of it. If it had only said you needed it to be faster, that would have been easier for a simpleton like me.

Yep, I should probably have mentioned this option for completeness. In practice, however, it only saves you a few cycles/nanos and adds more complexity and failure points downstream.

Check the specification at the top. The range for x is [-1, 1]. For the range you provided the accuracy of the 0.5x alternative is reported as only 33%: https://herbie.uwplse.org/demo/570b973df0f1f4a78fe791858038a...


You're right I misread the graph. That said though I have played around with Herbie before, trying it out on a few of the more gnarly expressions I had in my code (analytical partial derivatives if equations of motion if launch vehicle in rotating spherical frame) and didn't see much appreciable improvement over the expected range of values, but then again I didn't check every single one.

What would be cool is if you could some how have this kind of analysis done automatically for your whole program where it finds the needle in the haystack expression that can be improved, assuming you gave expected ranges for your variables


Author here. I've got a few papers about this problem (including one in submission), but it is very very hard to do, especially with acceptable overhead. The state of the art is maybe 100x overhead.


It depends on your use case, as always. Correctness is not always black and white (hence my favourite compilation flag, -funsafe-math-optimizations) and time complexity can be misleading (O(log N) with a large base is O(1) in practice), but a correct, theoretically optimal algorithm might still be leaving a lot of performance on the table. If you're a hyperscaler, a high-frequency trader, or perhaps a game programmer pushing the limits of the platform, small gains can accumulate meaningfully, as can the thousand cuts caused by small inefficiencies spread out over the codebase.


There's an interesting comment on this over on Lobsters: https://lobste.rs/s/e4y5ps/two_studies_compiler_optimisation...


(Author here)

>It all seems very brittle, though. And that something has gone very wrong with our ecosystem of tools, languages, and processes when it becomes advisable to massage source until specific passes in a specific version LLVM don't mess things up for other passes.

I would say the main takeaway is actually to not do that, precisely because it's brittle, difficult to understand, and can backfire by making things harder for the compiler. As I point out at the end, the vast majority of developers will be better served by sticking to common idioms and making their intent as clear as possible using language facilities, rather than trying to outsmart the compiler or, conversely, relying excessively on the optimiser as a magic black box.

That being said, I do find it helpful to understand the broad strokes of the optimisation pipeline (things like "callees are optimised before callers", "maintaining invariants enables more simplifications", or "early simplifications are better") to make the most of it. Like with any other tool, mastery means letting it do its job most of the time but knowing when and where to step in if necessary.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: