Hacker News new | past | comments | ask | show | jobs | submit login

I use async for UI work, but don't have much of an opinion for servers.

I suspect that the best async is that supported by the server OS, and the more efficiently a language/compiler/linker integrates with that, the better. JIT/interpreted languages introduce new dimensions that I have not experienced.

I do have some prior art in optimizing libraries, though. In particular, image processing libraries in C++. My opinion is that optimization is sort of a "black art," and async is anything but a "silver bullet." In my experience, "common sense" is often trumped by facts on the ground, and profilers are more important than careful design.

I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion, as you have the same timeline as sync, but with thread management overhead.

There are also hardware issues that come into play, like L1/2/3 caches, resource contention, look-ahead/execution pipelines and VM paging. These can have massive impact on performance, and are often only exposed by running the app in-context with a profiler. Sometimes, threading can exacerbate these issues, and wipe out any efficiency gains.

In my experience, well-behaved threaded software needs to be written, profiled and tuned, in that order. An experienced engineer can usually take care of the "low-hanging fruit," in design, but I have found that profiling tends to consistently yield surprises.

T.A.N.S.T.A.A.F.L.




Probably the most interesting new concept that I've come across is Linux's io_uring, which uses ring buffers to asynchronously submit and receive kernel I/O calls.

While Windows has had asynchronous I/O for ages, it's still one kernel transition per operation, whereas Linux can batch these now.

I suspect that all the CPU-level security issues will eventually be resolved, but at a permanently increased overhead for all user-mode to kernel transitions. Clever new API schemes like io_uring will likely have to be the way forward.

I can imagine a future where all kernel API calls go through a ring buffer, everything is asynchronous, and most hardware devices dump their data directly into user-mode ring buffers by default without direct kernel involvement.

It's going to be an interesting new landscape of performance optimisation and language design!


> profilers are more important than careful design.

> I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion

But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

I would expect profiling to mostly leads to micro-optimisations, eg combining or splitting the time a lock is taken, but when you're still designing you can look at avoiding as much need for synchronization as possible. eg: sharing data copy-on-write (not requiring locks as long as you have a reference) instead of having to lock the data when accessing it.

As another commenter says

> with asyncio we deploy a thread per worker (loop), and a worker per core. We also move cpu bound functions to a thread pool

you can't easily go from eg. thread-per-connection to a worker pool. that should have been caught during design


> But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

Yes and no. Again, I have not profiled or optimized servers or interpreted/JIT languages, so I bet there's a new ruleset.

Blocking can come from unexpected places. For example, if we use dependencies, then we don't have much control over the resources accessed by the dependency.

Sometimes, these dependencies are the OS or standard library. We would sometimes have to choose alternate system calls, as the ones we initially chose caused issues which were not exposed until the profile was run.

In my experience, the killer for us was often cache-breaking. Things like the length of the data in a variable could determine whether or not it was bounced from a register or low-level cache, and the impact could be astounding. This could lead to remedies like applying a visitor to break up a [supposedly] inconsequential temp buffer into cache-friendly bites.

Also, we sometimes had to recombine work that we had sent to threads, because that caused cache hits.

Unit testing could be useless. For example, the test images that we often used were the classic "Photo Test Diorama" variety, with a bunch of stuff crammed onto a well-lit table, with a few targets.

Then, we would run an image from a pro shooter, with a Western prairie skyline, and the lengths of some of the convolution target blocks would be different. This could sometimes cause a cache-hit, with a demotion of a buffer. This taught us to use a large pool of test images, which was sometimes quite difficult. In some cases, we actually had to use synthesized images.

Since we were working on image processing software, we were already doing this in other work, but we learned to do it in the optimization work, too.

When my team was working on C++ optimization, we had a team from Intel come in and profile our apps.

It was pretty humbling.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: