I use async for UI work, but don't have much of an opinion for servers. I suspec...

jiggawatts · on June 12, 2020

Probably the most interesting new concept that I've come across is Linux's io_uring, which uses ring buffers to asynchronously submit and receive kernel I/O calls.

While Windows has had asynchronous I/O for ages, it's still one kernel transition per operation, whereas Linux can batch these now.

I suspect that all the CPU-level security issues will eventually be resolved, but at a permanently increased overhead for all user-mode to kernel transitions. Clever new API schemes like io_uring will likely have to be the way forward.

I can imagine a future where all kernel API calls go through a ring buffer, everything is asynchronous, and most hardware devices dump their data directly into user-mode ring buffers by default without direct kernel involvement.

It's going to be an interesting new landscape of performance optimisation and language design!

unilynx · on June 12, 2020

> profilers are more important than careful design.

> I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion

But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

I would expect profiling to mostly leads to micro-optimisations, eg combining or splitting the time a lock is taken, but when you're still designing you can look at avoiding as much need for synchronization as possible. eg: sharing data copy-on-write (not requiring locks as long as you have a reference) instead of having to lock the data when accessing it.

As another commenter says

> with asyncio we deploy a thread per worker (loop), and a worker per core. We also move cpu bound functions to a thread pool

you can't easily go from eg. thread-per-connection to a worker pool. that should have been caught during design

ChrisMarshallNY · on June 12, 2020

> But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

Yes and no. Again, I have not profiled or optimized servers or interpreted/JIT languages, so I bet there's a new ruleset.

Blocking can come from unexpected places. For example, if we use dependencies, then we don't have much control over the resources accessed by the dependency.

Sometimes, these dependencies are the OS or standard library. We would sometimes have to choose alternate system calls, as the ones we initially chose caused issues which were not exposed until the profile was run.

In my experience, the killer for us was often cache-breaking. Things like the length of the data in a variable could determine whether or not it was bounced from a register or low-level cache, and the impact could be astounding. This could lead to remedies like applying a visitor to break up a [supposedly] inconsequential temp buffer into cache-friendly bites.

Also, we sometimes had to recombine work that we had sent to threads, because that caused cache hits.

Unit testing could be useless. For example, the test images that we often used were the classic "Photo Test Diorama" variety, with a bunch of stuff crammed onto a well-lit table, with a few targets.

Then, we would run an image from a pro shooter, with a Western prairie skyline, and the lengths of some of the convolution target blocks would be different. This could sometimes cause a cache-hit, with a demotion of a buffer. This taught us to use a large pool of test images, which was sometimes quite difficult. In some cases, we actually had to use synthesized images.

Since we were working on image processing software, we were already doing this in other work, but we learned to do it in the optimization work, too.

When my team was working on C++ optimization, we had a team from Intel come in and profile our apps.

It was pretty humbling.