> unlike a Python program that does not use async and could only proceed linearl...

gshulegaard · on June 12, 2020

> > Python's multiprocessing library is needed to overcome the GIL

> No it's not, just use threads.

I just wanted to expand on this a little to describe some of the downsides to threads in Python.

Multi-threaded logic can be (and often is) slower than single-threaded logic because threading introduces overhead of lock contention and context switching. David Beazley did a talk illustrating this in 2010:

https://www.youtube.com/watch?v=Obt-vMVdM8s

He also did a great talk about coroutines in 2015 where he explores threading and coroutines a bit more:

https://www.youtube.com/watch?v=MCs5OvhV9S4&t=525s

In workloads that are often "blocked" like network calls our I/O bound work loads, threads can provide similar benefits to coroutines but with overhead. Coroutines seek to provide the same benefit without as much overhead (no lock contention, fewer context switches by the kernel).

It's probably not the right guidelines for everyone but I generally use these when thinking about concurrency (and pseudo-concurrency) in Python:

- Coroutines where I can.

- Multi-processing where I need real concurrency.

- Never threads.

quietbritishjim · on June 12, 2020

Ah ha! Now we have finally reached the beginning of the conversation :-)

The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief. There seems to be some debate about whether the article is really representative, and I'm very curious about that. But then the parent comment to mine took us on an unproductive detour that based on the misconception that Python threads don't work at all. Now your comment has brought up that original belief again, but you haven't referenced the article at all.

gshulegaard · on June 13, 2020

I didn't reference the article because I provided more detailed references which explore the difference between threads and coroutines in Python to a much greater depth.

The point of my comment is to say that neither threads or coroutines will make Python _faster_ in and of themselves. Quite the opposite in fact: threading adds overhead so unless the benefit is greater than the overhead (e.g. lock contention and context switching) your code will actually be net slower.

I can't recommend the videos I shared enough, David Beazley is a great presenter. One of the few people who can do talks centered around live coding that keep me engaged throughout.

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief.

The disconnect here is that this article isn't claiming that asyncio is not faster than threads. In fact the article only claims that asyncio is not a silver bullet guaranteed to increase the performance of any Python logic. The misconception it is trying to clear up, in it's own words is:

> Sadly async is not go-faster-stripes for the Python interpreter.

What I, and many others are questioning is:

A) Is this actually as widespread a belief as the article claims it to be? None of the results are surprising to me (or apparently some others).

B) Is the article accurate in it's analysis and conclusion?

As an example, take this paragraph:

> Why is this? In async Python, the multi-threading is co-operative, which simply means that threads are not interrupted by a central governor (such as the kernel) but instead have to voluntarily yield their execution time to others. In asyncio, the execution is yielded upon three language keywords: await, async for and async with.

This is a really confusing paragraph because it seems to mix terminology. A short list of problems in this quote alone:

- Async Python != multi-threading.

- Multi-threading is not co-operatively scheduled, they are indeed interrupted by the kernel (context switches between threads in Python do actually happen).

- Asyncio is co-operatively scheduled and pieces of logic have to yield to allow other logic to proceed. This is a key difference between Asyncio (coroutines) and multi-threading (threads).

- Asynchronous Python can be implemented using coroutines, multi-threading, or multi-processing; it's a common noun but the quote uses it as a proper noun leaving us guessing what the author intended to refer to.

Additionally, there are concepts and interactions which are missing from the article such as the GIL's scheduling behavior. In the second video I shared, David Beazley actually shows how the GIL gives compute intensive tasks higher priority which is the opposite of typical scheduling priorities (e.g. kernel scheduling) which leads to adverse latency behavior.

So looking at the article as a whole, I don't think the underlying intent of the article is wrong, but the reasoning and analysis presented is at best misguided. Asyncio is not a performance silver bullet, it's not even real concurrency. Multi-processing and use of C extensions is the bigger bang for the buck when it comes to performance. But none of this is surprising and is expected if you really think about the underlying interactions.

To rephrase what you think I thought:

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO.

Is actually more like:

> Asyncio is more efficient than multi-threading in Python. It is also comparatively more variable than multi-processing, particularly when dealing with workloads that saturate a single event loop. Neither multi-threading or Asyncio is actually concurrent in Python, for that you have to use multi-processing to escape the GIL (or some C extension which you trust to safely execute outside of GIL control).

---

Regarding your aside example, it's true some C extensions can escape the GIL, but often times it's with caveats and careful consideration of where/when you can escape the GIL successfully. Take for example this scipy cookbook regarding parallelization:

https://scipy-cookbook.readthedocs.io/items/ParallelProgramm...

It's not often the case that using a C extension will give you truly concurrent multi-threading without significant and careful code refactoring.

camgunz · on June 13, 2020

For single processes you’re right, but this article (and a lot of the activity around asyncio in Python) is about backend webdev, where you’re already running multiple app servers. In this context, asyncio is almost always slower.

willseth · on June 12, 2020

> But blocking IO isn't one of those situations, so you can just use threads.

Threads and async are not mutually exclusive. If your system resources aren't heavily loaded, it doesn't matter, just choose the library you find most appropriate. But threads require more system overhead, and eventually adding more threads will reduce performance. So if it's critical to thoroughly maximize system resources, and your system cannot handle more threads, you need async (and threads).

otabdeveloper4 · on June 12, 2020

> But threads require more system overhead, and eventually adding more threads will reduce performance.

Absolutely false. OS threads are orders of magnitude lighter than any Python coroutine implementation.

dragonwriter · on June 12, 2020

> OS threads are orders of magnitude lighter than any Python coroutine implementation.

But python threads, which have extra weight on top of an cross-platform abstraction layer on top of the underlying OS threads, are not lighter than python coroutines.

You aren't choosing between Python threads and unadorned OS threads when writing Python code.

otabdeveloper4 · on June 13, 2020

You're absolutely right.

I'm pointing out that this is a Python problem, not a threads problem, a fact which people don't understand.

dragonwriter · on June 13, 2020

Everyone has been discussing relative performance of different techniques within Python; there is neither a basis to suggest from that that people don't understand that aspects of that are Python specific, nor a reason to think that that is even particularly relevant to the discussion.

willseth · on June 12, 2020

Okay, then let's do a bakeoff! You outfit a Python webserver that only uses threads, and I'll outfit an identical webserver that also implements async. Server that handling the most requests/sec wins. I get to pick the workload.

js2 · on June 12, 2020

FWIW, I have a real world Python3 application that does the following:

- receives an HTTP POST multipart/form-data that contains three file parts. The first part is JSON.

- parses the form.

- parses the JSON.

- depending upon the JSON accepts/rejects the POST.

- for accepted POSTs, writes the three parts as three separate files to S3.

It runs behind nginx + uwsgi, using the Falcon framework. For parsing the form I use streaming-form-data which is cython accelerated. (Falcon is also cython accelerated.)

I tested various deployment options. cpython, pypy, threads, gevent. Concurrency was more important than latency (within reason). I ended up with the best performance (measured as highest RPS while remaining within tolerable latency) using cpython+gevent.

It's been a while since I benchmarked and I'm typing this up from memory, so I don't have any numbers to add to this comment.

heavyset_go · on June 12, 2020

Each Linux thread has at least an 8MB virtual memory overhead. I just tested it, and was able to create one million coroutines in a few seconds and with a few hundred megabytes of overhead in Python. If I created just one thousand threads, it would take possibly 8 gigs of memory.

otabdeveloper4 · on June 13, 2020

Virtual memory is not memory. You're effectively just bumping an offset, there's no actual allocations involved.

> ...it would take possibly 8 gigs of memory.

No. Nothing is 'taken' when virtual memory is requested.

shk1338 · on June 13, 2020

But have you tried creating one thousand of OS threads and measuring the actual memory usage? If I recall correctly I read some article where it was explained that threads in Linux are not actually claiming their 8MB each so literally. I need to recheck that later.

heavyset_go · on June 13, 2020

You're right, I've read the same. Using Python 3.8, creating 12,000 threads with `time.sleep` as the target clocks in at 200MB residential memory.