Good idea. It's an interesting historical question - when we picked 4.0 as the default ~25 years ago, how close was is to the calculated value? I was asking that myself. Unfortunately I don't have a machine with traditional HDD in my homelab anymore, but I'll see if I can run the test somewhere.
I wouldn't be all that surprised if this was (partially) due to Postgres being less optimized back then, which might have hidden some of the random vs. sequential differences. But that's just a wild guess.
To the best of my knowledge, yes. Unfortunately the details of how it was calculated in ~2000 seem to be lost, but the person who did that described he did it like this. It's possible we forgot some important details, of course, but the intent was to use the same formula. Which is why I carefully described and published the scripts, so that other engineers can point out thinkos and suggest changes.
Good point, I should have included that (the linked pgsql-hackers thread have some of this information, at least).
I've observed exactly this behavior on a wide range of hardware / environments, it's not very specific to particular SSDs models (at least not for reads, which is what the blog post was measuring). That's why I showed results from three very different systems.
Some information for the two physical machines:
1) ryzen: Ryzen 9 9900X, RAID0 with 4x Samsung 990 PRO 1TB (in Asus Hyper M.2 Gen5 card)
2) xeon: E5-2699v4, WD Ultrastar DC SN640 960GB (U.3)
I don't know what exactly is backing the SSD storage on the Azure instance.
There probably is some additional inefficiency when reading pages randomly (compared to sequential reads), but most of the difference is at the storage level. That is, SSDs can handle a lot of random I/O, but it's nowhere close to sequential reads.
For example, I have a RAID0 with 4 SSDs (Samsung 990 PRO, so consumer, but quite good for reads). And this is what fio says:
With buffered I/O, random read stay at ~19k IOPS, while sequential reads get to ~1M IOPS (thanks to read-ahead, either at the OS level, or in the SSD).
So part of this is sequential reads benefiting from implicit "prefetching", which reduces the observed cost of a page. But for random I/O there's no such thing, and so it seems more expensive.
It's more complex (e.g. sequential reads allow issuing larger reads), of course.
> Because this is what "mimics" index scan (without prefetch) on cold data. More or less.
This is an interesting observation but does it really mimic the index scan? This would be essentially a worst case scenario. Submitting IO requests one by one would be a very inefficient way to handle scans, no?
True. Unfortunately it's what index scans in Postgres do right now - it's the last "major" scan type not supporting some sort of prefetch (posix_fadvise or AIO). We're working on it, hopefully it'll get into PG19.
AFAIK these two joins are exactly the same once you get past the parsing. It's just a different way to write an inner join. It's translated into the same AST and so there's no difference in planning/execution.
Perhaps in this very basic case they are exactly the same but is that still true if you add secondary WHERE conditions that apply to just one table, or if one "table" is actually a view with a complex query definition, or many other ways in which a very simple "example" can quickly get complicated?
In general, you split up the WHERE condition at every top-level AND. Then you do either pullup or pushdown or both (it depends a bit on how your planner looks on the inside). In the end, you end up with the two cases being exactly the same before you start planning.
For outer joins (left/right/full), it's different, and there you absolutely need the explicit join syntax (for correctness). And semijoins are not expressed the same way at all, partially for weird historical reasons.
That is part of the key idea, yes. It's more elaborate, because it can split the aggregate - it can do part of it before the join, and finalize it after the join. Similarly to what we do for parallel queries.
As for indexes, it can help, but not in this particular example - the "code" tables are tiny, and the planner adds Memoize nodes anyway, so it acts like an ad hoc index.
Indexes are more of a complementary improvement, not an alternative to this optimization (i.e. neither makes the other unnecessary). FWIW in this case the indexes won't help very much - if you use more data in the code tables, it'll use a hash join, not nested loop / merge join.
That doesn't mean we couldn't do better with indexes, there probably are smart execution strategies for certain types of queries. But indexes also come with quite a bit of overhead (even in read-only workloads).
It's not about not knowing about an optimization. The challenge is to know when to apply it, so that it does not cause regressions for cases that can't benefit from it. It may be less risky in specialized systems, like BI systems typically don't need to worry about regressing OLTP workloads. Postgres absolutely needs to be careful of that.
I believe that's one of the reasons why it took about ~8 years (the original patch was proposed in 2017).
IMHO the whole point of Qubes is that it does not do the compartmentalization at the level of individual applications, but groups of applications. Otherwise you'd need to very clearly specify how/when exactly the applications can exchange data, what data, etc. I'm not saying it's impossible, but "apps in the same qube VM can do whatever" is a much easier concept.
I believe there are reasons why e.g. io_uring could be inherently slower in some cases, and I tried to point some of those out.
With io_uring everything happens in the backend process, and so consumes some of the CPU time that might otherwise be spent executing the query. All the checksum verification, memcpy into shared buffers, etc. happen in the backend. And those things can be quite expensive. With worker this happens in the other processes, spreading the overhead.
Of course, on truly I/O-bound workload (actually waiting on the I/O), this may not be a huge difference. For warmed-up cases it may be more significant.
You can have a io_uring worker per thread and send jobs into a queue for them to pick it up and execute, so it won’t max out on cpu. Each worker would do cooperative async scheduling internally between the tasks so nothing delays other things too much.
But this is more difficult than spawning thousands of os threads and running sync workers on them
Ah my mistake. I meant to write something like, you can have a thread per cpu core that is running an io_uring instance like you would have a bunch of workers in a pool in the "worker" implementation.
If there are no threads then this would be, you can have a process per cpu core
> Right now async IO is used for sequential scans and bitmap scans, not for index scans. My initial guess would be that it mostly helps for complex queries (that use multiple indexes, so bitmap scans) and unoptimized queries (sequential scans), not so much for straightforward and/or optimized queries that use a single index. But this is really just a guess, I'm way out of my depth at this point. I'm curious how much it'll help once it is implemented for index scans as well.
Those are good guesses, IMHO.
For sequential scans, some of the "async" work could be done by kernel read-ahead, but AIO makes it explicit and loads the data into shared buffers, not just page cache. For bitmap scans we already had prefetching by fadvise, which is somewhat similar to read-ahead (also into page cache), and there were some bugs that made it ineffective in some cases, and AIO fixes that.
For index scans the difference can be an order of magnitude (say, 5-10x). Doing random I/O block by block is simply awful, prefetching data is important. I was just doing some testing on TPC-H, and on scale 50 I see Q8 going from 130s to 20s, and Q19 from 50s to 8s. And smaller improvements for a couple more queries. Of course, it depends on what else the query is doing - if it's spending 1% on the index, you won't notice a difference.
I did a lot of tests comparing the io_method choices, and I'm yet to see a realistic query where it makes a significant difference of more than a couple percent (in either direction). I'm sure it's possible to construct such queries, and it's interesting, but for real workloads it's mostly not noticeable.
At least that's how I see it right now, we'll see how that works on a much wider range of hardware and systems. The github repo linked from the pgsql-hackers post has a lot more results, some charts include results for the index prefetching patch - and there it makes more difference in some cases. But the patch is still fairly rough, it could be a bug in it too, and it changed a lot since August.
I wouldn't be all that surprised if this was (partially) due to Postgres being less optimized back then, which might have hidden some of the random vs. sequential differences. But that's just a wild guess.
reply