More

pgaddict · 2026-04-05T12:54:19 1775393659

So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?

anarazel · 2026-04-05T13:00:05 1775394005

I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.

pgaddict · 2026-03-01T22:11:48 1772403108

Good idea. It's an interesting historical question - when we picked 4.0 as the default ~25 years ago, how close was is to the calculated value? I was asking that myself. Unfortunately I don't have a machine with traditional HDD in my homelab anymore, but I'll see if I can run the test somewhere.

I wouldn't be all that surprised if this was (partially) due to Postgres being less optimized back then, which might have hidden some of the random vs. sequential differences. But that's just a wild guess.

vlovich123 · 2026-03-02T02:32:31 1772418751

But also if it was calculated 25 years ago, was it the same metric you’re using today?

pgaddict · 2026-03-02T11:46:29 1772451989

To the best of my knowledge, yes. Unfortunately the details of how it was calculated in ~2000 seem to be lost, but the person who did that described he did it like this. It's possible we forgot some important details, of course, but the intent was to use the same formula. Which is why I carefully described and published the scripts, so that other engineers can point out thinkos and suggest changes.

pgaddict · 2026-03-01T19:43:12 1772394192

Good point, I should have included that (the linked pgsql-hackers thread have some of this information, at least).

I've observed exactly this behavior on a wide range of hardware / environments, it's not very specific to particular SSDs models (at least not for reads, which is what the blog post was measuring). That's why I showed results from three very different systems.

Some information for the two physical machines:

1) ryzen: Ryzen 9 9900X, RAID0 with 4x Samsung 990 PRO 1TB (in Asus Hyper M.2 Gen5 card)

2) xeon: E5-2699v4, WD Ultrastar DC SN640 960GB (U.3)

I don't know what exactly is backing the SSD storage on the Azure instance.

pgaddict · 2026-03-01T19:26:08 1772393168

There probably is some additional inefficiency when reading pages randomly (compared to sequential reads), but most of the difference is at the storage level. That is, SSDs can handle a lot of random I/O, but it's nowhere close to sequential reads.

For example, I have a RAID0 with 4 SSDs (Samsung 990 PRO, so consumer, but quite good for reads). And this is what fio says:

# random reads, 8K, direct IO, depth=1

fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

-> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

# sequential reads, 8K, direct IO, depth=1

fio --filename=/dev/md127 --direct=1 --rw=read --bs=8k --ioengine=io_uring --iodepth=1 --runtime=30 --numjobs=1 --time_based --group_reporting --name=random-1 --eta-newline=1 --readonly

-> read: IOPS=85.5k, BW=668MiB/s (700MB/s)(19.6GiB/30001msec)

With buffered I/O, random read stay at ~19k IOPS, while sequential reads get to ~1M IOPS (thanks to read-ahead, either at the OS level, or in the SSD).

So part of this is sequential reads benefiting from implicit "prefetching", which reduces the observed cost of a page. But for random I/O there's no such thing, and so it seems more expensive.

It's more complex (e.g. sequential reads allow issuing larger reads), of course.

i_think_so · 2026-03-02T05:47:48 1772430468

This is why I love this site. Thank you for sharing your data, not just opinion.

menaerus · 2026-03-02T07:14:42 1772435682

> fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

> -> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

Isn't this too low? On my non-RAID configuration I get almost 2GB/s with the exact same command. Samsung 980 PRO 1TB.

pgaddict · 2026-03-02T11:37:39 1772451459

Damn, I copied the wrong command. I wanted to copy this one:

fio --filename=/dev/md127 --direct=1 --rw=randread --bs=8k --ioengine=io_uring --iodepth=1 --runtime=120 --numjobs=1 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

i.e. with iodepth=1 and numjobs=1. Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

The command I posted earlier does ~10GB/s on my RAID, which matches your data.

menaerus · 2026-03-03T07:15:43 1772522143

> Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

This is an interesting observation but does it really mimic the index scan? This would be essentially a worst case scenario. Submitting IO requests one by one would be a very inefficient way to handle scans, no?

pgaddict · 2026-03-03T11:48:14 1772538494

True. Unfortunately it's what index scans in Postgres do right now - it's the last "major" scan type not supporting some sort of prefetch (posix_fadvise or AIO). We're working on it, hopefully it'll get into PG19.

pgaddict · 2025-12-03T19:56:52 1764791812

AFAIK these two joins are exactly the same once you get past the parsing. It's just a different way to write an inner join. It's translated into the same AST and so there's no difference in planning/execution.

listenallyall · 2025-12-03T21:54:45 1764798885

Perhaps in this very basic case they are exactly the same but is that still true if you add secondary WHERE conditions that apply to just one table, or if one "table" is actually a view with a complex query definition, or many other ways in which a very simple "example" can quickly get complicated?

Sesse__ · 2025-12-04T08:02:02 1764835322

In general, you split up the WHERE condition at every top-level AND. Then you do either pullup or pushdown or both (it depends a bit on how your planner looks on the inside). In the end, you end up with the two cases being exactly the same before you start planning.

For outer joins (left/right/full), it's different, and there you absolutely need the explicit join syntax (for correctness). And semijoins are not expressed the same way at all, partially for weird historical reasons.

pgaddict · 2025-12-03T16:10:36 1764778236

That is part of the key idea, yes. It's more elaborate, because it can split the aggregate - it can do part of it before the join, and finalize it after the join. Similarly to what we do for parallel queries.

As for indexes, it can help, but not in this particular example - the "code" tables are tiny, and the planner adds Memoize nodes anyway, so it acts like an ad hoc index.

Indexes are more of a complementary improvement, not an alternative to this optimization (i.e. neither makes the other unnecessary). FWIW in this case the indexes won't help very much - if you use more data in the code tables, it'll use a hash join, not nested loop / merge join.

That doesn't mean we couldn't do better with indexes, there probably are smart execution strategies for certain types of queries. But indexes also come with quite a bit of overhead (even in read-only workloads).

pgaddict · 2025-12-03T15:43:04 1764776584

It's not about not knowing about an optimization. The challenge is to know when to apply it, so that it does not cause regressions for cases that can't benefit from it. It may be less risky in specialized systems, like BI systems typically don't need to worry about regressing OLTP workloads. Postgres absolutely needs to be careful of that.

I believe that's one of the reasons why it took about ~8 years (the original patch was proposed in 2017).

pgaddict · 2025-10-26T21:35:33 1761514533

IMHO the whole point of Qubes is that it does not do the compartmentalization at the level of individual applications, but groups of applications. Otherwise you'd need to very clearly specify how/when exactly the applications can exchange data, what data, etc. I'm not saying it's impossible, but "apps in the same qube VM can do whatever" is a much easier concept.

pgaddict · 2025-09-30T02:30:51 1759199451

I believe there are reasons why e.g. io_uring could be inherently slower in some cases, and I tried to point some of those out.

With io_uring everything happens in the backend process, and so consumes some of the CPU time that might otherwise be spent executing the query. All the checksum verification, memcpy into shared buffers, etc. happen in the backend. And those things can be quite expensive. With worker this happens in the other processes, spreading the overhead.

Of course, on truly I/O-bound workload (actually waiting on the I/O), this may not be a huge difference. For warmed-up cases it may be more significant.

ozgrakkurt · 2025-09-30T11:31:46 1759231906

You can have a io_uring worker per thread and send jobs into a queue for them to pick it up and execute, so it won’t max out on cpu. Each worker would do cooperative async scheduling internally between the tasks so nothing delays other things too much.

But this is more difficult than spawning thousands of os threads and running sync workers on them

pgaddict · 2025-09-30T13:31:34 1759239094

I'm not sure what exactly you mean by "thread" here. Postgres is not thread-based - there are people working on that, but for now it's all processes.

Some of these limitations are mostly due to Postgres design, no doubt about that.

ozgrakkurt · 2025-09-30T17:57:48 1759255068

Ah my mistake. I meant to write something like, you can have a thread per cpu core that is running an io_uring instance like you would have a bunch of workers in a pool in the "worker" implementation.

If there are no threads then this would be, you can have a process per cpu core

pgaddict · 2025-09-29T13:52:30 1759153950

> Right now async IO is used for sequential scans and bitmap scans, not for index scans. My initial guess would be that it mostly helps for complex queries (that use multiple indexes, so bitmap scans) and unoptimized queries (sequential scans), not so much for straightforward and/or optimized queries that use a single index. But this is really just a guess, I'm way out of my depth at this point. I'm curious how much it'll help once it is implemented for index scans as well.

Those are good guesses, IMHO.

For sequential scans, some of the "async" work could be done by kernel read-ahead, but AIO makes it explicit and loads the data into shared buffers, not just page cache. For bitmap scans we already had prefetching by fadvise, which is somewhat similar to read-ahead (also into page cache), and there were some bugs that made it ineffective in some cases, and AIO fixes that.

For index scans the difference can be an order of magnitude (say, 5-10x). Doing random I/O block by block is simply awful, prefetching data is important. I was just doing some testing on TPC-H, and on scale 50 I see Q8 going from 130s to 20s, and Q19 from 50s to 8s. And smaller improvements for a couple more queries. Of course, it depends on what else the query is doing - if it's spending 1% on the index, you won't notice a difference.