There's a library here that implements a lot of database features and can be use...

Dave_Rosenthal · 2025-05-08T16:50:46 1746723046

Thanks for the insight. A $XX million exa-data system is no doubt impressive :)

> Another big problem I found with modeling real apps was the five second transaction timeout. This is not, as you might expect, a configurable value. It's hard-coded into the servers and clients. This turns into a hugely awkward limitation and routinely wrecks your application logic and forces you to implement very tricky concurrency algorithms inside your app, just to do basic tasks. For example, computing most reports over a large dataset does not work with FoundationDB because you can't get a consistent snapshot for more than five seconds!

I'm pretty sure that the 5-second transaction timeout is configurable with a knob. You just need enough RAM to hold the key-range information for the transaction timeout period. Basically: throughput * transaction_time_limit <= RAM, since FDB enforces that isolation reconciliation runs in memory.

But, the other reason that 5 seconds is the default is that e.g. 1 hour read/write transactions don't really make sense in the optimistic concurrency world. This is the downside of optimistic concurrency. The upside is that your system never gets blocked by bad-behaved long-running transactions, which is a serious issue in real production systems.

Finally, I think that the current "Redwood" storage engine does allow long-lived read transactions, even though the original engine backing FDB didn't.

mike_hearn · 2025-05-08T21:00:20 1746738020

If it's configurable that's very new. It definitely has been hard coded and unconfigurable by design the entire history of the project that I've known it. They just tell you to work around it, not tweak it for your app.

Transactions holding locks for too long are indeed a problem though in Oracle transactions can have priorities and steal each other's locks.

Dave_Rosenthal · 2025-05-09T04:07:19 1746763639

It looks like it might be knobs (which can be changed via config file) called MAX_WRITE_TRANSACTION_LIFE_VERSIONS and MAX_READ_TRANSACTION_LIFE_VERSIONS now (defined in ServerKnobs.cpp)? It's in microseconds and probably needs to be synced client and server.

I don't know the details know, but it was definitely configurable when I wrote it :) I remember arguing for setting it to a default of 30/60 seconds we decided against as that would have impacted throughput at our default RAM budget. I thought might have been a good tradeoff to get people going, thinking they could tune it down (or up the RAM) if they needed to scale up perf later.

mike_hearn · 2025-05-09T08:13:58 1746778438

Hah, well I'll defer to your knowledge then :) I don't remember seeing any kind of negotiation protocol between client and server to find out what the timeout is, which would be needed for it to be properly configurable. But it could be easily added I guess. I admit I never understood why changing it wasn't a proper feature.

bob1029 · 2025-05-08T10:19:10 1746699550

> I started a job at Oracle Labs where I ended up using their database in a project, and that kind of gave me a new perspective on all this stuff.

I feel like a mandatory ~6 week boot camp using big boy SQL in a gigantic enterprise could go a long way to helping grow a lot of developers.

HN on average seems to have grave misconceptions regarding the true value of products like Oracle, MSSQL and DB2.

The whole reason you are paying money for these solutions is because you can't afford to screw with any of the other weird technical compromises. If you are using the RDBMS for a paid product and SQLite isn't a perfect fit, spending some money starts to make a lot of sense to me.

If the cost of a commercial SQL provider is too much for the margins to handle, I question if the business was ever viable to begin with.

mike_hearn · 2025-05-08T10:33:17 1746700397

Right. I should have done such a thing years ago. And, there are free/elastically scaled commercial RDBMS that are cheap. Oracle just give them away these days in the cloud. Even if you self-host, you can run small databases (up to iirc 12GB of data) for free, which is plenty for many internal apps.

The other thing that is super-unintuitive is the cost of cloud managed DBs. People like open source DBs because they see them as cheap and lacking lockin. I specced out what it costs to host a basic plain vanilla Postgres on AWS vs an Oracle DB on OCI at some point, and the latter was cheaper despite having far more features. Mostly because the Postgres didn't scale elastically and the Oracle DB did, plus AWS is just expensive. Well that was a shock. I'd always assumed a commercial RDBMS was a sort of expensive luxury, but once you decide you don't want to self-admin the DB the cost differences become irrelevant.

And then there's the lockin angle. Postgres doesn't do trademark enforcement so there's a lot of databases being advertised by cloud vendors as Postgres but are actually proprietary forks. The Jepsen test the other day was a wakeup call where an AWS "postgres" database didn't offer correct transactional isolation because of bugs in the forked code. If you're using cloud offerings you can end up depending on proprietary features or performance/scaling without realizing it.

So yeah. It's just all way more complex than I used to appreciate. FoundationDB is still a very nice piece of tech, and makes sense for the very specific iCloud business Apple use it for, but I'm not sure I'd try to use it if I was being paid to solve a problem fast.

jwr · 2025-05-08T13:07:35 1746709655

> grave misconceptions regarding the true value of products like Oracle, MSSQL and DB2

In the context of this discussion, I would offer that we are getting into "apples vs oranges" comparisons here.

If you are doing custom queries for building reports where each report needs to access humongous amounts of data, SQL databases are likely a good fit.

If you need a fast and yet correct distributed database (for fault tolerance) for an online app backend, where the data and query patterns are known and do not change much over time and all retrieval is done using indexes, SQL databases are not a great fit, "big boy" or not.

As for "questioning if the business was ever viable to begin with", as a solo founder SaaS builder, I would humbly point you to numerous HN discussions where people are OUTRAGED at any subscriptions, and expect software to be an inexpensive one-time purchase. But that's a separate discussion.

riku_iki · 2025-05-08T16:09:10 1746720550

> few hundred nodes at most, and Oracle RAC/ExaData clusters can scale up that far too

and how much license fee it will cost?

mike_hearn · 2025-05-08T20:57:18 1746737838

For over a hundred nodes? A lot, but these aren't little VMs with 4G of RAM and two vCPUs we are talking about there. Almost nobody has databases that big.

For a more realistic deployment in the cloud with 2TB of data, 4TB of backup storage and peak of 64 cores with elastic scaling it's about 6k/month. So the moment you're spending more than a third of a decently skilled SWE on working around database limitations it's worth it. That's for a fully managed HA cluster with 99.95% uptime, rolling upgrades etc.

riku_iki · 2025-05-08T21:15:04 1746738904

> 2TB of data, 4TB of backup storage and peak of 64 cores with elastic scaling it's about 6k/month

right, but foundationdb and similar is for cases where you actually need throughput of hundreds servers, for what you described OSS Postgres will work well.

mike_hearn · 2025-05-09T08:26:01 1746779161

It's confusing because of terminology differences. What FoundationDB calls a "server" is a single thread, often running on a commodity cloud VM. What Postgres calls a server is a multi-process instance on a single machine, often a commodity cloud VM. What ExaData calls a server is a dedicated bare metal machine with at least half a terabyte of RAM and a couple hundred CPU cores, with a gazillion processes running on it, linked to other cluster nodes by 100Gb networks. Often the machines are bigger.

Obviously then a workload that might require dozens or hundreds of "servers" in FoundationDB can fit on just one in an Oracle database. Hundreds of nodes in an ExaData cluster would mean tens of thousands of cores and hundreds of terabytes of RAM, not to even get into the disk sizes. The nodes in turn are linked by a dedicated high-bandwidth network, with user traffic being routed over the regular network.

As far as I know, nobody has ever scaled a FoundationDB cluster to tens of thousands of servers. So you can argue then that ExaData scales far better than FoundationDB does.

This makes sense for the design because in an RDBMS cluster the inputs and outputs are small. A little bit of SQL goes in, a small result set comes out (usually). But the data transfers and work required to compute that small result might be very large - in extremis, it may require reading the entire database from disk. Even with the best networks bandwidth inside a computer is much higher than bandwidth between computers, so you want the biggest machines possible before spilling horizontally. You just get much better bang-for-buck that way.

In FoundationDB on the other hand nodes barely communicate at all, because it's up to the client to do most of the work. If you write a workload that requires reading the entire database it will just fail immediately because you'll hit the five second timeout. If you do it without caring about transactional consistency (a huge sacrifice), you'll probably end up reading the entire database over a regular commodity network before being able to process it at all - much slower.

riku_iki · 2025-05-09T12:18:08 1746793088

> Postgres calls a server is a multi-process instance on a single machine, often a commodity cloud VM. What ExaData calls a server is a dedicated bare metal machine with at least half a terabyte of RAM and a couple hundred CPU cores, with a gazillion processes running on it, linked to other cluster nodes by 100Gb networks. Often the machines are bigger.

You can also have PG cluster of beefy machines linked by 100Gb network, I didn't get what's the difference in this case.

mike_hearn · 2025-05-09T13:58:44 1746799124

This is again an issue of terminology. A PG cluster is at best one write master and then some full read replicas. You don't necessarily need a fast interconnect for that, but your write traffic can never exceed the power of the one master, and storage isn't sharded but replicated (unless you host all the PG files on a SAN which would be slow).

A RAC cluster is multi-write master, and each node stores a subset of the data. You can then add as many read caches (not replicas) as you need, and they'll cache only the hot data blocks.

So they're not quite the same thing in that sense.

riku_iki · 2025-05-09T14:11:33 1746799893

PG ecosystem actually has multiple shared nothing cluster implementations, example is www.citusdata.com, unlike RAC, where masters need to sync to accept writes, so technically write load is not distributted across servers.

mike_hearn · 2025-05-10T07:52:53 1746863573

Citus is great but it's just sharding. Only some very specific use cases fit within those limits, and there are cost issues too (being ha requires replicas of each node etc). RAC masters don't need to sync writes to each other, that's not how it works. Every node can be writing independently, they only communicate when one node has exclusive ownership over a block another node needs to write to, and the transfer then occurs peer to peer over rdma. But if writes are scattered they work independently.

riku_iki · 2025-05-10T14:43:45 1746888225

> they only communicate when one node has exclusive ownership over a block another node needs to write to

and it needs to communicate that he has exclusive ownership, and also after each writes you need to invalidate cached data on other nodes, and read new data from some transactionally consistent store, which will do all heavylifting (syncing/reconciling writes), which is kinda FDB.

jwr · 2025-05-09T05:47:00 1746769620

> fully managed HA cluster

This is where Suspension of Disbelief stops for me — I've been taught by years of Jepsen analyses that "HA" really doesn't exist, especially in the SQL database world, and especially if there is a big company behind with lots of big complicated buzzwords.

What you usually get is single master replication, and you get to pick up the pieces if the master dies.

mike_hearn · 2025-05-09T08:15:52 1746778552

Jepsen doesn't work through databases in any kind of order as far as I can tell, and they haven't done an analysis of Oracle or many other popular databases. So I wouldn't take that as a representative sample.

RAC clusters are multi-write-master and HA. The drivers know how to fail over between nodes, can continue sessions across a failover and this can work even across major version upgrades. The tech has existed for a long time already.

If you're curious you can read about it more in the docs.

"Application continuity" i.e. continuing a connection if a node dies without the application noticing:

https://docs.oracle.com/en/database/oracle/oracle-database/2...

Fast failover and load balancing for cluster clients (here in the context of Java but it works for other languages too):

https://docs.oracle.com/en/database/oracle/oracle-database/2...

I mean, I understand the skepticism because HN never covers this stuff. It's "enterprise" and not "startup" for whatever reason, unless Google do it. But Oracle has been selling databases into nearly every big company on earth for decades. Do you really think nobody has built an HA and horizontally scalable SQL based database? They have and people have been using it to run 24/7 businesses at scale for a long time.

jen20 · 2025-05-19T04:25:00 1747628700

I wonder if the clause in the Oracle license that prohibits publishing benchmarks and other studies could be to blame for Jepsen having not investigated..?