Postgres can be scaled vertically like Stackoverflow did. With cache on edge for popular reads if you absolutely must (but you most likely dont).
No need to microservice or sync read replicas even (unless you are making a game). No load balancers. Just up the RAM and CPU up to TB levels for heavy real world apps (99% of you wont ever run into this issue)
Seriously its so create scalable backend services with postgrest, rpc, triggers, v8, even queues now all in Postgres. You dont even need cloud. Even a mildly RAM'd VPS will do for most apps.
got rid of redis, kubernetes, rabbitmq, bunch of SaaS tools. I just do everything on Postgres and scale vertically.
One server. No serverless. No microservice or load handlers. It's sooo easy.
Stack overflow absolutely had load balancers, and 9 web servers, and Redis caches. They also use 4 SQL servers, so not entirely vertical either. And they were only serving 500 requests a second on average (peak was probably higher).
I get what you're saying, they didn't do dynamic and "wild" horizontal scaling, they focused more on having an optimal architecture with beefy "vertically scaled" servers.
Very much something we should focus on. These days horizontal scaling, microservices, kubernetes, and just generally "throwing compute" at the problem is the lazy answer to scaling issues.
That's a primary and backup server for Stackoverflow and a primary/backup for SE. But they each have the full dataset for their sites, not actual horizontal scaling. Also that page is just a static marketing tool, not very representative of their current stack. See: https://meta.stackexchange.com/questions/374585/is-the-stack...
Having most of the servers be loaded at about 5% CPU usage feels extremely wasteful, but at the same time I guess it's better to have the spare capacity for something that you really want to keep online, given the nature of the site.
However, if they have a peak of 450 web requests per second and somewhere between 11000 - 23800 SQL queries per second, that'd mean between 25 - 53 SQL queries to serve a single request. There's probably a lot of background processes and whatnot (and also queries needed for web sockets) that cut the number down and it's not that bad either way, but I do wonder why that is.
The apps with good performance that I've generally worked with attempted to minimize the amount of DB requests needed to serve a user's request (e.g. session cached in Redis/Valkey and using DB views to return an optimized data structure that can be returned with minimal transformations).
Having at least 2 web servers and a read-only DB replica for redundancy/high availability is very easy and much safer. Yes, setting up a single-server is faster, but if your DB server dies - and at some point it will happen - you'll not just save a lot of downtime, but also a lot of stress and additional work.
Read replicas come with their own complexity as you have to account for the lag time on the replica for UX. This leads to a lot of unexpected quirks if it’s not planned for.
my startup has a similar setup (elixir + postgres). we use aurora so we get automated failover. its more expensive but its just a cost of doing business.
It still is. But you have to look at it in perspective. do you have customers that NEED high availability an will pull out pitch forks if you are down for even a few minutes? I do. the peace of mind is what you're paying for in that case.
Plus its still cheaper than paying a devops guy a fulltime salary to maintain these systems if you do it on your own.
That works for the performance aspect, but doesn't address any kind of High Availability (HA).
There are definitely ways to make HA work, especially if you run your own hardware, but the point is that you'll need (at least) a 2nd server to take over the load of the primary one that died.
Thank you for sharing this! I have been diving into it.
How do you manage transactions with PostgREST? Is there a way to do it inside it? Or does it need to be in a good old endpoint/microservice? I can’t find anything in their documentation about complex business logic beyond CRUD operations.
Yes, scaling vertically is much easier than scaling horizontally and dealing with replicas, caching, etc. But that certainly has limits and shouldn’t be taken as gospel, and is also way more expensive when you’re starting to deal with terabytes of RAM.
I also find it very difficult to trust your advice when you’re telling folks to stick Postgres on a VPS - for almost any real organization using a managed database will pay for itself many times over, especially at the start.
looking at hetzner benchmarks i would say VPS are quite enough to handle Postgres for Alexa Top 1000. When you approach under top 100, you will need more RAM than what is offered.
But my point is you won't ever hit this type of traffic. You don't even need Kafka to handle streams of logs from a fleet of generators from the wild. Postgres just works.
In general, the problem with modern backend architectural thinking is that it treats database as some unreliable bottleneck but that is an old fashioned belief.
Vast majority of HN users and startups are not going to be servicing more than 1 million transactions per second. Even a medium sized VPS from Digital Ocean running Postgres can handle that load just fine.
Postgres is very fast and efficient and you dont need to build your architecture around problems you wont ever hit and prepay that premium for that <0.1% peak that happens so infrequently (unless you are a bank and receive fines for that).
I work at a startup that is less than 1 year old and we have indices that are in the hundreds of gigabytes. It is not as uncommon as you think. Scaling vertically is extremely expensive, especially if one doesn’t take your (misguided) suggestion to run Postgres on a VPS rather than using a managed solution like most do.
Then your service is offline until you fix it. For many services a completely acceptable thing to happen once in a blue moon
Most would probably get two servers with a simple failover strategy. But on the other hand servers rarely die. At the scale of a datacenter it happens often, but if you have like six of them, buy server grade stuff and replace them every 3-5 years chances you won't experience any hardware issues
not sure what you are building but i hope that was for a real time multiplayer game otherwise doesn't make sense to have bi-directional communication when you only need reads
making read replicas function also as writes is needed for such cases but already when you have more than one place to write you run into edge cases and complexities in debugging
I think the reason is pushes are sent out regularly in batches by some cron system and rather than reading from the main database it reads from the replica before it pushes them out. I didn't really explain the context properly in my comment.
I ran into some scaling challenges with Postgres a few years ago and had to dive into the docs.
While I was mostly living out of the "High Availability, Load Balancing, and Replication" chapter, I couldn't help but poke around and found the docs to be excellent in general. Highly recommend checking them out.
They are excellent! Another great example is the Django project, which I always point to for how to write and structure great technical documentation. Working with Django/Postgres is such a nice combo and the standards of documentation and community are a huge part of that.
Interestingly I have had almost the exact opposite experience being very frustrated with the Django docs.
To be fair, it could be because I'm frustrated with Django's design decisions having come from Rails.
When learning Django a few years ago, I still carry a deep loathing against polymorphism (generic relations[0]), and model validations (full clean[1]),
generic relations are hard to get right, really if you can avoid using them you're going to avoid a lot of trickiness.
When you need them... it's nice to have them "just there", implemented correctly (at least as correctly as they can be in an entirely generic way).
Model validations is a whole thing... I think that Django offering a built-in auto-generated admin leads to a whole slew of differing decisions that end up coming back to be really tricky to handle.
- Model validations aren't run automatically. Need to call full_clean manually.
- EXCEPT when you're in a form! Forms have their own clean, which IS run automatically because is_valid() is run.
- This also happens to run the model's full_clean.
- DRF has its own version of create which is separate and also does not run full_clean.
- Validation errors in DRF's Serializers are a separate class of errors from model validations and thus model Val Errors are not handled automatically.
- Can't monkey patch models.Model.save to run full_clean automatically for because it breaks some models like User AND now it would run twice for Forms+Model[0].
Because of some very old web-forum style design decisions, model validations aren't unified thus the fragmentation makes you need to know whether you're calling .save()/.create() manually, are in a form, or in DRF. And it's been requested to change this behavior but it breaks backwards compat[0].
It's frustrating because in Rails this is a solved problem. Model validations ALWAYS run (and only once) because... I'm validating the model. Model validations == data validations which means it should be true for all areas regardless of caller, except in exceptions, then I should be required to be explicit when skipping (i.e. Rails) where as in Django I need to be explicit in running it - sometimes... depends where I am.
Thanks for your reply. I'm currently in a stage of falling out of love with Django and trying to get my thoughts together on why that is.
I think Django seems confused on the issue of clean/validation. On the one hand, it could say the "model" is just a database table and any validation should live in the business logic of your application. This would be a standard way of architecting a system where the persistence layer is in some peripheral part that isn't tied to the business logic. It's also how things like SQLAlchemy ORM are meant to be used. On the other hand, it could try to magically handle the translation of real business objects (with validation) to database tables.
It tries to do both, with bad results IMO. It sucks to use it on the periphery like SQLAlchemy, it's just not designed for that at all. So everyone builds "fat" models that try to be simultaneously business objects plus database tables. This just doesn't work for many reasons. It very quickly falls apart due to the object relational mismatch. I don't know how Rails works, but I can't imagine this ever working right. The only way is to do validation in the business layer of the application. Doing it in the views, like rest framework or form cleans is even worse.
Yeah definitely understand the frustration. I've been there and while I don't think we've found _the_ solution, we've settled into a flow that we're generally happy with.
For us we separate validations in two. Business and Data validations, which are generally defined as:
- Business: The Invoice in Country X is needs to ensure Y and Z taxes are applied at Billing T+3 days otherwise throw an error.
- Data Validation: The company's currency must match the country it operates in.
Business validations and logic always go inside services where as data validations are on the model. Data validations apply to 100% of all inserts. Once there's an IF statement segmenting a group it becomes business validation.
I could see an argument as to why the above is bad because sometimes it's a qualitative decision. Once in a while the lines get blurry, a data validation becomes _slightly_ too complex and an arguement ensues as to whether it's data vs business logic.
Our team really adheres to services and not fat models, sorry DHH.
To me, it's all so controversial whatever you pick will work out just fine - just stick to it and don't get lazy about it.
Services are definitely better and a solid part of a domain-driven design. The trouble is with Django I think it's a bandaid on a fundamentally broken architecture. The models end up anaemic because they're trying to be two things at once. It's super common to see things like services directly mutating model attributes and set up relationships manually by creating foreign keys etc. All of that should be hidden far away from services.
The ultimate I think is Domain-Driven Design (or Clean Architecture). This gives you a true core domain model that isn't constrained by frameworks etc. It's as powerful as it can be in whatever language you use (which in the case of Python is very powerful indeed). Some people have tried to get it to work with Django but it fights against you. It's probably more up front work as you won't get things like Django admin, but unless you really, truly are doing CRUD, then admin shouldn't be considered a good thing (it's like doing updates directly on the database, undermining any semblance of business rules).
15 years ago I worked on a couple of really high profile rails sites. We had millions of users with Rails and a single mysql instance (+memcached and nginx). Back then ruby was a bit slower than it is today but I’m certain some of the challenges you face at that scale are things people still do today…
1. try to make most things static-ish reads and cache generic stuff, e.g. most things became non-user specific HTML that got cached as SSI via nginx or memcached
2. move dynamic content to services to load after static-ish main content, e.g. comments, likes, etc. would be loaded via JSON after the page load
3. Move write operations to microservices, i.e. creating new content and changes to DB become mostly deferrable background operations
I guess the strategy was to do as much serving of content without dipping into ruby layer except for write or infrequent reads that would update cache.
What a small world. Earlier today I got tagged in a PR [1] where Andrew became the maintainer of a Ruby gem related to database migrations. Good to know he is involved in multiple projects in this space.
Hi there! That's funny! This interview and those gem updates were unrelated. However both are part of the sweet spot for me of education, advocacy, and technical solutions for PostgreSQL and Ruby on Rails apps.
I hope you’re able to check out the podcast episode and enjoy it. Thanks for weighing in within the gem comments, and for commenting here on this connection. :)
For real. Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill. I've worked at unicorn companies trying to do exactly that.
Their baseline was 800 instances of the Rails app...lol.
I'm not going to name-names (you've heard of them) ... but this is a company that had to invent an entirely new and novel deployment process in order to get new code onto the massive beast of Rails servers within a finite amount of time.
I've scaled a single rails server to 50k concurrent, and so if Rails is the theoretical bottleneck there, and we base it off scaling my meager efforts, that's only 20 servers for 1 mil concurrent, or around $1000/mo at the price point I was paying (heroku).
Rails these days isn't the top of the speed meters but it's not that slow either.
“Rails can’t scale” is so 10 years ago. It’s often other things like DB queries or network I/O that tend to be bottlenecks, or you have a huge Rails monolith that has a large memory footprint, or an application that isn’t well architected or optimized.
Sounds impressive until you realize that there’s 86400 seconds in a day and so even if majority of those happen during business hours thats still firmly under 200 qps per server. On modern hw that’s very small. Also what instance size?
The language/runtime certainly has an impact. But indeed, in reality there is no way to compare these scaling claims. For all we know people are talking about serving a http-level cache without even hitting the runtime.
This is trivial with epoll(7) or io_uring(7). What you are describing "5 ec2 instances" could likely be attributed to language and/or framework bloat but hard to know for certain without details.
Raw php scripts, no ORM either. It has very good abstractions for some logic and for some other parts it is just a spaghetti function. Changing anything is difficult and critical so we are not able to refactor much.
> Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill.
Can you deploy something to vercel that supports a million concurrent users for less than $250K/month? What about using AWS Lambdas? Go microservices running in K8s?
I think your infra bills are going to skyrocket no matter your software stack if you're serving 1 million+ concurrent users.
"without blowing $250,000/month on their AWS bill". The point is that you don't need AWS for this! You can use Docker to configure much, much cheaper/faster physical servers from Hetzner or similar with super-simple automated failover, and you absolutely don't need an expensive dedicated OPS team for that for this kind of simple deployments, as I read so often here on HN.
You might get surprised as how far you can go with the KISS approach with modern hardware and open source tools.
You ain’t replacing 250k/mo worth of ec2 with a single hetzner server so your “super-simple failover” option goes out the window. Baremetal is not that much faster if you’re running ruby on it, dont fall for the marketing.
I never said that you should only have one server on Hetzner. For the web servers and background workers, though, scaling horizontally with docker images on physical server is still trivial.
By the way, I was running my startup on 17 physical machines on Hetzner, so I'm not talking from marketing but from experience.
I had a similar experience, working in a large Ruby codebase made me realise how important type-hints is, sometimes I had to investige what types where expected and required because the editor where unable to tell me. I hope RBS / Sorbet solves this.
Rails and Postgres (and AWS) was the pre-acquisition stack, and development continued with that stack during this time period (2020-2021).
https://en.wikipedia.org/wiki/Flip_(software)
Microsoft acquired companies with web and mobile platforms with varied backgrounds at a high rate. I got the sense that the tech stack—at least when it was based on open source—was evaluated for ongoing maintenance and evolution on a case by case basis. There was a cloud migration to Azure and encouragement to adopt Surface laptops and VS Code, but the leadership advocated for continuing development in the stack as feature development was ongoing, and the team was small.
Besides hosted commercial versions, I was happy to see Microsoft supporting community/open source PostgreSQL so much and they continue to do so.
PostgreSQL has been the most popular choice for greenfield .NET projects for a while too. There really isn't any vendor lock-in as most of the ecosystem is built with swappable components.
Perhaps because you need to scale quickly and already have a large Rails app that would take a long time to recreate in another language and framework.
Scaling a non-scalabe by default framework that should have been few services written in a performance first language at a billion+ USD company.
I am not sure why are we boliling the oceans for the sake of a language like Ruby and a framework like Rails. I love those to death but Amazons approach is much better (or it used to be): you can't make a service for 10.000+ users in anything else than: C++, Java (probably Rust as well nowadays).
For millions of users the CPU cost difference probably justifies the rewrite cost.
You are connecting the dots backwards, but a project is usually trying to connect the dots forward.
So if you have a lot of money then you can start implementing from scratch your own web framework in C. It will be the perfect framework for your own product and you can put 50 dev/sec/ops/* on the team to make sure both the framework and product code are written.
But some (probably most) products are started with 1-2 people trying to find product market fit or whatever name is for solving a real problem for paying users as fast as they can. And then delegate scaling for when money are going in.
This is similar because this is about a startup/product bought by Microsoft and not built inhouse.
For fast delivery of stable secure code for web apps Rails is a perfect fit. I am not saying the only one but there are not that many offering the stability and batteries included to deliver with a small team a web app that can scale to product market fit while keeping the team small.
"For millions of users the CPU cost difference probably justifies the rewrite cost." This is only true if you have expensive computations done in Ruby or Python or similar, which is very rarely the case.
Not true, Ruby and Python are absurdly slow at even trivial tasks. Moving stuff around in memory, which is most of what a webapp is, is expensive. Lots of branches is gonna be really expensive too.
I've got more than 15 years of Rails production experience, including a lot of performance optimisation, and in my experience the Ruby code is very rarely the bottleneck. And in those cases, you can almost always find some solution.
You really do not know what you are talking about, it is not about the language, like it was repeated in this forum many many times already. We serve an application in PHP to thousands of users per second in less than 100ms constantly.
Sometimes it is the language. Or at least the ecosystem and libraries available.
My go-to example is graphql-ruby, which really chokes serializing complex object graphs (or did, it's been a while now since I've had to use it). It is pretty easy to consume 100s of ms purely on compute to serialize a complex graphql response.
I have mixed feelings about this. It's saying that python is too slow for data science ignoring that python can outsource that work to Pandas or NumPy.
For GraphQL on Rails you can avoid graphql-ruby and use Agoo[1] instead so that that work is outsourced to C. So in practice it's not a problem.
No need to microservice or sync read replicas even (unless you are making a game). No load balancers. Just up the RAM and CPU up to TB levels for heavy real world apps (99% of you wont ever run into this issue)
Seriously its so create scalable backend services with postgrest, rpc, triggers, v8, even queues now all in Postgres. You dont even need cloud. Even a mildly RAM'd VPS will do for most apps.
got rid of redis, kubernetes, rabbitmq, bunch of SaaS tools. I just do everything on Postgres and scale vertically.
One server. No serverless. No microservice or load handlers. It's sooo easy.