Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Rails and Postgres to millions of users at Microsoft (stepchange.work)
202 points by htormey 69 days ago | hide | past | favorite | 91 comments



Postgres can be scaled vertically like Stackoverflow did. With cache on edge for popular reads if you absolutely must (but you most likely dont).

No need to microservice or sync read replicas even (unless you are making a game). No load balancers. Just up the RAM and CPU up to TB levels for heavy real world apps (99% of you wont ever run into this issue)

Seriously its so create scalable backend services with postgrest, rpc, triggers, v8, even queues now all in Postgres. You dont even need cloud. Even a mildly RAM'd VPS will do for most apps.

got rid of redis, kubernetes, rabbitmq, bunch of SaaS tools. I just do everything on Postgres and scale vertically.

One server. No serverless. No microservice or load handlers. It's sooo easy.


Stack overflow absolutely had load balancers, and 9 web servers, and Redis caches. They also use 4 SQL servers, so not entirely vertical either. And they were only serving 500 requests a second on average (peak was probably higher).


was it? i read it was a huge ram server


The details of their architecture is documented in a series of blog posts:

https://nickcraver.com/blog/2016/02/03/stack-overflow-a-tech...

I get what you're saying, they didn't do dynamic and "wild" horizontal scaling, they focused more on having an optimal architecture with beefy "vertically scaled" servers.

Very much something we should focus on. These days horizontal scaling, microservices, kubernetes, and just generally "throwing compute" at the problem is the lazy answer to scaling issues.



That's a primary and backup server for Stackoverflow and a primary/backup for SE. But they each have the full dataset for their sites, not actual horizontal scaling. Also that page is just a static marketing tool, not very representative of their current stack. See: https://meta.stackexchange.com/questions/374585/is-the-stack...


Having most of the servers be loaded at about 5% CPU usage feels extremely wasteful, but at the same time I guess it's better to have the spare capacity for something that you really want to keep online, given the nature of the site.

However, if they have a peak of 450 web requests per second and somewhere between 11000 - 23800 SQL queries per second, that'd mean between 25 - 53 SQL queries to serve a single request. There's probably a lot of background processes and whatnot (and also queries needed for web sockets) that cut the number down and it's not that bad either way, but I do wonder why that is.

The apps with good performance that I've generally worked with attempted to minimize the amount of DB requests needed to serve a user's request (e.g. session cached in Redis/Valkey and using DB views to return an optimized data structure that can be returned with minimal transformations).

Either way, that's a quite beefy setup!


Having at least 2 web servers and a read-only DB replica for redundancy/high availability is very easy and much safer. Yes, setting up a single-server is faster, but if your DB server dies - and at some point it will happen - you'll not just save a lot of downtime, but also a lot of stress and additional work.


Read replicas come with their own complexity as you have to account for the lag time on the replica for UX. This leads to a lot of unexpected quirks if it’s not planned for.


That's true, but you can use your replica only for non-realtime reporting, or even just as a hot standby.

Edit: Careful for the non-realtime reporting though if you want to run very slow queries - those will pause replication and can be a PITA.


A hot standby / failover still meets this definition. That’s how I interpreted what was being described.


my startup has a similar setup (elixir + postgres). we use aurora so we get automated failover. its more expensive but its just a cost of doing business.


Last time I looked at Aurora (just as it came out) it was hilariously expensive. Are the costs better now for a real use case?


> it was hilariously expensive

It still is. But you have to look at it in perspective. do you have customers that NEED high availability an will pull out pitch forks if you are down for even a few minutes? I do. the peace of mind is what you're paying for in that case.

Plus its still cheaper than paying a devops guy a fulltime salary to maintain these systems if you do it on your own.


That works for the performance aspect, but doesn't address any kind of High Availability (HA).

There are definitely ways to make HA work, especially if you run your own hardware, but the point is that you'll need (at least) a 2nd server to take over the load of the primary one that died.


sure failover is recommended if you have HA commitments


Thank you for sharing this! I have been diving into it.

How do you manage transactions with PostgREST? Is there a way to do it inside it? Or does it need to be in a good old endpoint/microservice? I can’t find anything in their documentation about complex business logic beyond CRUD operations.


Transactions are done using database functions https://docs.postgrest.org/en/v12/references/api/functions.h....


Ah ok awesome thank you!


Yes, scaling vertically is much easier than scaling horizontally and dealing with replicas, caching, etc. But that certainly has limits and shouldn’t be taken as gospel, and is also way more expensive when you’re starting to deal with terabytes of RAM.

I also find it very difficult to trust your advice when you’re telling folks to stick Postgres on a VPS - for almost any real organization using a managed database will pay for itself many times over, especially at the start.


looking at hetzner benchmarks i would say VPS are quite enough to handle Postgres for Alexa Top 1000. When you approach under top 100, you will need more RAM than what is offered.

But my point is you won't ever hit this type of traffic. You don't even need Kafka to handle streams of logs from a fleet of generators from the wild. Postgres just works.

In general, the problem with modern backend architectural thinking is that it treats database as some unreliable bottleneck but that is an old fashioned belief.

Vast majority of HN users and startups are not going to be servicing more than 1 million transactions per second. Even a medium sized VPS from Digital Ocean running Postgres can handle that load just fine.

Postgres is very fast and efficient and you dont need to build your architecture around problems you wont ever hit and prepay that premium for that <0.1% peak that happens so infrequently (unless you are a bank and receive fines for that).


I work at a startup that is less than 1 year old and we have indices that are in the hundreds of gigabytes. It is not as uncommon as you think. Scaling vertically is extremely expensive, especially if one doesn’t take your (misguided) suggestion to run Postgres on a VPS rather than using a managed solution like most do.


shouldn't be expensive to handle that amount of indices on a dedicated server without breaking the bank


> One server

What happens if this server dies?


Then your service is offline until you fix it. For many services a completely acceptable thing to happen once in a blue moon

Most would probably get two servers with a simple failover strategy. But on the other hand servers rarely die. At the scale of a datacenter it happens often, but if you have like six of them, buy server grade stuff and replace them every 3-5 years chances you won't experience any hardware issues


if you cant risk this rarity then get a failover server with equal specs

maybe add another for good measure....if the biz insurance needs extreme HA then absolutely have multiple failover

my point is you arent doing extreme orchestration or routing

throw a cloudflare ddos protection too


Eventually you get data residency asks to keep data in the right region and for that you need to have horizontal partitioning of some kind.


Our backend at work does use a read replica purely for websockets. I always wondered if it was overkill, I’m not a backend developer, though.


not sure what you are building but i hope that was for a real time multiplayer game otherwise doesn't make sense to have bi-directional communication when you only need reads

making read replicas function also as writes is needed for such cases but already when you have more than one place to write you run into edge cases and complexities in debugging


I think the reason is pushes are sent out regularly in batches by some cron system and rather than reading from the main database it reads from the replica before it pushes them out. I didn't really explain the context properly in my comment.


> Just up the RAM and CPU up to TB levels

not sure what CPU at TB levels means but hope your wallet scales better vertically


They are definitely not on the cloud.


Aurora on AWS definitely has extreme RAM

It's not cheap at roughly $200/hr but already if you have this type of traffic then you are generating revenues (hopefully) at much greater amounts.


I ran into some scaling challenges with Postgres a few years ago and had to dive into the docs.

While I was mostly living out of the "High Availability, Load Balancing, and Replication" chapter, I couldn't help but poke around and found the docs to be excellent in general. Highly recommend checking them out.

https://www.postgresql.org/docs/16/index.html


They are excellent! Another great example is the Django project, which I always point to for how to write and structure great technical documentation. Working with Django/Postgres is such a nice combo and the standards of documentation and community are a huge part of that.


Interestingly I have had almost the exact opposite experience being very frustrated with the Django docs.

To be fair, it could be because I'm frustrated with Django's design decisions having come from Rails.

When learning Django a few years ago, I still carry a deep loathing against polymorphism (generic relations[0]), and model validations (full clean[1]),

You know what - it's design decisions...

[0] https://docs.djangoproject.com/en/5.1/ref/contrib/contenttyp...

[1] https://docs.djangoproject.com/en/5.1/ref/models/instances/#...


generic relations are hard to get right, really if you can avoid using them you're going to avoid a lot of trickiness.

When you need them... it's nice to have them "just there", implemented correctly (at least as correctly as they can be in an entirely generic way).

Model validations is a whole thing... I think that Django offering a built-in auto-generated admin leads to a whole slew of differing decisions that end up coming back to be really tricky to handle.


Would love to hear more about what you don't like with model validations (full clean).


Sorry on the slow reply.

But yea, I can complain at length.

- Model validations aren't run automatically. Need to call full_clean manually.

- EXCEPT when you're in a form! Forms have their own clean, which IS run automatically because is_valid() is run.

- This also happens to run the model's full_clean.

- DRF has its own version of create which is separate and also does not run full_clean.

- Validation errors in DRF's Serializers are a separate class of errors from model validations and thus model Val Errors are not handled automatically.

- Can't monkey patch models.Model.save to run full_clean automatically for because it breaks some models like User AND now it would run twice for Forms+Model[0].

Because of some very old web-forum style design decisions, model validations aren't unified thus the fragmentation makes you need to know whether you're calling .save()/.create() manually, are in a form, or in DRF. And it's been requested to change this behavior but it breaks backwards compat[0].

It's frustrating because in Rails this is a solved problem. Model validations ALWAYS run (and only once) because... I'm validating the model. Model validations == data validations which means it should be true for all areas regardless of caller, except in exceptions, then I should be required to be explicit when skipping (i.e. Rails) where as in Django I need to be explicit in running it - sometimes... depends where I am.

[0] https://stackoverflow.com/questions/4441539/why-doesnt-djang...


Thanks for your reply. I'm currently in a stage of falling out of love with Django and trying to get my thoughts together on why that is.

I think Django seems confused on the issue of clean/validation. On the one hand, it could say the "model" is just a database table and any validation should live in the business logic of your application. This would be a standard way of architecting a system where the persistence layer is in some peripheral part that isn't tied to the business logic. It's also how things like SQLAlchemy ORM are meant to be used. On the other hand, it could try to magically handle the translation of real business objects (with validation) to database tables.

It tries to do both, with bad results IMO. It sucks to use it on the periphery like SQLAlchemy, it's just not designed for that at all. So everyone builds "fat" models that try to be simultaneously business objects plus database tables. This just doesn't work for many reasons. It very quickly falls apart due to the object relational mismatch. I don't know how Rails works, but I can't imagine this ever working right. The only way is to do validation in the business layer of the application. Doing it in the views, like rest framework or form cleans is even worse.


Yeah definitely understand the frustration. I've been there and while I don't think we've found _the_ solution, we've settled into a flow that we're generally happy with.

For us we separate validations in two. Business and Data validations, which are generally defined as:

- Business: The Invoice in Country X is needs to ensure Y and Z taxes are applied at Billing T+3 days otherwise throw an error.

- Data Validation: The company's currency must match the country it operates in.

Business validations and logic always go inside services where as data validations are on the model. Data validations apply to 100% of all inserts. Once there's an IF statement segmenting a group it becomes business validation.

I could see an argument as to why the above is bad because sometimes it's a qualitative decision. Once in a while the lines get blurry, a data validation becomes _slightly_ too complex and an arguement ensues as to whether it's data vs business logic.

Our team really adheres to services and not fat models, sorry DHH.

To me, it's all so controversial whatever you pick will work out just fine - just stick to it and don't get lazy about it.


Services are definitely better and a solid part of a domain-driven design. The trouble is with Django I think it's a bandaid on a fundamentally broken architecture. The models end up anaemic because they're trying to be two things at once. It's super common to see things like services directly mutating model attributes and set up relationships manually by creating foreign keys etc. All of that should be hidden far away from services.

The ultimate I think is Domain-Driven Design (or Clean Architecture). This gives you a true core domain model that isn't constrained by frameworks etc. It's as powerful as it can be in whatever language you use (which in the case of Python is very powerful indeed). Some people have tried to get it to work with Django but it fights against you. It's probably more up front work as you won't get things like Django admin, but unless you really, truly are doing CRUD, then admin shouldn't be considered a good thing (it's like doing updates directly on the database, undermining any semblance of business rules).


Like many of the BSDs


Did Postgres used to be a BSD? Are they known for good documentation?


BSD was the Unix distribution; BSD and Postgres/Ingres development did overlap at UC Berkeley.


BSD? No, that's operating system(s)

Good documentation? Yes


15 years ago I worked on a couple of really high profile rails sites. We had millions of users with Rails and a single mysql instance (+memcached and nginx). Back then ruby was a bit slower than it is today but I’m certain some of the challenges you face at that scale are things people still do today…

1. try to make most things static-ish reads and cache generic stuff, e.g. most things became non-user specific HTML that got cached as SSI via nginx or memcached

2. move dynamic content to services to load after static-ish main content, e.g. comments, likes, etc. would be loaded via JSON after the page load

3. Move write operations to microservices, i.e. creating new content and changes to DB become mostly deferrable background operations

I guess the strategy was to do as much serving of content without dipping into ruby layer except for write or infrequent reads that would update cache.


Please check this excellent book by former Microsoft and Groupon engineer on scaling Rails and Postgres:

[1] High Performance PostgreSQL for Rails Reliable, Scalable, Maintainable Database Applications by Andrew Atkinson:

https://pragprog.com/titles/aapsql/high-performance-postgres...


What a small world. Earlier today I got tagged in a PR [1] where Andrew became the maintainer of a Ruby gem related to database migrations. Good to know he is involved in multiple projects in this space.

[1] https://github.com/lfittl/activerecord-clean-db-structure/is...


Hi there! That's funny! This interview and those gem updates were unrelated. However both are part of the sweet spot for me of education, advocacy, and technical solutions for PostgreSQL and Ruby on Rails apps.

I hope you’re able to check out the podcast episode and enjoy it. Thanks for weighing in within the gem comments, and for commenting here on this connection. :)


Postgres can scale to millions of users, but Rails definitely can't. Unless you're prepared to spend a ton of money.


For real. Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill. I've worked at unicorn companies trying to do exactly that.

Their baseline was 800 instances of the Rails app...lol.

I'm not going to name-names (you've heard of them) ... but this is a company that had to invent an entirely new and novel deployment process in order to get new code onto the massive beast of Rails servers within a finite amount of time.


I've scaled a single rails server to 50k concurrent, and so if Rails is the theoretical bottleneck there, and we base it off scaling my meager efforts, that's only 20 servers for 1 mil concurrent, or around $1000/mo at the price point I was paying (heroku).

Rails these days isn't the top of the speed meters but it's not that slow either.


Sounds like you made a nice, tight little Rails app. 50,000 concurrent? Oh man, I wish.


“Rails can’t scale” is so 10 years ago. It’s often other things like DB queries or network I/O that tend to be bottlenecks, or you have a huge Rails monolith that has a large memory footprint, or an application that isn’t well architected or optimized.


We use 5 ec2 instances to serve around 32 million requests per day on PHP, all under 100ms. It is not the language.


Sounds impressive until you realize that there’s 86400 seconds in a day and so even if majority of those happen during business hours thats still firmly under 200 qps per server. On modern hw that’s very small. Also what instance size?


c5.4xlarge


The language/runtime certainly has an impact. But indeed, in reality there is no way to compare these scaling claims. For all we know people are talking about serving a http-level cache without even hitting the runtime.


Each and every request reach the DB and/or Redis. MyISAM is deprecated, but is crazy fast if you mainly read.


This is trivial with epoll(7) or io_uring(7). What you are describing "5 ec2 instances" could likely be attributed to language and/or framework bloat but hard to know for certain without details.


Framework or custom app?


Raw php scripts, no ORM either. It has very good abstractions for some logic and for some other parts it is just a spaghetti function. Changing anything is difficult and critical so we are not able to refactor much.


Were they running t2.micro instances or something?

We're running 270k+ RPM no sweat, and our spend for those containers is maybe 1/100th what you're quoting there.

The idea that Rails can't handle high load is just such bloody nonsense.

You can build an abomination with any framework, if you try.


> Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill.

Can you deploy something to vercel that supports a million concurrent users for less than $250K/month? What about using AWS Lambdas? Go microservices running in K8s?

I think your infra bills are going to skyrocket no matter your software stack if you're serving 1 million+ concurrent users.


"without blowing $250,000/month on their AWS bill". The point is that you don't need AWS for this! You can use Docker to configure much, much cheaper/faster physical servers from Hetzner or similar with super-simple automated failover, and you absolutely don't need an expensive dedicated OPS team for that for this kind of simple deployments, as I read so often here on HN.

You might get surprised as how far you can go with the KISS approach with modern hardware and open source tools.


You ain’t replacing 250k/mo worth of ec2 with a single hetzner server so your “super-simple failover” option goes out the window. Baremetal is not that much faster if you’re running ruby on it, dont fall for the marketing.


I never said that you should only have one server on Hetzner. For the web servers and background workers, though, scaling horizontally with docker images on physical server is still trivial.

By the way, I was running my startup on 17 physical machines on Hetzner, so I'm not talking from marketing but from experience.


My experience scaling up Rails (mostly in size of codebase NOT in size of traffic) really made me love typesafe languages.

IDE smartness (auto complete, refactoring), compile error instead of runtime, clear APIs...

Kotlin is a pretty nice "Type-safe Ruby" to me nowadays.


I had a similar experience, working in a large Ruby codebase made me realise how important type-hints is, sometimes I had to investige what types where expected and required because the editor where unable to tell me. I hope RBS / Sorbet solves this.


This desperately needs the Walmart treatment of JET.com’s teams past acquisition :)


What's Rails and Postgres? Do they mean ASP.NET and MS SQL Server?


Rails and Postgres (and AWS) was the pre-acquisition stack, and development continued with that stack during this time period (2020-2021). https://en.wikipedia.org/wiki/Flip_(software)

Microsoft acquired companies with web and mobile platforms with varied backgrounds at a high rate. I got the sense that the tech stack—at least when it was based on open source—was evaluated for ongoing maintenance and evolution on a case by case basis. There was a cloud migration to Azure and encouragement to adopt Surface laptops and VS Code, but the leadership advocated for continuing development in the stack as feature development was ongoing, and the team was small.

Besides hosted commercial versions, I was happy to see Microsoft supporting community/open source PostgreSQL so much and they continue to do so.

https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...

https://techcommunity.microsoft.com/t5/azure-database-for-po...


PostgreSQL has been the most popular choice for greenfield .NET projects for a while too. There really isn't any vendor lock-in as most of the ecosystem is built with swappable components.


I don't understand why you wouldn't just use Elixir/Phoenix if you need to scale?


Perhaps because you need to scale quickly and already have a large Rails app that would take a long time to recreate in another language and framework.


It’s hard to compete with Rails productivity


I don't understand why you wouldn't use <compiled language that's faster than the BEAM> if you need to scale?

/s


I mean, you could, but you'd be missing out on the Rails-esque nature of Elixir/Phoenix.


Scaling a non-scalabe by default framework that should have been few services written in a performance first language at a billion+ USD company.

I am not sure why are we boliling the oceans for the sake of a language like Ruby and a framework like Rails. I love those to death but Amazons approach is much better (or it used to be): you can't make a service for 10.000+ users in anything else than: C++, Java (probably Rust as well nowadays).

For millions of users the CPU cost difference probably justifies the rewrite cost.


You are connecting the dots backwards, but a project is usually trying to connect the dots forward.

So if you have a lot of money then you can start implementing from scratch your own web framework in C. It will be the perfect framework for your own product and you can put 50 dev/sec/ops/* on the team to make sure both the framework and product code are written.

But some (probably most) products are started with 1-2 people trying to find product market fit or whatever name is for solving a real problem for paying users as fast as they can. And then delegate scaling for when money are going in.

This is similar because this is about a startup/product bought by Microsoft and not built inhouse.

For fast delivery of stable secure code for web apps Rails is a perfect fit. I am not saying the only one but there are not that many offering the stability and batteries included to deliver with a small team a web app that can scale to product market fit while keeping the team small.


"For millions of users the CPU cost difference probably justifies the rewrite cost." This is only true if you have expensive computations done in Ruby or Python or similar, which is very rarely the case.


Not true, Ruby and Python are absurdly slow at even trivial tasks. Moving stuff around in memory, which is most of what a webapp is, is expensive. Lots of branches is gonna be really expensive too.


I've got more than 15 years of Rails production experience, including a lot of performance optimisation, and in my experience the Ruby code is very rarely the bottleneck. And in those cases, you can almost always find some solution.


You really do not know what you are talking about, it is not about the language, like it was repeated in this forum many many times already. We serve an application in PHP to thousands of users per second in less than 100ms constantly.


Sometimes it is the language. Or at least the ecosystem and libraries available.

My go-to example is graphql-ruby, which really chokes serializing complex object graphs (or did, it's been a while now since I've had to use it). It is pretty easy to consume 100s of ms purely on compute to serialize a complex graphql response.


I have mixed feelings about this. It's saying that python is too slow for data science ignoring that python can outsource that work to Pandas or NumPy.

For GraphQL on Rails you can avoid graphql-ruby and use Agoo[1] instead so that that work is outsourced to C. So in practice it's not a problem.

1. https://github.com/ohler55/agoo


> python can outsource that work to Pandas or NumPy.

Exactly. So C/C++/Fortrant is better in this regard than Python.


I would make a case that that's not the language's fault. You need to assess how critical is speed in your requirements and adapt your solutions.


> You really do not know what you are talking about

> it is not about the language

Sure how about these people?

https://thenewstack.io/which-programming-languages-use-the-l...


Yup. As if there is no wealth of organizational knowledge and a particular first-party language to address this exact problem.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: