Vertically Scaling PostgreSQL

bshipp · on March 19, 2019

I administer a medium-sized postgres db, around (now) 110GB of data, and recently had the opportunity to revisit the old .conf file when I redeployed that database to a different server. What I quickly realized was that years of tweaks I had implemented as I learned postgres from various (now outdated) online sources were making it difficult to optimize my new instance.

I decided to start again from scratch. I discovered a handy little utility called pghero and set up it's internal monitoring:

https://github.com/ankane/pghero

I backed up my old .conf replaced it with a default and blew away all my old indices, and fired up the database. Of course, nothing worked well. But step by step, working with postgres' internal tools and pghero, I took a brand new look at every previous assumption and was much more strategic about where and when I created an index and which tuning parameter I tweaked.

Long story short, I was able to knock 30 GB off of the install disk size (based, in large part, on unnecessary duplicate indices), accelerate the most common queries, and everything worked a lot smoother. Plus my postgres installation was no longer fighting old postgres .conf tuning tweaks which were designed for much older versions of postgres.

Just adding my vote for revising your old tuning parameters after your database (and your programmer/administrator) has matured a bit.

joeyrobert · on March 19, 2019

This is a common operation for my team as well. In my old project, so many indexes that were added were just assumptions about the usage pattern of the data. Not only did they blow up the size of the table and INSERT time, they became the opposite of a red-herring where developers would stop and look elsewhere because the table "already has indexes".

Only adding indexes with good use of EXPLAIN/EXPLAIN ANALYZE on expected usage patterns yielded good results. Often we'd just blow the indexes away and fiddle until we added the correct one to enable an index scan / index only scan.

Other big boons for us have been using ENUM types where necessary (small known dataset for a column) -- now your column takes 4 bytes instead of N bytes for a string. I find them a bit easier to work in than foreign keys for this optimization because of their direct string mapping.

We've had really good mileage with Datadog's APM tools, RDS performance insights and pghero as you mentioned.

bshipp · on March 19, 2019

The ENUM type sounds like a great little tweak; I've got quite a few small tables that describe a fixed number of rows so I'll definitely investigate that.

I'm my own worst enemy for indexes, because I'm not only the database administrator but also the analyst. So when I'm trying to solve an ad-hoc problem and encounter a long-running query there's a 99% probability that I may have "accidentally" generated a new index to accelerate that query and then forgotten to remove it when I was done.

Lesson learned, and also a good argument for disaggregating the administrative and analytical user permissions.

aidos · on March 19, 2019

More recently I’ve got into the habit of creating a new schema when I’m analysing / patching data so I can pull out the data I need into new tables and index accordingly. Because I do it from vim and have a plugin to execute the queries, I just create a record as I go (and I set the path to hit my new schema first so any new tables go there).

bshipp · on March 19, 2019

Ooooh, I like that idea a lot, especially since I can assign it a tablespace on a different drive.

Jedd · on March 19, 2019

> Other big boons for us have been using ENUM types where necessary (small known dataset for a column) -- now your column takes 4 bytes instead of N bytes for a string. I find them a bit easier to work in than foreign keys for this optimization because of their direct string mapping.

Can you give an example?

It sounds like normalising the schema would have similar (perhaps more) benefits.

joeyrobert · on March 21, 2019

Normalizing has similar size benefits. For our use case (big aggregate reporting tables), incoming external data is not normalized, meaning we'd have to have another process iterate over it with foreign table in memory and map it to the foreign key. Enum's can be transparently ingested without this normalization requirement, while taking equivalent space. There's also now one level of indirection between the column and its value. Both are fine.

eropple · on March 20, 2019

Postgres keeps ENUM types in memory at all times. A join table is probably a slower option, all other things being equal. There are cases where it's worthwhile (dynamically expanding the values that would be in that enum, for example), but it's a tradeoff.

paulddraper · on March 19, 2019

They were useful. Unfortunately enums can't be changed (e.g. add a value) in a transaction without some extra work

fake-name · on March 19, 2019

PGHero is great, and as someone who's reported (minor) issues and made feature requests on the issue tracker, the author is quite open to improvements too.

11235813213455 · on March 19, 2019

just wondering, did you have to plan a maintenance downtime for doing all that or could it be done without interrupting the production DB?

beatgammit · on March 20, 2019

Not OP, but I imagine you could just duplicate the DB into a test environment to mess around with it.

bshipp · on March 20, 2019

That's exactly what it was. I had a bit of time after buying the server before I wanted to deploy it in production so I copied the last backup over and just started messing around. Because this database primarily ingests data from outside sources and doesn't drive user experiences, I was able to easily duplicate the data flow to replicate the production environment.

I recognize this would be a much different beast when it comes to a database that supports a hundred thousand users or something. However, Postgresql does have the ability to do hot standby and streaming replication so it's likely not as difficult as one might imagine.

https://www.postgresql.org/docs/current/warm-standby.html

tyingq · on March 19, 2019

"In Linux, the most popular host for Postgres, the size of the OS disk cache cannot be set by the user."

True, though you can tweak settings on when the cache writes dirty pages. Percona has a blog post specific to Postgres that covers it. There's also some other tuning there not mentioned in the HN linked story.

https://www.percona.com/blog/2018/08/29/tune-linux-kernel-pa...

michelpp · on March 19, 2019

There is an interesting tool called nancy that runs automated experiments with different postgres setups, allowing you to compare them against each other. It can run workloads derived from log files or pgbench (disclaimer, I wrote the pgbench support). The idea is, you run many experiments, tweaking different configurations and nancy automates the analysis of them, eventually allowing you to settle into a cost minimum/performance maximum you're comfortable with.

https://github.com/postgres-ai/nancy

natmaka · on March 19, 2019

Beware, effective_cache_size is NOT, as stated, "the amount of disk cache provided by the OS"

It is a combination of shared_buffers and kernel cache size See https://www.postgresql.org/message-id/20181105120104.GA13769...

This important configuration parameter is a too-often badly understood. IMHO it should be, at least on a server non dedicated to PG, periodically dynamically calculated by the engine.

ps · on March 19, 2019

Check out pgtune (https://pgtune.leopard.in.ua/#/) if simple config tuning is what you are looking for. It is probably more valuable than this blog post.

dewey · on March 19, 2019

> It is probably more valuable than this blog post.

Learning about what goes on under the hood before you blindly change settings some tool spits out doesn't seem like a bad idea I'd say.

ps · on March 19, 2019

The point is that the pgtune could and should force you to check the config options in documentation and it actually addresses your current setup. RTFM.

Values in the post are pretty arbitrary and useless as their comments in postgresql.conf used to be.

I am using PostgreSQL in production since 7.4 and the best source of performance tuning tips for me has been PostgreSQL 9.0 High Performance by Gregory Smith. These days PostgreSQL wiki is pretty good as well.

rch · on March 19, 2019

I'd second that book recommendation. I just went through it again on a long plane flight, and much (if not all) of the configuration advice applies to more recent releases as well.

bshipp · on March 19, 2019

For sure. And although I love pgtune and it's usually my first stop when I'm firing up a new database, it's only really modifying a few parameters when there's a whole host of server configuration tweaks available:

https://www.postgresql.org/docs/current/runtime-config.html

GordonS · on March 19, 2019

> It is probably more valuable than this blog post

They do different things - pgTune tells you how to configure your system, this blog post tells you why.

tejtm · on March 19, 2019

I was just back looking at pgtune after being distracted for a number of years and felt a twinge concerned on upkeep and fragmentation. I also noticed postgresql documentation talking about using https://github.com/jfcoz/postgresqltuner

Any compares and contrasts available?

I'm not running a production instance so tuning is gratuitous but nice.

manigandham · on March 19, 2019

Sounds like many of these should be automatic by now. One area where I'd like to see some progress in Postgres.

ahachete · on March 19, 2019

Feel free to check https://postgresqlco.nf A humble effort to document all PostgreSQL configuration parameters and provide some guidance.