commented:

I've had data corruption when using a 3rd party vendor just the same
as I've had when self-hosting.


As far as I'm concerned this is roughly comparable to the time you
spend debugging RDS connection limits, working around parameter groups
you can't modify, or dealing with surprise maintenance windows.


The main operational difference is that you're responsible for
incident response. If your database goes down at 3 AM, you need to fix
it. But here's the thing: RDS goes down too. And when it does, you're
still the one getting paged, you just have fewer tools to fix the
problem.

Doesn't look like it will ever become non-true…

  commented:

None of this is technically complex

Proper automated failover is quite complex. Luckily you don’t
implement it yourself for Postgres.
We run our own Postgres just fine but I would love to not have to run
MySQL. Cloud vendors typically handle database replication by
replicating block devices, which gives them a generic solution that
doesn’t have to deal with the problems of each database.

  commented:
For what it's worth, we've been running MySQL clusters on VMs (where
we can get local NVMe) for a long time using its replication, and this
stuff is now mundane for us.
Initial setup isn't fire-and-forget, and if, for example, you don't go
in aware what replication likes and doesn't (it doesn't like huge
transactions or large ALTERs without an online-schema-change tool!)
you might learn the hard way. But those seem like more learning and
start-up costs than recurring ones. I haven't looked deeply into it,
but it sounds a little like maybe Percona is trying to sort of build
an open-source RDS-like thing on top of k8s with their Everest
product, which might help with one class of setup work.
Postgres is probably the route if you're just deciding what to
use--the larger ecosystem seems to have gone that direction--but given
my past experience I'd probably personally start a thing with MySQL
again if I were starting over.
I mostly agree with the OP: the work that is specific to running our
own instances happens, but is occasional and very manageable. (And the
local-NVMe perf is sure nice.) Most of our database-related work is
the stuff that you'd have to do no matter how you hosted: looking at
what's inefficient and improving it, once every N years making sure
your app remains happy across a major-vesion upgrade, things like
that.
Of course, saying something works well is always calling the wrath of
the ops gods, and I'm knocking wood just writing this. But empirically
it has been working pretty well for us!

  commented:
Does something like https://pigsty.io/ help with that?

  commented:
There's also https://github.com/patroni/patroni which only does the HA
part, without being a full on postgres distro.

  commented:
Perhaps this ZFS backup strategy from 2022 can help as well as long as
there is equivalent to pg_start_backup/pg_stop_backup in MySQL?

  commented:
So do it as well. Use shared storage with dual controllers and
failover the whole DB VM. Use replica as manual failover in case
things go wrong with the whole primary cluster.

  commented:
It may be only a couple hours a month or so of maintenance, but it's
the task switching and minutiae knowledge that kills you in these
routines.  You look at it so infrequently that actually changing
anything becomes a behemoth task in my experience as you have to
refresh yourself on the implementation details.

  commented:
I find it believable what author says about «similar amount of
maintenance to what interfacing with RDS requires», and then the same
exact issue applies equally on both sides.

  commented:
Yeah, it also feels like the author is significantly more
knowledgeable about (and therefore confident in) his Postgres
management skills. He knows the failure modes and how to recover from
them. He knows how failover works and how to set up Patroni or
whatever. He knows how to configure backrest and has practiced
recovery.
I don’t know that much and don’t have that confidence. And I’m sure I
could learn it—it’s probably not all that much and I’m sure I could
find some good content on learning all of it. On the other hand, my
company can pay Google perhaps tens or hundreds of dollars per month
(on top of the cost of the underlying instances) to manage it for me
which is a rounding error on our cloud bill, and instead I can either
help increase revenue or decrease costs by some figure several orders
of magnitude larger than what I would save running Postgres myself. We
have people who write a bad BigQuery query that waste more money than
I would save running Postgres myself. It just isn’t worth my time
right now, and that’s probably true for many—if we get to an
aggressive cost optimization phase and that’s the next biggest
bang/buck, then we’ll tackle it then, but for now it would be the
wrong move.
Also worth noting that it’s never “running it myself”—I also have to
teach my team how to do that work.
I also don’t buy the argument that if RDS goes down you still have to
deal with it. No I don’t—Amazon deals with it. I maybe have to do a
bit of communication with stakeholders that RDS went down, but it’s
much less work and much less stress than fixing it myself. More
importantly, if I run the database myself and things break, it’s my
fault, but if RDS goes down, my stakeholders and their stakeholders
are understanding (maybe it shouldn’t be that way, but that’s the
world we live in).

  commented:
Great article. One thing seems to be (partially) missing; patching the
VM/server os (along with dist-upgrades when current version falls out
of support).
As one alternative to self-manage we're currently slicing up managed
postgres instances (from UpCloud) - giving our test environments a
seperate database, user and schema on a "shared" managed postgres
instance - which makes the "managed sevices tax" more reasonable.
For prod environments it can make sense to pay for one instance per
service - because of the "now it's somebody else's problem"-feature,
and making it trivial to do point in time recovery via just bringing a
new service instance online for recovery.
But the general trend of steering users toward one postgres instance
per service does become a little silly when you have important, but
low volume services.

  commented:
Good article! What I'd also be interested in: what are the usual
reasons for outages (i.e. for 3am pager calls) with a self-hosted
Postgres? Are there any common patterns? And is it possible for a
non-expert to debug/fix such problems on short notice?

  commented:
If random people can send SQL to your PG, chances are it's one of
those random SQL queries abusing the poor server. You have a bunch of
people sending terrible SQL and your going to have a bad time.
Sometimes it can be locking, if you allow really long running queries
against tables that are being inserted against.
As for figuring out if these are the issues are not, you can query for
both of those things. see the PG wiki for queries that do test for
these. There is also the server admin section of the PG manual:
https://www.postgresql.org/docs/current/admin.html
Also it should be noted, a hosted DB instance won't fix any of these
issues for you either, they will always be your problem.
If it's not that, it's probably not PG's issue. It's probably hardware
failure, OS crashing, no disk space,  stuff like that.

  commented:
I was a senior manager in a small company that ran Postgres in its
core and quite honestly the number of Postgres related after hours
calls we had in the 20+ years we were running could be counted on one
hand. One example that stands out due to its severity was an instance
that just kept crashing - but it turned out to be a raid controller
flipping bits.
While replication etc is great (and we did it with drbd for a long
time before other techniques came along), pg is rock solid, I’d have
no qualms running it solo and just relying on backups for a small
project.
All of that said, I currently run a smallish instance at Vultr for
$WORK. Backups, upgrades and failover just happen, and I’d need to pay
for the CPU and disk anyway. So while I don’t think it’s hard to run
your own PG, it is (was?) non trivial to set up and maintain that
sweet sweet automation, and there are plenty of cheapish hosting
options around that do it all for you.

  commented:
The benefit of hosted is getting accessed to closed source distributed
DBs like Aurora and AlloyDB.  I never understood the appeal to
RDS/CloudSQL.

  commented:
Fine, we'll self-host RDS Aurora.
.