------------------------------------------------- Title: Disaster day Date: 2022-03-09 Device: Laptop Mood: Exhausted ------------------------------------------------- The last few weeks at work I've spent planning a disaster recovery drill. Turns out I didn't need to wait that long. We had a huge outage at work yesterday. One of our junior engineers was looking to send out some nginx configurations to a service which we use just to terminate SSL and do some simple redirects (eg domain.com to www.domain.com). He'd written the configuration change, pushed it through the PR process, and then was ready to deploy. Unfortunately, due to lack of documentation and process, he ran the entire Ansible task list against all the servers in the inventory, not just the one which he intended. Half-way through this, he realised his error, and ^C'd his way back to 'safety'. Unfortunately this left the system in an inconsistent state, and some of the nginx upgrades which we applied to other parts of the estate got applied to our live infrastructure (to be honest, I'm still a little unclear on this part and why this happened). But anyway, the life infrastructure was left with no nginx running, and a set of configuration directives which left it unable to start. Unfortunately, I was AFK when this began, and after my phone blew up with alerts, I got back to my desk about 30 minutes after the incident began. I took control and started to calm everyone down and ask questions, but by this time some of the useful output was lost, and the team on the problem had already had their 'fuck it, lets reboot' moment (sigh), and a I had to rebuild a lot of useful context from questioning the team about EXACTLY the steps they took. After about 90 minutes I think we had a handle on the problem, but it was still unclear what to do next. I ended up splitting the engineers on the call into two teams; one to work on the live infrastructure to repair the nginx configurations, and the other to start Plan Z directives, which involves rebuilding the configuration from scratch. Thankfully, the first team quickly identified some modifications we could do to the configuration files to get nginx running again (note that the problem was never our application containers, they were happily running, just we couldn't route traffic to them). Once we had that figure out it was just a case of manually fixing them, testing, and notifying customers. The final outage in total was about 4 hours, which is the worst I've dealt with in more than a decade. I think we have a lot of learning to do over the next few weeks while we recover. There's obvious engineering work now to remove the footguns in Ansible which led to this happening in the first place -- those are easy to fix. But the more complex problem is how to move our customers over to a model where they can have hot failover to a spare. That's new territory for us here (though not for me, but I've never led that effort before). I've spent most of today trying to reassure the engineer in question that this wasn't anything to take personally. I think all engineers have SOME story about the time they killed production systems; it's part of the journey for most. But he is young, and I don't want to let this incident weight too heavily on him. I'm glad this didn't happen next week when I'll be on holiday! I doubt I could have contributed much from the phone in the mountains. --C