_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML Summary of the Amazon DynamoDB Service Disruption in US-East-1 Region
asim wrote 4 hours 35 min ago:
On the one hand it's an incentive to shift away to smaller self
management where you don't need AWS e.g as an individual I just run a
single DigitalOcean VPS. But on the other hand if you're a large
business the evaluation process is basically, can I tolerate this kind
of incident once in a while versus the massive operational cost of
doing it myself. It's really going to be a case by case study of who
stays, who moves and who tries some multicloud failover. It's not one
of those situations where you can blanket just say oh this is terrible,
stupid, should never happen, let's get off AWS. This is the slow build
up of dependency on something people value. That's not going to change
quickly. It might never change. The too big to fail mantra of banks
applies. What happens next is essentially very anticlimactic which is
to say, nothing.
tonymet wrote 52 min ago:
multi-region AWS would have been adequate for this outage.
JohnMakin wrote 5 hours 54 min ago:
Was still seeing SQS latency affecting my systems a full day after they
gave the âall clear.â There are red flags all over this summary to
me, particularly the case where they had no operational procedure for
recovery. That seems to me impossible in a hyperscaler - you never
considered this failure scenario, ever? Or did you lose engineers that
did know?
Anyway appreciate that this seems pretty honest and descriptive.
polyglotfacto2 wrote 9 hours 27 min ago:
Use TLA+ (which I thought they did)
baalimago wrote 9 hours 48 min ago:
Did they intentionally make it dense and complicated to discourage
anyone from actually reading it..?
776 words in a single paragraph
giamma wrote 10 hours 42 min ago:
Interesting analysis from The Register
HTML [1]: https://www.theregister.com/2025/10/20/aws_outage_amazon_brain...
mellosouls wrote 10 hours 29 min ago:
Discussed the other day:
Today is when the Amazon brain drain sent AWS down the spout (644
comments)
HTML [1]: https://news.ycombinator.com/item?id=45649178
martythemaniak wrote 14 hours 36 min ago:
It's not DNS
There's no way it's DNS
It was DNS
grogers wrote 14 hours 49 min ago:
> As this plan was deleted, all IP addresses for the regional endpoint
were immediately removed.
I feel like I am missing something here... They make it sound like the
DNS enactor basically diffs the current state of DNS with the desired
state, and then submits the adds/deletes needed to make the DNS go to
the desired state.
With the racing writers, wouldn't that have just made the DNS go back
to an older state? Why did it remove all the IPs entirely?
Aeolun wrote 13 hours 43 min ago:
1. Read state, oh, I need to delete all this.
2. Read state, oh, I need to write all this.
2. Writes
1. Deletes
Or some variant of that anyway. It happens in any system that has
concurrent reader/writers and no locks.
JCM9 wrote 15 hours 27 min ago:
Good to see a detailed summary. The frustration from a customer
perspective is that AWS continues to have these cross-region issues and
they continue to be very secretive about where these single points of
failure exist.
The region model is a lot less robust if core things in other regions
require US-East-1 to operate. This has been an issue in previous
outages and appears to have struck again this week.
It is what it is, but AWS consistently oversells the robustness of
regions as fully separate when events like Monday reveal theyâre
really not.
Arainach wrote 15 hours 17 min ago:
>about where these single points of failure exist.
In general, when you find one you work to fix it, and one of the most
common ways to find more is when one of them fails. Having single
points of failure and letting them live isn't the standard practice
at this scale.
rr808 wrote 15 hours 53 min ago:
[1] has a better explanation instead of the wall of text from AWS
HTML [1]: https://newsletter.pragmaticengineer.com/p/what-caused-the-lar...
cowsandmilk wrote 15 hours 17 min ago:
Except just in the DNS section, Iâve already found one place where
he gets it wrongâ¦
al_be_back wrote 16 hours 35 min ago:
Postmortem all you want - the internet is breaking, hard.
The internet was born out of the need for Distributed networks during
the cold war - to reduce central points of failure - a hedging
mechanism if you will.
Now it has consolidated into ever smaller mono nets. A simple mistake
in on one deployment could bring banking, shopping and travel to a halt
globally. This can only get much worse when cyber warfare gets
involved.
Personally, I think the cloud metaphor has overstretched and has long
burst.
For R&D, early stage start-ups and occasional/seasonal computing, cloud
works perfectly (similar to how time-sharing systems used to work).
For well established/growth businesses and gov, you better become
self-reliant and tech independent: own physical servers + own cloud +
own essential services (db, messaging, payment).
There's no shortage of affordable tech, know-how or workforce.
protocolture wrote 14 hours 13 min ago:
>the internet is breaking, hard.
I dont see that this is the case, its just more people want services
over the internet from the same 3 places that break irregularly.
Internet infrastructure is as far as I can tell, getting better all
the time.
The last big BGP bug had 1/10th the comments of the AWS one. And had
much less scary naming (ooooh routing instability) [1] >The internet
was born out of the need for Distributed networks during the cold war
- to reduce central points of failure - a hedging mechanism if you
will.
Instead of arguing about the need that birthed the internet, I will
simply say that the internet still works in the same largely
distributed fashion. Maybe you mean Web instead of Internet?
The issue here is that "Internet" isnt the same as "Things you might
access on the Internet". The Internet held up great during this
adventure. As far as I can tell it was returning 404's and 502's
without incident. The distributed networks were networking
distributedly. If you wanted to send and received packets with any
internet joined human in a way that didnt rely on some AWS hosted
application, that was still very possible.
>A simple mistake in on one deployment could bring banking, shopping
and travel to a halt globally.
Yeah but for how long and for how many people? The last 20 years have
been a burn in test for a lot of big industries on crappy
infrastructure. It looks like near everyone has been dragged kicking
and screaming into the future.
I mean the entire shipping industry got done over the last decade.
[2] >Personally, I think the cloud metaphor has overstretched and has
long burst.
It was never very useful.
>For well established/growth businesses and gov, you better become
self-reliant and tech independent
For these businesses, they just go out and get themselves some
region/vendor redundancy. Lots of applications fell over during this
outage, but lots of teams are also getting internal praise for
designing their systems robustly and avoiding its fallout.
>There's no shortage of affordable tech, know-how or workforce.
Yes, and these people often know how to design cloud infrastructure
to avoid these issues, or are smart enough to warn people that if
their region or its dependencies fail without redundancy, they are
taking a nose dive. Businesses will make business decisions and
review those decisions after getting publicly burnt.
HTML [1]: https://news.ycombinator.com/item?id=44105796
HTML [2]: https://www.zdnet.com/article/all-four-of-the-worlds-largest...
anyonecancode wrote 15 hours 20 min ago:
> The internet was born out of the need for Distributed networks
during the cold war - to reduce central points of failure - a hedging
mechanism if you will.
I don't think the idea was that in the event of catastrophe, up to
and including nuclear attack, the system would continue working
normally, but that it would keep working. And the internet -- as a
system -- certainly kept working during this AWS outage. In a
degraded state, yes, but it was working, and recovered.
I'm more concerned with the way the early public internet promised a
different kind of decentralization -- of economics, power, and ideas
-- and how _that_ has become heavily centralized. In which case, AWS,
and Amazon, indeed do make a good example. The internet, as a system,
is certainly working today, but arguably in a degraded state.
al_be_back wrote 5 min ago:
preventing a catastrophe was ARPA's mitigation strategy. the point
is where it's heading, not where it is. It's not about AWS per se,
or any one company, it's the way it is consolidating. AWS came
about by accident - cleverly utilizing spare server capacity from
amazon.com.
In it's conception, the internet (not www), was not envisaged as a
economical medium - it's success was a lovely side-effect.
danpalmer wrote 18 hours 22 min ago:
776 word paragraph and 28 word screen width, this is practically
unreadable.
scottatron wrote 17 hours 26 min ago:
yeah that is some pretty reader hostile formatting -_-
I asked Claude to reformat it for readability for me: [1] Obvs do
your own cross-checking with the original if 100% accuracy is
required.
HTML [1]: https://claude.ai/public/artifacts/958c4039-d2f1-45eb-9dfe-b...
827a wrote 18 hours 22 min ago:
I made it about ten lines into this before realizing that, against all
odds, I wasn't reading a postmortem, I was reading marketing material
designed to sell AWS.
> Many of the largest AWS services rely extensively on DNS to provide
seamless scale, fault isolation and recovery, low latency, and
locality...
Aeolun wrote 13 hours 39 min ago:
I didnât get 10 lines in before I realized that this wall of text
couldnât possibly contain the actual reason. Somewhere behind all
of that is an engineer saying âWe done borked up and deleted the
dynamodb DNS recordsâ
alexnewman wrote 20 hours 59 min ago:
Is it the internal dynamodb that other people use?
dilyevsky wrote 21 hours 12 min ago:
Sounds like they went with Availability over Correctness with this
design but the problem is that if your core foundational config is not
correct you get no availability either.
Velocifyer wrote 21 hours 15 min ago:
This is unreadable and terribly formatted.
citizenpaul wrote 13 hours 34 min ago:
Yeah for real thats what an "industry leading" company puts out for
their post mortem? They should be red in the face embarrassed.
Jeeze paragraphs? Punctuation?
I put more effort into my internet comments that won't be read by
millions of people.
citizenpaul wrote 13 hours 36 min ago:
Yeah for real thats what an "industry leading" company puts out for
their post mortem? They should be red in the face embarrassed.
Jeeze, paragraphs? Punctuation?
Looks like Amazon is starting to show cracks in the foundation.
bithavoc wrote 21 hours 31 min ago:
does DynamoDB run on EC2? if I read it right, EC2 depends on DynamoDB.
dokument wrote 20 hours 48 min ago:
There are circular dependencies within AWS, but also systems to
account for this (especially for cold starting).
Also there really is no one AWS, each region is its own (Now more
then ever before, where some systems weren't built to support this).
__turbobrew__ wrote 22 hours 14 min ago:
From a meta analysis level: bugs will always happen, formal
verification is hard, and sometimes it just takes a number of years to
have some bad luck (I have hit bugs which were over 10 years old but
due to low probability of them occurring they didnât happen for a
long time).
If we assume that the system will fail, I think the logical thing to
think about is how to limit the effects of that failure. In practice
this means cell based architecture, phased rollouts, and isolated
zones.
To my knowledge AWS does attempt to implement cell based architecture,
but there are some cross region dependencies specifically with
us-east-1 due to legacy. The real long term fix for this is designing
regions to be independent of each other.
This is a hard thing to do, but it is possible. I have personally been
involved in disaster testing where a region was purposely firewalled
off from the rest of the infrastructure. You find out very quick where
those cross region dependencies lie, and many of them are in unexpected
places.
Usually this work is not done due to lack of upper level VP support and
funding, and it is easier to stick your head in the sand and hope bad
things donât happen. The strongest supporters of this work are going
to be the share holders who are in it for the long run. If the company
goes poof due to improper disaster testing, the shareholders are going
to be the main bag holders. Making the board aware of the risks and the
estimated probability of fundamentally company ending events can help
get this work funded.
tptacek wrote 22 hours 31 min ago:
I'm a tedious broken record about this (among many other things) but if
you haven't read this Richard Cook piece, I strongly recommend you stop
reading this postmortem and go read Cook's piece first. It won't take
you long. It's the single best piece of writing about this topic I have
ever read and I think the piece of technical writing that has done the
most to change my thinking: [1] You can literally check off the things
from Cook's piece that apply directly here. Also: when I wrote this
comment, most of the thread was about root-causing the DNS thing that
happened, which I don't think is the big story behind this outage.
(Cook rejects the whole idea of a "root cause", and I'm pretty sure
he's dead on right about why.)
HTML [1]: https://how.complexsystems.fail/
vader_n wrote 4 hours 43 min ago:
That was a waste of my time.
inkyoto wrote 5 hours 53 min ago:
And I strongly recommend that you stop recommending the reading of
something that has its practical usefulness limited by what the
treatise leaves unsaid:
â It identifies problems (complexity, latent failures, hindsight
bias, etc.) more than it offers solutions. Readers must seek outside
methods to act on these insights.
â It feels abstract, describing general truths applicable to many
domains, but requiring translation into domain-specific practices (be
it software, aviation, medicine, etc.).
â It leaves out discussion on managing complexity â e.g.
principles of simplification, modular design, or quantitative risk
assessment â which would help prevent some of the failures it warns
about.
â It assumes well-intentioned actors and does not grapple with
scenarios where business or political pressures undermine safety â
an increasingly pertinent issue in modern industries.
â It does not explicitly warn against misusing its principles
(e.g. becoming fatalistic or overconfident in defenses). The nuance
that «failures are inevitable but we still must diligently work to
minimize them» must come from the readerâs interpretation.
«How Complex Systems Fail» is highly valuable for its conceptual
clarity and timeless truths about complex system behavior. Its
direction is one of realism â accepting that no complex system is
ever 100% safe â and of placing trust in human skill and systemic
defenses over simplistic fixes. The rational critique is that this
direction, whilst insightful, needs to be paired with concrete
strategies and a proactive mindset to be practically useful.
The treatise by itself wonât tell you how to design the next
aircraft or run a data center more safely, but it will shape your
thinking so you avoid common pitfalls (such as chasing singular root
causes or blaming operators). To truly «preclude» failures or
mitigate them, one must extend Cookâs ideas with detailed
engineering and organizational practices. In other words, Cook
teaches us why things fail in complex ways; it is up to us â
engineers, managers, regulators, and front-line practitioners â to
apply those lessons in how we build and operate the systems under our
care.
To be fair, at the time of writing (late 1990's), Cookâs treatise
was breaking ground by succinctly articulating these concepts for a
broad audience. Its objective was likely to provoke thought and shift
paradigms, rather than serve as a handbook.
Today, we have the benefit of two more decades of research and
practice in resilience engineering, which builds on Cookâs points.
Practitioners now emphasise building resilient systems, not just
trying to prevent failure outright. They use Cookâs insights as
rationale for things such as chaos engineering, better incident
response, and continuous learning cultures.
ponco wrote 9 hours 4 min ago:
Respectfully, I don't think that piece adds anything of material
substance. It's a list of hollow platitudes (vapid writing listing
inactionable truisms).
anonymars wrote 3 hours 55 min ago:
A better resource is likely Michael Nygard's book, "Release It!".
It has practical advice about many issues in this outage. For
example, it appears the circuit breaker and bulkhead patterns were
underused here.
Excerpt:
HTML [1]: https://www.infoq.com/articles/release-it-five-am/
nickelpro wrote 13 hours 6 min ago:
To quote Grandpa Simpson, "Everything everyone just said is either
obvious or wrong".
Pointing out that "complex systems" have "layers of defense" is
neither insightful nor useful, it's obvious. Saying that any and all
failures in a given complex system lack a root cause is wrong.
Cook uses a lot of words to say not much at all. There's no concrete
advise to be taken from How Complex Systems Fail, nothing to change.
There's no casualty procedure or post-mortem investigation which
would change a single letter of a single word in response to it. It's
hot air.
baq wrote 11 hours 57 min ago:
Thereâs a difference between âgrown organicallyâ and
âdesigned to operate in this wayâ, though. Experienced folks
will design system components with conscious awareness of how
operations actually look like from the start. Juniors wonât and
will be bolting on quasi solutions as their systems fall over time
and time again. Cookâs generalization is actually wildly
applicable, but it takes work to map it to specific situations.
user3939382 wrote 14 hours 31 min ago:
Nobody discussing the problem understands it.
ramraj07 wrote 15 hours 56 min ago:
As I was reading through that list, I kept feeling, "why do I feel
this is not universally true?"
Then I realized: the internet; the power-grid (at least in most
developed countries); there are things that don't actually fail
catastrophically, even though they are extremely complex, and not
always built by efficient organizations. Whats the retort to this
argument?
grumbelbart2 wrote 7 hours 49 min ago:
Also, aviation is great example of how we can manage failures in
complex systems and how we can track and fix more and rarer
failures over time.
figassis wrote 10 hours 25 min ago:
The grid fails catastrophically. It happened this year in Portugal,
spain and nearby countries? Still, think of the grid as more like
DNS. It is immense, but the concept is simple and well understood.
You can quickly identify where the fault is (even if not the actual
root cause), and can also quickly address it (even if bringing it
back up in sync takes time and is not trivial). Current cloud infra
is different in that each implementation is unique, services are
unique, knowledge is not universal. There are no books about AWS's
infra fundamentals or how to manage AWS's cloud.
baq wrote 11 hours 45 min ago:
> the internet [1] > power grid
HTML [1]: https://www.kentik.com/blog/a-brief-history-of-the-interne...
HTML [2]: https://www.entsoe.eu/publications/blackout/28-april-2025-...
jb1991 wrote 15 hours 15 min ago:
The power grid is a huge risk in several major western nations.
singron wrote 15 hours 18 min ago:
They do fail catastrophically. E.g. [1] I think you could argue AWS
is more complex than the electrical grid, but even if it's not, the
grid has had several decades to iron out kinks and AWS hasn't. AWS
also adds a ton of completely new services each year in addition to
adding more capacity. E.g. I bet these DNS Enactors have become
more numerous and their plans became much larger than when they
were first developed, which has greatly increased the odds of
experiencing this issue.
HTML [1]: https://en.wikipedia.org/wiki/Northeast_blackout_of_2003
habinero wrote 15 hours 41 min ago:
The power grid absolutely can fail catastrophically and is a lot
more fragile than people think.
Texas nearly ran into this during their blackout a few years ago --
their grid got within a few minutes of complete failure that would
have required a black start which IIRC has never been done.
Grady has a good explanation and the writeup is interesting reading
too. [1]
HTML [1]: https://youtu.be/08mwXICY4JM?si=Lmg_9UoDjQszRnMw
HTML [2]: https://youtu.be/uOSnQM1Zu4w?si=-v6-Li7PhGHN64LB
nonfamous wrote 16 hours 50 min ago:
Great link, thanks for sharing. This point below stood out to me â
put another way, âfixingâ a system in response to an incident to
make it safer might actually be making it less safe.
>>> Views of âcauseâ limit the effectiveness of defenses against
future events.
>>> Post-accident remedies for âhuman errorâ are usually
predicated on obstructing activities that can âcauseâ accidents.
These end-of-the-chain measures do little to reduce the likelihood of
further accidents. In fact that likelihood of an identical accident
is already extraordinarily low because the pattern of latent failures
changes constantly. Instead of increasing safety, post-accident
remedies usually increase the coupling and complexity of the system.
This increases the potential number of latent failures and also makes
the detection and blocking of accident trajectories more difficult.
albert_e wrote 14 hours 32 min ago:
But that sounds like an assertion without evidence and
underestimates the competence of everyone involved in designing and
maintaining these complex systems.
For example, take airline safety -- are we to believe based on the
quoted assertion that every airline accident and resulting remedy
that mitigated the causes have made air travel LESS safe? That
sounds objectively, demonstrably false.
Truly complex systems like ecosystems and climate might qualify for
this assertion where humans have interfered, often with best
intentions, but caused unexpected effects that maybe beyond human
capacity control.
nonfamous wrote 5 hours 43 min ago:
Airline safety is a special case I think â THE NTSB does
incredible work, and their recommendations are always designed to
improve total safety, not just reduce the likelihood of a
specific failure.
But I can think of lots of examples where the response to an
unfortunate, but very rare, incident can make us less safe
overall. The response to rare vaccine side effects comes
immediately to mind.
GuinansEyebrows wrote 18 hours 7 min ago:
thanks, i'm one of the lucky 10,000 today.
ericyd wrote 18 hours 47 min ago:
I'll admit i didn't read all of either document, but I'm not
convinced of the argument that one cannot attribute a failure to a
root cause simply because the system is complex and required multiple
points of failure to fail catastrophically.
One could make a similar argument in sports that no one person ever
scores a point because they are only put into scoring position by a
complex series of actions which preceded the actual point. I think
that's technically true but practically useless. It's good to have a
wide perspective of an issue but I see nothing wrong with identifying
the crux of a failure like this one.
Yokolos wrote 15 hours 32 min ago:
The best example for this is aviation. Insanely complex from the
machines to the processes to the situations to the people, all
interconnected and constantly interacting. But we still do "root
cause" analyses and based on those findings try to improve every
point in the system that failed or contributed to the failure,
because that's how we get a safer aviation industry. It's
definitely worked.
wbl wrote 18 hours 36 min ago:
Its extremely useful in sports. We evaluate batters on OPS vs RBI,
and no one ever evaluated them on runs they happened to score. We
talk all the time about a QB and his linemen working together and
the receivers. If all we talked about was the immediate cause we'd
miss all that.
ericyd wrote 1 hour 37 min ago:
I'm not saying we ignore all other causes in sports analysis, I'm
saying it doesn't make sense to pretend that there's no "one
person" who hit the home run or scored a touchdown. Of course
it's usually a team effort but we still attribute a score to one
person.
cb321 wrote 19 hours 35 min ago:
That minimalist post mortem for the public is of what sounds like a
Rube Goldberg machine and the reality is probably even more hairy. I
completely agree that if one wants to understand "root causes", it's
more important to understand why such machines are
built/trusted/evolved in the first place.
That piece by Cook is ok, but largely just a list of assertions (true
or not, most do feel intuitive, though). I suppose one should delve
into all those references at the end for details? Anyway, this is an
ancient topic, and I doubt we have all the answers on those root
whys. The MIT course on systems, 6.033, used to assign reading a
paper raised on HN only a few times in its history: [1] and [2] It's
from 1962, over 60 years ago, but that is also probably more
illuminating/thought provoking than the post mortem. Personally, I
suspect it's probably an instance of a [3] , but only past a certain
scale.
HTML [1]: https://news.ycombinator.com/item?id=10082625
HTML [2]: https://news.ycombinator.com/item?id=16392223
HTML [3]: https://en.wikipedia.org/wiki/Wicked_problem
tptacek wrote 19 hours 27 min ago:
I have a housing activism meetup I have to get to, but real quick
let me just say that these kinds of problems are not an abstraction
to me in my day job, that I read this piece before I worked where I
do and it bounced off me, but then I read it last year and was like
"are you me but just smarter?", like my pupils probably dilated
theatrically when I read it like I was a character in Requiem for a
Dream, and I think most of the points he's making are much subtler
and deeper than they seem at a casual read.
You might have to bring personal trauma to this piece to get the
full effect.
cb321 wrote 18 hours 42 min ago:
Oh, it's fine. At your leisure. I didn't mean to go against the
assertions themselves, but more just kind of speak to their
"unargued" quality and often sketchy presentation. Even that
Simon piece has a lot of this in there, where it's sort of "by
defenition of 'complexity'/by unelaborated observation".
In engineered systems, there is just a disconnect between on our
own/small scale KISS and what happens in large organizations, and
then what happens over time. This is the real root cause/why,
but I'm not sure it's fixable. Maybe partly addressable, tho'.
One thing that might give you a moment of worry is both in that
Simon and far, far more broadly all over academia both long
before and ever since, biological systems like our bodies are an
archetypal example of "complex". Besides medical failures, life
mostly has this one main trick -- make many copies and if they
don't all fail before they, too, can copy then a stable-ish
pattern emerges.
Stable populations + "litter size/replication factor" largely
imply average failure rates. For most species it is horrific.
On the David Attenborough specials they'll play the sad music and
tell you X% of these offspring never make it to mating age. The
alternative is not the [1] apocalypse, but the
"whatever-that-species-is-biopocalypse". Sorry - it's late and
my joke circuits are maybe fritzing. So, both big 'L' and little
'l' life, too, "is on the edge", just structurally. [2] (with
sand piles and whatnot) used to be a kind of statistical physics
hope for a theory of everything of these kinds of phenomena, but
it just doesn't get deployed. Things will seem "shallowly
critical" but not so upon deeper inspection. So, maybe it's not
not a useful enough approximation.
Anyway, good luck with your housing meetup!
HTML [1]: https://en.wikipedia.org/wiki/Gray_goo
HTML [2]: https://en.wikipedia.org/wiki/Self-organized_criticality
markus_zhang wrote 19 hours 38 min ago:
As a contractor who is on an oncall schedule. I have never worked in
a company that treats oncall as a very serious business. I only
worked in 2 companies that need oncall so Iâm biased. On paper,
they both say it is serious and all SLA stuffs were setup, but in
reality there is not enough support.
The problem is, oncall is a full-time business. It takes full
attention of the oncall engineer, whether there is an issue or not.
Both companies simply treat oncall as a by-product. We just had to do
it so letâs stuff it into the sprint. The first company was
slightly more serious as we were asked to put up a 2-3 point oncall
task in JIRA. The second one doesnât even do this.
Neither company really encourages engineers to read through complex
code written by others, even if we do oncall for those products.
Again, the first company did better, and we were supposed to create a
channel and pull people in, so itâs OKish to not know anything
about the code. The second company simply leaves oncall to do
whatever they can. Neither company allocates enough time for
engineers to read the source code thoroughly. And neither has good
documentation for oncall.
I donât know the culture of AWS. Iâd very much want to work in an
oncall environment that is serious and encourages learning.
dekhn wrote 18 hours 17 min ago:
When I was an SRE at Google our oncall was extremely serious (if
the service went down, Google was unable to show ads, record ad
impressions, or do any billing for ads). It was done on a
rotation, lasted 1 week (IIRC it was 9AM-9PM, we had another time
zone for the alternate 12 hours). The on-call was empowered to do
pretty much anything required to keep the service up and running,
including cancelling scheduled downtimes, pausing deployment
updates, stop abusive jobs, stop abusive developers, and invoke an
SVP if there was a fight with another important group).
We sent a test page periodically to make sure the pager actually
beeped. We got paid extra for being in the rotation. The
leadership knew this was a critical step. Unfortunately, much of
our tooling was terrible, which would cause false pages, or failed
critical operations, all too frequently.
I later worked on SWE teams that didn't take dev oncall very
seriously. At my current job, we have an oncall, but it's best
effort business hours only.
citizenpaul wrote 13 hours 39 min ago:
>empowered to do pretty much anything required to keep the
service up and running,
Is that really uncommon? I've been on call for many companies
and many types of institutions and never been told once I
couldn't do something to bring a system up that I can recall at
least. Its kinda the job?
On call seriousness should be directly proportional to pay.
Google pays. If smallcorp want to pay me COL I'll be looking at
that 2AM ticket at 9AM when I get to work.
lanyard-textile wrote 14 hours 51 min ago:
Handling my first non-prod alert bug as the oncall at Google was
pretty eye opening :)
It was a good lesson in what a manicured lower environment can do
for you.
markus_zhang wrote 18 hours 12 min ago:
Thatâs pretty good. Our oncall is actually 24-hour for one
week. On paper it looks very serious but even the best of us
donât really know everything so issues tend to lag to the
morning. Neither do we get any compensation for it. Someone got a
bad night and still need to logon next day. There is an informal
understanding to relax a bit if the night is too bad, though.
dmoy wrote 16 hours 17 min ago:
I did 24hr-for-a-week oncall for 10+ years, do not recommend.
12-12 rotation in SRE is a lot more reasonable for humans
sandeepkd wrote 13 hours 21 min ago:
Unfortunately 24hr-for-a-week seems to be default everywhere
nowdays, its just not practical for serious type businesses.
It just an indicator of how important is the UPTIME for a
company.
markus_zhang wrote 15 hours 5 min ago:
I agree. It sucks. And our schedule is actually 2 weeks in
every five. One is secondary and the other is primary.
malfist wrote 19 hours 5 min ago:
Amazon generally treats on call as a full time job. Generally
engineers who are on call are expected to only be on call. No
feature work.
tidbits wrote 18 hours 12 min ago:
It's very team/org dependent and I would say that's generally not
the case. In 6 years I have only had 1 team out of 3 where that
was true. The other two teams I was expected to juggle feature
work with oncall work. Same for most teams I interacted with.
markus_zhang wrote 18 hours 56 min ago:
That's actually pretty good.
dosnem wrote 20 hours 15 min ago:
How does knowing this help you avoid these problems? It doesnât
seem to provide any guidance on what to do in the face of complex
systems
tptacek wrote 20 hours 5 min ago:
He's literally writing about Three Mile Island. He doesn't have
anything to tell you about what concurrency primitives to use for
your distributed DNS management system.
But: given finite resources, should you respond to this incident by
auditing your DNS management systems (or all your systems) for race
conditions? Or should you instead figure out how to make the
Droplet Manager survive (in some degraded state) a partition from
DynamoDB without entering congestive collapse? Is the right
response an identification of the "most faulty components" and a
project plan to improve them? Or is it closing the human
expertise/process gap that prevented them from throttling DWFM for
4.5 hours?
Cook isn't telling you how to solve problems; he's asking you to
change how you think about problems, so you don't rathole in
obvious local extrema instead of being guided by the bigger
picture.
doctorpangloss wrote 18 hours 10 min ago:
Both documents are, "ceremonies for engineering personalities."
Even you can't help it - "enumerating a list of questions" is a
very engineering thing to do.
Normal people don't talk or think like that. The way Cook is
asking us to "think about problems" is kind of the opposite of
what good leadership looks like. Thinking about thinking about
problems is like, 200% wrong. On the contrary, be way more
emotional and way simpler.
cyberax wrote 18 hours 14 min ago:
Another point is that DWFM is likely working in a privileged,
isolated network because it needs access deep into the core
control plane. After all, you don't want a rogue service to be
able to add a malicious agent to a customer's VPC.
And since this network is privileged, observability tools,
debugging support, and even maybe access to it are more
complicated. Even just the set of engineers who have access is
likely more limited, especially at 2AM.
Should AWS relax these controls to make recovery easier? But then
it will also result in a less secure system. It's again a
trade-off.
dekhn wrote 18 hours 14 min ago:
It's entirely unclear to me if a system the size and scope of AWS
could be re-thought using these principles and successfully
execute a complete restructuring of all their processes to reduce
their failure rate a bit. It's a system that grew over time with
many thousands of different developers, with a need to solve
critical scaling issues that would have stopped the business in
its tracks (far worse than this outage).
dosnem wrote 19 hours 8 min ago:
I donât really follow what you are suggesting. If the system is
complex and constantly evolving as the article states, you
arenât going to be able to close any expertise process gap.
Operating in a degraded state is probably already built in, this
was just a state of degradation they were not prepared for. You
canât figure out all degraded states to operate in because by
definition the system is complex
yabones wrote 21 hours 9 min ago:
Another great lens to see this is "Normal Accidents" theory, where
the argument is made that the most dangerous systems are ones where
components are very tightly coupled, interactions are complex and
uncontrollable, and consequences of failure are serious.
HTML [1]: https://en.wikipedia.org/wiki/Normal_Accidents
stefan_bobev wrote 22 hours 32 min ago:
I appreciate the details this went through, especially laying out the
exact timelines of operations and how overlaying those timelines
produces unexpected effects. One of my all time favourite bits about
distributed systems comes from the (legendary) talk at GDC - I Shot You
First[1] - where the speaker describes drawing sequence diagrams with
tilted arrows to represent the flow of time and asking "Where is the
lag?". This method has saved me many times, all throughout my career
from making games, to livestream and VoD services to now fintech.
Always account for the flow of time when doing a distributed operation
- time's arrow always marches forward, your systems might not.
But the stale read didn't scare me nearly as much as this quote:
> Since this situation had no established operational recovery
procedure, engineers took care in attempting to resolve the issue with
DWFM without causing further issues
Everyone can make a distributed system mistake (these things are hard).
But I did not expect something as core as the service managing the
leases on the physical EC2 nodes to not have recovery procedure. Maybe
I am reading too much into it, maybe what they meant was that they
didn't have a recovery procedure for "this exact" set of circumstances,
but it is a little worrying even if that were the case. EC2 is one of
the original services in AWS. At this point I expect it to be so battle
hardened that very few edge cases would not have been identified. It
seems that the EC2 failure was more impactful in a way, as it cascaded
to more and more services (like the NLB and Lambda) and took more time
to fully recover. I'd be interested to know what gets put in place
there to make it even more resilient.
HTML [1]: https://youtu.be/h47zZrqjgLc?t=1587
throwdbaaway wrote 11 hours 37 min ago:
> But I did not expect something as core as the service managing the
leases on the physical EC2 nodes to not have recovery procedure.
I guess they don't have a recovery procedure for the "congestive
collapse" edge case. I have seen something similar, so I wouldn't be
frowning at this.
A couple of red flags though:
1. Apparent lack of load-shedding support by this DWFM, such that a
server reboot had to be performed. Need to learn from [1] 2. Having
DynamoDB as a dependency of this DWFM service, instead of something
more primitive like Chubby. Need to learn more about distributed
systems primitives from
HTML [1]: https://aws.amazon.com/builders-library/using-load-shedding-...
HTML [2]: https://www.youtube.com/watch?v=QVvFVwyElLY
gtowey wrote 12 hours 43 min ago:
It's shocking to me too, but not very surprising. It's probably a
combination of factors that could cause a failure of planning and
I've seen it play out the same way at lots of companies.
I bet the original engineers planned for, and designed the system to
be resilient to this cold start situation. But over time those
engineers left, and new people took over -- those who didn't fully
understand and appreciate the complexity, and probably didn't care
that much about all the edge cases. Then, pushed by management to
pursue goals that are antithetical to reliability, such as cost
optimization and other things the new failure case was introduced by
lots of sub optimal changes. The result is as we see it -- a
catastrophic failure which caught everyone by surprise.
It's the kind of thing that happens over and over again when the
accountants are in charge.
tptacek wrote 22 hours 29 min ago:
It shouldn't scare you. It should spark recognition. This
meta-failure-mode exists in every complex technological system. You
should be, like, "ah, of course, that makes sense now". Latent
failures are fractally prevalent and have combinatoric potential to
cause catastrophic failures. Yes, this is a runbook they need to
have, but we should all understand there are an unbounded number of
other runbooks they'll need and won't have, too!
lazystar wrote 22 hours 12 min ago:
the thing that scares me is that AI will never be able to diagnose
an issue that it has never seen before. If there are no runbooks,
there is no pattern recognition. this is something Ive been
shouting about for 2 years now; hopefully this issue makes AWS
leadership understand that current gen AI can never replace human
engineering.
janalsncm wrote 11 hours 29 min ago:
AI is a lot more than just LLMs. Running through the rats nest of
interdependent systems like AWS has is exactly what symbolic AI
was good at.
Aeolun wrote 13 hours 56 min ago:
I think millions of systems have failed due to missing DNS
records though.
tptacek wrote 22 hours 9 min ago:
I'm much less confident in that assertion. I'm not bullish on AI
systems independently taking over operations from humans, but
catastrophic outages are combinations of less-catastrophic
outages which are themselves combinations of latent failures, and
when the latent failures are easy to characterize (as is the case
here!), LLMs actually do really interesting stuff working out the
combinatorics.
I wouldn't want to, like, make a company out of it (I assume the
foundational model companies will eat all these businesses) but
you could probably do some really interesting stuff with an agent
that consumes telemetry and failure model information and uses it
to surface hypos about what to look at or what interventions to
consider.
All of this is besides my original point, though: I'm saying, you
can't runbook your way to having a system as complex as AWS run
safely. Safety in a system like that is a much more complicated
process, unavoidably. Like: I don't think an LLM can solve the
"fractal runbook requirement" problem!
shrubble wrote 22 hours 48 min ago:
The Bind resolver required each zone to have an increasing serial
number for the zone.
So if you made a change you had to increase the number, usually a
timestamp like 20250906114509 which would be older / lower numbered
than 20250906114702; making it easier to determine which zone file had
the newest data.
Seems like they sort of had the same setup but with less rigidity in
terms of refusing to load older files.
ecnahc515 wrote 23 hours 23 min ago:
Seems like the enactor should be checking the version/generation of the
current record before it applies the new value, to ensure it never
applies an old plan on top of an record updated by a new plan. It
wouldn't be as efficient, but that's just how it is. It's a basic
compare and swap operation, so it could be handled easily within
dynamodb itself where these records are stored.
pelagicAustral wrote 1 day ago:
Had no idea Dynamo was so intertwined with the whole AWS stack.
freedomben wrote 23 hours 43 min ago:
Yeah, for better or worse, AWS is a huge dogfooder. It's nice to know
they trust their stuff enough to depend on it themselves, but it's
also scary to know that the blast radius of a failure in any
particular service can be enormous
WaitWaitWha wrote 1 day ago:
I gather, the root cause was a latent race condition in the DynamoDB
DNS management system that allowed an outdated DNS plan to overwrite
the current one, resulting in an empty DNS record for the regional
endpoint.
Correct?
tptacek wrote 22 hours 48 min ago:
I think you have to be careful with ideas like "the root cause". They
underwent a metastable congestive collapse. A large component of the
outage was them not having a runbook to safely recover an adequately
performing state for their droplet manager service.
The precipitating event was a race condition with the DynamoDB
planner/enactor system.
HTML [1]: https://how.complexsystems.fail/
1970-01-01 wrote 20 hours 52 min ago:
Why can't a race condition bug be seen as the single root cause?
Yes, there were other factors that accelerated collapse, but those
are inherent to DNS, which is outside the scope of a summary.
tptacek wrote 20 hours 43 min ago:
Because the DNS race condition is just one flaw in the system.
The more important latent flawâ is probably the metastable
failure mode for the droplet manager, which, when it loses
connectivity to Dynamo, gradually itself loses connectivity with
the Droplets, until a critical mass is hit where the Droplet
manager has to be throttled and manually recovered.
Importantly: the DNS problem was resolved (to degraded state) in
1hr15, and fully resolved in 2hr30. The Droplet Manager problem
took much longer!
This is the point of complex failure analysis, and why that
school of thought says "root causing" is counterproductive. There
will always be other precipitating events!
â which itself could very well be a second-order effect of some
even deeper and more latent issue that would be more useful to
address!
cyberax wrote 18 hours 26 min ago:
The droplet manager failure is a lot more forgivable scenario.
It happened because the "must always be up" service went down
for an extended period of time, and the sheer amount of actions
needed for the recovery overwhelmed the system.
The initial DynamoDB DNS outage was much worse. A bog-standard
TOCTTOU for scheduled tasks that are assumed to be "instant".
And the lack of controls that allowed one task to just blow up
everything in one of the foundational services.
When I was at AWS some years ago, there were calls to limit the
blast radius by using cell architecture to create vertical
slices of the infrastructure for critical services. I guess
that got completely sidelined.
dgemm wrote 20 hours 40 min ago:
HTML [1]: https://en.wikipedia.org/wiki/Swiss_cheese_model
1970-01-01 wrote 20 hours 42 min ago:
Two different questions here.
1. How did it break?
2. Why did it collapse?
A1: Race condition
A2: What you said.
tptacek wrote 20 hours 25 min ago:
What is the purpose of identifying "root causes" in this
model? Is the root cause of a memory corruption vulnerability
holding a stale pointer to a freed value, or is it the lack
of memory safety? Where does AWS gain more advantage: in
identifying and mitigating metastable failure modes in EC2,
or in trying to identify every possible way DNS might take
down DynamoDB? (The latter is actually not an easy question,
but that's the point!)
1970-01-01 wrote 20 hours 16 min ago:
Two things can be important for an audience. For most, it's
the race condition lesson. Locks are there for a reason.
For AWS, it's the stability lesson. DNS can and did take
down the empire for several hours.
tptacek wrote 20 hours 10 min ago:
Did DNS take it down, or did a pattern of latent failures
take it down? DNS was restored fairly quickly!
Nobody is saying that locks aren't interesting or
important.
nickelpro wrote 12 hours 26 min ago:
The Droplet lease timeouts were an aggravating factor
for the severity of the incident, but are not
causative. Absent a trigger the droplet leases never
experience congestive failure.
The race condition was necessary and sufficient for
collapse. Absent corrective action it always leads to
AWS going down. In the presence of corrective actions
the severity of the failure would have been minor
without other aggravating factors, but the race
condition is always the cause of this failure.
dosnem wrote 18 hours 58 min ago:
This doesnât really matter. This type of error gets
the whole 5 whyâs treatment and every why needs to
get fixed. Both problems will certainly have an action
item
tptacek wrote 2 hours 54 min ago:
It is not my claim that AWS is going to handle this
badly, only that this thread is.
qrush wrote 1 day ago:
Sounds like DynamoDB is going to continue to be a hard dependency for
EC2, etc. I at least appreciate the transparency and hearing about
their internal systems names.
UltraSane wrote 15 hours 46 min ago:
They should at least split off dedicated isolated instances of
DynamoDB to reduce blast radius. I would want at least 2 instances
for every internal AWS service that uses it.
skywhopper wrote 21 hours 6 min ago:
I mean, something has to be the baseline data storage layer. Iâm
more comfortable with it being DynamoDB than something else that
isnât pushed as hard by as many different customers.
UltraSane wrote 15 hours 44 min ago:
The actual storage layer of DynamoDB is well engineered and has
some formal proofs.
offmycloud wrote 23 hours 58 min ago:
I think it's time for AWS to pull the curtain back a bit and release
a JSON document that shows a list of all internal service
dependencies for each AWS service.
mparnisari wrote 14 hours 36 min ago:
I worked for AWS for two years and if I recall correctly, one of
the issues was circular dependencies.
cyberax wrote 18 hours 22 min ago:
A lot of internal AWS services have names that are completely
opaque to outside users. Such a document will be pretty useless as
a result.
throitallaway wrote 23 hours 14 min ago:
Would it matter? Would you base decisions on whether or not to use
one of their products based on the dependency graph?
UltraSane wrote 15 hours 44 min ago:
It would let you know that if if service A and B both depend on
service C you can't use A and B to gain reliability.
withinboredom wrote 19 hours 59 min ago:
Yes.
bdangubic wrote 18 hours 14 min ago:
if so, I hate to tell you this but you would not use AWS (or
any other cloud provider)!
withinboredom wrote 11 hours 57 min ago:
I donât use AWS or any other cloud provider. I use bare
metal since 2012. See, in 2012 (IIRC), one fateful day, we
turned off our bare metal machines and went full AWS. That
afternoon, AWS had its first major outage. Prior to that day,
the owner could walk in and ask what we were doing about it.
That day, all we could do was twiddle our thumbs or turn on a
now outdated database replica. Surely AWS wonât be out for
hours, right? Right? With bare metal, you might be out for
hours, but you can quickly get back to a degraded state, no
matter what happens. With AWS, youâre stuck with whatever
they happen to fix first.
cthalupa wrote 1 hour 48 min ago:
Meanwhile I've had bare metal be a complete outage for over
a day because a backhoe decided it wanted to eat the fiber
line into our building. All I could do was twiddle my
thumbs because we were stuck waiting on another company to
fix that.
Could we have had an offsite location to fail over to? From
a technical perspective, sure. Same as you could go
multi-region or multi-cloud or turn on some servers at
hetzner or whatever. There's nothing better or worse about
the cloud here - you always have the ability to design with
resilience for whatever happens short of the internet on
the whole breaking somehow.
LaserToy wrote 1 day ago:
TLDR:
A DNS automation bug removed all the IP addresses for the regional
endpoints. The tooling that was supposed to help with recovery depends
on the system it needed to recover. Thatâs a classic âwe deleted
prodâ failure mode at AWS scale.
everfrustrated wrote 1 day ago:
>Services like DynamoDB maintain hundreds of thousands of DNS records
to operate a very large heterogeneous fleet of load balancers in each
Region
Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can
resolve to one of a hundred thousand IP address?
That's insane!
And also well beyond the limits of route53.
I'm wondering if they're constantly updating route53 with a smaller
subset of records and using a low ttl to somewhat work around this.
rescbr wrote 15 hours 48 min ago:
> And also well beyond the limits of route53.
One thing is the internal limit, another thing is the customer-facing
limit.
Some hard limits are softer than they appear.
donavanm wrote 18 hours 57 min ago:
Some details, but yeah that's basically how all AWS DNS works. I
think youre missing how labels, zones, and domains are related but
distinct. And that R53 operates in resource record SETS. And there
are affordances in the set relationships to build trees and logic for
selecting an appropriate set (eg healthcheck, latency).
> And also well beyond the limits of route53
Ipso facto, R53 can do this just fine. Where do you think all of your
public EC2, ELB, RDS, API Gateway, etc etc records are managed and
served?
thayne wrote 19 hours 10 min ago:
I haven't tested with dynamodb, but I once ran a loop of doing DNS
lookups for s3, and I in a couple seconds I got hundreds of distinct
ip addresses. And that was just for a single region, from a single
source ip.
supriyo-biswas wrote 1 day ago:
DNS-based CDNs are also effectively this: collect metrics from a
datastore regarding system usage metrics, packet loss, latency etc
and compute a table of viewer networks and preferred PoPs.
Unfortunately hard documentation is difficult to provide but thatâs
how a CDN worked at a place I used to work for, thereâs also
another CDN[1] which talks about the same thing in fancier terms.
HTML [1]: https://bunny.net/network/smartedge/
donavanm wrote 18 hours 55 min ago:
Akamai talked about it in the early 2000s. Facebook content folks
had a decent paper describing the latency collection and realtime
routing around 2011ish, something like âpinpointâ I want to
say. Though as you say was industry practice before then.
ericpauley wrote 1 day ago:
Interesting use of the phrase âRoute53 transactionâ for an
operation that has no hard transactional guarantees. Especially given
the lack of transactional updates are what caused the outageâ¦
donavanm wrote 1 day ago:
I think you misunderstnad the failure case. The
ChangeResourceRecordSet is transactional (or was when I worked on the
service) [1] .
The fault was two different clients with divergent goal states:
- one ("old") DNS Enactor experienced unusually high delays needing
to retry its update on several of the DNS endpoints
- the DNS Planner continued to run and produced many newer
generations of plans
[Ed: this is key: its producing "plans" of desired state, the does
not include a complete transaction like a log or chain with previous
state + mutations]
- one of the other ("new") DNS Enactors then began applying one of
the newer plans
- then ("new") invoked the plan clean-up process, which identifies
plans that are significantly older than the one it just applied and
deletes them [Ed: the key race is implied here. The "old" Enactor
is reading _current state_, which was the output of "new", and
applying its desired "old" state on top. The discrepency is because
apparently Planer and Enactor aren't working with a chain/vector
clock/serialized change set numbers/etc]
- At the same time the first ("old") Enactor ... applied its much
older plan to the regional DDB endpoint, overwriting the newer plan.
[Ed: and here is where "old" Enactor creates the valid ChangeRRSets
call, replacing "new" with "old"]
- The check that was made at the start of the plan application
process, which ensures that the plan is newer than the previously
applied plan, was stale by this time [Ed: Whoops!]
- The second Enactorâs clean-up process then deleted this older
plan because it was many generations older than the plan it had just
applied.
Ironically Route 53 does have strong transactions of API changes
_and_ serializes them _and_ has closed loop observers to validate
change sets globally on every dataplane host. So do other AWS
services. And there are even some internal primitives for building
replication or change set chains like this. But its also a PITA and
takes a bunch of work and when it _does_ fail you end up with global
deadlock and customers who are really grumpy that they dont see their
DNS changes going in to effect.
HTML [1]: https://docs.aws.amazon.com/Route53/latest/APIReference/API_...
RijilV wrote 23 hours 41 min ago:
Not for nothing, thereâs a support group for those of us whoâve
been hurt by WHU sev2sâ¦
donavanm wrote 18 hours 49 min ago:
Man I always hated that phrasing; always tried to get people to
use more precise terms like âcustomer change propagation.â
But yeah, who hasnt been punished by a queryplan change or some
random connectivity problem in south east asia!
gslin wrote 1 day ago:
I believe a report with timezone not using UTC is a crime.
tguedes wrote 22 hours 51 min ago:
I think it makes sense in this instance. Because this occurred in
us-east-1, the vast majority of affected customers are US based. For
most people, it's easier to do the timezone conversion from PT than
UTC.
thayne wrote 19 hours 4 min ago:
But us-east-1 is in Eastern Time, so if you aren't going to use
UTC, why not use that?
I'm guessing PT was chosen because the people writing this report
are in PT (where Amazon headquarters is).
trenchpilgrim wrote 20 hours 37 min ago:
us-east-1 is an exceptional Amazon region; it hosts many global
services as well as services which are not yet available in other
regions. Most AWS customers worldwide probably have an indirect
dependency on us-east-1.
cheeze wrote 1 day ago:
My guess is that PT was chosen to highlight the fact that this
happened in the middle of the night for most of the responding ops
folks.
(I don't know anything here, just spitballing why that choice would
be made)
throitallaway wrote 23 hours 12 min ago:
Their headquarters is in Seattle (Pacific Time.) But yeah, I hate
time zones.
exogenousdata wrote 1 day ago:
An epoch fail?
jasode wrote 1 day ago:
So the DNS records if-stale-then-needs-update it was basically a
variation of the "2 Hard Things In Computer Science - cache
invalidation". Excerpt from the giant paragraph:
>[...] Right before this event started, one DNS Enactor experienced
unusually high delays needing to retry its update on several of the DNS
endpoints. As it was slowly working through the endpoints, several
other things were also happening. First, the DNS Planner continued to
run and produced many newer generations of plans. Second, one of the
other DNS Enactors then began applying one of the newer plans and
rapidly progressed through all of the endpoints. The timing of these
events triggered the latent race condition. When the second Enactor
(applying the newest plan) completed its endpoint updates, it then
invoked the plan clean-up process, which identifies plans that are
significantly older than the one it just applied and deletes them. At
the same time that this clean-up process was invoked, the first Enactor
(which had been unusually delayed) applied its much older plan to the
regional DDB endpoint, overwriting the newer plan. The check that was
made at the start of the plan application process, which ensures that
the plan is newer than the previously applied plan, was stale by this
time due to the unusually high delays in Enactor processing. [...]
It outlines some of the mechanics but some might think it still isn't a
"Root Cause Analysis" because there's no satisfying explanation of
_why_ there were "unusually high delays in Enactor processing".
Hardware problem?!? Human error misconfiguration causing unintended
delays in Enactor behavior?!? Either the previous sequence of events
leading up to that is considered unimportant, or Amazon is still
investigating what made Enactor behave in an unpredictable way.
ignoramous wrote 23 hours 11 min ago:
> ...there's no satisfying explanation of _why_ there were "unusually
high delays in Enactor processing". Hardware problem?
Can't speak for the current incident but a similar "slow machine"
issue once bit our BigCloud service (not as big an incident,
thankfully) due to loooong JVM GC pauses on failing hardware.
Cicero22 wrote 1 day ago:
my take away was that the race condition was the root cause. Take
away that bug, and suddenly there's no incident, regardless of any
processing delays.
_alternator_ wrote 1 day ago:
Right.sounds like itâs a case of ârolling your own distributed
system algorithmâ without the up front investment in implementing
a true robust distributed system.
Often network engineers are unaware of some of the tricky problems
that DS research has addressed/solved in the last 50 years because
the algorithms are arcane and heuristics often work pretty well,
until they donât. But my guess is that AWS will invest in some
serious redesign of the system, hopefully with some rigorous
algorithms underpinning the updates.
Consider this a nudge for all you engineers that are designing
fault tolerant distributed systems at scale to investigate the
problem spaces and know which algorithms solve what problems.
withinboredom wrote 20 hours 8 min ago:
Further, please donât stop at RAFT. RAFT is popular because it
is easy to understand, not because it is the best way to do
distributed consensus. It is non-deterministic (thus requiring
odd numbers of electors), requires timeouts for liveness (thus
latency can kill you), and isnât all that good for
general-purpose consensus, IMHO.
foobarian wrote 23 hours 3 min ago:
> some serious redesign of the system, hopefully with some
rigorous algorithms underpinning the updates
Reading these words makes me break out in cold sweat :-) I really
hope they don't
dboreham wrote 23 hours 20 min ago:
Certainly seems like misuse of DNS. It wasn't designed to be a
rapidly updatable consistent distributed database.
tremon wrote 4 hours 15 min ago:
That's true, if you use the the CAP definition for consistency.
Otherwise, I'd say that the DNS design satisfies each of those
terms:
- "Rapidly updatable" depends on the specific implementation,
but the design allows for 2 billion changesets in flight before
mirrors fall irreparably out of sync with the master database,
and the DNS specs include all components necessary for rapid
updates: push-based notifications and incremental transfers.
- DNS is designed to be eventually consistent, and each replica
is expected to always offer internally consistent data. It's
certainly possible for two mirrors to respond with different
responses to the same query, but eventual consistency does not
preclude that.
- Distributed: the DNS system certainly is a distributed
database, if fact it was specifically designed to allow for
replication across organization boundaries -- something that
very few other distributed systems offer. What DNS does not
offer is multi-master operation, but neither do e.g. Postgres
or MSSQL.
pyrolistical wrote 17 hours 11 min ago:
I think historically DNS was âbest effortâ but with
consensus algorithms like raft, I can imagine a DNS that is
perfectly consistent
dustbunny wrote 1 day ago:
Why is the "DNS Planner" and "DNS Enactor" separate? If it was one
thing, wouldn't this race condition have been much more clear to the
people working on it? Is this caused by the explosion of complexity
due to the over use of the microservice architecture?
jiggawatts wrote 16 hours 14 min ago:
This was my thought also. The first sentences of the RCA screamed
ârace conditionâ without even having to mention the phrase.
The two DNS components comprise a monolith: neither is useful
without the other and there is one arrow on the design coupling
them together.
If they were a single component then none of this would have
happened.
Also, version checks? Really?
Why not compare the current state against the desired state and
take the necessary actions to bring them inline?
Last but not least, deleting old config files so aggressively is a
âpenny wise pound foolishâ design. I would keep these forever
or at least a month! Certainly much, much longer than any possible
time taken through the sequence of provisioning steps.
UltraSane wrote 15 hours 52 min ago:
Yes it should be impossible for all DNS entries to get deleted
like that.
neom wrote 1 day ago:
Pick your battle I'd guess. Given how huge AWS is, if you have
Desired state vs. reconciler, you probably have more resilient
operations generally and a easier job of finding and isolating
problems, the flip side of that is if you screw up your error
handling, you get this. That aside, it seems strange to me they
didn't account for the fact that a stale plan could get picked up
over a new one, so maybe I misunderstand the incident/architecture.
bananapub wrote 1 day ago:
> Why is the "DNS Planner" and "DNS Enactor" separate?
for a large system, it's in practice very nice to split up things
like that - you have one bit of software that just reads a bunch of
data and then emits a plan, and then another thing that just gets
given a plan and executes it.
this is easier to test (you're just dealing with producing one data
structure and consuming one data structure, the planner doesn't
even try to mutate anything), it's easier to restrict permissions
(one side only needs read access to the world!), it's easier to do
upgrades (neither side depends on the other existing or even being
in the same language), it's safer to operate (the planner is
disposable, it can crash or be killed at any time with no problem
except update latency), it's easier to comprehend (humans can
examine the planner output which contains the entire state of the
plan), it's easier to recover from weird states (you can in
extremis hack the plan) etc etc. these are all things you
appreciate more and more and your system gets bigger and more
complicated.
> If it was one thing, wouldn't this race condition have been much
more clear to the people working on it?
no
> Is this caused by the explosion of complexity due to the over use
of the microservice architecture?
no
it's extremely easy to second-guess the way other people decompose
their services since randoms online can't see any of the actual
complexity or any of the details and so can easily suggest it would
be better if it was different, without having to worry about any of
the downsides of the imagined alternative solution.
tuckerman wrote 21 hours 13 min ago:
Agreed, this is a common division of labor and simplifies things.
It's not entirely clear in the postmortem but I speculate that
the conflation of duties (i.e. the enactor also being responsible
for janitor duty of stale plans) might have been a contributing
factor.
The Oxide and Friends folks covered an update system they built
that is similarly split and they cite a number of the same
benefits as you:
HTML [1]: https://oxide-and-friends.transistor.fm/episodes/systems...
jiggawatts wrote 16 hours 10 min ago:
I would divide these as functions inside a monolithic
executable. At most, emit the plan to a file on disk as a
ââwhatifâ optional path.
Distributed systems with files as a communication medium are
much more complex than programmers think with far more failure
modes than they can imagine.
Like⦠this one, that took out a cloud for hours!
tuckerman wrote 53 min ago:
Doing it inside a single binary gets rid of some of the nice
observability features you get "for free" by breaking it up
and could complicate things quite a bit (more code paths,
flags for running it in "don't make a plan use the last plan
mode", flags for "use this human generated plan mode"). Very
few things are a free lunch but I've used this pattern
numerous times and quite like it. I ran a system that used a
MIP model to do capacity planning and separating planning
from executing a plan was very useful for us.
I think the communications piece depends on what other
systems you have around you to build on, its unlikely this
planner/executor is completely freestanding. Some companies
have large distributed filesystems with well known/tested
semantics, schedulers that launch jobs when files appear,
they might have ~free access to a database with strict
serializability where they can store a serialized version of
the plan, etc.
Anon1096 wrote 22 hours 56 min ago:
I mean any time a service goes down even 1/100 the size of AWS
you have people crawling out of the woodworks giving armchair
advice while having no domain relevant experience. It's barely
even worth taking the time to respond. The people with opinions
of value are already giving them internally.
lazystar wrote 21 hours 59 min ago:
> The people with opinions of value are already giving them
internally.
interesting take, in light of all the brain drain that AWS has
experienced over the last few years. some outside opinions
might be useful - but perhaps the brain drain is so extreme
that those remaining don't realize it's occurring?
supportengineer wrote 1 day ago:
It probably was a single-threaded python script until somebody
found a way to get a Promo out of it.
placardloop wrote 18 hours 16 min ago:
This is Amazon weâre talking about, it was probably Perl.
donavanm wrote 1 day ago:
This is public messaging to explain the problem at large. This isnt
really a post incident analysis.
Before the active incident is âresolvedâ theres an evaluation of
probable/plausible reoccurrence. Usually we/they would have potential
mitigations and recovery runbooks prepared as well to quickly react
to any reoccurance. Any likely open risks are actively worked to
mitigate before the immediate issue is considered resolved. That
includes around-the-clock dev team work if its the best known path to
mitigation.
Next any plausible paths to ârisk of reoccuranceâ would be top
dev team priority (business hours) until those action items are
completed and in deployment. That might include other teams with
similar DIY DNS management, other teams who had less impactful queue
depth problems, or other similar ânear missâ findings. Service
team tech & business owners (PE, Sr PE, GM, VP) would be tracking
progress daily until resolved.
Then in the next few weeks at org & AWS level âops meetingsâ
there are going to be the in depth discussions of the incident,
response, underlying problems, etc. the goal there being
organizational learning and broader dissemination of lessons learned,
action items, best practice etc.
mcmoor wrote 1 day ago:
Also, I don't know if I missed it, but they don't establish anything
to prevent outage if there's unusually high delay again?
mattcrox wrote 1 day ago:
Itâs at the end, they disabled the DDB DNS automations around
this to fix before they re-enable them
mcmoor wrote 11 hours 43 min ago:
If it's re enabled (without change?), wouldn't an unusually high
delay break it again?
cthalupa wrote 1 hour 52 min ago:
Why would they enable it without fixing the issue?
The post-mortem is specific that they won't turn it back on
without resolving this but I feel like the default assumption
for any halfway competent entity would be that they fix the
known issue that they have disabled something because.
shayonj wrote 1 day ago:
I was kinda surprised the lack of CAS on per-endpoint plan version or
rejecting stale writes via 2PC or single-writer lease per endpoint like
patterns.
Definitely a painful one with good learnings and kudos to AWS for being
so transparent and detailed :hugops:
donavanm wrote 1 day ago:
See [1] . The actual DNS mutation API does, effectively, CAS. They
had multiple unsynchronized writers who raced without logical
constraints or ordering to teh changes. Without thinking much they
_might_ have been able to implement something like a vector either
through updating the zone serial or another "sentinel record" that
was always used for ChangeRRSets affecting that label/zone; like a
TXT record containing a serialized change set number or a "checksum"
of the old + new state.
Im guessing the "plans" aspect skipped that and they were just
applying intended state, without trying serialize them. And
last-write-wins, until it doesnt.
HTML [1]: https://news.ycombinator.com/item?id=45681136
cyberax wrote 18 hours 23 min ago:
Oh, I can see it from here. AWS internally has a problem with
things like task orchestration. I bet that the enactor can be
rewritten as a goroutine/thread in the planner, with proper locking
and ordering.
But that's too complicated and results in more code. So they likely
just used an SQS queue with consumers reading from it.
lazystar wrote 1 day ago:
> Since this situation had no established operational recovery
procedure, engineers took care in attempting to resolve the issue with
DWFM without causing further issues.
interesting.
galaxy01 wrote 1 day ago:
Would conditional read/write solve this? looks like some kind of stale
read
yla92 wrote 1 day ago:
So the root cause is basically race condition 101 stale read ?
philipwhiuk wrote 1 day ago:
Race condition and bad data validation.
joeyhage wrote 1 day ago:
> as is the case with the recently launched IPv6 endpoint and the
public regional endpoint
It isn't explicitly stated in the RCA but it is likely these new
endpoints were the straw that broke the camel's back for the DynamoDB
load balancer DNS automation
DIR <- back to front page