URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Summary of the Amazon DynamoDB Service Disruption in US-East-1 Region
       
       
        asim wrote 4 hours 35 min ago:
        On the one hand it's an incentive to shift away to smaller self
        management where you don't need AWS e.g as an individual I just run a
        single DigitalOcean VPS. But on the other hand if you're a large
        business the evaluation process is basically, can I tolerate this kind
        of incident once in a while versus the massive operational cost of
        doing it myself. It's really going to be a case by case study of who
        stays, who moves and who tries some multicloud failover. It's not one
        of those situations where you can blanket just say oh this is terrible,
        stupid, should never happen, let's get off AWS. This is the slow build
        up of dependency on something people value. That's not going to change
        quickly. It might never change. The too big to fail mantra of banks
        applies. What happens next is essentially very anticlimactic which is
        to say, nothing.
       
          tonymet wrote 52 min ago:
          multi-region AWS would have been adequate for this outage.
       
        JohnMakin wrote 5 hours 54 min ago:
        Was still seeing SQS latency affecting my systems a full day after they
        gave the “all clear.” There are red flags all over this summary to
        me, particularly the case where they had no operational procedure for
        recovery. That seems to me impossible in a hyperscaler - you never
        considered this failure scenario, ever? Or did you lose engineers that
        did know?
        
        Anyway appreciate that this seems pretty honest and descriptive.
       
        polyglotfacto2 wrote 9 hours 27 min ago:
        Use TLA+ (which I thought they did)
       
        baalimago wrote 9 hours 48 min ago:
        Did they intentionally make it dense and complicated to discourage
        anyone from actually reading it..?
        
        776 words in a single paragraph
       
        giamma wrote 10 hours 42 min ago:
        Interesting analysis from The Register
        
  HTML  [1]: https://www.theregister.com/2025/10/20/aws_outage_amazon_brain...
       
          mellosouls wrote 10 hours 29 min ago:
          Discussed the other day:
          
          Today is when the Amazon brain drain sent AWS down the spout (644
          comments)
          
  HTML    [1]: https://news.ycombinator.com/item?id=45649178
       
        martythemaniak wrote 14 hours 36 min ago:
        It's not DNS
        There's no way it's DNS
        It was DNS
       
        grogers wrote 14 hours 49 min ago:
        > As this plan was deleted, all IP addresses for the regional endpoint
        were immediately removed.
        
        I feel like I am missing something here... They make it sound like the
        DNS enactor basically diffs the current state of DNS with the desired
        state, and then submits the adds/deletes needed to make the DNS go to
        the desired state.
        
        With the racing writers, wouldn't that have just made the DNS go back
        to an older state? Why did it remove all the IPs entirely?
       
          Aeolun wrote 13 hours 43 min ago:
          1. Read state, oh, I need to delete all this.
          
          2. Read state, oh, I need to write all this.
          
          2. Writes
          
          1. Deletes
          
          Or some variant of that anyway. It happens in any system that has
          concurrent reader/writers and no locks.
       
        JCM9 wrote 15 hours 27 min ago:
        Good to see a detailed summary. The frustration from a customer
        perspective is that AWS continues to have these cross-region issues and
        they continue to be very secretive about where these single points of
        failure exist.
        
        The region model is a lot less robust if core things in other regions
        require US-East-1 to operate. This has been an issue in previous
        outages and appears to have struck again this week.
        
        It is what it is, but AWS consistently oversells the robustness of
        regions as fully separate when events like Monday reveal they’re
        really not.
       
          Arainach wrote 15 hours 17 min ago:
          >about where these single points of failure exist.
          
          In general, when you find one you work to fix it, and one of the most
          common ways to find more is when one of them fails.  Having single
          points of failure and letting them live isn't the standard practice
          at this scale.
       
        rr808 wrote 15 hours 53 min ago:
         [1] has a better explanation instead of the wall of text from AWS
        
  HTML  [1]: https://newsletter.pragmaticengineer.com/p/what-caused-the-lar...
       
          cowsandmilk wrote 15 hours 17 min ago:
          Except just in the DNS section, I’ve already found one place where
          he gets it wrong…
       
        al_be_back wrote 16 hours 35 min ago:
        Postmortem all you want - the internet is breaking, hard.
        
        The internet was born out of the need for Distributed networks during
        the cold war - to reduce central points of failure - a hedging
        mechanism if you will.
        
        Now it has consolidated into ever smaller mono nets. A simple mistake
        in on one deployment could bring banking, shopping and travel to a halt
        globally. This can only get much worse when cyber warfare gets
        involved.
        
        Personally, I think the cloud metaphor has overstretched and has long
        burst.
        
        For R&D, early stage start-ups and occasional/seasonal computing, cloud
        works perfectly (similar to how time-sharing systems used to work).
        
        For well established/growth businesses and gov, you better become
        self-reliant and tech independent: own physical servers + own cloud +
        own essential services (db, messaging, payment).
        
        There's no shortage of affordable tech, know-how or workforce.
       
          protocolture wrote 14 hours 13 min ago:
          >the internet is breaking, hard.
          
          I dont see that this is the case, its just more people want services
          over the internet from the same 3 places that break irregularly.
          
          Internet infrastructure is as far as I can tell, getting better all
          the time.
          
          The last big BGP bug had 1/10th the comments of the AWS one. And had
          much less scary naming (ooooh routing instability) [1] >The internet
          was born out of the need for Distributed networks during the cold war
          - to reduce central points of failure - a hedging mechanism if you
          will.
          
          Instead of arguing about the need that birthed the internet, I will
          simply say that the internet still works in the same largely
          distributed fashion. Maybe you mean Web instead of Internet?
          
          The issue here is that "Internet" isnt the same as "Things you might
          access on the Internet". The Internet held up great during this
          adventure. As far as I can tell it was returning 404's and 502's
          without incident. The distributed networks were networking
          distributedly. If you wanted to send and received packets with any
          internet joined human in a way that didnt rely on some AWS hosted
          application, that was still very possible.
          
          >A simple mistake in on one deployment could bring banking, shopping
          and travel to a halt globally.
          
          Yeah but for how long and for how many people? The last 20 years have
          been a burn in test for a lot of big industries on crappy
          infrastructure. It looks like near everyone has been dragged kicking
          and screaming into the future.
          
          I mean the entire shipping industry got done over the last decade.
          [2] >Personally, I think the cloud metaphor has overstretched and has
          long burst.
          
          It was never very useful.
          
          >For well established/growth businesses and gov, you better become
          self-reliant and tech independent
          
          For these businesses, they just go out and get themselves some
          region/vendor redundancy. Lots of applications fell over during this
          outage, but lots of teams are also getting internal praise for
          designing their systems robustly and avoiding its fallout.
          
          >There's no shortage of affordable tech, know-how or workforce.
          
          Yes, and these people often know how to design cloud infrastructure
          to avoid these issues, or are smart enough to warn people that if
          their region or its dependencies fail without redundancy, they are
          taking a nose dive. Businesses will make business decisions and
          review those decisions after getting publicly burnt.
          
  HTML    [1]: https://news.ycombinator.com/item?id=44105796
  HTML    [2]: https://www.zdnet.com/article/all-four-of-the-worlds-largest...
       
          anyonecancode wrote 15 hours 20 min ago:
          > The internet was born out of the need for Distributed networks
          during the cold war - to reduce central points of failure - a hedging
          mechanism if you will.
          
          I don't think the idea was that in the event of catastrophe, up to
          and including  nuclear attack, the system would continue working
          normally, but that it would keep working. And the internet -- as a
          system -- certainly kept working during this AWS outage. In a
          degraded state, yes, but it was working, and recovered.
          
          I'm more concerned with the way the early public internet promised a
          different kind of decentralization -- of economics, power, and ideas
          -- and how _that_ has become heavily centralized. In which case, AWS,
          and Amazon, indeed do make a good example. The internet, as a system,
          is certainly working today, but arguably in a degraded state.
       
            al_be_back wrote 5 min ago:
            preventing a catastrophe was ARPA's mitigation strategy. the point
            is where it's heading, not where it is. It's not about AWS per se,
            or any one company, it's the way it is consolidating. AWS came
            about by accident - cleverly utilizing spare server capacity from
            amazon.com.
            
            In it's conception, the internet (not www), was not envisaged as a
            economical medium - it's success was a lovely side-effect.
       
        danpalmer wrote 18 hours 22 min ago:
        776 word paragraph and 28 word screen width, this is practically
        unreadable.
       
          scottatron wrote 17 hours 26 min ago:
          yeah that is some pretty reader hostile formatting -_-
          
          I asked Claude to reformat it for readability for me: [1] Obvs do
          your own cross-checking with the original if 100% accuracy is
          required.
          
  HTML    [1]: https://claude.ai/public/artifacts/958c4039-d2f1-45eb-9dfe-b...
       
        827a wrote 18 hours 22 min ago:
        I made it about ten lines into this before realizing that, against all
        odds, I wasn't reading a postmortem, I was reading marketing material
        designed to sell AWS.
        
        > Many of the largest AWS services rely extensively on DNS to provide
        seamless scale, fault isolation and recovery, low latency, and
        locality...
       
          Aeolun wrote 13 hours 39 min ago:
          I didn’t get 10 lines in before I realized that this wall of text
          couldn’t possibly contain the actual reason. Somewhere behind all
          of that is an engineer saying “We done borked up and deleted the
          dynamodb DNS records”
       
        alexnewman wrote 20 hours 59 min ago:
        Is it the internal dynamodb that other people use?
       
        dilyevsky wrote 21 hours 12 min ago:
        Sounds like they went with Availability over Correctness with this
        design but the problem is that if your core foundational config is not
        correct you get no availability either.
       
        Velocifyer wrote 21 hours 15 min ago:
        This is unreadable and terribly formatted.
       
          citizenpaul wrote 13 hours 34 min ago:
          Yeah for real  thats what an "industry leading" company puts out for
          their post mortem?  They should be red in the face embarrassed. 
          Jeeze paragraphs?  Punctuation?
          
          I put more effort into my internet comments that won't be read by
          millions of people.
       
          citizenpaul wrote 13 hours 36 min ago:
          Yeah for real  thats what an "industry leading" company puts out for
          their post mortem?  They should be red in the face embarrassed. 
          Jeeze, paragraphs?  Punctuation?
          
          Looks like Amazon is starting to show cracks in the foundation.
       
        bithavoc wrote 21 hours 31 min ago:
        does DynamoDB run on EC2? if I read it right, EC2 depends on DynamoDB.
       
          dokument wrote 20 hours 48 min ago:
          There are circular dependencies within AWS, but also systems to
          account for this (especially for cold starting).
          
          Also there really is no one AWS, each region is its own (Now more
          then ever before, where some systems weren't built to support this).
       
        __turbobrew__ wrote 22 hours 14 min ago:
        From a meta analysis level: bugs will always happen, formal
        verification is hard, and sometimes it just takes a number of years to
        have some bad luck (I have hit bugs which were over 10 years old but
        due to low probability of them occurring they didn’t happen for a
        long time).
        
        If we assume that the system will fail, I think the logical thing to
        think about is how to limit the effects of that failure. In practice
        this means cell based architecture, phased rollouts, and isolated
        zones.
        
        To my knowledge AWS does attempt to implement cell based architecture,
        but there are some cross region dependencies specifically with
        us-east-1 due to legacy. The real long term fix for this is designing
        regions to be independent of each other.
        
        This is a hard thing to do, but it is possible. I have personally been
        involved in disaster testing where a region was purposely firewalled
        off from the rest of the infrastructure. You find out very quick where
        those cross region dependencies lie, and many of them are in unexpected
        places.
        
        Usually this work is not done due to lack of upper level VP support and
        funding, and it is easier to stick your head in the sand and hope bad
        things don’t happen. The strongest supporters of this work are going
        to be the share holders who are in it for the long run. If the company
        goes poof due to improper disaster testing, the shareholders are going
        to be the main bag holders. Making the board aware of the risks and the
        estimated probability of fundamentally company ending events can help
        get this work funded.
       
        tptacek wrote 22 hours 31 min ago:
        I'm a tedious broken record about this (among many other things) but if
        you haven't read this Richard Cook piece, I strongly recommend you stop
        reading this postmortem and go read Cook's piece first. It won't take
        you long. It's the single best piece of writing about this topic I have
        ever read and I think the piece of technical writing that has done the
        most to change my thinking: [1] You can literally check off the things
        from Cook's piece that apply directly here. Also: when I wrote this
        comment, most of the thread was about root-causing the DNS thing that
        happened, which I don't think is the big story behind this outage.
        (Cook rejects the whole idea of a "root cause", and I'm pretty sure
        he's dead on right about why.)
        
  HTML  [1]: https://how.complexsystems.fail/
       
          vader_n wrote 4 hours 43 min ago:
          That was a waste of my time.
       
          inkyoto wrote 5 hours 53 min ago:
          And I strongly recommend that you stop recommending the reading of
          something that has its practical usefulness limited by what the
          treatise leaves unsaid:
          
            – It identifies problems (complexity, latent failures, hindsight
          bias, etc.) more than it offers solutions. Readers must seek outside
          methods to act on these insights.
          
            – It feels abstract, describing general truths applicable to many
          domains, but requiring translation into domain-specific practices (be
          it software, aviation, medicine, etc.).
          
            – It leaves out discussion on managing complexity – e.g.
          principles of simplification, modular design, or quantitative risk
          assessment – which would help prevent some of the failures it warns
          about.
          
            – It assumes well-intentioned actors and does not grapple with
          scenarios where business or political pressures undermine safety –
          an increasingly pertinent issue in modern industries.
          
            – It does not explicitly warn against misusing its principles
          (e.g. becoming fatalistic or overconfident in defenses). The nuance
          that «failures are inevitable but we still must diligently work to
          minimize them» must come from the reader’s interpretation.
          
          «How Complex Systems Fail» is highly valuable for its conceptual
          clarity and timeless truths about complex system behavior. Its
          direction is one of realism – accepting that no complex system is
          ever 100% safe – and of placing trust in human skill and systemic
          defenses over simplistic fixes. The rational critique is that this
          direction, whilst insightful, needs to be paired with concrete
          strategies and a proactive mindset to be practically useful.
          
          The treatise by itself won’t tell you how to design the next
          aircraft or run a data center more safely, but it will shape your
          thinking so you avoid common pitfalls (such as chasing singular root
          causes or blaming operators). To truly «preclude» failures or
          mitigate them, one must extend Cook’s ideas with detailed
          engineering and organizational practices. In other words, Cook
          teaches us why things fail in complex ways; it is up to us –
          engineers, managers, regulators, and front-line practitioners – to
          apply those lessons in how we build and operate the systems under our
          care.
          
          To be fair, at the time of writing (late 1990's), Cook’s treatise
          was breaking ground by succinctly articulating these concepts for a
          broad audience. Its objective was likely to provoke thought and shift
          paradigms, rather than serve as a handbook.
          
          Today, we have the benefit of two more decades of research and
          practice in resilience engineering, which builds on Cook’s points.
          Practitioners now emphasise building resilient systems, not just
          trying to prevent failure outright. They use Cook’s insights as
          rationale for things such as chaos engineering, better incident
          response, and continuous learning cultures.
       
          ponco wrote 9 hours 4 min ago:
          Respectfully, I don't think that piece adds anything of material
          substance. It's a list of hollow platitudes (vapid writing listing
          inactionable truisms).
       
            anonymars wrote 3 hours 55 min ago:
            A better resource is likely Michael Nygard's book, "Release It!".
            It has practical advice about many issues in this outage.  For
            example, it appears the circuit breaker and bulkhead patterns were
            underused here.
            
            Excerpt:
            
  HTML      [1]: https://www.infoq.com/articles/release-it-five-am/
       
          nickelpro wrote 13 hours 6 min ago:
          To quote Grandpa Simpson, "Everything everyone just said is either
          obvious or wrong".
          
          Pointing out that "complex systems" have "layers of defense" is
          neither insightful nor useful, it's obvious. Saying that any and all
          failures in a given complex system lack a root cause is wrong.
          
          Cook uses a lot of words to say not much at all. There's no concrete
          advise to be taken from How Complex Systems Fail, nothing to change.
          There's no casualty procedure or post-mortem investigation which
          would change a single letter of a single word in response to it. It's
          hot air.
       
            baq wrote 11 hours 57 min ago:
            There’s a difference between ‘grown organically’ and
            ‘designed to operate in this way’, though. Experienced folks
            will design system components with conscious awareness of how
            operations actually look like from the start. Juniors won’t and
            will be bolting on quasi solutions as their systems fall over time
            and time again. Cook’s generalization is actually wildly
            applicable, but it takes work to map it to specific situations.
       
          user3939382 wrote 14 hours 31 min ago:
          Nobody discussing the problem understands it.
       
          ramraj07 wrote 15 hours 56 min ago:
          As I was reading through that list, I kept feeling, "why do I feel
          this is not universally true?"
          
          Then I realized: the internet; the power-grid (at least in most
          developed countries); there are things that don't actually fail
          catastrophically, even though they are extremely complex, and not
          always built by efficient organizations. Whats the retort to this
          argument?
       
            grumbelbart2 wrote 7 hours 49 min ago:
            Also, aviation is great example of how we can manage failures in
            complex systems and how we can track and fix more and rarer
            failures over time.
       
            figassis wrote 10 hours 25 min ago:
            The grid fails catastrophically. It happened this year in Portugal,
            spain and nearby countries? Still, think of the grid as more like
            DNS. It is immense, but the concept is simple and well understood.
            You can quickly identify where the fault is (even if not the actual
            root cause), and can also quickly address it (even if bringing it
            back up in sync takes time and is not trivial). Current cloud infra
            is different in that each implementation is unique, services are
            unique, knowledge is not universal. There are no books about AWS's
            infra fundamentals or how to manage AWS's cloud.
       
            baq wrote 11 hours 45 min ago:
            > the internet [1] > power grid
            
  HTML      [1]: https://www.kentik.com/blog/a-brief-history-of-the-interne...
  HTML      [2]: https://www.entsoe.eu/publications/blackout/28-april-2025-...
       
            jb1991 wrote 15 hours 15 min ago:
            The power grid is a huge risk in several major western nations.
       
            singron wrote 15 hours 18 min ago:
            They do fail catastrophically. E.g. [1] I think you could argue AWS
            is more complex than the electrical grid, but even if it's not, the
            grid has had several decades to iron out kinks and AWS hasn't. AWS
            also adds a ton of completely new services each year in addition to
            adding more capacity. E.g. I bet these DNS Enactors have become
            more numerous and their plans became much larger than when they
            were first developed, which has greatly increased the odds of
            experiencing this issue.
            
  HTML      [1]: https://en.wikipedia.org/wiki/Northeast_blackout_of_2003
       
            habinero wrote 15 hours 41 min ago:
            The power grid absolutely can fail catastrophically and is a lot
            more fragile than people think.
            
            Texas nearly ran into this during their blackout a few years ago --
            their grid got within a few minutes of complete failure that would
            have required a black start which IIRC has never been done.
            
            Grady has a good explanation and the writeup is interesting reading
            too. [1]
            
  HTML      [1]: https://youtu.be/08mwXICY4JM?si=Lmg_9UoDjQszRnMw
  HTML      [2]: https://youtu.be/uOSnQM1Zu4w?si=-v6-Li7PhGHN64LB
       
          nonfamous wrote 16 hours 50 min ago:
          Great link, thanks for sharing. This point below stood out to me —
          put another way, “fixing” a system in response to an incident to
          make it safer might actually be making it less safe.
          
          >>> Views of ‘cause’ limit the effectiveness of defenses against
          future events.
          
          >>> Post-accident remedies for “human error” are usually
          predicated on obstructing activities that can “cause” accidents.
          These end-of-the-chain measures do little to reduce the likelihood of
          further accidents. In fact that likelihood of an identical accident
          is already extraordinarily low because the pattern of latent failures
          changes constantly. Instead of increasing safety, post-accident
          remedies usually increase the coupling and complexity of the system.
          This increases the potential number of latent failures and also makes
          the detection and blocking of accident trajectories more difficult.
       
            albert_e wrote 14 hours 32 min ago:
            But that sounds like an assertion without evidence and
            underestimates the competence of everyone involved in designing and
            maintaining these complex systems.
            
            For example, take airline safety -- are we to believe based on the
            quoted assertion that every airline accident and resulting remedy
            that mitigated the causes have made air travel LESS safe? That
            sounds objectively, demonstrably false.
            
            Truly complex systems like ecosystems and climate might qualify for
            this assertion where humans have interfered, often with best
            intentions, but caused unexpected effects that maybe beyond human
            capacity control.
       
              nonfamous wrote 5 hours 43 min ago:
              Airline safety is a special case I think — THE NTSB does
              incredible work, and their recommendations are always designed to
              improve total safety, not just reduce the likelihood of a
              specific failure.
              
              But I can think of lots of examples where the response to an
              unfortunate, but very rare, incident can make us less safe
              overall. The response to rare vaccine side effects comes
              immediately to mind.
       
          GuinansEyebrows wrote 18 hours 7 min ago:
          thanks, i'm one of the lucky 10,000 today.
       
          ericyd wrote 18 hours 47 min ago:
          I'll admit i didn't read all of either document, but I'm not
          convinced of the argument that one cannot attribute a failure to a
          root cause simply because the system is complex and required multiple
          points of failure to fail catastrophically.
          
          One could make a similar argument in sports that no one person ever
          scores a point because they are only put into scoring position by a
          complex series of actions which preceded the actual point. I think
          that's technically true but practically useless. It's good to have a
          wide perspective of an issue but I see nothing wrong with identifying
          the crux of a failure like this one.
       
            Yokolos wrote 15 hours 32 min ago:
            The best example for this is aviation. Insanely complex from the
            machines to the processes to the situations to the people, all
            interconnected and constantly interacting. But we still do "root
            cause" analyses and based on those findings try to improve every
            point in the system that failed or contributed to the failure,
            because that's how we get a safer aviation industry. It's
            definitely worked.
       
            wbl wrote 18 hours 36 min ago:
            Its extremely useful in sports. We evaluate batters on OPS vs RBI,
            and no one ever evaluated them on runs they happened to score. We
            talk all the time about a QB and his linemen working together and
            the receivers. If all we talked about was the immediate cause we'd
            miss all that.
       
              ericyd wrote 1 hour 37 min ago:
              I'm not saying we ignore all other causes in sports analysis, I'm
              saying it doesn't make sense to pretend that there's no "one
              person" who hit the home run or scored a touchdown. Of course
              it's usually a team effort but we still attribute a score to one
              person.
       
          cb321 wrote 19 hours 35 min ago:
          That minimalist post mortem for the public is of what sounds like a
          Rube Goldberg machine and the reality is probably even more hairy.  I
          completely agree that if one wants to understand "root causes", it's
          more important to understand why such machines are
          built/trusted/evolved in the first place.
          
          That piece by Cook is ok, but largely just a list of assertions (true
          or not, most do feel intuitive, though).  I suppose one should delve
          into all those references at the end for details?  Anyway, this is an
          ancient topic, and I doubt we have all the answers on those root
          whys.  The MIT course on systems, 6.033, used to assign reading a
          paper raised on HN only a few times in its history: [1] and [2] It's
          from 1962, over 60 years ago, but that is also probably more
          illuminating/thought provoking than the post mortem.  Personally, I
          suspect it's probably an instance of a [3] , but only past a certain
          scale.
          
  HTML    [1]: https://news.ycombinator.com/item?id=10082625
  HTML    [2]: https://news.ycombinator.com/item?id=16392223
  HTML    [3]: https://en.wikipedia.org/wiki/Wicked_problem
       
            tptacek wrote 19 hours 27 min ago:
            I have a housing activism meetup I have to get to, but real quick
            let me just say that these kinds of problems are not an abstraction
            to me in my day job, that I read this piece before I worked where I
            do and it bounced off me, but then I read it last year and was like
            "are you me but just smarter?", like my pupils probably dilated
            theatrically when I read it like I was a character in Requiem for a
            Dream, and I think most of the points he's making are much subtler
            and deeper than they seem at a casual read.
            
            You might have to bring personal trauma to this piece to get the
            full effect.
       
              cb321 wrote 18 hours 42 min ago:
              Oh, it's fine.    At your leisure.  I didn't mean to go against the
              assertions themselves, but more just kind of speak to their
              "unargued" quality and often sketchy presentation.  Even that
              Simon piece has a lot of this in there, where it's sort of "by
              defenition of 'complexity'/by unelaborated observation".
              
              In engineered systems, there is just a disconnect between on our
              own/small scale KISS and what happens in large organizations, and
              then what happens over time.  This is the real root cause/why,
              but I'm not sure it's fixable.    Maybe partly addressable, tho'.
              
              One thing that might give you a moment of worry is both in that
              Simon and far, far more broadly all over academia both long
              before and ever since, biological systems like our bodies are an
              archetypal example of "complex".  Besides medical failures, life
              mostly has this one main trick -- make many copies and if they
              don't all fail before they, too, can copy then a stable-ish
              pattern emerges.
              
              Stable populations + "litter size/replication factor" largely
              imply average failure rates.  For most species it is horrific. 
              On the David Attenborough specials they'll play the sad music and
              tell you X% of these offspring never make it to mating age.  The
              alternative is not the [1] apocalypse, but the
              "whatever-that-species-is-biopocalypse".  Sorry - it's late and
              my joke circuits are maybe fritzing.  So, both big 'L' and little
              'l' life, too, "is on the edge", just structurally. [2] (with
              sand piles and whatnot) used to be a kind of statistical physics
              hope for a theory of everything of these kinds of phenomena, but
              it just doesn't get deployed.  Things will seem "shallowly
              critical" but not so upon deeper inspection.  So, maybe it's not
              not a useful enough approximation.
              
              Anyway, good luck with your housing meetup!
              
  HTML        [1]: https://en.wikipedia.org/wiki/Gray_goo
  HTML        [2]: https://en.wikipedia.org/wiki/Self-organized_criticality
       
          markus_zhang wrote 19 hours 38 min ago:
          As a contractor who is on an oncall schedule. I have never worked in
          a company that treats oncall as a very serious business. I only
          worked in 2 companies that need oncall so I’m biased. On paper,
          they both say it is serious and all SLA stuffs were setup, but in
          reality there is not enough support.
          
          The problem is, oncall is a full-time business. It takes full
          attention of the oncall engineer, whether there is an issue or not.
          Both companies simply treat oncall as a by-product. We just had to do
          it so let’s stuff it into the sprint. The first company was
          slightly more serious as we were asked to put up a 2-3 point oncall
          task in JIRA. The second one doesn’t even do this.
          
          Neither company really encourages engineers to read through complex
          code written by others, even if we do oncall for those products.
          Again, the first company did better, and we were supposed to create a
          channel and pull people in, so it’s OKish to not know anything
          about the code. The second company simply leaves oncall to do
          whatever they can. Neither company allocates enough time for
          engineers to read the source code thoroughly. And neither has good
          documentation for oncall.
          
          I don’t know the culture of AWS. I’d very much want to work in an
          oncall environment that is serious and encourages learning.
       
            dekhn wrote 18 hours 17 min ago:
            When I was an SRE at Google our oncall was extremely serious (if
            the service went down, Google was unable to show ads, record ad
            impressions, or do any billing for ads).  It was done on a
            rotation, lasted 1 week (IIRC it was 9AM-9PM, we had another time
            zone for the alternate 12 hours).  The on-call was empowered to do
            pretty much anything required to keep the service up and running,
            including cancelling scheduled downtimes, pausing deployment
            updates, stop abusive jobs, stop abusive developers, and invoke an
            SVP if there was a fight with another important group).
            
            We sent a test page periodically to make sure the pager actually
            beeped.  We got paid extra for being in the rotation.  The
            leadership knew this was a critical step.  Unfortunately, much of
            our tooling was terrible, which would cause false pages, or failed
            critical operations, all too frequently.
            
            I later worked on SWE teams that didn't take dev oncall very
            seriously.  At my current job, we have an oncall, but it's best
            effort business hours only.
       
              citizenpaul wrote 13 hours 39 min ago:
              >empowered to do pretty much anything required to keep the
              service up and running,
              
              Is that really uncommon?  I've been on call for many companies
              and many types of institutions and never been told once I
              couldn't do something to bring a system up that I can recall at
              least.    Its kinda the job?
              
              On call seriousness should be directly proportional to pay. 
              Google pays.  If smallcorp want to pay me COL I'll be looking at
              that 2AM ticket at 9AM when I get to work.
       
              lanyard-textile wrote 14 hours 51 min ago:
              Handling my first non-prod alert bug as the oncall at Google was
              pretty eye opening :)
              
              It was a good lesson in what a manicured lower environment can do
              for you.
       
              markus_zhang wrote 18 hours 12 min ago:
              That’s pretty good. Our oncall is actually 24-hour for one
              week. On paper it looks very serious but even the best of us
              don’t really know everything so issues tend to lag to the
              morning. Neither do we get any compensation for it. Someone got a
              bad night and still need to logon next day. There is an informal
              understanding to relax a bit if the night is too bad, though.
       
                dmoy wrote 16 hours 17 min ago:
                I did 24hr-for-a-week oncall for 10+ years, do not recommend.
                
                12-12 rotation in SRE is a lot more reasonable for humans
       
                  sandeepkd wrote 13 hours 21 min ago:
                  Unfortunately 24hr-for-a-week seems to be default everywhere
                  nowdays, its just not practical for serious type businesses.
                  It just an indicator of how important is the UPTIME for a
                  company.
       
                  markus_zhang wrote 15 hours 5 min ago:
                  I agree. It sucks. And our schedule is actually 2 weeks in
                  every five. One is secondary and the other is primary.
       
            malfist wrote 19 hours 5 min ago:
            Amazon generally treats on call as a full time job. Generally
            engineers who are on call are expected to only be on call. No
            feature work.
       
              tidbits wrote 18 hours 12 min ago:
              It's very team/org dependent and I would say that's generally not
              the case. In 6 years I have only had 1 team out of 3 where that
              was true. The other two teams I was expected to juggle feature
              work with oncall work. Same for most teams I interacted with.
       
              markus_zhang wrote 18 hours 56 min ago:
              That's actually pretty good.
       
          dosnem wrote 20 hours 15 min ago:
          How does knowing this help you avoid these problems? It doesn’t
          seem to provide any guidance on what to do in the face of complex
          systems
       
            tptacek wrote 20 hours 5 min ago:
            He's literally writing about Three Mile Island. He doesn't have
            anything to tell you about what concurrency primitives to use for
            your distributed DNS management system.
            
            But: given finite resources, should you respond to this incident by
            auditing your DNS management systems (or all your systems) for race
            conditions? Or    should you instead figure out how to make the
            Droplet Manager survive (in some degraded state) a partition from
            DynamoDB without entering congestive collapse? Is the right
            response an identification of the "most faulty components" and a
            project plan to improve them? Or is it closing the human
            expertise/process gap that prevented them from throttling DWFM for
            4.5 hours?
            
            Cook isn't telling you how to solve problems; he's asking you to
            change how you think about problems, so you don't rathole in
            obvious local extrema instead of being guided by the bigger
            picture.
       
              doctorpangloss wrote 18 hours 10 min ago:
              Both documents are, "ceremonies for engineering personalities."
              
              Even you can't help it - "enumerating a list of questions" is a
              very engineering thing to do.
              
              Normal people don't talk or think like that. The way Cook is
              asking us to "think about problems" is kind of the opposite of
              what good leadership looks like. Thinking about thinking about
              problems is like, 200% wrong. On the contrary, be way more
              emotional and way simpler.
       
              cyberax wrote 18 hours 14 min ago:
              Another point is that DWFM is likely working in a privileged,
              isolated network because it needs access deep into the core
              control plane. After all, you don't want a rogue service to be
              able to add a malicious agent to a customer's VPC.
              
              And since this network is privileged, observability tools,
              debugging support, and even maybe access to it are more
              complicated. Even just the set of engineers who have access is
              likely more limited, especially at 2AM.
              
              Should AWS relax these controls to make recovery easier? But then
              it will also result in a less secure system. It's again a
              trade-off.
       
              dekhn wrote 18 hours 14 min ago:
              It's entirely unclear to me if a system the size and scope of AWS
              could be re-thought using these principles and successfully
              execute a complete restructuring of all their processes to reduce
              their failure rate a bit.  It's a system that grew over time with
              many thousands of different developers, with a need to solve
              critical scaling issues that would have stopped the business in
              its tracks (far worse than this outage).
       
              dosnem wrote 19 hours 8 min ago:
              I don’t really follow what you are suggesting. If the system is
              complex and constantly evolving as the article states, you
              aren’t going to be able to close any expertise process gap.
              Operating in a degraded state is probably already built in, this
              was just a state of degradation they were not prepared for. You
              can’t figure out all degraded states to operate in because by
              definition the system is complex
       
          yabones wrote 21 hours 9 min ago:
          Another great lens to see this is "Normal Accidents" theory, where
          the argument is made that the most dangerous systems are ones where
          components are very tightly coupled, interactions are complex and
          uncontrollable, and consequences of failure are serious.
          
  HTML    [1]: https://en.wikipedia.org/wiki/Normal_Accidents
       
        stefan_bobev wrote 22 hours 32 min ago:
        I appreciate the details this went through, especially laying out the
        exact timelines of operations and how overlaying those timelines
        produces unexpected effects. One of my all time favourite bits about
        distributed systems comes from the (legendary) talk at GDC - I Shot You
        First[1] - where the speaker describes drawing sequence diagrams with
        tilted arrows to represent the flow of time and asking "Where is the
        lag?". This method has saved me many times, all throughout my career
        from making games, to livestream and VoD services to now fintech.
        Always account for the flow of time when doing a distributed operation
        - time's arrow always marches forward, your systems might not.
        
        But the stale read didn't scare me nearly as much as this quote:
        
        > Since this situation had no established operational recovery
        procedure, engineers took care in attempting to resolve the issue with
        DWFM without causing further issues
        
        Everyone can make a distributed system mistake (these things are hard).
        But I did not expect something as core as the service managing the
        leases on the physical EC2 nodes to not have recovery procedure. Maybe
        I am reading too much into it, maybe what they meant was that they
        didn't have a recovery procedure for "this exact" set of circumstances,
        but it is a little worrying even if that were the case. EC2 is one of
        the original services in AWS. At this point I expect it to be so battle
        hardened that very few edge cases would not have been identified. It
        seems that the EC2 failure was more impactful in a way, as it cascaded
        to more and more services (like the NLB and Lambda) and took more time
        to fully recover. I'd be interested to know what gets put in place
        there to make it even more resilient.
        
  HTML  [1]: https://youtu.be/h47zZrqjgLc?t=1587
       
          throwdbaaway wrote 11 hours 37 min ago:
          > But I did not expect something as core as the service managing the
          leases on the physical EC2 nodes to not have recovery procedure.
          
          I guess they don't have a recovery procedure for the "congestive
          collapse" edge case. I have seen something similar, so I wouldn't be
          frowning at this.
          
          A couple of red flags though:
          
          1. Apparent lack of load-shedding support by this DWFM, such that a
          server reboot had to be performed. Need to learn from [1] 2. Having
          DynamoDB as a dependency of this DWFM service, instead of something
          more primitive like Chubby. Need to learn more about distributed
          systems primitives from
          
  HTML    [1]: https://aws.amazon.com/builders-library/using-load-shedding-...
  HTML    [2]: https://www.youtube.com/watch?v=QVvFVwyElLY
       
          gtowey wrote 12 hours 43 min ago:
          It's shocking to me too, but not very surprising. It's probably a
          combination of factors that could cause a failure of planning and
          I've seen it play out the same way at lots of companies.
          
          I bet the original engineers planned for, and designed the system to
          be resilient to this cold start situation.  But over time those
          engineers left, and new people took over -- those who didn't fully
          understand and appreciate the complexity, and probably didn't care
          that much about all the edge cases. Then, pushed by management to
          pursue goals that are antithetical to reliability, such as cost
          optimization and other things the new failure case was introduced by
          lots of sub optimal changes. The result is as we see it -- a
          catastrophic failure which caught everyone by surprise.
          
          It's the kind of thing that happens over and over again when the
          accountants are in charge.
       
          tptacek wrote 22 hours 29 min ago:
          It shouldn't scare you. It should spark recognition. This
          meta-failure-mode exists in every complex technological system. You
          should be, like, "ah, of course, that makes sense now". Latent
          failures are fractally prevalent and have combinatoric potential to
          cause catastrophic failures. Yes, this is a runbook they need to
          have, but we should all understand there are an unbounded number of
          other runbooks they'll need and won't have, too!
       
            lazystar wrote 22 hours 12 min ago:
            the thing that scares me is that AI will never be able to diagnose
            an issue that it has never seen before.  If there are no runbooks,
            there is no pattern recognition.  this is something Ive been
            shouting about for 2 years now; hopefully this issue makes AWS
            leadership understand that current gen AI can never replace human
            engineering.
       
              janalsncm wrote 11 hours 29 min ago:
              AI is a lot more than just LLMs. Running through the rats nest of
              interdependent systems like AWS has is exactly what symbolic AI
              was good at.
       
              Aeolun wrote 13 hours 56 min ago:
              I think millions of systems have failed due to missing DNS
              records though.
       
              tptacek wrote 22 hours 9 min ago:
              I'm much less confident in that assertion. I'm not bullish on AI
              systems independently taking over operations from humans, but
              catastrophic outages are combinations of less-catastrophic
              outages which are themselves combinations of latent failures, and
              when the latent failures are easy to characterize (as is the case
              here!), LLMs actually do really interesting stuff working out the
              combinatorics.
              
              I wouldn't want to, like, make a company out of it (I assume the
              foundational model companies will eat all these businesses) but
              you could probably do some really interesting stuff with an agent
              that consumes telemetry and failure model information and uses it
              to surface hypos about what to look at or what interventions to
              consider.
              
              All of this is besides my original point, though: I'm saying, you
              can't runbook your way to having a system as complex as AWS run
              safely. Safety in a system like that is a much more complicated
              process, unavoidably. Like: I don't think an LLM can solve the
              "fractal runbook requirement" problem!
       
        shrubble wrote 22 hours 48 min ago:
        The Bind resolver required each zone to have an increasing serial
        number for the zone.
        
        So if you made a change you had to increase the number, usually a
        timestamp like 20250906114509 which would be older / lower numbered
        than 20250906114702; making it easier to determine which zone file had
        the newest data.
        
        Seems like they sort of had the same setup but with less rigidity in
        terms of refusing to load older files.
       
        ecnahc515 wrote 23 hours 23 min ago:
        Seems like the enactor should be checking the version/generation of the
        current record before it applies the new value, to ensure it never
        applies an old plan on top of an record updated by a new plan. It
        wouldn't be as efficient, but that's just how it is. It's a basic
        compare and swap operation, so it could be handled easily within
        dynamodb itself where these records are stored.
       
        pelagicAustral wrote 1 day ago:
        Had no idea Dynamo was so intertwined with the whole AWS stack.
       
          freedomben wrote 23 hours 43 min ago:
          Yeah, for better or worse, AWS is a huge dogfooder. It's nice to know
          they trust their stuff enough to depend on it themselves, but it's
          also scary to know that the blast radius of a failure in any
          particular service can be enormous
       
        WaitWaitWha wrote 1 day ago:
        I gather, the root cause was a latent race condition in the DynamoDB
        DNS management system that allowed an outdated DNS plan to overwrite
        the current one, resulting in an empty DNS record for the regional
        endpoint.
        
        Correct?
       
          tptacek wrote 22 hours 48 min ago:
          I think you have to be careful with ideas like "the root cause". They
          underwent a metastable congestive collapse. A large component of the
          outage was them not having a runbook to safely recover an adequately
          performing state for their droplet manager service.
          
          The precipitating event was a race condition with the DynamoDB
          planner/enactor system.
          
  HTML    [1]: https://how.complexsystems.fail/
       
            1970-01-01 wrote 20 hours 52 min ago:
            Why can't a race condition bug be seen as the single root cause?
            Yes, there were other factors that accelerated collapse, but those
            are inherent to DNS, which is outside the scope of a summary.
       
              tptacek wrote 20 hours 43 min ago:
              Because the DNS race condition is just one flaw in the system.
              The more important latent flaw† is probably the metastable
              failure mode for the droplet manager, which, when it loses
              connectivity to Dynamo, gradually itself loses connectivity with
              the Droplets, until a critical mass is hit where the Droplet
              manager has to be throttled and manually recovered.
              
              Importantly: the DNS problem was resolved (to degraded state) in
              1hr15, and fully resolved in 2hr30. The Droplet Manager problem
              took much longer!
              
              This is the point of complex failure analysis, and why that
              school of thought says "root causing" is counterproductive. There
              will always be other precipitating events!
              
              † which itself could very well be a second-order effect of some
              even deeper and more latent issue that would be more useful to
              address!
       
                cyberax wrote 18 hours 26 min ago:
                The droplet manager failure is a lot more forgivable scenario.
                It happened because the "must always be up" service went down
                for an extended period of time, and the sheer amount of actions
                needed for the recovery overwhelmed the system.
                
                The initial DynamoDB DNS outage was much worse. A bog-standard
                TOCTTOU for scheduled tasks that are assumed to be "instant".
                And the lack of controls that allowed one task to just blow up
                everything in one of the foundational services.
                
                When I was at AWS some years ago, there were calls to limit the
                blast radius by using cell architecture to create vertical
                slices of the infrastructure for critical services. I guess
                that got completely sidelined.
       
                dgemm wrote 20 hours 40 min ago:
                
                
  HTML          [1]: https://en.wikipedia.org/wiki/Swiss_cheese_model
       
                1970-01-01 wrote 20 hours 42 min ago:
                Two different questions here.
                
                1. How did it break?
                
                2. Why did it collapse?
                
                A1: Race condition
                
                A2: What you said.
       
                  tptacek wrote 20 hours 25 min ago:
                  What is the purpose of identifying "root causes" in this
                  model? Is the root cause of a memory corruption vulnerability
                  holding a stale pointer to a freed value, or is it the lack
                  of memory safety? Where does AWS gain more advantage: in
                  identifying and mitigating metastable failure modes in EC2,
                  or in trying to identify every possible way DNS might take
                  down DynamoDB? (The latter is actually not an easy question,
                  but that's the point!)
       
                    1970-01-01 wrote 20 hours 16 min ago:
                    Two things can be important for an audience. For most, it's
                    the race condition lesson. Locks are there for a reason.
                    For AWS, it's the stability lesson. DNS can and did take
                    down the empire for several hours.
       
                      tptacek wrote 20 hours 10 min ago:
                      Did DNS take it down, or did a pattern of latent failures
                      take it down? DNS was restored fairly quickly!
                      
                      Nobody is saying that locks aren't interesting or
                      important.
       
                        nickelpro wrote 12 hours 26 min ago:
                        The Droplet lease timeouts were an aggravating factor
                        for the severity of the incident, but are not
                        causative. Absent a trigger the droplet leases never
                        experience congestive failure.
                        
                        The race condition was necessary and sufficient for
                        collapse. Absent corrective action it always leads to
                        AWS going down. In the presence of corrective actions
                        the severity of the failure would have been minor
                        without other aggravating factors, but the race
                        condition is always the cause of this failure.
       
                        dosnem wrote 18 hours 58 min ago:
                        This doesn’t really matter. This type of error gets
                        the whole 5 why’s treatment and every why needs to
                        get fixed. Both problems will certainly have an action
                        item
       
                          tptacek wrote 2 hours 54 min ago:
                          It is not my claim that AWS is going to handle this
                          badly, only that this thread is.
       
        qrush wrote 1 day ago:
        Sounds like DynamoDB is going to continue to be a hard dependency for
        EC2, etc. I at least appreciate the transparency and hearing about
        their internal systems names.
       
          UltraSane wrote 15 hours 46 min ago:
          They should at least split off dedicated isolated instances of
          DynamoDB to reduce blast radius. I would want at least 2 instances
          for every internal AWS service that uses it.
       
          skywhopper wrote 21 hours 6 min ago:
          I mean, something has to be the baseline data storage layer. I’m
          more comfortable with it being DynamoDB than something else that
          isn’t pushed as hard by as many different customers.
       
            UltraSane wrote 15 hours 44 min ago:
            The actual storage layer of DynamoDB is well engineered and has
            some formal proofs.
       
          offmycloud wrote 23 hours 58 min ago:
          I think it's time for AWS to pull the curtain back a bit and release
          a JSON document that shows a list of all internal service
          dependencies for each AWS service.
       
            mparnisari wrote 14 hours 36 min ago:
            I worked for AWS for two years and if I recall correctly, one of
            the issues was circular dependencies.
       
            cyberax wrote 18 hours 22 min ago:
            A lot of internal AWS services have names that are completely
            opaque to outside users. Such a document will be pretty useless as
            a result.
       
            throitallaway wrote 23 hours 14 min ago:
            Would it matter? Would you base decisions on whether or not to use
            one of their products based on the dependency graph?
       
              UltraSane wrote 15 hours 44 min ago:
              It would let you know that if if service A and B both depend on
              service C you can't use A and B to gain reliability.
       
              withinboredom wrote 19 hours 59 min ago:
              Yes.
       
                bdangubic wrote 18 hours 14 min ago:
                if so, I hate to tell you this but you would not use AWS (or
                any other cloud provider)!
       
                  withinboredom wrote 11 hours 57 min ago:
                  I don’t use AWS or any other cloud provider. I use bare
                  metal since 2012. See, in 2012 (IIRC), one fateful day, we
                  turned off our bare metal machines and went full AWS. That
                  afternoon, AWS had its first major outage. Prior to that day,
                  the owner could walk in and ask what we were doing about it.
                  That day, all we could do was twiddle our thumbs or turn on a
                  now outdated database replica. Surely AWS won’t be out for
                  hours, right? Right? With bare metal, you might be out for
                  hours, but you can quickly get back to a degraded state, no
                  matter what happens. With AWS, you’re stuck with whatever
                  they happen to fix first.
       
                    cthalupa wrote 1 hour 48 min ago:
                    Meanwhile I've had bare metal be a complete outage for over
                    a day because a backhoe decided it wanted to eat the fiber
                    line into our building. All I could do was twiddle my
                    thumbs because we were stuck waiting on another company to
                    fix that.
                    
                    Could we have had an offsite location to fail over to? From
                    a technical perspective, sure. Same as you could go
                    multi-region or multi-cloud or turn on some servers at
                    hetzner or whatever. There's nothing better or worse about
                    the cloud here - you always have the ability to design with
                    resilience for whatever happens short of the internet on
                    the whole breaking somehow.
       
        LaserToy wrote 1 day ago:
        TLDR: 
        A DNS automation bug removed all the IP addresses for the regional
        endpoints. The tooling that was supposed to help with recovery depends
        on the system it needed to recover. That’s a classic “we deleted
        prod” failure mode at AWS scale.
       
        everfrustrated wrote 1 day ago:
        >Services like DynamoDB maintain hundreds of thousands of DNS records
        to operate a very large heterogeneous fleet of load balancers in each
        Region
        
        Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can
        resolve to one of a hundred thousand IP address?
        
        That's insane!
        
        And also well beyond the limits of route53.
        
        I'm wondering if they're constantly updating route53 with a smaller
        subset of records and using a low ttl to somewhat work around this.
       
          rescbr wrote 15 hours 48 min ago:
          > And also well beyond the limits of route53.
          
          One thing is the internal limit, another thing is the customer-facing
          limit.
          
          Some hard limits are softer than they appear.
       
          donavanm wrote 18 hours 57 min ago:
          Some details, but yeah that's basically how all AWS DNS works. I
          think youre missing how labels, zones, and domains are related but
          distinct. And that R53 operates in resource record SETS. And there
          are affordances in the set relationships to build trees and logic for
          selecting an appropriate set (eg healthcheck, latency).
          
          > And also well beyond the limits of route53
          
          Ipso facto, R53 can do this just fine. Where do you think all of your
          public EC2, ELB, RDS, API Gateway, etc etc records are managed and
          served?
       
          thayne wrote 19 hours 10 min ago:
          I haven't tested with dynamodb, but I once ran a loop of doing DNS
          lookups for s3, and I in a couple seconds I got hundreds of distinct
          ip addresses. And that was just for a single region, from a single
          source ip.
       
          supriyo-biswas wrote 1 day ago:
          DNS-based CDNs are also effectively this: collect metrics from a
          datastore regarding system usage metrics, packet loss, latency etc
          and compute a table of viewer networks and preferred PoPs.
          
          Unfortunately hard documentation is difficult to provide but that’s
          how a CDN worked at a place I used to work for, there’s also
          another CDN[1] which talks about the same thing in fancier terms.
          
  HTML    [1]: https://bunny.net/network/smartedge/
       
            donavanm wrote 18 hours 55 min ago:
            Akamai talked about it in the early 2000s. Facebook content folks
            had a decent paper describing the latency collection and realtime
            routing around 2011ish, something like “pinpoint” I want to
            say. Though as you say was industry practice before then.
       
        ericpauley wrote 1 day ago:
        Interesting use of the phrase “Route53 transaction” for an
        operation that has no hard transactional guarantees. Especially given
        the lack of transactional updates are what caused the outage…
       
          donavanm wrote 1 day ago:
          I think you misunderstnad the failure case. The
          ChangeResourceRecordSet is transactional (or was when I worked on the
          service) [1] .
          
          The fault was two different clients with divergent goal states:
          
          - one ("old") DNS Enactor experienced unusually high delays needing
          to retry its update on several of the DNS endpoints
          
          - the DNS Planner continued to run and produced many newer
          generations of plans
             [Ed: this is key: its producing "plans" of desired state, the does
          not include a complete transaction like a log or chain with previous
          state + mutations]
          
          - one of the other ("new") DNS Enactors then began applying one of
          the newer plans
          
          -  then ("new") invoked the plan clean-up process, which identifies
          plans that are significantly older than the one it just applied and
          deletes them   [Ed: the key race is implied here. The "old" Enactor
          is reading _current state_, which was the output of "new", and
          applying its desired "old" state on top. The discrepency is because
          apparently Planer and Enactor aren't working with a chain/vector
          clock/serialized change set numbers/etc]
          
          - At the same time the first ("old") Enactor ... applied its much
          older plan to the regional DDB endpoint, overwriting the newer plan. 
           [Ed: and here is where "old" Enactor creates the valid ChangeRRSets
          call, replacing "new" with "old"]
          
          -  The check that was made at the start of the plan application
          process, which ensures that the plan is newer than the previously
          applied plan, was stale by this time  [Ed: Whoops!]
          
          - The second Enactor’s clean-up process then deleted this older
          plan because it was many generations older than the plan it had just
          applied.
          
          Ironically Route 53 does have strong transactions of API changes
          _and_ serializes them _and_ has closed loop observers to validate
          change sets globally on every dataplane host. So do other AWS
          services. And there are even some internal primitives for building
          replication or change set chains like this. But its also a PITA and
          takes a bunch of work and when it _does_ fail you end up with global
          deadlock and customers who are really grumpy that they dont see their
          DNS changes going in to effect.
          
  HTML    [1]: https://docs.aws.amazon.com/Route53/latest/APIReference/API_...
       
            RijilV wrote 23 hours 41 min ago:
            Not for nothing, there’s a support group for those of us who’ve
            been hurt by WHU sev2s…
       
              donavanm wrote 18 hours 49 min ago:
              Man I always hated that phrasing; always tried to get people to
              use more precise terms like “customer change propagation.”
              But yeah, who hasnt been punished by a queryplan change or some
              random connectivity problem in south east asia!
       
        gslin wrote 1 day ago:
        I believe a report with timezone not using UTC is a crime.
       
          tguedes wrote 22 hours 51 min ago:
          I think it makes sense in this instance. Because this occurred in
          us-east-1, the vast majority of affected customers are US based. For
          most people, it's easier to do the timezone conversion from PT than
          UTC.
       
            thayne wrote 19 hours 4 min ago:
            But us-east-1 is in Eastern Time, so if you aren't going to use
            UTC, why not use that?
            
            I'm guessing PT was chosen because the people writing this report
            are in PT (where Amazon headquarters is).
       
            trenchpilgrim wrote 20 hours 37 min ago:
            us-east-1 is an exceptional Amazon region; it hosts many global
            services as well as services which are not yet available in other
            regions. Most AWS customers worldwide probably have an indirect
            dependency on us-east-1.
       
          cheeze wrote 1 day ago:
          My guess is that PT was chosen to highlight the fact that this
          happened in the middle of the night for most of the responding ops
          folks.
          
          (I don't know anything here, just spitballing why that choice would
          be made)
       
            throitallaway wrote 23 hours 12 min ago:
            Their headquarters is in Seattle (Pacific Time.) But yeah, I hate
            time zones.
       
          exogenousdata wrote 1 day ago:
          An epoch fail?
       
        jasode wrote 1 day ago:
        So the DNS records if-stale-then-needs-update it was basically a
        variation of the "2 Hard Things In Computer Science - cache
        invalidation".    Excerpt from the giant paragraph:
        
        >[...] Right before this event started, one DNS Enactor experienced
        unusually high delays needing to retry its update on several of the DNS
        endpoints. As it was slowly working through the endpoints, several
        other things were also happening. First, the DNS Planner continued to
        run and produced many newer generations of plans. Second, one of the
        other DNS Enactors then began applying one of the newer plans and
        rapidly progressed through all of the endpoints. The timing of these
        events triggered the latent race condition. When the second Enactor
        (applying the newest plan) completed its endpoint updates, it then
        invoked the plan clean-up process, which identifies plans that are
        significantly older than the one it just applied and deletes them. At
        the same time that this clean-up process was invoked, the first Enactor
        (which had been unusually delayed) applied its much older plan to the
        regional DDB endpoint, overwriting the newer plan. The check that was
        made at the start of the plan application process, which ensures that
        the plan is newer than the previously applied plan, was stale by this
        time due to the unusually high delays in Enactor processing. [...] 
        
        It outlines some of the mechanics but some might think it still isn't a
        "Root Cause Analysis" because there's no satisfying explanation of
        _why_ there were "unusually high delays in Enactor processing". 
        Hardware problem?!?  Human error misconfiguration causing unintended
        delays in Enactor behavior?!?  Either the previous sequence of events
        leading up to that is considered unimportant, or Amazon is still
        investigating what made Enactor behave in an unpredictable way.
       
          ignoramous wrote 23 hours 11 min ago:
          > ...there's no satisfying explanation of _why_ there were "unusually
          high delays in Enactor processing". Hardware problem?
          
          Can't speak for the current incident but a similar "slow machine"
          issue once bit our BigCloud service (not as big an incident,
          thankfully) due to loooong JVM GC pauses on failing hardware.
       
          Cicero22 wrote 1 day ago:
          my take away was that the race condition was the root cause. Take
          away that bug, and suddenly there's no incident, regardless of any
          processing delays.
       
            _alternator_ wrote 1 day ago:
            Right.sounds like it’s a case of “rolling your own distributed
            system algorithm” without the up front investment in implementing
            a true robust distributed system.
            
            Often network engineers are unaware of some of the tricky problems
            that DS research has addressed/solved in the last 50 years because
            the algorithms are arcane and heuristics often work pretty well,
            until they don’t. But my guess is that AWS will invest in some
            serious redesign of the system, hopefully with some rigorous
            algorithms underpinning the updates.
            
            Consider this a nudge for all you engineers that are designing
            fault tolerant distributed systems at scale to investigate the
            problem spaces and know which algorithms solve what problems.
       
              withinboredom wrote 20 hours 8 min ago:
              Further, please don’t stop at RAFT. RAFT is popular because it
              is easy to understand, not because it is the best way to do
              distributed consensus. It is non-deterministic (thus requiring
              odd numbers of electors), requires timeouts for liveness (thus
              latency can kill you), and isn’t all that good for
              general-purpose consensus, IMHO.
       
              foobarian wrote 23 hours 3 min ago:
              > some serious redesign of the system, hopefully with some
              rigorous algorithms underpinning the updates
              
              Reading these words makes me break out in cold sweat :-) I really
              hope they don't
       
              dboreham wrote 23 hours 20 min ago:
              Certainly seems like misuse of DNS. It wasn't designed to be a
              rapidly updatable consistent distributed database.
       
                tremon wrote 4 hours 15 min ago:
                That's true, if you use the the CAP definition for consistency.
                Otherwise, I'd say that the DNS design satisfies each of those
                terms:
                
                - "Rapidly updatable" depends on the specific implementation,
                but the design allows for 2 billion changesets in flight before
                mirrors fall irreparably out of sync with the master database,
                and the DNS specs include all components necessary for rapid
                updates: push-based notifications and incremental transfers.
                
                - DNS is designed to be eventually consistent, and each replica
                is expected to always offer internally consistent data. It's
                certainly possible for two mirrors to respond with different
                responses to the same query, but eventual consistency does not
                preclude that.
                
                - Distributed: the DNS system certainly is a distributed
                database, if fact it was specifically designed to allow for
                replication across organization boundaries -- something that
                very few other distributed systems offer. What DNS does not
                offer is multi-master operation, but neither do e.g. Postgres
                or MSSQL.
       
                pyrolistical wrote 17 hours 11 min ago:
                I think historically DNS was “best effort” but with
                consensus algorithms like raft, I can imagine a DNS that is
                perfectly consistent
       
          dustbunny wrote 1 day ago:
          Why is the "DNS Planner" and "DNS Enactor" separate? If it was one
          thing, wouldn't this race condition have been much more clear to the
          people working on it? Is this caused by the explosion of complexity
          due to the over use of the microservice architecture?
       
            jiggawatts wrote 16 hours 14 min ago:
            This was my thought also. The first sentences of the RCA screamed
            “race condition” without even having to mention the phrase.
            
            The two DNS components comprise a monolith: neither is useful
            without the other and there is one arrow on the design coupling
            them together.
            
            If they were a single component then none of this would have
            happened.
            
            Also, version checks? Really?
            
            Why not compare the current state against the desired state and
            take the necessary actions to bring them inline?
            
            Last but not least, deleting old config files so aggressively is a
            “penny wise pound foolish” design. I would keep these forever
            or at least a month! Certainly much, much longer than any possible
            time taken through the sequence of provisioning steps.
       
              UltraSane wrote 15 hours 52 min ago:
              Yes it should be impossible for all DNS entries to get deleted
              like that.
       
            neom wrote 1 day ago:
            Pick your battle I'd guess. Given how huge AWS is, if you have
            Desired state vs. reconciler, you probably have more resilient
            operations generally and a easier job of finding and isolating
            problems, the flip side of that is if you screw up your error
            handling, you get this. That aside, it seems strange to me they
            didn't account for the fact that a stale plan could get picked up
            over a new one, so maybe I misunderstand the incident/architecture.
       
            bananapub wrote 1 day ago:
            > Why is the "DNS Planner" and "DNS Enactor" separate?
            
            for a large system, it's in practice very nice to split up things
            like that - you have one bit of software that just reads a bunch of
            data and then emits a plan, and then another thing that just gets
            given a plan and executes it.
            
            this is easier to test (you're just dealing with producing one data
            structure and consuming one data structure, the planner doesn't
            even try to mutate anything), it's easier to restrict permissions
            (one side only needs read access to the world!), it's easier to do
            upgrades (neither side depends on the other existing or even being
            in the same language), it's safer to operate (the planner is
            disposable, it can crash or be killed at any time with no problem
            except update latency), it's easier to comprehend (humans can
            examine the planner output which contains the entire state of the
            plan), it's easier to recover from weird states (you can in
            extremis hack the plan) etc etc.  these are all things you
            appreciate more and more and your system gets bigger and more
            complicated.
            
            > If it was one thing, wouldn't this race condition have been much
            more clear to the people working on it?
            
            no
            
            > Is this caused by the explosion of complexity due to the over use
            of the microservice architecture?
            
            no
            
            it's extremely easy to second-guess the way other people decompose
            their services since randoms online can't see any of the actual
            complexity or any of the details and so can easily suggest it would
            be better if it was different, without having to worry about any of
            the downsides of the imagined alternative solution.
       
              tuckerman wrote 21 hours 13 min ago:
              Agreed, this is a common division of labor and simplifies things.
              It's not entirely clear in the postmortem but I speculate that
              the conflation of duties (i.e. the enactor also being responsible
              for janitor duty of stale plans) might have been a contributing
              factor.
              
              The Oxide and Friends folks covered an update system they built
              that is similarly split and they cite a number of the same
              benefits as you:
              
  HTML        [1]: https://oxide-and-friends.transistor.fm/episodes/systems...
       
                jiggawatts wrote 16 hours 10 min ago:
                I would divide these as functions inside a monolithic
                executable. At most, emit the plan to a file on disk as a
                “—whatif” optional path.
                
                Distributed systems with files as a communication medium are
                much more complex than programmers think with far more failure
                modes than they can imagine.
                
                Like… this one, that took out a cloud for hours!
       
                  tuckerman wrote 53 min ago:
                  Doing it inside a single binary gets rid of some of the nice
                  observability features you get "for free" by breaking it up
                  and could complicate things quite a bit (more code paths,
                  flags for running it in "don't make a plan use the last plan
                  mode", flags for "use this human generated plan mode"). Very
                  few things are a free lunch but I've used this pattern
                  numerous times and quite like it. I ran a system that used a
                  MIP model to do capacity planning and separating planning
                  from executing a plan was very useful for us.
                  
                  I think the communications piece depends on what other
                  systems you have around you to build on, its unlikely this
                  planner/executor is completely freestanding. Some companies
                  have large distributed filesystems with well known/tested
                  semantics, schedulers that launch jobs when files appear,
                  they might have ~free access to a database with strict
                  serializability where they can store a serialized version of
                  the plan, etc.
       
              Anon1096 wrote 22 hours 56 min ago:
              I mean any time a service goes down even 1/100 the size of AWS
              you have people crawling out of the woodworks giving armchair
              advice while having no domain relevant experience. It's barely
              even worth taking the time to respond. The people with opinions
              of value are already giving them internally.
       
                lazystar wrote 21 hours 59 min ago:
                > The people with opinions of value are already giving them
                internally.
                
                interesting take, in light of all the brain drain that AWS has
                experienced over the last few years.  some outside opinions
                might be useful - but perhaps the brain drain is so extreme
                that those remaining don't realize it's occurring?
       
            supportengineer wrote 1 day ago:
            It probably was a single-threaded python script until somebody
            found a way to get a Promo out of it.
       
              placardloop wrote 18 hours 16 min ago:
              This is Amazon we’re talking about, it was probably Perl.
       
          donavanm wrote 1 day ago:
          This is public messaging to explain the problem at large. This isnt
          really a post incident analysis.
          
          Before the active incident is “resolved” theres an evaluation of
          probable/plausible reoccurrence. Usually we/they would have potential
          mitigations and recovery runbooks prepared as well to quickly react
          to any reoccurance. Any likely open risks are actively worked to
          mitigate before the immediate issue is considered resolved. That
          includes around-the-clock dev team work if its the best known path to
          mitigation.
          
          Next any plausible paths to “risk of reoccurance” would be top
          dev team priority (business hours) until those action items are
          completed and in deployment. That might include other teams with
          similar DIY DNS management, other teams who had less impactful queue
          depth problems, or other similar “near miss” findings. Service
          team tech & business owners (PE, Sr PE, GM, VP) would be tracking
          progress daily until resolved.
          
          Then in the next few weeks at org & AWS level “ops meetings”
          there are going to be the in depth discussions of the incident,
          response, underlying problems, etc. the goal there being
          organizational learning and broader dissemination of lessons learned,
          action items, best practice etc.
       
          mcmoor wrote 1 day ago:
          Also, I don't know if I missed it, but they don't establish anything
          to prevent outage if there's unusually high delay again?
       
            mattcrox wrote 1 day ago:
            It’s at the end, they disabled the DDB DNS automations around
            this to fix before they re-enable them
       
              mcmoor wrote 11 hours 43 min ago:
              If it's re enabled (without change?), wouldn't an unusually high
              delay break it again?
       
                cthalupa wrote 1 hour 52 min ago:
                Why would they enable it without fixing the issue?
                
                The post-mortem is specific that they won't turn it back on
                without resolving this but I feel like the default assumption
                for any halfway competent entity would be that they fix the
                known issue that they have disabled something because.
       
        shayonj wrote 1 day ago:
        I was kinda surprised the lack of CAS on per-endpoint plan version or
        rejecting stale writes via 2PC or single-writer lease per endpoint like
        patterns.
        
        Definitely a painful one with good learnings and kudos to AWS for being
        so transparent and detailed :hugops:
       
          donavanm wrote 1 day ago:
          See [1] . The actual DNS mutation API does, effectively, CAS. They
          had multiple unsynchronized writers who raced without logical
          constraints or ordering to teh changes. Without thinking much they
          _might_ have been able to implement something like a vector either
          through updating the zone serial or another "sentinel record" that
          was always used for ChangeRRSets affecting that label/zone; like a
          TXT record containing a serialized change set number or a "checksum"
          of the old + new state.
          
          Im guessing the "plans" aspect skipped that and they were just
          applying intended state, without trying serialize them. And
          last-write-wins, until it doesnt.
          
  HTML    [1]: https://news.ycombinator.com/item?id=45681136
       
            cyberax wrote 18 hours 23 min ago:
            Oh, I can see it from here. AWS internally has a problem with
            things like task orchestration. I bet that the enactor can be
            rewritten as a goroutine/thread in the planner, with proper locking
            and ordering.
            
            But that's too complicated and results in more code. So they likely
            just used an SQS queue with consumers reading from it.
       
        lazystar wrote 1 day ago:
        > Since this situation had no established operational recovery
        procedure, engineers took care in attempting to resolve the issue with
        DWFM without causing further issues.
        
        interesting.
       
        galaxy01 wrote 1 day ago:
        Would conditional read/write solve this? looks like some kind of stale
        read
       
        yla92 wrote 1 day ago:
        So the root cause is basically race condition 101 stale read ?
       
          philipwhiuk wrote 1 day ago:
          Race condition and bad data validation.
       
        joeyhage wrote 1 day ago:
        > as is the case with the recently launched IPv6 endpoint and the
        public regional endpoint
        
        It isn't explicitly stated in the RCA but it is likely these new
        endpoints were the straw that broke the camel's back for the DynamoDB
        load balancer DNS automation
       
       
   DIR <- back to front page