URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   How We Found 7 TiB of Memory Just Sitting Around
       
       
        Aeolun wrote 5 hours 5 min ago:
        I read this and I have to wonder, did anyone ever think it was
        reasonable that a cluster that apparently needed only 120gb of memory
        was consuming 1.2TB just for logging (or whatever vector does)
       
          fock wrote 1 hour 6 min ago:
          we have on-prem with heavy spikes (our batch workload can utilize the
          20TB of memory in the cluster easily) and we just don't care much and
          add 10% every year to the hardware requested. Compared to employing
          people or paying other vendors (relational databases with many
          TB-sized tables...) this is just irrelevant.
          
          Sadly devs are incentivized by that and going towards the cloud might
          be a fun story. Given the environment I hope they scrap the effort
          sooner rather than later, buy some Oxide systems for the people who
          need to iterate faster than the usual process of getting a VM and
          replace/reuse the 10% of the company occupied with the cloud (mind
          you: no real workload runs there yet...) to actually improve local
          processes...
       
          devjab wrote 1 hour 38 min ago:
          We're a much smaller scale company and the cost we lose on these
          things is insignificant compared to what's in this story. Yesterday I
          was improving the process for creating databases in our azure and I
          stumbled upon a subscription which was running 7 mssql servers for 12
          databases. These weren't elastic and they were each paying a license
          that we don't have to pay because we qualify for the base cost
          through our contract with our microsoft partner. This company has
          some of the thightest control over their cloud infrastructure out of
          any organisation I've worked with.
          
          This is anecdotal, but if my experiences aren't unique then there is
          a lot of lack of reasonable in DevOps.
       
          bstack wrote 2 hours 51 min ago:
          Author here: You’d be surprised what you don’t notice given
          enough nodes and slow enough resource growth over time! Out of the
          total resource usage in these clusters even at the high water mark
          for this daemonset it was still a small overall portion of the total.
       
            Aeolun wrote 20 min ago:
            I’m not sure if that makes it better or worse.
       
            fock wrote 1 hour 1 min ago:
            how large are the clusters then?
       
        hinkley wrote 10 hours 29 min ago:
        Keys require O(logn) space per key or nlogn for the entire data set,
        simply to avoid key collisions. But human friendly key spaces grow
        much, much faster and I don’t think many people have looked too hard
        at that.
        
        There were recent changes to the NodeJS Prometheus client that
        eliminates tag names from the keys used for storing the tag cardinality
        for metrics. The memory savings wasn’t reported but the cpu savings
        for recording data points was over 1/3. And about twice that when
        applied to the aggregation logic.
        
        Lookups are rarely O(1), even in hash tables.
        
        I wonder if there’s a general solution for keeping names concise
        without triggering transposition or reading comprehension errors. And
        what the space complexity is of such an algorithm.
       
          vlovich123 wrote 3 hours 14 min ago:
          Why aren’t let’s just 128bit UUIDs? Those are guaranteed to be
          globally unique and don’t require so much spacex
       
            hinkley wrote 37 min ago:
            Why aren’t what 128bit UUIDs?
            
            > keeping names concise without triggering transposition or reading
            comprehension errors.
            
            Code that doesn’t work for developers first will soon cease to
            work for anyone. Plus how do you look up a uuid for a set of tags? 
            What’s your perfect hash plan to make sure you don’t
            misattribute stats to the wrong place?
            
            UUIDs are entirely opaque and difficult to tell apart consistently.
       
        nitinreddy88 wrote 1 day ago:
        The other way to look is why adding NS label is causing so much memory
        footprint in Kubernetes. Shouldn't be fixing that (could be much bigger
        design change), will benefit whole Kube community?
       
          bstack wrote 15 hours 54 min ago:
          Author here: yeah that's a good point. tbh I was mostly unfamiliar
          with Vector so I took the shortest path to the goal but that could be
          interesting followup. It does seem like there's a lot of bytes per
          namespace!
       
            stackskipton wrote 4 hours 31 min ago:
            You mentioned in the blog article that it's doing listwatch. List
            Watch registers with Kubernetes API that get a list of all objects
            AND get a notification when anything in object you have registered
            with changes. A bunch of Vector Pods saying "Hey, send me a
            notification when anything with namespaces changes" and poof goes
            your Memory keeping track of who needs to know what.
            
            At this point, I wonder if instead of relying on daemonsets, you
            just gave every namespace a vector instance that was responsible
            for that namespace and pods within. ElasticSearch or whatever you
            pipe logging data to might not be happy with all those TCP
            connections.
            
            Just my SRE brain thoughts.
       
              fells wrote 3 hours 45 min ago:
              >you just gave every namespace a vector instance that was
              responsible for that namespace and pods within.
              
              Vector is a daemonset, because it needs to tail the log files on
              each node.  A single vector per namespace might not reside on the
              nodes that each pod is on.
       
        shanemhansen wrote 1 day ago:
        The unreasonable effectiveness of profiling and digging deep strikes
        again.
       
          hinkley wrote 11 hours 1 min ago:
          The biggest tool in the performance toolbox is stubbornness. Without
          it all the mechanical sympathy in the world will go unexploited.
          
          There’s about a factor of 3 improvement that can be made to most
          code after the profiler has given up. That probably means there are
          better profilers than could be written, but in 20 years of having
          them I’ve only seen 2 that tried. Sadly I think flame graphs made
          profiling more accessible to the unmotivated but didn’t actually
          improve overall results.
       
            jesse__ wrote 8 hours 54 min ago:
            Broadly agree.
            
            I'm curious, what're the profilers you know of that tried to be
            better?  I have a little homebrew game engine with an integrated
            profiler that I'm always looking for ideas to make more effective.
       
              hinkley wrote 8 hours 21 min ago:
              Clinic.js tried and lost steam. I have a recollection of a
              profiler called JProfiler that represented space and time as a
              graph, but also a recollection they went under. And there is a
              company selling a product of that name that has been around since
              that time, but doesn’t quite look how I recalled and so I
              don’t know if I was mistaken about their demise or I’ve
              swapped product names in my brain. It was 20 years ago which is a
              long time for mush to happen.
              
              The common element between attempts is new visualizations. And
              like drawing a projection of an object in a mechanical
              engineering drawing, there is no one projection that contains the
              entire description of the problem. You need to present several
              and let brain synthesize the data missing in each individual
              projection into an accurate model.
       
            Negitivefrags wrote 9 hours 48 min ago:
            I think the biggest tool is higher expectations. Most programmers
            really haven't come to grips with the idea that computers are fast.
            
            If you see a database query that takes 1 hour to run, and only
            touches a few gb of data, you should be thinking "Well nvme
            bandwidth is multiple gigabytes per second, why can't it run in 1
            second or less?"
            
            The idea that anyone would accept a request to a website taking
            longer than 30ms, (the time it takes for a game to render it's
            entire world including both the CPU and GPU parts at 60fps) is
            insane, and nobody should really accept it, but we commonly do.
       
              mjevans wrote 1 hour 45 min ago:
              30mS for a website is a tough bar to clear considering Speed of
              Light (or rather electrons in copper / light in fiber) [1] Just
              as an example, round trip delay from where I rent to the local
              backbone is about 14mS alone, and the average for a webserver is
              53mS.  Just as a simple echo reply. (I picked it because I'd
              hoped that was in Redmond or some nearby datacenter, but it looks
              more likely to be in a cheaper labor area.)
              
              However it's only the bloated ECMAScript (javascript) trash web
              of today that makes a website take longer than ~1 second to load
              on a modern PC.  Plain old HTML, images on a reasonable diet, and
              some script elements only for interactive things can scream.
              
                  mtr -bzw microsoft.com
                  6. AS7922         be-36131-cs03.seattle.wa.ibone.comcast.net
              (2001:558:3:942::1)        0.0%    10     12.9  13.9  11.5  18.7  
              2.6
                  7. AS7922         be-2311-pe11.seattle.wa.ibone.comcast.net
              (2001:558:3:3a::2)         0.0%    10   11.8    13.3  10.6  17.2 
               2.4
                  8. AS7922         2001:559:0:80::101e             
                            0.0%      10   15.2  20.7  10.7  60.0 
              17.3
                  9. AS8075         ae25-0.icr02.mwh01.ntwk.msn.net
              (2a01:111:2000:2:8000::b9a)           0.0%    10   41.1  23.7 
              14.8  41.9  10.4
                  10. AS8075          be140.ibr03.mwh01.ntwk.msn.net
              (2603:1060:0:12::f18e)            0.0%    10   53.1  53.1 
              50.2  57.4   2.1
                  11. AS8075          2603:1060:0:10::f536             
                             0.0%    10    82.1  55.7  50.5  82.1    
              9.7
                  12. AS8075          2603:1060:0:10::f3b1             
                             0.0%    10    54.4  96.6  50.4 147.4 
              32.5
                  13. AS8075          2603:1060:0:10::f51a             
                             0.0%    10    49.7  55.3  49.7  78.4    
              8.3
                  14. AS8075          2a01:111:201:f200::d9d             
                             0.0%    10    52.7  53.2  50.2  58.1    
              2.7
                  15. AS8075          2a01:111:2000:6::4a51             
                             0.0%    10    49.4  51.6  49.4  54.1    
              1.7
                  20. AS8075          2603:1030:b:3::152             
                             0.0%    10    50.7  53.4  49.2  60.7    
              4.2
              
  HTML        [1]: https://en.wikipedia.org/wiki/Speed_of_light
       
              hinkley wrote 8 hours 25 min ago:
              Lowered expectations are come in part from people giving up on
              theirs. Accepting versus pushing back.
       
                antonymoose wrote 8 hours 21 min ago:
                I have high hopes and expectations, unfortunately my chain of
                command does not, and is often an immovable force.
       
                  hinkley wrote 7 hours 24 min ago:
                  This is a terrible time to tell someone to find a movable
                  object in another part of the org or elsewhere. :/
                  
                  I always liked Shaw’s “The reasonable man adapts himself
                  to the world: the unreasonable one persists in trying to
                  adapt the world to himself. Therefore all progress depends on
                  the unreasonable man.”
       
              azornathogron wrote 8 hours 59 min ago:
              Pedantic nit: At 60 fps the per frame time is 16.66... ms, not 30
              ms. Having said that a lot of games run at 30 fps, or run
              different parts of their logic at different frequencies, or do
              other tricks that mean there isn't exactly one FPS rate that the
              thing is running at.
       
                Negitivefrags wrote 8 hours 39 min ago:
                The CPU part happens on one frame, the GPU part happens on the
                next frame. If you want to talk about the total time for a game
                to render a frame, it needs to count two frames.
       
                  wizzwizz4 wrote 7 hours 25 min ago:
                  Computers are fast. Why do you accept a frame of lag? The
                  average game for a PC from the 1980s ran with less lag than
                  that. Super Mario Bros had less than a frame between
                  controller input and character movement on the screen.
                  (Technically, it could be more than a frame, but only if
                  there were enough objects in play that the processor couldn't
                  handle all the physics updates in time and missed the v-blank
                  interval.)
       
                    Negitivefrags wrote 7 hours 1 min ago:
                    If Vsync is on which was my assumption from my previous
                    comment, then if your computer is fast enough, you might be
                    able to run CPU and GPU work entirely in a single frame if
                    you use Reflex to delay when simulation starts to lower
                    latency, but regardless, you still have a total time budget
                    of 1/30th of a second to do all your combined CPU and GPU
                    work to get to 60fps.
       
              javier2 wrote 9 hours 9 min ago:
              its also about cost. My game computer has 8 cores + 1 expensive
              gpu + 32GB ram for me alone. We dont have that per customer.
       
                Aeolun wrote 5 hours 4 min ago:
                If your websites take less than 16ms to serve, you can serve 60
                customers per second with that. So you sorta do have it per
                customer?
       
                  vlovich123 wrote 3 hours 43 min ago:
                  That’s per core assuming the 16ms is CPU bound activity (so
                  100 cores would serve 100 customers). If it’s I/O you can
                  overlap a lot of customers since a single core could easily
                  keep track of thousands of in flight requests.
       
                oivey wrote 8 hours 44 min ago:
                This is again a problem understanding that computers are fast.
                A toaster can run an old 3D game like Quake at hundreds of FPS.
                A website primarily displaying text should be way faster. The
                reasons websites often aren’t have nothing to do with the
                user’s computer.
       
                  paulryanrogers wrote 8 hours 2 min ago:
                  That's a dedicated toaster serving only one client. Websites
                  usually aren't backed by bare metal per visitor.
       
                    oivey wrote 7 hours 31 min ago:
                    Right. I’m replying to someone talking about their
                    personal computer.
       
                avidiax wrote 8 hours 57 min ago:
                It's also about revenue.
                
                Uber could run the complete global rider/driver flow from a
                single server.
                
                It doesn't, in part because all of those individual trips earn
                $1 or more each, so it's perfectly acceptable to the business
                to be more more inefficient and use hundreds of servers for
                this task.
                
                Similarly, a small website taking 150ms to render the page only
                matters if the lost productivity costs less than the
                engineering time to fix it, and even then, only makes sense if
                that engineering time isn't more productively used to add
                features or reliability.
       
                  onethumb wrote 50 min ago:
                  Uber could not run the complete global rider/driver flow from
                  a single server.
       
            zahlman wrote 9 hours 58 min ago:
            > The biggest tool in the performance toolbox is stubbornness.
            Without it all the mechanical sympathy in the world will go
            unexploited.
            
            The sympathy is also needed. Problems aren't found when people
            don't care, or consider the current performance acceptable.
            
            > There’s about a factor of 3 improvement that can be made to
            most code after the profiler has given up. That probably means
            there are better profilers than could be written, but in 20 years
            of having them I’ve only seen 2 that tried.
            
            It's hard for profilers to identify slowdowns that are due to the
            architecture. Making the function do less work to get its result
            feels different from determining that the function's result is
            unnecessary.
       
              hinkley wrote 8 hours 23 min ago:
              Architecture, cache eviction, memory bandwidth, thermal
              throttling.
              
              All of which have gotten perhaps an order of magnitude worse in
              the time since I started on this theory.
       
       
   DIR <- back to front page