codevoid.de/1/hn/comments_46290620.gph

  URI:

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Prompt caching for cheaper LLM tokens
       
       
        who-shot-jr wrote 6 hours 54 min ago:
        What a fantastic article! How did you create the animations?
       
          samwho wrote 6 hours 27 min ago:
          Thank you! <3
          
          These are all built with React and CSS animations (or the Web
          Animations API where I needed it). Iâm not very good at React so
          the code is a real mess. 2 of the components also use threejs for the
          3D bits.
          
          For the stuff on my personal site, which simonw graciously linked to
          in another reply, you can see all the code behind my work at
          
  HTML    [1]: https://github.com/samwho/visualisations
       
          simonw wrote 6 hours 52 min ago:
          Sam has a long history of building beautiful visual explanations like
          this - I didn't realize he works for ngrok now, here's his previous
          independent collection:
          
  HTML    [1]: https://samwho.dev/
       
            samwho wrote 6 hours 33 min ago:
            Simon, youâre too kind. Thank you. <3
       
        dangoodmanUT wrote 10 hours 25 min ago:
        But why is this posted on ngrok?
       
          toobulkeh wrote 10 hours 17 min ago:
          They have an AI router they just released.
          
          ngrok.ai
       
        holbrad wrote 12 hours 5 min ago:
        I gave the table of inputs and outputs to both Gemini 3.0 flash and GPT
        5.2 instant and they were stumped. [1]
        
  HTML  [1]: https://t3.chat/share/j2tnfwwful
  HTML  [2]: https://t3.chat/share/k1xhgisrw1
       
          samwho wrote 10 hours 58 min ago:
          When I was writing this, GPT 5.1 was the latest and it got it right
          away. Itâs the sequence of prime numbers fwiw :)
       
          andruby wrote 11 hours 7 min ago:
          What is the function supposed to be? Itâs not Celsius to Farenheit.
          (2C=35F, 206C=406F, â¦)
       
        WillAdams wrote 12 hours 25 min ago:
        When will Microsoft do this sort of thing?
        
        It's a pain having to tell Copilot "Open in pages mode" each time it's
        launched, and then after processing a batch of files run into:
        
  HTML  [1]: https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limit_...
       
        Havoc wrote 13 hours 5 min ago:
        Does anyone know whether the cache is segregated by user/API key for
        the big providers?
        
        Was looking at modifying outgoing requests via proxy and wondering
        whether that's harming    caching. Common coding tools presumably have a
        shared prompt across all their installs so universal cache would save a
        lot
       
          moebrowne wrote 12 hours 59 min ago:
          For ChatGPT:
          
          > Prompt caches are not shared between organizations. Only members of
          the same organization can access caches of identical prompts.
          
  HTML    [1]: https://platform.openai.com/docs/guides/prompt-caching#frequ...
       
            maxloh wrote 7 hours 42 min ago:
            I don't find it really viable. There are so many ways to express
            the same question, and context does matter: the same prompt becomes
            irrelevant if the previous prompts or LLM responses differ.
            
            With the cache limited to the same organization, the chances of it
            actually being reused would be extremely low.
       
              qeternity wrote 3 hours 21 min ago:
              In a chat setting you hit the cache every time you add a new
              prompt: all historical question/answer pairs are part of the
              context and donât need to be prefilled again.
              
              On the API side imagine you are doing document processing and
              have a 50k token instruction prompt that you reuse for every
              document.
              
              Itâs extremely viable and used all the time.
       
                jonhohle wrote 3 hours 14 min ago:
                Iâm shocked that this hasnât been a thing from the start.
                That seems like table stakes for automating repetitive tasks.
       
                  qeternity wrote 2 hours 23 min ago:
                  It has been a thing. In a single request, this same cache is
                  reused for each forward pass.
                  
                  It took a while for companies to start metering it and
                  charging accordingly.
                  
                  Also companies invested in hierarchical caches that allow
                  longer term and cross cluster caching.
       
              IanCal wrote 6 hours 55 min ago:
              It gets used massively in a conversation, also anything that has
              a lot of explain actions in the system prompt means you have a
              large matching prefix.
       
              babelfish wrote 7 hours 9 min ago:
              Think of it as a very useful prefix match. If all of your threads
              start with the same system prompt, you will reap benefits from
              prompt caching.
       
          samwho wrote 13 hours 1 min ago:
          I was wondering about this when I was reading around the topic. I
          canât personally think of a reason you would need to segregate,
          though it wouldnât surprise me if they do for some sort of
          compliance reasons. Iâm not sure though, would love to hear
          something first-party.
       
            dustfinger wrote 9 hours 13 min ago:
            I wonder if there is valuable information that can be learned by
            studying a companies prompts? There may be reasons why some
            companies want their prompts private.
       
              dustfinger wrote 6 hours 46 min ago:
              I realize cache segregation is mainly about security/compliance
              and tenant isolation, not protecting secret prompts. Still, if
              someone obtained access to a companyâs prompt templates/system
              prompts, analyzing them could reveal:
              
              - Product logic / decision rules, such as: when to refund, how to
              triage tickets
              
              - Internal taxonomies, schemas, or tool interfaces
              
              - Safety and policy guardrails (which adversaries could try to
              route around)
              
              - Brand voice, strategy, or proprietary workflows
              
              That is just off the top of my head.
       
            weird-eye-issue wrote 11 hours 32 min ago:
            They absolutely are segregated
            
            With OpenAI at least you can specify the cache key and they even
            have this in the docs:
            
            Use the 
            prompt_cache_key
             parameter consistently across requests that share common prefixes.
            Select a granularity that keeps each unique prefix-prompt_cache_key
            combination below 15 requests per minute to avoid cache overflow.
       
              psadri wrote 7 hours 29 min ago:
              Does anyone actually compute / use this key feature? Or do you
              rely on implicit caching? I wish HN had a comment with a poll
              feature.
       
            samwho wrote 12 hours 54 min ago:
            The only thing that comes to mind is some kind of timing attack.
            Send loads of requests specific to a company youâre trying to spy
            on and if it comes back cached you know someone has sent that
            prompt recently. Expensive attack, though, with a large search
            space.
       
              gwern wrote 5 hours 41 min ago:
              No, the search space is tiny: you can just attack 1 BPE at a
              time! Stuff like password guessing is almost trivial when you get
              to do a timing attack on each successive character. So that lets
              you quickly exfiltrate arbitrary numbers of prompts, especially
              if you have any idea what you are looking for. (Note that a lot
              of prompts are already public information, or you can already
              exfiltrate prompts quite easily from services and start attacking
              from there...)
       
                reitzensteinm wrote 3 hours 49 min ago:
                Hill climbing a password would only be possible if intermediate
                KV cache entries were stored. To hillclimb "hunter2", you're
                going to try "a", "b", "c", etc, until you notice that "h"
                comes back faster. Then you try "ha", "hb" and so on.
                
                But that's only going to work if the cache looks like: "h",
                "hu", "hun", ..., "hunter2"
                
                If just "hunter2" is in the cache, you won't get any signal
                until you stumble on exactly that password. And that's before
                getting into the block size granularity of the caches discussed
                elsewhere in this thread.
                
                That's not to say timing attacks aren't possible. I haven't
                looked at Claude Code's prompt generation, but there's no
                intrinsic reason why you couldn't do things like figure out
                what open source code and research papers your competitors are
                loading into context.
                
                Sharing caches between orgs would be an incredible misstep.
       
                  jgeralnik wrote 2 hours 44 min ago:
                  Right, you canât actually guess a letter (byte) at a time
                  but you can guess a token at a time (I believe the vocabulary
                  is 200000 possible tokens in gpt 5)
                  So you could send each of the 200000 possible tokens, see
                  which is cached, and then send 200000 more tokens to find the
                  next cached token
                  Certainly less efficient but well within the realm of a
                  feasible attack
       
                    reitzensteinm wrote 1 hour 31 min ago:
                    It's a good call out re: tokens vs letters, but I think you
                    might have misunderstood my point - you can't do it a token
                    at a time unless the intermediate KV cache is stored after
                    each token is generated.
                    
                    This won't be the case in any non toy implementation, as it
                    would be unneccessary and slow.
       
                IanCal wrote 4 hours 21 min ago:
                Do any providers do this level of granularity? Anthropic
                require explicit cache markers, for example.
       
                  jgeralnik wrote 2 hours 43 min ago:
                  Anthropic requires explicit cache markers but will âlook
                  backwardsâ some amount, so you donât need to fall on the
                  exact split to get cached tokens
       
              gunalx wrote 12 hours 33 min ago:
              I habe come across turning on caching means the llm has a faint
              memory of what was in the cache, even to unrelated queries.  If
              this is the case its fully unreasonable to share the cache,
              because of possibility of information leakage.
       
                weird-eye-issue wrote 11 hours 32 min ago:
                This is absolutely 100% incorrect.
       
                samwho wrote 12 hours 28 min ago:
                How would information leak, though? Thereâs no difference in
                the probability distribution the model outputs when caching vs
                not caching.
       
                  sroussey wrote 1 hour 11 min ago:
                  the probability distribution the model outputs is identical
                  under identical conditions.
                  
                  A local model running alone on your machine will 100% always
                  return the exact same thing and the internal state will be
                  exactly the same and you can checkpoint or cache that to
                  avoid rerunning to that point.
                  
                  Butâ¦ conditions can be different, and batching requests
                  tends to affect other items in flight. I believe Thinking
                  Machines had an article about how to make a request
                  deterministic again without performance going to complete
                  crap.
                  
                  I tend to think of things this way (completely not what
                  happens though): what if you were to cache based on a tensor
                  as the key? To generate a reasonably sized key what is an
                  acceptable loss of precision to retrieve the same cache
                  knowing that there is inherent jitter in the numbers of the
                  tensor?
                  
                  And then the ever so slight leak of information. But also
                  multiplied since there are internal kv caches for tokens and
                  blah blah blah.
       
        duggan wrote 13 hours 19 min ago:
        It was a real facepalm moment when I realised we were busting the cache
        on every request by including date time near the top of the main
        prompt.
        
        Even just moving it to the bottom helped move a lot of our usage into
        cache.
        
        Probably went from something like 30-50% cached tokens to 50-70%.
       
        willvarfar wrote 13 hours 20 min ago:
        A really clear explanation!
        
        So if I were running a provider I would be caching popular prefixes for
        questions across all users.  There must be so many questions that start
        'what is' or 'who was' etc?
        
        Also, can subsequences in the prompt be cached and reused?  Or is it
        only prefixes?    I mean, can you cache popular phrases that might appear
        in the middle of the prompt and reuse that somehow rather than needing
        to iterate through them token by token?  E.g. must be lots of times
        that "and then tell me what" appears in the middle of a prompt?
       
          GeneralMayhem wrote 13 hours 10 min ago:
          Really only prefixes, without a significant loss in accuracy. The
          point is that because later tokens can't influence earlier ones, the
          post-attention embeddings for those first tokens can't change. But
          the post-attention embeddings for "and then tell me what" would be
          wildly different for every prompt, because the embeddings for those
          tokens are affected by what came earlier.
          
          My favorite not-super-accurate mental model of what's going on with
          attention is that the model is sort of compressing the whole
          preceding context into each token. So the word "tell" would include a
          representation not just of the concept of telling, but also of what
          it is that's supposed to be told. That's explicitly what you don't
          want to cache.
          
          > So if I were running a provider I would be caching popular prefixes
          for questions across all users
          
          Unless you're injecting user context before the question. You can
          have a pre baked cache with the base system prompt, but not beyond
          that. Imagine that the prompt always starts with "SYSTEM: You are
          ChatGPT, a helpful assistant. The time is 6:51 ET on December 19,
          2025. The user's name is John Smith. USER: Hi, I was wondering..."
          You can't cache the "Hi, I was wondering" part because it comes after
          a high-entropy component (timestamp and user name).
       
          samwho wrote 13 hours 13 min ago:
          With KV caching as itâs described there it has to be a prefix
          match. OpenAI state in their docs they donât cache anything below
          1024 tokens long, and Iâm sure I read somewhere that they only
          cache in 1024 token blocks (so 1024, 2048, 3072, etc) but I canât
          find it now.
          
          Thereâs been some research into how to cache chunks in the middle,
          but I donât think any of the providers are doing it yet because it
          needs the prompt to be structured in a very specific way.
       
            moebrowne wrote 12 hours 59 min ago:
             [1] > Caching is available for prompts containing 1024 tokens or
            more.
            
            No mention of caching being in blocks of 1024 tokens thereafter.
            
  HTML      [1]: https://platform.openai.com/docs/guides/prompt-caching#req...
       
              IanCal wrote 4 hours 14 min ago:
              At launch it was described as being in blocks of 128
              
  HTML        [1]: https://openai.com/index/api-prompt-caching/
       
        tomhow wrote 13 hours 51 min ago:
        [under-the-rug stub]
        
        [see [1] for explanation]
        
  HTML  [1]: https://news.ycombinator.com/item?id=45988611
       
          walterbell wrote 9 hours 17 min ago:
          Excellent HN-esque innovation in moderation: immediate improvement in
          S/N ratio, unobtrusive UX, gentle feedback to humans, semantic signal
          to machines.
          
          How was the term "rug" chosen, e.g. in the historical context of
          newspaper folds?
       
          ThePyCoder wrote 15 hours 2 min ago:
          What an excellent write-up. Thank you!
       
            samwho wrote 14 hours 58 min ago:
            Thank you so much <3
       
          wesammikhail wrote 15 hours 49 min ago:
          Amazing article. I was under the misapprehension that temp and other
          output parameters actually do affect caching. Turns out I was wrong
          and this explains why beautifully.
          
          Great work. Learned a lot!
       
            stingraycharles wrote 14 hours 14 min ago:
            I had a âsomebody is wrong on the internet!!â discussion about
            exactly this a few weeks ago, and they proclaimed to be a professor
            in AI.
            
            Where do people get the idea from that temperature affects caching
            in any way? Temperature is about next token prediction / output,
            not input.
       
              wesammikhail wrote 12 hours 52 min ago:
              Because in my mind, as a person not working directly on this kind
              of stuff, I figured that caching was done similar to any resource
              caching in a webserver environment.
              
              ItÂ´s a semantics issue where the word caching is overloaded
              depending on context. For people that are not familiar with the
              inner workings of llm models, this can cause understandable
              confusion.
       
              semi-extrinsic wrote 13 hours 57 min ago:
              Being wrong about details like this is exactly what I would
              expect from a professor. They are mainly grant writers and PhD
              herders, often they are good at presenting as well, but they
              mostly only have gut feelings about technical details of stuff
              invented after they became a professor.
       
            samwho wrote 14 hours 55 min ago:
            Yay, glad I could help! The sampling process is so interesting on
            its own that I really want to do a piece on it as well.
       
              wesammikhail wrote 12 hours 52 min ago:
              Looking forward to it!
       
          coderintherye wrote 16 hours 19 min ago:
          Really well done article.
          
          I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it
          failed on it (with lots of colorful chain of thought), though Gemini
          got it right away.
       
            samwho wrote 14 hours 57 min ago:
            Huh, when I was writing the article it was GPT-5.1 and I remember
            it got it no problem.
       
          simedw wrote 3 days ago:
          Thanks for sharing; you clearly spent a lot of time making this easy
          to digest. I especially like the tokens-to-embedding visualisation.
          
          I recently had some trouble converting a HF transformer I trained
          with PyTorch to Core ML. I just couldnât get the KV cache to work,
          which made it unusably slow after 50 tokensâ¦
       
            samwho wrote 2 days ago:
            Thank you so much <3
            
            Yes, I recently wrote [1] and had a similar experience with cache
            vs no cache. Itâs so impactful.
            
  HTML      [1]: https://github.com/samwho/llmwalk
       
              mrgaro wrote 17 hours 58 min ago:
              Hopefully you can write the teased next article about how
              Feedforward and Output layers work. The article was super helpful
              for me to get better understanding on how LLM GPTs work!
       
                samwho wrote 14 hours 56 min ago:
                Yeah! Itâs planned for sure. It wonât be the direct next
                one, though. Iâm taking a detour into another aspect of LLMs
                first.
                
                Iâm really glad you liked it, and seriously the resources I
                link at the end are fantastic.
       
        NooneAtAll3 wrote 14 hours 11 min ago:
        Blog starts loading and then gives "Something Went Wrong. D is not a
        function" error displayed
       
          belter wrote 13 hours 59 min ago:
          You should upgrade IE6. It has been out of support for a while...
       
          samwho wrote 14 hours 8 min ago:
          Could you tell me what browser/OS/device youâre using? A few people
          have said this and I havenât been able to reproduce it.
       
        aitchnyu wrote 15 hours 36 min ago:
        Took me a minute to see it is same Ngrok which provided freemium
        tunnels to localhost. How did they adapt to the AI revolution?
       
          samwho wrote 14 hours 58 min ago:
          It is the same ngrok!
          
          The product has grown a lot since the mid 2010s. Still got free
          localhost tunnelling, but we also have a whole bunch of
          production-grade API gateway tooling and, as of recently, AI gateway
          stuff too.
       
        est wrote 18 hours 46 min ago:
        This is a surprising good read of how LLM works in general.
       
          samwho wrote 14 hours 54 min ago:
          Itâs funny, I didnât set out for that to be the case. When I
          pitched the idea internally, I wanted to scratch my own itch (what on
          earth is a cached token?) and produce a good post. But then I
          realised I had to go deeper and deeper to get to my answer and
          accidentally made a very long explainer.
       
            yomismoaqui wrote 9 hours 51 min ago:
            Thanks for the post, it's near perfect in focus, detail and how
            it's written.
            
            EDIT: You have some minor typos in the post (psuedocode)
       
        Youden wrote 2 days ago:
        Link seems to be broken: content briefly loads then is replaced with
        "Something Went Wrong" then "D is not a function". Stays broken with
        adblock disabled.
       
          samwho wrote 14 hours 53 min ago:
          Another person had this problem as well and we couldnât figure out
          what causes it. We suspect something to do with WebGL support. What
          browser/device are you using? Does it still break if you disable all
          extensions? Iâd love to fix this.
       
            bkor wrote 9 hours 37 min ago:
            It gives "D is not a function". This on Firefox 146. Various
            extensions including Ublock Origin but that doesn't seem to cause
            it. Also doesn't work in a private window.
       
       
   DIR <- back to front page