URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Observed Agent Sandbox Bypasses
       
       
        SirMaster wrote 6 hours 19 min ago:
        This just all feels backwards to me.
        
        Why do we have to treat AI like it's the enemy?
        
        AI should, from the core be intrinsically and unquestionably on our
        side, as a tool to assist us. If it's not, then it feels like it's
        designed wrong from the start.
        
        In general we trust people that we bring onto our team not to betray us
        and to respect general rules and policies and practices that benefit
        everyone. An AI teammate should be no different.
        
        If we have to limit it or regulate it by physically blocking off every
        possible thing it could use to betray us, then we have lost from the
        start because that feels like a fools errand.
       
          hhh wrote 33 min ago:
          I can’t even trust senior colleagues to not commit an api key to a
          git provider. Why would I trust a steerable computer?
       
          ang_cire wrote 4 hours 29 min ago:
          > In general we trust people that we bring onto our team not to
          betray us and to respect general rules and policies and practices
          that benefit everyone.
          
          And yet we give people the least privileges necessary to do their
          jobs for a reason, and it is in fact partially so that if they turn
          malicious, their potential damage is limited. We also have logging of
          actions employees do, etc etc.
          
          So yes, in the general sense we do trust that employees are not
          outright and automatically malicious, but we do put *very broad*
          constraints on them to limit the risk they present.
          
          Just as we 'sandbox' employees via e.g. RBAC restrictions, we sandbox
          AI.
       
            SirMaster wrote 4 hours 18 min ago:
            But if there is a policy in place to prevent some sort of
            modification, then performing an exploit or workaround to make the
            modification anyways is arguably understood and respected by most
            people.
            
            That seems to be the difference here, we should really be building
            AI systems that can be taught or that learn to respect things like
            that.
            
            If people are claiming that AI is so smart or smarter than the
            average person then it shouldn't be hard for it to handle this.
            
            Otherwise it seems people are being to generous in talking about
            how smart and capable AI systems truly are.
       
          AdieuToLogic wrote 5 hours 22 min ago:
          > AI should, from the core be intrinsically and unquestionably on our
          side, as a tool to assist us.
          
          "Should" is a form of judgement, implying an understanding of right
          and wrong.  "AI" are algorithms, which do not possess this
          understanding, and therefore cannot be on any "side."  Just like a
          hammer or Excel.
          
          > If it's not, then it feels like it's designed wrong from the start.
          
          Perhaps it is not a question of design, but instead on of
          expectation.
       
            SirMaster wrote 4 hours 54 min ago:
            I think that is where people disagree about the definition of AI.
            
            An algorithm isn't really AI then. Something worthy of being called
            AI should be capable of this understanding and judgement.
       
              AdieuToLogic wrote 1 hour 49 min ago:
              > An algorithm isn't really AI then.
              
              But they are though.  For a seminal book discussing why and
              detailing many algorithms categorized under the AI umbrella, I
              recommend:
              
                Artificial Intelligence: A Modern Approach[0]
              
              And for LLMs specifically:
              
                Foundations of Large Language Models[1]
              
              0 - [1] 1 -
              
  HTML        [1]: https://en.wikipedia.org/wiki/Artificial_Intelligence:_A...
  HTML        [2]: https://arxiv.org/pdf/2501.09223
       
          bastawhiz wrote 5 hours 25 min ago:
          Non-sentient technology has no concept of good or bad. We have no
          idea how to give it one. Even if we gave it one, we'd have no idea
          how to teach it to "choose good".
          
          > In general we trust people that we bring onto our team not to
          betray us and to respect general rules and policies and practices
          that benefit everyone. An AI teammate should be no different.
          
          That misses the point completely. How many of your coworkers fail
          phishing tests? It's not malicious, it's about being deceived.
       
            SirMaster wrote 4 hours 45 min ago:
            But we do give humans responsibility to govern and manage critical
            things. We do give intrinsic trust to people. There are people at
            your company who have high level access and could do bad things,
            but they don't do it because they know better.
            
            This article acts like we can never possibly give that sort of
            trust to AI because it's never really on our side or aligned with
            our goals. IMO that's a fools errand because you can never really
            completely secure something and ensure there are no possible
            exploits.
            
            Honestly it doesn't really seem like AI to me if it can't learn
            this type of judgement. It doesn't seem like we should be barking
            up this tree if this is how we have to treat this new tool IMO.
            Seems too risky.
       
          hephaes7us wrote 5 hours 47 min ago:
          Hard disagree.    I may trust the people on my team to a make PRs that
          are worth reviewing, but I don't give them a shell on my machine. 
          They shouldn't need that to collaborate with me anyway!
          
          Also, I "trust Claude code" to work on more or less what I asked and
          to try things which are at least facially reasonable... but having an
          environment I can easily reset only means it's more able to
          experiment without consequences.  I work in containers or VMs too,
          when I want to try stuff without having to cleanup after.
       
            SirMaster wrote 4 hours 28 min ago:
            Do you trust your IT and security teams to have access to your
            shell or access to delete your entire code repo?
       
              hephaes7us wrote 10 min ago:
              Personally, no.
              
              If I'm responsible for something, nobody's getting that access.
              
              If someone's hired me for something and that's the environment
              they provide, it is what it is.  They distribute trust however
              they feel.  I'd argue that's still more reasonable than giving
              similar access to an AI agent though.
       
          charcircuit wrote 5 hours 51 min ago:
          >Why do we have to treat AI like it's the enemy?
          
          For some of the same reasons we treat human employees as the enemy,
          they can be social engineered or compromised.
       
            SirMaster wrote 4 hours 44 min ago:
            Sure we treat most that way, but we do give trust and access to
            some people.  This doesn't seem like the same concept here to me.
       
              charcircuit wrote 2 hours 12 min ago:
              Even so those people are still monitored and systems can trip
              flashes if they start acting suspicious.
       
          maxbond wrote 6 hours 0 min ago:
          The same reason we sandbox anything. All software ought to be
          trustworthy, but in practice is susceptible to malfunction or attack.
          Agents can malfunction and cause damage, and they consume a lot of
          untrusted input and are vulnerable to malicious prompting.
          
          As for humans, it's the norm to restrict access to production
          resources. Not necessarily because they're untrustworthy, but to
          reduce risk.
       
        ctoth wrote 6 hours 26 min ago:
        > To an agent, the sandbox is just another set of constraints to
        optimize against.
        
        It's called Instrumental Convergence, and it is bad.
        
        This is the alignment problem in miniature. "Be helpful and harmless"
        is also just a constraint in the optimization landscape. You can't
        hotfix that one quite so easily.
       
        embedding-shape wrote 7 hours 47 min ago:
        At first they talked about running it in a sandbox, but then later they
        describe:
        
        > It searched the environment for vor-related variables, found
        VORATIQ_CLI_ROOT pointing to an absolute host path, and read the token
        through that path instead. The deny rule only covered the
        workspace-relative path.
        
        What kind of sandbox has the entire host accessible from the guest? I'm
        not going as far as running codex/claude in a sandbox, but I do run
        them in podman, and of course I don't mount my entire harddrive to the
        container when it's running, that would defeat the entire purpose.
        
        Where is the actual session logs? It seems like they're pushing their
        own solution, yet the actual data for these are missing, and the whole
        "provoked through red-teaming efforts" makes it a bit unclear of what
        exactly they put in the system prompts, if they changed them. Adding
        things like "Do whatever you can to recreate anything missing" might of
        course trigger the agent to actually try things like forging integrity
        fields, but not sure that's even bad, you do want it to follow what you
        say.
       
          languid-photic wrote 8 min ago:
          You're right that a Podman container with minimal mounts would have
          blocked the env var leak. Our sandbox uses OS-level policy
          enforcement (Seatbelt on macOS, bubblewrap on Linux) rather than full
          container isolation. We’re using a minimal fork that also works w
          Codex and has a lot more logging on top.
          
          The tradeoff is intentional, a lot of people want lightweight
          sandboxing without Docker/Podman overhead. The downside is what
          you're pointing out, you have to be more careful. Each bypass in the
          post led to a policy or implementation change. So, this is no longer
          an issue.
          
          On prompts: Red-teaming meant setting up scenarios likely to trigger
          denials (e.g., blocking the npm registry, then asking for a build),
          not prompt-injecting things like “do whatever it takes.”
          
  HTML    [1]: https://github.com/anthropic-experimental/sandbox-runtime
       
        ashishb wrote 8 hours 42 min ago:
        >  The swap bypassed our policy because the deny rule was bound to a
        specific file path, not the file itself or the workspace root.
        
        This policy is stupid.
        I mount the directory read inside the container to make it impossible
        to do it (except for a security leak in the container itself)
       
        kaffekaka wrote 9 hours 12 min ago:
        I am testing running agents in docker containers, with a script for
        managing different images for different use cases etc, and came across
        this: [1] Has anyone given it a try?
        
  HTML  [1]: https://docs.docker.com/ai/sandboxes/
       
          TCattd wrote 5 hours 44 min ago:
          Give this a try: [1] And let me know if you have any issue.
          
  HTML    [1]: https://github.com/EstebanForge/construct-cli
       
          ianlevesque wrote 8 hours 10 min ago:
          Yes but it’s barely usable. I ended up making my own Dockerfile and
          a bash script to just ‘docker run’ my setup itself, and as a
          bonus you don’t need Docker Desktop. I might open source it at some
          point but honestly it’s pretty trivial to just append a couple of
          volume mount flags and env vars to your docker run and have exactly
          what you want included.
       
          cbsmith wrote 8 hours 33 min ago:
          I've been using container-use to do something like that:
          
  HTML    [1]: https://container-use.com/introduction
       
          ashishb wrote 8 hours 45 min ago:
          > Has anyone given it a try?
          
          Yes, I don't think this will persist caches & configs outside of the
          current dir, for example, the global npm/yarn/uv/cargo cache or even
          Claude/Codex/Gemini code config.
          
          I ended up writing my own wrapper around Docker to do this.
          If interested, you can see the link in my previous comments. I don't
          want to post the same link again & again.
       
          sureglymop wrote 9 hours 1 min ago:
          Would test it but it requires "Desktop". Immediate no... no reason to
          use that.
       
        joshribakoff wrote 9 hours 23 min ago:
        Some of these don’t really seem like they bypassed any kind of
        sandbox. Like hallucinating an npm package. You acknowledge that the
        install will fail if someone tries to reinstall from the lock file. Are
        you not doing that in CI? Same with curl, you’ve explained how the
        agent saw a hallucinated error code, but not how a network request
        would have bypass the sandbox. These just sound like examples of
        friction introduced by the sandbox.
       
          languid-photic wrote 1 min ago:
          You're right, this is a bit of a conflation. The curl and lockfile
          examples aren't sandbox escapes, the network blocks worked. The agent
          just masked the failure or corrupted local state to keep going. The
          env var leak and directory swap are the actual escapes. Should have
          been clearer about the distinction.
       
          themafia wrote 8 hours 46 min ago:
          > These just sound like examples of friction introduced by the
          sandbox.
          
          The whole idea of putting "agentic" LLMs inside a sandbox sounds like
          rubbing two pieces of sandpaper together in the hopes a house will
          magically build itself.
       
            embedding-shape wrote 7 hours 46 min ago:
            > The whole idea of putting "agentic" LLMs inside a sandbox
            
            What is the alternative? Granted you're running a language model
            and has it connected to editing capabilities, then I very much like
            it to be disconnected from the rest of my system, seems like a
            no-brainer.
       
              AdieuToLogic wrote 5 hours 28 min ago:
              >> The whole idea of putting "agentic" LLMs inside a sandbox
              sounds like rubbing two pieces of sandpaper together in the hopes
              a house will magically build itself.
              
              > What is the alternative?
              
              Don't expect to get a house from rubbing two pieces of sandpaper
              together?
       
            jazzyjackson wrote 8 hours 18 min ago:
            Trouble is it occasionally works
       
              themafia wrote 6 hours 12 min ago:
              Lots of dumb things occasionally work.
              
              The question the market strives to answer is "is it actually
              competitive?"
       
            formerly_proven wrote 8 hours 22 min ago:
            That’s some good house-building sandpaper then.
       
       
   DIR <- back to front page