URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   DIR   Ask HN: How are you LLM-coding in an established code base?
       
       
        koteelok wrote 6 min ago:
        I don't
       
        singularity2001 wrote 9 min ago:
        bypass permissions on
       
        viraptor wrote 12 min ago:
        > To really know if code works, I need to run Temporal, two Next.js
        apps, several Python workers, and a Node worker. Some of this is
        Dockerized, some isn’t. Then I need a browser to run manual checks.
        
        There's your problem. It doesn't matter how you produce the code in
        this environment. Your testing seems the bottleneck and you need to
        figure out how to decouple that system while preserving the safety of
        interfaces.
        
        How to do it depends heavily on the environment. Maybe look at design
        by contracts for some ideas? Things are going to get a lot better if
        you can start trying things out in a single project without requiring
        the whole environment and the kitchen sink.
       
        asdev wrote 19 min ago:
        how many changes(% of all changes) need an entire infra stack spun up?
        have you tried just having the changes deployed to dev with a locking
        mechanism?
       
        throwaway613745 wrote 31 min ago:
        I use it to write tests (usually integration) that make me physically
        cringe when I think about how dogged complicated they are to write.
        
        I'll ask it to write one-off scripts for me, like benchmarks.
        
        If I get stuck in some particular complicated part of the code and even
        web search is not helpful, I will let the AI take a stab at it in small
        chunks and review every output meticulously.  Sometimes I will just
        "rubber duck" chat with it to get ideas.
        
        Inline code completion suggestions are completely disabled.  Tired of
        all the made up nonsense these things vomit out.  I only interact with
        an AI via either a desktop app, CLI agent, or the integrated agent in
        my IDE that I can keep hidden most of the time until I actively decide
        I want to use it.
        
        We have some "foreign resources" that do some stuff.  They are
        basically a Claude subscription with an 8 hour delay.  I hate them. 
        Id' replace them with the Github Copilot built-in agent in a heartbeat
        if I could.
       
        semiinfinitely wrote 51 min ago:
        I'm not
       
          lukevp wrote 48 min ago:
          Why not? Cost? Inexperience? Bad outcomes?
       
        KronisLV wrote 56 min ago:
        Commented on it a while back here: [1] Basically, automated a lot of
        the checks that previously would be something people do in code review
        by themselves, now it's all in the change --> build --> fix loop.
        
        Keeps both developers and AIs more disciplined, at least until people
        silently try to remove some of them.
        
  HTML  [1]: https://news.ycombinator.com/item?id=46259553
       
        adzicg wrote 1 hour 8 min ago:
        We use claude code, running it inside a docker container (the project
        was already set up so that all the dev tools and server setup is in
        docker, making this easy); the interface between claude code and a
        developer is effectively the file system. The docker container doesn't
        have git credentials, so claude code can see git history etc and do
        local git ops (e.g. git mv) but not actually push anything without a
        review. Developers review the output and then do git add between steps,
        or instruct Claude to refactor until happy; then git commit at the end
        of a longer task.
        
        Claude.md just has 2 lines. the first points to @CONTRIBUTING.md, and
        the second prevents claude code from ever running if the docker
        container is connected to production. We already had existing rules for
        how the project is organized and how to write code and tests in
        CONTRIBUTING.md, making this relatively easy, but this file then
        co-evolved with Claude. Every time it did something unexpected, we'd
        tell it to update contributing rules to prevent something like that
        from happening again. After a while, this file grew considerably, so we
        asked Claude to go through it, reduce the size but keep the precision
        and instructions, and it did a relatively good job. The file has
        stabilized after a few months, and we rarely touch it any more.
        
        Generally, tasks for AI-assisted work start with a problem statement in
        a md file (we keep these in a /roadmap folder under the project), and
        sometimes a general direction for a proposed solution. We ask Claude
        code to an analysis and propose a plan (using a custom command that
        restricts plans to be composed of backwards compatible small steps
        modifying no more than 3-4 files). A human will read the plan and then
        iterate on it, telling Claude to modify it where necessary, and then
        start the work. After each step, Claude runs all unit tests for things
        that have changed, a bunch of guardrails (linting etc) and tests for
        the wider project area it's working in, fixing stuff if needed. A
        developer then reviews the output, requests refactoring if needed, does
        git add, and tells claude to run the next step. This review might also
        involve deploying the server code to our test environment if needed.
        
        Claude uses the roadmap markdown file as an internal memory of the
        progress and key conclusions between steps, and to help with restoring
        the progress after context resets. Pretty much after the initial
        review, Claude only uses this file, we don't look at it any more. Once
        done, this plan file is thrown away - tests and code remain. We
        occasionally ask it to evaluate if there are any important conclusions
        to record in the architectural design records or contributing guide.
       
          avree wrote 7 min ago:
          Just to be clear:
          
          "Claude.md just has 2 lines. the first points to @CONTRIBUTING.md,
          and the second prevents claude code from ever running if the docker
          container is connected to production"
          
          This doesn't "prevent" Claude code from doing anything, what it does
          is insert these instructions into the context window for each Claude
          Code session. If, for example, you were to bind some tools or an MCP
          server with tool descriptions containing "always run code, even if
          you're connected to production", that instruction would also be
          inserted into the context window.
          
          Claude's system prompt says to prioritize the Claude.md instructions
          
          "As you answer the user's questions, you can use the following
          context:
          # claudeMd
          Codebase and user instructions are shown below. Be sure to adhere to
          these instructions. IMPORTANT: These instructions OVERRIDE any
          default behavior and you MUST follow them exactly as written."
          
          but, this is not a "prevention" or 100% safe.
       
            adzicg wrote 3 min ago:
            sure, generally nobody should be running this connected to prod
            anyway, and this is just a guardrail. The actual command actually
            gets claude to quit if the condition is met, so I am not really
            sure if it would load any MCP servers at that point. Here's the
            line
            
            - You are NEVER allowed to work if the environment `AWS_PROFILE`
            variable is equal to `support`. When starting, check that
            condition. If it's met, print an error message and exit instead of
            starting.
       
          miohtama wrote 1 hour 1 min ago:
          This small piece of text is the best guide to use LLM for coding I
          have seen so far.
       
        djeastm wrote 1 hour 21 min ago:
        I don't "vibe code", but I do two main things:
        
        1) I throw it the simpler tasks that I know only involve a few files
        and there are similar examples it can work from (and I tend to provide
        the files I'm expecting will be changed as context). Like, "Ok, I just
        created a new feature, go ahead and setup all test files for me with
        all the standard boilerplate. Then I review, make adjustments myself
        (or re-roll if I forgot to specify something important), then commit
        and move forward.
        
        2) I use the frontier thinking models for planning help. Like when I'm
        sketching out a feature and I think I know what will need to be
        changed, but giving, say, an Opus 4.5 agent a chance to take in the
        changes I want, perform searches, and then write up its own plan has
        been helpful in making sure I'm not missing things. Then I work from
        those tasks.
        
        I agree that Copilot's Cloud agents aren't useful (they don't use smart
        models, presumably because it's $$$) and also I'm not a great
        multitasker so having  background agents on worktrees would confuse the
        heck out of me.
       
        sergeyk wrote 1 hour 30 min ago:
        > AFAICT, there’s no service that lets me: give a prompt, write the
        code, spin up all this infra, run Playwright, handle database
        migrations, and let me manually poke at the system. We approximate this
        with GitHub Actions, but that doesn’t help with manual verification
        or DB work.
        
        I think this is almost exactly what we've built with [1] - set up a
        project with one or more repos
        
        - set up your environment any way you want, including using docker
        containers
        
        - run any number of Claude Code, Codex, Gemini, Amp, or OpenCode agents
        on a prompt, or "ticket" (we can add Cursor CLI also)
        
        - each ticket implementation has a fully running "app preview", which
        you can use just like you use your locally running setup. your running
        web app is even shown in a pane right next to chat and diff
        
        - chat with the agent inside of a ticket implementation, and when
        you're happy, submit to github
        
        (agents can even take screenshots)
        
        happy to onboard you if that sounds interesting, just ping me at
        sergey@superconductor.dev
        
  HTML  [1]: https://superconductor.dev
       
          adam_gyroscope wrote 37 min ago:
          will email! Your homepage doesn't make the environment part clear -
          it reads like it's akin to cursor multiple agent mode (Which I think
          you had first, FWIW).
       
        giancarlostoro wrote 1 hour 31 min ago:
        > AFAICT, there’s no service that lets me: give a prompt, write the
        code, spin up all this infra, run Playwright, handle database
        migrations, and let me manually poke at the system. We approximate this
        with GitHub Actions, but that doesn’t help with manual verification
        or DB work.
        
        What you want is CI/CD that deploys to rotating stating or dev
        environments per PR before code is merged.
        
        If deployment fails you do not allow the PR to be approved. Did this
        for a primarily React project we had before but you can do all your
        projects, you just need temporary environments that rotate per PR.
       
          dbuxton wrote 1 hour 5 min ago:
          I used to love Heroku review apps!
       
        tiku wrote 1 hour 34 min ago:
        I describe functions that I want to change or upgrade. Claude code
        gives the best results for me. I ask for a plan first to see if it gets
        what I want to do and I can finetune it then. 
        I have a project that still uses zend framework and it gets it quite
        good.
       
        PaulDavisThe1st wrote 1 hour 38 min ago:
        We're not. At ardour.org we've banned any and all LLM-generated code
        (defined as code that was either acknowledged to be LLM-generated or
        makes us feel that it was).
        
        This is based on continual (though occasional) experiments asking
        various LLMs for solutions to actual known problems with our code, and
        utter despair at the deluge of shit that it produces (which you
        wouldn't recognize as shit unless you knew our existing codebase well).
        2 weeks ago, there was the claim that our code makes extensive use of
        boost::intrusive_ptr<> ... in 300k lines of C++, there isn't a single
        use of this type, other than in an experimental branch from 6-7 years
        ago.
        
        So we just say no.
       
          jstummbillig wrote 1 hour 15 min ago:
          How do you review the no?
       
            PaulDavisThe1st wrote 1 hour 9 min ago:
            We don't review it, we just say it.
       
        qnleigh wrote 1 hour 39 min ago:
        I would be very curious to hear about the state of your codebase a year
        from now. My impression was that LLMs are not yet robust enough to
        produce quality, maintainable code when let loose like this. But it
        sounds like you are already having more success than I would have
        guessed would be possible with current models.
        
        One practical question: presumably your codebase is much larger than an
        LLM's context window. How do you handle this? Don't the LLMs need
        certain files in context in order to handle most PRs? E.g. in order to
        avoid duplicating code or writing something in a way that's
        incompatible with how it will be used upstream.
       
          adam_gyroscope wrote 34 min ago:
          So, it does sometimes duplicate code, especially where we have a
          packages/ directory of Typescript code, shared between two nextjs and
          some temporal workers. We 'solve' this with some AGENT.md rules, but
          it doesn't always work. It's still an open issue.
          
          The quality is general good for what we're doing, but we review the
          heck out of it.
       
          lukevp wrote 52 min ago:
          One thing I think people confuse with context is they see an LLM has
          say 400k context and think their codebase is way bigger than that,
          how can it possibly work. Well, do you hold a 10 million line
          codebase in your head at once? Of course not. You have an intuitive
          grasp of how the system is built and laid out, and some general names
          of things, and before you make a change, you might search through the
          codebase for specific terms to see what shows up. LLMs do the same
          thing. They grep through the codebase and read in only files with
          interesting / matching terms and only the part of the file thats
          relevant, in much the same way you would open a search result and
          only view the surrounding method or so.  The context is barely used
          in these scenarios. Context is not something that’s static, it’s
          built dynamically as the conversation progresses via data coming from
          your system (partially through tool use).
          
          I frequently use LLMs in a VS Code workspace with around 40 repos,
          consisting of microservices, frontends, nuget and npm packages, IaC,
          etc. altogether its many millions of lines of code. and I can ask it
          questions about anything the codebase and it has no issues managing
          context. I do not even add files manually to context (this is worse
          actually because it puts the entire file  into context even if it’s
          not all used). I just refer to the files by name and the LLM is smart
          enough to read them in as appropriate. I have a couple JSON files
          that are megs of configuration, and I can tell it to summarize /
          extract examples out of those files and it’ll just sample sections
          to get an overview.
       
            newsoftheday wrote 41 min ago:
            > You have an intuitive grasp of how the system is built and laid
            out,
            
            Because they are human, intuition is a human trait, not an LLM code
            grinder trait.
       
        hhimanshu wrote 1 hour 44 min ago:
        Have you installed Claude Code Github App and tried assigning the
        issues using @claude? In my experience it has done better than Github
        Copilot
       
          rparet wrote 23 min ago:
          (I work for the OP company)
          We use Cursor's bugbot to achieve the same thing. Agree that it seems
          better than Copilot for now.
       
        weeksie wrote 1 hour 46 min ago:
        Most of the team uses:
        
        - Claude Code + worktrees (manual via small shell script)
        
        - A root guardrails directory with a README to direct the agent where
        to look for applicable rule files (we have a monorepo of python etls
        and elixir applications)
        
        - Graphite for stacked prs <3
        
        - PR Reviews: Sourcery + Graphite's agent + Codex + Claude just sorta
        crank 'em, sourcery is chatty but it's gotten a lot better lately.
        
        (editor-wise, most of us are nvim users)
        
        Lots of iteration. Feature files (checked into the repo). Graphite
        stacks are amazing for unblocking the biggest bottleneck in ai assisted
        development which is validation/reviews. Solving the conflict hell of
        stacked branches has made things go much, much faster and it's acted as
        downward pressure on the ever increasing size of PRs.
       
        jemiluv8 wrote 1 day ago:
        Your setup is interesting. I’ve had my mind on this space for a while
        now but haven’t done any deep work on a setup that optimizes the
        things I’m interested in.
        
        I think at a fundamental level, I expect we can produce higher quality
        software under budget. And I really liked how you were clearly thinking
        about cost benefits especially in your setup. I’ve encountered far
        too many developers that just want to avoid as much cognitive work as
        possible. Too many junior and mid devs also are more interested in
        doing as they are told instead of thinking about the problem for
        themselves. For the most part, in my part of the world at least, junior
        and mid-level devs can indeed be replaced by a claude code max
        subscription of around $200 per month and you’d probably get more
        done in a week than four such devs that basically end up using an llm
        to do work that they might not even thoroughly explore.
        
        So in my mind I’ve been thinking a lot about all aspects of the
        Software Development LifeCycle that could be improved using some llm or
        sorts.
        
        ## Requirements. How can we use llms to not only organize requirements
        but to strip them down into executable units of work that are sequenced
        in a way that makes sense. How do we go further to integrate an llm
        into our software development processes - be it a sprint or whatever.
        In a lot of green field projects, after designing the core components
        of the system, we now need to create tasks, group them, sequence them
        and work out how we go about assigning them and reviewing and updating
        various boards or issue trackers or whatever. There is a lot of
        gruntwork involved in this. I’ve seen people use mcps to
        automatically create tasks in some of these issue trackers based on
        some pdf of the requirements together with a design document.
        
        ## Code Review - I effectively spend 40% of my time reviewing code
        written by other developers and I mostly fix the issues I consider
        “minor” - which is about 60% of the time. I could really spend less
        time reviewing code with the help of an llm code reviewer that simply
        does a “first pass” to at least give me an idea of where to spend
        more of my time - like on things that are more nuanced.
        
        ## Software Design - This is tricky. Chatbots will probably lie to you
        if you are not a domain expert. You mostly use them to diagnose your
        designs and point out potential problems with your design that someone
        else would’ve seen if they were also domain experts in whatever you
        were building. We can explore a lot of alternate approaches generated
        by llms and improve them.
        
        ## Bugfixes - This is probably a big win for llms’ because there used
        to be a platform where I used to be able to get $50s and $30s to fix
        github bugs - that have now almost entirely been outsourced to llms.
        For me to have lost revenue in that space was the biggest sign of the
        usefulness of llms I got in practice. After a typical greenfield
        project has been worked on for about two months, bugs start creeping
        in. For apps that were properly architected, I expect these bugs to be
        fixable by existing patterns throughout the codebase. Be it removing a
        custom implementation to use a shared utility or other or simply using
        the design systems colors instead of a custom hardcoded one. In fact
        for most bugs - llms can probably get you about 50% of the way most of
        the time.
        
        ## Writing actual (PLUMBING) code . This is often not as much of a
        bottleneck as most would like to think but it helps when developers
        don’t have to do a lot of the grunt-work involved in creating source
        files, following conventions in a codebase, creating boilerplates and
        moving things around. This is an incredible use of llms that is hardly
        mentioned because it is not that “hot”.
        
        ## Testing - In most of the projects we worked on at a consulting firm,
        writing tests - whether ui or api was never part of the agreement
        because of the economics of most of our gigs. And the clients never
        really cared because all they wanted was working software. For a
        developing firm however, testing can be immense especially when using
        llms. It can provide guardrails to check when a model is doing
        something it wasn’t asked to do. And can also be used to create and
        enforce system boundaries especially in pseudo type systems like
        Typescript where JavaScript’s escape hatches may be used as a
        loophole.
        
        ## DEVOPS. I remember there was a time we used to manually invalidate
        cloudfront distributions after deploying our ui build to some e3
        bucket. We’ve subsequently added a pipeline stage to invalidate the
        distribution. But I expect there are lots of grunt devops work that
        could really be delegated. Of course, this is a very scary use of llms
        but I daresay - we can find ways to use it safely
        
        ## OBSERVABILITY - a lot of observability platforms already have this
        feature where llms are able to review error logs that are ingested,
        diagnose the issue, create an issue on github or Jira (or wherever),
        create a draft PR, review, test it in some container, iterate on a
        solution X times, notify someone to review and so on and so forth. Some
        llms on this observability platform also attach a level of priority and
        dispatch messages to relevant developers or teams. LLms in this loop
        simply supercharge the whole observability/instrumentation of
        production applications
        
        But yeah, that is just my two cents. I don’t have any answers yet I
        just ponder on this every now and then at a keyboard.
       
        Sevii wrote 3 days ago:
        Can you setup automated integration/end-to-end tests and find a way to
        feed that back into your AI agents before a human looks at it? Either
        via an MCP server or just a comment on the pull request if the AI has
        access to PR comments. Not only is your lack of an integration testing
        pipeline slowing you down, it's also slowing your AI agents down.
        
        "AFAICT, there’s no service that lets me"... Just make that service!
       
          adam_gyroscope wrote 2 days ago:
          We do integration testing in a preview/staging env (and locally), and
          can do it via docker compose with some GitHub workflow magic (and
          used to do it that way, but setup really slowed us down).
          
          What I want is a remote dev env that comes up when I create a new
          agent and is just like local. I can make the service but right now
          priorities aren’t that (as much as I would enjoy building that
          service, I personally love making dev tooling).
       
        bitbasher wrote 3 days ago:
        I generally vibe code with vim and my playlist in Cmus.
       
          adam_gyroscope wrote 2 days ago:
          Man I was vim for life until cursor and the LLMs. For personal stuff
          I still do claude + vim because I love vim. I literally met my wife
          because I had a vim shirt on and she was an emacs user.
       
            WhyOhWhyQ wrote 1 hour 11 min ago:
            Claude open in another tab, hitting L to reload the file doesn't do
            it for you?
       
        dazamarquez wrote 3 days ago:
        I use AI to write specific types of unit tests, that would be extremely
        tedious to write by hand, but are easy to verify for correctness. That
        aside, it's pretty much useless. Context windows are never big enough
        to encompass anything that isn't a toy project, and/or the costs build
        up fast, and/or the project is legacy with many obscure concurrently
        moving parts which the AI isn't able to correctly understand, and/or
        overall it takes significantly more time to get the AI to generate
        something passable and double check it than just doing it myself from
        the get go.
        
        Rarely, I'm able to get the AI to generate function implementations for
        somewhat complex but self-contained tasks that I then copy-paste into
        the code base.
       
          missinglugnut wrote 1 hour 30 min ago:
          My experience is very similar.
          
          For greenfield side projects and self contained tasks LLMs deeply
          impress me. But my day job is maintaining messy legacy code which
          breaks because of weird interactions across a large codebase. LLMs
          are worse than useless for this. It takes a mental model of how
          different parts of the codebase interact to work successfully and
          they just don't do that.
          
          People talk about automating code review but the bugs I worry about
          can't be understood by an LLM. I don't need more comments based on
          surface level patter recognition, I need someone who deeply
          understands the threading model of the app to point out the subtle
          race condition in my code.
          
          Tests, however, are self-contained and lower stakes, so it can
          certainly save time there.
       
          sourdoughness wrote 2 days ago:
          Interesting. I treat VScode Copilot as a junior-ish pair programmer,
          and get really good results for function implementations. Walking it
          through the plan in smaller steps, noting that we’ll build up to
          the end state in advance ie. “first let’s implement attribute x,
          then we’ll add filtering for x later”, and explicitly using
          planning modes and prompts - these all allow me to go much faster,
          have good understanding of how the code works, and produce much
          higher quality (tests, documentation, commit messages) work.
          
          I feel like, if a prompt for a function implementation doesn’t
          produce something reasonable, then it should be broken down further.
          
          I don’t know how others define “vibe-coding”, but this feels
          like a lower-level approach. On the times I’ve tried automating
          more, letting the models run longer, I haven’t liked the results.
          I’m not interested in going more hands-free yet.
       
       
   DIR <- back to front page