Show HN: I built a GPT-4 bot which builds software incrementally

The ability to synthesize a relatively short snipped of code was already demonstrated. But I thought it would be interesting to test whether GPT-4 can replace a programmer completely.

To do that, AI needs to plan its actions and work on code incrementally, one piece at a time.

The challenge is the context size: the entire code base + plan does not fit into the context.

My approach: Add only relevant parts of the code base to the context.

Specifically, AI generation engine implements two distinct phases: planning and coding.

In the planning phase, GPT-4 receives a tree of tasks and a summary of code base (list of files and their descriptions). It replies with updated tasks (i.e. it is able to create sub-tasks as needed), the task it wants to work on in the next step and a list of relevant code fragments for that task.

In the coding phase, it receives the task description (as a tree, in YAML) and relevant code fragments. It replies with new generated or updated files, code fragments, and status: Was the task done? Do we need to break it into subtasks?

In both cases bot can also show it's "observations" before the output, as I believe it helps with planning code generation/planning.

Results: Currently I have only tested extremely basic scenarios. It needs a lot of work to be usable in practice. But I'd say it seems to work more-or-less as expected.

Example 1: "Write a reddit-like backend in Kotlin, using Ktor. Start by planning and creating subtasks."

This was the entire task which bot received, no other data.

Results:

Link to output: https://gist.github.com/killerstorm/dd6e26dc80064b7fc731d583f8d740c1#file-ktor_reddit-txt-L9

In short, it formulated reasonably-sounding subtasks and started generating code, e.g. made a Post model. It was aborted at that step due to GPT-4 API failure, it's not reliable yet.

Example 2: "Write a reddit clone in TypeScript. Start by planning and creating subtasks."

Link to output: https://gist.github.com/killerstorm/e3c50bea3ca3463c8b2d947dcfd80b84

You can see more work here, but I expect that it's less interesting.

Challenges: I'd say it can work pretty well in file-at-once mode. Making _fragments_ of the file is more challenging because it's not a well-defined concept. FWIW GPT-4 largely ignored what I wrote about file fragments and made entire files at once, which was the right decision.

I will post link to script in the comment to this post.

  • > In the planning phase, GPT-4 receives a tree of tasks and a summary of code base (list of files and their descriptions). It replies with updated tasks (i.e. it is able to create sub-tasks as needed), the task it wants to work on in the next step and a list of relevant code fragments for that task.

    Are you passing the original task in each prompt? If not I think that it's going to lose context of what it's trying to build overall.

    How are you deciding what are relevant code snippets to send?

  • I tried three formats:

      1. All-YAML
      2. All-XML
      3. Custom parse for code framents + YAML
    
    I first tried it on GPT-3.5 (gpt-3.5-turbo, aka ChatGPT) and it was really struggling with formatting - sometimes it got it right, sometime wrong.

    For GPT-4 I used custom listing-style representation and it kind of just worked. I later re-tried with YAML and XML and it seems to work quite well too.

    Here's the prompt for code-generation part of the "custom" variant:

        You're a code construction AI which creates code iteratively.
        You're given a list of existing code fragments and a task to work on.
        Files are normally broken into multiple fragments to reduce context size.
    
        The response should be in the following format:
    
        // Observations on the task and the code base, if any
        // A plan to implement the task
    
        // Code fragments to be added to the code base. Use the following markers to delimit code fragments. 
        // (Normally a fragment would be a function, class, or a list of related lines)
        
        /// BEGIN_FILE <path>: <description>
        
        /// BEGIN <marker>: <description>
        <code>
        /// END <marker>
        
        /// END_FILE <path>
    
    
        CODE_GENERATION_STATUS: <status> # COMPLETE, PARTIAL, REDO
        description: <description> # updated description if the status is PARTIAL or REDO
        subtasks: # updated subtasks if the status is PARTIAL or REDO
            - id: <id>
            status: <status> # DONE, PARTIAL, TODO
            description: <description>
    
    Here's the script for "custom" variant: https://gist.github.com/killerstorm/2296b282c818ffcfe4ceb729...

    Note that you really need GPT-4 to reproduce the results, it doesn't really works in GPT-3.5, although you can see some activity.