self_improving_coding_agent

A coding agent framework, that works on its own codebase.

https://github.com/maximerobeyns/self_improving_coding_agent

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A coding agent framework, that works on its own codebase.

Basic Info
  • Host: GitHub
  • Owner: MaximeRobeyns
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 713 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Self-Improving Coding Agent

A coding agent experiment, that works on its own codebase.

Agent Loop

The system operates as an iterative improvement loop: 1. evaluating the current agent version on some benchmark tasks to capture how well it does 2. storing the results in an archive 3. running the agent on its own codebase to work on an improvement 4. going back to step 1 with the updated agent code

See our workshop paper for more details.

Quickstart

IMPORTANT NOTE: always run the agent in the provided Docker container. Since the agent can execute shell commands, this offers some isolation from your host machine, avoiding inadvertent file system manipulation and similar risks.

First, make sure you've cloned the repo bash git clone https://github.com/MaximeRobeyns/self_improving_coding_agent

Then, export some environment variables which will be made available in the docker container. The project supports inference from a number of providers to allow for experimentation across many models. You must export at least one of these in your local shell, which you can do either directly or with direnv, dotenv, etc. Omitting any provider key will simply make that provider's models unavailable to the agent.

bash export ANTHROPIC_API_KEY= # For Claude models export OPENAI_API_KEY= # For GPT 4o and reasoning models (o1, o3, etc) export GEMINI_API_KEY= # For Gemini models export VERTEX_PROJECT_ID= # For models hosted on GCP's Vertex export FIREWORKS_AI_API_KEY= # For DeepSeek / Llama hosted on fireworks export DEEPSEEK_API_KEY= # For DeepSeek direct inference (V3, R1) export MODAL_TOKEN_ID= # To allow the agent to visit webpages and read papers export MODAL_TOKEN_SECRET= # To allow the agent to visit webpages and read papers For gemini, you should replace the template file in sandbox/GOOGLE_APPLICATION_CREDENTIALS.json with your own credentials.

Once you have at least one LLM provider's API key exported, you can build the docker image. The build command is wrapped in a Makefile target for convenience:

bash make image

If you are using an apple silicon machine, use this target instead: make image-mac

Finally, install the requirements in your local python environment: ```bash

remember to activate a virtual environment or equivalent here

pip install -r base_agent/requirements.txt pip install swebench ```

Testing the Agent

To test if the setup was successful, you can run the agent interactively with a manually set initial prompt using this target bash make int This will start the docker container and attach your shell to it. You can then run bash python -m agent_code.agent --server true -p "<some initial request here>" Then open your browser on http://localhost:8080 to follow the agent execution. This will show you an interactive webpage which visualises the events in the event bus / the agent callgraph, allowing you to click on individual events to see them in more detail, read overseer messages, and collapse sub-agent traces.

Agent Loop

The agent's working directory is mapped to results/interactive_output and any files created will be available here on your machine. Agent logs will be in results/interactive_output/agent_output.

You can see more options by doing bash make help or agent arguments wit bash python -m base_agent.agent --help

To further configure the agent, including the choice of LLMs, edit base_agent/src/config.py.

Self-Improvement Loop

To run the self-improvement loop, first inspect the list of benchmarks in the base_agent/src/benchmarks/__init__.py file, and make sure that you have uncommented those you want to include. Then do bash python runner.py To see all the options, do bash python runner.py --help Common options might be bash python runner.py --id 1 --workers 6

This will start the agent loop, placing the results in results/run_<id>.

Things to work on

Here are some potential things to try and do with the agent framework:

  • [ ] get the agent to curate / build more of its own benchmarks
  • [ ] reduce the variance of self-improvement runs (early features often influence subsequent features)
  • [ ] use a stronger LLM to build a scaffold for a weaker LLM
  • [ ] find or create more realistic 'software engineering' benchmark tasks

Agent Description

The agent in base_agent is a minimal agent that can just about perform the meta-improvement task. It lacks efficient file editing tools, devtools such as tree sitter or LSP integrations, or advanced reasoning structures that would help it out when performing coding tasks. It has the necessary building blocks to bootstrap these features and specialise itself to the distribution of benchmark tasks included.

Please see base_agent/README.md for a more detailed discussion of the base agent framework.

├── base_agent │ ├── agent_change_log.md │ ├── agent.py │ ├── conftest.py │ ├── description.txt │ ├── __main__.py │ ├── pytest.ini │ ├── README.md │ ├── requirements.txt │ ├── src │ │ ├── agents │ │ ├── benchmarks │ │ ├── callgraph │ │ ├── config.py │ │ ├── events │ │ ├── __init__.py │ │ ├── llm │ │ ├── oversight │ │ ├── schemas │ │ ├── tools │ │ ├── types │ │ ├── utils │ │ └── web_server │ └── tests │ ├── agents │ ├── benchmarks │ ├── events │ ├── __pycache__ │ ├── test_example.py │ ├── tools │ └── utils ├── benchmark_data ├── results │ ├── run_<id> │ └── interactive_output ├── runner.py └── sandbox

Results Organization

results/run_{id}/ ├── metadata.json # Experiment metadata └── agent_{i}/ # Agent iteration directory ├── agent_code/ # Agent implementation ├── benchmarks/ # Benchmark results │ └── {bench_name}/ │ ├── results.jsonl # Per-problem results │ ├── perf.jsonl # Summary metrics │ └── traces/ # Detailed traces └── meta_improvement/ # Improvement logs

Citation

@inproceedings{ robeyns2025sica, title={{SICA} A Self-Improving Coding Agent}, author={Maxime Robeyns, Martin Szummer, and Laurence Aitchison}, booktitle={ICLR 2025 Workshop on Scaling Self-Improving Foundation Models}, year={2025}, url={https://openreview.net/forum?id=rShJCyLsOr} }

Owner

  • Name: Maxime Robeyns
  • Login: MaximeRobeyns
  • Kind: user
  • Location: London

PhD student in probabilistic machine learning

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Robeyns
    given-names: Maxime
    orcid: https://orcid.org/0000-0001-9802-9597
  - family-names: Szummer
    given-names: Martin
  - family-names: Laurence
    given-names: Aitchison
title: "Self-Improving Coding Agent"
version: 0.0.1
date-released: 2025-04-12
repository-code: "https://github.com/MaximeRobeyns/self_improving_coding_agent"

GitHub Events

Total
  • Issues event: 2
  • Watch event: 41
  • Issue comment event: 1
  • Push event: 1
  • Public event: 1
  • Fork event: 10
Last Year
  • Issues event: 2
  • Watch event: 41
  • Issue comment event: 1
  • Push event: 1
  • Public event: 1
  • Fork event: 10

Dependencies

sandbox/Dockerfile docker
  • fedora 42 build
base_agent/requirements.txt pypi
  • GitPython *
  • anthropic ==0.42.0
  • cryptography *
  • datasets *
  • diff-match-patch *
  • duckduckgo-search *
  • fastapi *
  • google-cloud-aiplatform *
  • google-genai *
  • googlesearch-python *
  • jinja2 *
  • json-repair *
  • jsonlines *
  • openai *
  • pydantic *
  • pydantic-settings *
  • pytest *
  • pytest-asyncio *
  • python-dotenv *
  • rich *
  • scipy *
  • swebench *
  • sympy *
  • tabulate *
  • tiktoken *
  • uvicorn *
sandbox/base_requirements.txt pypi
  • black *
  • flake8 *