self_improving_coding_agent

A coding agent framework, that works on its own codebase.

https://github.com/maximerobeyns/self_improving_coding_agent

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.5%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

A coding agent framework, that works on its own codebase.

Basic Info

Host: GitHub
Owner: MaximeRobeyns
License: mit
Language: Python
Default Branch: master
Size: 713 KB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Self-Improving Coding Agent

A coding agent experiment, that works on its own codebase.

Agent Loop

The system operates as an iterative improvement loop: 1. evaluating the current agent version on some benchmark tasks to capture how well it does 2. storing the results in an archive 3. running the agent on its own codebase to work on an improvement 4. going back to step 1 with the updated agent code

See our workshop paper for more details.

Quickstart

IMPORTANT NOTE: always run the agent in the provided Docker container. Since the agent can execute shell commands, this offers some isolation from your host machine, avoiding inadvertent file system manipulation and similar risks.

First, make sure you've cloned the repo bash git clone https://github.com/MaximeRobeyns/self_improving_coding_agent

Then, export some environment variables which will be made available in the docker container. The project supports inference from a number of providers to allow for experimentation across many models. You must export at least one of these in your local shell, which you can do either directly or with direnv, dotenv, etc. Omitting any provider key will simply make that provider's models unavailable to the agent.

bash export ANTHROPIC_API_KEY= # For Claude models export OPENAI_API_KEY= # For GPT 4o and reasoning models (o1, o3, etc) export GEMINI_API_KEY= # For Gemini models export VERTEX_PROJECT_ID= # For models hosted on GCP's Vertex export FIREWORKS_AI_API_KEY= # For DeepSeek / Llama hosted on fireworks export DEEPSEEK_API_KEY= # For DeepSeek direct inference (V3, R1) export MODAL_TOKEN_ID= # To allow the agent to visit webpages and read papers export MODAL_TOKEN_SECRET= # To allow the agent to visit webpages and read papers For gemini, you should replace the template file in sandbox/GOOGLE_APPLICATION_CREDENTIALS.json with your own credentials.

Once you have at least one LLM provider's API key exported, you can build the docker image. The build command is wrapped in a Makefile target for convenience:

bash make image

If you are using an apple silicon machine, use this target instead: make image-mac

Finally, install the requirements in your local python environment: ```bash

remember to activate a virtual environment or equivalent here

pip install -r base_agent/requirements.txt pip install swebench ```

Testing the Agent

To test if the setup was successful, you can run the agent interactively with a manually set initial prompt using this target bash make int This will start the docker container and attach your shell to it. You can then run bash python -m agent_code.agent --server true -p "<some initial request here>" Then open your browser on http://localhost:8080 to follow the agent execution. This will show you an interactive webpage which visualises the events in the event bus / the agent callgraph, allowing you to click on individual events to see them in more detail, read overseer messages, and collapse sub-agent traces.

Agent Loop

The agent's working directory is mapped to results/interactive_output and any files created will be available here on your machine. Agent logs will be in results/interactive_output/agent_output.

You can see more options by doing bash make help or agent arguments wit bash python -m base_agent.agent --help

To further configure the agent, including the choice of LLMs, edit base_agent/src/config.py.

Self-Improvement Loop

To run the self-improvement loop, first inspect the list of benchmarks in the base_agent/src/benchmarks/__init__.py file, and make sure that you have uncommented those you want to include. Then do bash python runner.py To see all the options, do bash python runner.py --help Common options might be bash python runner.py --id 1 --workers 6

This will start the agent loop, placing the results in results/run_<id>.

Things to work on

Here are some potential things to try and do with the agent framework:

[ ] get the agent to curate / build more of its own benchmarks
[ ] reduce the variance of self-improvement runs (early features often influence subsequent features)
[ ] use a stronger LLM to build a scaffold for a weaker LLM
[ ] find or create more realistic 'software engineering' benchmark tasks

Agent Description

The agent in base_agent is a minimal agent that can just about perform the meta-improvement task. It lacks efficient file editing tools, devtools such as tree sitter or LSP integrations, or advanced reasoning structures that would help it out when performing coding tasks. It has the necessary building blocks to bootstrap these features and specialise itself to the distribution of benchmark tasks included.

Please see base_agent/README.md for a more detailed discussion of the base agent framework.

├── base_agent │ ├── agent_change_log.md │ ├── agent.py │ ├── conftest.py │ ├── description.txt │ ├── __main__.py │ ├── pytest.ini │ ├── README.md │ ├── requirements.txt │ ├── src │ │ ├── agents │ │ ├── benchmarks │ │ ├── callgraph │ │ ├── config.py │ │ ├── events │ │ ├── __init__.py │ │ ├── llm │ │ ├── oversight │ │ ├── schemas │ │ ├── tools │ │ ├── types │ │ ├── utils │ │ └── web_server │ └── tests │ ├── agents │ ├── benchmarks │ ├── events │ ├── __pycache__ │ ├── test_example.py │ ├── tools │ └── utils ├── benchmark_data ├── results │ ├── run_<id> │ └── interactive_output ├── runner.py └── sandbox

Results Organization

results/run_{id}/ ├── metadata.json # Experiment metadata └── agent_{i}/ # Agent iteration directory ├── agent_code/ # Agent implementation ├── benchmarks/ # Benchmark results │ └── {bench_name}/ │ ├── results.jsonl # Per-problem results │ ├── perf.jsonl # Summary metrics │ └── traces/ # Detailed traces └── meta_improvement/ # Improvement logs

Citation

@inproceedings{ robeyns2025sica, title={{SICA} A Self-Improving Coding Agent}, author={Maxime Robeyns, Martin Szummer, and Laurence Aitchison}, booktitle={ICLR 2025 Workshop on Scaling Self-Improving Foundation Models}, year={2025}, url={https://openreview.net/forum?id=rShJCyLsOr} }

Owner

Name: Maxime Robeyns
Login: MaximeRobeyns
Kind: user
Location: London

Website: maximerobeyns.com
Twitter: maxime_robeyns
Repositories: 6
Profile: https://github.com/MaximeRobeyns

PhD student in probabilistic machine learning

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Robeyns
    given-names: Maxime
    orcid: https://orcid.org/0000-0001-9802-9597
  - family-names: Szummer
    given-names: Martin
  - family-names: Laurence
    given-names: Aitchison
title: "Self-Improving Coding Agent"
version: 0.0.1
date-released: 2025-04-12
repository-code: "https://github.com/MaximeRobeyns/self_improving_coding_agent"

GitHub Events

Total

Issues event: 2
Watch event: 41
Issue comment event: 1
Push event: 1
Public event: 1
Fork event: 10

Last Year

Issues event: 2
Watch event: 41
Issue comment event: 1
Push event: 1
Public event: 1
Fork event: 10

Dependencies

sandbox/Dockerfile docker

fedora 42 build

base_agent/requirements.txt pypi

GitPython *
anthropic ==0.42.0
cryptography *
datasets *
diff-match-patch *
duckduckgo-search *
fastapi *
google-cloud-aiplatform *
google-genai *
googlesearch-python *
jinja2 *
json-repair *
jsonlines *
openai *
pydantic *
pydantic-settings *
pytest *
pytest-asyncio *
python-dotenv *
rich *
scipy *
swebench *
sympy *
tabulate *
tiktoken *
uvicorn *

sandbox/base_requirements.txt pypi

black *
flake8 *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science