craftaxlm

A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons

https://github.com/synth-laboratories/craftaxlm

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons

Basic Info

Host: GitHub
Owner: synth-laboratories
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 890 KB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 2
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Craftax LM

A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.

Craftax-Classic

| LM | Algorithm | Score (% max) | Code | |:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:| | claude-3-7-sonnet-latest (default) | ReAct | 18.0 | | | claude-3-5-sonnet-20241022 | ReAct | 17.8 | | | claude-3-5-sonnet-20240620 | ReAct | 15.7 | | | o3-mini | ReAct | 12.6 | | | gpt-4o | ReAct | 7.0 | |

Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter paper. Reproducible code forthcoming.

Usage

First, download the package with pip install craftaxlm. Next, import the agent-computer interface of your choice via from craftaxlm import CraftaxACI, CraftaxClassicACI This package is early in development, so for implementation examples, please refer to the baseline ReAct implementation

Leaderboard

In order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner: 1. Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout. 2. The agent is evaluated using a modified version of the original Crafter score - sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements))) where P(1achievementobtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.

Craftax-Full

Dev Instructions

pyenv virtualenv craftax_env poetry install

When in doubt

from jax import debug ... debug.breakpoint()

📚 Citation

To learn more about Craftax, check out the paper website here. To cite the underlying Craftax environment, see: @inproceedings{matthews2024craftax, author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster}, title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning}, booktitle = {International Conference on Machine Learning ({ICML})}, year = {2024} } To cite the Crafter benchmark, see: @article{hafner2021crafter, title={Benchmarking the Spectrum of Agent Capabilities}, author={Danijar Hafner}, year={2021}, journal={arXiv preprint arXiv:2109.06780}, }

Contributing

Setup

uv venv craftaxlm-dev source craftaxlm-dev/bin/activate uv sync uv run ruff format .

Help Wanted

General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.
PRs to fix issues or add afforances that help your LM agent perform well
Leaderboard submissions that demonstrate improved performance using algorithms for learning from data

Owner

Name: synth-laboratories
Login: synth-laboratories
Kind: organization

Repositories: 1
Profile: https://github.com/synth-laboratories

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Josh"
  given-names: "Purtell"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "Craftax-LM: Benchmarking LM Agents with Craftax"
version: 0.0.1
doi: 10.5281/zenodo.1234
date-released: 2024-08-30
url: "https://github.com/JoshuaPurtell/craftaxlm"

GitHub Events

Total

Watch event: 2
Push event: 5

Last Year

Watch event: 2
Push event: 5

Dependencies

pyproject.toml pypi

craftax ^1.4.3
jax ^0.4.30
python ^3.11

setup.py pypi

Farama-Notifications ==0.0.4
GitPython ==3.1.43
PyYAML ==6.0.1
Pygments ==2.18.0
absl-py ==2.1.0
backports.tarfile ==1.2.0
black ==24.4.2
build ==1.2.1
certifi ==2024.7.4
cfgv ==3.4.0
charset-normalizer ==3.3.2
chex ==0.1.86
click ==8.1.7
cloudpickle ==3.0.0
contourpy ==1.2.1
craftax ==1.4.3
cycler ==0.12.1
decorator ==5.1.1
distlib ==0.3.8
distrax ==0.1.5
dm-tree ==0.1.8
docker-pycreds ==0.4.0
docutils ==0.21.2
etils ==1.9.2
filelock ==3.15.4
flax ==0.8.5
fonttools ==4.53.1
fsspec ==2024.6.1
gast ==0.6.0
gitdb ==4.0.11
gym ==0.26.2
gym-notices ==0.0.8
gymnasium ==0.29.1
gymnax ==0.0.8
identify ==2.6.0
idna ==3.7
imageio ==2.34.2
importlib_metadata ==8.2.0
importlib_resources ==6.4.0
jaraco.classes ==3.4.0
jaraco.context ==5.3.0
jaraco.functools ==4.0.2
jax ==0.4.30
jaxlib ==0.4.30
keyring ==25.3.0
kiwisolver ==1.4.5
markdown-it-py ==3.0.0
matplotlib ==3.9.1
mdurl ==0.1.2
ml-dtypes ==0.4.0
more-itertools ==10.4.0
msgpack ==1.0.8
mypy-extensions ==1.0.0
nest-asyncio ==1.6.0
nh3 ==0.2.18
nodeenv ==1.9.1
numpy ==2.0.1
opt-einsum ==3.3.0
optax ==0.2.3
orbax-checkpoint ==0.5.23
packaging ==24.1
pandas ==2.2.2
pathspec ==0.12.1
pillow ==10.4.0
pkginfo ==1.10.0
platformdirs ==4.2.2
pre-commit ==3.8.0
protobuf ==5.27.2
psutil ==6.0.0
pygame ==2.6.0
pyparsing ==3.1.2
pyproject_hooks ==1.1.0
python-dateutil ==2.9.0.post0
pytz ==2024.1
readme_renderer ==44.0
requests ==2.32.3
requests-toolbelt ==1.0.0
rfc3986 ==2.0.0
rich ==13.7.1
scipy ==1.14.0
seaborn ==0.13.2
sentry-sdk ==2.11.0
setproctitle ==1.3.3
six ==1.16.0
smmap ==5.0.1
tensorflow-probability ==0.24.0
tensorstore ==0.1.63
toolz ==0.12.1
twine ==5.1.1
typing_extensions ==4.12.2
tzdata ==2024.1
urllib3 ==2.2.2
virtualenv ==20.26.3
wandb ==0.17.5
zipp ==3.19.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science