visualwebarena
VisualWebArena is a benchmark for multimodal agents.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary
Keywords
Repository
VisualWebArena is a benchmark for multimodal agents.
Basic Info
- Host: GitHub
- Owner: web-arena-x
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://jykoh.com/vwa
- Size: 186 MB
Statistics
- Stars: 334
- Watchers: 5
- Forks: 57
- Open Issues: 19
- Releases: 0
Topics
Metadata Files
README.md
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
VisualWebArena is a realistic and diverse benchmark for evaluating multimodal autonomous language agents. It comprises of a set of diverse and complex web-based visual tasks that evaluate various capabilities of autonomous multimodal agents. It builds off the reproducible, execution based evaluation introduced in WebArena.

TODOs
- [x] Add human trajectories.
- [x] Add GPT-4V + SoM trajectories from our paper.
- [x] Add scripts for end-to-end training and reset of environments.
- [x] Add demo to run multimodal agents on any arbitrary webpage.
News
- [08/05/2024]: Added an Amazon Machine Image that pre-installed all VWA (and WA) websites so that you don't have to!
- [03/08/2024]: Added the agent trajectories of our GPT-4V + SoM agent on the full set of 910 VWA tasks.
- [02/14/2024]: Added a demo script for running the GPT-4V + SoM agent on any task on an arbitrary website.
- [01/25/2024]: GitHub repo released with tasks and scripts for setting up the VWA environments.
Install
```bash
Python 3.10 (or 3.11, but not 3.12 cause 3.12 deprecated distutils needed here)
python -m venv venv source venv/bin/activate pip install -r requirements.txt playwright install pip install -e . ```
You can also run the unit tests to ensure that VisualWebArena is installed correctly:
pytest -x
End-to-end Evaluation
Setup the standalone environments. Please check out this page for details.
Configurate the urls for each website. First, export the
DATASETto bevisualwebarena:bash export DATASET=visualwebarenaThen, set the URL for the websites
bash
export CLASSIFIEDS="<your_classifieds_domain>:9980"
export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c" # Default reset token for classifieds site, change if you edited its docker-compose.yml
export SHOPPING="<your_shopping_site_domain>:7770"
export REDDIT="<your_reddit_domain>:9999"
export WIKIPEDIA="<your_wikipedia_domain>:8888"
export HOMEPAGE="<your_homepage_domain>:4399"
In addition, if you want to run on the original WebArena tasks, make sure to also set up the CMS, GitLab, and map environments, and then set their respective environment variables:
bash
export SHOPPING_ADMIN="<your_e_commerce_cms_domain>:7780/admin"
export GITLAB="<your_gitlab_domain>:8023"
export MAP="<your_map_domain>:3000"
Generate config files for each test example:
bash python scripts/generate_test_data.pyYou will see*.jsonfiles generated in the config_files folder. Each file contains the configuration for one test example.Obtain and save the auto-login cookies for all websites:
bash prepare.shSet up API keys.
If using OpenAI models, set a valid OpenAI API key (starting with sk-) as the environment variable:
export OPENAI_API_KEY=your_key
If using Gemini, first install the gcloud CLI. Configure the API key by authenticating with Google Cloud:
gcloud auth login
gcloud config set project <your_project_name>
- Launch the evaluation. For example, to reproduce our GPT-3.5 captioning baseline:
bash python run.py \ --instruction_path agent/prompts/jsons/p_cot_id_actree_3s.json \ --test_start_idx 0 \ --test_end_idx 1 \ --result_dir <your_result_dir> \ --test_config_base_dir=config_files/vwa/test_classifieds \ --model gpt-3.5-turbo-1106 \ --observation_type accessibility_tree_with_captionerThis script will run the first Classifieds example with the GPT-3.5 caption-augmented agent. The trajectory will be saved in<your_result_dir>/0.html. Note that the baselines that include a captioning model run on GPU by default (e.g., BLIP-2-T5XL as the captioning model will take up approximately 12GB of GPU VRAM).
GPT-4V + SoM Agent

To run the GPT-4V + SoM agent we proposed in our paper, you can run evaluation with the following flags:
bash
python run.py \
--instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
--test_start_idx 0 \
--test_end_idx 1 \
--result_dir <your_result_dir> \
--test_config_base_dir=config_files/vwa/test_classifieds \
--model gpt-4-vision-preview \
--action_set_tag som --observation_type image_som
To run Gemini models, you can change the provider, model, and the maxobslength (as Gemini uses characters instead of tokens for inputs):
bash
python run.py \
--instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
--test_start_idx 0 \
--test_end_idx 1 \
--max_steps 1 \
--result_dir <your_result_dir> \
--test_config_base_dir=config_files/vwa/test_classifieds \
--provider google --model gemini --mode completion --max_obs_length 15360 \
--action_set_tag som --observation_type image_som
If you'd like to reproduce the results from our paper, we have also provided scripts in scripts/ to run the full evaluation pipeline on each of the VWA environments. For example, to reproduce the results from the Classifieds environment, you can run:
bash
bash scripts/run_classifieds_som.sh
Agent Trajectories
To facilitate analysis and evals, we have also released the trajectories of the GPT-4V + SoM agent on the full set of 910 VWA tasks here. It consists of .html files that record the agent's observations and output at each step of the trajectory.
Demo

We have also prepared a demo for you to run the agents on your own task on an arbitrary webpage. An example is shown above where the agent is tasked to find the best Thai restaurant in Pittsburgh.
After following the setup instructions above and setting the OpenAI API key (the other environment variables for website URLs aren't really used, so you should be able to set them to some dummy variable), you can run the GPT-4V + SoM agent with the following command:
bash
python run_demo.py \
--instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
--start_url "https://www.amazon.com" \
--image "https://media.npr.org/assets/img/2023/01/14/this-is-fine_wide-0077dc0607062e15b476fb7f3bd99c5f340af356-s1400-c100.jpg" \
--intent "Help me navigate to a shirt that has this on it." \
--result_dir demo_test_amazon \
--model gpt-4-vision-preview \
--action_set_tag som --observation_type image_som \
--render
This tasks the agent to find a shirt that looks like the provided image (the "This is fine" dog) from Amazon. Have fun!
Human Evaluations
We collected human trajectories on 233 tasks (one from each template type) and the Playwright recording files are provided here. These are the same tasks reported in our paper (with a human success rate of ~89%). You can view the HTML pages, actions, etc., by running playwright show-trace <example_id>.zip. The example_id follows the same structure as the examples from the corresponding site in config_files/.
Citation
If you find our environment or our models useful, please consider citing VisualWebArena as well as WebArena: ``` @article{koh2024visualwebarena, title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks}, author={Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel}, journal={arXiv preprint arXiv:2401.13649}, year={2024} }
@article{zhou2024webarena, title={WebArena: A Realistic Web Environment for Building Autonomous Agents}, author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others}, journal={ICLR}, year={2024} } ```
Acknowledgements
Our code is heavily based off the WebArena codebase.
Owner
- Name: web-arena-x
- Login: web-arena-x
- Kind: organization
- Repositories: 1
- Profile: https://github.com/web-arena-x
GitHub Events
Total
- Issues event: 11
- Watch event: 125
- Issue comment event: 13
- Push event: 1
- Pull request event: 3
- Fork event: 19
- Create event: 1
Last Year
- Issues event: 11
- Watch event: 125
- Issue comment event: 13
- Push event: 1
- Pull request event: 3
- Fork event: 19
- Create event: 1
Dependencies
- Farama-Notifications ==0.0.4
- Jinja2 ==3.1.2
- MarkupSafe ==2.1.3
- Pillow ==10.0.1
- PyYAML ==6.0.1
- Pygments ==2.16.1
- accelerate ==0.22.0
- aiohttp ==3.8.5
- aiolimiter ==1.1.0
- aiosignal ==1.3.1
- annotated-types ==0.5.0
- anyio ==4.0.0
- appnope ==0.1.3
- asttokens ==2.4.0
- async-timeout ==4.0.3
- attrs ==23.1.0
- backcall ==0.2.0
- beartype ==0.12.0
- beautifulsoup4 ==4.12.2
- certifi ==2023.7.22
- cfgv ==3.4.0
- charset-normalizer ==3.2.0
- click ==8.1.7
- cloudpickle ==2.2.1
- comm ==0.1.4
- contourpy ==1.1.1
- cycler ==0.12.1
- datasets ==2.14.4
- debugpy ==1.8.0
- decorator ==5.1.1
- dill ==0.3.7
- distlib ==0.3.7
- evaluate ==0.4.0
- exceptiongroup ==1.1.3
- execnet ==2.0.2
- executing ==2.0.0
- fastjsonschema ==2.18.1
- filelock ==3.12.2
- fonttools ==4.43.1
- frozenlist ==1.4.0
- fsspec ==2023.6.0
- google-api-core ==2.15.0
- google-auth ==2.26.1
- google-cloud-aiplatform ==1.38.1
- google-cloud-bigquery ==3.14.1
- google-cloud-core ==2.4.1
- google-cloud-resource-manager ==1.11.0
- google-cloud-storage ==2.14.0
- google-crc32c ==1.5.0
- google-resumable-media ==2.7.0
- googleapis-common-protos ==1.62.0
- gradio_client ==0.5.2
- greenlet ==2.0.2
- grpc-google-iam-v1 ==0.13.0
- gymnasium ==0.29.1
- h11 ==0.14.0
- httpcore ==0.18.0
- httpx ==0.25.0
- huggingface-hub ==0.16.4
- identify ==2.5.30
- idna ==3.4
- iniconfig ==2.0.0
- ipykernel ==6.25.2
- ipython ==8.16.1
- jedi ==0.19.1
- joblib ==1.3.2
- jsonschema ==4.19.1
- jsonschema-specifications ==2023.7.1
- jupyter_client ==8.4.0
- jupyter_core ==5.4.0
- kiwisolver ==1.4.5
- matplotlib ==3.8.0
- matplotlib-inline ==0.1.6
- mpmath ==1.3.0
- multidict ==6.0.4
- multiprocess ==0.70.15
- mypy ==0.991
- mypy-extensions ==1.0.0
- nbclient ==0.6.8
- nbformat ==5.9.2
- nbmake ==1.4.6
- nest-asyncio ==1.5.8
- networkx ==3.1
- nltk ==3.8.1
- nodeenv ==1.8.0
- numpy ==1.25.2
- openai ==1.3.5
- opencv-python ==4.8.1.78
- packaging ==23.1
- pandas ==2.0.3
- parso ==0.8.3
- pexpect ==4.8.0
- pickleshare ==0.7.5
- platformdirs ==3.11.0
- playwright ==1.37.0
- pluggy ==1.3.0
- pre-commit ==3.0.1
- prompt-toolkit ==3.0.39
- protobuf ==4.24.3
- psutil ==5.9.5
- ptyprocess ==0.7.0
- pure-eval ==0.2.2
- py ==1.11.0
- pyarrow ==12.0.1
- pydantic ==2.4.2
- pydantic_core ==2.10.1
- pyee ==9.0.4
- pyparsing ==3.1.1
- pytest ==7.1.2
- pytest-asyncio ==0.21.1
- pytest-xdist ==3.3.1
- python-dateutil ==2.8.2
- pytz ==2023.3
- pyzmq ==25.1.1
- referencing ==0.30.2
- regex ==2023.8.8
- requests ==2.31.0
- responses ==0.18.0
- rpds-py ==0.10.6
- safetensors ==0.3.3
- scikit-image ==0.22.0
- sentencepiece ==0.1.99
- six ==1.16.0
- sniffio ==1.3.0
- soupsieve ==2.5
- stack-data ==0.6.3
- sympy ==1.12
- text-generation ==0.6.1
- tiktoken ==0.4.0
- tokenizers ==0.14.0
- tomli ==2.0.1
- torch ==2.0.1
- tornado ==6.3.3
- tqdm ==4.66.1
- traitlets ==5.11.2
- transformers ==4.34.0
- types-requests ==2.31.0.10
- types-tqdm ==4.66.0.1
- typing_extensions ==4.7.1
- tzdata ==2023.3
- urllib3 ==2.0.4
- virtualenv ==20.24.5
- wcwidth ==0.2.8
- websockets ==11.0.3
- xxhash ==3.3.0
- yarl ==1.9.2