https://github.com/bytedance/web-bench

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.

https://github.com/bytedance/web-bench

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.0%) to scientific vocabulary

Keywords

benchmark
Last synced: 5 months ago · JSON representation

Repository

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.

Basic Info
Statistics
  • Stars: 205
  • Watchers: 5
  • Forks: 21
  • Open Issues: 6
  • Releases: 1
Topics
benchmark
Created 10 months ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

Web-Bench

中文InstallPaperDatasetsLeaderBoardCitation

📖 Overview

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1\% Pass@1.

The distribution of the experimental data aligns well with the current code generation capabilities of mainstream LLMs.

pass@1

HumanEval and MBPP have approached saturation. APPS and EvalPlus are approaching saturation. The SOTA for Web-Bench is 25.1\%, which is lower (better) than that of the SWE-bench Full and Verified sets.

SOTAs

🚀 Quick Start

Refer to the Docker setup guide for instructions on installing Docker on your machine

  1. Create a new empty folder, add two files in this folder:

./config.json5 ./docker-compose.yml

  1. For config.json5, copy the json below and edit by Config Parameters:

json5 { models: [ 'openai/gpt-4o', // You can add more models here // "claude-sonnet-4-20250514" ], // Eval one project only // "projects": ["@web-bench/react"] }

  1. For docker-compose.yml, copy the yaml below and set environment

yaml services: web-bench: image: maoyiweiebay777/web-bench:latest volumes: - ./config.json5:/app/apps/eval/src/config.json5 - ./report:/app/apps/eval/report environment: # Add enviorment variables according to apps/src/model.json - OPENROUTER_API_KEY=your_api_key # Add more model's key # - ANTHROPIC_API_KEY=your_api_key

  1. Run docker-compose:

bash docker compose up

  1. Evaluation Report will be generated under ./report/

If you wish to evaluate from source code, refer to Install from source.

🛠️ Contribution

📚 Citation

bibtex @article{xu2025webbench, title={Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks}, author={Xu, Kai and Mao, YiWei and Guan, XinYi and Feng, ZiLong}, journal={arXiv preprint arXiv:2505.07473}, year={2025} }

📄 License

Apache 2.0

🌟 Contact us

  • Lark: Scan the QR code below with Register Feishu to join our Web Bench user group.

pass@1

Owner

  • Name: Bytedance Inc.
  • Login: bytedance
  • Kind: organization
  • Location: Singapore

GitHub Events

Total
  • Create event: 19
  • Issues event: 54
  • Release event: 1
  • Watch event: 140
  • Delete event: 18
  • Issue comment event: 48
  • Push event: 87
  • Gollum event: 101
  • Pull request review comment event: 6
  • Pull request review event: 14
  • Pull request event: 90
  • Fork event: 17
Last Year
  • Create event: 19
  • Issues event: 54
  • Release event: 1
  • Watch event: 140
  • Delete event: 18
  • Issue comment event: 48
  • Push event: 87
  • Gollum event: 101
  • Pull request review comment event: 6
  • Pull request review event: 14
  • Pull request event: 90
  • Fork event: 17

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 30
  • Total pull requests: 49
  • Average time to close issues: 5 days
  • Average time to close pull requests: about 1 hour
  • Total issue authors: 9
  • Total pull request authors: 5
  • Average comments per issue: 0.73
  • Average comments per pull request: 0.02
  • Merged pull requests: 36
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 30
  • Pull requests: 49
  • Average time to close issues: 5 days
  • Average time to close pull requests: about 1 hour
  • Issue authors: 9
  • Pull request authors: 5
  • Average comments per issue: 0.73
  • Average comments per pull request: 0.02
  • Merged pull requests: 36
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • luics (14)
  • xiaxiazhu (7)
  • diasforgood (2)
  • shellvon (2)
  • James4Ever0 (1)
  • mingrenbuke (1)
  • sijunhe (1)
  • joyfulcat (1)
  • Sunliangtai (1)
Pull Request Authors
  • sanmaopep (18)
  • liuyueweiyu (17)
  • luics (11)
  • JxJuly (2)
  • xiaxiazhu (1)
Top Labels
Issue Labels
bug (3) question (1)
Pull Request Labels