https://github.com/salesforceairesearch/mcp-universe

MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents

https://github.com/salesforceairesearch/mcp-universe

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents

Basic Info
  • Host: GitHub
  • Owner: SalesforceAIResearch
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage: https://mcp-universe.github.io/
  • Size: 2.53 MB
Statistics
  • Stars: 362
  • Watchers: 5
  • Forks: 31
  • Open Issues: 4
  • Releases: 1
Created about 1 year ago · Last pushed 9 months ago
Metadata Files
Readme Contributing License Code of conduct Codeowners Security

README.md

# MCP-Universe MCP-Universe [![Paper](https://img.shields.io/badge/Paper-arXiv:2508.14704-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.14704) [![Website](https://img.shields.io/badge/Website-Live-4285F4?style=for-the-badge&logo=googlechrome&logoColor=white)](https://mcp-universe.github.io/) [![Leaderboard](https://img.shields.io/badge/Leaderboard-Results-FF6B35?style=for-the-badge&logo=chartdotjs&logoColor=white)](https://mcp-universe.github.io/#results) [![Discord](https://img.shields.io/badge/Discord-Join_Community-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/t9tU77GF)

What is MCP-Universe?

MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.

![MCP-Universe Introduction](assets/intro-mcp-universe.png)

Unlike existing benchmarks that rely on overly simplistic tasks, MCP-Universe addresses critical gaps by evaluating LLMs in real-world scenarios through interaction with actual MCP servers, capturing real application challenges such as:

  • 🎯 Long-horizon reasoning across multi-step tasks
  • 🔧 Large, unfamiliar tool spaces with diverse MCP servers
  • 🌍 Real-world data sources and live environments
  • Dynamic evaluation with time-sensitive ground truth

Performance Highlights

Even state-of-the-art models show significant limitations in real-world MCP interactions:

  • 🥇 GPT-5: 43.72% success rate
  • 🥈 Grok-4: 33.33% success rate
  • 🥉 Claude-4.0-Sonnet: 29.44% success rate

This highlights the challenging nature of real-world MCP server interactions and substantial room for improvement in current LLM agents.

Table of Contents

Architecture Overview

The MCPUniverse architecture consists of the following key components:

  • Agents (mcpuniverse/agent/): Base implementations for different agent types
  • Workflows (mcpuniverse/workflows/): Orchestration and coordination layer
  • MCP Servers (mcpuniverse/mcp/): Protocol management and external service integration
  • LLM Integration (mcpuniverse/llm/): Multi-provider language model support
  • Benchmarking (mcpuniverse/benchmark/): Evaluation and testing framework
  • Dashboard (mcpuniverse/dashboard/): Visualization and monitoring interface

The diagram below illustrates the high-level view:

┌─────────────────────────────────────────────────────────────────┐ │ Application Layer │ ├─────────────────────────────────────────────────────────────────┤ │ Dashboard │ Web API │ Python Lib │ Benchmarks │ │ (Gradio) │ (FastAPI) │ │ │ └─────────────┬─────────────────┬────────────────┬────────────────┘ │ │ │ ┌─────────────▼─────────────────▼────────────────▼────────────────┐ │ Orchestration Layer │ ├─────────────────────────────────────────────────────────────────┤ │ Workflows │ Benchmark Runner │ │ (Chain, Router, etc.) │ (Evaluation Engine) │ └─────────────┬─────────────────┬────────────────┬────────────────┘ │ │ │ ┌─────────────▼─────────────────▼────────────────▼────────────────┐ │ Agent Layer │ ├─────────────────────────────────────────────────────────────────┤ │ BasicAgent │ ReActAgent │ FunctionCall │ Other │ │ │ │ Agent │ Agents │ └─────────────┬─────────────────┬────────────────┬────────────────┘ │ │ │ ┌─────────────▼─────────────────▼────────────────▼────────────────┐ │ Foundation Layer │ ├─────────────────────────────────────────────────────────────────┤ │ MCP Manager │ LLM Manager │ Memory Systems │ Tracers │ │ (Servers & │ (Multi-Model │ (RAM, Redis) │ (Logging) │ │ Clients) │ Support) │ │ │ └─────────────────┴─────────────────┴─────────────────┴───────────┘

More information can be found here.

Getting Started

We follow the feature branch workflow in this repo for its simplicity. To ensure code quality, PyLint is integrated into our CI to enforce Python coding standards.

Prerequisites

  • Python: Requires version 3.10 or higher.
  • Docker: Used for running Dockerized MCP servers.
  • PostgreSQL (optional): Used for database storage and persistence.
  • Redis (optional): Used for caching and memory management.

Installation

  1. Clone the repository bash git clone https://github.com/SalesforceAIResearch/MCP-Universe.git cd MCP-Universe

  2. Create and activate virtual environment bash python3 -m venv venv source venv/bin/activate

  3. Install dependencies bash pip install -r requirements.txt pip install -r dev-requirements.txt

  4. Platform-specific requirements

Linux: bash sudo apt-get install libpq-dev

macOS: bash brew install postgresql

  1. Configure pre-commit hooks bash pre-commit install

  2. Environment configuration ```bash cp .env.example .env

    Edit .env with your API keys and configuration

    ```

Quick Test

To run benchmarks, you first need to set environment variables:

  1. Copy the .env.example file to a new file named .env.
  2. In the .env file, set the required API keys for various services used by the agents, such as OPENAI_API_KEY and GOOGLE_MAPS_API_KEY.

To execute a benchmark programmatically:

```python from mcpuniverse.tracer.collectors import MemoryCollector # You can also use SQLiteCollector from mcpuniverse.benchmark.runner import BenchmarkRunner

async def test(): tracecollector = MemoryCollector() # Choose a benchmark config file under the folder "mcpuniverse/benchmark/configs" benchmark = BenchmarkRunner("dummy/benchmark1.yaml") # Run the specified benchmark results = await benchmark.run(tracecollector=tracecollector) # Get traces traceid = results[0].tasktraceids["dummy/tasks/weather.json"] tracerecords = tracecollector.get(traceid) ```

Evaluating LLMs and Agents

This section provides comprehensive instructions for evaluating LLMs and AI agents using the MCP-Universe benchmark suite. The framework supports evaluation across multiple domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.

Prerequisites

Before running benchmark evaluations, ensure you have completed the Getting Started section and have the following:

  • Python: Version 3.10 or higher
  • Docker: Installed and available in your environment
  • All required dependencies installed via pip install -r requirements.txt
  • Active virtual environment
  • Appropriate API access for the services you intend to evaluate

Environment Configuration

1. Initial Setup

Copy the environment template and configure your API credentials:

bash cp .env.example .env

2. API Keys and Configuration

Configure the following environment variables in your .env file. The required keys depend on which benchmark domains you plan to evaluate:

Core LLM Providers

| Environment Variable | Provider | Description | Required For | |---------------------|----------|-------------|--------------| | OPENAI_API_KEY | OpenAI | API key for GPT models (gpt-5, etc.) | All domains | | ANTHROPIC_API_KEY | Anthropic | API key for Claude models | All domains | | GEMINI_API_KEY | Google | API key for Gemini models | All domains |

Note: You only need to configure the API key for the LLM provider you intend to use in your evaluation.

Domain-Specific Services

| Environment Variable | Service | Description | Setup Instructions | |---------------------|---------|-------------|-------------------| | SERP_API_KEY | SerpAPI | Web search API for search benchmark evaluation | Get API key | | GOOGLE_MAPS_API_KEY | Google Maps | Geolocation and mapping services | Setup Guide | | GITHUB_PERSONAL_ACCESS_TOKEN | GitHub | Personal access token for repository operations | Token Setup | | GITHUB_PERSONAL_ACCOUNT_NAME | GitHub | Your GitHub username | N/A | | NOTION_API_KEY | Notion | Integration token for Notion workspace access | Integration Setup | | NOTION_ROOT_PAGE | Notion | Root page ID for your Notion workspace | See configuration example below |

System Paths

| Environment Variable | Description | Example | |---------------------|-------------|---------| | BLENDER_APP_PATH | Full path to Blender executable (we used v4.4.0) | /Applications/Blender.app/Contents/MacOS/Blender | | MCPUniverse_DIR | Absolute path to your MCP-Universe repository | /Users/username/MCP-Universe |

Configuration Examples

Notion Root Page ID: If your Notion page URL is: https://www.notion.so/your_workspace/MCP-Evaluation-1dd6d96e12345678901234567eaf9eff Set NOTION_ROOT_PAGE=MCP-Evaluation-1dd6d96e12345678901234567eaf9eff

Blender Installation: 1. Download Blender v4.4.0 from blender.org 2. Install our modified Blender MCP server following the installation guide 3. Set the path to the Blender executable

⚠️ Security Recommendations

🔒 IMPORTANT SECURITY NOTICE

Please read and follow these security guidelines carefully before running benchmarks:

  • 🚨 GitHub Integration: CRITICAL - We strongly recommend using a dedicated test GitHub account for benchmark evaluation. The AI agent will perform real operations on GitHub repositories, which could potentially modify or damage your personal repositories.

  • 🔐 API Key Management:

    • Store API keys securely and never commit them to version control
    • Use environment variables or secure key management systems
    • Regularly rotate your API keys for enhanced security
  • 🛡️ Access Permissions:

    • Grant minimal necessary permissions for each service integration
    • Review and limit API key scopes to only required operations
    • Monitor API usage and set appropriate rate limits
  • ⚡ Blender Operations: The 3D design benchmarks will execute Blender commands that may modify or create files on your system. Ensure you have adequate backups and run in an isolated environment if necessary.

Benchmark Configuration

Domain-Specific Configuration Files

Each benchmark domain has a dedicated YAML configuration file located in mcpuniverse/benchmark/configs/test/. To evaluate your LLM/agent, modify the appropriate configuration file:

| Domain | Configuration File | Description | |--------|-------------------|-------------| | Web Search | web_search.yaml | Search engine and information retrieval tasks | | Location Navigation | location_navigation.yaml | Geographic and mapping-related queries | | Browser Automation | browser_automation.yaml | Web interaction and automation scenarios | | Financial Analysis | financial_analysis.yaml | Market data analysis and financial computations | | Repository Management | repository_management.yaml | Git operations and code repository tasks | | 3D Design | 3d_design.yaml | Blender-based 3D modeling and design tasks |

LLM Model Configuration

In each configuration file, update the LLM specification to match your target model:

yaml kind: llm spec: name: llm-1 type: openai # or anthropic, google, etc. config: model_name: gpt-4o # Replace with your target model

Execution

Running Individual Benchmarks

Execute specific domain benchmarks using the following commands:

```bash

Set Python path and run individual benchmarks

export PYTHONPATH=.

Location Navigation

python tests/benchmark/testbenchmarklocation_navigation.py

Browser Automation

python tests/benchmark/testbenchmarkbrowser_automation.py

Financial Analysis

python tests/benchmark/testbenchmarkfinancial_analysis.py

Repository Management

python tests/benchmark/testbenchmarkrepository_management.py

Web Search

python tests/benchmark/testbenchmarkweb_search.py

3D Design

python tests/benchmark/testbenchmark3d_design.py ```

Batch Execution

For comprehensive evaluation across all domains:

```bash

!/bin/bash

export PYTHONPATH=.

domains=("locationnavigation" "browserautomation" "financialanalysis" "repositorymanagement" "websearch" "3ddesign")

for domain in "${domains[@]}"; do echo "Running benchmark: $domain" python "tests/benchmark/testbenchmark${domain}.py" echo "Completed: $domain" done ```

Save the running log

If you want to save the running log, you can pass the trace_collector to the benchmark run function:

```python from mcpuniverse.tracer.collectors import FileCollector

tracecollector = FileCollector(logfile="log/locationnavigation.log") benchmarkresults = await benchmark.run(tracecollector=tracecollector) ```

Save the benchmark result to a report

If you want to save a report of the benchmark result, you can use BenchmarkReport to dump a report:

```python from mcpuniverse.benchmark.report import BenchmarkReport

report = BenchmarkReport(benchmark, tracecollector=tracecollector) report.dump() ```

Visualize the agent running information

To run the benchmark with intermediate results and see real-time progress, pass callbacks=get_vprint_callbacks() to the run function:

```python from mcpuniverse.callbacks.handlers.vprint import getvprintcallbacks

benchmarkresults = await benchmark.run( tracecollector=tracecollector, callbacks=getvprint_callbacks() ) ```

This will print out the intermediate results as the benchmark runs.

For further details, refer to the in-code documentation or existing configuration samples in the repository.

Creating Custom Benchmarks

A benchmark is defined by three main configuration elements: the task definition, agent/workflow definition, and the benchmark configuration itself. Below is an example using a simple "weather forecasting" task.

Task definition

The task definition is provided in JSON format, for example:

json { "category": "general", "question": "What's the weather in San Francisco now?", "mcp_servers": [ { "name": "weather" } ], "output_format": { "city": "<City>", "weather": "<Weather forecast results>" }, "evaluators": [ { "func": "json -> get(city)", "op": "=", "value": "San Francisco" } ] }

Field descriptions:

  1. category: The task category, e.g., "general", "google-maps", etc. You can set any value for this property.
  2. question: The main question you want to ask in this task. This is treated as a user message.
  3. mcp_servers: A list of MCP servers that are supported in this framework.
  4. output_format: The desired output format of agent responses.
  5. evaluators: A list of tests to evaluate. For each test/evaluator, it has three attributes: "func" indicates how to extract values from the agent response, "op" is the comparison operator, and "value" is the ground-truth value. It will evaluate op(func(...), value, op_args...). "op" can be "=", "<", ">" or other customized operators.

In "evaluators", you need to write a rule ("func" attribute) showing how to extract values for testing. In the example above, "json -> get(city)" will first do JSON decoding and then extract the value of key "city". There are several predefined funcs in this repo:

  1. json: Perform JSON decoding.
  2. get: Get the value of a key.
  3. len: Get the length of a list.
  4. foreach: Do a FOR-EACH loop.

For example, let's define

python data = {"x": [{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}]}

Then get(x) -> foreach -> get(y) -> len will do the following:

  1. Get the value of "x": [{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}].
  2. Do a foreach loop and get the value of "y": [[1], [1, 1], [1, 2, 3, 4]].
  3. Get the length of each list: [1, 2, 4].

If these predefined functions are not enough, you can implement custom ones. For more details, please check this doc.

Benchmark definition

Define agent(s) and benchmark in a YAML file. Here’s a simple weather forecast benchmark:

```yaml kind: llm spec: name: llm-1 type: openai config: model_name: gpt-4o


kind: agent spec: name: ReAct-agent type: react config: llm: llm-1 instruction: You are an agent for weather forecasting. servers: - name: weather


kind: benchmark spec: description: Test the agent for weather forecasting agent: ReAct-agent tasks: - dummy/tasks/weather.json ```

The benchmark definition mainly contains two parts: the agent definition and the benchmark configuration. The benchmark configuration is simple—you just need to specify the agent to use (by the defined agent name) and a list of tasks to evaluate. Each task entry is the task config file path. It can be a full file path or a partial file path. If it is a partial file path (like "dummy/tasks/weather.json"), it should be put in the folder mcpuniverse/benchmark/configs in this repo.

This framework offers a flexible way to define both simple agents (such as ReAct) and more complex, multi-step agent workflows.

  1. Specify LLMs: Begin by declaring the large language models (LLMs) you want the agents to use. Each LLM component must be assigned a unique name (e.g., "llm-1"). These names serve as identifiers that the framework uses to connect the different components together.
  2. Define an agent: Next, define an agent by providing its name and selecting an agent class. Agent classes are available in the mcpuniverse.agent package. Commonly used classes include "basic", "function-call", and "react". Within the agent specification ( spec.config), you must also indicate which LLM instance the agent should use by setting the "llm" field.
  3. Create complex workflows: Beyond simple agents, the framework supports the definition of sophisticated, orchestrated workflows where multiple agents interact or collaborate to solve more complex tasks.

For example:

```yaml kind: llm spec: name: llm-1 type: openai config: model_name: gpt-4o


kind: agent spec: name: basic-agent type: basic config: llm: llm-1 instruction: Return the latitude and the longitude of a place.


kind: agent spec: name: function-call-agent type: function-call config: llm: llm-1 instruction: You are an agent for weather forecast. Please return the weather today at the given latitude and longitude. servers: - name: weather


kind: workflow spec: name: orchestrator-workflow type: orchestrator config: llm: llm-1 agents: - basic-agent - function-call-agent


kind: benchmark spec: description: Test the agent for weather forecasting agent: orchestrator-workflow tasks: - dummy/tasks/weather.json ```

Citation

If you use MCP-Universe in your research, please cite our paper:

bibtex @misc{mcpuniverse, title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers}, author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li}, year={2025}, eprint={2508.14704}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2508.14704}, }

Owner

  • Name: Salesforce AI Research
  • Login: SalesforceAIResearch
  • Kind: organization
  • Email: ospo@salesforce.com

Open Source projects released by Salesforce AI Research

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 4
  • Watch event: 207
  • Issue comment event: 4
  • Push event: 6
  • Public event: 1
  • Pull request event: 1
  • Fork event: 17
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 4
  • Watch event: 207
  • Issue comment event: 4
  • Push event: 6
  • Public event: 1
  • Pull request event: 1
  • Fork event: 17

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 3
  • Total pull requests: 2
  • Average time to close issues: about 9 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 2
  • Average time to close issues: about 9 hours
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • NielsRogge (1)
  • njbrake (1)
  • mcptest2508251 (1)
Pull Request Authors
  • Abhinavexists (2)
Top Labels
Issue Labels
Pull Request Labels
cla:signed (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 104 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
pypi.org: mcpuniverse

A framework for developing and benchmarking AI agents using Model Context Protocol (MCP)

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 104 Last month
Rankings
Stargazers count: 4.4%
Forks count: 8.3%
Dependent packages count: 8.6%
Average: 17.4%
Dependent repos count: 48.4%
Maintainers (1)
Last synced: 9 months ago

Dependencies

.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • redis latest docker
docker/gateway/Dockerfile docker
  • python 3.11-slim build
dev-requirements.txt pypi
  • pre-commit * development
  • pylint * development
  • pytest * development
  • pytest_asyncio * development
  • pytest_postgresql * development
pyproject.toml pypi
  • anthropic ==0.49.0
  • anyio ==4.9.0
  • bcrypt ==4.3.0
  • blender-mcp ==1.1.3
  • celery ==5.5.3
  • claude-code-sdk ==0.0.20
  • click ==8.1.8
  • fastapi ==0.115.12
  • google-api-python-client *
  • google-auth ==2.38.0
  • google-auth-oauthlib ==1.2.1
  • google-genai ==1.16.1
  • httpx ==0.28.1
  • jinja2 ==3.1.6
  • mathutils ==3.3.0
  • mcp ==1.9.4
  • mcp_server_calculator ==0.1.1
  • mcp_server_fetch *
  • mistralai ==1.6.0
  • openai ==1.68.2
  • playwright ==1.52.0
  • psycopg [binary]==3.2.9
  • pydantic ==2.10.6
  • pydantic [email]==2.10.6
  • pyseto ==1.8.4
  • python-dotenv ==1.0.1
  • pytz ==2024.2
  • pyyaml ==6.0.2
  • redis ==6.1.0
  • requests ==2.32.4
  • schema ==0.7.7
  • sqlalchemy [asyncio]==2.0.41
  • uvicorn [standard]==0.34.0
  • wikipedia-api ==0.8.1
  • xai-sdk ==1.0.0
  • yfinance ==0.2.61
requirements.txt pypi
  • anthropic ==0.49.0
  • anyio ==4.9.0
  • bcrypt ==4.3.0
  • blender-mcp ==1.1.3
  • celery ==5.5.3
  • claude-code-sdk ==0.0.20
  • click ==8.1.8
  • fastapi ==0.115.12
  • google-api-python-client *
  • google-auth ==2.38.0
  • google-auth-oauthlib ==1.2.1
  • google-genai ==1.16.1
  • httpx ==0.28.1
  • jinja2 ==3.1.6
  • mathutils ==3.3.0
  • mcp ==1.9.4
  • mcp_server_calculator ==0.1.1
  • mcp_server_fetch *
  • mistralai ==1.6.0
  • openai ==1.68.2
  • playwright ==1.52.0
  • psycopg ==3.2.9
  • pydantic ==2.10.6
  • pyseto ==1.8.4
  • python-dotenv ==1.0.1
  • pytz ==2024.2
  • pyyaml ==6.0.2
  • redis ==6.1.0
  • requests ==2.32.4
  • schema ==0.7.7
  • sqlalchemy ==2.0.41
  • wikipedia-api ==0.8.1
  • xai-sdk ==1.0.0
  • yfinance ==0.2.61