Create an evaluation suite

Evaluation suites help you test whether AI models use your tools correctly. This guide shows you how to create test cases that measure selection and parameter accuracy.

Install dependencies

Install Arcade with evaluation support:

uv

Terminal


uv tool install 'arcade-mcp[evals]'

Create an evaluation file

Navigate to your server directory and create a file starting with eval_:

Terminal


cd my_server
touch eval_server.py

Evaluation files must start with eval_ and use the .py extension. The CLI automatically discovers these files.

Define your evaluation suite

Create an evaluation suite that loads tools from your server and defines test cases:

Python


from arcade_evals import (
    EvalSuite,
    tool_eval,
    ExpectedMCPToolCall,
    BinaryCritic,
)
 
@tool_eval()
async def weather_eval_suite() -> EvalSuite:
    """Evaluate weather tool usage."""
    suite = EvalSuite(
        name="Weather Tools",
        system_message="You are a helpful weather assistant.",
    )
    
    # Load tools from your MCP server
    await suite.add_mcp_stdio_server(
        command=["python", "server.py"],
    )
    
    # Add a test case
    suite.add_case(
        name="Get weather for city",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                "Weather_GetCurrent",
                {"location": "Seattle", "units": "celsius"}
            )
        ],
        critics=[
            BinaryCritic(critic_field="location", weight=0.7),
            BinaryCritic(critic_field="units", weight=0.3),
        ],
    )
    
    return suite

Run the evaluation

Set your OpenAI and run the evaluation:

Terminal


export OPENAI_API_KEY=<your_api_key>
arcade evals .

The command discovers all eval_*.py files and executes them.

By default, evaluations use OpenAI’s gpt-4o model. To use Anthropic or different models, see Run evaluations.

Understand the results

Evaluation results show:

Passed: Score meets or exceeds the fail threshold (default: 0.8)
Failed: Score falls below the fail threshold
Warned: Score is between warn and fail thresholds (default: 0.9)

Example output:

PLAINTEXT


Suite: Weather Tools
  Model: gpt-4o
    PASSED Get weather for city -- Score: 1.00
    
Summary -- Total: 1 -- Passed: 1 -- Failed: 0

Use --details to see critic feedback:

Terminal


arcade evals . --details

Detailed output includes per-critic scores:

PLAINTEXT


PASSED Get weather for city -- Score: 1.00
  Details:
    location:
      Match: True, Score: 0.70/0.70
    units:
      Match: True, Score: 0.30/0.30

Loading tools

You can load from different sources depending on your setup.

All loading methods are async and must be awaited. Ensure your evaluation function is decorated with @tool_eval() and defined as async.

From MCP HTTP server

Load tools from an HTTP or SSE server:

Python


await suite.add_mcp_server(
    url="http://localhost:8000",
    headers={"Authorization": "Bearer token"},
)

The loader automatically appends /mcp to the URL if not present.

From MCP stdio server

Load tools from a stdio server:

Python


await suite.add_mcp_stdio_server(
    command=["python", "server.py"],
    env={"API_KEY": "secret"},
)

From Arcade Gateway

Load tools from an Arcade Gateway:

Python


await suite.add_arcade_gateway(
    gateway_slug="my-gateway",
    arcade_api_key="your-api-key",
    arcade_user_id="user-id",
)

Tool loading results are cached automatically to avoid redundant connections. If you update your server, use clear_tools_cache() to reload:

Python


from arcade_evals import clear_tools_cache
 
clear_tools_cache()

Manual tool definitions

Define tools manually using format:

Python


suite.add_tool_definitions([
    {
        "name": "Weather.GetCurrent",
        "description": "Get current weather for a location",
        "inputSchema": {
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "default": "celsius"
                },
            },
            "required": ["location"],
        },
    }
])

Expected tool calls

Expected calls define what the model should predict. Use ExpectedMCPToolCall with -style tool names:

Python


ExpectedMCPToolCall(
    "Weather_GetCurrent",
    {"location": "Seattle", "units": "celsius"}
)

names are normalized for provider compatibility. Dots (.) become underscores (_). For example, Weather.GetCurrent becomes Weather_GetCurrent. See Provider compatibility for details.

Critics

Critics evaluate specific parameters of calls. Choose the right critic for your validation needs.

Critic Type	Use When	Example Field
BinaryCritic	Need exact match	user_id, city, status
SimilarityCritic	Semantic match OK	message, description
NumericCritic	Range acceptable	temperature, price
DatetimeCritic	Time window OK	deadline, start_time

BinaryCritic

Checks for exact matches after type casting:

Python


from arcade_evals import BinaryCritic
 
# Perfect for IDs, locations, and enum values
BinaryCritic(critic_field="location", weight=0.7)

SimilarityCritic

Evaluates textual similarity using cosine similarity:

Python


from arcade_evals import SimilarityCritic
 
SimilarityCritic(
    critic_field="message",
    weight=0.5,
    similarity_threshold=0.8
)

NumericCritic

Assesses numeric values within tolerance:

Python


from arcade_evals import NumericCritic
 
NumericCritic(
    critic_field="temperature",
    tolerance=2.0,
    weight=0.3
)

DatetimeCritic

Evaluates datetime values within a time window:

Python


from datetime import timedelta
from arcade_evals import DatetimeCritic
 
DatetimeCritic(
    critic_field="scheduled_time",
    tolerance=timedelta(minutes=5),
    weight=0.4
)

Fuzzy weights

Use fuzzy weights when you want qualitative importance levels instead of precise numbers:

Python


from arcade_evals import BinaryCritic, SimilarityCritic
from arcade_evals.weights import FuzzyWeight
 
critics = [
    BinaryCritic(
        critic_field="user_id",
        weight=FuzzyWeight.CRITICAL
    ),
    SimilarityCritic(
        critic_field="message",
        weight=FuzzyWeight.MEDIUM
    ),
    BinaryCritic(
        critic_field="priority",
        weight=FuzzyWeight.LOW
    ),
]

Fuzzy weights are automatically normalized:

Weight	Value	Normalized (example above)
MINIMAL	1	-
VERY_LOW	2	-
LOW	3	21.4%
MEDIUM	4	28.6%
HIGH	5	-
VERY_HIGH	6	-
CRITICAL	7	50.0%

Multiple tool calls

Test cases with multiple expected calls:

Python


suite.add_case(
    name="Check weather in multiple cities",
    user_message="What's the weather in Seattle and Portland?",
    expected_tool_calls=[
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"}),
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Portland"}),
    ],
    critics=[
        BinaryCritic(critic_field="location", weight=1.0),
    ],
)

Conversation context

Add conversation history to test cases that require :

Python


suite.add_case(
    name="Weather based on previous location",
    user_message="What about the weather there?",
    expected_tool_calls=[
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Tokyo"}),
    ],
    critics=[
        BinaryCritic(critic_field="location", weight=1.0),
    ],
    additional_messages=[
        {"role": "user", "content": "I'm planning to visit Tokyo next week."},
        {"role": "assistant", "content": "That sounds exciting! What would you like to know about Tokyo?"},
    ],
)

Rubrics and thresholds

Customize evaluation thresholds using an EvalRubric:

Python


from arcade_evals import EvalRubric
 
rubric = EvalRubric(
    fail_threshold=0.85,
    warn_threshold=0.95,
)
 
suite = EvalSuite(
    name="Strict Weather Evaluation",
    system_message="You are a weather assistant.",
    rubric=rubric,
)

Default thresholds:

Fail threshold: 0.8
Warn threshold: 0.9

Next steps

Learn how to run evaluations with different providers
Explore capture mode to record calls
Compare sources with comparative evaluations
Understand provider compatibility

Create an evaluation suite

Install dependencies

uv

pip

Create an evaluation file

Define your evaluation suite

Run the evaluation

Understand the results

Loading tools

From MCP HTTP server

From MCP stdio server

From Arcade Gateway

Manual tool definitions

Expected tool calls

Critics

BinaryCritic

SimilarityCritic

NumericCritic

DatetimeCritic

Fuzzy weights

Multiple tool calls

Conversation context

Rubrics and thresholds

Next steps