Create an evaluation suite
Evaluation suites help you test whether AI models use your tools correctly. This guide shows you how to create test cases that measure selection and parameter accuracy.
Install dependencies
Install Arcade with evaluation support:
uv
uv tool install 'arcade-mcp[evals]'Create an evaluation file
Navigate to your server directory and create a file starting with eval_:
cd my_server
touch eval_server.pyEvaluation files must start with eval_ and use the .py extension. The CLI automatically discovers these files.
Define your evaluation suite
Create an evaluation suite that loads tools from your server and defines test cases:
from arcade_evals import (
EvalSuite,
tool_eval,
ExpectedMCPToolCall,
BinaryCritic,
)
@tool_eval()
async def weather_eval_suite() -> EvalSuite:
"""Evaluate weather tool usage."""
suite = EvalSuite(
name="Weather Tools",
system_message="You are a helpful weather assistant.",
)
# Load tools from your MCP server
await suite.add_mcp_stdio_server(
command=["python", "server.py"],
)
# Add a test case
suite.add_case(
name="Get weather for city",
user_message="What's the weather in Seattle?",
expected_tool_calls=[
ExpectedMCPToolCall(
"Weather_GetCurrent",
{"location": "Seattle", "units": "celsius"}
)
],
critics=[
BinaryCritic(critic_field="location", weight=0.7),
BinaryCritic(critic_field="units", weight=0.3),
],
)
return suiteRun the evaluation
Set your OpenAI and run the evaluation:
export OPENAI_API_KEY=<your_api_key>
arcade evals .The command discovers all eval_*.py files and executes them.
By default, evaluations use OpenAI’s gpt-4o model. To use Anthropic or different models, see Run evaluations.
Understand the results
Evaluation results show:
- Passed: Score meets or exceeds the fail threshold (default: 0.8)
- Failed: Score falls below the fail threshold
- Warned: Score is between warn and fail thresholds (default: 0.9)
Example output:
Suite: Weather Tools
Model: gpt-4o
PASSED Get weather for city -- Score: 1.00
Summary -- Total: 1 -- Passed: 1 -- Failed: 0Use --details to see critic feedback:
arcade evals . --detailsDetailed output includes per-critic scores:
PASSED Get weather for city -- Score: 1.00
Details:
location:
Match: True, Score: 0.70/0.70
units:
Match: True, Score: 0.30/0.30Loading tools
You can load from different sources depending on your setup.
All loading methods are async and must be awaited. Ensure your evaluation function is decorated with @tool_eval() and defined as async.
From MCP HTTP server
Load tools from an HTTP or SSE server:
await suite.add_mcp_server(
url="http://localhost:8000",
headers={"Authorization": "Bearer token"},
)The loader automatically appends /mcp to the URL if not present.
From MCP stdio server
Load tools from a stdio server:
await suite.add_mcp_stdio_server(
command=["python", "server.py"],
env={"API_KEY": "secret"},
)From Arcade Gateway
Load tools from an Arcade Gateway:
await suite.add_arcade_gateway(
gateway_slug="my-gateway",
arcade_api_key="your-api-key",
arcade_user_id="user-id",
)Tool loading results are cached automatically to avoid redundant connections. If you update your server, use clear_tools_cache() to reload:
from arcade_evals import clear_tools_cache
clear_tools_cache()Manual tool definitions
Define tools manually using format:
suite.add_tool_definitions([
{
"name": "Weather.GetCurrent",
"description": "Get current weather for a location",
"inputSchema": {
"type": "object",
"properties": {
"location": {"type": "string"},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius"
},
},
"required": ["location"],
},
}
])Expected tool calls
Expected calls define what the model should predict. Use ExpectedMCPToolCall with -style tool names:
ExpectedMCPToolCall(
"Weather_GetCurrent",
{"location": "Seattle", "units": "celsius"}
) names are normalized for provider compatibility. Dots (.) become underscores (_). For example, Weather.GetCurrent becomes Weather_GetCurrent. See Provider compatibility for details.
Critics
Critics evaluate specific parameters of calls. Choose the right critic for your validation needs.
| Critic Type | Use When | Example Field |
|---|---|---|
| BinaryCritic | Need exact match | user_id, city, status |
| SimilarityCritic | Semantic match OK | message, description |
| NumericCritic | Range acceptable | temperature, price |
| DatetimeCritic | Time window OK | deadline, start_time |
BinaryCritic
Checks for exact matches after type casting:
from arcade_evals import BinaryCritic
# Perfect for IDs, locations, and enum values
BinaryCritic(critic_field="location", weight=0.7)SimilarityCritic
Evaluates textual similarity using cosine similarity:
from arcade_evals import SimilarityCritic
SimilarityCritic(
critic_field="message",
weight=0.5,
similarity_threshold=0.8
)NumericCritic
Assesses numeric values within tolerance:
from arcade_evals import NumericCritic
NumericCritic(
critic_field="temperature",
tolerance=2.0,
weight=0.3
)DatetimeCritic
Evaluates datetime values within a time window:
from datetime import timedelta
from arcade_evals import DatetimeCritic
DatetimeCritic(
critic_field="scheduled_time",
tolerance=timedelta(minutes=5),
weight=0.4
)Fuzzy weights
Use fuzzy weights when you want qualitative importance levels instead of precise numbers:
from arcade_evals import BinaryCritic, SimilarityCritic
from arcade_evals.weights import FuzzyWeight
critics = [
BinaryCritic(
critic_field="user_id",
weight=FuzzyWeight.CRITICAL
),
SimilarityCritic(
critic_field="message",
weight=FuzzyWeight.MEDIUM
),
BinaryCritic(
critic_field="priority",
weight=FuzzyWeight.LOW
),
]Fuzzy weights are automatically normalized:
| Weight | Value | Normalized (example above) |
|---|---|---|
| MINIMAL | 1 | - |
| VERY_LOW | 2 | - |
| LOW | 3 | 21.4% |
| MEDIUM | 4 | 28.6% |
| HIGH | 5 | - |
| VERY_HIGH | 6 | - |
| CRITICAL | 7 | 50.0% |
Multiple tool calls
Test cases with multiple expected calls:
suite.add_case(
name="Check weather in multiple cities",
user_message="What's the weather in Seattle and Portland?",
expected_tool_calls=[
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"}),
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Portland"}),
],
critics=[
BinaryCritic(critic_field="location", weight=1.0),
],
)Conversation context
Add conversation history to test cases that require :
suite.add_case(
name="Weather based on previous location",
user_message="What about the weather there?",
expected_tool_calls=[
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Tokyo"}),
],
critics=[
BinaryCritic(critic_field="location", weight=1.0),
],
additional_messages=[
{"role": "user", "content": "I'm planning to visit Tokyo next week."},
{"role": "assistant", "content": "That sounds exciting! What would you like to know about Tokyo?"},
],
)Rubrics and thresholds
Customize evaluation thresholds using an EvalRubric:
from arcade_evals import EvalRubric
rubric = EvalRubric(
fail_threshold=0.85,
warn_threshold=0.95,
)
suite = EvalSuite(
name="Strict Weather Evaluation",
system_message="You are a weather assistant.",
rubric=rubric,
)Default thresholds:
- Fail threshold: 0.8
- Warn threshold: 0.9
Next steps
- Learn how to run evaluations with different providers
- Explore capture mode to record calls
- Compare sources with comparative evaluations
- Understand provider compatibility