Why evaluate tools?
evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:
- selection: Does the model choose the right tools for the task?
- Parameter accuracy: Does the model provide correct arguments?
Arcade’s evaluation framework helps you validate -calling capabilities before deployment, ensuring reliability in real-world applications.
What can go wrong?
Without proper evaluation, AI models might:
- Misinterpret intents, selecting the wrong
- Provide incorrect arguments, causing failures or unexpected behavior
- Skip necessary calls, missing steps in multi-step tasks
- Make incorrect assumptions about parameter defaults or formats
How evaluation works
Evaluations compare the model’s actual calls with expected tool calls for each test case.
Scoring components
- selection: Did the model choose the correct tool?
- Parameter evaluation: Are the arguments correct? (evaluated by critics)
- Weighted scoring: Each aspect has a weight that affects the final score
Evaluation results
Each test case receives:
- Score: A value between 0.0 and 1.0
- Status:
- Passed: Score meets or exceeds fail threshold (default: 0.8)
- Failed: Score falls below fail threshold
- Warned: Score is between warn and fail thresholds (default: 0.9)
Example output:
PASSED Get weather for city -- Score: 1.00
WARNED Send message with typo -- Score: 0.85
FAILED Wrong tool selected -- Score: 0.50Critics: Validating parameters
Critics evaluate the correctness of call arguments. Choose the right critic for your validation needs.
BinaryCritic
Checks for exact matches after type casting.
Use case: Exact values required ( IDs, commands, enum values)
from arcade_evals import BinaryCritic
BinaryCritic(critic_field="user_id", weight=1.0)SimilarityCritic
Measures textual similarity using cosine similarity.
Use case: Content should be similar but exact wording isn’t critical (messages, descriptions)
from arcade_evals import SimilarityCritic
SimilarityCritic(
critic_field="message",
weight=0.8,
similarity_threshold=0.85
)NumericCritic
Evaluates numeric values within a tolerance range.
Use case: Approximate values acceptable (temperatures, measurements)
from arcade_evals import NumericCritic
NumericCritic(
critic_field="temperature",
tolerance=2.0,
weight=0.5
)DatetimeCritic
Checks datetime values within a time window.
Use case: Times should be close to expected (scheduled events, deadlines)
from datetime import timedelta
from arcade_evals import DatetimeCritic
DatetimeCritic(
critic_field="due_date",
tolerance=timedelta(hours=1),
weight=0.6
)Setting thresholds with rubrics
An EvalRubric defines pass/fail criteria:
from arcade_evals import EvalRubric
rubric = EvalRubric(
fail_threshold=0.85, # Minimum score to pass
warn_threshold=0.95, # Score for warnings
)Default thresholds:
- Fail threshold: 0.8
- Warn threshold: 0.9
Example scenarios
Strict evaluation (critical production systems):
rubric = EvalRubric(
fail_threshold=0.95,
warn_threshold=0.98,
)Lenient evaluation (exploratory testing):
rubric = EvalRubric(
fail_threshold=0.6,
warn_threshold=0.8,
)Building effective evaluation suites
A comprehensive evaluation suite includes:
1. Common cases
Test typical requests:
suite.add_case(
name="Get weather for city",
user_message="What's the weather in Seattle?",
expected_tool_calls=[
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"})
],
)2. Edge cases
Test unusual or boundary conditions:
suite.add_case(
name="Weather with ambiguous location",
user_message="What's the weather in Portland?", # Portland, OR or ME?
expected_tool_calls=[
ExpectedMCPToolCall(
"Weather_GetCurrent",
{"location": "Portland", "state": "OR"}
)
],
)3. Multi-step cases
Test sequences requiring multiple calls:
suite.add_case(
name="Compare weather in two cities",
user_message="Compare the weather in Seattle and Portland",
expected_tool_calls=[
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"}),
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Portland"}),
],
)4. Context-dependent cases
Test with conversation history:
suite.add_case(
name="Weather from previous context",
user_message="What about the weather there?",
expected_tool_calls=[
ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Tokyo"})
],
additional_messages=[
{"role": "user", "content": "I'm traveling to Tokyo next week."},
{"role": "assistant", "content": "Tokyo is a great destination!"},
],
)Example evaluation suites
Weather tools
@tool_eval()
async def weather_eval_suite():
suite = EvalSuite(
name="Weather Tools",
system_message="You are a weather assistant.",
)
await suite.add_mcp_stdio_server(["python", "weather_server.py"])
suite.add_case(
name="Current weather",
user_message="What's the weather in Seattle?",
expected_tool_calls=[
ExpectedMCPToolCall("GetWeather", {"city": "Seattle", "type": "current"})
],
critics=[
BinaryCritic(critic_field="city", weight=0.7),
BinaryCritic(critic_field="type", weight=0.3),
],
)
return suiteCommunication tools
@tool_eval()
async def slack_eval_suite():
suite = EvalSuite(
name="Slack Messaging",
system_message="You are a Slack assistant.",
)
await suite.add_arcade_gateway(gateway_slug="slack-gateway")
suite.add_case(
name="Send direct message",
user_message="Send a DM to @alice saying 'Meeting at 3 PM'",
expected_tool_calls=[
ExpectedMCPToolCall(
"send_dm",
{"username": "alice", "message": "Meeting at 3 PM"}
)
],
critics=[
BinaryCritic(critic_field="username", weight=0.4),
SimilarityCritic(critic_field="message", weight=0.6),
],
)
return suiteBest practices
Start simple
Begin with straightforward cases and add complexity gradually:
- Single call with exact parameters
- Single call with flexible parameters
- Multiple calls
- -dependent calls
Weight critics appropriately
Assign weights based on importance:
critics=[
BinaryCritic(critic_field="user_id", weight=0.7), # Critical
SimilarityCritic(critic_field="message", weight=0.3), # Less critical
]Or use fuzzy weights:
from arcade_evals.weights import FuzzyWeight
critics=[
BinaryCritic(critic_field="user_id", weight=FuzzyWeight.CRITICAL),
SimilarityCritic(critic_field="message", weight=FuzzyWeight.MEDIUM),
]Test with multiple models
Compare performance across models:
arcade evals . \
--use-provider openai:gpt-4o,gpt-4o-mini \
--use-provider anthropic:claude-sonnet-4-5-20250929Iterate based on results
Use evaluation results to:
- Identify common failure patterns
- Improve descriptions
- Refine parameter validation
- Add missing test cases
Next steps
- Create an evaluation suite to start testing your
- Run evaluations with multiple providers
- Explore capture mode to bootstrap test expectations
- Compare sources with comparative evaluations