Why evaluate tools?

evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:

selection: Does the model choose the right tools for the task?
Parameter accuracy: Does the model provide correct arguments?

Arcade’s evaluation framework helps you validate -calling capabilities before deployment, ensuring reliability in real-world applications.

What can go wrong?

Without proper evaluation, AI models might:

Misinterpret intents, selecting the wrong
Provide incorrect arguments, causing failures or unexpected behavior
Skip necessary calls, missing steps in multi-step tasks
Make incorrect assumptions about parameter defaults or formats

How evaluation works

Evaluations compare the model’s actual calls with expected tool calls for each test case.

Scoring components

selection: Did the model choose the correct tool?
Parameter evaluation: Are the arguments correct? (evaluated by critics)
Weighted scoring: Each aspect has a weight that affects the final score

Evaluation results

Each test case receives:

Score: A value between 0.0 and 1.0
Status:
- Passed: Score meets or exceeds fail threshold (default: 0.8)
- Failed: Score falls below fail threshold
- Warned: Score is between warn and fail thresholds (default: 0.9)

Example output:

PLAINTEXT


PASSED Get weather for city -- Score: 1.00
WARNED Send message with typo -- Score: 0.85
FAILED Wrong tool selected -- Score: 0.50

Critics: Validating parameters

Critics evaluate the correctness of call arguments. Choose the right critic for your validation needs.

BinaryCritic

Checks for exact matches after type casting.

Use case: Exact values required ( IDs, commands, enum values)

Python


from arcade_evals import BinaryCritic
 
BinaryCritic(critic_field="user_id", weight=1.0)

SimilarityCritic

Measures textual similarity using cosine similarity.

Use case: Content should be similar but exact wording isn’t critical (messages, descriptions)

Python


from arcade_evals import SimilarityCritic
 
SimilarityCritic(
    critic_field="message",
    weight=0.8,
    similarity_threshold=0.85
)

NumericCritic

Evaluates numeric values within a tolerance range.

Use case: Approximate values acceptable (temperatures, measurements)

Python


from arcade_evals import NumericCritic
 
NumericCritic(
    critic_field="temperature",
    tolerance=2.0,
    weight=0.5
)

DatetimeCritic

Checks datetime values within a time window.

Use case: Times should be close to expected (scheduled events, deadlines)

Python


from datetime import timedelta
from arcade_evals import DatetimeCritic
 
DatetimeCritic(
    critic_field="due_date",
    tolerance=timedelta(hours=1),
    weight=0.6
)

Setting thresholds with rubrics

An EvalRubric defines pass/fail criteria:

Python


from arcade_evals import EvalRubric
 
rubric = EvalRubric(
    fail_threshold=0.85,  # Minimum score to pass
    warn_threshold=0.95,  # Score for warnings
)

Default thresholds:

Fail threshold: 0.8
Warn threshold: 0.9

Example scenarios

Strict evaluation (critical production systems):

Python


rubric = EvalRubric(
    fail_threshold=0.95,
    warn_threshold=0.98,
)

Lenient evaluation (exploratory testing):

Python


rubric = EvalRubric(
    fail_threshold=0.6,
    warn_threshold=0.8,
)

Building effective evaluation suites

A comprehensive evaluation suite includes:

1. Common cases

Test typical requests:

Python


suite.add_case(
    name="Get weather for city",
    user_message="What's the weather in Seattle?",
    expected_tool_calls=[
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"})
    ],
)

2. Edge cases

Test unusual or boundary conditions:

Python


suite.add_case(
    name="Weather with ambiguous location",
    user_message="What's the weather in Portland?",  # Portland, OR or ME?
    expected_tool_calls=[
        ExpectedMCPToolCall(
            "Weather_GetCurrent",
            {"location": "Portland", "state": "OR"}
        )
    ],
)

3. Multi-step cases

Test sequences requiring multiple calls:

Python


suite.add_case(
    name="Compare weather in two cities",
    user_message="Compare the weather in Seattle and Portland",
    expected_tool_calls=[
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"}),
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Portland"}),
    ],
)

4. Context-dependent cases

Test with conversation history:

Python


suite.add_case(
    name="Weather from previous context",
    user_message="What about the weather there?",
    expected_tool_calls=[
        ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Tokyo"})
    ],
    additional_messages=[
        {"role": "user", "content": "I'm traveling to Tokyo next week."},
        {"role": "assistant", "content": "Tokyo is a great destination!"},
    ],
)

Example evaluation suites

Weather tools

Python


@tool_eval()
async def weather_eval_suite():
    suite = EvalSuite(
        name="Weather Tools",
        system_message="You are a weather assistant.",
    )
    
    await suite.add_mcp_stdio_server(["python", "weather_server.py"])
    
    suite.add_case(
        name="Current weather",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[
            ExpectedMCPToolCall("GetWeather", {"city": "Seattle", "type": "current"})
        ],
        critics=[
            BinaryCritic(critic_field="city", weight=0.7),
            BinaryCritic(critic_field="type", weight=0.3),
        ],
    )
    
    return suite

Communication tools

Python


@tool_eval()
async def slack_eval_suite():
    suite = EvalSuite(
        name="Slack Messaging",
        system_message="You are a Slack assistant.",
    )
    
    await suite.add_arcade_gateway(gateway_slug="slack-gateway")
    
    suite.add_case(
        name="Send direct message",
        user_message="Send a DM to @alice saying 'Meeting at 3 PM'",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                "send_dm",
                {"username": "alice", "message": "Meeting at 3 PM"}
            )
        ],
        critics=[
            BinaryCritic(critic_field="username", weight=0.4),
            SimilarityCritic(critic_field="message", weight=0.6),
        ],
    )
    
    return suite

Best practices

Start simple

Begin with straightforward cases and add complexity gradually:

Single call with exact parameters
Single call with flexible parameters
Multiple calls
-dependent calls

Weight critics appropriately

Assign weights based on importance:

Python


critics=[
    BinaryCritic(critic_field="user_id", weight=0.7),    # Critical
    SimilarityCritic(critic_field="message", weight=0.3), # Less critical
]

Or use fuzzy weights:

Python


from arcade_evals.weights import FuzzyWeight
 
critics=[
    BinaryCritic(critic_field="user_id", weight=FuzzyWeight.CRITICAL),
    SimilarityCritic(critic_field="message", weight=FuzzyWeight.MEDIUM),
]

Test with multiple models

Compare performance across models:

Terminal


arcade evals . \
  --use-provider openai:gpt-4o,gpt-4o-mini \
  --use-provider anthropic:claude-sonnet-4-5-20250929

Iterate based on results

Use evaluation results to:

Identify common failure patterns
Improve descriptions
Refine parameter validation
Add missing test cases

Next steps

Create an evaluation suite to start testing your
Run evaluations with multiple providers
Explore capture mode to bootstrap test expectations
Compare sources with comparative evaluations