Skip to Content
HomeEvaluate toolsWhy evaluate tools?

Why evaluate tools?

evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:

  1. selection: Does the model choose the right tools for the task?
  2. Parameter accuracy: Does the model provide correct arguments?

Arcade’s evaluation framework helps you validate -calling capabilities before deployment, ensuring reliability in real-world applications.

What can go wrong?

Without proper evaluation, AI models might:

  • Misinterpret intents, selecting the wrong
  • Provide incorrect arguments, causing failures or unexpected behavior
  • Skip necessary calls, missing steps in multi-step tasks
  • Make incorrect assumptions about parameter defaults or formats

How evaluation works

Evaluations compare the model’s actual calls with expected tool calls for each test case.

Scoring components

  1. selection: Did the model choose the correct tool?
  2. Parameter evaluation: Are the arguments correct? (evaluated by critics)
  3. Weighted scoring: Each aspect has a weight that affects the final score

Evaluation results

Each test case receives:

  • Score: A value between 0.0 and 1.0
  • Status:
    • Passed: Score meets or exceeds fail threshold (default: 0.8)
    • Failed: Score falls below fail threshold
    • Warned: Score is between warn and fail thresholds (default: 0.9)

Example output:

PLAINTEXT
PASSED Get weather for city -- Score: 1.00 WARNED Send message with typo -- Score: 0.85 FAILED Wrong tool selected -- Score: 0.50

Critics: Validating parameters

Critics evaluate the correctness of call arguments. Choose the right critic for your validation needs.

BinaryCritic

Checks for exact matches after type casting.

Use case: Exact values required ( IDs, commands, enum values)

Python
from arcade_evals import BinaryCritic BinaryCritic(critic_field="user_id", weight=1.0)

SimilarityCritic

Measures textual similarity using cosine similarity.

Use case: Content should be similar but exact wording isn’t critical (messages, descriptions)

Python
from arcade_evals import SimilarityCritic SimilarityCritic( critic_field="message", weight=0.8, similarity_threshold=0.85 )

NumericCritic

Evaluates numeric values within a tolerance range.

Use case: Approximate values acceptable (temperatures, measurements)

Python
from arcade_evals import NumericCritic NumericCritic( critic_field="temperature", tolerance=2.0, weight=0.5 )

DatetimeCritic

Checks datetime values within a time window.

Use case: Times should be close to expected (scheduled events, deadlines)

Python
from datetime import timedelta from arcade_evals import DatetimeCritic DatetimeCritic( critic_field="due_date", tolerance=timedelta(hours=1), weight=0.6 )

Setting thresholds with rubrics

An EvalRubric defines pass/fail criteria:

Python
from arcade_evals import EvalRubric rubric = EvalRubric( fail_threshold=0.85, # Minimum score to pass warn_threshold=0.95, # Score for warnings )

Default thresholds:

  • Fail threshold: 0.8
  • Warn threshold: 0.9

Example scenarios

Strict evaluation (critical production systems):

Python
rubric = EvalRubric( fail_threshold=0.95, warn_threshold=0.98, )

Lenient evaluation (exploratory testing):

Python
rubric = EvalRubric( fail_threshold=0.6, warn_threshold=0.8, )

Building effective evaluation suites

A comprehensive evaluation suite includes:

1. Common cases

Test typical requests:

Python
suite.add_case( name="Get weather for city", user_message="What's the weather in Seattle?", expected_tool_calls=[ ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"}) ], )

2. Edge cases

Test unusual or boundary conditions:

Python
suite.add_case( name="Weather with ambiguous location", user_message="What's the weather in Portland?", # Portland, OR or ME? expected_tool_calls=[ ExpectedMCPToolCall( "Weather_GetCurrent", {"location": "Portland", "state": "OR"} ) ], )

3. Multi-step cases

Test sequences requiring multiple calls:

Python
suite.add_case( name="Compare weather in two cities", user_message="Compare the weather in Seattle and Portland", expected_tool_calls=[ ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Seattle"}), ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Portland"}), ], )

4. Context-dependent cases

Test with conversation history:

Python
suite.add_case( name="Weather from previous context", user_message="What about the weather there?", expected_tool_calls=[ ExpectedMCPToolCall("Weather_GetCurrent", {"location": "Tokyo"}) ], additional_messages=[ {"role": "user", "content": "I'm traveling to Tokyo next week."}, {"role": "assistant", "content": "Tokyo is a great destination!"}, ], )

Example evaluation suites

Weather tools

Python
@tool_eval() async def weather_eval_suite(): suite = EvalSuite( name="Weather Tools", system_message="You are a weather assistant.", ) await suite.add_mcp_stdio_server(["python", "weather_server.py"]) suite.add_case( name="Current weather", user_message="What's the weather in Seattle?", expected_tool_calls=[ ExpectedMCPToolCall("GetWeather", {"city": "Seattle", "type": "current"}) ], critics=[ BinaryCritic(critic_field="city", weight=0.7), BinaryCritic(critic_field="type", weight=0.3), ], ) return suite

Communication tools

Python
@tool_eval() async def slack_eval_suite(): suite = EvalSuite( name="Slack Messaging", system_message="You are a Slack assistant.", ) await suite.add_arcade_gateway(gateway_slug="slack-gateway") suite.add_case( name="Send direct message", user_message="Send a DM to @alice saying 'Meeting at 3 PM'", expected_tool_calls=[ ExpectedMCPToolCall( "send_dm", {"username": "alice", "message": "Meeting at 3 PM"} ) ], critics=[ BinaryCritic(critic_field="username", weight=0.4), SimilarityCritic(critic_field="message", weight=0.6), ], ) return suite

Best practices

Start simple

Begin with straightforward cases and add complexity gradually:

  1. Single call with exact parameters
  2. Single call with flexible parameters
  3. Multiple calls
  4. -dependent calls

Weight critics appropriately

Assign weights based on importance:

Python
critics=[ BinaryCritic(critic_field="user_id", weight=0.7), # Critical SimilarityCritic(critic_field="message", weight=0.3), # Less critical ]

Or use fuzzy weights:

Python
from arcade_evals.weights import FuzzyWeight critics=[ BinaryCritic(critic_field="user_id", weight=FuzzyWeight.CRITICAL), SimilarityCritic(critic_field="message", weight=FuzzyWeight.MEDIUM), ]

Test with multiple models

Compare performance across models:

Terminal
arcade evals . \ --use-provider openai:gpt-4o,gpt-4o-mini \ --use-provider anthropic:claude-sonnet-4-5-20250929

Iterate based on results

Use evaluation results to:

  1. Identify common failure patterns
  2. Improve descriptions
  3. Refine parameter validation
  4. Add missing test cases

Next steps

Last updated on

Why evaluate tools? | Arcade Docs