Skip to Content
HomeEvaluate toolsComparative evaluations

Comparative evaluations

Comparative evaluations let you run the same test cases against different implementations. Use tracks to compare tool sources side-by-side.

What are tracks?

Tracks are isolated registries within a single evaluation suite. Each track represents a different source of tools.

Common use cases:

  • Compare providers: Test Google Weather vs OpenWeather API
  • Version testing: Compare API v1 vs API v2
  • Implementation comparison: Test different servers for the same functionality
  • A/B testing: Evaluate alternative designs

When to use comparative evaluations

Use comparative evaluations when:

  • ✅ Testing multiple implementations of the same functionality
  • ✅ Comparing different API versions
  • ✅ Evaluating providers side-by-side
  • ✅ A/B testing designs

Use regular evaluations when:

  • ✅ Testing a single implementation
  • ✅ Validating behavior
  • ✅ Regression testing

Basic comparative evaluation

Register tools per track

Create a suite and register for each track:

Python
from arcade_evals import EvalSuite, tool_eval, ExpectedMCPToolCall, BinaryCritic @tool_eval() async def weather_comparison(): suite = EvalSuite( name="Weather API Comparison", system_message="You are a weather assistant.", ) # Track A: Weather API v1 await suite.add_mcp_server( "http://weather-v1.example/mcp", track="Weather v1" ) # Track B: Weather API v2 await suite.add_mcp_server( "http://weather-v2.example/mcp", track="Weather v2" ) return suite

Create comparative test case

Add a test case with track-specific expectations:

Python
suite.add_comparative_case( name="get_current_weather", user_message="What's the weather in Seattle?", ).for_track( "Weather v1", expected_tool_calls=[ ExpectedMCPToolCall( "GetWeather", {"city": "Seattle", "type": "current"} ) ], critics=[ BinaryCritic(critic_field="city", weight=0.7), BinaryCritic(critic_field="type", weight=0.3), ], ).for_track( "Weather v2", expected_tool_calls=[ ExpectedMCPToolCall( "Weather_GetCurrent", {"location": "Seattle"} ) ], critics=[ BinaryCritic(critic_field="location", weight=1.0), ], )

Run comparative evaluation

Terminal
arcade evals .

Results show per-track scores:

PLAINTEXT
Suite: Weather API Comparison Case: get_current_weather Track: Weather v1 -- Score: 1.00 -- PASSED Track: Weather v2 -- Score: 1.00 -- PASSED

Track registration

From MCP HTTP server

Python
await suite.add_mcp_server( url="http://localhost:8000", headers={"Authorization": "Bearer token"}, track="Production API", )

From MCP stdio server

Python
await suite.add_mcp_stdio_server( command=["python", "server_v2.py"], env={"API_KEY": "secret"}, track="Version 2", )

From Arcade Gateway

Python
await suite.add_arcade_gateway( gateway_slug="weather-gateway", track="Arcade Gateway", )

Manual tool definitions

Python
suite.add_tool_definitions( tools=[ { "name": "GetWeather", "description": "Get weather for a location", "inputSchema": {...}, } ], track="Custom Tools", )

must be registered before creating comparative cases that reference their tracks.

Comparative case builder

The add_comparative_case() method returns a builder for defining track-specific expectations.

Basic structure

Python
suite.add_comparative_case( name="test_case", user_message="Do something", ).for_track( "Track A", expected_tool_calls=[...], critics=[...], ).for_track( "Track B", expected_tool_calls=[...], critics=[...], )

Optional parameters

Add conversation to comparative cases:

Python
suite.add_comparative_case( name="weather_with_context", user_message="What about the weather there?", system_message="You are helpful.", # Optional override additional_messages=[ {"role": "user", "content": "I'm going to Seattle"}, ], ).for_track("Weather v1", ...).for_track("Weather v2", ...)

Different expectations per track

Tracks often have different names and parameters:

Python
suite.add_comparative_case( name="search_query", user_message="Search for Python tutorials", ).for_track( "Google Search", expected_tool_calls=[ ExpectedMCPToolCall("Google_Search", {"query": "Python tutorials"}) ], critics=[BinaryCritic(critic_field="query", weight=1.0)], ).for_track( "Bing Search", expected_tool_calls=[ ExpectedMCPToolCall("Bing_WebSearch", {"q": "Python tutorials"}) ], critics=[BinaryCritic(critic_field="q", weight=1.0)], )

Complete example

Here’s a full comparative evaluation:

Python
from arcade_evals import ( EvalSuite, tool_eval, ExpectedMCPToolCall, BinaryCritic, SimilarityCritic, ) @tool_eval() async def search_comparison(): """Compare different search APIs.""" suite = EvalSuite( name="Search API Comparison", system_message="You are a search assistant. Use the available tools to search for information.", ) # Register search providers await suite.add_mcp_server( "http://google-search.example/mcp", track="Google", ) await suite.add_mcp_server( "http://bing-search.example/mcp", track="Bing", ) await suite.add_mcp_server( "http://duckduckgo.example/mcp", track="DuckDuckGo", ) # Simple query suite.add_comparative_case( name="basic_search", user_message="Search for Python tutorials", ).for_track( "Google", expected_tool_calls=[ ExpectedMCPToolCall("Search", {"query": "Python tutorials"}) ], critics=[BinaryCritic(critic_field="query", weight=1.0)], ).for_track( "Bing", expected_tool_calls=[ ExpectedMCPToolCall("WebSearch", {"q": "Python tutorials"}) ], critics=[BinaryCritic(critic_field="q", weight=1.0)], ).for_track( "DuckDuckGo", expected_tool_calls=[ ExpectedMCPToolCall("DDG_Search", {"search_term": "Python tutorials"}) ], critics=[BinaryCritic(critic_field="search_term", weight=1.0)], ) # Query with filters suite.add_comparative_case( name="search_with_filters", user_message="Search for Python tutorials from the last month", ).for_track( "Google", expected_tool_calls=[ ExpectedMCPToolCall( "Search", {"query": "Python tutorials", "time_range": "month"} ) ], critics=[ SimilarityCritic(critic_field="query", weight=0.7), BinaryCritic(critic_field="time_range", weight=0.3), ], ).for_track( "Bing", expected_tool_calls=[ ExpectedMCPToolCall( "WebSearch", {"q": "Python tutorials", "freshness": "Month"} ) ], critics=[ SimilarityCritic(critic_field="q", weight=0.7), BinaryCritic(critic_field="freshness", weight=0.3), ], ).for_track( "DuckDuckGo", expected_tool_calls=[ ExpectedMCPToolCall( "DDG_Search", {"search_term": "Python tutorials", "time": "m"} ) ], critics=[ SimilarityCritic(critic_field="search_term", weight=0.7), BinaryCritic(critic_field="time", weight=0.3), ], ) return suite

Run the comparison:

Terminal
arcade evals . --details

Output shows side-by-side results:

PLAINTEXT
Suite: Search API Comparison Case: basic_search Track: Google -- Score: 1.00 -- PASSED Track: Bing -- Score: 1.00 -- PASSED Track: DuckDuckGo -- Score: 1.00 -- PASSED Case: search_with_filters Track: Google -- Score: 1.00 -- PASSED Track: Bing -- Score: 0.85 -- WARNED Track: DuckDuckGo -- Score: 0.90 -- WARNED

Result structure

Comparative results are organized by track:

Python
{ "Google": { "model": "gpt-4o", "suite_name": "Search API Comparison", "track_name": "Google", "rubric": {...}, "cases": [ { "name": "basic_search", "track": "Google", "input": "Search for Python tutorials", "expected_tool_calls": [...], "predicted_tool_calls": [...], "evaluation": { "score": 1.0, "result": "passed", ... } } ] }, "Bing": {...}, "DuckDuckGo": {...} }

Mixing regular and comparative cases

A suite can have both regular and comparative cases:

Python
@tool_eval() async def mixed_suite(): suite = EvalSuite( name="Mixed Evaluation", system_message="You are helpful.", ) # Register default tools await suite.add_mcp_stdio_server(["python", "server.py"]) # Regular case (uses default tools) suite.add_case( name="regular_test", user_message="Do something", expected_tool_calls=[...], ) # Register track-specific tools await suite.add_mcp_server("http://api-v2.example", track="v2") # Comparative case suite.add_comparative_case( name="compare_versions", user_message="Do something else", ).for_track( "default", # Uses default tools expected_tool_calls=[...], ).for_track( "v2", # Uses v2 tools expected_tool_calls=[...], ) return suite

Use track name "default" to reference registered without a track.

Capture mode with tracks

Capture calls from each track separately:

Terminal
arcade evals . --capture --file captures/comparison --format json

Output includes track names:

JSON
{ "captured_cases": [ { "case_name": "get_weather", "track_name": "Weather v1", "tool_calls": [ {"name": "GetWeather", "args": {...}} ] }, { "case_name": "get_weather", "track_name": "Weather v2", "tool_calls": [ {"name": "Weather_GetCurrent", "args": {...}} ] } ] }

Multi-model comparative evaluations

Combine comparative tracks with multiple models:

Terminal
arcade evals . \ --use-provider openai:gpt-4o,gpt-4o-mini \ --use-provider anthropic:claude-sonnet-4-5-20250929

Results show:

  • Per-track scores for each model
  • Cross-track comparisons for each model
  • Cross-model comparisons for each track

Example output:

PLAINTEXT
Suite: Weather API Comparison Model: gpt-4o Case: get_weather Track: Weather v1 -- Score: 1.00 -- PASSED Track: Weather v2 -- Score: 1.00 -- PASSED Model: gpt-4o-mini Case: get_weather Track: Weather v1 -- Score: 0.90 -- WARNED Track: Weather v2 -- Score: 0.95 -- PASSED Model: claude-sonnet-4-5-20250929 Case: get_weather Track: Weather v1 -- Score: 1.00 -- PASSED Track: Weather v2 -- Score: 0.85 -- WARNED

Best practices

Use descriptive track names

Choose clear names that indicate what’s being compared:

Python
# ✅ Good track="Weather API v1" track="OpenWeather Production" track="Google Weather (Staging)" # ❌ Avoid track="A" track="Test1" track="Track2"

Keep test cases consistent

Use the same user message and across tracks:

Python
suite.add_comparative_case( name="get_weather", user_message="What's the weather in Seattle?", # Same for all tracks ).for_track("v1", ...).for_track("v2", ...)

Adjust critics to track differences

Different may have different parameter names or types:

Python
.for_track( "Weather v1", expected_tool_calls=[ ExpectedMCPToolCall("GetWeather", {"city": "Seattle"}) ], critics=[ BinaryCritic(critic_field="city", weight=1.0), # v1 uses "city" ], ).for_track( "Weather v2", expected_tool_calls=[ ExpectedMCPToolCall("GetWeather", {"location": "Seattle"}) ], critics=[ BinaryCritic(critic_field="location", weight=1.0), # v2 uses "location" ], )

Start with capture mode

Use capture mode to discover track-specific signatures:

Terminal
arcade evals . --capture

Then create expectations based on captured calls.

Test edge cases per track

Different implementations may handle edge cases differently:

Python
suite.add_comparative_case( name="ambiguous_location", user_message="What's the weather in Portland?", # OR or ME? ).for_track( "Weather v1", # v1 defaults to most populous expected_tool_calls=[ ExpectedMCPToolCall("GetWeather", {"city": "Portland", "state": "OR"}) ], ).for_track( "Weather v2", # v2 requires disambiguation expected_tool_calls=[ ExpectedMCPToolCall("DisambiguateLocation", {"city": "Portland"}), ExpectedMCPToolCall("GetWeather", {"city": "Portland", "state": "OR"}), ], )

Troubleshooting

Track not found

Symptom: ValueError: Track 'TrackName' not registered

Solution: Register the track before adding comparative cases:

Python
# ✅ Correct order await suite.add_mcp_server(url, track="TrackName") suite.add_comparative_case(...).for_track("TrackName", ...) # ❌ Wrong order - will fail suite.add_comparative_case(...).for_track("TrackName", ...) await suite.add_mcp_server(url, track="TrackName")

Missing track expectations

Symptom: Case runs against some tracks but not others

Explanation: Comparative cases only run against tracks with .for_track() defined.

Solution: Add expectations for all registered tracks:

Python
suite.add_comparative_case( name="test", user_message="...", ).for_track("Track A", ...).for_track("Track B", ...)

Tool name mismatches

Symptom: not found” errors in specific tracks

Solution: Check names in each track:

Python
# List tools per track print(suite.list_tool_names(track="Track A")) print(suite.list_tool_names(track="Track B"))

Use the exact names from the output.

Inconsistent results across tracks

Symptom: Same message produces different scores across tracks

Explanation: This is expected. Different implementations may work differently.

Solution: Adjust expectations and critics per track to for implementation differences.

Advanced patterns

Baseline comparison

Compare new implementations against a baseline:

Python
await suite.add_mcp_server( "http://production.example/mcp", track="Production (Baseline)" ) await suite.add_mcp_server( "http://staging.example/mcp", track="Staging (New)" )

Results show deviations from baseline.

Progressive feature testing

Test feature support across versions:

Python
suite.add_comparative_case( name="advanced_filters", user_message="Search with advanced filters", ).for_track( "v1", expected_tool_calls=[], # Not supported ).for_track( "v2", expected_tool_calls=[ ExpectedMCPToolCall("SearchWithFilters", {...}) ], )

Tool catalog comparison

Compare Arcade catalogs:

Python
from arcade_core import ToolCatalog from my_tools import weather_v1, weather_v2 catalog_v1 = ToolCatalog() catalog_v1.add_tool(weather_v1, "Weather") catalog_v2 = ToolCatalog() catalog_v2.add_tool(weather_v2, "Weather") suite.add_tool_catalog(catalog_v1, track="Python v1") suite.add_tool_catalog(catalog_v2, track="Python v2")

Next steps

Last updated on

Comparative evaluations | Arcade Docs