Comparative evaluations

Comparative evaluations let you run the same test cases against different implementations. Use tracks to compare tool sources side-by-side.

What are tracks?

Tracks are isolated registries within a single evaluation suite. Each track represents a different source of tools.

Common use cases:

Compare providers: Test Google Weather vs OpenWeather API
Version testing: Compare API v1 vs API v2
Implementation comparison: Test different servers for the same functionality
A/B testing: Evaluate alternative designs

When to use comparative evaluations

Use comparative evaluations when:

✅ Testing multiple implementations of the same functionality
✅ Comparing different API versions
✅ Evaluating providers side-by-side
✅ A/B testing designs

Use regular evaluations when:

✅ Testing a single implementation
✅ Validating behavior
✅ Regression testing

Basic comparative evaluation

Register tools per track

Create a suite and register for each track:

Python


from arcade_evals import EvalSuite, tool_eval, ExpectedMCPToolCall, BinaryCritic
 
@tool_eval()
async def weather_comparison():
    suite = EvalSuite(
        name="Weather API Comparison",
        system_message="You are a weather assistant.",
    )
    
    # Track A: Weather API v1
    await suite.add_mcp_server(
        "http://weather-v1.example/mcp",
        track="Weather v1"
    )
    
    # Track B: Weather API v2
    await suite.add_mcp_server(
        "http://weather-v2.example/mcp",
        track="Weather v2"
    )
    
    return suite

Create comparative test case

Add a test case with track-specific expectations:

Python


suite.add_comparative_case(
    name="get_current_weather",
    user_message="What's the weather in Seattle?",
).for_track(
    "Weather v1",
    expected_tool_calls=[
        ExpectedMCPToolCall(
            "GetWeather",
            {"city": "Seattle", "type": "current"}
        )
    ],
    critics=[
        BinaryCritic(critic_field="city", weight=0.7),
        BinaryCritic(critic_field="type", weight=0.3),
    ],
).for_track(
    "Weather v2",
    expected_tool_calls=[
        ExpectedMCPToolCall(
            "Weather_GetCurrent",
            {"location": "Seattle"}
        )
    ],
    critics=[
        BinaryCritic(critic_field="location", weight=1.0),
    ],
)

Run comparative evaluation

Terminal


arcade evals .

Results show per-track scores:

PLAINTEXT


Suite: Weather API Comparison
  Case: get_current_weather
    Track: Weather v1 -- Score: 1.00 -- PASSED
    Track: Weather v2 -- Score: 1.00 -- PASSED

Track registration

From MCP HTTP server

Python


await suite.add_mcp_server(
    url="http://localhost:8000",
    headers={"Authorization": "Bearer token"},
    track="Production API",
)

From MCP stdio server

Python


await suite.add_mcp_stdio_server(
    command=["python", "server_v2.py"],
    env={"API_KEY": "secret"},
    track="Version 2",
)

From Arcade Gateway

Python


await suite.add_arcade_gateway(
    gateway_slug="weather-gateway",
    track="Arcade Gateway",
)

Manual tool definitions

Python


suite.add_tool_definitions(
    tools=[
        {
            "name": "GetWeather",
            "description": "Get weather for a location",
            "inputSchema": {...},
        }
    ],
    track="Custom Tools",
)

must be registered before creating comparative cases that reference their tracks.

Comparative case builder

The add_comparative_case() method returns a builder for defining track-specific expectations.

Basic structure

Python


suite.add_comparative_case(
    name="test_case",
    user_message="Do something",
).for_track(
    "Track A",
    expected_tool_calls=[...],
    critics=[...],
).for_track(
    "Track B",
    expected_tool_calls=[...],
    critics=[...],
)

Optional parameters

Add conversation to comparative cases:

Python


suite.add_comparative_case(
    name="weather_with_context",
    user_message="What about the weather there?",
    system_message="You are helpful.",  # Optional override
    additional_messages=[
        {"role": "user", "content": "I'm going to Seattle"},
    ],
).for_track("Weather v1", ...).for_track("Weather v2", ...)

Different expectations per track

Tracks often have different names and parameters:

Python


suite.add_comparative_case(
    name="search_query",
    user_message="Search for Python tutorials",
).for_track(
    "Google Search",
    expected_tool_calls=[
        ExpectedMCPToolCall("Google_Search", {"query": "Python tutorials"})
    ],
    critics=[BinaryCritic(critic_field="query", weight=1.0)],
).for_track(
    "Bing Search",
    expected_tool_calls=[
        ExpectedMCPToolCall("Bing_WebSearch", {"q": "Python tutorials"})
    ],
    critics=[BinaryCritic(critic_field="q", weight=1.0)],
)

Complete example

Here’s a full comparative evaluation:

Python


from arcade_evals import (
    EvalSuite,
    tool_eval,
    ExpectedMCPToolCall,
    BinaryCritic,
    SimilarityCritic,
)
 
@tool_eval()
async def search_comparison():
    """Compare different search APIs."""
    suite = EvalSuite(
        name="Search API Comparison",
        system_message="You are a search assistant. Use the available tools to search for information.",
    )
    
    # Register search providers
    await suite.add_mcp_server(
        "http://google-search.example/mcp",
        track="Google",
    )
    
    await suite.add_mcp_server(
        "http://bing-search.example/mcp",
        track="Bing",
    )
    
    await suite.add_mcp_server(
        "http://duckduckgo.example/mcp",
        track="DuckDuckGo",
    )
    
    # Simple query
    suite.add_comparative_case(
        name="basic_search",
        user_message="Search for Python tutorials",
    ).for_track(
        "Google",
        expected_tool_calls=[
            ExpectedMCPToolCall("Search", {"query": "Python tutorials"})
        ],
        critics=[BinaryCritic(critic_field="query", weight=1.0)],
    ).for_track(
        "Bing",
        expected_tool_calls=[
            ExpectedMCPToolCall("WebSearch", {"q": "Python tutorials"})
        ],
        critics=[BinaryCritic(critic_field="q", weight=1.0)],
    ).for_track(
        "DuckDuckGo",
        expected_tool_calls=[
            ExpectedMCPToolCall("DDG_Search", {"search_term": "Python tutorials"})
        ],
        critics=[BinaryCritic(critic_field="search_term", weight=1.0)],
    )
    
    # Query with filters
    suite.add_comparative_case(
        name="search_with_filters",
        user_message="Search for Python tutorials from the last month",
    ).for_track(
        "Google",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                "Search",
                {"query": "Python tutorials", "time_range": "month"}
            )
        ],
        critics=[
            SimilarityCritic(critic_field="query", weight=0.7),
            BinaryCritic(critic_field="time_range", weight=0.3),
        ],
    ).for_track(
        "Bing",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                "WebSearch",
                {"q": "Python tutorials", "freshness": "Month"}
            )
        ],
        critics=[
            SimilarityCritic(critic_field="q", weight=0.7),
            BinaryCritic(critic_field="freshness", weight=0.3),
        ],
    ).for_track(
        "DuckDuckGo",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                "DDG_Search",
                {"search_term": "Python tutorials", "time": "m"}
            )
        ],
        critics=[
            SimilarityCritic(critic_field="search_term", weight=0.7),
            BinaryCritic(critic_field="time", weight=0.3),
        ],
    )
    
    return suite

Run the comparison:

Terminal


arcade evals . --details

Output shows side-by-side results:

PLAINTEXT


Suite: Search API Comparison

Case: basic_search
  Track: Google -- Score: 1.00 -- PASSED
  Track: Bing -- Score: 1.00 -- PASSED
  Track: DuckDuckGo -- Score: 1.00 -- PASSED

Case: search_with_filters
  Track: Google -- Score: 1.00 -- PASSED
  Track: Bing -- Score: 0.85 -- WARNED
  Track: DuckDuckGo -- Score: 0.90 -- WARNED

Result structure

Comparative results are organized by track:

Python


{
    "Google": {
        "model": "gpt-4o",
        "suite_name": "Search API Comparison",
        "track_name": "Google",
        "rubric": {...},
        "cases": [
            {
                "name": "basic_search",
                "track": "Google",
                "input": "Search for Python tutorials",
                "expected_tool_calls": [...],
                "predicted_tool_calls": [...],
                "evaluation": {
                    "score": 1.0,
                    "result": "passed",
                    ...
                }
            }
        ]
    },
    "Bing": {...},
    "DuckDuckGo": {...}
}

Mixing regular and comparative cases

A suite can have both regular and comparative cases:

Python


@tool_eval()
async def mixed_suite():
    suite = EvalSuite(
        name="Mixed Evaluation",
        system_message="You are helpful.",
    )
    
    # Register default tools
    await suite.add_mcp_stdio_server(["python", "server.py"])
    
    # Regular case (uses default tools)
    suite.add_case(
        name="regular_test",
        user_message="Do something",
        expected_tool_calls=[...],
    )
    
    # Register track-specific tools
    await suite.add_mcp_server("http://api-v2.example", track="v2")
    
    # Comparative case
    suite.add_comparative_case(
        name="compare_versions",
        user_message="Do something else",
    ).for_track(
        "default",  # Uses default tools
        expected_tool_calls=[...],
    ).for_track(
        "v2",  # Uses v2 tools
        expected_tool_calls=[...],
    )
    
    return suite

Use track name "default" to reference registered without a track.

Capture mode with tracks

Capture calls from each track separately:

Terminal


arcade evals . --capture --file captures/comparison --format json

Output includes track names:

JSON


{
  "captured_cases": [
    {
      "case_name": "get_weather",
      "track_name": "Weather v1",
      "tool_calls": [
        {"name": "GetWeather", "args": {...}}
      ]
    },
    {
      "case_name": "get_weather",
      "track_name": "Weather v2",
      "tool_calls": [
        {"name": "Weather_GetCurrent", "args": {...}}
      ]
    }
  ]
}

Multi-model comparative evaluations

Combine comparative tracks with multiple models:

Terminal


arcade evals . \
  --use-provider openai:gpt-4o,gpt-4o-mini \
  --use-provider anthropic:claude-sonnet-4-5-20250929

Results show:

Per-track scores for each model
Cross-track comparisons for each model
Cross-model comparisons for each track

Example output:

PLAINTEXT


Suite: Weather API Comparison

Model: gpt-4o
  Case: get_weather
    Track: Weather v1 -- Score: 1.00 -- PASSED
    Track: Weather v2 -- Score: 1.00 -- PASSED

Model: gpt-4o-mini
  Case: get_weather
    Track: Weather v1 -- Score: 0.90 -- WARNED
    Track: Weather v2 -- Score: 0.95 -- PASSED

Model: claude-sonnet-4-5-20250929
  Case: get_weather
    Track: Weather v1 -- Score: 1.00 -- PASSED
    Track: Weather v2 -- Score: 0.85 -- WARNED

Best practices

Use descriptive track names

Choose clear names that indicate what’s being compared:

Python


# ✅ Good
track="Weather API v1"
track="OpenWeather Production"
track="Google Weather (Staging)"
 
# ❌ Avoid
track="A"
track="Test1"
track="Track2"

Keep test cases consistent

Use the same user message and across tracks:

Python


suite.add_comparative_case(
    name="get_weather",
    user_message="What's the weather in Seattle?",  # Same for all tracks
).for_track("v1", ...).for_track("v2", ...)

Adjust critics to track differences

Different may have different parameter names or types:

Python


.for_track(
    "Weather v1",
    expected_tool_calls=[
        ExpectedMCPToolCall("GetWeather", {"city": "Seattle"})
    ],
    critics=[
        BinaryCritic(critic_field="city", weight=1.0),  # v1 uses "city"
    ],
).for_track(
    "Weather v2",
    expected_tool_calls=[
        ExpectedMCPToolCall("GetWeather", {"location": "Seattle"})
    ],
    critics=[
        BinaryCritic(critic_field="location", weight=1.0),  # v2 uses "location"
    ],
)

Start with capture mode

Use capture mode to discover track-specific signatures:

Terminal


arcade evals . --capture

Then create expectations based on captured calls.

Test edge cases per track

Different implementations may handle edge cases differently:

Python


suite.add_comparative_case(
    name="ambiguous_location",
    user_message="What's the weather in Portland?",  # OR or ME?
).for_track(
    "Weather v1",
    # v1 defaults to most populous
    expected_tool_calls=[
        ExpectedMCPToolCall("GetWeather", {"city": "Portland", "state": "OR"})
    ],
).for_track(
    "Weather v2",
    # v2 requires disambiguation
    expected_tool_calls=[
        ExpectedMCPToolCall("DisambiguateLocation", {"city": "Portland"}),
        ExpectedMCPToolCall("GetWeather", {"city": "Portland", "state": "OR"}),
    ],
)

Troubleshooting

Track not found

Symptom: ValueError: Track 'TrackName' not registered

Solution: Register the track before adding comparative cases:

Python


# ✅ Correct order
await suite.add_mcp_server(url, track="TrackName")
suite.add_comparative_case(...).for_track("TrackName", ...)
 
# ❌ Wrong order - will fail
suite.add_comparative_case(...).for_track("TrackName", ...)
await suite.add_mcp_server(url, track="TrackName")

Missing track expectations

Symptom: Case runs against some tracks but not others

Explanation: Comparative cases only run against tracks with .for_track() defined.

Solution: Add expectations for all registered tracks:

Python


suite.add_comparative_case(
    name="test",
    user_message="...",
).for_track("Track A", ...).for_track("Track B", ...)

Tool name mismatches

Symptom: “ not found” errors in specific tracks

Solution: Check names in each track:

Python


# List tools per track
print(suite.list_tool_names(track="Track A"))
print(suite.list_tool_names(track="Track B"))

Use the exact names from the output.

Inconsistent results across tracks

Symptom: Same message produces different scores across tracks

Explanation: This is expected. Different implementations may work differently.

Solution: Adjust expectations and critics per track to for implementation differences.

Advanced patterns

Baseline comparison

Compare new implementations against a baseline:

Python


await suite.add_mcp_server(
    "http://production.example/mcp",
    track="Production (Baseline)"
)
 
await suite.add_mcp_server(
    "http://staging.example/mcp",
    track="Staging (New)"
)

Results show deviations from baseline.

Progressive feature testing

Test feature support across versions:

Python


suite.add_comparative_case(
    name="advanced_filters",
    user_message="Search with advanced filters",
).for_track(
    "v1",
    expected_tool_calls=[],  # Not supported
).for_track(
    "v2",
    expected_tool_calls=[
        ExpectedMCPToolCall("SearchWithFilters", {...})
    ],
)

Tool catalog comparison

Compare Arcade catalogs:

Python


from arcade_core import ToolCatalog
from my_tools import weather_v1, weather_v2
 
catalog_v1 = ToolCatalog()
catalog_v1.add_tool(weather_v1, "Weather")
 
catalog_v2 = ToolCatalog()
catalog_v2.add_tool(weather_v2, "Weather")
 
suite.add_tool_catalog(catalog_v1, track="Python v1")
suite.add_tool_catalog(catalog_v2, track="Python v2")

Next steps

Create an evaluation suite with tracks
Use capture mode to discover track-specific calls
Understand provider compatibility when comparing across providers
Run evaluations with multiple models and tracks