Capture mode

Capture mode records calls without evaluating them. Use it to bootstrap test expectations or debug model behavior.

When to use capture mode

Bootstrapping test expectations: When you don’t know what calls to expect, run capture mode to see what the model actually calls.

Debugging model behavior: When evaluations fail unexpectedly, capture mode shows exactly what the model is doing.

Exploring new : When adding new tools, capture mode helps you understand how models interpret them.

Documenting usage: Create examples of how models use your tools in different scenarios.

Typical workflow

PLAINTEXT


1. Create suite with empty expected_tool_calls
   ↓
2. Run: arcade evals . --capture --format json
   ↓
3. Review captured tool calls in output file
   ↓
4. Copy tool calls into expected_tool_calls
   ↓
5. Add critics for validation
   ↓
6. Run: arcade evals . --details

Basic usage

Create an evaluation suite without expectations

Create a suite with test cases but empty expected_tool_calls:

Python


from arcade_evals import EvalSuite, tool_eval
 
@tool_eval()
async def capture_weather_suite():
    suite = EvalSuite(
        name="Weather Capture",
        system_message="You are a weather assistant.",
    )
    
    await suite.add_mcp_stdio_server(["python", "weather_server.py"])
    
    # Add cases without expected tool calls
    suite.add_case(
        name="Simple weather query",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[],  # Empty for capture
    )
    
    suite.add_case(
        name="Multi-city comparison",
        user_message="Compare the weather in Seattle and Portland",
        expected_tool_calls=[],
    )
    
    return suite

Run in capture mode

Run evaluations with the --capture flag:

Terminal


arcade evals . --capture --file captures/weather --format json

This creates captures/weather.json with all calls.

Review captured output

Open the JSON file to see what the model called:

JSON


{
  "suite_name": "Weather Capture",
  "model": "gpt-4o",
  "provider": "openai",
  "captured_cases": [
    {
      "case_name": "Simple weather query",
      "user_message": "What's the weather in Seattle?",
      "tool_calls": [
        {
          "name": "Weather_GetCurrent",
          "args": {
            "location": "Seattle",
            "units": "fahrenheit"
          }
        }
      ]
    }
  ]
}

Convert to test expectations

Copy the captured calls into your evaluation suite:

Python


from arcade_evals import ExpectedMCPToolCall, BinaryCritic
 
suite.add_case(
    name="Simple weather query",
    user_message="What's the weather in Seattle?",
    expected_tool_calls=[
        ExpectedMCPToolCall(
            "Weather_GetCurrent",
            {"location": "Seattle", "units": "fahrenheit"}
        )
    ],
    critics=[
        BinaryCritic(critic_field="location", weight=0.7),
        BinaryCritic(critic_field="units", weight=0.3),
    ],
)

CLI options

Basic capture

Record calls to JSON:

Terminal


arcade evals . --capture --file captures/baseline --format json

Include conversation context

Capture system messages and conversation history:

Terminal


arcade evals . --capture --add-context --file captures/detailed --format json

Output includes:

JSON


{
  "case_name": "Weather with context",
  "user_message": "What about the weather there?",
  "system_message": "You are a weather assistant.",
  "additional_messages": [
    {"role": "user", "content": "I'm traveling to Tokyo"},
    {"role": "assistant", "content": "Tokyo is a great city!"}
  ],
  "tool_calls": [...]
}

Multiple formats

Save captures in multiple formats:

Terminal


arcade evals . --capture --file captures/out --format json,md

Markdown format is more readable for quick review:

Markdown


## Weather Capture
 
### Model: gpt-4o
 
#### Case: Simple weather query
 
**Input:** What's the weather in Seattle?
 
**Tool Calls:**
- `Weather_GetCurrent`
  - location: Seattle
  - units: fahrenheit

Multiple providers

Capture from multiple providers to compare behavior:

Terminal


arcade evals . --capture \
  --use-provider openai:gpt-4o \
  --use-provider anthropic:claude-sonnet-4-5-20250929 \
  --file captures/comparison --format json

Programmatic capture

Use capture mode from Python code:

Python


import asyncio
from openai import AsyncOpenAI
from arcade_evals import EvalSuite
 
async def capture_tool_calls():
    suite = EvalSuite(name="Weather", system_message="You are helpful.")
    await suite.add_mcp_stdio_server(["python", "server.py"])
    
    suite.add_case(
        name="weather_query",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[],
    )
    
    client = AsyncOpenAI(api_key="sk-...")
    
    result = await suite.capture(
        client=client,
        model="gpt-4o",
        provider="openai",
        include_context=True,
    )
    
    # Access captured data
    for case in result.captured_cases:
        print(f"Case: {case.case_name}")
        for tool_call in case.tool_calls:
            print(f"  Tool: {tool_call.name}")
            print(f"  Args: {tool_call.args}")
    
    # Save to file
    result.write_to_file("captures/output.json", include_context=True)
    
    return result
 
asyncio.run(capture_tool_calls())

Capture result structure

CaptureResult

Top-level capture result:

Python


@dataclass
class CaptureResult:
    suite_name: str
    model: str
    provider: str
    captured_cases: list[CapturedCase]

Methods:

to_dict(include_context=False) → dict
to_json(include_context=False, indent=2) → JSON string
write_to_file(file_path, include_context=False, indent=2) → None

CapturedCase

Individual test case result:

Python


@dataclass
class CapturedCase:
    case_name: str
    user_message: str
    tool_calls: list[CapturedToolCall]
    system_message: str | None = None
    additional_messages: list[dict] | None = None
    track_name: str | None = None

CapturedToolCall

Individual call:

Python


@dataclass
class CapturedToolCall:
    name: str
    args: dict[str, Any]

Capture with comparative tracks

Capture from multiple sources to see how different implementations behave:

Python


@tool_eval()
async def capture_comparative():
    suite = EvalSuite(
        name="Weather Comparison",
        system_message="You are a weather assistant.",
    )
    
    # Register different tool sources
    await suite.add_mcp_server(
        "http://weather-api-1.example/mcp",
        track="Weather API v1"
    )
    
    await suite.add_mcp_server(
        "http://weather-api-2.example/mcp",
        track="Weather API v2"
    )
    
    # Capture will run against each track
    suite.add_case(
        name="get_weather",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[],
    )
    
    return suite

Run capture:

Terminal


arcade evals . --capture --file captures/apis --format json

Output shows captures per track:

JSON


{
  "captured_cases": [
    {
      "case_name": "get_weather",
      "track_name": "Weather API v1",
      "tool_calls": [
        {"name": "GetCurrentWeather", "args": {...}}
      ]
    },
    {
      "case_name": "get_weather",
      "track_name": "Weather API v2",
      "tool_calls": [
        {"name": "Weather_Current", "args": {...}}
      ]
    }
  ]
}

Best practices

Start with broad queries

Begin with open-ended prompts to see natural model behavior:

Python


suite.add_case(
    name="explore_weather_tools",
    user_message="Show me everything you can do with weather",
    expected_tool_calls=[],
)

Capture edge cases

Record model behavior on unusual inputs:

Python


suite.add_case(
    name="ambiguous_location",
    user_message="What's the weather in Portland?",  # OR or ME?
    expected_tool_calls=[],
)

Include context variations

Capture with different conversation :

Python


suite.add_case(
    name="weather_from_context",
    user_message="How about the weather there?",
    additional_messages=[
        {"role": "user", "content": "I'm going to Seattle"},
    ],
    expected_tool_calls=[],
)

Capture multiple providers

Compare how different models interpret your :

Terminal


arcade evals . --capture \
  --use-provider openai:gpt-4o,gpt-4o-mini \
  --use-provider anthropic:claude-sonnet-4-5-20250929 \
  --file captures/models --format json,md

Converting captures to tests

Step 1: Identify patterns

Review captured calls to find patterns:

JSON


// Most queries use "fahrenheit"
{"location": "Seattle", "units": "fahrenheit"}
{"location": "Portland", "units": "fahrenheit"}
 
// Some use "celsius"
{"location": "Tokyo", "units": "celsius"}

Step 2: Create base expectations

Create expected calls based on patterns:

Python


# Default to fahrenheit for US cities
ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"})
 
# Use celsius for international cities
ExpectedMCPToolCall("GetWeather", {"location": "Tokyo", "units": "celsius"})

Step 3: Add appropriate critics

Choose critics based on parameter importance:

Python


critics=[
    BinaryCritic(critic_field="location", weight=0.8),  # Critical
    BinaryCritic(critic_field="units", weight=0.2),     # Less critical
]

Step 4: Run evaluations

Test with real evaluations:

Terminal


arcade evals . --details

Step 5: Iterate

Use failures to refine:

Adjust expected values
Change critic weights
Modify descriptions
Add more test cases

Troubleshooting

No tool calls captured

Symptom: Empty tool_calls arrays

Possible causes:

Model didn’t call any
not properly registered
System message doesn’t encourage use

Solution:

Python


suite = EvalSuite(
    name="Weather",
    system_message="You are a weather assistant. Use the available weather tools to answer questions.",
)

Unexpected tool names

Symptom: names have underscores instead of dots

Explanation: names are normalized for provider compatibility. Weather.GetCurrent becomes Weather_GetCurrent.

Solution: Use normalized names in expectations:

Python


ExpectedMCPToolCall("Weather_GetCurrent", {...})

See Provider compatibility for details.

Missing parameters

Symptom: Some parameters are missing from captured calls

Explanation: Models may omit optional parameters.

Solution: Check if parameters have defaults in your schema. The evaluation framework applies defaults automatically.

Different results per provider

Symptom: OpenAI and Anthropic capture different calls

Explanation: Providers interpret descriptions differently.

Solution: This is expected. Use captures to understand provider-specific behavior, then create provider-agnostic tests.

Example workflow

Here’s a complete workflow from capture to evaluation:

Create capture suite

Python


@tool_eval()
async def initial_capture():
    suite = EvalSuite(name="Slack Tools", system_message="You are a Slack assistant.")
    await suite.add_arcade_gateway(gateway_slug="slack")
    
    suite.add_case(
        name="send_message",
        user_message="Send a message to #general saying 'Hello team'",
        expected_tool_calls=[],
    )
    
    suite.add_case(
        name="send_dm",
        user_message="Send a DM to alice saying 'Meeting at 3'",
        expected_tool_calls=[],
    )
    
    return suite

Capture with multiple models

Terminal


arcade evals . --capture \
  --use-provider openai:gpt-4o,gpt-4o-mini \
  --file captures/slack --format json,md

Review markdown output

Markdown


## Slack Tools
 
### Model: gpt-4o
 
#### Case: send_message
**Tool Calls:**
- `send_message_to_channel`
  - channel: general
  - message: Hello team
 
#### Case: send_dm
**Tool Calls:**
- `send_dm_to_user`
  - user: alice
  - message: Meeting at 3

Create evaluation suite

Python


@tool_eval()
async def slack_eval():
    suite = EvalSuite(name="Slack Tools", system_message="You are a Slack assistant.")
    await suite.add_arcade_gateway(gateway_slug="slack")
    
    suite.add_case(
        name="send_message",
        user_message="Send a message to #general saying 'Hello team'",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                "send_message_to_channel",
                {"channel": "general", "message": "Hello team"}
            )
        ],
        critics=[
            BinaryCritic(critic_field="channel", weight=0.4),
            SimilarityCritic(critic_field="message", weight=0.6),
        ],
    )
    
    return suite

Run evaluations

Terminal


arcade evals . --details

Iterate based on results

Refine expectations and critics based on evaluation results.

Next steps

Learn about comparative evaluations to compare sources
Understand provider compatibility for cross-provider testing
Create evaluation suites with expectations