Skip to Content
HomeEvaluate toolsCapture mode

Capture mode

Capture mode records calls without evaluating them. Use it to bootstrap test expectations or debug model behavior.

When to use capture mode

Bootstrapping test expectations: When you don’t know what calls to expect, run capture mode to see what the model actually calls.

Debugging model behavior: When evaluations fail unexpectedly, capture mode shows exactly what the model is doing.

Exploring new : When adding new tools, capture mode helps you understand how models interpret them.

Documenting usage: Create examples of how models use your tools in different scenarios.

Typical workflow

PLAINTEXT
1. Create suite with empty expected_tool_calls 2. Run: arcade evals . --capture --format json 3. Review captured tool calls in output file 4. Copy tool calls into expected_tool_calls 5. Add critics for validation 6. Run: arcade evals . --details

Basic usage

Create an evaluation suite without expectations

Create a suite with test cases but empty expected_tool_calls:

Python
from arcade_evals import EvalSuite, tool_eval @tool_eval() async def capture_weather_suite(): suite = EvalSuite( name="Weather Capture", system_message="You are a weather assistant.", ) await suite.add_mcp_stdio_server(["python", "weather_server.py"]) # Add cases without expected tool calls suite.add_case( name="Simple weather query", user_message="What's the weather in Seattle?", expected_tool_calls=[], # Empty for capture ) suite.add_case( name="Multi-city comparison", user_message="Compare the weather in Seattle and Portland", expected_tool_calls=[], ) return suite

Run in capture mode

Run evaluations with the --capture flag:

Terminal
arcade evals . --capture --file captures/weather --format json

This creates captures/weather.json with all calls.

Review captured output

Open the JSON file to see what the model called:

JSON
{ "suite_name": "Weather Capture", "model": "gpt-4o", "provider": "openai", "captured_cases": [ { "case_name": "Simple weather query", "user_message": "What's the weather in Seattle?", "tool_calls": [ { "name": "Weather_GetCurrent", "args": { "location": "Seattle", "units": "fahrenheit" } } ] } ] }

Convert to test expectations

Copy the captured calls into your evaluation suite:

Python
from arcade_evals import ExpectedMCPToolCall, BinaryCritic suite.add_case( name="Simple weather query", user_message="What's the weather in Seattle?", expected_tool_calls=[ ExpectedMCPToolCall( "Weather_GetCurrent", {"location": "Seattle", "units": "fahrenheit"} ) ], critics=[ BinaryCritic(critic_field="location", weight=0.7), BinaryCritic(critic_field="units", weight=0.3), ], )

CLI options

Basic capture

Record calls to JSON:

Terminal
arcade evals . --capture --file captures/baseline --format json

Include conversation context

Capture system messages and conversation history:

Terminal
arcade evals . --capture --add-context --file captures/detailed --format json

Output includes:

JSON
{ "case_name": "Weather with context", "user_message": "What about the weather there?", "system_message": "You are a weather assistant.", "additional_messages": [ {"role": "user", "content": "I'm traveling to Tokyo"}, {"role": "assistant", "content": "Tokyo is a great city!"} ], "tool_calls": [...] }

Multiple formats

Save captures in multiple formats:

Terminal
arcade evals . --capture --file captures/out --format json,md

Markdown format is more readable for quick review:

Markdown
## Weather Capture ### Model: gpt-4o #### Case: Simple weather query **Input:** What's the weather in Seattle? **Tool Calls:** - `Weather_GetCurrent` - location: Seattle - units: fahrenheit

Multiple providers

Capture from multiple providers to compare behavior:

Terminal
arcade evals . --capture \ --use-provider openai:gpt-4o \ --use-provider anthropic:claude-sonnet-4-5-20250929 \ --file captures/comparison --format json

Programmatic capture

Use capture mode from Python code:

Python
import asyncio from openai import AsyncOpenAI from arcade_evals import EvalSuite async def capture_tool_calls(): suite = EvalSuite(name="Weather", system_message="You are helpful.") await suite.add_mcp_stdio_server(["python", "server.py"]) suite.add_case( name="weather_query", user_message="What's the weather in Seattle?", expected_tool_calls=[], ) client = AsyncOpenAI(api_key="sk-...") result = await suite.capture( client=client, model="gpt-4o", provider="openai", include_context=True, ) # Access captured data for case in result.captured_cases: print(f"Case: {case.case_name}") for tool_call in case.tool_calls: print(f" Tool: {tool_call.name}") print(f" Args: {tool_call.args}") # Save to file result.write_to_file("captures/output.json", include_context=True) return result asyncio.run(capture_tool_calls())

Capture result structure

CaptureResult

Top-level capture result:

Python
@dataclass class CaptureResult: suite_name: str model: str provider: str captured_cases: list[CapturedCase]

Methods:

  • to_dict(include_context=False) → dict
  • to_json(include_context=False, indent=2) → JSON string
  • write_to_file(file_path, include_context=False, indent=2) → None

CapturedCase

Individual test case result:

Python
@dataclass class CapturedCase: case_name: str user_message: str tool_calls: list[CapturedToolCall] system_message: str | None = None additional_messages: list[dict] | None = None track_name: str | None = None

CapturedToolCall

Individual call:

Python
@dataclass class CapturedToolCall: name: str args: dict[str, Any]

Capture with comparative tracks

Capture from multiple sources to see how different implementations behave:

Python
@tool_eval() async def capture_comparative(): suite = EvalSuite( name="Weather Comparison", system_message="You are a weather assistant.", ) # Register different tool sources await suite.add_mcp_server( "http://weather-api-1.example/mcp", track="Weather API v1" ) await suite.add_mcp_server( "http://weather-api-2.example/mcp", track="Weather API v2" ) # Capture will run against each track suite.add_case( name="get_weather", user_message="What's the weather in Seattle?", expected_tool_calls=[], ) return suite

Run capture:

Terminal
arcade evals . --capture --file captures/apis --format json

Output shows captures per track:

JSON
{ "captured_cases": [ { "case_name": "get_weather", "track_name": "Weather API v1", "tool_calls": [ {"name": "GetCurrentWeather", "args": {...}} ] }, { "case_name": "get_weather", "track_name": "Weather API v2", "tool_calls": [ {"name": "Weather_Current", "args": {...}} ] } ] }

Best practices

Start with broad queries

Begin with open-ended prompts to see natural model behavior:

Python
suite.add_case( name="explore_weather_tools", user_message="Show me everything you can do with weather", expected_tool_calls=[], )

Capture edge cases

Record model behavior on unusual inputs:

Python
suite.add_case( name="ambiguous_location", user_message="What's the weather in Portland?", # OR or ME? expected_tool_calls=[], )

Include context variations

Capture with different conversation :

Python
suite.add_case( name="weather_from_context", user_message="How about the weather there?", additional_messages=[ {"role": "user", "content": "I'm going to Seattle"}, ], expected_tool_calls=[], )

Capture multiple providers

Compare how different models interpret your :

Terminal
arcade evals . --capture \ --use-provider openai:gpt-4o,gpt-4o-mini \ --use-provider anthropic:claude-sonnet-4-5-20250929 \ --file captures/models --format json,md

Converting captures to tests

Step 1: Identify patterns

Review captured calls to find patterns:

JSON
// Most queries use "fahrenheit" {"location": "Seattle", "units": "fahrenheit"} {"location": "Portland", "units": "fahrenheit"} // Some use "celsius" {"location": "Tokyo", "units": "celsius"}

Step 2: Create base expectations

Create expected calls based on patterns:

Python
# Default to fahrenheit for US cities ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"}) # Use celsius for international cities ExpectedMCPToolCall("GetWeather", {"location": "Tokyo", "units": "celsius"})

Step 3: Add appropriate critics

Choose critics based on parameter importance:

Python
critics=[ BinaryCritic(critic_field="location", weight=0.8), # Critical BinaryCritic(critic_field="units", weight=0.2), # Less critical ]

Step 4: Run evaluations

Test with real evaluations:

Terminal
arcade evals . --details

Step 5: Iterate

Use failures to refine:

  • Adjust expected values
  • Change critic weights
  • Modify descriptions
  • Add more test cases

Troubleshooting

No tool calls captured

Symptom: Empty tool_calls arrays

Possible causes:

  1. Model didn’t call any
  2. not properly registered
  3. System message doesn’t encourage use

Solution:

Python
suite = EvalSuite( name="Weather", system_message="You are a weather assistant. Use the available weather tools to answer questions.", )

Unexpected tool names

Symptom: names have underscores instead of dots

Explanation: names are normalized for provider compatibility. Weather.GetCurrent becomes Weather_GetCurrent.

Solution: Use normalized names in expectations:

Python
ExpectedMCPToolCall("Weather_GetCurrent", {...})

See Provider compatibility for details.

Missing parameters

Symptom: Some parameters are missing from captured calls

Explanation: Models may omit optional parameters.

Solution: Check if parameters have defaults in your schema. The evaluation framework applies defaults automatically.

Different results per provider

Symptom: OpenAI and Anthropic capture different calls

Explanation: Providers interpret descriptions differently.

Solution: This is expected. Use captures to understand provider-specific behavior, then create provider-agnostic tests.

Example workflow

Here’s a complete workflow from capture to evaluation:

Create capture suite

Python
@tool_eval() async def initial_capture(): suite = EvalSuite(name="Slack Tools", system_message="You are a Slack assistant.") await suite.add_arcade_gateway(gateway_slug="slack") suite.add_case( name="send_message", user_message="Send a message to #general saying 'Hello team'", expected_tool_calls=[], ) suite.add_case( name="send_dm", user_message="Send a DM to alice saying 'Meeting at 3'", expected_tool_calls=[], ) return suite

Capture with multiple models

Terminal
arcade evals . --capture \ --use-provider openai:gpt-4o,gpt-4o-mini \ --file captures/slack --format json,md

Review markdown output

Markdown
## Slack Tools ### Model: gpt-4o #### Case: send_message **Tool Calls:** - `send_message_to_channel` - channel: general - message: Hello team #### Case: send_dm **Tool Calls:** - `send_dm_to_user` - user: alice - message: Meeting at 3

Create evaluation suite

Python
@tool_eval() async def slack_eval(): suite = EvalSuite(name="Slack Tools", system_message="You are a Slack assistant.") await suite.add_arcade_gateway(gateway_slug="slack") suite.add_case( name="send_message", user_message="Send a message to #general saying 'Hello team'", expected_tool_calls=[ ExpectedMCPToolCall( "send_message_to_channel", {"channel": "general", "message": "Hello team"} ) ], critics=[ BinaryCritic(critic_field="channel", weight=0.4), SimilarityCritic(critic_field="message", weight=0.6), ], ) return suite

Run evaluations

Terminal
arcade evals . --details

Iterate based on results

Refine expectations and critics based on evaluation results.

Next steps

Last updated on

Capture mode | Arcade Docs