Capture mode
Capture mode records calls without evaluating them. Use it to bootstrap test expectations or debug model behavior.
When to use capture mode
Bootstrapping test expectations: When you don’t know what calls to expect, run capture mode to see what the model actually calls.
Debugging model behavior: When evaluations fail unexpectedly, capture mode shows exactly what the model is doing.
Exploring new : When adding new tools, capture mode helps you understand how models interpret them.
Documenting usage: Create examples of how models use your tools in different scenarios.
Typical workflow
1. Create suite with empty expected_tool_calls
↓
2. Run: arcade evals . --capture --format json
↓
3. Review captured tool calls in output file
↓
4. Copy tool calls into expected_tool_calls
↓
5. Add critics for validation
↓
6. Run: arcade evals . --detailsBasic usage
Create an evaluation suite without expectations
Create a suite with test cases but empty expected_tool_calls:
from arcade_evals import EvalSuite, tool_eval
@tool_eval()
async def capture_weather_suite():
suite = EvalSuite(
name="Weather Capture",
system_message="You are a weather assistant.",
)
await suite.add_mcp_stdio_server(["python", "weather_server.py"])
# Add cases without expected tool calls
suite.add_case(
name="Simple weather query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[], # Empty for capture
)
suite.add_case(
name="Multi-city comparison",
user_message="Compare the weather in Seattle and Portland",
expected_tool_calls=[],
)
return suiteRun in capture mode
Run evaluations with the --capture flag:
arcade evals . --capture --file captures/weather --format jsonThis creates captures/weather.json with all calls.
Review captured output
Open the JSON file to see what the model called:
{
"suite_name": "Weather Capture",
"model": "gpt-4o",
"provider": "openai",
"captured_cases": [
{
"case_name": "Simple weather query",
"user_message": "What's the weather in Seattle?",
"tool_calls": [
{
"name": "Weather_GetCurrent",
"args": {
"location": "Seattle",
"units": "fahrenheit"
}
}
]
}
]
}Convert to test expectations
Copy the captured calls into your evaluation suite:
from arcade_evals import ExpectedMCPToolCall, BinaryCritic
suite.add_case(
name="Simple weather query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[
ExpectedMCPToolCall(
"Weather_GetCurrent",
{"location": "Seattle", "units": "fahrenheit"}
)
],
critics=[
BinaryCritic(critic_field="location", weight=0.7),
BinaryCritic(critic_field="units", weight=0.3),
],
)CLI options
Basic capture
Record calls to JSON:
arcade evals . --capture --file captures/baseline --format jsonInclude conversation context
Capture system messages and conversation history:
arcade evals . --capture --add-context --file captures/detailed --format jsonOutput includes:
{
"case_name": "Weather with context",
"user_message": "What about the weather there?",
"system_message": "You are a weather assistant.",
"additional_messages": [
{"role": "user", "content": "I'm traveling to Tokyo"},
{"role": "assistant", "content": "Tokyo is a great city!"}
],
"tool_calls": [...]
}Multiple formats
Save captures in multiple formats:
arcade evals . --capture --file captures/out --format json,mdMarkdown format is more readable for quick review:
## Weather Capture
### Model: gpt-4o
#### Case: Simple weather query
**Input:** What's the weather in Seattle?
**Tool Calls:**
- `Weather_GetCurrent`
- location: Seattle
- units: fahrenheitMultiple providers
Capture from multiple providers to compare behavior:
arcade evals . --capture \
--use-provider openai:gpt-4o \
--use-provider anthropic:claude-sonnet-4-5-20250929 \
--file captures/comparison --format jsonProgrammatic capture
Use capture mode from Python code:
import asyncio
from openai import AsyncOpenAI
from arcade_evals import EvalSuite
async def capture_tool_calls():
suite = EvalSuite(name="Weather", system_message="You are helpful.")
await suite.add_mcp_stdio_server(["python", "server.py"])
suite.add_case(
name="weather_query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[],
)
client = AsyncOpenAI(api_key="sk-...")
result = await suite.capture(
client=client,
model="gpt-4o",
provider="openai",
include_context=True,
)
# Access captured data
for case in result.captured_cases:
print(f"Case: {case.case_name}")
for tool_call in case.tool_calls:
print(f" Tool: {tool_call.name}")
print(f" Args: {tool_call.args}")
# Save to file
result.write_to_file("captures/output.json", include_context=True)
return result
asyncio.run(capture_tool_calls())Capture result structure
CaptureResult
Top-level capture result:
@dataclass
class CaptureResult:
suite_name: str
model: str
provider: str
captured_cases: list[CapturedCase]Methods:
to_dict(include_context=False)→ dictto_json(include_context=False, indent=2)→ JSON stringwrite_to_file(file_path, include_context=False, indent=2)→ None
CapturedCase
Individual test case result:
@dataclass
class CapturedCase:
case_name: str
user_message: str
tool_calls: list[CapturedToolCall]
system_message: str | None = None
additional_messages: list[dict] | None = None
track_name: str | None = NoneCapturedToolCall
Individual call:
@dataclass
class CapturedToolCall:
name: str
args: dict[str, Any]Capture with comparative tracks
Capture from multiple sources to see how different implementations behave:
@tool_eval()
async def capture_comparative():
suite = EvalSuite(
name="Weather Comparison",
system_message="You are a weather assistant.",
)
# Register different tool sources
await suite.add_mcp_server(
"http://weather-api-1.example/mcp",
track="Weather API v1"
)
await suite.add_mcp_server(
"http://weather-api-2.example/mcp",
track="Weather API v2"
)
# Capture will run against each track
suite.add_case(
name="get_weather",
user_message="What's the weather in Seattle?",
expected_tool_calls=[],
)
return suiteRun capture:
arcade evals . --capture --file captures/apis --format jsonOutput shows captures per track:
{
"captured_cases": [
{
"case_name": "get_weather",
"track_name": "Weather API v1",
"tool_calls": [
{"name": "GetCurrentWeather", "args": {...}}
]
},
{
"case_name": "get_weather",
"track_name": "Weather API v2",
"tool_calls": [
{"name": "Weather_Current", "args": {...}}
]
}
]
}Best practices
Start with broad queries
Begin with open-ended prompts to see natural model behavior:
suite.add_case(
name="explore_weather_tools",
user_message="Show me everything you can do with weather",
expected_tool_calls=[],
)Capture edge cases
Record model behavior on unusual inputs:
suite.add_case(
name="ambiguous_location",
user_message="What's the weather in Portland?", # OR or ME?
expected_tool_calls=[],
)Include context variations
Capture with different conversation :
suite.add_case(
name="weather_from_context",
user_message="How about the weather there?",
additional_messages=[
{"role": "user", "content": "I'm going to Seattle"},
],
expected_tool_calls=[],
)Capture multiple providers
Compare how different models interpret your :
arcade evals . --capture \
--use-provider openai:gpt-4o,gpt-4o-mini \
--use-provider anthropic:claude-sonnet-4-5-20250929 \
--file captures/models --format json,mdConverting captures to tests
Step 1: Identify patterns
Review captured calls to find patterns:
// Most queries use "fahrenheit"
{"location": "Seattle", "units": "fahrenheit"}
{"location": "Portland", "units": "fahrenheit"}
// Some use "celsius"
{"location": "Tokyo", "units": "celsius"}Step 2: Create base expectations
Create expected calls based on patterns:
# Default to fahrenheit for US cities
ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"})
# Use celsius for international cities
ExpectedMCPToolCall("GetWeather", {"location": "Tokyo", "units": "celsius"})Step 3: Add appropriate critics
Choose critics based on parameter importance:
critics=[
BinaryCritic(critic_field="location", weight=0.8), # Critical
BinaryCritic(critic_field="units", weight=0.2), # Less critical
]Step 4: Run evaluations
Test with real evaluations:
arcade evals . --detailsStep 5: Iterate
Use failures to refine:
- Adjust expected values
- Change critic weights
- Modify descriptions
- Add more test cases
Troubleshooting
No tool calls captured
Symptom: Empty tool_calls arrays
Possible causes:
- Model didn’t call any
- not properly registered
- System message doesn’t encourage use
Solution:
suite = EvalSuite(
name="Weather",
system_message="You are a weather assistant. Use the available weather tools to answer questions.",
)Unexpected tool names
Symptom: names have underscores instead of dots
Explanation: names are normalized for provider compatibility. Weather.GetCurrent becomes Weather_GetCurrent.
Solution: Use normalized names in expectations:
ExpectedMCPToolCall("Weather_GetCurrent", {...})See Provider compatibility for details.
Missing parameters
Symptom: Some parameters are missing from captured calls
Explanation: Models may omit optional parameters.
Solution: Check if parameters have defaults in your schema. The evaluation framework applies defaults automatically.
Different results per provider
Symptom: OpenAI and Anthropic capture different calls
Explanation: Providers interpret descriptions differently.
Solution: This is expected. Use captures to understand provider-specific behavior, then create provider-agnostic tests.
Example workflow
Here’s a complete workflow from capture to evaluation:
Create capture suite
@tool_eval()
async def initial_capture():
suite = EvalSuite(name="Slack Tools", system_message="You are a Slack assistant.")
await suite.add_arcade_gateway(gateway_slug="slack")
suite.add_case(
name="send_message",
user_message="Send a message to #general saying 'Hello team'",
expected_tool_calls=[],
)
suite.add_case(
name="send_dm",
user_message="Send a DM to alice saying 'Meeting at 3'",
expected_tool_calls=[],
)
return suiteCapture with multiple models
arcade evals . --capture \
--use-provider openai:gpt-4o,gpt-4o-mini \
--file captures/slack --format json,mdReview markdown output
## Slack Tools
### Model: gpt-4o
#### Case: send_message
**Tool Calls:**
- `send_message_to_channel`
- channel: general
- message: Hello team
#### Case: send_dm
**Tool Calls:**
- `send_dm_to_user`
- user: alice
- message: Meeting at 3Create evaluation suite
@tool_eval()
async def slack_eval():
suite = EvalSuite(name="Slack Tools", system_message="You are a Slack assistant.")
await suite.add_arcade_gateway(gateway_slug="slack")
suite.add_case(
name="send_message",
user_message="Send a message to #general saying 'Hello team'",
expected_tool_calls=[
ExpectedMCPToolCall(
"send_message_to_channel",
{"channel": "general", "message": "Hello team"}
)
],
critics=[
BinaryCritic(critic_field="channel", weight=0.4),
SimilarityCritic(critic_field="message", weight=0.6),
],
)
return suiteRun evaluations
arcade evals . --detailsIterate based on results
Refine expectations and critics based on evaluation results.
Next steps
- Learn about comparative evaluations to compare sources
- Understand provider compatibility for cross-provider testing
- Create evaluation suites with expectations