Run evaluations
The arcade evals command discovers and executes evaluation suites with support for multiple providers, models, and output formats.
Basic usage
Run all evaluations in the current directory:
arcade evals .The command searches for files starting with eval_ and ending with .py.
Show detailed results with critic feedback:
arcade evals . --detailsFilter to show only failures:
arcade evals . --failed-onlyMulti-provider support
Single provider with default model
Use OpenAI with default model (gpt-4o):
export OPENAI_API_KEY=sk-...
arcade evals .Use Anthropic with default model (claude-sonnet-4-5-20250929):
export ANTHROPIC_API_KEY=sk-ant-...
arcade evals . --use-provider anthropicSpecific models
Specify one or more models for a provider:
arcade evals . --use-provider openai:gpt-4o,gpt-4o-miniMultiple providers
Compare performance across providers:
arcade evals . \
--use-provider openai:gpt-4o \
--use-provider anthropic:claude-sonnet-4-5-20250929 \
--openai-key sk-... \
--anthropic-key sk-ant-...When you specify multiple models, results show side-by-side comparisons.
API keys
are resolved in the following order:
| Priority | OpenAI | Anthropic |
|---|---|---|
| 1. Explicit flag | --openai-key | --anthropic-key |
| 2. Environment | OPENAI_API_KEY | ANTHROPIC_API_KEY |
3. .env file | OPENAI_API_KEY=... | ANTHROPIC_API_KEY=... |
Create a .env file in your directory to avoid setting keys in every terminal session.
Capture mode
Record calls without scoring to bootstrap test expectations:
arcade evals . --capture --file captures/baseline --format jsonInclude conversation in captured output:
arcade evals . --capture --add-context --file captures/detailedCapture mode is useful for:
- Creating initial test expectations
- Debugging model behavior
- Understanding call patterns
See Capture mode for details.
Output formats
Save results to files
Save results in one or more formats:
arcade evals . --file results/out --format md,htmlSave in all formats:
arcade evals . --file results/out --format allAvailable formats
| Format | Extension | Description |
|---|---|---|
txt | .txt | Plain text, pytest-style output |
md | .md | Markdown with tables and collapsible sections |
html | .html | Interactive HTML report |
json | .json | Structured JSON for programmatic use |
Multiple formats generate separate files:
results/out.txtresults/out.mdresults/out.htmlresults/out.json
Command options
Quick reference
| Flag | Purpose | Example |
|---|---|---|
--use-provider | Select provider/model | --use-provider openai:gpt-4o |
--capture | Record without scoring | --capture --file out |
--details | Show critic feedback | --details |
--failed-only | Filter failures | --failed-only |
--format | Output format(s) | --format md,html,json |
--max-concurrent | Parallel limit | --max-concurrent 10 |
--use-provider
Specify which provider(s) and model(s) to use:
--use-provider <provider>[:<model1>,<model2>,...]Supported providers:
openai(default:gpt-4o)anthropic(default:claude-sonnet-4-5-20250929)
Anthropic model names include date stamps. Check Anthropic’s model documentation for the latest model versions.
Examples:
# Default model for provider
arcade evals . --use-provider anthropic
# Specific model
arcade evals . --use-provider openai:gpt-4o-mini
# Multiple models from same provider
arcade evals . --use-provider openai:gpt-4o,gpt-4o-mini
# Multiple providers
arcade evals . \
--use-provider openai:gpt-4o \
--use-provider anthropic:claude-sonnet-4-5-20250929--openai-key, --anthropic-key
Provide explicitly:
arcade evals . --use-provider openai --openai-key sk-...--capture
Enable capture mode to record calls without scoring:
arcade evals . --capture--add-context
Include system messages and conversation history in output:
arcade evals . --add-context --file out --format md--file
Specify output file base name:
arcade evals . --file results/evaluation--format
Choose output format(s):
arcade evals . --format md,html,jsonUse all for all formats:
arcade evals . --format all--details, -d
Show detailed results including critic feedback:
arcade evals . --details--failed-only
Show only failed test cases:
arcade evals . --failed-only--max-concurrent, -c
Set maximum concurrent evaluations:
arcade evals . --max-concurrent 10Default is 5 concurrent evaluations.
--arcade-url
Override Arcade gateway URL for testing:
arcade evals . --arcade-url https://staging.arcade.devUnderstanding results
Summary format
Results show overall performance:
Summary -- Total: 5 -- Passed: 4 -- Failed: 1Case results
Each case displays status and score:
PASSED Get weather for city -- Score: 1.00
FAILED Weather with invalid city -- Score: 0.65Detailed feedback
Use --details to see critic-level analysis:
Details:
location:
Match: False, Score: 0.00/0.70
Expected: Seattle
Actual: Seatle
units:
Match: True, Score: 0.30/0.30Multi-model results
When using multiple models, results show comparison tables:
Case: Get weather for city
Model: gpt-4o -- Score: 1.00 -- PASSED
Model: gpt-4o-mini -- Score: 0.95 -- WARNEDAdvanced usage
Test against staging gateway
Point to a staging Arcade gateway:
export ARCADE_API_KEY=...
export ARCADE_USER_ID=...
arcade evals . \
--arcade-url https://staging.arcade.dev \
--use-provider openaiHigh concurrency for fast execution
Increase concurrent evaluations:
arcade evals . --max-concurrent 20High concurrency may hit API rate limits. Start with default (5) and increase gradually.
Save comprehensive results
Generate all formats with full details:
arcade evals . \
--details \
--add-context \
--file results/full-report \
--format allTroubleshooting
Missing dependencies
If you see ImportError: MCP SDK is required, install the full package:
pip install 'arcade-mcp[evals]'For Anthropic support:
pip install anthropicTool name mismatches
names are normalized (dots become underscores). If you see unexpected tool names, check Provider compatibility.
API rate limits
Reduce --max-concurrent value:
arcade evals . --max-concurrent 2No evaluation files found
Ensure your evaluation files:
- Start with
eval_ - End with
.py - Contain functions decorated with
@tool_eval()
Next steps
- Explore capture mode for recording calls
- Learn about comparative evaluations for comparing sources
- Understand provider compatibility and schema differences