Real insurance tasks, scored by combo: model / harness
Agentic CLI (model drives its own tools) Basic Tool Calling (file reader) Simple (prompt only)
Preliminary, June 2026. Objective-keyed scoring only (code-checked answers; no LLM judges). Whiskers are 95% bootstrap confidence intervals over per-task means (2,000 resamples): repeat runs of the same task are correlated, so the interval resamples the tasks, not the raw answers — the honest, wider read. API combos: 3 runs per task; Agentic CLI: single run. Sub-benchmark views slice the same runs by insurance function — fewer tasks per slice means wider intervals; that is the point of showing them. Infrastructure errors are excluded, never counted as wrong answers. Task bank is pre-freeze ("trial-grade"): the expert-verified v0.1 release is in progress and these numbers will be re-run against it.