Real insurance tasks, scored by combo: model / harness
Agentic CLI (model drives its own tools)Basic Tool Calling (file reader)Simple (prompt only)
Preliminary, June 2026. Objective-keyed scoring only (code-checked answers; no LLM judges).
Whiskers are 95% bootstrap confidence intervals over per-task means (2,000 resamples):
repeat runs of the same task are correlated, so the interval resamples the tasks, not the
raw answers — the honest, wider read. API combos: 3 runs per task; Agentic CLI: single
run. Sub-benchmark views slice the same runs by insurance function — fewer tasks per
slice means wider intervals; that is the point of showing them. Infrastructure errors are
excluded, never counted as wrong answers. Task bank is pre-freeze ("trial-grade"): the
expert-verified v0.1 release is in progress and these numbers will be re-run against it.