June 17, 20263 min read

InsureBench V1 Is Ready

InsureBench V1 is live — a public benchmark scoring model-plus-harness combinations on insurance work. The headline finding: the best model depends on the job, and a low-cost model beat the frontier on actuarial tasks at roughly 1/400th the cost.

aibenchmarksharnessunderwritinginsurebench

Don Seibert

InsureThing

V1 · Benchmark

InsureBench V1 is ready.

The main results, methodology notes, leaderboard, and sample questions are on the InsureBench lab page.

Explore InsureBench V1

InsureBench V1 is live: a public benchmark for evaluating model-plus-harness combinations on insurance work — claims reasoning, actuarial analysis, underwriting review, coverage questions, document reading, forms, and workflow judgment.

The clearest result is that the best model depends on the job. Mimo v2.5 was the strongest system we tested on actuarial tasks, ahead of the frontier models in that slice at roughly 1/400th of the cost. Gemini 3.5 Flash Low did well on claims reasoning despite weaker actuarial scores. Nemotron led the tested claims set. None of these are permanent rankings, but they all point one way: no single model wins everything.

That is the point of the benchmark. We are not out to prove the strongest frontier models are strong — they are. GPT 5.5 and Claude Opus are broadly ahead, in the ways you would expect. The more useful question for anyone deploying AI is where that extra capability is worth its cost, where it is not, and how much the harness around the model changes the answer.

V1 adds deeper runs across more tasks and a wider selection of industry models. The task-specific pattern held up as the data grew.

The harness matters too. A model with file reading, output repair, batching, domain context, or an escalation path is not the same system as the same model called through a plain API. InsureBench measures the combination because that is what gets deployed.

The same is true for cost. The unit that matters is not the token but the completed business task — tool calls, retries, hidden reasoning tokens, and any escalation needed to keep the workflow reliable.

The benchmark will continue to grow. We will add more questions, more models, more harnesses, and more task families. We are especially interested in repeated insurance workflows where cost, latency, control, and reliability matter: underwriting intake, claims triage, coverage checks, loss-run analysis, forms processing, agency management, and portfolio monitoring.

InsureBench is a lab project and a public benchmark. InsureThing will continue to publish results and methodology notes. We are open to working with others to add tasks, test harnesses, evaluate models, and improve the benchmark. If a provider wants a model tested and can provide tokens or access, we are glad to talk.

We also help carriers, MGAs, brokers, and insurance technology teams apply these findings to their own operations. The public benchmark is a broad indicator. The production question is more specific: for this workflow, under these constraints, what is the cheapest and fastest AI system that is reliable enough, and how should it escalate when it is not?

The lab page has the current results, and it is where we will publish updates.

Live · Benchmark

Explore InsureBench V1.

Current leaderboard, task filters, methodology notes, and sample questions.

insurebench.insure-thing.com

Scanning for comments…