InsureBench Preview: Cost Per Task, Not Cost Per Token
Early InsureBench results show why insurance AI evaluation needs to measure model plus harness performance, cost per task, and task-specific reliability. The charts are preliminary, but the deployment question is already clear.
InsureBench is our benchmark for evaluating AI systems on insurance work. The first public view is now live.
Most model benchmarks ask which model is better. That is useful, but it is not the main question an insurance organization faces when it tries to deploy AI. In production, the deployed unit is not just a model. It is a model plus the harness around it: the prompt, file reader, retrieval layer, structured output handling, retries, domain context, and escalation path.
That is what InsureBench measures.
The early benchmark includes tasks that look more like insurance work than chat: reading claim files, interpreting coverage facts, working with loss triangles, classifying workers' compensation risks, reviewing underwriting information, and deciding when there is not enough evidence to make a clean call.
The first lesson is simple: cost per token is not cost per task.
The chart above should be read as a preliminary snapshot, not a final ranking. It plots model plus harness combinations by accuracy and cost per task. The useful comparison is not "which token price is lowest?" It is "which setup can complete this kind of work reliably at the lowest practical cost?"
That distinction matters. A model may be inexpensive on a price sheet but expensive in practice if it needs more output tokens, more tool calls, more retries, or more hidden reasoning tokens to reach an answer. Another model may be more expensive per token but finish the job in fewer steps.
The second early lesson is that reasoning cost can be hard to see if you only look at visible output.
Some models spend heavily on hidden reasoning tokens. That may be worthwhile on difficult work. It may also be unnecessary on repeated operational tasks where a cheaper model with a good harness can clear the required bar.
This is not an argument against frontier models. For high-value expert work, or for tasks where a small improvement is worth a large cost, the strongest available model may be the right choice. If a senior actuary is doing a difficult one-off analysis in an agentic coding environment, the economics are different from an automated scan across thousands of files.
The point is narrower and more practical: for repeated insurance workflows, the relevant unit is the completed task. The right answer may be a frontier model. It may be a lower-cost model with a better harness. It may be a routing workflow that handles routine cases cheaply and escalates uncertain cases.
InsureBench is built to make those distinctions visible.
The usual caveats apply. The question bank is expanding. Some task-family samples are still small. Confidence intervals matter. The results will move as we add models, repeat runs, and improve harnesses.
But even this preview shows the shape of the work ahead: insurance AI evaluation needs to measure the model, the harness, the task, and the cost of getting to a reliable answer.
If you have insurance tasks that belong in InsureBench, or a model, harness, or workflow you want measured against real insurance work, send it over: don@insure-thing.com.