MATH-500: well-defined tasks
Cost rises 265x. Performance rises 2%.
(r = 0.33*)
HLE: deep reasoning tasks
Cost rises 220x. Performance rises 5x.
(r = 0.99)