MATH-500: well-defined tasks
Cost rises 265x. Performance rises 2%. (r = 0.33*)
HLE: deep reasoning tasks
Cost rises 220x. Performance rises 5x. (r = 0.99)