Deployment Takeaway
MATH500 selective active verification accuracy, with fewer harmful flips than always-on verification.
GSM8K selective active verification accuracy, nearly matching the long-base setting at modest extra cost.
CommonsenseQA Self-Consistency@5 accuracy, while active verification is harmful on this workload.
The central result is not that one intervention always wins. Tune the initial reasoning budget first; then use selective recovery when explicit verification, bounded retries, auditability, or regression-risk control matters.
Cost-Accuracy Frontier
Bars show average realized model tokens relative to the largest value inside each dataset panel. The number at right is accuracy.
MATH500
GSM8K
CommonsenseQA
Main Tables
MATH500
| Policy | Accuracy | Calls | Total tokens | Harmful flips |
|---|---|---|---|---|
| Base, 4,096 limit | 59.0 | 1.000 | 4313 | 0.0 |
| Always continue | 72.0 | 2.000 | 8007 | 3.6 |
| Selective continue | 73.6 | 1.482 | 7064 | 1.7 |
| Always active verify | 75.5 | 2.000 | 8125 | 2.2 |
| Selective active verify | 76.3 | 1.482 | 7104 | 1.0 |
| Long base, 8,192 limit | 76.0 | 1.000 | 5124 | 0.0 |
GSM8K
| Policy | Accuracy | Calls | Total tokens | Harmful flips |
|---|---|---|---|---|
| Base, 4,096 limit | 93.40 | 1.00 | 1180 | 0.00 |
| Always active verify | 93.40 | 2.00 | 2932 | 1.25 |
| Selective active verify | 94.47 | 1.03 | 1335 | 0.00 |
| Long base, 8,192 limit | 94.54 | 1.00 | 1157 | 0.00 |
CommonsenseQA
| Policy | Accuracy | Calls | Total tokens | Harmful flips |
|---|---|---|---|---|
| Base, 4,096 limit | 76.49 | 1 | 2234 | 0.00 |
| Always active verify | 72.32 | 2 | 4794 | 5.94 |
| Self-Consistency@5 | 78.38 | 5 | 11343 | 1.56 |
| SC sampled-rollout oracle | 85.18 | - | - | - |
Helpful Fixes vs Harmful Flips
Positive bars are wrong-to-right changes; negative bars are right-to-wrong changes. Verification is useful only when fixes dominate flips.
Logged Example Patterns
MATH helpful verification
Problem: Smallest positive perfect cube expressible as the sum of three consecutive integers.
Base: 3 (wrong). After extra reasoning: 27 (correct). Candidate-specific checks can repair a local mathematical miss.
MATH harmful continuation
Problem: Cross-country graph asks which student has greatest average speed.
Base: Evelyn (correct). After extra reasoning: 3.6 (wrong). Extra reasoning can answer a different latent question.
GSM helpful verification
Problem: Interrupted 200GB download with restart overhead.
Base: 20 (wrong). After extra reasoning: 160 (correct). Verification helps when the base attempt misses a compositional constraint.
CommonsenseQA harmful verification
Problem: Multiple-choice question asking where a heifer's master lives.
Base: A (correct). After extra reasoning: E (wrong). Verification prompts can overthink short commonsense answers.