SEVRA Reasoning Allocation Replay

A static, anonymous replay dashboard for inspecting budget-aware inference-time reasoning results. It visualizes precomputed accuracy, realized model tokens, additional calls, helpful fixes, and harmful flips across MATH500, GSM8K, and CommonsenseQA.

Data-only artifact No live model serving Budget-matched reasoning analysis Industry-track serving metrics

Deployment Takeaway

76.3%

MATH500 selective active verification accuracy, with fewer harmful flips than always-on verification.

94.47%

GSM8K selective active verification accuracy, nearly matching the long-base setting at modest extra cost.

78.38%

CommonsenseQA Self-Consistency@5 accuracy, while active verification is harmful on this workload.

The central result is not that one intervention always wins. Tune the initial reasoning budget first; then use selective recovery when explicit verification, bounded retries, auditability, or regression-risk control matters.

Cost-Accuracy Frontier

Bars show average realized model tokens relative to the largest value inside each dataset panel. The number at right is accuracy.

MATH500

Base 4,096
59.0
Always continue
72.0
Selective continue
73.6
Always verify
75.5
Selective verify
76.3
Long base 8,192
76.0

GSM8K

Base 4,096
93.40
Always verify
93.40
Selective verify
94.47
Long base 8,192
94.54

CommonsenseQA

Base 4,096
76.49
Always verify
72.32
Self-Consistency@5
78.38

Main Tables

MATH500

PolicyAccuracyCallsTotal tokensHarmful flips
Base, 4,096 limit59.01.00043130.0
Always continue72.02.00080073.6
Selective continue73.61.48270641.7
Always active verify75.52.00081252.2
Selective active verify76.31.48271041.0
Long base, 8,192 limit76.01.00051240.0

GSM8K

PolicyAccuracyCallsTotal tokensHarmful flips
Base, 4,096 limit93.401.0011800.00
Always active verify93.402.0029321.25
Selective active verify94.471.0313350.00
Long base, 8,192 limit94.541.0011570.00

CommonsenseQA

PolicyAccuracyCallsTotal tokensHarmful flips
Base, 4,096 limit76.49122340.00
Always active verify72.32247945.94
Self-Consistency@578.385113431.56
SC sampled-rollout oracle85.18---

Helpful Fixes vs Harmful Flips

Positive bars are wrong-to-right changes; negative bars are right-to-wrong changes. Verification is useful only when fixes dominate flips.

Dataset/actionHelpful fixesHarmful flips
MATH always verify
MATH selective verify
GSM always verify
GSM selective verify
CSQA always verify
CSQA SC@5

Logged Example Patterns

MATH helpful verification

Problem: Smallest positive perfect cube expressible as the sum of three consecutive integers.

Base: 3 (wrong). After extra reasoning: 27 (correct). Candidate-specific checks can repair a local mathematical miss.

MATH harmful continuation

Problem: Cross-country graph asks which student has greatest average speed.

Base: Evelyn (correct). After extra reasoning: 3.6 (wrong). Extra reasoning can answer a different latent question.

GSM helpful verification

Problem: Interrupted 200GB download with restart overhead.

Base: 20 (wrong). After extra reasoning: 160 (correct). Verification helps when the base attempt misses a compositional constraint.

CommonsenseQA harmful verification

Problem: Multiple-choice question asking where a heifer's master lives.

Base: A (correct). After extra reasoning: E (wrong). Verification prompts can overthink short commonsense answers.