SEVRA Reasoning Allocation Replay

Deployment Takeaway

76.3%

MATH500 selective active verification accuracy, with fewer harmful flips than always-on verification.

94.47%

GSM8K selective active verification accuracy, nearly matching the long-base setting at modest extra cost.

78.38%

CommonsenseQA Self-Consistency@5 accuracy, while active verification is harmful on this workload.

The central result is not that one intervention always wins. Tune the initial reasoning budget first; then use selective recovery when explicit verification, bounded retries, auditability, or regression-risk control matters.

Cost-Accuracy Frontier

Bars show average realized model tokens relative to the largest value inside each dataset panel. The number at right is accuracy.

MATH500

Base 4,096

59.0

Always continue

72.0

Selective continue

73.6

Always verify

75.5

Selective verify

76.3

Long base 8,192

76.0

GSM8K

Base 4,096

93.40

Always verify

93.40

Selective verify

94.47

Long base 8,192

94.54

CommonsenseQA

Base 4,096

76.49

Always verify

72.32

Self-Consistency@5

78.38

Main Tables

MATH500

Policy	Accuracy	Calls	Total tokens	Harmful flips
Base, 4,096 limit	59.0	1.000	4313	0.0
Always continue	72.0	2.000	8007	3.6
Selective continue	73.6	1.482	7064	1.7
Always active verify	75.5	2.000	8125	2.2
Selective active verify	76.3	1.482	7104	1.0
Long base, 8,192 limit	76.0	1.000	5124	0.0

GSM8K

Policy	Accuracy	Calls	Total tokens	Harmful flips
Base, 4,096 limit	93.40	1.00	1180	0.00
Always active verify	93.40	2.00	2932	1.25
Selective active verify	94.47	1.03	1335	0.00
Long base, 8,192 limit	94.54	1.00	1157	0.00

CommonsenseQA

Policy	Accuracy	Calls	Total tokens	Harmful flips
Base, 4,096 limit	76.49	1	2234	0.00
Always active verify	72.32	2	4794	5.94
Self-Consistency@5	78.38	5	11343	1.56
SC sampled-rollout oracle	85.18	-	-	-

Helpful Fixes vs Harmful Flips

Positive bars are wrong-to-right changes; negative bars are right-to-wrong changes. Verification is useful only when fixes dominate flips.

Dataset/actionHelpful fixesHarmful flips

MATH always verify

MATH selective verify

GSM always verify

GSM selective verify

CSQA always verify

CSQA SC@5

Logged Example Patterns

MATH helpful verification

Problem: Smallest positive perfect cube expressible as the sum of three consecutive integers.

Base: 3 (wrong). After extra reasoning: 27 (correct). Candidate-specific checks can repair a local mathematical miss.

MATH harmful continuation

Problem: Cross-country graph asks which student has greatest average speed.

Base: Evelyn (correct). After extra reasoning: 3.6 (wrong). Extra reasoning can answer a different latent question.

GSM helpful verification

Problem: Interrupted 200GB download with restart overhead.

Base: 20 (wrong). After extra reasoning: 160 (correct). Verification helps when the base attempt misses a compositional constraint.

CommonsenseQA harmful verification

Problem: Multiple-choice question asking where a heifer's master lives.

Base: A (correct). After extra reasoning: E (wrong). Verification prompts can overthink short commonsense answers.