2025-12-23 Paper Digest

Deep Dive: Reasoning with Sampling

My notes and aha moments from Reasoning with Sampling's inference-time distribution sharpening via Metropolis-Hastings

Why this paper?

During my time as an intern at SAP, I've been exposed to various Post Training techniques - such as Programmatic Verifiers (RLVR), RL with Learned Reward Models (RLHF), SFT, etc. – of which RLVR has been the most interesting to learn about and work with. I've always been fascinated by how a proposed solution to the shortcomings of PPO (Deepseek, Feb 2024, GRPO) has since become the staple method to train for attempted task generalization, often as the immediate step after preference optimization. I've had to set aside this curiosity while grappling with bottlenecks of synthetic data generation and SFT to squeeze out performance of models, but the holidays offer some respite to inspect a paper in this domain that has captured my interest – Reasoning With Sampling: Your Base Model is Smarter Than You Think.

This paper is a clean, almost surgical attempt to isolate one simple question:

Doubt

Does RLVR with GRPO make a model learn new reasoning, or is it simply a training pipeline that makes the model sample its best reasoning more often?

By utilizing just three commonly used open source models and a computational statistics Bayesian inference algorithm, the authors were able to elicit reasoning at inference-time compute with no need for additional post training. The reasoning (pun intended) behind this, as proposed by the paper, serves to evaluate two hypothesis regarding RLVR:

What this paper is really testing

(1) If RL post-training “creates reasoning,” then purely sampling from a base model won't reach similar performance.

(2) If RL post-training mostly “reweights the distribution,” then a principled inference-time reweighting method should recover a surprising amount.

Principle

The paper's main idea is as follows. Assuming the base model defines a probability distribution over full answers:

  • Don’t just sample once and pray.
  • Instead, sample in a way that the model's biases toward trajectories are more plausible.
  • Do that biasing globally over the whole answer, rather than greedily token-by-token.

The authors utilize a common computational statistics algorithm rooted in Bayesian inference as their non-greedy sampling method - Markov Chain Monte Carlo Sampling. As mentioned in the paper:

CORE CONCEPT

“Markov Chain Monte Carlo (MCMC) algorithm known as Metropolis-Hastings (MH) (Metropolis et al., 1953), ... targets exactly what we want: approximate sampling from an unnormalized probability distribution”

My Interpretation

Think of a model response as a trajectory. A trajectory can either be “somewhat right” and wobbly, or crisp and consistent. A normal decoding pass produces one trajectory - one model output.

Reasoning with Sampling (RwS) says: “Treat that trajectory as a draft. Then repeatedly propose edits to the back half. Keep edits that move you into a region of response-space that the model likes much more.”

It's a “search,” performed by sampling from a target distribution rather than explicit best-first enumeration. Essentially:

Aha moment

Rather than just the next token, the selection unit is the entire continuation. It's a globally coherent method that's corrected by MH accept/reject loop so that the sampling process doesn’t instantly collapse into a purely greedy local "one token after the other" scheme.

This can be better visualized in the following graph from the paper:

Figure

MCMC Example

Demonstration of MCMC, where random index t in sample x is selected. Then, all tokens after t are replaced with a candidate trajectory, x'. Based on the relative likelihood A(x, x'), That new sample will either be accepted or rejected. This is repeated N times.

The two hyperparameters of importance are:

  1. α (power/sharpness): Higher α means the sampler prefers high-confidence regions more aggressively.
  2. # of MH steps (inference-time compute): More steps = more chances to “rewrite your way” into better trajectories.

α controls how picky you are. Steps control how long you keep editing the draft.

Algorithm

Algorithm

Essentially: Generate a draft answer, repeatedly “rewrite the back half” at random points, and keep the rewrite only if the whole answer becomes much more plausible to the model (according to a powered-up likelihood).

Here's the entire algorithm for a quick breakdown:

  1. 1.Start with your normal model. You have a base model that can generate an answer token by token, and it can also tell you how likely any full answer is (its log-probability).
  2. 2.Decide how aggressive you want to be. Pick a power α (like 1, 2, 4, 10, etc.) where a larger α prefers super-likely answers over uncertain answers.
  3. 3.Begin with a rough draft answer. Use normal decoding (e.g., temperature sampling) to generate a full answer from the model once. Think: “Let me write a first draft.”
  4. 4.Score that draft. Ask the model: “How likely is this whole answer?” That gives you a single number: its log-likelihood.
  5. 5.Now, randomly choose a place to edit. Pick a random position in the draft (some token index in the middle).
  6. 6.Rewrite everything after that point. Keep the prefix up to that position fixed. From there, re-sample the rest of the answer using the model again. This gives you a new full answer that shares the beginning but has a different continuation.
  7. 7.Score the new draft. Compute the log-likelihood of this new answer too. Compare “old vs new” using the power α.
    • If the new answer is much more likely under the model,  accept it.
    • If it’s a bit worse, you might still accept it with some probability.
    • If it’s way worse, you probably reject it and keep the old draft.
  8. 8.Repeat this “random edit + maybe keep it” many times. Each iteration is:
    • Pick a random edit point
    • Resample the tail
    • Maybe accept, maybe reject
    • Over time, the answer gets nudged toward versions that the model thinks are much more coherent/likely.
  9. 9.Grow the answer block-by-block (optional implementation detail). In practice, don’t always edit the whole thing from scratch. Build the answer in blocks (ex: 1 - 5 tokens at a time).
    • Generate the next block
    • Run a few of these “edit and accept/reject” steps on the partial answer. After which, move on to the next block.
  10. 10.Return the final refined answer. After enough edit steps, you stop and output the current draft. Because of all the “only keep if more likely (with power α)” decisions, this final answer behaves like it was sampled from a sharpened version of the model’s distribution i.e. it’s a “high-confidence” reasoning path that the base model already liked, just surfaced more reliably.

So what's happening under the hood?

Metropolis–Hastings needs two things:

  • a proposal distribution (how you generate candidate edits)
  • a target distribution (what you wish you could sample from)

Here:

  • Proposal is “keep the prefix, resample the suffix.”
  • Target is “answers the base model assigns much higher probability to” — optionally sharpened by α.

Here’s the simplest “engineering pseudo-view” I keep in mind:

RwS (engineering pseudo-view)

// x = current full answer
// propose: pick t, keep prefix x[:t], resample suffix -> x'
xPrime = proposeByResamplingSuffix(x, t)
// score: model log-likelihood of whole trajectory
L = logp_model(x)
Lp = logp_model(xPrime)
// accept: prefer higher-likelihood answers, sharpened by alpha
// (stochastic accept keeps exploration alive)
acceptProb = min(1, exp(alpha * (Lp - L)))
if (rand() < acceptProb) x = xPrime

Eval & Results

The authors compared this novel inference-time sampling method against 3 other configurations:

  • Base: base model sampling
  • Low-temperature: base model with low temperature sampling
  • GRPO (MATH): GRPOed model on MATH split across three open-source models: Qwen2.5-Math-7B, Qwen2.5-7B, Phi-3.5-mini-instruct.

The hyperparams used for each inference scheme are as follows:

Qwen:

  • GRPO training dataset: MATH (train split; 7,500 problems)
  • #GPUs: 8
  • Optimizer: AdamW (betas = 0.9, 0.95; eps = 1e-8)
  • Learning rate: 5e-7 (constant)
  • Rollout batch size: 64 prompts/update
  • Group size: 16 rollouts/prompt (for GRPO advantages)
  • PPO mini-batch: 128 rollouts/update
  • Sampling temperature (training): τ = 1
  • KL loss: disabled (0)
  • Entropy loss: disabled (0)
  • Seed: 0

Phi:

  • Starting checkpoint (policy\_init): Phi-3.5-mini-instruct (≈3.8B params; 128K context)
  • GRPO training dataset: MATH (train split; 7,500 problems)
  • SFT learning rate: 1e-6
  • SFT data volume: \~8B tokens (ChatML formatted)
  • GRPO policy learning rate: 1e-6
  • GRPO KL coefficient: 0.04
  • Num samples per prompt: 64
  • Reward type: verifiable math-answer correctness

RwS:

  • Max generation length: Tmax = 3072
  • Block size: B = 192 (Tmax / 16)
  • Power exponent: α = 4.0
  • Proposal temperature (reasoning eval): τ = 1/α = 0.25
  • Proposal temperature (AlpacaEval): τ = 0.5
  • #MCMC resampling steps: N\_MCMC = 10

The results of the eval pipeline are displayed below:

Results Table
Results Table

What's surprising isn't that power sampling beats vanilla decoding, but that it gets close to RL on MATH500 and can overtake RL on out of domain tasks such as HumanEval (especially stark for Phi-3.5-mini, where GRPO collapses on HumanEval). This suggests that the first hypothesis is disproved i.e. RL doesn't generate "real" reasoning, since a sampling method can produce the same results - and at times, better results on out of domain tasks.

But why and how does this occur? The ablations section of the paper answers this question and reinforces the second hypothesis.

Figure

Graph Results

Image 1: Avg Log-Likelihood & Avg Confidence across inference schemes. Image 2: pass@1 results. Image 3: pass@k + diversity. Image 4: Correlation between #MCMC-Steps and MATH500 accuracy on Qwen models.

The paper plots histograms of sequence log-likelihoods under the base model and a base-model “confidence” proxy for MATH500 responses. GRPO’s samples concentrate heavily at the top-likelihood/high-confidence peak, while power sampling shifts upward (as intended) but maintains more spread.

This supports the paper’s framing: RL posttraining can look like distribution sharpening, and power sampling is an explicit “sharpening-at-inference” alternative.

The authors also plot pass@k (solve if any of k samples is correct). GRPO’s curve tends to taper off as k grows, consistent with claims that RLVR causes diversity loss. pOn the other hand, Power Sampling keeps improving for k>1 and eventually approaches the base model’s high-k upper bound rather than collapsing. Concretely, the paper highlights that power sampling achieves GRPO-level single-shot performance without the multi-sample diversity tradeoff, and that this pattern repeats across MATH500, GPQA, and HumanEval in Appendix A.2 of the paper.

Takeaways

All in all, the paper:

  • Reinforces the idea of RLVR causing Distribution Sharpening, proposed to be the true result of current state of GRPO based RL in Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening. This paper demonstrates how to perform distribution sharpening while simultaneously elevating log-likelihood of less probable and correct solutions.
  • Presents the Power Distribution Sampling method via MCMC sampling of token subsequences
  • Demonstrates Performance on models (Qwen2.5-Math-7B, Qwen2.5-7B, Phi-3.5-mini-instruct) and reasoning tasks (MATH500, HumanEval, GPQA, AlpacaEval 2.0), occasionally outperforming GRPO on certain tasks (Phi-3.5-mini-instruct had the best results via power sampling vs GRPO).

So what am I getting out of this read?

I've been noticing recently that quite a few new RLVR methods have been highlighting sampling and high-entropy tokens (Qwen's 80/20 Rule) to overcome GRPO-RLVR's issues with pass@k performance. Some even propose methods to train models to reason with any external rewards, instead focusing on internal confidence intervals as rewards rather than external programmatic verification - Intuitor RLIF: Learning to Reason without External Rewards. I'm more interested in the latter methods that tackle hard to verify RL tasks, whether that be due to data sparsity/complexity or compliancy issues.

I've recently been exploring utilizing entropy and/or MCMC sampling methods (such as RwS) to generate synthetic reasoning traces from larger models or that same model to then SFT the base model to reason in accordance with a chat template, thus reducing the high inference cost - 8.84× the number of tokens relative to standard inference, using experimental settings (NMCMC=10, T=679, B=192) - that the authors reported while still addressing the shortcomings of standard GRPO RLVR.

Through this paper read, I also managed to grasp an understanding of the state of RL - how RLVR first began with algorithmic verifiers, progressing on to theorem proving models, and finally to the modern day paradigm that dominates Post-training pipelines (will be dropping another post regarding Theorem Proving Verifiers like Deepseek-Prover-v2, and how they originated from research by OpenAI in 2019.)

At a high level, RwS felt like a clean reminder that post-training isn’t synonymous with RL. A lot of the gains we attribute to GRPO-style pipelines can sometimes be recovered by changing how we sample, rather than changing the weights; as long as the base model already contains the relevant heuristics but fails to surface them reliably at test time.

My main takeaway is that the next wave of RL-for-reasoning likely splits into two complementary directions:

  1. better test-time inference (entropy-aware decoding, MCMC-style sampling, confidence shaping, etc.) and
  2. reward-light training that scales beyond brittle verifiers, and at times, circumvents them entirely.

For now, I’ll be experimenting with a hybrid solution - using these sampling methods to generate higher-quality traces, then distilling them back into a cheaper SFTable policy - because it feels like a pragmatic bridge between today’s compute costs and tomorrow’s “hard-to-verify” reward regimes.