Temperature, Sampling, and Why the Same Prompt Gives Different Answers

Explains how temperature, top-p, and top-k control model output variability, and why reproducibility is an engineering choice with hard limits.

By Leigh Garrity— May 8, 2026

Temperature, Sampling, and Why the Same Prompt Gives Different Answers

Explains how temperature, top-p, and top-k control model output variability, and why reproducibility is an engineering choice with hard limits.

After a language model processes your prompt, it computes a probability for every possible next token (the sub-word units the model operates on). Tens of thousands of candidates, each with a score. But it hasn't picked one yet. Between "here are the probabilities" and "here's the word" sits a step called sampling, and sampling introduces controlled randomness. Temperature, top-p, and top-k are the parameters that govern how much randomness. They're why the same prompt, sent twice to the same model, can produce two different answers. Nothing broke. The system was built to roll weighted dice, and the weighting is what you control.

This matters for a specific reason: when a buyer asks "will this give us the same answer every time," you need to know that the honest answer is nuanced, and you need to know exactly where the nuance lives.

How temperature works

The model's final layer produces raw scores called logits, one per token in its vocabulary. Before those scores become probabilities, they pass through a function called softmax, which converts them into a distribution that sums to 1.0.

Temperature is a number that divides every logit before softmax runs. That division is the entire mechanism.

Low temperature (say 0.3) means dividing by a small number, which amplifies the gaps between scores. The highest-scoring token pulls further ahead. High temperature (say 1.5) means dividing by a larger number, which compresses the gaps. Lower-probability tokens get a better shot.

Take three candidate tokens with logits of 1.0, 2.0, and 3.0:

Temperature	Token A (logit 1.0)	Token B (logit 2.0)	Token C (logit 3.0)
0.5 (low)	1.6%	11.7%	86.7%
1.0 (default)	9.0%	24.5%	66.5%
2.0 (high)	18.6%	30.7%	50.6%

Same logits. Same model. At temperature 0.5, Token C wins 87 times out of 100. At temperature 2.0, it wins half the time, and Token A, which barely registered before, now shows up nearly one in five draws.

At temperature 0, the math collapses to argmax: the model always picks the single highest-probability token. No sampling, no randomness. This is called greedy decoding.

These numbers are computed from the softmax formula, not pulled from a model run. They're illustrative. The mechanism is what matters, not the specific percentages.

• Temperature: Divides logits before softmax, scaling the probability distribution. Low sharpens toward the top token. High flattens toward uniform. Zero eliminates sampling entirely.

How top-p and top-k trim the candidate pool

Temperature adjusts how spread out the probabilities are. Top-p and top-k adjust which tokens are even eligible to be picked.

Top-k is straightforward. Set it to 40, and the model only considers the 40 highest-probability tokens. Everything else gets zeroed out before sampling. Hard cutoff, fixed number.

Top-p (also called nucleus sampling) is dynamic. You set a cumulative probability threshold. At top-p = 0.95, the model sorts tokens by probability, walks down the list, and stops when the running total hits 95%. Only those tokens are eligible. If the model is very confident about what comes next, that pool might be 3 tokens. If it's genuinely uncertain, it might be 200.

Top-p is generally more useful than top-k because it adapts to the model's confidence at each position. When the model knows what comes next, the pool tightens automatically. When it doesn't, the pool opens.

In practice, these parameters combine. A request might specify temperature 0.6 and top-p 0.95 and top-k 40. Temperature reshapes the distribution, top-p and top-k trim the tails, and then the model samples from what remains.

• Top-p: Restricts sampling to the smallest token set whose cumulative probability meets a threshold, adapting dynamically to model confidence. • Top-k: Restricts sampling to a fixed number of highest-probability tokens, regardless of confidence.

Okta Concept Mapping: Policy evaluation vs. probabilistic sampling

In IDAM, a policy engine evaluates an access request and returns a deterministic result. Same user, same resource, same context, same answer. Every time. LLM inference looks structurally similar on the surface: request in, response out. But every call at temperature > 0 is a fresh draw from a probability distribution. The model has no memory of its last answer and no mechanism to stay consistent with it. Your intuition that "same input = same output" is correct for policy engines and actively misleading for model inference. Consistency in LLM output is something you impose from outside, through parameter selection and infrastructure controls. The system won't do it for you.

What providers actually ship

DeepSeek's R1-0528 model card specifies temperature 0.6 and top-p 0.95 for both benchmarking and production deployment. The stated rationale, from NVIDIA's hosted documentation quoting DeepSeek directly: the 0.5–0.7 temperature range "prevent[s] endless repetitions or incoherent output."

Sit with that for a second. This is a reasoning model. The 0.6 setting exists because at lower temperatures, the model can get stuck in loops, repeating itself or degrading into incoherence. The sampling randomness serves a structural purpose for this class of model. It prevents degenerate output.

Google's Gemini 3 documentation makes the same point more bluntly: it "strongly recommend[s]" keeping temperature at the default of 1.0 and warns that lowering it "may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks." Anthropic's Claude Opus 4.7 goes further still: it doesn't accept a temperature parameter at all, managing sampling adaptively under the hood.

The intuition that lower temperature always means more reliable output does not hold for reasoning-optimized models. If a buyer is evaluating a reasoning model for analytical workflows and assumes they can crank temperature to 0 for consistency, they may get worse results. Functionally degraded. The kind of degradation where the model loops on the same phrase or produces incoherent chains of reasoning.

Provider-recommended settings shift as models evolve. These numbers reflect documentation current as of this writing. Check the model card before citing them in a conversation.

• Provider defaults are load-bearing. Reasoning models specify nonzero temperature because sampling randomness prevents degenerate behavior like looping. Lower is not always better.

Temperature 0 and the limits of determinism

For compliance-sensitive workflows, audit logging, QA pipelines, anything where you need to reproduce an output months later, temperature 0 is the starting point. Greedy decoding eliminates the sampling step entirely.

But "eliminates sampling" and "guarantees identical output" are different claims. Every major provider says so explicitly:

OpenAI offers a seed parameter but calls it "Beta" and describes it as "best effort." Repeated requests with the same seed "should return the same result." Note the should, not will. They also return a system_fingerprint in the response representing the backend configuration. If the fingerprint changes between calls, the output may differ even with the same seed, because OpenAI changed something on their infrastructure. And the seed parameter is not supported on their latest reasoning models (o3 and o4-mini). The reproducibility escape hatch doesn't exist for the model class most likely to be used in complex analytical workflows where reproducibility matters most.

Google says seed makes output "mostly deterministic" but adds it's "not a guaranteed absolute deterministic behavior."

Anthropic doesn't offer a seed parameter at all and states plainly: "Even with temperature set to 0, the results will not be fully deterministic."

Three providers. Three different approaches. Same conclusion.

Why can't they just guarantee it? The short answer is hardware. GPU operations involve parallel floating-point calculations where the order of operations can vary between runs. Different accumulation orders produce slightly different rounding, which produces slightly different logits, which can tip a close probability race to a different token. The non-determinism lives in the silicon itself, below the layer where software parameters can reach.

• Temperature 0 eliminates sampling but does not guarantee identical outputs across calls. Hardware-level floating-point variance can produce different results. • Seed parameters are "best effort" where they exist (OpenAI, Google), absent at Anthropic, and excluded from OpenAI's reasoning models entirely.

When this comes up in a conversation

A public sector buyer evaluating an LLM for any workflow touching compliance, audit, or reproducibility will eventually ask: "Will this give us the same answer every time?"

The honest answer: almost, if you engineer for it, with caveats that depend on the provider. Temperature 0 gets you close. A seed parameter gets you closer where it's available. But the engineering doesn't stop at sampling parameters. Reproducibility in practice means pinning a specific model version so you don't drift between provider snapshots, locking every request parameter so nothing varies silently, and logging outputs against their inputs so you have an audit trail when two calls do diverge. Those are the three levers. No provider currently offers a determinism SLA, and the mechanisms that prevent full determinism are structural. They're not bugs on a roadmap.

The buyer who understands this is better positioned than the buyer who was told "yes, it's deterministic." Because the second buyer will discover the variability in production, and when they do, they'll lose trust in the system and in whoever told them otherwise.

Nobody expects a lecture on softmax in a meeting. What holds up is being the person in the room who knows that consistency is a design choice with known limitations, a property you engineer into the system deliberately. That's a different kind of credibility than knowing the spec. It's the kind that holds up when the buyer's technical team starts asking follow-ups.

• Consistency is an engineering choice. It requires deliberate parameter selection, model version pinning, and acceptance of provider-stated limitations. No production LLM is deterministic by default.

Things to follow up on...

Transformer Explainer interactive tool: Georgia Tech's Transformer Explainer lets you manipulate a temperature slider and watch probability distributions shift in real time, which makes the softmax mechanism viscerally concrete in a way tables can't.
OpenAI's system_fingerprint field: OpenAI's advanced usage guide explains how to monitor backend configuration changes that silently break reproducibility even when you've pinned seed and temperature to zero.
Gemini 3's temperature warning: Google's getting started documentation for Gemini 3 explicitly warns that lowering temperature on reasoning models can cause looping and degraded performance, a caveat worth bookmarking as more reasoning models enter procurement conversations.
Why hallucinations persist structurally: Kalai and Nachum's arXiv paper on hallucination origins from OpenAI Research explains why the same statistical pressures that make sampling work also make confident wrong answers an expected output of the mechanism, not an anomaly.