Why the Same Prompt Gives Different Answers — and What You Can Do About It

By Leigh Garrity— May 8, 2026

Why the Same Prompt Gives Different Answers — and What You Can Do About It

When you send the same prompt to a language model twice and get two different responses, that's not a glitch. It's the intended behavior, and it's controlled by a parameter called temperature. Understanding temperature — and the two parameters that work alongside it — is the difference between knowing that AI outputs vary and knowing why, which is the difference that matters when a government buyer asks you to explain your system's audit trail.

What Temperature Actually Is

A language model doesn't look up answers. It generates them, one token at a time, by predicting what comes next. At each step, the model produces a probability distribution across its entire vocabulary — every word, subword, and punctuation mark it knows. Some tokens get high probability. Most get very low probability. The model then samples from that distribution to pick the next token, and the process repeats.

Temperature is the parameter that controls the shape of that distribution before the model samples from it.

Set temperature low — close to zero — and the distribution gets sharper. Probability mass concentrates at the top. The model almost always picks the highest-probability token, which means the output is predictable and consistent. Set temperature high — the upper range varies by model, but think of a scale from 0 to 2 — and the distribution flattens. Probability spreads across more tokens. The model starts picking options that wouldn't have made the cut at lower settings, which is where you get creative variation, unexpected phrasing, and occasionally outputs that wander off the map.

Temperature is sometimes described as a "creativity dial." The framing is intuitive, and it's also a little misleading: temperature doesn't add creativity to the model. It controls how much of the model's probability distribution you're willing to sample from. The model's knowledge doesn't change. The range of what it might say does.

“

• Temperature: A parameter that controls the shape of the probability distribution a language model samples from when generating each token. Low temperature produces consistent, predictable outputs; high temperature produces more varied, exploratory ones.

How the Mechanism Works — and Where Top-p and Seed Come In

At a single generation step, the model has assigned probabilities to, say, 50,000 tokens. At temperature 1.0 (the neutral setting most models default to), those probabilities are used roughly as-is. At temperature 0.2, the math compresses the distribution so that the top candidates get even more of the probability mass and the long tail gets almost none. At temperature 1.8, the distribution spreads, and tokens that had a 0.1% chance now have a 1% chance — which sounds small until you're generating hundreds of tokens and the small probabilities compound.

Top-p — sometimes called nucleus sampling — is a complementary constraint. Instead of reshaping the distribution, it restricts which part of the distribution is eligible for sampling at all. A top-p value of 0.9 means: consider only the smallest set of tokens whose cumulative probability adds up to 90%, and ignore everything else. The model samples from that nucleus.

Temperature and top-p interact. Temperature reshapes the distribution; top-p then constrains the sampling window within that reshaped distribution. Most production deployments set both. A common pairing for tasks requiring precision: temperature near zero, top-p near 1.0 (which, at very low temperature, doesn't do much because the distribution is already peaked). A common pairing for generative tasks: temperature around 0.7–1.0, top-p around 0.9. The specific numbers vary by model and use case; treat these as proportions, not recipes.

Neither temperature nor top-p, by themselves, makes the model's output reproducible. That's what seed is for.

A seed is an integer that initializes the random number generator the model uses when it samples from the distribution. Same prompt, same temperature, same top-p, same seed: same output, every time. The model's probabilistic machinery is still running — you're not removing the randomness, you're fixing the starting point of the random draw. Change the seed by one digit and you get a different output, even with everything else held constant.

Seed doesn't make the model deterministic in the mathematical sense. It makes the output reproducible given fixed inputs. That's a narrower claim, and it's the right one to make in front of a technically fluent buyer.

“

• Top-p: A constraint that limits sampling to the smallest set of tokens whose cumulative probability meets a threshold, applied after temperature reshapes the distribution. Controls the breadth of the sampling window. • Seed: An integer that initializes the random number generator used during sampling. Fixes the output for a given combination of prompt, temperature, and top-p. Enables reproducibility without removing probabilistic generation.

IDAM Concept Mapping

“

Closest analog: Cryptographic entropy in session token generation

In IDAM, when a system generates a session token, it draws from an entropy source — a pool of randomness — to produce a value that's unpredictable to an attacker. A seed in LLM sampling works similarly: it initializes the random draw that determines which token gets selected. The analogy breaks at the value judgment, and that break is worth naming precisely. In IDAM, a predictable token is a vulnerability. Fixing the entropy source is exactly what an attacker wants to do. In LLM production deployments, fixing the seed is a compliance feature — it's what lets you reproduce an output for an auditor. Your IDAM instinct says predictable = insecure. In this context, predictable = auditable. Same mechanism, inverted value, different domain.

The Production Tradeoff — and Why It Matters in Government Accounts

An agency has deployed an automated system to assist with benefits eligibility determinations. The system takes structured inputs — income data, household size, program criteria — and generates a recommendation. A caseworker reviews it; in some configurations, the system flags cases for human review automatically. Either way, the output is part of the administrative record.

Now an applicant appeals. The agency's legal team needs to reconstruct what the system said, why it said it, and whether it would say the same thing today given the same inputs. This is a standard audit requirement under most federal administrative law frameworks, and it's not hypothetical — agencies deploying AI-assisted decision tools are increasingly required to demonstrate that their systems produce consistent, explainable outputs.

If temperature is set at 1.0 and no seed is fixed, the system cannot reproduce its own outputs. Run the same inputs tomorrow and you'll get a different recommendation. That's a compliance gap. The agency either has to log every output verbatim (which creates its own data management obligations) or it has to configure the system to be reproducible from the start.

The practical configuration is usually: temperature near zero, seed fixed per session or per request, top-p set conservatively. Outputs become consistent, reproducible, and auditable. What you give up is the variation that makes language models useful for generative tasks — drafting, brainstorming, summarization where some flexibility in phrasing is acceptable. For a benefits-determination workflow, that's not much of a sacrifice. You never wanted the model to be creative about eligibility criteria.

The tradeoff is real in other contexts. A system that drafts policy summaries or generates training content benefits from some temperature — the outputs are more natural, less repetitive, better suited to human reading. "Better suited to human reading" and "auditable by a federal IG" are different requirements, and the same model configuration can't fully optimize for both. Knowing which requirement governs the deployment is the first question worth asking before any conversation about AI configuration in a government account.

Worth naming before you leave this topic: some model APIs don't expose seed as a parameter, or expose it with caveats about reproducibility across model versions. If a vendor tells you their system is reproducible, ask whether that reproducibility is guaranteed across model updates. The answer is usually no — when the underlying model changes, the same seed produces different outputs. That's not a flaw in the seed mechanism; it's a consequence of the model's weights changing. But it means "we fixed the seed" is not the same as "we have a stable audit trail across the system's lifetime." The audit trail requires logging outputs, not just fixing seeds.

“

• Production tradeoff: Low temperature and a fixed seed produce consistent, reproducible outputs suitable for auditable workflows; higher temperature produces varied outputs better suited to generative tasks. The deployment context determines which matters more — and in most public sector decision-support applications, auditability wins.

What to Hold Onto

Temperature, top-p, seed. Temperature controls how peaked or flat the probability distribution is when the model picks each token. Top-p constrains which tokens are even in the running. Seed fixes the random draw so you can reproduce the result.

When a CAIO asks why the system gave different answers to the same question, the answer is: because temperature was set above zero and no seed was fixed, which is the default configuration for most models. Whether that's appropriate for the deployment depends entirely on what the system is being used for. If the answer involves compliance, audit trails, or administrative records, the configuration probably needs to change.

That's the conversation, and you now have the vocabulary for it.