Every language model call is a sampling event. The model doesn't retrieve a stored answer — it generates one, token by token, by drawing from a probability distribution over its entire vocabulary. The parameters that govern that draw — temperature, top-p, and seed — determine whether two identical prompts produce identical outputs, and by how much they diverge when they don't.
What These Parameters Actually Do
Temperature, top-p, and seed are sampling controls. They don't change what the model knows. They change how the model chooses what to say next from everything it knows.
Temperature is the most consequential of the three. It's a scalar applied to the probability distribution over possible next tokens before sampling occurs. A high temperature flattens the distribution — tokens that were unlikely become more competitive, and the output gets more varied, sometimes more creative, sometimes more wrong. A low temperature sharpens the distribution — the highest-probability tokens dominate, and the output converges toward what the model considers most likely. At temperature 0, you get greedy decoding: the model always picks the single most probable next token, every time, with no randomness. Most providers implement this as "effectively 0" rather than mathematically exact, but the behavior is the same.
Top-p, also called nucleus sampling, is a filter applied before sampling happens. Instead of drawing from the full vocabulary, the model samples only from the smallest set of tokens whose cumulative probability exceeds the threshold p. Set top-p to 0.9 and the model draws only from tokens that together account for 90% of the probability mass — the long tail of improbable tokens gets cut. This is a different kind of constraint than temperature: temperature reshapes the distribution, top-p truncates it.
Seed fixes the random number generator used in the sampling step. With the same prompt, the same temperature, the same top-p, and the same seed, you get the same output. Without a fixed seed, you get a different sample every time — even if everything else is identical.
These three parameters operate as a unit. Temperature and top-p define the shape of the sampling space; seed determines where in that space you land on any given run.
• Sampling controls: Temperature, top-p, and seed govern how a model selects its next token from a probability distribution. They don't change the model's knowledge — they change how that knowledge gets expressed on each call.
The Mechanism
The model processes your prompt and produces a vector of raw scores — one for every token in its vocabulary. A large model's vocabulary might contain 50,000 to 100,000 tokens (these numbers vary by model and architecture and are subject to change). Those raw scores get converted into probabilities via a softmax function. At this point you have a distribution: token A has a 30% probability, token B has a 15% probability, and so on down to tokens with probabilities in the thousandths.
Temperature is applied before the softmax. Each raw score is divided by the temperature value. Divide by a number greater than 1 — high temperature — and the scores compress toward each other, flattening the distribution. Divide by a number less than 1 — low temperature — and the scores spread apart, sharpening it. Divide by a value approaching 0 and the highest score dominates so completely that the distribution is effectively a spike. This is why temperature 0 produces deterministic output: there's no meaningful distribution left to sample from.
Top-p runs after temperature adjustment. The model sorts tokens by probability, descending, and accumulates them until the cumulative probability crosses the threshold. Everything below the cutoff is excluded. If the top three tokens account for 92% of probability mass and your top-p is 0.9, you're sampling from three candidates. If the distribution is flat and you need 200 tokens to reach 90%, you're sampling from 200. The candidate pool size is dynamic, which is what makes nucleus sampling more stable than a fixed top-k cutoff across different prompt types.
Then the model samples. One token is drawn from the remaining candidates, weighted by their adjusted probabilities. That token gets appended to the context, and the whole process repeats for the next token.
Seed enters at the sampling step. The random number generator that governs the draw can be seeded. Fix the seed and you fix the sequence of random draws — same inputs, same outputs, every run. Change the seed and you get a different draw from the same distribution.
Worth being precise about what seed actually does: it makes variability reproducible, not absent. A model configured with temperature 0.8 and a fixed seed will give you the same varied output every time you run it. Useful for testing. Not the same as a low-temperature configuration that produces conservative, predictable outputs.
Seed gives you the same answer every time. It doesn't give you the right answer. Those are different problems, and conflating them is where a lot of "we fixed the auditability issue" conversations go sideways.
• Inference-time sampling: The model generates a probability distribution over its vocabulary, temperature reshapes it, top-p truncates it, and then one token is sampled. Seed fixes the random draw so the same inputs produce the same output — reproducibility without eliminating variability from the design.
IDAM Concept Mapping
Your intuition is right, and then it isn't.
In access control, a policy that returns different results for the same inputs isn't creative — it's broken. An authorization engine that allows a request on Tuesday and denies it on Wednesday, with no policy change in between, is a P1 incident, not a feature. Non-determinism in a security-relevant system is a failure condition. That intuition is exactly the right starting point for understanding why buyers get uncomfortable when they learn that the same prompt can produce different outputs.
Language models are not authorization engines, and output variability is not a defect category. In LLM production deployments, controlled variability is a design parameter with a legitimate range. A compliance document classifier that extracts specific data fields from federal forms should probably be configured like an authz policy — low temperature, fixed seed, reproducible outputs that can be regression-tested. A citizen-facing assistant that explains benefit eligibility in plain language might legitimately have more latitude, because rigid determinism there produces robotic, brittle responses. These are different use cases, and they belong at different points on the variability spectrum.
What matters is whether the variability was configured deliberately, for the right reasons, by someone with authority to make that call — and whether it's documented. That last part is where your IDAM instincts become useful again.
The Conversation You're Actually In
Three scenarios where this vocabulary earns its keep.
The buyer who got two different answers. A program manager runs the same query through an agency's new document-processing tool twice and gets different summaries. They file a ticket. The help desk doesn't know what to say. This is a temperature configuration question, not a model quality question — but nobody in the room has the vocabulary to say so. If you can name the mechanism, you can redirect the conversation from "the AI is broken" to "what variability tolerance did we configure for this workflow, and was that the right call?" That's a more productive conversation, and it surfaces a governance question that probably should have been answered before deployment.
The CISO asking about auditability. How do you audit outputs from a system that produces different outputs every time? You configure it so it doesn't, for the workflows where that matters. Low temperature plus fixed seed means your document classifier produces the same output for the same input on every run. You can log it, compare it, regression-test it, and demonstrate consistency to an auditor. This is a solvable problem — but only if someone made the configuration decision deliberately. If the deployment used default parameters, which in most provider APIs lean toward creative latitude rather than auditability, you have a governance gap, not a technology gap. The CISO's concern is legitimate; the framing just needs to shift from "is this auditable" to "did we configure it to be."
The procurement conversation about pipeline validation. An agency wants to validate an AI-assisted contract review pipeline before production. The question is whether they can test it. If the model is configured with high temperature and no fixed seed, every test run produces different outputs, and you can't build a regression suite. You can observe the system; you can't test it in any meaningful sense. Configure for reproducibility, and you can establish a baseline, run the same inputs through after any model update, and detect drift. This is a requirements conversation that happens before procurement, not after — and it requires someone to know that these parameters exist and what they control.
All three of these conversations share the same shape: the buyer's concern is legitimate, the mechanism is knowable, and the configuration decision is a governance question. Who set these parameters? What use case were they optimizing for? Is that documented and reviewable? Those questions map cleanly to the access governance vocabulary your buyers already speak. You're not translating from AI into English — you're translating from AI into a dialect they already know.
• Production tradeoff: Low temperature and fixed seed enable auditable, regression-testable outputs. High temperature without a fixed seed produces varied outputs that may be appropriate for some use cases and ungovernable for others. The configuration decision is a governance question, not a default to accept.

