The Model Is Sampling, Not Retrieving

By Carey Whitten— May 5, 2026

The Mechanism

When a language model generates text, it works one token at a time. (Quick terminology flag: in this context, "token" means a subword unit — roughly a syllable or short word — not a bearer token or session identifier. Same word, completely different domain. The collision will come up in buyer conversations; worth naming it early.)

At each step, the model produces a probability distribution across its entire vocabulary — tens of thousands of possible next tokens, each assigned a likelihood. The model then samples from that distribution rather than selecting the highest-probability token outright. That sampling step is why the same prompt produces different outputs on successive calls. The distribution is deterministic given the same input; the draw from it is not.

Three parameters control how that sampling works.

Temperature reshapes the distribution before sampling. Mathematically, it scales the raw scores (logits) the model produces before they're converted to probabilities. A temperature of 1.0 leaves the distribution as-is. Lower values compress it: the highest-probability tokens pull further ahead, and the model behaves more predictably. Higher values flatten it: probability mass spreads across more candidates, and the model ranges more widely. At temperature 0, the model becomes effectively deterministic, always picking the top token. At temperature 2.0, the distribution is flat enough that low-probability tokens become real contenders, and output quality often degrades. Provider defaults typically sit around 1.0, though this varies by model and use case and should be verified against current documentation.

Top-p (also called nucleus sampling) limits where the model can sample from. Set top-p to 0.9, and the model considers only the smallest set of tokens whose cumulative probability reaches 90% — everything below that threshold is excluded from the draw. This prevents the model from sampling from the long tail of improbable tokens even when temperature is elevated. Temperature and top-p are typically used together: temperature shapes the distribution, top-p constrains the sampling pool.

Seed fixes the random number generator state. Given identical inputs, identical model weights, and the same seed value, the sampling process produces identical output. Seed doesn't change what the distribution looks like — it makes the draw reproducible. This is the parameter that matters most for enterprise QA and audit workflows, and it's the one most often missing from early AI deployments.

“

Okta Concept Mapping

The closest IDAM analogy is session token entropy. When generating a session identifier, you want maximum entropy — predictable tokens are a vulnerability. Temperature is a similar dial: it controls how much randomness enters the output. Seed is analogous to deterministic key derivation — same seed, same output, every time.

The analogy breaks at the threat model. In IDAM, you always maximize entropy for security. Predictable session tokens are an attack surface. In LLMs, you tune entropy to the use case, and predictable outputs are sometimes exactly what you want. A CAIO asking about "model randomness" is asking about a quality and reproducibility property, not a security property. The vocabulary overlaps; the concern does not.

What This Looks Like in Practice

A federal agency is piloting an AI system to draft initial responses to public records requests. The program office wants to validate outputs before deployment. They run the same test prompts repeatedly and get different drafts each time — not dramatically different, but different enough that their QA process, built around comparing outputs to expected strings, keeps failing.

The culprit is temperature. The system was deployed with a provider default around 1.0, which is appropriate for creative tasks and actively counterproductive for a workflow that requires consistency. Dropping temperature to 0.2 and fixing a seed value produces outputs that are stable across runs. The QA process works. The audit trail holds.

That same low-temperature, seeded configuration will produce noticeably less varied outputs for tasks where variation is the point — brainstorming, drafting options, exploring alternatives. Every configuration choice trades one property for another. Consistency against range. Reproducibility against exploration.

The Production Tradeoff

For enterprise and public sector deployments, the question is which setting matches the workflow's requirements.

Workflows that need auditable, reproducible outputs (compliance summaries, structured data extraction, templated drafting) should run low temperature with a fixed seed. The output won't be identical across model versions or infrastructure changes, but it will be stable within a controlled environment. That's enough for most QA processes.

Workflows that benefit from exploration (ideation, options generation, open-ended analysis) should run higher temperature without a fixed seed. Reproducibility isn't the goal; coverage of the possibility space is.

The common failure mode: teams that deploy a single configuration across all use cases because they didn't know the parameters existed. Temperature and top-p require intentional configuration — the same discipline you'd apply to token lifetime or scope in an OAuth setup. Know what the parameter does, set it deliberately, document why.

When a buyer asks why their AI system gives different answers to the same question, the mechanism is this: the model is sampling from a probability distribution, and the sampling parameters haven't been tuned to the workflow's reproducibility requirements. That's a solvable problem, and the solution has three knobs.