A foundation model API call is a stateless HTTP request. You send a prompt, you get a completion back, the connection closes. No session, no infrastructure to manage, no memory of the previous call unless you explicitly include it in the next one. The model doesn't know you. It just responds.
That's the whole mechanism. Everything else — the pricing, the compliance architecture, the governance gap — follows from that single fact.
What a Model API Call Actually Is
When an application calls a foundation model, it sends an HTTP POST to an endpoint. The request body contains the prompt — the input text the model will respond to — along with configuration parameters like how long the response can be and how much randomness to allow in the output. The model processes the input, generates a response, and returns it. The connection closes. Nothing persists on the model's side.
Most enterprise software doesn't work this way. A database connection pools and persists. A SAML session has a defined lifetime and can be terminated. An API gateway can revoke a token mid-flight. A model API call has none of that. It's a function call over HTTP: input in, output out, done.
The economic unit is the token. One terminology note before we go further, because this word will cause problems with your existing mental model: in AI, "token" does not mean a credential. It's not a bearer token, not a JWT, not a SAML assertion. A token here is a chunk of text — roughly three-quarters of a word on average, though the exact mapping depends on the model's tokenizer. The sentence you just read contains approximately 35 tokens. Model providers price on token consumption, metered separately for input and output.
Bullet recap:
- Model API call: A stateless HTTP request containing a prompt; the model returns a completion and the connection closes. No session, no persistent state, no memory between calls unless the caller explicitly includes prior context.
- Token (AI definition): A chunk of text, roughly 3/4 of a word, used as the unit of measurement and pricing. Not a credential. Input tokens and output tokens are metered and billed separately.
How It Works: Tokens, Pricing, and the Stateless Reality
An application assembles a payload: a system prompt (instructions that frame how the model should behave), any prior conversation history the application wants the model to consider, and the user's current input. All of that goes in as input tokens. The model reads the full payload, generates a response, and returns it as output tokens. The application receives the response and closes the connection.
Pricing flows directly from this structure. Input tokens and output tokens carry different rates, and output is typically more expensive — the model is doing more work to generate than to read. A mid-tier hosted model might run around $3 per million input tokens and $15 per million output tokens. A single chat turn — say, 500 input tokens and 300 output tokens — costs roughly $0.006. Negligible per call. Meaningful at scale.
Document processing is where scale becomes visible. An agency ingesting a 50-page policy document for analysis might send 40,000 input tokens in a single call and receive 2,000 tokens of structured summary in return. At those rates, that's about $0.15 per document. Run that across 10,000 documents in a procurement pipeline and you're at $1,500 — before you've written a single line of application code or provisioned a single server. The infrastructure cost is zero. The consumption cost is real and variable.
"Stateless" has a specific operational meaning here. The model holds no state between calls. If you want the model to remember that it's helping a contracting officer review a specific solicitation, you have to send that context with every call. Applications that need continuity — a multi-turn conversation, a workflow that spans multiple steps — maintain state themselves and inject it into each request. The model is, in this sense, more like a very capable function than a service with memory.
No connection to manage, no session to maintain, no timeout to handle. The call either completes or it doesn't.
Bullet recap:
- Request structure: System prompt + conversation history + user input, assembled by the calling application and sent as a single payload.
- Token pricing: Input and output metered separately; output typically costs more. Pricing is per-call and variable by consumption, not by seat or subscription.
- Stateless in practice: No context persists between calls. The application is responsible for maintaining and injecting any state the model needs to do its job.
- Cost shape: Negligible per call, meaningful at pipeline scale. Infrastructure cost is zero; consumption cost scales with usage.
IDAM Concept Mapping
Most resembles: A service account call — credential presented, request made, response returned.
Where the analogy holds: The API key functions like a service account credential. You can rotate it, scope it, and audit calls at the gateway level. Access control lives at the credential layer, same as it does for a service account calling an API.
Where it breaks — and this is the part that matters:
There is no session. Unlike an OAuth flow where you can revoke a token mid-flight or terminate a session at the IdP, there is nothing to revoke once the request is in-flight. The call completes or it doesn't. Mid-call intervention doesn't exist.
There is no identity on the model side. The model doesn't authenticate to anything. It has no principal, no credential, no identity in any IDAM sense. It receives a prompt and generates a completion. You cannot query it for its identity, audit its access, or apply policy to it the way you would a workload identity.
The model itself logs nothing. Whatever audit trail exists lives in the API gateway, the hyperscaler's infrastructure, or the application layer — built by the caller, not the callee. The model has no record of who called it or what it produced.
The governance gap lives exactly here. In IDAM, you govern access by controlling the credential and the session. Here, you can control the credential. But once the call is made, there's nothing to revoke, nothing to inspect on the model side, and the "action" — generating a completion — produces no artifact in the model itself. Audit trails have to be constructed upstream. That's a different architecture than anything in your existing governance stack.
Why Your Buyers Are on Bedrock, Not api.openai.com
Most of the accounts you're walking into are not calling OpenAI or Anthropic directly. They're consuming the same underlying models — GPT-4o, Claude 3, Gemini — through AWS Bedrock, Azure AI Foundry, or Google Vertex AI. If you're surprised by this, you'll sound like you haven't talked to a federal buyer recently.
The reason is compliance inheritance. AWS GovCloud holds FedRAMP High authorization. Azure Government holds FedRAMP High authorization. Google Cloud's government regions hold FedRAMP High authorization. When an agency calls a foundation model through Bedrock inside an existing GovCloud boundary, the model call inherits the cloud provider's existing compliance posture — the data residency controls, the VPC routing that keeps traffic inside the agency's cloud perimeter, the audit logging that feeds into existing SIEM infrastructure, the incident response procedures the agency already has contracts for.
Calling api.openai.com from an agency network is a different conversation entirely. New vendor relationship, new data processing agreement, new ATO surface, new data residency questions, new audit logging gaps. The procurement and authorization overhead is substantial. Most agencies, given the choice between "call the model through the cloud provider we already have an ATO for" and "start a new ATO process," are choosing the former. Procurement logic, not preference.
A second factor doesn't get enough attention: the hyperscalers have made model access a feature of the platform, not a separate product. An agency already running workloads in Azure Government can enable Azure AI Foundry with existing identity controls, existing network policies, and existing billing infrastructure. The marginal friction is low. The marginal compliance lift is near zero. That's a different procurement conversation than standing up a new API relationship with a foundation model provider.
So when a buyer tells you they're "using AI," your first question should be where the model call is landing, not which model they're using. The model is often less important than the boundary it's running inside. Bedrock in GovCloud and api.openai.com from an agency laptop are not the same risk surface, even if they're calling the same underlying model.
Bullet recap:
- Why hyperscalers won regulated enterprise: Compliance inheritance. Calling a model through Bedrock, Azure AI, or Vertex inside an existing authorized cloud boundary carries the cloud provider's FedRAMP posture, data residency controls, and audit logging — without a new ATO.
- What "direct API" means in practice: Calling OpenAI or Anthropic directly is a new vendor relationship, new data processing agreement, new ATO surface. Most regulated buyers have already made this calculation and landed on hyperscaler-hosted.
- The discovery question: Ask where the model call lands, not just which model they're using. The boundary matters more than the model name.
- The governance implication: Audit logging in hyperscaler-hosted deployments lives in the cloud provider's infrastructure — not in the model. The model itself still logs nothing. The hyperscaler gives you more surface to instrument, but the fundamental gap from the IDAM callout above doesn't close. It just becomes more manageable.
The mechanism is simple enough to explain in a parking lot. The compliance logic is straightforward once you see it through a procurement lens. The governance gap is real and doesn't have a clean answer yet — which is exactly why it's worth being precise about where it lives before someone asks you in a meeting.

