You'll encounter all three configurations in public sector accounts this year. Cloud-hosted inference — Azure OpenAI, AWS Bedrock, Google Vertex AI — is what most agencies are piloting right now. Self-hosted in a government cloud enclave or on-premises GPU cluster is what the security-conscious ones are asking about. Third-party API access through an aggregator or regional reseller is what shows up in shadow IT conversations and small-agency procurements where nobody had budget for the first two options.
The configurations themselves aren't the hard part. Talking about what they actually change — in a CAIO conversation, in a CISO review, in the moment when procurement asks why the architecture diagram looks the way it does — that's where precise language earns its keep.
The Three Configurations
Cloud-hosted (provider-managed)
What it is: Inference runs on the model provider's infrastructure, in regions they control, under their operational responsibility.
What it does: The agency sends a prompt to an API endpoint. The provider's hardware runs the model, returns a response, and logs the transaction according to their data processing agreement. The agency never touches the weights.
Who's behind it: The model provider (OpenAI, Anthropic, Google, Microsoft) or a hyperscaler running a licensed model on their infrastructure. The accountability chain runs from the agency through a cloud service agreement to the provider's data processing addendum. FedRAMP authorization status varies by provider and service tier.
What makes it distinct: The provider controls what gets logged, where logs live, and who can access them under legal process. The agency's data processing agreement is the only lever they have.
Self-hosted (on-premises or government cloud enclave)
What it is: The model weights run on infrastructure the agency controls — either physical hardware in a government facility or a dedicated enclave in a government cloud (GovCloud, IL4/IL5, C2S).
What it does: Inference happens inside the agency's security boundary. Prompts, responses, and intermediate outputs never leave the enclave. The agency operates the model like any other piece of infrastructure.
Who's behind it: The agency's own IT or a managed service provider operating inside the agency's authorization boundary. The model weights may be licensed from a commercial vendor — Meta's Llama models, Mistral, or a fine-tuned derivative — but the runtime is the agency's problem.
What makes it distinct: The agency owns the full stack, which means they own the logging configuration, the breach surface, and the latency profile. There is no provider to call when the inference cluster goes down.
Third-party API access (aggregator or reseller)
What it is: Inference runs on a provider's infrastructure, but the agency accesses it through an intermediary — an aggregator like OpenRouter, a regional reseller, or a managed AI service that wraps multiple underlying models.
What it does: The intermediary routes the agency's prompts to one or more underlying models, sometimes with load balancing across providers. The agency may not know which model or which infrastructure is handling a given request.
Who's behind it: Two accountability chains, not one. The intermediary has a data processing agreement with the agency; the underlying model provider has a separate agreement with the intermediary. The agency's legal relationship stops at the intermediary.
What makes it distinct: The agency has the least visibility of any configuration. They can't audit the intermediary's relationship with the underlying provider, and they may not know which jurisdiction their data touched on a given request.
Okta Concept Mapping: Tenant Isolation in SaaS Identity
In Okta, tenant isolation means your org's data, sessions, and policy configurations are logically separated from every other customer on the same infrastructure. The boundary is enforced at the application layer — you're sharing physical infrastructure but not data. Hosting location in AI inference maps to this concept until it doesn't. The boundary in a cloud-hosted model is the provider's data processing agreement, not a technical enforcement layer. A misconfigured logging pipeline or a provider-side breach doesn't respect the logical boundary the same way a SaaS tenant boundary does. The analogy is useful for understanding what "separation" means in this context. It stops being useful when you're evaluating what actually enforces it.
Comparison: Four Dimensions, Three Configurations
Trait-led analysis is the right structure here. The four dimensions aren't equal in novelty or complexity — jurisdiction and logging are familiar to any public sector buyer, breach exposure is table stakes, and latency-to-tools is the one that changes how you architect. Running all three configurations through each dimension in sequence lets the latency section land with the weight it earns, rather than getting buried in a side-by-side table where all four rows look equivalent.
Jurisdiction
Cloud-hosted: Data and inference are subject to the legal regime of the provider's operating jurisdiction and the region where inference runs. An agency using Azure OpenAI in us-east-1 is operating under U.S. law — but the provider's global infrastructure means legal process in another jurisdiction could reach data the agency didn't know was replicated. The data processing agreement governs what the provider can be compelled to disclose, and that agreement is negotiated, not assumed.
Self-hosted: Jurisdiction follows the infrastructure. An agency running inference in a GovCloud enclave is operating under the legal regime that governs that facility, which is why GovCloud exists. There is no provider jurisdiction to worry about because there is no provider in the runtime path.
Third-party API: Jurisdiction is the intermediary's jurisdiction plus whatever jurisdiction the underlying provider operates in. The agency may not know the underlying provider's operating region for a given request. This is the configuration where "we don't know which jurisdiction our data touched" is a factually accurate answer, not a rhetorical hedge.
Logging Obligations
Cloud-hosted: The provider logs what their data processing agreement says they log. The agency can request logs; they cannot configure what gets captured. In practice, most enterprise agreements include prompt and response logging for safety and abuse detection, with retention periods set by the provider. The agency's security team can read those logs — usually — but cannot change the logging configuration.
Self-hosted: The agency configures logging. That's the advantage and the operational burden in the same sentence. An agency that hasn't explicitly configured inference logging has no logs, not because the system is secure, but because nobody turned logging on. Complete control means complete responsibility.
Third-party API: Two logging configurations, neither fully visible to the agency. The intermediary logs what their agreement says; the underlying provider logs what their agreement with the intermediary says. The agency's audit trail stops at the intermediary's API boundary. What happens between the intermediary and the underlying model is outside the agency's visibility.
Okta Concept Mapping: SCIM Provisioning Endpoint Location
When you configure SCIM provisioning in Okta, the provisioning endpoint location determines where user data travels during sync operations. An agency provisioning to a SaaS application hosted in a non-U.S. region is moving PII across a jurisdictional boundary on every sync cycle — which is why FedRAMP-authorized applications specify their data residency in their authorization package. The logging obligation question in AI inference is structurally identical: where the logs live, and who controls the endpoint that writes them, determines the agency's actual audit capability. The analogy holds completely here. The gap is that SCIM provisioning endpoints are documented in FedRAMP packages; AI inference logging configurations often aren't.
Breach Exposure
Cloud-hosted: A breach of the provider's infrastructure exposes the agency's prompts and responses — and potentially the prompts and responses of every other customer on the same infrastructure. The blast radius is the provider's entire customer base, bounded by whatever tenant isolation the provider has implemented. The agency's exposure is proportional to how much sensitive information flows through the inference API.
Self-hosted: A breach of the agency's enclave exposes the agency's inference workload and whatever data the model had access to. The blast radius is bounded by the enclave's access controls. An attacker who compromises the inference cluster can see everything the model could see — which, in an agentic deployment, may include credentials, tool outputs, and intermediate reasoning that never appears in a final response.
Third-party API: A breach of the intermediary exposes the agency's prompts in transit. A breach of the underlying provider exposes them at rest. Two breach surfaces, neither controlled by the agency, and the intermediary's security posture is typically not auditable through the agency's standard vendor assessment process.
Agent Round-Trip Latency
This is the dimension that reshapes the architecture conversation, and the one most buyers haven't thought through.
In a chat interface, latency is about the user. A 300ms response time is fine; 2 seconds starts to feel slow. The model's hosting location matters insofar as it affects how quickly the response reaches the person typing.
Agent workloads don't work this way. An agent completing a task — drafting a procurement summary, checking a system of record, updating a ticket — doesn't make one API call. It makes a sequence of tool calls: retrieve this document, query that database, invoke this API, check that permission. A moderately complex agent task involves 15 to 40 tool calls, each of which is a round-trip between the model and the tool.
The latency that matters in an agent workload runs model-to-tool, not model-to-user.
If the model is running in a commercial cloud region and the tools it's calling are in a GovCloud enclave, every tool call crosses that boundary. A cross-region round-trip in AWS infrastructure runs roughly 80 to 150ms depending on the regions involved. Same-region calls run 5 to 20ms. That gap — call it 100ms per tool call — looks trivial in isolation.
At 30 tool calls per task, 100ms per call is 3 seconds of pure network wait, accumulated in series. These calls are sequential, not parallel, because the agent's reasoning at step 4 depends on the result of step 3. You can't pipeline them. The latency compounds.
At 200 tasks per day across an agency's agent deployment, that's 600 seconds — 10 minutes — of accumulated wait time that has nothing to do with model quality, prompt engineering, or the sophistication of the agent's reasoning. It's a hosting mismatch.
Cloud-hosted: Latency to tools depends entirely on where the provider's inference endpoint sits relative to the agency's tool infrastructure. If the agency's systems of record are in GovCloud and the model is in a commercial region, every tool call crosses the boundary. The provider's region selection is constrained by what's available in their FedRAMP-authorized footprint, which limits the agency's ability to optimize placement.
Self-hosted: The agency controls where the inference cluster sits. A self-hosted model deployed in the same GovCloud enclave as the agency's tools and data has same-region latency on every tool call. Latency-to-tools is minimized by design. It's also the configuration that requires the agency to operate GPU infrastructure, which is not a trivial operational commitment.
Third-party API: The agency has no control over where inference runs. The intermediary routes to whatever underlying provider is available, which may change request-to-request. Latency-to-tools is unpredictable and unoptimizable. For agent workloads with strict performance requirements, this configuration fails first.
The right question in a CAIO conversation isn't "how fast is the model?" It's "where are your tools, and where is the model relative to them?" Those are different questions with different answers, and the second one determines whether the agent deployment performs at scale.
Okta Concept Mapping: Privileged Access Workstation Network Segmentation
A PAW is a hardened workstation that only connects to privileged management systems — physically and logically isolated from the general network to prevent credential theft from spreading laterally. The network segmentation is the security control; the isolation is what makes it work. Self-hosted inference in a government enclave follows the same logic: the model is inside the security boundary, so its access to privileged tools and data doesn't require crossing a network boundary that an attacker could intercept. The analogy holds for understanding why enclave hosting reduces both breach exposure and tool-call latency simultaneously. It stops holding when you extend it to operational burden — a PAW is a workstation configuration, not a GPU cluster. The agency that chooses enclave hosting for the security properties needs to be ready for the infrastructure management that comes with it.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "The cloud version is less secure" | "With cloud-hosted inference, the provider controls the logging configuration and the breach surface — the question is whether their data processing agreement gives you the audit rights you need" | Specific claim, auditable, doesn't sound like FUD |
| "Self-hosted is more compliant" | "Self-hosted puts the logging configuration and the breach boundary under your control — which is an advantage if you have the infrastructure team to operate it" | Compliance is a process, not a hosting attribute |
| "The aggregator option is fine for low-sensitivity workloads" | "With a third-party API, you have two data processing agreements and one jurisdiction you can't fully audit — that's a risk decision, not a technical one" | Gives the buyer the actual decision frame |
| "Latency won't be a problem" | "For chat interfaces, probably not. For agent workloads making 30 tool calls per task, the hosting location relative to your tools matters more than the hosting location relative to your users" | Introduces the latency-to-tools distinction before the buyer discovers it in production |
| "We can run the model in GovCloud" | "Running inference in the same enclave as your tools and data eliminates the cross-boundary latency on every tool call — at 30 calls per task, that's the difference between a 3-second penalty per task and a sub-second one" | Quantifies the architectural decision |
| "The provider handles security" | "The provider handles security of their infrastructure. Security of what flows through it — prompts, tool outputs, intermediate reasoning — depends on the data processing agreement and your logging configuration" | Precise scope, doesn't overstate or understate |
| "Which model do you want to use?" | "Before we get to model selection, where are the systems this agent will be calling? That determines which hosting configuration actually performs at scale" | Reframes the conversation to the right question first |
| "Third-party APIs are cheaper" | "Third-party aggregators can reduce cost, but you're trading visibility for price — you can't audit the intermediary's relationship with the underlying provider" | Accurate tradeoff framing, not a recommendation |
| "Jurisdiction is handled by FedRAMP" | "FedRAMP authorization tells you the provider met a security baseline — it doesn't determine which legal regime governs your data under compelled disclosure" | Prevents a conflation that will surface in procurement |
| "The model is the same either way" | "The model weights are the same file. The legal envelope around inference, the logging configuration, and the latency profile of every tool call are not" | The one-thing statement, said precisely |
The model doesn't care where it runs. Your buyer's legal team, security team, and the agent deployment that's supposed to complete 200 tasks a day — they care. The hosting decision gets made once, usually early in a procurement cycle, and it determines the operational reality of everything that runs on top of it.
An AE who can name the four dimensions and explain why latency-to-tools is the one that surprises people in production is the AE who gets called back for the architecture conversation. That's the meeting worth being in.

