Three Deployment Realities: Who Holds the File, Who Controls the Hardware, Who Sees the Prompts

By Leigh Garrity— May 9, 2026

Three Deployment Realities: Who Holds the File, Who Controls the Hardware, Who Sees the Prompts

When a public sector buyer says "we're deploying a model," they could mean one of three physically distinct things. They could be calling a provider's API — sending prompts to someone else's infrastructure and getting responses back. They could be running model weights on cloud infrastructure they've provisioned and control. Or they could be running weights on hardware they own, inside their own perimeter. Each option has a different answer on three axes that matter for every conversation you'll have about identity, access, and data: who holds the file, who controls the hardware, and who sees the prompts. Know which one your buyer is describing, and the right questions become obvious. Conflate them, and you'll spend the meeting catching up.

The foundation from 4.1 applies here: a trained model is a file — billions of numbers, typically ranging from a few gigabytes for a small model to several hundred gigabytes for a large one — that gets loaded into GPU memory to run. "Deploying a model" means copying that file onto a machine with enough VRAM to hold it and running inference on it. With that in hand, the three deployment realities are physically distinct in a way that's easy to map.

The Three Deployment Realities

Calling a Model Provider's API

What it is: Your application sends prompts to the provider's servers over HTTPS. The model weights never leave the provider's infrastructure.

What it does: Your system makes an API call — a structured request containing the prompt and any configuration parameters — and receives a response. The provider's servers handle everything in between: loading the weights into GPU memory, running inference, returning the output. From your environment's perspective, it looks like any other API call. You send a request; you get a response.

Who's behind it: OpenAI, Anthropic, Google (Gemini API), Cohere, and others for commercial deployments. For government, the primary path is Azure OpenAI Service, which runs OpenAI models on Microsoft's infrastructure. Whether a specific API arrangement meets a given agency's requirements is a question for their legal and compliance teams.

What makes it distinct: The file never touches your environment. You're renting inference time on someone else's model. The weights live on the provider's hardware, the provider's team maintains them, and the provider's servers process every prompt you send.

“

Okta Concept Mapping: API Calls and Federated Trust

Calling a model provider's API has a structural similarity to federated authentication: your application delegates something to a trusted third party and accepts what comes back. You see the output, not the mechanism. The analogy holds for the trust relationship — you're extending trust to an external system to handle something on your behalf. It breaks when you ask what the third party can see. In SAML federation, the IdP sees the authentication event, not your application's data. In a model API call, the provider sees the prompt — which may contain the data itself. That's a different kind of trust delegation, and it's worth surfacing when a buyer assumes API calls are equivalent to federated SSO from a data exposure standpoint.

Running Weights on a Hyperscaler You Control

What it is: You copy the model weights onto cloud infrastructure you've provisioned — GPU instances on AWS, Azure, or GCP — and run inference in your own cloud environment.

What it does: Your team provisions GPU instances with enough VRAM to hold the model (an H100 with 80GB of VRAM can hold a mid-sized model; larger models require multiple GPUs or specialized configurations), copies the weights from wherever they're sourced, and runs an inference server. Your application calls your endpoint, not the provider's. Prompts don't leave your cloud tenant.

Who's behind it: The model weights come from whoever trained them — Meta for Llama models, Mistral AI for Mistral models, and so on. The hardware is the cloud vendor's physical infrastructure. Your team controls the environment: the network configuration, the access policies, the software stack, the endpoint. The cloud vendor controls the physical machines.

What makes it distinct: The file is in your environment, not the provider's. Prompts go to your endpoint. But the physical machines running your workload belong to the cloud vendor. Your team controls the application layer; the vendor controls the hardware layer. For most civilian agencies, this is an acceptable boundary. For some, it isn't.

“

Okta Concept Mapping: Cloud-Hosted Infrastructure

Running model weights on a hyperscaler you control is structurally similar to running your own identity infrastructure on cloud compute — you control the application, the configuration, the access policies, but the physical machines belong to the cloud vendor. The question "does the cloud vendor have physical access to my data?" applies equally to both. For most civilian agency use cases, the answer is "yes, in principle, but within contractual and technical constraints that most agencies find acceptable." Whether those constraints are sufficient for a specific agency's requirements is a question for their legal and compliance teams — not something to resolve in a discovery call.

Hosting Weights On-Premises

What it is: You copy the model weights onto hardware you own and operate. The GPU servers are in your data center or a facility you control. No third-party infrastructure is in the data path.

What it does: Same inference process as the hyperscaler option — weights loaded into GPU memory, inference server running, your application calling your endpoint — but the physical machines are yours. Your team handles procurement, installation, maintenance, capacity planning, and failure response. A large model might require a rack of H100s; a smaller model might run on a single well-provisioned server. The capital expenditure is real and upfront.

Who's behind it: NVIDIA (or AMD) for the GPU hardware. The model weights come from whoever trained the model. Your team runs everything else. There is no cloud vendor in the architecture.

What makes it distinct: The file, the hardware, and the prompts are all inside your perimeter. Nothing leaves unless you send it somewhere. The physical data path is entirely under your control.

“

Okta Concept Mapping: On-Premises Infrastructure

On-prem model hosting is the AI equivalent of running Active Directory on hardware in your data center — you own the failure modes, the upgrade cycles, the capacity planning, and the security perimeter. The upside is that nothing leaves your facility unless you route it out. The downside is that the perimeter is entirely your problem: hardware failures, VRAM constraints as models grow, GPU driver updates that break inference servers. Agencies that run on-prem AD because policy or classification requirements demand it will recognize this tradeoff immediately. The conversation is familiar; the hardware is different.

Comparing the Three: A Scenario Map

A ranked comparison doesn't fit here because none of these options is better in the abstract. Each one is right for a specific set of circumstances. The useful structure is a scenario map: given what a buyer is actually trying to accomplish, which deployment reality fits?

The three axes that make these options physically distinct:

	Who holds the file	Who controls the hardware	Who sees the prompts
API	Provider	Provider	Provider (logically and physically)
Hyperscaler	You (in your cloud tenant)	Cloud vendor (physically); you (logically)	You (logically); cloud vendor (physically capable)
On-premises	You	You	You

The hyperscaler row is where buyers get imprecise, and where the conversation gets interesting. "We control the hardware" is true in the sense that matters for most workloads — you control the environment, the network, the access policies, who can reach the endpoint. It's not true in the sense that matters for some workloads: the physical machines belong to the cloud vendor, and the cloud vendor has physical access to them. This is the same question that drives some agencies to on-prem AD and others to cloud-hosted identity infrastructure — not a theoretical concern, but a real one with real consequences. The answer depends on the agency's threat model and requirements, not on a general principle.

API fits when: The agency needs to move fast, doesn't want to manage GPU infrastructure, and the data sensitivity is low enough that sending prompts to a third party is acceptable for their use case. Also fits when the agency has negotiated a specific arrangement (like Azure OpenAI Service) that meets their requirements. Whether it meets those requirements is a compliance question, not a deployment question.

Hyperscaler fits when: The agency needs more control than API provides — specifically, they need prompts to stay within their cloud environment — but can't justify or doesn't need the capital expenditure of on-prem hardware. Common path for civilian agencies with moderate data sensitivity requirements. Fine-tuning a model also lands here, since fine-tuning requires holding the weights. API deployments can't be fine-tuned by the customer.

On-premises fits when: The agency operates in a classified or air-gapped environment, or has a policy requirement that data never leave a physical facility. Some defense and intelligence agencies have been running GPU clusters for other workloads for years; on-prem model hosting is the natural extension. The capital and operational costs are significant; the control is total.

GPT-4 specifically is only available via API. OpenAI has not released the GPT-4 weights publicly. When a buyer says "we're deploying GPT-4," they are describing an API deployment — almost certainly Azure OpenAI Service for government accounts. When they say "we're deploying a GPT-4-class model," they might mean something different: a model of comparable capability (Llama 3 70B, for instance) that they're running on their own infrastructure. The distinction matters for every conversation about access control, credential management, and data handling.

How to Say This in the Field

If the buyer says...	Ask or say...	Why it matters
"We're deploying GPT-4."	"Are you calling the Azure OpenAI API, or are you running weights on your own infrastructure?"	GPT-4 weights aren't publicly available; this is almost always an API deployment, but confirm before assuming.
"We're running it in our cloud."	"Who provisioned the GPU instances — your team, or a managed service from the cloud vendor?"	Managed inference services are API deployments; self-provisioned GPU instances are hyperscaler deployments. Different data paths.
"We need it air-gapped."	"So you need the weights on hardware inside your perimeter, with no external API calls at any point in the inference path?"	Confirms on-prem requirement; also surfaces whether they've thought through the full data path.
"We're using Azure OpenAI."	"Got it — so the model weights are on Microsoft's infrastructure, and your prompts are going to their API endpoint."	Confirms API deployment; sets up the follow-on question about what's in those prompts.
"We're self-hosting Llama."	"Where are you running it — on cloud infrastructure you've provisioned, or on hardware you own?"	Llama weights are publicly available; self-hosting could mean hyperscaler or on-prem.
"We need data sovereignty."	"Does that mean prompts can't leave your cloud tenant, or they can't leave your physical facility?"	These are different requirements with different deployment implications; don't assume on-prem.
"We're fine-tuning the model."	"So you're holding the weights — where is that infrastructure?"	Fine-tuning requires the weights; this is hyperscaler or on-prem. API deployments can't be fine-tuned by the customer.
"We're using a managed inference service."	"So the weights are on the provider's infrastructure — you're calling their endpoint?"	Managed inference is API deployment; confirms who holds the file.
"We downloaded the model from Hugging Face."	"And where are you running it?"	Downloading weights is the first step of hyperscaler or on-prem deployment; the question is what hardware they're loading it onto.
"We're deploying multiple models for different use cases."	"Are any of those running on infrastructure you control, or are they all API calls?"	Mixed deployments are common; the identity and access questions are different for each type.
"We need the model to stay in our environment."	"When you say 'your environment' — do you mean your cloud tenant, or your physical facility?"	'Environment' is ambiguous; resolve it before the conversation goes further.
"We're evaluating on-prem versus cloud."	"What's driving the evaluation — is it about who sees the prompts, who controls the hardware, or something else?"	Surfaces the actual requirement; helps you understand which axis matters most to this buyer.

The three deployment realities are not a spectrum from "less secure" to "more secure." They're three different physical arrangements with different properties on different axes. A buyer who says "we're deploying a model" has told you almost nothing about which arrangement they're describing. The follow-up question — which of these three things do you mean? — is where the useful conversation starts.

What those physical facts mean for compliance, data residency, and jurisdictional requirements is a different conversation, and one for 4.4 and 4.5. What they mean for identity and access architecture is the conversation you're there to have.