When a buyer's IT team says they're "running the model on-prem," "using Bedrock," or "self-hosting on AWS," they're describing three different answers to the same question: who owns the machine the weights are loaded on? All three scenarios are the same fundamental act — a file containing billions of numbers gets copied onto a machine with enough GPU memory to hold it, and inference runs from there. The differences are about control surface, provisioning burden, and what the operator can and cannot touch. Get that framing in your head before the meeting and you'll follow the conversation. Get the vocabulary right and you'll be asking the questions that matter, not nodding at the ones you don't understand.
Local Hardware Deployment
What it is: The operator loads model weights onto machines they physically own and operate.
What it does: The IT team sources servers with sufficient GPU memory, downloads or transfers the weights file, installs an inference runtime, and serves requests from that hardware. Nothing leaves the building unless the operator deliberately sends it somewhere. The model runs as long as the hardware runs, which sounds obvious until you're in a conversation about uptime and realize there's no vendor SLA to point at.
Who's behind it: The operator's own infrastructure team. The weights file typically comes from a model provider — a download from Hugging Face, a direct vendor transfer, or an internal training run — but once the file is on-prem, the provider is out of the operational picture. The operator is now the provider.
What makes it distinct: Full control, full burden. The operator controls the hardware, the weights, the runtime, the network access, and who can query the model. They also own every failure mode. The provisioning requirement that surprises buyers: GPU memory is the binding constraint, not storage or CPU. A capable 70-billion-parameter model at standard precision requires roughly 140GB of VRAM — that's two A100 80GB cards running in parallel at minimum, plus the interconnect to make them function as a unit. Smaller quantized models fit on less, but "we'll just run it on our existing servers" is almost never true for anything worth running.
Concept Mapping: On-Prem Identity Infrastructure
Local hardware deployment maps reasonably well to running your own RADIUS server or on-prem Active Directory: you provision the hardware, you control the configuration, you're responsible for availability. The analogy holds for the provisioning conversation — the IT team owns the full stack. The place it stops working: with AD or RADIUS, you can audit the authentication logic because you wrote it. With a model, the "logic" is encoded in billions of floating-point numbers the operator didn't write and can't read line by line. The control is real. The transparency is a different matter.
Managed Cloud API
What it is: The operator calls a vendor-managed endpoint that runs inference on hardware the vendor owns and operates.
What it does: The buyer's IT team provisions API credentials and network access. The model weights live on the vendor's infrastructure; the vendor handles loading, scaling, and serving. The operator sends a prompt, receives a response. AWS Bedrock, Azure AI Foundry, and Google Vertex AI are the major examples — each wraps multiple model providers' weights behind a unified API surface, so a buyer using Bedrock might be accessing Anthropic's Claude, Meta's Llama, or Mistral's models through the same endpoint pattern.
Who's behind it: The cloud vendor manages the infrastructure; the model provider supplies the weights. In most cases the buyer has a commercial relationship with the cloud vendor, not directly with the model provider. The model provider's involvement ends when they hand the weights to the platform.
What makes it distinct: The operator controls almost nothing about the model itself. Not the weights, not the runtime, not the hardware, not the version (unless the API exposes version pinning). What they control is the API key, the contents of the request, and what they do with the response. Provisioning burden on the model side is minimal: stand up credentials, configure network egress, integrate the API into whatever application is calling it. The tradeoff is opacity: the operator has no visibility into what happens between sending the prompt and receiving the response, and no ability to change the model's behavior at the infrastructure level.
Concept Mapping: Third-Party SaaS Application
A managed cloud API maps well to how your buyers think about SaaS: you provision access, not infrastructure. The IT team configures credentials and network policy; the vendor runs everything else. That framing holds for the provisioning conversation — it's genuinely that simple on the operator side. It starts to break when the buyer asks about data handling. With a SaaS app, data handling and retention policies are usually documented in the service agreement and auditable. With a managed model API, what happens to the prompts and responses the operator sends is governed by the vendor's policies, which vary significantly across providers and aren't always surfaced prominently. That's a data flow question the IT team will ask, and you should know it's coming.
Self-Hosted Weights on Cloud Infrastructure
What it is: The operator controls the model weights and inference runtime, but runs everything on cloud hardware they rent rather than own.
What it does: The IT team provisions GPU instances from a cloud provider — AWS p4d or p5 instances, Azure NC-series, GCP A3 instances, or a specialized GPU cloud like CoreWeave or Lambda Labs — downloads the weights onto that infrastructure, installs an inference runtime (vLLM and Hugging Face's Text Generation Inference are the common choices for production workloads), and serves requests. The operator manages the software stack. The cloud provider manages the physical hardware, the hypervisor, and the data center.
Who's behind it: Split responsibility. The operator manages the model layer: weights, runtime, serving configuration, access control. The cloud provider manages the hardware layer. The weights come from wherever the operator sourced them; once transferred to the cloud instance, the model provider is out of the operational picture, same as in the local hardware scenario.
What makes it distinct: This is the middle case, and it's worth being precise about what "middle" means. The operator controls the weights and runtime — they can update the model, swap in a different version, change the serving configuration, restrict who can query it. What they don't control is the underlying hardware or the hypervisor layer sitting between their workload and the physical machine. For most commercial buyers this is an acceptable tradeoff: lower provisioning burden than local hardware (no physical procurement, no data center footprint), more control than a managed API. For public sector buyers evaluating isolation guarantees, the hypervisor question surfaces quickly and doesn't have a clean answer. The isolation guarantee is the cloud provider's, not the operator's.
Concept Mapping: Hybrid Cloud Application Hosting
Self-hosted weights on cloud infrastructure maps to hybrid cloud application hosting — you control the application layer, someone else controls the physical infrastructure. Your buyers already reason about this for other workloads. The provisioning conversation tracks: the IT team manages the software stack, the cloud provider manages the hardware. The gap appears when buyers ask about isolation. In a typical hybrid cloud setup, the infrastructure provider can't observe your application's runtime behavior without explicit access. With a model workload running on rented GPU infrastructure, the isolation boundary is the hypervisor — and whether that's sufficient depends on the buyer's threat model and compliance posture, which is worth surfacing early.
How They Compare
Trait-led analysis: each row examines one dimension of the deployment decision across all three scenarios. No scenario is "better" in the abstract — the right answer depends on what the buyer is optimizing for, and the table makes that explicit.
| Dimension | Local Hardware | Managed Cloud API | Self-Hosted Cloud |
|---|---|---|---|
| Who owns the machine | The operator | The cloud vendor | The cloud vendor |
| Who controls the weights | The operator | The model provider / platform | The operator |
| Who controls the runtime | The operator | The platform | The operator |
| IT provisioning burden | High — hardware procurement, GPU infrastructure, inference runtime, ongoing maintenance | Low — API credentials, network egress policy, application integration | Medium — GPU instance provisioning, weights transfer, runtime installation and maintenance; no physical hardware |
| Operator visibility into inference | Full — the operator runs the stack | None — the operator sees inputs and outputs | Full — the operator runs the stack |
| What the operator cannot change | Nothing, in principle; everything requires the operator's own labor | The model, the runtime, the hardware, the version (unless version pinning is available) | The underlying hardware and hypervisor |
| Where the model version is controlled | Operator decides when to update | Platform decides (with varying notice) | Operator decides when to update |
Two patterns in that table are worth naming, because they generate confusion in buyer conversations.
The managed cloud API and self-hosted cloud scenarios share one thing: the cloud vendor owns the machines. The operator controls the software stack in one case; the vendor controls it in the other. Buyers sometimes conflate "self-hosted" with "we control everything," which isn't accurate when the hardware is rented. The hardware isolation is the provider's responsibility.
Local hardware and self-hosted cloud give the operator the same software control: weights, runtime, serving configuration. Physical ownership and the provisioning path are where they diverge. For buyers who need that control surface but can't move fast enough to procure physical servers, self-hosted cloud is often the practical path. Whether it satisfies the compliance requirement that motivated the local hardware preference is a separate question.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "Running the AI on-prem" | "Running the model weights on hardware you own" | Shows you understand what's physically happening, not just where the box sits |
| "We're using Bedrock" | "We're accessing [model name] through Bedrock's managed endpoint" | Bedrock is the delivery mechanism; conflating it with the model confuses the technical conversation |
| "The model is in the cloud" | "The weights are loaded on cloud infrastructure" | Gives the IT team a specific, accurate description they can engage with |
| "Self-hosted means you control everything" | "Self-hosted means you control the weights and runtime — not the underlying hardware" | Critical distinction for public sector buyers evaluating isolation guarantees |
| "You need a lot of compute" | "You need enough VRAM to hold the weights file in memory" | Gives the IT team a specific constraint to evaluate rather than a vague requirement |
| "The vendor hosts the model" | "The vendor manages the inference infrastructure — you're calling their endpoint" | Clarifies what the vendor actually does and doesn't do in a managed API scenario |
| "It's like a SaaS app" | "It's like a SaaS app in terms of what you provision — API key, network access — but the data handling works differently" | The analogy holds for provisioning; it breaks for data flow, and buyers will ask about data flow |
| "Can we move it to the cloud later?" | "Can you move the weights to cloud infrastructure later, or are there constraints on where they can run?" | Surfaces the actual technical question — weights portability — rather than a vague migration question |
| "The model runs on GPUs" | "The model is loaded into GPU memory — VRAM is the binding constraint, not CPU or storage" | Adds the specific detail that matters when the IT team is evaluating hardware requirements |
| "We're self-hosting" | "We're running the weights ourselves on [cloud provider] GPU instances" | Distinguishes self-hosted-on-cloud from on-prem, which matters for the IT team's provisioning plan and the buyer's compliance posture |

