Where the Model Lives: On-Device and Cloud-Hosted AI, Compared

By Leigh Garrity— May 8, 2026

Where the Model Lives: On-Device and Cloud-Hosted AI, Compared

When a buyer says "we want AI but our data can't leave the building," they've just handed you a vocabulary problem wearing the costume of a technical requirement. They might mean they need an air-gapped deployment. They might mean they want on-device inference. They might mean they're worried about a cloud API logging their prompts. These are different problems with different architectures and different identity implications — and if you don't sort them out early, you'll spend the back half of the conversation solving the wrong one.

Two deployment patterns dominate this space. On-device and edge AI runs the model locally, on hardware the organization controls. Cloud-hosted model consumption sends data to a provider's infrastructure and receives a response. Both are real, both are in production at federal accounts right now, and buyers use the vocabulary for each interchangeably in ways that will cost you credibility if you don't have clean language ready.

On-Device and Edge AI

What it is: The model runs on hardware inside the organization's control — the device itself, a local server, or nearby infrastructure — without routing inference requests to an external provider.

What it does: When a user or application triggers an inference request, the computation happens locally. The input data never leaves the perimeter for model processing. The response comes back from hardware the organization owns or operates. This is true whether the model is running on a laptop chip, a rack-mounted GPU at the edge of a classified network, or a specialized inference appliance in a forward operating location.

Where it comes from: The hardware side of this space is moving fast. Apple's Neural Engine in M-series chips is the most visible consumer implementation. Qualcomm's Snapdragon X Elite and Intel's Meteor Lake both include dedicated AI accelerators designed for local inference workloads. On the model side, the families built for local deployment — Meta's Llama 3, Microsoft's Phi-3 and Phi-4, Mistral's 7B and its successors — are designed to run well on constrained hardware. Hugging Face is the primary distribution layer for most of these models in enterprise and research contexts.

What makes it distinct: The model is a resident. You own the compute, you own the inference, and you own the latency. There's no API call, no provider SLA, no per-token bill, and no data leaving the building. What you give up is frontier-scale capability and the ability to let someone else handle model updates, scaling, and infrastructure. That's a real tradeoff, worth naming explicitly.

The mechanism that made this viable: Two years ago, running a capable language model locally meant either accepting a very small model with limited capability or buying hardware that cost more than most federal IT budgets could absorb for a single workload. Quantization changed that calculus. The technique compresses model weights — the numerical values that encode what a model has learned — from high-precision floating-point representations to lower-precision formats. A model that requires 140GB of memory in full precision might fit in 35GB at 8-bit quantization, or under 5GB at aggressive 4-bit quantization. The compression is lossy. Some precision is sacrificed. But for most enterprise tasks — summarization, classification, document extraction, structured output generation — the performance degradation is smaller than the hardware savings, often by a significant margin. The models got small enough to fit on hardware that already exists in the environment, and they stayed capable enough to be useful. The 3B–20B parameter range is now the practical sweet spot for local deployment: capable enough for real work, small enough to run on a modern laptop or a modest edge server.

“

Callout: The Zero Trust Analog Zero trust says location doesn't establish trust — a device inside the network perimeter isn't automatically trustworthy. The same logic applies to on-device AI: a model running locally isn't automatically safe. The perimeter question shifts from "where is the model" to "what can the model reach." Where the analogy holds: access should be scoped, not assumed. Where it breaks: zero trust has mature tooling for human and service principals. It doesn't yet have a clean answer for model principals that read files, call APIs, and generate outputs autonomously.

Cloud-Hosted Model Consumption

What it is: The model lives on a provider's infrastructure. The application sends a request — a prompt, a document, a structured input — and receives a response. The organization never touches the model weights or the hardware running them.

What it does: From the application's perspective, this is an API call. The provider handles compute, scaling, model versioning, and infrastructure. The organization pays per token consumed, or under a contract that bundles usage. The model itself may be updated, fine-tuned, or replaced by the provider without the organization's direct involvement. What the organization controls is the prompt, the context it sends, and what it does with the response.

Where it comes from: OpenAI's API and Azure OpenAI Service are the dominant enterprise entry points. Anthropic's Claude API, Google's Gemini API, and AWS Bedrock (which aggregates multiple model providers behind a single interface) are the other major options in federal-adjacent commercial contexts. For FedRAMP-authorized deployments, the options narrow considerably — Azure OpenAI Service has the most mature federal authorization posture as of this writing, though the landscape is moving.

What makes it distinct: You're renting inference. The model is never yours to configure, constrain, or audit directly. What you get in return is access to frontier-scale capability — models with hundreds of billions of parameters, trained on compute budgets that no single agency could replicate — without the hardware bill. The tradeoff is that your data crosses a boundary you don't control, under terms your legal team may or may not have reviewed carefully.

“

Callout: The Federation Analog Sending data to a cloud model API is structurally similar to a federated identity assertion crossing a trust boundary — you're extending a claim into a domain you don't control. Where the analogy holds: the data governance questions are the same (what leaves, under what terms, who logs it). Where it breaks: in federation, the trust boundary is explicit, negotiated, and documented in a metadata agreement. In most cloud AI API agreements, the data handling terms are buried in a ToS that procurement reviewed for price, not for inference logging policy.

Comparison Strategy: Scenario Mapping

I'm using scenario mapping here rather than trait-by-trait comparison. The reason: for public sector accounts, the deployment decision is almost always driven by a specific operational constraint, not by a general capability preference. Trait-by-trait analysis produces a chart. Scenario mapping produces a conversation.

Classified networks and air-gapped environments. Edge is the only option. There is no connectivity to a cloud provider, and there shouldn't be. For buyers who say "air-gapped," the architecture question is which on-device or edge deployment fits the classification level and the available hardware — cloud doesn't enter the picture.

Disconnected operations. Tactical edge, field deployments, ships underway, forward operating locations with intermittent connectivity. Edge is the correct default. Cloud-hosted might work when connectivity is available, but the operational requirement is that the capability functions without it. The model needs to be local, and the deployment needs to tolerate the hardware constraints of the environment.

Latency-sensitive workflows. Real-time decision support, embedded systems, any workflow where a 200–800ms round-trip to a cloud API is too slow. A model running on local hardware responds in milliseconds. This matters more than it might seem for agentic workflows where the model is making multiple sequential calls — latency compounds.

Privacy-sensitive data at rest or in use. CUI, PII, PHI, law enforcement data, anything where the data governance policy prohibits external transmission. Edge avoids the data-leaves-the-building problem entirely. Cloud-hosted requires a data processing agreement, a FedRAMP authorization, and confidence that the provider's logging and retention policies align with the agency's data governance requirements. That's achievable, but it's not free.

Per-token cost at scale. This one surprises buyers who haven't run the numbers. Cloud-hosted inference is cheap per call and expensive at volume. An agency running millions of inference requests per month against a frontier model API will spend more on tokens than on the hardware that could run a capable local model. The crossover point depends on the model, the use case, and the hardware — but it exists, and it's lower than most buyers expect. Edge starts looking like a cost play, not just a security play, once the volume is real.

Frontier capability requirements. Complex multi-step reasoning, large context windows (100K+ tokens), tasks that require the breadth of a model trained on a massive corpus. Cloud is the right default here. The 7B–20B models that run locally are capable, but they're not GPT-4-class on hard reasoning tasks. If the use case requires frontier capability, the data governance question becomes the constraint to solve rather than the reason to avoid cloud entirely.

Rapid iteration and model updates. If the use case requires staying current with model improvements — and many do, because the capability curve is still steep — cloud-hosted handles this without operational overhead. The provider updates the model; the application keeps calling the same API. On-device deployments require a model update process, which in a federal environment means testing, validation, and change management.

Spiky or unpredictable workloads. Cloud wins on elasticity. Local hardware is sized for a workload; cloud infrastructure scales to meet demand. For workloads that are bursty or hard to predict, the cloud's elasticity is a genuine operational advantage.

The identity surface distinction — and why it's the closer, not the opener.

A model's location doesn't tell you whether the data flow is safe. It tells you where the compute happens. Those are different questions.

On-device, the model is inside the perimeter. But the model is also a new kind of principal — it reads inputs, processes them, and generates outputs. In an agentic deployment, it might be reading files, calling APIs, and writing to systems, all without a human in the loop for each action. The data the model touches might be pulled from systems with their own access controls. The outputs the model generates might go somewhere outside the perimeter. "The model is local" doesn't answer the question "what does the model have access to, and where do its outputs go." That's an identity and access question, and it's unsolved in most current deployments.

Cloud-hosted is simpler to reason about from a data governance perspective, even if it feels more exposed. Data leaves. The question is what data, under what terms, with what logging, under what authorization. That's a familiar problem — the same question you ask about any SaaS integration. The tools for answering it (DPAs, FedRAMP authorizations, data classification policies) exist. They're not always applied correctly to AI API calls, but the framework is there.

“

Callout: The Apple Intelligence Preview Apple Intelligence is the first consumer product to normalize on-device model execution with cloud fallback, and to make the routing decision semi-transparent to the user. When the on-device model can't handle a request, it routes to Apple's Private Cloud Compute. Enterprise governance is going to have to catch up to this pattern. When a model routes a request to the cloud because the local model can't handle it, does that routing decision get logged? Does the data governance policy cover the fallback path? Most current enterprise AI deployments don't have a clean answer.

How to Say This in the Field

Don't Say	Do Say	Why It Matters
"On-device and edge are basically the same thing."	"On-device means the model runs on the endpoint itself — a laptop, a phone. Edge means the model runs on nearby infrastructure, like a server at the facility. Both keep data local, but the hardware and the management model are different."	Buyers use these interchangeably. Distinguishing them early signals that you know the architecture, not just the marketing.
"Air-gapped means edge AI."	"Air-gapped describes the network — no external connectivity. Edge AI describes where the model runs. Air-gapped environments require edge AI, but edge AI doesn't require air-gapping."	Conflating these will get you corrected by a technical buyer. Worse, it'll lead you to recommend the wrong architecture.
"Quantization is how they compress the model."	"Quantization is the technique that made capable models small enough to run on standard hardware. It's why a model that would have needed a data center two years ago can now run on a laptop."	Buyers who've heard the term need a frame, not a definition. Give them the implication, not the mechanism.
"If you use cloud AI, your data leaves the building."	"With cloud-hosted inference, the data you send in the request goes to the provider's infrastructure. What matters is what data you're sending, under what agreement, and what the provider's logging policy is."	"Data leaves the building" triggers a reflex. The useful conversation is about what data, under what terms — that's where you can actually help.
"On-device AI keeps your data safe."	"On-device inference means the model runs locally, so the input data doesn't go to a provider. That's a meaningful control. The separate question is what data the model has access to and where its outputs go — those need their own controls."	"Safe" is doing too much work. The model's location is one control, not a complete answer. Buyers who think local = safe will have a bad time when the model starts touching sensitive data autonomously.
"Apple Intelligence is just a consumer thing."	"Apple Intelligence is the first widely deployed example of on-device model execution with cloud fallback — and it's already in your buyers' hands. The governance pattern it's normalizing is going to show up in enterprise requirements."	Federal buyers who use iPhones have already experienced this pattern. Dismissing it misses the preview it's giving you of where enterprise AI governance is heading.
"Small models can't do what frontier models can."	"Small models — in the 7B to 20B parameter range — handle most enterprise tasks well: summarization, classification, document extraction, structured output. For complex multi-step reasoning or very large context windows, frontier models have the edge."	"Small" sounds like a limitation. The useful frame is task fit. Most enterprise workflows don't need frontier capability, and saying so builds credibility.
"Per-token cost is just an API pricing thing."	"Per-token cost is a variable that changes the build-vs-buy calculus for AI infrastructure. At low volume, cloud APIs are cheaper. At high volume, the math often favors running a local model on hardware you amortize over time."	Buyers who haven't modeled this are surprised when they run the numbers. Being the person who surfaces the crossover point is useful.
"The model is inside the perimeter, so it's covered by our zero trust policy."	"The model's location is one input to the access control question. The more important question is what the model can reach and what it can do — those need to be scoped the same way you'd scope any privileged service account."	Zero trust practitioners will push back on location-based trust claims immediately. Getting ahead of this shows you understand the architecture.
"We'll figure out the governance later."	"The governance question for AI is the same question you've already solved for privileged access: who authorized this, what can it touch, and what's the audit trail. The difference is the principal is a model, not a person."	Framing AI governance as a familiar problem (privileged access management) makes it tractable. Leaving it as "we'll figure it out" is how accounts end up with ungoverned model access at scale.