Opener
Every AI model has to run somewhere. Cloud API inference sends the prompt to a provider's GPUs over the network. On-device (or edge) inference runs the model on hardware the buyer controls. You'll hit this choice the moment a public sector buyer says "that data can't leave our network" or "this has to work disconnected." The language that earns you credibility here is constraint language: data classification, connectivity requirements, cost at volume, latency tolerance. Name the constraint driving the decision, and you're in the conversation. Guess at which pattern is "better" without naming one, and you're out of it.
Cloud API Inference
What it is: Sending prompts to a model hosted on a provider's infrastructure and receiving results over the network.
What it does: Your buyer's application makes an API call. The provider's GPUs do the work. Tokens come back. The buyer pays per token and never touches the model weights, the hardware, or the inference stack. For public sector accounts, the critical distinction is that models are available both through commercial API endpoints and through FedRAMP-authorized hosting like AWS GovCloud, Azure Government, and Google Vertex AI Gov. That hosting distinction matters for your accounts more than the model choice does.
Who's behind it / where it comes from: The three dominant providers are OpenAI (GPT-5 family), Anthropic (Claude), and Google (Gemini). Their models are also available through the FedRAMP-authorized intermediaries above. Claude, for instance, is accessible on AWS Bedrock, Google Vertex AI, and Microsoft Foundry. The model is the same; the compliance posture of the hosting environment is not.
What makes it distinct: Frontier-class capability with zero infrastructure to manage. The tradeoff is dependency. On the network, on the provider's availability, on their data handling terms, on their pricing. Every prompt leaves your buyer's environment.
On-Device / Edge Inference
What it is: Running a model directly on hardware the buyer controls, from a laptop to a ruggedized edge server to an air-gapped appliance.
What it does: Model weights live on local hardware. Prompts never leave the device. No network call, no API, no token metering. A quantized 8-billion-parameter model fits in about 4 GB of memory and runs on a current MacBook. A ruggedized edge server can run larger models at a forward operating base with zero connectivity. At the Air Force's Mobility Guardian 2025 exercise, Google deployed an air-gapped appliance running transcription and summarization at the Secret level, with no network connection at all.
Who's behind it / where it comes from: The ecosystem is layered. llama.cpp is the foundational open-source inference engine, handling quantized model execution across hardware platforms. Apple's MLX framework is optimized for Apple Silicon's unified memory architecture. Model weights come from Meta (Llama), Mistral, Qwen, and other providers who release open-weight models in formats like GGUF that these engines consume.
What makes it distinct: The data never leaves. No network dependency, no data residency question, no provider trust decision. The tradeoff is a capability ceiling: local hardware constrains you to smaller models (practically 3B–20B parameters), and you own the infrastructure, the updates, and the maintenance.
Quantization Made This Possible
Quantization is the development that turned on-device inference from a demo curiosity into a real deployment pattern. The mechanism is worth understanding.
Large models store parameters as high-precision numbers (16-bit or 32-bit floating point). Quantization compresses those numbers to lower precision: 8-bit, 4-bit, sometimes lower. A 70-billion-parameter model that needs 178 GB at full precision fits in roughly 40–50 GB at 4-bit. An 8B model fits in 4 GB. That arithmetic is what changed what hardware could do.
Quality cost is real but bounded. A January 2026 empirical study across major quantization formats found that 4-bit quantization (specifically Q4_K_M in the GGUF format) is the practical floor for reliable output. Below that, factual recall degrades noticeably and instruction-following gets erratic. Above it, you're paying memory for diminishing quality gains. The practitioner shorthand: a well-quantized 7B model can outperform a poorly quantized 13B model in both efficiency and output quality.
On Apple Silicon specifically, the M5 chip pushes time-to-first-token under 10 seconds for a dense 14B model, and under 3 seconds for a 30B mixture-of-experts architecture. A MacBook Pro with 24 GB of memory can hold an 8B model at full precision or a 30B model at 4-bit quantization, keeping the inference workload under 18 GB.
Five years ago, useful models required a data center. Quantization changed the arithmetic. Your buyer can now say "we'll run it on a laptop" and not be dreaming.
The on-device vs. cloud inference decision maps to where you draw a federation trust boundary. In SAML/OIDC, you decide which assertions cross organizational boundaries; in inference, the buyer decides which prompts and data cross their security boundary. The analog holds for both — but it breaks here: federation has protocol-level handshakes (signed assertions, token validation) that create verifiable trust. Cloud API inference has contractual controls, not cryptographic ones. In a buyer conversation, this means "trust" in cloud inference is contractual, not cryptographic.
Comparison by Buyer Constraint
Structural choice: Scenario mapping by constraint. Buyers don't ask for "cloud inference" or "edge inference." They name a constraint. Each constraint points toward an answer. This structure mirrors the actual conversation, which makes it the most useful comparison format when N = 2 and the differentiators are situational.
Disconnected or Air-Gapped Environments
Cloud API inference requires a network connection. Full stop. If your buyer operates in DDIL environments (denied, degraded, intermittent, limited bandwidth, which is standard DoD procurement language), cloud APIs are architecturally excluded. On-device or edge inference is the only option. Connectivity is a physics constraint. At Mobility Guardian 2025, the Air Force demonstrated exactly this: AI-powered transcription, OCR, translation, and summarization running on a ruggedized appliance with zero connectivity.
Data Classification
The DoD Impact Level framework sets escalating restrictions on where data can be processed, based on sensitivity. Each level narrows the set of permissible inference environments:
| Impact Level | Data Type | Standard Cloud API? | What's Required |
|---|---|---|---|
| IL2 | Unclassified public | Yes | Standard commercial cloud |
| IL4 | CUI, PII | No | FedRAMP High authorized cloud or on-prem |
| IL5 | CUI-High, Mission Critical | No | Physically separated infrastructure, US-person access |
| IL6 | Classified (Secret) | No | Air-gapped only |
Standard commercial APIs from OpenAI, Anthropic, and Google are not FedRAMP authorized at the API service level. Their models may be available through authorized intermediaries (Bedrock, Azure Gov), but the compliance posture depends on the hosting environment. The model is secondary. For IL6, the answer is always local.
Privacy
Privacy concerns can drive the on-device decision independently of data classification. VA health data, Census PII, and similar workloads create privacy requirements that don't always map cleanly to Impact Levels. A civilian agency processing constituent records may sit at IL2 for classification purposes but still face strict constraints on where that data can be sent and whether it can be logged.
OMB M-25-22 is where this bites hardest: agencies cannot allow non-public data and outputs to train commercial AI without explicit consent. Prompt logging and model improvement clauses in API terms of service are now procurement-relevant in ways that go well beyond legal boilerplate. On-device inference eliminates the question entirely. If the prompt never leaves the hardware boundary, there's nothing to log on someone else's infrastructure.
Data Residency and Sovereignty
Even below IL4, buyers care about where prompts are processed. Anthropic is the only major provider publishing an explicit data residency pricing mechanism: specifying US-only inference via their inference_geo parameter costs a 1.1x multiplier on all token pricing. By default, cloud APIs route globally. For a government buyer, "globally" is often a non-starter even for unclassified work.
On-device inference sidesteps the residency question entirely. The data never leaves.
Latency
Cloud API inference has a latency floor set by network round-trip plus server-side processing. On-device inference eliminates the network component. On current Apple Silicon with MLX, a 14B dense model produces its first token in under 10 seconds; a 30B mixture-of-experts model, under 3 seconds.
Worth noting: mobile devices throttle significantly under sustained load. An iPhone 16 Pro loses nearly half its throughput within two inference iterations due to thermal management. Edge servers don't have this problem. "On-device" and "edge server" are different hardware classes with very different sustained performance profiles. If your buyer says "on-device," ask which device.
Per-Token Cost at Scale
Cloud API pricing scales linearly with volume. At sustained high volume, the math changes. A Dell-commissioned analysis found on-premises inference 2.9x to 4.1x more cost-effective than API pricing for a 70B model at sustained volume. (Dell-commissioned, so directionally useful, not neutral.) Independent research suggests self-hosted breakeven requires 50%+ GPU utilization for 7B models, 10%+ for 13B models.
Cloud API pricing has dropped dramatically. GPT-4-equivalent performance now costs roughly $0.40 per million tokens versus $20 in late 2022. But output tokens cost 3–5x more than input tokens across all major providers (because output generation is sequential while input processing parallelizes), which means applications generating long outputs face steeper economics than those processing long inputs with short responses.
Model Capability
Cloud API inference wins cleanly here. Frontier models with hundreds of billions of parameters, trained on massive datasets, updated continuously by the provider. If your buyer needs the best available reasoning, the longest context windows, or the most current knowledge, they need a cloud API or a sovereign cloud variant. Local hardware tops out practically around 20B parameters with acceptable quality. That's good enough for summarization, translation, classification, and routing. It is not frontier reasoning.
If your buyer runs SCIM provisioning to systems that go intermittently offline (common in tactical or shipboard environments), they already understand the core challenge of edge inference: pushing state to a disconnected endpoint and keeping it current. The analog holds — both require a sync mechanism, both degrade gracefully when disconnected, both create versioning headaches. It breaks here: a SCIM update is kilobytes; a model update is gigabytes. Getting a new quantized model onto a ruggedized appliance is closer to firmware distribution than identity provisioning.
Apple Intelligence as a Preview of Hybrid Inference
Apple Intelligence matters here because it's the most visible implementation of the pattern enterprise deployments will eventually need: on-device inference for routine tasks, cloud inference for harder ones, with documented privacy guarantees at every layer. Your buyer won't deploy it. But the architecture is instructive.
Apple runs a ~3B-parameter model on-device for everyday tasks like text refinement, notification summarization, and in-app actions. When a request exceeds on-device capability, it routes to Private Cloud Compute (PCC), which runs on Apple Silicon servers with five published guarantees:
- Stateless computation
- Cryptographic enforcement (not just policy)
- No privileged access (even Apple employees can't see user data)
- Non-targetability
- Verifiable transparency through independent audit
The mechanism: the user's device encrypts the request directly to the public keys of PCC nodes it has cryptographically verified. Load balancers and privacy gateways sit outside the trust boundary and cannot decrypt the request. Cryptographic attestation of the server's identity and configuration before any data is sent. A contractual promise that the server is configured correctly would be a different, weaker thing entirely.
No FedRAMP authorization. No enterprise SLA. No MDM controls over routing logic. But the architectural pattern, local inference as default with attested cloud escalation, is exactly what enterprise and government hybrid deployments will need to implement under their own compliance frameworks. What's missing for enterprise adoption is specific: there's no compliance-aware routing policy (where data classification, alongside task difficulty, determines which prompts escalate), no MDM integration to enforce routing rules across a managed fleet, and no attestation infrastructure that maps to existing enterprise trust frameworks like zero trust device posture checks. When a buyer mentions "trusted inference environments," this is the conceptual direction. The tooling to get there in an enterprise context hasn't been built yet.
Apple's PCC attestation model rhymes with zero trust's "never trust, always verify." In zero trust, the device proves its posture before accessing resources; in PCC, the server proves its configuration before receiving data. Both shift trust from network position to cryptographic proof. It breaks here: zero trust device attestation is continuous and can be re-evaluated mid-session, while PCC attestation happens at request time. PCC-style server attestation for inference workloads doesn't exist in the enterprise market yet — that's precisely the gap your buyer will eventually need filled.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "You should run models locally for better security" | "Where does this workload sit on the Impact Level ladder? That determines whether a standard API is even an option." | Shows you understand the compliance framework driving their architecture. |
| "Cloud AI is cheaper" | "Cloud APIs are cheaper at low volume. Past sustained high utilization, the economics flip — the breakeven depends on model size and call volume." | Demonstrates you know the crossover exists without pretending to know their exact numbers. |
| "On-device AI isn't as good" | "Local models top out around 20 billion parameters with current hardware. That handles summarization and classification well. Frontier reasoning still needs a cloud API or sovereign cloud variant." | Names the capability ceiling without dismissing the pattern. |
| "We can handle air-gapped environments" | "Are you operating in DDIL conditions? That's a hard constraint for cloud inference — we should talk about what needs to run at the edge." | DDIL is real DoD/IC procurement language. Using it signals you've been in these conversations before. |
| "Data never leaves your network with local inference" | "With on-device inference, prompts never leave the hardware boundary. With cloud APIs, even US-only routing has pricing implications — Anthropic charges a 1.1x multiplier for geo-restricted inference." | Naming the actual pricing mechanism shows you've done the work. |
| "Apple is doing this on phones now" | "Apple Intelligence runs a 3B model on-device for routine tasks and routes harder requests to attested cloud servers. That hybrid pattern — local default with cloud escalation — is the direction enterprise deployments are heading." | Frames Apple as a pattern preview worth studying. |
| "Quantization makes models smaller" | "Quantization compresses model weights from 16-bit to 4-bit precision. That's how an 8B model fits in 4 gigs of memory on a laptop. Below 4-bit, quality drops noticeably for factual tasks." | Shows you understand the mechanism and where it breaks. |
| "The model runs on the edge" | "Are we talking about a phone, a laptop, or a ruggedized edge server? Phones throttle under sustained load. Edge servers don't. The hardware class changes the performance conversation." | Prevents the buyer from conflating very different deployment targets. |
| "FedRAMP covers this" | "The model itself isn't FedRAMP authorized — the hosting environment is. Claude on Bedrock in GovCloud is a different compliance posture than Claude via Anthropic's commercial API." | Separates model choice from hosting choice, which is where procurement conversations actually stall. |
| "OMB says you can't use cloud AI" | "M-25-22 restricts agencies from letting non-public data train commercial models without consent. That means the prompt logging clauses in your API contract are procurement-relevant, not just legal boilerplate." | Shows you understand the memo's operational bite. |
The constraint your buyer names tells you which inference pattern fits. Your job is to recognize the constraint, confirm it, and ask the next question that shows you understand the tradeoff space. The buyer will remember that you didn't bluff.
Things to follow up on...
- Ollama now runs on MLX: As of March 2026, Ollama on Apple Silicon is built on top of Apple's MLX framework, which means the most popular consumer-facing local inference tool now uses the same optimized backend as Apple's own research stack.
- Mobile thermal throttling is real: A March 2026 arXiv study found that an iPhone 16 Pro loses nearly half its inference throughput within two iterations due to thermal management, which matters if your buyer conflates "phone" with "edge server" when discussing on-device deployment.
- Qualcomm NPU tooling gap on Windows: Despite strong benchmark claims, the Snapdragon X Elite NPU requires ONNX models rather than GGUF, meaning popular tools like Ollama and LM Studio run CPU-only on Windows ARM devices — a significant gap between marketing and practitioner experience.
- Air-gapped AI in the field: At Mobility Guardian 2025, the Air Force deployed Google's ruggedized GDC appliance to run AI-powered transcription, OCR, and summarization at the Secret level with zero network connectivity, making it the most publicly documented DDIL inference deployment to date.

