A model's weights are its learned parameters — the billions of numerical values that encode what the model knows and how it reasons. Training a large language model produces these values as its primary output. When a lab releases those values publicly, the model is "open weight." That's the whole definition.
What it is not: open source. Open source implies something about the training pipeline, the data, the tooling, the ability to reproduce the artifact from scratch. Open weights means none of that. Meta releases Llama's parameters under a license that restricts commercial use above certain user thresholds. Mistral releases weights with fewer restrictions. Neither releases the training data. Neither gives you the ability to audit what the model learned from, or why it behaves the way it does on edge cases. The weights are public. The provenance is not.
The capability gap is also real and worth stating plainly. The best open-weight models — Llama 3.1 405B, Mistral Large, Qwen 2.5 72B — are genuinely capable. For summarization, classification, and retrieval-augmented generation over internal documents, they're often sufficient. For complex multi-step reasoning, nuanced instruction-following at scale, and the performance that enterprise buyers benchmark against GPT-4o or Claude Sonnet — the gap exists, practitioners report it, and it doesn't close by wishing it away.
Quick reference:
- Open weights: The trained model parameters are publicly released. This says nothing about training data, licensing terms beyond the weights themselves, or equivalence to frontier closed models.
- Frontier gap: The best open-weight models are competitive for many tasks. They are not equivalent to GPT-4-class or Claude Sonnet-class models on complex reasoning. This is a real operational constraint, not vendor positioning.
What Running It Actually Takes
Assume your buyer wants to run Llama 3.1 70B — a mid-tier open-weight model, not the largest available, genuinely useful for many tasks. The 70B refers to 70 billion parameters. At standard 16-bit precision, that's roughly 140GB of GPU memory just to load the model. A single NVIDIA H100 has 80GB. You need at least two, configured correctly, before you've handled a single inference request.
This is the first place "we'll self-host" plans encounter reality. GPU supply for H100s and A100s has been constrained since 2023 and hasn't fully normalized. Lead times for on-premises hardware procurement in federal agencies run 6–18 months through standard acquisition channels. Cloud GPU instances are available faster, but if the point of self-hosting is data residency and keeping inference off commercial cloud infrastructure, renting GPUs from AWS or Azure defeats the purpose.
Assuming you solve the hardware problem, you need an inference stack. Running a model isn't just loading the weights and sending text at it. You need a serving framework — vLLM, TGI, Ollama for lighter workloads — configured for your hardware, integrated with your network, and capable of handling concurrent requests without falling over. You'll likely quantize the model (reduce numerical precision to fit larger models on available hardware, trading some quality for feasibility), which adds another layer of configuration and testing. None of this is rocket science, but it requires someone who knows what they're doing, and that person is not your average federal IT generalist.
Then there's the ops burden. Model serving is a live system. It needs monitoring, patching when vulnerabilities surface in the inference stack, capacity planning as usage grows, and incident response when it fails at 2am before a deadline. The teams running this in production consistently report that ongoing ops cost exceeds initial setup cost by a significant margin. One infrastructure lead at a civilian agency described it as "we thought we were buying a capability and we were actually hiring a product team."
DeepSeek deserves a specific mention for federal accounts: it's an open-weight model from a Chinese lab, which creates procurement complications that go beyond the technical. Several agencies have already issued informal guidance against its use on government systems, and formal policy is moving in that direction. If a buyer mentions DeepSeek, the conversation needs to include the provenance question before it gets to the capability question.
Quick reference:
- GPU requirement: A 70B parameter model needs ~140GB GPU memory at standard precision — at minimum two H100s before handling a single inference request.
- Inference stack: Serving frameworks (vLLM, TGI) require configuration, integration, and ongoing maintenance. Quantization trades some model quality for hardware feasibility.
- Ops burden: Production self-hosting requires dedicated operational capacity. The ongoing cost typically exceeds setup cost.
- Capability gap: Real for complex reasoning tasks. Sufficient for many document-centric enterprise workflows.
IDAM Concept Mapping
Self-hosting a model most closely resembles running your own identity infrastructure on-premises. You own the stack, you own the behavior, you own the failure. Your IDAM intuition gets you most of the way there.
With on-prem identity — AD FS, a locally deployed SAML IdP, an on-premises PAM solution — you still operate within a shared-responsibility model for the underlying protocols. SAML is SAML. OAuth is OAuth. The spec is the spec, and a conformant implementation behaves predictably. You can audit against it. You can test conformance. If something goes wrong, you have a specification to reason from.
With a self-hosted model, there is no spec for model behavior. The model does what the weights do. There is no RFC that defines what a 70B parameter model should output for a given input. No provider SLA covers the model's reasoning. No shared-responsibility model exists at the inference layer — the lab that released the weights is not responsible for what the model says when you run it on your hardware. There is no audit log from the model layer unless you build one. The behavior is emergent, not specified, and you own all of it. That's a categorically different trust posture than anything in your IDAM stack, including the most painful on-prem deployment you've ever supported.
The Conversation You're Actually Going to Have
The scenario is predictable: a CIO or CAIO at a civilian agency, under pressure from data residency requirements or a recent breach at a peer agency, says "we need to keep all AI inference in-house." They've read about open-weight models. They may have seen a demo of something running locally. They mean it.
Their concern is legitimate. Data residency requirements for controlled unclassified information are real, and the instinct to keep sensitive data off commercial inference endpoints is reasonable. Self-hosting is one answer to that problem. It's also an answer that trades one set of risks for a harder set of operational problems.
What they may not have priced in: the hardware acquisition timeline against their actual deployment schedule. The ops staffing requirement. The capability gap on the specific use cases they're planning to run. The fact that "keeping data in-house" is achievable through managed cloud services with appropriate data processing agreements — AWS Bedrock, Azure AI, Google Vertex AI — without taking on full model ops. Most federal AI deployments that start as self-hosting conversations end up in one of those managed environments within 12–18 months, not because the self-hosting failed technically, but because the operational cost was higher than the organization had budgeted for.
Your job in that conversation is to make sure they're deciding with accurate information about what they're actually buying. The data residency concern is solvable multiple ways. The question is which solution fits their actual capacity.
And when they do self-host — because some will, and some should — they need to understand that they've just inherited a trust chain with no floor. Every layer of the model's behavior is their responsibility. There is no vendor to call when the model produces output that violates policy. There is no SLA that covers model reasoning. There is no audit log unless their team builds one. Governance, logging, and behavioral controls need to be designed before the first inference request, not retrofitted after something goes wrong.
Quick reference:
- The real conversation: Data residency concerns are legitimate. Self-hosting is one solution. Managed cloud services with appropriate DPAs are another, and often the one organizations actually end up with.
- The trust chain question: Self-hosted models inherit no shared-responsibility model at the inference layer. Governance, logging, and behavioral controls are the operator's problem entirely.
- The timeline reality: Hardware procurement in federal agencies runs 6–18 months through standard channels. If the deployment timeline is shorter, the self-hosting plan needs to account for that gap explicitly.

