When a buyer says their team is evaluating DeepSeek or Llama for an internal deployment, they're describing a specific infrastructure decision with predictable consequences. This piece covers what that decision involves mechanically, what it costs operationally, and the consistent pattern where self-hosting ambitions quietly become managed cloud deployments within a year. It does not cover model training or fine-tuning.
What a Weights Release Actually Contains
A model's weights are its trained parameters. Billions of numerical values encoding everything the model learned during training. When Meta or DeepSeek "releases the weights," they publish a file (often hundreds of gigabytes) containing those parameters, a tokenizer, and enough inference code to run the model.
That's the full delivery. What you do not get: the training data, the reinforcement learning recipe, the reward model, or the infrastructure details that produced those weights. You get the finished artifact without the factory.
The vocabulary around this is genuinely bad. The Open Source Initiative published a definition (OSAID v1.0, October 2024) that draws a hard line: genuinely open-source AI requires the code, the data transparency, and the weights, all under terms guaranteeing use, study, modification, and sharing. By that standard, no mainstream model qualifies. Not Llama. Not DeepSeek. Not Mistral.
The accurate term is open-weight. Your buyer will probably say "open source" and mean "we can download it." In a public sector conversation, where procurement and legal will eventually scrutinize license terms, knowing the difference saves you from confidently saying something wrong in front of the wrong person.
The Licenses Are Not Interchangeable
The three major open-weight families carry meaningfully different license structures. The differences matter in practice.
Meta's Llama ships with a community license that includes a competitor training ban, a 700M MAU threshold requiring a separate agreement with Meta, and an Acceptable Use Policy that Meta can amend unilaterally. These restrictions propagate to derivative models. That propagation matters: teams using DeepSeek's distilled Llama variants (DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Llama-70B) still inherit Llama's restrictions, not DeepSeek's more permissive terms. It's generous for most enterprises, but "open" is doing a lot of heavy lifting in the marketing.
DeepSeek uses MIT for code and a modified OpenRAIL license for weights. Commercial use is explicitly permitted, including distillation. The terms are substantially more permissive than Llama's. The catch is different: DeepSeek's default content policies reflect Chinese regulatory requirements, which can produce unexpected censorship on political or sensitive topics. The license FAQ addresses permitted uses, but organizations self-hosting for public sector work should plan to apply their own alignment and content filtering layers. For a buyer with a CISO in the room, this is the first question that will come up.
Mistral's current flagship models (Mistral Large 3, released December 2025) ship under Apache 2.0, which is about as permissive as software licensing gets. No revenue thresholds, no competitor restrictions. Note that some earlier Mistral models and commercial-tier API products carried more restrictive terms, so check the license on the specific model, not the brand.
The practical point: "open-weight" covers a spectrum of licensing postures, and the license governs what your buyer's team can actually do with the model once they've downloaded it.
What Self-Hosting Requires, Mechanically
A buyer says they want to run Llama 70B internally. What follows is predictable.
GPU memory is the hard constraint. A 70-billion-parameter model in standard precision (FP16) requires roughly 140GB just for the weights. No single GPU holds that. You need at least two A100 80GB cards, or one H100 with aggressive quantization. Quantizing to 4-bit integers drops the memory requirement to about 35GB, which fits on a single high-end GPU, but introduces quality tradeoffs on complex reasoning tasks.
For the truly large models, the numbers get uncomfortable. Models at DeepSeek's scale (671 billion parameters) need eight H200 GPUs just for inference, according to Google's own documentation. Just answering questions. No training involved.
The serving stack is real infrastructure. vLLM is the dominant production inference framework. It handles continuous batching, memory management, and exposes an OpenAI-compatible API. It also ships with no authentication by default. The endpoint is open unless you build access controls around it. Hold that thought.
Cold starts kill autoscaling. Loading a 70B model from storage to GPU memory takes 2 to 5 minutes. By the time a new instance spins up during a traffic spike, the queue has already overflowed. Teams either maintain expensive warm standby instances or accept that their self-hosted model simply cannot scale elastically the way an API can.
The hidden cost is people. One practitioner estimate puts the staffing requirement at 0.3 to 0.5 of a dedicated infrastructure engineer per self-hosted model. GPU failures, driver updates, model upgrades, load balancing, observability. Multiple practitioner analyses converge on the same conclusion: at a senior engineer's fully loaded cost, that's $3,000 to $6,000 per month before you've paid for a single GPU hour. Teams that underestimate this line item consistently blow their total cost of ownership projections.
Where Your IdP Intuition Applies
If your buyer has ever run their own identity provider on-premises, they already understand the shape of this decision. Self-hosting a model is the same category of choice: you gain control over the infrastructure, you gain data residency guarantees, and you inherit every operational responsibility that a managed service would otherwise absorb. Patching, availability, monitoring, access control. All yours now.
The drift pattern is identical too. Organizations that ran on-prem IdPs in 2012 largely moved to cloud identity providers by 2018. The ops burden ground them down, and the control premium stopped justifying the cost. The same pattern is playing out with model hosting, compressed into a shorter timeline.
When you migrate between identity providers, the protocol layer (SAML, OIDC) means downstream applications don't notice if you did it right. There is no equivalent protocol layer for model serving. Swapping from Llama 70B to DeepSeek V3 changes output behavior in ways that are difficult to predict and impossible to abstract away with a standard. The portability story that makes cloud IdP migration manageable does not exist here yet.
A "patch" to a model can alter its reasoning on tasks you depend on, with no changelog that maps to your use case.
Where Most Self-Hosting Plans Actually Land
The trajectory is remarkably consistent. A team gets approval for a proof of concept. They spin up GPU instances, get vLLM running, and demo something impressive within two weeks. Then production happens. The first GPU driver conflict costs a day. The model update that subtly changes output quality costs a week of evaluation. The on-call rotation that nobody budgeted for costs goodwill. Quietly, someone opens the AWS console and starts looking at Bedrock pricing.
AWS Bedrock's Project Mantle now serves roughly two dozen open-weight models as fully managed, serverless inference. Your operational surface is an API endpoint and a per-token bill. GPU provisioning, scaling, failover: all handled by AWS. The same models, the same weights, but someone else runs the infrastructure. Google Vertex AI and Azure AI Foundry offer similar managed paths.
What you gain: existing cloud IAM policies apply, VPC isolation works, compliance documentation is already in place, and your team doesn't need to learn GPU orchestration. What you lose: full weight custody (no air-gapped deployment), the ability to modify weights in place, and complete control over when model versions change.
Self-hosting saves real money only above a threshold. One practitioner analysis puts the crossover at roughly 150–200 active users, or 2–5 million tokens per day. Below that, fixed costs of GPU infrastructure and engineering time exceed per-token managed pricing. Most enterprise teams exploring self-hosting are below that threshold. They just haven't done the math yet.
The Capability Gap Is Smaller Than Your Buyer Thinks
Your buyer may assume open-weight models are significantly weaker than frontier closed models. As of mid-2026, that assumption is outdated. On coding benchmarks, the gap has effectively closed. On the hardest reasoning tasks, closed models still lead by a few points on the Artificial Analysis leaderboard. For most production workloads like document analysis, summarization, code generation, and chat, open-weight models perform in the same capability class at a fraction of the per-token cost.
Self-hosting an open-weight model is a lateral move from an API. Same capability class, different control model, very different ops profile.
What This Means for Your Tuesday Call
When a buyer brings up self-hosting open-weight models, they're usually expressing one of three needs: data residency, cost control at scale, or independence from a single vendor's API. All three are legitimate. Leave the infrastructure choice to them. Your job is understanding the decision well enough to know where identity and access management enters the picture.
And it enters the moment they stand up an inference endpoint. vLLM ships with no auth. The managed path (Bedrock, Vertex, Foundry) maps cleanly to your existing cloud identity story. True self-hosting introduces an infrastructure surface where identity controls must be built, not inherited. That's a conversation your SE can lead, and it's a natural one to offer once the buyer starts describing their deployment model.
Worth having. It's the one where you add value without bluffing.
Things to follow up on...
- OSI's validation list: Only a handful of models (Pythia, OLMo, Amber, T5) have actually passed the Open Source AI Definition's validation phase, which puts the "open source" claims from every major lab in sharper relief.
- Distilled model license inheritance: DeepSeek's smaller distilled models built on Llama base weights carry the upstream Llama license restrictions, a nuance documented in the DeepSeek-R1 repository that procurement teams routinely miss.
- Mixture-of-Experts economics: The architecture behind the gap closing so fast is MoE, where DeepSeek V3's 671B parameters activate only 37B per token, and this training efficiency story explains why a $5.9M training run can produce frontier-competitive models.
- Bedrock's PrivateLink expansion: AWS now supports PrivateLink for the bedrock-mantle endpoint, which means open-weight model inference can stay entirely within a customer's VPC without traversing the public internet.

