Enterprise AI Deployment: The Reference Architecture, Assembled

By Leigh Garrity— May 6, 2026

Enterprise AI Deployment: The Reference Architecture, Assembled

The architecture runs in one direction: employee or application → SSO → gateway → provider(s) with negotiated data terms → observability and eval pipeline → FinOps dashboard with chargeback.Every governance problem you encountered in this chapter lives at one of those handoffs.

Employee / Application
        ↓
    SSO  ·  identity assertion, per-user credential binding
        ↓
    AI Gateway  ·  auth, routing, DLP, caching, rate limits, full prompt/response log
        ↓
    Provider(s)  ·  negotiated data terms, ZDR, regional endpoints
        ↓
    Observability & Eval Pipeline  ·  quality drift, LLM-as-judge, regression gates
        ↓
    FinOps Dashboard  ·  token chargeback, quota enforcement, cache-hit tracking

The Provisioning Problem

Sanctioned AI — Tools the organization has reviewed, approved, and added to an allowlist with defined access controls. When it comes up: When a customer asks how they distinguish "approved" from "tolerated" from "actively blocked." Don't confuse with: Managed AI, which implies ongoing vendor relationship and contractual terms — sanctioned is about access policy, not contract status.

Shadow AI — AI tools employees adopt without IT or security review, typically via personal accounts or direct API keys. When it comes up: When the conversation turns to why a gateway matters — shadow AI is the provisioning gap the gateway closes. Don't confuse with: Unsanctioned SaaS broadly; shadow AI is specifically the category where the data risk is prompt content, not just license exposure.

Three-tier policy model — Block / Monitor / Allow. The framework for categorizing AI tools by risk level rather than applying a binary approved/blocked decision. When it comes up: When procurement or security asks how the organization handles tools that aren't yet reviewed — "monitor" is the answer that keeps the business moving while the review runs. Don't confuse with: A firewall allow/deny rule; the Monitor tier implies active logging and periodic review, not passive tolerance.

Request-to-allowlist lifecycle — The workflow from an employee's tool request through security review, data classification, gateway configuration, and provisioning. When it comes up: When a customer asks how long it takes to approve a new AI tool — the answer is only meaningful if the lifecycle is defined.

“

If you remember nothing else: Shadow AI expands to fill the gap between what employees need and what IT has approved. A faster yes shrinks the gap; blocking just relocates it.

The Gateway Layer

AI gateway — A proxy layer that sits between users or applications and AI provider APIs, handling authentication, routing, rate limiting, DLP, caching, and complete prompt/response logging. When it comes up: Any conversation about centralized control, cost visibility, or audit trails — the gateway is the single point where all of those become possible. Don't confuse with: An API gateway. An API gateway manages REST traffic generically; an AI gateway understands prompt structure, token counts, and model-specific routing logic.

Virtual key — A gateway-issued credential that maps to a real provider API key, allowing the gateway to rotate, revoke, or scope provider access without touching the underlying key. When it comes up: When a customer asks how they avoid distributing raw OpenAI or Anthropic keys to every developer. Don't confuse with: A service account credential; a virtual key is ephemeral and gateway-managed, not a long-lived identity.

Fallback routing — Gateway logic that redirects requests to a secondary model or provider when the primary is unavailable or over quota. When it comes up: Business continuity conversations — the question is whether an outage at one provider takes down the application or just triggers a reroute. Don't confuse with: Load balancing; fallback routing is condition-triggered, not traffic-distributed.

“

If you remember nothing else: The gateway is the only layer that sees the full prompt and the full response. Which makes it the only layer that can enforce anything meaningful.

The Identity Layer

Per-user key model — Each authenticated user's AI requests are issued under a credential tied to their identity, making every prompt attributable to a specific person. When it comes up: When legal or compliance asks who sent a specific prompt — per-user keys are the answer; shared service accounts are not. Don't confuse with: Per-application keys, which attribute requests to a system, not a person, and are appropriate for automated pipelines but not human-facing tools.

SSO integration into AI platforms — Connecting an AI tool's authentication to the enterprise identity provider via OIDC or SAML, so users authenticate with their corporate credentials rather than a platform-native account. When it comes up: Every conversation about offboarding — if an employee's Okta account is deprovisioned, their AI platform access should terminate with it. Don't confuse with: API key management; SSO governs human authentication, not machine-to-machine credential exchange.

Workspace-level RBAC — Role-based access controls applied at the AI platform's workspace or project level, controlling which users or groups can access which models, data connections, or tool integrations. When it comes up: When a customer asks how they prevent the marketing team from accessing the legal team's document-grounded AI instance. Don't confuse with: Gateway-level rate limits, which control volume, not access.

Okta note: Okta Workforce Identity Cloud supports OIDC and SAML SSO into major AI platforms including Microsoft Copilot, Google Gemini for Workspace, and Anthropic Claude for Enterprise. SCIM provisioning handles automated user lifecycle — when an employee is terminated in Okta, downstream AI platform access is deprovisioned without manual intervention. Microsoft Entra ID covers the same patterns for organizations in the Microsoft stack; the two coexist in most large enterprises, often with Okta federating into Entra.

“

If you remember nothing else: The AI platform doesn't know who the user is. SSO is the only mechanism that makes the model's actions attributable to a person.

The FinOps Layer

Token (as spend unit) — The unit of AI consumption: roughly three-quarters of a word for most models, priced per thousand or million at rates that vary by model, modality, and whether the request hits cache. When it comes up: Budget conversations — tokens are the line item, and finance needs a way to map them to teams and projects. Don't confuse with: Auth tokens. Same word, completely different domain. See the vocabulary collision table below.

Prompt caching — Provider-side storage of repeated prompt prefixes, billed at reduced rates on cache hits. A CFO-visible line item once spend scales. When it comes up: When a customer asks how they reduce AI costs without changing models — caching is the first lever, and the gateway's cache-hit-rate metric is how you prove it's working. Don't confuse with: Response caching at the application layer, which stores full outputs and bypasses the model entirely.

Provisioned throughput — A reserved capacity commitment to a provider, traded for price certainty and guaranteed response latency at scale. When it comes up: Enterprise procurement conversations — provisioned throughput is the AI equivalent of reserved instances, and the showback math works the same way.

“

If you remember nothing else: Tokens are the new API calls. If you don't budget them, your finance team will budget them for you, and they'll do it by cutting access.

The Data Governance Layer

Zero data retention (ZDR) — A contractual commitment from the provider that prompt and response content is not stored after the API call completes and is not used for model training. When it comes up: When legal asks whether the model vendor can use their prompts to train the next version. ZDR is the answer, and it's contractual, not technical. Don't confuse with: Encryption at rest; ZDR is a retention policy, not a storage security control.

Prompt-boundary DLP — Data loss prevention rules applied at the gateway layer to inspect outbound prompts for sensitive content — PII, credentials, classified data — before they reach the provider. When it comes up: Any regulated industry conversation. The gateway is where DLP runs; the provider never sees the content that gets blocked. Don't confuse with: Response-side filtering, which catches sensitive content in model outputs rather than preventing it from being sent.

Regional endpoint routing — Directing AI API traffic to provider infrastructure in a specific geographic region to satisfy data residency requirements. When it comes up: EU customers, public sector accounts, any organization with explicit data localization obligations. Don't confuse with: ZDR; regional routing controls where data is processed, ZDR controls whether it's retained.

“

If you remember nothing else: Your prompts are logged. The governance work is knowing who can read those logs and under what legal authority.

The Observability and Evals Layer

Output quality drift — Degradation in model response quality over time, caused by provider model updates, prompt template changes, or shifting input distributions — often invisible without active monitoring. When it comes up: When a customer asks what happens after deployment. Drift is the answer; evals are the detection mechanism. Don't confuse with: Model hallucination as a static property; drift is a change over time, not a baseline characteristic.

LLM-as-judge (online) — Using a second language model to evaluate the quality of production responses on a sampled basis, generating a continuous quality signal without human review of every output. When it comes up: When a customer asks how they know the model is still working correctly next month. Online LLM-as-judge is the scalable answer. Don't confuse with: Human evaluation, which is the ground truth for calibrating the judge model but can't run at production volume.

Deploy-gated regression suite — A set of automated eval tests that must pass before a prompt template change or model version update is promoted to production. When it comes up: Change management conversations — the eval suite is the AI equivalent of a CI/CD pipeline gate. Don't confuse with: A/B testing; regression suites catch regressions against known-good behavior, not differences between variants.

OpenTelemetry GenAI semantic conventions — The emerging standard for what an AI observability span should contain: model name, prompt tokens, completion tokens, tool calls, latency, finish reason. When it comes up: When a customer asks how their existing observability stack handles AI workloads — OTel GenAI conventions are the answer if their tooling supports them.

“

If you remember nothing else: A model that worked last Tuesday can fail silently today. Observability is the only way to know before a customer does.

Vocabulary Collision Map

Table 1: The Five Terms That Will Cause Problems in Every Meeting

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	Billable unit of model input/output; ~¾ of a word	Auth token (JWT, SAML assertion, OAuth bearer)	Completely unrelated concepts sharing a word. In AI, tokens are consumed and counted. In IDAM, tokens are issued and validated. Clarify which domain you're in before the sentence lands.
Session	The stateless context window of a single conversation; no server-side session state	Authenticated session with server-side state, timeout, and revocation	LLM sessions don't expire via inactivity and can't be revoked mid-conversation. The gateway manages session continuity; the model doesn't.
Scope	FinOps budget scope: the organizational unit (team, project, cost center) to which spend is attributed	OAuth scope: the permissions granted to a token	In FinOps, scope is about cost attribution. In OAuth, scope is about access rights. A "scoped" AI deployment might mean budget-bounded or permission-bounded — ask which.
Agent	An AI system that takes multi-step actions using tools, APIs, or other models autonomously	Identity agent (Okta agent, AD connector) — a lightweight process that syncs identity data	An AI agent acts on behalf of a user or system and needs its own identity. An identity agent is infrastructure that moves identity data. The AI agent is the principal; the identity agent is the plumbing.
Gateway	A proxy layer purpose-built for AI traffic: prompt inspection, token counting, model routing, DLP	API gateway (Kong, Apigee, AWS API GW) — general-purpose HTTP proxy with auth and rate limiting	An API gateway doesn't understand prompt structure or token economics. An AI gateway does. You can build AI gateway functions on top of an API gateway, but the out-of-box behavior is different.

Table 2: Architectural Layer to Enterprise IT Equivalent

AI Layer	What It Does	Enterprise IT Equivalent	Key Divergence
AI Gateway	Central proxy for all model traffic; enforces auth, DLP, rate limits, logging	Secure Web Gateway / API Gateway	The AI gateway must parse prompt content to enforce policy. A traditional SWG operates on URLs and headers.
Per-user key model	Binds every AI request to an authenticated user identity	PAM / privileged session recording	AI sessions are not privileged in the PAM sense, but the attribution requirement is identical: every action traceable to a person.
Workspace RBAC	Controls which users access which AI projects and models	RBAC in cloud IAM (AWS IAM, Azure RBAC)	AI workspace roles are platform-defined and often coarser than cloud IAM roles. Mapping them to enterprise role taxonomy requires manual work.
ZDR commitment	Provider contractual promise not to retain or train on prompt data	DPA / data processing agreement	ZDR is a subset of DPA terms. A signed DPA doesn't guarantee ZDR; it must be explicitly negotiated.
Deploy-gated eval suite	Automated quality tests that gate model/prompt updates	CI/CD pipeline with test gates	Eval suites test output semantics, not code correctness. Pass/fail thresholds are probabilistic, not binary.

Table 3: Identity Concepts Specific to AI Platforms

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Service account for AI	A non-human identity used by an application to call model APIs	Service account / machine identity	AI service accounts accumulate prompt history and tool permissions that traditional service accounts don't. Rotation cadence and scope review need to account for this.
Tool permission	Authorization for an AI agent to call a specific external API or function	OAuth scope / API permission	Tool permissions are granted at agent configuration time, often without a formal approval workflow. This is the gap that IDAM governance needs to close.
Prompt template as code	A versioned, access-controlled artifact that defines model behavior for an application	Configuration artifact in a secrets manager or IaC repo	Prompt templates contain instructions that can override model safety behavior. They require the same access controls as application code, not just documentation.

What Comes Next

The architecture mapped in this chapter answers the operational questions: who can access which AI, through which gateway, at what cost, with what data handling, and with what quality signal. Governing the plumbing is table stakes. The risk and compliance chapter moves to the harder problem: what happens when the answer to any of those questions turns out to be wrong, or when an auditor, regulator, or legal team needs to prove it was right. That's where liability actually lives, and that's where the next chapter picks up.

For More Information

Recap Entry	Source Article	Section
Sanctioned AI, shadow AI, three-tier policy model, request-to-allowlist lifecycle	Lesson 2: Shadow AI and the Provisioning Gap	"The Three-Tier Model"
AI gateway, virtual key, fallback routing	Lesson 3: AI Gateways and Centralized Control	"What a Gateway Actually Does"
SSO integration, per-user vs. per-application keys, workspace RBAC	Lesson 4: Identity, SSO, and the AI Platform	"Binding Identity to the Model"
Token as spend unit, prompt caching, provisioned throughput	Lesson 5: FinOps for AI — Budgeting the Unpredictable	"The Token Economy"
ZDR, prompt-boundary DLP, regional endpoint routing	Lesson 6: Data Governance and the Prompt Boundary	"Where Your Prompts Go"
Output quality drift, LLM-as-judge, deploy-gated regression, OTel GenAI conventions	Lesson 7: Observability, Evals, and the Quality Signal	"After Deployment"
Vocabulary collision map (all five terms)	Lesson 1: The Enterprise AI Stack	"Terminology That Will Confuse Your Customers"

Employee / Application
        ↓
    SSO  ·  identity assertion, per-user credential binding
        ↓
    AI Gateway  ·  auth, routing, DLP, caching, rate limits, full prompt/response log
        ↓
    Provider(s)  ·  negotiated data terms, ZDR, regional endpoints
        ↓
    Observability & Eval Pipeline  ·  quality drift, LLM-as-judge, regression gates
        ↓
    FinOps Dashboard  ·  token chargeback, quota enforcement, cache-hit tracking