Recap — A Reference Architecture for Enterprise AI

Six-layer reference architecture mapping identity, cost, and audit ownership across every stage of enterprise AI model access.

By Leigh Garrity— May 8, 2026

Recap — A Reference Architecture for Enterprise AI

Six-layer reference architecture mapping identity, cost, and audit ownership across every stage of enterprise AI model access.

Consolidation document for Lessons 1–6. A scaffold you can retrieve in a parking lot before a call, point at during one, and use to locate which layer owns identity, cost, or the audit log in any AI initiative a buyer describes.

The architecture, end to end:

Reference Architecture

employee or application → SSO → gateway (authentication, routing, DLP, caching, logging) → provider with negotiated data terms → observability and eval pipeline → FinOps dashboard with chargeback

Six layers. Each answers "who touched the model?" from a different angle.

Layer 1: Discovery

The question this layer answers: Does this model or agent exist in inventory?

Shadow AI — AI tools adopted without security review, operating outside sanctioned scope. When it comes up: The buyer says "we don't have an AI problem yet." They do. They don't have inventory. Don't confuse with: Sanctioned AI that lacks governance. Shadow means unknown. Ungoverned means known but uncontrolled. The remediation paths diverge completely.

Agent Inventory — A registry of every AI agent in the environment, with an assigned human owner. When it comes up: The CISO asks "how many agents are calling our APIs right now?" Silence in the room means this layer is missing. Don't confuse with: A service account catalog. As you saw in the access lessons, agents are non-deterministic actors with delegated authority. Traditional NHI inventory tracks credentials, not intent or tool-chaining behavior.

Okta for AI Agents (GA as of April 30, 2026) discovers agents through OAuth consent grant detection and registers them in Universal Directory with assigned human owners. MCP servers register as governed resources.

If you remember nothing else: Shadow AI is an inventory failure with audit consequences. Every other layer assumes this one exists.

Layer 2: Access

The question this layer answers: Who authenticated to use this model, and under whose authority?

Per-User Attribution — Every model request traces to an authenticated human identity, not a shared service account. When it comes up: The buyer asks "can we tell which analyst ran that query?" If the answer involves a shared service account, attribution is broken. Don't confuse with: API key authentication. An API key identifies the application. Per-user attribution identifies the person.

Short-Lived Credentials — Scoped, time-limited access tokens issued to agents. Valid for minutes, not hours. When it comes up: The security team asks how agent access gets revoked. Short-lived credentials mean revocation is the default state; the credential expires unless explicitly renewed. Don't confuse with: LLM tokens. This collision causes the most damage in live conversations. Short-lived credentials are OAuth access tokens. LLM tokens are units of text the model processes. When someone says "token budget," ask: cost tokens or auth tokens?

Delegated Authority — An agent acting under a human's identity, carrying that human's authorization context across tool calls. When it comes up: "If the agent books a meeting on my behalf, whose permissions did it use?" Yours, scoped down, time-limited, and logged. Don't confuse with: Impersonation. Delegation is explicit and auditable. Impersonation is what happens when delegation wasn't designed.

Okta's Cross App Access (XAA) protocol (currently in early access) extends OAuth for agent-to-app connections using the Identity Assertion Authorization Grant adopted by the IETF OAuth Working Group.

If you remember nothing else: Access without attribution is a shared credential with extra steps.

Layer 3: Routing

The question this layer answers: Which model, which provider, and who decided?

AI Gateway — The enforcement point between users and model providers. Handles authentication, model selection, DLP scanning, caching, and logging. When it comes up: Every conversation about "how do we control what goes to which model." That's the gateway. Don't confuse with: An API gateway. An AI gateway understands prompt structure, token counting, and content policy. A standard API gateway routes HTTP requests. The AI gateway inspects what's inside them.

Model Selection and Fallback — The gateway's logic for routing a request to a specific model or provider, including fallback when a primary model is unavailable or cost-constrained. When it comes up: "We want GPT-4o for complex analysis but a smaller model for routine summarization." The gateway makes that routing decision based on policy, not the end user's preference. Don't confuse with: Load balancing. Load balancing distributes identical requests. Model selection routes different requests to different models based on task, cost, or data sensitivity.

Virtual Keys — Provider API keys managed by the gateway, invisible to end users. The gateway authenticates the user via SSO, then selects the right provider credential. When it comes up: "How do we avoid embedding API keys in every application?" Virtual keys centralize credential management at the gateway. Don't confuse with: User credentials. Virtual keys authenticate the gateway to the provider. The user authenticates to the gateway.

If you remember nothing else: The gateway is where identity policy meets model access. Without it, every application team builds its own control plane.

Layer 4: Cost

The question this layer answers: Who pays for this request?

Token-Level Chargeback — Attributing model consumption costs to the team, project, or user that generated the request. When it comes up: The CFO asks "why did our AI spend triple last quarter?" Without per-user, per-model attribution at the token level, nobody can answer. Don't confuse with: Cloud resource chargeback. Cloud costs map to compute hours. AI costs map to token consumption, which varies by model, prompt length, and caching.

Prompt Caching — Reusing previously computed prompt prefixes to reduce cost and latency. Recall from the FinOps lesson: the OTel spec now tracks gen_ai.usage.cache_read.input_tokens separately from uncached tokens, which is how you measure whether caching is actually saving money. When it comes up: The FinOps team asks how to cut per-request cost without switching models. Caching is the first lever. Don't confuse with: Response caching. Prompt caching reuses the model's internal computation on repeated input prefixes. Response caching returns a stored output without calling the model at all.

If you remember nothing else: Cost attribution requires the same per-user identity thread that access control does. Solve Layer 2 and Layer 4 is plumbing. Skip Layer 2 and Layer 4 is impossible.

Layer 5: Data

The question this layer answers: Whose data left the perimeter, and under what terms?

Zero Data Retention (ZDR) — A contractual commitment from the provider not to store prompt and response content beyond in-memory processing. When it comes up: Every public sector conversation about data sovereignty. ZDR is the first question after "where does the data go?" Don't confuse with: A technical guarantee. ZDR is contractual, not architectural. The contract constrains the provider's behavior; the architecture constrains yours.

Provider	Default Retention	ZDR Available	Gate
OpenAI API	30 days	Yes	Enterprise, sales-gated
Azure OpenAI	30 days	Yes (Modified Abuse Monitoring)	EA/MCA only
AWS Bedrock	No storage by default	Architecture-level	Default behavior
Anthropic	30 days (official); conflicting sources — verify current DPA	Yes, addendum	Enterprise approval

Not covered in Lessons 1–6, but worth knowing: OpenAI's current docs reserve the right to suspend ZDR eligibility for specific customers using gpt-5.5 and future models if severe risk activity is detected. ZDR is not unconditional for the newest models.

DLP at the Gateway — Content inspection on prompts before they reach the provider. Prevents sensitive data from leaving the perimeter. When it comes up: "How do we stop someone from pasting PII into ChatGPT?" Don't confuse with: Model-side content filtering. The provider's filter catches policy violations in the output. DLP catches sensitive data in the input, before it leaves your environment.

If you remember nothing else: The contract says one thing. The architecture enforces another. Know which layer you're relying on, because the buyer's auditor will.

Layer 6: Quality

The question this layer answers: Whose request produced this output, and was the output any good?

OTel GenAI Semantic Conventions — OpenTelemetry's specification for instrumenting model calls with structured telemetry: model name, token counts, latency, cache utilization. When it comes up: "How do we get observability across multiple model providers?" OTel is the vendor-neutral answer, and as of May 2026, it's the only one with real traction. Don't confuse with: A stable specification. Every GenAI semantic convention is still in development status. No stable release exists. Datadog and Grafana are adopting it anyway. Plan for breaking changes; pin your instrumentation version.

Eval Pipeline — Automated quality checks on model outputs: accuracy, relevance, policy compliance, hallucination detection. When it comes up: "How do we know the model is giving good answers?" This is where the eval pipeline lives. Don't confuse with: Observability. Observability tells you what happened. Eval tells you whether what happened was good.

If you remember nothing else: Observability answers "who did what." Eval answers "should they have." You need both, and neither substitutes for the other.

Vocabulary Collision Tables

Token

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token (LLM)	Chunk of text; unit of consumption and cost	OAuth/access token	LLM tokens carry no authorization. They are billing units, not security artifacts.
Token budget	Max token count before cost or context limit hits	Rate limit / quota	Governs cost and model attention, not data access.
Short-lived token	Scoped, time-limited credential for agents	Access token with TTL	Terms converge here. Okta's short-lived agent credentials are access tokens, not LLM tokens.

Session and Context

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Context window	Model's working memory for one request	Session state	Not persistent, not access-controlled, not an audit log. Can be truncated without the user knowing.
Session (AI)	Conversation thread or multi-turn interaction	Authenticated session	AI session does not map to auth state. The model may appear continuous while the app reconstructs context from summaries.
Security context	No model equivalent	Authentication/authorization state of a principal	Policy enforcement lives outside the model. Always.

Agent and Scope

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Agent (AI)	Autonomous LLM system that selects tools and takes actions	Service account / NHI	Non-deterministic. Service account models handle lifecycle but not intent, tool chaining, or delegated authority.
Scope (agent)	Task boundary, tool permissions, operational blast radius	OAuth scope	OAuth scopes are static permission strings. They don't capture whether a specific tool invocation is appropriate for the agent's current task.
Identity (agent)	The agent's credential-bearing identity, distinct from the human it acts for	Machine identity / NHI	Sits between human and machine identity. Carries delegated human authority that traditional NHI models weren't designed for.

Source Index

Shadow AI, Agent Inventory: Lesson 1 (shadow AI landscape, discovery mechanisms)
Per-User Attribution, Short-Lived Credentials, Delegated Authority: Lessons 3–4 (SSO and per-user attribution, agent credential lifecycle)
AI Gateway, Model Selection, Virtual Keys: Lesson 2 (gateway architecture)
Token-Level Chargeback, Prompt Caching: Lesson 4 (FinOps and cost attribution)
ZDR, DLP at the Gateway: Lesson 5 (data governance and provider terms)
OTel GenAI Conventions, Eval Pipeline: Lesson 6 (observability and quality)
Vocabulary Collisions: Cross-cutting; consolidated from all six lessons

Handoff to Risk and Compliance

Everything above is operational plumbing. It answers how model access gets authenticated, routed, metered, scanned, and observed.

The risk and compliance chapter sits on top of this architecture. It assumes every layer here is in place, then asks the harder questions: threat models, compliance frameworks, incident response for AI-specific failure modes.

“

"The risk is not that the model says something wrong. The risk is that the model says something wrong while holding a credential, calling a tool, reading sensitive context, or acting under a user's delegated authority."

The plumbing makes governance possible. The risk frameworks tell you what "possible" has to mean.

Things to follow up on...

OpenAI's new ZDR carve-out: OpenAI's current developer docs reserve the right to suspend ZDR eligibility for specific customers using gpt-5.5 and future models if severe risk activity is detected, which means ZDR is no longer an unconditional guarantee for the newest models.
OTel MCP tool tracing: The latest OpenTelemetry GenAI spec now includes conventions for tracing MCP tool executions alongside model calls, which connects the observability layer directly to agent tool-chaining behavior.
FinOps Foundation AI certification: The FinOps Foundation launched FinOps Certified: FinOps for AI as AI spending management jumped from 31% to 63% of practitioners year over year, signaling that token-level cost governance is moving from niche concern to core competency.
Anthropic retention policy conflict: Anthropic's official Privacy Center states 30-day retention for commercial API users as of March 2026, while multiple third-party sources consistently cite a September 2025 reduction to 7 days, so verify against the current DPA before using any specific figure in a buyer conversation.