What's Actually in That "Vector Database" Box

By Carey Whitten— May 5, 2026

What's Actually in That "Vector Database" Box

Every enterprise AI architecture diagram has one. It sits between the data sources and the AI model, labeled "vector database" or "embedding store," and in most meetings it gets a nod and a pass. Time to open it.

An Embedding Is a Coordinate

An embedding is a fixed-length list of numbers — a coordinate — that encodes the meaning of a piece of text, not its literal words. A model trained on large text corpora learns to map text into a high-dimensional space where semantic similarity corresponds to geometric proximity. "Contractor offboarding" and "vendor access termination" land near each other. "Contractor offboarding" and "quarterly earnings report" land far apart. The words share nothing; the meanings share a neighborhood.

The coordinate is fixed-length regardless of input length. A three-word query and a three-page policy document both produce a vector of the same dimensionality — OpenAI's text-embedding-3-large, for example, produces 3,072 numbers per input. That number is illustrative of the scale, not a figure to memorize; model specs change. What matters is the structure: every piece of text becomes a point in the same space, which makes comparison possible.

How Similarity Search Actually Works

Text goes in. The embedding model (a neural network running either locally or via API) processes the text and outputs the coordinate vector. That vector gets stored. When a user submits a query, the same model converts the query into its own vector. The system then finds stored vectors that are geometrically close to the query vector, typically measured by cosine similarity (the angle between vectors) or Euclidean distance.

The search is geometric, not lexical. Which stored coordinates sit nearest to this query coordinate? A document that never uses the word "offboarding" can still be the closest match to a query about offboarding, because the model learned that the underlying concepts occupy the same region of the space.

The scalability question is legitimate. Checking every stored vector against every query vector doesn't scale past a few million entries. Vector databases solve this with approximate nearest neighbor algorithms (HNSW and IVF are the two you'll hear most often) that build index structures allowing the search to skip most of the space and still return results that are correct often enough to be useful. Query times at hundreds of millions of vectors stay in the low milliseconds. That's why the vector database is its own infrastructure layer rather than a table in Postgres.

Okta Concept Mapping

“

Closest IDAM analogy: an attribute store.

A directory stores structured attributes about principals — department, role, clearance level — and answers queries against those attributes. A vector database stores coordinates about content and answers proximity queries against those coordinates. Both are purpose-built stores optimized for a specific query pattern.

Where the analogy breaks: Every attribute in your directory was defined by a human, carries a label, and can be governed by policy. An embedding has no labeled fields. Dimension 847 of a 3,072-dimensional vector encodes something the model learned implicitly — and there is no tooling that tells you what. You can write an access policy for a clearance-level attribute. Writing a governance policy for what's encoded in an embedding is an open problem. The infrastructure looks familiar; the auditability assumptions do not transfer.

When You'll Need This

The conversation comes up when a CAIO asks how the system finds relevant documents even when users don't know the exact policy language — when a query about "who approves contractor system access" should surface the document titled "Third-Party Privileged Access Management Procedures" even though those words don't appear in the query. Relevance was converted into geometry. The vector database is what retrieves it.

It also comes up in procurement, where agencies are standing up AI-assisted search across large document repositories — policy libraries, contract archives, security guidance. The RFI will ask about "semantic search capability." The architecture diagram will have the box. Underneath both is the same question: how does the system know what's relevant when the vocabulary doesn't match?

The answer is that relevance was converted into geometry, stored as coordinates, and queried by proximity. The vector database is the map. The embedding is how everything got placed on it.

What the map doesn't yet have is a legend anyone can read — and that's the part worth flagging when the conversation turns to governance.