What an Embedding Is
An embedding is a fixed-length array of floating-point numbers — a vector — produced by a neural network when it processes a piece of text. The network is trained so that semantically similar content produces vectors that are close together in the resulting mathematical space, and dissimilar content produces vectors that are far apart. The position of the vector encodes meaning. "Budget appropriation" and "fiscal allocation" land near each other. "Budget appropriation" and "network topology" do not.
The space these vectors inhabit has many dimensions — a modern text embedding model typically produces vectors of 768 to 3,072 dimensions, depending on the model architecture. "High-dimensional space" sounds like it belongs in a physics seminar, but the practical implication is straightforward: the model has enough degrees of freedom to represent fine-grained distinctions in meaning, not just broad categories. Two documents about contractor performance management will be close. Two documents about contractor performance management and satellite telemetry will not be, even if they share words like "system" and "report."
How It Works
At index time, every document in the corpus is passed through an embedding model, which converts it to a vector. Those vectors are stored in a vector database alongside a reference to the source document.
When a user submits a query, the query goes through the same embedding model, producing a query vector.
The vector database then computes the similarity between the query vector and every stored vector. The standard metric is cosine similarity — a measure of the angle between two vectors, which captures directional similarity independent of magnitude. Two vectors pointing in the same direction score near 1.0; two vectors pointing in opposite directions score near -1.0.
The database returns the documents whose vectors are nearest to the query vector: the documents most semantically similar to the query, regardless of whether they share any of its words.
At scale, exact nearest-neighbor search across tens of millions of vectors would be computationally prohibitive. In practice, approximate nearest neighbor (ANN) algorithms are used — they trade a small, bounded loss in recall for query latency measured in milliseconds rather than seconds. The approximation is a deliberate engineering choice.
Okta Concept Mapping
The closest IDAM analogue is LDAP directory search: a query mechanism that retrieves entities from a corpus based on matching criteria. Both are retrieval systems; both take a query and return a result set. The analogy holds there.
Two things break it. LDAP search is deterministic — the same query against the same directory returns the same result, every time, and that result is auditable. Vector search is approximate by design, and results can shift when the embedding model is updated or the index is rebuilt, even if the underlying documents haven't changed. LDAP attributes are also discrete and typed: a user either has department=Finance or doesn't. There is no "almost Finance." Embeddings are continuous coordinates; there is no natural threshold at which "close enough" becomes "matches." That continuity gives semantic search its power, and it's precisely what puts the retrieval layer outside the reach of the access control primitives your buyers already have. When a CAIO asks how you govern what the retrieval layer returns, the honest answer is that the tools for that problem are not the same tools that govern directory queries — and most organizations haven't built them yet.
Why This Matters in Practice
A contracting officer at a federal agency searching a corpus of acquisition regulations and past performance records illustrates the gap quickly. The legacy system runs keyword search. A query for "cost overrun remediation" returns results only if those exact words appear. The contracting officer who doesn't know the precise regulatory terminology gets nothing useful back.
With vector search, the same query surfaces documents about "schedule slippage mitigation," "EVM corrective action thresholds," and "contractor recovery plans" — because those concepts are geometrically close in the embedding space, even though they share no keywords with the query. The system retrieves on meaning, not on word choice.
This is the retrieval primitive that makes language-based queries work against unstructured document corpora. Retrieval-augmented generation, document Q&A, semantic policy search — all of them require this layer. Without it, you're doing keyword search with a language model bolted to the front, which is a different architecture and performs like one.

