Every AI architecture diagram your buyer puts on a whiteboard has a box labeled "vector database." You've seen it. You've nodded at it. This piece is about what's inside that box, how it works, and why the phrase "we'll just give it our docs" describes about four months of engineering decisions that your buyer may or may not have thought through.
Embeddings are how machines represent meaning as math. An embedding is a list of floating-point numbers, called a vector, that encodes the semantic content of a piece of text. Two pieces of text that mean similar things produce vectors that are close together in a high-dimensional space. Two pieces of text that mean different things produce vectors that are far apart. That's the entire concept. Everything else is plumbing.
What embeddings are, mechanically
Take a sentence. Pass it through an embedding model. Out comes a vector: a fixed-length array of decimal numbers. OpenAI's text-embedding-3-small produces 1,536 numbers. Their larger model produces 3,072. Google's Gemini embedding model also defaults to 3,072 dimensions. These numbers shift as providers release new models, so don't memorize them. What matters is the shape: every chunk of text becomes a fixed-length coordinate in a high-dimensional space.
And critically: vectors from different embedding models live in entirely different spaces. A 1,536-dimension vector from OpenAI and a 3,072-dimension vector from Google are different maps of different territories. You can't compare them, and mixing them produces meaningless results. Every engineering decision downstream inherits this constraint.
Something like [0.0023, -0.0142, 0.0391, ...] repeated 1,536 times. Each individual number in that array is uninterpretable on its own. The pattern across all positions, taken together, places that text at a specific coordinate in a space with 1,536 dimensions. You can't visualize 1,536 dimensions. Nobody can. But the math works the same way distance works in three dimensions: points that are close together are similar.
Proximity reflects meaning because of how the model was trained. Embedding models learn through contrastive training: the model sees millions of pairs of semantically related inputs and learns to assign them nearby coordinates, while pushing unrelated inputs to distant coordinates. Think of it as the model being shown "authentication failure" and "login error" over and over, in different contexts, until it reliably places them near each other in vector space. After billions of these examples, the geometry of the space starts to mirror the geometry of meaning. "Authentication failure" and "login error" land near each other. "Authentication failure" and "budget forecast" don't. The same proximity math powers recommendation systems: if two items land near each other in embedding space, a system can surface one when a user engages with the other.
Closeness is measured by cosine similarity: the cosine of the angle between two vectors. A score of 1 means they point in the same direction (very similar). A score of 0 means orthogonal (unrelated). In practice, text embeddings almost always fall between 0 and 1. OpenAI's embeddings are normalized to length 1, which simplifies the math to a dot product. The vector database handles this computation across millions of stored vectors in milliseconds. That's what the box on the diagram does.
(A note on sources: the dimension counts and similarity mechanics cited here come from OpenAI's and Google's own API documentation and model guides. These are primary vendor-maintained technical references, not blog commentary. They're the most reliable sources for model specifications, though the numbers are subject to change with new model releases.)
Recap:
- Embedding: A fixed-length array of floating-point numbers encoding the semantic meaning of text. Proximity in vector space reflects similarity in meaning.
- Cosine similarity: The standard metric for measuring how close two embeddings are. Near 1 means similar; near 0 means unrelated.
- Embedding model: The model that converts text into vectors. Different models produce different dimension counts and incompatible vector spaces.
How vector search finds what keyword search misses
Traditional search is string matching. You type "PTO policy for remote workers," and the system looks for documents containing those exact words. If the relevant document says "telework guidance and leave accrual for personnel onboarded in FY23," keyword search returns nothing. The words don't match. Anyone who's watched a federal employee try to find an answer in a 400-page policy manual using Ctrl+F knows this problem intimately.
Vector search operates on meaning. Your query gets converted to a vector using the same embedding model that processed the stored documents. The database then finds the stored vectors closest to your query vector. The telework guidance document lands near your PTO query because the meanings are close, even though zero words overlap.
Vector search has a failure mode worth knowing about. Embedding models learn meaning from their training data. They can miss domain-specific terminology, acronyms, or proper nouns that didn't appear frequently enough during training. A search for "FedRAMP P-ATO status" might not retrieve the right document if the embedding model doesn't understand FedRAMP as a concept distinct from general authorization language. The model has never seen enough FedRAMP-specific text to learn that distinction.
Production systems increasingly use hybrid search for exactly this reason, combining keyword matching and vector similarity. Exact-match catches the acronyms. Semantic similarity catches the paraphrases. Both methods have blind spots the other fills, and anyone who tells you otherwise hasn't deployed this in an environment where people search for things like "FISMA POA&M remediation timeline."
Recap:
- Keyword search: Finds documents containing the exact words in your query. Misses synonyms, paraphrases, and the creative rewording that bureaucracies produce at industrial scale.
- Vector search: Finds documents whose embeddings are close to your query's embedding. Catches meaning even when words don't match.
- Hybrid search: Combines both. Production systems use this because domain-specific terminology and semantic meaning are both real requirements.
Your IDAM intuition is useful here. A vector database is functionally a directory: it's only as valuable as what's enrolled in it. Before any query can retrieve a document, that document has to be onboarded — split into chunks, embedded, and indexed with metadata. Deprovisioning matters too: outdated documents need removal. Switching embedding models means re-processing everything, because vectors from different models live in incompatible spaces. Your instinct about lifecycle management applies directly.
Here's where that instinct starts to mislead you. Directory enrollment is deterministic. A user is provisioned or they aren't, and a failure produces an error you can see. A document can be indexed in a vector database but poorly represented, and the system won't tell you. If the embedding model doesn't capture domain-specific meaning, or if the chunking strategy splits a critical paragraph across two fragments, the document is "enrolled" but effectively invisible to the queries that need it. There is no failed-enrollment error for a semantically weak embedding. The failure is silent.
What "give it our docs" actually requires
When a buyer says "we'll just give it our docs," they're describing a pipeline. Each step has engineering decisions that determine whether the system actually works.
Chunk the documents. Large documents get split into smaller passages, typically a few hundred tokens each (a token is roughly ¾ of a word in English). Chunking strategy matters more than chunk size. Fixed-size splits can cut a sentence in half, break a table mid-row, or separate a policy statement from its exceptions three paragraphs later. These are engineering decisions that directly affect retrieval quality.
Embed each chunk. Each chunk passes through the embedding model and becomes a vector. A 200-page document might produce several hundred chunks, each represented by its own array of floating-point numbers. The embedding model you choose determines the vector space everything lives in.
Store in a vector database. The vectors get indexed for fast similarity search. The database stores the vector, the original text it came from, and metadata: source document, date, classification, section heading. This is the box on the architecture diagram.
Those three steps happen once per document, during indexing. The next three happen every time someone asks a question.
Embed the query. The user's question passes through the same embedding model used during indexing. This produces a query vector in the same space as the document vectors. Using a different model breaks the math. Vectors from different models aren't comparable, and similarity scores become meaningless. The spec calls this "using the same embedding model," which is a generous way of describing a constraint where, if you violate it, nothing works and nothing tells you it's broken.
Similarity search. The vector database computes cosine similarity between the query vector and every stored vector, returning the top results, typically 3 to 10 chunks. These are the chunks the system judges most relevant to the question.
Inject into the prompt. The retrieved chunks get inserted into the model's context window alongside the user's question. The LLM generates an answer using those chunks as source material. The model receives relevant excerpts at query time, selected by vector similarity. Your documents stay in the vector database; the model sees only the chunks that retrieval hands it for each query.
The whole mechanism is called retrieval-augmented generation, or RAG (a working definition: the pattern of retrieving external content and inserting it into a model's prompt to ground its response in specific source material). RAG is why "vector database" appears on every AI architecture diagram your buyer draws.
For your next conversation, the takeaway is concrete: "give it our docs" describes a pipeline with consequential engineering decisions at every stage. The embedding is a coordinate in meaning-space, and that coordinate is only as useful as what's in the database and how well retrieval works. Bad chunking, wrong embedding model, stale documents, missing metadata. Any of these make the coordinate point to the wrong neighborhood, and the model generates a confident answer grounded in the wrong context.
Recap:
- Chunking: Splitting documents into smaller passages for embedding. How you split directly affects what retrieval can find.
- Indexing pipeline: Chunk → embed → store. Runs once per document. Changing the embedding model means re-running everything.
- Retrieval pipeline: Embed the query → similarity search → inject results into the prompt. Runs at every query.
- RAG: Retrieval-augmented generation. The pattern of retrieving relevant content from a vector database and inserting it into the model's prompt.
- "Give it our docs": Describes all six steps. The quality of each step determines whether the system returns the right answer or a confident wrong one.
When this comes up
Your buyer's architecture whiteboard will have a vector database on it. It might be Pinecone, pgvector, MongoDB Atlas Vector Search, or something built into their cloud provider's AI stack.
The buyer's technical lead points to the vector database box and says, "This is where we store our knowledge base. We're feeding it all our policy documentation." That sentence sounds finished. It isn't. You now know enough to ask: how are you chunking those documents? What embedding model are you using, and have you tested whether it handles your domain terminology? What happens when those policy documents get updated next fiscal year? Do the old embeddings get replaced?
Those questions tell the buyer you understand their pipeline has moving parts, and that the vector database is a system with lifecycle requirements, not a magic box you drop documents into. The buyer who has thought through chunking strategy and embedding model selection will respect that you know to ask. The buyer who hasn't will realize they need to, and that's a more valuable conversation than anything on your slide deck.
When a CIO says "we're building a RAG pipeline," you know the six steps. When a program manager mentions they're evaluating vector databases, you know the evaluation covers the embedding model, the chunking strategy, the retrieval quality, and the lifecycle management of everything stored inside. The database is one component. What goes into it, how it gets there, and how well retrieval works determines whether the AI system is useful or just articulate.
Things to follow up on...
- Shrinking vectors, keeping quality: OpenAI's embedding models let developers reduce dimension counts via an API parameter — their large model shortened to 256 dimensions still outperforms the previous-generation model at full 1,536, which has real implications for storage cost and search speed.
- Hybrid search in practice: MongoDB's documentation on combining vector search with full-text search walks through the mechanics of running keyword and semantic queries in parallel, which is how production RAG systems handle domain-specific acronyms that embedding models miss.
- The RAG pipeline end-to-end: LangChain's RAG tutorial is the clearest step-by-step walkthrough of the indexing and retrieval pipeline described in this article, built on the same primary concepts with working code examples.
- Azure's RAG architecture guidance: Microsoft's Azure AI Search documentation includes a detailed RAG overview that covers chunking strategies, hybrid queries, and the specific problem of sending too much context to the model — useful framing for public sector buyers building on Azure.

