What's in the Field
These five names show up in RFIs, in discovery calls, and occasionally in the "we're already using..." disclosure that reframes a conversation you thought you had mapped. Cohere, Databricks, Together AI, Fireworks AI, and Replicate are not frontier labs and are not hyperscalers — they occupy a middle layer that exists because Bedrock and Vertex don't solve every problem, and because some buyers have specific requirements that the big platforms either can't meet or don't prioritize. The language that buys you credibility here is simple: "What's driving the provider selection for this workload?" That question signals you understand that AI infrastructure is not one-size, and it opens the conversation that actually matters.
The Providers
Cohere
What it is: An enterprise AI company specializing in large language models for retrieval-augmented generation and private deployment.
What it does: Cohere builds and serves its own model family — Command R and Command R+ are the current production models, optimized for RAG pipelines where a model needs to reason over retrieved documents rather than rely on training-time knowledge. The Coral platform is their enterprise interface layer. The operationally significant capability is private deployment: Cohere models can run inside a customer's VPC, on-premises, or in a sovereign cloud environment, with no data egress to Cohere's infrastructure. Pricing runs roughly $0.50–$3.00 per million tokens depending on model tier and deployment mode. Flag for accuracy review each cycle.
Who's behind it: Founded in 2019 in Toronto. Aidan Gomez, one of the co-authors of the "Attention Is All You Need" paper that introduced the transformer architecture, is CEO. The company has raised approximately $445 million as of early 2025 and has maintained a deliberate focus on enterprise and regulated-industry buyers rather than consumer or developer markets.
What makes it distinct: Private deployment with no data egress. Private deployment of a proprietary model is Cohere's core product and their go-to-market strategy, aimed squarely at buyers for whom sending data to a third-party inference endpoint is a non-starter.
Databricks
What it is: A data and AI platform company that has extended its Lakehouse architecture to include model training, serving, and governance through its Mosaic AI layer.
What it does: Databricks acquired MosaicML in 2023 and integrated it as Mosaic AI, which handles model training, fine-tuning, and serving within the Databricks platform. They released DBRX, an open-weight model, in March 2024. More practically, Mosaic AI serves third-party models — including Llama, Mistral, and others — through Model Serving endpoints that sit inside a customer's existing Databricks environment. Unity Catalog, Databricks' data governance layer, extends to AI assets: models, embeddings, and inference logs are governed by the same access control and audit infrastructure as the underlying data. Lakehouse AI is their positioning term for this integration.
Who's behind it: Founded in 2013 by the team behind Apache Spark at UC Berkeley. Ali Ghodsi is CEO. The company was valued at approximately $43 billion in its 2023 funding round and filed for IPO in 2024. Databricks is a dominant platform in enterprise data engineering; the AI layer is an extension of an existing relationship, not a new one.
What makes it distinct: The buyer already has Databricks. This is the only provider in this group where the AI capability is bundled into a contract that likely predates the AI conversation. Using Mosaic AI often happens as a configuration choice, not a procurement one. The platform is already there, the data is already there, and Unity Catalog governance already applies.
Together AI
What it is: A cloud inference provider that hosts open-weight models via an OpenAI-compatible API at pricing significantly below hyperscaler rates.
What it does: Together Inference serves a broad catalog of open-weight models — Llama 3, Mistral, Mixtral, Qwen, and others — through an API that accepts the same request format as OpenAI's. Developers and teams already using OpenAI's SDK can point it at Together's endpoint with a credential swap. Together Fine-tuning provides managed supervised fine-tuning on customer data. Pricing for Llama 3 70B runs approximately $0.54–$0.90 per million tokens, compared to roughly $2.50–$3.50 for comparable capability on hyperscaler platforms. Flag for accuracy review each cycle. Together also maintains academic and research pricing tiers.
Who's behind it: Founded in 2022 in San Francisco. Vipul Ved Prakash is CEO. The company has raised approximately $220 million and has positioned itself primarily toward research institutions, AI-native startups, and enterprise teams that want open-weight model access without standing up their own GPU infrastructure.
What makes it distinct: Model catalog breadth combined with OpenAI API compatibility. Together hosts more distinct model variants than any other provider in this group, and the API compatibility means the switching cost from OpenAI or from another Together-compatible provider is close to zero. That's a meaningful procurement argument for buyers who want optionality.
Fireworks AI
What it is: A production inference platform focused on low-latency serving of open-weight models with structured output guarantees.
What it does: Fireworks serves open-weight models with a stated emphasis on inference speed: sub-100ms p50 latency for many models under standard load, with SLA options for production deployments. Flag for accuracy review each cycle. FireFunction is their fine-tuned function-calling model, optimized for tool use in agentic pipelines. Compound AI Systems is their term for multi-model orchestration, where different models handle different steps in a pipeline and Fireworks manages the routing and output formatting. Structured output, meaning guaranteed JSON schema compliance on model responses, is a documented production capability that matters for applications that can't tolerate malformed outputs.
Who's behind it: Founded in 2022 by former Google Brain researchers. Lin Qiao is CEO. The company has raised approximately $77 million and has focused on production engineering quality rather than model breadth — a narrower catalog than Together, but with more investment in the serving infrastructure around each model.
What makes it distinct: Latency and structured output guarantees for production use. Together and Fireworks serve overlapping model catalogs at similar price points, but Fireworks' differentiation is in the serving layer: lower latency, SLA options, and structured output that production applications can depend on. The buyer who needs Llama 3 cheap goes to Together; the buyer who needs Llama 3 fast and reliable in a production pipeline has a reason to look at Fireworks.
Replicate
What it is: A developer platform for on-demand deployment of machine learning models, including a long tail of community models not available through other providers.
What it does: Replicate hosts thousands of models — text, image, video, audio, and multimodal — and serves them via API on a pay-per-second GPU billing model. Cog is their open-source containerization format that allows any model to be packaged and deployed on Replicate's infrastructure. Replicate Deployments provides dedicated GPU instances for teams that need consistent availability rather than cold-start on-demand. The model catalog includes both well-known open-weight models and a long tail of specialized community models that exist nowhere else as managed API endpoints.
Who's behind it: Founded in 2019 in San Francisco. Ben Firshman is CEO. The company has raised approximately $40 million and has maintained a developer-first positioning. The enterprise SLA story is thinner here than at Cohere or Fireworks; Replicate hasn't tried to close that gap.
What makes it distinct: On-demand access to models that nobody else hosts. If a buyer needs a specific image generation model, a specialized audio model, or a fine-tuned variant that a research team published but didn't operationalize, Replicate is often the only place to get it as a managed API endpoint. The use case is running the specific model a team needs that doesn't exist anywhere else as a managed endpoint.
Comparison Strategy
A flat trait grid across five providers produces a table that's technically complete and practically useless. Five columns of checkmarks don't tell you when to route a buyer conversation toward one versus another. The more useful structure is scenario-based: map each provider to the buyer circumstance that makes it the right answer, then cross-cut on the dimensions that distinguish providers within the same scenario.
Three scenarios cover the field:
Scenario A: Compliance or sovereignty requirements block hyperscaler deployment. The buyer's legal or security team has determined that sending inference requests to AWS Bedrock or Google Vertex creates unacceptable data exposure — either because of data residency requirements, contractual restrictions on third-party subprocessors, or a classified or sensitive workload that can't leave a controlled environment. Cohere is the primary answer here, because private deployment of a proprietary model is their core product. Databricks is a secondary answer if the buyer already has Databricks and the workload can run on an open-weight model served through Mosaic AI within their existing environment.
Scenario B: Open-weight model inference at scale, price-sensitive. The buyer has selected an open-weight model (Llama 3, Mistral, or similar) and wants managed inference without running their own GPU cluster. They're comparing against Bedrock's hosted Llama endpoints or Vertex's model garden. Together AI and Fireworks AI are both answers here, and the distinction matters: Together is the right answer when the buyer prioritizes model selection breadth and price, or when they want a clean migration path from OpenAI's API. Fireworks is the right answer when the buyer is building a production application that requires low latency, structured output guarantees, or SLA-backed availability. At the decision point, these two providers diverge, even though their catalogs overlap substantially.
Scenario C: Specific model access — the model the buyer needs isn't available elsewhere. The buyer has identified a model — a specialized fine-tune, a multimodal variant, a community model — that isn't in Bedrock's catalog, Vertex's model garden, or Together's or Fireworks' standard offerings. Replicate is the answer. This scenario is less common in large enterprise accounts and more common in agency innovation offices, research teams, and product teams building on specialized capabilities.
Cross-cutting dimensions:
| Dimension | Cohere | Databricks | Together AI | Fireworks AI | Replicate |
|---|---|---|---|---|---|
| Enterprise SLA | Yes | Yes (via platform) | Partial | Yes | Limited |
| Private/VPC deployment | Yes (core product) | Yes (within platform) | No | No | No |
| Existing relationship leverage | Low | High | Low | Low | Low |
| Production latency focus | Moderate | Moderate | Low | High | Low |
| Model catalog breadth | Narrow (proprietary) | Moderate | Broad | Moderate | Very broad |
| Primary buyer | Regulated enterprise | Existing Databricks customer | Dev/research teams | Production engineering | Developer / research |
One note on the Databricks row: "existing relationship leverage" is high because Mosaic AI typically surfaces inside a renewal or expansion conversation, not a net-new procurement. The sales motion is different.
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "AI vendor" | "inference provider" | Buyers with multiple AI relationships need precision about which layer you're discussing |
| "Cohere is like OpenAI but smaller" | "Cohere offers private deployment options that OpenAI doesn't — models run in your environment, not theirs" | The differentiation is architectural, not scale |
| "Databricks has AI now" | "Mosaic AI is Databricks' model serving layer, built on their MosaicML acquisition — it's inside the platform the buyer already has" | Frames it as an extension of an existing relationship, not a new vendor |
| "Together and Fireworks are basically the same" | "Together optimizes for model breadth and price; Fireworks optimizes for inference latency and structured output guarantees" | The distinction matters at the production deployment decision point |
| "Cheaper than OpenAI" | "Competitive inference pricing for open-weight models — roughly 60–80% below hyperscaler rates for comparable capability" | Avoids implying quality equivalence; grounds the claim in a specific comparison |
| "Data stays in your environment" | "Private deployment with no data egress to provider infrastructure" | "Your environment" is ambiguous; "no data egress to provider infrastructure" is a claim a security team can evaluate |
| "Replicate is for developers" | "Replicate supports on-demand deployment of models not available through other providers — useful for specialized or community models" | Positions it accurately without dismissing it as non-enterprise |
| "Fine-tuning" (used loosely) | "Supervised fine-tuning on your domain data, hosted by the provider" | Distinguishes from prompt engineering, RAG, and other adaptation methods the buyer may conflate |
| "Compound AI" | "Multi-model orchestration — different models handling different steps in a pipeline, with the provider managing routing and output formatting" | Fireworks' term; define it before using it or the buyer will nod and not follow |
| "Lakehouse AI" | "Databricks' term for AI workloads that run inside their data platform, governed by the same Unity Catalog access controls as the underlying data" | Databricks uses this term in their materials; buyers will encounter it |
| "Model catalog" | Use it — define it once as "the set of models a provider hosts and serves via API" | Precise and buyer-facing; worth establishing early in a conversation |
| "Sovereign AI" | "Data residency requirements that restrict where inference can occur" | "Sovereign AI" is a marketing term with variable meaning; the underlying requirement is specific and auditable |
Okta Concept Mapping: Federated Identity Provider Selection
Choosing a specialty AI inference provider follows logic that rhymes with federated IdP selection: you're deciding which external party to trust for a specific workload, based on what they can attest to — data residency, compliance posture, SLA, model provenance. Just as you'd choose an IdP based on what authentication assertions it can make for a given user population, you choose an inference provider based on what it can assert about where your data goes and how the model behaves. The analogy holds well enough to orient a conversation. Where it breaks: in federation, you can route different user populations to different IdPs within the same trust hierarchy, and switching an IdP mid-deployment is painful but architecturally clean. In AI provider selection, workload commitment creates data gravity — fine-tuning artifacts, cached embeddings, evaluation baselines, and audit logs accumulate at the provider over time, making mid-contract migration expensive in ways that have no clean IdP equivalent. When a CAIO asks about "AI provider strategy," they're asking a question closer to data residency architecture than to SSO configuration. The identity framing gets you into the conversation; the data gravity reality is what shapes the answer.
Pricing figures in this piece are illustrative benchmarks current as of publication. Verify against provider documentation before using in a buyer conversation — inference pricing moves faster than most procurement cycles.

