What These Are and When You'll Hear About Them
Five AI providers sit between the hyperscaler defaults and the frontier labs: Cohere, Databricks, Together AI, Fireworks AI, and Replicate. You'll hear their names when a buyer tells you why they passed on Bedrock. That explanation is the signal. The sentence that buys you time: "That makes sense. Are you optimizing for cost at scale, latency, a compliance requirement, or tighter integration with your existing data platform?" Those are the four reasons a CIO picks a specialty provider over a hyperscaler default, and each of these five maps to them differently.
Cohere
What it is: An enterprise AI model company focused on language understanding, retrieval-augmented generation (RAG — feeding a model your own documents so it answers from your data, not its training set), and private deployment.
What it does: Builds its own family of language models (Command, Embed, Rerank) optimized for enterprise search and document processing, deployable inside the customer's own infrastructure rather than only through a shared cloud API.
Who's behind it: Founded in Toronto by former Google Brain researchers, including Aidan Gomez, a co-author of the original Transformer paper. In April 2026, Cohere announced a merger with Aleph Alpha at a combined valuation of roughly $20 billion, backed by €500 million from Schwarz Group (the German retail conglomerate behind Lidl). Both the Canadian and German governments endorsed the deal publicly. The merger was pending regulatory approval at time of publication.
What makes it distinct: Cohere is the sovereignty play. Their models deploy into your VPC, on your own hardware, or in an air-gapped environment where data never crosses your perimeter. The Aleph Alpha deal adds STACKIT (Schwarz Group's sovereign cloud), OVHcloud, and a physical appliance form factor. Certifications include SOC 2 Type II, ISO 27001, and ISO 42001. No FedRAMP authorization exists or is documented as in-progress. HIPAA-supportive controls are listed but with a FAQ caveat rather than a formal attestation.
Databricks
What it is: A unified data and AI platform that bundles model serving with its Lakehouse data architecture.
What it does: Through its Mosaic AI Gateway, Databricks hosts open-weight models (Llama 3.3 70B, Llama 3.1 405B, Mistral variants) natively and routes requests to external providers (OpenAI, Anthropic) through the same governance layer. Every model call runs through Unity Catalog for access control, prompt logging, and usage attribution.
Who's behind it: Founded by the creators of Apache Spark at UC Berkeley. Databricks has been the dominant data engineering platform in large enterprises for years. Their AI play extends an existing relationship, and that matters.
What makes it distinct: Databricks is the "we already have Databricks" play. Pricing runs on Databricks Units (DBUs), not per-token rates, starting around $0.07 per DBU for foundation model serving. Direct cost comparison with per-token providers is genuinely difficult, which is partly the point. The value proposition: the model sits next to the data, governed by the same catalog, accessible to the same notebooks and pipelines the data team already uses. Standing up a separate inference provider creates a governance gap they don't want.
Together AI
What it is: A cloud inference platform for open-weight models (models whose weights are publicly available for download and modification) with a broad model catalog and dedicated fine-tuning infrastructure.
What it does: Hosts open-weight models from the Llama, Mistral, Qwen, and DeepSeek families via an OpenAI-compatible API endpoint. Charges per-token. Offers the most comprehensive fine-tuning support among cloud inference providers, including LoRA (a technique for adapting a model to your data without retraining the whole thing) and full fine-tuning on most major model families.
Who's behind it: Founded by researchers from Stanford, backed by significant venture funding. Together has positioned itself as the open-weight model ecosystem's default cloud.
What makes it distinct: Together is the model variety and fine-tuning play. Their catalog is the broadest among pure inference providers. If a buyer is evaluating multiple open-weight models or plans to fine-tune a base model on proprietary data, Together is likely on their shortlist. Current pricing for Llama 3.3 70B sits at $0.88 per million tokens — roughly 3x cheaper than Bedrock's published rate for comparable models.
Fireworks AI
What it is: An inference optimization platform built for production speed on open-weight models.
What it does: Runs the same open-weight models as Together (Llama, Mistral, Qwen) but differentiates on inference speed through custom GPU kernels, tensor parallelism, and PagedAttention optimizations. Provides structured output guarantees (JSON mode, which means the model returns machine-parseable output instead of free text) that matter for developers building applications against model APIs.
Who's behind it: Founded by former Meta AI infrastructure engineers who worked on PyTorch. The team's background is making models run fast on GPUs, and the product reflects it.
What makes it distinct: Fireworks is the latency play. They claim sub-100ms p50 latency for models up to 180B parameters. Independent practitioner testing is directionally consistent, with Together AI clocking roughly 150ms on comparable models. Pricing is nearly identical to Together ($0.90/MTok for 70B-class models). The differentiator is speed. Fireworks also documents HIPAA and SOC 2 compliance, which Together doesn't prominently feature.
Replicate
What it is: A model hosting marketplace that makes it easy to run community-contributed AI models via API.
What it does: Hosts over 50,000 community models and roughly 100 curated official models. Strongest in image and video generation (Stable Diffusion, Flux, SDXL) rather than LLM inference.
Who's behind it: Founded in 2019, acquired by Cloudflare in late 2025, with the deal closing in early 2026. The Replicate brand continues operating independently, with integration into Cloudflare's Workers AI ecosystem planned.
What makes it distinct: Replicate is the prototyping and media-model play, with an asterisk. The platform bills per-second of GPU time rather than per-token, making cost prediction difficult for LLM workloads. Cold start latency runs 2–30 seconds on community models. No published SLA. No documented SOC 2, HIPAA, or FedRAMP certifications. The Cloudflare acquisition signals a trajectory toward enterprise readiness, but as of May 2026, the enterprise offering hasn't materialized publicly.
Cohere's VPC and on-prem deployment options map directly to how you'd explain Access Gateway or on-prem agent architecture to a buyer who can't send authentication traffic to a cloud tenant. "The data never leaves your environment" is the same promise. The analogy breaks when you remember that identity federation assumes trust across organizational boundaries, while sovereignty models treat those boundaries as the entire point. A buyer evaluating Cohere for sovereignty reasons is probably also rethinking where their identity provider sits.
Four Drivers, Five Providers
Organized by decision driver, not by provider. Every provider appears where relevant. That asymmetry across drivers is the useful part.
Driver 1: Price at Scale
Together AI and Fireworks are the price plays. Both offer Llama 3.3 70B at $0.88–$0.90 per million tokens, roughly 3x cheaper than Bedrock's published rate for comparable models. At 100 million tokens per day, that's the difference between $90 and $270 daily. Replicate's per-second GPU billing makes cost comparison impractical for LLM workloads. Databricks pricing runs on DBUs, not tokens, so direct comparison requires knowing the buyer's compute tier and throughput commitment. Cohere's value is deployment flexibility, and their pricing reflects it.
Driver 2: Latency Optimization
Fireworks leads here. Their custom inference kernels deliver sub-100ms latency on models up to 180B parameters. Together AI benchmarks around 150ms on 70B-class models. These are vendor-claimed figures; independent practitioner testing is directionally consistent but not audit-grade. Replicate's cold starts (2–30 seconds) disqualify it from latency-sensitive workloads. Databricks provisioned throughput mode offers predictable latency but isn't optimized for raw speed. Cohere's latency depends entirely on the deployment target — their cloud API is competitive; on-prem latency is a function of the buyer's hardware.
Driver 3: Compliance and Sovereignty
Cohere has this dimension to itself. SOC 2 Type II, ISO 27001, ISO 42001, air-gapped deployment, and the Aleph Alpha merger's European sovereignty positioning make it the only provider here with a coherent sovereignty story. If a buyer raises Mistral in a sovereignty conversation, the distinction is worth knowing: Mistral is an EU-headquartered model provider, but Cohere's play is deploying models into the buyer's own infrastructure, a fundamentally different proposition from operating out of a favorable jurisdiction. Fireworks documents HIPAA and SOC 2, which matters for healthcare-adjacent workloads. Databricks inherits whatever compliance posture the buyer's existing deployment carries. Together AI and Replicate have no prominently documented compliance certifications. Databricks holds FedRAMP Moderate authorization for its Government Cloud offering. None of the other four providers in this group document FedRAMP authorization. That distinction matters in public sector conversations. Know which provider has it and which ones don't.
Driver 4: Existing Data Platform Relationship
Databricks, by default. If the buyer already runs their data lakehouse on Databricks, adding model serving through Mosaic AI Gateway means the model sits next to the data, governed by Unity Catalog, accessible to existing pipelines. No other provider in this group has an equivalent data platform relationship. Cohere's partnerships with AWS, Azure, GCP, and OCI offer deployment flexibility, but that's a separate question from data integration.
Databricks' Mosaic AI Gateway works like a policy enforcement point for model access. Every request is routed, logged, and checked against Unity Catalog permissions. Your OAuth intuition helps here: centralized policy, token-level attribution, audit logging. It starts to mislead you when you realize the "resource" being protected is a model's output, and the governance question becomes "who accessed this, with what prompt, and what came back."
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "Those are all basically the same — they just host open-source models." | "Together, Fireworks, and Replicate all host open-weight models, but they solve different problems. Which one are you evaluating?" | The buyer chose a provider for a reason; asking which one tells you their priority. |
| "Cohere is like a smaller OpenAI." | "Cohere builds their own models, but the reason enterprises pick them is the deployment model — the models run inside your environment." | Cohere's differentiator is where the model runs. |
| "Databricks does AI now too." | "If you're already on Databricks, their model serving keeps inference inside the same governance layer as your data. That's usually why teams pick it over a standalone provider." | Acknowledges the buyer's existing investment and names the actual value. |
| "Fireworks and Together are the same thing." | "Fireworks and Together are priced within pennies on the same models. Fireworks optimizes for latency; Together has the broader catalog and better fine-tuning support." | Shows you understand the actual differentiator between the two. |
| "Replicate is just for prototyping." | "Replicate is strong for image and video models. For production LLM inference, most teams end up on Together or Fireworks. Cloudflare acquired Replicate last year, so the enterprise story may change." | Accurate without being dismissive; acknowledges the acquisition trajectory. |
| "Open-source models are free, so why pay a provider?" | "The models are open-weight — free to download, but the license terms vary by model. The inference infrastructure is what you're paying for: GPUs, optimization, uptime." | "Open-weight" is the precise term; actual license terms differ across model families, and buyers notice when you get this right. |
| "You should use Bedrock instead." | "What's driving you toward [provider] over Bedrock? Is it cost, latency, a compliance requirement, or your existing data stack?" | Surfaces the buyer's actual decision criteria without challenging their choice. |
| "Cohere has FedRAMP." | "Cohere has SOC 2 Type II and ISO 27001, but no FedRAMP authorization as of now. Their sovereignty play is deployment flexibility — VPC, on-prem, air-gapped." | Prevents a compliance overclaim that could damage credibility with a public sector buyer. |
| "Databricks is expensive for AI." | "Databricks pricing is DBU-based, not per-token, so it's hard to compare directly. The value case is governance and data proximity." | Reframes the cost conversation around what Databricks actually sells. |
| "These providers don't matter for identity." | "When a buyer picks a provider that runs models inside their own environment, the identity question is who authenticates to that model endpoint and how access gets governed. That's where we come in." | Connects the provider choice to an identity conversation without forcing it. |
Together and Fireworks charge per-token. Replicate charges per-second of GPU time. If you've explained to a buyer why per-user licensing differs from per-transaction licensing, you already understand why this distinction matters to a finance team. The buyer evaluating Replicate for production workloads will hit the same forecasting wall that per-transaction identity pricing creates, and the same conversation about whether predictability or flexibility matters more at their scale.
Pricing figures reflect published rates as of late April 2026. Inference pricing changes without notice; treat all figures as directional benchmarks, not quotes. Verify against provider pricing pages before any client-facing use. Latency claims from Fireworks and Together are vendor-reported; independent practitioner testing is directionally consistent but not independently audited. The sub-100ms Fireworks claim is sourced partly from featherless.ai, itself an inference provider with commercial interests in the comparison space; their characterization is consistent with Fireworks' own documentation and practitioner reports but is not a disinterested source. The Cohere–Aleph Alpha merger was pending regulatory approval and had not formally closed as of publication. Replicate enterprise characterizations draw partly from WaveSpeedAI, a direct competitor whose assessment is consistent with broader practitioner sentiment but carries commercial interest.
Things to follow up on...
- Cohere's Aleph Alpha close: The merger was pending regulatory approval as of late April 2026, and whether Aleph Alpha's existing German government contracts survive under Canadian ownership will shape Cohere's sovereign credibility with European public sector buyers.
- Replicate's Cloudflare integration: Cloudflare's acquisition of Replicate closed in early 2026 but the Workers AI integration roadmap hasn't been published, so the platform's enterprise story could look materially different within 12 months.
- Databricks DBU pricing shifts: The DBU-based pricing model includes hidden costs for Unity Catalog governance checks and inference table logging that aren't obvious from headline rates, and these are worth verifying before any buyer-facing cost comparison.
- Deepinfra as a price outlier: At $0.23 per million input tokens for Llama 3.3 70B, Deepinfra undercuts Together and Fireworks by roughly 4x, a provider worth watching if price-at-scale conversations keep surfacing in your accounts.

