Provider Latency Leaderboard — April 2026 Update
Price is the headline number for most LLM comparisons. Latency is the headline number for every team that actually ships a user-facing AI product. Slow tokens kill conversion the same way slow pages killed e-commerce in 2014 — quietly, pervasively, and mostly invisible to the team that built the app.
This leaderboard aggregates the time-to-first-token (TTFT) distributions we saw on our routing fabric in March and early April 2026. It is the first cut of what we intend to update monthly.
The leaderboard
TTFT percentiles for each provider's median-used model on our fabric. For providers where we route to multiple models we're reporting the one that served the most traffic. All numbers are milliseconds to first streamed token, measured from the moment our gateway dispatched the request.
Provider Model (median-used) p50 p95 p99 Error rate ──────────────────────────────────────────────────────────────────────── Groq Llama 4 70B 85ms 210ms 540ms 0.14% Cerebras Llama 3.3 70B 110ms 280ms 710ms 0.21% Google Gemini 3 Flash 180ms 420ms 1.1s 0.08% DeepSeek DeepSeek-Chat 240ms 610ms 1.8s 0.31% Together AI Llama 4 Maverick 260ms 640ms 1.9s 0.24% Fireworks Qwen3-72B 270ms 680ms 2.1s 0.19% xAI Grok-3 Fast 310ms 720ms 2.0s 0.26% OpenAI GPT-5 mini 340ms 860ms 2.4s 0.11% Anthropic Sonnet 4.7 410ms 980ms 2.8s 0.09% Mistral Mistral Large 3 430ms 1.1s 3.2s 0.18% Cohere Command R+ 480ms 1.2s 3.4s 0.22%
Groq and Cerebras sit in their own tier. If sub-100ms TTFT is load-bearing to your UX, they are the only two providers that reliably get you there on their median models. Everybody else lives in the 200ms–500ms band at p50, and the gap mostly closes for post-first-token throughput, which is a different benchmark that we'll publish separately.
The p99 column is the one your on-call engineer feels. Averages lie; p50s are pleasant fictions; the 99th percentile is what defines your perceived reliability for the top 1% of your power users, who happen to be the 50% of your revenue.
Regional variation
Latency is wildly sensitive to routing region. For the three providers where we have enough cross-region traffic to make a claim, here's the p50 spread.
Provider US-East US-West EU-West EU-Central APAC (via proxy) ───────────────────────────────────────────────────────────────────────────── OpenAI 340ms 360ms 410ms 420ms 690ms Anthropic 410ms 430ms 520ms 540ms 780ms Google 180ms 190ms 210ms 220ms 340ms
Two takeaways. First, Google's edge POPs give it a genuine structural advantage for globally distributed traffic — we suspect that advantage widens further for APAC customers. Second, for the frontier labs, routing from the wrong region can cost you 40–100% on p50, which is the difference between "snappy" and "laggy" in the user's perception.
This is part of why kr-auto is region-aware by default. Sending every request through us-east-1 when your user is in Frankfurt is a tax most teams pay silently.
Outage minutes, Q1 2026
We define an outage as a sustained window (longer than 120 seconds) where a provider's error rate exceeded 5% for any of the models we route to them. These are the minutes of degraded service we observed and counted.
Provider Outage minutes (Q1 2026) # distinct incidents ────────────────────────────────────────────────────────────────── Anthropic 142 6 OpenAI 118 4 xAI 87 5 Cohere 76 3 DeepSeek 61 7 Mistral 54 4 Together AI 41 3 Google 33 2 Fireworks 28 2 Groq 19 2 Cerebras 14 1
A few notes before anyone screenshots this. These are incidents we observed and counted from where our fabric dispatches traffic — a provider may have had outages we didn't see because none of our traffic hit an affected region, and vice versa. Still, directionally: the frontier labs have the most incidents, which tracks with their complexity. Providers running narrower infrastructure (Groq, Cerebras, Fireworks) tend to have fewer but sometimes longer incidents.
Zero of the 11 providers on our fabric had zero outage minutes. This is the argument for multi-provider failover in one sentence. If you have one provider, you have no provider.
Latency-adjusted cost (LAC)
Here's a concept we've been using internally and want to put in the water. Cost per token is a vendor metric. Latency-adjusted cost is closer to an operator metric — what am I paying, in dollars, per unit of fast inference?
Our working definition:
LAC = (cost per 1M tokens) × (p95 TTFT in seconds ÷ reference p95) where reference p95 = 0.5 seconds A model priced at $2.10/1M with a p95 of 0.98s → LAC = $2.10 × (0.98 ÷ 0.5) = $4.12 / effective 1M A model priced at $4.00/1M with a p95 of 0.28s → LAC = $4.00 × (0.28 ÷ 0.5) = $2.24 / effective 1M
LAC penalizes slow models by how slow they are relative to a "reasonable" baseline. A cheaper-but-slower model can have a worse LAC than a pricier-but-faster one if latency is load-bearing to your UX. Conversely, if you're running batch jobs at 3am, ignore LAC entirely and pick on raw cost.
Latency-adjusted cost leaderboard — April 2026 (balanced tier) Model $/1M p95 TTFT LAC per effective 1M ─────────────────────────────────────────────────────────────────── Gemini 3 Flash $0.38 420ms $0.32 Groq Llama 4 70B $0.59 210ms $0.25 Haiku 4 $1.40 760ms $2.13 Sonnet 4.7 $3.10 980ms $6.08 GPT-5 mini $1.90 860ms $3.27 DeepSeek-Chat $0.22 610ms $0.27 Mistral Large 3 $2.20 1100ms $4.84
Three models — Groq Llama, Gemini 3 Flash, and DeepSeek-Chat — cluster tightly at the bottom of the LAC table. These are the latency-adjusted bargains of the quarter. Every one of them gets disproportionate volume on kr-auto for conversational and fast-classification tasks.
Streaming throughput (tokens per second, post-first-token)
TTFT is first-byte latency. It doesn't tell you how long the full response takes to render. Here are the median streaming throughputs we observe, measured as tokens/second after the first token arrives.
Provider Model Median tok/s p05 tok/s ──────────────────────────────────────────────────────────────────── Groq Llama 4 70B 680 420 Cerebras Llama 3.3 70B 520 340 Together AI Llama 4 Maverick 190 110 Fireworks Qwen3-72B 170 95 Google Gemini 3 Flash 140 88 DeepSeek DeepSeek-Chat 95 58 xAI Grok-3 Fast 92 54 OpenAI GPT-5 mini 78 47 Anthropic Sonnet 4.7 62 38 Mistral Mistral Large 3 55 31 Cohere Command R+ 48 27
If your use case emits long responses, throughput matters more than TTFT. A 2000-token answer at 60 tok/s takes 33 seconds of streaming time after the first token. The same answer at 600 tok/s takes 3.3 seconds. For long-form generation, dedicated inference hardware (Groq, Cerebras) is still an order-of-magnitude story.
Three patterns we're watching
1. The fast-tier floor is dropping. A year ago, 300ms p50 was table stakes for a "fast" model. Today the floor is closer to 100ms. By this time next year we expect sub-60ms p50 to be commonplace for small models.
2. Frontier models aren't getting faster. Opus 5 has roughly the same p50 as Opus 4 did a year ago. The frontier appears to be trading speed for capability at a roughly constant rate. If you need both, you're routing.
3. Provider tail risk is converging. Ratio of p99 to p50 used to vary by 3x across providers. Today most providers cluster between 4.5x and 6.5x. Tail reliability is becoming a commodity; base-case latency is where differentiation lives.
What this means if you're building
- Measure p95 and p99, not averages. Averages hide the outages and the bad days. The 99th percentile is what your users feel when things go sideways.
- Pick your region deliberately. A 200ms regional gap is a real UX hit. If you have international users, route from regions close to them.
- Don't conflate TTFT and throughput. Chatbot UX cares about TTFT. Long-form generation cares about throughput. Agents that do both care about both.
- Always have a failover. Zero providers had zero outages. Multi-provider routing is not premium engineering — it's basic resilience.
- Use LAC for operator decisions. Raw $/1M tokens understates the cost of slow models when latency matters.
Methodology & caveats
These numbers are derived from KairosRoute's routing fabric. We measure TTFT from the moment a request is dispatched from our gateway to the moment we receive the first streamed token back from the provider. That includes network transit and any request-queuing on the provider side, but excludes any client-to-gateway transit (which varies by your location, not ours).
- Regional coverage. Our fabric routes primarily from US-East and EU-West. Teams in APAC or LATAM will see different numbers. We've noted regional variation where we have reliable data.
- Model selection bias. We report the median-used model per provider. A different model choice shifts the number. For example, Anthropic's Haiku 4 is substantially faster than Sonnet 4.7 — picking Haiku for the leaderboard would reorder the rankings.
- Sample size varies. Groq and Anthropic each have tens of millions of routed requests in this window. Cohere and Cerebras are closer to two million. Smaller samples mean wider confidence intervals — take the tails with more grains of salt.
- Outage minutes are lower bounds. We count what we observed. Providers may have had incidents that missed our fabric entirely. Also, a 121-second blip counts as an outage here; a 119-second blip does not. Any threshold is arbitrary.
- Latency moves fast. Providers ship inference optimizations weekly. Numbers in this leaderboard are a snapshot. We'll update monthly.
If you want the raw TTFT distributions per model, the full methodology doc, or want to replicate this on your own traffic, reach out. And if you'd rather have latency-aware routing baked in without writing the logic yourself, try the playground — kr-auto makes these decisions automatically using telemetry that's refreshed every few minutes.
For more on the operator view, see Agent Observability is the New APM and the quarterly KairosRoute LLM Cost Index.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
Quarterly benchmark of median $/1M tokens across 10 providers and 45+ models, broken down by tier and task type. Plus our first read on the token deflation rate.
An annual industry report on what AI teams are actually running in production — model mix, observability adoption, cost-per-outcome improvements, and our best predictions for 2027. Based on KairosRoute routing telemetry and onboarding interviews.
Application performance monitoring gave every engineering team a dashboard for what their services are doing. Agent observability is the same shift, happening now, for AI-native products. Here is the thesis.