Blog/Provider Latency Leaderboard — April 2026 Update

BenchmarkApril 20, 2026•10 min read•KairosRoute

Provider Latency Leaderboard — April 2026 Update

Price is the headline number for most LLM comparisons. Latency is the headline number for every team that actually ships a user-facing AI product. Slow tokens kill conversion the same way slow pages killed e-commerce in 2014 — quietly, pervasively, and mostly invisible to the team that built the app.

This leaderboard aggregates the time-to-first-token (TTFT) distributions we saw on our routing fabric in March and early April 2026. It is the first cut of what we intend to update monthly.

The leaderboard

TTFT percentiles for each provider's median-used model on our fabric. For providers where we route to multiple models we're reporting the one that served the most traffic. All numbers are milliseconds to first streamed token, measured from the moment our gateway dispatched the request.

text

Provider         Model (median-used)     p50    p95    p99    Error rate
────────────────────────────────────────────────────────────────────────
Groq             Llama 4 70B              85ms  210ms   540ms    0.14%
Cerebras         Llama 3.3 70B           110ms  280ms   710ms    0.21%
Google           Gemini 3 Flash          180ms  420ms   1.1s     0.08%
DeepSeek         DeepSeek-Chat           240ms  610ms   1.8s     0.31%
Together AI      Llama 4 Maverick        260ms  640ms   1.9s     0.24%
Fireworks        Qwen3-72B               270ms  680ms   2.1s     0.19%
xAI              Grok-3 Fast             310ms  720ms   2.0s     0.26%
OpenAI           GPT-5 mini              340ms  860ms   2.4s     0.11%
Anthropic        Sonnet 4.7              410ms  980ms   2.8s     0.09%
Mistral          Mistral Large 3         430ms  1.1s    3.2s     0.18%
Cohere           Command R+              480ms  1.2s    3.4s     0.22%

Groq and Cerebras sit in their own tier. If sub-100ms TTFT is load-bearing to your UX, they are the only two providers that reliably get you there on their median models. Everybody else lives in the 200ms–500ms band at p50, and the gap mostly closes for post-first-token throughput, which is a different benchmark that we'll publish separately.

The p99 column is the one your on-call engineer feels. Averages lie; p50s are pleasant fictions; the 99th percentile is what defines your perceived reliability for the top 1% of your power users, who happen to be the 50% of your revenue.

Regional variation

Latency is wildly sensitive to routing region. For the three providers where we have enough cross-region traffic to make a claim, here's the p50 spread.

text

Provider       US-East    US-West    EU-West    EU-Central    APAC (via proxy)
─────────────────────────────────────────────────────────────────────────────
OpenAI           340ms      360ms      410ms       420ms           690ms
Anthropic        410ms      430ms      520ms       540ms           780ms
Google           180ms      190ms      210ms       220ms           340ms

Two takeaways. First, Google's edge POPs give it a genuine structural advantage for globally distributed traffic — we suspect that advantage widens further for APAC customers. Second, for the frontier labs, routing from the wrong region can cost you 40–100% on p50, which is the difference between "snappy" and "laggy" in the user's perception.

This is part of why kr-auto is region-aware by default. Sending every request through us-east-1 when your user is in Frankfurt is a tax most teams pay silently.

Outage minutes, Q1 2026

We define an outage as a sustained window (longer than 120 seconds) where a provider's error rate exceeded 5% for any of the models we route to them. These are the minutes of degraded service we observed and counted.

text

Provider          Outage minutes (Q1 2026)    # distinct incidents
──────────────────────────────────────────────────────────────────
Anthropic                   142                        6
OpenAI                      118                        4
xAI                          87                        5
Cohere                       76                        3
DeepSeek                     61                        7
Mistral                      54                        4
Together AI                  41                        3
Google                       33                        2
Fireworks                    28                        2
Groq                         19                        2
Cerebras                     14                        1

A few notes before anyone screenshots this. These are incidents we observed and counted from where our fabric dispatches traffic — a provider may have had outages we didn't see because none of our traffic hit an affected region, and vice versa. Still, directionally: the frontier labs have the most incidents, which tracks with their complexity. Providers running narrower infrastructure (Groq, Cerebras, Fireworks) tend to have fewer but sometimes longer incidents.

Zero of the 11 providers on our fabric had zero outage minutes. This is the argument for multi-provider failover in one sentence. If you have one provider, you have no provider.

Latency-adjusted cost (LAC)

Here's a concept we've been using internally and want to put in the water. Cost per token is a vendor metric. Latency-adjusted cost is closer to an operator metric — what am I paying, in dollars, per unit of fast inference?

Our working definition:

text

LAC = (cost per 1M tokens) × (p95 TTFT in seconds ÷ reference p95)

where reference p95 = 0.5 seconds

A model priced at $2.10/1M with a p95 of 0.98s
  → LAC = $2.10 × (0.98 ÷ 0.5) = $4.12 / effective 1M

A model priced at $4.00/1M with a p95 of 0.28s
  → LAC = $4.00 × (0.28 ÷ 0.5) = $2.24 / effective 1M

LAC penalizes slow models by how slow they are relative to a "reasonable" baseline. A cheaper-but-slower model can have a worse LAC than a pricier-but-faster one if latency is load-bearing to your UX. Conversely, if you're running batch jobs at 3am, ignore LAC entirely and pick on raw cost.

text

Latency-adjusted cost leaderboard — April 2026 (balanced tier)

Model                    $/1M     p95 TTFT    LAC per effective 1M
───────────────────────────────────────────────────────────────────
Gemini 3 Flash          $0.38     420ms             $0.32
Groq Llama 4 70B        $0.59     210ms             $0.25
Haiku 4                 $1.40     760ms             $2.13
Sonnet 4.7              $3.10     980ms             $6.08
GPT-5 mini              $1.90     860ms             $3.27
DeepSeek-Chat           $0.22     610ms             $0.27
Mistral Large 3         $2.20     1100ms            $4.84

Three models — Groq Llama, Gemini 3 Flash, and DeepSeek-Chat — cluster tightly at the bottom of the LAC table. These are the latency-adjusted bargains of the quarter. Every one of them gets disproportionate volume on kr-auto for conversational and fast-classification tasks.

Streaming throughput (tokens per second, post-first-token)

TTFT is first-byte latency. It doesn't tell you how long the full response takes to render. Here are the median streaming throughputs we observe, measured as tokens/second after the first token arrives.

text

Provider         Model                     Median tok/s    p05 tok/s
────────────────────────────────────────────────────────────────────
Groq             Llama 4 70B                     680           420
Cerebras         Llama 3.3 70B                   520           340
Together AI      Llama 4 Maverick                190           110
Fireworks        Qwen3-72B                       170            95
Google           Gemini 3 Flash                  140            88
DeepSeek         DeepSeek-Chat                    95            58
xAI              Grok-3 Fast                      92            54
OpenAI           GPT-5 mini                       78            47
Anthropic        Sonnet 4.7                       62            38
Mistral          Mistral Large 3                  55            31
Cohere           Command R+                       48            27

If your use case emits long responses, throughput matters more than TTFT. A 2000-token answer at 60 tok/s takes 33 seconds of streaming time after the first token. The same answer at 600 tok/s takes 3.3 seconds. For long-form generation, dedicated inference hardware (Groq, Cerebras) is still an order-of-magnitude story.

Three patterns we're watching

1. The fast-tier floor is dropping. A year ago, 300ms p50 was table stakes for a "fast" model. Today the floor is closer to 100ms. By this time next year we expect sub-60ms p50 to be commonplace for small models.

2. Frontier models aren't getting faster. Opus 5 has roughly the same p50 as Opus 4 did a year ago. The frontier appears to be trading speed for capability at a roughly constant rate. If you need both, you're routing.

3. Provider tail risk is converging. Ratio of p99 to p50 used to vary by 3x across providers. Today most providers cluster between 4.5x and 6.5x. Tail reliability is becoming a commodity; base-case latency is where differentiation lives.

What this means if you're building

Measure p95 and p99, not averages. Averages hide the outages and the bad days. The 99th percentile is what your users feel when things go sideways.
Pick your region deliberately. A 200ms regional gap is a real UX hit. If you have international users, route from regions close to them.
Don't conflate TTFT and throughput. Chatbot UX cares about TTFT. Long-form generation cares about throughput. Agents that do both care about both.
Always have a failover. Zero providers had zero outages. Multi-provider routing is not premium engineering — it's basic resilience.
Use LAC for operator decisions. Raw $/1M tokens understates the cost of slow models when latency matters.

Methodology & caveats

These numbers are derived from KairosRoute's routing fabric. We measure TTFT from the moment a request is dispatched from our gateway to the moment we receive the first streamed token back from the provider. That includes network transit and any request-queuing on the provider side, but excludes any client-to-gateway transit (which varies by your location, not ours).

Regional coverage. Our fabric routes primarily from US-East and EU-West. Teams in APAC or LATAM will see different numbers. We've noted regional variation where we have reliable data.
Model selection bias. We report the median-used model per provider. A different model choice shifts the number. For example, Anthropic's Haiku 4 is substantially faster than Sonnet 4.7 — picking Haiku for the leaderboard would reorder the rankings.
Sample size varies. Groq and Anthropic each have tens of millions of routed requests in this window. Cohere and Cerebras are closer to two million. Smaller samples mean wider confidence intervals — take the tails with more grains of salt.
Outage minutes are lower bounds. We count what we observed. Providers may have had incidents that missed our fabric entirely. Also, a 121-second blip counts as an outage here; a 119-second blip does not. Any threshold is arbitrary.
Latency moves fast. Providers ship inference optimizations weekly. Numbers in this leaderboard are a snapshot. We'll update monthly.

If you want the raw TTFT distributions per model, the full methodology doc, or want to replicate this on your own traffic, reach out. And if you'd rather have latency-aware routing baked in without writing the logic yourself, try the playground — kr-auto makes these decisions automatically using telemetry that's refreshed every few minutes.

For more on the operator view, see Agent Observability is the New APM and the quarterly KairosRoute LLM Cost Index.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Calculate Your Savings Start Free — 100K tokens