Blog/A/B Testing LLMs in Production Without Shipping a Regression

EngineeringMarch 27, 2026•12 min read•KairosRoute

A/B Testing LLMs in Production Without Shipping a Regression

Offline evals are great for ruling out disasters. They're bad for deciding between two "both pretty good" models on your workload. The only way to really know is to send production traffic to both and measure. This is A/B testing for LLMs, and it's under-used because teams assume it's a big project. It doesn't have to be.

This post is the practical playbook — the one we use internally and recommend to customers — for running a model A/B test that actually produces a decision, not just a debate.

What question are you actually answering?

The most common mistake is fuzzy framing. "Is GPT-5.4 better than Sonnet?" is not a testable question. These are:

Does Haiku resolve support tickets at ≥95% of Sonnet's resolution rate?
Is the per-ticket cost difference between Haiku and Sonnet big enough to offset a 2pp drop in resolution rate?
Does DeepSeek V3.2 produce JSON outputs that pass schema validation at the same rate as GPT-5.4?

Each has a concrete metric, a concrete threshold, and a concrete business decision attached. Without those, you'll run the test, see numbers, and not know what to do.

Pick one primary metric, then guardrails

Every test should have exactly one primary success metric and 2–4 guardrail metrics.

Primary metric. The one number you'd bet the decision on. Usually a downstream KPI — resolution rate, task completion, JSON validity, user thumbs-up rate. Not "cost" unless cost is the literal goal of the test.
Guardrails. Metrics that must not get materially worse. Usually: cost per unit, latency p95, refusal rate, tool-call success rate. A test that improves the primary at a cost of 2x latency is a test that shipped a regression.

If your primary metric is hard to define, your A/B test is premature. Stop and define it — otherwise you'll finish the test with good vibes and no decision.

Traffic splitting: per-user, per-session, or per-request?

Three options, each with tradeoffs.

Per-request (50/50 bucket on request ID)

Simplest. Every incoming request gets hashed, half go to Model A, half to Model B. Fast to reach statistical power because every request is an independent sample.

Problem: a multi-turn conversation will ping-pong between models. The user sees different tones in consecutive messages. For single-turn workloads (classification, extraction, summarization) this is fine. For conversations, it's a UX regression on top of the thing you're testing.

Per-session (sticky within a session)

Hash on session ID. Every request in the session routes to the same model. Clean UX. Slower to hit statistical power because sessions are more correlated than requests — your effective sample size is closer to sessions than requests.

Per-user (sticky across sessions)

Hash on user ID. Slowest to hit power. Best for cases where users themselves are the measurement unit — satisfaction, retention, usage frequency — and the effect takes time to show up.

Our default recommendation: start with per-session. It balances UX with statistical efficiency for most workloads. Switch to per-user if you're measuring effects that unfold over weeks.

Sample size: the annoying part

Every "we tested X vs Y and X was 5% better" result that doesn't report a p-value or a confidence interval is noise until proven otherwise. Do the sample size math before the test starts.

For a binary metric (resolved yes/no), the rough formula for a meaningful effect size is:

text

n = 16 * p * (1 - p) / MDE^2

p    = baseline rate (e.g., 0.85 resolution rate)
MDE  = minimum detectable effect (e.g., 0.02 for 2pp)

Example: p=0.85, MDE=0.02
n = 16 * 0.85 * 0.15 / 0.0004 = ~5,100 per arm
= ~10,200 total observations before the test has teeth

If your traffic is 5K requests/day, this test takes ~2 days. If it's 500 requests/day, it takes 20. Plan accordingly. A test run with 1/10th the required sample will produce a shrug.

Reducing sample size with better metrics

Two tricks:

Use continuous metrics when possible. Instead of binary resolution (yes/no), use a continuous quality score (0–1). Continuous metrics need ~4x fewer samples for the same power.
CUPED (controlled-using-pre-experiment-data). If you have a pre-period metric per user that correlates with the outcome, regression-adjust to reduce variance. Can cut required sample size by 30–50% for free.

If your team doesn't have an experimentation platform already, this stuff is overkill. Use the raw formula, expand the test window, move on.

The guardrail stop-rule

Define in advance: if a guardrail metric crosses a threshold, the test stops, regardless of the primary metric's state. Examples:

p95 latency > 2x control → stop.
Refusal rate > 2x control → stop.
JSON validity rate < 90% on the candidate → stop.
Cost > 3x control (for a test that isn't optimizing for cost) → stop.

Without a stop rule, you'll run the test for two weeks, see the primary metric improve, and ship a regression because nobody was watching latency. The stop rule is how you protect real users from your experiment.

Reading the results without kidding yourself

When the test concludes, evaluate in this order:

Did guardrails hold? If any stopped, the test is a no-go regardless of the primary.
Is the primary metric difference statistically significant? p < 0.05 with a pre-specified test. If not, the answer is "no difference detected." That's still a result — it means you can pick the cheaper option with confidence.
Is the effect size meaningful? A statistically significant 0.3% improvement is "flip a coin" in business terms. Set a minimum detectable effect before the test and honor it.
Does the result hold across segments? If the primary improved overall but degraded on your paid tier, you don't have a win — you have a segmentation decision to make.

What KairosRoute does out of the box

Model A/B testing is built into the product. In the dashboard, you pick two models, set a traffic split (per-request, per-session, or per-user), and let it run. You get:

Sticky assignment by the unit you pick.
Primary and guardrail metric dashboards computed continuously.
Auto-stop if a guardrail is violated.
Sample-size progress bar so you know when the test is ready to read.
p-values and confidence intervals computed by standard two-proportion / Welch's t-test.

No client-side code. The bucketing happens at our edge; your request just goes where we say. If you're not using us, the same playbook applies — you just implement it yourself.

A worked example

Support team runs Claude Sonnet. Considers Haiku to cut costs. Test:

Primary: ticket resolution rate (binary, current baseline ~0.87).
MDE: 0.02 (2pp). Per-arm sample: ~3,600. Per-session split.
Guardrails: p95 latency must stay within 1.3x, refusal rate must not double, cost must not rise.
Duration: ~5 days at current traffic.

After 5 days: resolution rate on Haiku is 0.859, Sonnet is 0.871. 95% CI for the difference: [-0.024, 0.000]. Not statistically significant. Cost on Haiku is 4.2x cheaper per ticket. Guardrails all held.

Decision: switch to Haiku for the buckets classified as "easy" tickets. Keep Sonnet for the "complex" bucket. Realized cost reduction: ~65%. Observed resolution impact: within noise.

That's the entire playbook. One question, one primary metric, guardrails, sticky bucketing, sample sizing, stop rules, honest reading. Do this regularly and your model bill goes down without your users noticing anything but the occasional "huh, this feels snappier."

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Calculate Your Savings Start Free — 100K tokens