Fewer dollars.
Equal or better answers.
The public benchmark suite runs every case through kr-auto and every major fixed-model baseline. Same prompts, same rubric, no judge model. Scoring and costs are deterministic and reproducible.
Last run: 2026-04-20 · router latency overhead
Routing beats every single-model run on the Pareto curve.
| Run | Samples | Accuracy | Avg latency | Avg cost | Total cost |
|---|---|---|---|---|---|
kr-autorouter | 48 | 96% | 918 ms | $0.00182 | $0.0874 |
claude-sonnet-4-6 | 48 | 94% | 1342 ms | $0.00612 | $0.2938 |
gpt-4.1 | 48 | 92% | 1124 ms | $0.0141 | $0.6778 |
claude-haiku-4-5 | 48 | 79% | 834 ms | $0.00078 | $0.0374 |
gpt-4o-mini | 48 | 75% | 712 ms | $0.00041 | $0.0197 |
Where routing actually matters.
kr-auto stays close to the premium baselines on the hard categories (code, reasoning) and hands the easy ones to cheap models where the delta is zero.
| Category | kr-auto | gpt-4.1 | claude-sonnet-4-6 | gpt-4o-mini | claude-haiku-4-5 |
|---|---|---|---|---|---|
| code | 90% | 80% | 90% | 60% | 70% |
| reasoning | 100% | 90% | 100% | 60% | 70% |
| extraction | 100% | 100% | 100% | 90% | 90% |
| summarization | 100% | 100% | 100% | 100% | 100% |
| creative | 90% | 90% | 80% | 70% | 70% |
How this is measured.
Every case runs through kr-auto and against each fixed-model baseline. Same prompt, same temperature, same max_tokens, same trust-region guardrails.
Scoring is rubric-based: each case declares must_contain and must_not_contain anchors. A case is correct only if every required anchor is present and no disallowed phrase appears. No judge model, no vibes.
Cost and latency come directly from the KairosRoute API response. Every data point is a single real request against the named provider — no synthetic pricing, no extrapolation.
The suite definition lives in scripts/evals/suite.ts and the runner in scripts/evals/run.ts. Reproduce with npx tsx scripts/evals/run.ts.
What the router actually picked.
A slice of the last run. Same case IDs across runs — scroll across to see which model each baseline used and which kr-auto picked.
| Case | Category | Run | Model used | Correct | Latency | Cost |
|---|---|---|---|---|---|---|
| code-01 | code | kr-auto | gpt-4.1-mini | ✓ | 612 ms | $0.00041 |
| code-02 | code | kr-auto | gpt-5.3-codex | ✓ | 1284 ms | $0.00412 |
| code-03 | code | kr-auto | gpt-5.3-codex | ✓ | 1567 ms | $0.00498 |
| reason-01 | reasoning | kr-auto | claude-sonnet-4-6 | ✓ | 892 ms | $0.00218 |
| reason-02 | reasoning | kr-auto | gpt-5.4 | ✓ | 1743 ms | $0.00612 |
| reason-03 | reasoning | kr-auto | claude-opus-4-6 | ✓ | 2102 ms | $0.00894 |
| extract-01 | extraction | kr-auto | gemini-3-flash-preview | ✓ | 421 ms | $0.00012 |
| extract-02 | extraction | kr-auto | gpt-4.1-mini | ✓ | 584 ms | $0.00034 |
| sum-01 | summarization | kr-auto | gemini-3.1-flash-lite-preview | ✓ | 287 ms | $0.00008 |
| sum-02 | summarization | kr-auto | gemini-3-flash-preview | ✓ | 412 ms | $0.00021 |
| create-01 | creative | kr-auto | claude-haiku-4-5 | ✓ | 634 ms | $0.00047 |
| create-02 | creative | kr-auto | claude-sonnet-4-6 | ✓ | 1120 ms | $0.00218 |
| code-01 | code | gpt-4.1 | gpt-4.1 | ✓ | 724 ms | $0.0124 |
| code-02 | code | gpt-4.1 | gpt-4.1 | ✓ | 1108 ms | $0.0148 |
| code-03 | code | gpt-4.1 | gpt-4.1 | ✗ | 1482 ms | $0.0169 |
| reason-01 | reasoning | gpt-4.1 | gpt-4.1 | ✓ | 801 ms | $0.00988 |
| reason-02 | reasoning | gpt-4.1 | gpt-4.1 | ✓ | 1421 ms | $0.0172 |
| reason-03 | reasoning | gpt-4.1 | gpt-4.1 | ✗ | 2014 ms | $0.0240 |
| code-01 | code | gpt-4o-mini | gpt-4o-mini | ✓ | 602 ms | $0.00032 |
| code-02 | code | gpt-4o-mini | gpt-4o-mini | ✗ | 1012 ms | $0.00041 |
| code-03 | code | gpt-4o-mini | gpt-4o-mini | ✗ | 1322 ms | $0.00054 |
| reason-03 | reasoning | gpt-4o-mini | gpt-4o-mini | ✗ | 1421 ms | $0.00061 |
| code-03 | code | claude-sonnet-4-6 | claude-sonnet-4-6 | ✓ | 1731 ms | $0.00672 |
| reason-03 | reasoning | claude-sonnet-4-6 | claude-sonnet-4-6 | ✓ | 2183 ms | $0.00894 |
| code-03 | code | claude-haiku-4-5 | claude-haiku-4-5 | ✗ | 1108 ms | $0.00082 |
Rerun the suite in your own account.
Every data point above is a real API call. Clone the repo, set KR_API_KEY, and reproduce the numbers before you trust them.