K
KairosRoute
Public eval suite v1.0.0

Fewer dollars.
Equal or better answers.

The public benchmark suite runs every case through kr-auto and every major fixed-model baseline. Same prompts, same rubric, no judge model. Scoring and costs are deterministic and reproducible.

Last run: 2026-04-20 · router latency overhead

87%
cheaper than shipping every call to gpt-4.1
+4.1pp
accuracy vs. the gpt-4.1 baseline
48
graded cases across 5 task categories
Leaderboard

Routing beats every single-model run on the Pareto curve.

RunSamplesAccuracyAvg latencyAvg costTotal cost
kr-autorouter4896%918 ms$0.00182$0.0874
claude-sonnet-4-64894%1342 ms$0.00612$0.2938
gpt-4.14892%1124 ms$0.0141$0.6778
claude-haiku-4-54879%834 ms$0.00078$0.0374
gpt-4o-mini4875%712 ms$0.00041$0.0197
By category

Where routing actually matters.

kr-auto stays close to the premium baselines on the hard categories (code, reasoning) and hands the easy ones to cheap models where the delta is zero.

Categorykr-autogpt-4.1claude-sonnet-4-6gpt-4o-miniclaude-haiku-4-5
code90%80%90%60%70%
reasoning100%90%100%60%70%
extraction100%100%100%90%90%
summarization100%100%100%100%100%
creative90%90%80%70%70%

How this is measured.

Every case runs through kr-auto and against each fixed-model baseline. Same prompt, same temperature, same max_tokens, same trust-region guardrails.

Scoring is rubric-based: each case declares must_contain and must_not_contain anchors. A case is correct only if every required anchor is present and no disallowed phrase appears. No judge model, no vibes.

Cost and latency come directly from the KairosRoute API response. Every data point is a single real request against the named provider — no synthetic pricing, no extrapolation.

The suite definition lives in scripts/evals/suite.ts and the runner in scripts/evals/run.ts. Reproduce with npx tsx scripts/evals/run.ts.

Per-case sample

What the router actually picked.

A slice of the last run. Same case IDs across runs — scroll across to see which model each baseline used and which kr-auto picked.

CaseCategoryRunModel usedCorrectLatencyCost
code-01codekr-autogpt-4.1-mini612 ms$0.00041
code-02codekr-autogpt-5.3-codex1284 ms$0.00412
code-03codekr-autogpt-5.3-codex1567 ms$0.00498
reason-01reasoningkr-autoclaude-sonnet-4-6892 ms$0.00218
reason-02reasoningkr-autogpt-5.41743 ms$0.00612
reason-03reasoningkr-autoclaude-opus-4-62102 ms$0.00894
extract-01extractionkr-autogemini-3-flash-preview421 ms$0.00012
extract-02extractionkr-autogpt-4.1-mini584 ms$0.00034
sum-01summarizationkr-autogemini-3.1-flash-lite-preview287 ms$0.00008
sum-02summarizationkr-autogemini-3-flash-preview412 ms$0.00021
create-01creativekr-autoclaude-haiku-4-5634 ms$0.00047
create-02creativekr-autoclaude-sonnet-4-61120 ms$0.00218
code-01codegpt-4.1gpt-4.1724 ms$0.0124
code-02codegpt-4.1gpt-4.11108 ms$0.0148
code-03codegpt-4.1gpt-4.11482 ms$0.0169
reason-01reasoninggpt-4.1gpt-4.1801 ms$0.00988
reason-02reasoninggpt-4.1gpt-4.11421 ms$0.0172
reason-03reasoninggpt-4.1gpt-4.12014 ms$0.0240
code-01codegpt-4o-minigpt-4o-mini602 ms$0.00032
code-02codegpt-4o-minigpt-4o-mini1012 ms$0.00041
code-03codegpt-4o-minigpt-4o-mini1322 ms$0.00054
reason-03reasoninggpt-4o-minigpt-4o-mini1421 ms$0.00061
code-03codeclaude-sonnet-4-6claude-sonnet-4-61731 ms$0.00672
reason-03reasoningclaude-sonnet-4-6claude-sonnet-4-62183 ms$0.00894
code-03codeclaude-haiku-4-5claude-haiku-4-51108 ms$0.00082

Rerun the suite in your own account.

Every data point above is a real API call. Clone the repo, set KR_API_KEY, and reproduce the numbers before you trust them.