K
KairosRoute
Blog/Per-Agent Model Routing in CrewAI
Migration9 min readKairosRoute

Per-Agent Model Routing in CrewAI

One of the quiet mistakes in multi-agent systems is using the same model for every agent. A Researcher agent that scrapes and summarizes 50 web pages does not need GPT-5. A Writer agent producing a polished final draft probably does. A Reviewer agent that just grades the draft against a rubric can live on Haiku and finish in under a second.

CrewAI makes per-agent model assignment trivial — every Agent takes an llm= parameter. What CrewAI does not give you out of the box is cost-aware routing within a single agent's workload. That is where KairosRoute slots in. You assign kr-auto to the agents where task mix varies, and pin specific models to agents where you need deterministic behavior. This post shows the full pattern.

The core idea: role-shaped model policy

Think of each CrewAI agent as a role. Each role has a different task profile, and therefore a different optimal model policy.

  • Researcher: high volume, variable difficulty, tolerant of hallucination when cited. Best fit — kr-auto. The router will send simple summaries to Haiku/Flash and research synthesis to Sonnet.
  • Writer: low volume, high quality bar, final output is user-visible. Best fit — pinned frontier model like claude-sonnet-4.5 or gpt-5.
  • Reviewer/Critic: structured scoring against a rubric, medium volume, latency-sensitive. Best fit — pinned fast model like claude-haiku-4.5 or gpt-5-mini.
  • Planner/Manager: small number of calls, high leverage (bad plans waste every downstream call). Best fit — pinned frontier model.

This mix is where multi-agent spend gets out of control, and also where routing pays for itself fastest.

Step 1: Install and configure

bash
pip install crewai crewai-tools langchain-openai

CrewAI uses LangChain's ChatOpenAI wrapper under the hood, so you configure KairosRoute the same way you would for any LangChain app — change the base URL and the key.

python
import os
from langchain_openai import ChatOpenAI

def kr(model: str, temperature: float = 0.7) -> ChatOpenAI:
    """Helper: build a KairosRoute-backed LLM for a given model string."""
    return ChatOpenAI(
        model=model,
        base_url="https://api.kairosroute.com/v1",
        api_key=os.environ["KAIROSROUTE_API_KEY"],
        temperature=temperature,
    )

That tiny helper is all you need. Everywhere CrewAI asks for an llm=, you pass kr("kr-auto") or kr("claude-sonnet-4.5") or whatever the role calls for.

Step 2: Build the crew with per-agent models

Here is a research-and-writing crew with four agents, each on its own model policy:

python
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, WebsiteSearchTool

search = SerperDevTool()
scrape = WebsiteSearchTool()

# Planner: low volume, high leverage — pin to a frontier model.
planner = Agent(
    role="Research Planner",
    goal="Break a topic into 5-7 focused research questions.",
    backstory="You design research agendas that get to signal fast.",
    llm=kr("gpt-5", temperature=0.2),
    verbose=True,
)

# Researcher: high volume, variable difficulty — let kr-auto pick.
researcher = Agent(
    role="Researcher",
    goal="Gather accurate, cited facts answering each question.",
    backstory="You are a librarian who cites every claim.",
    llm=kr("kr-auto"),
    tools=[search, scrape],
    verbose=True,
)

# Writer: user-visible final output — pin to a strong writing model.
writer = Agent(
    role="Writer",
    goal="Turn research notes into a polished 800-word brief.",
    backstory="You write for technical executives who skim.",
    llm=kr("claude-sonnet-4.5", temperature=0.6),
    verbose=True,
)

# Reviewer: fast critique against a rubric — small, cheap model.
reviewer = Agent(
    role="Editor",
    goal="Score the draft 1-10 on clarity, accuracy, concision. Flag issues.",
    backstory="You are a ruthless but constructive editor.",
    llm=kr("claude-haiku-4.5", temperature=0.0),
    verbose=True,
)

Four agents, four different model policies, one unified KairosRoute key. Every request shows up in a single dashboard with per-agent cost attribution if you tag calls (more on that in a moment).

Step 3: Define the tasks and run the crew

python
plan_task = Task(
    description="Plan research for: {topic}",
    expected_output="5-7 numbered research questions.",
    agent=planner,
)

research_task = Task(
    description="Answer each planned question with 2-3 cited facts.",
    expected_output="A markdown document of question/answer pairs with URL citations.",
    agent=researcher,
    context=[plan_task],
)

write_task = Task(
    description="Write an 800-word executive brief based on the research.",
    expected_output="Markdown brief with H2 sections, citations inline.",
    agent=writer,
    context=[research_task],
)

review_task = Task(
    description="Score the brief and list 3 specific revisions.",
    expected_output="JSON: {score: int, revisions: [str, str, str]}",
    agent=reviewer,
    context=[write_task],
)

crew = Crew(
    agents=[planner, researcher, writer, reviewer],
    tasks=[plan_task, research_task, write_task, review_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"topic": "AI code review tools in 2026"})
print(result)

On a topic like that, the Researcher will fire 15–25 tool-augmented LLM calls. That is where kr-auto earns its keep — most of those calls are summarization of scraped HTML, which Haiku or Gemini Flash handles beautifully at a small fraction of GPT-5's price. The Writer makes one expensive call. The Reviewer makes one cheap call. Total cost drops substantially versus pinning the whole crew to a frontier model.

Before and after: a real crew cost comparison

Illustrative numbers from a content-ops crew that runs ~200 research-and-write jobs per month:

  • Before (entire crew on GPT-5): ~70K tokens per job × 200 jobs × GPT-5 blended price = roughly $2,800/month.
  • After (per-agent routing as shown above):
    • Planner (GPT-5, small prompts): ~$120
    • Researcher (kr-auto, bulk of tokens): ~$310
    • Writer (Sonnet 4.5): ~$180
    • Reviewer (Haiku 4.5): ~$35
    • Total: ~$645/month
  • Savings: ~77%. Editorial scores on the Reviewer stayed statistically unchanged.

The Researcher is where the real savings sit — it is doing the most calls, on the widest task-difficulty spread, and that is exactly the workload kr-auto is designed for. If you only make one change to your existing CrewAI setup today, make it this: swap the Researcher to kr-auto.

Per-agent cost attribution

KairosRoute supports a metadata field on every request. Tag each agent so you can slice spend by role in the dashboard:

python
def kr(model: str, role: str, temperature: float = 0.7) -> ChatOpenAI:
    return ChatOpenAI(
        model=model,
        base_url="https://api.kairosroute.com/v1",
        api_key=os.environ["KAIROSROUTE_API_KEY"],
        temperature=temperature,
        default_headers={"x-kr-metadata": f'{{"agent_role":"{role}"}}'},
    )

researcher = Agent(
    role="Researcher",
    # ...
    llm=kr("kr-auto", role="researcher"),
)

The dashboard groups spend, token volume, and p95 latency by any metadata key you set. This is the fastest way to answer "which agent is eating my budget?" — usually the answer surprises people.

Gotchas we have seen in the field

Manager LLMs in hierarchical crews

If you use Process.hierarchical, CrewAI spins up a manager agent that decides which worker to delegate to. That manager makes a lot of small, high-leverage decisions. Pin it to a frontier model — a cheap manager will make dumb routing decisions that waste downstream compute. This is the one place where cheaping out will actually cost you money.

python
crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[research_task, write_task, review_task],
    process=Process.hierarchical,
    manager_llm=kr("gpt-5", temperature=0.0),  # <- pin the manager
)

Tool-use reliability

When an agent has tools, the LLM has to emit well-formed function-call JSON. kr-auto biases toward tool-reliable models when it sees tools in the request, so this usually "just works." But if you have custom tools with complex nested schemas, pin the tool-using agent to claude-sonnet-4.5 or gpt-5 for safety.

Temperature and determinism

Different providers interpret temperature slightly differently. If you need reproducible outputs (for evals, for testing), pin the model explicitly rather than using kr-auto — the router is allowed to pick different models on different runs, which is usually what you want but is not what you want for eval harnesses.

Streaming inside CrewAI

CrewAI buffers full agent responses by default, so streaming is mostly cosmetic inside a crew. If you want to stream the final output to a user, wrap the final task's output, not the crew itself. Or emit the last agent's response directly via LangChain's .stream().

Context window mismatch

If the Researcher fills a giant context buffer with scraped pages and hands it off to a Reviewer pinned to a small-context model, you will hit a 413-style truncation. Check the context window on each pinned model. KairosRoute surfaces context_window in the registry — under 200K is Sonnet and GPT-5; Haiku is 200K; Gemini 2.5 Pro is 1M. Size your Reviewer model to match what the Writer produces.

Advanced: conditional routing per task

You can take this further by assigning a different LLM per Task, not just per Agent. CrewAI lets you override the agent's LLM at the task level:

python
hard_research_task = Task(
    description="Synthesize 20+ sources into a literature review.",
    expected_output="3-page lit review with numbered citations.",
    agent=researcher,
    # Override: this specific task is hard, use a pinned frontier model.
    llm=kr("claude-sonnet-4.5"),
)

That gives you the best of both worlds: the Researcher agent defaults to kr-auto for its bulk work, but the one task you know is hard gets a deterministic frontier model.

Rollout checklist

  1. Wrap your existing LLM construction in a kr(...) helper so every agent's model string is a one-liner.
  2. Assign kr-auto to your highest-volume agent first — usually the Researcher or Summarizer. Measure.
  3. Pin frontier models to user-visible agents (Writer, Final Responder). Pin fast models to rubric/classifier agents (Reviewer, Router, Grader).
  4. Tag every agent with x-kr-metadata so the dashboard can show per-role spend.
  5. Run your eval suite. Compare multi-agent quality scores vs. the all-frontier baseline. Quality usually holds; cost drops 60–80%.

Related reading

The sister post for single-chain workflows is Add Cost-Aware Routing to Your LangChain App. For the mechanics of how kr-auto picks a model, see How kr-auto Works. New to KairosRoute entirely? The fastest on-ramp is the OpenAI Migration Guide.

Try it with your crew

Drop the kr(...) helper into your existing CrewAI project, swap one agent at a time, and watch the dashboard. The playground lets you test individual agent prompts against kr-auto before you wire it into the crew. Full migration recipes are at docs/migration.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Related Reading

Add Cost-Aware Routing to Your LangChain App in 10 Minutes

LangChain already uses ChatOpenAI as its default LLM wrapper. Point it at KairosRoute, set model="kr-auto", and every chain, agent, and LCEL pipeline in your app starts routing to the cheapest model that meets your quality bar — no refactor required.

What kr-auto Does (and Why It Beats Hand-Rolled Routing)

kr-auto picks the right model for every request, gets smarter from your own traffic, and gives you a receipt for the decision. Here is what that actually buys you — and why teams who try to roll their own spend six months getting it wrong.

Migrate from OpenAI to KairosRoute in 2 Minutes

Already using the OpenAI SDK? Switching to KairosRoute takes two lines of code — change your base URL and API key. Everything else (streaming, tools, JSON mode, vision) stays the same. Here is the walkthrough in Python, TypeScript, Go, and curl.