Cut your API costs while keeping the intelligence ceiling high – here’s the full implementation with code, cost math, and the gotchas Anthropic’s docs don’t emphasize.
On April 9, 2026, Anthropic released a new feature for the Claude API called the advisor tool. It introduces a two-model architecture where a cheaper, faster model (Sonnet or Haiku) handles execution while a more powerful model (Opus) serves as a strategic advisor – reviewing context, providing guidance, and course-correcting when needed. The whole thing runs inside a single API request with no extra infrastructure required.
The purpose is straightforward: most tokens in an agentic workflow are routine execution. You’re paying top-tier model rates for work that a mid-tier model handles fine. The advisor strategy lets you reserve that expensive intelligence for the 10-20% of decisions that actually benefit from it – architectural choices, ambiguous edge cases, knowing when to stop – while the executor handles everything else at a fraction of the cost.
This blueprint walks through the full implementation: how to set it up, how to control costs, the real token math, and the failure modes Anthropic’s announcement doesn’t emphasize.
Summary Card
What You’ll Build
- A Claude API integration where Sonnet handles execution and Opus steps in as a strategic advisor on hard decisions
- A cost-controlled setup with
max_usescaps and optional caching for long conversations - A working pattern you can drop into any existing agentic workflow, coding assistant, or automation pipeline
Why This Matters (And Why Speed Matters Right Now)
Anthropic shipped the advisor tool on April 9, 2026. It’s a beta feature that lets you declare Opus as a “server-side tool” inside a standard Messages API call. The executor model – Sonnet or Haiku – runs the task end-to-end. When it hits a decision it can’t confidently make alone, it calls the advisor. Opus reviews the full context and sends back a plan, correction, or stop signal. The executor picks up where it left off.
No extra API round trips. No orchestration code. One request, two models, done.
Here’s why this is a big deal for anyone running AI in production: you’ve probably been choosing between “cheap but sometimes wrong” (Sonnet/Haiku) and “smart but expensive” (Opus). The advisor strategy gives you a third option – cheap execution with expensive thinking only when it matters.
The benchmarks back this up:
The Real Cost Math
This is where most coverage of the advisor strategy stops at Anthropic’s benchmark percentages. Here’s the actual token-level math so you can model this against your own workloads.
Assumptions for this scenario: A typical agentic coding task – 10,000 input tokens, 2,000 executor output tokens. The advisor gets called once and generates a typical 600 output tokens (with ~1,200 input tokens to read context). These numbers come from Anthropic’s docs: advisor responses are typically 400-700 text tokens, 1,400-1,800 total including thinking.
The key takeaway: Haiku + Opus advisor costs about double Haiku solo ($0.041 vs $0.020) but more than doubles benchmark scores – and it’s still 59% cheaper than Opus end-to-end. Sonnet + Opus advisor adds modest overhead ($0.081 vs $0.060) for a meaningful quality boost, while staying well under Opus pricing.
At scale, this math compounds. Running 1,000 agentic tasks per day at Opus solo = $100/day. The same tasks at Haiku + advisor = $41/day. That’s $1,770/month saved while keeping Opus-level intelligence where it counts.
When to Use It (And When Not To)
Prerequisites
- An Anthropic API key with access to Claude Sonnet 4.6 and Opus 4.6
- The Anthropic Python SDK (
pip install anthropic) or TypeScript SDK (npm install @anthropic-ai/sdk), or just cURL - An existing workflow or use case where you’re currently running Sonnet or Haiku and wishing it was smarter on edge cases
Step 1: Add the Advisor Tool to Your API Call
The implementation is minimal. You’re adding one item to your tools array and one beta header. That’s it.
import anthropic client = anthropic.Anthropic() response = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=4096, betas=["advisor-tool-2026-03-01"], tools=[ { "type": "advisor_20260301", "name": "advisor", "model": "claude-opus-4-6", } ], messages=[ { "role": "user", "content": "Refactor this auth module to use JWT tokens.", } ], ) print(response)
Under the hood: When Sonnet decides it needs strategic guidance, it emits a server_tool_use block with name: "advisor". Anthropic runs a separate inference pass using Opus on the full conversation context. The result comes back as an advisor_tool_result block, and Sonnet continues generating. All inside a single API request.
Step 2: Control Costs with max_uses
Left uncapped, Sonnet will call the advisor whenever it thinks it needs help. That’s usually fine – Opus advisor responses are typically 400-700 text tokens (1,400-1,800 total including thinking). But if you’re running long agentic loops, costs can creep.
Set max_uses to cap advisor calls per request:
tools=[ { "type": "advisor_20260301", "name": "advisor", "model": "claude-opus-4-6", "max_uses": 3, # Opus gets called at most 3 times per request } ]
There’s no built-in conversation-level cap. If you need one, count advisor calls client-side and strip advisor_tool_result blocks from your message history when you want to stop consulting.
Step 3: Enable Caching for Long Conversations
If your agent conversations typically involve 3 or more advisor calls, caching pays for itself. Set it on the tool definition:
tools=[ { "type": "advisor_20260301", "name": "advisor", "model": "claude-opus-4-6", "max_uses": 5, "caching": { "type": "ephemeral", "ttl": "5m" # or "1h" for longer sessions }, } ]
The cache break-even is roughly three advisor calls per conversation. For short, single-turn tasks, leave caching off – you’ll just pay for cache writes you never read.
Gotcha: If you’re using clear_thinking with a keep value other than "all", it shifts the advisor’s quoted transcript each turn. This causes advisor-side cache misses and defeats the purpose. Either use "all" or skip caching.
Step 4: Guide When the Advisor Gets Called
You can influence advisor behavior through your system prompt. This is where the real tuning happens.
For coding and agentic tasks, Anthropic recommends this system prompt pattern:
Call advisor BEFORE substantive work (before writing, committing to an interpretation, or building on an assumption). Do orientation first (finding files, fetching source), then call advisor. Also call advisor: - When you believe the task is complete (after making the deliverable durable) - When stuck (errors recurring, approach not converging) - When considering a change of approach Give the advice serious weight. If you follow a step and it fails empirically, or have primary-source evidence contradicting a specific claim, adapt.
To trim advisor output length and cut costs 35-45%:
The advisor should respond in under 100 words and use enumerated steps, not explanations.
What Anthropic left out – a logging pattern to track advisor hit rate:
# Track advisor usage across your requests def log_advisor_usage(response): iterations = response.usage.get("iterations", []) advisor_calls = [ i for i in iterations if i["type"] == "advisor_message" ] print(f"Advisor calls: {len(advisor_calls)}") for call in advisor_calls: print(f" Input: {call['input_tokens']} tokens") print(f" Output: {call['output_tokens']} tokens") opus_cost = ( call["input_tokens"] * 5 / 1_000_000 + call["output_tokens"] * 25 / 1_000_000 ) print(f" Advisor cost: ${opus_cost:.4f}")
Effort pairing tip: For coding tasks, run Sonnet at medium effort with the Opus advisor. This gives intelligence comparable to Sonnet at default effort, but cheaper.
Step 5: Handle Multi-Turn Conversations Correctly
This is where people will trip up. When you’re building a multi-turn conversation, you must pass the full assistant content back – including the advisor_tool_result blocks.
# First turn response = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=4096, betas=["advisor-tool-2026-03-01"], tools=[{ "type": "advisor_20260301", "name": "advisor", "model": "claude-opus-4-6", }], messages=[ {"role": "user", "content": "Build a REST API for user management."} ], ) # Second turn - pass back FULL content including advisor blocks messages = [ {"role": "user", "content": "Build a REST API for user management."}, {"role": "assistant", "content": response.content}, {"role": "user", "content": "Now add rate limiting to the endpoints."}, ] response_2 = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=4096, betas=["advisor-tool-2026-03-01"], tools=[{ "type": "advisor_20260301", "name": "advisor", "model": "claude-opus-4-6", }], messages=messages, )
Critical rule: If you include the advisor tool in tools, your message history can contain advisor_tool_result blocks. If you remove the advisor tool from tools on a follow-up turn but your history still has those blocks, the API returns a 400 invalid_request_error. Either keep the tool declared or strip the blocks from history.
Step 6: Read the Billing Breakdown
The response includes a usage.iterations array that tells you exactly what each model consumed:
{
"usage": {
"input_tokens": 412,
"output_tokens": 531,
"iterations": [
{
"type": "message", // executor (Sonnet rates)
"input_tokens": 412,
"output_tokens": 89
},
{
"type": "advisor_message", // advisor (Opus rates)
"model": "claude-opus-4-6",
"input_tokens": 823,
"output_tokens": 1612
},
{
"type": "message", // executor resumes (Sonnet rates)
"input_tokens": 1348,
"output_tokens": 442
}
]
}
}
The top-level usage shows executor tokens only. The advisor_message entries are billed at Opus rates. This is how you audit costs per conversation and decide whether your max_uses cap is set right.
Advisor Strategy vs. Other Agent Architectures
The advisor tool isn’t the only way to get smarter outputs from cheaper models. Here’s how it compares to other patterns you might already be using:
The advisor strategy’s advantage is simplicity. You don’t need to build routing logic, classify difficulty, or spin up separate execution environments. The executor self-selects when to escalate. The trade-off is that you’re trusting the executor’s judgment on when advice is needed.
Common Failure Points
“The stream just… pauses.”
That’s expected. The advisor sub-inference doesn’t stream. Your executor’s stream goes quiet while Opus thinks. You’ll get SSE ping keepalives every ~30 seconds. When Opus finishes, the result arrives fully formed. Show a “thinking deeper…” indicator in user-facing apps.
400 error on follow-up turns
You removed the advisor tool from your tools array but your message history still contains advisor_tool_result blocks. Either keep the tool declared or strip those blocks before sending.
Advisor called too often on simple tasks
The advisor is a weak fit for single-turn Q&A where there’s nothing to plan. If you’re sending simple questions through this pattern, you’re paying Opus rates for guidance Sonnet doesn’t need.
Executor ignores advisor guidance
This can happen if your system prompt is vague. Add explicit instructions: “Give the advisor’s guidance serious weight. Only deviate if you have empirical evidence contradicting a specific claim.” Anthropic’s own recommended prompt includes this language.
Context bloat on long conversations
Each advisor call adds tokens to the conversation history. On long agent loops, context grows fast. Monitor total token counts and consider conversation-level caps. Stripping old advisor_tool_result blocks from early turns can help if you also remove the tool definition.
Production Checklist
Before shipping this to production, run through these:
Run a three-way eval
Test the same task set against executor solo, executor + advisor, and Opus solo. Measure output quality and cost. This is what Anthropic recommends – don’t skip it.
Instrument advisor hit rate
Log every usage.iterations response. Track how often the advisor is called, average token counts, and cost per advisor consultation. Use the logging pattern from Step 4.
Set max_uses conservatively first
Start at 2 for Haiku, 3 for Sonnet. Monitor for a week, then adjust based on whether the advisor is hitting its cap or going unused.
Handle streaming UX
If user-facing, add a “thinking deeper…” indicator during advisor pauses. The stream goes quiet except for ping keepalives while Opus runs.
Plan for beta changes
This is beta (advisor-tool-2026-03-01). The tool type, parameters, and behavior may change. Abstract your advisor tool definition into a config so you can update it in one place.
Why This Works (And Why Most AI Cost Optimization Doesn’t)
Most “cost optimization” for AI APIs means switching to a cheaper model and accepting worse results. The advisor strategy works differently because it keeps the expensive model’s intelligence available but only uses it when the executor decides it’s needed.
The key architectural insight: on most tasks, 80-90% of the tokens are straightforward execution (writing code, formatting output, following instructions). The remaining 10-20% are decision points – architectural choices, edge case handling, knowing when to stop. The advisor concentrates Opus’s intelligence exactly where it matters.
This is the same pattern that works in human organizations. Senior engineers don’t write every line of code. They review architecture decisions, catch design flaws, and course-correct when things go sideways. The junior does the volume; the senior provides the judgment.
What to Build Next
Once you have the basic advisor pattern working, here’s where to take it:
- Pair it with your existing tool-use workflows. The advisor composes with client-side and server-side tools in the same
toolsarray. Your agent can call APIs, read files, and consult the advisor in the same request. - Build a cost dashboard. Parse
usage.iterationsfrom every response and track advisor vs. executor spend over time. You’ll quickly see which workflows benefit most from the advisor. - Test the effort pairing. Run Sonnet at medium effort with the Opus advisor. Anthropic says this gives comparable intelligence to Sonnet at default effort, at lower cost. Worth benchmarking against your specific use cases.
Tools Used in This Blueprint
Frequently Asked Questions
My Notes After Testing
First – discaimer – the product has just been released, so I have barely scratched its surface. I will update these in the coming weeks.
The implementation is dead simple – that’s the good part. Adding the advisor tool to an existing API call takes about 5 minutes. The interesting part is tuning when and how the advisor gets called.
What surprised me: the advisor doesn’t just answer questions. It reviews the entire conversation context and provides strategic direction. On coding tasks, Sonnet would sometimes start down a path, then the advisor would pull it back with a better architectural approach. It’s less like “asking for help” and more like having a code reviewer embedded in the generation process.
The stream pause is the main UX consideration. If you’re building anything user-facing, you need to handle that gap. The keepalive pings help, but there’s a noticeable delay while Opus thinks.
For production use: start with max_uses: 1 and see if that single advisor call meaningfully improves your output. In many cases, one well-timed consultation at the start of a complex task is worth more than multiple mid-stream check-ins.
Tested April 10, 2026. This feature is in beta (advisor-tool-2026-03-01). Expect the API surface to evolve.
Related from ChatGPT Guide
Blueprint in the AI Automation Blueprints series at chatgptguide.ai.
The Advisor Strategy is one specific model-selection trick. The broader playbook that frames it – including 14 more Claude Code optimizations – lives in our 60 ways to save tokens on Claude guide.

