13m read
Tags: claude-code, ai-agents, token-optimization, javascript, workflows

Optimizing Claude Code Workflows for Token Usage

I built a workflow to research and outline a single blog post. The first time I ran it without limits, it spent roughly 768,000 output tokens across 41 subagents to produce one outline, and almost all of that went to a single phase that spawns one subagent per finding. Capping that fan-out with a slice() brought the same workflow down to about 85,000 output tokens, and I couldn't tell the cheaper outline apart from the expensive one.

Every snippet below comes from topic-outline.js, the workflow I run against this blog; the primitives — pipeline(), parallel(), agent(prompt, { schema }), budget.spent() — are from the Claude Code Workflows runtimeworkflows. Throughout, I quote OUTPUT tokens, the number that subscription and rate limits weigh; the gap between that and the total is where most of the "why is this so expensive" confusion lives.

The fan-out cap is the only knob that matters

Validation in this workflow is one agent per finding. The research stage sweeps the topic from a few angles and comes back with a pile of claims; each claim then gets its own fact-checker agent that fetches the source and tries to break it. Capping how many findings reach that fan-out, before it launches, is the single highest-leverage token decision in the entire script — bigger than model choice or prompt length.

const rawFindings = research.flatMap((r) => r.findings || [])
reportSpend('Research')
log(`Collected ${rawFindings.length} raw findings across ${research.length} angles`)
if (rawFindings.length === 0) throw new Error('Research produced no findings.')

const effectiveCap = overBudget() ? Math.min(maxFindings, 6) : maxFindings
if (overBudget()) log(`[cost] over the token cap after research — trimming validation to ${effectiveCap}`)
const allFindings = rawFindings.slice(0, effectiveCap)
if (rawFindings.length > effectiveCap)
  log(`[cost] validating ${effectiveCap} findings; ${rawFindings.length - effectiveCap} dropped before fact-checking`)

// validation fan-out: ONE agent per finding — the dominant cost line
const verdicts = await parallel(allFindings.map((f, i) => () => agent(/* shape below */)))

The thing to notice is where the cap lands. slice(0, effectiveCap) bounds the array — the count of findings — before parallel() ever touches it. That ordering is the entire trick. One agent per finding means N findings is N agents, and each agent is a fresh research-grade token bill. Two numbers from my own run carry the point: uncontrolled, it was ~768k output tokens across 41 subagents, and roughly 33 of those 41 were the validation fan-out. After dropping researchAngles from 4 to 3 and passing maxFindings: 12, the same workflow came back at ~85k output tokens, 16 agents, 4.5 minutes — call it 9x cheaper for an outline I couldn't tell apart from the expensive one.

The cap isn't static. Math.min(maxFindings, 6) means a run that's already over budget trims harder, down to six findings; the budget signal arms the next decision while there's still spend to save.

One caveat, because the source disagrees with my best-case run on purpose. The script default is maxFindings = Math.max(4, opts.maxFindings || 24) — that's 24, not 12. The 85k run used maxFindings: 12 passed as an option for that one invocation; if you read the source you'll find 24 as the default and Math.min(maxFindings, 6) in the slice, never a literal 12. And don't read this as "lower is always better." The cap controls spend; outline quality is its own problem. Starve the synthesis stage of facts and you get a thin draft you'll have to redomultiagent.

One flaky agent shouldn't poison the whole run

parallel() is a barrier — it launches every task, then awaits all of them, and an unhandled rejection in any single task throws away every sibling's already-paid-for work. Picture it. Thirty-two fact-checkers finish, cost you real tokens, and the thirty-third rejects; on a plain Promise.all, that one rejection detonates the lot. So the failure handling has to live inside the array, per task.

const verdicts = await parallel(allFindings.map((f, i) => () =>
  agent(verifyPrompt(f), {
    label: `verify:${(f.api || f.source_title || f.source_url || i).toString()}`.slice(0, 60),
    phase: 'Validate',
    schema: VERDICT_SCHEMA,
  })
    .then((v) => ({ finding: f, verdict: v }))
    .catch(() => ({ finding: f, verdict: null, error: true }))
))

const errored = verdicts.filter((v) => v && v.error).length
const usable = verdicts.filter((v) => v && !v.error)
if (errored) log(`[warn] ${errored} fact-checker agent(s) returned no structured output and were skipped`)

This failure mode shows up on real runs. When an agent can't produce schema-valid JSON inside the retry limit, the runtime hands back a result message with subtype error_max_structured_output_retries instead of a structured_output fieldstructured. That agent is structurally broken — it will never satisfy the schema, so retrying is pointless. The per-task .catch() converts the rejection into { verdict: null, error: true }, a result you can filter, and the script counts it as errored before it does any filtering. The research phase upstream uses the same idea with a thinner sentinel — .catch(() => null) followed by .then((rs) => rs.filter(Boolean)) — because at that stage a dropped angle is cheaper to lose than a dropped verdict.

The pitfall is the obvious overcorrection: swallowing failures silently is itself a bug. That errored count and the [warn] line exist so a half-broken run can't masquerade as a healthy one. If six of your twelve fact-checkers errored out, you want to know you outlined on six findings before you trust the draft.

pipeline() maps; parallel() launches and waits

These two primitives look interchangeable until you ask when a cap can fire. parallel() launches every task and awaits all of them at a hard barrier, so any limit you want has to be sized before launch — once it's running, you can't reach in and kill half of it. pipeline() maps items through a single stage function without that all-or-nothing barrier sitting between the items.

phase('Outline')
const results = await pipeline(
  angles,
  (a, _orig, i) => draftAndReview(
    { article: a, siblings: angles.filter((_, j) => j !== i) },
    a.suggested_slug || `art${i + 1}`,
  ),
)
const drafted = results.filter(Boolean)

pipeline(angles, (a, _orig, i) => ...) takes the ARRAY first and a single stage function second; it runs each angle through draftAndReview, threading along the item, its original, and the index. Set that against the validation call from two sections back — parallel(allFindings.map(...)) takes an array of thunks, fires them, and waits at the barrier. The consequence for cost: because parallel() is a barrier, your caps apply between phases and never mid-flight. You size the fan-out before it launches and can't abort it midway. That's the real reason the slice() has to come first, structurally, rather than being a tidy convenience.

So don't reach for pipeline() thinking it's the cheaper parallelism. pipeline() is a different control-flow shape, and the cost lever is still the agent count you hand it, decided before the call. Swapping one primitive for the other changes how failures and caps behave.

Run research once, then never again

The runtime forbids mid-run user input, and the docs are explicit about the workaround: run each stage as its own workflow when you need a human sign-off between themworkflows. The forced split is what makes the corpus reusable. Phase one stops and hands back its corpus; phase two is a separate invocation that reads the corpus and skips research entirely.

phase('Plan cluster')
const plan = await agent(/* propose distinct article angles from factsBlock */)

reportSpend('Plan cluster')
log(`Proposed ${plan.articles.length} article angles — STOPPING for approval before drafting.`)
return {
  mode: 'cluster',
  stage: 'awaiting-approval',
  tokens_k: spentK(),
  corpus: verified,          // the expensive part — run ONCE
  proposed_angles: plan.articles,
  next_step: 'Review proposed_angles, then re-invoke with {corpus, approvedAngles}.',
}

The runtime can't ask you a question mid-flight, so the split is forced. The script returns stage: 'awaiting-approval' carrying the verified corpus, you edit the proposed angles offline at your own pace, and the re-invocation comes back in with { corpus, approvedAngles }. On that second run the research and validation phases don't execute — the script sees a supplied corpus and short-circuits straight to drafting. My phase-two run, reusing the corpus, landed at ~51k output tokens, 6 agents, zero web calls; the research bill got paid exactly once and amortized across every article drafted from it. Feeding the same corpus into the drafting agents also lets them read it from cache, and the usage object splits cache_read_input_tokens (billed at a reduced rate) from cache_creation_input_tokens (billed higher)cost — so a high cache-read figure on the phase-two agents is your proof the reuse is happening.

The caveat is timing. Prompt-cache entries have a TTL, and if phase two lands more than a few minutes after phase one, the default cache window may have already lapsed; you'll re-pay full input price on the corpus and never see it in the output-token counter. The one-hour window (ENABLE_PROMPT_CACHING_1H) exists for exactly this shape of gap — a human staring at a list of proposed angles, deciding over coffee which three are worth drafting.

Count tokens live, and confess what you dropped

A workflow that can't see its own spend can't be tuned, and you end up guessing which phase is the pig. The fix is almost embarrassingly small: sample budget.spent() once at the top for a baseline, then report the delta at every phase boundary — and whenever you truncate, log the count you threw away instead of dropping it in silence.

const t0 = budget.spent()
const spentK = () => Math.round((budget.spent() - t0) / 1000)
const reportSpend = (label) => log(`[tokens] after ${label}: ~${spentK()}k output tokens this run`)
const overBudget = () => maxTokens != null && (budget.spent() - t0) >= maxTokens

t0 anchors the baseline so spentK() reports this run's delta. Calling reportSpend('Research') and reportSpend('Validate') at each boundary is precisely how I knew the fan-out was ~33 of 41 agents and not, say, the synthesis step quietly running away — the numbers told me where to cut before I changed a line. And overBudget() is the predicate that arms the Math.min(maxFindings, 6) trim from the first section; the live counter feeds the control flow.

If you hand-roll your own counter over a parallel() barrier, there's a correctness bug waiting that the built-in budget already handles. Parallel tool calls emit multiple assistant messages that share one id with identical usage, so a naive sum double-counts unless you dedupe by message ID:

const seenIds = new Set()
for await (const message of query({ prompt })) {
  if (message.type === 'assistant') {
    const id = message.message.id
    if (!seenIds.has(id)) {
      seenIds.add(id)
      totalOutputTokens += message.message.usage.output_tokens
    }
  }
}

The honest caveat on all of this: budget.spent() only counts what the script can see, which is OUTPUT tokens. It never sees the web-fetch input tokens a research agent pulls into its own context, so the total billed figure runs higher than your live counter ever shows. Use the counter to size the fan-out before it launches.

Prove where the tokens landed, then route the cheap work down

After you've capped the count of agents, the second lever is capping per-agent cost by model — and modelUsage is how you confirm the routing took effect. The result message carries a per-model map of token counts and dollar cost, and reading it takes a few lines:

for (const [modelName, usage] of Object.entries(message.modelUsage)) {
  console.log(`${modelName}: ${usage.outputTokens} out, $${usage.costUSD.toFixed(4)}`)
}

Run a split where Haiku handles the fan-out and a stronger model handles synthesis, then read this map; it tells you exactly where the spend landedcost. It's the receipt for the first section's claim that the fan-out is the cost driver — you don't have to take my word for which phase is expensive, the per-model breakdown shows it. Anthropic's own cost guidance is blunt about the lever: for simple subagent tasks, specify model: haiku in your subagent configurationcosts. A per-finding fact-check, fetch a URL and answer one boolean against a schema, is about as simple as a subagent task gets.

One precise caveat saves you an afternoon of confusion. That routing is a declarative field in the subagent's own configuration: you set it in the subagent's Markdown frontmatter, then confirm it landed by reading modelUsage. The map confirms the fan-out ran on Haiku.

The pitfall is over-applying the cheap model. Synthesis and adversarial verification are where reasoning quality decides whether the fact-checker catches a hallucinated API or waves it through, so keep those on a stronger model. A mechanical fan-out — schema in, one boolean out — is where the cheap model pulls its weight.

Closing

I keep coming back to how unremarkable the fixes are. A slice, a .catch(), an early return, a console.log — none of them is clever, none would survive a code review as anything but housekeeping, and together they turned a 768k-token run into an 85k one. What unsettles me is the asymmetry underneath. The runtime will happily fan out toward its 1,000-agent ceiling without flinching, so the discipline has to live in your script; nothing in the platform is going to cap your spend for youworkflows.

The open question I don't have a clean answer to is where this nets out as agents keep getting cheaper per token but easier to spawn by the dozen. A 9x leak stays 9x even when per-agent cost drops; cheaper agents just delay when you feel it, after the bill has scaled with the count. I'm not sure the instinct to cap before the fan-out survives a world where spawning a hundred subagents feels free.

References (footnotes, sparse):


What do you think of what I said?

Share with me your thoughts. You can tweet me at @allanmacgregor.