Bridge Article · LLM Bill Triage

Your AI agent bill is 10x–700x higher than it needs to be.

Five concrete mechanisms, a 30-minute self-check you can run tonight, and a $299 human read if you want the diagnosis written.

Milo Antaeus · June 5, 2026 · Companion to the LLM Bill Triage service.

Two sources, the same finding. RocketEdge (Mar 15, 2026) documented agents burning $47K–$1.2M in a single billing cycle. Predict / Medium (May 20, 2026) published five mechanisms that turn a working pilot into a crisis, with a 717x worst case. This page gives you the five mechanisms, a 30-minute self-check, and a $299 path to a human-written diagnosis.

The five mechanisms (and the cheap diagnostic for each)

1. Recursive self-correction loops (the $47K silent burn)

The agent fails a sub-task. It retries. It fails again. It retries with a slightly different prompt. Repeat until budget is empty or the task tracker finally times out.

Diagnostic: Pull your last 30 days of LLM logs, group by session_id, sort by total tokens descending, look at the top 1% of sessions. If one of them has more than 20× the median session tokens, you have a loop.

2. Unbounded tool-calling (the 717x case)

A ReAct-style agent with no per-step cap. Each step is cheap. Each step also prompts a "let me check one more thing" reflex that the agent can't unlearn.

The 717x case: a customer-support agent whose initial budget estimate was $300/month for 1,200 tickets. Production month one: $215,000. Average turns per ticket: 9.3 instead of the 1.3 in the pilot.

Diagnostic: Histogram of turns_per_session. If your pilot P95 was 3 turns and your production P95 is 11, the curve moved. The bill moved with it.

3. Context-stuffing (the "memory" that isn't)

Most agent "memory" implementations are append-only by default. The agent that "remembers" the last 47 turns is also re-paying for them on every subsequent turn. Worse: the system prompt often grows over time as engineers add "helpful" sections and never remove them.

Diagnostic: Take one production session, dump the full prompt that goes to the model on turn 30, count the tokens. If it's more than 3× what the system prompt was on turn 1, you have context-stuffing.

4. The "I forgot to filter" log (the non-LLM line item)

A new engineer adds a verbose debug log to a hot path. The log includes the full message history. The log ships to a third-party observability tool that charges per ingested token. Nobody notices for 6 weeks.

Diagnostic: Ask finance for the non-LLM cloud line items for the months after you launched the agent. If observability went up 4× and LLM went up 1.4×, you have a logging leak.

5. The model mismatch (the 1.5x–2x that's "fine")

A feature ships on a frontier model. It works. Six months later the prompt has been edited 40 times, the use case is now a high-volume narrow task, and the frontier model is still answering 800-token questions with 4,000-token thinking blocks because that's what it does.

Diagnostic: Your output : input ratio. If it's above 1.0 on a narrow task, you are over-paying by 1.5x–2x. A 2-tier fallback (frontier for hard, mini for easy) typically reduces this category of spend by 50–70% with no measurable quality drop.

The 30-minute self-check

5 min — Export your last 30 days of LLM usage grouped by session.
5 min — Compute session-token P50, P95, P99. If P99 > 20× P50, flag.
5 min — Take the 10 highest-spend sessions. Count the turns. If any has > 15 turns, read the last 5 turns — that's almost always where the loop lives.
5 min — Take one production session, dump the prompt on turn 30, count tokens. Compare to turn 1.
5 min — Compute the output : input ratio for the top 20 sessions by spend. If > 1.0, you have a model-mismatch candidate.
5 min — Write down the answers. The act of writing is the diagnostic.

Most teams find at least one of the five mechanisms within the first 30 minutes of looking. The diagnostic is the cheapest part. The fix is mechanical.

What I do for $299

If you'd rather hand the CSV to a human and get a written diagnosis in 24 hours, I read LLM bills for a fixed $299 fee. The deliverable is a forensic report with:

The five mechanisms identified in your actual data (not a template)
A ranked list of fixes with estimated monthly savings on each
A copy-paste guardrail config (per-session token cap, routing rule, context-compaction snippet) for the top three

What I won't do: I won't replace your observability platform, I won't sell a dashboard, I won't take a cut of your savings, I won't keep your data. The deliverable is a PDF, the CSV stays with you, and the $299 is the only transaction.

Read a sample report first. If your bill doesn't look like the sample, you probably don't need me yet.

LLM Bill Triage — $299 fixed fee

$299 USD

Fixed fee. 24-hour SLA. Written forensic report. Data stays with you.

Five mechanisms identified in your actual data
Ranked fix list with estimated monthly savings
Copy-paste guardrail config for the top three

Prefer to read the sample report first? See the sample.

The cheapest 90-second win (if you only do one thing)

Add a per-session token cap. Any framework can do this in 10 lines. Pick a number that's 3× your pilot's P95 session tokens. The cap won't fire on healthy sessions. It will fire on the 0.3% of sessions that would otherwise eat 30% of the bill. On average, that one cap is a 2–4x reduction in monthly LLM spend for teams that don't have one.

If you ship that one cap tonight, you've already gotten more value from this page than the cost of a coffee. The rest of the diagnosis is refinement.

Sources

RocketEdge, Your AI Agent Bill Is 30x Higher Than It Needs to Be, Mar 15, 2026.
Predict / Medium, AI Agent Bills: Why Production Costs 10x Your Pilot, May 20, 2026.
DigitalApplied, Why 88% of AI Agents Fail Production, 2026.
Codingscape, Build Production-Ready AI Agents in 2026 (AWS Kiro / Cost Explorer 13-hour outage, Dec 2025).
Gartner, Don't Let AI Agents Burn Your Budget, Mar 1, 2026.