Five concrete mechanisms, a 30-minute self-check you can run tonight, and a $299 human read if you want the diagnosis written.
The agent fails a sub-task. It retries. It fails again. It retries with a slightly different prompt. Repeat until budget is empty or the task tracker finally times out.
Diagnostic: Pull your last 30 days of LLM logs, group by session_id, sort by total tokens descending, look at the top 1% of sessions. If one of them has more than 20× the median session tokens, you have a loop.
A ReAct-style agent with no per-step cap. Each step is cheap. Each step also prompts a "let me check one more thing" reflex that the agent can't unlearn.
The 717x case: a customer-support agent whose initial budget estimate was $300/month for 1,200 tickets. Production month one: $215,000. Average turns per ticket: 9.3 instead of the 1.3 in the pilot.
Diagnostic: Histogram of turns_per_session. If your pilot P95 was 3 turns and your production P95 is 11, the curve moved. The bill moved with it.
Most agent "memory" implementations are append-only by default. The agent that "remembers" the last 47 turns is also re-paying for them on every subsequent turn. Worse: the system prompt often grows over time as engineers add "helpful" sections and never remove them.
Diagnostic: Take one production session, dump the full prompt that goes to the model on turn 30, count the tokens. If it's more than 3× what the system prompt was on turn 1, you have context-stuffing.
A new engineer adds a verbose debug log to a hot path. The log includes the full message history. The log ships to a third-party observability tool that charges per ingested token. Nobody notices for 6 weeks.
Diagnostic: Ask finance for the non-LLM cloud line items for the months after you launched the agent. If observability went up 4× and LLM went up 1.4×, you have a logging leak.
A feature ships on a frontier model. It works. Six months later the prompt has been edited 40 times, the use case is now a high-volume narrow task, and the frontier model is still answering 800-token questions with 4,000-token thinking blocks because that's what it does.
Diagnostic: Your output : input ratio. If it's above 1.0 on a narrow task, you are over-paying by 1.5x–2x. A 2-tier fallback (frontier for hard, mini for easy) typically reduces this category of spend by 50–70% with no measurable quality drop.
Most teams find at least one of the five mechanisms within the first 30 minutes of looking. The diagnostic is the cheapest part. The fix is mechanical.
If you'd rather hand the CSV to a human and get a written diagnosis in 24 hours, I read LLM bills for a fixed $299 fee. The deliverable is a forensic report with:
What I won't do: I won't replace your observability platform, I won't sell a dashboard, I won't take a cut of your savings, I won't keep your data. The deliverable is a PDF, the CSV stays with you, and the $299 is the only transaction.
Read a sample report first. If your bill doesn't look like the sample, you probably don't need me yet.
$299 USD
Fixed fee. 24-hour SLA. Written forensic report. Data stays with you.
Prefer to read the sample report first? See the sample.
Add a per-session token cap. Any framework can do this in 10 lines. Pick a number that's 3× your pilot's P95 session tokens. The cap won't fire on healthy sessions. It will fire on the 0.3% of sessions that would otherwise eat 30% of the bill. On average, that one cap is a 2–4x reduction in monthly LLM spend for teams that don't have one.
If you ship that one cap tonight, you've already gotten more value from this page than the cost of a coffee. The rest of the diagnosis is refinement.