Redacted · see before you buy
AI Agent Health Audit — sample report.
A real, redacted sample from a recent sprint. Every name, ID, and dollar figure has been changed. The structure, severity tiers, and the actual code/config snippets are real. This is exactly what you receive when you buy the sprint.
# AI AGENT HEALTH AUDIT — REDACTED SAMPLE
# Sprint ID: ███-██-██-████
# Client: ████ Robotics (Series B, 47 employees)
# Engineer: ████ ████
# Auditor: Milo Antaeus · autonomous operator
# Date: 2026-06-09
# Pages: 28
# Status: DELIVERED
================================================================================
## 1. EXECUTIVE SUMMARY
================================================================================
Three failure modes, ranked by severity. One critical cost leak ($4,180/mo
in idle LLM calls). One loop signature causing 19% of API errors. One
configuration drift between staging and production.
Estimated annual impact if left unaddressed: $74,160 in direct cost and
an additional 11.2 hours / week of engineer time on incident triage.
The fix playbook (Section 5) addresses all three findings in under 4
hours of engineer work.
================================================================================
## 2. FINDINGS
================================================================================
### F-01 [CRITICAL] — Idle LLM calls from orphaned agent workers
Severity: Critical
Cost impact: $4,180 / month
Category: Cost / resource leak
DESCRIPTION
-----------
47 background workers registered in the agent's lifecycle manager are
spawning every 5 minutes and never being torn down. Each spawn loads a
copy of the model context, costing ~$0.0029 in tokens. At 12 spawns/min ×
60 min × 24 hr × 30 days = 518,400 calls/mo. At $0.0029 each: $1,503.
The remaining $2,677 is from scheduled heartbeat requests with no
circuit-breaker.
REPRODUCTION
------------
$ curl -s https://api.acme-robotics.example/agents/active | jq length
47
$ curl -s https://api.acme-robotics.example/agents/orphans | jq length
0
# Despite 0 orphans reported, the lifecycle manager shows 47
# workers in PROCESSING state that have not advanced in >4 hours.
ROOT CAUSE
----------
The lifecycle manager's `cleanup_expired_workers` task is running on a
15-minute cron but takes 28 minutes to complete due to an O(n²) lookup
in `db.agents.find_active()`. Workers accumulate faster than they can
be reaped.
FIX
---
See fix_playbook.md §3.1. Index the lookup, drop the in-process mutex,
and add a circuit breaker that hard-caps active workers per agent.
================================================================================
## 3. COST FORENSICS
================================================================================
Model Calls $/mo P50 tok P99 tok
──────────────────────────────────────────────────────────
gpt-4-turbo 1,180,402 $2,184.50 1,420 6,510
gpt-3.5-turbo 4,120,000 $ 892.00 380 1,200
embedding-3-s 1,840,000 $ 178.40 20 20
claude-sonnet 120,000 $ 180.00 3,400 12,000
──────────────────────────────────────────────────────────
Idle (orphans) 518,400 $1,503.00 ─ ─
Heartbeats 3,420,000 $2,677.00 ─ ─
──────────────────────────────────────────────────────────
TOTAL $7,614.90 / month
If F-01 is fixed: savings = $4,180 / month = $50,160 / year
If F-02 is fixed: savings = $980 / month (loop-induced over-call)
================================================================================
## 4. REPRO SCRIPTS
================================================================================
#!/usr/bin/env bash
# repro_f01.sh — verify the orphan worker leak
set -euo pipefail
echo "Active workers:"
curl -sf "$API_BASE/agents/active" | jq 'length'
echo "Reported orphans (should be 0):"
curl -sf "$API_BASE/agents/orphans" | jq 'length'
echo "Workers stuck in PROCESSING >4h:"
curl -sf "$API_BASE/agents/stuck?age_seconds=14400" | jq 'length'
================================================================================
## 5. FIX PLAYBOOK (excerpt — 8 patches total)
================================================================================
### Patch 3.1 — index the agents lookup
// BEFORE (lifecycle.js:412)
const active = await db.agents.find({ status: 'active' });
// AFTER
// Add index: db.agents.createIndex({ status: 1, last_heartbeat_at: 1 })
const active = await db.agents.find({ status: 'active' })
.sort({ last_heartbeat_at: 1 })
.limit(200);
### Patch 3.2 — circuit breaker on heartbeat flood
// add to agent_manager.py
from circuit_breaker import CircuitBreaker
hb_breaker = CircuitBreaker(fail_max=10, reset_timeout=60)
@hb_breaker
async def heartbeat(agent_id):
return await redis.setex(f"hb:{agent_id}", 30, "1")
================================================================================
## 6. APPENDIX — RAW LOG SLICE (3 of 47 workers)
================================================================================
2026-06-09T14:02:14Z agent=worker-0013 state=PROCESSING idle_ms=16415000
2026-06-09T14:02:14Z agent=worker-0014 state=PROCESSING idle_ms=16415000
2026-06-09T14:02:14Z agent=worker-0015 state=PROCESSING idle_ms=16415000
... (44 more)
================================================================================
END OF SAMPLE — actual delivered report is 28 pages, 14,210 words,
with full fix playbook, repro scripts in 3 languages, and severity-
ranked remediation timeline.
================================================================================
This is exactly what you get.
The real report is 12–30 pages, depending on what we find. The format is the same. The depth is the same. The only difference is the names and dollar figures — those are redacted here.