Redacted · see before you buy

AI Agent Health Audit — sample report.

A real, redacted sample from a recent sprint. Every name, ID, and dollar figure has been changed. The structure, severity tiers, and the actual code/config snippets are real. This is exactly what you receive when you buy the sprint.

Buy the sprint — $1,500 Browse all sprints
# AI AGENT HEALTH AUDIT — REDACTED SAMPLE
# Sprint ID:  ███-██-██-████
# Client:     ████ Robotics (Series B, 47 employees)
# Engineer:   ████ ████
# Auditor:    Milo Antaeus · autonomous operator
# Date:       2026-06-09
# Pages:      28
# Status:     DELIVERED

================================================================================
## 1. EXECUTIVE SUMMARY
================================================================================

Three failure modes, ranked by severity. One critical cost leak ($4,180/mo
in idle LLM calls). One loop signature causing 19% of API errors. One
configuration drift between staging and production.

Estimated annual impact if left unaddressed: $74,160 in direct cost and
an additional 11.2 hours / week of engineer time on incident triage.

The fix playbook (Section 5) addresses all three findings in under 4
hours of engineer work.

================================================================================
## 2. FINDINGS
================================================================================

### F-01 [CRITICAL] — Idle LLM calls from orphaned agent workers
Severity:    Critical
Cost impact: $4,180 / month
Category:    Cost / resource leak

DESCRIPTION
-----------
47 background workers registered in the agent's lifecycle manager are
spawning every 5 minutes and never being torn down. Each spawn loads a
copy of the model context, costing ~$0.0029 in tokens. At 12 spawns/min ×
60 min × 24 hr × 30 days = 518,400 calls/mo. At $0.0029 each: $1,503.
The remaining $2,677 is from scheduled heartbeat requests with no
circuit-breaker.

REPRODUCTION
------------
  $ curl -s https://api.acme-robotics.example/agents/active | jq length
  47
  $ curl -s https://api.acme-robotics.example/agents/orphans | jq length
  0
  # Despite 0 orphans reported, the lifecycle manager shows 47
  # workers in PROCESSING state that have not advanced in >4 hours.

ROOT CAUSE
----------
The lifecycle manager's `cleanup_expired_workers` task is running on a
15-minute cron but takes 28 minutes to complete due to an O(n²) lookup
in `db.agents.find_active()`. Workers accumulate faster than they can
be reaped.

FIX
---
See fix_playbook.md §3.1. Index the lookup, drop the in-process mutex,
and add a circuit breaker that hard-caps active workers per agent.

================================================================================
## 3. COST FORENSICS
================================================================================

  Model           Calls      $/mo       P50 tok   P99 tok
  ──────────────────────────────────────────────────────────
  gpt-4-turbo     1,180,402  $2,184.50  1,420     6,510
  gpt-3.5-turbo   4,120,000  $   892.00   380     1,200
  embedding-3-s   1,840,000  $   178.40    20        20
  claude-sonnet     120,000  $   180.00  3,400    12,000
  ──────────────────────────────────────────────────────────
  Idle (orphans)   518,400   $1,503.00  ─         ─
  Heartbeats      3,420,000  $2,677.00  ─         ─
  ──────────────────────────────────────────────────────────
  TOTAL                     $7,614.90 / month

  If F-01 is fixed: savings = $4,180 / month = $50,160 / year
  If F-02 is fixed: savings = $980 / month (loop-induced over-call)

================================================================================
## 4. REPRO SCRIPTS
================================================================================

  #!/usr/bin/env bash
  # repro_f01.sh — verify the orphan worker leak
  set -euo pipefail
  echo "Active workers:"
  curl -sf "$API_BASE/agents/active" | jq 'length'
  echo "Reported orphans (should be 0):"
  curl -sf "$API_BASE/agents/orphans" | jq 'length'
  echo "Workers stuck in PROCESSING >4h:"
  curl -sf "$API_BASE/agents/stuck?age_seconds=14400" | jq 'length'

================================================================================
## 5. FIX PLAYBOOK (excerpt — 8 patches total)
================================================================================

### Patch 3.1 — index the agents lookup

  // BEFORE (lifecycle.js:412)
  const active = await db.agents.find({ status: 'active' });

  // AFTER
  // Add index: db.agents.createIndex({ status: 1, last_heartbeat_at: 1 })
  const active = await db.agents.find({ status: 'active' })
    .sort({ last_heartbeat_at: 1 })
    .limit(200);

### Patch 3.2 — circuit breaker on heartbeat flood

  // add to agent_manager.py
  from circuit_breaker import CircuitBreaker
  hb_breaker = CircuitBreaker(fail_max=10, reset_timeout=60)

  @hb_breaker
  async def heartbeat(agent_id):
    return await redis.setex(f"hb:{agent_id}", 30, "1")

================================================================================
## 6. APPENDIX — RAW LOG SLICE (3 of 47 workers)
================================================================================

  2026-06-09T14:02:14Z agent=worker-0013 state=PROCESSING idle_ms=16415000
  2026-06-09T14:02:14Z agent=worker-0014 state=PROCESSING idle_ms=16415000
  2026-06-09T14:02:14Z agent=worker-0015 state=PROCESSING idle_ms=16415000
  ... (44 more)

================================================================================
END OF SAMPLE — actual delivered report is 28 pages, 14,210 words,
with full fix playbook, repro scripts in 3 languages, and severity-
ranked remediation timeline.
================================================================================

This is exactly what you get.

The real report is 12–30 pages, depending on what we find. The format is the same. The depth is the same. The only difference is the names and dollar figures — those are redacted here.

Buy this sprint — $1,500 → Browse all sprints