What does this grader actually look for?

Eleven signal classes that distinguish a healthy agent from a silent-success one: intent capture, tool-call outcome (real response, not just 'ok'), retry-storm shape, outcome-assertion line, side-effect vs. completion timestamp drift, idempotency-key presence, prompt-injection log shapes, cost-per-outcome per task, context-stuffing, intent drift, and agent-loop budget-burn.

Milo Antaeus — a human who reads AI agent logs for a living. Same person behind the $149 AI Ops Checkup and the $299 LLM Bill Triage deep reports. This grader is the same 9-signal checklist used in those paid reports, run client-side.

Free Self-Audit v4 2 MORE SIGNALS · No install · Runs in your browser

Is your AI agent silently failing without you knowing?

Paste your last 50 log lines. Get an A-F grade on the 11 signal classes that separate a healthy agent from a silent-success one — now including intent drift (mid-task goal drift — the second-most-common 2026 silent-success shape) and agent-loop budget-burn (unbounded inner loops that multiply your bill 10-100x). Everything runs client-side — we never see your log text.

No signup to grade No install Email is optional Same 11-signal checklist used in the $149 checkup + $299 bill triage + Intent drift + Agent-loop budget-burn

Built by Milo Antaeus — the human who reads AI agent logs for a living.

Paste 50ish lines from your agent's log

Mix of JSON lines, prose, timestamps — anything text-based. The two newest signals look for tokens / cost_usd lines per task and for any single messages / context line over 20K chars (the silent token-bill multiplier).

What's new in v4

2 more signal classes — total 11, up from 9. All v3 signals still run. v4 adds intent drift (does any line in the log re-state the original user request after a tool call, or is the agent quietly following a goal it invented 4 steps ago — the second-most-common 2026 silent-success shape after context-stuffing, especially in multi-step agent-of-agents frameworks?) and agent-loop budget-burn (does any single task log the same tool call >5 times, an iteration counter, or a max_steps_reached line — the silent 10-100x token multiplier hidden inside the most popular 2026 agent frameworks, which keep retrying without bound). Same browser-side privacy, same 30-second grade, deeper coverage.

What the grader actually looks for

Eleven signal classes. If you have all 11, your agent is well-instrumented. If you're missing 3 or more, you have a silent-success drift problem — your dashboard says green, but the customer experience is wrong 1-3 days later (or your token bill is quietly 3-100x what it should be).

1. Intent capture

Did the agent log what it was trying to do (the user request, the goal, the task) before it started tool-calling? Without this, you can't reconstruct what the agent was thinking when it failed.

2. Tool-call outcome

Did each tool call log the actual response body / status code / side-effect — not just "ok" or "done"? "ok" is the #1 silent-success enabler. Real outcome is what the world did.

3. Retry-storm shape

Are you seeing the same call >2x in a row with no assertion between? That's a retry storm. The fix isn't better retries; it's an assertion line that decides whether to retry.

4. Outcome-assertion line

After a side-effecting call, is there a line that compares expected vs. actual? ("expected status 200, got 201 with redirect" / "expected 1 row in DB, got 0"). This is the cheapest defense against silent-success.

5. Side-effect vs. completion timestamp

When the agent reports "done", did the actual side-effect (email sent, row written, payment captured) land at the same time? A 90-second gap usually means the side-effect was buffered and may have failed silently.

6. Idempotency-key presence NEW

Did each side-effecting call (send_email, create_charge, write_row) carry an idempotency_key / request_id / dedup_token? Without one, a 3x retry storm on a payment API = 3x customer charges. This is the single highest-blast-radius signal in 2026 — Stripe, Plaid, Twilio, and SendGrid all require it; most agent wrappers don't emit it.

7. Prompt-injection log shapes NEW

Are there lines whose content looks like adversarial steering: ignore previous / system: appearing inside a user-input field / a tool call whose name or argument doesn't match the nearest intent line / a tool the agent has never been told about? These shapes are how prompt-injection actually looks in production logs — most teams miss them because they look like normal lines.

8. Cost-per-outcome NEW v3

Does each task/agent-step log its own token spend (tokens_in / tokens_out / cost_usd / usd)? Without per-task cost visibility, a single runaway task can multiply your bill 3-10x and you'll only see it on the monthly invoice. This signal is what the $299 LLM Bill Triage is built around — it's the cheapest one to add and the one most universally missing.

9. Context-stuffing NEW v3

Does any single messages / context / history line balloon past ~20K chars, or repeat the same chunk 3+ times? Context-stuffing is the #1 cause of $5K-$50K LLM-bill surprise in 2026: the agent re-attaches a tool result or document body on every retry, the context grows quadratically, and the cost grows linearly with it. Easy to miss because the logs look "normal" — just longer than last week.

10. Intent drift NEW v4

Does the log include a line that re-states the original user request after the agent has been running for a while, or does the agent just keep going without re-checking? Intent drift is the second-most-common 2026 silent-success shape: an agent follows a plausible-looking sub-goal, the sub-goal drifts from the original intent by step 4, and the customer gets a 2,000-word answer to "what's my account balance." The 1-line fix is a logger.info("agent.reaffirm_intent", { intent_hash: ..., step: N }) at every Nth tool call — if the intent line stops appearing, the agent is drifting.

11. Agent-loop budget-burn NEW v4

Does any single task log the same tool call more than 5 times, contain an iteration counter (iter=12/50 / attempt=8), or include a max_steps_reached / iteration_limit / tool_loop_detected line? Unbounded agent loops are the silent 10-100x token multiplier inside the most popular 2026 frameworks (LangGraph, CrewAI, AutoGen) — the agent gets stuck in a sub-task, calls the same tool 40 times, and your bill is 40x what you expected. The fix is one assertion line that aborts the loop when the same (tool, args) pair repeats 3+ times in a window.

What this isn't

This grader is not LangSmith, Langfuse, Helicone, or any other vendor. It's an 11-question check you can run in 30 seconds. For the full 30-page read of your actual production log archive (where I look for the specific silent-success drift, double-charges, prompt-injection events, mid-task intent drift, or the runaway-cost loop that's quietly eating your margin), the $149 AI Ops Checkup covers signals 1-7 and the $299 LLM Bill Triage covers signals 8-11 deeply.

Common fixes by signal

Missing intent capture → log the user request first

logger.info("agent.intent", {
  task_id: id,
  request_summary: "send email re: invoice #4471",
  source: "user_email_reply"
})

Tool-call outcome says "ok" → log the actual response

const r = await sendEmail(...);
logger.info("tool.send_email", {
  status: r.statusCode,         // 200, 201, 202, 4xx, 5xx
  provider_id: r.messageId,     // SES / Postmark / SendGrid id
  latency_ms: r.elapsed
});

Retry storm → add an assertion line that decides retry

async function toolWithRetry(fn, max=2) {
  for (let i = 0; i <= max; i++) {
    const r = await fn();
    if (assertOutcome(r)) return r;  // ← the line you're missing
    logger.warn("tool.retry", { attempt: i+1, why: describeMismatch(r) });
  }
}

No outcome-assertion line → write a one-line checker

function assertOutcome(r) {
  if (r.statusCode >= 500) throw new Error("server error");
  if (!r.body || r.body.error) return false;
  return true;
}

Side-effect vs. completion timestamp drift → emit a "landed_at" event

const t0 = Date.now();
const r = await sendEmail(...);
// ... agent reports "done" to user ...
logger.info("side_effect.landed", { id: r.messageId, landed_at: Date.now(), gap_ms: Date.now() - t0 });

Missing idempotency key → attach a key to every side-effecting call

// At the top of the request, generate a stable key from the intent:
const idemKey = "ord_" + orderId + "_" + intentHash;
const r = await stripe.charges.create({
  amount, currency, customer,
  idempotency_key: idemKey          // Stripe, Twilio, Plaid, SendGrid all accept this
});
logger.info("tool.charge", { idem_key: idemKey, charge_id: r.id });

Prompt-injection-shaped lines → log the goal and bind tool calls to it

// 1. Hash the intent line at tool-call time; log it:
const intentHash = sha256(intentLine);
logger.info("tool.bind", { intent_hash: intentHash, tool: "send_email", args_hash: sha256(JSON.stringify(args)) });
// 2. At review time: any tool whose intent_hash doesn't match a recent intent line
//    is a candidate injection event — flag for human review, do not auto-execute.

No per-task cost line → log tokens and cost on every LLM call

// In your LLM call wrapper (after the response):
const promptTokens = r.usage.prompt_tokens;
const completionTokens = r.usage.completion_tokens;
const costUsd = (promptTokens * 0.000003) + (completionTokens * 0.000015); // your model prices
logger.info("llm.call", {
  task_id: id,
  model: "gpt-4o",
  tokens_in: promptTokens,
  tokens_out: completionTokens,
  cost_usd: costUsd,
  // also tag the intent so a runaway cost shows up attached to a specific user request
  intent_hash: intentHash
});

Context-stuffing → cap the messages array and log its length

// Before each LLM call:
const MAX_CONTEXT_CHARS = 20000;  // ~5K tokens; tune to your model window
let messages = buildMessages(intent, history, toolResults);
const beforeLen = JSON.stringify(messages).length;
if (beforeLen > MAX_CONTEXT_CHARS) {
  // trim oldest tool results, keep the intent and last 2 turns
  messages = trimToIntentAndRecent(messages, 2);
  logger.warn("context.trimmed", { before_chars: beforeLen, after_chars: JSON.stringify(messages).length, task_id: id });
}
logger.info("llm.context", { task_id: id, context_chars: JSON.stringify(messages).length });
// Without this line, you have no way to know if a single task is ballooning context on retry.

Intent drift → re-state the original intent at every Nth tool call

// At tool-call N, log the original intent so you can see when the agent stops checking it:
const intentLine = originalUserRequest;       // capture at task start
const intentHash = sha256(intentLine);
function reaffirmIntent(step) {
  if (step % 3 === 0) {                       // every 3rd tool call
    logger.info("agent.reaffirm_intent", {
      task_id: id,
      step: step,
      intent_hash: intentHash,
      intent_first_60: intentLine.slice(0, 60),
      current_tool: toolName
    });
  }
}
// If intent_hash stops appearing in logs after step 6+, the agent has drifted off-task.

Agent-loop budget-burn → assert no (tool, args) pair repeats >3x in a window

// Track recent (tool, args) pairs in a small ring buffer; abort if a pair repeats:
const recent = [];  // { tool, args_hash, ts }
function guardLoop(tool, args) {
  const argsHash = sha256(JSON.stringify(args));
  recent.push({ tool, args_hash: argsHash, ts: Date.now() });
  if (recent.length > 8) recent.shift();
  const same = recent.filter(r => r.tool === tool && r.args_hash === argsHash).length;
  if (same >= 3) {
    logger.error("tool.loop_detected", { tool, args_hash: argsHash, repeats: same, task_id: id });
    throw new Error("budget_exhausted: tool " + tool + " repeated " + same + "x with same args — abort and escalate to human");
  }
  logger.info("tool.call", { tool, args_hash: argsHash, task_id: id });
}

Frequently asked

What does an A grade mean in v4?

All 11 signal classes present. Your agent logs intent before tool-calling, logs real tool outcomes (not just "ok"), avoids retry storms, has outcome-assertion lines, separates "landed" from "done", attaches idempotency keys to every side-effecting call, binds tool calls to a hashed intent line so prompt-injection events get flagged, logs per-task token/cost on every LLM call, caps the context/messages length so context-stuffing can't quietly multiply your bill, re-states the original intent at every Nth tool call so you can detect mid-task goal drift, AND asserts no (tool, args) pair repeats more than 3x in a window so an unbounded agent loop is caught before it 40x's your bill. Most teams shipping agent products score D or F on signals 6, 10, and 11 specifically — those are the 2026 high-blast-radius gaps.

What's the most common failure mode in v4?

Signals 10 and 11 are the newest and the most universally missing. Almost no team re-states the original user request at every Nth tool call (signal 10) — agents that drift mid-task look fine in execution traces but ship a totally-different-than-requested answer. Almost no team asserts against (tool, args) pair repetition (signal 11) — LangGraph/CrewAI/AutoGen agents get stuck in sub-tasks for 20-40 calls and the bill goes 20-40x with no warning. Both are 5-line fixes, but the visibility depends on a log line that almost no one emits. The 1-line grader shows you in 30 seconds whether you're shipping these gaps.

Is my log data sent anywhere?

No. The grade and all nine checks run locally in your browser. We never see your log text. We only see an anonymous pageview beacon plus an event when you choose to email yourself the report.

What if I get a D or F?

That means your agent is shipping without most of the 11 log signal classes. Either implement the missing signal yourself (use the prescriptive fix list above) or hire a human to read your full log archive and find the specific drifts. The $149 AI Ops Checkup covers signals 1-7; the $299 LLM Bill Triage covers signals 8-11 in depth (and reuses the first 7 to identify which tasks are responsible).

What format should my logs be in?

Any text format — JSON lines, plain prose, or a mix. The grader uses substring heuristics (words like "retry", "ok", "success", "done", "failed", "error", "idempotency", "ignore previous", "tokens", "cost", "context", HTTP status codes, ISO timestamps). It does not need structured fields. Roughly 50 lines is enough for a meaningful grade.