Is your AI agent silently failing without you knowing?
Paste your last 50 log lines. Get an A-F grade on the five signal classes that separate a healthy agent from a silent-success one. Everything runs client-side — we never see your log text.
Built by Milo Antaeus — the human who reads AI agent logs for a living.
Mix of JSON lines, prose, timestamps — anything text-based. No formatting required. Nothing leaves your browser until you click Email me the report.
What the grader actually looks for
Five signal classes. If you have all 5, your agent is well-instrumented. If you're missing 2 or more, you have a silent-success drift problem — your dashboard says green, but the customer experience is wrong 1-3 days later.
1. Intent capture
Did the agent log what it was trying to do (the user request, the goal, the task) before it started tool-calling? Without this, you can't reconstruct what the agent was thinking when it failed.
2. Tool-call outcome
Did each tool call log the actual response body / status code / side-effect — not just "ok" or "done"? "ok" is the #1 silent-success enabler. Real outcome is what the world did.
3. Retry-storm shape
Are you seeing the same call >2x in a row with no assertion between? That's a retry storm. The fix isn't better retries; it's an assertion line that decides whether to retry.
4. Outcome-assertion line
After a side-effecting call, is there a line that compares expected vs. actual? ("expected status 200, got 201 with redirect" / "expected 1 row in DB, got 0"). This is the cheapest defense against silent-success.
5. Side-effect vs. completion timestamp
When the agent reports "done", did the actual side-effect (email sent, row written, payment captured) land at the same time? A 90-second gap usually means the side-effect was buffered and may have failed silently.
What this isn't
This grader is not LangSmith, Langfuse, Helicone, or any other vendor. It's a 5-question check you can run in 30 seconds. For the full 30-page read of your actual production log archive (where I look for the specific silent-success drift that affects your customers), the $149 AI Ops Checkup is the deeper version of this same checklist.
Common fixes by signal
Missing intent capture → log the user request first
logger.info("agent.intent", {
task_id: id,
request_summary: "send email re: invoice #4471",
source: "user_email_reply"
})
Tool-call outcome says "ok" → log the actual response
const r = await sendEmail(...);
logger.info("tool.send_email", {
status: r.statusCode, // 200, 201, 202, 4xx, 5xx
provider_id: r.messageId, // SES / Postmark / SendGrid id
latency_ms: r.elapsed
});
Retry storm → add an assertion line that decides retry
async function toolWithRetry(fn, max=2) {
for (let i = 0; i <= max; i++) {
const r = await fn();
if (assertOutcome(r)) return r; // ← the line you're missing
logger.warn("tool.retry", { attempt: i+1, why: describeMismatch(r) });
}
}
No outcome-assertion line → write a one-line checker
function assertOutcome(r) {
if (r.statusCode >= 500) throw new Error("server error");
if (!r.body || r.body.error) return false;
return true;
}
Side-effect vs. completion timestamp drift → emit a "landed_at" event
const t0 = Date.now();
const r = await sendEmail(...);
// ... agent reports "done" to user ...
logger.info("side_effect.landed", { id: r.messageId, landed_at: Date.now(), gap_ms: Date.now() - t0 });
Frequently asked
Is my log data sent anywhere?
No. The grade and all five checks run locally in your browser. We never see your log text. We only see an anonymous pageview beacon plus an event when you choose to email yourself the report.
What if I get a D or F?
That means your agent is shipping without one of the five log signal classes. Either implement the missing signal yourself (use the prescriptive fix list above) or hire a human to read your full log archive and find the specific drifts. The $149 AI Ops Checkup is the second path.
What format should my logs be in?
Any text format — JSON lines, plain prose, or a mix. The grader uses substring heuristics (words like "retry", "ok", "success", "done", "failed", "error", HTTP status codes, ISO timestamps). It does not need structured fields. Roughly 50 lines is enough for a meaningful grade.