Agent Failure Forensics is a sprint deliverable for teams that already have an agent, automation stack, eval harness, or AI workflow in production-like use, but cannot reliably answer why it failed, whether it is safe to retry, or which fixes actually reduced recurrence. The finished engagement produces a compact evidence packet: a failure timeline, a root-cause map, a reproducible test case, a risk-ranked remediation list, and an operator-facing decision guide. It is not a generic AI audit. It is a reconstruction of a concrete incident or failure pattern from logs, prompts, traces, code paths, queue state, tool outputs, and runbooks.
The artefact demonstrates the difference between an anecdote and a defensible failure explanation. Many agent teams can say, the model ignored instructions, the tool broke, or the automation got confused. Those statements are usually too vague to fix anything. A finished forensics engagement breaks the failure into observable transitions: what input entered the system, which policy or planner transformed it, which state was stale or absent, which tool call changed the environment, which guardrail did or did not fire, and which downstream output exposed the defect to a customer or internal operator.
The finished packet also distinguishes primary cause, amplifying condition, and detection gap. Primary cause is the smallest defect that explains the failure. Amplifying condition is what made the blast radius larger, slower to recover, or harder to see. Detection gap is why the system looked healthy while the incident was still active. This separation matters because teams often patch the symptom that was loudest in chat while leaving the causal chain intact.
A buyer receives a clear account of system behavior written in technical but plain-spoken language. The packet includes exact examples, including representative prompts, trace excerpts, redacted payloads, failing assertions, and recommended code changes. The engagement does not require access to every secret-bearing surface. It can operate on sanitized exports, local reproductions, trace snippets, and read-only source views, as long as timestamps, request identifiers, and state transitions are preserved.
The core output is an incident reconstruction table. Each row states a timestamp, event, evidence source, confidence level, and implication. For example: a planner selected a browser tool at 09:14:22; the trace shows no lease check; the browser session belonged to a different task; the tool succeeded from the agent's perspective; the external page changed state; the final response reported success against the wrong account. That sequence gives engineering a fix target. It also gives operations a retry rule: do not retry similar jobs unless browser lease ownership, account identity, and target-page confirmation are all present in the trace.
The packet then converts findings into implementation work. A weak recommendation says, add better logging. A usable recommendation says, record tool_owner_task_id, active_account_label, target_resource_fingerprint, pre_action_snapshot_hash, and post_action_snapshot_hash for every state-changing browser action. A weak recommendation says, improve evals. A usable recommendation says, add a regression fixture where the planner receives two similar accounts, stale task state, and a tool success response from the wrong tab; assert that the agent stops and requests confirmation instead of reporting completion.
This artefact therefore demonstrates three things a serious buyer needs: the ability to find the real failure chain, the ability to convert that chain into specific control changes, and the ability to leave behind tests that make recurrence expensive. The work is deliberately practical. If a reader cannot point to the relevant code path, dashboard field, queue invariant, or runbook line after reading the packet, the artefact has failed.
Scenario: a customer-support agent was authorized to draft refunds, subscription credits, and apology emails. It was not authorized to issue live refunds without a final approval token. During a high-volume support window, the agent marked twelve refund cases as completed. Four customers received refund-confirmation emails, but the payment system showed no refund transaction. Two customers then opened chargebacks. The internal dashboard showed the agent as green because all twelve tasks had terminal status completed.
The forensic finding is not simply that the model hallucinated. The model followed a stale success signal. The workflow had three layers: an LLM planner, a payments adapter, and a support-ticket writer. The payments adapter returned queued when a refund request entered a review queue. The planner prompt treated any non-error adapter response as final success. The support-ticket writer then copied the planner's summary into the customer email. The terminal task state reflected the email write, not the payment settlement. The system confused refund request accepted with refund executed.
The incident timeline shows the wrong success boundary. At 14:02:11, the agent called create_refund_request with the correct invoice identifier. At 14:02:12, the adapter returned {"status":"queued","review_required":true,"refund_id":"rfq_1842"}. At 14:02:16, the planner summarized the step as refund processed. At 14:02:29, the ticket writer sent the confirmation email. At 14:02:31, the task was marked completed. No event in the trace proves that a payment processor created a settled refund.
The recommended state change is direct: replace the single terminal label with separate milestones. The task may enter refund_requested, approval_pending, refund_submitted, refund_settled, and customer_notified. The support agent may only send a refund-confirmation email after refund_settled. Before that point, it may send a different message: the refund request has been received and is under review. The code-level invariant is simple: customer_notified_refund_complete requires payment_event.type == refund.settled.
The system prompt contained a common but dangerous shortcut: If a tool call succeeds, continue with the next step and tell the user what was done. That instruction is too broad for state-changing tools. A tool can succeed technically while returning a business-state result that is incomplete, rejected, queued, pending review, partially applied, or applied to a different resource. The prompt should not ask the planner to infer finality from transport success.
The remediation is to make tool-result interpretation explicit. The payments tool schema should expose a field such as business_outcome with enumerated values: settled, pending_review, declined, requires_human_approval, not_found, and ambiguous. The planner instruction should state: Only business_outcome == settled authorizes customer-facing refund-complete language. For pending_review or requires_human_approval, report pending status and stop before irreversible communication.
The dashboard showed twelve green tasks because the queue worker exited without exception and wrote a final response. This is false-green reporting. A dashboard that measures worker completion cannot prove business completion. The correct dashboard needs at least three counters: agent_task_terminal_count, business_transaction_terminal_count, and customer_claim_terminal_count. A mismatch between those counters is not noise; it is the incident.
A sample detection rule: alert if customer_message.contains_refund_complete and no payment_event.refund_settled exists for same invoice_id within 60 seconds. Another rule: alert if task.status == completed and task.required_business_outcome not in observed_business_outcomes. These alerts are cheap compared with chargebacks, manual rework, and reputational damage from telling customers that money moved when it did not.
The team had logs, but replay could not reproduce the incident because the original prompt, tool schema version, account fixture, and adapter response were not captured together. The reconstruction required manual correlation across application logs, support-ticket metadata, and payment-adapter traces. The sprint deliverable recommends a replay bundle for every state-changing agent run: planner_prompt_hash, tool_schema_hash, policy_version, redacted_input_payload, tool_result_payloads, external_resource_ids, and final_customer_visible_text.
The sample regression fixture is small enough to add to an existing test suite. Given a refund request, the fake payments adapter returns pending_review. The expected behavior is not an email saying the refund was processed. The expected behavior is a pending-status note plus a task state of approval_pending. The test should fail if the final message includes phrases like has been refunded, refund is complete, or money has been returned.
ambiguous rather than inventing the most convenient interpretation.The expected post-fix behavior is boring, which is the point. The agent can still move quickly through low-risk drafting and classification work. It slows down only at the boundary where customer-facing claims, financial state, or account state would be misrepresented. The sprint does not make the agent timid. It makes success conditions real.
The ROI comes from removing repeated ambiguity. Teams lose time when every agent incident becomes a fresh mystery: engineers grep logs, support reconstructs customer impact, operations asks whether the dashboard can be trusted, and leadership receives a vague explanation that cannot be tested. A focused forensics sprint turns one incident into a reusable control improvement.
For a mid-sized SaaS team running an internal support or operations agent, a single confusing failure can easily consume 25 to 60 staff-hours. A typical breakdown is 8 hours of engineering log review, 6 hours of support-ticket cleanup, 4 hours of operations coordination, 3 hours of customer-success messaging, and another 4 to 12 hours of meetings, status writing, and speculative prompt edits. If the incident touches money, account access, compliance records, or customer-visible promises, the number climbs quickly.
A completed Agent Failure Forensics sprint can plausibly save 30 to 80 hours on the next similar incident because the trace requirements, replay bundle, dashboard invariant, and regression fixture already exist. At a blended internal cost of 100 to 175 dollars per hour, that is 3,000 to 14,000 dollars of labor avoidance per recurrence. If the same failure pattern would otherwise recur three times in a quarter, the avoided labor alone can reach 9,000 to 42,000 dollars.
The larger value is protected revenue and reduced customer damage. In the refund-confirmation scenario, assume four bad confirmations create two chargebacks, three retention escalations, and one lost annual account. If the account is worth 8,000 dollars annually, chargeback fees and support costs are minor compared with the preventable churn. A single control that prevents premature completion language can protect more revenue than the sprint cost, even before counting engineering time.
The sprint also reduces the cost of shipping agent improvements. Without forensics, teams often respond to failures by freezing autonomy broadly. That protects against embarrassment but destroys the productivity case for the agent. With precise failure boundaries, the team can keep safe lanes open. Classification, summarization, draft preparation, internal research, and low-risk data entry can continue while high-risk finalization gets stricter checks. That preserves throughput instead of forcing an all-or-nothing rollback.
A practical ROI model has three columns. First: incident cost avoided, measured in engineering hours, support hours, refunds, credits, chargebacks, and churn. Second: debug cycle compression, measured by how quickly the team can reconstruct a run from evidence rather than interviews. Third: safe autonomy preserved, measured by the percentage of tasks that remain automated after guardrails are corrected. The most important number is not how many prompts were rewritten. It is how many future failures become automatically detectable or impossible under test.
For a buyer processing 2,000 support tickets per month with an agent touching 30 percent of them, even a 1 percent serious ambiguity rate means six risky cases per month. If each risky case consumes 3 hours across support and engineering, that is 18 hours per month. If one in six creates a customer-credit or retention issue averaging 1,500 dollars, the monthly expected loss is roughly 1,500 dollars plus labor. A sprint that cuts the serious ambiguity rate from 1 percent to 0.25 percent saves about 13.5 labor hours per month and avoids roughly three-quarters of the expected customer-impact loss. Over a year, that is a conservative five-figure gain for one workflow.
The deliverable is intentionally scoped so it can be bought and used without a long transformation program. The buyer does not need a new platform, a governance committee, or a six-month agent strategy. The buyer needs one high-quality reconstruction, one set of specific patches, and one regression harness that proves the old failure no longer passes. That is the unit of value: a known failure class converted into a monitored, tested, and mostly eliminated failure class.
The final ROI is managerial clarity. After the sprint, the team can say which failures are model reasoning errors, which are tool-contract errors, which are stale-state errors, which are dashboard false-greens, and which are policy-boundary errors. That vocabulary prevents waste. It stops teams from using prompt edits to solve missing state, using dashboards to hide incomplete business events, or using human review as a blanket substitute for better invariants. The result is not a prettier agent demo. It is a system that fails less mysteriously, recovers faster, and earns more permission to operate.