Failure Intelligence
When an agent run fails, OpenJCK automatically analyzes the trace to identify the root cause. No configuration required — it fires on every failed run.
How It Works
Section titled “How It Works”The intelligence engine analyzes four aspects of every failure:
1. Root Cause Step
Section titled “1. Root Cause Step”Identifies the earliest step whose output fed into the failure chain. This is the step that, if fixed, would prevent the failure.
Example:
- Step 1: LLM returns “Use file data/report_2026.csv”
- Step 2: Tool tries to read that file
- Step 3: FileNotFoundError
Root cause: Step 1 (the LLM generated a path to a non-existent file)
2. Root Cause Reason
Section titled “2. Root Cause Reason”Explains why the root cause step led to the failure:
Output from step 1 (call_llm) contained "data/report_2026.csv"which was referenced by the failing step 3 (read_file).Error: FileNotFoundError: data/report_2026.csv not found3. Recovery Point
Section titled “3. Recovery Point”The last step with valid output before the failure chain began. This is where you could resume the run if implementing retry logic.
Example:
- Step 1: ✓ Valid output
- Step 2: ✓ Valid output
- Step 3: ✗ Failed
Recovery point: Step 2
4. Dependency Chain
Section titled “4. Dependency Chain”Lists all values passed between steps that appear in the failure context:
Step 1 → Step 3: "data/report_2026.csv"Step 1 → Step 3: "2026"This shows exactly which data flowed from the root cause to the failure.
Viewing Intelligence
Section titled “Viewing Intelligence”In the Dashboard
Section titled “In the Dashboard”Failed runs show an inline preview:
🔍 Root cause: Step 3 — File not found[View Step 3 →]Click “View Step N” to open the trace detail with the root cause step highlighted.
In Trace Detail
Section titled “In Trace Detail”Open any failed trace to see the full intelligence panel:
- Root Cause: Which step and why
- Recovery Point: Where to resume
- Dependency Chain: Full list of connected values
The root cause step is highlighted with an amber border and “Root Cause” badge in the step detail panel.
Pattern Detection
Section titled “Pattern Detection”The dashboard tracks recurring failures across multiple runs of the same agent:
- Most common failure: “Step X fails in N/M runs (P%)”
- Most expensive step: Which step costs the most on average
These patterns appear in the agent drill-down view.
How It Detects Dependencies
Section titled “How It Detects Dependencies”The engine extracts values from each step’s output that could be referenced later:
- File paths (e.g.,
data/report.csv) - URLs (e.g.,
https://api.example.com) - UUIDs
- Constant names (e.g.,
MAX_RETRIES)
It then checks if these values appear in the failing step’s input or error message.
Limitations
Section titled “Limitations”- Intelligence is computed after the run completes (not in real-time)
- Dependency detection works best with explicit file paths and IDs
- Complex data transformations between steps may not be captured
- Pattern detection requires at least 2 failed runs of the same agent
Disabling Intelligence
Section titled “Disabling Intelligence”Intelligence computation is automatic and lightweight. To skip it for a specific run:
from openjck import trace
@trace(name="my_agent", compute_intelligence=False)def run(): passBest Practices
Section titled “Best Practices”- Name your steps clearly — “fetch_user_data” is better than “step_3”
- Return explicit values — return file paths as strings, not objects
- Check the intelligence panel first when debugging failures
- Look for patterns — recurring failures indicate systemic issues
Example: Debugging a Failed Run
Section titled “Example: Debugging a Failed Run”from openjck import trace, trace_tool
@trace(name="research_agent")def run(): query = generate_query("AI safety") results = search_web(query) # fails here return summarize(results)
@trace_tooldef generate_query(topic): return f"latest {topic} research 2026"
@trace_tooldef search_web(query): # This fails because the query has no results return api.search(query)Intelligence output:
Root cause: Step 1 (generate_query)Reason: Output "latest AI safety research 2026" contained "2026"which was used in the failing search queryRecovery point: Step 1Fix: Update generate_query to use current year, not hardcoded 2026.
Technical Details
Section titled “Technical Details”The intelligence engine is pure Python with zero dependencies. It:
- Parses trace JSON files
- Uses regex to extract referenceable values
- Performs string matching to find dependencies
- Computes patterns across historical runs
Computation takes <100ms per trace and runs in a background thread.