* feat: implement Hook Chains runtime integration for self-healing agent mesh MVP - Add Hook Chains config loader, evaluator, and dispatcher in src/utils/hookChains.ts - Wire PostToolUseFailure hook dispatch in executePostToolUseFailureHooks() - Wire TaskCompleted hook dispatch in executeTaskCompletedHooks() - Integrate fallback-agent launcher with permission preservation (canUseTool threading) - Add safety hardening for config-read errors (try-catch protection) - Update docs with MVP runtime trigger explanation - Add 10 unit tests and 4 integration tests covering config, rules, guards, and actions This completes the self-healing agent mesh MVP by enabling declarative rule-based responses to tool failures and task completions, with fallback agent spawning, team notification, and capacity warming actions. * Update docs/hook-chains.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/utils/hookChains.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: address PR #711 review blockers for Hook Chains - Gate hook-chain dispatch behind feature('HOOK_CHAINS') and default env gate to off - Remove committed local artifact (agent.log) and ignore it in .gitignore - Revert hook dispatcher signature threading changes for canUseTool - Use ToolUseContext metadata hookChainsCanUseTool for fallback launch permissions - Make spawn_fallback_agent fail explicitly when launcher context is unavailable - Add config cache max age and guard map size limits to bound runtime memory - Update docs and tests for default-off gating and explicit fallback failure --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
9.9 KiB
Hook Chains (Self-Healing Agent Mesh MVP)
Hook Chains provide an event-driven recovery layer for important workflow failures. When a matching hook event occurs, OpenClaude evaluates declarative rules and can dispatch remediation actions such as:
spawn_fallback_agentnotify_teamwarm_remote_capacity
Disabled-By-Default Rollout
Rollout recommendation: keep Hook Chains disabled until you validate rules in your environment.
- Set top-level config to
"enabled": falseinitially.- Enable per environment when ready.
- Dispatch is gated by
feature('HOOK_CHAINS').- Env gate defaults to off unless
CLAUDE_CODE_ENABLE_HOOK_CHAINS=1is set.
This keeps existing workflows unchanged while you tune guard windows and action behavior.
Feature Overview
Hook Chains are loaded from a deterministic config file and evaluated on dispatched hook events.
MVP runtime trigger wiring:
PostToolUseFailurehooks dispatch Hook Chains with outcomefailed.TaskCompletedhooks dispatch Hook Chains with outcome:successwhen completion hooks did not block.failedwhen completion hooks returned blocking errors or prevented continuation.
Default config path:
.openclaude/hook-chains.json
Override path:
CLAUDE_CODE_HOOK_CHAINS_CONFIG_PATH=/abs/or/relative/path/to/hook-chains.json
Global gate:
feature('HOOK_CHAINS')must be enabled in the buildCLAUDE_CODE_ENABLE_HOOK_CHAINS=0|1(defaults to disabled when unset)
Safety Guarantees
The runtime is intentionally conservative:
- Depth guard: chain dispatch is blocked when
chainDepth >= maxChainDepth. - Rule cooldown: each rule can only re-fire after cooldown expires.
- Dedup window: identical event/action combinations are suppressed for a window.
- Abort-safe behavior: if the current signal is aborted, actions skip safely.
- Policy-aware remote warm:
warm_remote_capacityskips when remote sessions are policy denied. - Bridge inactive no-op:
warm_remote_capacitysafely skips when no active bridge handle exists. - Missing team context safety:
notify_teamskips with structured reason if no team context/team file is available. - Fallback launcher safety:
spawn_fallback_agentfails with a structured reason when launch permissions/context are unavailable.
Configuration Schema Reference
Top-level object:
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 30000,
"defaultDedupWindowMs": 30000,
"rules": []
}
Top-Level Fields
| Field | Type | Required | Notes |
|---|---|---|---|
version |
1 |
No | Defaults to 1. |
enabled |
boolean |
No | Global feature switch for this config file. |
maxChainDepth |
integer |
No | Global depth guard (default 2, max 10). |
defaultCooldownMs |
integer |
No | Default rule cooldown in ms (default 30000). |
defaultDedupWindowMs |
integer |
No | Default action dedup window in ms (default 30000). |
rules |
HookChainRule[] |
No | Defaults to []. May be omitted or empty; when no rules are present, dispatch is a no-op and returns enabled: false. |
Note: An empty ruleset is valid and can be used to keep Hook Chains configured but effectively disabled until rules are added.
Rule Object (HookChainRule)
{
"id": "task-failure-recovery",
"enabled": true,
"trigger": {
"event": "TaskCompleted",
"outcome": "failed"
},
"condition": {
"toolNames": ["Edit"],
"taskStatuses": ["failed"],
"errorIncludes": ["timeout", "permission denied"],
"eventFieldEquals": {
"meta.source": "scheduler"
}
},
"cooldownMs": 60000,
"dedupWindowMs": 30000,
"maxDepth": 2,
"actions": []
}
| Field | Type | Required | Notes |
|---|---|---|---|
id |
string |
Yes | Stable identifier used in telemetry/guards. |
enabled |
boolean |
No | Per-rule switch. |
trigger.event |
HookEvent |
Yes | Event name to match. |
trigger.outcome |
`"success" | "failed" | "timeout" |
trigger.outcomes |
Outcome[] |
No | Multi-outcome matcher. Use either outcome or outcomes. |
condition |
object |
No | Optional extra matching constraints. |
cooldownMs |
integer |
No | Overrides global cooldown for this rule. |
dedupWindowMs |
integer |
No | Overrides global dedup for this rule. |
maxDepth |
integer |
No | Per-rule depth cap. |
actions |
HookChainAction[] |
Yes | One or more actions to execute in order. |
Condition Fields
| Field | Type | Notes |
|---|---|---|
toolNames |
string[] |
Matches tool_name / toolName in event payload. |
taskStatuses |
string[] |
Matches task_status / taskStatus / status. |
errorIncludes |
string[] |
Case-insensitive substring match against error / reason / message. |
eventFieldEquals |
Record<string, string|number|boolean> |
Dot-path equality against payload (example: "meta.source": "scheduler"). |
Actions
spawn_fallback_agent
{
"type": "spawn_fallback_agent",
"id": "fallback-1",
"enabled": true,
"dedupWindowMs": 30000,
"description": "Fallback recovery for failed task",
"promptTemplate": "Recover task ${TASK_SUBJECT}. Event=${EVENT_NAME}, outcome=${OUTCOME}, error=${ERROR}. Payload=${PAYLOAD_JSON}",
"agentType": "general-purpose",
"model": "sonnet"
}
notify_team
{
"type": "notify_team",
"id": "notify-ops",
"enabled": true,
"dedupWindowMs": 30000,
"teamName": "mesh-team",
"recipients": ["*"],
"summary": "Hook chain ${RULE_ID} fired",
"messageTemplate": "Event=${EVENT_NAME} outcome=${OUTCOME}\nTask=${TASK_ID}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
}
warm_remote_capacity
{
"type": "warm_remote_capacity",
"id": "warm-bridge",
"enabled": true,
"dedupWindowMs": 60000,
"createDefaultEnvironmentIfMissing": false
}
Complete Example Configs
1) Retry via Fallback Agent
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 30000,
"defaultDedupWindowMs": 30000,
"rules": [
{
"id": "retry-task-via-fallback",
"trigger": {
"event": "TaskCompleted",
"outcome": "failed"
},
"cooldownMs": 60000,
"actions": [
{
"type": "spawn_fallback_agent",
"id": "spawn-retry-agent",
"description": "Retry failed task with fallback agent",
"promptTemplate": "A task failed. Recover it safely.\nTask=${TASK_SUBJECT}\nDescription=${TASK_DESCRIPTION}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}",
"agentType": "general-purpose",
"model": "sonnet"
}
]
}
]
}
2) Notify Only
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 30000,
"defaultDedupWindowMs": 30000,
"rules": [
{
"id": "notify-on-tool-failure",
"trigger": {
"event": "PostToolUseFailure",
"outcome": "failed"
},
"condition": {
"toolNames": ["Edit", "Write", "Bash"]
},
"actions": [
{
"type": "notify_team",
"id": "notify-team-failure",
"recipients": ["*"],
"summary": "Tool failure detected",
"messageTemplate": "Tool failure detected.\nEvent=${EVENT_NAME} outcome=${OUTCOME}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
}
]
}
]
}
3) Combined Fallback + Notify + Bridge Warm
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 45000,
"defaultDedupWindowMs": 30000,
"rules": [
{
"id": "full-recovery-chain",
"trigger": {
"event": "TaskCompleted",
"outcomes": ["failed", "timeout"]
},
"condition": {
"errorIncludes": ["timeout", "capacity", "connection"]
},
"cooldownMs": 90000,
"actions": [
{
"type": "spawn_fallback_agent",
"id": "fallback-agent",
"description": "Recover failed task execution",
"promptTemplate": "Recover failed task and produce a concise fix summary.\nTask=${TASK_SUBJECT}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
},
{
"type": "notify_team",
"id": "notify-team",
"recipients": ["*"],
"summary": "Recovery chain triggered",
"messageTemplate": "Recovery chain ${RULE_ID} fired.\nOutcome=${OUTCOME}\nTask=${TASK_SUBJECT}\nError=${ERROR}"
},
{
"type": "warm_remote_capacity",
"id": "warm-capacity",
"createDefaultEnvironmentIfMissing": false
}
]
}
]
}
Template Variables
The following placeholders are supported by promptTemplate, summary, and messageTemplate:
${EVENT_NAME}${OUTCOME}${RULE_ID}${TASK_SUBJECT}${TASK_DESCRIPTION}${TASK_ID}${ERROR}${PAYLOAD_JSON}
Troubleshooting
Rule never triggers
- Verify
trigger.eventandtrigger.outcome/trigger.outcomesexactly match dispatched event data. - Check
conditionfilters (especiallytoolNamesandeventFieldEqualsdot-path keys). - Confirm the config file is valid JSON and schema-valid.
Actions show as skipped
Common skip reasons:
action disabledrule cooldown active ...dedup window active ...max chain depth reached ...No team context is available ...Team file not found ...Remote sessions are blocked by policyBridge is not active; warm_remote_capacity is a safe no-opNo fallback agent launcher is registered in runtime context
Config changes not reflected
- Loader uses memoization by file mtime/size.
- Ensure your editor writes the file fully and updates mtime.
- If needed, force reload from the caller side with
forceReloadConfig: true.
Existing workflows changed unexpectedly
- Set
"enabled": falseat top-level. - Or globally disable with
CLAUDE_CODE_ENABLE_HOOK_CHAINS=0. - Re-enable gradually after validating one rule at a time.