Files
orcs-code/docs/hook-chains.md
Urvish L. 44a2c30d5f feat: implement Hook Chains runtime integration for self-healing agent mesh MVP (#711)
* feat: implement Hook Chains runtime integration for self-healing agent mesh MVP

- Add Hook Chains config loader, evaluator, and dispatcher in src/utils/hookChains.ts
- Wire PostToolUseFailure hook dispatch in executePostToolUseFailureHooks()
- Wire TaskCompleted hook dispatch in executeTaskCompletedHooks()
- Integrate fallback-agent launcher with permission preservation (canUseTool threading)
- Add safety hardening for config-read errors (try-catch protection)
- Update docs with MVP runtime trigger explanation
- Add 10 unit tests and 4 integration tests covering config, rules, guards, and actions

This completes the self-healing agent mesh MVP by enabling declarative rule-based
responses to tool failures and task completions, with fallback agent spawning,
team notification, and capacity warming actions.

* Update docs/hook-chains.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/utils/hookChains.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: address PR #711 review blockers for Hook Chains

- Gate hook-chain dispatch behind feature('HOOK_CHAINS') and default env gate to off
- Remove committed local artifact (agent.log) and ignore it in .gitignore
- Revert hook dispatcher signature threading changes for canUseTool
- Use ToolUseContext metadata hookChainsCanUseTool for fallback launch permissions
- Make spawn_fallback_agent fail explicitly when launcher context is unavailable
- Add config cache max age and guard map size limits to bound runtime memory
- Update docs and tests for default-off gating and explicit fallback failure

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-22 19:40:23 +08:00

9.9 KiB

Hook Chains (Self-Healing Agent Mesh MVP)

Hook Chains provide an event-driven recovery layer for important workflow failures. When a matching hook event occurs, OpenClaude evaluates declarative rules and can dispatch remediation actions such as:

  • spawn_fallback_agent
  • notify_team
  • warm_remote_capacity

Disabled-By-Default Rollout

Rollout recommendation: keep Hook Chains disabled until you validate rules in your environment.

  • Set top-level config to "enabled": false initially.
  • Enable per environment when ready.
  • Dispatch is gated by feature('HOOK_CHAINS').
  • Env gate defaults to off unless CLAUDE_CODE_ENABLE_HOOK_CHAINS=1 is set.

This keeps existing workflows unchanged while you tune guard windows and action behavior.

Feature Overview

Hook Chains are loaded from a deterministic config file and evaluated on dispatched hook events.

MVP runtime trigger wiring:

  • PostToolUseFailure hooks dispatch Hook Chains with outcome failed.
  • TaskCompleted hooks dispatch Hook Chains with outcome:
    • success when completion hooks did not block.
    • failed when completion hooks returned blocking errors or prevented continuation.

Default config path:

  • .openclaude/hook-chains.json

Override path:

  • CLAUDE_CODE_HOOK_CHAINS_CONFIG_PATH=/abs/or/relative/path/to/hook-chains.json

Global gate:

  • feature('HOOK_CHAINS') must be enabled in the build
  • CLAUDE_CODE_ENABLE_HOOK_CHAINS=0|1 (defaults to disabled when unset)

Safety Guarantees

The runtime is intentionally conservative:

  • Depth guard: chain dispatch is blocked when chainDepth >= maxChainDepth.
  • Rule cooldown: each rule can only re-fire after cooldown expires.
  • Dedup window: identical event/action combinations are suppressed for a window.
  • Abort-safe behavior: if the current signal is aborted, actions skip safely.
  • Policy-aware remote warm: warm_remote_capacity skips when remote sessions are policy denied.
  • Bridge inactive no-op: warm_remote_capacity safely skips when no active bridge handle exists.
  • Missing team context safety: notify_team skips with structured reason if no team context/team file is available.
  • Fallback launcher safety: spawn_fallback_agent fails with a structured reason when launch permissions/context are unavailable.

Configuration Schema Reference

Top-level object:

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 30000,
  "defaultDedupWindowMs": 30000,
  "rules": []
}

Top-Level Fields

Field Type Required Notes
version 1 No Defaults to 1.
enabled boolean No Global feature switch for this config file.
maxChainDepth integer No Global depth guard (default 2, max 10).
defaultCooldownMs integer No Default rule cooldown in ms (default 30000).
defaultDedupWindowMs integer No Default action dedup window in ms (default 30000).
rules HookChainRule[] No Defaults to []. May be omitted or empty; when no rules are present, dispatch is a no-op and returns enabled: false.

Note: An empty ruleset is valid and can be used to keep Hook Chains configured but effectively disabled until rules are added.

Rule Object (HookChainRule)

{
  "id": "task-failure-recovery",
  "enabled": true,
  "trigger": {
    "event": "TaskCompleted",
    "outcome": "failed"
  },
  "condition": {
    "toolNames": ["Edit"],
    "taskStatuses": ["failed"],
    "errorIncludes": ["timeout", "permission denied"],
    "eventFieldEquals": {
      "meta.source": "scheduler"
    }
  },
  "cooldownMs": 60000,
  "dedupWindowMs": 30000,
  "maxDepth": 2,
  "actions": []
}
Field Type Required Notes
id string Yes Stable identifier used in telemetry/guards.
enabled boolean No Per-rule switch.
trigger.event HookEvent Yes Event name to match.
trigger.outcome `"success" "failed" "timeout"
trigger.outcomes Outcome[] No Multi-outcome matcher. Use either outcome or outcomes.
condition object No Optional extra matching constraints.
cooldownMs integer No Overrides global cooldown for this rule.
dedupWindowMs integer No Overrides global dedup for this rule.
maxDepth integer No Per-rule depth cap.
actions HookChainAction[] Yes One or more actions to execute in order.

Condition Fields

Field Type Notes
toolNames string[] Matches tool_name / toolName in event payload.
taskStatuses string[] Matches task_status / taskStatus / status.
errorIncludes string[] Case-insensitive substring match against error / reason / message.
eventFieldEquals Record<string, string|number|boolean> Dot-path equality against payload (example: "meta.source": "scheduler").

Actions

spawn_fallback_agent

{
  "type": "spawn_fallback_agent",
  "id": "fallback-1",
  "enabled": true,
  "dedupWindowMs": 30000,
  "description": "Fallback recovery for failed task",
  "promptTemplate": "Recover task ${TASK_SUBJECT}. Event=${EVENT_NAME}, outcome=${OUTCOME}, error=${ERROR}. Payload=${PAYLOAD_JSON}",
  "agentType": "general-purpose",
  "model": "sonnet"
}

notify_team

{
  "type": "notify_team",
  "id": "notify-ops",
  "enabled": true,
  "dedupWindowMs": 30000,
  "teamName": "mesh-team",
  "recipients": ["*"],
  "summary": "Hook chain ${RULE_ID} fired",
  "messageTemplate": "Event=${EVENT_NAME} outcome=${OUTCOME}\nTask=${TASK_ID}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
}

warm_remote_capacity

{
  "type": "warm_remote_capacity",
  "id": "warm-bridge",
  "enabled": true,
  "dedupWindowMs": 60000,
  "createDefaultEnvironmentIfMissing": false
}

Complete Example Configs

1) Retry via Fallback Agent

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 30000,
  "defaultDedupWindowMs": 30000,
  "rules": [
    {
      "id": "retry-task-via-fallback",
      "trigger": {
        "event": "TaskCompleted",
        "outcome": "failed"
      },
      "cooldownMs": 60000,
      "actions": [
        {
          "type": "spawn_fallback_agent",
          "id": "spawn-retry-agent",
          "description": "Retry failed task with fallback agent",
          "promptTemplate": "A task failed. Recover it safely.\nTask=${TASK_SUBJECT}\nDescription=${TASK_DESCRIPTION}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}",
          "agentType": "general-purpose",
          "model": "sonnet"
        }
      ]
    }
  ]
}

2) Notify Only

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 30000,
  "defaultDedupWindowMs": 30000,
  "rules": [
    {
      "id": "notify-on-tool-failure",
      "trigger": {
        "event": "PostToolUseFailure",
        "outcome": "failed"
      },
      "condition": {
        "toolNames": ["Edit", "Write", "Bash"]
      },
      "actions": [
        {
          "type": "notify_team",
          "id": "notify-team-failure",
          "recipients": ["*"],
          "summary": "Tool failure detected",
          "messageTemplate": "Tool failure detected.\nEvent=${EVENT_NAME} outcome=${OUTCOME}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
        }
      ]
    }
  ]
}

3) Combined Fallback + Notify + Bridge Warm

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 45000,
  "defaultDedupWindowMs": 30000,
  "rules": [
    {
      "id": "full-recovery-chain",
      "trigger": {
        "event": "TaskCompleted",
        "outcomes": ["failed", "timeout"]
      },
      "condition": {
        "errorIncludes": ["timeout", "capacity", "connection"]
      },
      "cooldownMs": 90000,
      "actions": [
        {
          "type": "spawn_fallback_agent",
          "id": "fallback-agent",
          "description": "Recover failed task execution",
          "promptTemplate": "Recover failed task and produce a concise fix summary.\nTask=${TASK_SUBJECT}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
        },
        {
          "type": "notify_team",
          "id": "notify-team",
          "recipients": ["*"],
          "summary": "Recovery chain triggered",
          "messageTemplate": "Recovery chain ${RULE_ID} fired.\nOutcome=${OUTCOME}\nTask=${TASK_SUBJECT}\nError=${ERROR}"
        },
        {
          "type": "warm_remote_capacity",
          "id": "warm-capacity",
          "createDefaultEnvironmentIfMissing": false
        }
      ]
    }
  ]
}

Template Variables

The following placeholders are supported by promptTemplate, summary, and messageTemplate:

  • ${EVENT_NAME}
  • ${OUTCOME}
  • ${RULE_ID}
  • ${TASK_SUBJECT}
  • ${TASK_DESCRIPTION}
  • ${TASK_ID}
  • ${ERROR}
  • ${PAYLOAD_JSON}

Troubleshooting

Rule never triggers

  • Verify trigger.event and trigger.outcome/trigger.outcomes exactly match dispatched event data.
  • Check condition filters (especially toolNames and eventFieldEquals dot-path keys).
  • Confirm the config file is valid JSON and schema-valid.

Actions show as skipped

Common skip reasons:

  • action disabled
  • rule cooldown active ...
  • dedup window active ...
  • max chain depth reached ...
  • No team context is available ...
  • Team file not found ...
  • Remote sessions are blocked by policy
  • Bridge is not active; warm_remote_capacity is a safe no-op
  • No fallback agent launcher is registered in runtime context

Config changes not reflected

  • Loader uses memoization by file mtime/size.
  • Ensure your editor writes the file fully and updates mtime.
  • If needed, force reload from the caller side with forceReloadConfig: true.

Existing workflows changed unexpectedly

  • Set "enabled": false at top-level.
  • Or globally disable with CLAUDE_CODE_ENABLE_HOOK_CHAINS=0.
  • Re-enable gradually after validating one rule at a time.