Files

Urvish L. 44a2c30d5f feat: implement Hook Chains runtime integration for self-healing agent mesh MVP (#711 )

* feat: implement Hook Chains runtime integration for self-healing agent mesh MVP

- Add Hook Chains config loader, evaluator, and dispatcher in src/utils/hookChains.ts
- Wire PostToolUseFailure hook dispatch in executePostToolUseFailureHooks()
- Wire TaskCompleted hook dispatch in executeTaskCompletedHooks()
- Integrate fallback-agent launcher with permission preservation (canUseTool threading)
- Add safety hardening for config-read errors (try-catch protection)
- Update docs with MVP runtime trigger explanation
- Add 10 unit tests and 4 integration tests covering config, rules, guards, and actions

This completes the self-healing agent mesh MVP by enabling declarative rule-based
responses to tool failures and task completions, with fallback agent spawning,
team notification, and capacity warming actions.

* Update docs/hook-chains.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/utils/hookChains.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: address PR #711 review blockers for Hook Chains

- Gate hook-chain dispatch behind feature('HOOK_CHAINS') and default env gate to off
- Remove committed local artifact (agent.log) and ignore it in .gitignore
- Revert hook dispatcher signature threading changes for canUseTool
- Use ToolUseContext metadata hookChainsCanUseTool for fallback launch permissions
- Make spawn_fallback_agent fail explicitly when launcher context is unavailable
- Add config cache max age and guard map size limits to bound runtime memory
- Update docs and tests for default-off gating and explicit fallback failure

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2026-04-22 19:40:23 +08:00

9.9 KiB

Raw Blame History

Hook Chains (Self-Healing Agent Mesh MVP)

Hook Chains provide an event-driven recovery layer for important workflow failures. When a matching hook event occurs, OpenClaude evaluates declarative rules and can dispatch remediation actions such as:

spawn_fallback_agent
notify_team
warm_remote_capacity

Disabled-By-Default Rollout

Rollout recommendation: keep Hook Chains disabled until you validate rules in your environment.

Set top-level config to "enabled": false initially.

Enable per environment when ready.

Dispatch is gated by feature('HOOK_CHAINS').

Env gate defaults to off unless CLAUDE_CODE_ENABLE_HOOK_CHAINS=1 is set.

This keeps existing workflows unchanged while you tune guard windows and action behavior.

Feature Overview

Hook Chains are loaded from a deterministic config file and evaluated on dispatched hook events.

MVP runtime trigger wiring:

PostToolUseFailure hooks dispatch Hook Chains with outcome failed.
TaskCompleted hooks dispatch Hook Chains with outcome:
- success when completion hooks did not block.
- failed when completion hooks returned blocking errors or prevented continuation.

Default config path:

.openclaude/hook-chains.json

Override path:

CLAUDE_CODE_HOOK_CHAINS_CONFIG_PATH=/abs/or/relative/path/to/hook-chains.json

Global gate:

feature('HOOK_CHAINS') must be enabled in the build
CLAUDE_CODE_ENABLE_HOOK_CHAINS=0|1 (defaults to disabled when unset)

Safety Guarantees

The runtime is intentionally conservative:

Depth guard: chain dispatch is blocked when chainDepth >= maxChainDepth.
Rule cooldown: each rule can only re-fire after cooldown expires.
Dedup window: identical event/action combinations are suppressed for a window.
Abort-safe behavior: if the current signal is aborted, actions skip safely.
Policy-aware remote warm: warm_remote_capacity skips when remote sessions are policy denied.
Bridge inactive no-op: warm_remote_capacity safely skips when no active bridge handle exists.
Missing team context safety: notify_team skips with structured reason if no team context/team file is available.
Fallback launcher safety: spawn_fallback_agent fails with a structured reason when launch permissions/context are unavailable.

Configuration Schema Reference

Top-level object:

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 30000,
  "defaultDedupWindowMs": 30000,
  "rules": []
}

Top-Level Fields

Field	Type	Required	Notes
`version`	`1`	No	Defaults to `1`.
`enabled`	`boolean`	No	Global feature switch for this config file.
`maxChainDepth`	`integer`	No	Global depth guard (default `2`, max `10`).
`defaultCooldownMs`	`integer`	No	Default rule cooldown in ms (default `30000`).
`defaultDedupWindowMs`	`integer`	No	Default action dedup window in ms (default `30000`).
`rules`	`HookChainRule[]`	No	Defaults to `[]`. May be omitted or empty; when no rules are present, dispatch is a no-op and returns `enabled: false`.

Note: An empty ruleset is valid and can be used to keep Hook Chains configured but effectively disabled until rules are added.

Rule Object (`HookChainRule`)

{
  "id": "task-failure-recovery",
  "enabled": true,
  "trigger": {
    "event": "TaskCompleted",
    "outcome": "failed"
  },
  "condition": {
    "toolNames": ["Edit"],
    "taskStatuses": ["failed"],
    "errorIncludes": ["timeout", "permission denied"],
    "eventFieldEquals": {
      "meta.source": "scheduler"
    }
  },
  "cooldownMs": 60000,
  "dedupWindowMs": 30000,
  "maxDepth": 2,
  "actions": []
}

Field	Type	Required	Notes
`id`	`string`	Yes	Stable identifier used in telemetry/guards.
`enabled`	`boolean`	No	Per-rule switch.
`trigger.event`	`HookEvent`	Yes	Event name to match.
`trigger.outcome`	`"success"	"failed"	"timeout"
`trigger.outcomes`	`Outcome[]`	No	Multi-outcome matcher. Use either `outcome` or `outcomes`.
`condition`	`object`	No	Optional extra matching constraints.
`cooldownMs`	`integer`	No	Overrides global cooldown for this rule.
`dedupWindowMs`	`integer`	No	Overrides global dedup for this rule.
`maxDepth`	`integer`	No	Per-rule depth cap.
`actions`	`HookChainAction[]`	Yes	One or more actions to execute in order.

Condition Fields

Field	Type	Notes
`toolNames`	`string[]`	Matches `tool_name` / `toolName` in event payload.
`taskStatuses`	`string[]`	Matches `task_status` / `taskStatus` / `status`.
`errorIncludes`	`string[]`	Case-insensitive substring match against `error` / `reason` / `message`.
`eventFieldEquals`	`Record<string, string\|number\|boolean>`	Dot-path equality against payload (example: `"meta.source": "scheduler"`).

Actions

`spawn_fallback_agent`

{
  "type": "spawn_fallback_agent",
  "id": "fallback-1",
  "enabled": true,
  "dedupWindowMs": 30000,
  "description": "Fallback recovery for failed task",
  "promptTemplate": "Recover task ${TASK_SUBJECT}. Event=${EVENT_NAME}, outcome=${OUTCOME}, error=${ERROR}. Payload=${PAYLOAD_JSON}",
  "agentType": "general-purpose",
  "model": "sonnet"
}

`notify_team`

{
  "type": "notify_team",
  "id": "notify-ops",
  "enabled": true,
  "dedupWindowMs": 30000,
  "teamName": "mesh-team",
  "recipients": ["*"],
  "summary": "Hook chain ${RULE_ID} fired",
  "messageTemplate": "Event=${EVENT_NAME} outcome=${OUTCOME}\nTask=${TASK_ID}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
}

`warm_remote_capacity`

{
  "type": "warm_remote_capacity",
  "id": "warm-bridge",
  "enabled": true,
  "dedupWindowMs": 60000,
  "createDefaultEnvironmentIfMissing": false
}

Complete Example Configs

1) Retry via Fallback Agent

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 30000,
  "defaultDedupWindowMs": 30000,
  "rules": [
    {
      "id": "retry-task-via-fallback",
      "trigger": {
        "event": "TaskCompleted",
        "outcome": "failed"
      },
      "cooldownMs": 60000,
      "actions": [
        {
          "type": "spawn_fallback_agent",
          "id": "spawn-retry-agent",
          "description": "Retry failed task with fallback agent",
          "promptTemplate": "A task failed. Recover it safely.\nTask=${TASK_SUBJECT}\nDescription=${TASK_DESCRIPTION}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}",
          "agentType": "general-purpose",
          "model": "sonnet"
        }
      ]
    }
  ]
}

2) Notify Only

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 30000,
  "defaultDedupWindowMs": 30000,
  "rules": [
    {
      "id": "notify-on-tool-failure",
      "trigger": {
        "event": "PostToolUseFailure",
        "outcome": "failed"
      },
      "condition": {
        "toolNames": ["Edit", "Write", "Bash"]
      },
      "actions": [
        {
          "type": "notify_team",
          "id": "notify-team-failure",
          "recipients": ["*"],
          "summary": "Tool failure detected",
          "messageTemplate": "Tool failure detected.\nEvent=${EVENT_NAME} outcome=${OUTCOME}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
        }
      ]
    }
  ]
}

3) Combined Fallback + Notify + Bridge Warm

{
  "version": 1,
  "enabled": true,
  "maxChainDepth": 2,
  "defaultCooldownMs": 45000,
  "defaultDedupWindowMs": 30000,
  "rules": [
    {
      "id": "full-recovery-chain",
      "trigger": {
        "event": "TaskCompleted",
        "outcomes": ["failed", "timeout"]
      },
      "condition": {
        "errorIncludes": ["timeout", "capacity", "connection"]
      },
      "cooldownMs": 90000,
      "actions": [
        {
          "type": "spawn_fallback_agent",
          "id": "fallback-agent",
          "description": "Recover failed task execution",
          "promptTemplate": "Recover failed task and produce a concise fix summary.\nTask=${TASK_SUBJECT}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
        },
        {
          "type": "notify_team",
          "id": "notify-team",
          "recipients": ["*"],
          "summary": "Recovery chain triggered",
          "messageTemplate": "Recovery chain ${RULE_ID} fired.\nOutcome=${OUTCOME}\nTask=${TASK_SUBJECT}\nError=${ERROR}"
        },
        {
          "type": "warm_remote_capacity",
          "id": "warm-capacity",
          "createDefaultEnvironmentIfMissing": false
        }
      ]
    }
  ]
}

Template Variables

The following placeholders are supported by promptTemplate, summary, and messageTemplate:

${EVENT_NAME}
${OUTCOME}
${RULE_ID}
${TASK_SUBJECT}
${TASK_DESCRIPTION}
${TASK_ID}
${ERROR}
${PAYLOAD_JSON}

Troubleshooting

Rule never triggers

Verify trigger.event and trigger.outcome/trigger.outcomes exactly match dispatched event data.
Check condition filters (especially toolNames and eventFieldEquals dot-path keys).
Confirm the config file is valid JSON and schema-valid.

Actions show as skipped

Common skip reasons:

action disabled
rule cooldown active ...
dedup window active ...
max chain depth reached ...
No team context is available ...
Team file not found ...
Remote sessions are blocked by policy
Bridge is not active; warm_remote_capacity is a safe no-op
No fallback agent launcher is registered in runtime context

Config changes not reflected

Loader uses memoization by file mtime/size.
Ensure your editor writes the file fully and updates mtime.
If needed, force reload from the caller side with forceReloadConfig: true.

Existing workflows changed unexpectedly

Set "enabled": false at top-level.
Or globally disable with CLAUDE_CODE_ENABLE_HOOK_CHAINS=0.
Re-enable gradually after validating one rule at a time.

9.9 KiB Raw Blame History