feat: implement Hook Chains runtime integration for self-healing agent mesh MVP (#711)

* feat: implement Hook Chains runtime integration for self-healing agent mesh MVP

- Add Hook Chains config loader, evaluator, and dispatcher in src/utils/hookChains.ts
- Wire PostToolUseFailure hook dispatch in executePostToolUseFailureHooks()
- Wire TaskCompleted hook dispatch in executeTaskCompletedHooks()
- Integrate fallback-agent launcher with permission preservation (canUseTool threading)
- Add safety hardening for config-read errors (try-catch protection)
- Update docs with MVP runtime trigger explanation
- Add 10 unit tests and 4 integration tests covering config, rules, guards, and actions

This completes the self-healing agent mesh MVP by enabling declarative rule-based
responses to tool failures and task completions, with fallback agent spawning,
team notification, and capacity warming actions.

* Update docs/hook-chains.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/utils/hookChains.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: address PR #711 review blockers for Hook Chains

- Gate hook-chain dispatch behind feature('HOOK_CHAINS') and default env gate to off
- Remove committed local artifact (agent.log) and ignore it in .gitignore
- Revert hook dispatcher signature threading changes for canUseTool
- Use ToolUseContext metadata hookChainsCanUseTool for fallback launch permissions
- Make spawn_fallback_agent fail explicitly when launcher context is unavailable
- Add config cache max age and guard map size limits to bound runtime memory
- Update docs and tests for default-off gating and explicit fallback failure

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Urvish L.
2026-04-22 13:40:23 +02:00
committed by GitHub
parent 5b9cd21e37
commit 44a2c30d5f
9 changed files with 2905 additions and 22 deletions

333
docs/hook-chains.md Normal file
View File

@@ -0,0 +1,333 @@
# Hook Chains (Self-Healing Agent Mesh MVP)
Hook Chains provide an event-driven recovery layer for important workflow failures.
When a matching hook event occurs, OpenClaude evaluates declarative rules and can dispatch remediation actions such as:
- `spawn_fallback_agent`
- `notify_team`
- `warm_remote_capacity`
## Disabled-By-Default Rollout
> **Rollout recommendation:** keep Hook Chains disabled until you validate rules in your environment.
>
> - Set top-level config to `"enabled": false` initially.
> - Enable per environment when ready.
> - Dispatch is gated by `feature('HOOK_CHAINS')`.
> - Env gate defaults to off unless `CLAUDE_CODE_ENABLE_HOOK_CHAINS=1` is set.
This keeps existing workflows unchanged while you tune guard windows and action behavior.
## Feature Overview
Hook Chains are loaded from a deterministic config file and evaluated on dispatched hook events.
MVP runtime trigger wiring:
- `PostToolUseFailure` hooks dispatch Hook Chains with outcome `failed`.
- `TaskCompleted` hooks dispatch Hook Chains with outcome:
- `success` when completion hooks did not block.
- `failed` when completion hooks returned blocking errors or prevented continuation.
Default config path:
- `.openclaude/hook-chains.json`
Override path:
- `CLAUDE_CODE_HOOK_CHAINS_CONFIG_PATH=/abs/or/relative/path/to/hook-chains.json`
Global gate:
- `feature('HOOK_CHAINS')` must be enabled in the build
- `CLAUDE_CODE_ENABLE_HOOK_CHAINS=0|1` (defaults to disabled when unset)
## Safety Guarantees
The runtime is intentionally conservative:
- **Depth guard:** chain dispatch is blocked when `chainDepth >= maxChainDepth`.
- **Rule cooldown:** each rule can only re-fire after cooldown expires.
- **Dedup window:** identical event/action combinations are suppressed for a window.
- **Abort-safe behavior:** if the current signal is aborted, actions skip safely.
- **Policy-aware remote warm:** `warm_remote_capacity` skips when remote sessions are policy denied.
- **Bridge inactive no-op:** `warm_remote_capacity` safely skips when no active bridge handle exists.
- **Missing team context safety:** `notify_team` skips with structured reason if no team context/team file is available.
- **Fallback launcher safety:** `spawn_fallback_agent` fails with a structured reason when launch permissions/context are unavailable.
## Configuration Schema Reference
Top-level object:
```json
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 30000,
"defaultDedupWindowMs": 30000,
"rules": []
}
```
### Top-Level Fields
| Field | Type | Required | Notes |
|---|---|---:|---|
| `version` | `1` | No | Defaults to `1`. |
| `enabled` | `boolean` | No | Global feature switch for this config file. |
| `maxChainDepth` | `integer` | No | Global depth guard (default `2`, max `10`). |
| `defaultCooldownMs` | `integer` | No | Default rule cooldown in ms (default `30000`). |
| `defaultDedupWindowMs` | `integer` | No | Default action dedup window in ms (default `30000`). |
| `rules` | `HookChainRule[]` | No | Defaults to `[]`. May be omitted or empty; when no rules are present, dispatch is a no-op and returns `enabled: false`. |
> **Note:** An empty ruleset is valid and can be used to keep Hook Chains configured but effectively disabled until rules are added.
### Rule Object (`HookChainRule`)
```json
{
"id": "task-failure-recovery",
"enabled": true,
"trigger": {
"event": "TaskCompleted",
"outcome": "failed"
},
"condition": {
"toolNames": ["Edit"],
"taskStatuses": ["failed"],
"errorIncludes": ["timeout", "permission denied"],
"eventFieldEquals": {
"meta.source": "scheduler"
}
},
"cooldownMs": 60000,
"dedupWindowMs": 30000,
"maxDepth": 2,
"actions": []
}
```
| Field | Type | Required | Notes |
|---|---|---:|---|
| `id` | `string` | Yes | Stable identifier used in telemetry/guards. |
| `enabled` | `boolean` | No | Per-rule switch. |
| `trigger.event` | `HookEvent` | Yes | Event name to match. |
| `trigger.outcome` | `"success"|"failed"|"timeout"|"unknown"` | No | Single outcome matcher. |
| `trigger.outcomes` | `Outcome[]` | No | Multi-outcome matcher. Use either `outcome` or `outcomes`. |
| `condition` | `object` | No | Optional extra matching constraints. |
| `cooldownMs` | `integer` | No | Overrides global cooldown for this rule. |
| `dedupWindowMs` | `integer` | No | Overrides global dedup for this rule. |
| `maxDepth` | `integer` | No | Per-rule depth cap. |
| `actions` | `HookChainAction[]` | Yes | One or more actions to execute in order. |
### Condition Fields
| Field | Type | Notes |
|---|---|---|
| `toolNames` | `string[]` | Matches `tool_name` / `toolName` in event payload. |
| `taskStatuses` | `string[]` | Matches `task_status` / `taskStatus` / `status`. |
| `errorIncludes` | `string[]` | Case-insensitive substring match against `error` / `reason` / `message`. |
| `eventFieldEquals` | `Record<string, string\|number\|boolean>` | Dot-path equality against payload (example: `"meta.source": "scheduler"`). |
### Actions
#### `spawn_fallback_agent`
```json
{
"type": "spawn_fallback_agent",
"id": "fallback-1",
"enabled": true,
"dedupWindowMs": 30000,
"description": "Fallback recovery for failed task",
"promptTemplate": "Recover task ${TASK_SUBJECT}. Event=${EVENT_NAME}, outcome=${OUTCOME}, error=${ERROR}. Payload=${PAYLOAD_JSON}",
"agentType": "general-purpose",
"model": "sonnet"
}
```
#### `notify_team`
```json
{
"type": "notify_team",
"id": "notify-ops",
"enabled": true,
"dedupWindowMs": 30000,
"teamName": "mesh-team",
"recipients": ["*"],
"summary": "Hook chain ${RULE_ID} fired",
"messageTemplate": "Event=${EVENT_NAME} outcome=${OUTCOME}\nTask=${TASK_ID}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
}
```
#### `warm_remote_capacity`
```json
{
"type": "warm_remote_capacity",
"id": "warm-bridge",
"enabled": true,
"dedupWindowMs": 60000,
"createDefaultEnvironmentIfMissing": false
}
```
## Complete Example Configs
### 1) Retry via Fallback Agent
```json
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 30000,
"defaultDedupWindowMs": 30000,
"rules": [
{
"id": "retry-task-via-fallback",
"trigger": {
"event": "TaskCompleted",
"outcome": "failed"
},
"cooldownMs": 60000,
"actions": [
{
"type": "spawn_fallback_agent",
"id": "spawn-retry-agent",
"description": "Retry failed task with fallback agent",
"promptTemplate": "A task failed. Recover it safely.\nTask=${TASK_SUBJECT}\nDescription=${TASK_DESCRIPTION}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}",
"agentType": "general-purpose",
"model": "sonnet"
}
]
}
]
}
```
### 2) Notify Only
```json
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 30000,
"defaultDedupWindowMs": 30000,
"rules": [
{
"id": "notify-on-tool-failure",
"trigger": {
"event": "PostToolUseFailure",
"outcome": "failed"
},
"condition": {
"toolNames": ["Edit", "Write", "Bash"]
},
"actions": [
{
"type": "notify_team",
"id": "notify-team-failure",
"recipients": ["*"],
"summary": "Tool failure detected",
"messageTemplate": "Tool failure detected.\nEvent=${EVENT_NAME} outcome=${OUTCOME}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
}
]
}
]
}
```
### 3) Combined Fallback + Notify + Bridge Warm
```json
{
"version": 1,
"enabled": true,
"maxChainDepth": 2,
"defaultCooldownMs": 45000,
"defaultDedupWindowMs": 30000,
"rules": [
{
"id": "full-recovery-chain",
"trigger": {
"event": "TaskCompleted",
"outcomes": ["failed", "timeout"]
},
"condition": {
"errorIncludes": ["timeout", "capacity", "connection"]
},
"cooldownMs": 90000,
"actions": [
{
"type": "spawn_fallback_agent",
"id": "fallback-agent",
"description": "Recover failed task execution",
"promptTemplate": "Recover failed task and produce a concise fix summary.\nTask=${TASK_SUBJECT}\nError=${ERROR}\nPayload=${PAYLOAD_JSON}"
},
{
"type": "notify_team",
"id": "notify-team",
"recipients": ["*"],
"summary": "Recovery chain triggered",
"messageTemplate": "Recovery chain ${RULE_ID} fired.\nOutcome=${OUTCOME}\nTask=${TASK_SUBJECT}\nError=${ERROR}"
},
{
"type": "warm_remote_capacity",
"id": "warm-capacity",
"createDefaultEnvironmentIfMissing": false
}
]
}
]
}
```
## Template Variables
The following placeholders are supported by `promptTemplate`, `summary`, and `messageTemplate`:
- `${EVENT_NAME}`
- `${OUTCOME}`
- `${RULE_ID}`
- `${TASK_SUBJECT}`
- `${TASK_DESCRIPTION}`
- `${TASK_ID}`
- `${ERROR}`
- `${PAYLOAD_JSON}`
## Troubleshooting
### Rule never triggers
- Verify `trigger.event` and `trigger.outcome`/`trigger.outcomes` exactly match dispatched event data.
- Check `condition` filters (especially `toolNames` and `eventFieldEquals` dot-path keys).
- Confirm the config file is valid JSON and schema-valid.
### Actions show as skipped
Common skip reasons:
- `action disabled`
- `rule cooldown active ...`
- `dedup window active ...`
- `max chain depth reached ...`
- `No team context is available ...`
- `Team file not found ...`
- `Remote sessions are blocked by policy`
- `Bridge is not active; warm_remote_capacity is a safe no-op`
- `No fallback agent launcher is registered in runtime context`
### Config changes not reflected
- Loader uses memoization by file mtime/size.
- Ensure your editor writes the file fully and updates mtime.
- If needed, force reload from the caller side with `forceReloadConfig: true`.
### Existing workflows changed unexpectedly
- Set `"enabled": false` at top-level.
- Or globally disable with `CLAUDE_CODE_ENABLE_HOOK_CHAINS=0`.
- Re-enable gradually after validating one rule at a time.