) with explicit timeouts, or direct internal health probes.\n- **Background agent runs must be non-blocking.** `system/agent-dispatch` cannot use `execSync`/other blocking subprocess APIs for long codex or claude runs on the host worker; blocking the Bun event loop causes Talon/worker-supervisor health checks to fail, the worker to restart, `/internal/agent-await` to drop, and Inngest runs to go stale.\n- **Pi is now the preferred Restate PRD story executor.** `system/agent-dispatch` must honor the requested `cwd` when it calls `infer()`, should enable pi tools when file work is requested (`readFiles` or path-heavy prompts), and should use the dedicated roster agent `story-executor` for Restate PRD stories so they run under the tight execution prompt instead of the generic background-agent system prompt. The host bridge must also write a `running` inbox snapshot before long agent execution starts and dedupe `/internal/agent-dispatch` by `requestId`; otherwise multi-minute Restate retries spawn duplicate story agents and operators get a useless forever-`pending` state.\n- **Execution mode: host vs sandbox (ADR-0217 Story 4/next batch).** `system/agent-dispatch` accepts `executionMode: \"host\" | \"sandbox\"` (default: `\"host\"`). Host mode uses the existing shared-checkout path. Sandbox mode now has a concrete backend split: `sandboxBackend: \"local\" | \"k8s\"` (default local). The **local** backend is the proved live path on the host worker: it materializes a clean temp checkout at `baseSha`, runs the requested agent inside that isolated repo, exports patch/touched-file artifacts, and then tears the sandbox down without dirtying the operator checkout. **Gate A** (non-coding vertical slice) is proven via `packages/agent-execution/__tests__/gate-a-smoke.test.ts`. **Gate B** (minimal coding sandbox) is proven via `packages/agent-execution/__tests__/gate-b-smoke.test.ts`. The **k8s** backend is now code-landed and opt-in: `@joelclaw/agent-execution` owns Job spec generation plus Job launch/status/log helpers, `job-runner.ts` prints `SandboxExecutionResult` log markers and POSTs terminal results to `/internal/agent-result`, and `InboxResult` now preserves `sandboxBackend` plus optional Job metadata. Current honest limit: `pi` remains the local-backend story executor for now; the k8s runner is for runner-installed CLIs until host-routed pi-in-pod execution is designed. Deterministic sandbox requests should carry `workflowId`, `storyId`, `baseSha`, `repoUrl`, and `branch`; `trigger-prd.ts` now has explicit tool/backend knobs (`PRD_EXEC_TOOL`, `PRD_EXECUTION_MODE`, `PRD_SANDBOX_BACKEND`).\n- **Terminal state guarantees (ADR-0217 Story 5).** `system/agent-dispatch` ensures every execution lands in a terminal state (`completed|failed|cancelled`). Duplicate requests with the same `requestId` are deduped at function entry — if a terminal result already exists, it returns that result without spawning new work. Cancellation via `system/agent.cancelled` kills the active subprocess (tracked in `activeProcesses` map by requestId) and writes a `cancelled` inbox snapshot via the `onFailure` handler.\n- **Log surfacing (ADR-0217 Story 5).** All terminal results include `stdout`/`stderr` output (truncated to 10KB each) in the `logs` field. This is captured from subprocess execution and attached to the inbox result for post-mortem debugging. The logs are also emitted via OTEL events for searchability.\n- **Do not capture tool-enabled pi attempts by waiting on pipe EOF.** In `src/lib/inference.ts`, background pi runs with tools can spawn descendants that inherit stdout/stderr, leaving `new Response(proc.stdout).text()` or similar pipe readers hanging after the real `pi` child exits. Redirect stdout/stderr to temp files (or another exit-driven sink), wait for `proc.exited`, then read the captured output so `system/agent-dispatch` can always write a terminal inbox snapshot.\n- **Apply the same exit-driven capture rule inside `system/agent-dispatch` command execution.** Codex/Claude/bash subprocesses and local sandbox infra commands can also leave descendants holding stdout/stderr open after the parent exits. If `agent-dispatch` waits on pipe EOF there, terminal inbox writeback stalls and sandbox runs lie in `running` even though the real work already finished or failed.\n- **Use `tool: \"canary\"` for deterministic live verification of the dispatch substrate itself.** This is the non-LLM proof lane for `system/agent-dispatch`: fixed scenarios like `sleep-timeout` and `orphan-stderr` exercise the same subprocess capture + terminal inbox/registry path without depending on model behavior. Canonical timeout proof script: `bun scripts/verify-agent-dispatch-timeout.ts`. Canonical operator surface: `joelclaw status --agent-dispatch-canary`, and the default status envelope now exposes the latest persisted canary summary. Scheduled health integration is gated off by default and only activates when the live worker sets `HEALTH_AGENT_DISPATCH_CANARY_SCHEDULE=signals`.\n- **`infer({ timeout })` is an overall budget, not a per-fallback reset.** Story 6 proved that reusing a fresh 10-minute timeout on every fallback attempt creates a hidden 30-minute failure chain (`SIGTERM` → `exit 143`) before the real story budget is exhausted. `src/lib/inference.ts` must spend the remaining deadline across attempts and preserve up to a one-hour explicit request budget for Restate PRD story runs.\n- **Timeout errors must say timeout, not `exit 143: empty output`.** When `pi` is killed by the inference timer, surface `pi timed out after \u003cms\u003e` in the thrown error and OTEL metadata so operators know it was our budget kill, not a mysterious subprocess crash.\n- **Do not import `packages/cli/src/*` from system-bus via relative paths.** Keep runbook resolution local in `packages/system-bus` (or extract to a dedicated leaf package) and avoid creating `@joelclaw/system-bus` ↔ `@joelclaw/sdk` dependency cycles that break Turbo/Vercel.\n\n## Deploy: system-bus-worker (k8s)\n\n```bash\n~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh\nkubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s\njoelclaw refresh\n```\n\nBuilds ARM64 image, pushes to GHCR, updates k8s deployment, verifies rollout.\n\n## Adding a Webhook Provider\n\nSee the `webhooks` skill for full details. Quick summary:\n\n1. Create `src/webhooks/providers/\u003cservice\u003e.ts` implementing `WebhookProvider`\n2. Register in `src/webhooks/server.ts`\n3. Add secret to `WEBHOOK_SECRETS` array in `serve.ts`\n4. Store secret in agent-secrets: `secrets add \u003cservice\u003e_webhook_secret`\n\n## Debugging\n\n```bash\n# Check worker health\ncurl http://localhost:3111/ | jq\n\n# View registered functions\njoelclaw functions\n\n# Recent runs\njoelclaw runs --count 20\n\n# Inspect a specific run\njoelclaw run \u003cRUN_ID\u003e\n\n# Worker logs (k8s)\nkubectl logs -n joelclaw deploy/system-bus-worker -f\n\n# Inngest server logs\nkubectl logs -n joelclaw inngest-0 | grep ERROR\n\n# Force re-registration\ncurl -X PUT http://127.0.0.1:3111/api/inngest\n```\n\n### Runtime forensics: stale `RUNNING` runs\n\nWhen Inngest APIs disagree (`runs` list shows `RUNNING`, `run` detail shows terminal or non-cancellable state), treat it as runtime metadata drift, usually after SDK reachability failures.\n\nOperational truths:\n\n- Runtime DB is SQLite inside k8s Inngest pod: `inngest-0:/data/main.db`.\n- `trace_runs.status` alone is not sufficient to infer terminality.\n- Terminal source-of-truth is the presence of terminal history entries:\n - `FunctionCompleted`\n - `FunctionFailed`\n - `FunctionCancelled`\n\nSafe reconciliation sequence:\n\n1. Preview with `joelclaw inngest sweep-stale-runs`.\n2. Apply with `joelclaw inngest sweep-stale-runs --apply` (auto backup + transactional writes).\n3. If manual fallback is required:\n - Backup DB: `kubectl -n joelclaw exec inngest-0 -- sqlite3 /data/main.db '.backup /data/main.db.pre-sweep-\u003cts\u003e.sqlite'`\n - Find stale candidates via `trace_runs` + `function_finishes` + `history` joins.\n - Insert missing terminal history (`FunctionCancelled`) for stale candidates.\n - Ensure `function_finishes` rows exist.\n - Update `trace_runs.status` to cancelled (`500`) only after history/finishes.\n4. Verify with `joelclaw run \u003cid\u003e` and a fresh `joelclaw runs --status RUNNING`.\n\n## Key Files\n\n| File | Purpose |\n|------|---------|\n| `src/serve.ts` | HTTP server, Inngest registration, health endpoint, and host-only internal agent bridge endpoints (`/internal/agent-dispatch`, `/internal/agent-result/:id`, `/internal/agent-await/:id`) |\n| `src/inngest/client.ts` | Event type definitions, Inngest client |\n| `src/inngest/middleware/gateway.ts` | Gateway context injection |\n| `src/inngest/functions/index.host.ts` | Host-role function list |\n| `src/inngest/functions/index.cluster.ts` | Cluster-role function list |\n| `src/lib/inference.ts` | LLM inference via pi (use this, not raw APIs) |\n| `src/observability/emit.ts` | OTEL event emission |\n| `src/webhooks/server.ts` | Webhook route registration |\n| `~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh` | K8s deploy script |\n","repo_fullName":"joelhooks/joelclaw","repo_stars":57,"repo_language":"TypeScript","repo_license":null,"repo_pushedAt":"2026-06-02T00:03:49Z","owner_login":"joelhooks","owner_type":"User","owner_name":"Joel Hooks","owner_avatarUrl":"https://avatars.githubusercontent.com/u/86834?v=4"}};