diff --git a/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md b/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md index 60629e1f..0444fb7c 100644 --- a/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md +++ b/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md @@ -123,7 +123,57 @@ The `{ssdl}` markers note the two transformations: checkpoint write is an `[I]` ## §3 Hooks -(filled in by Phase 4 — covers `a4fb141` + both case-study harness scripts) +**Source:** nagent `a4fb141` (`bin/nagent:1442-1484` + `:1607-1625` + `:1922-1927` + `:2806-2825` + `:3167-3185`, `config.example.json:6-8`, `tests/test_nagent.py:870-960`); plus both case-study harness scripts (`https://raw.githubusercontent.com/macton/pep-copt/main/prove-optimized-harness.sh`, `https://raw.githubusercontent.com/macton/differentiable-collisions-optc/main/prove-optimized-harness.sh`). +**One-liner:** Per-turn ground-truth injection. A hook runs at the top of every turn (before the model speaks) or after every structured edit; its measured output — exit code, stdout, stderr, or "(no output)" — enters the conversation as a labeled block, so the model responds against measured state instead of its recollection. The case-study repos ARE the hooks: `prove-optimized-harness.sh` is the command wired into `--hook-per-run`. +**Pattern(s) vs v2.3:** NEW. v2.3 had the conversation-without-ground-truth loop (the model's word was the only word). v3 introduces the per-turn measurement primitive that breaks the loop's dependence on the model's self-reporting. EXTENDS v2.3 Pattern 5 ("the loop") with a measurement injection surface. The case-study methodology cluster (§9) elaborates this into a reusable 5-element pattern. +**Manual Slop implications:** Manual Slop has analogous hooks already — Tier 4 QA error interception (per `docs/guide_ai_client.md`) and the `ApiHookClient` test harness (per `docs/guide_api_hooks.md`). The generalization is per-turn, not per-error: a Manual Slop hook could be wired into the `run_agent_loop` equivalent (`dispatch_inference`) to inject a status block (build status, test status, dependency-check status) at the top of every turn. The "failure is data, not control flow" principle from `conductor/code_styleguides/error_handling.md` already encodes the "exit code + stderr surfaced" invariant. +**Decision candidate:** NEW Candidate 19 (MEDIUM). "Per-turn ground-truth hook for Manual Slop": add a per-turn hook primitive that runs a configured command (CLI > config > disabled) at the top of every `send_result()` and injects a `` block; honor the CLI > config > disabled precedence and the failing/quiet-hook-surfaces-output invariant. See `decisions.md` Candidate 19. +**Cross-refs:** §9 Case-study methodology (the 5-element pattern; hooks are the substrate), §10 PEP case study (the pep-copt harness), §11 Collisions case study (the collisions harness). These three together surface the full abstraction. +**Source-read citations:** +- `bin/nagent:1442-1463` — `run_hook(command, label, path=None)` (a4fb141) +- `bin/nagent:1466-1484` — `resolve_hooks(cli_per_run, cli_per_file_edit, config_path)` with CLI > config > disabled precedence (a4fb141) +- `bin/nagent:1607-1611` — `hook_per_file_edit` fires after `` (a4fb141) +- `bin/nagent:1618-1625` — `hook_per_file_edit` fires after `` in `--file-edit` mode only (scratch writes are not file edits) (a4fb141) +- `bin/nagent:1922-1927` — `hook_per_run` fires at top of every turn, before `call_llm` (a4fb141) +- `bin/nagent:2806-2825` — `--hook-per-run` and `--hook-per-file-edit` CLI flags (a4fb141) +- `bin/nagent:3167-3185` — wiring into `run_agent_loop` (a4fb141) +- `config.example.json:6-8` — `hook_per_run` and `hook_per_file_edit` config keys (a4fb141) +- `tests/test_nagent.py:870-883` — `test_run_hook_block_reports_output_and_exit_code` (a4fb141) +- `tests/test_nagent.py:885-915` — `test_hook_per_run_runs_before_every_turn` (a4fb141) +- `tests/test_nagent.py:917-942` — `test_hook_per_file_edit_runs_after_file_patch` (a4fb141) +- `tests/test_nagent.py:944-960` — `test_resolve_hooks_cli_overrides_config` (a4fb141) +- `prove-optimized-harness.sh` (pep-copt) — 9-step proof + 5 enforcing gates (identity baseline, median-of-5 speedup, decompression-time gate, generalization, determinism) +- `prove-optimized-harness.sh` (differentiable-collisions-optc) — 10-step proof + 4 enforcing gates (comparator with distance tolerance, contact-point certifier, precompute isolation, determinism) +**Honest gaps in this cluster:** +- The "subprocess reach" claim in `bin/nagent:2822-2824` — "A CLI flag applies to this invocation only; set it in the config file to apply it to delegated file-edit subprocesses too" — needs verification. The implementation at `bin/nagent:3167-3185` wires the hooks into `run_agent_loop`'s `main()` call only; whether delegated file-edit subprocesses read the config separately is not visible in this diff. The v3.1 source-read pass should verify the subprocess reach. +- The "default off" guarantee is not tested. Both hooks default to off (CLI flag absent, config key absent or empty string). A regression test asserting "no CLI flag, no config key → both hooks are None" would harden the contract. +- The `--hook-per-run` cost discipline ("point it at a fast status command") is documented in `--help` but not enforced. The case-study harnesses use median-of-5 timing in their proofs, which is fast, but a user wiring up a 10-second status command would pay 10 seconds per turn. A future track could add a `--hook-per-run-max-seconds` config knob. + +**Pattern deep-dive.** The hooks abstraction is a three-piece composition: **resolve**, **invoke**, **inject**. `resolve_hooks` enforces the CLI > config > disabled precedence (the CLI is the experiment's override; the config is the project's default; empty means off). `run_hook` invokes the command, captures exit code + stdout + stderr, and surfaces "(no output)" when silent. The injection sites are the conversation: per-run at the top of every turn before `call_llm`; per-file-edit after `` or `` in `--file-edit` mode (not scratch writes — the comment at `bin/nagent:1618-1620` notes the distinction explicitly: "A `` only edits a real file in per-file-edit mode ... in main mode it writes scratch, which is not a file edit worth a verify hook"). + +The case-study harness scripts are the proof that hooks work as intended. Both scripts implement the same skeleton: log + summary + enforcing gate. The log records every step with verbose mode for streaming; the summary collects every verdict at the end (`set +e` so a failing gate still prints); the enforcing gate collects the verdicts and decides pass/fail. Both harness scripts freeze the committed input via `sha256sum` before the run and re-check after — if the harness itself changes the input (a bug), it aborts. Both exclude precompute time from the measured speedup (the build stage cannot precompute the answer; the optimization log explains why). The PEP harness uses pixel-identity + lossless round-trip + size-correctness (the optimized `.pep` must not be larger than the reference `.pep` — speed may not be bought with a bigger file). The collisions harness uses a distance tolerance contract (1mm + 0.1% + conditional) because collision-flag identity is too strict (a face/edge contact has many equally-valid witness points) and an independent contact-point certifier (`validate_contacts`) shares no solver code. + +The data shape of the hook output, using survey grammar: + +``` +hook-result := + +run { command } :: hook-result {ssdl} [B] // boundary: LLM-failures + // surface, never hidden +inject { hook-result, conversation } :: () // append to conversation file + +resolve { cli, config } :: (per_run, per_file_edit) + // precedence: CLI > config > disabled + // empty string in config means disabled +``` + +The `{ssdl}` `[B]` (boundary) marker notes the abstraction: the hook is the boundary where the model's context meets the measured world; the failure of a measurement is data the model can act on, not a control-flow exception. The injection is append-only — the conversation grows by a labeled block, and the next turn sees it as part of the working state. + +The case-study methodology cluster (§9) abstracts the harness pattern itself: the hooks + the proof + the optimization log + the committed-input sha256 freeze + the model-as-test-subject framing form a reusable unit that any project adopting nagent can replicate. ## §4 Project-local roots