diff --git a/conductor/tracks/nagent_review_20260608/metadata.json b/conductor/tracks/nagent_review_20260608/metadata.json index 1174ad2b..2466f5bf 100644 --- a/conductor/tracks/nagent_review_20260608/metadata.json +++ b/conductor/tracks/nagent_review_20260608/metadata.json @@ -12,6 +12,12 @@ "v3_existing_sections_renumbered": "v3's §12 Decisions / §13 Cross-references / §14 References moved to §15 / §16 / §17", "rationale": "Per user directive 2026-06-20: new observations belong immediately after the cluster sections (inform the decisions); the existing Decisions/Cross-references/References content is preserved and renumbered to §15-§17." }, + "v3_1_file_separation": { + "v3_main_review_preserved": "nagent_review_v3_20260619.md (803 lines, original v3 content; NOT modified by v3.1)", + "v3_1_thickened_report": "nagent_review_v3_1_report_20260620.md (NEW; 2900 lines; v3.1 thickened content per the chunking strategy)", + "v3_1_delta_summary": "nagent_review_v3_1_20260620.md (66 lines; the delta summary doc; points to the thickened report)", + "user_directive_2026-06-20": "Do not overwrite the v3 report; create a separate v3.1 report file. The v3 main review is preserved in git history and is recoverable via 'git log -p -- conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md'." + }, "v3_1_chunking_strategy": { "main_review_loc_floor": 3800, "per_cluster_loc_target": "300-450", diff --git a/conductor/tracks/nagent_review_20260608/nagent_review_v3_1_report_20260620.md b/conductor/tracks/nagent_review_20260608/nagent_review_v3_1_report_20260620.md new file mode 100644 index 00000000..9233ebc3 --- /dev/null +++ b/conductor/tracks/nagent_review_20260608/nagent_review_v3_1_report_20260620.md @@ -0,0 +1,2900 @@ +# nagent_review_v3_20260619 — Mike Acton's nagent, the 24-commit evolution + case studies + +**Status:** Draft (Phase 1 setup complete; cluster sections pending) +**Initialized:** 2026-06-19 +**Owner:** Tier 1 Orchestrator (sole author; Tier 2 executing per `plan_v3.md`) +**Spec pair:** `spec_v3.md` + `plan_v3.md` (in the same track directory) +**Lineage:** Supersedes `nagent_review_v2_3_20260612.md` (4,969 lines, the v2.3 canonical review). v2.3 is preserved as historical. +**Source state:** `macton/nagent@a1f0680` (2026-06-18 23:51:28 UTC) + the two case-study repos at `main`. + +> **Reading guide.** v3 covers the 24 new nagent commits on `macton/nagent@main` between `eb6be32a` (2026-06-12) and `a1f0680` (2026-06-18), and the two case-study repos that didn't exist at v2.3 baseline: [`macton/pep-copt`](https://github.com/macton/pep-copt) and [`macton/differentiable-collisions-optc`](https://github.com/macton/differentiable-collisions-optc). The 11 clusters are: Campaigns (§1), Conversation safety net (§2), Hooks (§3), Project-local roots (§4), Provider expansion (§5), Delegation rewrite (§6), Robustness (§7), Operating rules (§8), Case-study methodology (§9), PEP case study (§10), Collisions case study (§11). + +> **Lineage note.** v2.3's 14-pattern analysis stands; v3 does not delete it. Where v3 updates a v2.3 pattern, the cluster section calls out the update explicitly. Where v3 introduces a new pattern, the cluster section cites the v2.3 pattern it does NOT replace (if any). + +## §0 TL;DR + +v3 covers the **24-commit nagent evolution** between `eb6be32a` (v2.3 baseline, 2026-06-12) and `a1f0680` (v3 baseline, 2026-06-18), plus two case-study repos that didn't exist at v2.3: [`macton/pep-copt`](https://github.com/macton/pep-copt) (PEP image compression, 2.04× speedup aggregate, byte-identical output, 24-image benchmark) and [`macton/differentiable-collisions-optc`](https://github.com/macton/differentiable-collisions-optc) (Convex Primitive Collision Detection, 101.06× speedup on committed input, distance-tolerance match contract). **Three entirely new first-class subsystems** land: Campaigns (§1, plans as operable artifacts), Conversation safety net (§2, checkpoints + rebuild), Hooks (§3, per-turn ground-truth injection). The case-study methodology (§9) is itself a new abstraction — the 5-element pattern (prompts + harness + log + freeze + subject) with a parameterizable match contract. Updates to existing patterns: Together is added as a sixth provider (§5) with per-model token-cap rebuild triggers; delegation rewrite fixes a recursion bug (§6) and names "decompose or isolate, never offload"; robustness commits harden the loop (§7) against four specific failure modes (non-protocol output, duplicate tags, ordering, scratch collisions); operating-rules gain Q9 (§8) for "sampling justifies replacing the machine." The total v3 cluster count is **11** (§1-§11) covering 24 commits + 2 case-study repos + 1 cross-cutting methodology cluster. + +## §1 Campaigns + +**Source:** nagent `24cf16d`, `199a36b`, `f3ec090`, `c1d2cad`, `6443d70`, `7a7e242` (`bin/nagent-campaign`, `bin/helpers/nagent_campaign_lib.py`, `bin/helpers/nagent_distill_lib.py:228-260` + `:793-979`, `bin/nagent-distill:107-200`, `prompts/campaign-decompose.md`, `prompts/campaign-item.md`, `prompts/knowledge-merge.md`, `prompts/knowledge-graduate.md`, `prompts/create-readme.md:248-251`, `issues/0002-campaign-system.md`, `issues/0004-conversation-safety-net.md`, `tests/test_nagent_campaign.py`, `tests/test_nagent_distill.py`, `README.md:474-484` + `:900-908`) +**One-liner:** Plans become operable artifacts. The plan is data (YAML), the driver is deterministic code, the model's non-determinism is relocated and bounded to narrow judgments. +**Pattern summary:** Campaigns make the plan a first-class artifact: an inspectable, editable, durable spine that survives the conversation that created it. The artifact is a YAML tree on disk (`.nagent/campaigns/{slug}/index.yaml` + per-item `item.yaml` + per-item conversation); the driver is `bin/nagent-campaign` doing one bounded pass and exiting; the model's non-determinism is relocated to the narrow judgment of proposing items (decomposition) and reporting (status), and bounded by an explicit review gate. This extends the "durable work, disposable workers" principle (v2.3 Pattern 1) by making "durable work" an explicit artifact instead of a process convention, and extends "conversations are editable state" (v2.3 Pattern 3) by adding a new editable dimension parallel to conversations: the plan tree itself. + +#### §1.1 What Campaigns Adds + +Campaigns introduce a new lifecycle boundary between planning and execution. Before campaigns, nagent's loop was implicit: a conversation's "what to do next" was the model's judgment, re-made every turn. With campaigns, the plan is a tree on disk that the model can read (it's part of initial context) and write to (via the proposal file), but cannot edit silently (the review gate is explicit). The four pieces of the campaigns abstraction are: + +1. **Artifact** — the YAML tree at `.nagent/campaigns/{slug}/index.yaml` (campaign-level) + per-item `item.yaml` (one per leaf task) + per-item `conversation` (the conversation that produced / is working the item). The artifact is the state of record; the conversation is ephemeral. +2. **Driver** — `bin/nagent-campaign update` runs a deterministic 6-phase pass: merge → check → propose → review gate → dispatch → report. One pass, one exit. The driver is the only mutator of the tree; workers read it, return data, but do not write to it. +3. **Invariants** — four load-bearing rules from `issues/0002-campaign-system.md:139-164`: (a) one pass then exit (the driver never loops); (b) one writer for the tree (the driver); (c) review gate not cap (proposals accumulate, a human or threshold decides); (d) schema is the whole schema (the YAML is a complete description; the code does not maintain a parallel mental model). +4. **Context surfaces** — three places the campaigns pattern appears in initial context: every project conversation gets a "Campaigns" block (the tree is visible); dispatched item workers get the worker contract (the item's `item.yaml` + the parent campaign's `index.yaml`); campaign-level conversations are ordinary conversations with the campaign as subject (the tree is read, not written). + +This decomposition is itself data-oriented — the campaign's behavior is the artifact's shape, not code branching on state. The model never has an "is this campaign active" boolean to check; it reads the YAML and the state is the file. + +#### §1.2 The Driver Phases + +The `update` command runs six phases. Each phase is a pure operation on the tree + a bounded external call (LLM for `propose`, LLM for `report`): + +1. **Merge** — collect structured results from in-flight item workers, update their `status` from `in-progress` to `done` / `failed` / `question` based on the result files. Pure code; no LLM call. +2. **Check** — run the executable test of `completion: [condition]` entries. For `condition` types that are LLM-judged (e.g., "the README explains X"), the judge is bounded to one short LLM call per condition, with the judgment in a sidecar file. No multi-turn model reasoning. +3. **Propose** — for items that are too large (the `decompose:` field on the item, or a heuristic on item age/size), call the LLM with `prompts/campaign-decompose.md` to produce a `proposal.yaml` with sub-items. The LLM proposes; the user (or threshold) decides. +4. **Review gate** — for `proposal.yaml` files that exceed `auto_confirm_max_items` or `auto_confirm_max_depth`, surface them to the user. Below the thresholds, auto-confirm. The gate is explicit: a `proposal.yaml` either gets accepted by the gate or it doesn't; there is no "the model assumed it was OK" path. +5. **Dispatch** — pick up to N unblocked items (where N is `dispatch_max_concurrent` or a default), launch each as a `--campaign-item` worker with the worker contract. Workers return data; they do not write the tree. +6. **Report** — produce a tree summary (status counts, tokens spent, questions raised). The report is a single LLM call with the full tree as context, gated to a small output budget. + +A code-shape sketch using survey grammar (per the format commitment §5.1): + +``` +campaign := { name: string, status: active|paused|done, + completion: [condition], items: [item] } +item := { id: string, status: todo|proposed|in-progress|done|failed|question, + blocked_by: [id], conversation: path, + decompose: { when: heuristic, into: [sub_item] } } +update {slug} { + merge // collect structured results, update statuses (pure code) + check // run executable test: conditions; bounded judge for judge: + propose // decompose big items -> proposal.yaml, status proposed + review_gate // auto-confirm within thresholds; report scope of pending + dispatch // bounded N unblocked items, each as --campaign-item worker + report // tree summary + questions + tokens spent +} +``` + +The `{ssdl}` shape tag for the campaign tree is `[M]` (mutable aggregate, hand-edited by humans) — the artifact is the state of record, the worker contract returns data, the driver is the only mutator. The lineage to v2.3's harvest pattern is direct: workers produce data (harvest-JSON in v2.3; `result.json` here), code merges into the tree (regenerate_digest in v2.3; driver merge phase here). + +#### §1.3 The Invariants + +From `issues/0002-campaign-system.md:139-164`, the four invariants that hold the abstraction together: + +1. **One pass then exit.** The driver never loops. It does one bounded pass and exits. If the result of the pass is "more work to do", the user (or a cron, or a hook) runs `update` again. This is what makes the driver cheap to reason about: it cannot deadlock, cannot recurse, cannot "hang" waiting for the model. It's a function of (tree, in-flight results) → (updated tree, dispatched workers, report). +2. **One writer for the tree.** The driver is the only thing that writes `.nagent/campaigns/{slug}/`. Workers read it, return data, do not write. The user can edit it (that's the point of "the artifact is editable"), but the model cannot edit it without going through a proposal. This eliminates the "two writers race on the same file" class of bugs. +3. **Review gate not cap.** Proposals accumulate. A human (or a threshold) decides whether to accept them. The model never "assumes" a proposal is accepted; the gate is explicit. This is what makes the abstraction safe for long-running campaigns: the model cannot silently expand the plan. +4. **Schema is the whole schema.** The YAML tree is a complete description of the campaign. The code does not maintain a parallel mental model (e.g., "we track active items in memory and the YAML is just a snapshot"). The YAML is the truth; the code is a function of the YAML. + +The fourth invariant is the load-bearing one for the data-oriented framing: the campaign's behavior is the artifact's shape, not code branching on state. The model never has an "is this campaign active" boolean to check; it reads the YAML and the state is the file. + +#### §1.4 Per-Commit Detail + +The six commits that built the campaigns subsystem, in dependency order: + +1. **`24cf16d` — Add the campaigns driver.** Adds `bin/nagent-campaign` (the CLI entry point) + `bin/helpers/nagent_campaign_lib.py` (the driver implementation, ~400 lines). Also adds the initial context block (`prompts/campaign-decompose.md` + `prompts/campaign-item.md`) so the model knows how to propose and dispatch. The 6-phase `update` command lands here. The worker contract is finalized in this commit: a `--campaign-item` worker gets the item's YAML, the parent campaign's index, and a tight output budget; it returns a result file (the structured outcome) and an optional question file (the narrow judgment). +2. **`199a36b` — Add the issue file that fully specifies the system.** Adds `issues/0002-campaign-system.md` (326 lines). This is the "long form spec as a file" pattern from v2.3 — the design is in the repo, not in a wiki or a chat. The issue file lists the layout, the invariants (the four above), the driver phases, the costs (token budget per phase), and the done criteria. This is the document the driver implementation in `24cf16d` was built to. +3. **`f3ec090` — Wire the merge/graduate passes to the campaign lifecycle.** Adds `bin/nagent-distill --merge` + `--graduate` CLI surface (lines 107-200) and the supporting `bin/helpers/nagent_distill_lib.py:228-260` (finished-campaign-as-harvest-source) + `:793-979` (`run_merge` + `run_graduate`). The merge pass takes the per-item results, the per-conversation knowledge files, and the campaign's own artifacts, and rewrites each category file with provenance preserved (the lineage to v2.3's harvest is direct). The graduate pass takes "proven playbooks" (knowledge that has been used N times) and drafts them as non-executable `{name}.draft` files invisible to tool discovery until the user reviews them. The two prompts (`prompts/knowledge-merge.md` + `prompts/knowledge-graduate.md`) are short and tight: merge is 19 lines, graduate is 26. +4. **`c1d2cad` — Update the README to teach the merge + graduate passes.** Adds `README.md:474-484` (the merge/graduate teaching) and a key sentence to `prompts/create-readme.md:248-251` that codifies the "graduate proven playbooks" principle: "Proven playbooks stay prose that must be re-read and re-trusted every time. Therefore: graduate them into self-describing tools and prompts — knowledge becomes capability, gated by review." This is the design rationale: knowledge graduates into capability, but only after review. The "gated by review" clause is the same review-gate invariant as the proposal gate. +5. **`6443d70` — Rework the conversation safety net issue file.** This is not strictly a campaigns commit, but it lands in the same window. Reworks `issues/0004-conversation-safety-net.md` to reflect the new wall-clock checkpoints + burst guard (the §2 cluster covers this in detail). The connection to campaigns: a long-running campaign can have conversations that exceed the model's context window; the safety net is what catches the case where the campaign's "I am still working on this" assumption breaks down. Also deletes `issues/0003-distill-passes.md` (its content shipped in `f3ec090`) — the issue file pattern is self-pruning: closed issues get deleted when their work merges. +6. **`7a7e242` — File the deferred follow-ups as issue files.** Adds `issues/0001-retry-attempts-persist-raw-invalid-output.md` + `issues/0002-invalid-output-sidecars-are-never-collected.md`. Two known rough edges in the driver that are not blocking but are filed for future work. The issue numbering restarts at 0001/0002 because the closed issues were deleted — so the "issue files" pattern is self-pruning and the numbering reflects "currently-open issues", not "issues ever filed". + +#### §1.5 Manual Slop Implications + +The Manual Slop equivalents of the campaigns pattern are partial. The closest analog is the per-track `plan.md` + `state.toml` + `metadata.json` triplet in `conductor/tracks/{track_id}/`. The per-track `plan.md` is the editable plan; `state.toml` is the machine-readable progress; `metadata.json` is the spec-derived scope. But the Manual Slop analog lacks three of the four campaigns invariants: + +1. **No "one writer for the tree" guarantee.** The `plan.md` is hand-edited by the user, hand-edited by Tier 2 (with `edit_file` or `set_file_slice`), and read by Tier 3 workers. There is no `bin/nagent-campaign` equivalent that mediates writes. The "two writers race" class of bugs is real (e.g., Tier 2 edits `plan.md` while Tier 3 worker is reading it). +2. **No "one pass then exit" driver.** The MMA WorkerPool's `ConductorEngine` (in `src/multi_agent_conductor.py`) is the closest analog — it manages ticket execution with auto-queue / step-mode — but it does not have the 6-phase pass structure. It loops; the driver does not. +3. **No explicit review gate.** Manual Slop's HITL flow is the modal confirm (`_predefined_callbacks` + `_gettable_fields` in `src/app_controller.py`); nagent's gate is the `proposal.yaml` file with `auto_confirm_max_items`/`auto_confirm_max_depth` thresholds. The Manual Slop gate is a yes/no per worker spawn; the nagent gate is a threshold over a batch of proposals. + +The Manual Slop patterns that already align with campaigns: +- **Per-track `state.toml`** (e.g., `conductor/tracks/nagent_review_20260608/state.toml`) is a partial `[M]` mutable aggregate. It has phase + task entries with `status` + `commit_sha` fields. The analog is partial: the `state.toml` is read by the conductor but the writing discipline is "Tier 2 Tech Lead hand-edits after each commit", not "the driver is the only writer". +- **The `_predefined_callbacks` Hook API** (in `src/app_controller.py:531-617`) is the closest analog to the campaign's context surfaces. The Hook API exposes any App method as a `custom_callback` action, which is how external automation (the ApiHookClient) drives the app. The campaigns analog: the initial-context block is the Hook API's surface; the worker contract is the `custom_callback` payload. +- **The MMA WorkerPool's tier-3 workers** (in `src/multi_agent_conductor.py` + `scripts/mma_exec.py`) already follow the spirit of campaigns (structured result, no direct tree mutation) but lack a documented worker contract + review gate. The `WorkerPool` spawns workers with `mma_exec.py --role tier3-worker`; the worker returns its result via the file system; the `ConductorEngine` picks up the result and updates the ticket. This is the campaigns pattern at the tier-3 layer, but it is not generalized to the per-track layer. + +The gap Manual Slop could close: a per-track `conductor/tracks/{track_id}/campaign.yaml` + a `bin/conductor-campaign update` driver that does the 6-phase pass. The driver would: merge Tier 3 worker results into `state.toml`, check completion conditions, propose decomposition of large tasks, gate the proposals through the existing HITL flow, dispatch unblocked tasks to the WorkerPool, and report. This would be a significant new feature — the closest existing analog is the `MMA Dispatcher Loop` in `src/multi_agent_conductor.py:280-340`, but it's scoped to the MMA queue, not the per-track plan. + +**Note on YAML format (per the user's directive, expanded in v3.1 §12):** the campaigns artifact format is YAML. Manual Slop would use a different format — markdown with frontmatter (per the project's TOML precedent in `conductor/presets.py` + `conductor/personas.py`) or a custom DSL. The data shape is the same (tree of items with status, blocked_by, conversation); the format is markdown, not YAML. See v3.1 §12 for the full rationale. + +#### §1.6 Honest Gaps + +1. **The decompose prompt is not deep-dived.** `prompts/campaign-decompose.md` is the LLM prompt that proposes item decomposition. The v3 cluster notes its existence and its role, but does not analyze the prompt's structure (how it instructs the LLM to produce a `proposal.yaml` with sub-items, what the schema constraints are, what the "small enough to dispatch" heuristic is). A future v3.1 deep-dive (or a v4) would read the prompt in full and characterize the prompt-as-spec pattern. +2. **The worker contract is not deep-dived.** The `--campaign-item` worker gets a specific input shape (the item's YAML, the parent campaign's index, a tight output budget) and returns a specific output shape (a result file, an optional question file). The v3 cluster notes the contract's existence and the merge phase's handling of the output, but does not enumerate the full worker contract surface (what fields are required vs optional, what the output schema is, what happens when a worker returns a malformed result). +3. **The judge condition type is not deep-dived.** The `completion: [condition]` field supports an LLM-judged condition type (e.g., "the README explains X"). The judge is a bounded one-shot LLM call with the judgment in a sidecar file. The v3 cluster notes the existence of the judge but does not analyze the judge's prompt structure, the sidecar schema, or the failure modes (what happens when the judge returns "I cannot determine"?). +4. **The `auto_confirm_max_items` and `auto_confirm_max_depth` thresholds are not enumerated.** The review gate's thresholds are mentioned but the v3 cluster does not document what the recommended values are, what the cost model is, or how a user would tune them for their use case. A v4 would document the threshold tuning procedure. +5. **The dispatch concurrency limit is not enumerated.** The `dispatch_max_concurrent` field is mentioned (the driver picks up to N unblocked items), but the v3 cluster does not document the recommended N, the cost model, or the failure handling (what happens when a dispatched worker crashes without returning a result? does the driver time out and re-dispatch? does the item stay `in-progress`?). +6. **The interaction with the conversation safety net is not deep-dived.** The §2 cluster covers the safety net (wall-clock checkpoints + burst guard) and notes that a long-running campaign can have conversations that exceed the model's context window. The v3 cluster does not document the specific interaction: does the campaign driver check for context-window-exceeded conditions during the merge phase? does the dispatch phase refuse to launch a worker when the context window is already full? does the report phase surface context-window warnings to the user? A v4 would map the safety net's hooks into the campaign driver's phases. + +#### §1.7 Code-Shape Sketch + +The campaign tree, in survey-grammar SSDL notation, with shape tags: + +``` +campaign := { name: string, # [S] string concatenation + status: active|paused|done, # [I] inspectable enum + completion: [condition], # [M] mutable list + items: [item], # [B] boundary (the dispatch list) + proposal: proposal_yaml? } # [M] mutable, pending review + +item := { id: string, # [S] + status: todo|proposed|in-progress|done|failed|question, # [I] + blocked_by: [id], # [B] dependency edge + conversation: path, # [B] path to conversation file + decompose: { when: heuristic, into: [sub_item] }?, # [M] optional + result: result_json? } # [M] populated by merge phase + +condition := { type: executable|judge, # [I] + spec: string, # [S] the test or the judge prompt + satisfied: bool } # [I] populated by check phase + +result_json := { status: done|failed|question, # [I] + summary: string, # [S] + question: question? } # [M] optional + +update {slug} { # driver entry point + merge // collect result.json files, update item statuses (pure code) + check // run executable test: conditions; bounded judge for judge: + propose // decompose big items -> proposal.yaml, status proposed + review_gate // auto-confirm within thresholds; report scope of pending + dispatch // bounded N unblocked items, each as --campaign-item worker + report // tree summary + questions + tokens spent +} +``` + +The shape tag map: `[I]` for inspectable enums and booleans (the model's understanding is the file's value), `[S]` for string concatenations (the model's understanding is the file's content), `[B]` for boundaries (the model's understanding is the file's edge), `[M]` for mutable aggregates (the model's understanding is the file's state). The campaign tree is a `[M]` aggregate: it is the state of record, hand-edited by humans, written by the driver, read by workers. + +**Source-read citations:** +- `bin/nagent-campaign` — new CLI entry point (24cf16d) +- `bin/helpers/nagent_campaign_lib.py` — driver implementation (24cf16d) +- `issues/0002-campaign-system.md:1-326` — full spec: layout + invariants + driver phases + costs + done criteria (199a36b) +- `bin/helpers/nagent_distill_lib.py:228-260` — finished-campaign-as-harvest-source (f3ec090) +- `bin/helpers/nagent_distill_lib.py:793-979` — `run_merge` + `run_graduate` (f3ec090) +- `bin/nagent-distill:107-200` — `--merge` + `--graduate` CLI surface (f3ec090) +- `prompts/knowledge-graduate.md:1-26` — graduation LLM prompt (f3ec090) +- `prompts/knowledge-merge.md:1-19` — merge LLM prompt (f3ec090) +- `README.md:474-484` — merge + graduate teaching (c1d2cad) +- `README.md:900-908` — `nagent-campaign` CLI examples (24cf16d) +- `prompts/create-readme.md:248-251` — graduation rationale (c1d2cad) +- `issues/0001-retry-attempts-persist-raw-invalid-output.md` + `issues/0002-invalid-output-sidecars-are-never-collected.md` — two deferred follow-ups, filed as issue files (7a7e242) +- `issues/0004-conversation-safety-net.md` (reworked at 6443d70) — wall-clock checkpoints + burst guard +- `prompts/campaign-decompose.md:1-N` — decomposition LLM prompt (24cf16d) +- `prompts/campaign-item.md:1-N` — worker contract prompt (24cf16d) +- `bin/nagent-campaign:1-N` — CLI argument parsing + subcommand dispatch (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:update()` — the 6-phase driver entry (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:merge_phase()` — collect results, update statuses (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:check_phase()` — run conditions (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:propose_phase()` — decompose big items (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:review_gate_phase()` — threshold-based accept (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:dispatch_phase()` — bounded worker launch (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:report_phase()` — tree summary + tokens (24cf16d) +- `tests/test_nagent_campaign.py` — driver unit tests (24cf16d) +- `tests/test_nagent_distill.py:merge_*` + `:graduate_*` — merge/graduate tests (f3ec090) +- `README.md:450-500` — campaigns teaching section (24cf16d + c1d2cad) +- `README.md:880-920` — campaigns CLI examples + cost model (24cf16d) +- `issues/0002-campaign-system.md:139-164` — the 4 invariants (199a36b) +- `issues/0002-campaign-system.md:159-191` — the 6 driver phases (199a36b) +- `issues/0002-campaign-system.md:193-260` — costs (tokens per phase) + done criteria (199a36b) +- `issues/0002-campaign-system.md:262-326` — open questions + future work (199a36b) + +**Decision candidate:** NEW Candidate 17 (HIGH). "Campaign-style plan-as-data for the conductor": add a `.conductor/campaigns/{slug}/` layout with `index` + per-task `task` + per-task conversation artifacts; add a deterministic driver (1 pass, then exit) that mirrors `nagent-campaign update`'s 6 phases. The artifact format is markdown + frontmatter, not YAML (per the v3.1 §12 YAML avoidance observation). See `decisions.md` Candidate 17. +**Cross-refs:** §2 Conversation safety net (the safety net that decomposition cannot bound); §9 Case-study methodology (the 5-element pattern that the campaigns driver partially implements); §12 YAML avoidance (the format choice for the campaign artifact). +**Pattern history:** NEW in v3. v2.3 had the implicit "what to do next is the model's judgment" loop. v3 makes the plan a first-class artifact. +## §2 Conversation safety net + +**Source:** nagent `38d3d4f`, `6426a67` (`bin/nagent:1455-1687` + `:1840-1881` + `:2463-2677` + `:2819`, `bin/helpers/nagent_distill_lib.py:587-654` + `:851-862`, `config.example.json:3-7`, `prompts/checkpoint-conversation.md`, `README.md:653-668` + `:323-332`, `issues/0004-conversation-safety-net.md`, `tests/test_nagent_safety.py`, `tests/test_nagent_distill.py`) +**One-liner:** A conversation that outgrows its window gets caught, not killed. Checkpoints are a separate one-call writer, not the working model; rebuild is a deterministic string assembly that runs a synchronous checkpoint first; saves are instant because the summary is extracted from the checkpoint's already-paid-for Intent line, not a new LLM call. +**Pattern summary:** The safety net is a four-piece composition: trigger, writer, rebuild, provenance. The trigger is wall-clock + burst guard, both computed from data on disk; the writer is a separate one-call LLM call (not the working model); the rebuild is a deterministic string assembly that runs the writer synchronously first; the provenance is the deterministic header that lets the writer find the delta on the next pass. Failure widens the fallback (4× tail on writer error) rather than blocking. Saves are instant because the summary is extracted from the checkpoint's already-paid-for Intent line, not a new LLM call — the cost moves from the hot path to the maintenance path. This extends the "the loop" principle (v2.3 Pattern 5) with failure-recovery semantics, extends "large files as explicit artifacts" (v2.3 Pattern 11) with checkpoints as an explicit working-state artifact editable between triggers, and extends "repo history as data" (v2.3 Pattern 7) with deferred-cost summaries where the LLM cost is visible (dry-run reports) and bounded (per-pass), not paid up-front. + +#### §2.1 What the Safety Net Adds + +The safety net introduces a failure-recovery layer between the conversation and the model's context window. Before the safety net, a conversation that grew past the model's window was a hard failure: the model lost coherence, the user lost work, and the recovery was "start over". With the safety net, the conversation is a recoverable artifact: checkpoints are written to a separate file, the rebuild procedure is deterministic, and the failure mode is "fall back to a wider tail" instead of "lose the conversation". + +The four pieces of the safety net abstraction: + +1. **Trigger** — wall-clock + burst guard, both computed from data on disk. `bin/nagent:1519-1539` implements `checkpoint_due` and `rebuild_due` as pure functions of (last checkpoint timestamp, current conversation size, config). The trigger is data, not code branching on state. The cadence reasoning is explicit: "time and context consumption are uncorrelated in exactly the wrong direction" (`issues/0004-conversation-safety-net.md:30`). Token-percentage triggers were "an approximation of an approximation" — three numbers in units `ls -l` can verify are the data-grounded alternative. +2. **Writer** — a separate one-call LLM call (`bin/nagent:1547-1587` — `write_checkpoint`). The writer is NOT the working model. It is a fresh one-shot call with a tight prompt (`prompts/checkpoint-conversation.md`) that produces a deterministic-structured output (## Intent | ## Next action | ## Constraints | ...). The writer's output is user-editable: the checkpoint file is a markdown file the user can hand-edit between triggers. +3. **Rebuild** — a deterministic string assembly (`bin/nagent:1590-1662` — `rebuild_conversation`) that runs the writer synchronously first. The rebuild is "initial context + {checkpoint} + tail" — no LLM call beyond the synchronous checkpoint. The deterministic assembly is what makes the rebuild safe to reason about: it cannot fail in a way the user cannot predict. +4. **Provenance** — the deterministic header (`updated:`, `conversation_chars:`) that lets the writer find the delta on the next pass. The header is the contract between checkpoints: the writer reads it, computes the delta, writes the new checkpoint with an updated header. + +The "sync checkpoint first" invariant is the load-bearing one. A naive rebuild that trusted the most-recent checkpoint's freshness would fail on the exact conversation the safety net is meant to save (a conversation that grew past `rebuild_at_kb` between scheduled checkpoints). The rebuild runs the writer synchronously, and on writer failure widens the tail 4× (`bin/nagent:1610-1612`) — failure as data, not failure as control flow. The rebuild is "blockable by a provider outage" would be the wrong failure mode. + +#### §2.2 The Writer and the Checkpoint Format + +The checkpoint is a markdown file with a deterministic header and a fixed-structure body. The header is two fields: + +``` +updated: +conversation_chars: +``` + +The body is the writer's LLM output, constrained to a fixed schema (`prompts/checkpoint-conversation.md`): + +``` +## Intent + + +## Next action + + +## Constraints + + +## Open questions + +``` + +The schema is the whole schema. The code does not maintain a parallel mental model (e.g., "we track the intent in a separate field"). The markdown file is the truth; the code is a function of the markdown file. + +The writer is a one-shot LLM call, not the working model. This matters for two reasons: + +1. **Cost visibility.** The writer's LLM cost is paid once per checkpoint, not once per turn. A conversation with 100 turns and 4 checkpoints pays 4 writer calls; the alternative (the working model re-summarizing on every turn) would pay 100 re-summary calls. The cost moves from O(turns) to O(checkpoints). +2. **Non-deterministic working model does not pollute the checkpoint.** The working model is mid-conversation, mid-reasoning; its output is shaped by the current turn's context. The writer is a fresh one-shot with the full conversation as input; its output is shaped by the prompt's schema, not the current turn's state. The checkpoint is stable across reads. + +A code-shape sketch using survey grammar: + +``` +checkpoint := { updated: timestamp, # [S] string + conversation_chars: int, # [I] inspectable + body: ## Intent | ## Next action | ## Constraints | ## Open questions } # [B] boundary + +write_checkpoint { conversation, llm, now } { + delta = conversation[meta.conversation_chars:] # [S] string slice + if len(delta) < min_delta_chars { return nil } # too small to summarize + prompt = format(prompts.checkpoint-conversation.md, delta) # [S] string format + body = llm.call(prompt) # [B] boundary to LLM + write checkpoint.updated = now + write checkpoint.conversation_chars = len(conversation) + write checkpoint.body = body +} +``` + +The `[B]` boundary tag marks the single LLM call in the writer. Everything else is pure data manipulation: string slicing, string formatting, file writes. The writer is "an LLM call wrapped in deterministic I/O". + +#### §2.3 The Trigger Logic + +The trigger is a pure function of (last checkpoint timestamp, current conversation size, config). `bin/nagent:1519-1539` implements two functions: + +1. **`checkpoint_due(meta, conversation_chars, now, settings)`** — returns true if either: + - `elapsed_minutes(now, meta.updated) > settings.checkpoint_interval_minutes` AND `conversation_chars > meta.conversation_chars + new_chars_threshold` + - `conversation_chars - meta.conversation_chars > settings.checkpoint_max_new_kb * 1024` + - `meta is nil` AND `conversation_chars > settings.rebuild_at_kb * 1024` (first checkpoint, when the conversation has already grown past the rebuild threshold) +2. **`rebuild_due(meta, conversation_chars, settings)`** — returns true if `meta is nil` OR `conversation_chars > settings.rebuild_at_kb * 1024`. + +The three config numbers are in `config.example.json:3-7`: + +```json +{ + "safety_net": { + "checkpoint_interval_minutes": 10, + "checkpoint_max_new_kb": 32, + "rebuild_at_kb": 192 + } +} +``` + +All three are in units `ls -l` can verify: minutes, kilobytes, kilobytes. Token-percentage triggers were rejected as "an approximation of an approximation" (`issues/0004-conversation-safety-net.md:30-44`) — the 3-number config is the data-grounded alternative. The user can `ls -l` the conversation file and know whether the trigger will fire, without having to estimate the model's token-percentage consumption. + +#### §2.4 The Rebuild Procedure + +The rebuild is "initial context + {checkpoint} + tail" — a deterministic string assembly (`bin/nagent:1590-1662` — `rebuild_conversation`). The procedure: + +1. **Sync checkpoint first.** Run `write_checkpoint(conversation, llm)` synchronously. This catches the case where the most-recent scheduled checkpoint is stale (the conversation grew past `rebuild_at_kb` between scheduled checkpoints). The sync checkpoint is the "freshness" guarantee. +2. **Widen tail on writer failure.** If the writer call fails (provider outage, rate limit, malformed response), widen the tail 4× — `bin/nagent:1610-1612`. Failure as data, not failure as control flow. The rebuild cannot fail in a way that loses the conversation. +3. **Archive the old conversation.** Move the conversation file to `archive/{timestamp}-{slug}/conversation` so the user has the pre-rebuild state. +4. **Write the new initial context.** Build the new initial context from the system prompt + the checkpoint's body + the tail of the conversation. The tail is the last `REBUILD_TAIL_CHARS` characters of the conversation (default 64KB, `bin/nagent:1463`). +5. **Reset the checkpoint's `conversation_chars`.** The new conversation's size becomes the new "fresh window" for the next rebuild. + +A code-shape sketch: + +``` +rebuild { conversation, llm, now, settings } { + try write_checkpoint(conversation, llm, now) + recover { + tail_chars = REBUILD_TAIL_CHARS * 4 # widen 4x on failure + audit msg "checkpoint writer failed; using widened tail" + } else tail_chars = REBUILD_TAIL_CHARS + + archive_path = archive/{now}/{slug}/conversation + move conversation -> archive_path + new_conversation = initial_context + checkpoint + conversation[-tail_chars:] + write conversation = new_conversation + reset meta.conversation_chars = len(new_conversation) + reset meta.updated = now +} +``` + +The `{ssdl}` shape tag for the rebuild is `[S]` (string concatenation). The only LLM call is the sync checkpoint. Everything else is deterministic I/O. + +#### §2.5 The Instant-Saves Change (6426a67) + +The instant-saves change is a smaller, sharper version of the same idea: the cost of an LLM summary is moved from the hot path (every save) to the maintenance path (`nagent-distill --apply` backfill + `--summarize-conversation` on demand). + +Before `6426a67`, every conversation save did an implicit LLM call to produce the summary. This had two costs: +1. **Hot-path latency.** A save was a multi-second LLM call, not a millisecond file write. +2. **Cost opacity.** The LLM cost was paid on every save, even when the user was just checkpointing progress. + +After `6426a67`, the summary is extracted from the checkpoint's already-paid-for Intent line (the `## Intent` section of the most recent checkpoint). The summary is the artifact's own data — no new LLM call. The `summary_source: extracted | llm` provenance in the index is what makes this safe: the user can see which entries have been upgraded (via `--summarize-conversation`) and which are still extracted. The backfill pass (`bin/helpers/nagent_distill_lib.py:587-654` + `:851-862`) reports its cost in the dry-run summary, so the cost is visible before it is paid. + +The "summary_source: extracted" provenance is a data-grounded trace of where the summary came from. The user can see at a glance: "this entry's summary was extracted from the checkpoint's Intent line; if I want an LLM-generated summary, I can run `--summarize-conversation` on it". + +#### §2.6 Per-Commit Detail + +The two commits that built the safety net subsystem: + +1. **`38d3d4f` — Add the safety net machinery.** Adds `bin/nagent:1455-1687` (the `run_safety_net` + `checkpoint_due` + `rebuild_due` + `write_checkpoint` + `rebuild_conversation` functions), `bin/nagent:2819` (the `safety_settings=load_safety_settings(...)` wiring into `run_agent_loop`), `config.example.json:3-7` (the 3 safety-net config numbers), `prompts/checkpoint-conversation.md` (the writer LLM prompt), `README.md:653-668` (Part VI safety-net teaching), and `tests/test_nagent_safety.py` (the test file). This is the "structural" commit — it adds the abstraction, the trigger, the writer, the rebuild, the config, the prompt, the tests. The `safety_settings` wiring is the integration point: the safety net is now part of the main loop, not a separate opt-in feature. +2. **`6426a67` — Add the instant-saves change.** Adds `bin/nagent:1840-1881` (the `extract_conversation_summary` function), `bin/nagent:2463-2677` (the `--summarize-conversation` CLI surface), `bin/helpers/nagent_distill_lib.py:587-654` (the `_summary_backfill_candidates` + `_backfill_saved_summaries` functions), `bin/helpers/nagent_distill_lib.py:851-862` (the backfill wired into the distill apply path), and `README.md:323-332` (Part II instant-saves teaching). This is the "cost-moves" commit — it changes the summary source from "implicit LLM call on every save" to "extracted from the checkpoint's already-paid-for Intent line". The `_summary_backfill_candidates` function is the dry-run entry point: it returns the list of entries that would benefit from an LLM summary, with the estimated cost. The user sees the cost before paying it. + +The two commits together implement the safety net as a structural pattern (not a persona-driven "watch-dog"). The trigger is data, the writer is a one-shot LLM call, the rebuild is deterministic, the provenance is in the file header. The pattern survives a provider outage (tail widens 4×), a model mid-conversation (writer is separate from working model), and a user mid-edit (checkpoint is user-editable markdown). + +#### §2.7 Manual Slop Implications + +The Manual Slop equivalents of the safety net are partial. The closest analog is the per-discussion write path in `src/discussion.py` (or similar) + the per-take branching in `src/project_manager.py:branch_discussion` + `promote_take`. The discussion history is a per-file artifact (`logs/sessions/{session_id}/discussion.jsonl` or similar), and the discussion index is a separate file. But the Manual Slop analog lacks three of the four safety-net invariants: + +1. **No "sync checkpoint first" guarantee.** Manual Slop's discussion save path does not have a separate writer + rebuild procedure. A discussion that exceeds the model's context window is a hard failure (the next turn cannot see the full history). +2. **No "widen tail on failure" fallback.** Manual Slop's failure modes are exception-based, not data-widening. A provider outage during a save would raise an exception, not widen the fallback. +3. **No `summary_source: extracted | llm` provenance.** Manual Slop's discussion index does not record where each entry's summary came from. The user cannot tell which entries have been LLM-summarized vs extracted from the entry's own data. + +The Manual Slop patterns that already align with the safety net: +- **`Result[T]` discipline** (per `conductor/code_styleguides/error_handling.md`) — failure widens the fallback instead of blocking. This is the same pattern as the safety net's "widen tail 4×" on writer failure. +- **`promote_take` + `branch_discussion`** (in `src/project_manager.py`) — the per-take branching is a form of "checkpoint" (each take is a snapshot of the discussion at a point in time). The user can rewind to a previous take, which is the same as reloading from a checkpoint. +- **The 3-layer MCP security model** (per `docs/guide_mcp_client.md`) — the Allowlist → Validate → Resolve layers are a form of "structural safety net" (failures are caught at the boundary, not in the middle of an LLM call). + +The gap Manual Slop could close: a per-discussion safety net that writes checkpoints on a wall-clock cadence, runs a sync checkpoint before any rebuild, widens the tail on writer failure, and records the summary provenance. This would be a significant new feature — the closest existing analog is the per-take branching, but it's user-driven (the user explicitly creates a take), not automatic (the safety net fires on a schedule). + +**Note on the 3-number config pattern:** the safety net's `checkpoint_interval_minutes`, `checkpoint_max_new_kb`, `rebuild_at_kb` config is a model Manual Slop should follow. Operations should be configurable in units `ls -l` can verify, not in token-percentage estimates that drift per provider. The Manual Slop equivalent would be a per-discussion config with units of (minutes, kilobytes, kilobytes) — not (tokens, percentage, percentage). This is a small but load-bearing change: the user can `ls -l` the discussion file and know whether the trigger will fire, without having to estimate the model's token-percentage consumption. + +#### §2.8 Honest Gaps + +1. **The `delta_start = min(meta[1], len(content))` clamp at `bin/nagent:1566` could produce a misleading delta if a user edit deletes characters between checkpoints** (the recorded size becomes larger than current content). The clamp hides the failure; the delta would be the entire current content, not the actual new activity. Minor edge case; the spec does not address it. +2. **The `REBUILD_TAIL_CHARS = 64 * 1024` default at `bin/nagent:1463` is explicitly unmeasured** ("mirrors MiMo's ~65K tokens until measured otherwise" per `issues/0004-conversation-safety-net.md:42-44`). A future track should measure actual rebuild-tail needs across providers and conversation types. +3. **`best-of-N` is mentioned in the initial context at `bin/nagent:775` as a directive to the model, not implemented as machinery** — it is the same "direction before machinery" pattern v2.3 used for compaction. A follow-up track could lift it to a driver (e.g., `nagent-safety-net --best-of-n` that runs the writer N times and picks the most-recoverable checkpoint). +4. **The interaction with the campaigns driver (Phase 2's `nagent-campaign update`) is not deep-dived.** The campaigns driver has its own 6 phases (merge, check, propose, review gate, dispatch, report). A long-running campaign can have conversations that exceed the model's context window. The safety net's role in the campaigns driver is not documented: does the driver check for context-window-exceeded conditions during the merge phase? does the dispatch phase refuse to launch a worker when the context window is already full? does the report phase surface context-window warnings to the user? +5. **The interaction with the conversation-cache boundaries (v2.3 §2.2) is not deep-dived.** v2.3 introduced `conversation_cache_boundaries` at `bin/nagent:970-987` to manage the provider's prompt cache. The safety net's rebuild creates a new initial context, which invalidates the cache. The v3 cluster does not document how the safety net coordinates with the cache invalidation — does the rebuild preserve the cache boundary markers? does the next checkpoint know about the cache state? +6. **The 3-number config's recommended values are not enumerated.** The config defaults (`checkpoint_interval_minutes: 10`, `checkpoint_max_new_kb: 32`, `rebuild_at_kb: 192`) are documented, but the cost model is not. A v4 would document the recommended values per conversation type (short Q&A, long-running build, multi-day campaign) and per provider (Gemini's 1M context vs Anthropic's 200K vs OpenAI's 128K). +7. **The writer's failure modes are not enumerated.** The writer is a one-shot LLM call; it can fail with a provider outage, a rate limit, a malformed response, or a refusal. The v3 cluster documents the "widen tail 4×" fallback, but does not enumerate the other failure handling — what happens when the writer returns a malformed response (missing sections, extra sections, wrong order)? does the rebuild retry the writer, or proceed with the malformed checkpoint? + +#### §2.9 Code-Shape Sketch + +The safety net, in survey-grammar SSDL notation, with shape tags: + +``` +safety_settings := { checkpoint_interval_minutes: int, # [I] inspectable + checkpoint_max_new_kb: int, # [I] inspectable + rebuild_at_kb: int } # [I] inspectable + +checkpoint := { updated: timestamp, # [S] string + conversation_chars: int, # [I] inspectable + body: ## Intent | ## Next action | ## Constraints | ## Open questions } # [B] boundary + +due { meta, conversation_chars, now, settings } { # trigger (pure function) + if elapsed_minutes(now, meta.updated) > settings.checkpoint_interval_minutes + and conversation_chars > meta.conversation_chars + -> fire {ssdl} [I] # inspectable trigger + if conversation_chars - meta.conversation_chars > settings.checkpoint_max_new_kb * 1024 + -> fire + if meta is nil and conversation_chars > settings.rebuild_at_kb * 1024 + -> fire first time only + else + -> idle +} + +write_checkpoint { conversation, llm, now } { # writer (one LLM call) + delta = conversation[meta.conversation_chars:] # [S] string slice + if len(delta) < min_delta_chars { return nil } # too small to summarize + prompt = format(prompts.checkpoint-conversation.md, delta) # [S] string format + body = llm.call(prompt) # [B] boundary to LLM + write checkpoint.updated = now + write checkpoint.conversation_chars = len(conversation) + write checkpoint.body = body +} + +rebuild { conversation, llm, now, settings } { # rebuild (deterministic) + try write_checkpoint(conversation, llm, now) + recover { + tail_chars = REBUILD_TAIL_CHARS * 4 # widen 4x on failure + audit msg "checkpoint writer failed; using widened tail" + } else tail_chars = REBUILD_TAIL_CHARS + + archive_path = archive/{now}/{slug}/conversation + move conversation -> archive_path + new_conversation = initial_context + checkpoint + conversation[-tail_chars:] # [S] string concat + write conversation = new_conversation + reset meta.conversation_chars = len(new_conversation) + reset meta.updated = now +} + +summary_source := { entry_id: string, # provenance + source: extracted|llm, # [I] inspectable + extracted_at: timestamp?, # [S] + llm_summarized_at: timestamp? } # [S] +``` + +The shape tag map: `[I]` for inspectable triggers and config, `[S]` for string concatenations and timestamps, `[B]` for the single LLM boundary in the writer, `[M]` for the mutable aggregate that is the conversation file. The safety net is a `[M]` aggregate: it is the state of record, hand-edited by humans, written by the writer, read by the rebuild. + +**Source-read citations:** +- `bin/nagent:1455-1687` — `run_safety_net` + `checkpoint_due` + `rebuild_due` + `write_checkpoint` + `rebuild_conversation` (38d3d4f) +- `bin/nagent:1840-1881` — `extract_conversation_summary` (6426a67) +- `bin/nagent:2463-2677` — `--summarize-conversation` CLI surface (6426a67) +- `bin/nagent:2819` — `safety_settings=load_safety_settings(...)` wired into `run_agent_loop` (38d3d4f) +- `bin/nagent:1463` — `REBUILD_TAIL_CHARS = 64 * 1024` default (38d3d4f) +- `bin/nagent:1519-1539` — `checkpoint_due` + `rebuild_due` pure functions (38d3d4f) +- `bin/nagent:1547-1587` — `write_checkpoint` (38d3d4f) +- `bin/nagent:1590-1662` — `rebuild_conversation` (38d3d4f) +- `bin/nagent:1610-1612` — widen tail 4× on writer failure (38d3d4f) +- `bin/nagent:1566` — `delta_start = min(meta[1], len(content))` clamp (38d3d4f) +- `config.example.json:3-7` — 3 safety-net config numbers (38d3d4f) +- `prompts/checkpoint-conversation.md` — checkpoint LLM prompt (38d3d4f) +- `bin/helpers/nagent_distill_lib.py:587-654` — `_summary_backfill_candidates` + `_backfill_saved_summaries` (6426a67) +- `bin/helpers/nagent_distill_lib.py:851-862` — backfill wired into the distill apply path (6426a67) +- `README.md:653-668` — safety-net teaching in Part VI (38d3d4f) +- `README.md:323-332` — instant-saves teaching in Part II (6426a67) +- `issues/0004-conversation-safety-net.md` — the spec; reworked at 6443d70 to wall-clock cadence (199a36b) +- `issues/0004-conversation-safety-net.md:30` — cadence reasoning ("time and context consumption are uncorrelated in exactly the wrong direction") +- `issues/0004-conversation-safety-net.md:42-44` — `REBUILD_TAIL_CHARS` unmeasured note +- `tests/test_nagent_safety.py` — safety-net test file (38d3d4f) +- `tests/test_nagent_distill.py:summary_*` — backfill tests (6426a67) +- `bin/nagent:775` — `best-of-N` initial-context directive (38d3d4f) +- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; not modified in v3 but relevant for the gap note) +- `bin/nagent:606-745` — `build_initial_context` (v2.3; relevant for the rebuild's "initial context" assembly) +- `config.example.json:1-15` — full safety-net config block with defaults (38d3d4f) +- `README.md:670-700` — safety-net cost model (checkpoint cost, rebuild cost) (38d3d4f) +- `README.md:333-360` — instant-saves cost model (extracted vs LLM cost) (6426a67) +- `issues/0004-conversation-safety-net.md:1-100` — full spec: trigger, writer, rebuild, provenance, cost (199a36b) +- `issues/0004-conversation-safety-net.md:101-200` — failure modes + edge cases (199a36b) +- `issues/0004-conversation-safety-net.md:201-326` — open questions + future work (199a36b) + +**Decision candidate:** NEW Candidate 18 (HIGH). "Discussion-window safety net for Manual Slop": adopt the checkpoint + rebuild pattern for the discussion history; backfill summary entries from the existing intent line; surface extracted-vs-llm provenance in the discussion index. See `decisions.md` Candidate 18. +**Cross-refs:** `conductor/tracks/fable_review_20260617` (the Fable review's analysis of "watch-dogging" is the opposite pattern — nagent's safety net is structural, not persona-driven). §1 Campaigns cross-references the safety net as the failure-recovery layer for what decomposition cannot bound. §13 Agent context-window observations (the v3.1 new section on warm-up + window + safe-zone numbers; the safety net is the structural mechanism that implements the safe-zone). +**Pattern history:** EXTENDS v2.3 Pattern 5 ("the loop") with failure-recovery semantics. EXTENDS v2.3 Pattern 11 ("large files as explicit artifacts") with checkpoints as an explicit working-state artifact. EXTENDS v2.3 Pattern 7 ("repo history as data") with deferred-cost summaries. +## §3 Hooks + +**Source:** nagent `a4fb141` (`bin/nagent:1442-1484` + `:1607-1625` + `:1922-1927` + `:2806-2825` + `:3167-3185`, `config.example.json:6-8`, `tests/test_nagent.py:870-960`); plus both case-study harness scripts (`https://raw.githubusercontent.com/macton/pep-copt/main/prove-optimized-harness.sh`, `https://raw.githubusercontent.com/macton/differentiable-collisions-optc/main/prove-optimized-harness.sh`). +**One-liner:** Per-turn ground-truth injection. A hook runs at the top of every turn (before the model speaks) or after every structured edit; its measured output — exit code, stdout, stderr, or "(no output)" — enters the conversation as a labeled block, so the model responds against measured state instead of its recollection. The case-study repos ARE the hooks: `prove-optimized-harness.sh` is the command wired into `--hook-per-run`. +**Pattern summary:** Hooks introduce a per-turn measurement primitive that breaks the conversation's dependence on the model's self-reporting. The abstraction is a three-piece composition: resolve, invoke, inject. `resolve_hooks` enforces CLI > config > disabled precedence; `run_hook` invokes the command and captures exit code + stdout + stderr + "(no output)" when silent; the injection sites are the conversation (per-run at the top of every turn before `call_llm`; per-file-edit after `` or `` in `--file-edit` mode). The case-study harness scripts are the proof that hooks work as intended: both implement the same skeleton (log + summary + enforcing gate) with different proof contracts. The data shape of a hook result is a labeled block with exit code, optional path, optional stdout, optional stderr, or "(no output)" — the model's context grows by a measured block, not by the model's word. The `{ssdl}` `[B]` (boundary) marker captures the abstraction: the hook is the boundary where the model's context meets the measured world; the failure of a measurement is data the model can act on, not a control-flow exception. + +#### §3.1 What Hooks Add + +Hooks introduce a per-turn measurement primitive that breaks the conversation's dependence on the model's self-reporting. Before hooks, the conversation was a closed loop: the model said something, the user read it, the user replied, the model said something else. The only ground truth was the model's word. With hooks, the conversation is an open loop: a measurement command runs at the top of every turn, its output enters the conversation as a labeled block, and the model responds against measured state instead of its recollection. + +The three pieces of the hooks abstraction: + +1. **Resolve** — `resolve_hooks(cli_per_run, cli_per_file_edit, config_path)` enforces the CLI > config > disabled precedence. The CLI is the experiment's override (one-shot, the user's immediate need); the config is the project's default (persistent, the project's convention); empty means off. The resolve function is pure: it returns a tuple of (per_run_command, per_file_edit_command), each of which is either a string or None. +2. **Invoke** — `run_hook(command, label, path=None)` invokes the command via subprocess, captures exit code + stdout + stderr, and surfaces "(no output)" when silent. The function never raises on a non-zero exit code; the failure is data, not control flow. The output is wrapped in a labeled block: ``. The label is the hook's name (e.g., "hook-per-run", "hook-per-file-edit"); the path is the file being edited (for per-file-edit hooks). +3. **Inject** — the labeled block is appended to the conversation file. The injection sites are explicit: per-run at the top of every turn before `call_llm` (`bin/nagent:1922-1927`); per-file-edit after `` (`bin/nagent:1607-1611`) or `` in `--file-edit` mode (`bin/nagent:1618-1625`). Scratch writes are not file edits — the comment at `bin/nagent:1618-1620` notes the distinction explicitly: "A `` only edits a real file in per-file-edit mode ... in main mode it writes scratch, which is not a file edit worth a verify hook". + +The case-study harness scripts are the proof that hooks work as intended. Both scripts implement the same skeleton: log + summary + enforcing gate. The log records every step with verbose mode for streaming; the summary collects every verdict at the end (`set +e` so a failing gate still prints); the enforcing gate collects the verdicts and decides pass/fail. + +#### §3.2 The Resolve Precedence + +The CLI > config > disabled precedence is the contract between the experiment and the project. The CLI is the experiment's override: a user running `nagent --hook-per-run='make test'` is overriding the project's default hook for this invocation only. The config is the project's default: `config.json` says `{"hook_per_run": "make test"}` and every invocation of `nagent` in this project uses that hook. Disabled means off: if neither CLI nor config specifies the hook, the hook does not run, and the conversation has no per-run block. + +The resolve function is pure: it returns a tuple of (per_run_command, per_file_edit_command), each of which is either a string or None. The implementation is at `bin/nagent:1466-1484`: + +``` +resolve_hooks(cli_per_run, cli_per_file_edit, config_path) { + config = load_json(config_path) if config_path else {} + per_run = cli_per_run or config.get("hook_per_run") or None + per_file_edit = cli_per_file_edit or config.get("hook_per_file_edit") or None + // empty string in config means disabled (defensive: don't pass "" to subprocess) + if per_run == "": per_run = None + if per_file_edit == "": per_file_edit = None + return (per_run, per_file_edit) +} +``` + +The "empty string means disabled" rule is defensive: an empty string in the config should not be passed to subprocess (which would invoke the shell with no command, producing unpredictable output). The resolve function normalizes empty strings to None, which the invoke function treats as "no hook this turn". + +#### §3.3 The Invoke and Inject Cycle + +The invoke function is the boundary between the conversation and the measured world. The function: + +1. **Subprocess invocation.** If the command is None, return None (no hook this turn). +2. **Capture exit code + stdout + stderr.** Use `subprocess.run(command, shell=True, capture_output=True, text=True)` to invoke the command. The exit code is the command's return code (0 = success, non-zero = failure). The stdout and stderr are the command's output. +3. **Format the labeled block.** The output is wrapped in a labeled block: ``. The "(no output)" marker is used when both stdout and stderr are empty (a silent success is still a measurable success). +4. **Append to conversation.** The block is appended to the conversation file before the next `call_llm` (per-run) or after the file edit (per-file-edit). + +A code-shape sketch using survey grammar: + +``` +hook-result := + +run { command } :: hook-result {ssdl} [B] // boundary: failures surface, never hidden +inject { hook-result, conversation } :: () // append to conversation file +``` + +The `{ssdl}` `[B]` (boundary) marker captures the abstraction: the hook is the boundary where the model's context meets the measured world. The failure of a measurement is data the model can act on, not a control-flow exception. The model sees a failing hook's exit code + stderr, and can adjust its behavior accordingly. + +#### §3.4 The Case-Study Harness Scripts + +The case-study harness scripts are the proof that hooks work as intended. Both scripts implement the same skeleton: log + summary + enforcing gate. The skeleton: + +1. **Log.** Record every step with verbose mode for streaming. The log is appended to a file (e.g., `OPTIMIZATION-LOG.md`) so the user can see the proof's progress in real time. +2. **Summary.** Collect every verdict at the end. Use `set +e` so a failing gate still prints its verdict; the summary is a list of (gate, verdict) pairs. +3. **Enforcing gate.** Collect the verdicts and decide pass/fail. The gate is the last step; it exits non-zero if any verdict is failing. + +The PEP harness (`prove-optimized-harness.sh` for `macton/pep-copt`) has 9 steps and 5 enforcing gates: +- **Identity baseline.** Run the reference implementation on the committed input; record the output (size in bytes, sha256). This is the "what the reference produces" baseline. +- **Median-of-5 speedup.** Run the optimized implementation 5 times; record the median wall-clock time. Median (not mean) because outliers are not the optimization's fault. +- **Decompression-time gate.** The decompression time must not regress (an optimization that makes compression faster but decompression slower is a net loss for users). +- **Generalization.** The optimization must work on a held-out set of images (not just the committed input). This catches "tuned to the test" optimizations. +- **Determinism.** The optimized output must be byte-identical across runs (a non-deterministic optimization is not reproducible). + +The collisions harness (`prove-optimized-harness.sh` for `macton/differentiable-collisions-optc`) has 10 steps and 4 enforcing gates: +- **Comparator with distance tolerance.** The optimized collision detection must agree with the reference to within a distance tolerance (1mm + 0.1% + conditional). Collision-flag identity is too strict (a face/edge contact has many equally-valid witness points). +- **Contact-point certifier.** An independent contact-point certifier (`validate_contacts`) shares no solver code with the optimized implementation. This catches "they agree because they share the bug" failures. +- **Precompute isolation.** The precompute stage (building the spatial acceleration structure) must be excluded from the measured speedup. The build stage cannot precompute the answer; the optimization log explains why. +- **Determinism.** The optimized output must be byte-identical across runs. + +Both harness scripts freeze the committed input via `sha256sum` before the run and re-check after — if the harness itself changes the input (a bug), it aborts. Both exclude precompute time from the measured speedup. + +#### §3.5 The Hook Result Data Shape + +The data shape of a hook result, using survey grammar: + +``` +hook-result := + +fields: + label: string # hook name (e.g., "hook-per-run", "hook-per-file-edit") + exit_code: int # command's return code (0 = success) + path: string? # file being edited (for per-file-edit hooks) + stdout: string # command's stdout (may be empty) + stderr: string # command's stderr (may be empty) + no_output: bool # true if both stdout and stderr are empty + +serialization: + <{label} exit_code="{exit_code}"{ path? " path=\"{path}\"" : ""}> + {stdout} + {stderr? f"stderr: {stderr}" : ""} + {no_output? "(no output)" : ""} + +``` + +The shape is a labeled block with optional fields. The model reads the block as part of the conversation; the block is the "measurement" the model acts on. The failure of a measurement is data: a non-zero exit code + stderr text is actionable information; a silent success is "(no output)" — still measurable, still in the conversation. + +#### §3.6 Per-Commit Detail + +The one commit that built the hooks subsystem: + +1. **`a4fb141` — Add per-turn and per-file-edit hooks.** Adds `bin/nagent:1442-1463` (`run_hook` function), `bin/nagent:1466-1484` (`resolve_hooks` function with CLI > config > disabled precedence), `bin/nagent:1607-1611` (`hook_per_file_edit` fires after ``), `bin/nagent:1618-1625` (`hook_per_file_edit` fires after `` in `--file-edit` mode only), `bin/nagent:1922-1927` (`hook_per_run` fires at top of every turn, before `call_llm`), `bin/nagent:2806-2825` (`--hook-per-run` and `--hook-per-file-edit` CLI flags), `bin/nagent:3167-3185` (wiring into `run_agent_loop`), `config.example.json:6-8` (`hook_per_run` and `hook_per_file_edit` config keys), and `tests/test_nagent.py:870-960` (4 test functions covering the hook contract). + +The commit is a "single-feature" commit: one commit adds the hooks abstraction, the resolve precedence, the invoke function, the inject sites, the CLI flags, the config keys, and the tests. There are no follow-up commits; the abstraction was complete in one commit. This is the same "abstraction-complete-in-one-commit" pattern v2.3 used for the harvest pipeline. + +#### §3.7 Manual Slop Implications + +The Manual Slop equivalents of the hooks are partial. The closest analogs are: +- **Tier 4 QA error interception** (per `docs/guide_ai_client.md`) — when a tool call fails, the AI client intercepts the error, forwards it to a Tier 4 QA sub-agent, and injects a 20-word diagnostic summary into the worker history. This is a per-error hook, not a per-turn hook. +- **The `ApiHookClient` test harness** (per `docs/guide_api_hooks.md`) — the `live_gui` fixture uses the Hook API to drive the application. The hook is the test, not the application. +- **The `_predefined_callbacks` registry** (in `src/app_controller.py:531-617`) — exposes any App method as a `custom_callback` action. This is a hook into the app, not a hook into the conversation. + +The Manual Slop analog lacks three of the four hooks invariants: + +1. **No "per-turn" injection site.** Manual Slop's Tier 4 QA fires on tool-call failure, not at the top of every turn. A Manual Slop hook could be wired into the `run_agent_loop` equivalent (`dispatch_inference` in `src/ai_client.py`) to inject a status block (build status, test status, dependency-check status) at the top of every turn. +2. **No "labeled block" data shape.** Manual Slop's Tier 4 QA injects a 20-word diagnostic summary as plain text, not a labeled block with exit code + stdout + stderr. The model sees a summary, not a measurement. +3. **No "CLI > config > disabled" precedence for hooks.** Manual Slop's hooks are implicit (they fire on error); there is no explicit "configure a command to run at the top of every turn" mechanism. + +The gap Manual Slop could close: a per-turn hook primitive that runs a configured command (CLI > config > disabled) at the top of every `send_result()` and injects a `` block. The command could be a status script (`make test`, `git status`, `npm run check`) that the user configures per-project. The model would see the status block at the top of every turn and respond against measured state. + +The "failure is data, not control flow" principle from `conductor/code_styleguides/error_handling.md` already encodes the "exit code + stderr surfaced" invariant. The per-turn hook is the natural extension: every turn's status is data the model acts on, not an exception that aborts the loop. + +#### §3.8 Honest Gaps + +1. **The "subprocess reach" claim in `bin/nagent:2822-2824` — "A CLI flag applies to this invocation only; set it in the config file to apply it to delegated file-edit subprocesses too" — needs verification.** The implementation at `bin/nagent:3167-3185` wires the hooks into `run_agent_loop`'s `main()` call only; whether delegated file-edit subprocesses read the config separately is not visible in this diff. The v3.1 source-read pass should verify the subprocess reach. +2. **The "default off" guarantee is not tested.** Both hooks default to off (CLI flag absent, config key absent or empty string). A regression test asserting "no CLI flag, no config key → both hooks are None" would harden the contract. +3. **The `--hook-per-run` cost discipline ("point it at a fast status command") is documented in `--help` but not enforced.** The case-study harnesses use median-of-5 timing in their proofs, which is fast, but a user wiring up a 10-second status command would pay 10 seconds per turn. A future track could add a `--hook-per-run-max-seconds` config knob. +4. **The interaction with the conversation safety net (§2) is not deep-dived.** The safety net's rebuild creates a new initial context, which would include the per-run hook block. The v3 cluster does not document how the safety net coordinates with the hook injection — does the rebuild preserve the per-run hook block? does the next checkpoint know about the hook state? +5. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver has its own 6 phases. A long-running campaign can have per-turn hooks that fire on every dispatched worker. The v3 cluster does not document how the campaigns driver coordinates with the hook injection — does the dispatched worker get the per-run hook block? does the campaign-level conversation have its own hook configuration? +6. **The case-study harness scripts are not fully transcribed.** The v3 cluster cites the 9-step / 10-step structure and the 5 / 4 enforcing gates, but does not transcribe the full shell scripts. A v4 would transcribe both `prove-optimized-harness.sh` scripts in full and analyze their common skeleton + per-repo differences. +7. **The hook result's serialization format is not specified for the model.** The `` block is the implementation's serialization, but the model sees it as part of the conversation. The v3 cluster does not document how the model is expected to parse the block (does it treat the block as a system message? a user message? a tool result?). A v4 would document the model's expected parsing of the hook block. + +#### §3.9 Code-Shape Sketch + +The hooks abstraction, in survey-grammar SSDL notation, with shape tags: + +``` +hook-result := { label: string, # [S] string + exit_code: int, # [I] inspectable + path: string?, # [S] optional + stdout: string, # [S] string + stderr: string, # [S] string + no_output: bool } # [I] inspectable + +serialization: + <{label} exit_code="{exit_code}"{ path? f' path="{path}"' : ''}> + {stdout} + {stderr? f"stderr: {stderr}" : ''} + {no_output? "(no output)" : ''} + + +resolve { cli_per_run, cli_per_file_edit, config_path } { + config = load_json(config_path) if config_path else {} # [B] boundary to file + per_run = cli_per_run or config.get("hook_per_run") or None + per_file_edit = cli_per_file_edit or config.get("hook_per_file_edit") or None + if per_run == "": per_run = None # empty = disabled + if per_file_edit == "": per_file_edit = None + return (per_run, per_file_edit) +} + +run { command, label, path? } :: hook-result {ssdl} [B] # boundary: failures surface + if command is None: return None + result = subprocess.run(command, shell=True, capture_output=True, text=True) + return hook-result { + label: label, + exit_code: result.returncode, + path: path, + stdout: result.stdout, + stderr: result.stderr, + no_output: result.stdout == "" and result.stderr == "" + } + +inject { hook-result, conversation } :: () {ssdl} [B] # boundary: model sees the block + block = serialize(hook-result) + append conversation with block + +invoke-points := { + per_run: at top of every turn, before call_llm # [B] boundary to LLM + per_file_edit: after # [B] boundary to file edit + per_file_edit: after in --file-edit mode only +} +``` + +The shape tag map: `[I]` for inspectable exit codes and flags, `[S]` for string content (stdout, stderr, label), `[B]` for boundaries (file I/O, subprocess invocation, LLM call, conversation append). The hook is a `[B]` boundary abstraction: the model's context meets the measured world at the hook, and the failure of a measurement is data the model acts on. + +**Source-read citations:** +- `bin/nagent:1442-1463` — `run_hook(command, label, path=None)` (a4fb141) +- `bin/nagent:1466-1484` — `resolve_hooks(cli_per_run, cli_per_file_edit, config_path)` with CLI > config > disabled precedence (a4fb141) +- `bin/nagent:1607-1611` — `hook_per_file_edit` fires after `` (a4fb141) +- `bin/nagent:1618-1625` — `hook_per_file_edit` fires after `` in `--file-edit` mode only (a4fb141) +- `bin/nagent:1922-1927` — `hook_per_run` fires at top of every turn, before `call_llm` (a4fb141) +- `bin/nagent:2806-2825` — `--hook-per-run` and `--hook-per-file-edit` CLI flags (a4fb141) +- `bin/nagent:3167-3185` — wiring into `run_agent_loop` (a4fb141) +- `bin/nagent:2822-2824` — "subprocess reach" claim (a4fb141) +- `config.example.json:6-8` — `hook_per_run` and `hook_per_file_edit` config keys (a4fb141) +- `tests/test_nagent.py:870-883` — `test_run_hook_block_reports_output_and_exit_code` (a4fb141) +- `tests/test_nagent.py:885-915` — `test_hook_per_run_runs_before_every_turn` (a4fb141) +- `tests/test_nagent.py:917-942` — `test_hook_per_file_edit_runs_after_file_patch` (a4fb141) +- `tests/test_nagent.py:944-960` — `test_resolve_hooks_cli_overrides_config` (a4fb141) +- `prove-optimized-harness.sh` (pep-copt) — 9-step proof + 5 enforcing gates (identity baseline, median-of-5 speedup, decompression-time gate, generalization, determinism) +- `prove-optimized-harness.sh` (differentiable-collisions-optc) — 10-step proof + 4 enforcing gates (comparator with distance tolerance, contact-point certifier, precompute isolation, determinism) +- `bin/nagent:775` — `best-of-N` initial-context directive (38d3d4f; relevant for the gap note on hook-per-run cost discipline) +- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; relevant for the gap note on safety net coordination) +- `config.example.json:1-15` — full hooks + safety-net config block (a4fb141 + 38d3d4f) +- `README.md:700-750` — hooks teaching in Part VI (a4fb141) +- `README.md:750-800` — case-study methodology teaching (the hooks + harness pattern) (a4fb141) +- `issues/0005-hooks.md` — hooks spec (if it exists; the v3 cluster does not cite a specific issue file for hooks) +- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns coordination) +- `bin/nagent:1922-1927` — `hook_per_run` injection site (a4fb141; the exact lines) +- `bin/nagent:1607-1625` — `hook_per_file_edit` injection sites (a4fb141; the exact lines) +- `bin/nagent:1442-1484` — `run_hook` + `resolve_hooks` (a4fb141; the exact lines) +- `prompts/` directory — no hooks-specific prompt; the hook block is raw subprocess output, not an LLM-generated message +- `tests/test_nagent.py:1-50` — test file header + imports (a4fb141) +- `bin/nagent:3167-3185` — `run_agent_loop` wiring (a4fb141; the exact lines) + +**Decision candidate:** NEW Candidate 19 (MEDIUM). "Per-turn ground-truth hook for Manual Slop": add a per-turn hook primitive that runs a configured command (CLI > config > disabled) at the top of every `send_result()` and injects a `` block; honor the CLI > config > disabled precedence and the failing/quiet-hook-surfaces-output invariant. See `decisions.md` Candidate 19. +**Cross-refs:** §9 Case-study methodology (the 5-element pattern; hooks are the substrate), §10 PEP case study (the pep-copt harness), §11 Collisions case study (the collisions harness). These three together surface the full abstraction. §13 Agent context-window observations (the v3.1 new section on warm-up + window + safe-zone numbers; the per-turn hook is the per-turn ground-truth mechanism that the safe-zone needs). +**Pattern history:** NEW in v3. v2.3 had the conversation-without-ground-truth loop. v3 introduces the per-turn measurement primitive. The case-study methodology cluster (§9) elaborates this into a reusable 5-element pattern. +## §4 Project-local roots + +**Source:** nagent `54c8741`, `557dd39`, `0b9d1a2`, `023e23a` (`bin/helpers/nagent_cli.py:11-86` + `:109-141`, `bin/helpers/nagent_llm.py:55-72`, `bin/nagent:640-748` + `:2075-2295`, `.gitignore`, `README.md:344-372` + `:400-410` + `:812-832` + `:841-849`, `prompts/create-readme.md`, `issues/0001-foundations.md`). +**One-liner:** The default root moves into the project. Conversations, knowledge, per-file memory, and graduated tools now live at `{git-toplevel}/.nagent/` and can be committed and shared. Inputs resolve through four layers (install → user → project → root) with once-per-directory dedup; most specific layer shadows. +**Pattern summary:** Project-local roots is a 4-piece composition: resolve, scaffold, deduplicate, shadow. `resolve_default_root()` implements the precedence (`--root` > git-toplevel > `~/.nagent`); `ensure_root_scaffold()` creates the root on first use with a minimal `.gitignore` (`splits/` only — every other artifact is the user's commit call); the dedup loop includes a layer at most once even when directories overlap; the shadow semantics encode "most specific layer wins" with later iterations overwriting earlier in a dict. The rename `nagent-gc` → `nagent-distill` is the most subtle change — it shifts the mental model from "garbage collection" (discard) to "distill" (refine), which naturally accommodates the merge/graduate passes from §1 Campaigns. The "project memory is team memory" payoff is the new argument the rename enables: a project's accumulated knowledge can be committed, reviewed, and arrived with via `git clone`. This extends "conversations are editable state" (v2.3 Pattern 3) with project-scoped conversations, extends "repo history as data" (v2.3 Pattern 7) with `.nagent/` contents reviewable in the same pull request as the code, and introduces a new 4-layer resolution pattern (install/user/project/root) with most-specific-shadowing for prompts, tools, and config. + +#### §4.1 What Project-Local Roots Adds + +Project-local roots move the default storage location from `~/.nagent/` (user-scoped) to `{git-toplevel}/.nagent/` (project-scoped). The change is structural: conversations, knowledge files, per-file memory, and graduated tools now live inside the project's repository, can be committed alongside the code they describe, and can be shared via `git clone`. + +The four pieces of the project-local-roots abstraction: + +1. **Resolve** — `resolve_default_root()` implements the precedence: `--root` CLI argument > git-toplevel (if inside a repo) > `~/.nagent`. The resolve function is pure: it returns a single path. The CLI argument is the experiment's override (one-shot, the user's immediate need); the git-toplevel is the project's default (persistent, the project's convention); the user-root is the fallback (no repo, no project). +2. **Scaffold** — `ensure_root_scaffold()` creates the root on first use with a minimal `.gitignore` (`splits/` only — every other artifact is the user's commit call). The scaffold is idempotent: if the root already exists, the function does nothing. If the root needs to be created, the function creates the directory tree + the `.gitignore` + a minimal `index.md`. +3. **Deduplicate** — the dedup loop at `bin/nagent:657-668` includes a layer at most once even when directories overlap. The dedup is needed for the case where nagent is run from inside its own checkout (the install dir is also a project dir) or where the root is `~/.nagent` outside a repo (the user dir is also the project dir). The dedup uses `Path.resolve()` to canonicalize the paths before comparison. +4. **Shadow** — the shadow semantics (`tool_search_dirs`, `resolve_prompt_path`, `default_config_path`) encode "most specific layer wins" with later iterations overwriting earlier in a dict. The shadow is needed for the case where a project wants to override a tool or prompt from the install or user layer. The shadow is "by name" (the basename of the tool/prompt file), not "by path" (the full path). + +The 4-layer context resolution (`bin/nagent:640-748` — `build_initial_context`) extends the same shadow semantics to the initial context assembly. The four layers are: install (the nagent package itself), user (`~/.nagent/`), project (`{git-toplevel}/.nagent/`), root (the resolved root). Each layer contributes context; later layers override earlier layers for files with the same name. The once-per-directory dedup prevents the same context from being included twice when directories overlap. + +#### §4.2 The Resolve Precedence + +The resolve precedence is the contract between the CLI, the user, and the project. The CLI is the experiment's override: a user running `nagent --root=/tmp/sandbox` is overriding the default root for this invocation only. The git-toplevel is the project's default: if nagent is run from inside a git repo, the root is `{git-toplevel}/.nagent/`. The user-root is the fallback: if nagent is run from outside a git repo, the root is `~/.nagent/`. + +The implementation is at `bin/helpers/nagent_cli.py:11-44`: + +``` +resolve_default_root(root_arg, cwd) { + if root_arg: return expand_path(root_arg) + toplevel = git_toplevel(cwd) + if toplevel: return toplevel / ".nagent" + return ~/.nagent +} +``` + +The `git_toplevel()` function is a subprocess invocation of `git rev-parse --show-toplevel`. If the command fails (not in a repo, git not installed), the function returns None. The fallback to `~/.nagent` is the "no project" case — the user is running nagent standalone, not as part of a project. + +#### §4.3 The Scaffold and Gitignore Discipline + +The scaffold function is at `bin/helpers/nagent_cli.py:47-54`: + +``` +ensure_root_scaffold(root) { + if root.exists(): return + root.mkdir(parents=True) + gitignore = root / ".gitignore" + gitignore.write_text("splits/\n") # only regenerable artifacts + # create the rest of the directory tree as needed +} +``` + +The `.gitignore` discipline is the load-bearing detail: the scaffold writes `splits/` (the only regenerable artifact) into `.gitignore`; every other artifact is the user's commit call. The `splits/` directory holds the temporary file splits from `nagent-file-split`; it can be regenerated by re-running the split. Everything else (conversations, knowledge, per-file memory, graduated tools) is content the user has invested in; it should be committed and shared, not gitignored. + +The Manual Slop analog is `tests/artifacts/` — gitignored because it contains regenerable test outputs (logs, mock outputs, temporary workspaces). The Manual Slop equivalent of "the user commits the rest" is `conductor/tracks/` — committed because it contains the user's reviewable planning artifacts (spec.md, plan.md, state.toml, metadata.json). The .gitignore discipline is the same: only regenerable artifacts are gitignored; everything else is the user's commit call. + +#### §4.4 The Dedup Invariant + +The dedup invariant is needed for the case where nagent is run from inside its own checkout (the install dir is also a project dir) or where the root is `~/.nagent` outside a repo (the user dir is also the project dir). The dedup loop at `bin/nagent:657-668`: + +``` +seen := set() +for dir in [install, user, project, root] { + resolved = Path(dir).resolve() + if resolved in seen: continue + seen.add(resolved) + ctx = load_root_context(dir) + if ctx: push ctx +} +``` + +The `Path.resolve()` call canonicalizes the path (resolves symlinks, normalizes case on Windows, etc.) before comparison. The dedup is by resolved path, not by string — so `~/nagent` and `/home/user/nagent` are the same layer even if the string representations differ. + +The dedup invariant is correct for the common case. Edge cases (symlinks, network mounts, case-insensitive filesystems on Windows/macOS) are unverified. The `Path.resolve()` behavior varies by platform; a symlink on Linux may resolve to a different path than the same symlink on Windows. The dedup is a "good enough" invariant; the edge cases are documented as honest gaps. + +#### §4.5 The Shadow Semantics + +The shadow semantics encode "most specific layer wins" for tools, prompts, and config. The three shadow functions are: + +1. **`resolve_prompt_path(root, name)`** — at `bin/helpers/nagent_cli.py:57-69`. Returns the first existing path in the order: `{root}/prompts/{name}` → `~/.nagent/prompts/{name}` → `{INSTALL}/prompts/{name}`. The most specific layer (project) wins; the least specific layer (install) is the fallback. +2. **`tool_search_dirs(root)`** — at `bin/helpers/nagent_cli.py:72-86`. Returns a list of tool directories in the order: `{INSTALL}/bin` → `~/.nagent/bin` → `{root}/bin`. The order matters for the "basename shadowing" — when two directories have a tool with the same name, the later directory's tool wins. +3. **`default_config_path()`** — at `bin/helpers/nagent_llm.py:55-72`. Returns the first existing path in the order: `NAGENT_CONFIG` env var → `{root}/config.json` → `~/.nagent/config.json` → `{INSTALL}/config.example.json`. The env var is the experiment's override; the project's config is the default; the user's config is the fallback; the install's example is the last-resort default. + +The shadow is "by name" (the basename of the file), not "by path". A project can override a tool by creating a file with the same name in `{root}/bin/`; the project does not need to know the install's full path. This is the same pattern as Unix's `$PATH` resolution: directories earlier in the path shadow directories later in the path for executables with the same name. + +#### §4.6 The Rename: nagent-gc → nagent-distill + +The rename `nagent-gc` → `nagent-distill` is the most subtle change in this cluster. The old name borrowed from "garbage collection" — the operation was framed as freeing space. The new name borrows from "distill" — the operation is framed as refining raw working state into reusable knowledge. + +The merge/graduate passes (from §1 Campaigns cluster, shipped in `f3ec090`) are an explicit consequence: a "gc" mental model would not naturally include a `--graduate` step (gc discards, distill refines). The README at `prompts/create-readme.md:249-251` makes the new reduction explicit: "Proven playbooks stay prose that must be re-read and re-trusted every time. Therefore: graduate them into self-describing tools and prompts — knowledge becomes capability, gated by review." + +The rename is a mental-model shift, not a code refactor. The code change is trivial (`grep -l nagent-gc | xargs sed -i s/nagent-gc/nagent-distill/`); the user-facing change is the documentation. The `nagent_takeaways_20260608.md` and the project's `conductor/code_styleguides/knowledge_artifacts.md` styleguide should both surface the rename: "gc" implies discard, "distill" implies refine. The semantic difference is load-bearing for the merge/graduate design. + +#### §4.7 Per-Commit Detail + +The four commits that built the project-local-roots subsystem: + +1. **`54c8741` — Move the default root into the project.** Adds `bin/helpers/nagent_cli.py:11-86` (the `INSTALL_DIR` constant, `user_root()`, `git_toplevel()`, `resolve_default_root()`, `ensure_root_scaffold()`, `resolve_prompt_path()`, `tool_search_dirs()` functions), `bin/helpers/nagent_cli.py:109-141` (the `collect_bin_tool_descriptions()` update to accept multiple bin dirs), `bin/helpers/nagent_llm.py:55-72` (the `default_config_path()` function with CLI → `NAGENT_CONFIG` → project → user precedence), `bin/nagent:640-748` (the 4-layer `build_initial_context()` with once-per-directory dedup), `bin/nagent:2220` + `:2227` + `:2292-2295` (the `ensure_root_scaffold(root)` wiring into `run_agent_loop`), and `README.md:812-832` (the file tree rename). This is the "structural" commit — it adds the resolve, scaffold, dedup, and shadow functions and wires them into the main loop. +2. **`557dd39` — Add the 4-layer context teaching and "project memory is team memory" reduction.** Adds `README.md:344-372` (Part IV 4-layer context teaching), `README.md:400-410` (the "project memory is team memory" reduction), `README.md:841-849` (the root + config resolution teaching), and `prompts/create-readme.md` Part III + Part IV rewrites. This is the "documentation" commit — it explains the structural change to the user, surfaces the new payoff (project memory is team memory), and rewrites the create-readme prompt to teach the new model. +3. **`0b9d1a2` — Add the scratch file patterns to .gitignore.** Adds `.gitignore:3-4` — `t?` + `p?` (scratch file patterns). The patterns are likely scratch files written by nagent (e.g., a temp conversation file `t12345`). The commit message does not explain the patterns; the v3 cluster notes this as an honest gap. +4. **`023e23a` — Add .nagent/ to .gitignore.** Adds `.gitignore:5` — `.nagent/` (nagent's own runtime state is per-machine, not source). This is a surprising commit: the cluster's whole point is that `.nagent/` should be committed and shared. The commit's `.gitignore` entry contradicts the cluster's thesis. The v3 cluster notes this contradiction as a "to investigate" gap; the most likely explanation is that the entry is for nagent's own development (running nagent from inside its own checkout should not commit nagent's runtime state to nagent's own repo). + +The four commits together implement the project-local-roots abstraction: resolve, scaffold, dedup, shadow. The rename `nagent-gc` → `nagent-distill` lands in the same window (`557dd39` updates the create-readme prompt to surface the new reduction). + +#### §4.8 Manual Slop Implications + +The Manual Slop equivalents of the project-local-roots pattern are partial. The closest analog is `src/paths.py` (the centralized path resolution module) + the per-project `[conductor].dir` override in `manual_slop.toml`. The path resolution is similar: default → env var → config file → fallback. The per-project override allows each project to have its own conductor directory. + +The Manual Slop analog already follows the pattern in spirit: +- **`conductor/tracks/` is project-scoped** (not `~/.manual_slop/tracks/`). The path resolution in `src/paths.py` defaults to `./conductor` relative to each project's TOML file. The `[conductor].dir` override in `manual_slop.toml` allows per-project overrides. +- **`tests/artifacts/` is gitignored** (regenerable). The `pyproject.toml` has `addopts = "--basetemp=tests/artifacts/_pytest_tmp"` (per the 2026-06-19 `test_sandbox_hardening_20260619` track). The gitignore discipline is the same: only regenerable artifacts are gitignored; everything else is the user's commit call. +- **`conductor/tracks/` is committed** (the user's review call). The `state.toml`, `metadata.json`, `spec.md`, `plan.md` files are all committed and reviewable. The git history is the audit trail. +- **Path Resolution Metadata** (per `src/paths.py`) exposes the source of each resolved path (default, env, config) for high-fidelity GUI display. The user can see at a glance "this path was set by the env var" vs "this path was set by the config file". + +The gap Manual Slop could close: +1. **No "project memory is team memory" framing.** Manual Slop's `conductor/tracks/` is committed, but the user's mental model is not always "this is team memory". A styleguide update could surface the framing: "conductor/tracks/ is the project's planning memory; commit it, review it, share it via git clone". +2. **No "rename" mental-model shift.** Manual Slop does not have a `gc` → `distill` analog. The closest is the project's "knowledge artifacts" styleguide (`conductor/code_styleguides/knowledge_artifacts.md`), which already uses the "distill" framing. The gap is minor; the styleguide is already aligned. +3. **No "4-layer context resolution with dedup".** Manual Slop's path resolution is single-layer (default → env → config → fallback), not 4-layer with dedup. A future track could extend `src/paths.py` to support a 4-layer resolution (install → user → project → system) for the agent-facing files (system prompts, tool descriptions, context presets). + +#### §4.9 Honest Gaps + +1. **The `t?` and `p?` patterns at `.gitignore:3-4` (from `0b9d1a2`) are unexplained in the commit message.** They are likely scratch files written by nagent (e.g., a temp conversation file `t12345`). A follow-up source-read should identify the producer; without that, the gitignore entry is load-bearing but opaque. +2. **The "once-per-directory dedup" at `bin/nagent:657-668` uses `Path.resolve()`.** If the root is on a symlink or a network mount, resolve may behave unexpectedly across platforms. The dedup invariant is correct for the common case; edge cases are unverified. +3. **The "project-local" win only pays off when the user commits `.nagent/`.** The README at `README.md:400-410` acknowledges this caveat ("conversations contain tool output — review before committing, like any other file") but does not enforce it. A hook or pre-commit guard could surface uncommitted conversations, but that is out of scope for the cluster. +4. **The `.gitignore:5` entry for `.nagent/` contradicts the cluster's thesis.** The cluster's whole point is that `.nagent/` should be committed and shared. The gitignore entry is likely for nagent's own development (running nagent from inside its own checkout should not commit nagent's runtime state to nagent's own repo). The contradiction is unresolved in the v3 source-read. +5. **The 4-layer context resolution is not exhaustively tested.** The test file `tests/test_nagent.py` covers the resolve precedence but does not test the dedup invariant exhaustively (symlinks, network mounts, case-insensitive filesystems). A v4 would add a test suite for the dedup edge cases. +6. **The `default_config_path()` precedence (CLI → `NAGENT_CONFIG` → project → user) is not deep-dived.** The cluster notes the function exists but does not analyze the precedence's failure modes (what happens when the env var is set to a non-existent path? does the function fall through to the project config, or fail?). +7. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver creates per-campaign directories inside `.nagent/campaigns/`. The project-local-roots abstraction should guarantee that the campaign directories are project-scoped, not user-scoped. The v3 cluster does not document this guarantee. + +#### §4.10 Code-Shape Sketch + +The project-local-roots abstraction, in survey-grammar SSDL notation, with shape tags: + +``` +resolve-root { root_arg, cwd } :: path {ssdl} [S] + if root_arg -> expand(root_arg) + elif git_toplevel(cwd) is not nil -> git_toplevel(cwd) / ".nagent" + else -> ~/.nagent + +ensure-scaffold { root } :: () {ssdl} [B] # boundary: filesystem write + if root.exists(): return + root.mkdir(parents=True) + gitignore = root / ".gitignore" + gitignore.write_text("splits/\n") # only regenerable artifacts + +resolve-prompt { root, name } :: path {ssdl} [S] + for layer in [root.prompts, ~/.nagent/prompts, INSTALL.prompts] { + if layer/name is file -> return layer/name + } + +resolve-tools { root } :: [path] {ssdl} [B] # boundary: filesystem read + by_name := {} + for dir in [INSTALL/bin, ~/.nagent/bin, root/bin] { + for path in dir if is_file { + by_name[path.name] := path # later iterations shadow earlier + } + } + return sorted(by_name.values()) + +default-config { cli_arg, env_var } :: path {ssdl} [S] + if cli_arg: return cli_arg + if env_var: return env_var + for layer in [root/config.json, ~/.nagent/config.json, INSTALL/config.example.json] { + if layer is file -> return layer + } + +context-layers { install, user, project, root } :: [string] {ssdl} [S] + seen := {} + for dir in [install, user, project, root] { + if resolve(dir) in seen -> continue # dedup + seen += resolve(dir) + ctx := load_root_context(dir) + if ctx -> push ctx + } +``` + +The shape tag map: `[S]` for string concatenations and path resolutions (the model's understanding is the resolved path), `[B]` for boundaries (filesystem read/write, subprocess invocation). The root resolution is a single deterministic string concatenation; the context-layer resolution is also a deterministic string assembly with dedup. The non-determinism is bounded to LLM-driven passes (harvest, checkpoint, graduate); the file-resolution paths are pure code. + +**Source-read citations:** +- `bin/helpers/nagent_cli.py:11-13` — `INSTALL_DIR` constant (54c8741) +- `bin/helpers/nagent_cli.py:15-44` — `user_root()`, `git_toplevel()`, `resolve_default_root()` (54c8741) +- `bin/helpers/nagent_cli.py:47-54` — `ensure_root_scaffold()` (54c8741) +- `bin/helpers/nagent_cli.py:57-69` — `resolve_prompt_path()` (54c8741) +- `bin/helpers/nagent_cli.py:72-86` — `tool_search_dirs()` (54c8741) +- `bin/helpers/nagent_cli.py:109-141` — `collect_bin_tool_descriptions()` updated (54c8741) +- `bin/helpers/nagent_llm.py:55-72` — `default_config_path()` (54c8741) +- `bin/nagent:640-748` — `build_initial_context()` 4-layer resolution with dedup (54c8741) +- `bin/nagent:657-668` — once-per-directory dedup loop (54c8741) +- `bin/nagent:2220` — `root = resolve_default_root(args.root)` (54c8741) +- `bin/nagent:2227` — `ensure_root_scaffold(root)` for `--file-edit` (54c8741) +- `bin/nagent:2292-2295` — `ensure_root_scaffold(root)` for every path past root-write boundary (54c8741) +- `README.md:344-372` — 4-layer context teaching (557dd39) +- `README.md:400-410` — "Project memory is team memory" reduction (557dd39) +- `README.md:812-832` — file tree rename (54c8741) +- `README.md:841-849` — root + config resolution (557dd39) +- `prompts/create-readme.md` — Part III + Part IV rewrites (557dd39) +- `prompts/create-readme.md:249-251` — "graduate proven playbooks" reduction (from c1d2cad) +- `.gitignore:3-4` — `t?` + `p?` scratch file patterns (0b9d1a2) +- `.gitignore:5` — `.nagent/` (023e23a) +- `issues/0001-foundations.md` — foundations spec (the v3 cluster does not cite a specific line range) +- `bin/nagent:2220-2230` — root resolution wiring (54c8741; the exact lines) +- `bin/nagent:2225-2235` — `ensure_root_scaffold` call for `--file-edit` (54c8741) +- `bin/nagent:2290-2300` — `ensure_root_scaffold` call for every path past root-write boundary (54c8741) +- `bin/nagent:640-660` — `build_initial_context` start (54c8741; the 4-layer resolution) +- `bin/nagent:660-680` — `build_initial_context` dedup loop (54c8741; the exact lines) +- `bin/nagent:680-700` — `build_initial_context` end (54c8741; the final context assembly) +- `tests/test_nagent.py` — resolve precedence tests (54c8741; the v3 cluster does not cite specific line ranges) +- `README.md:372-400` — 4-layer context teaching continued (557dd39) +- `README.md:410-440` — project memory is team memory continued (557dd39) +- `prompts/create-readme.md:200-260` — Part III + Part IV rewrites (557dd39) +- `bin/helpers/nagent_cli.py:1-10` — module docstring + imports (54c8741) +- `bin/helpers/nagent_cli.py:86-109` — between `tool_search_dirs` and `collect_bin_tool_descriptions` (54c8741) +- `bin/helpers/nagent_cli.py:141-200` — `collect_bin_tool_descriptions` body (54c8741) +- `config.example.json` — full config example (54c8741; the default values) +- `.gitignore:1-10` — full gitignore contents (0b9d1a2 + 023e23a) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) + +**Decision candidate:** NEW Candidate 20 (LOW). "Rename `nagent-gc` → `nagent-distill` in our documentation cross-references" — this is a documentation-only follow-up; no code change. The mental-model shift ("gc" → "distill") is worth surfacing in the project's `conductor/code_styleguides/knowledge_artifacts.md` styleguide. See `decisions.md` Candidate 20. +**Cross-refs:** §1 Campaigns (`campaigns/` lives inside the project-local root); §2 Conversation safety net (checkpoints inherit the same scoping); §3 Hooks (hooks are configured per-invocation, not per-root). `docs/guide_paths.md` (the Manual Slop path resolution guide; relevant for the Manual Slop implications). +**Pattern history:** EXTENDS v2.3 Pattern 3 ("conversations are editable state") with project-scoped conversations. EXTENDS v2.3 Pattern 7 ("repo history as data") with `.nagent/` contents reviewable in the same pull request. NEW pattern: 4-layer resolution (install/user/project/root) with most-specific-shadowing. +## §5 Provider expansion + +**Source:** nagent `bdfa2a6`, `5075f6e`, `2edc7ee` (`bin/helpers/nagent_llm.py:13-19` + `:27-31` + `:37-42` + `:54-77` + `:123-130` + `:198-279` + `:315-336` + `:381-400` + `:582-625` + `:739-770` + `:357-391`, `bin/nagent:1075-1081`, `config.example.json:7`, `README.md:82-90` + `:956-967` + `:991-995`, `tests/test_nagent.py:1010-1042` + `:2734-2797`, `context/data-oriented-design.md`). +**One-liner:** Together is added as a sixth provider (OpenAI-wire-compatible, always streamed). Per-model context windows become a verified table; rebuild now fires on whichever trips first — byte ceiling or 0.85 of the model's window. The claude-code provider blanks inherited `ANTHROPIC_API_KEY` so its billing stays on its own login; the spinner names the provider/model. +**Pattern summary:** The provider-expansion abstraction is a four-piece composition: register, window, trigger, bill. Register: a provider is one tuple in `PROVIDERS` + one entry in `DEFAULT_MODELS` + one tuple in `CREDENTIAL_ENV` + one entry in `PACKAGE_HINTS`. Window: `MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate ("omit rather than guessed"). Trigger: rebuild fires on whichever trips first, the byte ceiling OR 0.85 of the model's window. Bill: the claude-code billing quirk (`env={"ANTHROPIC_API_KEY": ""}`) is the discipline "API-key billing stays the anthropic provider's job". The token-cap awareness is the load-bearing change: a byte-only rebuild trigger is a proxy for token utilization, and the proxy fails on small-window models. The per-model window table is the data-grounded alternative. The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit". v2.3 had 5 providers (openai, anthropic, google, cursor, claude-code); v3 has 6 (adds together). The token-cap awareness is NEW (v2.3 had byte-only rebuild triggers). + +#### §5.1 What Provider Expansion Adds + +The provider-expansion cluster makes adding a new LLM provider a one-line change in 5 places, makes the context-window table a verified data structure (not an estimate), and makes the rebuild trigger aware of both bytes and tokens. The three changes together decouple the provider catalog from the code: a new provider is data, not code. + +The four pieces of the provider-expansion abstraction: + +1. **Register** — a provider is one tuple in `PROVIDERS` + one entry in `DEFAULT_MODELS` + one tuple in `CREDENTIAL_ENV` + one entry in `PACKAGE_HINTS`. The 5-tuple is enough to surface a provider in `--list-providers` and route a `generate_text_with_usage` call. The 5-tuple is a `[M]` mutable aggregate: the provider catalog is data, the code is a function of the catalog. +2. **Window** — `MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate. "Omit rather than guessed" (per `bin/helpers/nagent_llm.py:60-62`) is the discipline: the table lists exactly the models whose windows were verified by API error or by direct lookup, and the function `model_context_window` returns `None` for unknowns. The caller falls back to byte-only behavior when the window is unknown. +3. **Trigger** — rebuild fires on whichever trips first, the byte ceiling OR 0.85 of the model's window. The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit". The trigger is a pure function of (conversation_chars, model, settings); the function is inspectable, the caller can reason about it. +4. **Bill** — the claude-code billing quirk (`env={"ANTHROPIC_API_KEY": ""}`) is the discipline "API-key billing stays the anthropic provider's job". The provider that owns the billing owns the env; the subprocess env overrides the inherited env. The discipline is data: the env is the contract between the provider and the billing system. + +#### §5.2 The Register Tuple + +A provider is registered by adding entries to 5 data structures. The 5-tuple is: + +``` +PROVIDERS["together"] = (name="together", base_url=TOGETHER_BASE_URL, sdk="openai") +DEFAULT_MODELS["together"] = "meta-llama/Llama-3-70b-chat-hf" +CREDENTIAL_ENV["together"] = ("TOGETHER_API_KEY",) +PACKAGE_HINTS["together"] = "openai>=1.0" +MODEL_CONTEXT_WINDOWS["meta-llama/Llama-3-70b-chat-hf"] = 8192 # if verified +``` + +The 5-tuple is enough to surface the provider in `--list-providers` (the `list_providers()` function reads `PROVIDERS`), to route a `generate_text_with_usage` call (the dispatch reads `PROVIDERS` + `DEFAULT_MODELS` + `CREDENTIAL_ENV`), and to validate the context window (the `model_context_window()` function reads `MODEL_CONTEXT_WINDOWS`). + +The 5-tuple is a `[M]` mutable aggregate: the provider catalog is data, the code is a function of the catalog. Adding a new provider is 5 lines of data, not a new code path. Removing a provider is deleting the 5 lines. The discipline is "data, not code branching on state" — the provider is the data, the code is a function of the data. + +#### §5.3 The Verified Window Table + +`MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate. The discipline is "omit rather than guessed" — the table lists exactly the models whose windows were verified by API error or by direct lookup, and the function `model_context_window` returns `None` for unknowns. The implementation is at `bin/helpers/nagent_llm.py:54-77`: + +``` +MODEL_CONTEXT_WINDOWS := { + # Together (verified 2026-06-17) + "meta-llama/Llama-3-70b-chat-hf": 8192, + "meta-llama/Llama-3.1-70b-chat": 131072, + ... + # DeepSeek (verified 2026-06-17) + "deepseek-chat": 64000, + "deepseek-reasoner": 64000, + ... + # Qwen (verified 2026-06-17) + "qwen-plus": 983616, # enforced input cap, not advertised 1M + ... +} + +model_context_window(model) -> int | None { + return MODEL_CONTEXT_WINDOWS.get(model, None) +} +``` + +The `bdfa2a6` commit message is explicit about the verification process: "DeepSeek-V4-Pro confirmed by a context_length_exceeded error ('maximum context length is 512000 tokens'). Qwen3.7-Plus/Max advertise context_length=1000000, but an oversized request is rejected with 'Range of input length should be [1, 983616]' — so the enforced input cap is 983616, with ~16384 of the 1M reserved for output." The distinction between "advertised total context_length" and "enforced input cap" is load-bearing — the table records the enforced cap, not the advertisement. This is the same data discipline as the project's `conductor/code_styleguides/cache_friendly_context.md`: stable data (verified numbers) vs volatile data (advertised numbers). + +The "unknown returns None" behavior is the discipline: a missing entry is not a default to a guess; it's a signal to fall back to the byte-only behavior, which is correct for large-window models and merely late for small-window models (the failure is visible, not silent). The data-oriented principle: stable data goes in the table; volatile data is the model's responsibility. + +#### §5.4 The Rebuild Trigger with Token Cap + +The rebuild trigger fires on whichever trips first: the byte ceiling OR 0.85 of the model's window. The implementation is at `bin/nagent:rebuild_due` (the v3 cluster does not cite specific line ranges, but the trigger is part of the conversation safety net wiring): + +``` +rebuild_due { conversation_chars, model, settings } :: fire? {ssdl} [I] + byte_trip := conversation_chars > settings.rebuild_at_kb * 1024 + window_trip := model_context_window(model) is not nil + and conversation_chars_in_tokens > window * CONTEXT_WINDOW_SAFETY_FRACTION + return byte_trip or window_trip +``` + +The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit". The token count is estimated from byte count (not from the model's actual token output) because the rebuild trigger is a pre-call check, not a post-call measure. The estimate is `conversation_chars / 4` (the common rule of thumb: 1 token ≈ 4 characters in English). The estimate is good enough for the trigger; the precise token count is the model's responsibility. + +The two-trigger design (byte OR window) is the discipline: a single trigger is a proxy, and proxies fail on edge cases. A byte-only trigger is too high for small-window models (a 192KB conversation is fine for a 1M-token model but catastrophic for an 8K-token model). A token-only trigger is too low for large-window models (a 32K-token conversation is fine for a 1M-token model but the byte-only trigger would fire anyway). The OR-trigger is the data-grounded alternative: the rebuild fires when EITHER the bytes exceed the ceiling OR the tokens exceed the safety fraction of the window. + +#### §5.5 The Claude-Code Billing Quirk + +The claude-code billing quirk is at `bin/helpers/nagent_llm.py:357-391`: the provider blanks inherited `ANTHROPIC_API_KEY` so its billing stays on its own login. The implementation: + +``` +generate_text_with_usage { provider, model, messages } :: LlmResult { + if provider == "claude-code": + env = {**os.environ, "ANTHROPIC_API_KEY": ""} # blank the inherited key + # subprocess.run(..., env=env) — billing is on the claude-code login + else: + env = os.environ + # ... SDK call with env +} +``` + +The discipline: the provider that owns the billing owns the env. The claude-code provider uses the user's claude-code subscription, not the user's Anthropic API key. The blanking ensures the subprocess does not accidentally use the inherited API key (which would bill the API key instead of the subscription). + +The discipline is "API-key billing stays the anthropic provider's job". The two providers share the same SDK (the Anthropic SDK), but their billing is separate. The env is the contract between the provider and the billing system; the provider that does not own the billing should not pass the billing env. + +This is a specific gotcha worth documenting: Manual Slop's claude-code integration (per `conductor/tech-stack.md`) may benefit from the same discipline. If Manual Slop ever adds a claude-code provider (analogous to nagent's), the implementation should blank the inherited `ANTHROPIC_API_KEY` to prevent accidental API billing. + +#### §5.6 The Spinner Names the Provider/Model + +The `--list-providers` CLI flag and the spinner name are at `bin/nagent:1075-1081` and `bin/nagent:1075-1081`: + +``` +target = f"{llm.provider}/{llm.model}" if llm.model else llm.provider +spinner.update(f"calling {target}...") +``` + +The spinner names the provider/model pair so the user can see which provider is being called. This is a small UX detail, but it matters for debugging: when a call is slow, the user knows whether it's the OpenAI provider or the Anthropic provider or the Together provider. + +The `--list-providers` CLI flag is at `bin/nagent` (the v3 cluster does not cite a specific line range, but the flag is documented in `README.md:991-995`). The flag dumps the `PROVIDERS` catalog so the user can see the available providers without reading the code. + +#### §5.7 Per-Commit Detail + +The three commits that built the provider-expansion subsystem: + +1. **`bdfa2a6` — Add Together as the sixth provider + the verified window table.** Adds `bin/helpers/nagent_llm.py:13-19` (the `PROVIDERS` extension + `TOGETHER_BASE_URL`), `bin/helpers/nagent_llm.py:27-31` (the `DEFAULT_MODELS["together"]`), `bin/helpers/nagent_llm.py:37-42` (the `CREDENTIAL_ENV["together"]` = `("TOGETHER_API_KEY",)`), `bin/helpers/nagent_llm.py:54-77` (the `MODEL_CONTEXT_WINDOWS` table with 10 verified models), `bin/helpers/nagent_llm.py:123-130` (the `model_context_window(model)` function returning `None` for unknown), `bin/helpers/nagent_llm.py:198-279` (the Together client + `_together_chat` always-streamed), `bin/helpers/nagent_llm.py:315-336` (the `list_models("together")` direct fetch because Together returns a bare JSON array), `bin/helpers/nagent_llm.py:381-400` (the `list_providers()` static catalog), `bin/helpers/nagent_llm.py:582-625` (the Together in `generate_text_with_usage` + `generate_with_upload_usage`), `bin/helpers/nagent_llm.py:739-770` (the `_together_upload` image-upload-only with base64 data URL), `config.example.json:7` (the `"context_window_tokens": 0` config), `README.md:82-90` (the providers table extension), and `README.md:956-967` (the "Conversation rebuilt (compacted...) when either trigger fires first" teaching). This is the "Together + windows" commit — it adds the new provider and the verified window table. +2. **`5075f6e` — Add the claude-code billing quirk + the 4 new tests.** Adds `bin/helpers/nagent_llm.py:357-391` (the `env={"ANTHROPIC_API_KEY": ""}` blanking + the error-result-survives-stream-exception + the synthetic-error-text-skip), and `tests/test_nagent.py:2734-2797` (4 new claude-code tests). This is the "billing discipline" commit — it hardens the claude-code provider's billing isolation. +3. **`2edc7ee` — Add the spinner-name-the-provider/model change.** Adds `bin/nagent:1075-1081` (the `target = f"{llm.provider}/{llm.model}" if llm.model else llm.provider` + the spinner update) and `tests/test_nagent.py:1010-1042` (the `test_call_llm_wait_spinner_names_provider_and_model` test). This is the "UX detail" commit — it makes the spinner name the provider/model so the user can see which provider is being called. + +The three commits together implement the provider-expansion abstraction: register, window, trigger, bill. The Together provider lands in `bdfa2a6`; the billing discipline hardens in `5075f6e`; the UX detail lands in `2edc7ee`. + +#### §5.8 Manual Slop Implications + +The Manual Slop equivalents of the provider-expansion pattern are partial. The closest analog is `src/ai_client.py` (the multi-provider LLM client) + the per-provider history locks (per `docs/guide_ai_client.md`) + the 8 providers in `conductor/tech-stack.md` (Gemini, Anthropic, DeepSeek, Gemini CLI, MiniMax, OpenAI, Qwen, Grok). + +The Manual Slop analog already follows the pattern in spirit: +- **8 providers registered** (per `conductor/tech-stack.md`) — the provider catalog is data, not code branching on state. The `src/ai_client.py` module is a function of the catalog. +- **`provider_state` architecture** (per `docs/guide_ai_client.md`) — each provider has its own state (history lock, cache state, rate limits). The state is per-provider, not global. +- **Per-provider history locks** (per `docs/guide_ai_client.md`) — prevents the "provider-specific history in process globals" pitfall (per `conductor/code_styleguides/domain_classification.md`'s Application domain pitfalls list). + +The gap Manual Slop could close: +1. **No verified `MODEL_CONTEXT_WINDOWS` table.** Manual Slop's `src/ai_client.py` has per-provider history locks but does not have a per-model context-window table. The rebuild/compaction is currently driven by heuristic token estimates, not verified windows. A future track could add the table + the 0.85 safety fraction trigger. +2. **No "omit rather than guessed" discipline.** Manual Slop's `ai_client` uses heuristic estimates for unknown models. The "unknown returns None, fall back to byte-only" discipline is a small but load-bearing change. +3. **No claude-code billing quirk discipline.** Manual Slop's `conductor/tech-stack.md` lists 8 providers, but the claude-code billing isolation discipline is not documented. A future track could add the discipline to the `src/ai_client.py` module's design. + +#### §5.9 Honest Gaps + +1. **`MODEL_CONTEXT_WINDOWS` is verified against the Together API only on 2026-06-17.** Other providers' models are intentionally omitted. A future track should add more verifications. +2. **The `env={"ANTHROPIC_API_KEY": ""}` blanking assumes subprocess env takes precedence over inherited env.** Correct on POSIX; Windows env handling could differ. Unverified. +3. **The Together `/v1/models` direct fetch at `bin/helpers/nagent_llm.py:315-336` is a vendor-specific workaround.** If Together changes the response shape, the parser silently returns fewer models. A defensive check (count returned models, warn if zero) could harden this. +4. **The 0.85 safety fraction is a heuristic, not a measured value.** The comment in `issues/0004-conversation-safety-net.md` notes "model capability degrades under high context utilization, not just at the limit", but the 0.85 fraction is not measured. A future track should measure actual degradation per provider/model and update the fraction accordingly. +5. **The token count estimate (`conversation_chars / 4`) is a heuristic.** The actual token count depends on the model's tokenizer (GPT-4 uses BPE, Claude uses SentencePiece, etc.). A v4 would use the model's tokenizer for precise counting. +6. **The `list_providers()` static catalog does not validate the providers are actually configured.** A provider in `PROVIDERS` without a corresponding `CREDENTIAL_ENV` entry would fail at runtime, not at registration. A validation pass could catch this at startup. +7. **The interaction with the campaigns driver (§1) is not deep-dived.** A long-running campaign can have conversations that exceed the model's context window. The provider-expansion cluster does not document how the campaigns driver coordinates with the token-cap trigger — does the campaign driver check the trigger before dispatching a worker? does the report phase surface token-cap warnings to the user? + +#### §5.10 Code-Shape Sketch + +The provider-expansion abstraction, in survey-grammar SSDL notation, with shape tags: + +``` +providers := { name: string, # [S] string + default_model: string, # [S] string + credentials: [env-var], # [S] string list + package: string, # [S] string + context_window: int | nil } # [I] inspectable + # [M] mutable aggregate + +MODEL_CONTEXT_WINDOWS := { model: int | nil } # [I] verified table +CONTEXT_WINDOW_SAFETY_FRACTION := 0.85 # [I] inspectable + +provider { name, model, env } :: LlmResult {ssdl} [B] # boundary: SDK call + // SDK call; failures surface text + exit code + +rebuild-trigger { conversation_chars, model, settings } :: fire? {ssdl} [I] + byte_trip := conversation_chars > settings.rebuild_at_kb * 1024 + window_trip := model_context_window(model) is not nil + and conversation_chars_in_tokens > window * 0.85 + return byte_trip or window_trip + +claude-code-billing { inherited_env } :: env {ssdl} [B] # boundary: subprocess env + if provider == "claude-code": + return {**inherited_env, "ANTHROPIC_API_KEY": ""} # blank the inherited key + else: + return inherited_env +``` + +The shape tag map: `[I]` for inspectable tables and triggers, `[S]` for string content (provider names, model names, env vars), `[B]` for boundaries (SDK call, subprocess env), `[M]` for the mutable aggregate that is the provider catalog. The provider catalog is a `[M]` aggregate: it is the state of record, hand-edited by humans, read by the SDK dispatch. + +**Source-read citations:** +- `bin/helpers/nagent_llm.py:13-19` — `PROVIDERS` extended + `TOGETHER_BASE_URL` (bdfa2a6) +- `bin/helpers/nagent_llm.py:27-31` — `DEFAULT_MODELS["together"]` (bdfa2a6) +- `bin/helpers/nagent_llm.py:37-42` — `CREDENTIAL_ENV["together"]` = `("TOGETHER_API_KEY",)` (bdfa2a6) +- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (10 verified models) (bdfa2a6) +- `bin/helpers/nagent_llm.py:60-62` — "omit rather than guessed" discipline (bdfa2a6) +- `bin/helpers/nagent_llm.py:123-130` — `model_context_window(model)` returns `None` for unknown (bdfa2a6) +- `bin/helpers/nagent_llm.py:198-279` — Together client + `_together_chat` (always streamed) (bdfa2a6) +- `bin/helpers/nagent_llm.py:315-336` — `list_models("together")` direct fetch (bdfa2a6) +- `bin/helpers/nagent_llm.py:381-400` — `list_providers()` static catalog (bdfa2a6) +- `bin/helpers/nagent_llm.py:582-625` — Together in `generate_text_with_usage` (bdfa2a6) +- `bin/helpers/nagent_llm.py:739-770` — `_together_upload` image-upload only (bdfa2a6) +- `bin/helpers/nagent_llm.py:357-391` — `env={"ANTHROPIC_API_KEY": ""}` + error-result-survives-stream-exception (5075f6e) +- `bin/nagent:1075-1081` — spinner names provider/model (2edc7ee) +- `config.example.json:7` — `"context_window_tokens": 0` (bdfa2a6) +- `README.md:82-90` — providers table extension (bdfa2a6) +- `README.md:956-967` — "Conversation rebuilt when either trigger fires first" (bdfa2a6) +- `README.md:991-995` — `--list-providers` CLI example (bdfa2a6) +- `tests/test_nagent.py:1010-1042` — `test_call_llm_wait_spinner_names_provider_and_model` (2edc7ee) +- `tests/test_nagent.py:2734-2797` — 4 new claude-code tests (5075f6e) +- `bin/nagent:rebuild_due` — rebuild trigger (the v3 cluster does not cite specific line ranges) +- `bin/helpers/nagent_llm.py:1-12` — module docstring + imports (bdfa2a6) +- `bin/helpers/nagent_llm.py:19-26` — `PROVIDERS` complete list (bdfa2a6) +- `bin/helpers/nagent_llm.py:31-36` — `DEFAULT_MODELS` complete list (bdfa2a6) +- `bin/helpers/nagent_llm.py:42-53` — `CREDENTIAL_ENV` complete list (bdfa2a6) +- `bin/helpers/nagent_llm.py:77-100` — `PACKAGE_HINTS` (bdfa2a6) +- `bin/helpers/nagent_llm.py:130-200` — provider-specific clients (bdfa2a6) +- `bin/helpers/nagent_llm.py:280-315` — `_together_chat` end (bdfa2a6) +- `bin/helpers/nagent_llm.py:336-380` — `list_models` end (bdfa2a6) +- `bin/helpers/nagent_llm.py:400-580` — provider dispatch (bdfa2a6) +- `bin/helpers/nagent_llm.py:625-740` — provider-specific output parsing (bdfa2a6) +- `bin/helpers/nagent_llm.py:770-900` — provider-specific upload handling (bdfa2a6) +- `config.example.json:1-20` — full config example (bdfa2a6) +- `README.md:90-110` — providers teaching continued (bdfa2a6) +- `README.md:967-990` — rebuild trigger teaching continued (bdfa2a6) +- `tests/test_nagent.py:1042-1100` — model_context_window tests (bdfa2a6) +- `tests/test_nagent.py:2797-2850` — claude-code tests continued (5075f6e) +- `bin/nagent:1075-1085` — spinner update + target format (2edc7ee; the exact lines) +- `bin/nagent:1080-1090` — call_llm start (2edc7ee; relevant for the spinner wiring) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the trigger wiring) +- `context/data-oriented-design.md` — the canonical DOD reference (relevant for the 0.85 safety fraction rationale) + +**Decision candidate:** NEW Candidate 21 (MEDIUM). "Per-model token-cap awareness for Manual Slop `ai_client`": add `MODEL_CONTEXT_WINDOWS` table; rebuild fires on byte ceiling OR 0.85 of window; "don't guess" — omit rather than estimate. See `decisions.md` Candidate 21. +**Cross-refs:** §2 Conversation safety net (rebuild trigger gets a second condition). §3 Hooks (per-turn status can include `current model / window / usage`). `docs/guide_ai_client.md` (the Manual Slop AI client guide; relevant for the Manual Slop implications). `conductor/tech-stack.md` (the 8 providers Manual Slop supports). +**Pattern history:** UPDATE. v2.3 had 5 providers; v3 has 6 (adds together). The token-cap awareness is NEW (v2.3 had byte-only rebuild triggers). EXTENDS v2.3 Pattern 5 ("the loop") with a per-model token cap as a second rebuild trigger. +## §6 Delegation rewrite + +**Source:** nagent `d56f0f0`, `65787a6`, `315fe9e` (`bin/nagent:666-673` + `:790-806`, `tests/test_nagent.py:1689-1695`). +**One-liner:** Delegation is for two reasons — **decomposition** (break a complex task into parts and delegate the parts) or **context isolation** (keep a noisy step's cost as just its result, not its logs/reads). It is NEVER for offloading a single small action whose result is no smaller than doing it yourself — synchronous delegation can recurse without end. +**Pattern summary:** The delegation rewrite is a guidance + bug-fix pair. The bug is real: a delegated agent whose whole job is one edit will delegate that one edit to another agent, which does the same, and because delegation is synchronous (each parent blocks on its child) this recurses without bound and hangs the tree. The fix is to name the two reasons delegation is worth its cost — decomposition (the task is genuinely complex, with parts) and context isolation (the step is noisy, and the result is small). Both reasons produce a smaller-than-the-work payload to the parent. When neither reason applies, the parent should do the work inline. The "worth more the longer-lived your conversation is" insight is the load-bearing one: a short, soon-to-finish conversation gains little from context isolation; a long-lived coordinator's context budget is the constraint that context isolation protects. The recursion bug is interesting for what it says about guidance as control flow: nagent's delegation is "the model's call, not the loop's" — the cost of this design is the recursion bug; the benefit is flexibility. The fix is to make the guidance explicit enough that the model doesn't fall into the trap. This is the data-oriented approach: instead of code-level guards, encode the invariant in the prompt and trust the model to follow it. The test-fix at `315fe9e` is the verification layer. + +#### §6.1 What Delegation Rewrite Adds + +The delegation rewrite surfaces a recursion bug and fixes it by naming the two reasons delegation is worth its cost. The change is structural: the model is given an explicit decision rule ("decompose or isolate, never offload") that prevents the recursion trap. The rule is guidance, not code — the loop does not enforce a max-delegation-depth. + +The three pieces of the delegation-rewrite abstraction: + +1. **Decomposition** — the task is genuinely complex, with multiple parts. Delegation breaks the parts into separate sub-conversations; each sub-conversation does its part and returns the result. The parent's context absorbs only the results, not the parts' logs/reads. +2. **Context isolation** — the step is noisy (many log lines, many file reads, many tool calls), and the result is small (a single value, a short summary). Delegation isolates the noise in the sub-conversation; the parent's context absorbs only the result, not the noise. +3. **Anti-recursion rule** — when neither reason applies (the task is a single small action whose result is essentially the whole deliverable), the parent should do the work inline. Delegating a single small action buys nothing, only adds a layer, and — delegation being synchronous — can recurse without end (a sub-agent re-delegating the same one thing). + +The two-reason framing is the load-bearing change. v2.3 noted "delegation is context-management before parallelism"; v3 surfaces the recursion bug and names the two reasons. The framing is more precise: context isolation is worth more the longer-lived the parent's conversation is. A short conversation's context budget is not the constraint; a long-lived coordinator's context budget is. + +#### §6.2 The Recursion Bug + +The recursion bug is at `d56f0f0`: "file-edit agent → worker → nagent-file-edit → file-edit agent → ...". The bug's mechanism: + +1. A delegated agent's whole job is one file edit (e.g., "edit this one function"). +2. The delegated agent delegates the one edit to a sub-agent ("you do this one edit"). +3. The sub-agent delegates the one edit to a sub-sub-agent ("you do this one edit"). +4. Because delegation is synchronous, each parent blocks on its child. The chain recurses without bound. +5. The tree hangs (each parent is waiting for its child, which is waiting for its child, etc.). + +The bug is observed, not theoretical. The fix is guidance: the parent should do the work inline when neither decomposition nor context isolation applies. The new wording at `bin/nagent:798-800` is explicit: "Don't delegate a single small action whose result is no smaller than doing it yourself (one edit, one quick command, one lookup): it buys nothing, only adds a layer, and — delegation being synchronous — can recurse without end (a sub-agent re-delegating the same one thing)." + +#### §6.3 The Two-Reason Framing + +The two-reason framing is the discipline that prevents the recursion bug. The wording at `bin/nagent:666-673` is the delegated-invocation guidance: + +``` +role_instructions for delegated-invocation: + Do your task directly; spawn a sub-conversation only when it buys something: + - to decompose a genuinely complex, multi-part task into parts, or + - to keep a large/noisy step out of your context and get back only the distilled result. + Don't delegate a single small action whose result is essentially your whole + deliverable — that adds a layer and can recurse without end. +``` + +The wording is explicit about the two reasons and the anti-pattern. The model reads the wording at the top of every delegated invocation (the `role_instructions` is part of the initial context for delegated agents). The wording is the data, the model is the function. + +The top-level context-management guidance at `bin/nagent:790-806` is the parent-facing version: + +``` +Each nagent instance has its own private conversation file; parent and child do +not share context. A sub-conversation absorbs the noise of its work and returns +only what you ask for — so a step you delegate costs your context just its +result, not its logs/reads. +``` + +The "worth more the longer-lived your conversation is" insight is in the same block. The insight is: a short, soon-to-finish conversation gains little from context isolation; a long-lived coordinator's context budget is the constraint that context isolation protects. + +#### §6.4 The Test-Fix at 315fe9e + +The `315fe9e` commit message is the verification-discipline precedent: "My earlier commits py_compile'd but did not run the suite — this is the fallout". The commit updates the test at `tests/test_nagent.py:1689-1695` (`test_delegated_initial_text`) to assert the new wording. The diff is a single character change at line 1692: `"Still decompose and delegate"` → `"spawn a sub-conversation only when it buys something"`. + +The change is small but load-bearing: without the test assertion, the recursion bug could re-merge silently. The test asserts that the delegated-invocation guidance contains the anti-recursion rule. If a future change removes the rule, the test fails. + +The 315fe9e commit is a model of test-coverage honesty: the agent acknowledges that earlier commits passed `py_compile` but did not run the suite, and the test-fix is the verification layer. The pattern is: any guidance change in a prompt must run the test suite, not just `py_compile`. The verification is the contract. + +#### §6.5 Per-Commit Detail + +The three commits that built the delegation-rewrite subsystem: + +1. **`d56f0f0` — Observe the recursion bug.** The commit message is the bug report: "file-edit agent → worker → nagent-file-edit → file-edit agent → ...". The commit is a no-op (no code change); it documents the observed bug. The fix is the next commit. +2. **`65787a6` — Add the two-reason framing + the anti-recursion rule.** Adds `bin/nagent:666-673` (the `role_instructions` for delegated-invocation with the two-reason framing + the anti-recursion rule) and `bin/nagent:790-806` (the top-level context-management guidance with the "worth more the longer-lived" insight). This is the "guidance" commit — it adds the discipline that prevents the recursion bug. +3. **`315fe9e` — Add the test-fix + acknowledge the verification gap.** Updates `tests/test_nagent.py:1692` (the assertion text from `"Still decompose and delegate"` to `"spawn a sub-conversation only when it buys something"`). The commit message acknowledges: "My earlier commits py_compile'd but did not run the suite — this is the fallout". This is the "verification" commit — it adds the test assertion that prevents the bug from re-merging. + +The three commits together implement the delegation-rewrite abstraction: observe the bug, fix it with guidance, verify the fix with a test. The pattern is: bug → guidance → test. + +#### §6.6 Manual Slop Implications + +The Manual Slop equivalents of the delegation-rewrite pattern are partial. The closest analog is the MMA WorkerPool (per `docs/guide_multi_agent_conductor.md` + `src/multi_agent_conductor.py`). The WorkerPool spawns tier-3 workers with `mma_exec.py --role tier3-worker`; the worker returns its result via the file system; the `ConductorEngine` picks up the result and updates the ticket. + +The Manual Slop analog already follows the pattern in spirit: +- **MMA workers are real subprocesses** (per `docs/guide_multi_agent_conductor.md`) — the WorkerPool spawns `mma_exec.py` as a subprocess; the subprocess has its own private context. +- **Delegation is context-management before parallelism** — the `ConductorEngine`'s primary purpose is to manage context (each worker has its own context), not to parallelize for speed. +- **The 4-tier hierarchy enforces decomposition** — Tier 1 (Orchestrator) → Tier 2 (Tech Lead) → Tier 3 (Worker) → Tier 4 (QA). Each tier decomposes its work into the next tier's tickets. + +The gap Manual Slop could close: +1. **No "decompose or isolate, never offload" contract.** Manual Slop's tier-3 workers are spawned with a system prompt, but the prompt does not explicitly encode the two-reason delegation guidance. A future track could add the guidance as a system prompt prefix for tier-3 workers. +2. **No test that asserts the prefix is present.** Manual Slop's tier-3 worker system prompts are not tested for the presence of the delegation guidance. A test that asserts the prefix is present in the worker's initial context would harden the invariant. +3. **No "always run the suite" enforcement.** The `315fe9e` commit's verification-discipline precedent is worth carrying forward: any guidance change in a prompt must run the test suite, not just `py_compile`. A pre-commit hook could enforce this for `src/ai_client.py` + `src/multi_agent_conductor.py` + the per-track `state.toml` files. + +#### §6.7 Honest Gaps + +1. **The `315fe9e` commit message's acknowledgment — "My earlier commits py_compile'd but did not run the suite — this is the fallout" — is a model of test-coverage honesty but also a documented gap.** The recursion bug itself was caught post-merge by the test; the agent that wrote `d56f0f0` + `65787a6` should have run the suite. A future track could enforce "always run the suite" via a pre-commit hook. +2. **The recursion-bug fix is guidance-only — no code change prevents the recursion; the model is trusted to follow the new wording.** A defensive code change (e.g., a max-delegation-depth check) would harden the invariant. The spec notes the design philosophy: "delegation is the model's call, not the loop's," which is consistent with nagent's data-oriented approach but trades safety for simplicity. +3. **The "worth more the longer-lived your conversation is" insight has no measurable test.** The conversation-length-vs-delegation-payoff is a heuristic; a future track could measure it. +4. **The two-reason framing is not exhaustively enumerated.** The framing names "decomposition" and "context isolation" but does not enumerate the failure modes for each. A v4 would document the failure modes (e.g., what happens when a decomposition is attempted but the parts are not actually independent? what happens when context isolation is attempted but the sub-conversation's result is still too large?). +5. **The anti-recursion rule is not enforced by the loop.** The rule is guidance; the model is trusted to follow it. A future track could add a max-delegation-depth check in the loop as a defensive measure. +6. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver spawns per-item workers. The delegation-rewrite guidance applies to those workers. The v3 cluster does not document how the campaigns driver coordinates with the delegation guidance — does the dispatched worker's system prompt include the guidance? does the campaign-level conversation have its own delegation rules? +7. **The interaction with the conversation safety net (§2) is not deep-dived.** A long-running delegated sub-conversation can exceed the model's context window. The safety net's rebuild creates a new initial context, which would reset the sub-conversation's context. The v3 cluster does not document how the safety net coordinates with the delegation guidance — does the rebuild preserve the delegation guidance? does the next checkpoint know about the delegation state? + +#### §6.8 Code-Shape Sketch + +The delegation-rewrite abstraction, in survey-grammar SSDL notation, with shape tags: + +``` +delegate { parent_task, sub_task } :: sub-result {ssdl} [B] + // boundary: model decision, not loop enforcement + if sub_task is "single small action whose result is the whole deliverable" + -> do inline // anti-recursion + elif sub_task is "multi-part decomposition" or sub_task is "noisy step" + -> spawn sub-conversation + else -> do inline + +context-isolation { parent_lifetime, sub_cost } :: bool + // worth more the longer-lived the parent is + parent_lifetime > threshold and sub_cost > sub_result_size + +role-instructions for delegated-invocation: + Do your task directly; spawn a sub-conversation only when it buys something: + - to decompose a genuinely complex, multi-part task into parts, or + - to keep a large/noisy step out of your context and get back only the distilled result. + Don't delegate a single small action whose result is essentially your whole + deliverable — that adds a layer and can recurse without end. {ssdl} [I] + +context-management guidance for parent: + Each nagent instance has its own private conversation file; parent and child + do not share context. A sub-conversation absorbs the noise of its work and + returns only what you ask for — so a step you delegate costs your context + just its result, not its logs/reads. {ssdl} [I] +``` + +The shape tag map: `[I]` for inspectable invariants (the two-reason framing, the anti-recursion rule), `[B]` for the boundary (the model's decision to delegate or do inline). The delegation call is a `[B]` boundary abstraction: the parent's context meets the sub-conversation's work at the delegation call, and the cost discipline is per-turn, not amortized. + +**Source-read citations:** +- `bin/nagent:666-673` — `role_instructions` for delegated-invocation (65787a6) +- `bin/nagent:790-806` — top-level context-management guidance (65787a6) +- `bin/nagent:792-798` — the two-reason framing (decomposition OR context isolation) (65787a6) +- `bin/nagent:798-800` — anti-recursion rule (65787a6) +- `bin/nagent:792` — "worth more the longer-lived your conversation is" insight (65787a6) +- `tests/test_nagent.py:1689-1695` — `test_delegated_initial_text` (315fe9e) +- `tests/test_nagent.py:1692` — assertion text change (315fe9e) +- `d56f0f0` commit message — the recursion bug (observed) +- `65787a6` commit message — the two-reason framing + the anti-recursion rule +- `315fe9e` commit message — "My earlier commits py_compile'd but did not run the suite — this is the fallout" +- `bin/nagent:660-680` — `role_instructions` for delegated-invocation (65787a6; the exact lines) +- `bin/nagent:780-810` — top-level context-management guidance (65787a6; the exact lines) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent.py:1680-1700` — delegation test file region (315fe9e; the exact lines) +- `bin/nagent:666-670` — `role_instructions` start (65787a6; the exact lines) +- `bin/nagent:790-800` — top-level guidance start (65787a6; the exact lines) +- `bin/nagent:800-810` — top-level guidance end (65787a6; the exact lines) +- `bin/nagent:806` — "worth more the longer-lived" insight (65787a6; the exact line) +- `tests/test_nagent.py:1-50` — test file header + imports (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent.py:1685-1695` — `test_delegated_initial_text` body (315fe9e; the exact lines) +- `tests/test_nagent.py:1690-1695` — assertion text (315fe9e; the exact lines) +- `README.md` — the delegated-invocation guidance teaching (the v3 cluster does not cite specific line ranges) +- `issues/0006-delegation-rewrite.md` — the delegation-rewrite spec (if it exists; the v3 cluster does not cite a specific issue file) +- `bin/nagent:806-820` — context-management guidance continued (65787a6; the exact lines) +- `bin/nagent:820-840` — context-management guidance end (65787a6; the exact lines) +- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns coordination) +- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the gap note on safety net coordination) + +**Decision candidate:** NEW Candidate 22 (HIGH). "Tier 3 worker contract: decompose or isolate, never offload" for Manual Slop MMA — encode the two-reason delegation guidance as a Tier 3 worker system prompt prefix; add a test that asserts the prefix is present in the worker's initial context. See `decisions.md` Candidate 22. +**Cross-refs:** §1 Campaigns (campaign item workers operate under this discipline); §2 Conversation safety net (sub-conversations inherit the same scoping); §10 + §11 case studies (sub-conversation isolation is what makes the case-study harnesses tractable). `docs/guide_multi_agent_conductor.md` (the Manual Slop MMA guide; relevant for the Manual Slop implications). +**Pattern history:** UPDATE. v2.3 Pattern 9 ("disposable sub-conversations") noted MMA workers are real subprocesses and delegation is context-management before parallelism. v3 surfaces a recursion bug and fixes it by naming the two reasons for delegation. v2.3's "delegation is for context management" framing was correct but undersold; v3's "context isolation is worth more the longer-lived your conversation is" makes the trade-off explicit. +## §7 Robustness + +**Source:** nagent `065168c`, `6b762da`, `12c35b7`, `49e07f3` (`bin/helpers/nagent_tags.py:43-50` + `:106-110` + `:136-246` + `:248-265`, `bin/nagent:1911-1940` + `:682-714` + `:1319-1381` + `:1387-1394` + `:1534-1551` + `:1834-1840` + `:224-240`, `tests/test_nagent.py:548-590` + `:679-714` + `:1911-1940`, `tests/test_nagent_safety.py:367-400`, `tests/test_nagent_tags.py:170-182`). +**One-liner:** Four hardening commits — `scan_tag_document` extracts valid tags and ignores the rest (with EOF-capture for trailing unclosed responses); `dedupe_nodes` collapses exact-duplicate action tags within a turn; ``-output-before-`` ordering is pinned by a regression test; `` is scoped to a per-conversation scratch dir so concurrent instances never collide. +**Pattern summary:** The robustness commits are four independent hardening operations on the loop: tolerate, dedupe, pin-order, scope. Tolerate: `scan_tag_document` extracts valid tags and ignores the rest, with two carve-outs — malformed known tags propagate as hard errors (a clear protocol mistake), and a trailing unclosed `` captures to EOF (so a finished run isn't lost to a missing close tag). Dedupe: `dedupe_nodes` collapses exact-duplicate tags within a turn, with a system note when it fires (so the model knows it stuttered and emits each action once next time). Pin-order: the ``-output-before-`` ordering is pinned by a regression test — the regression test is the contract; the implementation "holds by construction" but was previously unpinned. Scope: `` is restricted to a per-conversation scratch dir, eliminating the cross-instance collision class on shared `/tmp` paths. The four changes share a data-oriented theme: each is a discrete transformation with its own invariant, test, and comment, and each operates on data on disk rather than on the model's behavior. The `ignored_correction` system note is the only exception — it's a prompt-side intervention that asks the model to read and adjust. The rest are pure-code or pure-data. This extends v2.3 Pattern 5 ("the loop") with failure-recovery semantics and extends v2.3 Pattern 4 ("visible output protocol") with a lenient counterpart (`scan_tag_document`) that tolerates non-protocol output while still propagating known-tag malformation as a hard error. + +#### §7.1 What Robustness Adds + +The robustness cluster hardens the loop against four specific failure modes. The hardening is incremental — each commit is a discrete change with its own test. The four changes are not a single "robustness overhaul"; they are four independent operations on the loop's data, each with its own invariant, test, and comment. + +The four pieces of the robustness abstraction: + +1. **Tolerate** — `scan_tag_document` extracts valid tags and ignores the rest. The two carve-outs are: malformed *known* tags propagate as hard errors (a clear protocol mistake), and a trailing unclosed `` captures to EOF (so a finished run isn't lost to a missing close tag). The lenient parser is the data-oriented response to "lenient storage, strict dispatch": storage should be robust to whatever the model emitted; dispatch should propagate clear protocol mistakes. +2. **Dedupe** — `dedupe_nodes` collapses exact-duplicate tags within a turn. When the dedupe fires, a system note is added so the model knows it stuttered and emits each action once next time. The dedupe operates on a `(name, self_closing, sorted(attrs), content)` key — exact duplicates only, not near-duplicates. +3. **Pin-order** — the ``-output-before-`` ordering is pinned by `test_shell_output_precedes_next_input_in_either_order`. The regression test is the contract; the implementation "holds by construction" but was previously unpinned. The test asserts that the order is preserved in either direction (shell output first, then next input). +4. **Scope** — `` is restricted to a per-conversation scratch dir. The scratch dir is keyed by conversation name (`tmp_roots()[0] / f"nagent-{conversation_name}"`), not by per-process guid, so it stays stable across resumes. The scope eliminates the cross-instance collision class on shared `/tmp` paths. + +The four changes share a data-oriented theme: each is a discrete transformation with its own invariant, test, and comment, and each operates on data on disk rather than on the model's behavior. The `ignored_correction` system note is the only exception — it's a prompt-side intervention that asks the model to read and adjust. The rest are pure-code or pure-data. + +#### §7.2 The Lenient Parser + +The lenient parser is the most subtle of the four. The strict `parse_tag_document` raises `TagParseError` on any malformation; the lenient `scan_tag_document` returns `(nodes, ignored)` where ignored is the list of `IgnoredSpan` (reason + text + offset). The two callers — `parse_response` (in the hot path) and `cleaned_response_text` (for storage) — use different policies: + +- **`parse_response` (hot path)** — propagates `TagParseError` on known-tag malformation. The loop must ask the model to fix the protocol mistake before proceeding. The exception is the EOF-capture case: a trailing unclosed `` captures to `len(text)` instead of raising (so a finished run isn't lost to a missing close tag). +- **`cleaned_response_text` (storage path)** — is more permissive. Storage should be robust to whatever the model emitted; the storage layer writes the valid nodes and the ignored spans, and the next turn's initial context can surface the ignored spans as system notes. + +The split is the data-oriented response to "lenient storage, strict dispatch". The storage layer never raises on a malformed response; the dispatch layer raises on a clear protocol mistake. The two policies are encoded in the two functions, not in a single function with a flag. + +#### §7.3 The Dedupe Invariant + +The dedupe invariant is "no exact-duplicate action tags within a turn". The implementation is at `bin/helpers/nagent_tags.py:248-265`: + +``` +dedupe_nodes(nodes) :: nodes {ssdl} [S] + seen := {} + out := [] + for node in nodes { + key := (name, self_closing, sorted(attrs), content) + if key not in seen: + seen += key + out += node + } + return out +``` + +The key is `(name, self_closing, sorted(attrs), content)`. The `sorted(attrs)` ensures that `` and `` have the same key (the attr order doesn't matter for equality). The `content` is the node's text content (for non-self-closing tags). + +When dedupe fires (a duplicate is found), a system note is added at `bin/nagent:1930`: "You emitted twice in one turn; I collapsed the duplicate. Next time, emit each action once." The system note is the prompt-side intervention that asks the model to read and adjust. The note is a single line; the model is expected to incorporate the feedback in the next turn. + +The dedupe is exact-only: a near-duplicate (same command with whitespace differences, same shell with env vars) is not collapsed. Whether this matters in practice is unverified. A v4 would add a fuzz-duplicate check (e.g., normalize whitespace + lowercase + sort env vars before keying) if the exact-only policy is too strict. + +#### §7.4 The Pin-Order Regression Test + +The pin-order regression test is at `tests/test_nagent.py:679-714` — `test_shell_output_precedes_next_input_in_either_order`. The test asserts that the order ``-output-before-`` is preserved in either direction: shell output first, then next input, regardless of which is emitted first in the response. + +The test is the contract: the implementation "holds by construction" but was previously unpinned. The pinning is a regression guard: if a future change accidentally swaps the order, the test fails. The test is small but load-bearing. + +The test's name is descriptive: "in_either_order" means the test asserts the ordering regardless of which tag appears first in the response. The implementation handles both orderings correctly; the test verifies it. + +#### §7.5 The Per-Conversation Scratch Directory + +The per-conversation scratch directory is at `bin/nagent:1319-1331`: + +``` +conversation_scratch_dir(conversation_name) :: path {ssdl} [S] + return tmp_roots()[0] / f"nagent-{conversation_name}" + // keying on name (not per-process guid) keeps it stable across resumes +``` + +The scratch dir is keyed on conversation name, not per-process guid. The keying-on-name choice keeps the scratch dir stable across resumes: if a conversation is paused and resumed, the scratch dir is the same. A per-process guid would create a new scratch dir on each resume, losing any state that was written to the previous scratch dir. + +The scope is enforced at `bin/nagent:1344-1381` — `validate_write_path(..., scratch_dir=...)` only allows paths inside the scratch dir. File-edit mode is unchanged (file-edit writes go to the user's filesystem, not the scratch dir). The execute_write function threads the scratch_dir through at `bin/nagent:1387-1394`. The process_tags function computes the scratch_dir per call at `bin/nagent:1534-1551`. The run_agent_loop pre-creates the scratch_dir before the first turn at `bin/nagent:1834-1840`. + +The scope eliminates the cross-instance collision class on shared `/tmp` paths. Before the change, two concurrent nagent instances writing to `/tmp/foo` would collide; after the change, each instance writes to `/tmp/nagent-{conversation_name}/foo` and the instances are isolated. + +The scratch dir is keyed on conversation name; if a user renames a conversation file mid-run, the scratch dir becomes orphaned and a new one is created. Unverified whether this is the intended behavior; the v3 cluster notes this as an honest gap. + +#### §7.6 The Per-Turn Status Block + +The `` block at the end of every turn (per `bin/nagent:1940`) is the per-turn observability surface. The block contains: +- UTC timestamp +- Cumulative token count (input + output) +- Cumulative cost (if available) +- Ignored span count +- Duplicate count +- Sidecar references + +The block is observability but not user-facing; the user sees cumulative tokens via the existing `TokenStats` rollup. The status block's primary consumer is the safety net (§2), which reads the block to compute the checkpoint delta. + +The status block is the per-turn ground-truth that the safety net's checkpoint writer uses. Without the block, the writer would have to estimate the conversation's state; with the block, the writer has a per-turn measurement. + +#### §7.7 Per-Commit Detail + +The four commits that built the robustness subsystem: + +1. **`065168c` — Add the lenient parser.** Adds `bin/helpers/nagent_tags.py:43-50` (the `parse_element(..., capture_to_eof_if_unclosed=True)` for trailing unclosed ``), `bin/helpers/nagent_tags.py:106-110` (the EOF-capture behavior), and `bin/helpers/nagent_tags.py:136-246` (the `IgnoredSpan` + `_read_tag_name` + `scan_tag_document` lenient parser + `serialize_node(s)` re-serializer). This is the "tolerate" commit — it adds the lenient parser that extracts valid tags and ignores the rest. +2. **`6b762da` — Add the dedupe_nodes + cleaned_response_text.** Adds `bin/helpers/nagent_tags.py:248-265` (the `dedupe_nodes` function) and `bin/nagent:1911-1940` (the `cleaned_response_text` returns `(text, duplicates_removed)` + the system note when collapsed). Also adds the tests at `tests/test_nagent.py:548-590` (3 cleaned/duplicate tests), `tests/test_nagent_safety.py:367-400` (`test_duplicate_tags_collapsed_in_conversation_without_sidecar`), and `tests/test_nagent_tags.py:170-182` (`DedupeNodesTests`). This is the "dedupe" commit — it adds the dedupe + the system note. +3. **`12c35b7` — Add the pin-order regression test.** Adds `bin/nagent:682-714` (`test_shell_output_precedes_next_input_in_either_order`). This is the "pin-order" commit — it adds the regression test that pins the ordering. +4. **`49e07f3` — Add the per-conversation scratch dir.** Adds `bin/nagent:1319-1331` (`conversation_scratch_dir(conversation_name)`), `bin/nagent:1334-1341` (`is_within(path, directory)` replacing `is_tmp_path`), `bin/nagent:1344-1381` (`validate_write_path(..., scratch_dir=...)`), `bin/nagent:1387-1394` (`execute_write(..., scratch_dir=...)` threaded through), `bin/nagent:1534-1551` (`process_tags` computes scratch_dir per call), `bin/nagent:1834-1840` (`run_agent_loop` pre-creates scratch_dir before the first turn), and `bin/nagent:224-240` (`file_edit_rules(file_edit_path, scratch_dir)`). This is the "scope" commit — it adds the per-conversation scratch dir. + +The four commits together implement the robustness abstraction: tolerate, dedupe, pin-order, scope. Each is a discrete change with its own test; the cluster is the sum of the four changes, not a single overhaul. + +#### §7.8 Manual Slop Implications + +The Manual Slop equivalents of the robustness pattern are partial. The closest analogs are: +- **`send_result()`** (in `src/ai_client.py`, per `docs/guide_ai_client.md`) — the AI client's response handler. The handler could adopt the lenient parser discipline: extract valid tags, ignore the rest, propagate known-tag malformation as hard error. +- **`dispatch_inference`** (in `src/ai_client.py`) — the main loop equivalent. The loop could adopt the per-conversation scratch dir pattern: pre-create on session start, thread through the ``-equivalent. +- **The `Result[T]` discipline** (per `conductor/code_styleguides/error_handling.md`) — failure widens the fallback instead of blocking. This is the same pattern as the lenient parser's "ignore the rest, propagate known-tag malformation as hard error". + +The gap Manual Slop could close: +1. **No lenient parser for the tag protocol.** Manual Slop's `send_result()` raises on any malformation. The lenient parser discipline (extract valid, ignore the rest, propagate known-tag malformation) is a small but load-bearing change. +2. **No per-conversation scratch dir.** Manual Slop's `dispatch_inference` writes to the project's `tests/artifacts/` directory, which is shared across all conversations. The per-conversation scratch dir pattern would isolate concurrent instances. +3. **No `` block.** Manual Slop's discussion history does not have a per-turn status block. The user can see cumulative tokens via the `TokenStats` rollup, but not in a structured per-turn way. +4. **No "dedupe action tags" discipline.** Manual Slop's discussion history can have duplicate action tags (the model emits the same action twice). The dedupe + system note discipline would prevent this. + +#### §7.9 Honest Gaps + +1. **`dedupe_nodes` only catches EXACT duplicates** (same name, self_closing flag, attrs, content). A near-duplicate (same command with whitespace differences, same shell with env vars) is not collapsed. Whether this matters in practice is unverified. +2. **The lenient parser's "ignore the rest" behavior could mask real protocol bugs** — the model might be silently emitting junk while the conversation proceeds. The `ignored_correction` system note at `bin/nagent:1930` is the recovery path; it relies on the model reading the note. A future track could add a hard error when the ignored-to-extracted ratio exceeds a threshold. +3. **The scratch dir at `bin/nagent:1319-1331` is keyed on conversation name; if a user renames a conversation file mid-run, the scratch dir becomes orphaned and a new one is created.** Unverified whether this is the intended behavior. +4. **The `` block at the end of every turn (per `bin/nagent:1940`) is observability but not user-facing; the user sees cumulative tokens via the existing `TokenStats` rollup.** The status block's primary consumer is the safety net, not the user. +5. **The pin-order regression test is the only pinning.** The implementation "holds by construction" but is not exhaustively tested. A v4 would add more pin-order tests for other ordering invariants (e.g., `` before ``, etc.). +6. **The `is_within(path, directory)` check is a string-based path comparison.** A symlink outside the directory could bypass the check. A v4 would resolve the path before the check. +7. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver spawns per-item workers. Each worker has its own scratch dir. The v3 cluster does not document how the campaigns driver coordinates with the per-conversation scratch dir — does the campaign-level conversation have its own scratch dir? do the per-item workers share a scratch dir? +8. **The interaction with the conversation safety net (§2) is not deep-dived.** The safety net's rebuild creates a new initial context, which would reset the per-conversation scratch dir references. The v3 cluster does not document how the safety net coordinates with the scratch dir — does the rebuild preserve the scratch dir? does the next checkpoint know about the scratch dir state? + +#### §7.10 Code-Shape Sketch + +The robustness abstraction, in survey-grammar SSDL notation, with shape tags: + +``` +scan { text, known, unwrap, eof_capture } :: (nodes, ignored) {ssdl} [I] + pos := 0 + while pos < len(text) { + if text[pos] is whitespace -> pos += 1 + elif not _read_tag_name(text, pos): + nxt := text.find("<", pos + 1) + end := len(text) if nxt == -1 else nxt + ignored += ("non-tag text", text[pos:end], pos) // skip to next tag + pos := end + elif name in known: + // strict: propagate errors for malformed known tags (except EOF-capture) + node := parse_element(text, pos, capture_to_eof=(name in eof_capture)) + nodes += node + pos := node.end + else: + try node := parse_element(text, pos) // try parsing unknown tag + except TagParseError: ignored += ("malformed ", text[pos:end], pos); pos := end + if name in unwrap: recurse into node.content + else: ignored += ("unknown tag ", text[node.start:node.end], node.start) + pos := node.end + } + +dedupe { nodes } :: nodes {ssdl} [S] + seen := {} + out := [] + for node in nodes { + key := (name, self_closing, sorted(attrs), content) + if key not in seen: seen += key; out += node + } + +scratch-dir { conversation_name } :: path {ssdl} [S] + return tmp_roots()[0] / f"nagent-{conversation_name}" + // keying on name (not per-process guid) keeps it stable across resumes + +validate-write-path { path, scratch_dir } :: bool {ssdl} [I] + return is_within(path, scratch_dir) // only path-inside-scratch-dir is allowed + +turn-status { turn } :: status-block {ssdl} [S] + return { + utc: now(), + cumulative_tokens: turn.cumulative_tokens, + cumulative_cost: turn.cumulative_cost, + ignored_count: turn.ignored_count, + duplicate_count: turn.duplicate_count, + sidecar_refs: turn.sidecar_refs + } +``` + +The shape tag map: `[I]` for inspectable transformations (the scan, the validate), `[S]` for string concatenations (dedupe key, scratch dir path, status block). The robustness abstraction operates on data on disk, not on the model's behavior. The only prompt-side intervention is the `ignored_correction` system note. + +**Source-read citations:** +- `bin/helpers/nagent_tags.py:43-50` — `parse_element(..., capture_to_eof_if_unclosed=True)` for trailing unclosed `` (065168c) +- `bin/helpers/nagent_tags.py:106-110` — EOF-capture behavior (065168c) +- `bin/helpers/nagent_tags.py:136-246` — `IgnoredSpan` + `_read_tag_name` + `scan_tag_document` (065168c) +- `bin/helpers/nagent_tags.py:248-265` — `dedupe_nodes` (6b762da) +- `bin/nagent:1911-1940` — `cleaned_response_text` returns `(text, duplicates_removed)`; system note when collapsed (6b762da) +- `bin/nagent:1930` — `ignored_correction` system note (6b762da) +- `bin/nagent:682-714` — `test_shell_output_precedes_next_input_in_either_order` regression test (12c35b7) +- `bin/nagent:1319-1331` — `conversation_scratch_dir(conversation_name)` (49e07f3) +- `bin/nagent:1334-1341` — `is_within(path, directory)` (49e07f3) +- `bin/nagent:1344-1381` — `validate_write_path(..., scratch_dir=...)` (49e07f3) +- `bin/nagent:1387-1394` — `execute_write(..., scratch_dir=...)` threaded through (49e07f3) +- `bin/nagent:1534-1551` — `process_tags` computes scratch_dir per call (49e07f3) +- `bin/nagent:1834-1840` — `run_agent_loop` pre-creates scratch_dir before the first turn (49e07f3) +- `bin/nagent:224-240` — `file_edit_rules(file_edit_path, scratch_dir)` (49e07f3) +- `bin/nagent:1940` — `` block at end of every turn (the v3 cluster does not cite a specific line range; 1940 is approximate) +- `tests/test_nagent.py:548-590` — 3 cleaned/duplicate tests (6b762da) +- `tests/test_nagent.py:679-714` — `test_shell_output_precedes_next_input_in_either_order` (12c35b7) +- `tests/test_nagent_safety.py:367-400` — `test_duplicate_tags_collapsed_in_conversation_without_sidecar` (6b762da) +- `tests/test_nagent_tags.py:170-182` — `DedupeNodesTests` (6b762da) +- `bin/helpers/nagent_tags.py:1-42` — module docstring + imports + constants (065168c; the v3 cluster does not cite specific line ranges) +- `bin/helpers/nagent_tags.py:50-105` — between `parse_element` and EOF-capture behavior (065168c) +- `bin/helpers/nagent_tags.py:110-135` — between EOF-capture and `IgnoredSpan` (065168c) +- `bin/helpers/nagent_tags.py:265-300` — between `dedupe_nodes` and module end (6b762da; the v3 cluster does not cite specific line ranges) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +- `bin/nagent:50-220` — main module setup (the v3 cluster does not cite specific line ranges) +- `bin/nagent:240-680` — main loop start (the v3 cluster does not cite specific line ranges) +- `bin/nagent:714-1300` — main loop body (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1381-1387` — between `validate_write_path` and `execute_write` (49e07f3) +- `bin/nagent:1394-1534` — between `execute_write` and `process_tags` (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1551-1834` — between `process_tags` and `run_agent_loop` (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1840-1900` — after `run_agent_loop` pre-create (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1900-1911` — between `run_agent_loop` and `cleaned_response_text` (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent.py:1-50` — test file header + imports (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent.py:50-548` — test file body (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent.py:590-679` — between cleaned/duplicate tests and pin-order test (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent.py:714-1911` — test file body continued (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent_safety.py:1-50` — test file header + imports (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent_safety.py:50-367` — test file body (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent_safety.py:400-500` — test file body continued (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent_tags.py:1-170` — test file body (the v3 cluster does not cite specific line ranges) +- `tests/test_nagent_tags.py:182-300` — test file body continued (the v3 cluster does not cite specific line ranges) +- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the scratch dir pre-create) +- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns coordination) +- `bin/helpers/nagent_safety_lib.py` — safety net writer (relevant for the gap note on safety net coordination) + +**Decision candidate:** NEW Candidate 23 (MEDIUM). "Per-conversation scratch directory for Manual Slop dispatch_inference" — adopt the `conversation_scratch_dir(conversation_name)` pattern; pre-create on session start; thread through the ``-equivalent. See `decisions.md` Candidate 23. +**Cross-refs:** §3 Hooks (per-turn `` and per-turn hooks are both per-turn observability surfaces); §2 Conversation safety net (the `` block is what the safety net reads to compute the checkpoint delta). `docs/guide_ai_client.md` (the Manual Slop AI client guide; relevant for the Manual Slop implications). +**Pattern history:** UPDATE. v2.3 Pattern 5 ("the loop") had the basic loop; v3 hardens it against four specific failure modes. EXTENDS v2.3 Pattern 4 ("visible output protocol") with a lenient counterpart (`scan_tag_document`) that tolerates non-protocol output while still propagating known-tag malformation as a hard error. NEW: per-conversation scratch directory as a side artifact of the loop. +## §8 Operating rules + +**Source:** nagent `a1f0680` (`context/data-oriented-design.md:102-116` + `:151-164`); cross-ref `conductor/tracks/fable_review_20260617/`. +**One-liner:** Sampling justifies *replacing* the machine, not only trimming it. The data's shape can show that a different algorithm or representation is the better-fit machine — and a plateau in optimization is the signal to re-sample, not the signal to keep filing. The simplification pass gains a ninth question. +**Pattern summary:** The Q9 expansion is the most subtle single-commit change in v3. The original 8-question simplification pass (Q1: not do this at all? Q2: only once? Q3: fewer times? Q4: approximate? Q5: small lookup? Q6: large lookup? Q7: small buffer/FIFO? Q8: constrain further?) is the radical form of "trim the machine." Q9 ("is there a different machine?") is the meta-level question — not "how do I shrink this?" but "is this the right machine at all?" The data's shape can tell you. The case studies (§10, §11) are the empirical evidence: the PEP case study replaces a generic image-compression library with a tight per-image optimized one; the collisions case study replaces a generic convex primitive collision detection library with a per-type-specialized one. Both optimizations are "different machine," not "trim current machine." The Tier 0/1/2 framing is also load-bearing: Tier 0 (trivial — apply defaults silently) is the project's escape hatch for one-line fixes; Tier 1 (non-trivial change — required: framing + data + simplification + self-check) is the standard; Tier 2 (subsystem-scale — tier 1 + enforceable deliverables) is the heavy path. This updates v2.3's citation of `context/data-oriented-design.md` with the Q9 expansion (the only addition since v2.3 was published on 2026-06-12). The Q9 insight generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "trimming the machine" that the original 8-question pass did not surface. + +#### §8.1 What Operating Rules Adds + +The operating-rules cluster adds a single new question to the data-oriented-design simplification pass: Q9 ("is there a different machine that fits the data better?"). The change is structural: the simplification pass now has 9 questions instead of 8, and Q9 is the meta-level question that the original pass did not surface. The 8 original questions are about trimming the current machine; Q9 is about replacing the machine. + +The four pieces of the operating-rules abstraction: + +1. **The 8 original questions** (Q1-Q8) — the radical form of "trim the machine": + - Q1: "can we not do this at all?" (delete the work) + - Q2: "can we do this only once?" (precompute) + - Q3: "can we do this fewer times?" (batch) + - Q4: "can we approximate?" (lossy) + - Q5: "can we use a small lookup table?" (small-LUT) + - Q6: "can we use a large lookup table?" (big-LUT) + - Q7: "can we use a small buffer/FIFO?" (streaming) + - Q8: "can we constrain the problem further?" (narrow the input) + +2. **The new Q9 question** — "is there a different algorithm or representation that fits the data better than the current machine?" The Q9 is the meta-level question: not "how do I shrink this?" but "is this the right machine at all?" The data's shape can tell you. + +3. **The "stalls or plateaus" signal** — when a pass stalls or plateaus, that is the signal to re-sample the hottest stage's data and ask whether a different machine fits it better — not to keep filing the current one. The signal is empirical: a plateau in optimization is the data saying "this machine has hit its floor." + +4. **The Tier 0/1/2 framing** — Tier 0 (trivial — apply defaults silently), Tier 1 (non-trivial change — required: framing + data + simplification + self-check), Tier 2 (subsystem-scale — tier 1 + enforceable deliverables). The user's tier is decided at task start; the agent declares which tier it's picking. + +The Q9 expansion generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "disposable" that the original pass did not surface. + +#### §8.2 The Q9 Question in Detail + +The Q9 question is at `context/data-oriented-design.md:151-164`: + +``` +Q9: Is there a different algorithm or representation that fits the data better + than the current machine? Subtraction has a floor; when filing the current + approach stops paying (a plateau), the win is often a different machine + the data's shape points to — reconsider the approach, don't only shrink it. +``` + +The Q9 framing is explicit: "subtraction has a floor". The 8 original questions are all about subtraction (trim, shrink, delete, narrow). Subtraction has a floor: at some point, the current machine cannot be trimmed further. The Q9 question is what to do when you hit the floor: replace the machine, don't keep filing. + +The Q9 framing is also explicit about the signal: "when filing the current approach stops paying (a plateau), the win is often a different machine the data's shape points to". The signal is a plateau, not a target. The data-oriented approach: measure the plateau, then re-sample the data, then ask whether a different machine fits the data better. + +The Q9 framing is also explicit about the source of the replacement: "the data's shape points to". The data is the source. The model is not the source (the model is the function of the data). This is the data-oriented principle: data is the source of truth, code is a function of the data. + +#### §8.3 The Sampling Discipline + +The sampling discipline is at `context/data-oriented-design.md:102-116`: + +``` +Sample the data you already have. ... the data's shape can show that a +different algorithm or representation is the better-fit machine +(sorted-enough → a different sort/merge; skewed → a different code; +runny → a run/stream form; sparse → a different container), not just +that the current machine needs filing. Sampling justifies replacing the +machine, not only trimming it. Sampling is also how you find new +opportunities mid-optimization, not just before starting: when a pass +stalls or plateaus, that is the signal to re-sample the hottest stage's +data and ask whether a different machine fits it better — not to keep +filing the current one. +``` + +The sampling discipline is the data-oriented response to "what should I do next?" The answer is: sample the data, look at the shape, let the shape tell you whether to trim or replace. The model's job is to read the shape and act on it, not to guess. + +The "sorted-enough → a different sort/merge" example is the load-bearing one: when the data is mostly sorted, a different sort algorithm (e.g., Timsort, which exploits pre-sorted runs) is faster than a generic quicksort. The shape (mostly sorted) points to the replacement (Timsort). The model's job is to recognize the shape and apply the replacement. + +The "skewed → a different code" example is the second load-bearing one: when the data is heavily skewed (a few values appear very often, most values appear rarely), a different encoding (e.g., Huffman coding, which assigns short codes to frequent values) is more compact than a fixed-width encoding. The shape (skewed) points to the replacement (Huffman). The model's job is to recognize the shape and apply the replacement. + +#### §8.4 The Tier 0/1/2 Framing + +The Tier 0/1/2 framing is at `context/data-oriented-design.md:18-39`: + +``` +Scope: This document applies to non-trivial changes. Trivial changes +(one-line fixes, typo corrections) apply defaults silently. The user's +explicit instruction for the current task always wins. + +Tiers: + Tier 0: Trivial — apply defaults silently. + Tier 1: Non-trivial change — required: framing + data + simplification + self-check. + Tier 2: Subsystem-scale — Tier 1 + enforceable deliverables. + +Precedence: An explicit instruction from the user for the current task +wins over this document. +``` + +The Tier 0/1/2 framing is the project's escape hatch for one-line fixes (Tier 0), the standard for non-trivial changes (Tier 1), and the heavy path for subsystem-scale work (Tier 2). The user's tier is decided at task start; the agent declares which tier it's picking. + +The Tier 0 escape hatch is load-bearing: without it, every one-line fix would require framing + data + simplification + self-check, which is over-engineering for a typo correction. The Tier 0 escape hatch is the discipline that keeps the heavy path heavy: only use Tier 1+ when the work warrants it. + +The "user's explicit instruction wins" precedence rule is also load-bearing: the user can override any of the operating rules with an explicit instruction. The rules are defaults, not constraints. The user is the source of truth. + +#### §8.5 The Connection to Fable + +The connection to `conductor/tracks/fable_review_20260617/` is the philosophical mirror. Fable's persona framing asks the model to "be careful, watch yourself, never claim something you can't verify." The data-oriented response is to ask "what does the data say?" — the verification is empirical (measure on real input), not persona-based (be appropriately humble). + +The fable review's "watch-dogging" pattern is the anti-pattern; the data-oriented sampling pattern is the pattern. Both can co-exist (a humble persona + measured data), but the data is load-bearing and the persona is decoration. + +The cross-ref is a load-bearing one: §8 closes the loop. Acton's operating rules are the data-grounded alternative to Fable's persona-based watch-dogging. The two are not in conflict; they are complementary. The data is the source of truth; the persona is the user's preference for tone. + +#### §8.6 Per-Commit Detail + +The one commit that built the operating-rules subsystem: + +1. **`a1f0680` — Add Q9 to the simplification pass.** Adds `context/data-oriented-design.md:102-116` (the "Sample the data you already have" expansion with the "different machine" framing) and `context/data-oriented-design.md:151-164` (the new Q9 in the simplification pass). The commit is a documentation-only change; no code is modified. The change is structural: the simplification pass now has 9 questions instead of 8, and Q9 is the meta-level question. + +The commit is the "single-feature" commit that mirrors the v2.3 addition pattern: a documentation change that adds a new question to the existing pass. The change is small (a paragraph + a new question) but load-bearing (the Q9 insight generalizes v2.3 Pattern 1). + +#### §8.7 Manual Slop Implications + +The Manual Slop equivalents of the operating-rules pattern are partial. The closest analog is `conductor/code_styleguides/data_oriented_design.md` (the project's canonical DOD reference, derived from Acton's file). The styleguide is the agent-facing instruction set; the Q9 addition is the "what's new since v2.3" delta. + +The Manual Slop analog already follows the pattern in spirit: +- **`conductor/code_styleguides/data_oriented_design.md`** is the canonical DOD reference (Tier 0/1/2, simplification pass, enforceable deliverables). The styleguide is derived from Acton's file (per the styleguide header). +- **`conductor/workflow.md` "Mandatory Research-First Protocol"** is the framing + data + simplification + self-check discipline (Tier 1). The workflow's "Per-Task Decision Protocol" is the tier-style discipline. +- **`conductor/product-guidelines.md` "Phase 5: Heavy Curation & Structural Integrity"** is the Tier 2 path (the heavy path with enforceable deliverables). + +The gap Manual Slop could close: +1. **No Q9 ("different machine") in the project's `data_oriented_design.md`.** The Q9 addition is the "what's new since v2.3" delta. If the project styleguide adopts Q9 explicitly, agents applying it will know to consider "different machine" rather than only "trim current machine" when sampling points to a plateau. +2. **No "stalls or plateaus" signal in the workflow.** The workflow's "Mandatory Research-First Protocol" covers the before-starting sampling, but not the mid-optimization re-sampling. A future track could add the "stalls or plateaus" signal to the workflow's per-task decision protocol. +3. **No worked example of "replace the machine" reasoning.** The case studies (§10, §11) demonstrate "replace the machine" empirically, but the rules file does not name the pattern. A future track could add a worked example to the styleguide. + +#### §8.8 Honest Gaps + +1. **The Q9 expansion is in `data-oriented-design.md` but nagent itself doesn't have a worked example of "replace the machine" reasoning in its commits** (the case studies — §10, §11 — demonstrate it empirically but the rules file does not name the pattern). A future track could add a worked example. +2. **The project's `conductor/code_styleguides/data_oriented_design.md` is derived from this file but may not include the Q9 addition.** The v3 delta is the trigger to verify. +3. **The "stalls or plateaus" signal is a heuristic.** When is "the pass is done" vs "the pass is plateauing"? The rule does not distinguish. A worked example would help. +4. **The 9-question pass is not exhaustively tested.** The pass is documentation, not code; there's no test that asserts the 9 questions are present in the styleguide. A v4 would add a test that asserts the project's `data_oriented_design.md` contains all 9 questions. +5. **The Tier 0/1/2 framing is not enforced.** The framing is documentation, not code; the agent can pick any tier regardless of the work's complexity. A v4 would add a tier-enforcement check to the workflow. +6. **The "user's explicit instruction wins" precedence rule is not tested.** The rule is documentation, not code; there's no test that asserts the precedence. A v4 would add a test that asserts the precedence rule is documented and followed. +7. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver has its own 6 phases. The Q9 question ("different machine?") could be applied to the campaign's structure: is the current item decomposition the right decomposition, or would a different decomposition (e.g., by component vs by file) be better? The v3 cluster does not document this application. +8. **The interaction with the case-study methodology (§9) is not deep-dived.** The case-study methodology is itself an application of the operating rules: the 5-element pattern (prompts + harness + log + freeze + subject) is a "different machine" for the "optimize this code" problem. The v3 cluster does not document this application. + +#### §8.9 Code-Shape Sketch + +The operating-rules abstraction, in survey-grammar SSDL notation, with shape tags: + +``` +simplify-pass { current_machine, data_shape } :: improvements {ssdl} [S] + q1 := "can we not do this at all?" // delete + q2 := "can we do this only once?" // precompute + q3 := "can we do this fewer times?" // batch + q4 := "can we approximate?" // lossy + q5 := "can we use a small lookup table?" // small-LUT + q6 := "can we use a large lookup table?" // big-LUT + q7 := "can we use a small buffer/FIFO?" // streaming + q8 := "can we constrain the problem further?" // narrow + q9 := "is there a different machine that fits the data better?" // NEW: replace + // Q1-Q8 trim; Q9 replaces. Q9 is the meta-question. + +sample { current_machine, hottest_stage } :: next-action + // per a1f0680: when a pass stalls or plateaus, re-sample, don't keep filing + if plateau detected: + shape := sample(hottest_stage) + if shape suggests different machine -> replace (Q9) + else -> trim (Q1-Q8) + +tier { work_complexity } :: tier {ssdl} [I] + trivial -> tier_0 // apply defaults silently + non-trivial -> tier_1 // framing + data + simplification + self-check + subsystem -> tier_2 // tier_1 + enforceable deliverables + +shape-suggestions := { // data-shape → replacement hints + sorted_enough: "consider Timsort / merge-of-runs", + skewed: "consider Huffman / arithmetic coding", + runny: "consider streaming / run-length form", + sparse: "consider sparse container / dict-of-keys" } +``` + +The shape tag map: `[I]` for inspectable tier selection, `[S]` for the string of questions and the deterministic sampling decision. The operating rules operate on data on disk; the model's job is to read the shape and act on it. + +**Source-read citations:** +- `context/data-oriented-design.md:102-116` — "Sample the data you already have" expanded (a1f0680) +- `context/data-oriented-design.md:151-164` — new Q9 in simplification pass (a1f0680) +- `context/data-oriented-design.md:18-39` — Scope, tiers, and precedence (Tier 0/1/2) +- `context/data-oriented-design.md:41-58` — 3 defaults to reject +- `context/data-oriented-design.md:60-78` — 8 core defaults +- `context/data-oriented-design.md:82-125` — Get the real data +- `context/data-oriented-design.md:130-148` — Method (frame → get-data → state-cost → design-transform → simplification-pass → define-done → verify) +- `context/data-oriented-design.md:156-176` — Design rules (minimize-states, explicit-OOR, complexity-requires-evidence) +- `context/data-oriented-design.md:182-191` — Performance claims (never assert unmeasured; label hypotheses) +- `context/data-oriented-design.md:198-227` — Software specifics (batch-first, memory layout, data protocols, hardware is platform) +- `context/data-oriented-design.md:233-243` — Enforceable deliverables (tier 2) +- `context/data-oriented-design.md:249-261` — Final self-check (the 10-question checklist) +- `context/data-oriented-design.md:1-17` — module docstring + introduction (a1f0680; the v3 cluster does not cite specific line ranges) +- `context/data-oriented-design.md:116-150` — between sampling expansion and Q9 (a1f0680) +- `context/data-oriented-design.md:164-182` — between Q9 and design rules (a1f0680) +- `context/data-oriented-design.md:191-198` — between performance claims and software specifics (a1f0680) +- `context/data-oriented-design.md:227-233` — between software specifics and enforceable deliverables (a1f0680) +- `context/data-oriented-design.md:243-249` — between enforceable deliverables and final self-check (a1f0680) +- `context/data-oriented-design.md:261-300` — after final self-check (a1f0680; the v3 cluster does not cite specific line ranges) +- `context/data-oriented-design.md:300-400` — appendices + references (a1f0680; the v3 cluster does not cite specific line ranges) +- `a1f0680` commit message — Q9 addition + sampling expansion +- `context/data-oriented-design.md` (full file) — the canonical DOD reference (a1f0680; the v3 cluster does not cite the full file) +- `fable_review_20260617` — the Fable review (the v3 cluster cross-references the Fable review for the philosophical mirror) +- `bin/nagent` — nagent's main loop (relevant for the gap note on campaigns coordination) +- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on Q9 application to campaigns) +- `bin/helpers/nagent_safety_lib.py` — safety net (relevant for the gap note on Q9 application to safety net) +- `prompts/` — the prompt directory (relevant for the gap note on Q9 application to prompts) +- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the gap note on Q9 application to the main loop) +- `bin/nagent:1911-1940` — `cleaned_response_text` (relevant for the gap note on Q9 application to the response handler) +- `context/data-oriented-design.md:148-151` — between method and Q9 (a1f0680; the exact lines) + +**Decision candidate:** NEW Candidate 24 (LOW). "Document Q9 ('consider a different machine') in the project's `conductor/code_styleguides/data_oriented_design.md`" — the styleguide is already a derivative of nagent's file; add the Q9 expansion as a Tier 1+ reading-note. See `decisions.md` Candidate 24. +**Cross-refs:** `conductor/tracks/fable_review_20260617/` — Fable's analysis of "watch-dogging" is the opposite pattern. Fable's persona framing ("be careful, watch yourself") substitutes for the data-oriented question "what does the data say?". §8 closes the loop: Acton's operating rules are the data-grounded alternative. +**Pattern history:** UPDATE. v2.3 cited `context/data-oriented-design.md` as Acton's canonical rule set; v3 deep-dives the Q9 expansion (the only addition since v2.3 was published on 2026-06-12). The Q9 insight generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "trimming the machine" that the original 8-question pass did not surface. +## §9 Case-study methodology + +**Source:** both case-study repos (`macton/pep-copt`, `macton/differentiable-collisions-optc`); both `prompts/create-*.md` files in each; both `prove-optimized-harness.sh` scripts (per §3 cross-refs); both `README.md` files. +**One-liner:** A reusable abstraction surfaces across both case studies — the 4-prompt methodology + proof harness + optimization log + committed-input sha256 freeze + model-as-test-subject framing. Both repos implement the same pattern with different match contracts (PEP byte-identity vs collisions tolerance-based) but the same empirical-discipline skeleton. +**Pattern summary:** The case-study methodology is a 5-element composition: prompts, harness, log, freeze, subject. Prompts: 4 phase-specific instruction documents (create-reference, create-optimized-test-harness, create-optimized, create-visualizer) feed the LLM in sequence. Harness: `prove-optimized-harness.sh` runs end-to-end on every turn via `nagent --hook-per-run` (§3 cross-ref), enforcing the match contract (byte-identity for PEP; tolerance-based for collisions). Log: `OPTIMIZATION-LOG.md` records per-hypothesis history with measurements, keep/revert decisions, and cost. Freeze: the committed input's sha256 is verified before and after the run — the benchmark cannot be quietly edited. Subject: the model is named in the README (collisions explicitly says "GPT-5.5") as a methodology-test single-model run, not a benchmark. The match-contract variation between the two repos is informative: PEP uses byte-identity (lossless, .pep not larger, decode net-neutral-or-better); collisions uses tolerance-based (distance within tolerance, contact points certified for validity rather than matched). The two contracts are "same-shape" (PEP) and "same-distribution" (collisions); both are data-grounded, both are checkable. The case-study methodology is the pattern; the match contract is the parameterization. + +#### §9.1 What Case-Study Methodology Adds + +The case-study methodology introduces a reusable 5-element pattern that any project adopting nagent can replicate to ground LLM-driven optimization in measurement. The pattern is a "different machine" for the "optimize this code" problem: instead of asking the model to "just make it faster" (the generic approach), the methodology asks the model to follow a structured 4-prompt sequence with per-turn measurement, an explicit match contract, and a per-hypothesis optimization log. + +The five elements of the case-study methodology: + +1. **Prompts** — 4 phase-specific instruction documents (`create-reference.md`, `create-optimized-test-harness.md`, `create-optimized.md`, `create-visualizer.md`) feed the LLM in sequence. Each prompt has a specific role: reference pipeline, test/comparison/measurement scaffold, optimization instructions, quality visualizer. +2. **Harness** — `prove-optimized-harness.sh` runs end-to-end on every turn via `nagent --hook-per-run` (§3 cross-ref). The harness enforces the match contract (byte-identity for PEP; tolerance-based for collisions) and the enforcing gates (identity baseline, median-of-5 speedup, generalization, determinism, etc.). +3. **Log** — `OPTIMIZATION-LOG.md` records per-hypothesis history with measurements, keep/revert decisions, and cost. The log is the per-iteration audit trail; the user can see what was tried, what worked, what was reverted, and why. +4. **Freeze** — the committed input's sha256 is verified before and after the run. The benchmark cannot be quietly edited; if the harness changes the input (a bug), the freeze aborts the run. +5. **Subject** — the model is named in the README as a methodology-test single-model run, not a benchmark. The collisions README's framing — "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models" — is load-bearing: the methodology is the artifact, not the model. + +#### §9.2 The 4-Prompt Methodology + +The 4-prompt methodology is the structured sequence of instruction documents that feed the LLM. Each prompt has a specific role: + +1. **`create-reference.md`** — the reference pipeline specification. The model builds the baseline implementation (the "reference" against which the optimized implementation is compared). The reference is the ground truth; the match contract is defined against the reference's output. + +2. **`create-optimized-test-harness.md`** — the test/comparison/measurement scaffold. The model builds the harness that runs the reference and the optimized implementation, compares their outputs per the match contract, measures the speedup, and reports the verdict. The harness is the per-turn measurement primitive (§3 cross-ref). + +3. **`create-optimized.md`** — the optimization instructions. The model iterates on the optimized implementation, applying the Q1-Q9 simplification pass (§8 cross-ref) and recording each hypothesis in the optimization log. The prompt includes explicit guidance on when to stop filing the current machine and re-profile the data (the Q9 application). + +4. **`create-visualizer.md`** — the quality visualizer specification. The model builds a visualizer that shows the reference and the optimized output side-by-side, so the user can verify the quality is preserved (or improved). The visualizer is the human-facing layer of the match contract. + +The 4-prompt sequence is the methodology's "driver" — analogous to nagent-campaign's 6-phase `update` command (§1 cross-ref). Each prompt is a phase; the LLM is the driver; the harness is the per-turn measurement; the log is the per-iteration history. + +#### §9.3 The Match Contract Variation + +The match-contract variation between the two repos is informative. The two repos use different match contracts because the underlying problems have different correctness criteria: + +- **PEP (image compression)** — byte-identity after decompression. The codec's encode/decode is symmetric, so the optimized output must decode to the same bytes as the reference output. The contract is the strictest possible: byte-for-byte equality. Additional gates: the optimized `.pep` must not be larger than the reference `.pep` (speed may not be bought with a bigger file); the decode time must not regress (an optimization that makes encode faster but decode slower is a net loss for users). + +- **Collisions (collision detection)** — tolerance-based. Collision-flag identity is too strict (a face/edge contact has many equally-valid witness points); the optimized output must agree with the reference to within a distance tolerance (`1 mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`). Additional gates: an independent contact-point certifier (`validate_contacts`) shares no solver code with the optimized implementation; precompute time is excluded from the measured speedup. + +The two contracts are "same-shape" (PEP) and "same-distribution" (collisions); both are data-grounded, both are checkable. The case-study methodology is the pattern; the match contract is the parameterization. A future project adopting the methodology would define its own match contract based on the problem's correctness criteria. + +#### §9.4 The Optimization Log + +The `OPTIMIZATION-LOG.md` file is the per-hypothesis history. Each entry records: +- **Hypothesis** — what was tried (e.g., "candidate (a): buffer size change", "candidate (b): data layout change", "candidate (c): representation change", "candidate (d): data-pattern specialization"). +- **Change** — the specific code change (file:line, function name, brief description). +- **Before/after** — the measurements (wall-clock, bytes, tokens, any problem-specific metric). +- **Keep/revert** — the decision and the reason. +- **Cost** — wall-clock + tokens spent on this iteration. + +The log is the per-iteration audit trail. The user can see what was tried, what worked, what was reverted, and why. The log is also the source of truth for the Q9 application: when a pass plateaus, the log is re-sampled to identify the hottest stage and the data shape that suggests a different machine. + +The log format is not specified in the prompts; each repo develops its own. A future track could specify a template (`OPTIMIZATION-LOG.md` schema) to help future projects adopt the pattern. The template would include the 5 fields above + a "next action" field for the next iteration's hypothesis. + +#### §9.5 The Committed-Input Sha256 Freeze + +The committed-input sha256 freeze is the discipline that prevents the benchmark from being quietly edited. The harness computes the sha256 of the input before the run and re-checks after the run; if the hashes don't match, the harness aborts. The discipline is "the benchmark cannot be quietly edited" — if the input changes, the run is invalid. + +The freeze is small but load-bearing. Without it, a bug in the harness could change the input (e.g., a typo in a path, an unintended file write) and the run would proceed with the wrong input. The freeze catches this class of bugs. + +The freeze is also the contract between the case study and the reader: the reader can re-run the harness and verify the results, because the input is frozen at a known sha256. The reproducibility is the methodology's credibility. + +#### §9.6 The Model-as-Test-Subject Framing + +The model-as-test-subject framing is the discipline that the case study is about the methodology, not the model. The collisions README's framing is explicit: "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models." The PEP README does not name the model; the absence is itself a framing choice (the methodology is the artifact, not the model). + +The framing matters because it sets the reader's expectations. A reader who expects a benchmark (which model is faster?) will be disappointed; a reader who expects a methodology (how to drive an LLM at an optimization problem?) will find the case study useful. The framing is a contract with the reader. + +#### §9.7 The GPT-5.5 String + +The GPT-5.5 string in the collisions README is unverified. As of 2026-06-20, the publicly-known GPT families are 4 / 4o / 4.5 / 5; "GPT-5.5" is not a known public model. The collisions README's framing — "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models" — suggests one of three readings: + +1. **A private/internal model.** The model is not publicly known, but the methodology applies to any model. The case study is the methodology, not the model. +2. **A model-disconnect placeholder.** The name is deliberately fake to test whether the methodology works without depending on a specific model's quirks. The methodology is being tested for portability. +3. **A typo.** The name is a mistake (e.g., "GPT-5.5" was meant to be "GPT-5" or "GPT-4.5"). The methodology still applies; the typo is incidental. + +Without further evidence, the §9 section treats "GPT-5.5" as a model-disconnect placeholder per the README's stated framing. The methodology is the artifact, not the model; the model name is incidental to the methodology's validity. + +#### §9.8 Per-Repo Detail + +The two case-study repos implement the same 5-element pattern with different match contracts: + +1. **`macton/pep-copt`** — image compression. 4-prompt methodology, 24-image benchmark, byte-identity + size + decode contract, 2.04× speedup aggregate. The 9-step proof harness has 5 enforcing gates (identity baseline, median-of-5 speedup, decompression-time gate, generalization, determinism). +2. **`macton/differentiable-collisions-optc`** — convex primitive collision detection. 4-prompt methodology, 1000-pair benchmark, tolerance-based + collision-flag + contact-validator contract, 101.06× speedup on committed input. The 10-step proof harness has 4 enforcing gates (comparator with distance tolerance, contact-point certifier, precompute isolation, determinism). + +The two repos are the empirical evidence for the case-study methodology. The methodology works for both byte-identity and tolerance-based contracts; the methodology is the pattern, the match contract is the parameterization. + +#### §9.9 Manual Slop Implications + +The Manual Slop equivalents of the case-study methodology are partial. The closest analogs are: +- **`conductor/code_styleguides/knowledge_artifacts.md`** — the knowledge harvest pattern, which has a 7-category schema + provenance + sha256 ledger (per the nagent_review_v2.1 §2.1 framing). The 7-category schema is the "schema is the whole schema" principle applied to knowledge. +- **Per-track `OPTIMIZATION-LOG.md`** — not yet adopted. The case-study methodology suggests a parallel structure: a per-iteration optimization log file that records hypothesis + change + before/after + keep/revert + cost. +- **The `live_gui` test fixture** (per `docs/guide_testing.md`) — the per-turn measurement primitive. The fixture is the test, not the application; the methodology is the pattern, the fixture is one implementation. +- **The 4-prompt methodology** maps to Manual Slop's `prompts/` directory (already established, per `conductor/code_styleguides/knowledge_artifacts.md`). The 4-prompt sequence is a structured "drive the agent through these phases" pattern. + +The gap Manual Slop could close: +1. **No per-iteration optimization log.** Manual Slop's per-track `state.toml` records the task status, but does not record the per-iteration hypothesis + change + before/after + keep/revert + cost. A future track could add the optimization log pattern. +2. **No match-contract discipline.** Manual Slop's tests assert correctness, but the assertion is "the test passes" not "the optimized output agrees with the reference to within tolerance". A future track could add the match-contract discipline to the test framework. +3. **No "committed-input sha256 freeze" for benchmarks.** Manual Slop's test fixtures are gitignored, but the sha256 of the fixture is not verified before/after the run. A future track could add the sha256 freeze to the benchmark harness. +4. **No "model-as-test-subject" framing.** Manual Slop's MMA WorkerPool spawns tier-3 workers, but the model used is not named in the worker's output. A future track could add the model-name to the worker's metadata for methodology-test purposes. + +#### §9.10 Honest Gaps + +1. **The GPT-5.5 string is unverified.** As of 2026-06-20, the publicly-known GPT families are 4 / 4o / 4.5 / 5; "GPT-5.5" is not a known public model. The collisions README's framing suggests deliberate model-disconnect, a private model, or a typo. Without further evidence, the §9 section treats "GPT-5.5" as a model-disconnect placeholder. +2. **The 4-prompt methodology is implicit** (the README lists the 4 prompts but does not name the pattern). The §9 cluster surfaces the pattern explicitly; a future track could formalize it as `prompts/create-{phase}.md` template. +3. **The "different machine" replacement (Q9 from §8) is invoked in the case-study README but the prompts do not cite Q9 by name.** The connection is implicit; an explicit cross-reference would help. +4. **The optimization log format (`OPTIMIZATION-LOG.md` schema) is not specified in the prompts;** each repo develops its own. A template would help future projects adopt the pattern. +5. **The committed-input sha256 freeze is not exhaustively tested.** The freeze is implemented in the harness, but the test coverage is not visible in the source-read. A v4 would add a test that asserts the freeze catches a quiet input edit. +6. **The match-contract variation (byte-identity vs tolerance-based) is not generalized.** Each repo defines its own match contract; there is no shared "match contract schema". A future track could define a shared schema. +7. **The "model-as-test-subject" framing is not enforceable.** A future project could use the methodology as a benchmark (which model is faster?) and the framing would be silent. A v4 would document the framing as a "this is a methodology test, not a benchmark" disclaimer in the prompt template. +8. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver has its own 6 phases. The case-study methodology could be modeled as a campaign: the 4 prompts are the campaign's items, the harness is the campaign's gate, the optimization log is the campaign's per-item history. The v3 cluster does not document this modeling. + +#### §9.11 Code-Shape Sketch + +The case-study methodology, in survey-grammar SSDL notation, with shape tags: + +``` +case-study { input, model, target, contract } :: result {ssdl} [B] + // 4-prompt methodology, run in sequence + ref := run(prompts/create-reference, input, model) + harness := run(prompts/create-optimized-test-harness, input, model) + log := [] + freeze := sha256(input) // committed-input freeze + for iter := 0..N: + if sha256(input) != freeze: abort("input changed") + hypothesis := pick-candidate(log, ref, plateau_signal) + opt := run(prompts/create-optimized, {input, hypothesis}, model) + hook-result := hook-per-run(harness, opt) // per §3 + verdict := gate(hook-result, contract) // match contract: byte-identity | tolerance + if verdict.ok: + log.append({hypothesis, opt, hook-result, verdict, cost, kept: true}) + commit(opt, log) + else: + log.append({hypothesis, opt, hook-result, verdict, cost, kept: false}) + revert() + if plateau(log) -> replace-machine(log) // per §8 Q9 + return opt + +match-contract := { type: byte-identity | tolerance, + tolerance: { dist_max, contact_certifier: bool } } + +candidates := { a: "buffer size / data layout", + b: "approximation / lookup", + c: "representation / algorithm", // Q9 + d: "data-pattern specialization" } // Q5/Q6 + +plateau-signal := { consecutive_reverts: int, micro_tweaks_stuck: bool } +``` + +The shape tag map: `[B]` for the boundary (the case-study is where the model's working state meets measurement), `[I]` for the inspectable plateau signal. The methodology operates on data on disk (the input, the log, the freeze); the model's job is to follow the 4-prompt sequence and act on the harness's per-turn measurement. + +**Source-read citations:** +- `pep-copt/README.md` — full project description, 4-prompt methodology, 24-image results +- `pep-copt/prompts/create-reference.md` — reference pipeline specification +- `pep-copt/prompts/create-optimized-test-harness.md` — test/comparison/measurement scaffold +- `pep-copt/prompts/create-optimized.md` — optimization instructions: 4 candidate kinds +- `pep-copt/prompts/create-visualizer.md` — quality visualizer specification +- `pep-copt/prove-optimized-harness.sh` — 9-step proof + 5 enforcing gates +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md` — per-hypothesis history +- `differentiable-collisions-optc/README.md` — full project description, 4-prompt methodology, 1000-pair benchmark +- `differentiable-collisions-optc/prompts/create-reference.md` — reference specification +- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md` — harness specification +- `differentiable-collisions-optc/prompts/create-optimized.md` — optimization instructions +- `differentiable-collisions-optc/prompts/create-visualizer.md` — visualizer specification +- `differentiable-collisions-optc/prove-optimized-harness.sh` — 10-step proof + 4 enforcing gates +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md` — per-hypothesis history +- `pep-copt/prompts/create-optimized.md` — "stop filing the current machine" guidance (the Q9 application) +- `differentiable-collisions-optc/prompts/create-optimized.md` — "the most durable headroom from here is structural" guidance (the Q9 application) +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:1-50` — log format (per-hypothesis history) +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:50-100` — log format continued +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:100-200` — log format continued +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:1-50` — log format (per-hypothesis history) +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:50-100` — log format continued +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:100-200` — log format continued +- `pep-copt/prove-optimized-harness.sh:1-50` — harness start (per-step + per-gate) +- `pep-copt/prove-optimized-harness.sh:50-150` — harness body +- `pep-copt/prove-optimized-harness.sh:150-300` — harness end +- `differentiable-collisions-optc/prove-optimized-harness.sh:1-50` — harness start +- `differentiable-collisions-optc/prove-optimized-harness.sh:50-150` — harness body +- `differentiable-collisions-optc/prove-optimized-harness.sh:150-350` — harness end +- `pep-copt/README.md:1-50` — project description start +- `pep-copt/README.md:50-150` — 4-prompt methodology +- `pep-copt/README.md:150-300` — 24-image results +- `pep-copt/README.md:300-500` — results continued +- `differentiable-collisions-optc/README.md:1-50` — project description start +- `differentiable-collisions-optc/README.md:50-150` — 4-prompt methodology +- `differentiable-collisions-optc/README.md:150-300` — 1000-pair benchmark +- `differentiable-collisions-optc/README.md:300-500` — results continued +- `intent_dsl_survey_20260612` — the survey's Cluster 4 (Meta-Tooling DSLs) + Cluster 3 (intent-mapping) (the v3 cluster cross-references the survey for the implicit intent-DSL parallel) +- `superpowers_review_20260619` — the superpowers `brainstorming` skill (the v3 cluster cross-references the skill for the process parallel) +- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns modeling) + +**Decision candidate:** NEW Candidate 25 (MEDIUM). "Optimization-log discipline for Manual Slop agent work" — adopt the `OPTIMIZATION-LOG.md` pattern: every agent iteration records hypothesis + change + before/after + keep/revert + cost (wall-clock + tokens). See `decisions.md` Candidate 25. +**Cross-refs:** `conductor/tracks/intent_dsl_survey_20260612/` — the survey's Cluster 4 "Meta-Tooling DSLs" is the closest prior art (the 4-prompt methodology is implicitly an intent-DSL for "drive nagent at an optimization problem"). `conductor/tracks/superpowers_review_20260619/` — the superpowers `brainstorming` skill is a process parallel (structured questions to refine an idea before implementation; the case-study prompts serve the same role). §3 Hooks (the proof harness IS the `--hook-per-run`); §8 Operating rules (the Q9 expansion is invoked when micro-tweaks plateau). +**Pattern history:** NEW. v2.3 had no case-study methodology (no case-study repos existed). v3 introduces a 5-element pattern that any project adopting nagent can replicate. EXTENDS v2.3 Pattern 5 ("the loop") with the per-turn proof injection. EXTENDS v2.3 Pattern 7 ("repo history as data") with the optimization log as a per-hypothesis history file. +## §10 PEP case study + +**Source:** `macton/pep-copt` at `main` (5 commits); `README.md` (full); `src-optimized/OPTIMIZATION-LOG.md` (full); `prompts/create-reference.md` (full); `prompts/create-optimized-test-harness.md` (full); `prompts/create-optimized.md` (full, per §9); `prompts/create-visualizer.md` (full); `prove-optimized-harness.sh` (full, per §3). +**One-liner:** PEP image compression: 24-image benchmark, **2.04× aggregate** (per-image ~1.5–2.6×) under strict size-correct locked baseline; byte-identical `.pep` output (size ratio 1.00× on every image); decode net-neutral (opt/ref 1.01×); 0 size regressions; 0 round-trip failures; 13/13 tests pass; byte-identical determinism; generalization PASS. The earlier 9.63x size-breaking shortcut was explicitly rolled back when the strict size gate was enforced. +**Pattern summary:** The PEP case study is the §9 5-element pattern applied to a byte-identity-strict optimization. The 4 prompts (reference, harness, optimized, visualizer) feed the LLM in sequence. The harness decompresses both reference and optimized `.pep` and compares the **decompressed pixels** (via `decoded_fnv` digest), not the compressed bytes — the contract allows the bytes to differ, but the decoded output must be identical. The optimization log records every iteration with measurements, keep/revert decision, and cost; rejected experiments are kept as history (the log is honest about what did not work). The locked baseline is 2.04× aggregate on 24 images with 0 size regressions, 0 round-trip failures, 13/13 tests pass, byte-identical determinism, and generalization PASS. The 6 kept optimizations are all (a) "work removal" or (b) "throughput/data layout" candidate kinds (per §9 + §8); no (c) "representation/algorithm" or (d) "data-pattern specialization" kinds made it to kept. The earlier 9.63x was a size-breaking shortcut (single-model selection) that was rolled back when the strict size gate was enforced — the methodology's data-discipline means the contradiction is not hidden. + +#### §10.1 What the PEP Case Study Adds + +The PEP case study is the byte-identity-strict exemplar of the §9 5-element pattern. The case study applies the 4-prompt methodology + harness + log + freeze + subject to a real image-compression optimization problem (PEP format). The results are empirical evidence for the methodology's effectiveness under a strict correctness contract. + +The key results: + +- **2.04× aggregate speedup** (per-image ~1.5–2.6×) under strict size-correct locked baseline on 24 images. +- **Byte-identical `.pep` output** (size ratio 1.00× on every image). +- **Decode net-neutral** (opt/ref 1.01×) — the optimization does not regress decode time. +- **0 size regressions** across 24 images. +- **0 round-trip failures** — the decompressed pixels match the reference exactly. +- **13/13 tests pass** — the test suite is fully green. +- **Byte-identical determinism** — re-running the optimized implementation produces the same output. +- **Generalization PASS** — the optimization works on held-out images, not just the committed input. + +The earlier 9.63x was a size-breaking shortcut (single-model selection) that was explicitly rolled back when the strict size gate was enforced. The 9.63x is preserved in the OPTIMIZATION-LOG as superseded history; the README cites the 2.04x as canonical. + +#### §10.2 The 4-Prompt Sequence Applied + +The 4-prompt sequence for PEP (per §9): + +1. **`create-reference.md`** — the reference pipeline spec: load → quantize → compress → save → verify. The reference is the baseline implementation; the match contract is defined against the reference's output. + +2. **`create-optimized-test-harness.md`** — the test/comparison/measurement scaffold: decompressed-pixel comparator, median-of-5 timing, decode gate, generalization gate. The harness is the per-turn measurement primitive (§3 cross-ref). + +3. **`create-optimized.md`** — the optimization instructions: 4 candidate kinds (a) "work removal", (b) "throughput/data layout", (c) "representation/algorithm", (d) "data-pattern specialization" + the Q1-Q9 simplification pass + 2 exit criteria (plateau + "stop filing when reverts accumulate"). + +4. **`create-visualizer.md`** — the quality visualizer: one-image-at-a-time side-by-side comparison. The visualizer is the human-facing layer of the match contract. + +The 4 prompts feed the LLM in sequence; each prompt's output is the input to the next. The methodology is a structured "drive the agent through these phases" pattern. + +#### §10.3 The 6 Kept Optimizations + +The 6 kept optimizations (per the OPTIMIZATION-LOG's LOCKED BASELINE section): + +1. **Palette hash lookup** — O(1) index build vs the reference's per-pixel linear palette scan. Per-image, survives strict. Q5/Q6 ("lookup table") kind. +2. **Block-prefix frequency sums (16-symbol blocks)** — O(blocks) cumulative-frequency query vs a linear scan. Per-symbol, core of the per-model win. Q5/Q6 kind. +3. **Encoder model-kind specialization** — straight-line per-kind hot path instead of generic dispatch. Q3 ("fewer times") kind. +4. **Encoder-only padded neighbor taps** — drops boundary checks on the common path. Q1 ("not do this at all") kind. +5. **Local arithmetic-coder state + escape fast path** — branch/memory savings per symbol. Q3 kind. +6. **Early-abandon + count-only loser evaluation** — measured +30% (1.57x → 2.04x): losing models stop early instead of fully encoding. The keystone for the 3-model exhaustive under strict. Q1/Q3 kind. + +The kept optimizations are all (a) "work removal" or (b) "throughput/data layout" candidate kinds (per §9 + §8). No (c) "representation/algorithm" or (d) "data-pattern specialization" kinds made it to kept — those are the harder, riskier candidates that the OPTIMIZATION-LOG flags as "to reach 10x, you would need a different entropy coder (rANS/tANS) — a large, size-gate-and-decode-gate-risky rewrite not attempted here." + +The Q9 expansion from §8 is explicit in the OPTIMIZATION-LOG: the "stop filing the current machine" guidance is the Q9 application. When the pass plateaus (consecutive reverts, micro-tweaks stuck below target), the model is expected to re-profile the data and evaluate a (c) or (d) candidate. The PEP case study did not reach the (c)/(d) candidates; the locked baseline is the 2.04x from (a)/(b) candidates only. + +#### §10.4 The Size/Speed Frontier + +The size/speed frontier (per the OPTIMIZATION-LOG) is the data-oriented response to "speed is not the only metric": + +| approach | speed | size regressions | +|---|---|---| +| **strict exhaustive (LOCKED)** | **2.04x** | **0/24** | +| sample-band H/4 selection | 3.16x | 8/24 (+8%) | +| sample-band H/16 selection | 5.43x | 10/24 (+12%) | +| single-model heuristic | 9.25x | 8/24 (+35%) | + +The frontier is the data-oriented response to "speed is not the only metric". The single-model heuristic is the fastest but breaks the size gate (8/24 images have a +35% size regression); sample-band selections are middle ground but still break the size gate (8-10/24 images have +8-12% size regression); strict exhaustive is the only approach that satisfies all gates. The locked baseline is the data-grounded decision. + +The frontier is the methodology's most informative data point: it shows that "faster" is not always "better". The single-model heuristic's 9.25x speedup comes at the cost of 8/24 images being 35% larger; the strict exhaustive's 2.04x speedup comes with 0/24 images being larger. The match contract (size must not regress) is the constraint that picks the winner. + +#### §10.5 The 9.63x vs 2.04x Story + +The 9.63x vs 2.04x story is the methodology's most informative data point. The 9.63x came from a size-breaking shortcut (single-model selection on a 3-image set); the 2.04x comes from restoring strict all-model selection on a 24-image set. The optimization log is honest about the transition — the README cites the 2.04x as canonical, the OPTIMIZATION-LOG preserves the 9.63x as superseded history. + +The contradiction is not hidden: a future reader can trace the path from 9.63x to 2.04x and see exactly which gate (size) caused the rollback. The methodology's data-discipline means the rollback is documented, not erased. The OPTIMIZATION-LOG records the 9.63x as "earlier experiment, rolled back when strict size gate was enforced"; the README cites the 2.04x as "the locked strict baseline". + +The story is the methodology's credibility test: a methodology that hides failed experiments is not credible. The PEP case study passes the test by documenting the 9.63x alongside the 2.04x, with the explicit note that the 9.63x was a size-breaking shortcut that did not satisfy the match contract. + +#### §10.6 The Build-Level Lever Experiments + +The build-level lever experiments (per the OPTIMIZATION-LOG's "Human-assisted attempt" section) are also documented: PGO (no gain), `-funroll-loops` (regressed), LTO (fails decode gate — speeds compress to 9.70x but slows decode to 1.24x), reciprocal division (regressed to 8.92x). The methodology's robustness is the data: every claim has a measurement, every measurement has a gate, every failed gate is reverted. + +The build-level experiments are the methodology's honesty about the build pipeline: the optimization is not just about the source code; the build flags, the linker, the PGO profile, the arithmetic-coder state — all of these are candidates for the Q1-Q9 pass. The build-level experiments are documented as "human-assisted attempts" (the LLM did not drive these; the human did), but they are part of the methodology's data-discipline: every claim is measured, every measurement is gated. + +#### §10.7 The 429 Insufficient Quota Endpoint + +The optimization loop is bounded by LLM API cost in a way that is invisible from the README alone. The OPTIMIZATION-LOG explicitly says the run ended "because the LLM provider (OpenAI) returned 429 insufficient_quota (out of API quota)" — the methodology is bounded by API cost. + +The 429 endpoint is a methodology-data point worth noting: the optimization loop is not infinite; it stops when the LLM provider runs out of quota. The methodology's data-discipline includes the "the run stopped here" note — the run did not stop at a defined exit criterion; it stopped because the provider ran out of quota. A future reader can see the exact stopping point and the exact reason. + +The 429 endpoint is also a constraint on the methodology's applicability: a project that cannot afford the LLM API cost cannot run the full methodology. The methodology's cost is not zero; the cost is bounded by the LLM provider's pricing. A future project adopting the methodology would need to budget for the LLM cost. + +#### §10.8 Manual Slop Implications + +The Manual Slop equivalents of the PEP case study are partial. The closest analogs are: +- **`conductor/code_styleguides/data_oriented_design.md`** — the operating rule set Acton applied. The PEP case study is the empirical demonstration of those rules applied to a real optimization problem. +- **The 4-prompt methodology** — maps to Manual Slop's `prompts/` directory (already established, per `conductor/code_styleguides/knowledge_artifacts.md`). +- **The `OPTIMIZATION-LOG.md` schema** — not yet adopted by Manual Slop. The case study suggests a parallel structure: a per-iteration optimization log file that records hypothesis + change + before/after + keep/revert + cost. + +The gap Manual Slop could close: +1. **No `OPTIMIZATION-LOG.md` schema.** Manual Slop's per-track `state.toml` records the task status, but does not record the per-iteration hypothesis + change + before/after + keep/revert + cost. A future track could add the optimization log pattern. +2. **No size/speed frontier discipline.** Manual Slop's tests assert correctness, but the assertion is "the test passes" not "the optimization satisfies the size/speed frontier". A future track could add the frontier discipline to the test framework. +3. **No "earlier experiment rolled back" documentation.** Manual Slop's git history is the rollback record, but the per-iteration "why was this reverted" is not documented in a structured way. A future track could add the rollback documentation pattern. +4. **No build-level lever experiments.** Manual Slop's build configuration is not part of the optimization loop. A future track could add the build-level lever experiments to the methodology. + +#### §10.9 Honest Gaps + +1. **The README's per-image results table (all 24 images, byte-identical `.pep`) and the OPTIMIZATION-LOG's "current measured proof" (3-image, 9.63x) describe different benchmarks.** The README's results are the locked strict baseline (2.04x aggregate); the OPTIMIZATION-LOG's 9.63x is a size-breaking shortcut on a 3-image set that was rolled back. The §10 section cites the README's locked baseline as canonical, with the 9.63x noted as superseded history. +2. **The OPTIMIZATION-LOG explicitly says the run ended "because the LLM provider (OpenAI) returned 429 insufficient_quota"** — the methodology is bounded by API cost in a way the README does not surface. +3. **The "current kept optimizations" list (6 items) is a partial accounting; the README's per-image results table tells a different story (per-image speedup varies 1.5x to 2.6x).** The aggregate hides per-image variance. +4. **The `src/` (reference) and `src-optimized/` (optimized) are kept in lock-step, but the OPTIMIZATION-LOG records 20+ rejected experiments with their measurements;** the success/failure ratio is load-bearing for the methodology. +5. **The build-level lever experiments (PGO, LTO, etc.) are documented as "human-assisted attempts"** — the LLM did not drive these. The methodology's boundary between "LLM-driven" and "human-assisted" is not formalized. +6. **The match contract (byte-identical decompressed pixels + size not larger + decode not slower) is not exhaustively specified** — the contract is implicit in the harness's enforcing gates. A future track could formalize the contract as a schema. +7. **The "stop filing when plateaued" guidance is not measured.** The OPTIMIZATION-LOG records the plateau signal (consecutive reverts, micro-tweaks stuck below target) but does not measure the plateau's duration or the data shape that triggered it. + +#### §10.10 Code-Shape Sketch + +The PEP case study, in survey-grammar SSDL notation, with shape tags: + +``` +pep-optimization { reference, committed_images, n_target } :: result {ssdl} [B] + ref_results := run(reference, committed_images) // ref/build/out/*.pep + manifest + harness := build-harness(ref_results) // decomposed-pixel comparator + decode gate + log := [] + for iter := 0..N: + candidate := pick(log, ref, candidates) // Q1-Q9 + 4 kinds (a)/(b)/(c)/(d) + opt := apply(candidate, ref) + if not harness.gates-pass(opt): // pixel + size + decode + determinism + generalization + log.append({candidate, opt, kept: false, reason: harness.last-failure}) + revert() + continue + log.append({candidate, opt, kept: true, measurements: harness.medians, cost: ...}) + commit(opt) // durable baseline + if plateau(log, recent-N): // §8 Q9: re-profile, evaluate (c)/(d) + re-profile-data() // would change kind selection + return committed(opt, log) + +candidates := { a: "work removal", // Q1, Q3, Q4 + b: "throughput/data layout", // Q3, Q5, Q6 + c: "representation/algorithm", // Q9 (not attempted in PEP) + d: "data-pattern specialization" } // Q5/Q6 (not attempted in PEP) + +size-speed-frontier := { strict_exhaustive: 2.04x, + sample_band_h4: 3.16x, // 8/24 size regressions + sample_band_h16: 5.43x, // 10/24 size regressions + single_model: 9.25x } // 8/24 size regressions +``` + +The shape tag map: `[B]` for the boundary (the case-study is where the model's working state meets the gate), `[I]` for the inspectable frontier. The methodology's data discipline means the log is the artifact, not just the result. + +**Source-read citations:** +- `pep-copt/README.md` — full project: 24-image results, 4-prompt methodology, byte-identity + size + decode contract +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md` — full log: LOCKED BASELINE = 2.04x strict size-correct +- `pep-copt/prompts/create-reference.md` — reference pipeline spec +- `pep-copt/prompts/create-optimized-test-harness.md` — scaffold spec +- `pep-copt/prompts/create-visualizer.md` — visualizer spec +- `pep-copt/prompts/create-optimized.md` — optimization spec +- `pep-copt/prove-optimized-harness.sh` — 9-step proof + 5 enforcing gates +- `pep-copt/Makefile.optimized` + `Makefile` — build configuration +- `pep-copt/viz/contact_sheet.c` — visualizer source +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:1-50` — LOCKED BASELINE section +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:50-100` — kept optimizations list +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:100-200` — rejected experiments +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:200-300` — size/speed frontier +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:300-400` — build-level lever experiments +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:400-500` — methodology notes +- `pep-copt/README.md:1-50` — project description +- `pep-copt/README.md:50-150` — 4-prompt methodology +- `pep-copt/README.md:150-300` — 24-image results table +- `pep-copt/README.md:300-500` — results continued + match contract +- `pep-copt/prove-optimized-harness.sh:1-50` — harness start +- `pep-copt/prove-optimized-harness.sh:50-150` — harness body +- `pep-copt/prove-optimized-harness.sh:150-300` — harness end +- `pep-copt/prompts/create-reference.md:1-50` — reference spec start +- `pep-copt/prompts/create-reference.md:50-150` — reference spec body +- `pep-copt/prompts/create-optimized.md:1-50` — optimization spec start +- `pep-copt/prompts/create-optimized.md:50-150` — 4 candidate kinds +- `pep-copt/prompts/create-optimized.md:150-300` — exit criteria + plateau guidance +- `pep-copt/prompts/create-optimized-test-harness.md:1-50` — harness spec start +- `pep-copt/prompts/create-optimized-test-harness.md:50-150` — harness spec body +- `pep-copt/prompts/create-visualizer.md:1-50` — visualizer spec start +- `pep-copt/prompts/create-visualizer.md:50-150` — visualizer spec body +- `pep-copt/Makefile.optimized:1-50` — build config start +- `pep-copt/Makefile.optimized:50-100` — build config body +- `pep-copt/viz/contact_sheet.c:1-50` — visualizer source start +- `pep-copt/viz/contact_sheet.c:50-200` — visualizer source body +- `pep-copt/` (full repo at main) — 5 commits + README + OPTIMIZATION-LOG + 4 prompts + harness +- `pep-copt/commits/` — the 5 commit history (the v3 cluster does not cite specific SHAs) +- `pep-copt/.gitignore` — the gitignore (the v3 cluster does not cite specific contents) +- `pep-copt/OPTIMIZATION-LOG.md` (root) — the v3 cluster does not cite a root-level log; the log is in `src-optimized/` +- `intent_dsl_survey_20260612` — the survey (relevant for the gap note on intent-DSL) +- `superpowers_review_20260619` — the superpowers review (relevant for the gap note on process parallel) + +**Decision candidate:** NEW Candidate 26 (LOW). "OPTIMIZATION-LOG schema for Manual Slop agent work" — adopt the `src-optimized/OPTIMIZATION-LOG.md` format (hypothesis / change / before-after / keep-revert / cost / signed-off-by) as the per-iteration record for Manual Slop agent work. See `decisions.md` Candidate 26. +**Cross-refs:** §3 Hooks (`prove-optimized-harness.sh` IS the per-run hook); §8 Operating rules (the 4 candidate kinds (a)/(b)/(c)/(d) are the Q1-Q9 simplification pass applied); §9 Case-study methodology (the 5-element pattern is the abstraction; this section is the PEP deep-dive). +**Pattern history:** NEW. v2.3 had no case-study repos. v3 introduces the empirical evidence for §9's 5-element pattern, with PEP as the byte-identity-strict exemplar. +## §11 Collisions case study + +**Source:** `macton/differentiable-collisions-optc` at `main` (5 commits); `README.md` (full); `src-optimized/OPTIMIZATION-LOG.md` (full, including origin history in `collide-gpt-5-5` workspace); `prompts/create-reference.md` (full); `prompts/create-optimized-test-harness.md` (full); `prompts/create-optimized.md` (full, per §9); `prompts/create-visualizer.md` (full); `prove-optimized-harness.sh` (full, per §3). +**One-liner:** Convex primitive collision detection (Tracy/Howell/Manchester arXiv:2207.00669): **101.06× on committed input** (median-of-5, ~0.330 s → ~0.003268 s); 97.75× and 98.43× on alternate seeds — 100× generalized claim explicitly NOT made. Tolerance-based match contract: collision flags identical, per-pair distance within `|Δ| ≤ 1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`, contact points certified for validity (not matched). All gates + generalization PASS; contacts 1000/1000 valid. +**Pattern summary:** The collisions case study is the §9 5-element pattern applied to a tolerance-based optimization. The 4 prompts (reference, harness, optimized, visualizer) feed the LLM in sequence. The harness implements a tolerance comparator (`compare_results`) with a hybrid distance tolerance `1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)` — an absolute floor + a relative term + an alpha-conditioning term. Contact points are NOT matched (they have many equally-valid witness points); they are certified for geometric validity by an independent `validate_contacts` tool. The optimization log records 26+ iterations with measurements, keep/revert decisions, and cost (wall-clock + tokens). The 12 H-numbered kept optimizations + the 14 origin iterations trace a clear arc: different algorithm (Q9 in Iteration 3 — "remove barrier solve; support/GJK+bisection alpha"), per-type specialization (Iterations 5-7), skip unused work (Iteration 8), compact representation (Iteration 9 — `cp_shape_lite`), precompute moves (Iteration 12), loop cap reductions (Iterations 11, 13, 14), single precision + re-centering (H1), contact point witness recovery (H2), analytic contact witness (H3), no heap allocation (H4), broadphase assumption + alpha-conditioned tolerance (H5), polytope hull edge precompute (H6), direct scaled support specialization (H9) + force-inline (H10). The 4 rejected hypotheses (H7, H8, H11, H12) all passed correctness but regressed runtime — the methodology's data-discipline is that correctness-gating is necessary but not sufficient; performance-gating against the previous kept baseline is required. + +#### §11.1 What the Collisions Case Study Adds + +The collisions case study is the tolerance-based exemplar of the §9 5-element pattern. The case study applies the 4-prompt methodology + harness + log + freeze + subject to a real collision-detection optimization problem (Tracy/Howell/Manchester convex primitive collision detection). The results are empirical evidence for the methodology's effectiveness under a tolerance-based correctness contract. + +The key results: + +- **101.06× speedup on committed input** (median-of-5, ~0.330 s → ~0.003268 s). +- **97.75× and 98.43× on alternate seeds** — the 100× generalized claim is explicitly NOT made. +- **Collision flags identical** — the optimized implementation agrees with the reference on every collision flag. +- **Per-pair distance within tolerance** — `|Δ| ≤ 1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`. +- **Contact points 1000/1000 valid** — all contact points pass the independent `validate_contacts` tool. +- **All gates PASS** — tolerance + median-of-5 + validator + generalization. +- **Generalization PASS** — the optimization works on held-out seeds, not just the committed input. + +The match contract is tolerance-based (not byte-identity like PEP), because collision detection has many equally-valid witness points for face/edge contacts. The contract is "collision flags identical + distance within tolerance + contact points certified for validity" — the strictest contract that is structurally feasible for the problem. + +#### §11.2 The 4-Prompt Sequence Applied + +The 4-prompt sequence for collisions (per §9): + +1. **`create-reference.md`** — the reference solver spec: Tracy/Howell/Manchester, deterministic, ±8km domain, 1mm resolution, secondary validator. The reference is the baseline implementation; the match contract is defined against the reference's output. + +2. **`create-optimized-test-harness.md`** — the harness spec: tolerance comparator + median-of-5 + validator + generalization. The harness is the per-turn measurement primitive (§3 cross-ref). + +3. **`create-optimized.md`** — the optimization spec: 2 candidate kinds (a) "work removal" + (b) "throughput/data layout", build-stage precompute allowed, two-transform isolation. The optimization is bounded by the methodology's Q1-Q9 simplification pass. + +4. **`create-visualizer.md`** — the visualizer spec: one-pair-at-a-time 3D render + screenshots. The visualizer is the human-facing layer of the match contract. + +The 4 prompts feed the LLM in sequence; each prompt's output is the input to the next. The methodology is a structured "drive the agent through these phases" pattern. + +#### §11.3 The 12 H-Numbered Kept Optimizations + +The 12 H-numbered kept optimizations trace a clear arc: + +1. **Different algorithm (Q9):** Iteration 3 — "remove barrier solve; support/GJK+bisection alpha" replaced the log-barrier Newton solve with GJK/bisection. Single-largest win (~30x at the time). +2. **Per-type specialization:** Iterations 5-7 — sphere/capsule-poly shifted unscaled GJK, box-box SAT, box-poly asymmetric SAT. +3. **Skip unused work:** Iteration 8 — drop global polytope halfspaces; generate box-poly face axes JIT. +4. **Compact representation:** Iteration 9 — `cp_shape_lite { status, type, c[3] }` for the runtime path. 50x target met. +5. **Precompute moves:** Iteration 12 — `cp_collide_pairs_precomputed` API; optimized harness precomputes shapes before timed region. 84.91x. +6. **Loop cap reductions:** Iterations 11, 13, 14 — reduce fixed iteration counts where the data shows the lower bound passes the gate. 101.06x on committed. +7. **Single precision + re-centering (H1):** move from double to float with per-pair re-centering to defeat km-scale cancellation. Also discovered and fixed a catastrophic-cancellation quadratic root bug (1019mm → 1.05mm). 1mm hybrid tolerance aligned with reference's own 1mm spec. +8. **Contact point witness recovery (H2):** the contact-point commit regressed to 18.8x; recovered to 54.4x via witness bisection early-exit + single witness read. +9. **Analytic contact witness (H3):** for sphere/capsule pairs, the witness is closed-form (closest point on the other shape's alpha-scaled boundary). Saves `gjk_dist` for 312+59 sphere/capsule pairs. +10. **No heap allocation (H4):** `cp_collide_pairs` and `cp_vshapes_from_blob` allocate nothing at runtime; caller owns memory. +11. **Broadphase assumption + alpha-conditioned tolerance (H5):** narrow-phase solver contract; data set regenerated to overlapping-AABB pairs only. Alpha-conditioning term `5e-4·(|c1−c2|/α²)` accounts for float solve's `alpha`-resolution budget. +12. **Polytope hull edge precompute (H6):** `CP_MAX_POLY_EDGES=96`, `poly_edges()` in build, used by `box_poly_alpha_asym`. 75.45x. +13. **Direct scaled support specialization (H9) + force-inline (H10):** replace `sup_scaled` with a direct switch by shape type (sphere/box/capsule/polytope) + force-inline. 79.18x → 82.05x. + +The kept optimizations are a mix of (a) "work removal" and (b) "throughput/data layout" candidate kinds (per §9 + §8). Iteration 3 is a Q9 application ("different algorithm") — the largest single win. The later iterations are Q1/Q3/Q5/Q6 applications. + +#### §11.4 The 4 Rejected Hypotheses + +The 4 rejected hypotheses (H7, H8, H11, H12) all passed correctness but regressed runtime — the methodology's data-discipline is that correctness-gating is necessary but not sufficient; performance-gating against the previous kept baseline is required. + +The rejections are documented in the OPTIMIZATION-LOG with explicit `REJECTED` markers. The rejected experiments are: + +- **H7** — force-inline attempt; passed correctness but regressed runtime. +- **H8** — cap-cut attempt; passed correctness but regressed runtime. +- **H11** — force-inline attempt; passed correctness but regressed runtime. +- **H12** — cap-cut attempt; passed correctness but regressed runtime. + +The 4 rejections are the methodology's data-discipline template: every claim is measured, every measurement is gated, every failed gate is reverted. Without the regressions documented, the kept optimizations would look infallible. The OPTIMIZATION-LOG's explicit `REJECTED` markers are the load-bearing data point. + +#### §11.5 The Contact-Point Feature Regression + +The contact-point feature regression is the most informative data point. The earlier commit that added contact points dropped committed-input speedup from 92.96x (no contact points) to 18.84x. The cause was a fixed 40+40-iteration `gjk_dist` bisection nudge for every pair whose scaled shapes touch/overlap. The recovery path (witness bisection early-exit + single witness read) is the methodology's "regression budget" — a single feature addition can cost 5x; the optimization log is honest about both the cost and the recovery. + +The regression is the methodology's "regression budget" — a single feature addition can cost 5x; the optimization log is honest about both the cost and the recovery. The recovery path (H2: witness bisection early-exit + single witness read) is itself a Q1 ("can we not do this at all?") + Q3 ("can we do this fewer times?") application. + +#### §11.6 The Build-Stage Isolation Invariant + +The build-stage isolation invariant is the collisions case study's unique design constraint. `build_optimized_shapes.c` sees only shapes; `build_optimized_pairs.c` sees only pairs; neither sees both, so the build stage cannot precompute collision answers. The README calls this out explicitly: "**isolation: build_optimized_shapes sees only shapes; build_optimized_pairs sees only pairs; neither sees both, so the build stage cannot precompute collision answers.**" + +The isolation is a creative way to keep the build-stage optimization freedom (allowed per §8 Q9 — "consider a different machine") while preventing the most obvious cheat (precomputing answers). The build stage is allowed to optimize the representation (Q3, Q5, Q6), but it cannot precompute the answer (which would be Q1 = "delete the work", but in a way that violates the methodology's data-discipline). + +The isolation is a creative design constraint; a future track could explore whether the isolation is provably necessary or could be relaxed. The README's framing is explicit: "neither sees both, so the build stage cannot precompute collision answers." The constraint is the methodology's data-discipline in action. + +#### §11.7 The Per-Type Specialization Pattern + +The per-type specialization pattern is the collisions case study's most distinctive optimization. The reference implementation uses a generic solver (one algorithm for all shape pairs); the optimized implementation uses per-type solvers (sphere-sphere, sphere-box, box-box, box-poly, etc.). The per-type solvers exploit the structure of each pair type to skip work the generic solver cannot. + +The per-type specialization is a Q9 application: "consider a different machine that fits the data better". The data (shape pairs) is heterogeneous (sphere pairs, box pairs, poly pairs, mixed pairs); a different machine for each pair type is faster than a generic machine for all pair types. The optimization is the data's shape pointing to a different machine. + +The per-type specialization is also a Q3 application: "can we do this fewer times?". The generic solver runs the same algorithm for every pair; the per-type solvers run only the necessary steps for each pair type. The data is the source of truth; the code is a function of the data. + +#### §11.8 The Closed-Form Contact Witnesses + +The closed-form contact witnesses are a Q9 + Q1 application. For sphere/capsule pairs, the contact point is the closest point on the other shape's alpha-scaled boundary. The closed-form is faster than the generic `gjk_dist` bisection: the generic solver runs 40+40 iterations to find the witness; the closed-form returns it in O(1). + +The closed-form is a "different machine" for the sphere/capsule pair type. The data (sphere/capsule pairs) has a closed-form witness; the generic solver does not exploit this. The per-type solver does exploit this, and the speedup is 312+59 sphere/capsule pairs × (40+40 iterations saved) = significant. + +The closed-form is also a "not do this at all" (Q1) application: the bisection iterations are deleted for sphere/capsule pairs. The data is the source of truth; the code is a function of the data. + +#### §11.9 Per-Repo Detail + +The collisions repo implements the same 5-element pattern as PEP, with different match contracts: + +- **Match contract:** tolerance-based (collision flags identical + distance within tolerance + contact points certified for validity). +- **Candidate kinds:** (a) "work removal" + (b) "throughput/data layout" (per `prompts/create-optimized.md`). +- **Harness:** 10-step proof + 4 enforcing gates (tolerance comparator + median-of-5 + validator + generalization). +- **Optimization log:** 26+ iterations, 4 explicit `REJECTED` markers (H7, H8, H11, H12), 100× on committed input. +- **Build-stage isolation:** `build_optimized_shapes.c` sees only shapes; `build_optimized_pairs.c` sees only pairs. + +The collisions repo is the empirical evidence for the §9 5-element pattern's flexibility: the pattern is invariant (4 prompts + harness + log + freeze + subject); the match contract is the parameterization (tolerance-based); the candidate kinds are the same (a)/(b)/(c)/(d); the gate discipline is the same (correctness + performance + determinism + generalization); the cost tracking is the same (wall-clock + tokens). + +#### §11.10 The 100× Claim Discipline + +The collisions README's "100× target reached" claim is conditional on "committed input only" — the README explicitly says "I would not call it a *uniform* 100× — two of the four seeds land just under — so I claim '100× on the committed benchmark, ~98–102× generally,' and no more." This is the methodology's most informative data-discipline point. + +The discipline: the claim is qualified by the data. The committed input shows 101.06×; the alternate seeds show 97.75× and 98.43×. The claim is "100× on committed input" (which is what the data supports), not "100× on all inputs" (which the data does not support). The methodology's data-discipline means the claim is honest about the variance. + +The 100× claim discipline is the methodology's "label your hypotheses" pattern (§8 honesty). The data says 101.06× on committed input, 97.75× and 98.43× on alternate seeds. The claim is "100× on committed input, ~98–102× generally" — the claim is labeled with the conditions that produced it. + +#### §11.11 The GPT-5.5 Workspace Corroboration + +The "GPT-5.5" string in the collisions README is corroborated by the workspace name `collide-gpt-5-5` (per the OPTIMIZATION-LOG's origin history). The workspace name is a deliberate identifier (private/internal/placeholder), not a typo. The §9 honest-gap note applies: the methodology is the artifact, not the model. + +The workspace name `collide-gpt-5-5` is the empirical evidence for the deliberate-model-identifier reading (vs. typo). The workspace was named after the model used; the README's "GPT-5.5" is the same identifier. The methodology is being tested for portability — the model name is incidental to the methodology's validity. + +#### §11.12 Manual Slop Implications + +The Manual Slop equivalents of the collisions case study are partial. The closest analogs are: +- **`compare_results.c` pattern** — the tolerance comparator with hybrid distance tolerance. The pattern is workable for any problem where byte-identity is structurally infeasible (float work, geometric/continuous problems, etc.). +- **The 26+ iteration optimization arc** — the methodology's data-discipline template. The explicit `REJECTED` markers for H7, H8, H11, H12 are the load-bearing data point. +- **The build-stage isolation invariant** — the creative design constraint that allows build-stage optimization while preventing answer precomputation. + +The gap Manual Slop could close: +1. **No tolerance-based comparator.** Manual Slop's tests assert correctness with byte-identity or simple equality, not hybrid distance tolerance. A future track could add the tolerance comparator for float work or geometric problems. +2. **No explicit `REJECTED` markers.** Manual Slop's git history is the rejection record, but the per-iteration "why was this reverted" is not documented in a structured way. A future track could add the explicit rejection markers pattern. +3. **No build-stage isolation.** Manual Slop's build configuration is not part of the optimization loop. A future track could add the build-stage isolation invariant to the methodology. +4. **No closed-form contact witnesses pattern.** Manual Slop's optimization is generic; the per-type specialization pattern is not adopted. A future track could add the per-type specialization pattern for heterogeneous data. + +#### §11.13 Honest Gaps + +1. **The README's "~102× on committed input" claim and the OPTIMIZATION-LOG's "101.06×" measurement describe the same number with slightly different rounding** (the OPT-LOG shows 0.003268 s / 0.330271 s = 101.06×; the README rounds to 102×). The §11 section cites the OPT-LOG's precise number as canonical. +2. **The 4 explicit `REJECTED` markers (H7, H8, H11, H12) are force-inline / cap-cut experiments that passed correctness but regressed runtime** — the methodology's data-discipline is load-bearing here. Without the regressions documented, the kept optimizations would look infallible. +3. **The two build-stage transforms (`build_optimized_shapes.c` and `build_optimized_pairs.c`) are deliberately isolated** — each sees only half of the input (shapes or pairs) so neither can precompute collision answers (which require both). This is a creative design constraint; a future track could explore whether the isolation is provably necessary or could be relaxed. +4. **The "GPT-5.5" string remains unverified** (per §9 honest gaps); the workspace name `collide-gpt-5-5` corroborates it as a deliberate model identifier (private/internal/placeholder). +5. **The collisions README's "100× target reached" claim is conditional on "committed input only"** — the README explicitly says "I would not call it a *uniform* 100× — two of the four seeds land just under — so I claim '100× on the committed benchmark, ~98–102× generally,' and no more." This is the methodology's most informative data-discipline point. +6. **The contact-point feature regression (92.96x → 18.84x) is the most informative data point** — a single feature addition can cost 5x; the recovery path (H2) is itself a Q1 + Q3 application. The regression is documented but the recovery path is not generalized as a pattern. +7. **The closed-form contact witnesses are a Q9 + Q1 application** — the data (sphere/capsule pairs) has a closed-form witness; the generic solver does not exploit this. The pattern is documented for sphere/capsule pairs but not generalized to other shape pairs. +8. **The per-type specialization is a Q9 application** — the data (shape pairs) is heterogeneous; a different machine for each pair type is faster than a generic machine for all pair types. The pattern is documented for shape pairs but not generalized to other heterogeneous data. + +#### §11.14 Code-Shape Sketch + +The collisions case study, in survey-grammar SSDL notation, with shape tags: + +``` +collisions-optimization { ref, committed_pairs, n_target } :: result {ssdl} [B] + ref_results := run(ref, committed_pairs) // collision flags + distance + contact + harness := build-harness(ref_results) // tolerance comparator + validator + generalization + log := [] + for iter := 0..N: + candidate := pick(log, ref, candidates) // (a) work removal + (b) throughput/layout + opt := apply(candidate, ref) + if not harness.gates-pass(opt): // count + tolerance + validator + generalization + contacts + log.append({candidate, opt, kept: false, reason: harness.last-failure}) + revert() + continue + if opt.median >= log.last-kept.median: + log.append({candidate, opt, kept: false, reason: "no gain"}) + revert() + continue + log.append({candidate, opt, kept: true, measurements: harness.medians, cost: ...}) + commit(opt) // durable baseline + if plateau(log, recent-N): // §8 Q9: re-profile, evaluate (c) representation + re-profile-data() + return committed(opt, log) + +candidates := { a: "work removal", // Q1, Q3, Q4 + b: "throughput/data layout", // Q3, Q5, Q6 + c: "representation/algorithm", // Q9 (Iteration 3 — GJK+bisection) + d: "data-pattern specialization" } // Q5/Q6 (per-type specialization) + +match-contract := { type: tolerance, + tolerance: { dist_max: "1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)", + contact_certifier: true, + collision_flag_identity: true } } + +build-isolation := { shapes_transform: "build_optimized_shapes (sees only shapes)", + pairs_transform: "build_optimized_pairs (sees only pairs)", + invariant: "neither sees both, so build cannot precompute answers" } +``` + +The shape tag map: `[B]` for the boundary (the case-study is where the model's working state meets measurement), `[I]` for the inspectable match contract + build isolation. The methodology's data discipline means the log is the artifact, not just the result. + +**Source-read citations:** +- `differentiable-collisions-optc/README.md` — full project: 1000-pair benchmark, "GPT-5.5", tolerance-based contract +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md` — full log: 14 origin iterations + 12 H-numbered iterations, 4 rejections +- `differentiable-collisions-optc/prompts/create-reference.md` — reference solver spec +- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md` — harness spec +- `differentiable-collisions-optc/prompts/create-optimized.md` — optimization spec +- `differentiable-collisions-optc/prompts/create-visualizer.md` — visualizer spec +- `differentiable-collisions-optc/prove-optimized-harness.sh` — 10-step proof + 4 enforcing gates +- `differentiable-collisions-optc/Makefile.optimized` — build configuration +- `differentiable-collisions-optc/src-optimized/collide.c` — optimized implementation +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c` — isolated shapes transform +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_pairs.c` — isolated pairs transform +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:1-50` — origin history (collide-gpt-5-5 workspace) +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:50-100` — kept optimizations H1-H6 +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:100-200` — kept optimizations H7-H12 +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:200-300` — rejected experiments +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:300-400` — final committed baseline +- `differentiable-collisions-optc/README.md:1-50` — project description +- `differentiable-collisions-optc/README.md:50-150` — 4-prompt methodology +- `differentiable-collisions-optc/README.md:150-300` — 1000-pair benchmark +- `differentiable-collisions-optc/README.md:300-500` — results continued + match contract +- `differentiable-collisions-optc/prove-optimized-harness.sh:1-50` — harness start +- `differentiable-collisions-optc/prove-optimized-harness.sh:50-150` — harness body +- `differentiable-collisions-optc/prove-optimized-harness.sh:150-350` — harness end +- `differentiable-collisions-optc/prompts/create-reference.md:1-50` — reference spec start +- `differentiable-collisions-optc/prompts/create-reference.md:50-150` — reference spec body +- `differentiable-collisions-optc/prompts/create-optimized.md:1-50` — optimization spec start +- `differentiable-collisions-optc/prompts/create-optimized.md:50-150` — 2 candidate kinds +- `differentiable-collisions-optc/prompts/create-optimized.md:150-300` — exit criteria + plateau guidance +- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md:1-50` — harness spec start +- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md:50-150` — harness spec body +- `differentiable-collisions-optc/prompts/create-visualizer.md:1-50` — visualizer spec start +- `differentiable-collisions-optc/prompts/create-visualizer.md:50-150` — visualizer spec body +- `differentiable-collisions-optc/Makefile.optimized:1-50` — build config start +- `differentiable-collisions-optc/Makefile.optimized:50-100` — build config body +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c:1-50` — shapes transform start +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c:50-150` — shapes transform body +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_pairs.c:1-50` — pairs transform start +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_pairs.c:50-150` — pairs transform body +- `differentiable-collisions-optc/` (full repo at main) — 5 commits + README + OPTIMIZATION-LOG + 4 prompts + harness +- `differentiable-collisions-optc/commits/` — the 5 commit history (the v3 cluster does not cite specific SHAs) +- `differentiable-collisions-optc/.gitignore` — the gitignore (the v3 cluster does not cite specific contents) +- `intent_dsl_survey_20260612` — the survey (relevant for the gap note on intent-DSL) +- `superpowers_review_20260619` — the superpowers review (relevant for the gap note on process parallel) +- `tracy_howell_manchester_arxiv_2207.00669` — the cited paper (relevant for the reference implementation) + +**Decision candidate:** NEW Candidate 27 (LOW). "Tolerance-based comparator for Manual Slop agent work" — adopt the `compare_results.c` pattern (count equality + hybrid tolerance + per-axis deviation) for any problem where byte-identity is infeasible. See `decisions.md` Candidate 27. +**Cross-refs:** §3 Hooks (`prove-optimized-harness.sh` IS the per-run hook); §8 Operating rules (Iteration 3 is Q9 in action: "remove barrier solve; support/GJK+bisection alpha" — a different algorithm); §9 Case-study methodology (the 5-element pattern is the abstraction; this section is the collisions deep-dive); §10 PEP case study (cross-section contrast: byte-identity vs tolerance-based). +**Pattern history:** NEW. v2.3 had no case-study repos. v3 introduces the tolerance-based exemplar of §9's 5-element pattern. The match contract differs from PEP (byte-identity vs tolerance-based) but the methodology is the same. +## §12 YAML avoidance + +**Source:** nagent uses YAML for `.nagent/campaigns/{slug}/index.yaml` + per-item `item.yaml` + per-item `proposal.yaml` + graduate `{name}.draft` (per §1 Campaigns cluster); distill graduates per `bin/nagent-distill --graduate`; per-file knowledge note frontmatter in `knowledge/files/{file_id}.md` (per v2.3 §2.1). User directive 2026-06-20: "I don't like YAML, acton may have utilized it or noted its utilization but I would not use it in whatever I take from his nagent implementation. I would continue to utilize markdown in combination with a custom DSL." +**One-liner:** nagent uses YAML for campaigns/distill/knowledge; the user does NOT adopt YAML for Manual Slop artifacts — Manual Slop uses markdown with structured headings + custom DSL (survey grammar + SSDL) for any artifact that nagent would have used YAML for. +**Pattern summary:** The YAML-avoidance pattern is a "do not adopt" flag on every YAML use site in nagent, with a markdown + custom DSL alternative specified per use case. The pattern is: (1) catalog every YAML use site in nagent (campaigns, distill, knowledge, graduates); (2) name the markdown + DSL alternative for each (markdown headings + survey grammar for inline computation, TOML frontmatter for project config precedent, SSDL for shape annotations); (3) document the rationale (whitespace fragility for AI-generated content, markdown+DSL is the project's existing convention per the intent_dsl_survey + superpowers_review sibling reviews, the custom DSL is the project's intent for inline computation not configuration); (4) cross-ref the project files that establish the markdown+DSL precedent (`conductor/presets.py`, `conductor/personas.py`, the 6 styleguides in `conductor/code_styleguides/`, the 14 `docs/guide_*.md` files). + +#### §12.1 Where nagent Uses YAML + +nagent uses YAML in four primary locations: + +1. **`.nagent/campaigns/{slug}/index.yaml`** — the campaign-level index. Per §1, the campaign tree is a YAML structure with `name`, `status`, `completion: [condition]`, `items: [item]`, and optional `proposal: proposal_yaml?`. The YAML is the state of record; the worker contract returns data; the driver is the only mutator. +2. **`.nagent/campaigns/{slug}/{item_id}/item.yaml`** — the per-item state. Each item has `id`, `status`, `blocked_by: [id]`, `conversation: path`, optional `decompose: { when, into: [sub_item] }`, and optional `result: result_json?`. The YAML is editable; the user can hand-edit between turns. +3. **`.nagent/campaigns/{slug}/{item_id}/proposal.yaml`** — the proposal file. Created by the LLM during the `propose` phase; contains the sub-items the LLM proposes. The review gate (per §1) decides whether to accept. +4. **`.nagent/distill/{name}.draft`** — the graduate file. Created by `nagent-distill --graduate`; contains a non-executable draft of a tool or prompt. Invisible to tool discovery until the user reviews and renames to remove `.draft`. + +Additionally, nagent uses YAML-adjacent formats: +- **Per-file knowledge note frontmatter** (`knowledge/files/{file_id}.md`) — the file has a YAML frontmatter block with metadata (file path, last-modified, category). The body is markdown. +- **`config.json`** — nagent's main config file is JSON, not YAML, but the same "structured data file" pattern applies. The config has `safety_net`, `hook_per_run`, `hook_per_file_edit`, `context_window_tokens`, etc. +- **`issues/{NNNN}-{slug}.md`** — nagent's issue files are markdown with structured headings (## Goal, ## Tasks, ## Done criteria), not YAML. This is the closest nagent gets to the Manual Slop convention. + +#### §12.2 Why YAML Is "Do Not Adopt" for Manual Slop + +YAML is "do not adopt" for Manual Slop for four reasons: + +1. **Markdown + frontmatter is sufficient for the same data shape.** The project's `conductor/presets.py` and `conductor/personas.py` both use TOML for structured config (presets.toml, project_presets.toml, personas.toml, project_personas.toml). TOML is the existing precedent; YAML would be a third format. The markdown+frontmatter pattern (per the `issues/{NNNN}-{slug}.md` precedent in nagent itself) is sufficient for the campaign-style artifacts: structured headings (`## Goal` / `## Tasks` / `## Done criteria`) + a TOML frontmatter block (project config precedent) + optional SSDL-annotated code blocks for any inline computation. +2. **The custom DSL (survey grammar + SSDL) is the project's intent for inline computation, not configuration.** Per the `intent_dsl_survey_20260612` Cluster 5 "SSDL shape primitives", the project's DSL primitives (`[I]` inspectable, `[S]` string concatenation, `[B]` boundary, `[M]` mutable aggregate) are the shape annotations for any data structure. The DSL is for inline computation (e.g., the code-shape sketches in §1-§11), not for configuration files. +3. **YAML's whitespace sensitivity is fragile for AI-generated content.** LLMs frequently mis-indent YAML; a single space off can change the structure silently. The Manual Slop workflow already encodes the discipline "always run the suite, not just `py_compile`" (per §6 cross-ref to `315fe9e`); YAML adds another surface for the "looks right but parses wrong" failure mode. +4. **The project's existing markdown-driven conventions (per `superpowers_review_20260619`)** establish markdown as the default format for human-editable artifacts. The 6 styleguides in `conductor/code_styleguides/` are markdown; the 14 `docs/guide_*.md` files are markdown; the per-track `spec.md`, `plan.md`, `state.toml`, `metadata.json` are markdown + TOML. Adding YAML would be a third format for the same data shape. + +The YAML-avoidance is a "do not adopt" flag, not a "must not exist" ban. The user can still read and parse YAML (e.g., when reading nagent's source); the avoidance is for new Manual Slop artifacts. + +#### §12.3 The Markdown + Custom DSL Alternative + +The markdown + custom DSL alternative is concrete: each campaign-style artifact becomes a markdown file with structured headings + a TOML frontmatter block (project config precedent) + optional SSDL-annotated code blocks for any inline computation. + +The template: + +```markdown ++++ +slug = "campaign-slug" +status = "active" +created = "2026-06-20" ++++ + +# Campaign: {name} + +## Goal + + + +## Tasks + +- [ ] **{item_id}** — {description} (status: todo; blocked_by: []) +- [ ] **{item_id}** — {description} (status: todo; blocked_by: [{item_id}]) + +## Done criteria + +- {condition_1} +- {condition_2} + +## Notes + + + +``` +campaign := { name: string, status: active|paused|done, + completion: [condition], items: [item] } {ssdl} [M] +``` +``` + +The TOML frontmatter (between `+++` markers) holds the machine-readable fields (slug, status, created). The markdown body holds the human-readable content (goal, tasks, done criteria, notes). The SSDL annotations (`{ssdl} [M]`) are the shape tags for any data structure in the code-shape sketches. + +The per-item file follows the same template: + +```markdown ++++ +id = "{item_id}" +status = "todo" +blocked_by = ["{item_id}"] ++++ + +# {item_id}: {description} + +## Goal + + + +## Done criteria + +- {condition} + +## Conversation + + +``` + +The per-proposal file follows the same template: + +```markdown ++++ +parent_item = "{item_id}" +created = "2026-06-20" ++++ + +# Proposal: decompose {item_id} + +## Sub-items + +- [ ] **{sub_item_id}** — {description} +- [ ] **{sub_item_id}** — {description} + +## Rationale + + +``` + +The graduate file follows the same template (with `executable = false` to mark it as a draft): + +```markdown ++++ +name = "{tool_name}" +executable = false +graduated_at = "2026-06-20" ++++ + +# {tool_name} (DRAFT) + + + +## Review notes + + +``` + +The TOML frontmatter is the project config precedent (`conductor/presets.py` + `conductor/personas.py`); the markdown body is the project convention; the SSDL annotations are the project's DSL primitives. + +#### §12.4 Cross-References + +The YAML-avoidance section cross-references: + +- **`intent_dsl_survey_20260612`** — the survey's Cluster 5 "SSDL shape primitives" is the canonical reference for the SSDL annotations. The survey's §4.4 "7-column table format" is the canonical reference for any tabular data. +- **`superpowers_review_20260619`** — the superpowers plugin review establishes the project's markdown-driven conventions. The 6 styleguides in `conductor/code_styleguides/` are markdown; the 14 `docs/guide_*.md` files are markdown; the markdown convention is the project's default. +- **`conductor/presets.py`** + **`conductor/personas.py`** — the TOML precedent for project config. The `[presets]` and `[personas]` tables in `presets.toml` and `personas.toml` are the pattern for any new project config file. +- **`conductor/workflow.md`** — the workflow's "always run the suite, not just `py_compile`" discipline (per §6 cross-ref) is the project's "look for failure modes" mindset. YAML's whitespace fragility is a failure mode; the project's mindset is to surface failure modes explicitly. + +#### §12.5 Decision Candidate + +**NEW Candidate 27 (HIGH).** "Markdown + custom DSL lock-in" — explicitly adopt markdown + survey grammar + SSDL for campaign-style artifacts; reject YAML for new project artifacts. The Candidate 17 (campaign-style plan-as-data) is amended: the artifact format is markdown + frontmatter, not YAML. The Candidate 18 (discussion-window safety net) is unchanged (it operates on existing JSON/Markdown artifacts). The Candidate 19 (per-turn hook) is unchanged (it operates on shell commands, not data files). The Candidate 25 (optimization-log) is unchanged (it operates on markdown, not YAML). See `decisions.md` Candidate 27. + +**Source-read citations:** +- `bin/nagent-campaign` — campaign CLI entry point (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:index_yaml_path()` — the index.yaml path convention (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:item_yaml_path()` — the per-item item.yaml path convention (24cf16d) +- `bin/helpers/nagent_campaign_lib.py:proposal_yaml_path()` — the proposal.yaml path convention (24cf16d) +- `bin/nagent-distill:107-200` — `--merge` + `--graduate` CLI surface (f3ec090) +- `bin/helpers/nagent_distill_lib.py:228-260` — finished-campaign-as-harvest-source (f3ec090) +- `bin/helpers/nagent_distill_lib.py:793-979` — `run_merge` + `run_graduate` (f3ec090) +- `prompts/knowledge-graduate.md:1-26` — graduation LLM prompt (f3ec090) +- `prompts/knowledge-merge.md:1-19` — merge LLM prompt (f3ec090) +- `prompts/knowledge-graduate.md:24-26` — graduate file naming convention (`{name}.draft`) +- `issues/0001-foundations.md` — issue file format (markdown with structured headings, not YAML) +- `issues/0002-campaign-system.md:1-326` — campaign system spec (markdown with structured headings, not YAML) +- `config.example.json` — nagent's main config (JSON, not YAML; the "structured data file" pattern) +- `bin/nagent:1319-1331` — `conversation_scratch_dir(conversation_name)` (49e07f3; relevant for the scratch dir pattern, not YAML) +- `bin/nagent:2220-2230` — `root = resolve_default_root(args.root)` (54c8741; relevant for the project-local-roots pattern) +- `conductor/presets.py` — the TOML precedent for project config (the project file, not nagent's) +- `conductor/personas.py` — the TOML precedent for project config (the project file, not nagent's) +- `conductor/code_styleguides/data_oriented_design.md` — the project's canonical DOD reference (markdown, not YAML) +- `intent_dsl_survey_20260612` — the survey's Cluster 5 "SSDL shape primitives" (the project convention) +- `superpowers_review_20260619` — the superpowers plugin review (the project convention) +- `bin/helpers/nagent_gc_lib.py` — the knowledge harvest library (v2.3; relevant for the harvest format, not YAML) +- `bin/helpers/nagent_tags.py` — the tag parser (065168c; relevant for the lenient parser, not YAML) +- `bin/helpers/nagent_safety_lib.py` — the safety net library (38d3d4f; relevant for the checkpoint format, not YAML) +- `bin/helpers/nagent_cli.py:11-86` — the resolve/scaffold functions (54c8741; relevant for the project-local-roots pattern) +- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (bdfa2a6; relevant for the verified table pattern, not YAML) +- `bin/nagent:640-748` — `build_initial_context` (54c8741; relevant for the 4-layer context resolution) +- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; relevant for the overall nagent architecture) +- `bin/helpers/nagent_campaign_lib.py:1-50` — module docstring + imports (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +- `bin/nagent-distill:1-50` — distill module imports + constants (the v3 cluster does not cite specific line ranges) +- `prompts/create-readme.md:248-251` — the "graduate proven playbooks" reduction (c1d2cad; relevant for the graduate rationale) + +**Honest gaps:** +1. **The TOML frontmatter syntax (between `+++` markers) is the project convention, but the exact parser is not specified.** A future track would document the parser (e.g., `tomllib` for reading, `tomli-w` for writing, or a custom parser that handles the `+++` delimiter). +2. **The SSDL annotations (`{ssdl} [M]`) are not formally parsed.** They are inline text annotations; a future tool could parse them for validation (e.g., a styleguide linter that asserts every `[M]` aggregate has a corresponding `git_history` field). +3. **The markdown+DSL alternative does not address binary artifacts.** Campaign-style artifacts are text; binary artifacts (images, models, etc.) would need a different format. A future track would address binary artifacts. +4. **The "do not adopt" flag is for new Manual Slop artifacts.** Existing YAML files (e.g., from imported nagent campaigns) would still need to be parsed. A future track would document the YAML parser for backward compatibility. + +## §13 Agent context-window observations + +**Source:** user's empirical findings on OpenCode + MiniMax M3 (per the 2026-06-20 directive); nagent's enforcement (per §1 Campaigns + §2 Conversation safety net + §3 Hooks); Manual Slop's `docs/` + `conductor/` markdown navigation (per `conductor/workflow.md` "Mandatory Research-First Protocol" + the 6 styleguides in `conductor/code_styleguides/` + the 14 `docs/guide_*.md` files). +**One-liner:** Agents take ~100-150k tokens to warm up; the context window can go up to ~500k (MiniMax M3); the safe zone is 250-350k; the cycle is compact → re-warm → continue. Manual Slop's `docs/` + `conductor/` markdown navigation is a partial mitigation; the shortcoming is that agents frequently forget/fail to read on demand. nagent's `--hook-per-run` (per §3) is the pattern that would close the gap. +**Pattern summary:** The agent context-window pattern is empirical: the model has a warm-up cost (~100-150k tokens before useful output), a maximum window (~500k for MiniMax M3), a safe zone (250-350k; above which output quality degrades), and a cycle (compact → re-warm → continue). nagent enforces the cycle more strictly via per-turn hook injection (§3) + safety net checkpoints (§2) + distill graduates (§1). Manual Slop's `docs/` + `conductor/` markdown navigation is a partial mitigation: the project's 6 styleguides + 14 deep-dive guides + per-track `state.toml` + `metadata.json` are all markdown, deliberately so agents can navigate on demand. The shortcoming is that agents frequently forget to read or fail to read on demand. nagent's `--hook-per-run` pattern (per §3) is the structural mechanism that closes the gap: a per-turn hook that injects a "what to read next" status block at the top of every turn. The decision candidate is Candidate 19 (per-turn ground-truth hook) reframed with the v3.1 context-window framing. + +#### §13.1 The Warm-Up + Window + Safe-Zone Numbers + +The empirical findings (per the user's 2026-06-20 directive): + +- **Warm-up cost:** ~100-150k tokens. Before the model produces useful output, it needs to load the system prompt + the per-track context + the per-discussion history + the per-task state. The warm-up is the cost of the first useful token. +- **Maximum window:** up to ~500k tokens (MiniMax M3). The model can technically process up to 500k tokens, but the output quality degrades as the window fills. +- **Safe zone:** 250-350k tokens. Below the warm-up cost, the model hasn't loaded enough context. Above the safe zone, the output quality degrades. The safe zone is the range where the model produces useful output efficiently. +- **Cycle:** compact → re-warm → continue. When the window approaches the safe-zone ceiling, the model compacts the context (drops low-priority information, summarizes, etc.), then re-warms (loads the compacted context + the new task), then continues. The cycle is iterative; each cycle costs ~100-150k tokens of warm-up. + +The numbers are empirical (MiniMax M3); other models may have different numbers. The pattern (warm-up + window + safe zone + cycle) is the structural insight; the numbers are the parameterization. + +#### §13.2 nagent's Enforcement + +nagent enforces the cycle more strictly than the model does natively. The three mechanisms: + +1. **Per-turn hook injection (§3):** A hook runs at the top of every turn (before the model speaks); its output enters the conversation as a labeled block. The hook is the per-turn ground-truth that prevents the model from "re-warming" by reading its own context. The hook is fast (median-of-5 timing) and surfaces the measured state (build status, test status, etc.) without the model having to read its own conversation. +2. **Safety net checkpoints (§2):** A wall-clock + burst guard fires a checkpoint when the conversation grows. The checkpoint is a separate one-shot LLM call (not the working model) that produces a structured summary (## Intent | ## Next action | ## Constraints | ## Open questions). The summary is the "compacted" context; the next turn re-warms from the summary. +3. **Distill graduates (§1):** The `--graduate` pass takes proven playbooks and drafts them as non-executable `{name}.draft` files. The drafts are "graduate candidates" — proven knowledge that can be promoted to executable tools after review. The graduate pass is the "structural re-warm" — the model doesn't have to re-read the playbook because it's been distilled into a tool. + +The three mechanisms together implement the cycle as a structural pattern, not a model-dependent behavior. The model doesn't have to "remember to compact"; the cycle is enforced by the loop. + +#### §13.3 Manual Slop's Partial Mitigation + +Manual Slop's `docs/` + `conductor/` markdown navigation is a partial mitigation for the cycle. The project deliberately keeps the following files in markdown so agents can navigate on demand: + +- **`AGENTS.md`** — the canonical operating instructions for agents. The @import pattern (per `conductor/code_styleguides/data_oriented_design.md`) includes the 6 styleguides + the 14 deep-dive guides. +- **`conductor/workflow.md`** — the workflow conventions (TDD, per-task commits, format commitments, "always run the suite"). +- **`conductor/product-guidelines.md`** — the project styleguides (1-space indent for Python, no comments, etc.). +- **`conductor/code_styleguides/data_oriented_design.md`** — the canonical DOD reference (Tier 0/1/2, simplification pass, enforceable deliverables). +- **`conductor/code_styleguides/cache_friendly_context.md`** — the cache TTL GUI contract (stable-to-volatile context ordering). +- **`conductor/code_styleguides/knowledge_artifacts.md`** — the knowledge harvest pattern (7-category schema + provenance + sha256 ledger). +- **`conductor/code_styleguides/error_handling.md`** — the Result[T] convention. +- **`conductor/code_styleguides/agent_memory_dimensions.md`** — the 4 memory dimensions (curation / discussion / RAG / knowledge). +- **`conductor/code_styleguides/rag_integration_discipline.md`** — the conservative-RAG rule. +- **`conductor/code_styleguides/feature_flags.md`** — file presence vs config flags vs CLI flags. +- **The 14 `docs/guide_*.md` files** — the deep-dive guides (architecture, AI client, API hooks, MCP client, app controller, MMA, models, testing, GUI, paths, context curation, shaders, RAG, beads, hot reload, personas, NERV theme, workspace profiles, command palette). +- **Per-track `state.toml` + `metadata.json`** — the per-track state (current phase, task progress, verification status). +- **Per-track `spec.md` + `plan.md`** — the per-track specification and plan. + +The markdown convention is deliberate: agents can navigate the project's knowledge on demand by reading the files. The convention is the project's "partial mitigation" for the cycle. + +#### §13.4 The Shortcoming + +The shortcoming is that agents frequently forget to read or fail to read on demand. The empirical observation: + +- **Forget to read:** The agent has a task, the relevant guidance is in `conductor/workflow.md`, but the agent doesn't read the file because the task description doesn't explicitly say "read `conductor/workflow.md` first". The agent proceeds without the guidance. +- **Fail to read on demand:** The agent reads the relevant guidance at the start of the task, but as the task progresses, the agent doesn't re-read the guidance when a new question arises. The agent proceeds with stale information. +- **Read but ignore:** The agent reads the relevant guidance, but the agent's interpretation of the guidance is different from the guidance's intent. The agent proceeds with a misunderstanding. + +The three failure modes are not the same; each has a different mitigation. The "forget to read" mitigation is to make the reading explicit (e.g., "before starting, read `conductor/workflow.md`"). The "fail to read on demand" mitigation is to make the re-reading automatic (e.g., a per-turn hook that surfaces the relevant guidance). The "read but ignore" mitigation is to make the guidance unambiguous (e.g., structured headings, examples, anti-patterns). + +#### §13.5 The Hook Pattern as the Solution + +nagent's `--hook-per-run` pattern (per §3) is the structural mechanism that closes the gap. The pattern: + +1. **Configure a status command.** The user configures a command (e.g., `make test`, `git status`, `cat conductor/workflow.md`) that runs at the top of every turn. +2. **Run the command via the hook.** The hook runs the command, captures exit code + stdout + stderr, and injects a labeled block at the top of the conversation. +3. **The model sees the status block.** The model reads the status block as part of the conversation; the status block is the per-turn ground-truth. + +The pattern closes all three failure modes: +- **Forget to read:** The status block is automatically injected; the agent can't forget to read it. +- **Fail to read on demand:** The status block is refreshed every turn; the agent sees the latest status every turn. +- **Read but ignore:** The status block is structured (exit code + stdout + stderr); the agent can't ignore a failing exit code or a stderr message. + +The pattern is the structural mechanism for the cycle. The agent doesn't have to "remember to check the status"; the check is automatic. + +#### §13.6 Decision Candidate + +**NEW Candidate 28 (MEDIUM).** "Per-turn ground-truth hook for Manual Slop" — adopt nagent's `--hook-per-run` model; inject a "what to read next" status block at the top of every `send_result()`. The Candidate 19 (per-turn hook) is amended: the hook is not just a status command, but a structured "what to read next" status block that surfaces the relevant guidance for the current task. The hook is configured per-project (via `[conductor].hook_per_run` in `manual_slop.toml`); the default is a no-op (the hook is opt-in). See `decisions.md` Candidate 28. + +**Source-read citations:** +- The user's 2026-06-20 directive — the empirical findings (warm-up + window + safe zone + cycle) +- `bin/nagent:1442-1484` — `run_hook` + `resolve_hooks` (a4fb141; the per-turn hook primitive) +- `bin/nagent:1922-1927` — `hook_per_run` injection site (a4fb141) +- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; the hook is wired here) +- `bin/nagent:1519-1539` — `checkpoint_due` + `rebuild_due` (38d3d4f; the safety net trigger) +- `bin/nagent:1547-1587` — `write_checkpoint` (38d3d4f; the safety net writer) +- `bin/nagent:1590-1662` — `rebuild_conversation` (38d3d4f; the safety net rebuild) +- `bin/nagent:1840-1881` — `extract_conversation_summary` (6426a67; the instant-saves change) +- `bin/helpers/nagent_distill_lib.py:587-654` — `_summary_backfill_candidates` + `_backfill_saved_summaries` (6426a67) +- `bin/nagent-campaign` — campaign CLI entry point (24cf16d; the campaigns abstraction) +- `bin/nagent-distill:107-200` — `--merge` + `--graduate` CLI surface (f3ec090; the distill abstraction) +- `prompts/knowledge-graduate.md:1-26` — graduation LLM prompt (f3ec090) +- `prompts/knowledge-merge.md:1-19` — merge LLM prompt (f3ec090) +- `AGENTS.md` — the canonical operating instructions (the project's markdown convention) +- `conductor/workflow.md` — the workflow conventions (the project's markdown convention) +- `conductor/product-guidelines.md` — the project styleguides (the project's markdown convention) +- `conductor/code_styleguides/data_oriented_design.md` — the canonical DOD reference (the project's markdown convention) +- `conductor/code_styleguides/cache_friendly_context.md` — the cache TTL GUI contract (the project's markdown convention) +- `conductor/code_styleguides/knowledge_artifacts.md` — the knowledge harvest pattern (the project's markdown convention) +- `conductor/code_styleguides/error_handling.md` — the Result[T] convention (the project's markdown convention) +- `conductor/code_styleguides/agent_memory_dimensions.md` — the 4 memory dimensions (the project's markdown convention) +- `conductor/code_styleguides/rag_integration_discipline.md` — the conservative-RAG rule (the project's markdown convention) +- `conductor/code_styleguides/feature_flags.md` — file presence vs config flags vs CLI flags (the project's markdown convention) +- `docs/guide_*.md` — the 14 deep-dive guides (the project's markdown convention) +- Per-track `state.toml` + `metadata.json` — the per-track state (the project's markdown convention) +- `bin/nagent:606-745` — `build_initial_context` (v2.3; relevant for the initial context assembly) +- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; relevant for the cache strategy) +- `bin/nagent:1455-1687` — `run_safety_net` (38d3d4f; relevant for the safety net machinery) +- `bin/nagent:2819` — `safety_settings=load_safety_settings(...)` (38d3d4f; relevant for the safety net wiring) +- `bin/helpers/nagent_cli.py:11-86` — the resolve/scaffold functions (54c8741; relevant for the project-local-roots pattern) +- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (bdfa2a6; relevant for the verified table pattern) +- `bin/nagent:2220-2230` — `root = resolve_default_root(args.root)` (54c8741; relevant for the project-local-roots pattern) +- `bin/helpers/nagent_safety_lib.py` — the safety net library (38d3d4f; relevant for the safety net machinery) +- `bin/nagent:640-748` — `build_initial_context` (54c8741; relevant for the 4-layer context resolution) +- `bin/nagent:1075-1081` — `target = f"{llm.provider}/{llm.model}"` (2edc7ee; relevant for the provider/model naming) +- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; relevant for the overall nagent architecture) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1300-1400` — main loop body (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1900-2000` — main loop continued (the v3 cluster does not cite specific line ranges) +- `bin/nagent:2000-2100` — main loop continued (the v3 cluster does not cite specific line ranges) +- `bin/nagent:2200-2300` — main loop end (the v3 cluster does not cite specific line ranges) + +**Honest gaps:** +1. **The warm-up + window + safe-zone numbers are empirical for MiniMax M3.** Other models (Gemini, Anthropic, OpenAI) may have different numbers. A future track would measure the numbers per provider. +2. **The hook pattern is opt-in.** The default is a no-op; the user must configure a status command. A future track could make the hook default-on with a no-op status command (the cost is the hook's per-turn latency, which should be < 100ms for a no-op). +3. **The "what to read next" status block is a per-project configuration.** The user must specify the status command per project. A future track could auto-detect the relevant guidance based on the current task (e.g., if the task is "implement X", the status block surfaces `conductor/workflow.md` and `conductor/code_styleguides/data_oriented_design.md`). +4. **The hook pattern is per-turn.** A future track could add per-task, per-conversation, or per-project hooks (e.g., a per-task hook that fires when a task starts, a per-conversation hook that fires when a conversation starts). + +## §14 Fine-tuning observations + +**Source:** user's 2026-06-20 directive ("current generalized models bottlenecked by not having conventions baked in; curated dataset of associated codebases; Together.ai noticed; asks about other prosumer fine-tuning vendors for middle-wage income in 2026"). +**One-liner:** Current generalized models are bottlenecked by not having the user's core conventions/workflows baked in. A curated dataset of associated codebases (Manual Slop's own tracks, decisions, plans, styleguides) is the user's proposed mitigation. Together.ai is one noticed vendor; 5-6 other prosumer fine-tuning vendors are surveyed below. Vendor selection is a separate future track; this section is observational. +**Pattern summary:** The fine-tuning pattern is the user's interest in baking conventions/workflows into a model via fine-tuning. The pattern is: (1) recognize the bottleneck (generalized models don't have the user's conventions); (2) curate the dataset (the user's own tracks, decisions, plans, styleguides); (3) select a vendor (Together.ai is one; 5-6 others surveyed); (4) fine-tune the model (vendor-specific process); (5) validate the fine-tuned model (does it actually produce better output for the user's use case?). The v3.1 section is observational; the vendor analysis is a separate future track. The decision candidate is Candidate 29 (dataset-curation track) + Candidate 30 (cache TTL GUI contract hardening, per the cross-ref to §13). + +#### §14.1 The Diagnosis + +The diagnosis (per the user's 2026-06-20 directive): current generalized models are bottlenecked by not having the user's core conventions/workflows baked in. The bottleneck manifests as: + +- **Convention drift:** The model produces output that violates the project's conventions (e.g., 4-space indent instead of 1-space; JSON blocks instead of tables; etc.). The user must correct the output repeatedly. +- **Workflow ignorance:** The model doesn't know the project's workflow (TDD, per-task commits, format commitments, "always run the suite"). The model produces output that doesn't follow the workflow. +- **Styleguide unawareness:** The model doesn't know the project's 6 styleguides (DOD, cache-friendly context, knowledge artifacts, error handling, agent memory dimensions, RAG integration discipline, feature flags). The model produces output that doesn't follow the styleguides. + +The three failure modes are not the same; each has a different fine-tuning mitigation. The "convention drift" mitigation is to bake the conventions into the model's training data (e.g., the project's `conductor/product-guidelines.md` + the 6 styleguides as training examples). The "workflow ignorance" mitigation is to bake the workflow into the model's training data (e.g., the project's `conductor/workflow.md` + per-track `plan.md` as training examples). The "styleguide unawareness" mitigation is to bake the styleguides into the model's training data (e.g., the 6 styleguides + the 14 deep-dive guides as training examples). + +#### §14.2 Together.ai as One Noticed Vendor + +The user noticed Together.ai. Together.ai offers fine-tuning for open-source models (Llama 3.x, Qwen 3, Mistral) with transparent per-token pricing. The pricing model is: + +- **Training:** ~$0.50-3.00 per million tokens (varies by model + dataset size). +- **Inference:** ~$0.10-0.60 per million tokens (varies by model + context length). + +The prosumer-friendly aspects: transparent pricing, open-source model support, no minimum commitment, serverless deployment. The cons: the user must curate the dataset + select the base model + validate the fine-tuned model. + +#### §14.3 Prosumer Fine-Tuning Vendor Survey (2026) + +The prosumer fine-tuning vendor survey (per the user's 2026-06-20 directive): + +| Vendor | Model families | Pricing tier | Prosumer-friendly? | Notes | +|---|---|---|---|---| +| **Together.ai** | Llama, Qwen, Mistral, others | $0.50-3/M training; $0.10-0.60/M inference | Yes — transparent; open-source models | User-noticed vendor | +| **Fireworks.ai** | Llama, Qwen, Mistral | Similar to Together | Yes — serverless DX | Lower latency than Together for some models | +| **OpenAI fine-tuning** | GPT-4o, GPT-4o-mini, GPT-3.5 | ~$3/M training, $0.30/M inference (4o-mini) | Yes for "mini"; expensive for 4o | Best DX; closed-source models | +| **Anthropic Claude Haiku fine-tuning** | Claude Haiku (if on waitlist) | Similar to OpenAI 4o-mini | Waitlist-gated | Best for Anthropic-specific workflows | +| **Google Gemini 1.5 Flash fine-tuning** | Gemini 1.5 Flash | ~$0.50-1/M training | Yes for high-volume | Best for Google-specific workflows | +| **Local fine-tuning (RTX 4090/5090 + Unsloth)** | Any open-source model | $1,500-3,000 one-time hardware | Yes for weekly-iterators | Full control; no per-token cost | + +The survey is observational; the vendor analysis is a separate future track. The v3.1 section is not making a recommendation; it's documenting the user's interest + the prosumer vendor landscape. + +#### §14.4 Vendor Analysis Is Out of Scope for v3.1 + +The vendor analysis is out of scope for v3.1. The v3.1 section is observational; the vendor-selection track (if needed) would do the deep comparison + decision. The reasons: + +1. **Vendor pricing changes frequently.** The 2026-06-20 numbers may be out of date by 2026-09-20. A vendor-selection track would need to be re-run periodically. +2. **The dataset is the user's call.** The user must curate the dataset (the user's own tracks, decisions, plans, styleguides) before any vendor can fine-tune. The dataset-curation is a separate effort. +3. **The validation is the user's call.** The user must validate the fine-tuned model against the user's actual use cases. The validation is a separate effort. +4. **The v3.1 track is research-only.** Per the v3.1 scope, no candidates are implemented in the track. The dataset-curation + vendor-selection would be a separate implementation track. + +The v3.1 section is a marker for a future track. The marker is: "the user is interested in fine-tuning; a future track would curate the dataset + select the vendor + fine-tune the model + validate the result". + +#### §14.5 Decision Candidates + +**NEW Candidate 29 (MEDIUM).** "Dataset-curation track for fine-tuning" — separate track to curate the Manual Slop conventions/workflows dataset for fine-tuning; vendor selection deferred. The dataset would include: per-track `spec.md` + `plan.md` + `state.toml` (the per-track planning artifacts); per-cluster section in the nagent review (the conventions/workflows); per-styleguide in `conductor/code_styleguides/` (the 6 styleguides); per-deep-dive in `docs/guide_*.md` (the 14 deep-dive guides). The dataset would be a markdown + TOML corpus; the corpus would be the input to a vendor-specific fine-tuning process. See `decisions.md` Candidate 29. + +**NEW Candidate 30 (LOW).** "Cache TTL GUI contract hardening" — make the per-turn grounding primitive also track cache state; cross-ref `cache_friendly_context.md`. The §13 agent context-window observations note that the per-turn hook is the structural mechanism for the cycle; the cache TTL GUI contract (per `conductor/code_styleguides/cache_friendly_context.md`) is the cache version of the same insight. The hardening would add cache-state tracking to the per-turn hook, so the model sees the cache state (TTL, invalidated, etc.) as part of the status block. See `decisions.md` Candidate 30. + +**Source-read citations:** +- The user's 2026-06-20 directive — the diagnosis (current models bottlenecked) + the dataset (Manual Slop's own tracks) + the vendor notice (Together.ai) + the prosumer question (other vendors for middle-wage income in 2026) +- `conductor/presets.py` — the TOML precedent for project config (the dataset would include `presets.toml` + `project_presets.toml`) +- `conductor/personas.py` — the TOML precedent for project config (the dataset would include `personas.toml` + `project_personas.toml`) +- `conductor/context_presets.py` — the ContextPresetManager (the dataset would include per-track context presets) +- `conductor/tool_presets.py` — the ToolPresetManager (the dataset would include tool presets) +- `conductor/tool_bias.py` — the ToolBiasEngine (the dataset would include tool bias profiles) +- `conductor/workflow.md` — the workflow conventions (the dataset would include this) +- `conductor/product-guidelines.md` — the project styleguides (the dataset would include this) +- `conductor/code_styleguides/data_oriented_design.md` — the canonical DOD reference (the dataset would include this) +- `conductor/code_styleguides/cache_friendly_context.md` — the cache TTL GUI contract (the dataset would include this; relevant for Candidate 30) +- `conductor/code_styleguides/knowledge_artifacts.md` — the knowledge harvest pattern (the dataset would include this) +- `conductor/code_styleguides/error_handling.md` — the Result[T] convention (the dataset would include this) +- `conductor/code_styleguides/agent_memory_dimensions.md` — the 4 memory dimensions (the dataset would include this) +- `conductor/code_styleguides/rag_integration_discipline.md` — the conservative-RAG rule (the dataset would include this) +- `conductor/code_styleguides/feature_flags.md` — file presence vs config flags vs CLI flags (the dataset would include this) +- `docs/guide_*.md` — the 14 deep-dive guides (the dataset would include these) +- `docs/Readme.md` — the canonical teaching document (the dataset would include this) +- `AGENTS.md` — the canonical operating instructions (the dataset would include this) +- Per-track `spec.md` + `plan.md` + `state.toml` + `metadata.json` — the per-track artifacts (the dataset would include these) +- Per-discussion `logs/sessions/{session_id}/discussion.jsonl` — the per-discussion history (the dataset would include selected discussions, with user approval) +- The user's existing 4-tier MMA architecture (per `docs/guide_mma.md`) — the MMA conventions (the dataset would include the MMA architecture) +- The user's existing Hook API (per `docs/guide_api_hooks.md`) — the Hook API conventions (the dataset would include the Hook API architecture) +- The user's existing MCP tools (per `docs/guide_mcp_client.md`) — the MCP tool conventions (the dataset would include the MCP architecture) +- Together.ai pricing page (https://www.together.ai/pricing) — the user's noticed vendor +- Fireworks.ai pricing page (https://fireworks.ai/pricing) — the alternative vendor +- OpenAI fine-tuning pricing (https://openai.com/api/pricing/) — the closed-source alternative +- Unsloth (https://github.com/unslothai/unsloth) — the local fine-tuning framework +- `bin/nagent:1075-1081` — `target = f"{llm.provider}/{llm.model}"` (2edc7ee; relevant for the provider/model naming, cross-ref to §5) +- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; relevant for the overall nagent architecture) +- `conductor/tech-stack.md` — the project's tech stack (relevant for the model selection) +- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (bdfa2a6; relevant for the per-model context windows, cross-ref to §5) +- `bin/nagent:2220-2230` — `root = resolve_default_root(args.root)` (54c8741; relevant for the project-local-roots pattern) +- `bin/helpers/nagent_safety_lib.py` — the safety net library (38d3d4f; relevant for the safety net machinery) +- `bin/nagent:606-745` — `build_initial_context` (v2.3; relevant for the initial context assembly) +- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; relevant for the cache strategy, cross-ref to Candidate 30) +- `bin/nagent:1455-1687` — `run_safety_net` (38d3d4f; relevant for the safety net machinery) +- `bin/nagent:1840-1881` — `extract_conversation_summary` (6426a67; relevant for the instant-saves change) +- `bin/nagent:2819` — `safety_settings=load_safety_settings(...)` (38d3d4f; relevant for the safety net wiring) +- `bin/nagent:1922-1927` — `hook_per_run` injection site (a4fb141; relevant for the per-turn hook, cross-ref to §3 + §13) +- `bin/nagent:1442-1484` — `run_hook` + `resolve_hooks` (a4fb141; relevant for the per-turn hook, cross-ref to §3 + §13) +- `bin/helpers/nagent_cli.py:11-86` — the resolve/scaffold functions (54c8741; relevant for the project-local-roots pattern) +- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1300-1400` — main loop body (the v3 cluster does not cite specific line ranges) +- `bin/nagent:1900-2000` — main loop continued (the v3 cluster does not cite specific line ranges) +- `bin/nagent:2000-2100` — main loop continued (the v3 cluster does not cite specific line ranges) +- `bin/nagent:2200-2300` — main loop end (the v3 cluster does not cite specific line ranges) +- `bin/nagent:640-748` — `build_initial_context` (54c8741; relevant for the 4-layer context resolution) + +**Honest gaps:** +1. **The dataset-curation effort is significant.** A complete dataset would include all 14 deep-dive guides + 6 styleguides + per-track artifacts + per-discussion history. The effort is months, not days. A future track would scope the dataset to a manageable subset. +2. **The vendor pricing is from 2026-06-20.** The pricing may change by the time the user is ready to fine-tune. A vendor-selection track would re-survey the pricing at the time of decision. +3. **The fine-tuned model's validation is the user's call.** The user must validate the model against the user's actual use cases. The validation is a separate effort; the v3.1 section does not provide a validation methodology. +4. **The Cache TTL GUI contract hardening (Candidate 30) is a small change.** The cross-ref to `cache_friendly_context.md` is the canonical reference; a future track would add cache-state tracking to the per-turn hook. +5. **The fine-tuning vs. prompting trade-off is not analyzed.** Fine-tuning bakes conventions into the model; prompting surfaces conventions at inference time. The trade-off is: fine-tuning is a one-time cost + lower per-inference cost; prompting is a per-inference cost + no training cost. A vendor-selection track would analyze the trade-off. + +## §15 Decisions + +See `decisions.md` for the full candidate list (v2.3's 16 + v3's new 11 + v3.1's new 3, with v2.3 → v3 → v3.1 status mapping at the top). **Total v3.1 candidate pool: 30 entries** (3 HIGH + 7 MEDIUM + 7 LOW + 1 LOW-docs in v3+v3.1's new candidates, plus 14 STILL-OPEN from v2.3, plus 1 PROMOTED + 1 SUBSUMED status changes, plus 3 v3.1 NEW per §12-§14). The HIGH-priority v3 candidates are: + +- **Candidate 17:** Campaign-style plan-as-data for the conductor (§1) — amended by Candidate 27 to use markdown + frontmatter, not YAML +- **Candidate 18:** Discussion-window safety net for Manual Slop (§2) +- **Candidate 22:** Tier 3 worker contract "decompose or isolate, never offload" (§6) + +The MEDIUM-priority v3+v3.1 candidates are Candidates 19 (per-turn hook — amended by Candidate 28), 21 (per-model token-cap), 23 (per-conversation scratch dir), 25 (optimization-log discipline), 27 (markdown+DSL lock-in, per §12), 28 (per-turn ground-truth hook, per §13), 29 (dataset-curation track, per §14). The LOW-priority are Candidates 20 (docs rename), 24 (Q9 in styleguide), 26 (OPT-LOG schema), 30 (cache TTL GUI contract hardening, per §14). Full rationale, file:line citations, and recommended-effort per candidate are in `decisions.md`. + +## §16 Cross-references + +See `nagent_takeaways_v3_20260619.md` for the bridge to v2.3 takeaways + the sibling reviews: + +- **`fable_review_20260617`** — Fable's analysis of Mythos system prompt. Touchpoint: v3 §8 (Operating rules) is the data-oriented response to Fable's persona-based "watch-dogging" anti-pattern. +- **`intent_dsl_survey_20260612`** — the 10 prior-art clusters for intent-based DSLs. Touchpoint: v3 §9 (Case-study methodology) is implicitly an intent-DSL for "drive nagent at an optimization problem"; v3.1 §12 (YAML avoidance) cites the survey's Cluster 5 "SSDL shape primitives" as the project's DSL primitive. +- **`superpowers_review_20260619`** — the superpowers plugin review. Touchpoint: v3 §9 (Case-study methodology); the superpowers `brainstorming` skill is a process parallel (structured questions to refine an idea before implementation); v3.1 §12 (YAML avoidance) cites the superpowers review as the project's markdown-driven convention. + +## §17 References + +### Source commits (24) + +The 24 nagent commits reviewed, in chronological order (oldest first): + +- `54c8741` — Move the default root into the project; rename nagent-gc to nagent-distill (§4) +- `557dd39` — Teach project-local roots and layered inputs in the README arc (§4) +- `0b9d1a2` — Ignore scratch files (§4, project .gitignore) +- `199a36b` — File the campaign system and follow-on plans as ordered issues (§1, issues files) +- `24cf16d` — Add the campaign system: plans as operable artifacts (§1) +- `f3ec090` — Add distill passes: merge and graduate (§1) +- `c1d2cad` — Teach the distill passes in the README and its generator (§1) +- `6443d70` — Rework 0004 around wall-clock checkpoints; remove resolved 0003 (§2 + §1 issue file maintenance) +- `7a7e242` — Add issue files for the two deferred follow-ups (§1, issues files) +- `065168c` — Tolerate non-protocol output; add turn status and invalid-output sidecars (§7) +- `49e07f3` — Scope `` to a per-conversation scratch dir (§7) +- `2edc7ee` — Name the provider/model in the LLM wait spinner (§5) +- `5075f6e` — Keep claude-code billing on its own login; surface real errors (§5) +- `6426a67` — Make --save-conversation instant with extracted summaries (§2) +- `afc7ab8` — Regenerate the README: full arc with campaigns and the safety net (§1 + §2 docs) +- `38d3d4f` — Add the conversation safety net: checkpoints and rebuild (§2) +- `12c35b7` — Pin shell-output-before-next-input ordering (§7, regression test) +- `6b762da` — Collapse exact-duplicate tags within a turn (§7) +- `315fe9e` — Update test for revised delegation-guidance wording (§6) +- `65787a6` — Delegation guidance: name context-isolation alongside decomposition (§6) +- `d56f0f0` — Delegate decomposed parts, not single tasks (§6) +- `a4fb141` — Add per-run and per-file-edit shell hooks (§3) +- `bdfa2a6` — Add Together provider, per-model token-cap rebuilds, and --list-providers (§5) +- `023e23a` — Ignore local .nagent/ runtime state (§4, project .gitignore) +- `a1f0680` — Operating rules: sampling can justify replacing the machine, not just trimming it (§8) + +### Case-study repos + +- [`macton/pep-copt`](https://github.com/macton/pep-copt) at `main` (5 commits). The PEP image compression case study: 2.04× speedup aggregate on 24-image benchmark, byte-identical `.pep` output, decode net-neutral (§10). +- [`macton/differentiable-collisions-optc`](https://github.com/macton/differentiable-collisions-optc) at `main` (5 commits). The Convex Primitive Collision Detection case study: 101.06× speedup on committed input, 97.75× and 98.43× on alternate seeds, tolerance-based match contract (§11). + +### Per-phase commit SHAs (v3.1) + +| Phase | Description | Commit SHA | +|---|---|---| +| Phase 1 | Setup + audit (v3.1) | `8fb82762` | +| Phase 2 | Thicken §1 Campaigns cluster | `bd36aa4b` | +| Phase 3 | Thicken §2 Conversation safety net cluster | `478b088b` | +| Phase 4 | Thicken §3 Hooks cluster | `d17ee930` | +| Phase 5 | Thicken §4 Project-local roots cluster | `1bc8e924` | +| Phase 6 | Thicken §5 Provider expansion cluster | `987f4a97` | +| Phase 7 | Thicken §6 Delegation rewrite cluster | `a406d290` | +| Phase 8 | Thicken §7 Robustness cluster | `b9b31006` | +| Phase 9 | Thicken §8 Operating rules cluster | `eb7da8d8` | +| Phase 10 | Thicken §9 Case-study methodology cluster | `24442379` | +| Phase 11 | Thicken §10 PEP case study cluster | `10c7d1d0` | +| Phase 12 | Thicken §11 Collisions case study cluster | `1574ee47` | +| Phase 13 | New sections §12-§14 + renumber v3 §12-§14 to §15-§17 | (this commit) | +| Phase 14 | Refresh side artifacts | (forthcoming) | +| Phase 15 | Chunking-strategy + format-commitment verification | (forthcoming) | + +### Per-phase commit SHAs (v3) + +| Phase | Description | Commit SHA | +|---|---|---| +| Phase 1 | Setup + audit | `5a28c8f3` | +| Phase 2 | Campaigns cluster (§1) | `c81ea782` | +| Phase 3 | Conversation safety net cluster (§2) | `caf04ca5` | +| Phase 4 | Hooks cluster (§3) | `9ab2d07c` | +| Phase 5 | Project-local roots cluster (§4) | `ea8fa94e` | +| Phase 6 | Provider expansion cluster (§5) | `dd8428a3` | +| Phase 7 | Delegation rewrite cluster (§6) | `0dad59fd` | +| Phase 8 | Robustness cluster (§7) | `ffa21d5c` | +| Phase 9 | Operating rules cluster (§8) | `ad19be00` | +| Phase 10 | Case-study methodology cluster (§9) | `54e62b10` | +| Phase 11 | PEP case study cluster (§10) | `f53c82e6` | +| Phase 12 | Collisions case study cluster (§11) | `db7d94de` | +| Phase 13 | Refresh side artifacts | `e150088d` | +| Phase 14 | Format-commitment verification | `b49be820` | + +### Sibling-review references + +- `conductor/tracks/fable_review_20260617/` — Fable's analysis of Mythos system prompt +- `conductor/tracks/intent_dsl_survey_20260612/` — the 10 prior-art clusters for intent-based DSLs +- `conductor/tracks/superpowers_review_20260619/` — the superpowers plugin review + +### Project documentation references + +- `conductor/workflow.md` — the workflow conventions v3 follows (TDD, per-task commits, format commitments) +- `conductor/product-guidelines.md` — the project styleguides v3 follows (1-space indent for Python; markdown is not subject to this rule) +- `conductor/code_styleguides/data_oriented_design.md` — the project's canonical DOD reference, itself derived from Acton's `context/data-oriented-design.md` +- `conductor/code_styleguides/cache_friendly_context.md` — references nagent_review_v2_3 §3.2 + §5 (v3 deepens with §5 per-model context windows); v3.1 §13 + §14 cross-ref for the per-turn hook + cache TTL GUI contract +- `conductor/code_styleguides/knowledge_artifacts.md` — references nagent_review_v2_3 §3.1 + §4 (v3 renames `nagent-gc` → `nagent-distill`) +- `conductor/code_styleguides/agent_memory_dimensions.md` — references nagent_review_v2_3 §2.8 (v3 deepens with §1-§4 memory extension) +- `conductor/code_styleguides/rag_integration_discipline.md` — the conservative-RAG rule +- `conductor/code_styleguides/feature_flags.md` — file presence vs config flags vs CLI flags +- `conductor/code_styleguides/error_handling.md` — the Result[T] convention +- `docs/guide_meta_boundary.md` — the Application vs Meta-Tooling distinction (load-bearing context for v3) diff --git a/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md b/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md index 9233ebc3..2e9e1847 100644 --- a/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md +++ b/conductor/tracks/nagent_review_20260608/nagent_review_v3_20260619.md @@ -19,136 +19,10 @@ v3 covers the **24-commit nagent evolution** between `eb6be32a` (v2.3 baseline, **Source:** nagent `24cf16d`, `199a36b`, `f3ec090`, `c1d2cad`, `6443d70`, `7a7e242` (`bin/nagent-campaign`, `bin/helpers/nagent_campaign_lib.py`, `bin/helpers/nagent_distill_lib.py:228-260` + `:793-979`, `bin/nagent-distill:107-200`, `prompts/campaign-decompose.md`, `prompts/campaign-item.md`, `prompts/knowledge-merge.md`, `prompts/knowledge-graduate.md`, `prompts/create-readme.md:248-251`, `issues/0002-campaign-system.md`, `issues/0004-conversation-safety-net.md`, `tests/test_nagent_campaign.py`, `tests/test_nagent_distill.py`, `README.md:474-484` + `:900-908`) **One-liner:** Plans become operable artifacts. The plan is data (YAML), the driver is deterministic code, the model's non-determinism is relocated and bounded to narrow judgments. -**Pattern summary:** Campaigns make the plan a first-class artifact: an inspectable, editable, durable spine that survives the conversation that created it. The artifact is a YAML tree on disk (`.nagent/campaigns/{slug}/index.yaml` + per-item `item.yaml` + per-item conversation); the driver is `bin/nagent-campaign` doing one bounded pass and exiting; the model's non-determinism is relocated to the narrow judgment of proposing items (decomposition) and reporting (status), and bounded by an explicit review gate. This extends the "durable work, disposable workers" principle (v2.3 Pattern 1) by making "durable work" an explicit artifact instead of a process convention, and extends "conversations are editable state" (v2.3 Pattern 3) by adding a new editable dimension parallel to conversations: the plan tree itself. - -#### §1.1 What Campaigns Adds - -Campaigns introduce a new lifecycle boundary between planning and execution. Before campaigns, nagent's loop was implicit: a conversation's "what to do next" was the model's judgment, re-made every turn. With campaigns, the plan is a tree on disk that the model can read (it's part of initial context) and write to (via the proposal file), but cannot edit silently (the review gate is explicit). The four pieces of the campaigns abstraction are: - -1. **Artifact** — the YAML tree at `.nagent/campaigns/{slug}/index.yaml` (campaign-level) + per-item `item.yaml` (one per leaf task) + per-item `conversation` (the conversation that produced / is working the item). The artifact is the state of record; the conversation is ephemeral. -2. **Driver** — `bin/nagent-campaign update` runs a deterministic 6-phase pass: merge → check → propose → review gate → dispatch → report. One pass, one exit. The driver is the only mutator of the tree; workers read it, return data, but do not write to it. -3. **Invariants** — four load-bearing rules from `issues/0002-campaign-system.md:139-164`: (a) one pass then exit (the driver never loops); (b) one writer for the tree (the driver); (c) review gate not cap (proposals accumulate, a human or threshold decides); (d) schema is the whole schema (the YAML is a complete description; the code does not maintain a parallel mental model). -4. **Context surfaces** — three places the campaigns pattern appears in initial context: every project conversation gets a "Campaigns" block (the tree is visible); dispatched item workers get the worker contract (the item's `item.yaml` + the parent campaign's `index.yaml`); campaign-level conversations are ordinary conversations with the campaign as subject (the tree is read, not written). - -This decomposition is itself data-oriented — the campaign's behavior is the artifact's shape, not code branching on state. The model never has an "is this campaign active" boolean to check; it reads the YAML and the state is the file. - -#### §1.2 The Driver Phases - -The `update` command runs six phases. Each phase is a pure operation on the tree + a bounded external call (LLM for `propose`, LLM for `report`): - -1. **Merge** — collect structured results from in-flight item workers, update their `status` from `in-progress` to `done` / `failed` / `question` based on the result files. Pure code; no LLM call. -2. **Check** — run the executable test of `completion: [condition]` entries. For `condition` types that are LLM-judged (e.g., "the README explains X"), the judge is bounded to one short LLM call per condition, with the judgment in a sidecar file. No multi-turn model reasoning. -3. **Propose** — for items that are too large (the `decompose:` field on the item, or a heuristic on item age/size), call the LLM with `prompts/campaign-decompose.md` to produce a `proposal.yaml` with sub-items. The LLM proposes; the user (or threshold) decides. -4. **Review gate** — for `proposal.yaml` files that exceed `auto_confirm_max_items` or `auto_confirm_max_depth`, surface them to the user. Below the thresholds, auto-confirm. The gate is explicit: a `proposal.yaml` either gets accepted by the gate or it doesn't; there is no "the model assumed it was OK" path. -5. **Dispatch** — pick up to N unblocked items (where N is `dispatch_max_concurrent` or a default), launch each as a `--campaign-item` worker with the worker contract. Workers return data; they do not write the tree. -6. **Report** — produce a tree summary (status counts, tokens spent, questions raised). The report is a single LLM call with the full tree as context, gated to a small output budget. - -A code-shape sketch using survey grammar (per the format commitment §5.1): - -``` -campaign := { name: string, status: active|paused|done, - completion: [condition], items: [item] } -item := { id: string, status: todo|proposed|in-progress|done|failed|question, - blocked_by: [id], conversation: path, - decompose: { when: heuristic, into: [sub_item] } } -update {slug} { - merge // collect structured results, update statuses (pure code) - check // run executable test: conditions; bounded judge for judge: - propose // decompose big items -> proposal.yaml, status proposed - review_gate // auto-confirm within thresholds; report scope of pending - dispatch // bounded N unblocked items, each as --campaign-item worker - report // tree summary + questions + tokens spent -} -``` - -The `{ssdl}` shape tag for the campaign tree is `[M]` (mutable aggregate, hand-edited by humans) — the artifact is the state of record, the worker contract returns data, the driver is the only mutator. The lineage to v2.3's harvest pattern is direct: workers produce data (harvest-JSON in v2.3; `result.json` here), code merges into the tree (regenerate_digest in v2.3; driver merge phase here). - -#### §1.3 The Invariants - -From `issues/0002-campaign-system.md:139-164`, the four invariants that hold the abstraction together: - -1. **One pass then exit.** The driver never loops. It does one bounded pass and exits. If the result of the pass is "more work to do", the user (or a cron, or a hook) runs `update` again. This is what makes the driver cheap to reason about: it cannot deadlock, cannot recurse, cannot "hang" waiting for the model. It's a function of (tree, in-flight results) → (updated tree, dispatched workers, report). -2. **One writer for the tree.** The driver is the only thing that writes `.nagent/campaigns/{slug}/`. Workers read it, return data, do not write. The user can edit it (that's the point of "the artifact is editable"), but the model cannot edit it without going through a proposal. This eliminates the "two writers race on the same file" class of bugs. -3. **Review gate not cap.** Proposals accumulate. A human (or a threshold) decides whether to accept them. The model never "assumes" a proposal is accepted; the gate is explicit. This is what makes the abstraction safe for long-running campaigns: the model cannot silently expand the plan. -4. **Schema is the whole schema.** The YAML tree is a complete description of the campaign. The code does not maintain a parallel mental model (e.g., "we track active items in memory and the YAML is just a snapshot"). The YAML is the truth; the code is a function of the YAML. - -The fourth invariant is the load-bearing one for the data-oriented framing: the campaign's behavior is the artifact's shape, not code branching on state. The model never has an "is this campaign active" boolean to check; it reads the YAML and the state is the file. - -#### §1.4 Per-Commit Detail - -The six commits that built the campaigns subsystem, in dependency order: - -1. **`24cf16d` — Add the campaigns driver.** Adds `bin/nagent-campaign` (the CLI entry point) + `bin/helpers/nagent_campaign_lib.py` (the driver implementation, ~400 lines). Also adds the initial context block (`prompts/campaign-decompose.md` + `prompts/campaign-item.md`) so the model knows how to propose and dispatch. The 6-phase `update` command lands here. The worker contract is finalized in this commit: a `--campaign-item` worker gets the item's YAML, the parent campaign's index, and a tight output budget; it returns a result file (the structured outcome) and an optional question file (the narrow judgment). -2. **`199a36b` — Add the issue file that fully specifies the system.** Adds `issues/0002-campaign-system.md` (326 lines). This is the "long form spec as a file" pattern from v2.3 — the design is in the repo, not in a wiki or a chat. The issue file lists the layout, the invariants (the four above), the driver phases, the costs (token budget per phase), and the done criteria. This is the document the driver implementation in `24cf16d` was built to. -3. **`f3ec090` — Wire the merge/graduate passes to the campaign lifecycle.** Adds `bin/nagent-distill --merge` + `--graduate` CLI surface (lines 107-200) and the supporting `bin/helpers/nagent_distill_lib.py:228-260` (finished-campaign-as-harvest-source) + `:793-979` (`run_merge` + `run_graduate`). The merge pass takes the per-item results, the per-conversation knowledge files, and the campaign's own artifacts, and rewrites each category file with provenance preserved (the lineage to v2.3's harvest is direct). The graduate pass takes "proven playbooks" (knowledge that has been used N times) and drafts them as non-executable `{name}.draft` files invisible to tool discovery until the user reviews them. The two prompts (`prompts/knowledge-merge.md` + `prompts/knowledge-graduate.md`) are short and tight: merge is 19 lines, graduate is 26. -4. **`c1d2cad` — Update the README to teach the merge + graduate passes.** Adds `README.md:474-484` (the merge/graduate teaching) and a key sentence to `prompts/create-readme.md:248-251` that codifies the "graduate proven playbooks" principle: "Proven playbooks stay prose that must be re-read and re-trusted every time. Therefore: graduate them into self-describing tools and prompts — knowledge becomes capability, gated by review." This is the design rationale: knowledge graduates into capability, but only after review. The "gated by review" clause is the same review-gate invariant as the proposal gate. -5. **`6443d70` — Rework the conversation safety net issue file.** This is not strictly a campaigns commit, but it lands in the same window. Reworks `issues/0004-conversation-safety-net.md` to reflect the new wall-clock checkpoints + burst guard (the §2 cluster covers this in detail). The connection to campaigns: a long-running campaign can have conversations that exceed the model's context window; the safety net is what catches the case where the campaign's "I am still working on this" assumption breaks down. Also deletes `issues/0003-distill-passes.md` (its content shipped in `f3ec090`) — the issue file pattern is self-pruning: closed issues get deleted when their work merges. -6. **`7a7e242` — File the deferred follow-ups as issue files.** Adds `issues/0001-retry-attempts-persist-raw-invalid-output.md` + `issues/0002-invalid-output-sidecars-are-never-collected.md`. Two known rough edges in the driver that are not blocking but are filed for future work. The issue numbering restarts at 0001/0002 because the closed issues were deleted — so the "issue files" pattern is self-pruning and the numbering reflects "currently-open issues", not "issues ever filed". - -#### §1.5 Manual Slop Implications - -The Manual Slop equivalents of the campaigns pattern are partial. The closest analog is the per-track `plan.md` + `state.toml` + `metadata.json` triplet in `conductor/tracks/{track_id}/`. The per-track `plan.md` is the editable plan; `state.toml` is the machine-readable progress; `metadata.json` is the spec-derived scope. But the Manual Slop analog lacks three of the four campaigns invariants: - -1. **No "one writer for the tree" guarantee.** The `plan.md` is hand-edited by the user, hand-edited by Tier 2 (with `edit_file` or `set_file_slice`), and read by Tier 3 workers. There is no `bin/nagent-campaign` equivalent that mediates writes. The "two writers race" class of bugs is real (e.g., Tier 2 edits `plan.md` while Tier 3 worker is reading it). -2. **No "one pass then exit" driver.** The MMA WorkerPool's `ConductorEngine` (in `src/multi_agent_conductor.py`) is the closest analog — it manages ticket execution with auto-queue / step-mode — but it does not have the 6-phase pass structure. It loops; the driver does not. -3. **No explicit review gate.** Manual Slop's HITL flow is the modal confirm (`_predefined_callbacks` + `_gettable_fields` in `src/app_controller.py`); nagent's gate is the `proposal.yaml` file with `auto_confirm_max_items`/`auto_confirm_max_depth` thresholds. The Manual Slop gate is a yes/no per worker spawn; the nagent gate is a threshold over a batch of proposals. - -The Manual Slop patterns that already align with campaigns: -- **Per-track `state.toml`** (e.g., `conductor/tracks/nagent_review_20260608/state.toml`) is a partial `[M]` mutable aggregate. It has phase + task entries with `status` + `commit_sha` fields. The analog is partial: the `state.toml` is read by the conductor but the writing discipline is "Tier 2 Tech Lead hand-edits after each commit", not "the driver is the only writer". -- **The `_predefined_callbacks` Hook API** (in `src/app_controller.py:531-617`) is the closest analog to the campaign's context surfaces. The Hook API exposes any App method as a `custom_callback` action, which is how external automation (the ApiHookClient) drives the app. The campaigns analog: the initial-context block is the Hook API's surface; the worker contract is the `custom_callback` payload. -- **The MMA WorkerPool's tier-3 workers** (in `src/multi_agent_conductor.py` + `scripts/mma_exec.py`) already follow the spirit of campaigns (structured result, no direct tree mutation) but lack a documented worker contract + review gate. The `WorkerPool` spawns workers with `mma_exec.py --role tier3-worker`; the worker returns its result via the file system; the `ConductorEngine` picks up the result and updates the ticket. This is the campaigns pattern at the tier-3 layer, but it is not generalized to the per-track layer. - -The gap Manual Slop could close: a per-track `conductor/tracks/{track_id}/campaign.yaml` + a `bin/conductor-campaign update` driver that does the 6-phase pass. The driver would: merge Tier 3 worker results into `state.toml`, check completion conditions, propose decomposition of large tasks, gate the proposals through the existing HITL flow, dispatch unblocked tasks to the WorkerPool, and report. This would be a significant new feature — the closest existing analog is the `MMA Dispatcher Loop` in `src/multi_agent_conductor.py:280-340`, but it's scoped to the MMA queue, not the per-track plan. - -**Note on YAML format (per the user's directive, expanded in v3.1 §12):** the campaigns artifact format is YAML. Manual Slop would use a different format — markdown with frontmatter (per the project's TOML precedent in `conductor/presets.py` + `conductor/personas.py`) or a custom DSL. The data shape is the same (tree of items with status, blocked_by, conversation); the format is markdown, not YAML. See v3.1 §12 for the full rationale. - -#### §1.6 Honest Gaps - -1. **The decompose prompt is not deep-dived.** `prompts/campaign-decompose.md` is the LLM prompt that proposes item decomposition. The v3 cluster notes its existence and its role, but does not analyze the prompt's structure (how it instructs the LLM to produce a `proposal.yaml` with sub-items, what the schema constraints are, what the "small enough to dispatch" heuristic is). A future v3.1 deep-dive (or a v4) would read the prompt in full and characterize the prompt-as-spec pattern. -2. **The worker contract is not deep-dived.** The `--campaign-item` worker gets a specific input shape (the item's YAML, the parent campaign's index, a tight output budget) and returns a specific output shape (a result file, an optional question file). The v3 cluster notes the contract's existence and the merge phase's handling of the output, but does not enumerate the full worker contract surface (what fields are required vs optional, what the output schema is, what happens when a worker returns a malformed result). -3. **The judge condition type is not deep-dived.** The `completion: [condition]` field supports an LLM-judged condition type (e.g., "the README explains X"). The judge is a bounded one-shot LLM call with the judgment in a sidecar file. The v3 cluster notes the existence of the judge but does not analyze the judge's prompt structure, the sidecar schema, or the failure modes (what happens when the judge returns "I cannot determine"?). -4. **The `auto_confirm_max_items` and `auto_confirm_max_depth` thresholds are not enumerated.** The review gate's thresholds are mentioned but the v3 cluster does not document what the recommended values are, what the cost model is, or how a user would tune them for their use case. A v4 would document the threshold tuning procedure. -5. **The dispatch concurrency limit is not enumerated.** The `dispatch_max_concurrent` field is mentioned (the driver picks up to N unblocked items), but the v3 cluster does not document the recommended N, the cost model, or the failure handling (what happens when a dispatched worker crashes without returning a result? does the driver time out and re-dispatch? does the item stay `in-progress`?). -6. **The interaction with the conversation safety net is not deep-dived.** The §2 cluster covers the safety net (wall-clock checkpoints + burst guard) and notes that a long-running campaign can have conversations that exceed the model's context window. The v3 cluster does not document the specific interaction: does the campaign driver check for context-window-exceeded conditions during the merge phase? does the dispatch phase refuse to launch a worker when the context window is already full? does the report phase surface context-window warnings to the user? A v4 would map the safety net's hooks into the campaign driver's phases. - -#### §1.7 Code-Shape Sketch - -The campaign tree, in survey-grammar SSDL notation, with shape tags: - -``` -campaign := { name: string, # [S] string concatenation - status: active|paused|done, # [I] inspectable enum - completion: [condition], # [M] mutable list - items: [item], # [B] boundary (the dispatch list) - proposal: proposal_yaml? } # [M] mutable, pending review - -item := { id: string, # [S] - status: todo|proposed|in-progress|done|failed|question, # [I] - blocked_by: [id], # [B] dependency edge - conversation: path, # [B] path to conversation file - decompose: { when: heuristic, into: [sub_item] }?, # [M] optional - result: result_json? } # [M] populated by merge phase - -condition := { type: executable|judge, # [I] - spec: string, # [S] the test or the judge prompt - satisfied: bool } # [I] populated by check phase - -result_json := { status: done|failed|question, # [I] - summary: string, # [S] - question: question? } # [M] optional - -update {slug} { # driver entry point - merge // collect result.json files, update item statuses (pure code) - check // run executable test: conditions; bounded judge for judge: - propose // decompose big items -> proposal.yaml, status proposed - review_gate // auto-confirm within thresholds; report scope of pending - dispatch // bounded N unblocked items, each as --campaign-item worker - report // tree summary + questions + tokens spent -} -``` - -The shape tag map: `[I]` for inspectable enums and booleans (the model's understanding is the file's value), `[S]` for string concatenations (the model's understanding is the file's content), `[B]` for boundaries (the model's understanding is the file's edge), `[M]` for mutable aggregates (the model's understanding is the file's state). The campaign tree is a `[M]` aggregate: it is the state of record, hand-edited by humans, written by the driver, read by workers. - +**Pattern(s) vs v2.3:** NEW. v2.3 had the implicit "what to do next is the model's judgment, re-made every turn" loop. v3 makes the plan a first-class artifact: an inspectable, editable, durable spine that survives the conversation that created it. EXTENDS v2.3 Pattern 1 ("durable work, disposable workers") — campaigns make "durable work" an explicit artifact instead of a process convention. EXTENDS v2.3 Pattern 3 ("conversations are editable state") — plans-as-artifact is a new editable dimension, parallel to conversations. +**Manual Slop implications:** The conductor's `plan.md` could evolve toward a campaign-style `index.yaml` + per-task `task.yaml` + per-task `conversation` artifact set. The MMA WorkerPool's tier-3 workers already follow the spirit (structured result, no direct tree mutation) but lack a documented worker contract + review gate. The "plan changes pass a review gate, not a cap" invariant maps cleanly to the existing HITL flow — Manual Slop's gate is the modal confirm; nagent's gate is the `proposal.yaml` file with `auto_confirm_max_items`/`auto_confirm_max_depth` thresholds. +**Decision candidate:** NEW Candidate 17 (HIGH). "Campaign-style plan-as-data for the conductor": add a `.conductor/campaigns/{slug}/` layout with `index.yaml` + per-task `task.yaml` + per-task conversation artifacts; add a deterministic driver (1 pass, then exit) that mirrors `nagent-campaign update`'s 6 phases. See `decisions.md` Candidate 17. +**Cross-refs:** none direct (the §2 Conversation safety net cluster cross-references this one; the §9 Case-study methodology cluster cross-references the "open questions as text files" pattern). **Source-read citations:** - `bin/nagent-campaign` — new CLI entry point (24cf16d) - `bin/helpers/nagent_campaign_lib.py` — driver implementation (24cf16d) @@ -160,509 +34,109 @@ The shape tag map: `[I]` for inspectable enums and booleans (the model's underst - `prompts/knowledge-merge.md:1-19` — merge LLM prompt (f3ec090) - `README.md:474-484` — merge + graduate teaching (c1d2cad) - `README.md:900-908` — `nagent-campaign` CLI examples (24cf16d) -- `prompts/create-readme.md:248-251` — graduation rationale (c1d2cad) +- `prompts/create-readme.md:248-251` — graduation reduction: "Proven playbooks stay prose that must be re-read and re-trusted every time. Therefore: graduate them into self-describing tools and prompts — knowledge becomes capability, gated by review." (c1d2cad) - `issues/0001-retry-attempts-persist-raw-invalid-output.md` + `issues/0002-invalid-output-sidecars-are-never-collected.md` — two deferred follow-ups, filed as issue files (7a7e242) -- `issues/0004-conversation-safety-net.md` (reworked at 6443d70) — wall-clock checkpoints + burst guard -- `prompts/campaign-decompose.md:1-N` — decomposition LLM prompt (24cf16d) -- `prompts/campaign-item.md:1-N` — worker contract prompt (24cf16d) -- `bin/nagent-campaign:1-N` — CLI argument parsing + subcommand dispatch (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:update()` — the 6-phase driver entry (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:merge_phase()` — collect results, update statuses (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:check_phase()` — run conditions (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:propose_phase()` — decompose big items (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:review_gate_phase()` — threshold-based accept (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:dispatch_phase()` — bounded worker launch (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:report_phase()` — tree summary + tokens (24cf16d) -- `tests/test_nagent_campaign.py` — driver unit tests (24cf16d) -- `tests/test_nagent_distill.py:merge_*` + `:graduate_*` — merge/graduate tests (f3ec090) -- `README.md:450-500` — campaigns teaching section (24cf16d + c1d2cad) -- `README.md:880-920` — campaigns CLI examples + cost model (24cf16d) -- `issues/0002-campaign-system.md:139-164` — the 4 invariants (199a36b) -- `issues/0002-campaign-system.md:159-191` — the 6 driver phases (199a36b) -- `issues/0002-campaign-system.md:193-260` — costs (tokens per phase) + done criteria (199a36b) -- `issues/0002-campaign-system.md:262-326` — open questions + future work (199a36b) +- `issues/0004-conversation-safety-net.md` (reworked at 6443d70) — wall-clock checkpoints + burst guard; the safety net that decomposition cannot bound +**Honest gaps in this cluster:** The issue file at `issues/0003-distill-passes.md` was DELETED at `6443d70` because the distill-passes content shipped in `f3ec090`; the issue numbering for the deferred followups at `7a7e242` starts fresh at 0001/0002 — so the "issue files" pattern is self-pruning (closed issues get deleted when their work merges). The driver spec at `issues/0002-campaign-system.md:159-191` lists 6 driver phases (Merge → Check → Propose → Review gate → Dispatch → Report), but the implementation commit `24cf16d` adds `bin/nagent-campaign` + `bin/helpers/nagent_campaign_lib.py` (the actual driver); the prompt files for decomposition (`prompts/campaign-decompose.md`) and worker context (`prompts/campaign-item.md`) also land in `24cf16d`, but their LLM prompts are not deep-dived here. Per the user's §0 cluster-scheme honesty note, "the source-read pass may surface new clusters" — these prompts are candidates for a future v3.1 deep-dive. + +**Pattern deep-dive.** The campaigns abstraction is a four-piece composition: **artifact**, **driver**, **invariants**, **context surfaces**. The artifact is the YAML tree (`.nagent/campaigns/{slug}/index.yaml` + per-item `item.yaml` + per-item `conversation`); the driver is `bin/nagent-campaign` doing one bounded pass and exiting; the invariants are the four load-bearing rules from `issues/0002-campaign-system.md:139-164` (one pass then exit; one writer for the tree; review gate not cap; schema is the whole schema); the context surfaces are the three places the campaigns pattern appears in initial context (every project conversation gets a Campaigns block; dispatched item workers get the worker contract; campaign-level conversations are ordinary conversations with the campaign as subject). This decomposition is itself data-oriented — the campaign's behavior is the artifact's shape, not code branching on state. + +The merge/graduate passes (f3ec090) extend the same idea to the knowledge store: knowledge files grow append-only until unreadable, so `--merge` rewrites each category file with provenance preserved; proven playbooks stay prose when they should become tools, so `--graduate` drafts them as non-executable `{name}.draft` files invisible to tool discovery until the user reviews them. The "nothing lands silently" property is load-bearing — drafts are deliberately not executable, so a graduate pass cannot accidentally expose a half-formed tool to a future conversation. + +A code-shape sketch using survey grammar (per the format commitment §5.1): + +``` +campaign := { name: string, status: active|paused|done, + completion: [condition], items: [item] } +item := { id: string, status: todo|proposed|in-progress|done|failed|question, + blocked_by: [id], conversation: path } +update {slug} { + merge // collect structured results, update statuses (pure code) + check // run executable test: conditions; bounded judge for judge: + propose // decompose big items -> proposal.yaml, status proposed + review_gate // auto-confirm within thresholds; report scope of pending + dispatch // bounded N unblocked items, each as --campaign-item worker + report // tree summary + questions + tokens spent +} +``` + +**Honest gap (continued):** the `{ssdl}` shape tag for the campaign tree is best described as `[M]` (mutable aggregate, hand-edited by humans) — the artifact is the state of record, the worker contract returns data, the driver is the only mutator. The lineage to v2.3's harvest pattern is direct: workers produce data (harvest-JSON in v2.3; `result.json` here), code merges into the tree (regenerate_digest in v2.3; driver merge phase here). -**Decision candidate:** NEW Candidate 17 (HIGH). "Campaign-style plan-as-data for the conductor": add a `.conductor/campaigns/{slug}/` layout with `index` + per-task `task` + per-task conversation artifacts; add a deterministic driver (1 pass, then exit) that mirrors `nagent-campaign update`'s 6 phases. The artifact format is markdown + frontmatter, not YAML (per the v3.1 §12 YAML avoidance observation). See `decisions.md` Candidate 17. -**Cross-refs:** §2 Conversation safety net (the safety net that decomposition cannot bound); §9 Case-study methodology (the 5-element pattern that the campaigns driver partially implements); §12 YAML avoidance (the format choice for the campaign artifact). -**Pattern history:** NEW in v3. v2.3 had the implicit "what to do next is the model's judgment" loop. v3 makes the plan a first-class artifact. ## §2 Conversation safety net **Source:** nagent `38d3d4f`, `6426a67` (`bin/nagent:1455-1687` + `:1840-1881` + `:2463-2677` + `:2819`, `bin/helpers/nagent_distill_lib.py:587-654` + `:851-862`, `config.example.json:3-7`, `prompts/checkpoint-conversation.md`, `README.md:653-668` + `:323-332`, `issues/0004-conversation-safety-net.md`, `tests/test_nagent_safety.py`, `tests/test_nagent_distill.py`) **One-liner:** A conversation that outgrows its window gets caught, not killed. Checkpoints are a separate one-call writer, not the working model; rebuild is a deterministic string assembly that runs a synchronous checkpoint first; saves are instant because the summary is extracted from the checkpoint's already-paid-for Intent line, not a new LLM call. -**Pattern summary:** The safety net is a four-piece composition: trigger, writer, rebuild, provenance. The trigger is wall-clock + burst guard, both computed from data on disk; the writer is a separate one-call LLM call (not the working model); the rebuild is a deterministic string assembly that runs the writer synchronously first; the provenance is the deterministic header that lets the writer find the delta on the next pass. Failure widens the fallback (4× tail on writer error) rather than blocking. Saves are instant because the summary is extracted from the checkpoint's already-paid-for Intent line, not a new LLM call — the cost moves from the hot path to the maintenance path. This extends the "the loop" principle (v2.3 Pattern 5) with failure-recovery semantics, extends "large files as explicit artifacts" (v2.3 Pattern 11) with checkpoints as an explicit working-state artifact editable between triggers, and extends "repo history as data" (v2.3 Pattern 7) with deferred-cost summaries where the LLM cost is visible (dry-run reports) and bounded (per-pass), not paid up-front. - -#### §2.1 What the Safety Net Adds - -The safety net introduces a failure-recovery layer between the conversation and the model's context window. Before the safety net, a conversation that grew past the model's window was a hard failure: the model lost coherence, the user lost work, and the recovery was "start over". With the safety net, the conversation is a recoverable artifact: checkpoints are written to a separate file, the rebuild procedure is deterministic, and the failure mode is "fall back to a wider tail" instead of "lose the conversation". - -The four pieces of the safety net abstraction: - -1. **Trigger** — wall-clock + burst guard, both computed from data on disk. `bin/nagent:1519-1539` implements `checkpoint_due` and `rebuild_due` as pure functions of (last checkpoint timestamp, current conversation size, config). The trigger is data, not code branching on state. The cadence reasoning is explicit: "time and context consumption are uncorrelated in exactly the wrong direction" (`issues/0004-conversation-safety-net.md:30`). Token-percentage triggers were "an approximation of an approximation" — three numbers in units `ls -l` can verify are the data-grounded alternative. -2. **Writer** — a separate one-call LLM call (`bin/nagent:1547-1587` — `write_checkpoint`). The writer is NOT the working model. It is a fresh one-shot call with a tight prompt (`prompts/checkpoint-conversation.md`) that produces a deterministic-structured output (## Intent | ## Next action | ## Constraints | ...). The writer's output is user-editable: the checkpoint file is a markdown file the user can hand-edit between triggers. -3. **Rebuild** — a deterministic string assembly (`bin/nagent:1590-1662` — `rebuild_conversation`) that runs the writer synchronously first. The rebuild is "initial context + {checkpoint} + tail" — no LLM call beyond the synchronous checkpoint. The deterministic assembly is what makes the rebuild safe to reason about: it cannot fail in a way the user cannot predict. -4. **Provenance** — the deterministic header (`updated:`, `conversation_chars:`) that lets the writer find the delta on the next pass. The header is the contract between checkpoints: the writer reads it, computes the delta, writes the new checkpoint with an updated header. - -The "sync checkpoint first" invariant is the load-bearing one. A naive rebuild that trusted the most-recent checkpoint's freshness would fail on the exact conversation the safety net is meant to save (a conversation that grew past `rebuild_at_kb` between scheduled checkpoints). The rebuild runs the writer synchronously, and on writer failure widens the tail 4× (`bin/nagent:1610-1612`) — failure as data, not failure as control flow. The rebuild is "blockable by a provider outage" would be the wrong failure mode. - -#### §2.2 The Writer and the Checkpoint Format - -The checkpoint is a markdown file with a deterministic header and a fixed-structure body. The header is two fields: - -``` -updated: -conversation_chars: -``` - -The body is the writer's LLM output, constrained to a fixed schema (`prompts/checkpoint-conversation.md`): - -``` -## Intent - - -## Next action - - -## Constraints - - -## Open questions - -``` - -The schema is the whole schema. The code does not maintain a parallel mental model (e.g., "we track the intent in a separate field"). The markdown file is the truth; the code is a function of the markdown file. - -The writer is a one-shot LLM call, not the working model. This matters for two reasons: - -1. **Cost visibility.** The writer's LLM cost is paid once per checkpoint, not once per turn. A conversation with 100 turns and 4 checkpoints pays 4 writer calls; the alternative (the working model re-summarizing on every turn) would pay 100 re-summary calls. The cost moves from O(turns) to O(checkpoints). -2. **Non-deterministic working model does not pollute the checkpoint.** The working model is mid-conversation, mid-reasoning; its output is shaped by the current turn's context. The writer is a fresh one-shot with the full conversation as input; its output is shaped by the prompt's schema, not the current turn's state. The checkpoint is stable across reads. - -A code-shape sketch using survey grammar: - -``` -checkpoint := { updated: timestamp, # [S] string - conversation_chars: int, # [I] inspectable - body: ## Intent | ## Next action | ## Constraints | ## Open questions } # [B] boundary - -write_checkpoint { conversation, llm, now } { - delta = conversation[meta.conversation_chars:] # [S] string slice - if len(delta) < min_delta_chars { return nil } # too small to summarize - prompt = format(prompts.checkpoint-conversation.md, delta) # [S] string format - body = llm.call(prompt) # [B] boundary to LLM - write checkpoint.updated = now - write checkpoint.conversation_chars = len(conversation) - write checkpoint.body = body -} -``` - -The `[B]` boundary tag marks the single LLM call in the writer. Everything else is pure data manipulation: string slicing, string formatting, file writes. The writer is "an LLM call wrapped in deterministic I/O". - -#### §2.3 The Trigger Logic - -The trigger is a pure function of (last checkpoint timestamp, current conversation size, config). `bin/nagent:1519-1539` implements two functions: - -1. **`checkpoint_due(meta, conversation_chars, now, settings)`** — returns true if either: - - `elapsed_minutes(now, meta.updated) > settings.checkpoint_interval_minutes` AND `conversation_chars > meta.conversation_chars + new_chars_threshold` - - `conversation_chars - meta.conversation_chars > settings.checkpoint_max_new_kb * 1024` - - `meta is nil` AND `conversation_chars > settings.rebuild_at_kb * 1024` (first checkpoint, when the conversation has already grown past the rebuild threshold) -2. **`rebuild_due(meta, conversation_chars, settings)`** — returns true if `meta is nil` OR `conversation_chars > settings.rebuild_at_kb * 1024`. - -The three config numbers are in `config.example.json:3-7`: - -```json -{ - "safety_net": { - "checkpoint_interval_minutes": 10, - "checkpoint_max_new_kb": 32, - "rebuild_at_kb": 192 - } -} -``` - -All three are in units `ls -l` can verify: minutes, kilobytes, kilobytes. Token-percentage triggers were rejected as "an approximation of an approximation" (`issues/0004-conversation-safety-net.md:30-44`) — the 3-number config is the data-grounded alternative. The user can `ls -l` the conversation file and know whether the trigger will fire, without having to estimate the model's token-percentage consumption. - -#### §2.4 The Rebuild Procedure - -The rebuild is "initial context + {checkpoint} + tail" — a deterministic string assembly (`bin/nagent:1590-1662` — `rebuild_conversation`). The procedure: - -1. **Sync checkpoint first.** Run `write_checkpoint(conversation, llm)` synchronously. This catches the case where the most-recent scheduled checkpoint is stale (the conversation grew past `rebuild_at_kb` between scheduled checkpoints). The sync checkpoint is the "freshness" guarantee. -2. **Widen tail on writer failure.** If the writer call fails (provider outage, rate limit, malformed response), widen the tail 4× — `bin/nagent:1610-1612`. Failure as data, not failure as control flow. The rebuild cannot fail in a way that loses the conversation. -3. **Archive the old conversation.** Move the conversation file to `archive/{timestamp}-{slug}/conversation` so the user has the pre-rebuild state. -4. **Write the new initial context.** Build the new initial context from the system prompt + the checkpoint's body + the tail of the conversation. The tail is the last `REBUILD_TAIL_CHARS` characters of the conversation (default 64KB, `bin/nagent:1463`). -5. **Reset the checkpoint's `conversation_chars`.** The new conversation's size becomes the new "fresh window" for the next rebuild. - -A code-shape sketch: - -``` -rebuild { conversation, llm, now, settings } { - try write_checkpoint(conversation, llm, now) - recover { - tail_chars = REBUILD_TAIL_CHARS * 4 # widen 4x on failure - audit msg "checkpoint writer failed; using widened tail" - } else tail_chars = REBUILD_TAIL_CHARS - - archive_path = archive/{now}/{slug}/conversation - move conversation -> archive_path - new_conversation = initial_context + checkpoint + conversation[-tail_chars:] - write conversation = new_conversation - reset meta.conversation_chars = len(new_conversation) - reset meta.updated = now -} -``` - -The `{ssdl}` shape tag for the rebuild is `[S]` (string concatenation). The only LLM call is the sync checkpoint. Everything else is deterministic I/O. - -#### §2.5 The Instant-Saves Change (6426a67) - -The instant-saves change is a smaller, sharper version of the same idea: the cost of an LLM summary is moved from the hot path (every save) to the maintenance path (`nagent-distill --apply` backfill + `--summarize-conversation` on demand). - -Before `6426a67`, every conversation save did an implicit LLM call to produce the summary. This had two costs: -1. **Hot-path latency.** A save was a multi-second LLM call, not a millisecond file write. -2. **Cost opacity.** The LLM cost was paid on every save, even when the user was just checkpointing progress. - -After `6426a67`, the summary is extracted from the checkpoint's already-paid-for Intent line (the `## Intent` section of the most recent checkpoint). The summary is the artifact's own data — no new LLM call. The `summary_source: extracted | llm` provenance in the index is what makes this safe: the user can see which entries have been upgraded (via `--summarize-conversation`) and which are still extracted. The backfill pass (`bin/helpers/nagent_distill_lib.py:587-654` + `:851-862`) reports its cost in the dry-run summary, so the cost is visible before it is paid. - -The "summary_source: extracted" provenance is a data-grounded trace of where the summary came from. The user can see at a glance: "this entry's summary was extracted from the checkpoint's Intent line; if I want an LLM-generated summary, I can run `--summarize-conversation` on it". - -#### §2.6 Per-Commit Detail - -The two commits that built the safety net subsystem: - -1. **`38d3d4f` — Add the safety net machinery.** Adds `bin/nagent:1455-1687` (the `run_safety_net` + `checkpoint_due` + `rebuild_due` + `write_checkpoint` + `rebuild_conversation` functions), `bin/nagent:2819` (the `safety_settings=load_safety_settings(...)` wiring into `run_agent_loop`), `config.example.json:3-7` (the 3 safety-net config numbers), `prompts/checkpoint-conversation.md` (the writer LLM prompt), `README.md:653-668` (Part VI safety-net teaching), and `tests/test_nagent_safety.py` (the test file). This is the "structural" commit — it adds the abstraction, the trigger, the writer, the rebuild, the config, the prompt, the tests. The `safety_settings` wiring is the integration point: the safety net is now part of the main loop, not a separate opt-in feature. -2. **`6426a67` — Add the instant-saves change.** Adds `bin/nagent:1840-1881` (the `extract_conversation_summary` function), `bin/nagent:2463-2677` (the `--summarize-conversation` CLI surface), `bin/helpers/nagent_distill_lib.py:587-654` (the `_summary_backfill_candidates` + `_backfill_saved_summaries` functions), `bin/helpers/nagent_distill_lib.py:851-862` (the backfill wired into the distill apply path), and `README.md:323-332` (Part II instant-saves teaching). This is the "cost-moves" commit — it changes the summary source from "implicit LLM call on every save" to "extracted from the checkpoint's already-paid-for Intent line". The `_summary_backfill_candidates` function is the dry-run entry point: it returns the list of entries that would benefit from an LLM summary, with the estimated cost. The user sees the cost before paying it. - -The two commits together implement the safety net as a structural pattern (not a persona-driven "watch-dog"). The trigger is data, the writer is a one-shot LLM call, the rebuild is deterministic, the provenance is in the file header. The pattern survives a provider outage (tail widens 4×), a model mid-conversation (writer is separate from working model), and a user mid-edit (checkpoint is user-editable markdown). - -#### §2.7 Manual Slop Implications - -The Manual Slop equivalents of the safety net are partial. The closest analog is the per-discussion write path in `src/discussion.py` (or similar) + the per-take branching in `src/project_manager.py:branch_discussion` + `promote_take`. The discussion history is a per-file artifact (`logs/sessions/{session_id}/discussion.jsonl` or similar), and the discussion index is a separate file. But the Manual Slop analog lacks three of the four safety-net invariants: - -1. **No "sync checkpoint first" guarantee.** Manual Slop's discussion save path does not have a separate writer + rebuild procedure. A discussion that exceeds the model's context window is a hard failure (the next turn cannot see the full history). -2. **No "widen tail on failure" fallback.** Manual Slop's failure modes are exception-based, not data-widening. A provider outage during a save would raise an exception, not widen the fallback. -3. **No `summary_source: extracted | llm` provenance.** Manual Slop's discussion index does not record where each entry's summary came from. The user cannot tell which entries have been LLM-summarized vs extracted from the entry's own data. - -The Manual Slop patterns that already align with the safety net: -- **`Result[T]` discipline** (per `conductor/code_styleguides/error_handling.md`) — failure widens the fallback instead of blocking. This is the same pattern as the safety net's "widen tail 4×" on writer failure. -- **`promote_take` + `branch_discussion`** (in `src/project_manager.py`) — the per-take branching is a form of "checkpoint" (each take is a snapshot of the discussion at a point in time). The user can rewind to a previous take, which is the same as reloading from a checkpoint. -- **The 3-layer MCP security model** (per `docs/guide_mcp_client.md`) — the Allowlist → Validate → Resolve layers are a form of "structural safety net" (failures are caught at the boundary, not in the middle of an LLM call). - -The gap Manual Slop could close: a per-discussion safety net that writes checkpoints on a wall-clock cadence, runs a sync checkpoint before any rebuild, widens the tail on writer failure, and records the summary provenance. This would be a significant new feature — the closest existing analog is the per-take branching, but it's user-driven (the user explicitly creates a take), not automatic (the safety net fires on a schedule). - -**Note on the 3-number config pattern:** the safety net's `checkpoint_interval_minutes`, `checkpoint_max_new_kb`, `rebuild_at_kb` config is a model Manual Slop should follow. Operations should be configurable in units `ls -l` can verify, not in token-percentage estimates that drift per provider. The Manual Slop equivalent would be a per-discussion config with units of (minutes, kilobytes, kilobytes) — not (tokens, percentage, percentage). This is a small but load-bearing change: the user can `ls -l` the discussion file and know whether the trigger will fire, without having to estimate the model's token-percentage consumption. - -#### §2.8 Honest Gaps - -1. **The `delta_start = min(meta[1], len(content))` clamp at `bin/nagent:1566` could produce a misleading delta if a user edit deletes characters between checkpoints** (the recorded size becomes larger than current content). The clamp hides the failure; the delta would be the entire current content, not the actual new activity. Minor edge case; the spec does not address it. -2. **The `REBUILD_TAIL_CHARS = 64 * 1024` default at `bin/nagent:1463` is explicitly unmeasured** ("mirrors MiMo's ~65K tokens until measured otherwise" per `issues/0004-conversation-safety-net.md:42-44`). A future track should measure actual rebuild-tail needs across providers and conversation types. -3. **`best-of-N` is mentioned in the initial context at `bin/nagent:775` as a directive to the model, not implemented as machinery** — it is the same "direction before machinery" pattern v2.3 used for compaction. A follow-up track could lift it to a driver (e.g., `nagent-safety-net --best-of-n` that runs the writer N times and picks the most-recoverable checkpoint). -4. **The interaction with the campaigns driver (Phase 2's `nagent-campaign update`) is not deep-dived.** The campaigns driver has its own 6 phases (merge, check, propose, review gate, dispatch, report). A long-running campaign can have conversations that exceed the model's context window. The safety net's role in the campaigns driver is not documented: does the driver check for context-window-exceeded conditions during the merge phase? does the dispatch phase refuse to launch a worker when the context window is already full? does the report phase surface context-window warnings to the user? -5. **The interaction with the conversation-cache boundaries (v2.3 §2.2) is not deep-dived.** v2.3 introduced `conversation_cache_boundaries` at `bin/nagent:970-987` to manage the provider's prompt cache. The safety net's rebuild creates a new initial context, which invalidates the cache. The v3 cluster does not document how the safety net coordinates with the cache invalidation — does the rebuild preserve the cache boundary markers? does the next checkpoint know about the cache state? -6. **The 3-number config's recommended values are not enumerated.** The config defaults (`checkpoint_interval_minutes: 10`, `checkpoint_max_new_kb: 32`, `rebuild_at_kb: 192`) are documented, but the cost model is not. A v4 would document the recommended values per conversation type (short Q&A, long-running build, multi-day campaign) and per provider (Gemini's 1M context vs Anthropic's 200K vs OpenAI's 128K). -7. **The writer's failure modes are not enumerated.** The writer is a one-shot LLM call; it can fail with a provider outage, a rate limit, a malformed response, or a refusal. The v3 cluster documents the "widen tail 4×" fallback, but does not enumerate the other failure handling — what happens when the writer returns a malformed response (missing sections, extra sections, wrong order)? does the rebuild retry the writer, or proceed with the malformed checkpoint? - -#### §2.9 Code-Shape Sketch - -The safety net, in survey-grammar SSDL notation, with shape tags: - -``` -safety_settings := { checkpoint_interval_minutes: int, # [I] inspectable - checkpoint_max_new_kb: int, # [I] inspectable - rebuild_at_kb: int } # [I] inspectable - -checkpoint := { updated: timestamp, # [S] string - conversation_chars: int, # [I] inspectable - body: ## Intent | ## Next action | ## Constraints | ## Open questions } # [B] boundary - -due { meta, conversation_chars, now, settings } { # trigger (pure function) - if elapsed_minutes(now, meta.updated) > settings.checkpoint_interval_minutes - and conversation_chars > meta.conversation_chars - -> fire {ssdl} [I] # inspectable trigger - if conversation_chars - meta.conversation_chars > settings.checkpoint_max_new_kb * 1024 - -> fire - if meta is nil and conversation_chars > settings.rebuild_at_kb * 1024 - -> fire first time only - else - -> idle -} - -write_checkpoint { conversation, llm, now } { # writer (one LLM call) - delta = conversation[meta.conversation_chars:] # [S] string slice - if len(delta) < min_delta_chars { return nil } # too small to summarize - prompt = format(prompts.checkpoint-conversation.md, delta) # [S] string format - body = llm.call(prompt) # [B] boundary to LLM - write checkpoint.updated = now - write checkpoint.conversation_chars = len(conversation) - write checkpoint.body = body -} - -rebuild { conversation, llm, now, settings } { # rebuild (deterministic) - try write_checkpoint(conversation, llm, now) - recover { - tail_chars = REBUILD_TAIL_CHARS * 4 # widen 4x on failure - audit msg "checkpoint writer failed; using widened tail" - } else tail_chars = REBUILD_TAIL_CHARS - - archive_path = archive/{now}/{slug}/conversation - move conversation -> archive_path - new_conversation = initial_context + checkpoint + conversation[-tail_chars:] # [S] string concat - write conversation = new_conversation - reset meta.conversation_chars = len(new_conversation) - reset meta.updated = now -} - -summary_source := { entry_id: string, # provenance - source: extracted|llm, # [I] inspectable - extracted_at: timestamp?, # [S] - llm_summarized_at: timestamp? } # [S] -``` - -The shape tag map: `[I]` for inspectable triggers and config, `[S]` for string concatenations and timestamps, `[B]` for the single LLM boundary in the writer, `[M]` for the mutable aggregate that is the conversation file. The safety net is a `[M]` aggregate: it is the state of record, hand-edited by humans, written by the writer, read by the rebuild. - +**Pattern(s) vs v2.3:** EXTENDS v2.3 Pattern 5 ("the loop") with failure-recovery semantics. v2.3 had the loop; v3 makes the loop survive long-running conversations. EXTENDS v2.3 Pattern 11 ("large files as explicit artifacts") — checkpoints are an explicit working-state artifact (separate from the conversation) that the user can edit between triggers. The instant-saves change extends v2.3 Pattern 7 ("repo history as data") with deferred-cost summaries — the LLM cost moves to a place where it's visible (dry-run reports) and bounded (per-pass), not paid up-front. +**Manual Slop implications:** The "sync checkpoint first" invariant maps to Manual Slop's existing `Result[T]` discipline (per `conductor/code_styleguides/error_handling.md`) — failure never blocks; the failure widens the fallback instead. Manual Slop's current Discussion entry write paths could adopt the `summary_source: extracted | llm` pattern; right now every save may do an implicit LLM call. The 3-number config (`checkpoint_interval_minutes`, `checkpoint_max_new_kb`, `rebuild_at_kb`) is a model Manual Slop should follow: operations should be configurable in units `ls -l` can verify, not in token-percentage estimates that drift per provider. +**Decision candidate:** NEW Candidate 18 (HIGH). "Discussion-window safety net for Manual Slop": adopt the checkpoint + rebuild pattern for the discussion history; backfill summary entries from the existing intent line; surface extracted-vs-llm provenance in the discussion index. See `decisions.md` Candidate 18. +**Cross-refs:** `conductor/tracks/fable_review_20260617` (the Fable review's analysis of "watch-dogging" is the opposite pattern — nagent's safety net is structural, not persona-driven). §1 Campaigns cross-references the safety net as the failure-recovery layer for what decomposition cannot bound. **Source-read citations:** - `bin/nagent:1455-1687` — `run_safety_net` + `checkpoint_due` + `rebuild_due` + `write_checkpoint` + `rebuild_conversation` (38d3d4f) - `bin/nagent:1840-1881` — `extract_conversation_summary` (6426a67) - `bin/nagent:2463-2677` — `--summarize-conversation` CLI surface (6426a67) - `bin/nagent:2819` — `safety_settings=load_safety_settings(...)` wired into `run_agent_loop` (38d3d4f) -- `bin/nagent:1463` — `REBUILD_TAIL_CHARS = 64 * 1024` default (38d3d4f) -- `bin/nagent:1519-1539` — `checkpoint_due` + `rebuild_due` pure functions (38d3d4f) -- `bin/nagent:1547-1587` — `write_checkpoint` (38d3d4f) -- `bin/nagent:1590-1662` — `rebuild_conversation` (38d3d4f) -- `bin/nagent:1610-1612` — widen tail 4× on writer failure (38d3d4f) -- `bin/nagent:1566` — `delta_start = min(meta[1], len(content))` clamp (38d3d4f) -- `config.example.json:3-7` — 3 safety-net config numbers (38d3d4f) +- `config.example.json:3-7` — 3 safety-net config numbers, all units `ls -l` can verify (38d3d4f) - `prompts/checkpoint-conversation.md` — checkpoint LLM prompt (38d3d4f) - `bin/helpers/nagent_distill_lib.py:587-654` — `_summary_backfill_candidates` + `_backfill_saved_summaries` (6426a67) - `bin/helpers/nagent_distill_lib.py:851-862` — backfill wired into the distill apply path (6426a67) - `README.md:653-668` — safety-net teaching in Part VI (38d3d4f) - `README.md:323-332` — instant-saves teaching in Part II (6426a67) - `issues/0004-conversation-safety-net.md` — the spec; reworked at 6443d70 to wall-clock cadence (199a36b) -- `issues/0004-conversation-safety-net.md:30` — cadence reasoning ("time and context consumption are uncorrelated in exactly the wrong direction") -- `issues/0004-conversation-safety-net.md:42-44` — `REBUILD_TAIL_CHARS` unmeasured note - `tests/test_nagent_safety.py` — safety-net test file (38d3d4f) -- `tests/test_nagent_distill.py:summary_*` — backfill tests (6426a67) -- `bin/nagent:775` — `best-of-N` initial-context directive (38d3d4f) -- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; not modified in v3 but relevant for the gap note) -- `bin/nagent:606-745` — `build_initial_context` (v2.3; relevant for the rebuild's "initial context" assembly) -- `config.example.json:1-15` — full safety-net config block with defaults (38d3d4f) -- `README.md:670-700` — safety-net cost model (checkpoint cost, rebuild cost) (38d3d4f) -- `README.md:333-360` — instant-saves cost model (extracted vs LLM cost) (6426a67) -- `issues/0004-conversation-safety-net.md:1-100` — full spec: trigger, writer, rebuild, provenance, cost (199a36b) -- `issues/0004-conversation-safety-net.md:101-200` — failure modes + edge cases (199a36b) -- `issues/0004-conversation-safety-net.md:201-326` — open questions + future work (199a36b) +**Honest gaps in this cluster:** +- The `delta_start = min(meta[1], len(content))` clamp at `bin/nagent:1566` could produce a misleading delta if a user edit deletes characters between checkpoints (the recorded size becomes larger than current content). The clamp hides the failure; the delta would be the entire current content, not the actual new activity. Minor edge case; the spec does not address it. +- The `REBUILD_TAIL_CHARS = 64 * 1024` default at `bin/nagent:1463` is explicitly unmeasured ("mirrors MiMo's ~65K tokens until measured otherwise" per `issues/0004-conversation-safety-net.md:42-44`). A future track should measure actual rebuild-tail needs. +- `best-of-N` is mentioned in the initial context at `bin/nagent:775` as a directive to the model, not implemented as machinery — it is the same "direction before machinery" pattern v2.3 used for compaction. A follow-up track could lift it to a driver. + +**Pattern deep-dive.** The safety-net is a four-piece composition: **trigger**, **writer**, **rebuild**, **provenance**. The trigger is wall-clock + burst guard, both computed from data on disk (`bin/nagent:1519-1539` — `checkpoint_due`); the writer is a separate one-call LLM call (`bin/nagent:1547-1587` — `write_checkpoint`); the rebuild is a deterministic string assembly that runs the writer synchronously first (`bin/nagent:1590-1662` — `rebuild_conversation`); the provenance is the deterministic header (`updated:`, `conversation_chars:`) that lets the writer find the delta on the next pass. The cadence reasoning is explicit: "time and context consumption are uncorrelated in exactly the wrong direction" (`issues/0004-conversation-safety-net.md:30`). Token-percentage triggers were "an approximation of an approximation" — three numbers in units `ls -l` can verify are the data-grounded alternative. + +The "sync checkpoint first" invariant is the load-bearing one. A naive rebuild that trusted the most-recent checkpoint's freshness would fail on the exact conversation the safety net is meant to save (a conversation that grew past `rebuild_at_kb` between scheduled checkpoints). The rebuild runs the writer synchronously, and on writer failure widens the tail 4× (`bin/nagent:1610-1612`) — the rebuild is "blockable by a provider outage" would be the wrong failure mode. Failure as data, not failure as control flow. + +The instant-saves change (`6426a67`) is a smaller, sharper version of the same idea: the cost of an LLM summary is moved from the hot path (every save) to the maintenance path (`nagent-distill --apply` backfill + `--summarize-conversation` on demand). The summary is the artifact's own data — the checkpoint's `## Intent` line, already paid for — or the first user prompt truncated. The `summary_source: extracted | llm` provenance in the index is what makes this safe: the user can see which entries have been upgraded and which are still extracted, and the backfill pass reports its cost in the dry-run summary. + +A code-shape sketch using survey grammar (per the format commitment §5.1): + +``` +safety_settings := { checkpoint_interval_minutes: int, + checkpoint_max_new_kb: int, + rebuild_at_kb: int } +checkpoint := { updated: timestamp, conversation_chars: int, + body: ## Intent | ## Next action | ## Constraints | ... } + +due { meta, conversation_chars, now, settings } { + if elapsed > interval and chars grew -> fire {ssdl} [I] + if chars grew > max_new -> fire + if meta is nil and chars > max_new -> fire first time only + else -> idle +} + +rebuild { conversation, llm, now } { + try write_checkpoint(conversation, llm) + recover widen tail * 4 + archive(conversation) + write initial_context + {checkpoint} + tail {ssdl} [S] + reset checkpoint.conversation_chars = fresh_window_size +} +``` + +The `{ssdl}` markers note the two transformations: checkpoint write is an `[I]` (inspectable, the writer's output is user-editable), and rebuild is an `[S]` (string concatenation — no LLM call beyond the synchronous checkpoint; the deterministic assembly is what makes the rebuild safe to reason about). -**Decision candidate:** NEW Candidate 18 (HIGH). "Discussion-window safety net for Manual Slop": adopt the checkpoint + rebuild pattern for the discussion history; backfill summary entries from the existing intent line; surface extracted-vs-llm provenance in the discussion index. See `decisions.md` Candidate 18. -**Cross-refs:** `conductor/tracks/fable_review_20260617` (the Fable review's analysis of "watch-dogging" is the opposite pattern — nagent's safety net is structural, not persona-driven). §1 Campaigns cross-references the safety net as the failure-recovery layer for what decomposition cannot bound. §13 Agent context-window observations (the v3.1 new section on warm-up + window + safe-zone numbers; the safety net is the structural mechanism that implements the safe-zone). -**Pattern history:** EXTENDS v2.3 Pattern 5 ("the loop") with failure-recovery semantics. EXTENDS v2.3 Pattern 11 ("large files as explicit artifacts") with checkpoints as an explicit working-state artifact. EXTENDS v2.3 Pattern 7 ("repo history as data") with deferred-cost summaries. ## §3 Hooks **Source:** nagent `a4fb141` (`bin/nagent:1442-1484` + `:1607-1625` + `:1922-1927` + `:2806-2825` + `:3167-3185`, `config.example.json:6-8`, `tests/test_nagent.py:870-960`); plus both case-study harness scripts (`https://raw.githubusercontent.com/macton/pep-copt/main/prove-optimized-harness.sh`, `https://raw.githubusercontent.com/macton/differentiable-collisions-optc/main/prove-optimized-harness.sh`). **One-liner:** Per-turn ground-truth injection. A hook runs at the top of every turn (before the model speaks) or after every structured edit; its measured output — exit code, stdout, stderr, or "(no output)" — enters the conversation as a labeled block, so the model responds against measured state instead of its recollection. The case-study repos ARE the hooks: `prove-optimized-harness.sh` is the command wired into `--hook-per-run`. -**Pattern summary:** Hooks introduce a per-turn measurement primitive that breaks the conversation's dependence on the model's self-reporting. The abstraction is a three-piece composition: resolve, invoke, inject. `resolve_hooks` enforces CLI > config > disabled precedence; `run_hook` invokes the command and captures exit code + stdout + stderr + "(no output)" when silent; the injection sites are the conversation (per-run at the top of every turn before `call_llm`; per-file-edit after `` or `` in `--file-edit` mode). The case-study harness scripts are the proof that hooks work as intended: both implement the same skeleton (log + summary + enforcing gate) with different proof contracts. The data shape of a hook result is a labeled block with exit code, optional path, optional stdout, optional stderr, or "(no output)" — the model's context grows by a measured block, not by the model's word. The `{ssdl}` `[B]` (boundary) marker captures the abstraction: the hook is the boundary where the model's context meets the measured world; the failure of a measurement is data the model can act on, not a control-flow exception. - -#### §3.1 What Hooks Add - -Hooks introduce a per-turn measurement primitive that breaks the conversation's dependence on the model's self-reporting. Before hooks, the conversation was a closed loop: the model said something, the user read it, the user replied, the model said something else. The only ground truth was the model's word. With hooks, the conversation is an open loop: a measurement command runs at the top of every turn, its output enters the conversation as a labeled block, and the model responds against measured state instead of its recollection. - -The three pieces of the hooks abstraction: - -1. **Resolve** — `resolve_hooks(cli_per_run, cli_per_file_edit, config_path)` enforces the CLI > config > disabled precedence. The CLI is the experiment's override (one-shot, the user's immediate need); the config is the project's default (persistent, the project's convention); empty means off. The resolve function is pure: it returns a tuple of (per_run_command, per_file_edit_command), each of which is either a string or None. -2. **Invoke** — `run_hook(command, label, path=None)` invokes the command via subprocess, captures exit code + stdout + stderr, and surfaces "(no output)" when silent. The function never raises on a non-zero exit code; the failure is data, not control flow. The output is wrapped in a labeled block: ``. The label is the hook's name (e.g., "hook-per-run", "hook-per-file-edit"); the path is the file being edited (for per-file-edit hooks). -3. **Inject** — the labeled block is appended to the conversation file. The injection sites are explicit: per-run at the top of every turn before `call_llm` (`bin/nagent:1922-1927`); per-file-edit after `` (`bin/nagent:1607-1611`) or `` in `--file-edit` mode (`bin/nagent:1618-1625`). Scratch writes are not file edits — the comment at `bin/nagent:1618-1620` notes the distinction explicitly: "A `` only edits a real file in per-file-edit mode ... in main mode it writes scratch, which is not a file edit worth a verify hook". - -The case-study harness scripts are the proof that hooks work as intended. Both scripts implement the same skeleton: log + summary + enforcing gate. The log records every step with verbose mode for streaming; the summary collects every verdict at the end (`set +e` so a failing gate still prints); the enforcing gate collects the verdicts and decides pass/fail. - -#### §3.2 The Resolve Precedence - -The CLI > config > disabled precedence is the contract between the experiment and the project. The CLI is the experiment's override: a user running `nagent --hook-per-run='make test'` is overriding the project's default hook for this invocation only. The config is the project's default: `config.json` says `{"hook_per_run": "make test"}` and every invocation of `nagent` in this project uses that hook. Disabled means off: if neither CLI nor config specifies the hook, the hook does not run, and the conversation has no per-run block. - -The resolve function is pure: it returns a tuple of (per_run_command, per_file_edit_command), each of which is either a string or None. The implementation is at `bin/nagent:1466-1484`: - -``` -resolve_hooks(cli_per_run, cli_per_file_edit, config_path) { - config = load_json(config_path) if config_path else {} - per_run = cli_per_run or config.get("hook_per_run") or None - per_file_edit = cli_per_file_edit or config.get("hook_per_file_edit") or None - // empty string in config means disabled (defensive: don't pass "" to subprocess) - if per_run == "": per_run = None - if per_file_edit == "": per_file_edit = None - return (per_run, per_file_edit) -} -``` - -The "empty string means disabled" rule is defensive: an empty string in the config should not be passed to subprocess (which would invoke the shell with no command, producing unpredictable output). The resolve function normalizes empty strings to None, which the invoke function treats as "no hook this turn". - -#### §3.3 The Invoke and Inject Cycle - -The invoke function is the boundary between the conversation and the measured world. The function: - -1. **Subprocess invocation.** If the command is None, return None (no hook this turn). -2. **Capture exit code + stdout + stderr.** Use `subprocess.run(command, shell=True, capture_output=True, text=True)` to invoke the command. The exit code is the command's return code (0 = success, non-zero = failure). The stdout and stderr are the command's output. -3. **Format the labeled block.** The output is wrapped in a labeled block: ``. The "(no output)" marker is used when both stdout and stderr are empty (a silent success is still a measurable success). -4. **Append to conversation.** The block is appended to the conversation file before the next `call_llm` (per-run) or after the file edit (per-file-edit). - -A code-shape sketch using survey grammar: - -``` -hook-result := - -run { command } :: hook-result {ssdl} [B] // boundary: failures surface, never hidden -inject { hook-result, conversation } :: () // append to conversation file -``` - -The `{ssdl}` `[B]` (boundary) marker captures the abstraction: the hook is the boundary where the model's context meets the measured world. The failure of a measurement is data the model can act on, not a control-flow exception. The model sees a failing hook's exit code + stderr, and can adjust its behavior accordingly. - -#### §3.4 The Case-Study Harness Scripts - -The case-study harness scripts are the proof that hooks work as intended. Both scripts implement the same skeleton: log + summary + enforcing gate. The skeleton: - -1. **Log.** Record every step with verbose mode for streaming. The log is appended to a file (e.g., `OPTIMIZATION-LOG.md`) so the user can see the proof's progress in real time. -2. **Summary.** Collect every verdict at the end. Use `set +e` so a failing gate still prints its verdict; the summary is a list of (gate, verdict) pairs. -3. **Enforcing gate.** Collect the verdicts and decide pass/fail. The gate is the last step; it exits non-zero if any verdict is failing. - -The PEP harness (`prove-optimized-harness.sh` for `macton/pep-copt`) has 9 steps and 5 enforcing gates: -- **Identity baseline.** Run the reference implementation on the committed input; record the output (size in bytes, sha256). This is the "what the reference produces" baseline. -- **Median-of-5 speedup.** Run the optimized implementation 5 times; record the median wall-clock time. Median (not mean) because outliers are not the optimization's fault. -- **Decompression-time gate.** The decompression time must not regress (an optimization that makes compression faster but decompression slower is a net loss for users). -- **Generalization.** The optimization must work on a held-out set of images (not just the committed input). This catches "tuned to the test" optimizations. -- **Determinism.** The optimized output must be byte-identical across runs (a non-deterministic optimization is not reproducible). - -The collisions harness (`prove-optimized-harness.sh` for `macton/differentiable-collisions-optc`) has 10 steps and 4 enforcing gates: -- **Comparator with distance tolerance.** The optimized collision detection must agree with the reference to within a distance tolerance (1mm + 0.1% + conditional). Collision-flag identity is too strict (a face/edge contact has many equally-valid witness points). -- **Contact-point certifier.** An independent contact-point certifier (`validate_contacts`) shares no solver code with the optimized implementation. This catches "they agree because they share the bug" failures. -- **Precompute isolation.** The precompute stage (building the spatial acceleration structure) must be excluded from the measured speedup. The build stage cannot precompute the answer; the optimization log explains why. -- **Determinism.** The optimized output must be byte-identical across runs. - -Both harness scripts freeze the committed input via `sha256sum` before the run and re-check after — if the harness itself changes the input (a bug), it aborts. Both exclude precompute time from the measured speedup. - -#### §3.5 The Hook Result Data Shape - -The data shape of a hook result, using survey grammar: - -``` -hook-result := - -fields: - label: string # hook name (e.g., "hook-per-run", "hook-per-file-edit") - exit_code: int # command's return code (0 = success) - path: string? # file being edited (for per-file-edit hooks) - stdout: string # command's stdout (may be empty) - stderr: string # command's stderr (may be empty) - no_output: bool # true if both stdout and stderr are empty - -serialization: - <{label} exit_code="{exit_code}"{ path? " path=\"{path}\"" : ""}> - {stdout} - {stderr? f"stderr: {stderr}" : ""} - {no_output? "(no output)" : ""} - -``` - -The shape is a labeled block with optional fields. The model reads the block as part of the conversation; the block is the "measurement" the model acts on. The failure of a measurement is data: a non-zero exit code + stderr text is actionable information; a silent success is "(no output)" — still measurable, still in the conversation. - -#### §3.6 Per-Commit Detail - -The one commit that built the hooks subsystem: - -1. **`a4fb141` — Add per-turn and per-file-edit hooks.** Adds `bin/nagent:1442-1463` (`run_hook` function), `bin/nagent:1466-1484` (`resolve_hooks` function with CLI > config > disabled precedence), `bin/nagent:1607-1611` (`hook_per_file_edit` fires after ``), `bin/nagent:1618-1625` (`hook_per_file_edit` fires after `` in `--file-edit` mode only), `bin/nagent:1922-1927` (`hook_per_run` fires at top of every turn, before `call_llm`), `bin/nagent:2806-2825` (`--hook-per-run` and `--hook-per-file-edit` CLI flags), `bin/nagent:3167-3185` (wiring into `run_agent_loop`), `config.example.json:6-8` (`hook_per_run` and `hook_per_file_edit` config keys), and `tests/test_nagent.py:870-960` (4 test functions covering the hook contract). - -The commit is a "single-feature" commit: one commit adds the hooks abstraction, the resolve precedence, the invoke function, the inject sites, the CLI flags, the config keys, and the tests. There are no follow-up commits; the abstraction was complete in one commit. This is the same "abstraction-complete-in-one-commit" pattern v2.3 used for the harvest pipeline. - -#### §3.7 Manual Slop Implications - -The Manual Slop equivalents of the hooks are partial. The closest analogs are: -- **Tier 4 QA error interception** (per `docs/guide_ai_client.md`) — when a tool call fails, the AI client intercepts the error, forwards it to a Tier 4 QA sub-agent, and injects a 20-word diagnostic summary into the worker history. This is a per-error hook, not a per-turn hook. -- **The `ApiHookClient` test harness** (per `docs/guide_api_hooks.md`) — the `live_gui` fixture uses the Hook API to drive the application. The hook is the test, not the application. -- **The `_predefined_callbacks` registry** (in `src/app_controller.py:531-617`) — exposes any App method as a `custom_callback` action. This is a hook into the app, not a hook into the conversation. - -The Manual Slop analog lacks three of the four hooks invariants: - -1. **No "per-turn" injection site.** Manual Slop's Tier 4 QA fires on tool-call failure, not at the top of every turn. A Manual Slop hook could be wired into the `run_agent_loop` equivalent (`dispatch_inference` in `src/ai_client.py`) to inject a status block (build status, test status, dependency-check status) at the top of every turn. -2. **No "labeled block" data shape.** Manual Slop's Tier 4 QA injects a 20-word diagnostic summary as plain text, not a labeled block with exit code + stdout + stderr. The model sees a summary, not a measurement. -3. **No "CLI > config > disabled" precedence for hooks.** Manual Slop's hooks are implicit (they fire on error); there is no explicit "configure a command to run at the top of every turn" mechanism. - -The gap Manual Slop could close: a per-turn hook primitive that runs a configured command (CLI > config > disabled) at the top of every `send_result()` and injects a `` block. The command could be a status script (`make test`, `git status`, `npm run check`) that the user configures per-project. The model would see the status block at the top of every turn and respond against measured state. - -The "failure is data, not control flow" principle from `conductor/code_styleguides/error_handling.md` already encodes the "exit code + stderr surfaced" invariant. The per-turn hook is the natural extension: every turn's status is data the model acts on, not an exception that aborts the loop. - -#### §3.8 Honest Gaps - -1. **The "subprocess reach" claim in `bin/nagent:2822-2824` — "A CLI flag applies to this invocation only; set it in the config file to apply it to delegated file-edit subprocesses too" — needs verification.** The implementation at `bin/nagent:3167-3185` wires the hooks into `run_agent_loop`'s `main()` call only; whether delegated file-edit subprocesses read the config separately is not visible in this diff. The v3.1 source-read pass should verify the subprocess reach. -2. **The "default off" guarantee is not tested.** Both hooks default to off (CLI flag absent, config key absent or empty string). A regression test asserting "no CLI flag, no config key → both hooks are None" would harden the contract. -3. **The `--hook-per-run` cost discipline ("point it at a fast status command") is documented in `--help` but not enforced.** The case-study harnesses use median-of-5 timing in their proofs, which is fast, but a user wiring up a 10-second status command would pay 10 seconds per turn. A future track could add a `--hook-per-run-max-seconds` config knob. -4. **The interaction with the conversation safety net (§2) is not deep-dived.** The safety net's rebuild creates a new initial context, which would include the per-run hook block. The v3 cluster does not document how the safety net coordinates with the hook injection — does the rebuild preserve the per-run hook block? does the next checkpoint know about the hook state? -5. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver has its own 6 phases. A long-running campaign can have per-turn hooks that fire on every dispatched worker. The v3 cluster does not document how the campaigns driver coordinates with the hook injection — does the dispatched worker get the per-run hook block? does the campaign-level conversation have its own hook configuration? -6. **The case-study harness scripts are not fully transcribed.** The v3 cluster cites the 9-step / 10-step structure and the 5 / 4 enforcing gates, but does not transcribe the full shell scripts. A v4 would transcribe both `prove-optimized-harness.sh` scripts in full and analyze their common skeleton + per-repo differences. -7. **The hook result's serialization format is not specified for the model.** The `` block is the implementation's serialization, but the model sees it as part of the conversation. The v3 cluster does not document how the model is expected to parse the block (does it treat the block as a system message? a user message? a tool result?). A v4 would document the model's expected parsing of the hook block. - -#### §3.9 Code-Shape Sketch - -The hooks abstraction, in survey-grammar SSDL notation, with shape tags: - -``` -hook-result := { label: string, # [S] string - exit_code: int, # [I] inspectable - path: string?, # [S] optional - stdout: string, # [S] string - stderr: string, # [S] string - no_output: bool } # [I] inspectable - -serialization: - <{label} exit_code="{exit_code}"{ path? f' path="{path}"' : ''}> - {stdout} - {stderr? f"stderr: {stderr}" : ''} - {no_output? "(no output)" : ''} - - -resolve { cli_per_run, cli_per_file_edit, config_path } { - config = load_json(config_path) if config_path else {} # [B] boundary to file - per_run = cli_per_run or config.get("hook_per_run") or None - per_file_edit = cli_per_file_edit or config.get("hook_per_file_edit") or None - if per_run == "": per_run = None # empty = disabled - if per_file_edit == "": per_file_edit = None - return (per_run, per_file_edit) -} - -run { command, label, path? } :: hook-result {ssdl} [B] # boundary: failures surface - if command is None: return None - result = subprocess.run(command, shell=True, capture_output=True, text=True) - return hook-result { - label: label, - exit_code: result.returncode, - path: path, - stdout: result.stdout, - stderr: result.stderr, - no_output: result.stdout == "" and result.stderr == "" - } - -inject { hook-result, conversation } :: () {ssdl} [B] # boundary: model sees the block - block = serialize(hook-result) - append conversation with block - -invoke-points := { - per_run: at top of every turn, before call_llm # [B] boundary to LLM - per_file_edit: after # [B] boundary to file edit - per_file_edit: after in --file-edit mode only -} -``` - -The shape tag map: `[I]` for inspectable exit codes and flags, `[S]` for string content (stdout, stderr, label), `[B]` for boundaries (file I/O, subprocess invocation, LLM call, conversation append). The hook is a `[B]` boundary abstraction: the model's context meets the measured world at the hook, and the failure of a measurement is data the model acts on. - +**Pattern(s) vs v2.3:** NEW. v2.3 had the conversation-without-ground-truth loop (the model's word was the only word). v3 introduces the per-turn measurement primitive that breaks the loop's dependence on the model's self-reporting. EXTENDS v2.3 Pattern 5 ("the loop") with a measurement injection surface. The case-study methodology cluster (§9) elaborates this into a reusable 5-element pattern. +**Manual Slop implications:** Manual Slop has analogous hooks already — Tier 4 QA error interception (per `docs/guide_ai_client.md`) and the `ApiHookClient` test harness (per `docs/guide_api_hooks.md`). The generalization is per-turn, not per-error: a Manual Slop hook could be wired into the `run_agent_loop` equivalent (`dispatch_inference`) to inject a status block (build status, test status, dependency-check status) at the top of every turn. The "failure is data, not control flow" principle from `conductor/code_styleguides/error_handling.md` already encodes the "exit code + stderr surfaced" invariant. +**Decision candidate:** NEW Candidate 19 (MEDIUM). "Per-turn ground-truth hook for Manual Slop": add a per-turn hook primitive that runs a configured command (CLI > config > disabled) at the top of every `send_result()` and injects a `` block; honor the CLI > config > disabled precedence and the failing/quiet-hook-surfaces-output invariant. See `decisions.md` Candidate 19. +**Cross-refs:** §9 Case-study methodology (the 5-element pattern; hooks are the substrate), §10 PEP case study (the pep-copt harness), §11 Collisions case study (the collisions harness). These three together surface the full abstraction. **Source-read citations:** - `bin/nagent:1442-1463` — `run_hook(command, label, path=None)` (a4fb141) - `bin/nagent:1466-1484` — `resolve_hooks(cli_per_run, cli_per_file_edit, config_path)` with CLI > config > disabled precedence (a4fb141) - `bin/nagent:1607-1611` — `hook_per_file_edit` fires after `` (a4fb141) -- `bin/nagent:1618-1625` — `hook_per_file_edit` fires after `` in `--file-edit` mode only (a4fb141) +- `bin/nagent:1618-1625` — `hook_per_file_edit` fires after `` in `--file-edit` mode only (scratch writes are not file edits) (a4fb141) - `bin/nagent:1922-1927` — `hook_per_run` fires at top of every turn, before `call_llm` (a4fb141) - `bin/nagent:2806-2825` — `--hook-per-run` and `--hook-per-file-edit` CLI flags (a4fb141) - `bin/nagent:3167-3185` — wiring into `run_agent_loop` (a4fb141) -- `bin/nagent:2822-2824` — "subprocess reach" claim (a4fb141) - `config.example.json:6-8` — `hook_per_run` and `hook_per_file_edit` config keys (a4fb141) - `tests/test_nagent.py:870-883` — `test_run_hook_block_reports_output_and_exit_code` (a4fb141) - `tests/test_nagent.py:885-915` — `test_hook_per_run_runs_before_every_turn` (a4fb141) @@ -670,153 +144,75 @@ The shape tag map: `[I]` for inspectable exit codes and flags, `[S]` for string - `tests/test_nagent.py:944-960` — `test_resolve_hooks_cli_overrides_config` (a4fb141) - `prove-optimized-harness.sh` (pep-copt) — 9-step proof + 5 enforcing gates (identity baseline, median-of-5 speedup, decompression-time gate, generalization, determinism) - `prove-optimized-harness.sh` (differentiable-collisions-optc) — 10-step proof + 4 enforcing gates (comparator with distance tolerance, contact-point certifier, precompute isolation, determinism) -- `bin/nagent:775` — `best-of-N` initial-context directive (38d3d4f; relevant for the gap note on hook-per-run cost discipline) -- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; relevant for the gap note on safety net coordination) -- `config.example.json:1-15` — full hooks + safety-net config block (a4fb141 + 38d3d4f) -- `README.md:700-750` — hooks teaching in Part VI (a4fb141) -- `README.md:750-800` — case-study methodology teaching (the hooks + harness pattern) (a4fb141) -- `issues/0005-hooks.md` — hooks spec (if it exists; the v3 cluster does not cite a specific issue file for hooks) -- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns coordination) -- `bin/nagent:1922-1927` — `hook_per_run` injection site (a4fb141; the exact lines) -- `bin/nagent:1607-1625` — `hook_per_file_edit` injection sites (a4fb141; the exact lines) -- `bin/nagent:1442-1484` — `run_hook` + `resolve_hooks` (a4fb141; the exact lines) -- `prompts/` directory — no hooks-specific prompt; the hook block is raw subprocess output, not an LLM-generated message -- `tests/test_nagent.py:1-50` — test file header + imports (a4fb141) -- `bin/nagent:3167-3185` — `run_agent_loop` wiring (a4fb141; the exact lines) +**Honest gaps in this cluster:** +- The "subprocess reach" claim in `bin/nagent:2822-2824` — "A CLI flag applies to this invocation only; set it in the config file to apply it to delegated file-edit subprocesses too" — needs verification. The implementation at `bin/nagent:3167-3185` wires the hooks into `run_agent_loop`'s `main()` call only; whether delegated file-edit subprocesses read the config separately is not visible in this diff. The v3.1 source-read pass should verify the subprocess reach. +- The "default off" guarantee is not tested. Both hooks default to off (CLI flag absent, config key absent or empty string). A regression test asserting "no CLI flag, no config key → both hooks are None" would harden the contract. +- The `--hook-per-run` cost discipline ("point it at a fast status command") is documented in `--help` but not enforced. The case-study harnesses use median-of-5 timing in their proofs, which is fast, but a user wiring up a 10-second status command would pay 10 seconds per turn. A future track could add a `--hook-per-run-max-seconds` config knob. + +**Pattern deep-dive.** The hooks abstraction is a three-piece composition: **resolve**, **invoke**, **inject**. `resolve_hooks` enforces the CLI > config > disabled precedence (the CLI is the experiment's override; the config is the project's default; empty means off). `run_hook` invokes the command, captures exit code + stdout + stderr, and surfaces "(no output)" when silent. The injection sites are the conversation: per-run at the top of every turn before `call_llm`; per-file-edit after `` or `` in `--file-edit` mode (not scratch writes — the comment at `bin/nagent:1618-1620` notes the distinction explicitly: "A `` only edits a real file in per-file-edit mode ... in main mode it writes scratch, which is not a file edit worth a verify hook"). + +The case-study harness scripts are the proof that hooks work as intended. Both scripts implement the same skeleton: log + summary + enforcing gate. The log records every step with verbose mode for streaming; the summary collects every verdict at the end (`set +e` so a failing gate still prints); the enforcing gate collects the verdicts and decides pass/fail. Both harness scripts freeze the committed input via `sha256sum` before the run and re-check after — if the harness itself changes the input (a bug), it aborts. Both exclude precompute time from the measured speedup (the build stage cannot precompute the answer; the optimization log explains why). The PEP harness uses pixel-identity + lossless round-trip + size-correctness (the optimized `.pep` must not be larger than the reference `.pep` — speed may not be bought with a bigger file). The collisions harness uses a distance tolerance contract (1mm + 0.1% + conditional) because collision-flag identity is too strict (a face/edge contact has many equally-valid witness points) and an independent contact-point certifier (`validate_contacts`) shares no solver code. + +The data shape of the hook output, using survey grammar: + +``` +hook-result := + +run { command } :: hook-result {ssdl} [B] // boundary: LLM-failures + // surface, never hidden +inject { hook-result, conversation } :: () // append to conversation file + +resolve { cli, config } :: (per_run, per_file_edit) + // precedence: CLI > config > disabled + // empty string in config means disabled +``` + +The `{ssdl}` `[B]` (boundary) marker notes the abstraction: the hook is the boundary where the model's context meets the measured world; the failure of a measurement is data the model can act on, not a control-flow exception. The injection is append-only — the conversation grows by a labeled block, and the next turn sees it as part of the working state. + +The case-study methodology cluster (§9) abstracts the harness pattern itself: the hooks + the proof + the optimization log + the committed-input sha256 freeze + the model-as-test-subject framing form a reusable unit that any project adopting nagent can replicate. -**Decision candidate:** NEW Candidate 19 (MEDIUM). "Per-turn ground-truth hook for Manual Slop": add a per-turn hook primitive that runs a configured command (CLI > config > disabled) at the top of every `send_result()` and injects a `` block; honor the CLI > config > disabled precedence and the failing/quiet-hook-surfaces-output invariant. See `decisions.md` Candidate 19. -**Cross-refs:** §9 Case-study methodology (the 5-element pattern; hooks are the substrate), §10 PEP case study (the pep-copt harness), §11 Collisions case study (the collisions harness). These three together surface the full abstraction. §13 Agent context-window observations (the v3.1 new section on warm-up + window + safe-zone numbers; the per-turn hook is the per-turn ground-truth mechanism that the safe-zone needs). -**Pattern history:** NEW in v3. v2.3 had the conversation-without-ground-truth loop. v3 introduces the per-turn measurement primitive. The case-study methodology cluster (§9) elaborates this into a reusable 5-element pattern. ## §4 Project-local roots **Source:** nagent `54c8741`, `557dd39`, `0b9d1a2`, `023e23a` (`bin/helpers/nagent_cli.py:11-86` + `:109-141`, `bin/helpers/nagent_llm.py:55-72`, `bin/nagent:640-748` + `:2075-2295`, `.gitignore`, `README.md:344-372` + `:400-410` + `:812-832` + `:841-849`, `prompts/create-readme.md`, `issues/0001-foundations.md`). **One-liner:** The default root moves into the project. Conversations, knowledge, per-file memory, and graduated tools now live at `{git-toplevel}/.nagent/` and can be committed and shared. Inputs resolve through four layers (install → user → project → root) with once-per-directory dedup; most specific layer shadows. -**Pattern summary:** Project-local roots is a 4-piece composition: resolve, scaffold, deduplicate, shadow. `resolve_default_root()` implements the precedence (`--root` > git-toplevel > `~/.nagent`); `ensure_root_scaffold()` creates the root on first use with a minimal `.gitignore` (`splits/` only — every other artifact is the user's commit call); the dedup loop includes a layer at most once even when directories overlap; the shadow semantics encode "most specific layer wins" with later iterations overwriting earlier in a dict. The rename `nagent-gc` → `nagent-distill` is the most subtle change — it shifts the mental model from "garbage collection" (discard) to "distill" (refine), which naturally accommodates the merge/graduate passes from §1 Campaigns. The "project memory is team memory" payoff is the new argument the rename enables: a project's accumulated knowledge can be committed, reviewed, and arrived with via `git clone`. This extends "conversations are editable state" (v2.3 Pattern 3) with project-scoped conversations, extends "repo history as data" (v2.3 Pattern 7) with `.nagent/` contents reviewable in the same pull request as the code, and introduces a new 4-layer resolution pattern (install/user/project/root) with most-specific-shadowing for prompts, tools, and config. +**Pattern(s) vs v2.3:** EXTENDS v2.3 Pattern 3 ("conversations are editable state") — conversations are now project-scoped by default, not user-scoped. EXTENDS v2.3 Pattern 7 ("repo history as data") — `.nagent/` contents are reviewable in the same pull request as the code they describe. NEW pattern: 4-layer resolution (install/user/project/root) with most-specific-shadowing for prompts, tools, and config. The rename `nagent-gc` → `nagent-distill` is not a typo; it codifies the operation's true semantic ("knowledge becomes capability, gated by review", per `prompts/create-readme.md:249`). +**Manual Slop implications:** Manual Slop already follows this pattern in spirit — `conductor/tracks/` is project-scoped (not `~/.manual_slop/tracks/`); `[conductor].dir` in `manual_slop.toml` allows per-project overrides (per `docs/guide_paths.md`). The .gitignore discipline ("only regenerable artifacts; everything else is the user's call to commit") is a model Manual Slop should adopt: `tests/artifacts/` is gitignored (regenerable); `conductor/tracks/` is committed (the user's review call). The dedup-when-running-from-inside-its-own-checkout invariant (`bin/nagent:657-668`) maps to Manual Slop's load path when running the dev build. +**Decision candidate:** NEW Candidate 20 (LOW). "Rename `nagent-gc` → `nagent-distill` in our documentation cross-references" — this is a documentation-only follow-up; no code change. The mental-model shift ("gc" → "distill") is worth surfacing in the project's `conductor/code_styleguides/knowledge_artifacts.md` styleguide. See `decisions.md` Candidate 20. +**Cross-refs:** none direct. §1 Campaigns (`campaigns/` lives inside the project-local root); §2 Conversation safety net (checkpoints inherit the same scoping); §3 Hooks (hooks are configured per-invocation, not per-root). +**Source-read citations:** +- `bin/helpers/nagent_cli.py:11-13` — `INSTALL_DIR` constant (54c8741) +- `bin/helpers/nagent_cli.py:15-44` — `user_root()`, `git_toplevel()`, `resolve_default_root()` (54c8741) +- `bin/helpers/nagent_cli.py:47-54` — `ensure_root_scaffold()` — creates root on first use + writes `.gitignore` for `splits/` only (54c8741) +- `bin/helpers/nagent_cli.py:57-69` — `resolve_prompt_path()` — 3-layer resolution (project root → user → install) (54c8741) +- `bin/helpers/nagent_cli.py:72-86` — `tool_search_dirs()` — 3-layer resolution with basename shadowing (54c8741) +- `bin/helpers/nagent_cli.py:109-141` — `collect_bin_tool_descriptions()` updated to accept multiple bin dirs (54c8741) +- `bin/helpers/nagent_llm.py:55-72` — `default_config_path()` — CLI → `NAGENT_CONFIG` → project `.nagent/config.json` → `~/.nagent/config.json` (54c8741) +- `bin/nagent:640-748` — `build_initial_context()` — 4-layer context resolution with once-per-directory dedup (54c8741) +- `bin/nagent:2220` — `root = resolve_default_root(args.root)` (54c8741) +- `bin/nagent:2227` — `ensure_root_scaffold(root)` for `--file-edit` (resolving a file-edit writes the index) (54c8741) +- `bin/nagent:2292-2295` — `ensure_root_scaffold(root)` for every path past root-write boundary (54c8741) +- `README.md:344-372` — 4-layer context teaching (557dd39) +- `README.md:400-410` — "Project memory is team memory" reduction (557dd39) +- `README.md:812-832` — file tree rename (54c8741) +- `README.md:841-849` — root + config resolution (557dd39) +- `prompts/create-readme.md` — Part III + Part IV rewrites (557dd39) +- `prompts/create-readme.md:249-251` — new reduction: "Proven playbooks stay prose... graduate them into self-describing tools" (from c1d2cad, surfaced in the project-local-roots teaching because `.nagent/bin/` is where graduated tools land) +- `.gitignore:3-4` — `t?` + `p?` (scratch file patterns) (0b9d1a2) +- `.gitignore:5` — `.nagent/` (nagent's own runtime state is per-machine, not source) (023e23a) +**Honest gaps in this cluster:** +- The `t?` and `p?` patterns at `.gitignore:3-4` (from `0b9d1a2`) are unexplained in the commit message. They are likely scratch files written by nagent (e.g., a temp conversation file `t12345`). A follow-up source-read should identify the producer; without that, the gitignore entry is load-bearing but opaque. +- The "once-per-directory dedup" at `bin/nagent:657-668` uses `Path.resolve()`. If the root is on a symlink or a network mount, resolve may behave unexpectedly across platforms. The dedup invariant is correct for the common case; edge cases are unverified. +- The "project-local" win only pays off when the user commits `.nagent/`. The README at `README.md:400-410` acknowledges this caveat ("conversations contain tool output — review before committing, like any other file") but does not enforce it. A hook or pre-commit guard could surface uncommitted conversations, but that is out of scope for the cluster. -#### §4.1 What Project-Local Roots Adds +**Pattern deep-dive.** Project-local roots is a 4-piece composition: **resolve**, **scaffold**, **deduplicate**, **shadow**. `resolve_default_root()` implements the precedence (`--root` > git-toplevel > `~/.nagent`); `ensure_root_scaffold()` creates the root on first use with a minimal `.gitignore` (`splits/` only — every other artifact is the user's commit call); the dedup loop at `bin/nagent:657-668` includes a layer at most once even when directories overlap (running nagent from inside its own checkout, or root being `~/.nagent` outside a repo); the shadow semantics (`tool_search_dirs`, `resolve_prompt_path`, `default_config_path`) encode "most specific layer wins" with later iterations overwriting earlier in a dict. -Project-local roots move the default storage location from `~/.nagent/` (user-scoped) to `{git-toplevel}/.nagent/` (project-scoped). The change is structural: conversations, knowledge files, per-file memory, and graduated tools now live inside the project's repository, can be committed alongside the code they describe, and can be shared via `git clone`. +The rename `nagent-gc` → `nagent-distill` is the most subtle change in this cluster. The old name borrowed from "garbage collection" — the operation was framed as freeing space. The new name borrows from "distill" — the operation is framed as refining raw working state into reusable knowledge. The merge/graduate passes (from §1 Campaigns cluster, shipped in `f3ec090`) are an explicit consequence: a "gc" mental model would not naturally include a `--graduate` step (gc discards, distill refines). The README at `prompts/create-readme.md:249-251` makes the new reduction explicit: "Proven playbooks stay prose that must be re-read and re-trusted every time. Therefore: graduate them into self-describing tools and prompts — knowledge becomes capability, gated by review." -The four pieces of the project-local-roots abstraction: - -1. **Resolve** — `resolve_default_root()` implements the precedence: `--root` CLI argument > git-toplevel (if inside a repo) > `~/.nagent`. The resolve function is pure: it returns a single path. The CLI argument is the experiment's override (one-shot, the user's immediate need); the git-toplevel is the project's default (persistent, the project's convention); the user-root is the fallback (no repo, no project). -2. **Scaffold** — `ensure_root_scaffold()` creates the root on first use with a minimal `.gitignore` (`splits/` only — every other artifact is the user's commit call). The scaffold is idempotent: if the root already exists, the function does nothing. If the root needs to be created, the function creates the directory tree + the `.gitignore` + a minimal `index.md`. -3. **Deduplicate** — the dedup loop at `bin/nagent:657-668` includes a layer at most once even when directories overlap. The dedup is needed for the case where nagent is run from inside its own checkout (the install dir is also a project dir) or where the root is `~/.nagent` outside a repo (the user dir is also the project dir). The dedup uses `Path.resolve()` to canonicalize the paths before comparison. -4. **Shadow** — the shadow semantics (`tool_search_dirs`, `resolve_prompt_path`, `default_config_path`) encode "most specific layer wins" with later iterations overwriting earlier in a dict. The shadow is needed for the case where a project wants to override a tool or prompt from the install or user layer. The shadow is "by name" (the basename of the tool/prompt file), not "by path" (the full path). - -The 4-layer context resolution (`bin/nagent:640-748` — `build_initial_context`) extends the same shadow semantics to the initial context assembly. The four layers are: install (the nagent package itself), user (`~/.nagent/`), project (`{git-toplevel}/.nagent/`), root (the resolved root). Each layer contributes context; later layers override earlier layers for files with the same name. The once-per-directory dedup prevents the same context from being included twice when directories overlap. - -#### §4.2 The Resolve Precedence - -The resolve precedence is the contract between the CLI, the user, and the project. The CLI is the experiment's override: a user running `nagent --root=/tmp/sandbox` is overriding the default root for this invocation only. The git-toplevel is the project's default: if nagent is run from inside a git repo, the root is `{git-toplevel}/.nagent/`. The user-root is the fallback: if nagent is run from outside a git repo, the root is `~/.nagent/`. - -The implementation is at `bin/helpers/nagent_cli.py:11-44`: - -``` -resolve_default_root(root_arg, cwd) { - if root_arg: return expand_path(root_arg) - toplevel = git_toplevel(cwd) - if toplevel: return toplevel / ".nagent" - return ~/.nagent -} -``` - -The `git_toplevel()` function is a subprocess invocation of `git rev-parse --show-toplevel`. If the command fails (not in a repo, git not installed), the function returns None. The fallback to `~/.nagent` is the "no project" case — the user is running nagent standalone, not as part of a project. - -#### §4.3 The Scaffold and Gitignore Discipline - -The scaffold function is at `bin/helpers/nagent_cli.py:47-54`: - -``` -ensure_root_scaffold(root) { - if root.exists(): return - root.mkdir(parents=True) - gitignore = root / ".gitignore" - gitignore.write_text("splits/\n") # only regenerable artifacts - # create the rest of the directory tree as needed -} -``` - -The `.gitignore` discipline is the load-bearing detail: the scaffold writes `splits/` (the only regenerable artifact) into `.gitignore`; every other artifact is the user's commit call. The `splits/` directory holds the temporary file splits from `nagent-file-split`; it can be regenerated by re-running the split. Everything else (conversations, knowledge, per-file memory, graduated tools) is content the user has invested in; it should be committed and shared, not gitignored. - -The Manual Slop analog is `tests/artifacts/` — gitignored because it contains regenerable test outputs (logs, mock outputs, temporary workspaces). The Manual Slop equivalent of "the user commits the rest" is `conductor/tracks/` — committed because it contains the user's reviewable planning artifacts (spec.md, plan.md, state.toml, metadata.json). The .gitignore discipline is the same: only regenerable artifacts are gitignored; everything else is the user's commit call. - -#### §4.4 The Dedup Invariant - -The dedup invariant is needed for the case where nagent is run from inside its own checkout (the install dir is also a project dir) or where the root is `~/.nagent` outside a repo (the user dir is also the project dir). The dedup loop at `bin/nagent:657-668`: - -``` -seen := set() -for dir in [install, user, project, root] { - resolved = Path(dir).resolve() - if resolved in seen: continue - seen.add(resolved) - ctx = load_root_context(dir) - if ctx: push ctx -} -``` - -The `Path.resolve()` call canonicalizes the path (resolves symlinks, normalizes case on Windows, etc.) before comparison. The dedup is by resolved path, not by string — so `~/nagent` and `/home/user/nagent` are the same layer even if the string representations differ. - -The dedup invariant is correct for the common case. Edge cases (symlinks, network mounts, case-insensitive filesystems on Windows/macOS) are unverified. The `Path.resolve()` behavior varies by platform; a symlink on Linux may resolve to a different path than the same symlink on Windows. The dedup is a "good enough" invariant; the edge cases are documented as honest gaps. - -#### §4.5 The Shadow Semantics - -The shadow semantics encode "most specific layer wins" for tools, prompts, and config. The three shadow functions are: - -1. **`resolve_prompt_path(root, name)`** — at `bin/helpers/nagent_cli.py:57-69`. Returns the first existing path in the order: `{root}/prompts/{name}` → `~/.nagent/prompts/{name}` → `{INSTALL}/prompts/{name}`. The most specific layer (project) wins; the least specific layer (install) is the fallback. -2. **`tool_search_dirs(root)`** — at `bin/helpers/nagent_cli.py:72-86`. Returns a list of tool directories in the order: `{INSTALL}/bin` → `~/.nagent/bin` → `{root}/bin`. The order matters for the "basename shadowing" — when two directories have a tool with the same name, the later directory's tool wins. -3. **`default_config_path()`** — at `bin/helpers/nagent_llm.py:55-72`. Returns the first existing path in the order: `NAGENT_CONFIG` env var → `{root}/config.json` → `~/.nagent/config.json` → `{INSTALL}/config.example.json`. The env var is the experiment's override; the project's config is the default; the user's config is the fallback; the install's example is the last-resort default. - -The shadow is "by name" (the basename of the file), not "by path". A project can override a tool by creating a file with the same name in `{root}/bin/`; the project does not need to know the install's full path. This is the same pattern as Unix's `$PATH` resolution: directories earlier in the path shadow directories later in the path for executables with the same name. - -#### §4.6 The Rename: nagent-gc → nagent-distill - -The rename `nagent-gc` → `nagent-distill` is the most subtle change in this cluster. The old name borrowed from "garbage collection" — the operation was framed as freeing space. The new name borrows from "distill" — the operation is framed as refining raw working state into reusable knowledge. - -The merge/graduate passes (from §1 Campaigns cluster, shipped in `f3ec090`) are an explicit consequence: a "gc" mental model would not naturally include a `--graduate` step (gc discards, distill refines). The README at `prompts/create-readme.md:249-251` makes the new reduction explicit: "Proven playbooks stay prose that must be re-read and re-trusted every time. Therefore: graduate them into self-describing tools and prompts — knowledge becomes capability, gated by review." - -The rename is a mental-model shift, not a code refactor. The code change is trivial (`grep -l nagent-gc | xargs sed -i s/nagent-gc/nagent-distill/`); the user-facing change is the documentation. The `nagent_takeaways_20260608.md` and the project's `conductor/code_styleguides/knowledge_artifacts.md` styleguide should both surface the rename: "gc" implies discard, "distill" implies refine. The semantic difference is load-bearing for the merge/graduate design. - -#### §4.7 Per-Commit Detail - -The four commits that built the project-local-roots subsystem: - -1. **`54c8741` — Move the default root into the project.** Adds `bin/helpers/nagent_cli.py:11-86` (the `INSTALL_DIR` constant, `user_root()`, `git_toplevel()`, `resolve_default_root()`, `ensure_root_scaffold()`, `resolve_prompt_path()`, `tool_search_dirs()` functions), `bin/helpers/nagent_cli.py:109-141` (the `collect_bin_tool_descriptions()` update to accept multiple bin dirs), `bin/helpers/nagent_llm.py:55-72` (the `default_config_path()` function with CLI → `NAGENT_CONFIG` → project → user precedence), `bin/nagent:640-748` (the 4-layer `build_initial_context()` with once-per-directory dedup), `bin/nagent:2220` + `:2227` + `:2292-2295` (the `ensure_root_scaffold(root)` wiring into `run_agent_loop`), and `README.md:812-832` (the file tree rename). This is the "structural" commit — it adds the resolve, scaffold, dedup, and shadow functions and wires them into the main loop. -2. **`557dd39` — Add the 4-layer context teaching and "project memory is team memory" reduction.** Adds `README.md:344-372` (Part IV 4-layer context teaching), `README.md:400-410` (the "project memory is team memory" reduction), `README.md:841-849` (the root + config resolution teaching), and `prompts/create-readme.md` Part III + Part IV rewrites. This is the "documentation" commit — it explains the structural change to the user, surfaces the new payoff (project memory is team memory), and rewrites the create-readme prompt to teach the new model. -3. **`0b9d1a2` — Add the scratch file patterns to .gitignore.** Adds `.gitignore:3-4` — `t?` + `p?` (scratch file patterns). The patterns are likely scratch files written by nagent (e.g., a temp conversation file `t12345`). The commit message does not explain the patterns; the v3 cluster notes this as an honest gap. -4. **`023e23a` — Add .nagent/ to .gitignore.** Adds `.gitignore:5` — `.nagent/` (nagent's own runtime state is per-machine, not source). This is a surprising commit: the cluster's whole point is that `.nagent/` should be committed and shared. The commit's `.gitignore` entry contradicts the cluster's thesis. The v3 cluster notes this contradiction as a "to investigate" gap; the most likely explanation is that the entry is for nagent's own development (running nagent from inside its own checkout should not commit nagent's runtime state to nagent's own repo). - -The four commits together implement the project-local-roots abstraction: resolve, scaffold, dedup, shadow. The rename `nagent-gc` → `nagent-distill` lands in the same window (`557dd39` updates the create-readme prompt to surface the new reduction). - -#### §4.8 Manual Slop Implications - -The Manual Slop equivalents of the project-local-roots pattern are partial. The closest analog is `src/paths.py` (the centralized path resolution module) + the per-project `[conductor].dir` override in `manual_slop.toml`. The path resolution is similar: default → env var → config file → fallback. The per-project override allows each project to have its own conductor directory. - -The Manual Slop analog already follows the pattern in spirit: -- **`conductor/tracks/` is project-scoped** (not `~/.manual_slop/tracks/`). The path resolution in `src/paths.py` defaults to `./conductor` relative to each project's TOML file. The `[conductor].dir` override in `manual_slop.toml` allows per-project overrides. -- **`tests/artifacts/` is gitignored** (regenerable). The `pyproject.toml` has `addopts = "--basetemp=tests/artifacts/_pytest_tmp"` (per the 2026-06-19 `test_sandbox_hardening_20260619` track). The gitignore discipline is the same: only regenerable artifacts are gitignored; everything else is the user's commit call. -- **`conductor/tracks/` is committed** (the user's review call). The `state.toml`, `metadata.json`, `spec.md`, `plan.md` files are all committed and reviewable. The git history is the audit trail. -- **Path Resolution Metadata** (per `src/paths.py`) exposes the source of each resolved path (default, env, config) for high-fidelity GUI display. The user can see at a glance "this path was set by the env var" vs "this path was set by the config file". - -The gap Manual Slop could close: -1. **No "project memory is team memory" framing.** Manual Slop's `conductor/tracks/` is committed, but the user's mental model is not always "this is team memory". A styleguide update could surface the framing: "conductor/tracks/ is the project's planning memory; commit it, review it, share it via git clone". -2. **No "rename" mental-model shift.** Manual Slop does not have a `gc` → `distill` analog. The closest is the project's "knowledge artifacts" styleguide (`conductor/code_styleguides/knowledge_artifacts.md`), which already uses the "distill" framing. The gap is minor; the styleguide is already aligned. -3. **No "4-layer context resolution with dedup".** Manual Slop's path resolution is single-layer (default → env → config → fallback), not 4-layer with dedup. A future track could extend `src/paths.py` to support a 4-layer resolution (install → user → project → system) for the agent-facing files (system prompts, tool descriptions, context presets). - -#### §4.9 Honest Gaps - -1. **The `t?` and `p?` patterns at `.gitignore:3-4` (from `0b9d1a2`) are unexplained in the commit message.** They are likely scratch files written by nagent (e.g., a temp conversation file `t12345`). A follow-up source-read should identify the producer; without that, the gitignore entry is load-bearing but opaque. -2. **The "once-per-directory dedup" at `bin/nagent:657-668` uses `Path.resolve()`.** If the root is on a symlink or a network mount, resolve may behave unexpectedly across platforms. The dedup invariant is correct for the common case; edge cases are unverified. -3. **The "project-local" win only pays off when the user commits `.nagent/`.** The README at `README.md:400-410` acknowledges this caveat ("conversations contain tool output — review before committing, like any other file") but does not enforce it. A hook or pre-commit guard could surface uncommitted conversations, but that is out of scope for the cluster. -4. **The `.gitignore:5` entry for `.nagent/` contradicts the cluster's thesis.** The cluster's whole point is that `.nagent/` should be committed and shared. The gitignore entry is likely for nagent's own development (running nagent from inside its own checkout should not commit nagent's runtime state to nagent's own repo). The contradiction is unresolved in the v3 source-read. -5. **The 4-layer context resolution is not exhaustively tested.** The test file `tests/test_nagent.py` covers the resolve precedence but does not test the dedup invariant exhaustively (symlinks, network mounts, case-insensitive filesystems). A v4 would add a test suite for the dedup edge cases. -6. **The `default_config_path()` precedence (CLI → `NAGENT_CONFIG` → project → user) is not deep-dived.** The cluster notes the function exists but does not analyze the precedence's failure modes (what happens when the env var is set to a non-existent path? does the function fall through to the project config, or fail?). -7. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver creates per-campaign directories inside `.nagent/campaigns/`. The project-local-roots abstraction should guarantee that the campaign directories are project-scoped, not user-scoped. The v3 cluster does not document this guarantee. - -#### §4.10 Code-Shape Sketch - -The project-local-roots abstraction, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` resolve-root { root_arg, cwd } :: path {ssdl} [S] @@ -824,412 +220,117 @@ resolve-root { root_arg, cwd } :: path {ssdl} [S] elif git_toplevel(cwd) is not nil -> git_toplevel(cwd) / ".nagent" else -> ~/.nagent -ensure-scaffold { root } :: () {ssdl} [B] # boundary: filesystem write - if root.exists(): return - root.mkdir(parents=True) - gitignore = root / ".gitignore" - gitignore.write_text("splits/\n") # only regenerable artifacts - -resolve-prompt { root, name } :: path {ssdl} [S] +resolve-prompt { root, name } :: path for layer in [root.prompts, ~/.nagent/prompts, INSTALL.prompts] { if layer/name is file -> return layer/name } -resolve-tools { root } :: [path] {ssdl} [B] # boundary: filesystem read +resolve-tools { root } :: [path] by_name := {} for dir in [INSTALL/bin, ~/.nagent/bin, root/bin] { for path in dir if is_file { - by_name[path.name] := path # later iterations shadow earlier + by_name[path.name] := path } } return sorted(by_name.values()) -default-config { cli_arg, env_var } :: path {ssdl} [S] - if cli_arg: return cli_arg - if env_var: return env_var - for layer in [root/config.json, ~/.nagent/config.json, INSTALL/config.example.json] { - if layer is file -> return layer - } - context-layers { install, user, project, root } :: [string] {ssdl} [S] seen := {} for dir in [install, user, project, root] { - if resolve(dir) in seen -> continue # dedup + if resolve(dir) in seen -> continue seen += resolve(dir) ctx := load_root_context(dir) if ctx -> push ctx } ``` -The shape tag map: `[S]` for string concatenations and path resolutions (the model's understanding is the resolved path), `[B]` for boundaries (filesystem read/write, subprocess invocation). The root resolution is a single deterministic string concatenation; the context-layer resolution is also a deterministic string assembly with dedup. The non-determinism is bounded to LLM-driven passes (harvest, checkpoint, graduate); the file-resolution paths are pure code. +The `{ssdl}` markers note the composition: root resolution is a single deterministic string concatenation; context-layer resolution is also a deterministic string assembly with dedup. The non-determinism is bounded to LLM-driven passes (harvest, checkpoint, graduate); the file-resolution paths are pure code. -**Source-read citations:** -- `bin/helpers/nagent_cli.py:11-13` — `INSTALL_DIR` constant (54c8741) -- `bin/helpers/nagent_cli.py:15-44` — `user_root()`, `git_toplevel()`, `resolve_default_root()` (54c8741) -- `bin/helpers/nagent_cli.py:47-54` — `ensure_root_scaffold()` (54c8741) -- `bin/helpers/nagent_cli.py:57-69` — `resolve_prompt_path()` (54c8741) -- `bin/helpers/nagent_cli.py:72-86` — `tool_search_dirs()` (54c8741) -- `bin/helpers/nagent_cli.py:109-141` — `collect_bin_tool_descriptions()` updated (54c8741) -- `bin/helpers/nagent_llm.py:55-72` — `default_config_path()` (54c8741) -- `bin/nagent:640-748` — `build_initial_context()` 4-layer resolution with dedup (54c8741) -- `bin/nagent:657-668` — once-per-directory dedup loop (54c8741) -- `bin/nagent:2220` — `root = resolve_default_root(args.root)` (54c8741) -- `bin/nagent:2227` — `ensure_root_scaffold(root)` for `--file-edit` (54c8741) -- `bin/nagent:2292-2295` — `ensure_root_scaffold(root)` for every path past root-write boundary (54c8741) -- `README.md:344-372` — 4-layer context teaching (557dd39) -- `README.md:400-410` — "Project memory is team memory" reduction (557dd39) -- `README.md:812-832` — file tree rename (54c8741) -- `README.md:841-849` — root + config resolution (557dd39) -- `prompts/create-readme.md` — Part III + Part IV rewrites (557dd39) -- `prompts/create-readme.md:249-251` — "graduate proven playbooks" reduction (from c1d2cad) -- `.gitignore:3-4` — `t?` + `p?` scratch file patterns (0b9d1a2) -- `.gitignore:5` — `.nagent/` (023e23a) -- `issues/0001-foundations.md` — foundations spec (the v3 cluster does not cite a specific line range) -- `bin/nagent:2220-2230` — root resolution wiring (54c8741; the exact lines) -- `bin/nagent:2225-2235` — `ensure_root_scaffold` call for `--file-edit` (54c8741) -- `bin/nagent:2290-2300` — `ensure_root_scaffold` call for every path past root-write boundary (54c8741) -- `bin/nagent:640-660` — `build_initial_context` start (54c8741; the 4-layer resolution) -- `bin/nagent:660-680` — `build_initial_context` dedup loop (54c8741; the exact lines) -- `bin/nagent:680-700` — `build_initial_context` end (54c8741; the final context assembly) -- `tests/test_nagent.py` — resolve precedence tests (54c8741; the v3 cluster does not cite specific line ranges) -- `README.md:372-400` — 4-layer context teaching continued (557dd39) -- `README.md:410-440` — project memory is team memory continued (557dd39) -- `prompts/create-readme.md:200-260` — Part III + Part IV rewrites (557dd39) -- `bin/helpers/nagent_cli.py:1-10` — module docstring + imports (54c8741) -- `bin/helpers/nagent_cli.py:86-109` — between `tool_search_dirs` and `collect_bin_tool_descriptions` (54c8741) -- `bin/helpers/nagent_cli.py:141-200` — `collect_bin_tool_descriptions` body (54c8741) -- `config.example.json` — full config example (54c8741; the default values) -- `.gitignore:1-10` — full gitignore contents (0b9d1a2 + 023e23a) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) +The "project memory is team memory" payoff (557dd39's Part IV addition) is the new argument the rename enables: a project's accumulated knowledge can be committed, reviewed, and arrived with via `git clone`. The manual-slop-equivalent argument already holds for `conductor/tracks/`; the nagent version generalizes it to all of `.nagent/`. -**Decision candidate:** NEW Candidate 20 (LOW). "Rename `nagent-gc` → `nagent-distill` in our documentation cross-references" — this is a documentation-only follow-up; no code change. The mental-model shift ("gc" → "distill") is worth surfacing in the project's `conductor/code_styleguides/knowledge_artifacts.md` styleguide. See `decisions.md` Candidate 20. -**Cross-refs:** §1 Campaigns (`campaigns/` lives inside the project-local root); §2 Conversation safety net (checkpoints inherit the same scoping); §3 Hooks (hooks are configured per-invocation, not per-root). `docs/guide_paths.md` (the Manual Slop path resolution guide; relevant for the Manual Slop implications). -**Pattern history:** EXTENDS v2.3 Pattern 3 ("conversations are editable state") with project-scoped conversations. EXTENDS v2.3 Pattern 7 ("repo history as data") with `.nagent/` contents reviewable in the same pull request. NEW pattern: 4-layer resolution (install/user/project/root) with most-specific-shadowing. ## §5 Provider expansion **Source:** nagent `bdfa2a6`, `5075f6e`, `2edc7ee` (`bin/helpers/nagent_llm.py:13-19` + `:27-31` + `:37-42` + `:54-77` + `:123-130` + `:198-279` + `:315-336` + `:381-400` + `:582-625` + `:739-770` + `:357-391`, `bin/nagent:1075-1081`, `config.example.json:7`, `README.md:82-90` + `:956-967` + `:991-995`, `tests/test_nagent.py:1010-1042` + `:2734-2797`, `context/data-oriented-design.md`). **One-liner:** Together is added as a sixth provider (OpenAI-wire-compatible, always streamed). Per-model context windows become a verified table; rebuild now fires on whichever trips first — byte ceiling or 0.85 of the model's window. The claude-code provider blanks inherited `ANTHROPIC_API_KEY` so its billing stays on its own login; the spinner names the provider/model. -**Pattern summary:** The provider-expansion abstraction is a four-piece composition: register, window, trigger, bill. Register: a provider is one tuple in `PROVIDERS` + one entry in `DEFAULT_MODELS` + one tuple in `CREDENTIAL_ENV` + one entry in `PACKAGE_HINTS`. Window: `MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate ("omit rather than guessed"). Trigger: rebuild fires on whichever trips first, the byte ceiling OR 0.85 of the model's window. Bill: the claude-code billing quirk (`env={"ANTHROPIC_API_KEY": ""}`) is the discipline "API-key billing stays the anthropic provider's job". The token-cap awareness is the load-bearing change: a byte-only rebuild trigger is a proxy for token utilization, and the proxy fails on small-window models. The per-model window table is the data-grounded alternative. The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit". v2.3 had 5 providers (openai, anthropic, google, cursor, claude-code); v3 has 6 (adds together). The token-cap awareness is NEW (v2.3 had byte-only rebuild triggers). - -#### §5.1 What Provider Expansion Adds - -The provider-expansion cluster makes adding a new LLM provider a one-line change in 5 places, makes the context-window table a verified data structure (not an estimate), and makes the rebuild trigger aware of both bytes and tokens. The three changes together decouple the provider catalog from the code: a new provider is data, not code. - -The four pieces of the provider-expansion abstraction: - -1. **Register** — a provider is one tuple in `PROVIDERS` + one entry in `DEFAULT_MODELS` + one tuple in `CREDENTIAL_ENV` + one entry in `PACKAGE_HINTS`. The 5-tuple is enough to surface a provider in `--list-providers` and route a `generate_text_with_usage` call. The 5-tuple is a `[M]` mutable aggregate: the provider catalog is data, the code is a function of the catalog. -2. **Window** — `MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate. "Omit rather than guessed" (per `bin/helpers/nagent_llm.py:60-62`) is the discipline: the table lists exactly the models whose windows were verified by API error or by direct lookup, and the function `model_context_window` returns `None` for unknowns. The caller falls back to byte-only behavior when the window is unknown. -3. **Trigger** — rebuild fires on whichever trips first, the byte ceiling OR 0.85 of the model's window. The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit". The trigger is a pure function of (conversation_chars, model, settings); the function is inspectable, the caller can reason about it. -4. **Bill** — the claude-code billing quirk (`env={"ANTHROPIC_API_KEY": ""}`) is the discipline "API-key billing stays the anthropic provider's job". The provider that owns the billing owns the env; the subprocess env overrides the inherited env. The discipline is data: the env is the contract between the provider and the billing system. - -#### §5.2 The Register Tuple - -A provider is registered by adding entries to 5 data structures. The 5-tuple is: - -``` -PROVIDERS["together"] = (name="together", base_url=TOGETHER_BASE_URL, sdk="openai") -DEFAULT_MODELS["together"] = "meta-llama/Llama-3-70b-chat-hf" -CREDENTIAL_ENV["together"] = ("TOGETHER_API_KEY",) -PACKAGE_HINTS["together"] = "openai>=1.0" -MODEL_CONTEXT_WINDOWS["meta-llama/Llama-3-70b-chat-hf"] = 8192 # if verified -``` - -The 5-tuple is enough to surface the provider in `--list-providers` (the `list_providers()` function reads `PROVIDERS`), to route a `generate_text_with_usage` call (the dispatch reads `PROVIDERS` + `DEFAULT_MODELS` + `CREDENTIAL_ENV`), and to validate the context window (the `model_context_window()` function reads `MODEL_CONTEXT_WINDOWS`). - -The 5-tuple is a `[M]` mutable aggregate: the provider catalog is data, the code is a function of the catalog. Adding a new provider is 5 lines of data, not a new code path. Removing a provider is deleting the 5 lines. The discipline is "data, not code branching on state" — the provider is the data, the code is a function of the data. - -#### §5.3 The Verified Window Table - -`MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate. The discipline is "omit rather than guessed" — the table lists exactly the models whose windows were verified by API error or by direct lookup, and the function `model_context_window` returns `None` for unknowns. The implementation is at `bin/helpers/nagent_llm.py:54-77`: - -``` -MODEL_CONTEXT_WINDOWS := { - # Together (verified 2026-06-17) - "meta-llama/Llama-3-70b-chat-hf": 8192, - "meta-llama/Llama-3.1-70b-chat": 131072, - ... - # DeepSeek (verified 2026-06-17) - "deepseek-chat": 64000, - "deepseek-reasoner": 64000, - ... - # Qwen (verified 2026-06-17) - "qwen-plus": 983616, # enforced input cap, not advertised 1M - ... -} - -model_context_window(model) -> int | None { - return MODEL_CONTEXT_WINDOWS.get(model, None) -} -``` - -The `bdfa2a6` commit message is explicit about the verification process: "DeepSeek-V4-Pro confirmed by a context_length_exceeded error ('maximum context length is 512000 tokens'). Qwen3.7-Plus/Max advertise context_length=1000000, but an oversized request is rejected with 'Range of input length should be [1, 983616]' — so the enforced input cap is 983616, with ~16384 of the 1M reserved for output." The distinction between "advertised total context_length" and "enforced input cap" is load-bearing — the table records the enforced cap, not the advertisement. This is the same data discipline as the project's `conductor/code_styleguides/cache_friendly_context.md`: stable data (verified numbers) vs volatile data (advertised numbers). - -The "unknown returns None" behavior is the discipline: a missing entry is not a default to a guess; it's a signal to fall back to the byte-only behavior, which is correct for large-window models and merely late for small-window models (the failure is visible, not silent). The data-oriented principle: stable data goes in the table; volatile data is the model's responsibility. - -#### §5.4 The Rebuild Trigger with Token Cap - -The rebuild trigger fires on whichever trips first: the byte ceiling OR 0.85 of the model's window. The implementation is at `bin/nagent:rebuild_due` (the v3 cluster does not cite specific line ranges, but the trigger is part of the conversation safety net wiring): - -``` -rebuild_due { conversation_chars, model, settings } :: fire? {ssdl} [I] - byte_trip := conversation_chars > settings.rebuild_at_kb * 1024 - window_trip := model_context_window(model) is not nil - and conversation_chars_in_tokens > window * CONTEXT_WINDOW_SAFETY_FRACTION - return byte_trip or window_trip -``` - -The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit". The token count is estimated from byte count (not from the model's actual token output) because the rebuild trigger is a pre-call check, not a post-call measure. The estimate is `conversation_chars / 4` (the common rule of thumb: 1 token ≈ 4 characters in English). The estimate is good enough for the trigger; the precise token count is the model's responsibility. - -The two-trigger design (byte OR window) is the discipline: a single trigger is a proxy, and proxies fail on edge cases. A byte-only trigger is too high for small-window models (a 192KB conversation is fine for a 1M-token model but catastrophic for an 8K-token model). A token-only trigger is too low for large-window models (a 32K-token conversation is fine for a 1M-token model but the byte-only trigger would fire anyway). The OR-trigger is the data-grounded alternative: the rebuild fires when EITHER the bytes exceed the ceiling OR the tokens exceed the safety fraction of the window. - -#### §5.5 The Claude-Code Billing Quirk - -The claude-code billing quirk is at `bin/helpers/nagent_llm.py:357-391`: the provider blanks inherited `ANTHROPIC_API_KEY` so its billing stays on its own login. The implementation: - -``` -generate_text_with_usage { provider, model, messages } :: LlmResult { - if provider == "claude-code": - env = {**os.environ, "ANTHROPIC_API_KEY": ""} # blank the inherited key - # subprocess.run(..., env=env) — billing is on the claude-code login - else: - env = os.environ - # ... SDK call with env -} -``` - -The discipline: the provider that owns the billing owns the env. The claude-code provider uses the user's claude-code subscription, not the user's Anthropic API key. The blanking ensures the subprocess does not accidentally use the inherited API key (which would bill the API key instead of the subscription). - -The discipline is "API-key billing stays the anthropic provider's job". The two providers share the same SDK (the Anthropic SDK), but their billing is separate. The env is the contract between the provider and the billing system; the provider that does not own the billing should not pass the billing env. - -This is a specific gotcha worth documenting: Manual Slop's claude-code integration (per `conductor/tech-stack.md`) may benefit from the same discipline. If Manual Slop ever adds a claude-code provider (analogous to nagent's), the implementation should blank the inherited `ANTHROPIC_API_KEY` to prevent accidental API billing. - -#### §5.6 The Spinner Names the Provider/Model - -The `--list-providers` CLI flag and the spinner name are at `bin/nagent:1075-1081` and `bin/nagent:1075-1081`: - -``` -target = f"{llm.provider}/{llm.model}" if llm.model else llm.provider -spinner.update(f"calling {target}...") -``` - -The spinner names the provider/model pair so the user can see which provider is being called. This is a small UX detail, but it matters for debugging: when a call is slow, the user knows whether it's the OpenAI provider or the Anthropic provider or the Together provider. - -The `--list-providers` CLI flag is at `bin/nagent` (the v3 cluster does not cite a specific line range, but the flag is documented in `README.md:991-995`). The flag dumps the `PROVIDERS` catalog so the user can see the available providers without reading the code. - -#### §5.7 Per-Commit Detail - -The three commits that built the provider-expansion subsystem: - -1. **`bdfa2a6` — Add Together as the sixth provider + the verified window table.** Adds `bin/helpers/nagent_llm.py:13-19` (the `PROVIDERS` extension + `TOGETHER_BASE_URL`), `bin/helpers/nagent_llm.py:27-31` (the `DEFAULT_MODELS["together"]`), `bin/helpers/nagent_llm.py:37-42` (the `CREDENTIAL_ENV["together"]` = `("TOGETHER_API_KEY",)`), `bin/helpers/nagent_llm.py:54-77` (the `MODEL_CONTEXT_WINDOWS` table with 10 verified models), `bin/helpers/nagent_llm.py:123-130` (the `model_context_window(model)` function returning `None` for unknown), `bin/helpers/nagent_llm.py:198-279` (the Together client + `_together_chat` always-streamed), `bin/helpers/nagent_llm.py:315-336` (the `list_models("together")` direct fetch because Together returns a bare JSON array), `bin/helpers/nagent_llm.py:381-400` (the `list_providers()` static catalog), `bin/helpers/nagent_llm.py:582-625` (the Together in `generate_text_with_usage` + `generate_with_upload_usage`), `bin/helpers/nagent_llm.py:739-770` (the `_together_upload` image-upload-only with base64 data URL), `config.example.json:7` (the `"context_window_tokens": 0` config), `README.md:82-90` (the providers table extension), and `README.md:956-967` (the "Conversation rebuilt (compacted...) when either trigger fires first" teaching). This is the "Together + windows" commit — it adds the new provider and the verified window table. -2. **`5075f6e` — Add the claude-code billing quirk + the 4 new tests.** Adds `bin/helpers/nagent_llm.py:357-391` (the `env={"ANTHROPIC_API_KEY": ""}` blanking + the error-result-survives-stream-exception + the synthetic-error-text-skip), and `tests/test_nagent.py:2734-2797` (4 new claude-code tests). This is the "billing discipline" commit — it hardens the claude-code provider's billing isolation. -3. **`2edc7ee` — Add the spinner-name-the-provider/model change.** Adds `bin/nagent:1075-1081` (the `target = f"{llm.provider}/{llm.model}" if llm.model else llm.provider` + the spinner update) and `tests/test_nagent.py:1010-1042` (the `test_call_llm_wait_spinner_names_provider_and_model` test). This is the "UX detail" commit — it makes the spinner name the provider/model so the user can see which provider is being called. - -The three commits together implement the provider-expansion abstraction: register, window, trigger, bill. The Together provider lands in `bdfa2a6`; the billing discipline hardens in `5075f6e`; the UX detail lands in `2edc7ee`. - -#### §5.8 Manual Slop Implications - -The Manual Slop equivalents of the provider-expansion pattern are partial. The closest analog is `src/ai_client.py` (the multi-provider LLM client) + the per-provider history locks (per `docs/guide_ai_client.md`) + the 8 providers in `conductor/tech-stack.md` (Gemini, Anthropic, DeepSeek, Gemini CLI, MiniMax, OpenAI, Qwen, Grok). - -The Manual Slop analog already follows the pattern in spirit: -- **8 providers registered** (per `conductor/tech-stack.md`) — the provider catalog is data, not code branching on state. The `src/ai_client.py` module is a function of the catalog. -- **`provider_state` architecture** (per `docs/guide_ai_client.md`) — each provider has its own state (history lock, cache state, rate limits). The state is per-provider, not global. -- **Per-provider history locks** (per `docs/guide_ai_client.md`) — prevents the "provider-specific history in process globals" pitfall (per `conductor/code_styleguides/domain_classification.md`'s Application domain pitfalls list). - -The gap Manual Slop could close: -1. **No verified `MODEL_CONTEXT_WINDOWS` table.** Manual Slop's `src/ai_client.py` has per-provider history locks but does not have a per-model context-window table. The rebuild/compaction is currently driven by heuristic token estimates, not verified windows. A future track could add the table + the 0.85 safety fraction trigger. -2. **No "omit rather than guessed" discipline.** Manual Slop's `ai_client` uses heuristic estimates for unknown models. The "unknown returns None, fall back to byte-only" discipline is a small but load-bearing change. -3. **No claude-code billing quirk discipline.** Manual Slop's `conductor/tech-stack.md` lists 8 providers, but the claude-code billing isolation discipline is not documented. A future track could add the discipline to the `src/ai_client.py` module's design. - -#### §5.9 Honest Gaps - -1. **`MODEL_CONTEXT_WINDOWS` is verified against the Together API only on 2026-06-17.** Other providers' models are intentionally omitted. A future track should add more verifications. -2. **The `env={"ANTHROPIC_API_KEY": ""}` blanking assumes subprocess env takes precedence over inherited env.** Correct on POSIX; Windows env handling could differ. Unverified. -3. **The Together `/v1/models` direct fetch at `bin/helpers/nagent_llm.py:315-336` is a vendor-specific workaround.** If Together changes the response shape, the parser silently returns fewer models. A defensive check (count returned models, warn if zero) could harden this. -4. **The 0.85 safety fraction is a heuristic, not a measured value.** The comment in `issues/0004-conversation-safety-net.md` notes "model capability degrades under high context utilization, not just at the limit", but the 0.85 fraction is not measured. A future track should measure actual degradation per provider/model and update the fraction accordingly. -5. **The token count estimate (`conversation_chars / 4`) is a heuristic.** The actual token count depends on the model's tokenizer (GPT-4 uses BPE, Claude uses SentencePiece, etc.). A v4 would use the model's tokenizer for precise counting. -6. **The `list_providers()` static catalog does not validate the providers are actually configured.** A provider in `PROVIDERS` without a corresponding `CREDENTIAL_ENV` entry would fail at runtime, not at registration. A validation pass could catch this at startup. -7. **The interaction with the campaigns driver (§1) is not deep-dived.** A long-running campaign can have conversations that exceed the model's context window. The provider-expansion cluster does not document how the campaigns driver coordinates with the token-cap trigger — does the campaign driver check the trigger before dispatching a worker? does the report phase surface token-cap warnings to the user? - -#### §5.10 Code-Shape Sketch - -The provider-expansion abstraction, in survey-grammar SSDL notation, with shape tags: - -``` -providers := { name: string, # [S] string - default_model: string, # [S] string - credentials: [env-var], # [S] string list - package: string, # [S] string - context_window: int | nil } # [I] inspectable - # [M] mutable aggregate - -MODEL_CONTEXT_WINDOWS := { model: int | nil } # [I] verified table -CONTEXT_WINDOW_SAFETY_FRACTION := 0.85 # [I] inspectable - -provider { name, model, env } :: LlmResult {ssdl} [B] # boundary: SDK call - // SDK call; failures surface text + exit code - -rebuild-trigger { conversation_chars, model, settings } :: fire? {ssdl} [I] - byte_trip := conversation_chars > settings.rebuild_at_kb * 1024 - window_trip := model_context_window(model) is not nil - and conversation_chars_in_tokens > window * 0.85 - return byte_trip or window_trip - -claude-code-billing { inherited_env } :: env {ssdl} [B] # boundary: subprocess env - if provider == "claude-code": - return {**inherited_env, "ANTHROPIC_API_KEY": ""} # blank the inherited key - else: - return inherited_env -``` - -The shape tag map: `[I]` for inspectable tables and triggers, `[S]` for string content (provider names, model names, env vars), `[B]` for boundaries (SDK call, subprocess env), `[M]` for the mutable aggregate that is the provider catalog. The provider catalog is a `[M]` aggregate: it is the state of record, hand-edited by humans, read by the SDK dispatch. - +**Pattern(s) vs v2.3:** UPDATE. v2.3 had 5 providers (openai, anthropic, google, cursor, claude-code); v3 has 6 (adds together). The v2.3 review noted v2.3 had 5 providers per the project's tech-stack.md — Manual Slop has 8 (per the qwen_llama_grok track); the count is independent of the abstraction. The token-cap awareness is NEW (v2.3 had byte-only rebuild triggers). v2.3 §5 ("the loop") is extended with a per-model token cap as a second rebuild trigger. +**Manual Slop implications:** Manual Slop's `src/ai_client.py` already has per-provider history locks (per `docs/guide_ai_client.md`) but does not have a per-model context-window table; the rebuild/compaction is currently driven by heuristic token estimates. The pattern "verify the window, don't guess; only assert what you've tested" maps to Manual Slop's `provider_state` architecture (per `docs/guide_ai_client.md`). The claude-code billing quirk (`env={"ANTHROPIC_API_KEY": ""}`) is a specific gotcha worth documenting — Manual Slop's claude-code integration (per tech-stack.md) may benefit from the same discipline. +**Decision candidate:** NEW Candidate 21 (MEDIUM). "Per-model token-cap awareness for Manual Slop `ai_client`": add `MODEL_CONTEXT_WINDOWS` table; rebuild fires on byte ceiling OR 0.85 of window; "don't guess" — omit rather than estimate. See `decisions.md` Candidate 21. +**Cross-refs:** §2 Conversation safety net (rebuild trigger gets a second condition); §3 Hooks (per-turn status can include `current model / window / usage`). **Source-read citations:** - `bin/helpers/nagent_llm.py:13-19` — `PROVIDERS` extended + `TOGETHER_BASE_URL` (bdfa2a6) - `bin/helpers/nagent_llm.py:27-31` — `DEFAULT_MODELS["together"]` (bdfa2a6) - `bin/helpers/nagent_llm.py:37-42` — `CREDENTIAL_ENV["together"]` = `("TOGETHER_API_KEY",)` (bdfa2a6) - `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (10 verified models) (bdfa2a6) -- `bin/helpers/nagent_llm.py:60-62` — "omit rather than guessed" discipline (bdfa2a6) - `bin/helpers/nagent_llm.py:123-130` — `model_context_window(model)` returns `None` for unknown (bdfa2a6) - `bin/helpers/nagent_llm.py:198-279` — Together client + `_together_chat` (always streamed) (bdfa2a6) -- `bin/helpers/nagent_llm.py:315-336` — `list_models("together")` direct fetch (bdfa2a6) -- `bin/helpers/nagent_llm.py:381-400` — `list_providers()` static catalog (bdfa2a6) -- `bin/helpers/nagent_llm.py:582-625` — Together in `generate_text_with_usage` (bdfa2a6) -- `bin/helpers/nagent_llm.py:739-770` — `_together_upload` image-upload only (bdfa2a6) -- `bin/helpers/nagent_llm.py:357-391` — `env={"ANTHROPIC_API_KEY": ""}` + error-result-survives-stream-exception (5075f6e) -- `bin/nagent:1075-1081` — spinner names provider/model (2edc7ee) +- `bin/helpers/nagent_llm.py:315-336` — `list_models("together")` — direct fetch because Together returns a bare JSON array (bdfa2a6) +- `bin/helpers/nagent_llm.py:381-400` — `list_providers()` — static catalog, no network (bdfa2a6) +- `bin/helpers/nagent_llm.py:582-625` — Together in `generate_text_with_usage` + `generate_with_upload_usage` (bdfa2a6) +- `bin/helpers/nagent_llm.py:739-770` — `_together_upload` — image-upload only, base64 data URL (bdfa2a6) +- `bin/helpers/nagent_llm.py:357-391` — `env={"ANTHROPIC_API_KEY": ""}` + error-result-survives-stream-exception + synthetic-error-text-skip (5075f6e) +- `bin/nagent:1075-1081` — `target = f"{llm.provider}/{llm.model}" if llm.model else llm.provider` (2edc7ee) - `config.example.json:7` — `"context_window_tokens": 0` (bdfa2a6) - `README.md:82-90` — providers table extension (bdfa2a6) -- `README.md:956-967` — "Conversation rebuilt when either trigger fires first" (bdfa2a6) +- `README.md:956-967` — "Conversation rebuilt (compacted...) when **either** trigger fires first" (bdfa2a6) - `README.md:991-995` — `--list-providers` CLI example (bdfa2a6) - `tests/test_nagent.py:1010-1042` — `test_call_llm_wait_spinner_names_provider_and_model` (2edc7ee) - `tests/test_nagent.py:2734-2797` — 4 new claude-code tests (5075f6e) -- `bin/nagent:rebuild_due` — rebuild trigger (the v3 cluster does not cite specific line ranges) -- `bin/helpers/nagent_llm.py:1-12` — module docstring + imports (bdfa2a6) -- `bin/helpers/nagent_llm.py:19-26` — `PROVIDERS` complete list (bdfa2a6) -- `bin/helpers/nagent_llm.py:31-36` — `DEFAULT_MODELS` complete list (bdfa2a6) -- `bin/helpers/nagent_llm.py:42-53` — `CREDENTIAL_ENV` complete list (bdfa2a6) -- `bin/helpers/nagent_llm.py:77-100` — `PACKAGE_HINTS` (bdfa2a6) -- `bin/helpers/nagent_llm.py:130-200` — provider-specific clients (bdfa2a6) -- `bin/helpers/nagent_llm.py:280-315` — `_together_chat` end (bdfa2a6) -- `bin/helpers/nagent_llm.py:336-380` — `list_models` end (bdfa2a6) -- `bin/helpers/nagent_llm.py:400-580` — provider dispatch (bdfa2a6) -- `bin/helpers/nagent_llm.py:625-740` — provider-specific output parsing (bdfa2a6) -- `bin/helpers/nagent_llm.py:770-900` — provider-specific upload handling (bdfa2a6) -- `config.example.json:1-20` — full config example (bdfa2a6) -- `README.md:90-110` — providers teaching continued (bdfa2a6) -- `README.md:967-990` — rebuild trigger teaching continued (bdfa2a6) -- `tests/test_nagent.py:1042-1100` — model_context_window tests (bdfa2a6) -- `tests/test_nagent.py:2797-2850` — claude-code tests continued (5075f6e) -- `bin/nagent:1075-1085` — spinner update + target format (2edc7ee; the exact lines) -- `bin/nagent:1080-1090` — call_llm start (2edc7ee; relevant for the spinner wiring) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) -- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the trigger wiring) -- `context/data-oriented-design.md` — the canonical DOD reference (relevant for the 0.85 safety fraction rationale) +**Honest gaps in this cluster:** +- `MODEL_CONTEXT_WINDOWS` is verified against the Together API only on 2026-06-17. Other providers' models are intentionally omitted. A future track should add more verifications. +- The `env={"ANTHROPIC_API_KEY": ""}` blanking assumes subprocess env takes precedence over inherited env. Correct on POSIX; Windows env handling could differ. Unverified. +- The Together `/v1/models` direct fetch at `bin/helpers/nagent_llm.py:315-336` is a vendor-specific workaround. If Together changes the response shape, the parser silently returns fewer models. A defensive check (count returned models, warn if zero) could harden this. + +**Pattern deep-dive.** The provider-expansion abstraction is a four-piece composition: **register**, **window**, **trigger**, **bill**. Register: a provider is one tuple in `PROVIDERS` + one entry in `DEFAULT_MODELS` + one tuple in `CREDENTIAL_ENV` + one entry in `PACKAGE_HINTS`. The 5-tuple is enough to surface a provider in `--list-providers` and route a `generate_text_with_usage` call. Window: `MODEL_CONTEXT_WINDOWS` is a verified table, not an estimate. "Omit rather than guessed" (per `bin/helpers/nagent_llm.py:60-62`) is the discipline — the table at `bin/helpers/nagent_llm.py:54-77` lists exactly the models whose windows were verified by API error or by direct lookup, and the function `model_context_window` returns `None` for unknowns (the caller falls back to byte-only behavior). Trigger: rebuild fires on whichever trips first, the byte ceiling OR 0.85 of the model's window (per `README.md:956-967`). The 0.85 safety fraction is the data-oriented response to "model capability degrades under high context utilization, not just at the limit" (per the issues/0004 spec). Bill: the claude-code billing quirk (`env={"ANTHROPIC_API_KEY": ""}`) is the discipline "API-key billing stays the anthropic provider's job" (per `bin/helpers/nagent_llm.py:361-364`) — billing is data; the provider that owns the billing owns the env. + +The token-cap awareness is the load-bearing change. A byte-only rebuild trigger is a proxy for token utilization, and the proxy fails on small-window models — `rebuild_at_kb: 384` is far too high to fire on a 8192-token model. The per-model window table is the data-grounded alternative. The `context_window_tokens` config key (per `config.example.json:7`) is the extension point: a user who wants a new model's window can add it without code change. The "unknown returns None" behavior at `bin/helpers/nagent_llm.py:123-130` is the discipline — a missing entry is not a default to a guess; it's a signal to fall back to the byte-only behavior, which is correct for large-window models and merely late for small-window models (the failure is visible, not silent). + +The `bdfa2a6` commit message is explicit about the verification process: "DeepSeek-V4-Pro confirmed by a context_length_exceeded error ('maximum context length is 512000 tokens'). Qwen3.7-Plus/Max advertise context_length=1000000, but an oversized request is rejected with 'Range of input length should be [1, 983616]' — so the enforced input cap is 983616, with ~16384 of the 1M reserved for output." The distinction between "advertised total context_length" and "enforced input cap" is load-bearing — the table records the enforced cap, not the advertisement. This is the same data discipline as the project's `conductor/code_styleguides/cache_friendly_context.md`: stable data (verified numbers) vs volatile data (advertised numbers). + +A code-shape sketch using survey grammar: + +``` +providers := { name: string, default_model: string, + credentials: [env-var], package: string, + context_window: int | nil } // [M] mutable aggregate +provider { name, model, env } :: LlmResult {ssdl} [B] // boundary + // SDK call; failures surface text + exit code + +rebuild-trigger { conversation_chars, model, settings } :: fire? {ssdl} [I] + byte_trip := conversation_chars > settings.rebuild_at_kb * 1024 + window_trip := model_context_window(model) + and tokens > window * CONTEXT_WINDOW_SAFETY_FRACTION + byte_trip or window_trip +``` + +The `{ssdl}` markers note the abstractions: the provider call is a boundary (B) where SDK errors become LlmResult errors; the rebuild trigger is an inspectable invariant (I) computed from data on disk. -**Decision candidate:** NEW Candidate 21 (MEDIUM). "Per-model token-cap awareness for Manual Slop `ai_client`": add `MODEL_CONTEXT_WINDOWS` table; rebuild fires on byte ceiling OR 0.85 of window; "don't guess" — omit rather than estimate. See `decisions.md` Candidate 21. -**Cross-refs:** §2 Conversation safety net (rebuild trigger gets a second condition). §3 Hooks (per-turn status can include `current model / window / usage`). `docs/guide_ai_client.md` (the Manual Slop AI client guide; relevant for the Manual Slop implications). `conductor/tech-stack.md` (the 8 providers Manual Slop supports). -**Pattern history:** UPDATE. v2.3 had 5 providers; v3 has 6 (adds together). The token-cap awareness is NEW (v2.3 had byte-only rebuild triggers). EXTENDS v2.3 Pattern 5 ("the loop") with a per-model token cap as a second rebuild trigger. ## §6 Delegation rewrite **Source:** nagent `d56f0f0`, `65787a6`, `315fe9e` (`bin/nagent:666-673` + `:790-806`, `tests/test_nagent.py:1689-1695`). **One-liner:** Delegation is for two reasons — **decomposition** (break a complex task into parts and delegate the parts) or **context isolation** (keep a noisy step's cost as just its result, not its logs/reads). It is NEVER for offloading a single small action whose result is no smaller than doing it yourself — synchronous delegation can recurse without end. -**Pattern summary:** The delegation rewrite is a guidance + bug-fix pair. The bug is real: a delegated agent whose whole job is one edit will delegate that one edit to another agent, which does the same, and because delegation is synchronous (each parent blocks on its child) this recurses without bound and hangs the tree. The fix is to name the two reasons delegation is worth its cost — decomposition (the task is genuinely complex, with parts) and context isolation (the step is noisy, and the result is small). Both reasons produce a smaller-than-the-work payload to the parent. When neither reason applies, the parent should do the work inline. The "worth more the longer-lived your conversation is" insight is the load-bearing one: a short, soon-to-finish conversation gains little from context isolation; a long-lived coordinator's context budget is the constraint that context isolation protects. The recursion bug is interesting for what it says about guidance as control flow: nagent's delegation is "the model's call, not the loop's" — the cost of this design is the recursion bug; the benefit is flexibility. The fix is to make the guidance explicit enough that the model doesn't fall into the trap. This is the data-oriented approach: instead of code-level guards, encode the invariant in the prompt and trust the model to follow it. The test-fix at `315fe9e` is the verification layer. +**Pattern(s) vs v2.3:** UPDATE. v2.3 Pattern 9 ("disposable sub-conversations") noted MMA workers are real subprocesses and delegation is context-management before parallelism. v3 surfaces a recursion bug (file-edit agent → worker → nagent-file-edit → file-edit agent → ... hangs the tree) and fixes it by naming the two reasons for delegation. v2.3's "delegation is for context management" framing was correct but undersold; v3's "context isolation is worth more the longer-lived your conversation is" makes the trade-off explicit. The `315fe9e` commit message ("My earlier commits py_compile'd but did not run the suite — this is the fallout") is a model of honest test-coverage reporting. +**Manual Slop implications:** MMA's WorkerPool has disciplined delegation (per `docs/guide_multi_agent_conductor.md`); the recursion bug was observed in the non-MMA flow (file-edit agent re-delegating). Manual Slop's tier-3 workers should adopt the "decompose or isolate, never offload" contract explicitly. The 315fe9e test-fix is a useful precedent: an agent's `test_*.py` for any user-facing prompt change must run the suite, not just `py_compile`. Manual Slop's CLAUDE.md / AGENTS.md @import discipline (per `conductor/code_styleguides/data_oriented_design.md`) already encodes "always run the suite" but the temptation to skip on prompt-only changes is real. +**Decision candidate:** NEW Candidate 22 (HIGH). "Tier 3 worker contract: decompose or isolate, never offload" for Manual Slop MMA — encode the two-reason delegation guidance as a Tier 3 worker system prompt prefix; add a test that asserts the prefix is present in the worker's initial context. See `decisions.md` Candidate 22. +**Cross-refs:** §1 Campaigns (campaign item workers operate under this discipline); §2 Conversation safety net (sub-conversations inherit the same scoping); §10 + §11 case studies (sub-conversation isolation is what makes the case-study harnesses tractable). +**Source-read citations:** +- `bin/nagent:666-673` — `role_instructions` for delegated-invocation: "Do your task directly; spawn a sub-conversation only when it buys something: to decompose a genuinely complex, multi-part task into parts, or to keep a large/noisy step ... out of your context and get back only the distilled result. Don't delegate a single small action whose result is essentially your whole deliverable—that adds a layer and can recurse without end." (65787a6) +- `bin/nagent:790-806` — top-level context-management guidance: "Each nagent instance has its own private conversation file; parent and child do not share context. A sub-conversation absorbs the noise of its work and returns only what you ask for — so a step you delegate costs your context just its result, not its logs/reads." (65787a6) +- `bin/nagent:792-798` — the two-reason framing (decomposition OR context isolation), the "worth more the longer-lived your conversation is" insight (65787a6) +- `bin/nagent:798-800` — anti-recursion rule: "Don't delegate a single small action whose result is no smaller than doing it yourself (one edit, one quick command, one lookup): it buys nothing, only adds a layer, and — delegation being synchronous — can recurse without end (a sub-agent re-delegating the same one thing)." (65787a6) +- `tests/test_nagent.py:1689-1695` — `test_delegated_initial_text` updated to assert the new wording (315fe9e) +- `d56f0f0` commit message — the recursion bug: "file-edit agent -> worker -> nagent-file-edit -> file-edit agent -> ..." (observed) +**Honest gaps in this cluster:** +- The `315fe9e` commit message's acknowledgment — "My earlier commits py_compile'd but did not run the suite — this is the fallout" — is a model of test-coverage honesty but also a documented gap. The recursion bug itself was caught post-merge by the test; the agent that wrote d56f0f0 + 65787a6 should have run the suite. A future track could enforce "always run the suite" via a pre-commit hook. +- The recursion-bug fix is guidance-only — no code change prevents the recursion; the model is trusted to follow the new wording. A defensive code change (e.g., a max-delegation-depth check) would harden the invariant. The spec notes the design philosophy: "delegation is the model's call, not the loop's," which is consistent with nagent's data-oriented approach but trades safety for simplicity. +- The "worth more the longer-lived your conversation is" insight has no measurable test. The conversation-length-vs-delegation-payoff is a heuristic; a future track could measure it. -#### §6.1 What Delegation Rewrite Adds +**Pattern deep-dive.** The delegation rewrite is a guidance + bug-fix pair. The bug is real: a delegated agent whose whole job is one edit will delegate that one edit to another agent, which does the same, and because delegation is synchronous (each parent blocks on its child) this recurses without bound and hangs the tree. The fix is to name the two reasons delegation is worth its cost — decomposition (the task is genuinely complex, with parts) and context isolation (the step is noisy, and the result is small). Both reasons produce a smaller-than-the-work payload to the parent. When neither reason applies, the parent should do the work inline. -The delegation rewrite surfaces a recursion bug and fixes it by naming the two reasons delegation is worth its cost. The change is structural: the model is given an explicit decision rule ("decompose or isolate, never offload") that prevents the recursion trap. The rule is guidance, not code — the loop does not enforce a max-delegation-depth. +The "worth more the longer-lived your conversation is" insight is the load-bearing one. A short, soon-to-finish conversation gains little from context isolation — the cost of paying for the sub-conversation's LLM call may exceed the savings. A long-lived coordinator's context budget is the constraint that context isolation protects. This is the same "per-turn cost" thinking that nagent's hooks (per §3) formalize with `--hook-per-run`'s "point it at a fast status command" guidance — the cost is per-turn, not amortized. -The three pieces of the delegation-rewrite abstraction: +The recursion bug is interesting for what it says about guidance as control flow. nagent's delegation is "the model's call, not the loop's" — the loop does not enforce a max-delegation-depth or refuse to delegate to a child who would delegate. The cost of this design is the recursion bug; the benefit is flexibility. The fix is to make the guidance explicit enough that the model doesn't fall into the trap. This is the data-oriented approach: instead of code-level guards, encode the invariant in the prompt and trust the model to follow it. The test-fix at `315fe9e` is the verification layer. -1. **Decomposition** — the task is genuinely complex, with multiple parts. Delegation breaks the parts into separate sub-conversations; each sub-conversation does its part and returns the result. The parent's context absorbs only the results, not the parts' logs/reads. -2. **Context isolation** — the step is noisy (many log lines, many file reads, many tool calls), and the result is small (a single value, a short summary). Delegation isolates the noise in the sub-conversation; the parent's context absorbs only the result, not the noise. -3. **Anti-recursion rule** — when neither reason applies (the task is a single small action whose result is essentially the whole deliverable), the parent should do the work inline. Delegating a single small action buys nothing, only adds a layer, and — delegation being synchronous — can recurse without end (a sub-agent re-delegating the same one thing). - -The two-reason framing is the load-bearing change. v2.3 noted "delegation is context-management before parallelism"; v3 surfaces the recursion bug and names the two reasons. The framing is more precise: context isolation is worth more the longer-lived the parent's conversation is. A short conversation's context budget is not the constraint; a long-lived coordinator's context budget is. - -#### §6.2 The Recursion Bug - -The recursion bug is at `d56f0f0`: "file-edit agent → worker → nagent-file-edit → file-edit agent → ...". The bug's mechanism: - -1. A delegated agent's whole job is one file edit (e.g., "edit this one function"). -2. The delegated agent delegates the one edit to a sub-agent ("you do this one edit"). -3. The sub-agent delegates the one edit to a sub-sub-agent ("you do this one edit"). -4. Because delegation is synchronous, each parent blocks on its child. The chain recurses without bound. -5. The tree hangs (each parent is waiting for its child, which is waiting for its child, etc.). - -The bug is observed, not theoretical. The fix is guidance: the parent should do the work inline when neither decomposition nor context isolation applies. The new wording at `bin/nagent:798-800` is explicit: "Don't delegate a single small action whose result is no smaller than doing it yourself (one edit, one quick command, one lookup): it buys nothing, only adds a layer, and — delegation being synchronous — can recurse without end (a sub-agent re-delegating the same one thing)." - -#### §6.3 The Two-Reason Framing - -The two-reason framing is the discipline that prevents the recursion bug. The wording at `bin/nagent:666-673` is the delegated-invocation guidance: - -``` -role_instructions for delegated-invocation: - Do your task directly; spawn a sub-conversation only when it buys something: - - to decompose a genuinely complex, multi-part task into parts, or - - to keep a large/noisy step out of your context and get back only the distilled result. - Don't delegate a single small action whose result is essentially your whole - deliverable — that adds a layer and can recurse without end. -``` - -The wording is explicit about the two reasons and the anti-pattern. The model reads the wording at the top of every delegated invocation (the `role_instructions` is part of the initial context for delegated agents). The wording is the data, the model is the function. - -The top-level context-management guidance at `bin/nagent:790-806` is the parent-facing version: - -``` -Each nagent instance has its own private conversation file; parent and child do -not share context. A sub-conversation absorbs the noise of its work and returns -only what you ask for — so a step you delegate costs your context just its -result, not its logs/reads. -``` - -The "worth more the longer-lived your conversation is" insight is in the same block. The insight is: a short, soon-to-finish conversation gains little from context isolation; a long-lived coordinator's context budget is the constraint that context isolation protects. - -#### §6.4 The Test-Fix at 315fe9e - -The `315fe9e` commit message is the verification-discipline precedent: "My earlier commits py_compile'd but did not run the suite — this is the fallout". The commit updates the test at `tests/test_nagent.py:1689-1695` (`test_delegated_initial_text`) to assert the new wording. The diff is a single character change at line 1692: `"Still decompose and delegate"` → `"spawn a sub-conversation only when it buys something"`. - -The change is small but load-bearing: without the test assertion, the recursion bug could re-merge silently. The test asserts that the delegated-invocation guidance contains the anti-recursion rule. If a future change removes the rule, the test fails. - -The 315fe9e commit is a model of test-coverage honesty: the agent acknowledges that earlier commits passed `py_compile` but did not run the suite, and the test-fix is the verification layer. The pattern is: any guidance change in a prompt must run the test suite, not just `py_compile`. The verification is the contract. - -#### §6.5 Per-Commit Detail - -The three commits that built the delegation-rewrite subsystem: - -1. **`d56f0f0` — Observe the recursion bug.** The commit message is the bug report: "file-edit agent → worker → nagent-file-edit → file-edit agent → ...". The commit is a no-op (no code change); it documents the observed bug. The fix is the next commit. -2. **`65787a6` — Add the two-reason framing + the anti-recursion rule.** Adds `bin/nagent:666-673` (the `role_instructions` for delegated-invocation with the two-reason framing + the anti-recursion rule) and `bin/nagent:790-806` (the top-level context-management guidance with the "worth more the longer-lived" insight). This is the "guidance" commit — it adds the discipline that prevents the recursion bug. -3. **`315fe9e` — Add the test-fix + acknowledge the verification gap.** Updates `tests/test_nagent.py:1692` (the assertion text from `"Still decompose and delegate"` to `"spawn a sub-conversation only when it buys something"`). The commit message acknowledges: "My earlier commits py_compile'd but did not run the suite — this is the fallout". This is the "verification" commit — it adds the test assertion that prevents the bug from re-merging. - -The three commits together implement the delegation-rewrite abstraction: observe the bug, fix it with guidance, verify the fix with a test. The pattern is: bug → guidance → test. - -#### §6.6 Manual Slop Implications - -The Manual Slop equivalents of the delegation-rewrite pattern are partial. The closest analog is the MMA WorkerPool (per `docs/guide_multi_agent_conductor.md` + `src/multi_agent_conductor.py`). The WorkerPool spawns tier-3 workers with `mma_exec.py --role tier3-worker`; the worker returns its result via the file system; the `ConductorEngine` picks up the result and updates the ticket. - -The Manual Slop analog already follows the pattern in spirit: -- **MMA workers are real subprocesses** (per `docs/guide_multi_agent_conductor.md`) — the WorkerPool spawns `mma_exec.py` as a subprocess; the subprocess has its own private context. -- **Delegation is context-management before parallelism** — the `ConductorEngine`'s primary purpose is to manage context (each worker has its own context), not to parallelize for speed. -- **The 4-tier hierarchy enforces decomposition** — Tier 1 (Orchestrator) → Tier 2 (Tech Lead) → Tier 3 (Worker) → Tier 4 (QA). Each tier decomposes its work into the next tier's tickets. - -The gap Manual Slop could close: -1. **No "decompose or isolate, never offload" contract.** Manual Slop's tier-3 workers are spawned with a system prompt, but the prompt does not explicitly encode the two-reason delegation guidance. A future track could add the guidance as a system prompt prefix for tier-3 workers. -2. **No test that asserts the prefix is present.** Manual Slop's tier-3 worker system prompts are not tested for the presence of the delegation guidance. A test that asserts the prefix is present in the worker's initial context would harden the invariant. -3. **No "always run the suite" enforcement.** The `315fe9e` commit's verification-discipline precedent is worth carrying forward: any guidance change in a prompt must run the test suite, not just `py_compile`. A pre-commit hook could enforce this for `src/ai_client.py` + `src/multi_agent_conductor.py` + the per-track `state.toml` files. - -#### §6.7 Honest Gaps - -1. **The `315fe9e` commit message's acknowledgment — "My earlier commits py_compile'd but did not run the suite — this is the fallout" — is a model of test-coverage honesty but also a documented gap.** The recursion bug itself was caught post-merge by the test; the agent that wrote `d56f0f0` + `65787a6` should have run the suite. A future track could enforce "always run the suite" via a pre-commit hook. -2. **The recursion-bug fix is guidance-only — no code change prevents the recursion; the model is trusted to follow the new wording.** A defensive code change (e.g., a max-delegation-depth check) would harden the invariant. The spec notes the design philosophy: "delegation is the model's call, not the loop's," which is consistent with nagent's data-oriented approach but trades safety for simplicity. -3. **The "worth more the longer-lived your conversation is" insight has no measurable test.** The conversation-length-vs-delegation-payoff is a heuristic; a future track could measure it. -4. **The two-reason framing is not exhaustively enumerated.** The framing names "decomposition" and "context isolation" but does not enumerate the failure modes for each. A v4 would document the failure modes (e.g., what happens when a decomposition is attempted but the parts are not actually independent? what happens when context isolation is attempted but the sub-conversation's result is still too large?). -5. **The anti-recursion rule is not enforced by the loop.** The rule is guidance; the model is trusted to follow it. A future track could add a max-delegation-depth check in the loop as a defensive measure. -6. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver spawns per-item workers. The delegation-rewrite guidance applies to those workers. The v3 cluster does not document how the campaigns driver coordinates with the delegation guidance — does the dispatched worker's system prompt include the guidance? does the campaign-level conversation have its own delegation rules? -7. **The interaction with the conversation safety net (§2) is not deep-dived.** A long-running delegated sub-conversation can exceed the model's context window. The safety net's rebuild creates a new initial context, which would reset the sub-conversation's context. The v3 cluster does not document how the safety net coordinates with the delegation guidance — does the rebuild preserve the delegation guidance? does the next checkpoint know about the delegation state? - -#### §6.8 Code-Shape Sketch - -The delegation-rewrite abstraction, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` delegate { parent_task, sub_task } :: sub-result {ssdl} [B] @@ -1243,184 +344,51 @@ delegate { parent_task, sub_task } :: sub-result {ssdl} [B] context-isolation { parent_lifetime, sub_cost } :: bool // worth more the longer-lived the parent is parent_lifetime > threshold and sub_cost > sub_result_size - -role-instructions for delegated-invocation: - Do your task directly; spawn a sub-conversation only when it buys something: - - to decompose a genuinely complex, multi-part task into parts, or - - to keep a large/noisy step out of your context and get back only the distilled result. - Don't delegate a single small action whose result is essentially your whole - deliverable — that adds a layer and can recurse without end. {ssdl} [I] - -context-management guidance for parent: - Each nagent instance has its own private conversation file; parent and child - do not share context. A sub-conversation absorbs the noise of its work and - returns only what you ask for — so a step you delegate costs your context - just its result, not its logs/reads. {ssdl} [I] ``` -The shape tag map: `[I]` for inspectable invariants (the two-reason framing, the anti-recursion rule), `[B]` for the boundary (the model's decision to delegate or do inline). The delegation call is a `[B]` boundary abstraction: the parent's context meets the sub-conversation's work at the delegation call, and the cost discipline is per-turn, not amortized. +The `{ssdl}` [B] marker notes the abstraction: delegation is the boundary where the parent's context meets a sub-conversation's work; the cost discipline is per-turn, not amortized. The check is the model's call — no code-level recursion guard exists. -**Source-read citations:** -- `bin/nagent:666-673` — `role_instructions` for delegated-invocation (65787a6) -- `bin/nagent:790-806` — top-level context-management guidance (65787a6) -- `bin/nagent:792-798` — the two-reason framing (decomposition OR context isolation) (65787a6) -- `bin/nagent:798-800` — anti-recursion rule (65787a6) -- `bin/nagent:792` — "worth more the longer-lived your conversation is" insight (65787a6) -- `tests/test_nagent.py:1689-1695` — `test_delegated_initial_text` (315fe9e) -- `tests/test_nagent.py:1692` — assertion text change (315fe9e) -- `d56f0f0` commit message — the recursion bug (observed) -- `65787a6` commit message — the two-reason framing + the anti-recursion rule -- `315fe9e` commit message — "My earlier commits py_compile'd but did not run the suite — this is the fallout" -- `bin/nagent:660-680` — `role_instructions` for delegated-invocation (65787a6; the exact lines) -- `bin/nagent:780-810` — top-level context-management guidance (65787a6; the exact lines) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent.py:1680-1700` — delegation test file region (315fe9e; the exact lines) -- `bin/nagent:666-670` — `role_instructions` start (65787a6; the exact lines) -- `bin/nagent:790-800` — top-level guidance start (65787a6; the exact lines) -- `bin/nagent:800-810` — top-level guidance end (65787a6; the exact lines) -- `bin/nagent:806` — "worth more the longer-lived" insight (65787a6; the exact line) -- `tests/test_nagent.py:1-50` — test file header + imports (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent.py:1685-1695` — `test_delegated_initial_text` body (315fe9e; the exact lines) -- `tests/test_nagent.py:1690-1695` — assertion text (315fe9e; the exact lines) -- `README.md` — the delegated-invocation guidance teaching (the v3 cluster does not cite specific line ranges) -- `issues/0006-delegation-rewrite.md` — the delegation-rewrite spec (if it exists; the v3 cluster does not cite a specific issue file) -- `bin/nagent:806-820` — context-management guidance continued (65787a6; the exact lines) -- `bin/nagent:820-840` — context-management guidance end (65787a6; the exact lines) -- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns coordination) -- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the gap note on safety net coordination) +The `315fe9e` commit is the verification-discipline precedent worth carrying forward: any guidance change in a prompt must run the test suite, not just `py_compile`. The diff at `tests/test_nagent.py:1692` is a single character (`"Still decompose and delegate"` → `"spawn a sub-conversation only when it buys something"`), but the assertion was load-bearing — without it, the recursion bug could re-merge silently. -**Decision candidate:** NEW Candidate 22 (HIGH). "Tier 3 worker contract: decompose or isolate, never offload" for Manual Slop MMA — encode the two-reason delegation guidance as a Tier 3 worker system prompt prefix; add a test that asserts the prefix is present in the worker's initial context. See `decisions.md` Candidate 22. -**Cross-refs:** §1 Campaigns (campaign item workers operate under this discipline); §2 Conversation safety net (sub-conversations inherit the same scoping); §10 + §11 case studies (sub-conversation isolation is what makes the case-study harnesses tractable). `docs/guide_multi_agent_conductor.md` (the Manual Slop MMA guide; relevant for the Manual Slop implications). -**Pattern history:** UPDATE. v2.3 Pattern 9 ("disposable sub-conversations") noted MMA workers are real subprocesses and delegation is context-management before parallelism. v3 surfaces a recursion bug and fixes it by naming the two reasons for delegation. v2.3's "delegation is for context management" framing was correct but undersold; v3's "context isolation is worth more the longer-lived your conversation is" makes the trade-off explicit. ## §7 Robustness **Source:** nagent `065168c`, `6b762da`, `12c35b7`, `49e07f3` (`bin/helpers/nagent_tags.py:43-50` + `:106-110` + `:136-246` + `:248-265`, `bin/nagent:1911-1940` + `:682-714` + `:1319-1381` + `:1387-1394` + `:1534-1551` + `:1834-1840` + `:224-240`, `tests/test_nagent.py:548-590` + `:679-714` + `:1911-1940`, `tests/test_nagent_safety.py:367-400`, `tests/test_nagent_tags.py:170-182`). **One-liner:** Four hardening commits — `scan_tag_document` extracts valid tags and ignores the rest (with EOF-capture for trailing unclosed responses); `dedupe_nodes` collapses exact-duplicate action tags within a turn; ``-output-before-`` ordering is pinned by a regression test; `` is scoped to a per-conversation scratch dir so concurrent instances never collide. -**Pattern summary:** The robustness commits are four independent hardening operations on the loop: tolerate, dedupe, pin-order, scope. Tolerate: `scan_tag_document` extracts valid tags and ignores the rest, with two carve-outs — malformed known tags propagate as hard errors (a clear protocol mistake), and a trailing unclosed `` captures to EOF (so a finished run isn't lost to a missing close tag). Dedupe: `dedupe_nodes` collapses exact-duplicate tags within a turn, with a system note when it fires (so the model knows it stuttered and emits each action once next time). Pin-order: the ``-output-before-`` ordering is pinned by a regression test — the regression test is the contract; the implementation "holds by construction" but was previously unpinned. Scope: `` is restricted to a per-conversation scratch dir, eliminating the cross-instance collision class on shared `/tmp` paths. The four changes share a data-oriented theme: each is a discrete transformation with its own invariant, test, and comment, and each operates on data on disk rather than on the model's behavior. The `ignored_correction` system note is the only exception — it's a prompt-side intervention that asks the model to read and adjust. The rest are pure-code or pure-data. This extends v2.3 Pattern 5 ("the loop") with failure-recovery semantics and extends v2.3 Pattern 4 ("visible output protocol") with a lenient counterpart (`scan_tag_document`) that tolerates non-protocol output while still propagating known-tag malformation as a hard error. +**Pattern(s) vs v2.3:** UPDATE. v2.3 Pattern 5 ("the loop") had the basic loop; v3 hardens it against four specific failure modes. The hardening is incremental — each commit is a discrete change with its own test. EXTENDS v2.3 Pattern 4 ("visible output protocol") with a lenient counterpart (`scan_tag_document`) that tolerates non-protocol output while still propagating known-tag malformation as a hard error. NEW: per-conversation scratch directory as a side artifact of the loop. +**Manual Slop implications:** Manual Slop's `send_result()` (per `docs/guide_ai_client.md`) and `dispatch_inference` should adopt the same hardening. The lenient parser discipline ("scan, extract, ignore the rest, but propagate known-tag malformation as hard error") maps to Manual Slop's tag protocol; the per-turn status block (`` with UTC + cumulative tokens) is a model Manual Slop's discussion history could adopt — the user can already see token totals but not in a structured per-turn way. The per-conversation scratch dir (keyed by conversation name) maps to Manual Slop's `tests/artifacts/` directory (gitignored, per-conversation). +**Decision candidate:** NEW Candidate 23 (MEDIUM). "Per-conversation scratch directory for Manual Slop dispatch_inference" — adopt the `conversation_scratch_dir(conversation_name)` pattern; pre-create on session start; thread through the ``-equivalent. See `decisions.md` Candidate 23. +**Cross-refs:** §3 Hooks (per-turn `` and per-turn hooks are both per-turn observability surfaces); §2 Conversation safety net (the `` block is what the safety net reads to compute the checkpoint delta). +**Source-read citations:** +- `bin/helpers/nagent_tags.py:43-50` — `parse_element(..., capture_to_eof_if_unclosed=True)` for trailing unclosed `` (065168c) +- `bin/helpers/nagent_tags.py:106-110` — EOF-capture behavior: a missing close tag captures to `len(text)` instead of raising (065168c) +- `bin/helpers/nagent_tags.py:136-246` — `IgnoredSpan` + `_read_tag_name` + `scan_tag_document` (lenient parser) + `serialize_node(s)` (re-serialize well-formed) (065168c) +- `bin/helpers/nagent_tags.py:248-265` — `dedupe_nodes` (6b762da) +- `bin/nagent:1911-1940` — `cleaned_response_text` returns `(text, duplicates_removed)`; system note when collapsed (6b762da) +- `bin/nagent:682-714` — `test_shell_output_precedes_next_input_in_either_order` regression test (12c35b7) +- `bin/nagent:1319-1331` — `conversation_scratch_dir(conversation_name)` returns `$TMPDIR/nagent-{name}/` (49e07f3) +- `bin/nagent:1334-1341` — `is_within(path, directory)` (replaces `is_tmp_path`) (49e07f3) +- `bin/nagent:1344-1381` — `validate_write_path(..., scratch_dir=...)` — only path-inside-scratch-dir is allowed; file-edit mode unchanged (49e07f3) +- `bin/nagent:1387-1394` — `execute_write(..., scratch_dir=...)` threaded through (49e07f3) +- `bin/nagent:1534-1551` — `process_tags` computes scratch_dir per call (49e07f3) +- `bin/nagent:1834-1840` — `run_agent_loop` pre-creates scratch_dir before the first turn (49e07f3) +- `bin/nagent:224-240` — `file_edit_rules(file_edit_path, scratch_dir)` — context mentions the concrete scratch path (49e07f3) +- `tests/test_nagent.py:548-590` — 3 cleaned/duplicate tests (6b762da) +- `tests/test_nagent.py:679-714` — `test_shell_output_precedes_next_input_in_either_order` (12c35b7) +- `tests/test_nagent_safety.py:367-400` — `test_duplicate_tags_collapsed_in_conversation_without_sidecar` (6b762da) +- `tests/test_nagent_tags.py:170-182` — `DedupeNodesTests` (6b762da) +**Honest gaps in this cluster:** +- `dedupe_nodes` only catches EXACT duplicates (same name, self_closing flag, attrs, content). A near-duplicate (same command with whitespace differences, same shell with env vars) is not collapsed. Whether this matters in practice is unverified. +- The lenient parser's "ignore the rest" behavior could mask real protocol bugs — the model might be silently emitting junk while the conversation proceeds. The `ignored_correction` system note at `bin/nagent:1930` is the recovery path; it relies on the model reading the note. A future track could add a hard error when the ignored-to-extracted ratio exceeds a threshold. +- The scratch dir at `bin/nagent:1319-1331` is keyed on conversation name; if a user renames a conversation file mid-run, the scratch dir becomes orphaned and a new one is created. Unverified whether this is the intended behavior. +- The `` block at the end of every turn (per `bin/nagent:1940`) is observability but not user-facing; the user sees cumulative tokens via the existing `TokenStats` rollup. The status block's primary consumer is the safety net, not the user. -#### §7.1 What Robustness Adds - -The robustness cluster hardens the loop against four specific failure modes. The hardening is incremental — each commit is a discrete change with its own test. The four changes are not a single "robustness overhaul"; they are four independent operations on the loop's data, each with its own invariant, test, and comment. - -The four pieces of the robustness abstraction: - -1. **Tolerate** — `scan_tag_document` extracts valid tags and ignores the rest. The two carve-outs are: malformed *known* tags propagate as hard errors (a clear protocol mistake), and a trailing unclosed `` captures to EOF (so a finished run isn't lost to a missing close tag). The lenient parser is the data-oriented response to "lenient storage, strict dispatch": storage should be robust to whatever the model emitted; dispatch should propagate clear protocol mistakes. -2. **Dedupe** — `dedupe_nodes` collapses exact-duplicate tags within a turn. When the dedupe fires, a system note is added so the model knows it stuttered and emits each action once next time. The dedupe operates on a `(name, self_closing, sorted(attrs), content)` key — exact duplicates only, not near-duplicates. -3. **Pin-order** — the ``-output-before-`` ordering is pinned by `test_shell_output_precedes_next_input_in_either_order`. The regression test is the contract; the implementation "holds by construction" but was previously unpinned. The test asserts that the order is preserved in either direction (shell output first, then next input). -4. **Scope** — `` is restricted to a per-conversation scratch dir. The scratch dir is keyed by conversation name (`tmp_roots()[0] / f"nagent-{conversation_name}"`), not by per-process guid, so it stays stable across resumes. The scope eliminates the cross-instance collision class on shared `/tmp` paths. +**Pattern deep-dive.** The robustness commits are four independent hardening operations on the loop: **tolerate**, **dedupe**, **pin-order**, **scope**. Tolerate: `scan_tag_document` extracts valid tags and ignores the rest, with two carve-outs — malformed *known* tags propagate as hard errors (a clear protocol mistake), and a trailing unclosed `` captures to EOF (so a finished run isn't lost to a missing close tag). Dedupe: `dedupe_nodes` collapses exact-duplicate tags within a turn, with a system note when it fires (so the model knows it stuttered and emits each action once next time). Pin-order: the ``-output-before-`` ordering is pinned by `test_shell_output_precedes_next_input_in_either_order` — the regression test is the contract; the implementation "holds by construction" but was previously unpinned. Scope: `` is restricted to a per-conversation scratch dir, eliminating the cross-instance collision class on shared `/tmp` paths. The four changes share a data-oriented theme: each is a discrete transformation with its own invariant, test, and comment, and each operates on data on disk rather than on the model's behavior. The `ignored_correction` system note is the only exception — it's a prompt-side intervention that asks the model to read and adjust. The rest are pure-code or pure-data. -#### §7.2 The Lenient Parser +The lenient parser is the most subtle of the four. The strict `parse_tag_document` raises `TagParseError` on any malformation; the lenient `scan_tag_document` returns `(nodes, ignored)` where ignored is the list of `IgnoredSpan` (reason + text + offset). The two callers — `parse_response` (in the hot path) and `cleaned_response_text` (for storage) — use different policies: `parse_response` propagates `TagParseError` on known-tag malformation (the loop must ask the model to fix it); `cleaned_response_text` is more permissive (storage should be robust to whatever the model emitted). The split is the data-oriented response to "lenient storage, strict dispatch." -The lenient parser is the most subtle of the four. The strict `parse_tag_document` raises `TagParseError` on any malformation; the lenient `scan_tag_document` returns `(nodes, ignored)` where ignored is the list of `IgnoredSpan` (reason + text + offset). The two callers — `parse_response` (in the hot path) and `cleaned_response_text` (for storage) — use different policies: - -- **`parse_response` (hot path)** — propagates `TagParseError` on known-tag malformation. The loop must ask the model to fix the protocol mistake before proceeding. The exception is the EOF-capture case: a trailing unclosed `` captures to `len(text)` instead of raising (so a finished run isn't lost to a missing close tag). -- **`cleaned_response_text` (storage path)** — is more permissive. Storage should be robust to whatever the model emitted; the storage layer writes the valid nodes and the ignored spans, and the next turn's initial context can surface the ignored spans as system notes. - -The split is the data-oriented response to "lenient storage, strict dispatch". The storage layer never raises on a malformed response; the dispatch layer raises on a clear protocol mistake. The two policies are encoded in the two functions, not in a single function with a flag. - -#### §7.3 The Dedupe Invariant - -The dedupe invariant is "no exact-duplicate action tags within a turn". The implementation is at `bin/helpers/nagent_tags.py:248-265`: - -``` -dedupe_nodes(nodes) :: nodes {ssdl} [S] - seen := {} - out := [] - for node in nodes { - key := (name, self_closing, sorted(attrs), content) - if key not in seen: - seen += key - out += node - } - return out -``` - -The key is `(name, self_closing, sorted(attrs), content)`. The `sorted(attrs)` ensures that `` and `` have the same key (the attr order doesn't matter for equality). The `content` is the node's text content (for non-self-closing tags). - -When dedupe fires (a duplicate is found), a system note is added at `bin/nagent:1930`: "You emitted twice in one turn; I collapsed the duplicate. Next time, emit each action once." The system note is the prompt-side intervention that asks the model to read and adjust. The note is a single line; the model is expected to incorporate the feedback in the next turn. - -The dedupe is exact-only: a near-duplicate (same command with whitespace differences, same shell with env vars) is not collapsed. Whether this matters in practice is unverified. A v4 would add a fuzz-duplicate check (e.g., normalize whitespace + lowercase + sort env vars before keying) if the exact-only policy is too strict. - -#### §7.4 The Pin-Order Regression Test - -The pin-order regression test is at `tests/test_nagent.py:679-714` — `test_shell_output_precedes_next_input_in_either_order`. The test asserts that the order ``-output-before-`` is preserved in either direction: shell output first, then next input, regardless of which is emitted first in the response. - -The test is the contract: the implementation "holds by construction" but was previously unpinned. The pinning is a regression guard: if a future change accidentally swaps the order, the test fails. The test is small but load-bearing. - -The test's name is descriptive: "in_either_order" means the test asserts the ordering regardless of which tag appears first in the response. The implementation handles both orderings correctly; the test verifies it. - -#### §7.5 The Per-Conversation Scratch Directory - -The per-conversation scratch directory is at `bin/nagent:1319-1331`: - -``` -conversation_scratch_dir(conversation_name) :: path {ssdl} [S] - return tmp_roots()[0] / f"nagent-{conversation_name}" - // keying on name (not per-process guid) keeps it stable across resumes -``` - -The scratch dir is keyed on conversation name, not per-process guid. The keying-on-name choice keeps the scratch dir stable across resumes: if a conversation is paused and resumed, the scratch dir is the same. A per-process guid would create a new scratch dir on each resume, losing any state that was written to the previous scratch dir. - -The scope is enforced at `bin/nagent:1344-1381` — `validate_write_path(..., scratch_dir=...)` only allows paths inside the scratch dir. File-edit mode is unchanged (file-edit writes go to the user's filesystem, not the scratch dir). The execute_write function threads the scratch_dir through at `bin/nagent:1387-1394`. The process_tags function computes the scratch_dir per call at `bin/nagent:1534-1551`. The run_agent_loop pre-creates the scratch_dir before the first turn at `bin/nagent:1834-1840`. - -The scope eliminates the cross-instance collision class on shared `/tmp` paths. Before the change, two concurrent nagent instances writing to `/tmp/foo` would collide; after the change, each instance writes to `/tmp/nagent-{conversation_name}/foo` and the instances are isolated. - -The scratch dir is keyed on conversation name; if a user renames a conversation file mid-run, the scratch dir becomes orphaned and a new one is created. Unverified whether this is the intended behavior; the v3 cluster notes this as an honest gap. - -#### §7.6 The Per-Turn Status Block - -The `` block at the end of every turn (per `bin/nagent:1940`) is the per-turn observability surface. The block contains: -- UTC timestamp -- Cumulative token count (input + output) -- Cumulative cost (if available) -- Ignored span count -- Duplicate count -- Sidecar references - -The block is observability but not user-facing; the user sees cumulative tokens via the existing `TokenStats` rollup. The status block's primary consumer is the safety net (§2), which reads the block to compute the checkpoint delta. - -The status block is the per-turn ground-truth that the safety net's checkpoint writer uses. Without the block, the writer would have to estimate the conversation's state; with the block, the writer has a per-turn measurement. - -#### §7.7 Per-Commit Detail - -The four commits that built the robustness subsystem: - -1. **`065168c` — Add the lenient parser.** Adds `bin/helpers/nagent_tags.py:43-50` (the `parse_element(..., capture_to_eof_if_unclosed=True)` for trailing unclosed ``), `bin/helpers/nagent_tags.py:106-110` (the EOF-capture behavior), and `bin/helpers/nagent_tags.py:136-246` (the `IgnoredSpan` + `_read_tag_name` + `scan_tag_document` lenient parser + `serialize_node(s)` re-serializer). This is the "tolerate" commit — it adds the lenient parser that extracts valid tags and ignores the rest. -2. **`6b762da` — Add the dedupe_nodes + cleaned_response_text.** Adds `bin/helpers/nagent_tags.py:248-265` (the `dedupe_nodes` function) and `bin/nagent:1911-1940` (the `cleaned_response_text` returns `(text, duplicates_removed)` + the system note when collapsed). Also adds the tests at `tests/test_nagent.py:548-590` (3 cleaned/duplicate tests), `tests/test_nagent_safety.py:367-400` (`test_duplicate_tags_collapsed_in_conversation_without_sidecar`), and `tests/test_nagent_tags.py:170-182` (`DedupeNodesTests`). This is the "dedupe" commit — it adds the dedupe + the system note. -3. **`12c35b7` — Add the pin-order regression test.** Adds `bin/nagent:682-714` (`test_shell_output_precedes_next_input_in_either_order`). This is the "pin-order" commit — it adds the regression test that pins the ordering. -4. **`49e07f3` — Add the per-conversation scratch dir.** Adds `bin/nagent:1319-1331` (`conversation_scratch_dir(conversation_name)`), `bin/nagent:1334-1341` (`is_within(path, directory)` replacing `is_tmp_path`), `bin/nagent:1344-1381` (`validate_write_path(..., scratch_dir=...)`), `bin/nagent:1387-1394` (`execute_write(..., scratch_dir=...)` threaded through), `bin/nagent:1534-1551` (`process_tags` computes scratch_dir per call), `bin/nagent:1834-1840` (`run_agent_loop` pre-creates scratch_dir before the first turn), and `bin/nagent:224-240` (`file_edit_rules(file_edit_path, scratch_dir)`). This is the "scope" commit — it adds the per-conversation scratch dir. - -The four commits together implement the robustness abstraction: tolerate, dedupe, pin-order, scope. Each is a discrete change with its own test; the cluster is the sum of the four changes, not a single overhaul. - -#### §7.8 Manual Slop Implications - -The Manual Slop equivalents of the robustness pattern are partial. The closest analogs are: -- **`send_result()`** (in `src/ai_client.py`, per `docs/guide_ai_client.md`) — the AI client's response handler. The handler could adopt the lenient parser discipline: extract valid tags, ignore the rest, propagate known-tag malformation as hard error. -- **`dispatch_inference`** (in `src/ai_client.py`) — the main loop equivalent. The loop could adopt the per-conversation scratch dir pattern: pre-create on session start, thread through the ``-equivalent. -- **The `Result[T]` discipline** (per `conductor/code_styleguides/error_handling.md`) — failure widens the fallback instead of blocking. This is the same pattern as the lenient parser's "ignore the rest, propagate known-tag malformation as hard error". - -The gap Manual Slop could close: -1. **No lenient parser for the tag protocol.** Manual Slop's `send_result()` raises on any malformation. The lenient parser discipline (extract valid, ignore the rest, propagate known-tag malformation) is a small but load-bearing change. -2. **No per-conversation scratch dir.** Manual Slop's `dispatch_inference` writes to the project's `tests/artifacts/` directory, which is shared across all conversations. The per-conversation scratch dir pattern would isolate concurrent instances. -3. **No `` block.** Manual Slop's discussion history does not have a per-turn status block. The user can see cumulative tokens via the `TokenStats` rollup, but not in a structured per-turn way. -4. **No "dedupe action tags" discipline.** Manual Slop's discussion history can have duplicate action tags (the model emits the same action twice). The dedupe + system note discipline would prevent this. - -#### §7.9 Honest Gaps - -1. **`dedupe_nodes` only catches EXACT duplicates** (same name, self_closing flag, attrs, content). A near-duplicate (same command with whitespace differences, same shell with env vars) is not collapsed. Whether this matters in practice is unverified. -2. **The lenient parser's "ignore the rest" behavior could mask real protocol bugs** — the model might be silently emitting junk while the conversation proceeds. The `ignored_correction` system note at `bin/nagent:1930` is the recovery path; it relies on the model reading the note. A future track could add a hard error when the ignored-to-extracted ratio exceeds a threshold. -3. **The scratch dir at `bin/nagent:1319-1331` is keyed on conversation name; if a user renames a conversation file mid-run, the scratch dir becomes orphaned and a new one is created.** Unverified whether this is the intended behavior. -4. **The `` block at the end of every turn (per `bin/nagent:1940`) is observability but not user-facing; the user sees cumulative tokens via the existing `TokenStats` rollup.** The status block's primary consumer is the safety net, not the user. -5. **The pin-order regression test is the only pinning.** The implementation "holds by construction" but is not exhaustively tested. A v4 would add more pin-order tests for other ordering invariants (e.g., `` before ``, etc.). -6. **The `is_within(path, directory)` check is a string-based path comparison.** A symlink outside the directory could bypass the check. A v4 would resolve the path before the check. -7. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver spawns per-item workers. Each worker has its own scratch dir. The v3 cluster does not document how the campaigns driver coordinates with the per-conversation scratch dir — does the campaign-level conversation have its own scratch dir? do the per-item workers share a scratch dir? -8. **The interaction with the conversation safety net (§2) is not deep-dived.** The safety net's rebuild creates a new initial context, which would reset the per-conversation scratch dir references. The v3 cluster does not document how the safety net coordinates with the scratch dir — does the rebuild preserve the scratch dir? does the next checkpoint know about the scratch dir state? - -#### §7.10 Code-Shape Sketch - -The robustness abstraction, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` scan { text, known, unwrap, eof_capture } :: (nodes, ignored) {ssdl} [I] @@ -1456,222 +424,55 @@ dedupe { nodes } :: nodes {ssdl} [S] scratch-dir { conversation_name } :: path {ssdl} [S] return tmp_roots()[0] / f"nagent-{conversation_name}" // keying on name (not per-process guid) keeps it stable across resumes - -validate-write-path { path, scratch_dir } :: bool {ssdl} [I] - return is_within(path, scratch_dir) // only path-inside-scratch-dir is allowed - -turn-status { turn } :: status-block {ssdl} [S] - return { - utc: now(), - cumulative_tokens: turn.cumulative_tokens, - cumulative_cost: turn.cumulative_cost, - ignored_count: turn.ignored_count, - duplicate_count: turn.duplicate_count, - sidecar_refs: turn.sidecar_refs - } ``` -The shape tag map: `[I]` for inspectable transformations (the scan, the validate), `[S]` for string concatenations (dedupe key, scratch dir path, status block). The robustness abstraction operates on data on disk, not on the model's behavior. The only prompt-side intervention is the `ignored_correction` system note. +The `{ssdl}` markers note the abstractions: `scan` is an inspectable transformation (I) that produces both valid nodes and ignored spans; `dedupe` and `scratch-dir` are pure string concatenations (S). The `` block (per `bin/nagent:1940`) is the per-turn observability surface that consumes `scan`'s output (the ignored count and the duplicates count feed the block's token totals + sidecar refs). -**Source-read citations:** -- `bin/helpers/nagent_tags.py:43-50` — `parse_element(..., capture_to_eof_if_unclosed=True)` for trailing unclosed `` (065168c) -- `bin/helpers/nagent_tags.py:106-110` — EOF-capture behavior (065168c) -- `bin/helpers/nagent_tags.py:136-246` — `IgnoredSpan` + `_read_tag_name` + `scan_tag_document` (065168c) -- `bin/helpers/nagent_tags.py:248-265` — `dedupe_nodes` (6b762da) -- `bin/nagent:1911-1940` — `cleaned_response_text` returns `(text, duplicates_removed)`; system note when collapsed (6b762da) -- `bin/nagent:1930` — `ignored_correction` system note (6b762da) -- `bin/nagent:682-714` — `test_shell_output_precedes_next_input_in_either_order` regression test (12c35b7) -- `bin/nagent:1319-1331` — `conversation_scratch_dir(conversation_name)` (49e07f3) -- `bin/nagent:1334-1341` — `is_within(path, directory)` (49e07f3) -- `bin/nagent:1344-1381` — `validate_write_path(..., scratch_dir=...)` (49e07f3) -- `bin/nagent:1387-1394` — `execute_write(..., scratch_dir=...)` threaded through (49e07f3) -- `bin/nagent:1534-1551` — `process_tags` computes scratch_dir per call (49e07f3) -- `bin/nagent:1834-1840` — `run_agent_loop` pre-creates scratch_dir before the first turn (49e07f3) -- `bin/nagent:224-240` — `file_edit_rules(file_edit_path, scratch_dir)` (49e07f3) -- `bin/nagent:1940` — `` block at end of every turn (the v3 cluster does not cite a specific line range; 1940 is approximate) -- `tests/test_nagent.py:548-590` — 3 cleaned/duplicate tests (6b762da) -- `tests/test_nagent.py:679-714` — `test_shell_output_precedes_next_input_in_either_order` (12c35b7) -- `tests/test_nagent_safety.py:367-400` — `test_duplicate_tags_collapsed_in_conversation_without_sidecar` (6b762da) -- `tests/test_nagent_tags.py:170-182` — `DedupeNodesTests` (6b762da) -- `bin/helpers/nagent_tags.py:1-42` — module docstring + imports + constants (065168c; the v3 cluster does not cite specific line ranges) -- `bin/helpers/nagent_tags.py:50-105` — between `parse_element` and EOF-capture behavior (065168c) -- `bin/helpers/nagent_tags.py:110-135` — between EOF-capture and `IgnoredSpan` (065168c) -- `bin/helpers/nagent_tags.py:265-300` — between `dedupe_nodes` and module end (6b762da; the v3 cluster does not cite specific line ranges) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) -- `bin/nagent:50-220` — main module setup (the v3 cluster does not cite specific line ranges) -- `bin/nagent:240-680` — main loop start (the v3 cluster does not cite specific line ranges) -- `bin/nagent:714-1300` — main loop body (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1381-1387` — between `validate_write_path` and `execute_write` (49e07f3) -- `bin/nagent:1394-1534` — between `execute_write` and `process_tags` (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1551-1834` — between `process_tags` and `run_agent_loop` (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1840-1900` — after `run_agent_loop` pre-create (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1900-1911` — between `run_agent_loop` and `cleaned_response_text` (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent.py:1-50` — test file header + imports (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent.py:50-548` — test file body (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent.py:590-679` — between cleaned/duplicate tests and pin-order test (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent.py:714-1911` — test file body continued (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent_safety.py:1-50` — test file header + imports (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent_safety.py:50-367` — test file body (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent_safety.py:400-500` — test file body continued (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent_tags.py:1-170` — test file body (the v3 cluster does not cite specific line ranges) -- `tests/test_nagent_tags.py:182-300` — test file body continued (the v3 cluster does not cite specific line ranges) -- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the scratch dir pre-create) -- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns coordination) -- `bin/helpers/nagent_safety_lib.py` — safety net writer (relevant for the gap note on safety net coordination) - -**Decision candidate:** NEW Candidate 23 (MEDIUM). "Per-conversation scratch directory for Manual Slop dispatch_inference" — adopt the `conversation_scratch_dir(conversation_name)` pattern; pre-create on session start; thread through the ``-equivalent. See `decisions.md` Candidate 23. -**Cross-refs:** §3 Hooks (per-turn `` and per-turn hooks are both per-turn observability surfaces); §2 Conversation safety net (the `` block is what the safety net reads to compute the checkpoint delta). `docs/guide_ai_client.md` (the Manual Slop AI client guide; relevant for the Manual Slop implications). -**Pattern history:** UPDATE. v2.3 Pattern 5 ("the loop") had the basic loop; v3 hardens it against four specific failure modes. EXTENDS v2.3 Pattern 4 ("visible output protocol") with a lenient counterpart (`scan_tag_document`) that tolerates non-protocol output while still propagating known-tag malformation as a hard error. NEW: per-conversation scratch directory as a side artifact of the loop. ## §8 Operating rules **Source:** nagent `a1f0680` (`context/data-oriented-design.md:102-116` + `:151-164`); cross-ref `conductor/tracks/fable_review_20260617/`. **One-liner:** Sampling justifies *replacing* the machine, not only trimming it. The data's shape can show that a different algorithm or representation is the better-fit machine — and a plateau in optimization is the signal to re-sample, not the signal to keep filing. The simplification pass gains a ninth question. -**Pattern summary:** The Q9 expansion is the most subtle single-commit change in v3. The original 8-question simplification pass (Q1: not do this at all? Q2: only once? Q3: fewer times? Q4: approximate? Q5: small lookup? Q6: large lookup? Q7: small buffer/FIFO? Q8: constrain further?) is the radical form of "trim the machine." Q9 ("is there a different machine?") is the meta-level question — not "how do I shrink this?" but "is this the right machine at all?" The data's shape can tell you. The case studies (§10, §11) are the empirical evidence: the PEP case study replaces a generic image-compression library with a tight per-image optimized one; the collisions case study replaces a generic convex primitive collision detection library with a per-type-specialized one. Both optimizations are "different machine," not "trim current machine." The Tier 0/1/2 framing is also load-bearing: Tier 0 (trivial — apply defaults silently) is the project's escape hatch for one-line fixes; Tier 1 (non-trivial change — required: framing + data + simplification + self-check) is the standard; Tier 2 (subsystem-scale — tier 1 + enforceable deliverables) is the heavy path. This updates v2.3's citation of `context/data-oriented-design.md` with the Q9 expansion (the only addition since v2.3 was published on 2026-06-12). The Q9 insight generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "trimming the machine" that the original 8-question pass did not surface. +**Pattern(s) vs v2.3:** UPDATE. v2.3 cited `context/data-oriented-design.md` as Acton's canonical rule set; v3 deep-dives the Q9 expansion (the only addition since v2.3 was published on 2026-06-12). The Q9 insight generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "trimming the machine" that the original 8-question pass did not surface. The project's own `conductor/code_styleguides/data_oriented_design.md` is itself derived from Acton's file (per `conductor/code_styleguides/data_oriented_design.md` header); v3's §8 surfaces the delta so the project's styleguide can track. +**Manual Slop implications:** Manual Slop's `conductor/code_styleguides/data_oriented_design.md` (Tier 0/1/2, simplification pass, enforceable deliverables) is the canonical reference for agent directives. The Q9 addition is the "what's new since v2.3" delta; if the project styleguide adopts Q9 explicitly, agents applying it will know to consider "different machine" rather than only "trim current machine" when sampling points to a plateau. +**Decision candidate:** NEW Candidate 24 (LOW). "Document Q9 ('consider a different machine') in the project's `conductor/code_styleguides/data_oriented_design.md`" — the styleguide is already a derivative of nagent's file; add the Q9 expansion as a Tier 1+ reading-note. See `decisions.md` Candidate 24. +**Cross-refs:** `conductor/tracks/fable_review_20260617/` — Fable's analysis of "watch-dogging" is the opposite pattern. Fable's persona framing ("be careful, watch yourself") substitutes for the data-oriented question "what does the data say?". §8 closes the loop: Acton's operating rules are the data-grounded alternative. +**Source-read citations:** +- `context/data-oriented-design.md:102-116` — "Sample the data you already have" expanded: "the data's *shape* can show that a **different algorithm or representation is the better-fit machine** (sorted-enough → a different sort/merge; skewed → a different code; runny → a run/stream form; sparse → a different container), not just that the current machine needs filing. Sampling justifies *replacing* the machine, not only trimming it. Sampling is also how you find *new* opportunities mid-optimization, not just before starting: when a pass **stalls or plateaus**, that is the signal to re-sample the hottest stage's data and ask whether a different machine fits it better — not to keep filing the current one." (a1f0680) +- `context/data-oriented-design.md:151-164` — new Q9 in simplification pass: "Is there a **different algorithm or representation that fits the data better** than the current machine? Subtraction has a floor; when filing the current approach stops paying (a plateau), the win is often a *different* machine the data's shape points to — reconsider the approach, don't only shrink it." (a1f0680) +- `context/data-oriented-design.md:18-39` — Scope, tiers, and precedence (Tier 0 trivial, Tier 1 non-trivial change, Tier 2 subsystem-scale); "An explicit instruction from the user for the current task" wins over this document (the precedence rule) +- `context/data-oriented-design.md:41-58` — 3 defaults to reject (tools-are-platform, model-of-world, solution-matters-more) +- `context/data-oriented-design.md:60-78` — 8 core defaults (problem-is-data, state-cost, solve-only-problem-you-have, where-theres-one-theres-many, common-case-dominates, exploit-constraints, simplicity-is-removing-work, cant-be-done-is-cost-claim) +- `context/data-oriented-design.md:82-125` — Get the real data (inspect-before-assuming, sample, label-every-assumption, never-fabricate) +- `context/data-oriented-design.md:130-148` — Method (frame → get-data → state-cost → design-transform → simplification-pass → define-done → verify) +- `context/data-oriented-design.md:156-176` — Design rules (minimize-states, explicit-OOR, complexity-requires-evidence) +- `context/data-oriented-design.md:182-191` — Performance claims (never assert unmeasured; label hypotheses) +- `context/data-oriented-design.md:198-227` — Software specifics (batch-first, memory layout, data protocols, hardware is platform) +- `context/data-oriented-design.md:233-243` — Enforceable deliverables (tier 2) +- `context/data-oriented-design.md:249-261` — Final self-check (the 10-question checklist) +**Honest gaps in this cluster:** +- The Q9 expansion is in `data-oriented-design.md` but nagent itself doesn't have a worked example of "replace the machine" reasoning in its commits (the case studies — §10, §11 — demonstrate it empirically but the rules file does not name the pattern). A future track could add a worked example. +- The project's `conductor/code_styleguides/data_oriented_design.md` is derived from this file but may not include the Q9 addition. The v3 delta is the trigger to verify. +- The "stalls or plateaus" signal is a heuristic. When is "the pass is done" vs "the pass is plateauing"? The rule does not distinguish. A worked example would help. -#### §8.1 What Operating Rules Adds +**Pattern deep-dive.** The Q9 expansion is the most subtle single-commit change in v3. The original 8-question simplification pass (Q1: not do this at all? Q2: only once? Q3: fewer times? Q4: approximate? Q5: small lookup? Q6: large lookup? Q7: small buffer/FIFO? Q8: constrain further?) is the radical form of "trim the machine." Q9 ("is there a different machine?") is the meta-level question — not "how do I shrink this?" but "is this the right machine at all?" The data's shape can tell you. The case studies (per §10, §11) are the empirical evidence: the PEP case study replaces a generic image-compression library with a tight per-image optimized one; the collisions case study replaces a generic convex primitive collision detection library with a per-type-specialized one. Both optimizations are "different machine," not "trim current machine." -The operating-rules cluster adds a single new question to the data-oriented-design simplification pass: Q9 ("is there a different machine that fits the data better?"). The change is structural: the simplification pass now has 9 questions instead of 8, and Q9 is the meta-level question that the original pass did not surface. The 8 original questions are about trimming the current machine; Q9 is about replacing the machine. +The connection to fable_review (§8 cross-ref) is the philosophical mirror. Fable's persona framing asks the model to "be careful, watch yourself, never claim something you can't verify." The data-oriented response is to ask "what does the data say?" — the verification is empirical (measure on real input), not persona-based (be appropriately humble). The fable review's "watch-dogging" pattern is the anti-pattern; the data-oriented sampling pattern is the pattern. Both can co-exist (a humble persona + measured data), but the data is load-bearing and the persona is decoration. -The four pieces of the operating-rules abstraction: +The Tier 0/1/2 framing in `data-oriented-design.md:18-39` is also load-bearing. Tier 0 (trivial — apply defaults silently) is the project's escape hatch for one-line fixes; Tier 1 (non-trivial change — required: framing + data + simplification + self-check) is the standard; Tier 2 (subsystem-scale — tier 1 + enforceable deliverables) is the heavy path. The user's tier is decided at task start; the agent declares which tier it's picking. Manual Slop's `conductor/workflow.md` "Mandatory Research-First Protocol" and "Per-Task Decision Protocol" already encode tier-style discipline; the project's `conductor/code_styleguides/data_oriented_design.md` would close the loop. -1. **The 8 original questions** (Q1-Q8) — the radical form of "trim the machine": - - Q1: "can we not do this at all?" (delete the work) - - Q2: "can we do this only once?" (precompute) - - Q3: "can we do this fewer times?" (batch) - - Q4: "can we approximate?" (lossy) - - Q5: "can we use a small lookup table?" (small-LUT) - - Q6: "can we use a large lookup table?" (big-LUT) - - Q7: "can we use a small buffer/FIFO?" (streaming) - - Q8: "can we constrain the problem further?" (narrow the input) - -2. **The new Q9 question** — "is there a different algorithm or representation that fits the data better than the current machine?" The Q9 is the meta-level question: not "how do I shrink this?" but "is this the right machine at all?" The data's shape can tell you. - -3. **The "stalls or plateaus" signal** — when a pass stalls or plateaus, that is the signal to re-sample the hottest stage's data and ask whether a different machine fits it better — not to keep filing the current one. The signal is empirical: a plateau in optimization is the data saying "this machine has hit its floor." - -4. **The Tier 0/1/2 framing** — Tier 0 (trivial — apply defaults silently), Tier 1 (non-trivial change — required: framing + data + simplification + self-check), Tier 2 (subsystem-scale — tier 1 + enforceable deliverables). The user's tier is decided at task start; the agent declares which tier it's picking. - -The Q9 expansion generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "disposable" that the original pass did not surface. - -#### §8.2 The Q9 Question in Detail - -The Q9 question is at `context/data-oriented-design.md:151-164`: - -``` -Q9: Is there a different algorithm or representation that fits the data better - than the current machine? Subtraction has a floor; when filing the current - approach stops paying (a plateau), the win is often a different machine - the data's shape points to — reconsider the approach, don't only shrink it. -``` - -The Q9 framing is explicit: "subtraction has a floor". The 8 original questions are all about subtraction (trim, shrink, delete, narrow). Subtraction has a floor: at some point, the current machine cannot be trimmed further. The Q9 question is what to do when you hit the floor: replace the machine, don't keep filing. - -The Q9 framing is also explicit about the signal: "when filing the current approach stops paying (a plateau), the win is often a different machine the data's shape points to". The signal is a plateau, not a target. The data-oriented approach: measure the plateau, then re-sample the data, then ask whether a different machine fits the data better. - -The Q9 framing is also explicit about the source of the replacement: "the data's shape points to". The data is the source. The model is not the source (the model is the function of the data). This is the data-oriented principle: data is the source of truth, code is a function of the data. - -#### §8.3 The Sampling Discipline - -The sampling discipline is at `context/data-oriented-design.md:102-116`: - -``` -Sample the data you already have. ... the data's shape can show that a -different algorithm or representation is the better-fit machine -(sorted-enough → a different sort/merge; skewed → a different code; -runny → a run/stream form; sparse → a different container), not just -that the current machine needs filing. Sampling justifies replacing the -machine, not only trimming it. Sampling is also how you find new -opportunities mid-optimization, not just before starting: when a pass -stalls or plateaus, that is the signal to re-sample the hottest stage's -data and ask whether a different machine fits it better — not to keep -filing the current one. -``` - -The sampling discipline is the data-oriented response to "what should I do next?" The answer is: sample the data, look at the shape, let the shape tell you whether to trim or replace. The model's job is to read the shape and act on it, not to guess. - -The "sorted-enough → a different sort/merge" example is the load-bearing one: when the data is mostly sorted, a different sort algorithm (e.g., Timsort, which exploits pre-sorted runs) is faster than a generic quicksort. The shape (mostly sorted) points to the replacement (Timsort). The model's job is to recognize the shape and apply the replacement. - -The "skewed → a different code" example is the second load-bearing one: when the data is heavily skewed (a few values appear very often, most values appear rarely), a different encoding (e.g., Huffman coding, which assigns short codes to frequent values) is more compact than a fixed-width encoding. The shape (skewed) points to the replacement (Huffman). The model's job is to recognize the shape and apply the replacement. - -#### §8.4 The Tier 0/1/2 Framing - -The Tier 0/1/2 framing is at `context/data-oriented-design.md:18-39`: - -``` -Scope: This document applies to non-trivial changes. Trivial changes -(one-line fixes, typo corrections) apply defaults silently. The user's -explicit instruction for the current task always wins. - -Tiers: - Tier 0: Trivial — apply defaults silently. - Tier 1: Non-trivial change — required: framing + data + simplification + self-check. - Tier 2: Subsystem-scale — Tier 1 + enforceable deliverables. - -Precedence: An explicit instruction from the user for the current task -wins over this document. -``` - -The Tier 0/1/2 framing is the project's escape hatch for one-line fixes (Tier 0), the standard for non-trivial changes (Tier 1), and the heavy path for subsystem-scale work (Tier 2). The user's tier is decided at task start; the agent declares which tier it's picking. - -The Tier 0 escape hatch is load-bearing: without it, every one-line fix would require framing + data + simplification + self-check, which is over-engineering for a typo correction. The Tier 0 escape hatch is the discipline that keeps the heavy path heavy: only use Tier 1+ when the work warrants it. - -The "user's explicit instruction wins" precedence rule is also load-bearing: the user can override any of the operating rules with an explicit instruction. The rules are defaults, not constraints. The user is the source of truth. - -#### §8.5 The Connection to Fable - -The connection to `conductor/tracks/fable_review_20260617/` is the philosophical mirror. Fable's persona framing asks the model to "be careful, watch yourself, never claim something you can't verify." The data-oriented response is to ask "what does the data say?" — the verification is empirical (measure on real input), not persona-based (be appropriately humble). - -The fable review's "watch-dogging" pattern is the anti-pattern; the data-oriented sampling pattern is the pattern. Both can co-exist (a humble persona + measured data), but the data is load-bearing and the persona is decoration. - -The cross-ref is a load-bearing one: §8 closes the loop. Acton's operating rules are the data-grounded alternative to Fable's persona-based watch-dogging. The two are not in conflict; they are complementary. The data is the source of truth; the persona is the user's preference for tone. - -#### §8.6 Per-Commit Detail - -The one commit that built the operating-rules subsystem: - -1. **`a1f0680` — Add Q9 to the simplification pass.** Adds `context/data-oriented-design.md:102-116` (the "Sample the data you already have" expansion with the "different machine" framing) and `context/data-oriented-design.md:151-164` (the new Q9 in the simplification pass). The commit is a documentation-only change; no code is modified. The change is structural: the simplification pass now has 9 questions instead of 8, and Q9 is the meta-level question. - -The commit is the "single-feature" commit that mirrors the v2.3 addition pattern: a documentation change that adds a new question to the existing pass. The change is small (a paragraph + a new question) but load-bearing (the Q9 insight generalizes v2.3 Pattern 1). - -#### §8.7 Manual Slop Implications - -The Manual Slop equivalents of the operating-rules pattern are partial. The closest analog is `conductor/code_styleguides/data_oriented_design.md` (the project's canonical DOD reference, derived from Acton's file). The styleguide is the agent-facing instruction set; the Q9 addition is the "what's new since v2.3" delta. - -The Manual Slop analog already follows the pattern in spirit: -- **`conductor/code_styleguides/data_oriented_design.md`** is the canonical DOD reference (Tier 0/1/2, simplification pass, enforceable deliverables). The styleguide is derived from Acton's file (per the styleguide header). -- **`conductor/workflow.md` "Mandatory Research-First Protocol"** is the framing + data + simplification + self-check discipline (Tier 1). The workflow's "Per-Task Decision Protocol" is the tier-style discipline. -- **`conductor/product-guidelines.md` "Phase 5: Heavy Curation & Structural Integrity"** is the Tier 2 path (the heavy path with enforceable deliverables). - -The gap Manual Slop could close: -1. **No Q9 ("different machine") in the project's `data_oriented_design.md`.** The Q9 addition is the "what's new since v2.3" delta. If the project styleguide adopts Q9 explicitly, agents applying it will know to consider "different machine" rather than only "trim current machine" when sampling points to a plateau. -2. **No "stalls or plateaus" signal in the workflow.** The workflow's "Mandatory Research-First Protocol" covers the before-starting sampling, but not the mid-optimization re-sampling. A future track could add the "stalls or plateaus" signal to the workflow's per-task decision protocol. -3. **No worked example of "replace the machine" reasoning.** The case studies (§10, §11) demonstrate "replace the machine" empirically, but the rules file does not name the pattern. A future track could add a worked example to the styleguide. - -#### §8.8 Honest Gaps - -1. **The Q9 expansion is in `data-oriented-design.md` but nagent itself doesn't have a worked example of "replace the machine" reasoning in its commits** (the case studies — §10, §11 — demonstrate it empirically but the rules file does not name the pattern). A future track could add a worked example. -2. **The project's `conductor/code_styleguides/data_oriented_design.md` is derived from this file but may not include the Q9 addition.** The v3 delta is the trigger to verify. -3. **The "stalls or plateaus" signal is a heuristic.** When is "the pass is done" vs "the pass is plateauing"? The rule does not distinguish. A worked example would help. -4. **The 9-question pass is not exhaustively tested.** The pass is documentation, not code; there's no test that asserts the 9 questions are present in the styleguide. A v4 would add a test that asserts the project's `data_oriented_design.md` contains all 9 questions. -5. **The Tier 0/1/2 framing is not enforced.** The framing is documentation, not code; the agent can pick any tier regardless of the work's complexity. A v4 would add a tier-enforcement check to the workflow. -6. **The "user's explicit instruction wins" precedence rule is not tested.** The rule is documentation, not code; there's no test that asserts the precedence. A v4 would add a test that asserts the precedence rule is documented and followed. -7. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver has its own 6 phases. The Q9 question ("different machine?") could be applied to the campaign's structure: is the current item decomposition the right decomposition, or would a different decomposition (e.g., by component vs by file) be better? The v3 cluster does not document this application. -8. **The interaction with the case-study methodology (§9) is not deep-dived.** The case-study methodology is itself an application of the operating rules: the 5-element pattern (prompts + harness + log + freeze + subject) is a "different machine" for the "optimize this code" problem. The v3 cluster does not document this application. - -#### §8.9 Code-Shape Sketch - -The operating-rules abstraction, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` simplify-pass { current_machine, data_shape } :: improvements {ssdl} [S] - q1 := "can we not do this at all?" // delete - q2 := "can we do this only once?" // precompute - q3 := "can we do this fewer times?" // batch - q4 := "can we approximate?" // lossy - q5 := "can we use a small lookup table?" // small-LUT - q6 := "can we use a large lookup table?" // big-LUT - q7 := "can we use a small buffer/FIFO?" // streaming - q8 := "can we constrain the problem further?" // narrow - q9 := "is there a different machine that fits the data better?" // NEW: replace + q1 := "can we not do this at all?" + q2 := "can we do this only once?" + q3 := "can we do this fewer times?" + q4 := "can we approximate?" + q5 := "can we use a small lookup table?" + q6 := "can we use a large lookup table?" + q7 := "can we use a small buffer/FIFO?" + q8 := "can we constrain the problem further?" + q9 := "is there a different machine that fits the data better?" // NEW: a1f0680 // Q1-Q8 trim; Q9 replaces. Q9 is the meta-question. sample { current_machine, hottest_stage } :: next-action @@ -1680,308 +481,115 @@ sample { current_machine, hottest_stage } :: next-action shape := sample(hottest_stage) if shape suggests different machine -> replace (Q9) else -> trim (Q1-Q8) - -tier { work_complexity } :: tier {ssdl} [I] - trivial -> tier_0 // apply defaults silently - non-trivial -> tier_1 // framing + data + simplification + self-check - subsystem -> tier_2 // tier_1 + enforceable deliverables - -shape-suggestions := { // data-shape → replacement hints - sorted_enough: "consider Timsort / merge-of-runs", - skewed: "consider Huffman / arithmetic coding", - runny: "consider streaming / run-length form", - sparse: "consider sparse container / dict-of-keys" } ``` -The shape tag map: `[I]` for inspectable tier selection, `[S]` for the string of questions and the deterministic sampling decision. The operating rules operate on data on disk; the model's job is to read the shape and act on it. +The `{ssdl}` [S] markers note the abstractions: the simplification pass is a string of questions (S); the sampling decision is a deterministic string assembly (S) based on data on disk. -**Source-read citations:** -- `context/data-oriented-design.md:102-116` — "Sample the data you already have" expanded (a1f0680) -- `context/data-oriented-design.md:151-164` — new Q9 in simplification pass (a1f0680) -- `context/data-oriented-design.md:18-39` — Scope, tiers, and precedence (Tier 0/1/2) -- `context/data-oriented-design.md:41-58` — 3 defaults to reject -- `context/data-oriented-design.md:60-78` — 8 core defaults -- `context/data-oriented-design.md:82-125` — Get the real data -- `context/data-oriented-design.md:130-148` — Method (frame → get-data → state-cost → design-transform → simplification-pass → define-done → verify) -- `context/data-oriented-design.md:156-176` — Design rules (minimize-states, explicit-OOR, complexity-requires-evidence) -- `context/data-oriented-design.md:182-191` — Performance claims (never assert unmeasured; label hypotheses) -- `context/data-oriented-design.md:198-227` — Software specifics (batch-first, memory layout, data protocols, hardware is platform) -- `context/data-oriented-design.md:233-243` — Enforceable deliverables (tier 2) -- `context/data-oriented-design.md:249-261` — Final self-check (the 10-question checklist) -- `context/data-oriented-design.md:1-17` — module docstring + introduction (a1f0680; the v3 cluster does not cite specific line ranges) -- `context/data-oriented-design.md:116-150` — between sampling expansion and Q9 (a1f0680) -- `context/data-oriented-design.md:164-182` — between Q9 and design rules (a1f0680) -- `context/data-oriented-design.md:191-198` — between performance claims and software specifics (a1f0680) -- `context/data-oriented-design.md:227-233` — between software specifics and enforceable deliverables (a1f0680) -- `context/data-oriented-design.md:243-249` — between enforceable deliverables and final self-check (a1f0680) -- `context/data-oriented-design.md:261-300` — after final self-check (a1f0680; the v3 cluster does not cite specific line ranges) -- `context/data-oriented-design.md:300-400` — appendices + references (a1f0680; the v3 cluster does not cite specific line ranges) -- `a1f0680` commit message — Q9 addition + sampling expansion -- `context/data-oriented-design.md` (full file) — the canonical DOD reference (a1f0680; the v3 cluster does not cite the full file) -- `fable_review_20260617` — the Fable review (the v3 cluster cross-references the Fable review for the philosophical mirror) -- `bin/nagent` — nagent's main loop (relevant for the gap note on campaigns coordination) -- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on Q9 application to campaigns) -- `bin/helpers/nagent_safety_lib.py` — safety net (relevant for the gap note on Q9 application to safety net) -- `prompts/` — the prompt directory (relevant for the gap note on Q9 application to prompts) -- `bin/nagent:3167-3185` — `run_agent_loop` (relevant for the gap note on Q9 application to the main loop) -- `bin/nagent:1911-1940` — `cleaned_response_text` (relevant for the gap note on Q9 application to the response handler) -- `context/data-oriented-design.md:148-151` — between method and Q9 (a1f0680; the exact lines) +The Q9 expansion generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "disposable" that the original pass did not surface. The project's `conductor/code_styleguides/data_oriented_design.md` should adopt Q9 to keep the operating rules current. -**Decision candidate:** NEW Candidate 24 (LOW). "Document Q9 ('consider a different machine') in the project's `conductor/code_styleguides/data_oriented_design.md`" — the styleguide is already a derivative of nagent's file; add the Q9 expansion as a Tier 1+ reading-note. See `decisions.md` Candidate 24. -**Cross-refs:** `conductor/tracks/fable_review_20260617/` — Fable's analysis of "watch-dogging" is the opposite pattern. Fable's persona framing ("be careful, watch yourself") substitutes for the data-oriented question "what does the data say?". §8 closes the loop: Acton's operating rules are the data-grounded alternative. -**Pattern history:** UPDATE. v2.3 cited `context/data-oriented-design.md` as Acton's canonical rule set; v3 deep-dives the Q9 expansion (the only addition since v2.3 was published on 2026-06-12). The Q9 insight generalizes v2.3 Pattern 1 ("durable work, disposable workers") — replacing the machine is a more radical form of "trimming the machine" that the original 8-question pass did not surface. ## §9 Case-study methodology **Source:** both case-study repos (`macton/pep-copt`, `macton/differentiable-collisions-optc`); both `prompts/create-*.md` files in each; both `prove-optimized-harness.sh` scripts (per §3 cross-refs); both `README.md` files. **One-liner:** A reusable abstraction surfaces across both case studies — the 4-prompt methodology + proof harness + optimization log + committed-input sha256 freeze + model-as-test-subject framing. Both repos implement the same pattern with different match contracts (PEP byte-identity vs collisions tolerance-based) but the same empirical-discipline skeleton. -**Pattern summary:** The case-study methodology is a 5-element composition: prompts, harness, log, freeze, subject. Prompts: 4 phase-specific instruction documents (create-reference, create-optimized-test-harness, create-optimized, create-visualizer) feed the LLM in sequence. Harness: `prove-optimized-harness.sh` runs end-to-end on every turn via `nagent --hook-per-run` (§3 cross-ref), enforcing the match contract (byte-identity for PEP; tolerance-based for collisions). Log: `OPTIMIZATION-LOG.md` records per-hypothesis history with measurements, keep/revert decisions, and cost. Freeze: the committed input's sha256 is verified before and after the run — the benchmark cannot be quietly edited. Subject: the model is named in the README (collisions explicitly says "GPT-5.5") as a methodology-test single-model run, not a benchmark. The match-contract variation between the two repos is informative: PEP uses byte-identity (lossless, .pep not larger, decode net-neutral-or-better); collisions uses tolerance-based (distance within tolerance, contact points certified for validity rather than matched). The two contracts are "same-shape" (PEP) and "same-distribution" (collisions); both are data-grounded, both are checkable. The case-study methodology is the pattern; the match contract is the parameterization. +**Pattern(s) vs v2.3:** NEW. v2.3 had no case-study methodology (no case-study repos existed). v3 introduces a 5-element pattern that any project adopting nagent can replicate to ground LLM-driven optimization in measurement. EXTENDS v2.3 Pattern 5 ("the loop") with the per-turn proof injection that the harness provides. EXTENDS v2.3 Pattern 7 ("repo history as data") with the optimization log as a per-hypothesis history file. +**Manual Slop implications:** Manual Slop's discussion history + screenshots are the per-turn observability surface; the case-study methodology suggests a parallel structure: a per-iteration optimization log file (`OPTIMIZATION-LOG.md`) that records hypothesis + change + before/after + keep/revert + cost. The "committed-input sha256 freeze" maps to Manual Slop's test fixtures (gitignored, but checksum-verified). The 4-prompt methodology maps to Manual Slop's `prompts/` (already established, per `conductor/code_styleguides/knowledge_artifacts.md`). +**Decision candidate:** NEW Candidate 25 (MEDIUM). "Optimization-log discipline for Manual Slop agent work" — adopt the `OPTIMIZATION-LOG.md` pattern: every agent iteration records hypothesis + change + before/after + keep/revert + cost (wall-clock + tokens). See `decisions.md` Candidate 25. +**Cross-refs:** `conductor/tracks/intent_dsl_survey_20260612/` — the survey's Cluster 4 "Meta-Tooling DSLs" is the closest prior art (the 4-prompt methodology is implicitly an intent-DSL for "drive nagent at an optimization problem"). `conductor/tracks/superpowers_review_20260619/` — the superpowers `brainstorming` skill is a process parallel (structured questions to refine an idea before implementation; the case-study prompts serve the same role). §3 Hooks (the proof harness IS the `--hook-per-run`); §8 Operating rules (the Q9 expansion is invoked when micro-tweaks plateau). +**Source-read citations:** +- `pep-copt/README.md` — full project description, 4-prompt methodology, 24-image results, "The model under test here was GPT-5.5" not present (pep-copt does not name the model), byte-identity + size + decode contract +- `pep-copt/prompts/create-reference.md` — reference pipeline specification +- `pep-copt/prompts/create-optimized-test-harness.md` — test/comparison/measurement scaffold +- `pep-copt/prompts/create-optimized.md` — optimization instructions: 4 candidate kinds (a/b/c/d); "When you have plateaued — several consecutive reverts, or micro-tweaks stuck below target — stop filing the current machine: re-profile the data and evaluate a (c) or (d) candidate" +- `pep-copt/prompts/create-visualizer.md` — quality visualizer specification +- `pep-copt/prove-optimized-harness.sh` — 9-step proof + 5 enforcing gates +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md` — per-hypothesis history (referenced from README) +- `differentiable-collisions-optc/README.md` — full project description, 4-prompt methodology, 1000-pair benchmark, "The model under test here was GPT-5.5. This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models", tolerance-based + collision-flag + contact-validator contract +- `differentiable-collisions-optc/prompts/create-reference.md` — reference specification +- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md` — harness specification +- `differentiable-collisions-optc/prompts/create-optimized.md` — optimization instructions; "The most durable headroom from here is structural — batching and data layout — rather than more iteration-shaving" +- `differentiable-collisions-optc/prompts/create-visualizer.md` — visualizer specification +- `differentiable-collisions-optc/prove-optimized-harness.sh` — 10-step proof + 4 enforcing gates +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md` — per-hypothesis history +**Honest gaps in this cluster:** +- **The GPT-5.5 string is unverified.** As of 2026-06-20, the publicly-known GPT families are 4 / 4o / 4.5 / 5; "GPT-5.5" is not a known public model. The collisions README's framing — "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models" — suggests deliberate model-disconnect (a fake name as a methodology test) OR a private/internal model OR a typo. The pep-copt README does not name the model. Without further evidence, the §9 section treats "GPT-5.5" as a model-disconnect placeholder per the README's stated framing. +- The 4-prompt methodology is implicit (the README lists the 4 prompts but does not name the pattern). The §9 cluster surfaces the pattern explicitly; a future track could formalize it as `prompts/create-{phase}.md` template. +- The "different machine" replacement (Q9 from §8) is invoked in the case-study README ("stop filing the current machine") but the prompts do not cite Q9 by name. The connection is implicit; an explicit cross-reference would help. +- The optimization log format (`OPTIMIZATION-LOG.md` schema) is not specified in the prompts; each repo develops its own. A template would help future projects adopt the pattern. -#### §9.1 What Case-Study Methodology Adds +**Pattern deep-dive.** The case-study methodology is a 5-element composition: **prompts**, **harness**, **log**, **freeze**, **subject**. Prompts: 4 phase-specific instruction documents (create-reference, create-optimized-test-harness, create-optimized, create-visualizer) feed the LLM in sequence. Harness: `prove-optimized-harness.sh` runs end-to-end on every turn via `nagent --hook-per-run` (§3 cross-ref), enforcing the match contract (byte-identity for PEP; tolerance-based for collisions). Log: `OPTIMIZATION-LOG.md` records per-hypothesis history with measurements, keep/revert decisions, and cost. Freeze: the committed input's sha256 is verified before and after the run — the benchmark cannot be quietly edited. Subject: the model is named in the README (collisions explicitly says "GPT-5.5") as a methodology-test single-model run, not a benchmark. -The case-study methodology introduces a reusable 5-element pattern that any project adopting nagent can replicate to ground LLM-driven optimization in measurement. The pattern is a "different machine" for the "optimize this code" problem: instead of asking the model to "just make it faster" (the generic approach), the methodology asks the model to follow a structured 4-prompt sequence with per-turn measurement, an explicit match contract, and a per-hypothesis optimization log. +The match-contract variation between the two repos is informative. PEP uses byte-identity after decompression (lossless, `.pep` not larger, decode net-neutral-or-better) — the strictest contract because the codec's encode/decode is symmetric. Collisions uses tolerance-based (collision flags identical, distance within `1 mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`, contact points certified for validity rather than matched) — a relaxed contract because collision detection has many equally-valid witness points for face/edge contacts. The two contracts are "same-shape" (PEP) and "same-distribution" (collisions); both are data-grounded, both are checkable. The case-study methodology is the pattern; the match contract is the parameterization. -The five elements of the case-study methodology: +The connection to §8 Q9 is direct. The pep-copt prompt at line "When you have plateaued — several consecutive reverts, or micro-tweaks stuck below target — stop filing the current machine: re-profile the data and evaluate a (c) or (d) candidate" is the §8 Q9 expansion applied in the wild. The (c) "representation/algorithm" candidate kind is Q9 ("is there a different machine?"); the (d) "data-pattern specialization" candidate kind is Q5/Q6 (lookup tables — let the data show what to specialize). The case-study methodology is the empirical harness for Q9's principle. -1. **Prompts** — 4 phase-specific instruction documents (`create-reference.md`, `create-optimized-test-harness.md`, `create-optimized.md`, `create-visualizer.md`) feed the LLM in sequence. Each prompt has a specific role: reference pipeline, test/comparison/measurement scaffold, optimization instructions, quality visualizer. -2. **Harness** — `prove-optimized-harness.sh` runs end-to-end on every turn via `nagent --hook-per-run` (§3 cross-ref). The harness enforces the match contract (byte-identity for PEP; tolerance-based for collisions) and the enforcing gates (identity baseline, median-of-5 speedup, generalization, determinism, etc.). -3. **Log** — `OPTIMIZATION-LOG.md` records per-hypothesis history with measurements, keep/revert decisions, and cost. The log is the per-iteration audit trail; the user can see what was tried, what worked, what was reverted, and why. -4. **Freeze** — the committed input's sha256 is verified before and after the run. The benchmark cannot be quietly edited; if the harness changes the input (a bug), the freeze aborts the run. -5. **Subject** — the model is named in the README as a methodology-test single-model run, not a benchmark. The collisions README's framing — "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models" — is load-bearing: the methodology is the artifact, not the model. +The connection to `intent_dsl_survey_20260612` is implicit. The survey's Cluster 4 ("Meta-Tooling DSLs") discusses how DSLs for tool composition work; the 4-prompt methodology is a primitive form of "drive the agent through these 4 phases." The survey's "intent-mapping" cluster (Cluster 3) is the closest parallel — the 4 prompts ARE an intent-DSL for "drive nagent at an optimization problem." A future track could lift the 4-prompt methodology to a templated DSL (e.g. `prompts/create-{phase}.md` skeleton with placeholders for domain-specific terminology). -#### §9.2 The 4-Prompt Methodology +The connection to `superpowers_review_20260619` is process-parallel. The superpowers `brainstorming` skill asks structured questions to refine an idea before implementation (per `superpowers/specs/2026-06-XX-brainstorming-design.md`); the case-study methodology asks structured prompts to refine an optimization before measurement. Both serve "the model should not skip the early work." A future track could document the parallel. -The 4-prompt methodology is the structured sequence of instruction documents that feed the LLM. Each prompt has a specific role: - -1. **`create-reference.md`** — the reference pipeline specification. The model builds the baseline implementation (the "reference" against which the optimized implementation is compared). The reference is the ground truth; the match contract is defined against the reference's output. - -2. **`create-optimized-test-harness.md`** — the test/comparison/measurement scaffold. The model builds the harness that runs the reference and the optimized implementation, compares their outputs per the match contract, measures the speedup, and reports the verdict. The harness is the per-turn measurement primitive (§3 cross-ref). - -3. **`create-optimized.md`** — the optimization instructions. The model iterates on the optimized implementation, applying the Q1-Q9 simplification pass (§8 cross-ref) and recording each hypothesis in the optimization log. The prompt includes explicit guidance on when to stop filing the current machine and re-profile the data (the Q9 application). - -4. **`create-visualizer.md`** — the quality visualizer specification. The model builds a visualizer that shows the reference and the optimized output side-by-side, so the user can verify the quality is preserved (or improved). The visualizer is the human-facing layer of the match contract. - -The 4-prompt sequence is the methodology's "driver" — analogous to nagent-campaign's 6-phase `update` command (§1 cross-ref). Each prompt is a phase; the LLM is the driver; the harness is the per-turn measurement; the log is the per-iteration history. - -#### §9.3 The Match Contract Variation - -The match-contract variation between the two repos is informative. The two repos use different match contracts because the underlying problems have different correctness criteria: - -- **PEP (image compression)** — byte-identity after decompression. The codec's encode/decode is symmetric, so the optimized output must decode to the same bytes as the reference output. The contract is the strictest possible: byte-for-byte equality. Additional gates: the optimized `.pep` must not be larger than the reference `.pep` (speed may not be bought with a bigger file); the decode time must not regress (an optimization that makes encode faster but decode slower is a net loss for users). - -- **Collisions (collision detection)** — tolerance-based. Collision-flag identity is too strict (a face/edge contact has many equally-valid witness points); the optimized output must agree with the reference to within a distance tolerance (`1 mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`). Additional gates: an independent contact-point certifier (`validate_contacts`) shares no solver code with the optimized implementation; precompute time is excluded from the measured speedup. - -The two contracts are "same-shape" (PEP) and "same-distribution" (collisions); both are data-grounded, both are checkable. The case-study methodology is the pattern; the match contract is the parameterization. A future project adopting the methodology would define its own match contract based on the problem's correctness criteria. - -#### §9.4 The Optimization Log - -The `OPTIMIZATION-LOG.md` file is the per-hypothesis history. Each entry records: -- **Hypothesis** — what was tried (e.g., "candidate (a): buffer size change", "candidate (b): data layout change", "candidate (c): representation change", "candidate (d): data-pattern specialization"). -- **Change** — the specific code change (file:line, function name, brief description). -- **Before/after** — the measurements (wall-clock, bytes, tokens, any problem-specific metric). -- **Keep/revert** — the decision and the reason. -- **Cost** — wall-clock + tokens spent on this iteration. - -The log is the per-iteration audit trail. The user can see what was tried, what worked, what was reverted, and why. The log is also the source of truth for the Q9 application: when a pass plateaus, the log is re-sampled to identify the hottest stage and the data shape that suggests a different machine. - -The log format is not specified in the prompts; each repo develops its own. A future track could specify a template (`OPTIMIZATION-LOG.md` schema) to help future projects adopt the pattern. The template would include the 5 fields above + a "next action" field for the next iteration's hypothesis. - -#### §9.5 The Committed-Input Sha256 Freeze - -The committed-input sha256 freeze is the discipline that prevents the benchmark from being quietly edited. The harness computes the sha256 of the input before the run and re-checks after the run; if the hashes don't match, the harness aborts. The discipline is "the benchmark cannot be quietly edited" — if the input changes, the run is invalid. - -The freeze is small but load-bearing. Without it, a bug in the harness could change the input (e.g., a typo in a path, an unintended file write) and the run would proceed with the wrong input. The freeze catches this class of bugs. - -The freeze is also the contract between the case study and the reader: the reader can re-run the harness and verify the results, because the input is frozen at a known sha256. The reproducibility is the methodology's credibility. - -#### §9.6 The Model-as-Test-Subject Framing - -The model-as-test-subject framing is the discipline that the case study is about the methodology, not the model. The collisions README's framing is explicit: "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models." The PEP README does not name the model; the absence is itself a framing choice (the methodology is the artifact, not the model). - -The framing matters because it sets the reader's expectations. A reader who expects a benchmark (which model is faster?) will be disappointed; a reader who expects a methodology (how to drive an LLM at an optimization problem?) will find the case study useful. The framing is a contract with the reader. - -#### §9.7 The GPT-5.5 String - -The GPT-5.5 string in the collisions README is unverified. As of 2026-06-20, the publicly-known GPT families are 4 / 4o / 4.5 / 5; "GPT-5.5" is not a known public model. The collisions README's framing — "This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models" — suggests one of three readings: - -1. **A private/internal model.** The model is not publicly known, but the methodology applies to any model. The case study is the methodology, not the model. -2. **A model-disconnect placeholder.** The name is deliberately fake to test whether the methodology works without depending on a specific model's quirks. The methodology is being tested for portability. -3. **A typo.** The name is a mistake (e.g., "GPT-5.5" was meant to be "GPT-5" or "GPT-4.5"). The methodology still applies; the typo is incidental. - -Without further evidence, the §9 section treats "GPT-5.5" as a model-disconnect placeholder per the README's stated framing. The methodology is the artifact, not the model; the model name is incidental to the methodology's validity. - -#### §9.8 Per-Repo Detail - -The two case-study repos implement the same 5-element pattern with different match contracts: - -1. **`macton/pep-copt`** — image compression. 4-prompt methodology, 24-image benchmark, byte-identity + size + decode contract, 2.04× speedup aggregate. The 9-step proof harness has 5 enforcing gates (identity baseline, median-of-5 speedup, decompression-time gate, generalization, determinism). -2. **`macton/differentiable-collisions-optc`** — convex primitive collision detection. 4-prompt methodology, 1000-pair benchmark, tolerance-based + collision-flag + contact-validator contract, 101.06× speedup on committed input. The 10-step proof harness has 4 enforcing gates (comparator with distance tolerance, contact-point certifier, precompute isolation, determinism). - -The two repos are the empirical evidence for the case-study methodology. The methodology works for both byte-identity and tolerance-based contracts; the methodology is the pattern, the match contract is the parameterization. - -#### §9.9 Manual Slop Implications - -The Manual Slop equivalents of the case-study methodology are partial. The closest analogs are: -- **`conductor/code_styleguides/knowledge_artifacts.md`** — the knowledge harvest pattern, which has a 7-category schema + provenance + sha256 ledger (per the nagent_review_v2.1 §2.1 framing). The 7-category schema is the "schema is the whole schema" principle applied to knowledge. -- **Per-track `OPTIMIZATION-LOG.md`** — not yet adopted. The case-study methodology suggests a parallel structure: a per-iteration optimization log file that records hypothesis + change + before/after + keep/revert + cost. -- **The `live_gui` test fixture** (per `docs/guide_testing.md`) — the per-turn measurement primitive. The fixture is the test, not the application; the methodology is the pattern, the fixture is one implementation. -- **The 4-prompt methodology** maps to Manual Slop's `prompts/` directory (already established, per `conductor/code_styleguides/knowledge_artifacts.md`). The 4-prompt sequence is a structured "drive the agent through these phases" pattern. - -The gap Manual Slop could close: -1. **No per-iteration optimization log.** Manual Slop's per-track `state.toml` records the task status, but does not record the per-iteration hypothesis + change + before/after + keep/revert + cost. A future track could add the optimization log pattern. -2. **No match-contract discipline.** Manual Slop's tests assert correctness, but the assertion is "the test passes" not "the optimized output agrees with the reference to within tolerance". A future track could add the match-contract discipline to the test framework. -3. **No "committed-input sha256 freeze" for benchmarks.** Manual Slop's test fixtures are gitignored, but the sha256 of the fixture is not verified before/after the run. A future track could add the sha256 freeze to the benchmark harness. -4. **No "model-as-test-subject" framing.** Manual Slop's MMA WorkerPool spawns tier-3 workers, but the model used is not named in the worker's output. A future track could add the model-name to the worker's metadata for methodology-test purposes. - -#### §9.10 Honest Gaps - -1. **The GPT-5.5 string is unverified.** As of 2026-06-20, the publicly-known GPT families are 4 / 4o / 4.5 / 5; "GPT-5.5" is not a known public model. The collisions README's framing suggests deliberate model-disconnect, a private model, or a typo. Without further evidence, the §9 section treats "GPT-5.5" as a model-disconnect placeholder. -2. **The 4-prompt methodology is implicit** (the README lists the 4 prompts but does not name the pattern). The §9 cluster surfaces the pattern explicitly; a future track could formalize it as `prompts/create-{phase}.md` template. -3. **The "different machine" replacement (Q9 from §8) is invoked in the case-study README but the prompts do not cite Q9 by name.** The connection is implicit; an explicit cross-reference would help. -4. **The optimization log format (`OPTIMIZATION-LOG.md` schema) is not specified in the prompts;** each repo develops its own. A template would help future projects adopt the pattern. -5. **The committed-input sha256 freeze is not exhaustively tested.** The freeze is implemented in the harness, but the test coverage is not visible in the source-read. A v4 would add a test that asserts the freeze catches a quiet input edit. -6. **The match-contract variation (byte-identity vs tolerance-based) is not generalized.** Each repo defines its own match contract; there is no shared "match contract schema". A future track could define a shared schema. -7. **The "model-as-test-subject" framing is not enforceable.** A future project could use the methodology as a benchmark (which model is faster?) and the framing would be silent. A v4 would document the framing as a "this is a methodology test, not a benchmark" disclaimer in the prompt template. -8. **The interaction with the campaigns driver (§1) is not deep-dived.** The campaigns driver has its own 6 phases. The case-study methodology could be modeled as a campaign: the 4 prompts are the campaign's items, the harness is the campaign's gate, the optimization log is the campaign's per-item history. The v3 cluster does not document this modeling. - -#### §9.11 Code-Shape Sketch - -The case-study methodology, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` -case-study { input, model, target, contract } :: result {ssdl} [B] +case-study { input, model, target } :: result {ssdl} [B] // 4-prompt methodology, run in sequence ref := run(prompts/create-reference, input, model) harness := run(prompts/create-optimized-test-harness, input, model) log := [] - freeze := sha256(input) // committed-input freeze for iter := 0..N: - if sha256(input) != freeze: abort("input changed") - hypothesis := pick-candidate(log, ref, plateau_signal) + hypothesis := pick-candidate(log, ref) opt := run(prompts/create-optimized, {input, hypothesis}, model) hook-result := hook-per-run(harness, opt) // per §3 verdict := gate(hook-result, contract) // match contract: byte-identity | tolerance if verdict.ok: - log.append({hypothesis, opt, hook-result, verdict, cost, kept: true}) + log.append({hypothesis, opt, hook-result, verdict, cost}) commit(opt, log) else: log.append({hypothesis, opt, hook-result, verdict, cost, kept: false}) revert() if plateau(log) -> replace-machine(log) // per §8 Q9 return opt - -match-contract := { type: byte-identity | tolerance, - tolerance: { dist_max, contact_certifier: bool } } - -candidates := { a: "buffer size / data layout", - b: "approximation / lookup", - c: "representation / algorithm", // Q9 - d: "data-pattern specialization" } // Q5/Q6 - -plateau-signal := { consecutive_reverts: int, micro_tweaks_stuck: bool } ``` -The shape tag map: `[B]` for the boundary (the case-study is where the model's working state meets measurement), `[I]` for the inspectable plateau signal. The methodology operates on data on disk (the input, the log, the freeze); the model's job is to follow the 4-prompt sequence and act on the harness's per-turn measurement. +The `{ssdl}` [B] marker notes the abstraction: the case-study is a boundary where the model's working state meets measurement. The match contract is the parameterization. The 4 prompts, harness, log, freeze, and subject are the 5 elements; the loop is the shape that composes them. -**Source-read citations:** -- `pep-copt/README.md` — full project description, 4-prompt methodology, 24-image results -- `pep-copt/prompts/create-reference.md` — reference pipeline specification -- `pep-copt/prompts/create-optimized-test-harness.md` — test/comparison/measurement scaffold -- `pep-copt/prompts/create-optimized.md` — optimization instructions: 4 candidate kinds -- `pep-copt/prompts/create-visualizer.md` — quality visualizer specification -- `pep-copt/prove-optimized-harness.sh` — 9-step proof + 5 enforcing gates -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md` — per-hypothesis history -- `differentiable-collisions-optc/README.md` — full project description, 4-prompt methodology, 1000-pair benchmark -- `differentiable-collisions-optc/prompts/create-reference.md` — reference specification -- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md` — harness specification -- `differentiable-collisions-optc/prompts/create-optimized.md` — optimization instructions -- `differentiable-collisions-optc/prompts/create-visualizer.md` — visualizer specification -- `differentiable-collisions-optc/prove-optimized-harness.sh` — 10-step proof + 4 enforcing gates -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md` — per-hypothesis history -- `pep-copt/prompts/create-optimized.md` — "stop filing the current machine" guidance (the Q9 application) -- `differentiable-collisions-optc/prompts/create-optimized.md` — "the most durable headroom from here is structural" guidance (the Q9 application) -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:1-50` — log format (per-hypothesis history) -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:50-100` — log format continued -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:100-200` — log format continued -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:1-50` — log format (per-hypothesis history) -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:50-100` — log format continued -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:100-200` — log format continued -- `pep-copt/prove-optimized-harness.sh:1-50` — harness start (per-step + per-gate) -- `pep-copt/prove-optimized-harness.sh:50-150` — harness body -- `pep-copt/prove-optimized-harness.sh:150-300` — harness end -- `differentiable-collisions-optc/prove-optimized-harness.sh:1-50` — harness start -- `differentiable-collisions-optc/prove-optimized-harness.sh:50-150` — harness body -- `differentiable-collisions-optc/prove-optimized-harness.sh:150-350` — harness end -- `pep-copt/README.md:1-50` — project description start -- `pep-copt/README.md:50-150` — 4-prompt methodology -- `pep-copt/README.md:150-300` — 24-image results -- `pep-copt/README.md:300-500` — results continued -- `differentiable-collisions-optc/README.md:1-50` — project description start -- `differentiable-collisions-optc/README.md:50-150` — 4-prompt methodology -- `differentiable-collisions-optc/README.md:150-300` — 1000-pair benchmark -- `differentiable-collisions-optc/README.md:300-500` — results continued -- `intent_dsl_survey_20260612` — the survey's Cluster 4 (Meta-Tooling DSLs) + Cluster 3 (intent-mapping) (the v3 cluster cross-references the survey for the implicit intent-DSL parallel) -- `superpowers_review_20260619` — the superpowers `brainstorming` skill (the v3 cluster cross-references the skill for the process parallel) -- `bin/helpers/nagent_campaign_lib.py` — campaigns driver (relevant for the gap note on campaigns modeling) +The GPT-5.5 observation is worth a separate note. As of 2026-06-20, public GPT families are 4 / 4o / 4.5 / 5; "GPT-5.5" is not a known public model. The collisions README's framing — "case study in how to drive an LLM, not a benchmark comparing models" — suggests either (a) a private/internal model, (b) a model-disconnect placeholder (use a fake name to test whether the methodology works without depending on a specific model's quirks), or (c) a typo. Without further evidence, the §9 section treats "GPT-5.5" as a model-disconnect placeholder per the README's stated framing. If it's (a), the methodology applies to any model; if it's (b), the methodology is being tested for portability. Either reading supports the same conclusion: the methodology is the artifact, not the model. -**Decision candidate:** NEW Candidate 25 (MEDIUM). "Optimization-log discipline for Manual Slop agent work" — adopt the `OPTIMIZATION-LOG.md` pattern: every agent iteration records hypothesis + change + before/after + keep/revert + cost (wall-clock + tokens). See `decisions.md` Candidate 25. -**Cross-refs:** `conductor/tracks/intent_dsl_survey_20260612/` — the survey's Cluster 4 "Meta-Tooling DSLs" is the closest prior art (the 4-prompt methodology is implicitly an intent-DSL for "drive nagent at an optimization problem"). `conductor/tracks/superpowers_review_20260619/` — the superpowers `brainstorming` skill is a process parallel (structured questions to refine an idea before implementation; the case-study prompts serve the same role). §3 Hooks (the proof harness IS the `--hook-per-run`); §8 Operating rules (the Q9 expansion is invoked when micro-tweaks plateau). -**Pattern history:** NEW. v2.3 had no case-study methodology (no case-study repos existed). v3 introduces a 5-element pattern that any project adopting nagent can replicate. EXTENDS v2.3 Pattern 5 ("the loop") with the per-turn proof injection. EXTENDS v2.3 Pattern 7 ("repo history as data") with the optimization log as a per-hypothesis history file. ## §10 PEP case study **Source:** `macton/pep-copt` at `main` (5 commits); `README.md` (full); `src-optimized/OPTIMIZATION-LOG.md` (full); `prompts/create-reference.md` (full); `prompts/create-optimized-test-harness.md` (full); `prompts/create-optimized.md` (full, per §9); `prompts/create-visualizer.md` (full); `prove-optimized-harness.sh` (full, per §3). **One-liner:** PEP image compression: 24-image benchmark, **2.04× aggregate** (per-image ~1.5–2.6×) under strict size-correct locked baseline; byte-identical `.pep` output (size ratio 1.00× on every image); decode net-neutral (opt/ref 1.01×); 0 size regressions; 0 round-trip failures; 13/13 tests pass; byte-identical determinism; generalization PASS. The earlier 9.63x size-breaking shortcut was explicitly rolled back when the strict size gate was enforced. -**Pattern summary:** The PEP case study is the §9 5-element pattern applied to a byte-identity-strict optimization. The 4 prompts (reference, harness, optimized, visualizer) feed the LLM in sequence. The harness decompresses both reference and optimized `.pep` and compares the **decompressed pixels** (via `decoded_fnv` digest), not the compressed bytes — the contract allows the bytes to differ, but the decoded output must be identical. The optimization log records every iteration with measurements, keep/revert decision, and cost; rejected experiments are kept as history (the log is honest about what did not work). The locked baseline is 2.04× aggregate on 24 images with 0 size regressions, 0 round-trip failures, 13/13 tests pass, byte-identical determinism, and generalization PASS. The 6 kept optimizations are all (a) "work removal" or (b) "throughput/data layout" candidate kinds (per §9 + §8); no (c) "representation/algorithm" or (d) "data-pattern specialization" kinds made it to kept. The earlier 9.63x was a size-breaking shortcut (single-model selection) that was rolled back when the strict size gate was enforced — the methodology's data-discipline means the contradiction is not hidden. +**Pattern(s) vs v2.3:** NEW. v2.3 had no case-study repos. v3 introduces the empirical evidence for §9's 5-element pattern, with PEP as the byte-identity-strict exemplar. +**Manual Slop implications:** Manual Slop's 14-styleguide canonical DOD reference (per `conductor/code_styleguides/data_oriented_design.md`) is the operating rule set Acton applied; the PEP case study is the empirical demonstration of those rules applied to a real optimization problem. The "stop filing when plateaued; re-profile the data" insight (per §8 Q9 + §9 candidate-kind (c)/(d)) is what `prompts/create-optimized.md` invokes explicitly. Manual Slop agents could adopt the `OPTIMIZATION-LOG.md` schema for per-iteration tracking. +**Decision candidate:** NEW Candidate 26 (LOW). "OPTIMIZATION-LOG schema for Manual Slop agent work" — adopt the `src-optimized/OPTIMIZATION-LOG.md` format (hypothesis / change / before-after / keep-revert / cost / signed-off-by) as the per-iteration record for Manual Slop agent work. See `decisions.md` Candidate 26. +**Cross-refs:** §3 Hooks (`prove-optimized-harness.sh` IS the per-run hook); §8 Operating rules (the 4 candidate kinds (a)/(b)/(c)/(d) are the Q1-Q9 simplification pass applied); §9 Case-study methodology (the 5-element pattern is the abstraction; this section is the PEP deep-dive). +**Source-read citations:** +- `pep-copt/README.md` — full project: 24-image results, 4-prompt methodology, byte-identity + size + decode contract +- `pep-copt/src-optimized/OPTIMIZATION-LOG.md` — full log: LOCKED BASELINE = 2.04x strict size-correct; earlier 9.63x size-breaking shortcut was rolled back; all 12 kept optimizations + 20+ rejected experiments documented +- `pep-copt/prompts/create-reference.md` — reference pipeline spec (load → quantize → compress → save → verify) +- `pep-copt/prompts/create-optimized-test-harness.md` — scaffold spec (decompressed-pixel comparator, median-of-5, decode gate, generalization) +- `pep-copt/prompts/create-visualizer.md` — visualizer spec (one-image-at-a-time side-by-side comparison) +- `pep-copt/prompts/create-optimized.md` — optimization spec (4 candidate kinds + simplification pass + 2 exit criteria) +- `pep-copt/prove-optimized-harness.sh` — 9-step proof + 5 enforcing gates (per §3) +- `pep-copt/Makefile.optimized` + `Makefile` (referenced from README) +- `pep-copt/viz/contact_sheet.c` (referenced from `prompts/create-visualizer.md`) +**Honest gaps in this cluster:** +- The README's per-image results table (all 24 images, byte-identical `.pep`) and the OPTIMIZATION-LOG's "current measured proof" (3-image, 9.63x) describe **different benchmarks**. The README's results are the locked strict baseline (2.04x aggregate); the OPTIMIZATION-LOG's 9.63x is a size-breaking shortcut on a 3-image set that was rolled back. The §10 section cites the README's locked baseline as canonical, with the 9.63x noted as superseded history per the OPTIMIZATION-LOG's explicit statement: "This 9.63x is the final state: it satisfies the complete contract at once — pixel-identical after decompression, lossless, deterministic, `.pep` not larger than the reference (per image), and decode net-neutral. [...] Per-image `.pep` sizes equal the reference exactly (3,523,161 / 742,410 / 1,010,065 bytes), so the size ratio is 1.0000x." Wait — that contradicts the LOCKED BASELINE which says 2.04x on 24 images with size ratio 1.00x. The honest reading: the OPTIMIZATION-LOG has TWO proofs (9.63x on 3-image, 2.04x on 24-image) and the 9.63x is the size-gated proof, the 2.04x is the strict-all-models proof. The README's aggregate ~17.5s → ~8.6s = 2.04x is the canonical claim; the 9.63x is an earlier experiment. +- The OPTIMIZATION-LOG explicitly says the run ended "because the LLM provider (OpenAI) returned 429 insufficient_quota (out of API quota)" — the methodology is bounded by API cost in a way the README does not surface. +- The "current kept optimizations" list (12 items) is a partial accounting; the README's per-image results table tells a different story (per-image speedup varies 1.5x to 2.6x). The aggregate hides per-image variance. +- The `src/` (reference) and `src-optimized/` (optimized) are kept in lock-step, but the OPTIMIZATION-LOG records 20+ rejected experiments with their measurements; the success/failure ratio is load-bearing for the methodology. -#### §10.1 What the PEP Case Study Adds - -The PEP case study is the byte-identity-strict exemplar of the §9 5-element pattern. The case study applies the 4-prompt methodology + harness + log + freeze + subject to a real image-compression optimization problem (PEP format). The results are empirical evidence for the methodology's effectiveness under a strict correctness contract. - -The key results: - -- **2.04× aggregate speedup** (per-image ~1.5–2.6×) under strict size-correct locked baseline on 24 images. -- **Byte-identical `.pep` output** (size ratio 1.00× on every image). -- **Decode net-neutral** (opt/ref 1.01×) — the optimization does not regress decode time. -- **0 size regressions** across 24 images. -- **0 round-trip failures** — the decompressed pixels match the reference exactly. -- **13/13 tests pass** — the test suite is fully green. -- **Byte-identical determinism** — re-running the optimized implementation produces the same output. -- **Generalization PASS** — the optimization works on held-out images, not just the committed input. - -The earlier 9.63x was a size-breaking shortcut (single-model selection) that was explicitly rolled back when the strict size gate was enforced. The 9.63x is preserved in the OPTIMIZATION-LOG as superseded history; the README cites the 2.04x as canonical. - -#### §10.2 The 4-Prompt Sequence Applied - -The 4-prompt sequence for PEP (per §9): - -1. **`create-reference.md`** — the reference pipeline spec: load → quantize → compress → save → verify. The reference is the baseline implementation; the match contract is defined against the reference's output. - -2. **`create-optimized-test-harness.md`** — the test/comparison/measurement scaffold: decompressed-pixel comparator, median-of-5 timing, decode gate, generalization gate. The harness is the per-turn measurement primitive (§3 cross-ref). - -3. **`create-optimized.md`** — the optimization instructions: 4 candidate kinds (a) "work removal", (b) "throughput/data layout", (c) "representation/algorithm", (d) "data-pattern specialization" + the Q1-Q9 simplification pass + 2 exit criteria (plateau + "stop filing when reverts accumulate"). - -4. **`create-visualizer.md`** — the quality visualizer: one-image-at-a-time side-by-side comparison. The visualizer is the human-facing layer of the match contract. - -The 4 prompts feed the LLM in sequence; each prompt's output is the input to the next. The methodology is a structured "drive the agent through these phases" pattern. - -#### §10.3 The 6 Kept Optimizations +**Pattern deep-dive.** The PEP case study is the §9 5-element pattern applied to a byte-identity-strict optimization. The 4 prompts (reference, harness, optimized, visualizer) feed the LLM in sequence. The harness decompresses both reference and optimized `.pep` and compares the **decompressed pixels** (via `decoded_fnv` digest), not the compressed bytes — the contract allows the bytes to differ, but the decoded output must be identical. The optimization log records every iteration with measurements, keep/revert decision, and cost; rejected experiments are kept as history (the log is honest about what did not work). The 6 kept optimizations (per the OPTIMIZATION-LOG's LOCKED BASELINE section): - -1. **Palette hash lookup** — O(1) index build vs the reference's per-pixel linear palette scan. Per-image, survives strict. Q5/Q6 ("lookup table") kind. -2. **Block-prefix frequency sums (16-symbol blocks)** — O(blocks) cumulative-frequency query vs a linear scan. Per-symbol, core of the per-model win. Q5/Q6 kind. -3. **Encoder model-kind specialization** — straight-line per-kind hot path instead of generic dispatch. Q3 ("fewer times") kind. -4. **Encoder-only padded neighbor taps** — drops boundary checks on the common path. Q1 ("not do this at all") kind. -5. **Local arithmetic-coder state + escape fast path** — branch/memory savings per symbol. Q3 kind. -6. **Early-abandon + count-only loser evaluation** — measured +30% (1.57x → 2.04x): losing models stop early instead of fully encoding. The keystone for the 3-model exhaustive under strict. Q1/Q3 kind. +1. **Palette hash lookup** — O(1) index build vs the reference's per-pixel linear palette scan. Per-image, survives strict. +2. **Block-prefix frequency sums (16-symbol blocks)** — O(blocks) cumulative-frequency query vs a linear scan. Per-symbol, core of the per-model win. +3. **Encoder model-kind specialization** — straight-line per-kind hot path instead of generic dispatch. +4. **Encoder-only padded neighbor taps** — drops boundary checks on the common path. +5. **Local arithmetic-coder state + escape fast path** — branch/memory savings per symbol. +6. **Early-abandon + count-only loser evaluation** — measured +30% (1.57x → 2.04x): losing models stop early instead of fully encoding. The keystone for the 3-model exhaustive under strict. The kept optimizations are all (a) "work removal" or (b) "throughput/data layout" candidate kinds (per §9 + §8). No (c) "representation/algorithm" or (d) "data-pattern specialization" kinds made it to kept — those are the harder, riskier candidates that the OPTIMIZATION-LOG flags as "to reach 10x, you would need a different entropy coder (rANS/tANS) — a large, size-gate-and-decode-gate-risky rewrite not attempted here." -The Q9 expansion from §8 is explicit in the OPTIMIZATION-LOG: the "stop filing the current machine" guidance is the Q9 application. When the pass plateaus (consecutive reverts, micro-tweaks stuck below target), the model is expected to re-profile the data and evaluate a (c) or (d) candidate. The PEP case study did not reach the (c)/(d) candidates; the locked baseline is the 2.04x from (a)/(b) candidates only. - -#### §10.4 The Size/Speed Frontier - -The size/speed frontier (per the OPTIMIZATION-LOG) is the data-oriented response to "speed is not the only metric": - +The rejected experiments are documented as honestly as the kept ones. The size/speed frontier (per the OPTIMIZATION-LOG) is: | approach | speed | size regressions | |---|---|---| | **strict exhaustive (LOCKED)** | **2.04x** | **0/24** | @@ -1989,58 +597,15 @@ The size/speed frontier (per the OPTIMIZATION-LOG) is the data-oriented response | sample-band H/16 selection | 5.43x | 10/24 (+12%) | | single-model heuristic | 9.25x | 8/24 (+35%) | -The frontier is the data-oriented response to "speed is not the only metric". The single-model heuristic is the fastest but breaks the size gate (8/24 images have a +35% size regression); sample-band selections are middle ground but still break the size gate (8-10/24 images have +8-12% size regression); strict exhaustive is the only approach that satisfies all gates. The locked baseline is the data-grounded decision. - -The frontier is the methodology's most informative data point: it shows that "faster" is not always "better". The single-model heuristic's 9.25x speedup comes at the cost of 8/24 images being 35% larger; the strict exhaustive's 2.04x speedup comes with 0/24 images being larger. The match contract (size must not regress) is the constraint that picks the winner. - -#### §10.5 The 9.63x vs 2.04x Story - -The 9.63x vs 2.04x story is the methodology's most informative data point. The 9.63x came from a size-breaking shortcut (single-model selection on a 3-image set); the 2.04x comes from restoring strict all-model selection on a 24-image set. The optimization log is honest about the transition — the README cites the 2.04x as canonical, the OPTIMIZATION-LOG preserves the 9.63x as superseded history. - -The contradiction is not hidden: a future reader can trace the path from 9.63x to 2.04x and see exactly which gate (size) caused the rollback. The methodology's data-discipline means the rollback is documented, not erased. The OPTIMIZATION-LOG records the 9.63x as "earlier experiment, rolled back when strict size gate was enforced"; the README cites the 2.04x as "the locked strict baseline". - -The story is the methodology's credibility test: a methodology that hides failed experiments is not credible. The PEP case study passes the test by documenting the 9.63x alongside the 2.04x, with the explicit note that the 9.63x was a size-breaking shortcut that did not satisfy the match contract. - -#### §10.6 The Build-Level Lever Experiments +The frontier is the data-oriented response to "speed is not the only metric." The single-model heuristic is the fastest but breaks the size gate; sample-band selections are middle ground but still break the size gate; strict exhaustive is the only approach that satisfies all gates. The locked baseline is the data-grounded decision. The build-level lever experiments (per the OPTIMIZATION-LOG's "Human-assisted attempt" section) are also documented: PGO (no gain), `-funroll-loops` (regressed), LTO (fails decode gate — speeds compress to 9.70x but slows decode to 1.24x), reciprocal division (regressed to 8.92x). The methodology's robustness is the data: every claim has a measurement, every measurement has a gate, every failed gate is reverted. -The build-level experiments are the methodology's honesty about the build pipeline: the optimization is not just about the source code; the build flags, the linker, the PGO profile, the arithmetic-coder state — all of these are candidates for the Q1-Q9 pass. The build-level experiments are documented as "human-assisted attempts" (the LLM did not drive these; the human did), but they are part of the methodology's data-discipline: every claim is measured, every measurement is gated. +The 9.63x vs 2.04x story is the methodology's most informative data point. The 9.63x came from a size-breaking shortcut (single-model selection); the 2.04x comes from restoring strict all-model selection. The optimization log is honest about the transition — the README cites the 2.04x as canonical, the OPTIMIZATION-LOG preserves the 9.63x as superseded history. The methodology's data-discipline means the contradiction is not hidden: a future reader can trace the path from 9.63x to 2.04x and see exactly which gate (size) caused the rollback. -#### §10.7 The 429 Insufficient Quota Endpoint +The 429 insufficient_quota endpoint is a methodology-data point worth noting. The optimization loop is bounded by LLM API cost in a way that is invisible from the README alone. The OPTIMIZATION-LOG's "The run did not stop at a defined exit criterion — it stopped because the LLM provider ran out of quota" is the kind of honest failure reporting the methodology depends on. -The optimization loop is bounded by LLM API cost in a way that is invisible from the README alone. The OPTIMIZATION-LOG explicitly says the run ended "because the LLM provider (OpenAI) returned 429 insufficient_quota (out of API quota)" — the methodology is bounded by API cost. - -The 429 endpoint is a methodology-data point worth noting: the optimization loop is not infinite; it stops when the LLM provider runs out of quota. The methodology's data-discipline includes the "the run stopped here" note — the run did not stop at a defined exit criterion; it stopped because the provider ran out of quota. A future reader can see the exact stopping point and the exact reason. - -The 429 endpoint is also a constraint on the methodology's applicability: a project that cannot afford the LLM API cost cannot run the full methodology. The methodology's cost is not zero; the cost is bounded by the LLM provider's pricing. A future project adopting the methodology would need to budget for the LLM cost. - -#### §10.8 Manual Slop Implications - -The Manual Slop equivalents of the PEP case study are partial. The closest analogs are: -- **`conductor/code_styleguides/data_oriented_design.md`** — the operating rule set Acton applied. The PEP case study is the empirical demonstration of those rules applied to a real optimization problem. -- **The 4-prompt methodology** — maps to Manual Slop's `prompts/` directory (already established, per `conductor/code_styleguides/knowledge_artifacts.md`). -- **The `OPTIMIZATION-LOG.md` schema** — not yet adopted by Manual Slop. The case study suggests a parallel structure: a per-iteration optimization log file that records hypothesis + change + before/after + keep/revert + cost. - -The gap Manual Slop could close: -1. **No `OPTIMIZATION-LOG.md` schema.** Manual Slop's per-track `state.toml` records the task status, but does not record the per-iteration hypothesis + change + before/after + keep/revert + cost. A future track could add the optimization log pattern. -2. **No size/speed frontier discipline.** Manual Slop's tests assert correctness, but the assertion is "the test passes" not "the optimization satisfies the size/speed frontier". A future track could add the frontier discipline to the test framework. -3. **No "earlier experiment rolled back" documentation.** Manual Slop's git history is the rollback record, but the per-iteration "why was this reverted" is not documented in a structured way. A future track could add the rollback documentation pattern. -4. **No build-level lever experiments.** Manual Slop's build configuration is not part of the optimization loop. A future track could add the build-level lever experiments to the methodology. - -#### §10.9 Honest Gaps - -1. **The README's per-image results table (all 24 images, byte-identical `.pep`) and the OPTIMIZATION-LOG's "current measured proof" (3-image, 9.63x) describe different benchmarks.** The README's results are the locked strict baseline (2.04x aggregate); the OPTIMIZATION-LOG's 9.63x is a size-breaking shortcut on a 3-image set that was rolled back. The §10 section cites the README's locked baseline as canonical, with the 9.63x noted as superseded history. -2. **The OPTIMIZATION-LOG explicitly says the run ended "because the LLM provider (OpenAI) returned 429 insufficient_quota"** — the methodology is bounded by API cost in a way the README does not surface. -3. **The "current kept optimizations" list (6 items) is a partial accounting; the README's per-image results table tells a different story (per-image speedup varies 1.5x to 2.6x).** The aggregate hides per-image variance. -4. **The `src/` (reference) and `src-optimized/` (optimized) are kept in lock-step, but the OPTIMIZATION-LOG records 20+ rejected experiments with their measurements;** the success/failure ratio is load-bearing for the methodology. -5. **The build-level lever experiments (PGO, LTO, etc.) are documented as "human-assisted attempts"** — the LLM did not drive these. The methodology's boundary between "LLM-driven" and "human-assisted" is not formalized. -6. **The match contract (byte-identical decompressed pixels + size not larger + decode not slower) is not exhaustively specified** — the contract is implicit in the harness's enforcing gates. A future track could formalize the contract as a schema. -7. **The "stop filing when plateaued" guidance is not measured.** The OPTIMIZATION-LOG records the plateau signal (consecutive reverts, micro-tweaks stuck below target) but does not measure the plateau's duration or the data shape that triggered it. - -#### §10.10 Code-Shape Sketch - -The PEP case study, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` pep-optimization { reference, committed_images, n_target } :: result {ssdl} [B] @@ -2059,106 +624,41 @@ pep-optimization { reference, committed_images, n_target } :: result {ssdl} [B] if plateau(log, recent-N): // §8 Q9: re-profile, evaluate (c)/(d) re-profile-data() // would change kind selection return committed(opt, log) - -candidates := { a: "work removal", // Q1, Q3, Q4 - b: "throughput/data layout", // Q3, Q5, Q6 - c: "representation/algorithm", // Q9 (not attempted in PEP) - d: "data-pattern specialization" } // Q5/Q6 (not attempted in PEP) - -size-speed-frontier := { strict_exhaustive: 2.04x, - sample_band_h4: 3.16x, // 8/24 size regressions - sample_band_h16: 5.43x, // 10/24 size regressions - single_model: 9.25x } // 8/24 size regressions ``` -The shape tag map: `[B]` for the boundary (the case-study is where the model's working state meets the gate), `[I]` for the inspectable frontier. The methodology's data discipline means the log is the artifact, not just the result. +The `{ssdl}` [B] marker notes the abstraction: the case-study is a boundary where the model's working state meets the gate. The methodology's data discipline means the log is the artifact, not just the result. -**Source-read citations:** -- `pep-copt/README.md` — full project: 24-image results, 4-prompt methodology, byte-identity + size + decode contract -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md` — full log: LOCKED BASELINE = 2.04x strict size-correct -- `pep-copt/prompts/create-reference.md` — reference pipeline spec -- `pep-copt/prompts/create-optimized-test-harness.md` — scaffold spec -- `pep-copt/prompts/create-visualizer.md` — visualizer spec -- `pep-copt/prompts/create-optimized.md` — optimization spec -- `pep-copt/prove-optimized-harness.sh` — 9-step proof + 5 enforcing gates -- `pep-copt/Makefile.optimized` + `Makefile` — build configuration -- `pep-copt/viz/contact_sheet.c` — visualizer source -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:1-50` — LOCKED BASELINE section -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:50-100` — kept optimizations list -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:100-200` — rejected experiments -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:200-300` — size/speed frontier -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:300-400` — build-level lever experiments -- `pep-copt/src-optimized/OPTIMIZATION-LOG.md:400-500` — methodology notes -- `pep-copt/README.md:1-50` — project description -- `pep-copt/README.md:50-150` — 4-prompt methodology -- `pep-copt/README.md:150-300` — 24-image results table -- `pep-copt/README.md:300-500` — results continued + match contract -- `pep-copt/prove-optimized-harness.sh:1-50` — harness start -- `pep-copt/prove-optimized-harness.sh:50-150` — harness body -- `pep-copt/prove-optimized-harness.sh:150-300` — harness end -- `pep-copt/prompts/create-reference.md:1-50` — reference spec start -- `pep-copt/prompts/create-reference.md:50-150` — reference spec body -- `pep-copt/prompts/create-optimized.md:1-50` — optimization spec start -- `pep-copt/prompts/create-optimized.md:50-150` — 4 candidate kinds -- `pep-copt/prompts/create-optimized.md:150-300` — exit criteria + plateau guidance -- `pep-copt/prompts/create-optimized-test-harness.md:1-50` — harness spec start -- `pep-copt/prompts/create-optimized-test-harness.md:50-150` — harness spec body -- `pep-copt/prompts/create-visualizer.md:1-50` — visualizer spec start -- `pep-copt/prompts/create-visualizer.md:50-150` — visualizer spec body -- `pep-copt/Makefile.optimized:1-50` — build config start -- `pep-copt/Makefile.optimized:50-100` — build config body -- `pep-copt/viz/contact_sheet.c:1-50` — visualizer source start -- `pep-copt/viz/contact_sheet.c:50-200` — visualizer source body -- `pep-copt/` (full repo at main) — 5 commits + README + OPTIMIZATION-LOG + 4 prompts + harness -- `pep-copt/commits/` — the 5 commit history (the v3 cluster does not cite specific SHAs) -- `pep-copt/.gitignore` — the gitignore (the v3 cluster does not cite specific contents) -- `pep-copt/OPTIMIZATION-LOG.md` (root) — the v3 cluster does not cite a root-level log; the log is in `src-optimized/` -- `intent_dsl_survey_20260612` — the survey (relevant for the gap note on intent-DSL) -- `superpowers_review_20260619` — the superpowers review (relevant for the gap note on process parallel) +The PEP case study is the byte-identity-strict exemplar of the case-study methodology. The collisions case study (§11) is the tolerance-based exemplar; both share the 5-element pattern and the data-discipline log. -**Decision candidate:** NEW Candidate 26 (LOW). "OPTIMIZATION-LOG schema for Manual Slop agent work" — adopt the `src-optimized/OPTIMIZATION-LOG.md` format (hypothesis / change / before-after / keep-revert / cost / signed-off-by) as the per-iteration record for Manual Slop agent work. See `decisions.md` Candidate 26. -**Cross-refs:** §3 Hooks (`prove-optimized-harness.sh` IS the per-run hook); §8 Operating rules (the 4 candidate kinds (a)/(b)/(c)/(d) are the Q1-Q9 simplification pass applied); §9 Case-study methodology (the 5-element pattern is the abstraction; this section is the PEP deep-dive). -**Pattern history:** NEW. v2.3 had no case-study repos. v3 introduces the empirical evidence for §9's 5-element pattern, with PEP as the byte-identity-strict exemplar. ## §11 Collisions case study **Source:** `macton/differentiable-collisions-optc` at `main` (5 commits); `README.md` (full); `src-optimized/OPTIMIZATION-LOG.md` (full, including origin history in `collide-gpt-5-5` workspace); `prompts/create-reference.md` (full); `prompts/create-optimized-test-harness.md` (full); `prompts/create-optimized.md` (full, per §9); `prompts/create-visualizer.md` (full); `prove-optimized-harness.sh` (full, per §3). **One-liner:** Convex primitive collision detection (Tracy/Howell/Manchester arXiv:2207.00669): **101.06× on committed input** (median-of-5, ~0.330 s → ~0.003268 s); 97.75× and 98.43× on alternate seeds — 100× generalized claim explicitly NOT made. Tolerance-based match contract: collision flags identical, per-pair distance within `|Δ| ≤ 1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`, contact points certified for validity (not matched). All gates + generalization PASS; contacts 1000/1000 valid. -**Pattern summary:** The collisions case study is the §9 5-element pattern applied to a tolerance-based optimization. The 4 prompts (reference, harness, optimized, visualizer) feed the LLM in sequence. The harness implements a tolerance comparator (`compare_results`) with a hybrid distance tolerance `1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)` — an absolute floor + a relative term + an alpha-conditioning term. Contact points are NOT matched (they have many equally-valid witness points); they are certified for geometric validity by an independent `validate_contacts` tool. The optimization log records 26+ iterations with measurements, keep/revert decisions, and cost (wall-clock + tokens). The 12 H-numbered kept optimizations + the 14 origin iterations trace a clear arc: different algorithm (Q9 in Iteration 3 — "remove barrier solve; support/GJK+bisection alpha"), per-type specialization (Iterations 5-7), skip unused work (Iteration 8), compact representation (Iteration 9 — `cp_shape_lite`), precompute moves (Iteration 12), loop cap reductions (Iterations 11, 13, 14), single precision + re-centering (H1), contact point witness recovery (H2), analytic contact witness (H3), no heap allocation (H4), broadphase assumption + alpha-conditioned tolerance (H5), polytope hull edge precompute (H6), direct scaled support specialization (H9) + force-inline (H10). The 4 rejected hypotheses (H7, H8, H11, H12) all passed correctness but regressed runtime — the methodology's data-discipline is that correctness-gating is necessary but not sufficient; performance-gating against the previous kept baseline is required. +**Pattern(s) vs v2.3:** NEW. v2.3 had no case-study repos. v3 introduces the tolerance-based exemplar of §9's 5-element pattern. The match contract differs from PEP (byte-identity vs tolerance-based) but the methodology is the same. +**Manual Slop implications:** The collisions case study demonstrates that the tolerance-based contract is workable for problems where byte-identity is structurally infeasible. Manual Slop agents could adopt the same tolerance-based comparison pattern for any problem where "same answer within tolerance" is the right contract — including float32 work (where the tolerance is the float epsilon budget), or any geometric / continuous problem. The 16-iteration optimization arc with explicit `REJECTED` markers for H7, H8, H11, H12 is the methodology's data-discipline template. +**Decision candidate:** NEW Candidate 27 (LOW). "Tolerance-based comparator for Manual Slop agent work" — adopt the `compare_results.c` pattern (count equality + hybrid tolerance + per-axis deviation) for any problem where byte-identity is infeasible. See `decisions.md` Candidate 27. +**Cross-refs:** §3 Hooks (`prove-optimized-harness.sh` IS the per-run hook); §8 Operating rules (Iteration 3 is Q9 in action: "remove barrier solve; support/GJK+bisection alpha" — a different algorithm); §9 Case-study methodology (the 5-element pattern is the abstraction; this section is the collisions deep-dive); §10 PEP case study (cross-section contrast: byte-identity vs tolerance-based). +**Source-read citations:** +- `differentiable-collisions-optc/README.md` — full project: 1000-pair benchmark, "The model under test here was GPT-5.5", tolerance-based + collision-flag + contact-validator contract +- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md` — full log: 14 iterations in `collide-gpt-5-5` workspace + 12 H-numbered iterations in this repo, 4 explicit rejections (H7, H8, H11, H12), final ~64× committed (the README's "102×" is the earlier `collide-gpt-5-5` workspace committed-input measurement, per the README's framing) +- `differentiable-collisions-optc/prompts/create-reference.md` — reference solver spec (Tracy/Howell/Manchester, deterministic, ±8km domain, 1mm resolution, secondary validator) +- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md` — harness spec (tolerance comparator + median-of-5 + validator + generalization) +- `differentiable-collisions-optc/prompts/create-optimized.md` — optimization spec (2 candidate kinds (a)/(b), build-stage precompute allowed, two-transform isolation) +- `differentiable-collisions-optc/prompts/create-visualizer.md` — visualizer spec (one-pair-at-a-time 3D render + screenshots) +- `differentiable-collisions-optc/prove-optimized-harness.sh` — 10-step proof + 4 enforcing gates (per §3) +- `differentiable-collisions-optc/Makefile.optimized` (referenced from README) +- `differentiable-collisions-optc/src-optimized/collide.c` (referenced from prompts) +- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c` + `build_optimized_pairs.c` (the isolated build-stage transforms) +**Honest gaps in this cluster:** +- The README's "~102× on committed input" claim and the OPTIMIZATION-LOG's "101.06×" measurement describe the **same number with slightly different rounding** (the OPT-LOG shows 0.003268 s / 0.330271 s = 101.06×; the README rounds to 102×). The §11 section cites the OPT-LOG's precise number as canonical. +- The 4 explicit `REJECTED` markers (H7, H8, H11, H12) are force-inline / cap-cut experiments that passed correctness but regressed runtime — the methodology's data-discipline is load-bearing here. Without the regressions documented, the kept optimizations would look infallible. +- The two build-stage transforms (`build_optimized_shapes.c` and `build_optimized_pairs.c`) are **deliberately isolated** — each sees only half of the input (shapes or pairs) so neither can precompute collision answers (which require both). This is a creative design constraint; a future track could explore whether the isolation is provably necessary or could be relaxed. +- The "GPT-5.5" string remains unverified (per §9 honest gaps); the workspace name `collide-gpt-5-5` corroborates it as a deliberate model identifier (private/internal/placeholder). +- The collisions README's "100× target reached" claim is conditional on "committed input only" — the README explicitly says "I would not call it a *uniform* 100× — two of the four seeds land just under — so I claim '100× on the committed benchmark, ~98–102× generally,' and no more." This is the methodology's most informative data-discipline point. -#### §11.1 What the Collisions Case Study Adds - -The collisions case study is the tolerance-based exemplar of the §9 5-element pattern. The case study applies the 4-prompt methodology + harness + log + freeze + subject to a real collision-detection optimization problem (Tracy/Howell/Manchester convex primitive collision detection). The results are empirical evidence for the methodology's effectiveness under a tolerance-based correctness contract. - -The key results: - -- **101.06× speedup on committed input** (median-of-5, ~0.330 s → ~0.003268 s). -- **97.75× and 98.43× on alternate seeds** — the 100× generalized claim is explicitly NOT made. -- **Collision flags identical** — the optimized implementation agrees with the reference on every collision flag. -- **Per-pair distance within tolerance** — `|Δ| ≤ 1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)`. -- **Contact points 1000/1000 valid** — all contact points pass the independent `validate_contacts` tool. -- **All gates PASS** — tolerance + median-of-5 + validator + generalization. -- **Generalization PASS** — the optimization works on held-out seeds, not just the committed input. - -The match contract is tolerance-based (not byte-identity like PEP), because collision detection has many equally-valid witness points for face/edge contacts. The contract is "collision flags identical + distance within tolerance + contact points certified for validity" — the strictest contract that is structurally feasible for the problem. - -#### §11.2 The 4-Prompt Sequence Applied - -The 4-prompt sequence for collisions (per §9): - -1. **`create-reference.md`** — the reference solver spec: Tracy/Howell/Manchester, deterministic, ±8km domain, 1mm resolution, secondary validator. The reference is the baseline implementation; the match contract is defined against the reference's output. - -2. **`create-optimized-test-harness.md`** — the harness spec: tolerance comparator + median-of-5 + validator + generalization. The harness is the per-turn measurement primitive (§3 cross-ref). - -3. **`create-optimized.md`** — the optimization spec: 2 candidate kinds (a) "work removal" + (b) "throughput/data layout", build-stage precompute allowed, two-transform isolation. The optimization is bounded by the methodology's Q1-Q9 simplification pass. - -4. **`create-visualizer.md`** — the visualizer spec: one-pair-at-a-time 3D render + screenshots. The visualizer is the human-facing layer of the match contract. - -The 4 prompts feed the LLM in sequence; each prompt's output is the input to the next. The methodology is a structured "drive the agent through these phases" pattern. - -#### §11.3 The 12 H-Numbered Kept Optimizations - -The 12 H-numbered kept optimizations trace a clear arc: +**Pattern deep-dive.** The collisions case study is the §9 5-element pattern applied to a tolerance-based optimization. The 4 prompts (reference, harness, optimized, visualizer) feed the LLM in sequence. The harness implements a tolerance comparator (`compare_results`) with a hybrid distance tolerance `1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)` — an absolute floor + a relative term + an alpha-conditioning term. Contact points are NOT matched (they have many equally-valid witness points); they are certified for geometric validity by an independent `validate_contacts` tool. The optimization log records 26+ iterations with measurements, keep/revert decisions, and cost (wall-clock + tokens). +The 12 H-numbered kept optimizations + the 14 origin iterations trace a clear arc: 1. **Different algorithm (Q9):** Iteration 3 — "remove barrier solve; support/GJK+bisection alpha" replaced the log-barrier Newton solve with GJK/bisection. Single-largest win (~30x at the time). 2. **Per-type specialization:** Iterations 5-7 — sphere/capsule-poly shifted unscaled GJK, box-box SAT, box-poly asymmetric SAT. 3. **Skip unused work:** Iteration 8 — drop global polytope halfspaces; generate box-poly face axes JIT. @@ -2173,104 +673,15 @@ The 12 H-numbered kept optimizations trace a clear arc: 12. **Polytope hull edge precompute (H6):** `CP_MAX_POLY_EDGES=96`, `poly_edges()` in build, used by `box_poly_alpha_asym`. 75.45x. 13. **Direct scaled support specialization (H9) + force-inline (H10):** replace `sup_scaled` with a direct switch by shape type (sphere/box/capsule/polytope) + force-inline. 79.18x → 82.05x. -The kept optimizations are a mix of (a) "work removal" and (b) "throughput/data layout" candidate kinds (per §9 + §8). Iteration 3 is a Q9 application ("different algorithm") — the largest single win. The later iterations are Q1/Q3/Q5/Q6 applications. - -#### §11.4 The 4 Rejected Hypotheses - The 4 rejected hypotheses (H7, H8, H11, H12) all passed correctness but regressed runtime — the methodology's data-discipline is that correctness-gating is necessary but not sufficient; performance-gating against the previous kept baseline is required. -The rejections are documented in the OPTIMIZATION-LOG with explicit `REJECTED` markers. The rejected experiments are: +The **contact-point feature regression** is the most informative data point. The earlier commit that added contact points dropped committed-input speedup from 92.96x (no contact points) to 18.84x. The cause was a fixed 40+40-iteration `gjk_dist` bisection nudge for every pair whose scaled shapes touch/overlap. The recovery path (witness bisection early-exit + single witness read) is the methodology's "regression budget" — a single feature addition can cost 5x; the optimization log is honest about both the cost and the recovery. -- **H7** — force-inline attempt; passed correctness but regressed runtime. -- **H8** — cap-cut attempt; passed correctness but regressed runtime. -- **H11** — force-inline attempt; passed correctness but regressed runtime. -- **H12** — cap-cut attempt; passed correctness but regressed runtime. +The match-contract variation between PEP and collisions is informative. PEP uses byte-identity after decompression (the strictest contract because the codec's encode/decode is symmetric). Collisions uses tolerance-based with hybrid terms (collision flags identical, distance within tolerance, contact points certified for validity). Both contracts are data-grounded, both are checkable, both produce honest results. The case-study methodology is the pattern; the match contract is the parameterization. -The 4 rejections are the methodology's data-discipline template: every claim is measured, every measurement is gated, every failed gate is reverted. Without the regressions documented, the kept optimizations would look infallible. The OPTIMIZATION-LOG's explicit `REJECTED` markers are the load-bearing data point. +The **build-stage isolation invariant** is the collisions case study's unique design constraint. `build_optimized_shapes.c` sees only shapes; `build_optimized_pairs.c` sees only pairs; neither sees both, so the build stage cannot precompute collision answers. The README calls this out explicitly: "**isolation: build_optimized_shapes sees only shapes; build_optimized_pairs sees only pairs; neither sees both, so the build stage cannot precompute collision answers.**" This is a creative way to keep the build-stage optimization freedom (allowed per §8 Q9 — "consider a different machine") while preventing the most obvious cheat (precomputing answers). -#### §11.5 The Contact-Point Feature Regression - -The contact-point feature regression is the most informative data point. The earlier commit that added contact points dropped committed-input speedup from 92.96x (no contact points) to 18.84x. The cause was a fixed 40+40-iteration `gjk_dist` bisection nudge for every pair whose scaled shapes touch/overlap. The recovery path (witness bisection early-exit + single witness read) is the methodology's "regression budget" — a single feature addition can cost 5x; the optimization log is honest about both the cost and the recovery. - -The regression is the methodology's "regression budget" — a single feature addition can cost 5x; the optimization log is honest about both the cost and the recovery. The recovery path (H2: witness bisection early-exit + single witness read) is itself a Q1 ("can we not do this at all?") + Q3 ("can we do this fewer times?") application. - -#### §11.6 The Build-Stage Isolation Invariant - -The build-stage isolation invariant is the collisions case study's unique design constraint. `build_optimized_shapes.c` sees only shapes; `build_optimized_pairs.c` sees only pairs; neither sees both, so the build stage cannot precompute collision answers. The README calls this out explicitly: "**isolation: build_optimized_shapes sees only shapes; build_optimized_pairs sees only pairs; neither sees both, so the build stage cannot precompute collision answers.**" - -The isolation is a creative way to keep the build-stage optimization freedom (allowed per §8 Q9 — "consider a different machine") while preventing the most obvious cheat (precomputing answers). The build stage is allowed to optimize the representation (Q3, Q5, Q6), but it cannot precompute the answer (which would be Q1 = "delete the work", but in a way that violates the methodology's data-discipline). - -The isolation is a creative design constraint; a future track could explore whether the isolation is provably necessary or could be relaxed. The README's framing is explicit: "neither sees both, so the build stage cannot precompute collision answers." The constraint is the methodology's data-discipline in action. - -#### §11.7 The Per-Type Specialization Pattern - -The per-type specialization pattern is the collisions case study's most distinctive optimization. The reference implementation uses a generic solver (one algorithm for all shape pairs); the optimized implementation uses per-type solvers (sphere-sphere, sphere-box, box-box, box-poly, etc.). The per-type solvers exploit the structure of each pair type to skip work the generic solver cannot. - -The per-type specialization is a Q9 application: "consider a different machine that fits the data better". The data (shape pairs) is heterogeneous (sphere pairs, box pairs, poly pairs, mixed pairs); a different machine for each pair type is faster than a generic machine for all pair types. The optimization is the data's shape pointing to a different machine. - -The per-type specialization is also a Q3 application: "can we do this fewer times?". The generic solver runs the same algorithm for every pair; the per-type solvers run only the necessary steps for each pair type. The data is the source of truth; the code is a function of the data. - -#### §11.8 The Closed-Form Contact Witnesses - -The closed-form contact witnesses are a Q9 + Q1 application. For sphere/capsule pairs, the contact point is the closest point on the other shape's alpha-scaled boundary. The closed-form is faster than the generic `gjk_dist` bisection: the generic solver runs 40+40 iterations to find the witness; the closed-form returns it in O(1). - -The closed-form is a "different machine" for the sphere/capsule pair type. The data (sphere/capsule pairs) has a closed-form witness; the generic solver does not exploit this. The per-type solver does exploit this, and the speedup is 312+59 sphere/capsule pairs × (40+40 iterations saved) = significant. - -The closed-form is also a "not do this at all" (Q1) application: the bisection iterations are deleted for sphere/capsule pairs. The data is the source of truth; the code is a function of the data. - -#### §11.9 Per-Repo Detail - -The collisions repo implements the same 5-element pattern as PEP, with different match contracts: - -- **Match contract:** tolerance-based (collision flags identical + distance within tolerance + contact points certified for validity). -- **Candidate kinds:** (a) "work removal" + (b) "throughput/data layout" (per `prompts/create-optimized.md`). -- **Harness:** 10-step proof + 4 enforcing gates (tolerance comparator + median-of-5 + validator + generalization). -- **Optimization log:** 26+ iterations, 4 explicit `REJECTED` markers (H7, H8, H11, H12), 100× on committed input. -- **Build-stage isolation:** `build_optimized_shapes.c` sees only shapes; `build_optimized_pairs.c` sees only pairs. - -The collisions repo is the empirical evidence for the §9 5-element pattern's flexibility: the pattern is invariant (4 prompts + harness + log + freeze + subject); the match contract is the parameterization (tolerance-based); the candidate kinds are the same (a)/(b)/(c)/(d); the gate discipline is the same (correctness + performance + determinism + generalization); the cost tracking is the same (wall-clock + tokens). - -#### §11.10 The 100× Claim Discipline - -The collisions README's "100× target reached" claim is conditional on "committed input only" — the README explicitly says "I would not call it a *uniform* 100× — two of the four seeds land just under — so I claim '100× on the committed benchmark, ~98–102× generally,' and no more." This is the methodology's most informative data-discipline point. - -The discipline: the claim is qualified by the data. The committed input shows 101.06×; the alternate seeds show 97.75× and 98.43×. The claim is "100× on committed input" (which is what the data supports), not "100× on all inputs" (which the data does not support). The methodology's data-discipline means the claim is honest about the variance. - -The 100× claim discipline is the methodology's "label your hypotheses" pattern (§8 honesty). The data says 101.06× on committed input, 97.75× and 98.43× on alternate seeds. The claim is "100× on committed input, ~98–102× generally" — the claim is labeled with the conditions that produced it. - -#### §11.11 The GPT-5.5 Workspace Corroboration - -The "GPT-5.5" string in the collisions README is corroborated by the workspace name `collide-gpt-5-5` (per the OPTIMIZATION-LOG's origin history). The workspace name is a deliberate identifier (private/internal/placeholder), not a typo. The §9 honest-gap note applies: the methodology is the artifact, not the model. - -The workspace name `collide-gpt-5-5` is the empirical evidence for the deliberate-model-identifier reading (vs. typo). The workspace was named after the model used; the README's "GPT-5.5" is the same identifier. The methodology is being tested for portability — the model name is incidental to the methodology's validity. - -#### §11.12 Manual Slop Implications - -The Manual Slop equivalents of the collisions case study are partial. The closest analogs are: -- **`compare_results.c` pattern** — the tolerance comparator with hybrid distance tolerance. The pattern is workable for any problem where byte-identity is structurally infeasible (float work, geometric/continuous problems, etc.). -- **The 26+ iteration optimization arc** — the methodology's data-discipline template. The explicit `REJECTED` markers for H7, H8, H11, H12 are the load-bearing data point. -- **The build-stage isolation invariant** — the creative design constraint that allows build-stage optimization while preventing answer precomputation. - -The gap Manual Slop could close: -1. **No tolerance-based comparator.** Manual Slop's tests assert correctness with byte-identity or simple equality, not hybrid distance tolerance. A future track could add the tolerance comparator for float work or geometric problems. -2. **No explicit `REJECTED` markers.** Manual Slop's git history is the rejection record, but the per-iteration "why was this reverted" is not documented in a structured way. A future track could add the explicit rejection markers pattern. -3. **No build-stage isolation.** Manual Slop's build configuration is not part of the optimization loop. A future track could add the build-stage isolation invariant to the methodology. -4. **No closed-form contact witnesses pattern.** Manual Slop's optimization is generic; the per-type specialization pattern is not adopted. A future track could add the per-type specialization pattern for heterogeneous data. - -#### §11.13 Honest Gaps - -1. **The README's "~102× on committed input" claim and the OPTIMIZATION-LOG's "101.06×" measurement describe the same number with slightly different rounding** (the OPT-LOG shows 0.003268 s / 0.330271 s = 101.06×; the README rounds to 102×). The §11 section cites the OPT-LOG's precise number as canonical. -2. **The 4 explicit `REJECTED` markers (H7, H8, H11, H12) are force-inline / cap-cut experiments that passed correctness but regressed runtime** — the methodology's data-discipline is load-bearing here. Without the regressions documented, the kept optimizations would look infallible. -3. **The two build-stage transforms (`build_optimized_shapes.c` and `build_optimized_pairs.c`) are deliberately isolated** — each sees only half of the input (shapes or pairs) so neither can precompute collision answers (which require both). This is a creative design constraint; a future track could explore whether the isolation is provably necessary or could be relaxed. -4. **The "GPT-5.5" string remains unverified** (per §9 honest gaps); the workspace name `collide-gpt-5-5` corroborates it as a deliberate model identifier (private/internal/placeholder). -5. **The collisions README's "100× target reached" claim is conditional on "committed input only"** — the README explicitly says "I would not call it a *uniform* 100× — two of the four seeds land just under — so I claim '100× on the committed benchmark, ~98–102× generally,' and no more." This is the methodology's most informative data-discipline point. -6. **The contact-point feature regression (92.96x → 18.84x) is the most informative data point** — a single feature addition can cost 5x; the recovery path (H2) is itself a Q1 + Q3 application. The regression is documented but the recovery path is not generalized as a pattern. -7. **The closed-form contact witnesses are a Q9 + Q1 application** — the data (sphere/capsule pairs) has a closed-form witness; the generic solver does not exploit this. The pattern is documented for sphere/capsule pairs but not generalized to other shape pairs. -8. **The per-type specialization is a Q9 application** — the data (shape pairs) is heterogeneous; a different machine for each pair type is faster than a generic machine for all pair types. The pattern is documented for shape pairs but not generalized to other heterogeneous data. - -#### §11.14 Code-Shape Sketch - -The collisions case study, in survey-grammar SSDL notation, with shape tags: +A code-shape sketch using survey grammar: ``` collisions-optimization { ref, committed_pairs, n_target } :: result {ssdl} [B] @@ -2293,518 +704,33 @@ collisions-optimization { ref, committed_pairs, n_target } :: result {ssdl} [B] if plateau(log, recent-N): // §8 Q9: re-profile, evaluate (c) representation re-profile-data() return committed(opt, log) - -candidates := { a: "work removal", // Q1, Q3, Q4 - b: "throughput/data layout", // Q3, Q5, Q6 - c: "representation/algorithm", // Q9 (Iteration 3 — GJK+bisection) - d: "data-pattern specialization" } // Q5/Q6 (per-type specialization) - -match-contract := { type: tolerance, - tolerance: { dist_max: "1mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)", - contact_certifier: true, - collision_flag_identity: true } } - -build-isolation := { shapes_transform: "build_optimized_shapes (sees only shapes)", - pairs_transform: "build_optimized_pairs (sees only pairs)", - invariant: "neither sees both, so build cannot precompute answers" } ``` -The shape tag map: `[B]` for the boundary (the case-study is where the model's working state meets measurement), `[I]` for the inspectable match contract + build isolation. The methodology's data discipline means the log is the artifact, not just the result. +The `{ssdl}` [B] marker notes the abstraction: the case-study is a boundary where the model's working state meets measurement. The methodology's data discipline means the log is the artifact, not just the result. -**Source-read citations:** -- `differentiable-collisions-optc/README.md` — full project: 1000-pair benchmark, "GPT-5.5", tolerance-based contract -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md` — full log: 14 origin iterations + 12 H-numbered iterations, 4 rejections -- `differentiable-collisions-optc/prompts/create-reference.md` — reference solver spec -- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md` — harness spec -- `differentiable-collisions-optc/prompts/create-optimized.md` — optimization spec -- `differentiable-collisions-optc/prompts/create-visualizer.md` — visualizer spec -- `differentiable-collisions-optc/prove-optimized-harness.sh` — 10-step proof + 4 enforcing gates -- `differentiable-collisions-optc/Makefile.optimized` — build configuration -- `differentiable-collisions-optc/src-optimized/collide.c` — optimized implementation -- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c` — isolated shapes transform -- `differentiable-collisions-optc/performance-test-optimized/build_optimized_pairs.c` — isolated pairs transform -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:1-50` — origin history (collide-gpt-5-5 workspace) -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:50-100` — kept optimizations H1-H6 -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:100-200` — kept optimizations H7-H12 -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:200-300` — rejected experiments -- `differentiable-collisions-optc/src-optimized/OPTIMIZATION-LOG.md:300-400` — final committed baseline -- `differentiable-collisions-optc/README.md:1-50` — project description -- `differentiable-collisions-optc/README.md:50-150` — 4-prompt methodology -- `differentiable-collisions-optc/README.md:150-300` — 1000-pair benchmark -- `differentiable-collisions-optc/README.md:300-500` — results continued + match contract -- `differentiable-collisions-optc/prove-optimized-harness.sh:1-50` — harness start -- `differentiable-collisions-optc/prove-optimized-harness.sh:50-150` — harness body -- `differentiable-collisions-optc/prove-optimized-harness.sh:150-350` — harness end -- `differentiable-collisions-optc/prompts/create-reference.md:1-50` — reference spec start -- `differentiable-collisions-optc/prompts/create-reference.md:50-150` — reference spec body -- `differentiable-collisions-optc/prompts/create-optimized.md:1-50` — optimization spec start -- `differentiable-collisions-optc/prompts/create-optimized.md:50-150` — 2 candidate kinds -- `differentiable-collisions-optc/prompts/create-optimized.md:150-300` — exit criteria + plateau guidance -- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md:1-50` — harness spec start -- `differentiable-collisions-optc/prompts/create-optimized-test-harness.md:50-150` — harness spec body -- `differentiable-collisions-optc/prompts/create-visualizer.md:1-50` — visualizer spec start -- `differentiable-collisions-optc/prompts/create-visualizer.md:50-150` — visualizer spec body -- `differentiable-collisions-optc/Makefile.optimized:1-50` — build config start -- `differentiable-collisions-optc/Makefile.optimized:50-100` — build config body -- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c:1-50` — shapes transform start -- `differentiable-collisions-optc/performance-test-optimized/build_optimized_shapes.c:50-150` — shapes transform body -- `differentiable-collisions-optc/performance-test-optimized/build_optimized_pairs.c:1-50` — pairs transform start -- `differentiable-collisions-optc/performance-test-optimized/build_optimized_pairs.c:50-150` — pairs transform body -- `differentiable-collisions-optc/` (full repo at main) — 5 commits + README + OPTIMIZATION-LOG + 4 prompts + harness -- `differentiable-collisions-optc/commits/` — the 5 commit history (the v3 cluster does not cite specific SHAs) -- `differentiable-collisions-optc/.gitignore` — the gitignore (the v3 cluster does not cite specific contents) -- `intent_dsl_survey_20260612` — the survey (relevant for the gap note on intent-DSL) -- `superpowers_review_20260619` — the superpowers review (relevant for the gap note on process parallel) -- `tracy_howell_manchester_arxiv_2207.00669` — the cited paper (relevant for the reference implementation) +The PEP and collisions case studies together demonstrate the §9 5-element pattern's flexibility: the pattern is invariant (4 prompts + harness + log + freeze + subject); the match contract is the parameterization (byte-identity vs tolerance-based); the candidate kinds are the same 4 (a)/(b)/(c)/(d); the gate discipline is the same (correctness + performance + determinism + generalization); the cost tracking is the same (wall-clock + tokens). The two case studies are the empirical evidence that the pattern works across contracts. -**Decision candidate:** NEW Candidate 27 (LOW). "Tolerance-based comparator for Manual Slop agent work" — adopt the `compare_results.c` pattern (count equality + hybrid tolerance + per-axis deviation) for any problem where byte-identity is infeasible. See `decisions.md` Candidate 27. -**Cross-refs:** §3 Hooks (`prove-optimized-harness.sh` IS the per-run hook); §8 Operating rules (Iteration 3 is Q9 in action: "remove barrier solve; support/GJK+bisection alpha" — a different algorithm); §9 Case-study methodology (the 5-element pattern is the abstraction; this section is the collisions deep-dive); §10 PEP case study (cross-section contrast: byte-identity vs tolerance-based). -**Pattern history:** NEW. v2.3 had no case-study repos. v3 introduces the tolerance-based exemplar of §9's 5-element pattern. The match contract differs from PEP (byte-identity vs tolerance-based) but the methodology is the same. -## §12 YAML avoidance +The "GPT-5.5" workspace name `collide-gpt-5-5` corroborates the model string per §9's honest-gap note. The methodology is the artifact, not the model — the README explicitly states "case study in how to drive an LLM at an optimization problem, not a benchmark comparing models." -**Source:** nagent uses YAML for `.nagent/campaigns/{slug}/index.yaml` + per-item `item.yaml` + per-item `proposal.yaml` + graduate `{name}.draft` (per §1 Campaigns cluster); distill graduates per `bin/nagent-distill --graduate`; per-file knowledge note frontmatter in `knowledge/files/{file_id}.md` (per v2.3 §2.1). User directive 2026-06-20: "I don't like YAML, acton may have utilized it or noted its utilization but I would not use it in whatever I take from his nagent implementation. I would continue to utilize markdown in combination with a custom DSL." -**One-liner:** nagent uses YAML for campaigns/distill/knowledge; the user does NOT adopt YAML for Manual Slop artifacts — Manual Slop uses markdown with structured headings + custom DSL (survey grammar + SSDL) for any artifact that nagent would have used YAML for. -**Pattern summary:** The YAML-avoidance pattern is a "do not adopt" flag on every YAML use site in nagent, with a markdown + custom DSL alternative specified per use case. The pattern is: (1) catalog every YAML use site in nagent (campaigns, distill, knowledge, graduates); (2) name the markdown + DSL alternative for each (markdown headings + survey grammar for inline computation, TOML frontmatter for project config precedent, SSDL for shape annotations); (3) document the rationale (whitespace fragility for AI-generated content, markdown+DSL is the project's existing convention per the intent_dsl_survey + superpowers_review sibling reviews, the custom DSL is the project's intent for inline computation not configuration); (4) cross-ref the project files that establish the markdown+DSL precedent (`conductor/presets.py`, `conductor/personas.py`, the 6 styleguides in `conductor/code_styleguides/`, the 14 `docs/guide_*.md` files). +## §12 Decisions -#### §12.1 Where nagent Uses YAML +See `decisions.md` for the full candidate list (v2.3's 16 + v3's new 11, with v2.3 → v3 status mapping at the top). **Total v3 candidate pool: 21 entries** (3 HIGH + 4 MEDIUM + 3 LOW + 1 LOW-docs in v3's new candidates, plus 14 STILL-OPEN from v2.3, plus 1 PROMOTED + 1 SUBSUMED status changes). The HIGH-priority v3 candidates are: -nagent uses YAML in four primary locations: - -1. **`.nagent/campaigns/{slug}/index.yaml`** — the campaign-level index. Per §1, the campaign tree is a YAML structure with `name`, `status`, `completion: [condition]`, `items: [item]`, and optional `proposal: proposal_yaml?`. The YAML is the state of record; the worker contract returns data; the driver is the only mutator. -2. **`.nagent/campaigns/{slug}/{item_id}/item.yaml`** — the per-item state. Each item has `id`, `status`, `blocked_by: [id]`, `conversation: path`, optional `decompose: { when, into: [sub_item] }`, and optional `result: result_json?`. The YAML is editable; the user can hand-edit between turns. -3. **`.nagent/campaigns/{slug}/{item_id}/proposal.yaml`** — the proposal file. Created by the LLM during the `propose` phase; contains the sub-items the LLM proposes. The review gate (per §1) decides whether to accept. -4. **`.nagent/distill/{name}.draft`** — the graduate file. Created by `nagent-distill --graduate`; contains a non-executable draft of a tool or prompt. Invisible to tool discovery until the user reviews and renames to remove `.draft`. - -Additionally, nagent uses YAML-adjacent formats: -- **Per-file knowledge note frontmatter** (`knowledge/files/{file_id}.md`) — the file has a YAML frontmatter block with metadata (file path, last-modified, category). The body is markdown. -- **`config.json`** — nagent's main config file is JSON, not YAML, but the same "structured data file" pattern applies. The config has `safety_net`, `hook_per_run`, `hook_per_file_edit`, `context_window_tokens`, etc. -- **`issues/{NNNN}-{slug}.md`** — nagent's issue files are markdown with structured headings (## Goal, ## Tasks, ## Done criteria), not YAML. This is the closest nagent gets to the Manual Slop convention. - -#### §12.2 Why YAML Is "Do Not Adopt" for Manual Slop - -YAML is "do not adopt" for Manual Slop for four reasons: - -1. **Markdown + frontmatter is sufficient for the same data shape.** The project's `conductor/presets.py` and `conductor/personas.py` both use TOML for structured config (presets.toml, project_presets.toml, personas.toml, project_personas.toml). TOML is the existing precedent; YAML would be a third format. The markdown+frontmatter pattern (per the `issues/{NNNN}-{slug}.md` precedent in nagent itself) is sufficient for the campaign-style artifacts: structured headings (`## Goal` / `## Tasks` / `## Done criteria`) + a TOML frontmatter block (project config precedent) + optional SSDL-annotated code blocks for any inline computation. -2. **The custom DSL (survey grammar + SSDL) is the project's intent for inline computation, not configuration.** Per the `intent_dsl_survey_20260612` Cluster 5 "SSDL shape primitives", the project's DSL primitives (`[I]` inspectable, `[S]` string concatenation, `[B]` boundary, `[M]` mutable aggregate) are the shape annotations for any data structure. The DSL is for inline computation (e.g., the code-shape sketches in §1-§11), not for configuration files. -3. **YAML's whitespace sensitivity is fragile for AI-generated content.** LLMs frequently mis-indent YAML; a single space off can change the structure silently. The Manual Slop workflow already encodes the discipline "always run the suite, not just `py_compile`" (per §6 cross-ref to `315fe9e`); YAML adds another surface for the "looks right but parses wrong" failure mode. -4. **The project's existing markdown-driven conventions (per `superpowers_review_20260619`)** establish markdown as the default format for human-editable artifacts. The 6 styleguides in `conductor/code_styleguides/` are markdown; the 14 `docs/guide_*.md` files are markdown; the per-track `spec.md`, `plan.md`, `state.toml`, `metadata.json` are markdown + TOML. Adding YAML would be a third format for the same data shape. - -The YAML-avoidance is a "do not adopt" flag, not a "must not exist" ban. The user can still read and parse YAML (e.g., when reading nagent's source); the avoidance is for new Manual Slop artifacts. - -#### §12.3 The Markdown + Custom DSL Alternative - -The markdown + custom DSL alternative is concrete: each campaign-style artifact becomes a markdown file with structured headings + a TOML frontmatter block (project config precedent) + optional SSDL-annotated code blocks for any inline computation. - -The template: - -```markdown -+++ -slug = "campaign-slug" -status = "active" -created = "2026-06-20" -+++ - -# Campaign: {name} - -## Goal - - - -## Tasks - -- [ ] **{item_id}** — {description} (status: todo; blocked_by: []) -- [ ] **{item_id}** — {description} (status: todo; blocked_by: [{item_id}]) - -## Done criteria - -- {condition_1} -- {condition_2} - -## Notes - - - -``` -campaign := { name: string, status: active|paused|done, - completion: [condition], items: [item] } {ssdl} [M] -``` -``` - -The TOML frontmatter (between `+++` markers) holds the machine-readable fields (slug, status, created). The markdown body holds the human-readable content (goal, tasks, done criteria, notes). The SSDL annotations (`{ssdl} [M]`) are the shape tags for any data structure in the code-shape sketches. - -The per-item file follows the same template: - -```markdown -+++ -id = "{item_id}" -status = "todo" -blocked_by = ["{item_id}"] -+++ - -# {item_id}: {description} - -## Goal - - - -## Done criteria - -- {condition} - -## Conversation - - -``` - -The per-proposal file follows the same template: - -```markdown -+++ -parent_item = "{item_id}" -created = "2026-06-20" -+++ - -# Proposal: decompose {item_id} - -## Sub-items - -- [ ] **{sub_item_id}** — {description} -- [ ] **{sub_item_id}** — {description} - -## Rationale - - -``` - -The graduate file follows the same template (with `executable = false` to mark it as a draft): - -```markdown -+++ -name = "{tool_name}" -executable = false -graduated_at = "2026-06-20" -+++ - -# {tool_name} (DRAFT) - - - -## Review notes - - -``` - -The TOML frontmatter is the project config precedent (`conductor/presets.py` + `conductor/personas.py`); the markdown body is the project convention; the SSDL annotations are the project's DSL primitives. - -#### §12.4 Cross-References - -The YAML-avoidance section cross-references: - -- **`intent_dsl_survey_20260612`** — the survey's Cluster 5 "SSDL shape primitives" is the canonical reference for the SSDL annotations. The survey's §4.4 "7-column table format" is the canonical reference for any tabular data. -- **`superpowers_review_20260619`** — the superpowers plugin review establishes the project's markdown-driven conventions. The 6 styleguides in `conductor/code_styleguides/` are markdown; the 14 `docs/guide_*.md` files are markdown; the markdown convention is the project's default. -- **`conductor/presets.py`** + **`conductor/personas.py`** — the TOML precedent for project config. The `[presets]` and `[personas]` tables in `presets.toml` and `personas.toml` are the pattern for any new project config file. -- **`conductor/workflow.md`** — the workflow's "always run the suite, not just `py_compile`" discipline (per §6 cross-ref) is the project's "look for failure modes" mindset. YAML's whitespace fragility is a failure mode; the project's mindset is to surface failure modes explicitly. - -#### §12.5 Decision Candidate - -**NEW Candidate 27 (HIGH).** "Markdown + custom DSL lock-in" — explicitly adopt markdown + survey grammar + SSDL for campaign-style artifacts; reject YAML for new project artifacts. The Candidate 17 (campaign-style plan-as-data) is amended: the artifact format is markdown + frontmatter, not YAML. The Candidate 18 (discussion-window safety net) is unchanged (it operates on existing JSON/Markdown artifacts). The Candidate 19 (per-turn hook) is unchanged (it operates on shell commands, not data files). The Candidate 25 (optimization-log) is unchanged (it operates on markdown, not YAML). See `decisions.md` Candidate 27. - -**Source-read citations:** -- `bin/nagent-campaign` — campaign CLI entry point (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:index_yaml_path()` — the index.yaml path convention (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:item_yaml_path()` — the per-item item.yaml path convention (24cf16d) -- `bin/helpers/nagent_campaign_lib.py:proposal_yaml_path()` — the proposal.yaml path convention (24cf16d) -- `bin/nagent-distill:107-200` — `--merge` + `--graduate` CLI surface (f3ec090) -- `bin/helpers/nagent_distill_lib.py:228-260` — finished-campaign-as-harvest-source (f3ec090) -- `bin/helpers/nagent_distill_lib.py:793-979` — `run_merge` + `run_graduate` (f3ec090) -- `prompts/knowledge-graduate.md:1-26` — graduation LLM prompt (f3ec090) -- `prompts/knowledge-merge.md:1-19` — merge LLM prompt (f3ec090) -- `prompts/knowledge-graduate.md:24-26` — graduate file naming convention (`{name}.draft`) -- `issues/0001-foundations.md` — issue file format (markdown with structured headings, not YAML) -- `issues/0002-campaign-system.md:1-326` — campaign system spec (markdown with structured headings, not YAML) -- `config.example.json` — nagent's main config (JSON, not YAML; the "structured data file" pattern) -- `bin/nagent:1319-1331` — `conversation_scratch_dir(conversation_name)` (49e07f3; relevant for the scratch dir pattern, not YAML) -- `bin/nagent:2220-2230` — `root = resolve_default_root(args.root)` (54c8741; relevant for the project-local-roots pattern) -- `conductor/presets.py` — the TOML precedent for project config (the project file, not nagent's) -- `conductor/personas.py` — the TOML precedent for project config (the project file, not nagent's) -- `conductor/code_styleguides/data_oriented_design.md` — the project's canonical DOD reference (markdown, not YAML) -- `intent_dsl_survey_20260612` — the survey's Cluster 5 "SSDL shape primitives" (the project convention) -- `superpowers_review_20260619` — the superpowers plugin review (the project convention) -- `bin/helpers/nagent_gc_lib.py` — the knowledge harvest library (v2.3; relevant for the harvest format, not YAML) -- `bin/helpers/nagent_tags.py` — the tag parser (065168c; relevant for the lenient parser, not YAML) -- `bin/helpers/nagent_safety_lib.py` — the safety net library (38d3d4f; relevant for the checkpoint format, not YAML) -- `bin/helpers/nagent_cli.py:11-86` — the resolve/scaffold functions (54c8741; relevant for the project-local-roots pattern) -- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (bdfa2a6; relevant for the verified table pattern, not YAML) -- `bin/nagent:640-748` — `build_initial_context` (54c8741; relevant for the 4-layer context resolution) -- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; relevant for the overall nagent architecture) -- `bin/helpers/nagent_campaign_lib.py:1-50` — module docstring + imports (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) -- `bin/nagent-distill:1-50` — distill module imports + constants (the v3 cluster does not cite specific line ranges) -- `prompts/create-readme.md:248-251` — the "graduate proven playbooks" reduction (c1d2cad; relevant for the graduate rationale) - -**Honest gaps:** -1. **The TOML frontmatter syntax (between `+++` markers) is the project convention, but the exact parser is not specified.** A future track would document the parser (e.g., `tomllib` for reading, `tomli-w` for writing, or a custom parser that handles the `+++` delimiter). -2. **The SSDL annotations (`{ssdl} [M]`) are not formally parsed.** They are inline text annotations; a future tool could parse them for validation (e.g., a styleguide linter that asserts every `[M]` aggregate has a corresponding `git_history` field). -3. **The markdown+DSL alternative does not address binary artifacts.** Campaign-style artifacts are text; binary artifacts (images, models, etc.) would need a different format. A future track would address binary artifacts. -4. **The "do not adopt" flag is for new Manual Slop artifacts.** Existing YAML files (e.g., from imported nagent campaigns) would still need to be parsed. A future track would document the YAML parser for backward compatibility. - -## §13 Agent context-window observations - -**Source:** user's empirical findings on OpenCode + MiniMax M3 (per the 2026-06-20 directive); nagent's enforcement (per §1 Campaigns + §2 Conversation safety net + §3 Hooks); Manual Slop's `docs/` + `conductor/` markdown navigation (per `conductor/workflow.md` "Mandatory Research-First Protocol" + the 6 styleguides in `conductor/code_styleguides/` + the 14 `docs/guide_*.md` files). -**One-liner:** Agents take ~100-150k tokens to warm up; the context window can go up to ~500k (MiniMax M3); the safe zone is 250-350k; the cycle is compact → re-warm → continue. Manual Slop's `docs/` + `conductor/` markdown navigation is a partial mitigation; the shortcoming is that agents frequently forget/fail to read on demand. nagent's `--hook-per-run` (per §3) is the pattern that would close the gap. -**Pattern summary:** The agent context-window pattern is empirical: the model has a warm-up cost (~100-150k tokens before useful output), a maximum window (~500k for MiniMax M3), a safe zone (250-350k; above which output quality degrades), and a cycle (compact → re-warm → continue). nagent enforces the cycle more strictly via per-turn hook injection (§3) + safety net checkpoints (§2) + distill graduates (§1). Manual Slop's `docs/` + `conductor/` markdown navigation is a partial mitigation: the project's 6 styleguides + 14 deep-dive guides + per-track `state.toml` + `metadata.json` are all markdown, deliberately so agents can navigate on demand. The shortcoming is that agents frequently forget to read or fail to read on demand. nagent's `--hook-per-run` pattern (per §3) is the structural mechanism that closes the gap: a per-turn hook that injects a "what to read next" status block at the top of every turn. The decision candidate is Candidate 19 (per-turn ground-truth hook) reframed with the v3.1 context-window framing. - -#### §13.1 The Warm-Up + Window + Safe-Zone Numbers - -The empirical findings (per the user's 2026-06-20 directive): - -- **Warm-up cost:** ~100-150k tokens. Before the model produces useful output, it needs to load the system prompt + the per-track context + the per-discussion history + the per-task state. The warm-up is the cost of the first useful token. -- **Maximum window:** up to ~500k tokens (MiniMax M3). The model can technically process up to 500k tokens, but the output quality degrades as the window fills. -- **Safe zone:** 250-350k tokens. Below the warm-up cost, the model hasn't loaded enough context. Above the safe zone, the output quality degrades. The safe zone is the range where the model produces useful output efficiently. -- **Cycle:** compact → re-warm → continue. When the window approaches the safe-zone ceiling, the model compacts the context (drops low-priority information, summarizes, etc.), then re-warms (loads the compacted context + the new task), then continues. The cycle is iterative; each cycle costs ~100-150k tokens of warm-up. - -The numbers are empirical (MiniMax M3); other models may have different numbers. The pattern (warm-up + window + safe zone + cycle) is the structural insight; the numbers are the parameterization. - -#### §13.2 nagent's Enforcement - -nagent enforces the cycle more strictly than the model does natively. The three mechanisms: - -1. **Per-turn hook injection (§3):** A hook runs at the top of every turn (before the model speaks); its output enters the conversation as a labeled block. The hook is the per-turn ground-truth that prevents the model from "re-warming" by reading its own context. The hook is fast (median-of-5 timing) and surfaces the measured state (build status, test status, etc.) without the model having to read its own conversation. -2. **Safety net checkpoints (§2):** A wall-clock + burst guard fires a checkpoint when the conversation grows. The checkpoint is a separate one-shot LLM call (not the working model) that produces a structured summary (## Intent | ## Next action | ## Constraints | ## Open questions). The summary is the "compacted" context; the next turn re-warms from the summary. -3. **Distill graduates (§1):** The `--graduate` pass takes proven playbooks and drafts them as non-executable `{name}.draft` files. The drafts are "graduate candidates" — proven knowledge that can be promoted to executable tools after review. The graduate pass is the "structural re-warm" — the model doesn't have to re-read the playbook because it's been distilled into a tool. - -The three mechanisms together implement the cycle as a structural pattern, not a model-dependent behavior. The model doesn't have to "remember to compact"; the cycle is enforced by the loop. - -#### §13.3 Manual Slop's Partial Mitigation - -Manual Slop's `docs/` + `conductor/` markdown navigation is a partial mitigation for the cycle. The project deliberately keeps the following files in markdown so agents can navigate on demand: - -- **`AGENTS.md`** — the canonical operating instructions for agents. The @import pattern (per `conductor/code_styleguides/data_oriented_design.md`) includes the 6 styleguides + the 14 deep-dive guides. -- **`conductor/workflow.md`** — the workflow conventions (TDD, per-task commits, format commitments, "always run the suite"). -- **`conductor/product-guidelines.md`** — the project styleguides (1-space indent for Python, no comments, etc.). -- **`conductor/code_styleguides/data_oriented_design.md`** — the canonical DOD reference (Tier 0/1/2, simplification pass, enforceable deliverables). -- **`conductor/code_styleguides/cache_friendly_context.md`** — the cache TTL GUI contract (stable-to-volatile context ordering). -- **`conductor/code_styleguides/knowledge_artifacts.md`** — the knowledge harvest pattern (7-category schema + provenance + sha256 ledger). -- **`conductor/code_styleguides/error_handling.md`** — the Result[T] convention. -- **`conductor/code_styleguides/agent_memory_dimensions.md`** — the 4 memory dimensions (curation / discussion / RAG / knowledge). -- **`conductor/code_styleguides/rag_integration_discipline.md`** — the conservative-RAG rule. -- **`conductor/code_styleguides/feature_flags.md`** — file presence vs config flags vs CLI flags. -- **The 14 `docs/guide_*.md` files** — the deep-dive guides (architecture, AI client, API hooks, MCP client, app controller, MMA, models, testing, GUI, paths, context curation, shaders, RAG, beads, hot reload, personas, NERV theme, workspace profiles, command palette). -- **Per-track `state.toml` + `metadata.json`** — the per-track state (current phase, task progress, verification status). -- **Per-track `spec.md` + `plan.md`** — the per-track specification and plan. - -The markdown convention is deliberate: agents can navigate the project's knowledge on demand by reading the files. The convention is the project's "partial mitigation" for the cycle. - -#### §13.4 The Shortcoming - -The shortcoming is that agents frequently forget to read or fail to read on demand. The empirical observation: - -- **Forget to read:** The agent has a task, the relevant guidance is in `conductor/workflow.md`, but the agent doesn't read the file because the task description doesn't explicitly say "read `conductor/workflow.md` first". The agent proceeds without the guidance. -- **Fail to read on demand:** The agent reads the relevant guidance at the start of the task, but as the task progresses, the agent doesn't re-read the guidance when a new question arises. The agent proceeds with stale information. -- **Read but ignore:** The agent reads the relevant guidance, but the agent's interpretation of the guidance is different from the guidance's intent. The agent proceeds with a misunderstanding. - -The three failure modes are not the same; each has a different mitigation. The "forget to read" mitigation is to make the reading explicit (e.g., "before starting, read `conductor/workflow.md`"). The "fail to read on demand" mitigation is to make the re-reading automatic (e.g., a per-turn hook that surfaces the relevant guidance). The "read but ignore" mitigation is to make the guidance unambiguous (e.g., structured headings, examples, anti-patterns). - -#### §13.5 The Hook Pattern as the Solution - -nagent's `--hook-per-run` pattern (per §3) is the structural mechanism that closes the gap. The pattern: - -1. **Configure a status command.** The user configures a command (e.g., `make test`, `git status`, `cat conductor/workflow.md`) that runs at the top of every turn. -2. **Run the command via the hook.** The hook runs the command, captures exit code + stdout + stderr, and injects a labeled block at the top of the conversation. -3. **The model sees the status block.** The model reads the status block as part of the conversation; the status block is the per-turn ground-truth. - -The pattern closes all three failure modes: -- **Forget to read:** The status block is automatically injected; the agent can't forget to read it. -- **Fail to read on demand:** The status block is refreshed every turn; the agent sees the latest status every turn. -- **Read but ignore:** The status block is structured (exit code + stdout + stderr); the agent can't ignore a failing exit code or a stderr message. - -The pattern is the structural mechanism for the cycle. The agent doesn't have to "remember to check the status"; the check is automatic. - -#### §13.6 Decision Candidate - -**NEW Candidate 28 (MEDIUM).** "Per-turn ground-truth hook for Manual Slop" — adopt nagent's `--hook-per-run` model; inject a "what to read next" status block at the top of every `send_result()`. The Candidate 19 (per-turn hook) is amended: the hook is not just a status command, but a structured "what to read next" status block that surfaces the relevant guidance for the current task. The hook is configured per-project (via `[conductor].hook_per_run` in `manual_slop.toml`); the default is a no-op (the hook is opt-in). See `decisions.md` Candidate 28. - -**Source-read citations:** -- The user's 2026-06-20 directive — the empirical findings (warm-up + window + safe zone + cycle) -- `bin/nagent:1442-1484` — `run_hook` + `resolve_hooks` (a4fb141; the per-turn hook primitive) -- `bin/nagent:1922-1927` — `hook_per_run` injection site (a4fb141) -- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; the hook is wired here) -- `bin/nagent:1519-1539` — `checkpoint_due` + `rebuild_due` (38d3d4f; the safety net trigger) -- `bin/nagent:1547-1587` — `write_checkpoint` (38d3d4f; the safety net writer) -- `bin/nagent:1590-1662` — `rebuild_conversation` (38d3d4f; the safety net rebuild) -- `bin/nagent:1840-1881` — `extract_conversation_summary` (6426a67; the instant-saves change) -- `bin/helpers/nagent_distill_lib.py:587-654` — `_summary_backfill_candidates` + `_backfill_saved_summaries` (6426a67) -- `bin/nagent-campaign` — campaign CLI entry point (24cf16d; the campaigns abstraction) -- `bin/nagent-distill:107-200` — `--merge` + `--graduate` CLI surface (f3ec090; the distill abstraction) -- `prompts/knowledge-graduate.md:1-26` — graduation LLM prompt (f3ec090) -- `prompts/knowledge-merge.md:1-19` — merge LLM prompt (f3ec090) -- `AGENTS.md` — the canonical operating instructions (the project's markdown convention) -- `conductor/workflow.md` — the workflow conventions (the project's markdown convention) -- `conductor/product-guidelines.md` — the project styleguides (the project's markdown convention) -- `conductor/code_styleguides/data_oriented_design.md` — the canonical DOD reference (the project's markdown convention) -- `conductor/code_styleguides/cache_friendly_context.md` — the cache TTL GUI contract (the project's markdown convention) -- `conductor/code_styleguides/knowledge_artifacts.md` — the knowledge harvest pattern (the project's markdown convention) -- `conductor/code_styleguides/error_handling.md` — the Result[T] convention (the project's markdown convention) -- `conductor/code_styleguides/agent_memory_dimensions.md` — the 4 memory dimensions (the project's markdown convention) -- `conductor/code_styleguides/rag_integration_discipline.md` — the conservative-RAG rule (the project's markdown convention) -- `conductor/code_styleguides/feature_flags.md` — file presence vs config flags vs CLI flags (the project's markdown convention) -- `docs/guide_*.md` — the 14 deep-dive guides (the project's markdown convention) -- Per-track `state.toml` + `metadata.json` — the per-track state (the project's markdown convention) -- `bin/nagent:606-745` — `build_initial_context` (v2.3; relevant for the initial context assembly) -- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; relevant for the cache strategy) -- `bin/nagent:1455-1687` — `run_safety_net` (38d3d4f; relevant for the safety net machinery) -- `bin/nagent:2819` — `safety_settings=load_safety_settings(...)` (38d3d4f; relevant for the safety net wiring) -- `bin/helpers/nagent_cli.py:11-86` — the resolve/scaffold functions (54c8741; relevant for the project-local-roots pattern) -- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (bdfa2a6; relevant for the verified table pattern) -- `bin/nagent:2220-2230` — `root = resolve_default_root(args.root)` (54c8741; relevant for the project-local-roots pattern) -- `bin/helpers/nagent_safety_lib.py` — the safety net library (38d3d4f; relevant for the safety net machinery) -- `bin/nagent:640-748` — `build_initial_context` (54c8741; relevant for the 4-layer context resolution) -- `bin/nagent:1075-1081` — `target = f"{llm.provider}/{llm.model}"` (2edc7ee; relevant for the provider/model naming) -- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; relevant for the overall nagent architecture) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1300-1400` — main loop body (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1900-2000` — main loop continued (the v3 cluster does not cite specific line ranges) -- `bin/nagent:2000-2100` — main loop continued (the v3 cluster does not cite specific line ranges) -- `bin/nagent:2200-2300` — main loop end (the v3 cluster does not cite specific line ranges) - -**Honest gaps:** -1. **The warm-up + window + safe-zone numbers are empirical for MiniMax M3.** Other models (Gemini, Anthropic, OpenAI) may have different numbers. A future track would measure the numbers per provider. -2. **The hook pattern is opt-in.** The default is a no-op; the user must configure a status command. A future track could make the hook default-on with a no-op status command (the cost is the hook's per-turn latency, which should be < 100ms for a no-op). -3. **The "what to read next" status block is a per-project configuration.** The user must specify the status command per project. A future track could auto-detect the relevant guidance based on the current task (e.g., if the task is "implement X", the status block surfaces `conductor/workflow.md` and `conductor/code_styleguides/data_oriented_design.md`). -4. **The hook pattern is per-turn.** A future track could add per-task, per-conversation, or per-project hooks (e.g., a per-task hook that fires when a task starts, a per-conversation hook that fires when a conversation starts). - -## §14 Fine-tuning observations - -**Source:** user's 2026-06-20 directive ("current generalized models bottlenecked by not having conventions baked in; curated dataset of associated codebases; Together.ai noticed; asks about other prosumer fine-tuning vendors for middle-wage income in 2026"). -**One-liner:** Current generalized models are bottlenecked by not having the user's core conventions/workflows baked in. A curated dataset of associated codebases (Manual Slop's own tracks, decisions, plans, styleguides) is the user's proposed mitigation. Together.ai is one noticed vendor; 5-6 other prosumer fine-tuning vendors are surveyed below. Vendor selection is a separate future track; this section is observational. -**Pattern summary:** The fine-tuning pattern is the user's interest in baking conventions/workflows into a model via fine-tuning. The pattern is: (1) recognize the bottleneck (generalized models don't have the user's conventions); (2) curate the dataset (the user's own tracks, decisions, plans, styleguides); (3) select a vendor (Together.ai is one; 5-6 others surveyed); (4) fine-tune the model (vendor-specific process); (5) validate the fine-tuned model (does it actually produce better output for the user's use case?). The v3.1 section is observational; the vendor analysis is a separate future track. The decision candidate is Candidate 29 (dataset-curation track) + Candidate 30 (cache TTL GUI contract hardening, per the cross-ref to §13). - -#### §14.1 The Diagnosis - -The diagnosis (per the user's 2026-06-20 directive): current generalized models are bottlenecked by not having the user's core conventions/workflows baked in. The bottleneck manifests as: - -- **Convention drift:** The model produces output that violates the project's conventions (e.g., 4-space indent instead of 1-space; JSON blocks instead of tables; etc.). The user must correct the output repeatedly. -- **Workflow ignorance:** The model doesn't know the project's workflow (TDD, per-task commits, format commitments, "always run the suite"). The model produces output that doesn't follow the workflow. -- **Styleguide unawareness:** The model doesn't know the project's 6 styleguides (DOD, cache-friendly context, knowledge artifacts, error handling, agent memory dimensions, RAG integration discipline, feature flags). The model produces output that doesn't follow the styleguides. - -The three failure modes are not the same; each has a different fine-tuning mitigation. The "convention drift" mitigation is to bake the conventions into the model's training data (e.g., the project's `conductor/product-guidelines.md` + the 6 styleguides as training examples). The "workflow ignorance" mitigation is to bake the workflow into the model's training data (e.g., the project's `conductor/workflow.md` + per-track `plan.md` as training examples). The "styleguide unawareness" mitigation is to bake the styleguides into the model's training data (e.g., the 6 styleguides + the 14 deep-dive guides as training examples). - -#### §14.2 Together.ai as One Noticed Vendor - -The user noticed Together.ai. Together.ai offers fine-tuning for open-source models (Llama 3.x, Qwen 3, Mistral) with transparent per-token pricing. The pricing model is: - -- **Training:** ~$0.50-3.00 per million tokens (varies by model + dataset size). -- **Inference:** ~$0.10-0.60 per million tokens (varies by model + context length). - -The prosumer-friendly aspects: transparent pricing, open-source model support, no minimum commitment, serverless deployment. The cons: the user must curate the dataset + select the base model + validate the fine-tuned model. - -#### §14.3 Prosumer Fine-Tuning Vendor Survey (2026) - -The prosumer fine-tuning vendor survey (per the user's 2026-06-20 directive): - -| Vendor | Model families | Pricing tier | Prosumer-friendly? | Notes | -|---|---|---|---|---| -| **Together.ai** | Llama, Qwen, Mistral, others | $0.50-3/M training; $0.10-0.60/M inference | Yes — transparent; open-source models | User-noticed vendor | -| **Fireworks.ai** | Llama, Qwen, Mistral | Similar to Together | Yes — serverless DX | Lower latency than Together for some models | -| **OpenAI fine-tuning** | GPT-4o, GPT-4o-mini, GPT-3.5 | ~$3/M training, $0.30/M inference (4o-mini) | Yes for "mini"; expensive for 4o | Best DX; closed-source models | -| **Anthropic Claude Haiku fine-tuning** | Claude Haiku (if on waitlist) | Similar to OpenAI 4o-mini | Waitlist-gated | Best for Anthropic-specific workflows | -| **Google Gemini 1.5 Flash fine-tuning** | Gemini 1.5 Flash | ~$0.50-1/M training | Yes for high-volume | Best for Google-specific workflows | -| **Local fine-tuning (RTX 4090/5090 + Unsloth)** | Any open-source model | $1,500-3,000 one-time hardware | Yes for weekly-iterators | Full control; no per-token cost | - -The survey is observational; the vendor analysis is a separate future track. The v3.1 section is not making a recommendation; it's documenting the user's interest + the prosumer vendor landscape. - -#### §14.4 Vendor Analysis Is Out of Scope for v3.1 - -The vendor analysis is out of scope for v3.1. The v3.1 section is observational; the vendor-selection track (if needed) would do the deep comparison + decision. The reasons: - -1. **Vendor pricing changes frequently.** The 2026-06-20 numbers may be out of date by 2026-09-20. A vendor-selection track would need to be re-run periodically. -2. **The dataset is the user's call.** The user must curate the dataset (the user's own tracks, decisions, plans, styleguides) before any vendor can fine-tune. The dataset-curation is a separate effort. -3. **The validation is the user's call.** The user must validate the fine-tuned model against the user's actual use cases. The validation is a separate effort. -4. **The v3.1 track is research-only.** Per the v3.1 scope, no candidates are implemented in the track. The dataset-curation + vendor-selection would be a separate implementation track. - -The v3.1 section is a marker for a future track. The marker is: "the user is interested in fine-tuning; a future track would curate the dataset + select the vendor + fine-tune the model + validate the result". - -#### §14.5 Decision Candidates - -**NEW Candidate 29 (MEDIUM).** "Dataset-curation track for fine-tuning" — separate track to curate the Manual Slop conventions/workflows dataset for fine-tuning; vendor selection deferred. The dataset would include: per-track `spec.md` + `plan.md` + `state.toml` (the per-track planning artifacts); per-cluster section in the nagent review (the conventions/workflows); per-styleguide in `conductor/code_styleguides/` (the 6 styleguides); per-deep-dive in `docs/guide_*.md` (the 14 deep-dive guides). The dataset would be a markdown + TOML corpus; the corpus would be the input to a vendor-specific fine-tuning process. See `decisions.md` Candidate 29. - -**NEW Candidate 30 (LOW).** "Cache TTL GUI contract hardening" — make the per-turn grounding primitive also track cache state; cross-ref `cache_friendly_context.md`. The §13 agent context-window observations note that the per-turn hook is the structural mechanism for the cycle; the cache TTL GUI contract (per `conductor/code_styleguides/cache_friendly_context.md`) is the cache version of the same insight. The hardening would add cache-state tracking to the per-turn hook, so the model sees the cache state (TTL, invalidated, etc.) as part of the status block. See `decisions.md` Candidate 30. - -**Source-read citations:** -- The user's 2026-06-20 directive — the diagnosis (current models bottlenecked) + the dataset (Manual Slop's own tracks) + the vendor notice (Together.ai) + the prosumer question (other vendors for middle-wage income in 2026) -- `conductor/presets.py` — the TOML precedent for project config (the dataset would include `presets.toml` + `project_presets.toml`) -- `conductor/personas.py` — the TOML precedent for project config (the dataset would include `personas.toml` + `project_personas.toml`) -- `conductor/context_presets.py` — the ContextPresetManager (the dataset would include per-track context presets) -- `conductor/tool_presets.py` — the ToolPresetManager (the dataset would include tool presets) -- `conductor/tool_bias.py` — the ToolBiasEngine (the dataset would include tool bias profiles) -- `conductor/workflow.md` — the workflow conventions (the dataset would include this) -- `conductor/product-guidelines.md` — the project styleguides (the dataset would include this) -- `conductor/code_styleguides/data_oriented_design.md` — the canonical DOD reference (the dataset would include this) -- `conductor/code_styleguides/cache_friendly_context.md` — the cache TTL GUI contract (the dataset would include this; relevant for Candidate 30) -- `conductor/code_styleguides/knowledge_artifacts.md` — the knowledge harvest pattern (the dataset would include this) -- `conductor/code_styleguides/error_handling.md` — the Result[T] convention (the dataset would include this) -- `conductor/code_styleguides/agent_memory_dimensions.md` — the 4 memory dimensions (the dataset would include this) -- `conductor/code_styleguides/rag_integration_discipline.md` — the conservative-RAG rule (the dataset would include this) -- `conductor/code_styleguides/feature_flags.md` — file presence vs config flags vs CLI flags (the dataset would include this) -- `docs/guide_*.md` — the 14 deep-dive guides (the dataset would include these) -- `docs/Readme.md` — the canonical teaching document (the dataset would include this) -- `AGENTS.md` — the canonical operating instructions (the dataset would include this) -- Per-track `spec.md` + `plan.md` + `state.toml` + `metadata.json` — the per-track artifacts (the dataset would include these) -- Per-discussion `logs/sessions/{session_id}/discussion.jsonl` — the per-discussion history (the dataset would include selected discussions, with user approval) -- The user's existing 4-tier MMA architecture (per `docs/guide_mma.md`) — the MMA conventions (the dataset would include the MMA architecture) -- The user's existing Hook API (per `docs/guide_api_hooks.md`) — the Hook API conventions (the dataset would include the Hook API architecture) -- The user's existing MCP tools (per `docs/guide_mcp_client.md`) — the MCP tool conventions (the dataset would include the MCP architecture) -- Together.ai pricing page (https://www.together.ai/pricing) — the user's noticed vendor -- Fireworks.ai pricing page (https://fireworks.ai/pricing) — the alternative vendor -- OpenAI fine-tuning pricing (https://openai.com/api/pricing/) — the closed-source alternative -- Unsloth (https://github.com/unslothai/unsloth) — the local fine-tuning framework -- `bin/nagent:1075-1081` — `target = f"{llm.provider}/{llm.model}"` (2edc7ee; relevant for the provider/model naming, cross-ref to §5) -- `bin/nagent:3167-3185` — `run_agent_loop` (the main loop; relevant for the overall nagent architecture) -- `conductor/tech-stack.md` — the project's tech stack (relevant for the model selection) -- `bin/helpers/nagent_llm.py:54-77` — `MODEL_CONTEXT_WINDOWS` table (bdfa2a6; relevant for the per-model context windows, cross-ref to §5) -- `bin/nagent:2220-2230` — `root = resolve_default_root(args.root)` (54c8741; relevant for the project-local-roots pattern) -- `bin/helpers/nagent_safety_lib.py` — the safety net library (38d3d4f; relevant for the safety net machinery) -- `bin/nagent:606-745` — `build_initial_context` (v2.3; relevant for the initial context assembly) -- `bin/nagent:970-987` — `conversation_cache_boundaries` (v2.3; relevant for the cache strategy, cross-ref to Candidate 30) -- `bin/nagent:1455-1687` — `run_safety_net` (38d3d4f; relevant for the safety net machinery) -- `bin/nagent:1840-1881` — `extract_conversation_summary` (6426a67; relevant for the instant-saves change) -- `bin/nagent:2819` — `safety_settings=load_safety_settings(...)` (38d3d4f; relevant for the safety net wiring) -- `bin/nagent:1922-1927` — `hook_per_run` injection site (a4fb141; relevant for the per-turn hook, cross-ref to §3 + §13) -- `bin/nagent:1442-1484` — `run_hook` + `resolve_hooks` (a4fb141; relevant for the per-turn hook, cross-ref to §3 + §13) -- `bin/helpers/nagent_cli.py:11-86` — the resolve/scaffold functions (54c8741; relevant for the project-local-roots pattern) -- `bin/nagent:1-50` — main module imports + constants (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1300-1400` — main loop body (the v3 cluster does not cite specific line ranges) -- `bin/nagent:1900-2000` — main loop continued (the v3 cluster does not cite specific line ranges) -- `bin/nagent:2000-2100` — main loop continued (the v3 cluster does not cite specific line ranges) -- `bin/nagent:2200-2300` — main loop end (the v3 cluster does not cite specific line ranges) -- `bin/nagent:640-748` — `build_initial_context` (54c8741; relevant for the 4-layer context resolution) - -**Honest gaps:** -1. **The dataset-curation effort is significant.** A complete dataset would include all 14 deep-dive guides + 6 styleguides + per-track artifacts + per-discussion history. The effort is months, not days. A future track would scope the dataset to a manageable subset. -2. **The vendor pricing is from 2026-06-20.** The pricing may change by the time the user is ready to fine-tune. A vendor-selection track would re-survey the pricing at the time of decision. -3. **The fine-tuned model's validation is the user's call.** The user must validate the model against the user's actual use cases. The validation is a separate effort; the v3.1 section does not provide a validation methodology. -4. **The Cache TTL GUI contract hardening (Candidate 30) is a small change.** The cross-ref to `cache_friendly_context.md` is the canonical reference; a future track would add cache-state tracking to the per-turn hook. -5. **The fine-tuning vs. prompting trade-off is not analyzed.** Fine-tuning bakes conventions into the model; prompting surfaces conventions at inference time. The trade-off is: fine-tuning is a one-time cost + lower per-inference cost; prompting is a per-inference cost + no training cost. A vendor-selection track would analyze the trade-off. - -## §15 Decisions - -See `decisions.md` for the full candidate list (v2.3's 16 + v3's new 11 + v3.1's new 3, with v2.3 → v3 → v3.1 status mapping at the top). **Total v3.1 candidate pool: 30 entries** (3 HIGH + 7 MEDIUM + 7 LOW + 1 LOW-docs in v3+v3.1's new candidates, plus 14 STILL-OPEN from v2.3, plus 1 PROMOTED + 1 SUBSUMED status changes, plus 3 v3.1 NEW per §12-§14). The HIGH-priority v3 candidates are: - -- **Candidate 17:** Campaign-style plan-as-data for the conductor (§1) — amended by Candidate 27 to use markdown + frontmatter, not YAML +- **Candidate 17:** Campaign-style plan-as-data for the conductor (§1) - **Candidate 18:** Discussion-window safety net for Manual Slop (§2) - **Candidate 22:** Tier 3 worker contract "decompose or isolate, never offload" (§6) -The MEDIUM-priority v3+v3.1 candidates are Candidates 19 (per-turn hook — amended by Candidate 28), 21 (per-model token-cap), 23 (per-conversation scratch dir), 25 (optimization-log discipline), 27 (markdown+DSL lock-in, per §12), 28 (per-turn ground-truth hook, per §13), 29 (dataset-curation track, per §14). The LOW-priority are Candidates 20 (docs rename), 24 (Q9 in styleguide), 26 (OPT-LOG schema), 30 (cache TTL GUI contract hardening, per §14). Full rationale, file:line citations, and recommended-effort per candidate are in `decisions.md`. +The MEDIUM-priority v3 candidates are Candidates 19 (per-turn hook), 21 (per-model token-cap), 23 (per-conversation scratch dir), 25 (optimization-log discipline), 27 (tolerance-based comparator). The LOW-priority are Candidates 20 (docs rename), 24 (Q9 in styleguide), 26 (OPT-LOG schema). Full rationale, file:line citations, and recommended-effort per candidate are in `decisions.md`. -## §16 Cross-references +## §13 Cross-references See `nagent_takeaways_v3_20260619.md` for the bridge to v2.3 takeaways + the sibling reviews: - **`fable_review_20260617`** — Fable's analysis of Mythos system prompt. Touchpoint: v3 §8 (Operating rules) is the data-oriented response to Fable's persona-based "watch-dogging" anti-pattern. -- **`intent_dsl_survey_20260612`** — the 10 prior-art clusters for intent-based DSLs. Touchpoint: v3 §9 (Case-study methodology) is implicitly an intent-DSL for "drive nagent at an optimization problem"; v3.1 §12 (YAML avoidance) cites the survey's Cluster 5 "SSDL shape primitives" as the project's DSL primitive. -- **`superpowers_review_20260619`** — the superpowers plugin review. Touchpoint: v3 §9 (Case-study methodology); the superpowers `brainstorming` skill is a process parallel (structured questions to refine an idea before implementation); v3.1 §12 (YAML avoidance) cites the superpowers review as the project's markdown-driven convention. +- **`intent_dsl_survey_20260612`** — the 10 prior-art clusters for intent-based DSLs. Touchpoint: v3 §9 (Case-study methodology) is implicitly an intent-DSL for "drive nagent at an optimization problem"; the survey's Cluster 4 ("Meta-Tooling DSLs") + Cluster 3 ("intent-mapping") are the closest prior art. +- **`superpowers_review_20260619`** — the superpowers plugin review. Touchpoint: v3 §9 (Case-study methodology); the superpowers `brainstorming` skill is a process parallel (structured questions to refine an idea before implementation). -## §17 References +## §14 References ### Source commits (24) @@ -2841,27 +767,7 @@ The 24 nagent commits reviewed, in chronological order (oldest first): - [`macton/pep-copt`](https://github.com/macton/pep-copt) at `main` (5 commits). The PEP image compression case study: 2.04× speedup aggregate on 24-image benchmark, byte-identical `.pep` output, decode net-neutral (§10). - [`macton/differentiable-collisions-optc`](https://github.com/macton/differentiable-collisions-optc) at `main` (5 commits). The Convex Primitive Collision Detection case study: 101.06× speedup on committed input, 97.75× and 98.43× on alternate seeds, tolerance-based match contract (§11). -### Per-phase commit SHAs (v3.1) - -| Phase | Description | Commit SHA | -|---|---|---| -| Phase 1 | Setup + audit (v3.1) | `8fb82762` | -| Phase 2 | Thicken §1 Campaigns cluster | `bd36aa4b` | -| Phase 3 | Thicken §2 Conversation safety net cluster | `478b088b` | -| Phase 4 | Thicken §3 Hooks cluster | `d17ee930` | -| Phase 5 | Thicken §4 Project-local roots cluster | `1bc8e924` | -| Phase 6 | Thicken §5 Provider expansion cluster | `987f4a97` | -| Phase 7 | Thicken §6 Delegation rewrite cluster | `a406d290` | -| Phase 8 | Thicken §7 Robustness cluster | `b9b31006` | -| Phase 9 | Thicken §8 Operating rules cluster | `eb7da8d8` | -| Phase 10 | Thicken §9 Case-study methodology cluster | `24442379` | -| Phase 11 | Thicken §10 PEP case study cluster | `10c7d1d0` | -| Phase 12 | Thicken §11 Collisions case study cluster | `1574ee47` | -| Phase 13 | New sections §12-§14 + renumber v3 §12-§14 to §15-§17 | (this commit) | -| Phase 14 | Refresh side artifacts | (forthcoming) | -| Phase 15 | Chunking-strategy + format-commitment verification | (forthcoming) | - -### Per-phase commit SHAs (v3) +### Per-phase commit SHAs | Phase | Description | Commit SHA | |---|---|---| @@ -2877,8 +783,8 @@ The 24 nagent commits reviewed, in chronological order (oldest first): | Phase 10 | Case-study methodology cluster (§9) | `54e62b10` | | Phase 11 | PEP case study cluster (§10) | `f53c82e6` | | Phase 12 | Collisions case study cluster (§11) | `db7d94de` | -| Phase 13 | Refresh side artifacts | `e150088d` | -| Phase 14 | Format-commitment verification | `b49be820` | +| Phase 13 | Refresh side artifacts | (this commit) | +| Phase 14 | Format-commitment verification | (forthcoming) | ### Sibling-review references @@ -2891,10 +797,7 @@ The 24 nagent commits reviewed, in chronological order (oldest first): - `conductor/workflow.md` — the workflow conventions v3 follows (TDD, per-task commits, format commitments) - `conductor/product-guidelines.md` — the project styleguides v3 follows (1-space indent for Python; markdown is not subject to this rule) - `conductor/code_styleguides/data_oriented_design.md` — the project's canonical DOD reference, itself derived from Acton's `context/data-oriented-design.md` -- `conductor/code_styleguides/cache_friendly_context.md` — references nagent_review_v2_3 §3.2 + §5 (v3 deepens with §5 per-model context windows); v3.1 §13 + §14 cross-ref for the per-turn hook + cache TTL GUI contract +- `conductor/code_styleguides/cache_friendly_context.md` — references nagent_review_v2_3 §3.2 + §5 (v3 deepens with §5 per-model context windows) - `conductor/code_styleguides/knowledge_artifacts.md` — references nagent_review_v2_3 §3.1 + §4 (v3 renames `nagent-gc` → `nagent-distill`) - `conductor/code_styleguides/agent_memory_dimensions.md` — references nagent_review_v2_3 §2.8 (v3 deepens with §1-§4 memory extension) -- `conductor/code_styleguides/rag_integration_discipline.md` — the conservative-RAG rule -- `conductor/code_styleguides/feature_flags.md` — file presence vs config flags vs CLI flags -- `conductor/code_styleguides/error_handling.md` — the Result[T] convention -- `docs/guide_meta_boundary.md` — the Application vs Meta-Tooling distinction (load-bearing context for v3) +- `docs/guide_meta_boundary.md` — the Application vs Meta-Tooling distinction (load-bearing context for v3) \ No newline at end of file