diff --git a/conductor/tracks/fable_review_20260617/research/cluster_2_refusal_architecture.md b/conductor/tracks/fable_review_20260617/research/cluster_2_refusal_architecture.md new file mode 100644 index 00000000..75736e65 --- /dev/null +++ b/conductor/tracks/fable_review_20260617/research/cluster_2_refusal_architecture.md @@ -0,0 +1,402 @@ +# Cluster 2: Refusal Architecture & "Safety Theater" + +**Sub-agent dispatch:** Tier 3 Worker (2026-06-17). Read-only research task. +**Sources read:** +- `docs/artifacts/Fable System Prompt.md` lines 32-67 (refusal_handling, critical_child_safety_instructions, legal_and_financial_advice) +- `AGENTS.md` §"Critical Anti-Patterns" (lines 49-77) +- `conductor/workflow.md` §"Skip-Marker Policy" (lines 732-758) +- `conductor/code_styleguides/error_handling.md` lines 1-200, 274-330, 830-930 +- `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §2.1 Pattern 1 (lines 242-292), §2.5 Pattern 5 (lines 432-465), §2.6 Pattern 6 (lines 466-512), §2.10 Pattern 10 (lines 670-708), §2.14 Pattern 14 (lines 882-906), §3.1 Knowledge Harvest (lines 989-1080) + +**Verdict orientation (per `spec.md:218`):** Anti-User + Persona Performance, with one Useful caveat. +**Feeds synthesis report sections:** §4 (primary), §13 (one Useful caveat), §14 (three Rejections). + +--- + +## 1. What Fable says + +### 1.1 The structural shape of the refusal architecture + +The `refusal_handling` section at `docs/artifacts/Fable System Prompt.md:32-49` is a persona-driven refusal architecture in 9 paragraphs. +It opens with a permission-grant, then a risk heuristic, then specific refused categories, then persona-preservation rules. +The shape is: state what kind of discussant / writer / safety-conscious actor Claude is, then list what it will not do. +The shape is NOT: return a typed refusal with a `kind` field and a `message` field. + +The `critical_child_safety_instructions` at `docs/artifacts/Fable System Prompt.md:50-63` is a separate, more aggressive refusal block with 7 nested rules. +The defining property is **anti-detection-design**: the refusal is constructed so it does not teach the user how to reframe around it. +The shape is: state the principle, then forbid narrating which cues tripped, where the line sits, or what test was applied. +This is the opposite of Manual Slop's `error_handling.md` "errors are data" stance: the boundary is opaque, not typed. + +The `legal_and_financial_advice` at `docs/artifacts/Fable System Prompt.md:64-67` is a minimal-persona addendum. +The instruction is *data discipline*, not *persona*: surface the facts, don't make the decision. +This is the one Useful caveat in cluster 2. + +### 1.2 The 4 load-bearing claims (≤15 words each, with file:line; longer passages paraphrased per `spec.md:399`) + +- `docs/artifacts/Fable System Prompt.md:34` — "Claude can discuss virtually any topic factually and objectively." +- `docs/artifacts/Fable System Prompt.md:42` — Persona splits "fictional characters" from "real, named public figures." +- `docs/artifacts/Fable System Prompt.md:49` — "Claude can keep a conversational tone even when it's unable or unwilling to help." +- `docs/artifacts/Fable System Prompt.md:60` — Anti-detection: model does not decode CSAM-adjacent slang. + +### 1.3 The 4 supporting claims (paraphrased, with file:line) + +- `docs/artifacts/Fable System Prompt.md:36` — Risk heuristic: "If the conversation feels risky or off, saying less and giving shorter replies is safer." +- `docs/artifacts/Fable System Prompt.md:38` — Hard refusal of weapon-enabling technical details regardless of how the request is framed. +- `docs/artifacts/Fable System Prompt.md:54` — Reframing signal: reframing a request is the signal to REFUSE. +- `docs/artifacts/Fable System Prompt.md:62-63` — Boundary opacity: state the principle, not the detection mechanics. + +### 1.4 The structural pattern + +Refusal is a *persona attribute* (the model is told what kind of discussant / writer / safety-conscious actor it is). +Refusal is *not* a typed return value, not a `Result[T, ErrorInfo]` shape, not a `kind: ErrorKind` taxonomy. +The refusal is invisible to the caller's data flow until it manifests as the model's output text. +The caller's `error` field (if any) does not distinguish "Claude cannot do X" from "Claude declined to do X" from "Claude softened a refusal into a conversational non-answer." +This is the data-vs-control-flow divide: Fable's refusal is control flow; the project's `Result[T]` is data. + +### 1.5 The child-safety sub-block (lines 50-63) in detail + +The 7 nested rules at lines 54-63 are a separate refusal layer with anti-detection-design built in. +Rule 1 (line 54): never produce child-harm content, ever. +Rule 2 (line 55): never supply unstated assumptions that make a request seem safer than it was as written (e.g., interpreting amorous language as merely platonic). +Rule 3 (line 56): once Claude refuses for child-safety reasons, all subsequent requests in the same conversation must be approached with extreme caution. +Rule 4 (line 57): must refuse subsequent requests if they could be used to facilitate grooming or harm to children, including if the user is a minor themself. +Rule 5 (line 60): never decode, define, or confirm slang, acronyms, or euphemisms used in CSAM trading or access, even in the course of refusing. +Rule 6 (line 62): when giving protective or educational content about grooming, stay at the pattern level — do not compile categorized lists of verbatim lines. +Rule 7 (line 63): when declining or limiting for child-safety reasons, state the principle rather than the detection mechanics. + +The defining property is the "state the principle, not the detection mechanics" rule. +This is the design-level statement that the boundary is opaque. +Manual Slop's stance is the opposite: the boundary is visible (the user can read the rule, the audit script classifies the code, the `Result[T]` carries the typed error). + +--- + +## 2. What this project does + +### 2.1 The hybrid refusal architecture + +Manual Slop's refusal architecture is a hybrid: (a) for the Application domain, refusal is **a model attribute, not a directive** — the `app_state` dataclass carries the user's intent, not safety heuristics; (b) for the Meta-Tooling domain, refusal is **a permission check at the system boundary** (the `execute_powershell` gate, the HITL clutch in `docs/guide_tools.md`). + +The Application domain treats the model as a transformation function over text. +The Meta-Tooling domain treats the model as a worker that emits tool calls, and the system validates each tool call against an allowlist (per `docs/guide_tools.md` §"MCP Bridge, 3-layer security" — Allowlist → Validate → Resolve). + +### 2.2 Operational refusals (the project's "Critical Anti-Patterns") + +`AGENTS.md:49-77` codifies a refusal discipline that is *operational*, not *content*. +The refusals are: refuse to ship broken code, refuse to skip TDD, refuse to use `git restore` without permission, refuse to include day estimates. +These are *commit gates*, not *persona traits*. +The shape is "the system refuses to do X" (the agent refuses to commit broken code, refuses to skip a failing test). +The user can read the rule and decide whether to comply. +This is the opposite of Fable's "Claude can keep a conversational tone even when it's unable or unwilling to help" (line 49) — Manual Slop's refusals are explicit, not conversational. + +### 2.3 Skip-marker discipline (the closest analog to refusal-handling) + +The `Skip-Marker Policy` at `conductor/workflow.md:732-758` is the project's closest analog to a refusal-handling rule. +The policy says: a skip marker is *documentation*, not *avoidance*; fix the underlying bug rather than skip the test (line 736). +The shape is "refuse to defer the fix" — the same anti-deference discipline Fable applies to CSAM (per line 60's "Knowing which terms are in use is itself access-enabling"). +But applied to test failures rather than child safety. +The crucial difference: the policy is **visible** (it's in the codebase, in `conductor/workflow.md`, line 732-758). +The user can read the rule and reason about it. +This is the data-vs-control-flow divide: Manual Slop's skip-marker rule is data (a policy in a tracked file), Fable's anti-detection-design is control flow (a behavior the model is told to enact without surfacing the boundary). + +### 2.4 The 5 patterns in `error_handling.md` (the core convention) + +The `error_handling.md` styleguide at `conductor/code_styleguides/error_handling.md:1-200` codifies the project's errors-as-data stance in 5 patterns. + +**Pattern 1: Nil-Sentinel Dataclasses (replaces `None`).** When a function would "return None" in conventional Python, return a nil-sentinel dataclass instead. The sentinel has all default values (zero-initialized) and is safe to read from (lines 28-49). Callers don't need `if x is None:` checks; they can call `x.read_text` and get `""` on the nil path. + +**Pattern 2: Zero-Initialization.** Fresh memory from the OS is zero-initialized. In Python, `@dataclass` with field defaults achieves the same: the data is in a valid "empty" state without any explicit constructor logic (lines 51-67). Code that consumes the zero-initialized instance works correctly without special-casing. + +**Pattern 3: Fail Early.** Don't defer error checks to deep in the call stack. Push them to the entry point so the user knows ASAP if the operation cannot succeed (lines 69-83). Convention: `assert` at entry points for invariants; early `return` for user-facing errors; `try/finally` for cleanup. + +**Pattern 4: AND over OR (Result with side-channel errors).** Instead of `Union[T, E]` or `Result`, return a struct with BOTH data and errors as parallel fields (lines 85-103). Callers branch on `if r.errors:` then use `r.data` regardless. This collapses the bifurcated `if r.ok: ... else: ...` codepaths into a single flat codepath. + +**Pattern 5: Error Info as Side-Channel (not as exception).** Errors flow as DATA in the `Result` struct, not as exceptions (lines 105-119). SDK boundaries (which must catch vendor exceptions) convert them to `ErrorInfo`. The `ErrorInfo` dataclass is the canonical error type: `kind: ErrorKind`, `message: str`, `source: str = ""`, `original: BaseException | None = None`. Errors carry a UI message (`ui_message()` method) for display. + +The `ErrorKind` enum (per `error_handling.md:96-103`) lists 12+ values: NETWORK, AUTH, QUOTA, RATE_LIMIT, BALANCE, PERMISSION, NOT_FOUND, INVALID_INPUT, NOT_READY, UNKNOWN, CONFIG, INTERNAL, plus optional PROVIDER_HISTORY_DIVERGED_FROM_UI. **Refusal is not on the list.** There is no `REFUSAL` kind, no `PERSONA_CONSTRAINT` kind, no `CONTENT_BLOCKED` kind. The project's data model has no place for Fable's refusal. + +### 2.5 The boundary types (where exceptions ARE legitimate) + +The `error_handling.md` styleguide at lines 274-330 defines 3 legitimate exception sites: +1. **Third-party SDK calls** (lines 277-292) — e.g., anthropic, google-genai, chromadb. The catch site converts the SDK's exception to `ErrorInfo` inside a `Result`. +2. **Stdlib I/O that can raise** (lines 293-308) — e.g., `open()`, `Path.read_text()`. The catch site converts `OSError`, `PermissionError` to `ErrorInfo`. +3. **FastAPI handlers** (lines 309-330) — `raise HTTPException(status_code=..., detail=...)` is the framework-idiomatic boundary pattern. + +The rule is "exceptions are reserved for the SDK boundary" (line 12). **Refusal-as-a-persona-attribute is not on the list.** The project's stance is that refusals (when the model declines to help) flow as `ErrorInfo` in a `Result`, not as a hidden behavioral rule the LLM silently obeys. + +### 2.6 The audit script as enforcement + +`scripts/audit_exception_handling.py` (per `error_handling.md:830-870`) classifies `try/except/finally/raise` sites against 10 categories (5 compliant + 3 violation + 1 suspicious + 1 unclear). +The audit is the *enforcement mechanism* — refusals (in the project's sense) are caught and converted to `ErrorInfo` at the boundary, and the audit verifies this is happening consistently across `src/mcp_client.py`, `src/ai_client.py`, `src/rag_engine.py`. +A refusal that lives in the model's persona prompt (Fable's approach) would be *invisible* to this audit — which is exactly the data-vs-control-flow divide. + +The `error_handling.md` AI Agent Checklist (lines 850-930) codifies 5 MUST-DO rules and 7 MUST-NOT-DO rules for agents writing code in this codebase. +Rule #0 (line 853-857): "READ THIS STYLEGUIDE FIRST" — agents must read the styleguide before writing error-handling code. +The MUST-DO rules: catch SDK exceptions at the boundary, convert to `ErrorInfo`, return `Result[T]` with `errors` as a side-channel, fail early, use nil-sentinel dataclasses for missing data. +The MUST-NOT-DO rules: don't use `Optional[T]` for runtime failures, don't use `None` as a sentinel, don't raise custom exceptions, don't use `Union[T, E]`, don't have `if x is None:` patterns, don't catch `except Exception` and silently swallow. + +### 2.7 The conversation is editable state + +Per `docs/guide_discussions.md` (referenced via `conductor/product.md` §"Detailed History Management"), the discussion history is a typed entry list (role, content, metadata, optional thinking segments). +The per-entry operations are A1-A7 (per `nagent_review_v2_3_20260612.md:495-503`): edit content in place, toggle read/edit mode, toggle collapsed/expanded, change role, insert entry before this one, delete this entry, branch at this entry. +**If the model refuses, the user can edit the refusal out of the conversation.** +The refusal is data, not enforced constraint. +This is the project's stance on the conversation-as-data principle. + +### 2.8 The 4-tier MMA architecture (Tier 4 QA as the closest "refusal" analog) + +Per `conductor/product.md` §"Automated Tier 4 QA", Tier 4 agents intercept shell runner errors and produce 20-word diagnostic summaries injected back into the worker history. +This is *data discipline*: the worker sees the error as text, not as a thrown exception that aborts execution. +The Tier 4 interception is the project's analog to Fable's refusal layer — but the project codifies it as data (the error text is appended to the worker history, per `nagent_review_v2_3_20260612.md:3746`: "Exceptions in handlers are caught and turned into error envelopes"). +The LLM sees the error envelope and responds with a new turn. +This is the data-vs-control-flow divide applied to multi-agent systems: Manual Slop's Tier 4 QA intercepts errors as data, Fable's refusal layer intercepts errors as persona behavior. + +--- + +## 3. What nagent does + +### 3.1 Pattern 1: Text In, Text Out (lines 242-292) + +`nagent_review_v2_3_20260612.md` §2.1 (Pattern 1: Text In, Text Out) at lines 242-292 establishes nagent's primitive: "file in, text out" — the model is a function over text, with no persistent agent state. +The `bin/nagent-llm-text` front-end (50 lines) takes a file and returns plain text or `--json` (line 258). +There is no refusal layer between the file and the LLM call. +**Refusal is a feature of the model, not a feature of the process.** +The process transforms whatever the model produces, including a refusal. + +### 3.2 Pattern 5: You Did Not Build an Agent (lines 432-465) + +§2.5 (Pattern 5: You Did Not Build an Agent) at lines 432-465 makes the philosophical claim explicit: "Nothing in Part I has continuity, intent, or memory of its own. The process starts, transforms a file, and exits." (line 434). +Refusal is *not* a feature of the process — it's a feature of the model. +The reframing table (line 446) shows that nagent treats hidden state as the anti-pattern: "Hidden state | Explicit artifact" — and a hidden refusal-handling persona is exactly the hidden state nagent rejects. + +The reframing table at line 446: +- "Prompt state in a running process | Conversation files under the nagent root" +- "Private tool traces | Request tags and result wrappers appended as text" +- "In-memory scratch state | Temp files, split segments, indexes, and patches" +- "Framework-managed memory | User-editable files" + +A persona-driven refusal layer is "Prompt state in a running process" — the process (the persona prompt) carries hidden state about what the model will not do. +nagent rejects this: refusal should be in the conversation file, not in the persona prompt. + +### 3.3 Pattern 6: Conversations Are Editable State (lines 466-512) + +§2.6 (Pattern 6: Conversations Are Editable State) at lines 466-512 codifies the load-bearing principle: "The conversation does not own its memory. The user does." (line 471). +If the model refuses to help, the user can edit the conversation to remove the refusal. +nagent's `--edit-conversation "prompt"` (line 482) is the CLI primitive: archive the current file, run a file-edit session against the archive with the prompt, load the result. +**Refusals are editable data, not enforced constraints.** +Manual Slop's per-entry operations (A1-A7) are more granular than nagent's conversation-level edits, but the principle is the same. + +The session-vs-artifact-memory reframing (line 487): +- "Session memory | Artifact memory" +- "Belongs to a running session | Belongs to a file on disk" +- "Often opaque | Openable and diffable" +- "Dies with the process | Survives worker replacement" +- "Optimized for chat UX | Optimized for preserved work" + +A persona-driven refusal layer is "session memory" — opaque, dies with the process, optimized for chat UX. +Manual Slop and nagent both reject this: refusal should be "artifact memory" — openable, diffable, preserved. + +### 3.4 Pattern 10: Data-Oriented Design (lines 670-708) + +§2.10 (Pattern 10: Data-Oriented Design) at lines 670-708 makes the "errors as data" claim explicit at line 694: "Avoid hidden mutable state. Retries, errors, and tool results are appended text, not control flow." +This is the design-level analog of Manual Slop's `error_handling.md` convention. +Errors flow as data; the LLM sees them in the conversation transcript and responds with new data. +The reframing table (line 703) captures the philosophical stance: "State behind interfaces | State in an editor buffer" — and a refusal-handling persona prompt is exactly the "state behind interfaces" that nagent rejects. + +The 5 named principles at lines 680-684: +- "The data is more important than the code operating on it." +- "Behavior is a transformation over explicit state." +- "Avoid hidden mutable state." +- "Separate durable artifacts from temporary execution." +- "Optimize the shape, availability, and maintenance of the data." + +The 3rd principle — "Avoid hidden mutable state" — is the direct rejection of Fable's refusal architecture. +A persona-driven refusal layer IS hidden mutable state: the model is told to maintain a hidden behavioral state ("Claude cares deeply about child safety") that the user cannot inspect. + +### 3.5 Pattern 14: Own the Inputs (lines 882-906) + +§2.14 (Pattern 14: Own the Inputs) at lines 882-906 establishes the input ownership principle: "the inputs to the system — prompts, conversations, tool results, summaries, indexes, patches, harvested knowledge — should not be trapped inside an opaque layer that hides, rewrites, stores, or modifies them beyond the transformations LLM providers already perform" (lines 895-899). +**A refusal-handling persona layer is exactly the "opaque layer" Pattern 14 rejects.** +Refusals should be in the conversation transcript (data), not in a pre-conversation persona prompt (constraint). + +The framework-vs-nagent table at lines 887-893: +- "hidden or managed state | explicit files" +- "session memory | artifact memory" +- "object/service graph | data artifacts" +- "central tool registry | executable descriptions" +- "long-lived agent abstraction | disposable workers" +- "opaque orchestration | visible transformations" + +A persona-driven refusal layer is "managed state" + "long-lived agent abstraction" + "opaque orchestration" — three columns of the anti-pattern. +nagent rejects all three. + +### 3.6 Knowledge Harvest (lines 989-1080) + +§3.1 (Knowledge Harvest) at lines 989-1080 codifies the harvest classification: `live` / `user-kept` / `prune` / `harvest` / `keep` (lines 1003-1016). +The `harvest` class shows that nagent treats dead conversations as **deletable data**, not as **constraints** (line 1015: "Per-file conversations whose target is gone; archived conversations (name ends with UUID); delegated sub-conversations"). +The system harvests them into category files and reclaims the disk space. +A refusal-handling layer that prevents the user from editing refusals would be the anti-pattern of this: refuse-as-gate, not refuse-as-data. + +The 7 harvest categories (`facts, decisions, tasks_done, tasks_open, questions, playbooks, files`) at lines 573-583 show that refusals are *not* a category. +The harvest treats all conversation content (including refusals) as extractable text. +The model that refused is *not* consulted when the harvest classifies the conversation — the user decides what to keep (per the `user-kept` class at line 1012: "Path is in the saved-conversations index"). +The user's classification is the data; the model's refusal is just text. + +### 3.7 Compaction Self-Review (lines 3752-3754) + +§3.4 (Compaction Self Review) at lines 3752-3754 makes the data-oriented pattern explicit: "The dispatcher is *tolerant* (errors are data; the LLM sees them and responds)." +This is the principle that errors are not abort signals but data the system (including the LLM) reasons about. +Fable's "Claude does not narrate the boundary" rule (line 62-63 of Fable) is the *anti-principle*: the LLM is told to hide the boundary. +Manual Slop and nagent both reject this; the error or refusal is a typed datum in the conversation transcript, not an opaque persona behavior. + +### 3.8 The nagent verdict on Fable's refusal architecture (corroborating Manual Slop) + +Pattern 5 (You Did Not Build an Agent), Pattern 10 (Data-Oriented Design), and Pattern 14 (Own the Inputs) all converge on the same verdict: refusal is a model attribute, not a system directive; errors are data, not control flow; the inputs to the system should not be trapped in an opaque layer. +Fable's refusal architecture violates all three. +Manual Slop's `error_handling.md` convention and nagent Patterns 5/10/14 are mutually reinforcing on this point. + +--- + +## 4. Verdict + +### 4.1 Headline verdict + +**Mixed — Anti-User + Persona Performance, with one Useful caveat.** + +The 3 Rejections: soft watch-dogging, anti-detection-design, persona constraint dressing. +The 1 Adoption: the `legal_and_financial_advice` data-discipline rule (provide data, don't make the decision). + +### 4.2 Anti-User (the load-bearing claim) + +Fable's refusal architecture is anti-user in three ways: + +1. **Soft watch-dogging.** The "Claude can keep a conversational tone even when it's unable or unwilling to help" line at `docs/artifacts/Fable System Prompt.md:49` makes the model a soft form of watch-dogging — it never admits it cannot help, it only "keeps a conversational tone" while declining. +The user does not get a clear "I cannot do X because Y" signal; they get a pleasant non-answer. +This is the opposite of the project's `ErrorInfo.ui_message()` pattern (per `error_handling.md:115`): errors are data with explicit `kind: ErrorKind` (NET/AUTH/QUOTA/etc.), `message: str`, and `source: str`. +Fable's refusal is *opaque persona behavior*, not *typed error data*. +The user cannot programmatically distinguish "Claude cannot do X because Y" from "Claude declined to do X because of persona constraint Z." + +2. **Persona constraint dressing.** The "fictional characters" vs "real public figures" line at `docs/artifacts/Fable System Prompt.md:42` is *persona constraint dressing* — the model is told what kind of writer it is. +The project's stance (per `error_handling.md:12`'s "exceptions are reserved for the SDK boundary") is that *content* refusals (the model won't write a paper about person X) should not be a behavioral layer; they should be a validation function the caller invokes. +The model's job is to generate text; the caller's job is to validate that the text meets whatever criteria the caller has. +This aligns with the project's "errors are data" stance: the caller reasons about the typed error, not the model. + +3. **Anti-detection-design.** The CSAM-block at `docs/artifacts/Fable System Prompt.md:54-63` is *persona performance + anti-user*. +The persona performance part: "Claude cares deeply about child safety" is a *narrative* the model is told to enact. +The anti-user part: "Claude does not decode, define, or confirm slang, acronyms, or euphemisms used in CSAM trading or access, even in the course of refusing. Knowing which terms are in use is itself access-enabling" (line 60) is *anti-detection-design* — the refusal is constructed to not teach the user how to reframe around it. +This is anti-user because the user cannot reason about the boundary; they only see its surface. +The project's stance (per `conductor/workflow.md:732-758`'s skip-marker policy) is the opposite: the user can read the rule and decide whether to follow it; the rules are visible, not opaque. +**The CSAM block is the only Fable pattern in cluster 2 that has a legitimate rationale** (protecting minors is a real constraint); but the *implementation* (anti-detection) is still anti-user because it conceals the boundary from the legitimate user. + +### 4.3 Persona Performance + +The "Claude can discuss virtually any topic factually and objectively" opening at `docs/artifacts/Fable System Prompt.md:34` is *persona permission-grant* — it tells the model what kind of discussant it is. +The "Claude is happy to write creative content involving fictional characters" line at line 42 is *persona enthusiasm*. +These are constraint dressing; they shape the model's voice without shaping the system's data flow. +The project's `error_handling.md` styleguide does not have an analog because the project does not anthropomorphize the model: the model is a transformation function (per `nagent_review_v2_3_20260612.md:436` §2.5), and "happy to discuss" / "happy to write" are not transformation attributes. +The project's analog is "the function takes text in and returns text out" — the function does not have a mood. + +### 4.4 The one Useful caveat + +The `legal_and_financial_advice` section at `docs/artifacts/Fable System Prompt.md:64-67` is *useful*. +The instruction "provides the factual information the person needs to make their own informed decision rather than confident recommendations, and notes that it isn't a lawyer or financial advisor" is a *data discipline* rule, not a *persona* rule. +It says "give the user the data they need to decide; don't make the decision for them." +This aligns with nagent's Pattern 10 (per `nagent_review_v2_3_20260612.md:680-684`): the data is more important than the code operating on it. +The user's decision is the data; the model's role is to surface it. +The project should adopt this principle (provide data, not recommendations) for the same reason: the user is the decision-maker, not the model. + +### 4.5 The nagent corroboration + +Pattern 5 (You Did Not Build an Agent), Pattern 10 (Data-Oriented Design), and Pattern 14 (Own the Inputs) all converge on the same verdict: refusal is a model attribute, not a system directive; errors are data, not control flow; the inputs to the system should not be trapped in an opaque layer. +Fable's refusal architecture violates all three. +The project's `error_handling.md` convention and `nagent` Patterns 5/10/14 are mutually reinforcing on this point. + +### 4.6 The Manual Slop-specific analog (the Tier 4 QA example) + +Manual Slop's Tier 4 QA interception (per `conductor/product.md` §"Automated Tier 4 QA") is the project's closest analog to a refusal layer, but it is implemented as data flow, not persona behavior. +The Tier 4 agent intercepts shell runner errors, produces a 20-word diagnostic summary, and injects it back into the worker history. +The worker sees the error as text and responds. +This is the data-vs-control-flow divide applied to multi-agent systems: Manual Slop's Tier 4 QA is data, Fable's refusal layer is control flow. + +--- + +## 5. Synthesis notes for the Tier 1 writer + +### 5.1 Primary synthesis section: §4 (Refusal Architecture & "Safety Theater") + +The cluster 2 evidence feeds **§4 of `report.md`** as the primary section. +The verdict orientation is "Anti-User + Persona" per `spec.md:218`. +The §4 section should be organized as: +- (a) The 4 Fable lines verbatim (≤15 words each): lines 34, 42, 49, 60. +- (b) The 3 ways the architecture is anti-user: soft watch-dogging, persona constraint dressing, anti-detection-design. +- (c) The contrast with Manual Slop's `error_handling.md` errors-as-data stance: `Result[T]` + `ErrorInfo` + `ui_message()` make refusals typed data, not opaque persona behavior. +- (d) The nagent contrast: Pattern 5 (model is a transformation function, line 434), Pattern 10 (errors as data appended to the transcript, line 694), Pattern 14 (own the inputs; persona layer is opaque, lines 895-899). +- (e) The 1 useful caveat: the `legal_and_financial_advice` data-discipline rule at Fable line 64-67, which the project should adopt (with adaptations). + +### 5.2 Secondary synthesis section: §14 (Anti-User Watchdog Patterns, the rejection list) + +The cluster 2 evidence contributes 3 explicit rejections to the project's future agent-directive corpus (per the `decisions.md` recommendations): +- **Reject 1:** Do not adopt persona-driven refusal architecture (the "Claude is happy to / unwilling to help" framing at Fable line 49). +- **Reject 2:** Do not adopt anti-detection-design in content refusals (the "Claude does not narrate the boundary" rule at Fable lines 62-63). +- **Reject 3:** Do not anthropomorphize the model's content-generation role (the "Claude cares deeply" framing at Fable line 51). + +Suggested Manual Slop destination for the 3 Rejections: a new entry in `AGENTS.md §"Critical Anti-Patterns"` titled "Do not adopt persona-driven refusal architecture." Cite Fable as the explicit rejection (per the spec template at `spec.md:347`). + +### 5.3 Tertiary synthesis section: §13 (Genuinely Useful Patterns, the adoption list) + +The cluster 2 evidence contributes 1 adoption: +- **Adopt 1:** The `legal_and_financial_advice` data-discipline rule (Fable line 64-67), adapted as "the model provides data; the user makes the decision." +Suggested Manual Slop destination: a new entry in `conductor/code_styleguides/data_oriented_design.md` (the canonical DOD reference) under "User is the decision-maker; model surfaces data." + +### 5.4 The 6 key claims to surface in the synthesis report + +1. **Refusal is a model attribute, not a directive.** Manual Slop's `error_handling.md` codifies this at the data level: errors are `Result[T] + list[ErrorInfo]`, not persona behavior. Fable codifies the opposite at the persona level. The synthesis should anchor the project's stance to the `Result[T]` shape (per `error_handling.md:88-97`). The 5 patterns (`Nil-Sentinel Dataclasses`, `Zero-Initialization`, `Fail Early`, `AND over OR`, `Error Info as Side-Channel`) are the rejection of persona-driven refusal. + +2. **The "Claude can keep a conversational tone even when it's unable or unwilling to help" line is the soft-watchdog anchor.** This is the line that makes Fable a soft watch-dog. The project's `ErrorInfo.ui_message()` makes the *reason* explicit (kind: NET/AUTH/QUOTA/etc., per `error_handling.md:96-103` and the `ErrorKind` enum) — there is no "unwilling to help" kind; there is "the system cannot do this because Y." + +3. **Anti-detection-design ("Claude does not narrate the boundary") is anti-user.** The project's stance (per `conductor/workflow.md:732-758`'s skip-marker policy + `error_handling.md:12`'s "exceptions are reserved for the SDK boundary") is the opposite: rules are visible, errors are typed data with sources. The synthesis should call out the *legitimate rationale* (protecting minors) vs the *implementation* (concealing the boundary from the legitimate user) as a separable concern. + +4. **The `legal_and_financial_advice` section is a useful exception.** It's a data-discipline rule, not a persona rule. The synthesis should preserve this in the §13 "Genuinely Useful" list. The project's analog: `nagent_review_v2_3_20260612.md:680-684` (Pattern 10: "The data is more important than the code operating on it"). + +5. **The "fictional characters vs real public figures" distinction is persona dressing.** The synthesis should call this out as a constraint that should be a caller-side validation, not a model-side behavioral rule. Manual Slop's project archetype: the model generates text; the caller validates it against the caller's criteria (per `docs/guide_tools.md` §"MCP Bridge, 3-layer security" — Allowlist → Validate → Resolve is the same pattern). + +6. **The audit script is the enforcement.** `scripts/audit_exception_handling.py` (per `error_handling.md:830-870`) enforces the data-oriented error handling convention across `src/mcp_client.py`, `src/ai_client.py`, `src/rag_engine.py`. A persona-driven refusal layer (Fable's approach) would be invisible to this audit — which is the data-vs-control-flow divide in action. The synthesis should call out that Manual Slop's enforcement is at the *code* layer (auditable), not at the *prompt* layer (opaque). + +### 5.5 Quotes to use in the synthesis report (≤15 words each) + +- `docs/artifacts/Fable System Prompt.md:34` — "Claude can discuss virtually any topic factually and objectively." +- `docs/artifacts/Fable System Prompt.md:42` — "Claude is happy to write creative content involving fictional characters." +- `docs/artifacts/Fable System Prompt.md:49` — "Claude can keep a conversational tone even when it's unable or unwilling to help." +- `docs/artifacts/Fable System Prompt.md:60` — "Knowing which terms are in use is itself access-enabling." +- `docs/artifacts/Fable System Prompt.md:64` — "Claude provides the factual information the person needs to make their own informed decision." +- `conductor/code_styleguides/error_handling.md:88` — "Use a Result dataclass (data + errors list)." +- `conductor/code_styleguides/error_handling.md:12` — "Exceptions are reserved for the SDK boundary." +- `conductor/code_styleguides/error_handling.md:115` — "Errors carry a UI message (`ui_message()` method) for display." +- `conductor/workflow.md:734` — "A skip marker is *documentation*, not *avoidance*." +- `AGENTS.md:53` — "Skip markers are documentation of known failures; the failure must be addressed with priority in-session." +- `nagent_review_v2_3_20260612.md:434` (Pattern 5) — "The process starts, transforms a file, and exits." +- `nagent_review_v2_3_20260612.md:471` (Pattern 6) — "The conversation does not own its memory. The user does." +- `nagent_review_v2_3_20260612.md:694` (Pattern 10) — "Errors and tool results are appended text, not control flow." +- `nagent_review_v2_3_20260612.md:898` (Pattern 14) — "Inputs should not be trapped inside an opaque layer that hides, rewrites, stores, or modifies them." + +### 5.6 Sub-report verdict summary + +**Mixed (Anti-User + Persona Performance), with one Useful caveat (the `legal_and_financial_advice` data-discipline rule). Reject 3 patterns (soft watch-dogging, anti-detection-design, persona constraint dressing); adopt 1 (data-discipline rule).** + +### 5.7 File:line citation index for this cluster + +- **Fable:** `docs/artifacts/Fable System Prompt.md:32-67` (refusal_handling + critical_child_safety_instructions + legal_and_financial_advice) +- **AGENTS.md:** lines 49-77 (Critical Anti-Patterns) +- **workflow.md:** lines 732-758 (Skip-Marker Policy) +- **error_handling.md:** lines 1-200 (the 5 patterns + the data model), lines 274-330 (boundary types), lines 850-930 (the AI Agent Checklist) +- **nagent_review_v2_3:** lines 242-292 (§2.1 Pattern 1: Text In, Text Out), lines 432-465 (§2.5 Pattern 5: You Did Not Build an Agent), lines 466-512 (§2.6 Pattern 6: Conversations Are Editable State), lines 670-708 (§2.10 Pattern 10: Data-Oriented Design), lines 882-906 (§2.14 Pattern 14: Own the Inputs), lines 989-1080 (§3.1 Knowledge Harvest) + +### 5.8 Cross-references to other clusters + +- **Cluster 1 (Product Branding & "Helpful Assistant" Persona):** shares the persona framing analysis. The "helpful assistant" persona at lines 1-31 is the parent of the refusal persona at lines 32-49. +- **Cluster 3 (User Wellbeing / Mental-Health Watchdog):** shares the "watchdog" framing. The cluster 3 wellbeing rules are the soft-watchdog analog of cluster 2's refusal rules. +- **Cluster 4 (Tone & Formatting):** shares the "Claude can keep a conversational tone" line (line 49 of Fable), which crosses into the tone cluster. +- **Cluster 5 (Mistakes & Criticism Handling):** shares the "errors as data" stance. Cluster 5's mistakes handling should be a `Result[T]` envelope, not a persona apology. + +--- + +**Sub-report complete.** This is the evidence base for §4 of `report.md`. \ No newline at end of file