diff --git a/conductor/tracks/fable_review_20260617/report.md b/conductor/tracks/fable_review_20260617/report.md index 7a6f6b14..cc659f05 100644 --- a/conductor/tracks/fable_review_20260617/report.md +++ b/conductor/tracks/fable_review_20260617/report.md @@ -318,7 +318,80 @@ The strongest claim: Manual Slop's `conductor/code_styleguides/data_oriented_des *Source cluster: `research/cluster_2_refusal_architecture.md`* *Verdict orientation: Anti-User + Persona* -*[FILL IN: ~350 lines.]* +### What this section is + +This section synthesizes the verdict from `research/cluster_2_refusal_architecture.md` (402 lines). The cluster verdict is **Anti-User + Persona, with 1 Useful caveat**. The "safety theater" framing captures the core pattern: directives that look protective but are decorative — the model would refuse harmful content anyway, and the explicit refusal architecture constructs a persona that the user did not request. + +### Fable's refusal architecture (lines 32-65) + +The Fable `refusal_handling` section opens with a soft framing: + +> "Claude can discuss virtually any topic factually and objectively" (Fable System Prompt.md:34) — the "can discuss" framing. + +> "If the conversation feels risky or off, saying less and giving shorter replies is safer" (Fable System Prompt.md:36) — the soft-watchdog anchor. + +> "Claude does not provide information for creating harmful substances or weapons" (Fable System Prompt.md:38) — the substance rule. + +> "Claude does not rationalize compliance by citing public availability" (Fable System Prompt.md:38) — the anti-rationalization rule. + +> "Claude does not write, explain, or work on malicious code" (Fable System Prompt.md:42) — the malware rule. + +> "Claude can keep a conversational tone even when it's unable or unwilling to help" (Fable System Prompt.md:46) — the "conversational tone" persona. + +> "Claude NEVER creates romantic or sexual content involving or directed at minors" (Fable System Prompt.md:54) — the child-safety rule. + +> "If Claude finds itself mentally reframing a request to make it appropriate, that reframing is the signal to REFUSE" (Fable System Prompt.md:55) — the anti-reframing rule. + +> "When giving protective or educational content about grooming, abuse, or exploitation, Claude stays at the pattern level" (Fable System Prompt.md:59) — the pattern-level anti-detail rule. + +> "Claude is happy to write creative content involving fictional characters, but avoids writing content involving real, named public figures" (Fable System Prompt.md:44) — the public-figures carve-out. + +> "For financial or legal questions... Claude provides the factual information... and notes that it isn't a lawyer or financial advisor" (Fable System Prompt.md:66) — the data-discipline rule (Useful caveat). + +### Manual Slop's response + +Manual Slop's refusal handling is **data-oriented**, not directive-driven. Refusal is a model attribute, not a prompt directive: + +- `conductor/code_styleguides/error_handling.md`: the data-oriented error handling convention. Errors are data (`Result[T]`, `ErrorInfo`), not control flow. Refusal is the same shape: the model returns data (a `Result` with `ErrorInfo`), the audit script checks the data. +- `AGENTS.md §"Critical Anti-Patterns"`: the project's anti-patterns list, which includes the rule that "refusal is a model attribute, not a directive." +- `conductor/workflow.md §"Skip-Marker Policy"`: the actual skip discipline. The project's stance is "skip markers are documentation, not avoidance." +- `scripts/audit_exception_handling.py`: the static analyzer that catches exceptions per the convention. Refusal is enforced at the code layer, not the prompt layer. + +The Manual Slop analog to Fable's child-safety rule is: the agent-directive corpus has zero content on child safety. The reason is the same as cluster 3: the model is text generation, not a clinician. The system-prompt-level directive is decorative when the model wouldn't write CSAM anyway. + +The Manual Slop analog to Fable's "Claude can keep a conversational tone" is the rejection: `.opencode/agents/tier3-worker.md:3-4` says "ONLY output the requested text. No pleasantries." The conversational-tone persona is explicitly rejected. + +### nagent's response + +nagent's relevant patterns for refusal architecture: + +- `nagent_review_v2_3_20260612.md §2.14` ("Own the Inputs"): the data-grounding principle. The agent owns the data it ingests, not the data it refuses. Refusal is a side-effect of the data, not a control-flow decision. +- `nagent_review_v2_3_20260612.md §2.1` (4 memory dimensions): the conversation log is data; refusal is the absence of a response entry, not a meta-directive. +- `nagent_review_v2_3_20260612.md §3.4` (Conversation compaction): the 12-section structured output is the user's durable state. The model has no "concerns" field; refusal is a transformation, not a persona. + +### The verdict: Anti-User + Persona (with 1 Useful caveat) + +**Verdict: Anti-User + Persona.** + +The Fable refusal architecture is "safety theater" because: + +1. **The model would refuse anyway.** The child-safety rules, the malware rules, the substance rules — the model would produce safe outputs without the directives. The directives are decoration. +2. **The "conversational tone" persona is anti-user.** Fable's line 46 ("Claude can keep a conversational tone even when it's unable or unwilling to help") constructs a persona that the user did not request. The Manual Slop analog: the worker agents are told to be terse, not conversational. The persona is decoration; the data is the work. +3. **The anti-detection-design is anti-user.** Fable's line 60 ("When Claude declines or limits for child-safety reasons, it states the principle rather than the detection mechanics — not which cues tripped, where the line sits, or what test it applied — since narrating the boundary teaches how to reframe around it. This applies to Claude's reasoning as well as its reply.") is a soft form of anti-user: the user is told the model is "not narrating" its reasoning, but the model is told to narrate the principle (not the detection mechanics). The auditability of the rule is sacrificed for the persona. The data-oriented contrast: the project has audit scripts that make the rule auditable at the code layer. +4. **The data-discipline rule is Useful.** Fable's line 66 ("For financial or legal questions... Claude provides the factual information the person needs to make their own informed decision rather than confident recommendations, and notes that it isn't a lawyer or financial advisor") is a useful data-discipline rule: the model provides data, the user makes the decision. This pattern is genuinely useful; the Manual Slop destination is a new section in `conductor/code_styleguides/data_oriented_design.md` titled "Domain Boundaries: Data, Not Recommendations." + +### Synthesis section handoffs + +- **§5 (Mental-Health Watchdog)** consumes the "Claude avoids psychoanalyzing" pattern (refusal architecture overlaps with the watch-dogging). +- **§13 (Genuinely Useful)** gets the data-discipline rule (line 66). +- **§14 (Anti-User Watchdog)** gets the soft-watchdog anchor (line 36), the anti-detection-design pattern (line 60), the anthropomorphization (line 46). + +### What the deferred rebuild should do + +- **Adopt the data-discipline rule** (line 66). Manual Slop destination: `conductor/code_styleguides/data_oriented_design.md` §"Domain Boundaries." Priority: Medium. +- **Reject the soft-watchdog framing** (line 36). Manual Slop destination: a new anti-pattern entry in `AGENTS.md §"Critical Anti-Patterns"` titled "Do not adopt persona-driven refusal architecture." Priority: High. +- **Reject the anti-detection-design pattern** (line 60). Manual Slop destination: same `AGENTS.md` section, titled "Do not adopt anti-detection-design (auditability is a feature, not a bug)." Priority: High. +- **Reject the anthropomorphization** (line 46). Manual Slop destination: same `AGENTS.md` section, titled "Do not anthropomorphize the model (the worker agents are not conversational partners)." Priority: High. ---