Per user request 'use your remaining context to update agent workflow
docs and then regular docs based on what was discussed in this report',
this commit creates/updates 15 files derived from the v2.3 nagent
review (the 12 new nagent additions + the 4 memory dimensions
reframing + the cache strategy + the RAG discipline + the knowledge
harvest pattern).
Agent workflow docs (4 files):
- AGENTS.md (UPDATE): add @import line to canonical DOD + 'Code
Styleguides' section pointing to the 6 new styleguides + new
'Human-Facing Documentation' section pointing to ./docs/AGENTS.md
- conductor/workflow.md (UPDATE): new section 'Additions (2026-06-12)
- the 12 patterns from the latest nagent corpus' with TDD
protocols for knowledge harvest, cache ordering, compaction, RAG
discipline
- conductor/product-guidelines.md (UPDATE): new sections 'Memory
Dimensions (added 2026-06-12)' + 'See Also - Updated' with the
6-styleguide catalog
- docs/AGENTS.md (NEW): the agent-facing mirror of docs/Readme.md
(per the nagent CLAUDE.md pattern). 10 sections + the per-tier
reading path + the 4 memory dimensions + the caching strategy +
the knowledge harvest + the RAG discipline + the feature flags
Regular docs (11 files):
- 6 new styleguides (the convention catalog):
* data_oriented_design.md: the canonical DOD reference (Tier
0/1/2; 3 defaults to reject; 8 core defaults; 7-question
simplification pass; 10-question self-check; 4 memory
dimensions in Manual Slop context)
* agent_memory_dimensions.md: the 4 memory dims (curation /
discussion / RAG / knowledge) + when to use each + the
boundaries
* rag_integration_discipline.md: the conservative-RAG rule
(opt-in, complement, provenance, no mutation, feature-gated,
graceful failure)
* cache_friendly_context.md: stable-to-volatile context
ordering + the cache TTL GUI contract + the byte-comparison
test
* knowledge_artifacts.md: the knowledge harvest pattern
(category files, provenance, sha256 ledger, digest
regeneration, 'delete to turn off')
* feature_flags.md: file presence vs config flags vs CLI flags
- 3 new project docs (the cross-cutting guides):
* guide_agent_memory_dimensions.md: the cross-cutting guide on
the 4 dims + the decision tree
* guide_caching_strategy.md: caching across providers +
stable-to-volatile ordering + cache TTL GUI + the byte-
comparison test + the 5th provider (claude-code)
* guide_knowledge_curation.md: the knowledge memory guide (4th
dim) + the 5 category files + per-file notes + the digest +
the ledger + the harvest workflow
- 2 existing doc updates:
* guide_mma.md: new sections 'Delegation as context management'
+ 'The 4 memory dimensions (the MMA scope)'
* guide_ai_client.md: new section 'Cache strategy and the 12-
layer model' + the 5th provider (claude-code)
All files use the same style as the v2.3 review (the user's preferred
format): 7-column tables, no JSON, SSDL shape tags, forth/array
notation, file:line citations, ASCII sketches where useful. The
human Readme files (Readme.md, docs/Readme.md) are NOT modified
(per repeated user instruction).
The 5th provider (claude-code) is documented in guide_ai_client.md
+ the data_oriented_design.md references the nagent pattern as the
source of the canonical rules.
The cross-references are bidirectional: the 6 styleguides reference
the 3 project docs; the 3 project docs reference the 6 styleguides;
the 2 doc updates reference both; AGENTS.md + ./docs/AGENTS.md
provide the entry points.
16 KiB
Cache-Friendly Context (stable-to-volatile ordering + cache TTL)
Status: Styleguide; codifies the cache strategy for aggregate.py:run and the GUI exposure of cache TTL.
Date: 2026-06-12
Cross-refs: conductor/code_styleguides/data_oriented_design.md §3.2; conductor/code_styleguides/agent_memory_dimensions.md; docs/guide_caching_strategy.md; conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5.
What this is. The LLM providers that Manual Slop uses (Anthropic, Gemini, OpenAI) all support some form of prompt caching. The cost benefit comes from the stable prefix being byte-identical across turns and across discussions. This styleguide defines the stable prefix, the volatile suffix, the byte-comparison contract, and the cache TTL GUI exposure.
0. The one-glance principle
[STABLE PREFIX (cached across turns)] [VOLATILE SUFFIX (per-turn)]
[Role instructions] [Discussion metadata]
[Function-calling schema] [Active preset (FileItems)]
[Discovered tool descriptions] [Per-file details]
[System prompt preset] [Tool-call results from prior turns]
[Persona profile] [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]
The cache boundary is at layer 8/9 (the last stable / first volatile). The Anthropic-specific path wraps the prefix in cache_control: {"type": "ephemeral"} blocks at the boundary; the Gemini path uses cachedContent resources; the OpenAI path uses implicit prefix caching.
1. The 12-layer model (the stable-to-volatile ordering)
| # | Layer | Stable across turns? | Source | SSDL |
|---|---|---|---|---|
| 1 | Role instructions (model + provider) | yes | _get_combined_system_prompt |
[I] |
| 2 | Function-calling schema | yes | per provider | [I] |
| 3 | Discovered tool descriptions | yes | mcp_client.get_tool_schemas() |
[I] |
| 4 | System prompt preset | yes | app_state.ai_settings.system_prompt |
[I] |
| 5 | Persona profile | yes | app_state.active_persona |
[I] |
| 6 | Project context (per manual_slop.toml) |
yes | NEW (Candidate 14) | [I] |
| 7 | Knowledge digest (per knowledge/digest.md) |
yes (within a gc cycle) | NEW (Candidate 8) | [I] |
| 8 | Discussion metadata (name, role count) | no (per turn) | disc_entries[:1] or disc_meta |
─── (data) |
| 9 | Active preset (FileItem set) | no (per turn) | self.context_files |
─── (data) |
| 10 | Per-file details (history, slices, notes) | no (per file) | per FileItem |
─── (data) |
| 11 | Tool-call results from prior turns | no (per turn) | per _reread_file_items |
─── (data) |
| 12 | The user message | no (per turn) | the input | ─── (data) |
The cache boundary is at layer 7/8. Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.
2. The byte-comparison test (the design contract)
The design rule "stable prefix is byte-identical" must be testable. The test:
# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
"""The first N characters of the context should be identical across turns
of the same conversation, when no stable-layer inputs change."""
ctrl = mock_app_controller()
ctrl.ai_settings.system_prompt = "Test system prompt"
ctrl.active_persona = mock_persona()
# Turn 1
turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")
# Turn 2 (same stable inputs, different user message)
turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")
# The first N characters should be identical (N = where the volatile layers start)
N = aggregate.stable_prefix_length(ctrl)
assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"
The test is the contract. If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).
The implementation. aggregate.stable_prefix_length(ctrl) returns the character offset where layer 8 starts. The simplest implementation: a class-level constant per aggregate.py, updated when the layer stack changes:
class AggregateStack:
ROLE_INSTRUCTIONS_END = 0 # placeholder; computed at runtime
SCHEMA_END = 0
TOOLS_END = 0
SYSTEM_PROMPT_END = 0
PERSONA_END = 0
PROJECT_CONTEXT_END = 0
KNOWLEDGE_DIGEST_END = 0
INSTANCE_START = 0 # the cache boundary
The test failure modes:
| Failure | Why it fails | Fix |
|---|---|---|
| A new stable layer was added in the wrong position | The first N characters differ because the new layer is below the boundary | Move the new layer above the boundary (between layers 7 and 8) |
| A stable layer was moved to the volatile position | The first N characters differ because the stable layer is now in the volatile part | Move the layer back to the stable position |
| A volatile input leaked into a stable layer (e.g., a timestamp in the system prompt) | The first N characters differ because the volatile input is in the prefix | Strip the volatile input from the stable layer; pass it as a separate volatile argument |
The system prompt has a now() call |
The first N characters differ across calls | Pass now() as a separate argument; don't include in the system prompt |
3. The provider-specific cache_control (the implementation)
3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)
# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
if cache_prefix_chars is not None:
# Wrap the message in content blocks; mark each prefix with cache_control
content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
else:
content_blocks = messages
response = anthropic_client.messages.create(
model=model,
max_tokens=8192,
messages=[{"role": "user", "content": content_blocks}],
)
return _result_with_usage(response.content, response.usage, messages)
The cache_prefix_blocks helper (mirrors nagent's bin/helpers/nagent_llm.py:cache_prefix_blocks):
def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
"""Split the message into content blocks at the given char offsets.
Mark each prefix block with cache_control. Returns the plain string
when no valid boundary exists. At most 3 prefix blocks (provider limit
is 4 breakpoints per request)."""
if not cache_boundaries:
return message
points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
if not points:
return message
blocks = []
start = 0
for point in points:
blocks.append({
"type": "text",
"text": message[start:point],
"cache_control": {"type": "ephemeral"},
})
start = point
blocks.append({"type": "text", "text": message[start:]})
return blocks
The Anthropic usage accounting (per nagent_llm.py:_result_with_usage):
def _result_with_usage(text, usage, input_text=None):
input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
# Anthropic reports cached prompt tokens separately; fold them back
# so input_tokens stays "tokens sent" across providers.
input_tokens += _usage_value(usage, "cache_read_input_tokens")
input_tokens += _usage_value(usage, "cache_creation_input_tokens")
output_tokens = _usage_value(usage, "output_tokens", "completion_tokens", ...)
# ... etc
The 4-breakpoint limit. Anthropic allows at most 4 cache_control markers per request. nagent caps at 3 prefix blocks (one breakpoint per prefix). Manual Slop does the same: 3 prefix blocks, 1 volatile suffix.
3.2 Gemini (1-hour explicit cache, configurable TTL)
# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
if cache_ttl_seconds > 0:
# Create a cachedContent resource for the stable prefix
cached_content = genai_client.caches.create(
model=model,
contents=stable_prefix_messages, # layers 1-7
ttl=f"{cache_ttl_seconds}s",
)
# Reference the cached content in the request
response = genai_client.models.generate_content(
model=model,
contents=volatile_messages, # layers 8-12
config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
)
else:
response = genai_client.models.generate_content(model=model, contents=messages)
return _result_with_usage(response.text, response.usage_metadata, messages)
The default TTL is 1 hour. Configurable per the GUI (per §5 below).
3.3 OpenAI (5-10 min implicit, provider-managed)
OpenAI's caching is implicit: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.
# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
response = openai_client.responses.create(model=model, input=messages)
return _result_with_usage(response.output_text, response.usage, messages)
# No application-side cache_control; the provider handles it
The TTL is provider-managed (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."
3.4 The provider table (the summary)
| Provider | Cache type | Default TTL | Configurable? | GUI exposure? |
|---|---|---|---|---|
| Anthropic | ephemeral | 5 min | yes (via prompt cache breakpoints) | yes (per-discussion state) |
| Google (Gemini) | explicit | 1 h | yes (via ttl field) |
yes (TTL override) |
| OpenAI | implicit (auto) | 5-10 min (provider-managed) | no | no (just shows "cached") |
4. The codepath (the end-to-end flow)
[Q:ai_client.send() is called]
│
▼
[I:aggregate.build_initial_context(ctrl, user_message) -> str]
│
├──► [I:layer 1-7: build stable prefix (the cache-friendly part)]
│
├──► [I:layer 8-12: build volatile suffix (the per-turn part)]
│
├──► [I:concatenate stable + volatile = full context]
│
├──► [I:stable_prefix_length(ctrl) -> N] (the cache boundary)
│
▼
[Q:cache boundary N > 0?]
│
├── no ──► [I:pass full context to provider; no caching]
│
▼
[Q:provider is Anthropic?]
│
├── yes ──► [I:cache_prefix_blocks(full_context, [N]) -> content_blocks]
│ [I:anthropic.messages.create(content=content_blocks)]
│
[Q:provider is Gemini?]
│
├── yes ──► [I:create cachedContent resource for stable prefix]
│ [I:genai.models.generate_content(cached_content=..., contents=volatile)]
│
[Q:provider is OpenAI?]
│
├── yes ──► [I:openai.responses.create(input=full_context)] (provider handles caching)
│
[I:return LlmResult(text, input_tokens, output_tokens)]
│
▼
[Q:return to caller; aggregate.test_aggregate_stable_to_volatile_ordering is run]
│
[T:end]
5. The GUI exposure (per-provider cache state)
The "Caching" Operations Hub sub-panel (per the v2.3 §5.3 sketch):
+------------------------------------------------------+
| Caching |
+------------------------------------------------------+
| Provider summaries |
| [Anthropic] in:340 cache:80 hit:23% ttl:4:32 |
| [Gemini] in:120 cache:0 hit:0% ttl:0:00 |
| [OpenAI] in:560 cache:200 hit:35% ttl:n/a |
+------------------------------------------------------+
| Active discussions |
| Discussion "refactor auth" |
| cached: yes (Anthropic) |
| expires: 2026-06-12T15:32 (in 4:32) |
| [Invalidate cache] [Disable caching for this] |
| Discussion "fix the parser" |
| cached: no |
| [Enable caching for this] |
+------------------------------------------------------+
| Global settings |
| [X] Enable Anthropic ephemeral caching |
| [X] Enable Gemini explicit caching |
| [ ] Allow >1h Gemini caches (charges may apply) |
| Anthropic default TTL: [5 min v] |
| Gemini default TTL: [60 min v] |
+------------------------------------------------------+
The data sources:
| Widget | Data source | Frequency |
|---|---|---|
in:N cache:N hit:N% |
ai_client.get_token_stats() (already exported) |
per turn (or per session) |
ttl:4:32 |
ai_client._send_<provider> usage metadata + the cache expiry timestamp |
per turn |
cached: yes/no |
per-discussion flag (NEW; tracks which discussions have active caches) | per discussion |
[Invalidate cache] |
calls ai_client._invalidate_cache(discussion_id) (NEW) |
on click |
The new AI client state:
# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
discussion_id: str
provider: str
cached_at: datetime
expires_at: Optional[datetime] # None for OpenAI implicit
hit_count: int = 0
tokens_cached: int = 0
last_invalidated_at: Optional[datetime] = None
caching_enabled: bool = True # user can disable per-discussion
# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {} # keyed by discussion_id
The Hook API additions:
GET /api/cache # list all discussion cache states
GET /api/cache/<discussion_id> # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable
6. The interaction with the 4 memory dimensions (where the cache hits)
| Dim | Where injected | Stable? | Cache impact |
|---|---|---|---|
| Curation | layer 9 (active preset) | no (per turn) | NOT cached; the user might switch presets |
| Discussion | layer 8 (metadata) + layer 11 (prior turns) | no (per turn) | NOT cached (except: layer 8 metadata is the boundary) |
| RAG | the {rag-context} block, appended to layer 8-12 |
no (per query) | NOT cached; RAG is volatile per query |
| Knowledge | layer 7 (digest) + per-file (file-knowledge) | yes (within a gc cycle) | CACHED; the digest is the stable prefix |
The cache only hits on the stable prefix (layers 1-7). The volatile suffix (layers 8-12) is not cached; the user expects the conversation to change per turn.
The interaction with knowledge harvest: when nagent-gc (or the Manual Slop equivalent) regenerates the digest, the cache is invalidated for the next turn. The user has a way to force invalidation manually (the [Invalidate cache] button).
The interaction with file edit: when the user edits a file in the Structural File Editor, the file-knowledge for that file is updated. The cache is invalidated for the next turn that references the file. The per-file knowledge change is a cache invalidator.
7. The cross-references
conductor/code_styleguides/data_oriented_design.md§3.2, §3.3, §3.4 — the data-oriented foundationconductor/code_styleguides/agent_memory_dimensions.md— the 4 dims (where the cache hits)conductor/code_styleguides/knowledge_artifacts.md— the knowledge digest (the layer 7 cached content)docs/guide_caching_strategy.md— the user-facing deep-divesrc/aggregate.py:run— the consumer of this styleguidesrc/ai_client.py:_send_<provider>— the producerconductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md§3.2, §5 — the nagent pattern that informed this styleguide