# Cache-Friendly Context (stable-to-volatile ordering + cache TTL) **Status:** Styleguide; codifies the cache strategy for `aggregate.py:run` and the GUI exposure of cache TTL. **Date:** 2026-06-12 **Cross-refs:** `conductor/code_styleguides/data_oriented_design.md` §3.2; `conductor/code_styleguides/agent_memory_dimensions.md`; `docs/guide_caching_strategy.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5. > **What this is.** The LLM providers that Manual Slop uses (Anthropic, Gemini, OpenAI) all support some form of prompt caching. The cost benefit comes from the *stable prefix* being byte-identical across turns and across discussions. This styleguide defines the stable prefix, the volatile suffix, the byte-comparison contract, and the cache TTL GUI exposure. --- ## 0. The one-glance principle ``` [STABLE PREFIX (cached across turns)] [VOLATILE SUFFIX (per-turn)] [Role instructions] [Discussion metadata] [Function-calling schema] [Active preset (FileItems)] [Discovered tool descriptions] [Per-file details] [System prompt preset] [Tool-call results from prior turns] [Persona profile] [The user message] [Project context] [Knowledge digest] [file-knowledge for files in scope] ``` The cache boundary is at layer 8/9 (the last stable / first volatile). The Anthropic-specific path wraps the prefix in `cache_control: {"type": "ephemeral"}` blocks at the boundary; the Gemini path uses `cachedContent` resources; the OpenAI path uses implicit prefix caching. --- ## 1. The 12-layer model (the stable-to-volatile ordering) | # | Layer | Stable across turns? | Source | SSDL | |---|---|---|---|---| | 1 | Role instructions (model + provider) | yes | `_get_combined_system_prompt` | `[I]` | | 2 | Function-calling schema | yes | per provider | `[I]` | | 3 | Discovered tool descriptions | yes | `mcp_client.get_tool_schemas()` | `[I]` | | 4 | System prompt preset | yes | `app_state.ai_settings.system_prompt` | `[I]` | | 5 | Persona profile | yes | `app_state.active_persona` | `[I]` | | 6 | Project context (per `manual_slop.toml`) | yes | NEW (Candidate 14) | `[I]` | | 7 | Knowledge digest (per `knowledge/digest.md`) | yes (within a gc cycle) | NEW (Candidate 8) | `[I]` | | 8 | Discussion metadata (name, role count) | no (per turn) | `disc_entries[:1]` or `disc_meta` | `───` (data) | | 9 | Active preset (FileItem set) | no (per turn) | `self.context_files` | `───` (data) | | 10 | Per-file details (history, slices, notes) | no (per file) | per `FileItem` | `───` (data) | | 11 | Tool-call results from prior turns | no (per turn) | per `_reread_file_items` | `───` (data) | | 12 | The user message | no (per turn) | the input | `───` (data) | **The cache boundary is at layer 7/8.** Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn. --- ## 2. The byte-comparison test (the design contract) The design rule "stable prefix is byte-identical" must be testable. The test: ```python # In tests/test_aggregate_caching.py (NEW) def test_aggregate_stable_to_volatile_ordering(): """The first N characters of the context should be identical across turns of the same conversation, when no stable-layer inputs change.""" ctrl = mock_app_controller() ctrl.ai_settings.system_prompt = "Test system prompt" ctrl.active_persona = mock_persona() # Turn 1 turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt") # Turn 2 (same stable inputs, different user message) turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt") # The first N characters should be identical (N = where the volatile layers start) N = aggregate.stable_prefix_length(ctrl) assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}" ``` **The test is the contract.** If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification). **The implementation.** `aggregate.stable_prefix_length(ctrl)` returns the character offset where layer 8 starts. The simplest implementation: a class-level constant per `aggregate.py`, updated when the layer stack changes: ```python class AggregateStack: ROLE_INSTRUCTIONS_END = 0 # placeholder; computed at runtime SCHEMA_END = 0 TOOLS_END = 0 SYSTEM_PROMPT_END = 0 PERSONA_END = 0 PROJECT_CONTEXT_END = 0 KNOWLEDGE_DIGEST_END = 0 INSTANCE_START = 0 # the cache boundary ``` **The test failure modes:** | Failure | Why it fails | Fix | |---|---|---| | A new stable layer was added in the wrong position | The first N characters differ because the new layer is below the boundary | Move the new layer above the boundary (between layers 7 and 8) | | A stable layer was moved to the volatile position | The first N characters differ because the stable layer is now in the volatile part | Move the layer back to the stable position | | A volatile input leaked into a stable layer (e.g., a timestamp in the system prompt) | The first N characters differ because the volatile input is in the prefix | Strip the volatile input from the stable layer; pass it as a separate volatile argument | | The system prompt has a `now()` call | The first N characters differ across calls | Pass `now()` as a separate argument; don't include in the system prompt | --- ## 3. The provider-specific cache_control (the implementation) ### 3.1 Anthropic (5-minute ephemeral, 4 breakpoints max) ```python # In src/ai_client.py:_send_anthropic def _send_anthropic(messages, *, cache_prefix_chars=None): if cache_prefix_chars is not None: # Wrap the message in content blocks; mark each prefix with cache_control content_blocks = cache_prefix_blocks(messages, cache_prefix_chars) else: content_blocks = messages response = anthropic_client.messages.create( model=model, max_tokens=8192, messages=[{"role": "user", "content": content_blocks}], ) return _result_with_usage(response.content, response.usage, messages) ``` **The cache_prefix_blocks helper** (mirrors nagent's `bin/helpers/nagent_llm.py:cache_prefix_blocks`): ```python def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]: """Split the message into content blocks at the given char offsets. Mark each prefix block with cache_control. Returns the plain string when no valid boundary exists. At most 3 prefix blocks (provider limit is 4 breakpoints per request).""" if not cache_boundaries: return message points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3] if not points: return message blocks = [] start = 0 for point in points: blocks.append({ "type": "text", "text": message[start:point], "cache_control": {"type": "ephemeral"}, }) start = point blocks.append({"type": "text", "text": message[start:]}) return blocks ``` **The Anthropic usage accounting** (per `nagent_llm.py:_result_with_usage`): ```python def _result_with_usage(text, usage, input_text=None): input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count") # Anthropic reports cached prompt tokens separately; fold them back # so input_tokens stays "tokens sent" across providers. input_tokens += _usage_value(usage, "cache_read_input_tokens") input_tokens += _usage_value(usage, "cache_creation_input_tokens") output_tokens = _usage_value(usage, "output_tokens", "completion_tokens", ...) # ... etc ``` **The 4-breakpoint limit.** Anthropic allows at most 4 `cache_control` markers per request. nagent caps at 3 prefix blocks (one breakpoint per prefix). Manual Slop does the same: 3 prefix blocks, 1 volatile suffix. ### 3.2 Gemini (1-hour explicit cache, configurable TTL) ```python # In src/ai_client.py:_send_gemini def _send_gemini(messages, *, cache_ttl_seconds=3600): if cache_ttl_seconds > 0: # Create a cachedContent resource for the stable prefix cached_content = genai_client.caches.create( model=model, contents=stable_prefix_messages, # layers 1-7 ttl=f"{cache_ttl_seconds}s", ) # Reference the cached content in the request response = genai_client.models.generate_content( model=model, contents=volatile_messages, # layers 8-12 config=genai.types.GenerateContentConfig(cached_content=cached_content.name), ) else: response = genai_client.models.generate_content(model=model, contents=messages) return _result_with_usage(response.text, response.usage_metadata, messages) ``` **The default TTL is 1 hour.** Configurable per the GUI (per §5 below). ### 3.3 OpenAI (5-10 min implicit, provider-managed) OpenAI's caching is *implicit*: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control. ```python # In src/ai_client.py:_send_openai def _send_openai(messages, *, model="gpt-5.5"): response = openai_client.responses.create(model=model, input=messages) return _result_with_usage(response.output_text, response.usage, messages) # No application-side cache_control; the provider handles it ``` **The TTL is provider-managed** (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed." ### 3.4 The provider table (the summary) | Provider | Cache type | Default TTL | Configurable? | GUI exposure? | |---|---|---|---|---| | Anthropic | ephemeral | 5 min | yes (via prompt cache breakpoints) | yes (per-discussion state) | | Google (Gemini) | explicit | 1 h | yes (via `ttl` field) | yes (TTL override) | | OpenAI | implicit (auto) | 5-10 min (provider-managed) | no | no (just shows "cached") | --- ## 4. The codepath (the end-to-end flow) ``` [Q:ai_client.send() is called] │ ▼ [I:aggregate.build_initial_context(ctrl, user_message) -> str] │ ├──► [I:layer 1-7: build stable prefix (the cache-friendly part)] │ ├──► [I:layer 8-12: build volatile suffix (the per-turn part)] │ ├──► [I:concatenate stable + volatile = full context] │ ├──► [I:stable_prefix_length(ctrl) -> N] (the cache boundary) │ ▼ [Q:cache boundary N > 0?] │ ├── no ──► [I:pass full context to provider; no caching] │ ▼ [Q:provider is Anthropic?] │ ├── yes ──► [I:cache_prefix_blocks(full_context, [N]) -> content_blocks] │ [I:anthropic.messages.create(content=content_blocks)] │ [Q:provider is Gemini?] │ ├── yes ──► [I:create cachedContent resource for stable prefix] │ [I:genai.models.generate_content(cached_content=..., contents=volatile)] │ [Q:provider is OpenAI?] │ ├── yes ──► [I:openai.responses.create(input=full_context)] (provider handles caching) │ [I:return LlmResult(text, input_tokens, output_tokens)] │ ▼ [Q:return to caller; aggregate.test_aggregate_stable_to_volatile_ordering is run] │ [T:end] ``` --- ## 5. The GUI exposure (per-provider cache state) The "Caching" Operations Hub sub-panel (per the v2.3 §5.3 sketch): ``` +------------------------------------------------------+ | Caching | +------------------------------------------------------+ | Provider summaries | | [Anthropic] in:340 cache:80 hit:23% ttl:4:32 | | [Gemini] in:120 cache:0 hit:0% ttl:0:00 | | [OpenAI] in:560 cache:200 hit:35% ttl:n/a | +------------------------------------------------------+ | Active discussions | | Discussion "refactor auth" | | cached: yes (Anthropic) | | expires: 2026-06-12T15:32 (in 4:32) | | [Invalidate cache] [Disable caching for this] | | Discussion "fix the parser" | | cached: no | | [Enable caching for this] | +------------------------------------------------------+ | Global settings | | [X] Enable Anthropic ephemeral caching | | [X] Enable Gemini explicit caching | | [ ] Allow >1h Gemini caches (charges may apply) | | Anthropic default TTL: [5 min v] | | Gemini default TTL: [60 min v] | +------------------------------------------------------+ ``` **The data sources:** | Widget | Data source | Frequency | |---|---|---| | `in:N cache:N hit:N%` | `ai_client.get_token_stats()` (already exported) | per turn (or per session) | | `ttl:4:32` | `ai_client._send_` usage metadata + the cache expiry timestamp | per turn | | `cached: yes/no` | per-discussion flag (NEW; tracks which discussions have active caches) | per discussion | | `[Invalidate cache]` | calls `ai_client._invalidate_cache(discussion_id)` (NEW) | on click | **The new AI client state:** ```python # In src/ai_client.py (NEW) @dataclass class DiscussionCacheState: discussion_id: str provider: str cached_at: datetime expires_at: Optional[datetime] # None for OpenAI implicit hit_count: int = 0 tokens_cached: int = 0 last_invalidated_at: Optional[datetime] = None caching_enabled: bool = True # user can disable per-discussion # In AppController (NEW) self.discussion_caches: dict[str, DiscussionCacheState] = {} # keyed by discussion_id ``` **The Hook API additions:** ``` GET /api/cache # list all discussion cache states GET /api/cache/ # get one POST /api/cache//invalidate POST /api/cache//disable POST /api/cache//enable ``` --- ## 6. The interaction with the 4 memory dimensions (where the cache hits) | Dim | Where injected | Stable? | Cache impact | |---|---|---|---| | Curation | layer 9 (active preset) | no (per turn) | NOT cached; the user might switch presets | | Discussion | layer 8 (metadata) + layer 11 (prior turns) | no (per turn) | NOT cached (except: layer 8 metadata is the boundary) | | RAG | the `{rag-context}` block, appended to layer 8-12 | no (per query) | NOT cached; RAG is volatile per query | | Knowledge | layer 7 (digest) + per-file (file-knowledge) | yes (within a gc cycle) | CACHED; the digest is the stable prefix | **The cache only hits on the stable prefix (layers 1-7).** The volatile suffix (layers 8-12) is *not* cached; the user expects the conversation to change per turn. **The interaction with knowledge harvest:** when `nagent-gc` (or the Manual Slop equivalent) regenerates the digest, the cache is invalidated for the next turn. The user has a way to force invalidation manually (the `[Invalidate cache]` button). **The interaction with file edit:** when the user edits a file in the Structural File Editor, the file-knowledge for that file is updated. The cache is invalidated for the next turn that references the file. The per-file knowledge change is a cache invalidator. --- ## 7. The cross-references - `conductor/code_styleguides/data_oriented_design.md` §3.2, §3.3, §3.4 — the data-oriented foundation - `conductor/code_styleguides/agent_memory_dimensions.md` — the 4 dims (where the cache hits) - `conductor/code_styleguides/knowledge_artifacts.md` — the knowledge digest (the layer 7 cached content) - `docs/guide_caching_strategy.md` — the user-facing deep-dive - `src/aggregate.py:run` — the consumer of this styleguide - `src/ai_client.py:_send_` — the producer - `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5 — the nagent pattern that informed this styleguide