# Caching Strategy Guide **Status:** User-facing deep-dive on the cache strategy: stable-to-volatile context ordering, the 4 cache-TTL profiles (Anthropic, Gemini, OpenAI, claude-code), and the GUI exposure. **Date:** 2026-06-12 **Cross-refs:** `conductor/code_styleguides/cache_friendly_context.md`; `docs/guide_ai_client.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5. > **What this is.** The LLM providers Manual Slop uses (Anthropic, Gemini, OpenAI) all support prompt caching. The cost benefit comes from the *stable prefix* being byte-identical across turns. This guide is the user-facing deep-dive on the 12-layer model, the byte-comparison test, the provider-specific TTLs, and the GUI exposure. --- ## 0. The 30-second version ``` [STABLE PREFIX (cached across turns)] [VOLATILE SUFFIX (per-turn)] [Role instructions] [Discussion metadata] [Function-calling schema] [Active preset (FileItems)] [Discovered tool descriptions] [Per-file details] [System prompt preset] [Tool-call results from prior turns] [Persona profile] [The user message] [Project context] [Knowledge digest] [file-knowledge for files in scope] ``` **The cache boundary is at layer 8/9.** Layers 1-7 are byte-identical across turns; layers 8-12 change per turn. The Anthropic-specific path wraps the prefix in `cache_control: {"type": "ephemeral"}` blocks; the Gemini path uses `cachedContent` resources; the OpenAI path uses implicit prefix caching. **The provider-specific defaults:** | Provider | Default TTL | Configurable? | GUI exposure? | |---|---|---|---| | Anthropic ephemeral | 5 min | yes (per-discussion) | yes | | Gemini explicit | 1 h | yes (per-discussion override) | yes (TTL override) | | OpenAI implicit | 5-10 min (provider-managed) | no | shows "cached" only | | claude-code (Claude Agent SDK) | varies (provider-managed) | no | shows "cached" only | --- ## 1. The 12-layer model (the stable-to-volatile ordering) | # | Layer | Stable across turns? | Source | SSDL | |---|---|---|---|---| | 1 | Role instructions (model + provider) | yes | `_get_combined_system_prompt` | `[I]` | | 2 | Function-calling schema | yes | per provider | `[I]` | | 3 | Discovered tool descriptions | yes | `mcp_client.get_tool_schemas()` | `[I]` | | 4 | System prompt preset | yes | `app_state.ai_settings.system_prompt` | `[I]` | | 5 | Persona profile | yes | `app_state.active_persona` | `[I]` | | 6 | Project context (per `manual_slop.toml`) | yes | NEW | `[I]` | | 7 | Knowledge digest (per `knowledge/digest.md`) | yes (within a gc cycle) | NEW | `[I]` | | 8 | Discussion metadata (name, role count) | no (per turn) | `disc_entries[:1]` or `disc_meta` | `───` | | 9 | Active preset (FileItem set) | no (per turn) | `self.context_files` | `───` | | 10 | Per-file details (history, slices, notes) | no (per file) | per `FileItem` | `───` | | 11 | Tool-call results from prior turns | no (per turn) | per `_reread_file_items` | `───` | | 12 | The user message | no (per turn) | the input | `───` | **The cache boundary is at layer 7/8.** Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn. --- ## 2. The byte-comparison test (the design contract) The design rule "stable prefix is byte-identical" must be testable. The test: ```python # In tests/test_aggregate_caching.py (NEW) def test_aggregate_stable_to_volatile_ordering(): """The first N characters of the context should be identical across turns of the same conversation, when no stable-layer inputs change.""" ctrl = mock_app_controller() ctrl.ai_settings.system_prompt = "Test system prompt" ctrl.active_persona = mock_persona() # Turn 1 turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt") # Turn 2 (same stable inputs, different user message) turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt") # The first N characters should be identical (N = where the volatile layers start) N = aggregate.stable_prefix_length(ctrl) assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}" ``` **The test is the contract.** If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification). --- ## 3. The provider-specific cache strategies ### 3.1 Anthropic (5-minute ephemeral, 4 breakpoints max) ```python # In src/ai_client.py:_send_anthropic def _send_anthropic(messages, *, cache_prefix_chars=None): if cache_prefix_chars is not None: # Wrap the message in content blocks; mark each prefix with cache_control content_blocks = cache_prefix_blocks(messages, cache_prefix_chars) else: content_blocks = messages response = anthropic_client.messages.create( model=model, max_tokens=8192, messages=[{"role": "user", "content": content_blocks}], ) return _result_with_usage(response.content, response.usage, messages) ``` **The cache_prefix_blocks helper:** ```python def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]: """Split the message into content blocks at the given char offsets. Mark each prefix block with cache_control. Returns the plain string when no valid boundary exists. At most 3 prefix blocks (provider limit is 4 breakpoints per request).""" if not cache_boundaries: return message points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3] if not points: return message blocks = [] start = 0 for point in points: blocks.append({ "type": "text", "text": message[start:point], "cache_control": {"type": "ephemeral"}, }) start = point blocks.append({"type": "text", "text": message[start:]}) return blocks ``` **The Anthropic usage accounting:** ```python def _result_with_usage(text, usage, input_text=None): input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count") # Anthropic reports cached prompt tokens separately; fold them back # so input_tokens stays "tokens sent" across providers. input_tokens += _usage_value(usage, "cache_read_input_tokens") input_tokens += _usage_value(usage, "cache_creation_input_tokens") # ... ``` **The 4-breakpoint limit.** Anthropic allows at most 4 `cache_control` markers per request. Manual Slop uses 3 prefix blocks (one breakpoint per prefix) + 1 volatile suffix. ### 3.2 Gemini (1-hour explicit cache, configurable TTL) ```python # In src/ai_client.py:_send_gemini def _send_gemini(messages, *, cache_ttl_seconds=3600): if cache_ttl_seconds > 0: cached_content = genai_client.caches.create( model=model, contents=stable_prefix_messages, ttl=f"{cache_ttl_seconds}s", ) response = genai_client.models.generate_content( model=model, contents=volatile_messages, config=genai.types.GenerateContentConfig(cached_content=cached_content.name), ) else: response = genai_client.models.generate_content(model=model, contents=messages) return _result_with_usage(response.text, response.usage_metadata, messages) ``` **The default TTL is 1 hour.** Configurable per the GUI (per §4 below). ### 3.3 OpenAI (5-10 min implicit, provider-managed) OpenAI's caching is *implicit*: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control. ```python # In src/ai_client.py:_send_openai def _send_openai(messages, *, model="gpt-5.5"): response = openai_client.responses.create(model=model, input=messages) return _result_with_usage(response.output_text, response.usage, messages) # No application-side cache_control; the provider handles it ``` **The TTL is provider-managed** (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed." ### 3.4 claude-code (5th provider, subscription auth) `claude-code` uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed. ```python # In src/ai_client.py:_send_claude_code (the 5th provider) def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1): options = ClaudeAgentOptions( model=None if not model or model == "default" else model, max_turns=max_turns, tools=list(allowed_tools) if allowed_tools else [], allowed_tools=list(allowed_tools) if allowed_tools else [], cwd=os.getcwd(), ) # ... claude_agent_sdk.query(prompt=message, options=options) return _result_with_usage(text, usage, message) ``` --- ## 4. The GUI exposure The "Caching" Operations Hub sub-panel: ``` +------------------------------------------------------+ | Caching | +------------------------------------------------------+ | Provider summaries | | [Anthropic] in:340 cache:80 hit:23% ttl:4:32 | | [Gemini] in:120 cache:0 hit:0% ttl:0:00 | | [OpenAI] in:560 cache:200 hit:35% ttl:n/a | +------------------------------------------------------+ | Active discussions | | Discussion "refactor auth" | | cached: yes (Anthropic) | | expires: 2026-06-12T15:32 (in 4:32) | | [Invalidate cache] [Disable caching for this] | | Discussion "fix the parser" | | cached: no | | [Enable caching for this] | +------------------------------------------------------+ | Global settings | | [X] Enable Anthropic ephemeral caching | | [X] Enable Gemini explicit caching | | [ ] Allow >1h Gemini caches (charges may apply) | | Anthropic default TTL: [5 min v] | | Gemini default TTL: [60 min v] | +------------------------------------------------------+ ``` **The data sources:** | Widget | Data source | Frequency | |---|---|---| | `in:N cache:N hit:N%` | `ai_client.get_token_stats()` | per turn (or per session) | | `ttl:4:32` | `ai_client._send_` usage metadata + the cache expiry timestamp | per turn | | `cached: yes/no` | per-discussion flag (NEW) | per discussion | | `[Invalidate cache]` | calls `ai_client._invalidate_cache(discussion_id)` (NEW) | on click | **The new AI client state:** ```python # In src/ai_client.py (NEW) @dataclass class DiscussionCacheState: discussion_id: str provider: str cached_at: datetime expires_at: Optional[datetime] hit_count: int = 0 tokens_cached: int = 0 last_invalidated_at: Optional[datetime] = None caching_enabled: bool = True # In AppController (NEW) self.discussion_caches: dict[str, DiscussionCacheState] = {} ``` **The Hook API additions:** ``` GET /api/cache # list all discussion cache states GET /api/cache/ # get one POST /api/cache//invalidate POST /api/cache//disable POST /api/cache//enable ``` --- ## 5. The injection (where the cache hits) | Layer | Where injected | Stable? | Cache impact | |---|---|---|---| | 1. Role instructions | `_get_combined_system_prompt` | yes | **CACHED** | | 2. Function-calling schema | per provider | yes | **CACHED** | | 3. Discovered tool descriptions | `mcp_client.get_tool_schemas()` | yes | **CACHED** | | 4. System prompt preset | `app_state.ai_settings.system_prompt` | yes | **CACHED** | | 5. Persona profile | `app_state.active_persona` | yes | **CACHED** | | 6. Project context | `manual_slop.toml [agent.context_files]` | yes | **CACHED** | | 7. Knowledge digest | `~/.manual_slop/knowledge/digest.md` | yes (within a gc cycle) | **CACHED** | | 8. Discussion metadata | `disc_entries[:1]` | no | NOT cached | | 9. Active preset | `self.context_files` | no | NOT cached | | 10. Per-file details | per `FileItem` | no | NOT cached | | 11. Prior tool results | per `_reread_file_items` | no | NOT cached | | 12. User message | the input | no | NOT cached | **The cache only hits on the stable prefix (layers 1-7).** The volatile suffix (layers 8-12) is *not* cached; the user expects the conversation to change per turn. --- ## 6. The cache invalidation triggers | Trigger | Effect | |---|---| | `python -m src.knowledge_harvest --apply` | The digest is regenerated; the cache is invalidated for the next turn | | `FileItem.notes` edited | The per-file knowledge changes; the cache is invalidated for the next turn that references the file | | `persona` changed | The persona profile is in the stable prefix; the cache is invalidated | | `[Invalidate cache]` button | The per-discussion cache state is marked `last_invalidated_at`; the next turn re-creates it | | `expiration` reached | The provider's cache expires automatically; the next turn re-creates it | --- ## 7. The measurement (the empirical basis) **The "before" measurement** (do this first, before any refactor): ```bash # Log the cache hit rate over a sample of representative discussions $ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic cache hit rate: 23% (avg) cache write rate: 45% (avg) in:N avg: 1,200 cache:N avg: 280 ``` **The "after" measurement** (after the stable-to-volatile refactor): ```bash $ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic cache hit rate: 67% (avg) # <-- should be measurably higher cache write rate: 18% (avg) # <-- should be lower in:N avg: 1,200 # <-- unchanged (the user still types the same) cache:N avg: 280 # <-- unchanged ``` **The win comes from re-aligning the boundaries**, not from changing the providers. The test is whether the cache hit rate is measurably higher after the refactor. --- ## 8. The cross-references - `conductor/code_styleguides/cache_friendly_context.md` — the canonical styleguide - `docs/guide_ai_client.md` — the underlying LLM client (the producer) - `docs/guide_agent_memory_dimensions.md` §5 — where the 4 dims get injected - `docs/guide_knowledge_curation.md` §3 — the digest (layer 7) - `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5 — the nagent pattern