manual_slop/docs/guide_caching_strategy.md

# Caching Strategy Guide

**Status:** User-facing deep-dive on the cache strategy: stable-to-volatile context ordering, the 4 cache-TTL profiles (Anthropic, Gemini, OpenAI, claude-code), and the GUI exposure.
**Date:** 2026-06-12
**Cross-refs:** `conductor/code_styleguides/cache_friendly_context.md`; `docs/guide_ai_client.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5.

> **What this is.** The LLM providers Manual Slop uses (Anthropic, Gemini, OpenAI) all support prompt caching. The cost benefit comes from the *stable prefix* being byte-identical across turns. This guide is the user-facing deep-dive on the 12-layer model, the byte-comparison test, the provider-specific TTLs, and the GUI exposure.

---

## 0. The 30-second version

```
[STABLE PREFIX (cached across turns)]  [VOLATILE SUFFIX (per-turn)]
[Role instructions]                     [Discussion metadata]
[Function-calling schema]               [Active preset (FileItems)]
[Discovered tool descriptions]          [Per-file details]
[System prompt preset]                  [Tool-call results from prior turns]
[Persona profile]                       [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]
```

**The cache boundary is at layer 8/9.** Layers 1-7 are byte-identical across turns; layers 8-12 change per turn. The Anthropic-specific path wraps the prefix in `cache_control: {"type": "ephemeral"}` blocks; the Gemini path uses `cachedContent` resources; the OpenAI path uses implicit prefix caching.

**The provider-specific defaults:**

| Provider | Default TTL | Configurable? | GUI exposure? |
|---|---|---|---|
| Anthropic ephemeral | 5 min | yes (per-discussion) | yes |
| Gemini explicit | 1 h | yes (per-discussion override) | yes (TTL override) |
| OpenAI implicit | 5-10 min (provider-managed) | no | shows "cached" only |
| claude-code (Claude Agent SDK) | varies (provider-managed) | no | shows "cached" only |

---

## 1. The 12-layer model (the stable-to-volatile ordering)

| # | Layer | Stable across turns? | Source | SSDL |
|---|---|---|---|---|
| 1 | Role instructions (model + provider) | yes | `_get_combined_system_prompt` | `[I]` |
| 2 | Function-calling schema | yes | per provider | `[I]` |
| 3 | Discovered tool descriptions | yes | `mcp_client.get_tool_schemas()` | `[I]` |
| 4 | System prompt preset | yes | `app_state.ai_settings.system_prompt` | `[I]` |
| 5 | Persona profile | yes | `app_state.active_persona` | `[I]` |
| 6 | Project context (per `manual_slop.toml`) | yes | NEW | `[I]` |
| 7 | Knowledge digest (per `knowledge/digest.md`) | yes (within a gc cycle) | NEW | `[I]` |
| 8 | Discussion metadata (name, role count) | no (per turn) | `disc_entries[:1]` or `disc_meta` | `───` |
| 9 | Active preset (FileItem set) | no (per turn) | `self.context_files` | `───` |
| 10 | Per-file details (history, slices, notes) | no (per file) | per `FileItem` | `───` |
| 11 | Tool-call results from prior turns | no (per turn) | per `_reread_file_items` | `───` |
| 12 | The user message | no (per turn) | the input | `───` |

**The cache boundary is at layer 7/8.** Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.

---

## 2. The byte-comparison test (the design contract)

The design rule "stable prefix is byte-identical" must be testable. The test:

```python
# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
    """The first N characters of the context should be identical across turns
    of the same conversation, when no stable-layer inputs change."""
    ctrl = mock_app_controller()
    ctrl.ai_settings.system_prompt = "Test system prompt"
    ctrl.active_persona = mock_persona()

    # Turn 1
    turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")

    # Turn 2 (same stable inputs, different user message)
    turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")

    # The first N characters should be identical (N = where the volatile layers start)
    N = aggregate.stable_prefix_length(ctrl)
    assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"
```

**The test is the contract.** If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).

---

## 3. The provider-specific cache strategies

### 3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)

```python
# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
    if cache_prefix_chars is not None:
        # Wrap the message in content blocks; mark each prefix with cache_control
        content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
    else:
        content_blocks = messages

    response = anthropic_client.messages.create(
        model=model,
        max_tokens=8192,
        messages=[{"role": "user", "content": content_blocks}],
    )
    return _result_with_usage(response.content, response.usage, messages)
```

**The cache_prefix_blocks helper:**

```python
def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
    """Split the message into content blocks at the given char offsets.
    Mark each prefix block with cache_control. Returns the plain string
    when no valid boundary exists. At most 3 prefix blocks (provider limit
    is 4 breakpoints per request)."""
    if not cache_boundaries:
        return message
    points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
    if not points:
        return message
    blocks = []
    start = 0
    for point in points:
        blocks.append({
            "type": "text",
            "text": message[start:point],
            "cache_control": {"type": "ephemeral"},
        })
        start = point
    blocks.append({"type": "text", "text": message[start:]})
    return blocks
```

**The Anthropic usage accounting:**

```python
def _result_with_usage(text, usage, input_text=None):
    input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
    # Anthropic reports cached prompt tokens separately; fold them back
    # so input_tokens stays "tokens sent" across providers.
    input_tokens += _usage_value(usage, "cache_read_input_tokens")
    input_tokens += _usage_value(usage, "cache_creation_input_tokens")
    # ...
```

**The 4-breakpoint limit.** Anthropic allows at most 4 `cache_control` markers per request. Manual Slop uses 3 prefix blocks (one breakpoint per prefix) + 1 volatile suffix.

### 3.2 Gemini (1-hour explicit cache, configurable TTL)

```python
# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
    if cache_ttl_seconds > 0:
        cached_content = genai_client.caches.create(
            model=model,
            contents=stable_prefix_messages,
            ttl=f"{cache_ttl_seconds}s",
        )
        response = genai_client.models.generate_content(
            model=model,
            contents=volatile_messages,
            config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
        )
    else:
        response = genai_client.models.generate_content(model=model, contents=messages)
    return _result_with_usage(response.text, response.usage_metadata, messages)
```

**The default TTL is 1 hour.** Configurable per the GUI (per §4 below).

### 3.3 OpenAI (5-10 min implicit, provider-managed)

OpenAI's caching is *implicit*: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.

```python
# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
    response = openai_client.responses.create(model=model, input=messages)
    return _result_with_usage(response.output_text, response.usage, messages)
    # No application-side cache_control; the provider handles it
```

**The TTL is provider-managed** (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."

### 3.4 claude-code (5th provider, subscription auth)

`claude-code` uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed.

```python
# In src/ai_client.py:_send_claude_code (the 5th provider)
def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1):
    options = ClaudeAgentOptions(
        model=None if not model or model == "default" else model,
        max_turns=max_turns,
        tools=list(allowed_tools) if allowed_tools else [],
        allowed_tools=list(allowed_tools) if allowed_tools else [],
        cwd=os.getcwd(),
    )
    # ... claude_agent_sdk.query(prompt=message, options=options)
    return _result_with_usage(text, usage, message)
```

---

## 4. The GUI exposure

The "Caching" Operations Hub sub-panel:

```
+------------------------------------------------------+
| Caching                                              |
+------------------------------------------------------+
| Provider summaries                                   |
| [Anthropic]   in:340 cache:80  hit:23%  ttl:4:32   |
| [Gemini]      in:120 cache:0   hit:0%   ttl:0:00   |
| [OpenAI]      in:560 cache:200 hit:35%  ttl:n/a    |
+------------------------------------------------------+
| Active discussions                                   |
| Discussion "refactor auth"                           |
|   cached: yes (Anthropic)                            |
|   expires: 2026-06-12T15:32 (in 4:32)                |
|   [Invalidate cache] [Disable caching for this]      |
| Discussion "fix the parser"                           |
|   cached: no                                         |
|   [Enable caching for this]                         |
+------------------------------------------------------+
| Global settings                                      |
|   [X] Enable Anthropic ephemeral caching             |
|   [X] Enable Gemini explicit caching                 |
|   [ ] Allow >1h Gemini caches (charges may apply)    |
|   Anthropic default TTL: [5 min v]                   |
|   Gemini default TTL:    [60 min v]                  |
+------------------------------------------------------+
```

**The data sources:**

| Widget | Data source | Frequency |
|---|---|---|
| `in:N cache:N hit:N%` | `ai_client.get_token_stats()` | per turn (or per session) |
| `ttl:4:32` | `ai_client._send_<provider>` usage metadata + the cache expiry timestamp | per turn |
| `cached: yes/no` | per-discussion flag (NEW) | per discussion |
| `[Invalidate cache]` | calls `ai_client._invalidate_cache(discussion_id)` (NEW) | on click |

**The new AI client state:**

```python
# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
    discussion_id: str
    provider: str
    cached_at: datetime
    expires_at: Optional[datetime]
    hit_count: int = 0
    tokens_cached: int = 0
    last_invalidated_at: Optional[datetime] = None
    caching_enabled: bool = True

# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {}
```

**The Hook API additions:**

```
GET  /api/cache                        # list all discussion cache states
GET  /api/cache/<discussion_id>        # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable
```

---

## 5. The injection (where the cache hits)

| Layer | Where injected | Stable? | Cache impact |
|---|---|---|---|
| 1. Role instructions | `_get_combined_system_prompt` | yes | **CACHED** |
| 2. Function-calling schema | per provider | yes | **CACHED** |
| 3. Discovered tool descriptions | `mcp_client.get_tool_schemas()` | yes | **CACHED** |
| 4. System prompt preset | `app_state.ai_settings.system_prompt` | yes | **CACHED** |
| 5. Persona profile | `app_state.active_persona` | yes | **CACHED** |
| 6. Project context | `manual_slop.toml [agent.context_files]` | yes | **CACHED** |
| 7. Knowledge digest | `~/.manual_slop/knowledge/digest.md` | yes (within a gc cycle) | **CACHED** |
| 8. Discussion metadata | `disc_entries[:1]` | no | NOT cached |
| 9. Active preset | `self.context_files` | no | NOT cached |
| 10. Per-file details | per `FileItem` | no | NOT cached |
| 11. Prior tool results | per `_reread_file_items` | no | NOT cached |
| 12. User message | the input | no | NOT cached |

**The cache only hits on the stable prefix (layers 1-7).** The volatile suffix (layers 8-12) is *not* cached; the user expects the conversation to change per turn.

---

## 6. The cache invalidation triggers

| Trigger | Effect |
|---|---|
| `python -m src.knowledge_harvest --apply` | The digest is regenerated; the cache is invalidated for the next turn |
| `FileItem.notes` edited | The per-file knowledge changes; the cache is invalidated for the next turn that references the file |
| `persona` changed | The persona profile is in the stable prefix; the cache is invalidated |
| `[Invalidate cache]` button | The per-discussion cache state is marked `last_invalidated_at`; the next turn re-creates it |
| `expiration` reached | The provider's cache expires automatically; the next turn re-creates it |

---

## 7. The measurement (the empirical basis)

**The "before" measurement** (do this first, before any refactor):

```bash
# Log the cache hit rate over a sample of representative discussions
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 23% (avg)
cache write rate: 45% (avg)
in:N avg: 1,200
cache:N avg: 280
```

**The "after" measurement** (after the stable-to-volatile refactor):

```bash
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 67% (avg)     # <-- should be measurably higher
cache write rate: 18% (avg)   # <-- should be lower
in:N avg: 1,200               # <-- unchanged (the user still types the same)
cache:N avg: 280              # <-- unchanged
```

**The win comes from re-aligning the boundaries**, not from changing the providers. The test is whether the cache hit rate is measurably higher after the refactor.

---

## 8. The cross-references

- `conductor/code_styleguides/cache_friendly_context.md` — the canonical styleguide
- `docs/guide_ai_client.md` — the underlying LLM client (the producer)
- `docs/guide_agent_memory_dimensions.md` §5 — where the 4 dims get injected
- `docs/guide_knowledge_curation.md` §3 — the digest (layer 7)
- `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5 — the nagent pattern