# Cache-Friendly Context (stable-to-volatile ordering + cache TTL)

**Status:** Styleguide; codifies the cache strategy for `aggregate.py:run` and the GUI exposure of cache TTL.
**Date:** 2026-06-12
**Cross-refs:** `conductor/code_styleguides/data_oriented_design.md` §3.2; `conductor/code_styleguides/agent_memory_dimensions.md`; `docs/guide_caching_strategy.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5.

> **What this is.** The LLM providers that Manual Slop uses (Anthropic, Gemini, OpenAI) all support some form of prompt caching. The cost benefit comes from the *stable prefix* being byte-identical across turns and across discussions. This styleguide defines the stable prefix, the volatile suffix, the byte-comparison contract, and the cache TTL GUI exposure.

---

## 0. The one-glance principle

```
[STABLE PREFIX (cached across turns)]  [VOLATILE SUFFIX (per-turn)]
[Role instructions]                     [Discussion metadata]
[Function-calling schema]               [Active preset (FileItems)]
[Discovered tool descriptions]          [Per-file details]
[System prompt preset]                  [Tool-call results from prior turns]
[Persona profile]                       [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]
```

The cache boundary is at layer 8/9 (the last stable / first volatile). The Anthropic-specific path wraps the prefix in `cache_control: {"type": "ephemeral"}` blocks at the boundary; the Gemini path uses `cachedContent` resources; the OpenAI path uses implicit prefix caching.

---

## 1. The 12-layer model (the stable-to-volatile ordering)

| # | Layer | Stable across turns? | Source | SSDL |
|---|---|---|---|---|
| 1 | Role instructions (model + provider) | yes | `_get_combined_system_prompt` | `[I]` |
| 2 | Function-calling schema | yes | per provider | `[I]` |
| 3 | Discovered tool descriptions | yes | `mcp_client.get_tool_schemas()` | `[I]` |
| 4 | System prompt preset | yes | `app_state.ai_settings.system_prompt` | `[I]` |
| 5 | Persona profile | yes | `app_state.active_persona` | `[I]` |
| 6 | Project context (per `manual_slop.toml`) | yes | NEW (Candidate 14) | `[I]` |
| 7 | Knowledge digest (per `knowledge/digest.md`) | yes (within a gc cycle) | NEW (Candidate 8) | `[I]` |
| 8 | Discussion metadata (name, role count) | no (per turn) | `disc_entries[:1]` or `disc_meta` | `───` (data) |
| 9 | Active preset (FileItem set) | no (per turn) | `self.context_files` | `───` (data) |
| 10 | Per-file details (history, slices, notes) | no (per file) | per `FileItem` | `───` (data) |
| 11 | Tool-call results from prior turns | no (per turn) | per `_reread_file_items` | `───` (data) |
| 12 | The user message | no (per turn) | the input | `───` (data) |

**The cache boundary is at layer 7/8.** Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.

---

## 2. The byte-comparison test (the design contract)

The design rule "stable prefix is byte-identical" must be testable. The test:

```python
# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
    """The first N characters of the context should be identical across turns
    of the same conversation, when no stable-layer inputs change."""
    ctrl = mock_app_controller()
    ctrl.ai_settings.system_prompt = "Test system prompt"
    ctrl.active_persona = mock_persona()

    # Turn 1
    turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")

    # Turn 2 (same stable inputs, different user message)
    turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")

    # The first N characters should be identical (N = where the volatile layers start)
    N = aggregate.stable_prefix_length(ctrl)
    assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"
```

**The test is the contract.** If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).

**The implementation.** `aggregate.stable_prefix_length(ctrl)` returns the character offset where layer 8 starts. The simplest implementation: a class-level constant per `aggregate.py`, updated when the layer stack changes:

```python
class AggregateStack:
    ROLE_INSTRUCTIONS_END = 0          # placeholder; computed at runtime
    SCHEMA_END = 0
    TOOLS_END = 0
    SYSTEM_PROMPT_END = 0
    PERSONA_END = 0
    PROJECT_CONTEXT_END = 0
    KNOWLEDGE_DIGEST_END = 0
    INSTANCE_START = 0                 # the cache boundary
```

**The test failure modes:**

| Failure | Why it fails | Fix |
|---|---|---|
| A new stable layer was added in the wrong position | The first N characters differ because the new layer is below the boundary | Move the new layer above the boundary (between layers 7 and 8) |
| A stable layer was moved to the volatile position | The first N characters differ because the stable layer is now in the volatile part | Move the layer back to the stable position |
| A volatile input leaked into a stable layer (e.g., a timestamp in the system prompt) | The first N characters differ because the volatile input is in the prefix | Strip the volatile input from the stable layer; pass it as a separate volatile argument |
| The system prompt has a `now()` call | The first N characters differ across calls | Pass `now()` as a separate argument; don't include in the system prompt |

---

## 3. The provider-specific cache_control (the implementation)

### 3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)

```python
# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
    if cache_prefix_chars is not None:
        # Wrap the message in content blocks; mark each prefix with cache_control
        content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
    else:
        content_blocks = messages

    response = anthropic_client.messages.create(
        model=model,
        max_tokens=8192,
        messages=[{"role": "user", "content": content_blocks}],
    )
    return _result_with_usage(response.content, response.usage, messages)
```

**The cache_prefix_blocks helper** (mirrors nagent's `bin/helpers/nagent_llm.py:cache_prefix_blocks`):

```python
def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
    """Split the message into content blocks at the given char offsets.
    Mark each prefix block with cache_control. Returns the plain string
    when no valid boundary exists. At most 3 prefix blocks (provider limit
    is 4 breakpoints per request)."""
    if not cache_boundaries:
        return message
    points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
    if not points:
        return message
    blocks = []
    start = 0
    for point in points:
        blocks.append({
            "type": "text",
            "text": message[start:point],
            "cache_control": {"type": "ephemeral"},
        })
        start = point
    blocks.append({"type": "text", "text": message[start:]})
    return blocks
```

**The Anthropic usage accounting** (per `nagent_llm.py:_result_with_usage`):

```python
def _result_with_usage(text, usage, input_text=None):
    input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
    # Anthropic reports cached prompt tokens separately; fold them back
    # so input_tokens stays "tokens sent" across providers.
    input_tokens += _usage_value(usage, "cache_read_input_tokens")
    input_tokens += _usage_value(usage, "cache_creation_input_tokens")
    output_tokens = _usage_value(usage, "output_tokens", "completion_tokens", ...)
    # ... etc
```

**The 4-breakpoint limit.** Anthropic allows at most 4 `cache_control` markers per request. nagent caps at 3 prefix blocks (one breakpoint per prefix). Manual Slop does the same: 3 prefix blocks, 1 volatile suffix.

### 3.2 Gemini (1-hour explicit cache, configurable TTL)

```python
# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
    if cache_ttl_seconds > 0:
        # Create a cachedContent resource for the stable prefix
        cached_content = genai_client.caches.create(
            model=model,
            contents=stable_prefix_messages,    # layers 1-7
            ttl=f"{cache_ttl_seconds}s",
        )
        # Reference the cached content in the request
        response = genai_client.models.generate_content(
            model=model,
            contents=volatile_messages,         # layers 8-12
            config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
        )
    else:
        response = genai_client.models.generate_content(model=model, contents=messages)
    return _result_with_usage(response.text, response.usage_metadata, messages)
```

**The default TTL is 1 hour.** Configurable per the GUI (per §5 below).

### 3.3 OpenAI (5-10 min implicit, provider-managed)

OpenAI's caching is *implicit*: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.

```python
# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
    response = openai_client.responses.create(model=model, input=messages)
    return _result_with_usage(response.output_text, response.usage, messages)
    # No application-side cache_control; the provider handles it
```

**The TTL is provider-managed** (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."

### 3.4 The provider table (the summary)

| Provider | Cache type | Default TTL | Configurable? | GUI exposure? |
|---|---|---|---|---|
| Anthropic | ephemeral | 5 min | yes (via prompt cache breakpoints) | yes (per-discussion state) |
| Google (Gemini) | explicit | 1 h | yes (via `ttl` field) | yes (TTL override) |
| OpenAI | implicit (auto) | 5-10 min (provider-managed) | no | no (just shows "cached") |

---

## 4. The codepath (the end-to-end flow)

```
[Q:ai_client.send() is called]
   │
   ▼
[I:aggregate.build_initial_context(ctrl, user_message) -> str]
   │
   ├──► [I:layer 1-7: build stable prefix (the cache-friendly part)]
   │
   ├──► [I:layer 8-12: build volatile suffix (the per-turn part)]
   │
   ├──► [I:concatenate stable + volatile = full context]
   │
   ├──► [I:stable_prefix_length(ctrl) -> N]    (the cache boundary)
   │
   ▼
[Q:cache boundary N > 0?]
   │
   ├── no ──► [I:pass full context to provider; no caching]
   │
   ▼
[Q:provider is Anthropic?]
   │
   ├── yes ──► [I:cache_prefix_blocks(full_context, [N]) -> content_blocks]
   │            [I:anthropic.messages.create(content=content_blocks)]
   │
[Q:provider is Gemini?]
   │
   ├── yes ──► [I:create cachedContent resource for stable prefix]
   │            [I:genai.models.generate_content(cached_content=..., contents=volatile)]
   │
[Q:provider is OpenAI?]
   │
   ├── yes ──► [I:openai.responses.create(input=full_context)]    (provider handles caching)
   │
[I:return LlmResult(text, input_tokens, output_tokens)]
   │
   ▼
[Q:return to caller; aggregate.test_aggregate_stable_to_volatile_ordering is run]
   │
[T:end]
```

---

## 5. The GUI exposure (per-provider cache state)

The "Caching" Operations Hub sub-panel (per the v2.3 §5.3 sketch):

```
+------------------------------------------------------+
| Caching                                              |
+------------------------------------------------------+
| Provider summaries                                   |
| [Anthropic]   in:340 cache:80  hit:23%  ttl:4:32   |
| [Gemini]      in:120 cache:0   hit:0%   ttl:0:00   |
| [OpenAI]      in:560 cache:200 hit:35%  ttl:n/a    |
+------------------------------------------------------+
| Active discussions                                   |
| Discussion "refactor auth"                           |
|   cached: yes (Anthropic)                            |
|   expires: 2026-06-12T15:32 (in 4:32)                |
|   [Invalidate cache] [Disable caching for this]      |
| Discussion "fix the parser"                           |
|   cached: no                                         |
|   [Enable caching for this]                         |
+------------------------------------------------------+
| Global settings                                      |
|   [X] Enable Anthropic ephemeral caching             |
|   [X] Enable Gemini explicit caching                 |
|   [ ] Allow >1h Gemini caches (charges may apply)    |
|   Anthropic default TTL: [5 min v]                   |
|   Gemini default TTL:    [60 min v]                  |
+------------------------------------------------------+
```

**The data sources:**

| Widget | Data source | Frequency |
|---|---|---|
| `in:N cache:N hit:N%` | `ai_client.get_token_stats()` (already exported) | per turn (or per session) |
| `ttl:4:32` | `ai_client._send_<provider>` usage metadata + the cache expiry timestamp | per turn |
| `cached: yes/no` | per-discussion flag (NEW; tracks which discussions have active caches) | per discussion |
| `[Invalidate cache]` | calls `ai_client._invalidate_cache(discussion_id)` (NEW) | on click |

**The new AI client state:**

```python
# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
    discussion_id: str
    provider: str
    cached_at: datetime
    expires_at: Optional[datetime]  # None for OpenAI implicit
    hit_count: int = 0
    tokens_cached: int = 0
    last_invalidated_at: Optional[datetime] = None
    caching_enabled: bool = True   # user can disable per-discussion

# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {}  # keyed by discussion_id
```

**The Hook API additions:**

```
GET  /api/cache                        # list all discussion cache states
GET  /api/cache/<discussion_id>        # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable
```

---

## 6. The interaction with the 4 memory dimensions (where the cache hits)

| Dim | Where injected | Stable? | Cache impact |
|---|---|---|---|
| Curation | layer 9 (active preset) | no (per turn) | NOT cached; the user might switch presets |
| Discussion | layer 8 (metadata) + layer 11 (prior turns) | no (per turn) | NOT cached (except: layer 8 metadata is the boundary) |
| RAG | the `{rag-context}` block, appended to layer 8-12 | no (per query) | NOT cached; RAG is volatile per query |
| Knowledge | layer 7 (digest) + per-file (file-knowledge) | yes (within a gc cycle) | CACHED; the digest is the stable prefix |

**The cache only hits on the stable prefix (layers 1-7).** The volatile suffix (layers 8-12) is *not* cached; the user expects the conversation to change per turn.

**The interaction with knowledge harvest:** when `nagent-gc` (or the Manual Slop equivalent) regenerates the digest, the cache is invalidated for the next turn. The user has a way to force invalidation manually (the `[Invalidate cache]` button).

**The interaction with file edit:** when the user edits a file in the Structural File Editor, the file-knowledge for that file is updated. The cache is invalidated for the next turn that references the file. The per-file knowledge change is a cache invalidator.

---

## 7. The cross-references

- `conductor/code_styleguides/data_oriented_design.md` §3.2, §3.3, §3.4 — the data-oriented foundation
- `conductor/code_styleguides/agent_memory_dimensions.md` — the 4 dims (where the cache hits)
- `conductor/code_styleguides/knowledge_artifacts.md` — the knowledge digest (the layer 7 cached content)
- `docs/guide_caching_strategy.md` — the user-facing deep-dive
- `src/aggregate.py:run` — the consumer of this styleguide
- `src/ai_client.py:_send_<provider>` — the producer
- `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5 — the nagent pattern that informed this styleguide