Private
Public Access
0
0
Files
manual_slop/docs/guide_caching_strategy.md
T
ed 35c6cca134 docs: agent workflow docs + regular docs (v2.3 surfacing)
Per user request 'use your remaining context to update agent workflow
docs and then regular docs based on what was discussed in this report',
this commit creates/updates 15 files derived from the v2.3 nagent
review (the 12 new nagent additions + the 4 memory dimensions
reframing + the cache strategy + the RAG discipline + the knowledge
harvest pattern).

Agent workflow docs (4 files):
- AGENTS.md (UPDATE): add @import line to canonical DOD + 'Code
  Styleguides' section pointing to the 6 new styleguides + new
  'Human-Facing Documentation' section pointing to ./docs/AGENTS.md
- conductor/workflow.md (UPDATE): new section 'Additions (2026-06-12)
  - the 12 patterns from the latest nagent corpus' with TDD
  protocols for knowledge harvest, cache ordering, compaction, RAG
  discipline
- conductor/product-guidelines.md (UPDATE): new sections 'Memory
  Dimensions (added 2026-06-12)' + 'See Also - Updated' with the
  6-styleguide catalog
- docs/AGENTS.md (NEW): the agent-facing mirror of docs/Readme.md
  (per the nagent CLAUDE.md pattern). 10 sections + the per-tier
  reading path + the 4 memory dimensions + the caching strategy +
  the knowledge harvest + the RAG discipline + the feature flags

Regular docs (11 files):
- 6 new styleguides (the convention catalog):
  * data_oriented_design.md: the canonical DOD reference (Tier
    0/1/2; 3 defaults to reject; 8 core defaults; 7-question
    simplification pass; 10-question self-check; 4 memory
    dimensions in Manual Slop context)
  * agent_memory_dimensions.md: the 4 memory dims (curation /
    discussion / RAG / knowledge) + when to use each + the
    boundaries
  * rag_integration_discipline.md: the conservative-RAG rule
    (opt-in, complement, provenance, no mutation, feature-gated,
    graceful failure)
  * cache_friendly_context.md: stable-to-volatile context
    ordering + the cache TTL GUI contract + the byte-comparison
    test
  * knowledge_artifacts.md: the knowledge harvest pattern
    (category files, provenance, sha256 ledger, digest
    regeneration, 'delete to turn off')
  * feature_flags.md: file presence vs config flags vs CLI flags
- 3 new project docs (the cross-cutting guides):
  * guide_agent_memory_dimensions.md: the cross-cutting guide on
    the 4 dims + the decision tree
  * guide_caching_strategy.md: caching across providers +
    stable-to-volatile ordering + cache TTL GUI + the byte-
    comparison test + the 5th provider (claude-code)
  * guide_knowledge_curation.md: the knowledge memory guide (4th
    dim) + the 5 category files + per-file notes + the digest +
    the ledger + the harvest workflow
- 2 existing doc updates:
  * guide_mma.md: new sections 'Delegation as context management'
    + 'The 4 memory dimensions (the MMA scope)'
  * guide_ai_client.md: new section 'Cache strategy and the 12-
    layer model' + the 5th provider (claude-code)

All files use the same style as the v2.3 review (the user's preferred
format): 7-column tables, no JSON, SSDL shape tags, forth/array
notation, file:line citations, ASCII sketches where useful. The
human Readme files (Readme.md, docs/Readme.md) are NOT modified
(per repeated user instruction).

The 5th provider (claude-code) is documented in guide_ai_client.md
+ the data_oriented_design.md references the nagent pattern as the
source of the canonical rules.

The cross-references are bidirectional: the 6 styleguides reference
the 3 project docs; the 3 project docs reference the 6 styleguides;
the 2 doc updates reference both; AGENTS.md + ./docs/AGENTS.md
provide the entry points.
2026-06-12 13:50:40 -04:00

343 lines
14 KiB
Markdown

# Caching Strategy Guide
**Status:** User-facing deep-dive on the cache strategy: stable-to-volatile context ordering, the 4 cache-TTL profiles (Anthropic, Gemini, OpenAI, claude-code), and the GUI exposure.
**Date:** 2026-06-12
**Cross-refs:** `conductor/code_styleguides/cache_friendly_context.md`; `docs/guide_ai_client.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5.
> **What this is.** The LLM providers Manual Slop uses (Anthropic, Gemini, OpenAI) all support prompt caching. The cost benefit comes from the *stable prefix* being byte-identical across turns. This guide is the user-facing deep-dive on the 12-layer model, the byte-comparison test, the provider-specific TTLs, and the GUI exposure.
---
## 0. The 30-second version
```
[STABLE PREFIX (cached across turns)] [VOLATILE SUFFIX (per-turn)]
[Role instructions] [Discussion metadata]
[Function-calling schema] [Active preset (FileItems)]
[Discovered tool descriptions] [Per-file details]
[System prompt preset] [Tool-call results from prior turns]
[Persona profile] [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]
```
**The cache boundary is at layer 8/9.** Layers 1-7 are byte-identical across turns; layers 8-12 change per turn. The Anthropic-specific path wraps the prefix in `cache_control: {"type": "ephemeral"}` blocks; the Gemini path uses `cachedContent` resources; the OpenAI path uses implicit prefix caching.
**The provider-specific defaults:**
| Provider | Default TTL | Configurable? | GUI exposure? |
|---|---|---|---|
| Anthropic ephemeral | 5 min | yes (per-discussion) | yes |
| Gemini explicit | 1 h | yes (per-discussion override) | yes (TTL override) |
| OpenAI implicit | 5-10 min (provider-managed) | no | shows "cached" only |
| claude-code (Claude Agent SDK) | varies (provider-managed) | no | shows "cached" only |
---
## 1. The 12-layer model (the stable-to-volatile ordering)
| # | Layer | Stable across turns? | Source | SSDL |
|---|---|---|---|---|
| 1 | Role instructions (model + provider) | yes | `_get_combined_system_prompt` | `[I]` |
| 2 | Function-calling schema | yes | per provider | `[I]` |
| 3 | Discovered tool descriptions | yes | `mcp_client.get_tool_schemas()` | `[I]` |
| 4 | System prompt preset | yes | `app_state.ai_settings.system_prompt` | `[I]` |
| 5 | Persona profile | yes | `app_state.active_persona` | `[I]` |
| 6 | Project context (per `manual_slop.toml`) | yes | NEW | `[I]` |
| 7 | Knowledge digest (per `knowledge/digest.md`) | yes (within a gc cycle) | NEW | `[I]` |
| 8 | Discussion metadata (name, role count) | no (per turn) | `disc_entries[:1]` or `disc_meta` | `───` |
| 9 | Active preset (FileItem set) | no (per turn) | `self.context_files` | `───` |
| 10 | Per-file details (history, slices, notes) | no (per file) | per `FileItem` | `───` |
| 11 | Tool-call results from prior turns | no (per turn) | per `_reread_file_items` | `───` |
| 12 | The user message | no (per turn) | the input | `───` |
**The cache boundary is at layer 7/8.** Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.
---
## 2. The byte-comparison test (the design contract)
The design rule "stable prefix is byte-identical" must be testable. The test:
```python
# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
"""The first N characters of the context should be identical across turns
of the same conversation, when no stable-layer inputs change."""
ctrl = mock_app_controller()
ctrl.ai_settings.system_prompt = "Test system prompt"
ctrl.active_persona = mock_persona()
# Turn 1
turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")
# Turn 2 (same stable inputs, different user message)
turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")
# The first N characters should be identical (N = where the volatile layers start)
N = aggregate.stable_prefix_length(ctrl)
assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"
```
**The test is the contract.** If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).
---
## 3. The provider-specific cache strategies
### 3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)
```python
# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
if cache_prefix_chars is not None:
# Wrap the message in content blocks; mark each prefix with cache_control
content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
else:
content_blocks = messages
response = anthropic_client.messages.create(
model=model,
max_tokens=8192,
messages=[{"role": "user", "content": content_blocks}],
)
return _result_with_usage(response.content, response.usage, messages)
```
**The cache_prefix_blocks helper:**
```python
def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
"""Split the message into content blocks at the given char offsets.
Mark each prefix block with cache_control. Returns the plain string
when no valid boundary exists. At most 3 prefix blocks (provider limit
is 4 breakpoints per request)."""
if not cache_boundaries:
return message
points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
if not points:
return message
blocks = []
start = 0
for point in points:
blocks.append({
"type": "text",
"text": message[start:point],
"cache_control": {"type": "ephemeral"},
})
start = point
blocks.append({"type": "text", "text": message[start:]})
return blocks
```
**The Anthropic usage accounting:**
```python
def _result_with_usage(text, usage, input_text=None):
input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
# Anthropic reports cached prompt tokens separately; fold them back
# so input_tokens stays "tokens sent" across providers.
input_tokens += _usage_value(usage, "cache_read_input_tokens")
input_tokens += _usage_value(usage, "cache_creation_input_tokens")
# ...
```
**The 4-breakpoint limit.** Anthropic allows at most 4 `cache_control` markers per request. Manual Slop uses 3 prefix blocks (one breakpoint per prefix) + 1 volatile suffix.
### 3.2 Gemini (1-hour explicit cache, configurable TTL)
```python
# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
if cache_ttl_seconds > 0:
cached_content = genai_client.caches.create(
model=model,
contents=stable_prefix_messages,
ttl=f"{cache_ttl_seconds}s",
)
response = genai_client.models.generate_content(
model=model,
contents=volatile_messages,
config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
)
else:
response = genai_client.models.generate_content(model=model, contents=messages)
return _result_with_usage(response.text, response.usage_metadata, messages)
```
**The default TTL is 1 hour.** Configurable per the GUI (per §4 below).
### 3.3 OpenAI (5-10 min implicit, provider-managed)
OpenAI's caching is *implicit*: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.
```python
# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
response = openai_client.responses.create(model=model, input=messages)
return _result_with_usage(response.output_text, response.usage, messages)
# No application-side cache_control; the provider handles it
```
**The TTL is provider-managed** (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."
### 3.4 claude-code (5th provider, subscription auth)
`claude-code` uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed.
```python
# In src/ai_client.py:_send_claude_code (the 5th provider)
def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1):
options = ClaudeAgentOptions(
model=None if not model or model == "default" else model,
max_turns=max_turns,
tools=list(allowed_tools) if allowed_tools else [],
allowed_tools=list(allowed_tools) if allowed_tools else [],
cwd=os.getcwd(),
)
# ... claude_agent_sdk.query(prompt=message, options=options)
return _result_with_usage(text, usage, message)
```
---
## 4. The GUI exposure
The "Caching" Operations Hub sub-panel:
```
+------------------------------------------------------+
| Caching |
+------------------------------------------------------+
| Provider summaries |
| [Anthropic] in:340 cache:80 hit:23% ttl:4:32 |
| [Gemini] in:120 cache:0 hit:0% ttl:0:00 |
| [OpenAI] in:560 cache:200 hit:35% ttl:n/a |
+------------------------------------------------------+
| Active discussions |
| Discussion "refactor auth" |
| cached: yes (Anthropic) |
| expires: 2026-06-12T15:32 (in 4:32) |
| [Invalidate cache] [Disable caching for this] |
| Discussion "fix the parser" |
| cached: no |
| [Enable caching for this] |
+------------------------------------------------------+
| Global settings |
| [X] Enable Anthropic ephemeral caching |
| [X] Enable Gemini explicit caching |
| [ ] Allow >1h Gemini caches (charges may apply) |
| Anthropic default TTL: [5 min v] |
| Gemini default TTL: [60 min v] |
+------------------------------------------------------+
```
**The data sources:**
| Widget | Data source | Frequency |
|---|---|---|
| `in:N cache:N hit:N%` | `ai_client.get_token_stats()` | per turn (or per session) |
| `ttl:4:32` | `ai_client._send_<provider>` usage metadata + the cache expiry timestamp | per turn |
| `cached: yes/no` | per-discussion flag (NEW) | per discussion |
| `[Invalidate cache]` | calls `ai_client._invalidate_cache(discussion_id)` (NEW) | on click |
**The new AI client state:**
```python
# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
discussion_id: str
provider: str
cached_at: datetime
expires_at: Optional[datetime]
hit_count: int = 0
tokens_cached: int = 0
last_invalidated_at: Optional[datetime] = None
caching_enabled: bool = True
# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {}
```
**The Hook API additions:**
```
GET /api/cache # list all discussion cache states
GET /api/cache/<discussion_id> # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable
```
---
## 5. The injection (where the cache hits)
| Layer | Where injected | Stable? | Cache impact |
|---|---|---|---|
| 1. Role instructions | `_get_combined_system_prompt` | yes | **CACHED** |
| 2. Function-calling schema | per provider | yes | **CACHED** |
| 3. Discovered tool descriptions | `mcp_client.get_tool_schemas()` | yes | **CACHED** |
| 4. System prompt preset | `app_state.ai_settings.system_prompt` | yes | **CACHED** |
| 5. Persona profile | `app_state.active_persona` | yes | **CACHED** |
| 6. Project context | `manual_slop.toml [agent.context_files]` | yes | **CACHED** |
| 7. Knowledge digest | `~/.manual_slop/knowledge/digest.md` | yes (within a gc cycle) | **CACHED** |
| 8. Discussion metadata | `disc_entries[:1]` | no | NOT cached |
| 9. Active preset | `self.context_files` | no | NOT cached |
| 10. Per-file details | per `FileItem` | no | NOT cached |
| 11. Prior tool results | per `_reread_file_items` | no | NOT cached |
| 12. User message | the input | no | NOT cached |
**The cache only hits on the stable prefix (layers 1-7).** The volatile suffix (layers 8-12) is *not* cached; the user expects the conversation to change per turn.
---
## 6. The cache invalidation triggers
| Trigger | Effect |
|---|---|
| `python -m src.knowledge_harvest --apply` | The digest is regenerated; the cache is invalidated for the next turn |
| `FileItem.notes` edited | The per-file knowledge changes; the cache is invalidated for the next turn that references the file |
| `persona` changed | The persona profile is in the stable prefix; the cache is invalidated |
| `[Invalidate cache]` button | The per-discussion cache state is marked `last_invalidated_at`; the next turn re-creates it |
| `expiration` reached | The provider's cache expires automatically; the next turn re-creates it |
---
## 7. The measurement (the empirical basis)
**The "before" measurement** (do this first, before any refactor):
```bash
# Log the cache hit rate over a sample of representative discussions
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 23% (avg)
cache write rate: 45% (avg)
in:N avg: 1,200
cache:N avg: 280
```
**The "after" measurement** (after the stable-to-volatile refactor):
```bash
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 67% (avg) # <-- should be measurably higher
cache write rate: 18% (avg) # <-- should be lower
in:N avg: 1,200 # <-- unchanged (the user still types the same)
cache:N avg: 280 # <-- unchanged
```
**The win comes from re-aligning the boundaries**, not from changing the providers. The test is whether the cache hit rate is measurably higher after the refactor.
---
## 8. The cross-references
- `conductor/code_styleguides/cache_friendly_context.md` — the canonical styleguide
- `docs/guide_ai_client.md` — the underlying LLM client (the producer)
- `docs/guide_agent_memory_dimensions.md` §5 — where the 4 dims get injected
- `docs/guide_knowledge_curation.md` §3 — the digest (layer 7)
- `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5 — the nagent pattern