35c6cca134
Per user request 'use your remaining context to update agent workflow
docs and then regular docs based on what was discussed in this report',
this commit creates/updates 15 files derived from the v2.3 nagent
review (the 12 new nagent additions + the 4 memory dimensions
reframing + the cache strategy + the RAG discipline + the knowledge
harvest pattern).
Agent workflow docs (4 files):
- AGENTS.md (UPDATE): add @import line to canonical DOD + 'Code
Styleguides' section pointing to the 6 new styleguides + new
'Human-Facing Documentation' section pointing to ./docs/AGENTS.md
- conductor/workflow.md (UPDATE): new section 'Additions (2026-06-12)
- the 12 patterns from the latest nagent corpus' with TDD
protocols for knowledge harvest, cache ordering, compaction, RAG
discipline
- conductor/product-guidelines.md (UPDATE): new sections 'Memory
Dimensions (added 2026-06-12)' + 'See Also - Updated' with the
6-styleguide catalog
- docs/AGENTS.md (NEW): the agent-facing mirror of docs/Readme.md
(per the nagent CLAUDE.md pattern). 10 sections + the per-tier
reading path + the 4 memory dimensions + the caching strategy +
the knowledge harvest + the RAG discipline + the feature flags
Regular docs (11 files):
- 6 new styleguides (the convention catalog):
* data_oriented_design.md: the canonical DOD reference (Tier
0/1/2; 3 defaults to reject; 8 core defaults; 7-question
simplification pass; 10-question self-check; 4 memory
dimensions in Manual Slop context)
* agent_memory_dimensions.md: the 4 memory dims (curation /
discussion / RAG / knowledge) + when to use each + the
boundaries
* rag_integration_discipline.md: the conservative-RAG rule
(opt-in, complement, provenance, no mutation, feature-gated,
graceful failure)
* cache_friendly_context.md: stable-to-volatile context
ordering + the cache TTL GUI contract + the byte-comparison
test
* knowledge_artifacts.md: the knowledge harvest pattern
(category files, provenance, sha256 ledger, digest
regeneration, 'delete to turn off')
* feature_flags.md: file presence vs config flags vs CLI flags
- 3 new project docs (the cross-cutting guides):
* guide_agent_memory_dimensions.md: the cross-cutting guide on
the 4 dims + the decision tree
* guide_caching_strategy.md: caching across providers +
stable-to-volatile ordering + cache TTL GUI + the byte-
comparison test + the 5th provider (claude-code)
* guide_knowledge_curation.md: the knowledge memory guide (4th
dim) + the 5 category files + per-file notes + the digest +
the ledger + the harvest workflow
- 2 existing doc updates:
* guide_mma.md: new sections 'Delegation as context management'
+ 'The 4 memory dimensions (the MMA scope)'
* guide_ai_client.md: new section 'Cache strategy and the 12-
layer model' + the 5th provider (claude-code)
All files use the same style as the v2.3 review (the user's preferred
format): 7-column tables, no JSON, SSDL shape tags, forth/array
notation, file:line citations, ASCII sketches where useful. The
human Readme files (Readme.md, docs/Readme.md) are NOT modified
(per repeated user instruction).
The 5th provider (claude-code) is documented in guide_ai_client.md
+ the data_oriented_design.md references the nagent pattern as the
source of the canonical rules.
The cross-references are bidirectional: the 6 styleguides reference
the 3 project docs; the 3 project docs reference the 6 styleguides;
the 2 doc updates reference both; AGENTS.md + ./docs/AGENTS.md
provide the entry points.
343 lines
14 KiB
Markdown
343 lines
14 KiB
Markdown
# Caching Strategy Guide
|
|
|
|
**Status:** User-facing deep-dive on the cache strategy: stable-to-volatile context ordering, the 4 cache-TTL profiles (Anthropic, Gemini, OpenAI, claude-code), and the GUI exposure.
|
|
**Date:** 2026-06-12
|
|
**Cross-refs:** `conductor/code_styleguides/cache_friendly_context.md`; `docs/guide_ai_client.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5.
|
|
|
|
> **What this is.** The LLM providers Manual Slop uses (Anthropic, Gemini, OpenAI) all support prompt caching. The cost benefit comes from the *stable prefix* being byte-identical across turns. This guide is the user-facing deep-dive on the 12-layer model, the byte-comparison test, the provider-specific TTLs, and the GUI exposure.
|
|
|
|
---
|
|
|
|
## 0. The 30-second version
|
|
|
|
```
|
|
[STABLE PREFIX (cached across turns)] [VOLATILE SUFFIX (per-turn)]
|
|
[Role instructions] [Discussion metadata]
|
|
[Function-calling schema] [Active preset (FileItems)]
|
|
[Discovered tool descriptions] [Per-file details]
|
|
[System prompt preset] [Tool-call results from prior turns]
|
|
[Persona profile] [The user message]
|
|
[Project context]
|
|
[Knowledge digest]
|
|
[file-knowledge for files in scope]
|
|
```
|
|
|
|
**The cache boundary is at layer 8/9.** Layers 1-7 are byte-identical across turns; layers 8-12 change per turn. The Anthropic-specific path wraps the prefix in `cache_control: {"type": "ephemeral"}` blocks; the Gemini path uses `cachedContent` resources; the OpenAI path uses implicit prefix caching.
|
|
|
|
**The provider-specific defaults:**
|
|
|
|
| Provider | Default TTL | Configurable? | GUI exposure? |
|
|
|---|---|---|---|
|
|
| Anthropic ephemeral | 5 min | yes (per-discussion) | yes |
|
|
| Gemini explicit | 1 h | yes (per-discussion override) | yes (TTL override) |
|
|
| OpenAI implicit | 5-10 min (provider-managed) | no | shows "cached" only |
|
|
| claude-code (Claude Agent SDK) | varies (provider-managed) | no | shows "cached" only |
|
|
|
|
---
|
|
|
|
## 1. The 12-layer model (the stable-to-volatile ordering)
|
|
|
|
| # | Layer | Stable across turns? | Source | SSDL |
|
|
|---|---|---|---|---|
|
|
| 1 | Role instructions (model + provider) | yes | `_get_combined_system_prompt` | `[I]` |
|
|
| 2 | Function-calling schema | yes | per provider | `[I]` |
|
|
| 3 | Discovered tool descriptions | yes | `mcp_client.get_tool_schemas()` | `[I]` |
|
|
| 4 | System prompt preset | yes | `app_state.ai_settings.system_prompt` | `[I]` |
|
|
| 5 | Persona profile | yes | `app_state.active_persona` | `[I]` |
|
|
| 6 | Project context (per `manual_slop.toml`) | yes | NEW | `[I]` |
|
|
| 7 | Knowledge digest (per `knowledge/digest.md`) | yes (within a gc cycle) | NEW | `[I]` |
|
|
| 8 | Discussion metadata (name, role count) | no (per turn) | `disc_entries[:1]` or `disc_meta` | `───` |
|
|
| 9 | Active preset (FileItem set) | no (per turn) | `self.context_files` | `───` |
|
|
| 10 | Per-file details (history, slices, notes) | no (per file) | per `FileItem` | `───` |
|
|
| 11 | Tool-call results from prior turns | no (per turn) | per `_reread_file_items` | `───` |
|
|
| 12 | The user message | no (per turn) | the input | `───` |
|
|
|
|
**The cache boundary is at layer 7/8.** Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.
|
|
|
|
---
|
|
|
|
## 2. The byte-comparison test (the design contract)
|
|
|
|
The design rule "stable prefix is byte-identical" must be testable. The test:
|
|
|
|
```python
|
|
# In tests/test_aggregate_caching.py (NEW)
|
|
def test_aggregate_stable_to_volatile_ordering():
|
|
"""The first N characters of the context should be identical across turns
|
|
of the same conversation, when no stable-layer inputs change."""
|
|
ctrl = mock_app_controller()
|
|
ctrl.ai_settings.system_prompt = "Test system prompt"
|
|
ctrl.active_persona = mock_persona()
|
|
|
|
# Turn 1
|
|
turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")
|
|
|
|
# Turn 2 (same stable inputs, different user message)
|
|
turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")
|
|
|
|
# The first N characters should be identical (N = where the volatile layers start)
|
|
N = aggregate.stable_prefix_length(ctrl)
|
|
assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"
|
|
```
|
|
|
|
**The test is the contract.** If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).
|
|
|
|
---
|
|
|
|
## 3. The provider-specific cache strategies
|
|
|
|
### 3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)
|
|
|
|
```python
|
|
# In src/ai_client.py:_send_anthropic
|
|
def _send_anthropic(messages, *, cache_prefix_chars=None):
|
|
if cache_prefix_chars is not None:
|
|
# Wrap the message in content blocks; mark each prefix with cache_control
|
|
content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
|
|
else:
|
|
content_blocks = messages
|
|
|
|
response = anthropic_client.messages.create(
|
|
model=model,
|
|
max_tokens=8192,
|
|
messages=[{"role": "user", "content": content_blocks}],
|
|
)
|
|
return _result_with_usage(response.content, response.usage, messages)
|
|
```
|
|
|
|
**The cache_prefix_blocks helper:**
|
|
|
|
```python
|
|
def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
|
|
"""Split the message into content blocks at the given char offsets.
|
|
Mark each prefix block with cache_control. Returns the plain string
|
|
when no valid boundary exists. At most 3 prefix blocks (provider limit
|
|
is 4 breakpoints per request)."""
|
|
if not cache_boundaries:
|
|
return message
|
|
points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
|
|
if not points:
|
|
return message
|
|
blocks = []
|
|
start = 0
|
|
for point in points:
|
|
blocks.append({
|
|
"type": "text",
|
|
"text": message[start:point],
|
|
"cache_control": {"type": "ephemeral"},
|
|
})
|
|
start = point
|
|
blocks.append({"type": "text", "text": message[start:]})
|
|
return blocks
|
|
```
|
|
|
|
**The Anthropic usage accounting:**
|
|
|
|
```python
|
|
def _result_with_usage(text, usage, input_text=None):
|
|
input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
|
|
# Anthropic reports cached prompt tokens separately; fold them back
|
|
# so input_tokens stays "tokens sent" across providers.
|
|
input_tokens += _usage_value(usage, "cache_read_input_tokens")
|
|
input_tokens += _usage_value(usage, "cache_creation_input_tokens")
|
|
# ...
|
|
```
|
|
|
|
**The 4-breakpoint limit.** Anthropic allows at most 4 `cache_control` markers per request. Manual Slop uses 3 prefix blocks (one breakpoint per prefix) + 1 volatile suffix.
|
|
|
|
### 3.2 Gemini (1-hour explicit cache, configurable TTL)
|
|
|
|
```python
|
|
# In src/ai_client.py:_send_gemini
|
|
def _send_gemini(messages, *, cache_ttl_seconds=3600):
|
|
if cache_ttl_seconds > 0:
|
|
cached_content = genai_client.caches.create(
|
|
model=model,
|
|
contents=stable_prefix_messages,
|
|
ttl=f"{cache_ttl_seconds}s",
|
|
)
|
|
response = genai_client.models.generate_content(
|
|
model=model,
|
|
contents=volatile_messages,
|
|
config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
|
|
)
|
|
else:
|
|
response = genai_client.models.generate_content(model=model, contents=messages)
|
|
return _result_with_usage(response.text, response.usage_metadata, messages)
|
|
```
|
|
|
|
**The default TTL is 1 hour.** Configurable per the GUI (per §4 below).
|
|
|
|
### 3.3 OpenAI (5-10 min implicit, provider-managed)
|
|
|
|
OpenAI's caching is *implicit*: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.
|
|
|
|
```python
|
|
# In src/ai_client.py:_send_openai
|
|
def _send_openai(messages, *, model="gpt-5.5"):
|
|
response = openai_client.responses.create(model=model, input=messages)
|
|
return _result_with_usage(response.output_text, response.usage, messages)
|
|
# No application-side cache_control; the provider handles it
|
|
```
|
|
|
|
**The TTL is provider-managed** (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."
|
|
|
|
### 3.4 claude-code (5th provider, subscription auth)
|
|
|
|
`claude-code` uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed.
|
|
|
|
```python
|
|
# In src/ai_client.py:_send_claude_code (the 5th provider)
|
|
def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1):
|
|
options = ClaudeAgentOptions(
|
|
model=None if not model or model == "default" else model,
|
|
max_turns=max_turns,
|
|
tools=list(allowed_tools) if allowed_tools else [],
|
|
allowed_tools=list(allowed_tools) if allowed_tools else [],
|
|
cwd=os.getcwd(),
|
|
)
|
|
# ... claude_agent_sdk.query(prompt=message, options=options)
|
|
return _result_with_usage(text, usage, message)
|
|
```
|
|
|
|
---
|
|
|
|
## 4. The GUI exposure
|
|
|
|
The "Caching" Operations Hub sub-panel:
|
|
|
|
```
|
|
+------------------------------------------------------+
|
|
| Caching |
|
|
+------------------------------------------------------+
|
|
| Provider summaries |
|
|
| [Anthropic] in:340 cache:80 hit:23% ttl:4:32 |
|
|
| [Gemini] in:120 cache:0 hit:0% ttl:0:00 |
|
|
| [OpenAI] in:560 cache:200 hit:35% ttl:n/a |
|
|
+------------------------------------------------------+
|
|
| Active discussions |
|
|
| Discussion "refactor auth" |
|
|
| cached: yes (Anthropic) |
|
|
| expires: 2026-06-12T15:32 (in 4:32) |
|
|
| [Invalidate cache] [Disable caching for this] |
|
|
| Discussion "fix the parser" |
|
|
| cached: no |
|
|
| [Enable caching for this] |
|
|
+------------------------------------------------------+
|
|
| Global settings |
|
|
| [X] Enable Anthropic ephemeral caching |
|
|
| [X] Enable Gemini explicit caching |
|
|
| [ ] Allow >1h Gemini caches (charges may apply) |
|
|
| Anthropic default TTL: [5 min v] |
|
|
| Gemini default TTL: [60 min v] |
|
|
+------------------------------------------------------+
|
|
```
|
|
|
|
**The data sources:**
|
|
|
|
| Widget | Data source | Frequency |
|
|
|---|---|---|
|
|
| `in:N cache:N hit:N%` | `ai_client.get_token_stats()` | per turn (or per session) |
|
|
| `ttl:4:32` | `ai_client._send_<provider>` usage metadata + the cache expiry timestamp | per turn |
|
|
| `cached: yes/no` | per-discussion flag (NEW) | per discussion |
|
|
| `[Invalidate cache]` | calls `ai_client._invalidate_cache(discussion_id)` (NEW) | on click |
|
|
|
|
**The new AI client state:**
|
|
|
|
```python
|
|
# In src/ai_client.py (NEW)
|
|
@dataclass
|
|
class DiscussionCacheState:
|
|
discussion_id: str
|
|
provider: str
|
|
cached_at: datetime
|
|
expires_at: Optional[datetime]
|
|
hit_count: int = 0
|
|
tokens_cached: int = 0
|
|
last_invalidated_at: Optional[datetime] = None
|
|
caching_enabled: bool = True
|
|
|
|
# In AppController (NEW)
|
|
self.discussion_caches: dict[str, DiscussionCacheState] = {}
|
|
```
|
|
|
|
**The Hook API additions:**
|
|
|
|
```
|
|
GET /api/cache # list all discussion cache states
|
|
GET /api/cache/<discussion_id> # get one
|
|
POST /api/cache/<discussion_id>/invalidate
|
|
POST /api/cache/<discussion_id>/disable
|
|
POST /api/cache/<discussion_id>/enable
|
|
```
|
|
|
|
---
|
|
|
|
## 5. The injection (where the cache hits)
|
|
|
|
| Layer | Where injected | Stable? | Cache impact |
|
|
|---|---|---|---|
|
|
| 1. Role instructions | `_get_combined_system_prompt` | yes | **CACHED** |
|
|
| 2. Function-calling schema | per provider | yes | **CACHED** |
|
|
| 3. Discovered tool descriptions | `mcp_client.get_tool_schemas()` | yes | **CACHED** |
|
|
| 4. System prompt preset | `app_state.ai_settings.system_prompt` | yes | **CACHED** |
|
|
| 5. Persona profile | `app_state.active_persona` | yes | **CACHED** |
|
|
| 6. Project context | `manual_slop.toml [agent.context_files]` | yes | **CACHED** |
|
|
| 7. Knowledge digest | `~/.manual_slop/knowledge/digest.md` | yes (within a gc cycle) | **CACHED** |
|
|
| 8. Discussion metadata | `disc_entries[:1]` | no | NOT cached |
|
|
| 9. Active preset | `self.context_files` | no | NOT cached |
|
|
| 10. Per-file details | per `FileItem` | no | NOT cached |
|
|
| 11. Prior tool results | per `_reread_file_items` | no | NOT cached |
|
|
| 12. User message | the input | no | NOT cached |
|
|
|
|
**The cache only hits on the stable prefix (layers 1-7).** The volatile suffix (layers 8-12) is *not* cached; the user expects the conversation to change per turn.
|
|
|
|
---
|
|
|
|
## 6. The cache invalidation triggers
|
|
|
|
| Trigger | Effect |
|
|
|---|---|
|
|
| `python -m src.knowledge_harvest --apply` | The digest is regenerated; the cache is invalidated for the next turn |
|
|
| `FileItem.notes` edited | The per-file knowledge changes; the cache is invalidated for the next turn that references the file |
|
|
| `persona` changed | The persona profile is in the stable prefix; the cache is invalidated |
|
|
| `[Invalidate cache]` button | The per-discussion cache state is marked `last_invalidated_at`; the next turn re-creates it |
|
|
| `expiration` reached | The provider's cache expires automatically; the next turn re-creates it |
|
|
|
|
---
|
|
|
|
## 7. The measurement (the empirical basis)
|
|
|
|
**The "before" measurement** (do this first, before any refactor):
|
|
|
|
```bash
|
|
# Log the cache hit rate over a sample of representative discussions
|
|
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
|
|
cache hit rate: 23% (avg)
|
|
cache write rate: 45% (avg)
|
|
in:N avg: 1,200
|
|
cache:N avg: 280
|
|
```
|
|
|
|
**The "after" measurement** (after the stable-to-volatile refactor):
|
|
|
|
```bash
|
|
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
|
|
cache hit rate: 67% (avg) # <-- should be measurably higher
|
|
cache write rate: 18% (avg) # <-- should be lower
|
|
in:N avg: 1,200 # <-- unchanged (the user still types the same)
|
|
cache:N avg: 280 # <-- unchanged
|
|
```
|
|
|
|
**The win comes from re-aligning the boundaries**, not from changing the providers. The test is whether the cache hit rate is measurably higher after the refactor.
|
|
|
|
---
|
|
|
|
## 8. The cross-references
|
|
|
|
- `conductor/code_styleguides/cache_friendly_context.md` — the canonical styleguide
|
|
- `docs/guide_ai_client.md` — the underlying LLM client (the producer)
|
|
- `docs/guide_agent_memory_dimensions.md` §5 — where the 4 dims get injected
|
|
- `docs/guide_knowledge_curation.md` §3 — the digest (layer 7)
|
|
- `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.2, §5 — the nagent pattern
|