Private
Public Access
0
0
Files
manual_slop/docs/guide_caching_strategy.md
T
ed 35c6cca134 docs: agent workflow docs + regular docs (v2.3 surfacing)
Per user request 'use your remaining context to update agent workflow
docs and then regular docs based on what was discussed in this report',
this commit creates/updates 15 files derived from the v2.3 nagent
review (the 12 new nagent additions + the 4 memory dimensions
reframing + the cache strategy + the RAG discipline + the knowledge
harvest pattern).

Agent workflow docs (4 files):
- AGENTS.md (UPDATE): add @import line to canonical DOD + 'Code
  Styleguides' section pointing to the 6 new styleguides + new
  'Human-Facing Documentation' section pointing to ./docs/AGENTS.md
- conductor/workflow.md (UPDATE): new section 'Additions (2026-06-12)
  - the 12 patterns from the latest nagent corpus' with TDD
  protocols for knowledge harvest, cache ordering, compaction, RAG
  discipline
- conductor/product-guidelines.md (UPDATE): new sections 'Memory
  Dimensions (added 2026-06-12)' + 'See Also - Updated' with the
  6-styleguide catalog
- docs/AGENTS.md (NEW): the agent-facing mirror of docs/Readme.md
  (per the nagent CLAUDE.md pattern). 10 sections + the per-tier
  reading path + the 4 memory dimensions + the caching strategy +
  the knowledge harvest + the RAG discipline + the feature flags

Regular docs (11 files):
- 6 new styleguides (the convention catalog):
  * data_oriented_design.md: the canonical DOD reference (Tier
    0/1/2; 3 defaults to reject; 8 core defaults; 7-question
    simplification pass; 10-question self-check; 4 memory
    dimensions in Manual Slop context)
  * agent_memory_dimensions.md: the 4 memory dims (curation /
    discussion / RAG / knowledge) + when to use each + the
    boundaries
  * rag_integration_discipline.md: the conservative-RAG rule
    (opt-in, complement, provenance, no mutation, feature-gated,
    graceful failure)
  * cache_friendly_context.md: stable-to-volatile context
    ordering + the cache TTL GUI contract + the byte-comparison
    test
  * knowledge_artifacts.md: the knowledge harvest pattern
    (category files, provenance, sha256 ledger, digest
    regeneration, 'delete to turn off')
  * feature_flags.md: file presence vs config flags vs CLI flags
- 3 new project docs (the cross-cutting guides):
  * guide_agent_memory_dimensions.md: the cross-cutting guide on
    the 4 dims + the decision tree
  * guide_caching_strategy.md: caching across providers +
    stable-to-volatile ordering + cache TTL GUI + the byte-
    comparison test + the 5th provider (claude-code)
  * guide_knowledge_curation.md: the knowledge memory guide (4th
    dim) + the 5 category files + per-file notes + the digest +
    the ledger + the harvest workflow
- 2 existing doc updates:
  * guide_mma.md: new sections 'Delegation as context management'
    + 'The 4 memory dimensions (the MMA scope)'
  * guide_ai_client.md: new section 'Cache strategy and the 12-
    layer model' + the 5th provider (claude-code)

All files use the same style as the v2.3 review (the user's preferred
format): 7-column tables, no JSON, SSDL shape tags, forth/array
notation, file:line citations, ASCII sketches where useful. The
human Readme files (Readme.md, docs/Readme.md) are NOT modified
(per repeated user instruction).

The 5th provider (claude-code) is documented in guide_ai_client.md
+ the data_oriented_design.md references the nagent pattern as the
source of the canonical rules.

The cross-references are bidirectional: the 6 styleguides reference
the 3 project docs; the 3 project docs reference the 6 styleguides;
the 2 doc updates reference both; AGENTS.md + ./docs/AGENTS.md
provide the entry points.
2026-06-12 13:50:40 -04:00

14 KiB

Caching Strategy Guide

Status: User-facing deep-dive on the cache strategy: stable-to-volatile context ordering, the 4 cache-TTL profiles (Anthropic, Gemini, OpenAI, claude-code), and the GUI exposure. Date: 2026-06-12 Cross-refs: conductor/code_styleguides/cache_friendly_context.md; docs/guide_ai_client.md; conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5.

What this is. The LLM providers Manual Slop uses (Anthropic, Gemini, OpenAI) all support prompt caching. The cost benefit comes from the stable prefix being byte-identical across turns. This guide is the user-facing deep-dive on the 12-layer model, the byte-comparison test, the provider-specific TTLs, and the GUI exposure.


0. The 30-second version

[STABLE PREFIX (cached across turns)]  [VOLATILE SUFFIX (per-turn)]
[Role instructions]                     [Discussion metadata]
[Function-calling schema]               [Active preset (FileItems)]
[Discovered tool descriptions]          [Per-file details]
[System prompt preset]                  [Tool-call results from prior turns]
[Persona profile]                       [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]

The cache boundary is at layer 8/9. Layers 1-7 are byte-identical across turns; layers 8-12 change per turn. The Anthropic-specific path wraps the prefix in cache_control: {"type": "ephemeral"} blocks; the Gemini path uses cachedContent resources; the OpenAI path uses implicit prefix caching.

The provider-specific defaults:

Provider Default TTL Configurable? GUI exposure?
Anthropic ephemeral 5 min yes (per-discussion) yes
Gemini explicit 1 h yes (per-discussion override) yes (TTL override)
OpenAI implicit 5-10 min (provider-managed) no shows "cached" only
claude-code (Claude Agent SDK) varies (provider-managed) no shows "cached" only

1. The 12-layer model (the stable-to-volatile ordering)

# Layer Stable across turns? Source SSDL
1 Role instructions (model + provider) yes _get_combined_system_prompt [I]
2 Function-calling schema yes per provider [I]
3 Discovered tool descriptions yes mcp_client.get_tool_schemas() [I]
4 System prompt preset yes app_state.ai_settings.system_prompt [I]
5 Persona profile yes app_state.active_persona [I]
6 Project context (per manual_slop.toml) yes NEW [I]
7 Knowledge digest (per knowledge/digest.md) yes (within a gc cycle) NEW [I]
8 Discussion metadata (name, role count) no (per turn) disc_entries[:1] or disc_meta ───
9 Active preset (FileItem set) no (per turn) self.context_files ───
10 Per-file details (history, slices, notes) no (per file) per FileItem ───
11 Tool-call results from prior turns no (per turn) per _reread_file_items ───
12 The user message no (per turn) the input ───

The cache boundary is at layer 7/8. Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.


2. The byte-comparison test (the design contract)

The design rule "stable prefix is byte-identical" must be testable. The test:

# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
    """The first N characters of the context should be identical across turns
    of the same conversation, when no stable-layer inputs change."""
    ctrl = mock_app_controller()
    ctrl.ai_settings.system_prompt = "Test system prompt"
    ctrl.active_persona = mock_persona()

    # Turn 1
    turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")

    # Turn 2 (same stable inputs, different user message)
    turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")

    # The first N characters should be identical (N = where the volatile layers start)
    N = aggregate.stable_prefix_length(ctrl)
    assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"

The test is the contract. If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).


3. The provider-specific cache strategies

3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)

# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
    if cache_prefix_chars is not None:
        # Wrap the message in content blocks; mark each prefix with cache_control
        content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
    else:
        content_blocks = messages

    response = anthropic_client.messages.create(
        model=model,
        max_tokens=8192,
        messages=[{"role": "user", "content": content_blocks}],
    )
    return _result_with_usage(response.content, response.usage, messages)

The cache_prefix_blocks helper:

def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
    """Split the message into content blocks at the given char offsets.
    Mark each prefix block with cache_control. Returns the plain string
    when no valid boundary exists. At most 3 prefix blocks (provider limit
    is 4 breakpoints per request)."""
    if not cache_boundaries:
        return message
    points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
    if not points:
        return message
    blocks = []
    start = 0
    for point in points:
        blocks.append({
            "type": "text",
            "text": message[start:point],
            "cache_control": {"type": "ephemeral"},
        })
        start = point
    blocks.append({"type": "text", "text": message[start:]})
    return blocks

The Anthropic usage accounting:

def _result_with_usage(text, usage, input_text=None):
    input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
    # Anthropic reports cached prompt tokens separately; fold them back
    # so input_tokens stays "tokens sent" across providers.
    input_tokens += _usage_value(usage, "cache_read_input_tokens")
    input_tokens += _usage_value(usage, "cache_creation_input_tokens")
    # ...

The 4-breakpoint limit. Anthropic allows at most 4 cache_control markers per request. Manual Slop uses 3 prefix blocks (one breakpoint per prefix) + 1 volatile suffix.

3.2 Gemini (1-hour explicit cache, configurable TTL)

# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
    if cache_ttl_seconds > 0:
        cached_content = genai_client.caches.create(
            model=model,
            contents=stable_prefix_messages,
            ttl=f"{cache_ttl_seconds}s",
        )
        response = genai_client.models.generate_content(
            model=model,
            contents=volatile_messages,
            config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
        )
    else:
        response = genai_client.models.generate_content(model=model, contents=messages)
    return _result_with_usage(response.text, response.usage_metadata, messages)

The default TTL is 1 hour. Configurable per the GUI (per §4 below).

3.3 OpenAI (5-10 min implicit, provider-managed)

OpenAI's caching is implicit: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.

# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
    response = openai_client.responses.create(model=model, input=messages)
    return _result_with_usage(response.output_text, response.usage, messages)
    # No application-side cache_control; the provider handles it

The TTL is provider-managed (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."

3.4 claude-code (5th provider, subscription auth)

claude-code uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed.

# In src/ai_client.py:_send_claude_code (the 5th provider)
def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1):
    options = ClaudeAgentOptions(
        model=None if not model or model == "default" else model,
        max_turns=max_turns,
        tools=list(allowed_tools) if allowed_tools else [],
        allowed_tools=list(allowed_tools) if allowed_tools else [],
        cwd=os.getcwd(),
    )
    # ... claude_agent_sdk.query(prompt=message, options=options)
    return _result_with_usage(text, usage, message)

4. The GUI exposure

The "Caching" Operations Hub sub-panel:

+------------------------------------------------------+
| Caching                                              |
+------------------------------------------------------+
| Provider summaries                                   |
| [Anthropic]   in:340 cache:80  hit:23%  ttl:4:32   |
| [Gemini]      in:120 cache:0   hit:0%   ttl:0:00   |
| [OpenAI]      in:560 cache:200 hit:35%  ttl:n/a    |
+------------------------------------------------------+
| Active discussions                                   |
| Discussion "refactor auth"                           |
|   cached: yes (Anthropic)                            |
|   expires: 2026-06-12T15:32 (in 4:32)                |
|   [Invalidate cache] [Disable caching for this]      |
| Discussion "fix the parser"                           |
|   cached: no                                         |
|   [Enable caching for this]                         |
+------------------------------------------------------+
| Global settings                                      |
|   [X] Enable Anthropic ephemeral caching             |
|   [X] Enable Gemini explicit caching                 |
|   [ ] Allow >1h Gemini caches (charges may apply)    |
|   Anthropic default TTL: [5 min v]                   |
|   Gemini default TTL:    [60 min v]                  |
+------------------------------------------------------+

The data sources:

Widget Data source Frequency
in:N cache:N hit:N% ai_client.get_token_stats() per turn (or per session)
ttl:4:32 ai_client._send_<provider> usage metadata + the cache expiry timestamp per turn
cached: yes/no per-discussion flag (NEW) per discussion
[Invalidate cache] calls ai_client._invalidate_cache(discussion_id) (NEW) on click

The new AI client state:

# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
    discussion_id: str
    provider: str
    cached_at: datetime
    expires_at: Optional[datetime]
    hit_count: int = 0
    tokens_cached: int = 0
    last_invalidated_at: Optional[datetime] = None
    caching_enabled: bool = True

# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {}

The Hook API additions:

GET  /api/cache                        # list all discussion cache states
GET  /api/cache/<discussion_id>        # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable

5. The injection (where the cache hits)

Layer Where injected Stable? Cache impact
1. Role instructions _get_combined_system_prompt yes CACHED
2. Function-calling schema per provider yes CACHED
3. Discovered tool descriptions mcp_client.get_tool_schemas() yes CACHED
4. System prompt preset app_state.ai_settings.system_prompt yes CACHED
5. Persona profile app_state.active_persona yes CACHED
6. Project context manual_slop.toml [agent.context_files] yes CACHED
7. Knowledge digest ~/.manual_slop/knowledge/digest.md yes (within a gc cycle) CACHED
8. Discussion metadata disc_entries[:1] no NOT cached
9. Active preset self.context_files no NOT cached
10. Per-file details per FileItem no NOT cached
11. Prior tool results per _reread_file_items no NOT cached
12. User message the input no NOT cached

The cache only hits on the stable prefix (layers 1-7). The volatile suffix (layers 8-12) is not cached; the user expects the conversation to change per turn.


6. The cache invalidation triggers

Trigger Effect
python -m src.knowledge_harvest --apply The digest is regenerated; the cache is invalidated for the next turn
FileItem.notes edited The per-file knowledge changes; the cache is invalidated for the next turn that references the file
persona changed The persona profile is in the stable prefix; the cache is invalidated
[Invalidate cache] button The per-discussion cache state is marked last_invalidated_at; the next turn re-creates it
expiration reached The provider's cache expires automatically; the next turn re-creates it

7. The measurement (the empirical basis)

The "before" measurement (do this first, before any refactor):

# Log the cache hit rate over a sample of representative discussions
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 23% (avg)
cache write rate: 45% (avg)
in:N avg: 1,200
cache:N avg: 280

The "after" measurement (after the stable-to-volatile refactor):

$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 67% (avg)     # <-- should be measurably higher
cache write rate: 18% (avg)   # <-- should be lower
in:N avg: 1,200               # <-- unchanged (the user still types the same)
cache:N avg: 280              # <-- unchanged

The win comes from re-aligning the boundaries, not from changing the providers. The test is whether the cache hit rate is measurably higher after the refactor.


8. The cross-references

  • conductor/code_styleguides/cache_friendly_context.md — the canonical styleguide
  • docs/guide_ai_client.md — the underlying LLM client (the producer)
  • docs/guide_agent_memory_dimensions.md §5 — where the 4 dims get injected
  • docs/guide_knowledge_curation.md §3 — the digest (layer 7)
  • conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5 — the nagent pattern