Private
Public Access
0
0
Files
manual_slop/docs/guide_caching_strategy.md
T
ed 434b6d0d54 docs: reduce redundant content across files; map references to canonical sources
Per user 'a bunch of docs just committed had redundant content across
files. Can we do a reduction of that and instead map references to
other files?'

This commit reduces content duplication across 9 files. The
canonical sources are kept as detailed references; the other
files now point to them.

Reductions (table replaced with 'see canonical' reference):

1. data_oriented_design.md §9: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

2. guide_agent_memory_dimensions.md §0: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

3. guide_caching_strategy.md §1: the 12-layer model
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

4. guide_ai_client.md 'Cache strategy' section: the 12-layer model recap
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

5. guide_knowledge_curation.md §1: the 5 category file details
   (canonical: conductor/code_styleguides/knowledge_artifacts.md §1)

6. product-guidelines.md 'Memory Dimensions' section: the 4-dim table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

7. guide_mma.md '4 memory dimensions' section: the MMA scope table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

8. docs/AGENTS.md §0 + §5-§8: 4-dim table + caching/knowledge/RAG/
   feature flag tables (canonical: the per-topic styleguides in
   conductor/code_styleguides/)

9. AGENTS.md 'Code Styleguides' section: the 6-styleguide list
   (canonical: docs/AGENTS.md §2)

The principle: each piece of content has ONE source of truth; other
places point to it. The data-oriented way. Files retain their
narrative flow and the 'what this is' intros, but the detailed
tables are now in their canonical home.

Net effect: -2100 bytes across 9 files (without losing any
information - the canonical sources are unchanged). The
'cross-references' sections are kept; the duplicated content
is removed.
2026-06-12 14:10:30 -04:00

14 KiB

Caching Strategy Guide

Status: User-facing deep-dive on the cache strategy: stable-to-volatile context ordering, the 4 cache-TTL profiles (Anthropic, Gemini, OpenAI, claude-code), and the GUI exposure. Date: 2026-06-12 Cross-refs: conductor/code_styleguides/cache_friendly_context.md; docs/guide_ai_client.md; conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5.

What this is. The LLM providers Manual Slop uses (Anthropic, Gemini, OpenAI) all support prompt caching. The cost benefit comes from the stable prefix being byte-identical across turns. This guide is the user-facing deep-dive on the 12-layer model, the byte-comparison test, the provider-specific TTLs, and the GUI exposure.


0. The 30-second version

[STABLE PREFIX (cached across turns)]  [VOLATILE SUFFIX (per-turn)]
[Role instructions]                     [Discussion metadata]
[Function-calling schema]               [Active preset (FileItems)]
[Discovered tool descriptions]          [Per-file details]
[System prompt preset]                  [Tool-call results from prior turns]
[Persona profile]                       [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]

The cache boundary is at layer 8/9. Layers 1-7 are byte-identical across turns; layers 8-12 change per turn. The Anthropic-specific path wraps the prefix in cache_control: {"type": "ephemeral"} blocks; the Gemini path uses cachedContent resources; the OpenAI path uses implicit prefix caching.

The provider-specific defaults:

Provider Default TTL Configurable? GUI exposure?
Anthropic ephemeral 5 min yes (per-discussion) yes
Gemini explicit 1 h yes (per-discussion override) yes (TTL override)
OpenAI implicit 5-10 min (provider-managed) no shows "cached" only
claude-code (Claude Agent SDK) varies (provider-managed) no shows "cached" only

1. The 12-layer model (the stable-to-volatile ordering)

The canonical reference is conductor/code_styleguides/cache_friendly_context.md §1 (the full 12-layer table with the stable/volatile classification + the byte-comparison test contract + the per-layer ─── data markings). This section is a pointer.

The one-line summary: layers 1-7 (role instructions, function-calling schema, tool descriptions, system prompt, persona, project context, knowledge digest) are byte-identical across turns and cacheable. Layers 8-12 (discussion metadata, active preset, per-file details, prior tool results, user message) are per-turn and NOT cached. The cache boundary is at layer 7/8.


2. The byte-comparison test (the design contract)

The design rule "stable prefix is byte-identical" must be testable. The test:

# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
    """The first N characters of the context should be identical across turns
    of the same conversation, when no stable-layer inputs change."""
    ctrl = mock_app_controller()
    ctrl.ai_settings.system_prompt = "Test system prompt"
    ctrl.active_persona = mock_persona()

    # Turn 1
    turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")

    # Turn 2 (same stable inputs, different user message)
    turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")

    # The first N characters should be identical (N = where the volatile layers start)
    N = aggregate.stable_prefix_length(ctrl)
    assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"

The test is the contract. If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).


3. The provider-specific cache strategies

3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)

# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
    if cache_prefix_chars is not None:
        # Wrap the message in content blocks; mark each prefix with cache_control
        content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
    else:
        content_blocks = messages

    response = anthropic_client.messages.create(
        model=model,
        max_tokens=8192,
        messages=[{"role": "user", "content": content_blocks}],
    )
    return _result_with_usage(response.content, response.usage, messages)

The cache_prefix_blocks helper:

def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
    """Split the message into content blocks at the given char offsets.
    Mark each prefix block with cache_control. Returns the plain string
    when no valid boundary exists. At most 3 prefix blocks (provider limit
    is 4 breakpoints per request)."""
    if not cache_boundaries:
        return message
    points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
    if not points:
        return message
    blocks = []
    start = 0
    for point in points:
        blocks.append({
            "type": "text",
            "text": message[start:point],
            "cache_control": {"type": "ephemeral"},
        })
        start = point
    blocks.append({"type": "text", "text": message[start:]})
    return blocks

The Anthropic usage accounting:

def _result_with_usage(text, usage, input_text=None):
    input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
    # Anthropic reports cached prompt tokens separately; fold them back
    # so input_tokens stays "tokens sent" across providers.
    input_tokens += _usage_value(usage, "cache_read_input_tokens")
    input_tokens += _usage_value(usage, "cache_creation_input_tokens")
    # ...

The 4-breakpoint limit. Anthropic allows at most 4 cache_control markers per request. Manual Slop uses 3 prefix blocks (one breakpoint per prefix) + 1 volatile suffix.

3.2 Gemini (1-hour explicit cache, configurable TTL)

# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
    if cache_ttl_seconds > 0:
        cached_content = genai_client.caches.create(
            model=model,
            contents=stable_prefix_messages,
            ttl=f"{cache_ttl_seconds}s",
        )
        response = genai_client.models.generate_content(
            model=model,
            contents=volatile_messages,
            config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
        )
    else:
        response = genai_client.models.generate_content(model=model, contents=messages)
    return _result_with_usage(response.text, response.usage_metadata, messages)

The default TTL is 1 hour. Configurable per the GUI (per §4 below).

3.3 OpenAI (5-10 min implicit, provider-managed)

OpenAI's caching is implicit: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.

# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
    response = openai_client.responses.create(model=model, input=messages)
    return _result_with_usage(response.output_text, response.usage, messages)
    # No application-side cache_control; the provider handles it

The TTL is provider-managed (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."

3.4 claude-code (5th provider, subscription auth)

claude-code uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed.

# In src/ai_client.py:_send_claude_code (the 5th provider)
def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1):
    options = ClaudeAgentOptions(
        model=None if not model or model == "default" else model,
        max_turns=max_turns,
        tools=list(allowed_tools) if allowed_tools else [],
        allowed_tools=list(allowed_tools) if allowed_tools else [],
        cwd=os.getcwd(),
    )
    # ... claude_agent_sdk.query(prompt=message, options=options)
    return _result_with_usage(text, usage, message)

4. The GUI exposure

The "Caching" Operations Hub sub-panel:

+------------------------------------------------------+
| Caching                                              |
+------------------------------------------------------+
| Provider summaries                                   |
| [Anthropic]   in:340 cache:80  hit:23%  ttl:4:32   |
| [Gemini]      in:120 cache:0   hit:0%   ttl:0:00   |
| [OpenAI]      in:560 cache:200 hit:35%  ttl:n/a    |
+------------------------------------------------------+
| Active discussions                                   |
| Discussion "refactor auth"                           |
|   cached: yes (Anthropic)                            |
|   expires: 2026-06-12T15:32 (in 4:32)                |
|   [Invalidate cache] [Disable caching for this]      |
| Discussion "fix the parser"                           |
|   cached: no                                         |
|   [Enable caching for this]                         |
+------------------------------------------------------+
| Global settings                                      |
|   [X] Enable Anthropic ephemeral caching             |
|   [X] Enable Gemini explicit caching                 |
|   [ ] Allow >1h Gemini caches (charges may apply)    |
|   Anthropic default TTL: [5 min v]                   |
|   Gemini default TTL:    [60 min v]                  |
+------------------------------------------------------+

The data sources:

Widget Data source Frequency
in:N cache:N hit:N% ai_client.get_token_stats() per turn (or per session)
ttl:4:32 ai_client._send_<provider> usage metadata + the cache expiry timestamp per turn
cached: yes/no per-discussion flag (NEW) per discussion
[Invalidate cache] calls ai_client._invalidate_cache(discussion_id) (NEW) on click

The new AI client state:

# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
    discussion_id: str
    provider: str
    cached_at: datetime
    expires_at: Optional[datetime]
    hit_count: int = 0
    tokens_cached: int = 0
    last_invalidated_at: Optional[datetime] = None
    caching_enabled: bool = True

# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {}

The Hook API additions:

GET  /api/cache                        # list all discussion cache states
GET  /api/cache/<discussion_id>        # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable

5. The injection (where the cache hits)

Layer Where injected Stable? Cache impact
1. Role instructions _get_combined_system_prompt yes CACHED
2. Function-calling schema per provider yes CACHED
3. Discovered tool descriptions mcp_client.get_tool_schemas() yes CACHED
4. System prompt preset app_state.ai_settings.system_prompt yes CACHED
5. Persona profile app_state.active_persona yes CACHED
6. Project context manual_slop.toml [agent.context_files] yes CACHED
7. Knowledge digest ~/.manual_slop/knowledge/digest.md yes (within a gc cycle) CACHED
8. Discussion metadata disc_entries[:1] no NOT cached
9. Active preset self.context_files no NOT cached
10. Per-file details per FileItem no NOT cached
11. Prior tool results per _reread_file_items no NOT cached
12. User message the input no NOT cached

The cache only hits on the stable prefix (layers 1-7). The volatile suffix (layers 8-12) is not cached; the user expects the conversation to change per turn.


6. The cache invalidation triggers

Trigger Effect
python -m src.knowledge_harvest --apply The digest is regenerated; the cache is invalidated for the next turn
FileItem.notes edited The per-file knowledge changes; the cache is invalidated for the next turn that references the file
persona changed The persona profile is in the stable prefix; the cache is invalidated
[Invalidate cache] button The per-discussion cache state is marked last_invalidated_at; the next turn re-creates it
expiration reached The provider's cache expires automatically; the next turn re-creates it

7. The measurement (the empirical basis)

The "before" measurement (do this first, before any refactor):

# Log the cache hit rate over a sample of representative discussions
$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 23% (avg)
cache write rate: 45% (avg)
in:N avg: 1,200
cache:N avg: 280

The "after" measurement (after the stable-to-volatile refactor):

$ python -m scripts.measure_cache_hit_rate --discussions 50 --provider anthropic
cache hit rate: 67% (avg)     # <-- should be measurably higher
cache write rate: 18% (avg)   # <-- should be lower
in:N avg: 1,200               # <-- unchanged (the user still types the same)
cache:N avg: 280              # <-- unchanged

The win comes from re-aligning the boundaries, not from changing the providers. The test is whether the cache hit rate is measurably higher after the refactor.


8. The cross-references

  • conductor/code_styleguides/cache_friendly_context.md — the canonical styleguide
  • docs/guide_ai_client.md — the underlying LLM client (the producer)
  • docs/guide_agent_memory_dimensions.md §5 — where the 4 dims get injected
  • docs/guide_knowledge_curation.md §3 — the digest (layer 7)
  • conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5 — the nagent pattern