Private
Public Access
0
0
Files
manual_slop/conductor/code_styleguides/cache_friendly_context.md
T
ed 35c6cca134 docs: agent workflow docs + regular docs (v2.3 surfacing)
Per user request 'use your remaining context to update agent workflow
docs and then regular docs based on what was discussed in this report',
this commit creates/updates 15 files derived from the v2.3 nagent
review (the 12 new nagent additions + the 4 memory dimensions
reframing + the cache strategy + the RAG discipline + the knowledge
harvest pattern).

Agent workflow docs (4 files):
- AGENTS.md (UPDATE): add @import line to canonical DOD + 'Code
  Styleguides' section pointing to the 6 new styleguides + new
  'Human-Facing Documentation' section pointing to ./docs/AGENTS.md
- conductor/workflow.md (UPDATE): new section 'Additions (2026-06-12)
  - the 12 patterns from the latest nagent corpus' with TDD
  protocols for knowledge harvest, cache ordering, compaction, RAG
  discipline
- conductor/product-guidelines.md (UPDATE): new sections 'Memory
  Dimensions (added 2026-06-12)' + 'See Also - Updated' with the
  6-styleguide catalog
- docs/AGENTS.md (NEW): the agent-facing mirror of docs/Readme.md
  (per the nagent CLAUDE.md pattern). 10 sections + the per-tier
  reading path + the 4 memory dimensions + the caching strategy +
  the knowledge harvest + the RAG discipline + the feature flags

Regular docs (11 files):
- 6 new styleguides (the convention catalog):
  * data_oriented_design.md: the canonical DOD reference (Tier
    0/1/2; 3 defaults to reject; 8 core defaults; 7-question
    simplification pass; 10-question self-check; 4 memory
    dimensions in Manual Slop context)
  * agent_memory_dimensions.md: the 4 memory dims (curation /
    discussion / RAG / knowledge) + when to use each + the
    boundaries
  * rag_integration_discipline.md: the conservative-RAG rule
    (opt-in, complement, provenance, no mutation, feature-gated,
    graceful failure)
  * cache_friendly_context.md: stable-to-volatile context
    ordering + the cache TTL GUI contract + the byte-comparison
    test
  * knowledge_artifacts.md: the knowledge harvest pattern
    (category files, provenance, sha256 ledger, digest
    regeneration, 'delete to turn off')
  * feature_flags.md: file presence vs config flags vs CLI flags
- 3 new project docs (the cross-cutting guides):
  * guide_agent_memory_dimensions.md: the cross-cutting guide on
    the 4 dims + the decision tree
  * guide_caching_strategy.md: caching across providers +
    stable-to-volatile ordering + cache TTL GUI + the byte-
    comparison test + the 5th provider (claude-code)
  * guide_knowledge_curation.md: the knowledge memory guide (4th
    dim) + the 5 category files + per-file notes + the digest +
    the ledger + the harvest workflow
- 2 existing doc updates:
  * guide_mma.md: new sections 'Delegation as context management'
    + 'The 4 memory dimensions (the MMA scope)'
  * guide_ai_client.md: new section 'Cache strategy and the 12-
    layer model' + the 5th provider (claude-code)

All files use the same style as the v2.3 review (the user's preferred
format): 7-column tables, no JSON, SSDL shape tags, forth/array
notation, file:line citations, ASCII sketches where useful. The
human Readme files (Readme.md, docs/Readme.md) are NOT modified
(per repeated user instruction).

The 5th provider (claude-code) is documented in guide_ai_client.md
+ the data_oriented_design.md references the nagent pattern as the
source of the canonical rules.

The cross-references are bidirectional: the 6 styleguides reference
the 3 project docs; the 3 project docs reference the 6 styleguides;
the 2 doc updates reference both; AGENTS.md + ./docs/AGENTS.md
provide the entry points.
2026-06-12 13:50:40 -04:00

16 KiB

Cache-Friendly Context (stable-to-volatile ordering + cache TTL)

Status: Styleguide; codifies the cache strategy for aggregate.py:run and the GUI exposure of cache TTL. Date: 2026-06-12 Cross-refs: conductor/code_styleguides/data_oriented_design.md §3.2; conductor/code_styleguides/agent_memory_dimensions.md; docs/guide_caching_strategy.md; conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5.

What this is. The LLM providers that Manual Slop uses (Anthropic, Gemini, OpenAI) all support some form of prompt caching. The cost benefit comes from the stable prefix being byte-identical across turns and across discussions. This styleguide defines the stable prefix, the volatile suffix, the byte-comparison contract, and the cache TTL GUI exposure.


0. The one-glance principle

[STABLE PREFIX (cached across turns)]  [VOLATILE SUFFIX (per-turn)]
[Role instructions]                     [Discussion metadata]
[Function-calling schema]               [Active preset (FileItems)]
[Discovered tool descriptions]          [Per-file details]
[System prompt preset]                  [Tool-call results from prior turns]
[Persona profile]                       [The user message]
[Project context]
[Knowledge digest]
[file-knowledge for files in scope]

The cache boundary is at layer 8/9 (the last stable / first volatile). The Anthropic-specific path wraps the prefix in cache_control: {"type": "ephemeral"} blocks at the boundary; the Gemini path uses cachedContent resources; the OpenAI path uses implicit prefix caching.


1. The 12-layer model (the stable-to-volatile ordering)

# Layer Stable across turns? Source SSDL
1 Role instructions (model + provider) yes _get_combined_system_prompt [I]
2 Function-calling schema yes per provider [I]
3 Discovered tool descriptions yes mcp_client.get_tool_schemas() [I]
4 System prompt preset yes app_state.ai_settings.system_prompt [I]
5 Persona profile yes app_state.active_persona [I]
6 Project context (per manual_slop.toml) yes NEW (Candidate 14) [I]
7 Knowledge digest (per knowledge/digest.md) yes (within a gc cycle) NEW (Candidate 8) [I]
8 Discussion metadata (name, role count) no (per turn) disc_entries[:1] or disc_meta ─── (data)
9 Active preset (FileItem set) no (per turn) self.context_files ─── (data)
10 Per-file details (history, slices, notes) no (per file) per FileItem ─── (data)
11 Tool-call results from prior turns no (per turn) per _reread_file_items ─── (data)
12 The user message no (per turn) the input ─── (data)

The cache boundary is at layer 7/8. Layers 1-7 are byte-identical across turns of the same discussion (and across discussions of the same mode). Layers 8-12 change per turn.


2. The byte-comparison test (the design contract)

The design rule "stable prefix is byte-identical" must be testable. The test:

# In tests/test_aggregate_caching.py (NEW)
def test_aggregate_stable_to_volatile_ordering():
    """The first N characters of the context should be identical across turns
    of the same conversation, when no stable-layer inputs change."""
    ctrl = mock_app_controller()
    ctrl.ai_settings.system_prompt = "Test system prompt"
    ctrl.active_persona = mock_persona()

    # Turn 1
    turn1 = aggregate.build_initial_context(ctrl, user_message="first prompt")

    # Turn 2 (same stable inputs, different user message)
    turn2 = aggregate.build_initial_context(ctrl, user_message="second prompt")

    # The first N characters should be identical (N = where the volatile layers start)
    N = aggregate.stable_prefix_length(ctrl)
    assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"

The test is the contract. If a new layer is added in the middle of the stack, this test fails; the agent must either move the layer to the stable position or update the test (with written justification).

The implementation. aggregate.stable_prefix_length(ctrl) returns the character offset where layer 8 starts. The simplest implementation: a class-level constant per aggregate.py, updated when the layer stack changes:

class AggregateStack:
    ROLE_INSTRUCTIONS_END = 0          # placeholder; computed at runtime
    SCHEMA_END = 0
    TOOLS_END = 0
    SYSTEM_PROMPT_END = 0
    PERSONA_END = 0
    PROJECT_CONTEXT_END = 0
    KNOWLEDGE_DIGEST_END = 0
    INSTANCE_START = 0                 # the cache boundary

The test failure modes:

Failure Why it fails Fix
A new stable layer was added in the wrong position The first N characters differ because the new layer is below the boundary Move the new layer above the boundary (between layers 7 and 8)
A stable layer was moved to the volatile position The first N characters differ because the stable layer is now in the volatile part Move the layer back to the stable position
A volatile input leaked into a stable layer (e.g., a timestamp in the system prompt) The first N characters differ because the volatile input is in the prefix Strip the volatile input from the stable layer; pass it as a separate volatile argument
The system prompt has a now() call The first N characters differ across calls Pass now() as a separate argument; don't include in the system prompt

3. The provider-specific cache_control (the implementation)

3.1 Anthropic (5-minute ephemeral, 4 breakpoints max)

# In src/ai_client.py:_send_anthropic
def _send_anthropic(messages, *, cache_prefix_chars=None):
    if cache_prefix_chars is not None:
        # Wrap the message in content blocks; mark each prefix with cache_control
        content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
    else:
        content_blocks = messages

    response = anthropic_client.messages.create(
        model=model,
        max_tokens=8192,
        messages=[{"role": "user", "content": content_blocks}],
    )
    return _result_with_usage(response.content, response.usage, messages)

The cache_prefix_blocks helper (mirrors nagent's bin/helpers/nagent_llm.py:cache_prefix_blocks):

def cache_prefix_blocks(message: str, cache_boundaries: list[int]) -> list[dict]:
    """Split the message into content blocks at the given char offsets.
    Mark each prefix block with cache_control. Returns the plain string
    when no valid boundary exists. At most 3 prefix blocks (provider limit
    is 4 breakpoints per request)."""
    if not cache_boundaries:
        return message
    points = sorted({b for b in cache_boundaries if 0 < b < len(message)})[:3]
    if not points:
        return message
    blocks = []
    start = 0
    for point in points:
        blocks.append({
            "type": "text",
            "text": message[start:point],
            "cache_control": {"type": "ephemeral"},
        })
        start = point
    blocks.append({"type": "text", "text": message[start:]})
    return blocks

The Anthropic usage accounting (per nagent_llm.py:_result_with_usage):

def _result_with_usage(text, usage, input_text=None):
    input_tokens = _usage_value(usage, "input_tokens", "prompt_tokens", "prompt_token_count")
    # Anthropic reports cached prompt tokens separately; fold them back
    # so input_tokens stays "tokens sent" across providers.
    input_tokens += _usage_value(usage, "cache_read_input_tokens")
    input_tokens += _usage_value(usage, "cache_creation_input_tokens")
    output_tokens = _usage_value(usage, "output_tokens", "completion_tokens", ...)
    # ... etc

The 4-breakpoint limit. Anthropic allows at most 4 cache_control markers per request. nagent caps at 3 prefix blocks (one breakpoint per prefix). Manual Slop does the same: 3 prefix blocks, 1 volatile suffix.

3.2 Gemini (1-hour explicit cache, configurable TTL)

# In src/ai_client.py:_send_gemini
def _send_gemini(messages, *, cache_ttl_seconds=3600):
    if cache_ttl_seconds > 0:
        # Create a cachedContent resource for the stable prefix
        cached_content = genai_client.caches.create(
            model=model,
            contents=stable_prefix_messages,    # layers 1-7
            ttl=f"{cache_ttl_seconds}s",
        )
        # Reference the cached content in the request
        response = genai_client.models.generate_content(
            model=model,
            contents=volatile_messages,         # layers 8-12
            config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
        )
    else:
        response = genai_client.models.generate_content(model=model, contents=messages)
    return _result_with_usage(response.text, response.usage_metadata, messages)

The default TTL is 1 hour. Configurable per the GUI (per §5 below).

3.3 OpenAI (5-10 min implicit, provider-managed)

OpenAI's caching is implicit: the provider automatically caches the prefix and reuses it across requests with the same prefix. No application-side control.

# In src/ai_client.py:_send_openai
def _send_openai(messages, *, model="gpt-5.5"):
    response = openai_client.responses.create(model=model, input=messages)
    return _result_with_usage(response.output_text, response.usage, messages)
    # No application-side cache_control; the provider handles it

The TTL is provider-managed (5-10 min). The GUI just shows "Cached by OpenAI; TTL: provider-managed."

3.4 The provider table (the summary)

Provider Cache type Default TTL Configurable? GUI exposure?
Anthropic ephemeral 5 min yes (via prompt cache breakpoints) yes (per-discussion state)
Google (Gemini) explicit 1 h yes (via ttl field) yes (TTL override)
OpenAI implicit (auto) 5-10 min (provider-managed) no no (just shows "cached")

4. The codepath (the end-to-end flow)

[Q:ai_client.send() is called]
   │
   ▼
[I:aggregate.build_initial_context(ctrl, user_message) -> str]
   │
   ├──► [I:layer 1-7: build stable prefix (the cache-friendly part)]
   │
   ├──► [I:layer 8-12: build volatile suffix (the per-turn part)]
   │
   ├──► [I:concatenate stable + volatile = full context]
   │
   ├──► [I:stable_prefix_length(ctrl) -> N]    (the cache boundary)
   │
   ▼
[Q:cache boundary N > 0?]
   │
   ├── no ──► [I:pass full context to provider; no caching]
   │
   ▼
[Q:provider is Anthropic?]
   │
   ├── yes ──► [I:cache_prefix_blocks(full_context, [N]) -> content_blocks]
   │            [I:anthropic.messages.create(content=content_blocks)]
   │
[Q:provider is Gemini?]
   │
   ├── yes ──► [I:create cachedContent resource for stable prefix]
   │            [I:genai.models.generate_content(cached_content=..., contents=volatile)]
   │
[Q:provider is OpenAI?]
   │
   ├── yes ──► [I:openai.responses.create(input=full_context)]    (provider handles caching)
   │
[I:return LlmResult(text, input_tokens, output_tokens)]
   │
   ▼
[Q:return to caller; aggregate.test_aggregate_stable_to_volatile_ordering is run]
   │
[T:end]

5. The GUI exposure (per-provider cache state)

The "Caching" Operations Hub sub-panel (per the v2.3 §5.3 sketch):

+------------------------------------------------------+
| Caching                                              |
+------------------------------------------------------+
| Provider summaries                                   |
| [Anthropic]   in:340 cache:80  hit:23%  ttl:4:32   |
| [Gemini]      in:120 cache:0   hit:0%   ttl:0:00   |
| [OpenAI]      in:560 cache:200 hit:35%  ttl:n/a    |
+------------------------------------------------------+
| Active discussions                                   |
| Discussion "refactor auth"                           |
|   cached: yes (Anthropic)                            |
|   expires: 2026-06-12T15:32 (in 4:32)                |
|   [Invalidate cache] [Disable caching for this]      |
| Discussion "fix the parser"                           |
|   cached: no                                         |
|   [Enable caching for this]                         |
+------------------------------------------------------+
| Global settings                                      |
|   [X] Enable Anthropic ephemeral caching             |
|   [X] Enable Gemini explicit caching                 |
|   [ ] Allow >1h Gemini caches (charges may apply)    |
|   Anthropic default TTL: [5 min v]                   |
|   Gemini default TTL:    [60 min v]                  |
+------------------------------------------------------+

The data sources:

Widget Data source Frequency
in:N cache:N hit:N% ai_client.get_token_stats() (already exported) per turn (or per session)
ttl:4:32 ai_client._send_<provider> usage metadata + the cache expiry timestamp per turn
cached: yes/no per-discussion flag (NEW; tracks which discussions have active caches) per discussion
[Invalidate cache] calls ai_client._invalidate_cache(discussion_id) (NEW) on click

The new AI client state:

# In src/ai_client.py (NEW)
@dataclass
class DiscussionCacheState:
    discussion_id: str
    provider: str
    cached_at: datetime
    expires_at: Optional[datetime]  # None for OpenAI implicit
    hit_count: int = 0
    tokens_cached: int = 0
    last_invalidated_at: Optional[datetime] = None
    caching_enabled: bool = True   # user can disable per-discussion

# In AppController (NEW)
self.discussion_caches: dict[str, DiscussionCacheState] = {}  # keyed by discussion_id

The Hook API additions:

GET  /api/cache                        # list all discussion cache states
GET  /api/cache/<discussion_id>        # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable

6. The interaction with the 4 memory dimensions (where the cache hits)

Dim Where injected Stable? Cache impact
Curation layer 9 (active preset) no (per turn) NOT cached; the user might switch presets
Discussion layer 8 (metadata) + layer 11 (prior turns) no (per turn) NOT cached (except: layer 8 metadata is the boundary)
RAG the {rag-context} block, appended to layer 8-12 no (per query) NOT cached; RAG is volatile per query
Knowledge layer 7 (digest) + per-file (file-knowledge) yes (within a gc cycle) CACHED; the digest is the stable prefix

The cache only hits on the stable prefix (layers 1-7). The volatile suffix (layers 8-12) is not cached; the user expects the conversation to change per turn.

The interaction with knowledge harvest: when nagent-gc (or the Manual Slop equivalent) regenerates the digest, the cache is invalidated for the next turn. The user has a way to force invalidation manually (the [Invalidate cache] button).

The interaction with file edit: when the user edits a file in the Structural File Editor, the file-knowledge for that file is updated. The cache is invalidated for the next turn that references the file. The per-file knowledge change is a cache invalidator.


7. The cross-references

  • conductor/code_styleguides/data_oriented_design.md §3.2, §3.3, §3.4 — the data-oriented foundation
  • conductor/code_styleguides/agent_memory_dimensions.md — the 4 dims (where the cache hits)
  • conductor/code_styleguides/knowledge_artifacts.md — the knowledge digest (the layer 7 cached content)
  • docs/guide_caching_strategy.md — the user-facing deep-dive
  • src/aggregate.py:run — the consumer of this styleguide
  • src/ai_client.py:_send_<provider> — the producer
  • conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5 — the nagent pattern that informed this styleguide