Private

Public Access

Files

T

ed 434b6d0d54 docs: reduce redundant content across files; map references to canonical sources

Per user 'a bunch of docs just committed had redundant content across
files. Can we do a reduction of that and instead map references to
other files?'

This commit reduces content duplication across 9 files. The
canonical sources are kept as detailed references; the other
files now point to them.

Reductions (table replaced with 'see canonical' reference):

1. data_oriented_design.md §9: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

2. guide_agent_memory_dimensions.md §0: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

3. guide_caching_strategy.md §1: the 12-layer model
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

4. guide_ai_client.md 'Cache strategy' section: the 12-layer model recap
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

5. guide_knowledge_curation.md §1: the 5 category file details
   (canonical: conductor/code_styleguides/knowledge_artifacts.md §1)

6. product-guidelines.md 'Memory Dimensions' section: the 4-dim table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

7. guide_mma.md '4 memory dimensions' section: the MMA scope table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

8. docs/AGENTS.md §0 + §5-§8: 4-dim table + caching/knowledge/RAG/
   feature flag tables (canonical: the per-topic styleguides in
   conductor/code_styleguides/)

9. AGENTS.md 'Code Styleguides' section: the 6-styleguide list
   (canonical: docs/AGENTS.md §2)

The principle: each piece of content has ONE source of truth; other
places point to it. The data-oriented way. Files retain their
narrative flow and the 'what this is' intros, but the detailed
tables are now in their canonical home.

Net effect: -2100 bytes across 9 files (without losing any
information - the canonical sources are unchanged). The
'cross-references' sections are kept; the duplicated content
is removed.

2026-06-12 14:10:30 -04:00

35 KiB

Raw Blame History

`src/ai_client.py` — Multi-Provider LLM Abstraction

Top | Architecture | Testing | MMA

Overview

src/ai_client.py (~116KB) is the unified LLM client for 8 providers. It abstracts the differences between providers (Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI, Qwen, Grok, Llama) behind a single send() function.

The module is a stateful singleton — all provider state is held in module-level globals. There is no class wrapping; the module itself is the abstraction layer.

The 8 providers split into 3 API shapes:

Native SDK: Gemini (google-genai), Anthropic (anthropic), Qwen (DashScope)
OpenAI-compatible: MiniMax, Grok, Llama (Ollama/OpenRouter/custom), DeepSeek
Subprocess: Gemini CLI

The OpenAI-compatible vendors all call the shared helper in src/openai_compatible.py (added 2026-06-06 by the qwen_llama_grok_integration_20260606 track; see "Shared OpenAI-Compatible Helper" section below). The MiniMax provider's _send_minimax was refactored to use this helper (Phase 4 of the same track, 231 → 75 lines, 68% reduction).

Module-Level Imports

Important: The 5 provider SDKs are NOT imported at module level. import google.genai, import anthropic, import openai, and import fastapi are heavy (~430-955ms each on cold load) and are now obtained via src.module_loader._require_warmed("google.genai") and similar calls, after the WarmupManager has loaded them in the background. The module-level globals you see in the State section (_gemini_client, _anthropic_client, etc.) are typed as Optional because they're populated by _require_warmed() on first use, not at import time.

This change was part of the 2026-06-06 startup_speedup_20260606 track. Before: import src.ai_client took ~1800ms. After: ~161ms. The remaining cost is the bare module skeleton.

Architecture

┌─────────────────────────────────────────────────┐
│ ai_client.send(md_content, user_message, ...)    │
│                                                 │
│ 1. _send_lock.acquire() — serialize all calls   │
│ 2. Read _provider / _model                       │
│ 3. Route to provider-specific _send_<provider>() │
│ 4. Return str response                           │
└─────────────────┬───────────────────────────────┘
                  │ dispatches based on _provider
                  ▼
   ┌────────┬─────────┬────────┬──────────┐
   ▼        ▼         ▼        ▼          ▼
_gemini  _anthropic _deepseek _minimax  _gemini_cli
                                              (subprocess)

State

All state is module-level globals. The most important:

Variable	Type	Purpose
`_provider: str`	`"gemini" \| "anthropic" \| "deepseek" \| "minimax" \| "gemini_cli"`	Active provider
`_model: str`	`str`	Active model name
`_temperature: float`	`0.0`	Sampling temperature
`_top_p: float`	`1.0`	Nucleus sampling
`_max_tokens: int`	`8192`	Output token cap
`_history_trunc_limit: int`	`8000`	Char limit for truncating old tool outputs
`_send_lock`	`threading.Lock`	Serializes all send() calls
`_current_palette: str`	theme	Last-applied theme palette

Per-Provider State

_gemini_client: Optional[genai.Client] = None
_gemini_chat: Any = None
_gemini_cache: Any = None
_gemini_cache_md_hash: Optional[str] = None
_gemini_cache_created_at: Optional[float] = None
_gemini_cached_file_paths: list[str] = []

_anthropic_client: Optional[anthropic.Anthropic] = None
_anthropic_history: list[dict] = []
_anthropic_history_lock: threading.Lock = threading.Lock()

_deepseek_client: Any = None
_deepseek_history: list[dict] = []
_deepseek_history_lock: threading.Lock = threading.Lock()

_minimax_client: Any = None
_minimax_history: list[dict] = []
_minimax_history_lock: threading.Lock = threading.Lock()

_gemini_cli_adapter: Optional[GeminiCliAdapter] = None

The Public API

`send(...)` — The Main Entry Point

def send(
    md_content: str,
    user_message: str,
    base_dir: str = ".",
    file_items: list[dict] | None = None,
    discussion_history: str = "",
    stream: bool = False,
    pre_tool_callback: Optional[Callable] = None,
    qa_callback: Optional[Callable] = None,
    enable_tools: bool = True,
    stream_callback: Optional[Callable] = None,
    patch_callback: Optional[Callable] = None,
    rag_engine: Optional[Any] = None,
) -> str:

Returns the model's response as a string. All provider calls go through here.

Parameters:

md_content — the system prompt + context (markdown)
user_message — the user's message
base_dir — for MCP tool filesystem operations
file_items — files in the context (deprecated path; usually empty)
discussion_history — legacy parameter
stream / stream_callback — for streaming responses
pre_tool_callback — called before each tool execution (HITL gate)
qa_callback — called when an error occurs (Tier 4 integration)
enable_tools — whether to enable PowerShell + MCP tools
patch_callback — Tier 4 patch generation hook
rag_engine — optional RAG engine for context augmentation

Provider Switching

from src import ai_client
ai_client.set_provider("gemini", "gemini-3-flash-preview")
ai_client.set_provider("anthropic", "claude-3-5-sonnet-latest")
ai_client.set_provider("deepseek", "deepseek-chat")
ai_client.set_provider("minimax", "grok-2-latest")
ai_client.set_provider("gemini_cli", "gemini-2.0-flash")

Parameter Setters

ai_client.set_model_params(temp=0.7, max_tok=4096, top_p=0.9, trunc_limit=4000)

Session Management

ai_client.reset_session()  # Clears all provider state, history, cache

Event Hooks

from src import ai_client

# Confirmation hook (called before destructive tool execution)
ai_client.confirm_and_run_callback = my_gui_callback

# Comms log hook (called on every API call)
ai_client.comms_log_callback = my_logging_callback

# Tool log hook (called on every tool completion)
ai_client.tool_log_callback = my_tool_logging_callback

# Event emitter (for any subscriber)
ai_client.events.on("my_event", my_handler)

Comms Log

ai_client._append_comms(direction, kind, payload)  # Add entry
ai_client.get_comms_log()  # Read all
ai_client.clear_comms_log()  # Clear
ai_client.get_token_stats(md_content)  # Estimate token usage

Provider Error Taxonomy — Legacy (Pre-Refactor)

As of 2026-06-11: This section describes the pre-refactor exception-based pattern. The ProviderError class is removed in the data_oriented_error_handling_20260606 track. See the new Data-Oriented Error Handling (Fleury Pattern) section below for the current convention.

class ProviderError(Exception):
    kind: str  # "quota" | "rate_limit" | "auth" | "balance" | "network" | "unknown"
    provider: str
    original: Exception

    def ui_message(self) -> str:
        """Returns a user-friendly error message."""

ProviderError was raised by provider-specific _send_* functions on failure. The caller (typically app_controller.py) caught it and surfaced the error to the user via app.ai_status. Post-refactor, the same flow uses ErrorInfo dataclasses inside Result[str] returns — see the new section below.

The Tool-Call Loop

All providers follow the same high-level pattern in _send_*:

def _send_<provider>(md_content, user_message, ...):
    for round in range(MAX_TOOL_ROUNDS + 2):  # up to 10 rounds
        response = provider_api_call(md_content, user_message, history, tools)
        comms_log(direction="IN", kind="response", payload=response)

        if not has_function_calls(response):
            return extract_text(response)

        for call in response.function_calls:
            if pre_tool_callback and pre_tool_callback(...) is rejected:
                return rejection_message
            tool_result = dispatch(call.name, call.args, base_dir)
            append_tool_result_to_history(call, tool_result)

        # Context refresh: re-read all tracked files (mtime check)
        _reread_file_items(file_items)

        # Truncate tool outputs at _history_trunc_limit
        truncate_tool_outputs(history)

        # Cumulative byte check
        if cumulative_tool_bytes > 500_000:
            inject_warning()

    return final_response

The constants:

MAX_TOOL_ROUNDS: int = 10 — max tool-call iterations per send()
_MAX_TOOL_OUTPUT_BYTES: int = 500_000 — cumulative tool output budget
_ANTHROPIC_CHUNK_SIZE: int = 120_000 — chars per Anthropic system text block
_ANTHROPIC_MAX_PROMPT_TOKENS: int = 180_000 — Anthropic prompt limit (200K minus headroom)
_GEMINI_MAX_INPUT_TOKENS: int = 900_000 — Gemini 1M window minus headroom

Provider-Specific Behaviors

Gemini (SDK)

Server-side cache: genai.CachedContent with TTL management
Cache rebuild at 90% TTL: proactive renewal
Cache hash: tracks content hash for invalidation
Cached file paths: tracks which files are in the active cache

Anthropic

Ephemeral prompt caching: 4 cache_control: ephemeral breakpoints
Breakpoints: system prompt, context chunks, tool def, conversation prefix
History trimming at 180K tokens: 2-phase (strip stale file refreshes, then drop turn pairs)
History repair: _repair_anthropic_history handles tool_result chain breaks

DeepSeek

Raw HTTP: uses requests.post directly (no SDK)
Streaming: supports streaming responses
History repair: _repair_deepseek_history for tool result chains

MiniMax

OpenAI-compatible endpoint: uses the openai SDK
History trimming: similar to Anthropic (drop turn pairs at threshold)
History repair: _repair_minimax_history

Gemini CLI

Subprocess adapter: GeminiCliAdapter in src/gemini_cli_adapter.py
Persistent session: CLI maintains its own session ID
JSONL output protocol: parses streaming JSONL from the CLI subprocess
Full feature parity: tool calls, streaming, usage metadata

History Trimming Strategies

Gemini (40% threshold)

if total_in > _GEMINI_MAX_INPUT_TOKENS * 0.4:
    while len(hist) > 4 and total_in > _GEMINI_MAX_INPUT_TOKENS * 0.3:
        hist.pop(0)  # Assistant
        hist.pop(0)  # User

Anthropic (180K limit)

_trim_anthropic_history(system_blocks, history) — two-phase:

Strip stale [SYSTEM: FILES UPDATED] blocks
Drop oldest turn pairs (preserving tool_result chains)

MiniMax

Same pattern as Anthropic (similar 180K limit).

DeepSeek

No built-in trimming (relies on the caller to keep history short).

Caching Strategies

Gemini Server-Side Cache

_gemini_cache_md_hash: Optional[str] = None  # Hash of cached content
_gemini_cache_created_at: Optional[float] = None  # Monotonic time

The cache decision is a 3-way branch on each _send_gemini call:

Hash changed: delete old, rebuild with new content
Cache age > 90% of TTL (3240s of 3600s): proactive renewal
No cache exists: create new if token count >= 2048, otherwise inline

Anthropic Cache (4-Breakpoint System)

[System prompt]─breakpoint 1
[Context chunks]─breakpoint 2
[Tool definitions]─breakpoint 3
[Last user message]─breakpoint 4

Before placing breakpoint 4, all existing cache_control is stripped to prevent exceeding the 4-breakpoint limit.

Context Refresh Mechanism

After the last tool call in each round, _reread_file_items(file_items) checks mtimes:

For each file item: compare Path.stat().st_mtime against stored mtime
If unchanged: pass through as-is
If changed: re-read content, store old_content for diffing, update mtime
Changed files are diffed via _build_file_diff_text:
- Files ≤ 200 lines: emit full content
- Files > 200 lines with old_content: emit difflib.unified_diff
Diff is appended to the last tool's output as [SYSTEM: FILES UPDATED]\n\n{diff}
Stale [FILES UPDATED] blocks are stripped from older history turns by _strip_stale_file_refreshes

This is the "agent always sees current code" mechanism.

Subagent Summarization

For Tier 4: when an error occurs, qa_callback may be invoked to get a Tier 4 AI summary of the traceback. The summary is injected back into the worker's history as a hint.

def run_tier4_analysis(stderr: str) -> str:
    """Stateless Tier 4 QA analysis of an error message."""
    # Uses a dedicated system prompt for error triage
    # Returns analysis text (root cause, suggested fix)
    # Does NOT modify any code — analysis only

For Tier 4 patch generation:

def run_tier4_patch_generation(error: str, file_context: str) -> str:
    """Generate a unified diff patch from an error and file context."""
    # Returns the patch as a string
    # The caller (typically the patch modal) presents it for human review

Public API Quick Reference

Function	Purpose
`send(...)`	The main entry point — call the active provider
`set_provider(provider, model)`	Switch active provider and model
`get_provider() -> str`	Get the active provider name
`set_model_params(temp, max_tok, trunc_limit, top_p)`	Update generation params
`set_custom_system_prompt(prompt)`	Set the per-session system prompt override
`set_base_system_prompt(prompt)`	Set the foundational base prompt (advanced)
`set_use_default_base_prompt(use: bool)`	Toggle whether the base prompt is included
`set_project_context_marker(marker)`	Set the project-specific context tag
`reset_session()`	Clear all provider state
`get_comms_log()`	Read the in-memory comms log
`clear_comms_log()`	Clear the in-memory comms log
`get_token_stats(md_content)`	Estimate token usage for the given content
`cleanup()`	Tear down (delete Gemini caches, etc.)
`get_current_palette() -> str`	Get the current theme palette name
`list_models(provider) -> list[str]`	List available models for a provider
`run_tier4_analysis(stderr) -> str`	Tier 4 error analysis
`run_tier4_patch_generation(error, file_context) -> str`	Tier 4 patch generation
`run_subagent_summarization(file_path, content, is_code, outline) -> str`	AI summary of a file
`run_discussion_compression(text) -> str`	AI compression of a long discussion

Thread Safety

_send_lock: threading.Lock — serializes all provider calls. No two send() calls run concurrently.
Per-provider history locks (_anthropic_history_lock, etc.) — guard the history list mutations.
The EventEmitter (in src/events.py) is thread-safe for subscribe/emit.

Testing

Unit Tests (no real API calls)

def test_set_provider():
    from src import ai_client
    ai_client.set_provider("anthropic", "claude-3-5-sonnet-latest")
    assert ai_client.get_provider() == "anthropic"
    ai_client.reset_session()  # Cleanup

Mocked Tests

from unittest.mock import patch

def test_send_routes_to_provider(monkeypatch):
    with patch.object(ai_client, "_send_anthropic", return_value="mocked") as m:
        ai_client.set_provider("anthropic", "claude-3-5-sonnet-latest")
        result = ai_client.send("system", "user")
        assert result == "mocked"
        m.assert_called_once()
    ai_client.reset_session()

Integration (real API)

Gated by env var (e.g., RUN_REAL_AI_TESTS=1). Hits the real API. Not in default CI.

Data-Oriented Error Handling (Fleury Pattern)

The provider layer follows the "errors are just cases" framework (Ryan Fleury, The Easiest Way To Handle Errors). The canonical reference is conductor/code_styleguides/error_handling.md.

Result-Based Returns

All _send_<vendor>_result() functions (8 vendors: Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI, Qwen, Llama, Grok — plus the _send_llama_native Ollama adapter) return Result[str, ErrorInfo]. SDK exceptions are caught at the boundary (src/openai_compatible.py, src/qwen_adapter.py) and converted to ErrorInfo dataclasses. The _classify_<vendor>_error() functions return ErrorInfo (not raise ProviderError, which has been removed).

The 12 canonical ErrorKind values: NETWORK, AUTH, QUOTA, RATE_LIMIT, BALANCE, PERMISSION, NOT_FOUND, INVALID_INPUT, NOT_READY, UNKNOWN, CONFIG, INTERNAL. Each has exactly one meaning — do not overload UNKNOWN when a new failure mode surfaces (Lottes's anti-pattern). ErrorInfo.source is one of "ai_client.<vendor>" (e.g., "ai_client.gemini", "ai_client.anthropic") for diagnostic routing.

Public API

ai_client.send_result(...) — the new public API. Returns Result[str, ErrorInfo]. Mirrors the send() signature (13+ parameters including 8 callbacks). Internally calls _send_<vendor>_result() for the active provider.
ai_client.send(...) — deprecated. Emits DeprecationWarning at runtime (via typing_extensions.deprecated; cached per call site to avoid log spam). Returns str (the response text) for backward compat. Errors are logged to the comms log via the deprecated path's comms entry but not returned. Will be removed in the public_api_migration_20260606 follow-up track.

Example

from src import ai_client
from src.result_types import ErrorKind

r = ai_client.send_result("system prompt", "user message")
if not r.ok:
    for err in r.errors:
        log.error(err.ui_message())
        # err.kind is one of ErrorKind.*; err.source is "ai_client.<vendor>"
# use r.data regardless (it's the zero-initialized "" on failure)
print(r.data)

Migration Notes for Existing Callers

The app_controller._api_generate path and the MMA worker dispatch (multi_agent_conductor.py:591) call ai_client.send(). They will continue to work during the deprecation window; migration to send_result() is the work of the public_api_migration_20260606 follow-up track.
Tests that mock ai_client._send_<vendor> should be updated to mock _send_<vendor>_result() (or send_result() at the public API level).
tests/conftest.py adds a filterwarnings entry to silence the DeprecationWarning from send() during the transition; new tests for the new API should assert the warning is not emitted by send_result().

Shared OpenAI-Compatible Helper (`src/openai_compatible.py`)

Added 2026-06-06 by the qwen_llama_grok_integration_20260606 track. Operates on a normalized request/response data structure so 4 OpenAI-compatible vendors (MiniMax, Grok, Llama, DeepSeek) can share the same request building, response parsing, streaming aggregation, tool call detection, and error classification logic.

Data Structures

@dataclass(frozen=True)
class NormalizedResponse:
    text: str
    tool_calls: list[dict[str, Any]]
    usage_input_tokens: int
    usage_output_tokens: int
    usage_cache_read_tokens: int
    usage_cache_creation_tokens: int
    raw_response: Any

@dataclass
class OpenAICompatibleRequest:
    messages: list[dict[str, Any]]
    model: str
    temperature: float = 0.0
    top_p: float = 1.0
    max_tokens: int = 8192
    tools: Optional[list[dict[str, Any]]] = None
    tool_choice: str = "auto"
    stream: bool = False
    stream_callback: Optional[Callable[[str], None]] = None

The Function

def send_openai_compatible(
    client: Any,        # openai.OpenAI client with vendor-specific base_url + auth
    request: OpenAICompatibleRequest,
    *, capabilities: "VendorCapabilities",  # from src/vendor_capabilities.py
) -> NormalizedResponse:

The function:

Translates request.messages into the OpenAI SDK's messages parameter (passthrough — already in OpenAI shape).
Translates request.tools if non-None (passthrough for now; future: strip unsupported fields based on capabilities).
Calls client.chat.completions.create(...) with the right parameters.
If streaming: aggregates chunks; calls stream_callback(text_chunk) for each text delta; collects final usage from the last chunk.
If non-streaming: parses the response in one shot.
Returns a NormalizedResponse with text, tool calls (in OpenAI shape), usage stats.
On exception: classifies the OpenAI exception and re-raises as ProviderError.

Usage Pattern (per vendor)

# _send_grok, _send_llama (single-shot placeholders), _send_minimax (with restored tool loop)
def _send_grok(md_content, user_message, base_dir, file_items=None, discussion_history="", stream=False, ...):
    client = _ensure_grok_client()  # openai.OpenAI(api_key=..., base_url="https://api.x.ai/v1")
    with _grok_history_lock:
        # ... build messages, append user, system + context ...
        request = OpenAICompatibleRequest(
            messages=messages, model=_model, stream=stream,
            stream_callback=stream_callback,
        )
        caps = get_capabilities("grok", _model)
        response = send_openai_compatible(client, request, capabilities=caps)
        # ... append to history, return response.text ...

Qwen Adapter (`src/qwen_adapter.py`)

Qwen uses Alibaba's DashScope native SDK (not OpenAI-compatible) because DashScope's OpenAI-compatible mode drops important features (Qwen-Audio, Qwen-Long custom chunking, Qwen-VL-Max enhanced vision). The adapter normalizes DashScope tool format to OpenAI shape via build_dashscope_tools() and classifies DashScope exceptions via classify_dashscope_error().

Llama Multi-Backend

_send_llama supports 3 backends via the state globals _llama_base_url and _llama_api_key:

Ollama (local): http://localhost:11434/v1; no auth
OpenRouter (cloud aggregator): https://openrouter.ai/api/v1
Custom URL (escape hatch): any OpenAI-compatible endpoint

`run_with_tool_loop` — Shared Tool-Call Loop Helper

Added 2026-06-11 by the qwen_llama_grok_followup_20260611 track. Wraps send_openai_compatible with the tool-call loop, so 4+ OpenAI-compatible vendors share the same dispatch + history logic instead of each having their own inline loop.

Signature (in src/ai_client.py:806):

def run_with_tool_loop(
    client: Any,
    request: OpenAICompatibleRequest | Callable[[int], OpenAICompatibleRequest],
    *,
    capabilities: "VendorCapabilities",
    pre_tool_callback: Optional[Callable] = None,
    qa_callback: Optional[Callable] = None,
    stream_callback: Optional[Callable[[str], None]] = None,
    patch_callback: Optional[Callable] = None,
    base_dir: str,
    vendor_name: str,
    history_lock: Optional[threading.Lock] = None,
    history: Optional[list] = None,
    trim_func: Optional[Callable] = None,
    send_func: Optional[Callable[[int], "NormalizedResponse"]] = None,
    on_pre_dispatch: Optional[Callable] = None,
) -> str:

Two extensions were added beyond the original signature:

request accepts a Callable[[int], OpenAICompatibleRequest] (per-round history rebuild). Use this when the vendor mutates history between rounds (e.g., MiniMax's per-round append).
send_func + on_pre_dispatch allows vendored call paths (e.g., Gemini CLI's GeminiCliAdapter) to share the loop + dispatch without going through send_openai_compatible.

Vendors applied (as of 2026-06-11):

_send_minimax (was inline, now uses helper)
_send_grok (was single-shot, now has loop)
_send_llama (was single-shot, now has loop)
_send_gemini_cli (uses send_func + on_pre_dispatch)

Vendors still deferred (multi-day refactor; see conductor/tracks/qwen_llama_grok_followup_20260611/state.toml t5_6/7/8):

_send_anthropic (uses anthropic SDK)
_send_gemini (uses google-genai streaming)
_send_deepseek (uses requests.post)

Audit enforcement: scripts/audit_no_inline_tool_loops.py fails if any non-deferred _send_<vendor>() has an inline for ... in range(MAX_TOOL_ROUNDS) loop.

Native Ollama Adapter (Phase 4)

Added 2026-06-11. When _llama_base_url is localhost / 127.0.0.1 (Ollama default), _send_llama routes to _send_llama_native (which wraps ollama_chat). The native adapter POSTs to /api/chat (NOT /v1/chat/completions) and supports Ollama's vendor-specific fields:

think: low | medium | high — reasoning depth hint
images: list of base64-encoded images (for vision-capable models)
thinking: returned field; captured in history for subsequent rounds

The dispatcher check is in _send_llama at the function head:

if "localhost" in _llama_base_url or "127.0.0.1" in _llama_base_url:
    return _send_llama_native(...)

For OpenRouter, custom URLs, and other cloud Llama endpoints, the existing OpenAI-compat path is unchanged.

V2 Capability Matrix (Phase 4)

Added 2026-06-11. The VendorCapabilities dataclass in src/vendor_capabilities.py now has 12 v2 fields beyond the original 7 v1 fields:

V1 fields (unchanged):

vision, tool_calling, caching, streaming, model_discovery, context_window, cost_tracking

V2 fields (added):

local — backend is on-device (Ollama, etc.); consumed by _apply_runtime_caps_override for llama+localhost
reasoning — model supports thinking / reasoning traces (e.g., MiniMax-M2.5/M2.7, DeepSeek R1, llama-3.1-405b-reasoning)
structured_output — model supports JSON / tool-use output format
code_execution — model can run code (server-side; e.g., gemini-2.0-experimental)
web_search — model can do live web search (e.g., grok-2, gemini-grounded)
x_search — X/Twitter search (grok-specific)
file_search — model has a file_search tool (Anthropic)
mcp_support — model supports the Model Context Protocol (Anthropic, gemini)
audio — model accepts audio input (gemini-2.5+, qwen-audio)
video — model accepts video input (gemini-2.5+, qwen-vl-max)
grounding — model supports grounding (gemini)
computer_use — model can drive a computer (Anthropic claude-3.5+)

GUI rendering: src/gui_2.py:_render_v2_capability_badges renders small green badges in the provider panel for each field where caps.<field> = True. The user can see at a glance which capabilities their active vendor+model supports.

Static + runtime: Most v2 fields are per-model properties in the registry. caps.local is unique — it's runtime state (URL-dependent), so the GUI uses dataclasses.replace(caps, local=True) to override when the active backend is Ollama.

PROVIDERS Location (Phase 2)

The PROVIDERS list moved from src/models.py to src/ai_client.py:56 per the AGENTS.md HARD RULE (no new src/<thing>.py files). A PEP 562 __getattr__ re-export in src/models.py:261 maintains backward compatibility (lazy import; breaks the circular dependency where src/ai_client.py imports ToolPreset from src/models.py).

Audit: scripts/audit_providers_source_of_truth.py fails if PROVIDERS is declared in src/models.py.

Tests

tests/test_vendor_capabilities.py (3 tests): registry lookup, vendor-default fallback, unknown-vendor raises
tests/test_openai_compatible.py (6 tests): non-streaming, streaming aggregation, tool call detection, vision, error classification, frozen dataclass
conductor/tracks/nagent_review_20260608/report.md §15 Pitfalls #2 and #4 — Deep-dive on the per-provider history globals and the stateful singleton pattern; future-track candidate for stateless LLMClient

Addition (2026-06-12) — Cache strategy and the 12-layer model

The nagent review (v2.3, §3.2 + §5) formalizes the cache strategy that this client implements. The strategy: stable-to-volatile context ordering, where layers 1-7 of the initial context are byte-identical across turns and across discussions of the same mode (and therefore cacheable), and layers 8-12 are per-turn (and therefore not cached).

The 12-layer model (the recap)

The canonical reference is conductor/code_styleguides/cache_friendly_context.md §1 (the full 12-layer table with the stable/volatile classification + the ─── data markings + the byte-comparison test). This section is a pointer.

The one-line summary: layers 1-7 (role instructions, function-calling schema, tool descriptions, system prompt, persona, project context, knowledge digest) are byte-identical across turns and cacheable. Layers 8-12 are per-turn and NOT cached. The cache boundary is at layer 7/8.

The byte-comparison test (the design contract)

The test in tests/test_aggregate_caching.py ensures the first N characters of the context are byte-identical across turns:

def test_aggregate_stable_to_volatile_ordering():
    ctrl = mock_app_controller()
    turn1 = aggregate.build_initial_context(ctrl, user_message="first")
    turn2 = aggregate.build_initial_context(ctrl, user_message="second")
    N = aggregate.stable_prefix_length(ctrl)
    assert turn1[:N] == turn2[:N], f"Stable prefix mismatch: {turn1[:N]!r} != {turn2[:N]!r}"

The test is the contract. If a new layer is added in the wrong position, the test fails; the agent must move the layer to the stable position or update the test with written justification.

The provider-specific cache strategies

Anthropic (5-min ephemeral, 4 breakpoints max)

def _send_anthropic(messages, *, cache_prefix_chars=None):
    if cache_prefix_chars is not None:
        content_blocks = cache_prefix_blocks(messages, cache_prefix_chars)
    else:
        content_blocks = messages

    response = anthropic_client.messages.create(
        model=model,
        max_tokens=8192,
        messages=[{"role": "user", "content": content_blocks}],
    )
    return _result_with_usage(response.content, response.usage, messages)

The cache_prefix_blocks helper splits the message at the given char offsets and marks each prefix with cache_control: {"type": "ephemeral"}. Max 3 prefix blocks (provider limit is 4 breakpoints per request).

The Anthropic usage accounting (in _result_with_usage): cache_read_input_tokens + cache_creation_input_tokens are added to input_tokens so the accounting stays "tokens sent" across providers. Caching is invisible in the user-facing number.

Gemini (1-h explicit, configurable TTL)

def _send_gemini(messages, *, cache_ttl_seconds=3600):
    if cache_ttl_seconds > 0:
        cached_content = genai_client.caches.create(
            model=model, contents=stable_prefix_messages, ttl=f"{cache_ttl_seconds}s",
        )
        response = genai_client.models.generate_content(
            model=model, contents=volatile_messages,
            config=genai.types.GenerateContentConfig(cached_content=cached_content.name),
        )
    else:
        response = genai_client.models.generate_content(model=model, contents=messages)
    return _result_with_usage(response.text, response.usage_metadata, messages)

The default TTL is 1 hour; configurable per-discussion via the GUI.

OpenAI (5-10 min implicit, provider-managed)

No application-side control; the provider handles caching. The GUI just shows "Cached by OpenAI; TTL: provider-managed."

The GUI exposure (the "Caching" Operations Hub sub-panel)

Provider	Default TTL	Configurable?
Anthropic ephemeral	5 min	yes (per-discussion state)
Gemini explicit	1 h	yes (TTL override)
OpenAI implicit	5-10 min (provider-managed)	no
claude-code (Claude Agent SDK)	varies (provider-managed)	no

The new AI client state:

@dataclass
class DiscussionCacheState:
    discussion_id: str
    provider: str
    cached_at: datetime
    expires_at: Optional[datetime]  # None for OpenAI implicit
    hit_count: int = 0
    tokens_cached: int = 0
    last_invalidated_at: Optional[datetime] = None
    caching_enabled: bool = True

The Hook API additions:

GET  /api/cache                        # list all discussion cache states
GET  /api/cache/<discussion_id>        # get one
POST /api/cache/<discussion_id>/invalidate
POST /api/cache/<discussion_id>/disable
POST /api/cache/<discussion_id>/enable

The 5th provider (claude-code)

claude-code uses the Claude Agent SDK with local Claude Code authentication (no API key). The caching behavior is provider-managed.

def _send_claude_code(message, model, *, allowed_tools=None, max_turns=1):
    options = ClaudeAgentOptions(
        model=None if not model or model == "default" else model,
        max_turns=max_turns,
        tools=list(allowed_tools) if allowed_tools else [],
        allowed_tools=list(allowed_tools) if allowed_tools else [],
        cwd=os.getcwd(),
    )
    # ... claude_agent_sdk.query(prompt=message, options=options)
    return _result_with_usage(text, usage, message)

The cross-references

docs/guide_caching_strategy.md — the user-facing deep-dive
conductor/code_styleguides/cache_friendly_context.md — the canonical styleguide
docs/guide_agent_memory_dimensions.md — the 4 dims (where the cache hits)
conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.2, §5 — the nagent pattern

35 KiB Raw Blame History

src/ai_client.py — Multi-Provider LLM Abstraction

Overview

Module-Level Imports

Architecture

State

Per-Provider State

The Public API

send(...) — The Main Entry Point

Provider Switching

Parameter Setters

Session Management

Event Hooks

Comms Log

Provider Error Taxonomy — Legacy (Pre-Refactor)

The Tool-Call Loop

Provider-Specific Behaviors

Gemini (SDK)

Anthropic

DeepSeek

MiniMax

Gemini CLI

History Trimming Strategies

Gemini (40% threshold)

Anthropic (180K limit)

MiniMax

DeepSeek

Caching Strategies

Gemini Server-Side Cache

Anthropic Cache (4-Breakpoint System)

Context Refresh Mechanism

Subagent Summarization

Public API Quick Reference

Thread Safety

Testing

Unit Tests (no real API calls)

Mocked Tests

Integration (real API)

Data-Oriented Error Handling (Fleury Pattern)

Result-Based Returns

Public API

Example

Migration Notes for Existing Callers

See Also (in-doc)

See Also

Shared OpenAI-Compatible Helper (src/openai_compatible.py)

Data Structures

The Function

Usage Pattern (per vendor)

Qwen Adapter (src/qwen_adapter.py)

Llama Multi-Backend

run_with_tool_loop — Shared Tool-Call Loop Helper

Native Ollama Adapter (Phase 4)

V2 Capability Matrix (Phase 4)

PROVIDERS Location (Phase 2)

Tests

Addition (2026-06-12) — Cache strategy and the 12-layer model

The 12-layer model (the recap)

The byte-comparison test (the design contract)

The provider-specific cache strategies

Anthropic (5-min ephemeral, 4 breakpoints max)

Gemini (1-h explicit, configurable TTL)

OpenAI (5-10 min implicit, provider-managed)

The GUI exposure (the "Caching" Operations Hub sub-panel)

The 5th provider (claude-code)

The cross-references

35 KiB

Raw Blame History

`src/ai_client.py` — Multi-Provider LLM Abstraction

`send(...)` — The Main Entry Point

Shared OpenAI-Compatible Helper (`src/openai_compatible.py`)

Qwen Adapter (`src/qwen_adapter.py`)

`run_with_tool_loop` — Shared Tool-Call Loop Helper