Specification: AI Provider Caching Optimization

Overview

This track aims to verify and optimize the caching strategies across all supported AI providers (Gemini, Anthropic, DeepSeek, MiniMax, and OpenAI). The goal is to minimize token consumption and latency by ensuring that static and recurring context (system prompts, tool definitions, project documents, and conversation history) are effectively cached using each provider's specific mechanisms.

Functional Requirements

1. Provider-Specific Optimization

Anthropic (Claude)

4-Breakpoint Strategy: Implement a hierarchical caching structure using the 4 available cache_control: {"type": "ephemeral"} breakpoints:
1. Global System: Core instructions and stable tool definitions.
2. Project Context: High-signal project documents (README.md, tech-stack.md, etc.).
3. Context Injection: Large file contents or skeletons injected into the current session.
4. Rolling History: A "sliding window" breakpoint on the history to cache the majority of previous turns.
Automatic Caching: Research and integrate the top-level cache_control flag if supported by the current SDK version to simplify history caching.

OpenAI & DeepSeek

Prefix Stabilization: Ensure that the messages array is structured with most-static content first (System role content, then invariant project context).
Metric Monitoring: Implement tracking for cached_tokens (OpenAI) and prompt_cache_hit_tokens (DeepSeek) from the API usage metadata.
Minimum Threshold Awareness: Only attempt to optimize or report on caching when the total prefix length exceeds the provider's threshold (1,024 for OpenAI, 64 for DeepSeek).

Gemini

Hybrid Caching:
- Explicit: Continue using CachedContent for massive context blocks (>= 32k tokens) in the system_instruction.
- Implicit: Optimize the system_instruction and first message prefix to trigger implicit caching for smaller contexts (~1k-4k tokens).
TTL Management: Allow users to configure the TTL for explicit caches to balance cost vs. persistence.

MiniMax

Parity Check: Verify if the MiniMax API supports OpenAI-style usage metrics for cached tokens and implement tracking if available.

2. GUI Enhancements

Cache Hit Visualization: Add a "Cache Hit Rate" or "Saved Tokens" indicator to the AI Metrics panel.
Manual Cache Management:
- Add a "Force Cache Rebuild" button to invalidate existing caches and start fresh.
- Add a "Clear Provider Caches" button to explicitly delete server-side caches (primarily for Gemini).
Cache Awareness: Display a visual indicator (e.g., a "Cached" badge) next to files in the Context Hub that are currently part of a cached prefix.
Per-Response Comms Metrics: Ensure each response entry in the Comms Log displays its specific token usage (input, output, cached) directly in the list or detail view for immediate verification of caching efficacy.

Non-Functional Requirements

Efficiency: The caching logic itself must not introduce significant overhead or latency.
Cost-Effectiveness: Avoid explicit caching (Gemini) for single-query sessions where storage fees outweigh token savings.
Robustness: Gracefully handle "Cache Not Found" errors or automatic evictions.

Acceptance Criteria

Anthropic requests consistently use 3-4 breakpoints for complex sessions.
OpenAI and DeepSeek responses correctly report cached token counts in the Comms Log and GUI.
Gemini explicit caches are proactively rebuilt and deleted according to the configured TTL and session lifecycle.
The GUI provides clear visibility into saved tokens and hit rates.
Users can manually clear or rebuild caches via the settings.

Out of Scope

Implementing client-side persistent KV caching (focusing strictly on vendor-side prompt/context caching).
Caching of image/vision data across different sessions (unless natively supported by the provider's prompt cache).

4.0 KiB Raw Blame History