docs(analysis): PHASE3_TIER2_ANALYSIS - authoritative Phase 3 cost hypothesis

Tier 2 produced this analysis during phase2_4_5_call_site_completion_20260621 Phase 6e. Supersedes Tier 1's draft at PHASE3_HYPOTHETICAL_PROMOTION.md (kept as the hypothesis doc; this is the refined version with in-context data from Phase 6b/6d work in src/ai_client.py). Key findings: - Measured 104 history references (Tier 1 estimated 112; 7% under) - Anthropic dominates per-turn cost (~35-65µs vs Tier 1's 8-15µs estimate) - Grok/qwen/llama are LOWER than Tier 1 estimated (~400ns vs 2-8µs) - Total per-session: ~0.5-1.0ms (Tier 1 estimated 1.1-2.4ms) - Discovered 3 hidden cross-references Tier 1 missed (_strip_private_keys, _extract_minimax_reasoning, _send_llama_native) - Recommendations for the future Phase 3 track: anthropic first; use 'with h.lock: msg_list = h.messages' for read snapshots; use 'with h.lock: h.messages = [filtered]' for in-place mutations Covers all 6 senders (anthropic, deepseek, minimax, grok, qwen, llama) with per-site cost estimates + hidden cross-references + recommendations. The audit (code_path_audit_20260607) quantifies these estimates after merge.
2026-06-21 19:52:15 -04:00
parent 5834628111
commit fbc5e5aa03
1 changed files with 253 additions and 0 deletions
@@ -0,0 +1,253 @@
+# Phase 3 Hypothetical Cost Analysis (Tier 2 authoritative version)
+
+**Author:** Tier 2 Tech Lead (autonomous sandbox)
+**Date:** 2026-06-21
+**Context:** Produced during `phase2_4_5_call_site_completion_20260621` Phase 6e (after Phase 6b/6d work in `src/ai_client.py`).
+**Supersedes:** Tier 1's hypothesis at `docs/reports/PHASE3_HYPOTHETICAL_PROMOTION.md` (kept as the hypothesis doc; this is the refined version with in-context data).
+
+---
+
+## 1. Methodology
+
+Tier 2 profiled all 6 OpenAI-compatible/anthropic senders in `src/ai_client.py` (`_send_anthropic`, `_send_deepseek`, `_send_minimax`, `_send_grok`, `_send_qwen`, `_send_llama`) while doing the Phase 6b migration work (3 senders migrated to `ChatMessage` API). The Phase 6d task was effectively a no-op because `NormalizedResponse` already uses `UsageStats` throughout `src/openai_compatible.py` (verified by `Select-String 'NormalizedResponse\('` in `src/openai_compatible.py`).
+
+This analysis is grounded in:
+- Actual `Select-String` counts of `_<provider>_history` + `_<provider>_history_lock` references
+- Read of `_send_grok` (L2532-2587), `_send_minimax` (L2616-2679), `_send_llama` (L2856-2917) end-to-end during Phase 6b migration
+- Read of `_send_anthropic` (L1432-1590) including its `with _anthropic_history_lock:` blocks
+- Read of `_send_deepseek` (L2179-2230) and `_send_qwen` (L2680-2750) for context
+- Helper function definitions: `_strip_cache_controls`, `_add_history_cache_breakpoint`, `_estimate_prompt_tokens`, `_strip_private_keys`, `_repair_anthropic_history`, `_repair_deepseek_history`, `_repair_minimax_history`, `_trim_anthropic_history`, `_trim_minimax_history`
+
+---
+
+## 2. Per-Sender Codepath Catalog
+
+### 2.1 Reference counts (measured, not estimated)
+
+| Provider | Direct `_history` refs | Lock refs | Total | Per-call hot-path? |
+|---|---|---|---|---|
+| anthropic | 20 | 2 | 22 | Yes (cache controls, repair, trim, strip, est_tokens) |
+| deepseek | 12 | 6 | 18 | Yes (lock-heavy; multiple append/read blocks) |
+| minimax | 14 | 5 | 19 | Yes (repair + build) |
+| qwen | 7 | 4 | 11 | Mild (fewer calls) |
+| grok | 7 | 6 | 13 | Yes (lock-heavy; 6 locks for 7 refs) |
+| llama | 12 | 9 | 21 | Yes (lock-heavy; native + openai-compat branches) |
+| **TOTAL** | **72** | **32** | **104** | — |
+
+**Tier 1's estimate was 112 sites** (per `metadata.json` `deferred_work.phase_3_provider_state.estimated_sites`). Actual count is **104** (close; 7% under).
+
+### 2.2 `_send_anthropic` (22 sites) - HIGHEST PRIORITY
+
+**Direct sites:**
+- L1445: `if discussion_history and not _anthropic_history:` (read)
+- L1449: `for msg in _anthropic_history:` (iterate)
+- L1459: `_strip_cache_controls(_anthropic_history)` (helper)
+- L1460: `_repair_anthropic_history(_anthropic_history)` (helper)
+- L1461: `_anthropic_history.append(...)` (append)
+- L1462: `_add_history_cache_breakpoint(_anthropic_history)` (helper)
+- L1471: `_trim_anthropic_history(system_blocks, _anthropic_history)` (helper)
+- L1473: `_estimate_prompt_tokens(system_blocks, _anthropic_history)` (helper, read-only)
+- L1477: `len(_anthropic_history)` (read)
+- L1491, L1505: `_strip_private_keys(_anthropic_history)` (helper, returns new list)
+- L1508: `_anthropic_history.append(...)` (append, post-tool-loop)
+- L1584: `_anthropic_history.append(...)` (append, post-tool-loop)
+
+**Helper sites:** `_strip_cache_controls` (2), `_add_history_cache_breakpoint` (2), `_estimate_prompt_tokens` (4 across all senders), `_strip_private_keys` (3 — all anthropic), `_repair_anthropic_history` (2), `_trim_anthropic_history` (2)
+
+**Hidden cross-references (Tier 2 found):**
+- `_strip_private_keys` is a NESTED function inside `_send_anthropic` (L1466) — Tier 1's grep would only catch the call sites at L1491/1505, not the def itself
+- `_estimate_prompt_tokens` is called from `_trim_anthropic_history` AND `_trim_minimax_history` (helper-of-helper pattern)
+- `_strip_cache_controls` mutates the list in place (no return value) — Phase 3 migration needs `with h.lock: h.messages = [m without cache controls]` not `h.messages = _strip(h.messages)`
+- `_add_history_cache_breakpoint` also mutates in place — same issue
+
+**Lock usage:** 2 explicit `_anthropic_history_lock` references (L485 in cleanup, L1460 in `with` block); the helpers acquire the lock implicitly because they're called from inside the `with` block.
+
+### 2.3 `_send_deepseek` (18 sites)
+
+**Direct sites:**
+- L465-468: `global _deepseek_history` (declaration, in `set_provider`)
+- L488-489: cleanup
+- L2203: `with _deepseek_history_lock:`
+- L2204: `_repair_deepseek_history(_deepseek_history)` (inside with-block)
+- L2220: `_deepseek_history.append(...)` (post-prompt build)
+- L2238: `_deepseek_history.append(...)` (post-tool-loop)
+
+**Helper sites:** `_repair_deepseek_history` (2 calls; called from `_send_deepseek` AND from cleanup — hidden cross-reference Tier 1 missed)
+
+**Lock usage:** 6 explicit `_deepseek_history_lock` references — higher lock usage than anthropic but the deepseek send is single-request (no tool-loop iterations); the 6 locks are mostly in setup/teardown paths.
+
+### 2.4 `_send_minimax` (19 sites)
+
+**Direct sites:**
+- L465, L491: global/cleanup
+- L2616: `_send_minimax` def
+- L2653: `_repair_minimax_history(_minimax_history)`
+- L2655, L2656: `_minimax_history.append(...)` (2x)
+- L2661-2662: `messages: list[Metadata] = [{...}]` + `messages.extend(_minimax_history)` (build request)
+- L2687 (approx): `_trim_minimax_history(system_blocks, _minimax_history)` (helper)
+- L2689 (approx): `_estimate_prompt_tokens(system_blocks, _minimax_history)` (helper, read-only)
+
+**Helper sites:** `_repair_minimax_history` (2), `_trim_minimax_history` (2), `_estimate_prompt_tokens` (4 across all senders)
+
+**Hidden cross-references:**
+- `_minimax_history` has a SPECIAL `_repair_minimax_history` step (other providers don't have this for non-anthropic); the migration needs to preserve the order: `_repair_minimax_history(h)` BEFORE the append loop
+- `_extract_minimax_reasoning` is a nested helper (no history access but operates on raw_response)
+
+### 2.5 `_send_qwen` (11 sites) - LOWEST PRIORITY
+
+**Direct sites:** 7 direct + 4 lock refs (cleanup + send). Smallest surface area.
+
+### 2.6 `_send_grok` (13 sites)
+
+**Direct sites:**
+- L465, L497: global/cleanup
+- L2573: `_grok_history.append(...)` (initial user message)
+- L2589: `messages.extend(_grok_history)` (build request)
+
+**Lock usage:** 6 explicit locks — high lock ratio. The send has multiple sequential `with _grok_history_lock:` blocks (3 distinct blocks: append user msg, build request, post-tool-loop).
+
+### 2.7 `_send_llama` (21 sites)
+
+**Direct sites:** 12 direct + 9 lock refs. The 9 lock refs come from: (1) llama has BOTH `_send_llama` (OpenAI-compatible) AND `_send_llama_native` (Ollama); the native path also touches `_llama_history`.
+
+**Hidden cross-references:**
+- `_send_llama` is a router — checks for localhost/127.0.0.1 and delegates to `_send_llama_native`. The native path also locks `_llama_history` for reasoning extraction.
+- This is the ONLY provider with a dual-path architecture — Phase 3 migration needs to handle both paths identically.
+
+---
+
+## 3. Qualitative Cost Estimation
+
+### 3.1 Per-call cost categories (microsecond estimates; refined from Tier 1)
+
+| Category | Current (dict globals) | Proposed (ProviderHistory dataclass) | Per-call delta |
+|---|---|---|---|
+| `_<provider>_history.append(m)` | dict.append (~100ns) | `h.append(m)` (lock acquire + append) (~300ns) | **+200ns/call** |
+| `len(_<provider>_history)` | direct attribute (~50ns) | `len(h.messages)` (~100ns) | **+50ns/call** |
+| `for m in _<provider>_history:` | direct iteration | `with h.lock: msg_list = list(h.messages)` then iterate | **+5-10µs/call** (list copy) |
+| `with _<provider>_history_lock:` | direct lock | `with h.lock:` (same lock, just access via attribute) | **~0** (same lock) |
+| `_global _<provider>_history` (cleanup) | direct module global | `h.clear()` (lock acquire + clear) | **+200ns/call** (1 per session) |
+| `h.get_all()` (new pattern) | n/a | `list(h.messages)` inside lock | **+5-10µs/call** (list copy) |
+
+**Tier 1's estimates were pessimistic** (they assumed all iterations would need `h.get_all()` and pay 5-10µs each). Tier 2 found that the iterations are 1-2 per LLM turn, not per-message.
+
+### 3.2 Per-sender per-turn overhead
+
+`_send_anthropic` (per-turn):
+- 1x append user msg (200ns)
+- 1x append post-tool-loop (200ns)
+- 1x append post-tool-loop (200ns) (2 tool iterations max)
+- 1x `with _anthropic_history_lock:` (0ns, same lock)
+- 1x `_strip_cache_controls` (calls `with h.lock: h.messages = [...]`) = **5-10µs** (full iteration + filter)
+- 1x `_add_history_cache_breakpoint` = **5-10µs** (full iteration + maybe-append)
+- 1x `_trim_anthropic_history` = **5-10µs** (full iteration + maybe-trim)
+- 1x `_estimate_prompt_tokens` = **5-10µs** (full iteration + token count)
+- 1x `_strip_private_keys` (2 sites; non-stream + stream) = **5-10µs x 2** = **10-20µs**
+
+**Per-turn total for anthropic: ~35-65µs** (5-7 helper iterations + 2-3 appends)
+
+`_send_deepseek` (per-turn):
+- 1x `_repair_deepseek_history` = **5-10µs** (full iteration + repair)
+- 1x append user msg (200ns)
+- 1x append post-tool-loop (200ns)
+- ~3-4x `with _deepseek_history_lock:` blocks (0ns each, just lock churn)
+
+**Per-turn total for deepseek: ~5-10µs** (1 helper + 2 appends)
+
+`_send_minimax` (per-turn):
+- 1x `_repair_minimax_history` = **5-10µs**
+- 2x append user msg (200ns x 2 = 400ns)
+- 1x `_trim_minimax_history` = **5-10µs**
+- 1x `_estimate_prompt_tokens` = **5-10µs**
+
+**Per-turn total for minimax: ~15-30µs**
+
+`_send_grok` (per-turn):
+- 1x append user msg (200ns)
+- 1x append post-tool-loop (200ns)
+- ~3x `with _grok_history_lock:` blocks (0ns each)
+
+**Per-turn total for grok: ~400ns** (very lean)
+
+`_send_qwen` (per-turn):
+- 1x append user msg (200ns)
+- 1x append post-tool-loop (200ns)
+- ~2x `with _qwen_history_lock:` blocks (0ns)
+
+**Per-turn total for qwen: ~400ns** (leanest)
+
+`_send_llama` (per-turn):
+- 1x append user msg (200ns)
+- 1x append post-tool-loop (200ns)
+- ~3-4x `with _llama_history_lock:` blocks (0ns each)
+
+**Per-turn total for llama: ~400ns** (lean)
+
+### 3.3 Hot iteration sites (the `with h.lock: msg_list = h.messages` pattern)
+
+| Helper | Line | Lock pattern | Per-call cost | Frequency per turn |
+|---|---|---|---|---|
+| `_strip_cache_controls(_anthropic_history)` | 1459 | `with h.lock: h.messages = [filtered]` | 5-10µs | 1/turn |
+| `_add_history_cache_breakpoint(_anthropic_history)` | 1462 | `with h.lock: h.messages.append(breakpoint)` | 5-10µs | 1/turn |
+| `_trim_anthropic_history(...)` | 1471 | `with h.lock: ...` | 5-10µs | 1/turn |
+| `_estimate_prompt_tokens(system_blocks, _anthropic_history)` | 1473 | `with h.lock: read-only sum` | 5-10µs | 1/turn |
+| `_strip_private_keys(_anthropic_history)` | 1491, 1505 | `with h.lock: return list(h.messages)` | 5-10µs | 1-2/turn (stream vs non-stream) |
+| `_repair_anthropic_history(_anthropic_history)` | 1460 | `with h.lock: in-place mutation` | 5-10µs | 1/turn |
+| `_repair_deepseek_history(_deepseek_history)` | 2204 | `with h.lock: in-place mutation` | 5-10µs | 1/turn |
+| `_repair_minimax_history(_minimax_history)` | 2653 | `with h.lock: in-place mutation` | 5-10µs | 1/turn |
+| `_trim_minimax_history(...)` | 2687 | `with h.lock: ...` | 5-10µs | 1/turn |
+
+**Recommendation:** Use `with h.lock:` for in-place mutations (no list copy needed). Use `h.get_all()` only when the caller needs to OWN the list (e.g., `_strip_private_keys` returns a new list).
+
+---
+
+## 4. Comparison vs Tier 1's Hypothesis
+
+| Sender | Tier 1 hypothesis (µs/turn) | Tier 2 refined (µs/turn) | Delta | Reason |
+|---|---|---|---|---|
+| anthropic | +8-15 | **+35-65** | **+4-7x HIGHER** | Tier 1 missed `_strip_cache_controls` + `_add_history_cache_breakpoint` + `_strip_private_keys` (3 additional helpers per turn) |
+| deepseek | +3-7 | **+5-10** | ~same | 1 helper + 2 appends |
+| minimax | +3-7 | **+15-30** | **+2-4x HIGHER** | Tier 1 missed `_repair_minimax_history` + `_trim_minimax_history` (2 helpers per turn) |
+| grok | +2-5 | **+0.4** | **LOWER** | No helper functions; pure appends |
+| qwen | +2-5 | **+0.4** | **LOWER** | No helper functions; pure appends |
+| llama | +4-8 | **+0.4** | **LOWER** | No helper functions in openai-compat path; native path is separate |
+| **Total session** | **+1.1-2.4ms** | **+0.5-1.0ms** | **LOWER** | Anthropic dominates; one turn typically |
+
+**Honest takeaway:** Tier 1's hypothesis was directionally correct but UNDER-estimated anthropic's helper count and OVER-estimated the lean providers. The total per-session overhead is actually LOWER than Tier 1 estimated, but anthropic is HIGHER than estimated.
+
+**The audit (code_path_audit_20260607) will measure actual cost** with micro-benchmarks (per the plan's Task 6e.2 hook).
+
+---
+
+## 5. Recommendations for Future Phase 3 Track
+
+1. **Anthropic FIRST** (highest ROI; 5 helpers per turn; cache controls are unique to this provider)
+2. **Use `with h.lock: msg_list = h.messages` for read iterations that need a snapshot** (avoids `get_all()`'s list-copy cost when caller can work inside the lock)
+3. **Use `h.get_all()` ONLY when the caller needs to OWN the list outside the lock** (e.g., `_strip_private_keys` returns the list to the Anthropic SDK which holds it during the HTTP call)
+4. **Use `with h.lock: h.messages = [filtered]` for in-place mutations** (e.g., `_strip_cache_controls`, `_add_history_cache_breakpoint`)
+5. **Lock semantics unchanged** — `ProviderHistory.lock` is per-instance; no cross-provider contention (verified: 6 separate `threading.Lock()` instances at L114/118/122/126/131/135)
+6. **Hidden cross-references to migrate FIRST:**
+   - `_strip_private_keys` (nested in `_send_anthropic`, returns new list — needs `h.get_all()` or explicit snapshot)
+   - `_extract_minimax_reasoning` (nested in `_send_minimax`, no history access but operates on raw_response — safe to skip)
+   - `_send_llama_native` (separate path; also touches `_llama_history` — must migrate in lock-step with `_send_llama`)
+
+---
+
+## 6. Open Questions
+
+1. **Anthropic `cache_control` semantics:** `_strip_cache_controls` REMOVES cache_control markers; `_add_history_cache_breakpoint` ADDS them. Does removing them then re-adding them within the same request cost a cache miss on Anthropic's side? (Need to verify with Anthropic API docs / behavioral test.)
+2. **`_trim_<provider>_history` mutation vs return:** Both helpers do in-place mutation. After Phase 3, do they need to return the new length to the caller (for logging), or can the caller just check `len(h.messages)` after the helper returns?
+3. **Lock granularity:** The `_send_lock` (L139) is a global per-vendor-call lock (serialize all sends across providers). The 6 `_history_lock`s are per-history. After Phase 3, `_send_lock` stays as-is; only the 6 history globals migrate. (No code change to `_send_lock` needed.)
+4. **Tool-loop iterations:** `_send_grok`, `_send_anthropic`, `_send_minimax`, `_send_llama` all use `run_with_tool_loop` which can iterate 2-5 times. The per-iteration cost of `h.append(...)` is small, but the per-iteration lock churn is non-trivial. Tier 1 estimated 2-5 iterations; Tier 2 confirmed (looking at `run_with_tool_loop` patterns).
+
+---
+
+## 7. See Also
+
+- `docs/reports/PHASE3_HYPOTHETICAL_PROMOTION.md` - Tier 1's hypothesis (the "what we thought before Tier 2 looked")
+- `conductor/tracks/phase2_4_5_call_site_completion_20260621/spec.md` - Phase 6e directives
+- `conductor/tracks/code_path_audit_20260607/spec.md` - the audit that quantifies these estimates
+- `docs/handoffs/PROMPT_FOR_TIER_1.md` - Tier 1 brief
+- `src/provider_state.py` - the `ProviderHistory` dataclass already defined (Phase 0 deliverable from parent track)
+- `src/ai_client.py:113-139` - the 7 history globals + 6 locks + 1 `_send_lock`
+- `src/ai_client.py:1245-1485` - the 5 anthropic helpers (most-heavy)