diff --git a/docs/reports/PHASE3_TIER2_ANALYSIS.md b/docs/reports/PHASE3_TIER2_ANALYSIS.md new file mode 100644 index 00000000..d3dc04b8 --- /dev/null +++ b/docs/reports/PHASE3_TIER2_ANALYSIS.md @@ -0,0 +1,253 @@ +# Phase 3 Hypothetical Cost Analysis (Tier 2 authoritative version) + +**Author:** Tier 2 Tech Lead (autonomous sandbox) +**Date:** 2026-06-21 +**Context:** Produced during `phase2_4_5_call_site_completion_20260621` Phase 6e (after Phase 6b/6d work in `src/ai_client.py`). +**Supersedes:** Tier 1's hypothesis at `docs/reports/PHASE3_HYPOTHETICAL_PROMOTION.md` (kept as the hypothesis doc; this is the refined version with in-context data). + +--- + +## 1. Methodology + +Tier 2 profiled all 6 OpenAI-compatible/anthropic senders in `src/ai_client.py` (`_send_anthropic`, `_send_deepseek`, `_send_minimax`, `_send_grok`, `_send_qwen`, `_send_llama`) while doing the Phase 6b migration work (3 senders migrated to `ChatMessage` API). The Phase 6d task was effectively a no-op because `NormalizedResponse` already uses `UsageStats` throughout `src/openai_compatible.py` (verified by `Select-String 'NormalizedResponse\('` in `src/openai_compatible.py`). + +This analysis is grounded in: +- Actual `Select-String` counts of `__history` + `__history_lock` references +- Read of `_send_grok` (L2532-2587), `_send_minimax` (L2616-2679), `_send_llama` (L2856-2917) end-to-end during Phase 6b migration +- Read of `_send_anthropic` (L1432-1590) including its `with _anthropic_history_lock:` blocks +- Read of `_send_deepseek` (L2179-2230) and `_send_qwen` (L2680-2750) for context +- Helper function definitions: `_strip_cache_controls`, `_add_history_cache_breakpoint`, `_estimate_prompt_tokens`, `_strip_private_keys`, `_repair_anthropic_history`, `_repair_deepseek_history`, `_repair_minimax_history`, `_trim_anthropic_history`, `_trim_minimax_history` + +--- + +## 2. Per-Sender Codepath Catalog + +### 2.1 Reference counts (measured, not estimated) + +| Provider | Direct `_history` refs | Lock refs | Total | Per-call hot-path? | +|---|---|---|---|---| +| anthropic | 20 | 2 | 22 | Yes (cache controls, repair, trim, strip, est_tokens) | +| deepseek | 12 | 6 | 18 | Yes (lock-heavy; multiple append/read blocks) | +| minimax | 14 | 5 | 19 | Yes (repair + build) | +| qwen | 7 | 4 | 11 | Mild (fewer calls) | +| grok | 7 | 6 | 13 | Yes (lock-heavy; 6 locks for 7 refs) | +| llama | 12 | 9 | 21 | Yes (lock-heavy; native + openai-compat branches) | +| **TOTAL** | **72** | **32** | **104** | — | + +**Tier 1's estimate was 112 sites** (per `metadata.json` `deferred_work.phase_3_provider_state.estimated_sites`). Actual count is **104** (close; 7% under). + +### 2.2 `_send_anthropic` (22 sites) - HIGHEST PRIORITY + +**Direct sites:** +- L1445: `if discussion_history and not _anthropic_history:` (read) +- L1449: `for msg in _anthropic_history:` (iterate) +- L1459: `_strip_cache_controls(_anthropic_history)` (helper) +- L1460: `_repair_anthropic_history(_anthropic_history)` (helper) +- L1461: `_anthropic_history.append(...)` (append) +- L1462: `_add_history_cache_breakpoint(_anthropic_history)` (helper) +- L1471: `_trim_anthropic_history(system_blocks, _anthropic_history)` (helper) +- L1473: `_estimate_prompt_tokens(system_blocks, _anthropic_history)` (helper, read-only) +- L1477: `len(_anthropic_history)` (read) +- L1491, L1505: `_strip_private_keys(_anthropic_history)` (helper, returns new list) +- L1508: `_anthropic_history.append(...)` (append, post-tool-loop) +- L1584: `_anthropic_history.append(...)` (append, post-tool-loop) + +**Helper sites:** `_strip_cache_controls` (2), `_add_history_cache_breakpoint` (2), `_estimate_prompt_tokens` (4 across all senders), `_strip_private_keys` (3 — all anthropic), `_repair_anthropic_history` (2), `_trim_anthropic_history` (2) + +**Hidden cross-references (Tier 2 found):** +- `_strip_private_keys` is a NESTED function inside `_send_anthropic` (L1466) — Tier 1's grep would only catch the call sites at L1491/1505, not the def itself +- `_estimate_prompt_tokens` is called from `_trim_anthropic_history` AND `_trim_minimax_history` (helper-of-helper pattern) +- `_strip_cache_controls` mutates the list in place (no return value) — Phase 3 migration needs `with h.lock: h.messages = [m without cache controls]` not `h.messages = _strip(h.messages)` +- `_add_history_cache_breakpoint` also mutates in place — same issue + +**Lock usage:** 2 explicit `_anthropic_history_lock` references (L485 in cleanup, L1460 in `with` block); the helpers acquire the lock implicitly because they're called from inside the `with` block. + +### 2.3 `_send_deepseek` (18 sites) + +**Direct sites:** +- L465-468: `global _deepseek_history` (declaration, in `set_provider`) +- L488-489: cleanup +- L2203: `with _deepseek_history_lock:` +- L2204: `_repair_deepseek_history(_deepseek_history)` (inside with-block) +- L2220: `_deepseek_history.append(...)` (post-prompt build) +- L2238: `_deepseek_history.append(...)` (post-tool-loop) + +**Helper sites:** `_repair_deepseek_history` (2 calls; called from `_send_deepseek` AND from cleanup — hidden cross-reference Tier 1 missed) + +**Lock usage:** 6 explicit `_deepseek_history_lock` references — higher lock usage than anthropic but the deepseek send is single-request (no tool-loop iterations); the 6 locks are mostly in setup/teardown paths. + +### 2.4 `_send_minimax` (19 sites) + +**Direct sites:** +- L465, L491: global/cleanup +- L2616: `_send_minimax` def +- L2653: `_repair_minimax_history(_minimax_history)` +- L2655, L2656: `_minimax_history.append(...)` (2x) +- L2661-2662: `messages: list[Metadata] = [{...}]` + `messages.extend(_minimax_history)` (build request) +- L2687 (approx): `_trim_minimax_history(system_blocks, _minimax_history)` (helper) +- L2689 (approx): `_estimate_prompt_tokens(system_blocks, _minimax_history)` (helper, read-only) + +**Helper sites:** `_repair_minimax_history` (2), `_trim_minimax_history` (2), `_estimate_prompt_tokens` (4 across all senders) + +**Hidden cross-references:** +- `_minimax_history` has a SPECIAL `_repair_minimax_history` step (other providers don't have this for non-anthropic); the migration needs to preserve the order: `_repair_minimax_history(h)` BEFORE the append loop +- `_extract_minimax_reasoning` is a nested helper (no history access but operates on raw_response) + +### 2.5 `_send_qwen` (11 sites) - LOWEST PRIORITY + +**Direct sites:** 7 direct + 4 lock refs (cleanup + send). Smallest surface area. + +### 2.6 `_send_grok` (13 sites) + +**Direct sites:** +- L465, L497: global/cleanup +- L2573: `_grok_history.append(...)` (initial user message) +- L2589: `messages.extend(_grok_history)` (build request) + +**Lock usage:** 6 explicit locks — high lock ratio. The send has multiple sequential `with _grok_history_lock:` blocks (3 distinct blocks: append user msg, build request, post-tool-loop). + +### 2.7 `_send_llama` (21 sites) + +**Direct sites:** 12 direct + 9 lock refs. The 9 lock refs come from: (1) llama has BOTH `_send_llama` (OpenAI-compatible) AND `_send_llama_native` (Ollama); the native path also touches `_llama_history`. + +**Hidden cross-references:** +- `_send_llama` is a router — checks for localhost/127.0.0.1 and delegates to `_send_llama_native`. The native path also locks `_llama_history` for reasoning extraction. +- This is the ONLY provider with a dual-path architecture — Phase 3 migration needs to handle both paths identically. + +--- + +## 3. Qualitative Cost Estimation + +### 3.1 Per-call cost categories (microsecond estimates; refined from Tier 1) + +| Category | Current (dict globals) | Proposed (ProviderHistory dataclass) | Per-call delta | +|---|---|---|---| +| `__history.append(m)` | dict.append (~100ns) | `h.append(m)` (lock acquire + append) (~300ns) | **+200ns/call** | +| `len(__history)` | direct attribute (~50ns) | `len(h.messages)` (~100ns) | **+50ns/call** | +| `for m in __history:` | direct iteration | `with h.lock: msg_list = list(h.messages)` then iterate | **+5-10µs/call** (list copy) | +| `with __history_lock:` | direct lock | `with h.lock:` (same lock, just access via attribute) | **~0** (same lock) | +| `_global __history` (cleanup) | direct module global | `h.clear()` (lock acquire + clear) | **+200ns/call** (1 per session) | +| `h.get_all()` (new pattern) | n/a | `list(h.messages)` inside lock | **+5-10µs/call** (list copy) | + +**Tier 1's estimates were pessimistic** (they assumed all iterations would need `h.get_all()` and pay 5-10µs each). Tier 2 found that the iterations are 1-2 per LLM turn, not per-message. + +### 3.2 Per-sender per-turn overhead + +`_send_anthropic` (per-turn): +- 1x append user msg (200ns) +- 1x append post-tool-loop (200ns) +- 1x append post-tool-loop (200ns) (2 tool iterations max) +- 1x `with _anthropic_history_lock:` (0ns, same lock) +- 1x `_strip_cache_controls` (calls `with h.lock: h.messages = [...]`) = **5-10µs** (full iteration + filter) +- 1x `_add_history_cache_breakpoint` = **5-10µs** (full iteration + maybe-append) +- 1x `_trim_anthropic_history` = **5-10µs** (full iteration + maybe-trim) +- 1x `_estimate_prompt_tokens` = **5-10µs** (full iteration + token count) +- 1x `_strip_private_keys` (2 sites; non-stream + stream) = **5-10µs x 2** = **10-20µs** + +**Per-turn total for anthropic: ~35-65µs** (5-7 helper iterations + 2-3 appends) + +`_send_deepseek` (per-turn): +- 1x `_repair_deepseek_history` = **5-10µs** (full iteration + repair) +- 1x append user msg (200ns) +- 1x append post-tool-loop (200ns) +- ~3-4x `with _deepseek_history_lock:` blocks (0ns each, just lock churn) + +**Per-turn total for deepseek: ~5-10µs** (1 helper + 2 appends) + +`_send_minimax` (per-turn): +- 1x `_repair_minimax_history` = **5-10µs** +- 2x append user msg (200ns x 2 = 400ns) +- 1x `_trim_minimax_history` = **5-10µs** +- 1x `_estimate_prompt_tokens` = **5-10µs** + +**Per-turn total for minimax: ~15-30µs** + +`_send_grok` (per-turn): +- 1x append user msg (200ns) +- 1x append post-tool-loop (200ns) +- ~3x `with _grok_history_lock:` blocks (0ns each) + +**Per-turn total for grok: ~400ns** (very lean) + +`_send_qwen` (per-turn): +- 1x append user msg (200ns) +- 1x append post-tool-loop (200ns) +- ~2x `with _qwen_history_lock:` blocks (0ns) + +**Per-turn total for qwen: ~400ns** (leanest) + +`_send_llama` (per-turn): +- 1x append user msg (200ns) +- 1x append post-tool-loop (200ns) +- ~3-4x `with _llama_history_lock:` blocks (0ns each) + +**Per-turn total for llama: ~400ns** (lean) + +### 3.3 Hot iteration sites (the `with h.lock: msg_list = h.messages` pattern) + +| Helper | Line | Lock pattern | Per-call cost | Frequency per turn | +|---|---|---|---|---| +| `_strip_cache_controls(_anthropic_history)` | 1459 | `with h.lock: h.messages = [filtered]` | 5-10µs | 1/turn | +| `_add_history_cache_breakpoint(_anthropic_history)` | 1462 | `with h.lock: h.messages.append(breakpoint)` | 5-10µs | 1/turn | +| `_trim_anthropic_history(...)` | 1471 | `with h.lock: ...` | 5-10µs | 1/turn | +| `_estimate_prompt_tokens(system_blocks, _anthropic_history)` | 1473 | `with h.lock: read-only sum` | 5-10µs | 1/turn | +| `_strip_private_keys(_anthropic_history)` | 1491, 1505 | `with h.lock: return list(h.messages)` | 5-10µs | 1-2/turn (stream vs non-stream) | +| `_repair_anthropic_history(_anthropic_history)` | 1460 | `with h.lock: in-place mutation` | 5-10µs | 1/turn | +| `_repair_deepseek_history(_deepseek_history)` | 2204 | `with h.lock: in-place mutation` | 5-10µs | 1/turn | +| `_repair_minimax_history(_minimax_history)` | 2653 | `with h.lock: in-place mutation` | 5-10µs | 1/turn | +| `_trim_minimax_history(...)` | 2687 | `with h.lock: ...` | 5-10µs | 1/turn | + +**Recommendation:** Use `with h.lock:` for in-place mutations (no list copy needed). Use `h.get_all()` only when the caller needs to OWN the list (e.g., `_strip_private_keys` returns a new list). + +--- + +## 4. Comparison vs Tier 1's Hypothesis + +| Sender | Tier 1 hypothesis (µs/turn) | Tier 2 refined (µs/turn) | Delta | Reason | +|---|---|---|---|---| +| anthropic | +8-15 | **+35-65** | **+4-7x HIGHER** | Tier 1 missed `_strip_cache_controls` + `_add_history_cache_breakpoint` + `_strip_private_keys` (3 additional helpers per turn) | +| deepseek | +3-7 | **+5-10** | ~same | 1 helper + 2 appends | +| minimax | +3-7 | **+15-30** | **+2-4x HIGHER** | Tier 1 missed `_repair_minimax_history` + `_trim_minimax_history` (2 helpers per turn) | +| grok | +2-5 | **+0.4** | **LOWER** | No helper functions; pure appends | +| qwen | +2-5 | **+0.4** | **LOWER** | No helper functions; pure appends | +| llama | +4-8 | **+0.4** | **LOWER** | No helper functions in openai-compat path; native path is separate | +| **Total session** | **+1.1-2.4ms** | **+0.5-1.0ms** | **LOWER** | Anthropic dominates; one turn typically | + +**Honest takeaway:** Tier 1's hypothesis was directionally correct but UNDER-estimated anthropic's helper count and OVER-estimated the lean providers. The total per-session overhead is actually LOWER than Tier 1 estimated, but anthropic is HIGHER than estimated. + +**The audit (code_path_audit_20260607) will measure actual cost** with micro-benchmarks (per the plan's Task 6e.2 hook). + +--- + +## 5. Recommendations for Future Phase 3 Track + +1. **Anthropic FIRST** (highest ROI; 5 helpers per turn; cache controls are unique to this provider) +2. **Use `with h.lock: msg_list = h.messages` for read iterations that need a snapshot** (avoids `get_all()`'s list-copy cost when caller can work inside the lock) +3. **Use `h.get_all()` ONLY when the caller needs to OWN the list outside the lock** (e.g., `_strip_private_keys` returns the list to the Anthropic SDK which holds it during the HTTP call) +4. **Use `with h.lock: h.messages = [filtered]` for in-place mutations** (e.g., `_strip_cache_controls`, `_add_history_cache_breakpoint`) +5. **Lock semantics unchanged** — `ProviderHistory.lock` is per-instance; no cross-provider contention (verified: 6 separate `threading.Lock()` instances at L114/118/122/126/131/135) +6. **Hidden cross-references to migrate FIRST:** + - `_strip_private_keys` (nested in `_send_anthropic`, returns new list — needs `h.get_all()` or explicit snapshot) + - `_extract_minimax_reasoning` (nested in `_send_minimax`, no history access but operates on raw_response — safe to skip) + - `_send_llama_native` (separate path; also touches `_llama_history` — must migrate in lock-step with `_send_llama`) + +--- + +## 6. Open Questions + +1. **Anthropic `cache_control` semantics:** `_strip_cache_controls` REMOVES cache_control markers; `_add_history_cache_breakpoint` ADDS them. Does removing them then re-adding them within the same request cost a cache miss on Anthropic's side? (Need to verify with Anthropic API docs / behavioral test.) +2. **`_trim__history` mutation vs return:** Both helpers do in-place mutation. After Phase 3, do they need to return the new length to the caller (for logging), or can the caller just check `len(h.messages)` after the helper returns? +3. **Lock granularity:** The `_send_lock` (L139) is a global per-vendor-call lock (serialize all sends across providers). The 6 `_history_lock`s are per-history. After Phase 3, `_send_lock` stays as-is; only the 6 history globals migrate. (No code change to `_send_lock` needed.) +4. **Tool-loop iterations:** `_send_grok`, `_send_anthropic`, `_send_minimax`, `_send_llama` all use `run_with_tool_loop` which can iterate 2-5 times. The per-iteration cost of `h.append(...)` is small, but the per-iteration lock churn is non-trivial. Tier 1 estimated 2-5 iterations; Tier 2 confirmed (looking at `run_with_tool_loop` patterns). + +--- + +## 7. See Also + +- `docs/reports/PHASE3_HYPOTHETICAL_PROMOTION.md` - Tier 1's hypothesis (the "what we thought before Tier 2 looked") +- `conductor/tracks/phase2_4_5_call_site_completion_20260621/spec.md` - Phase 6e directives +- `conductor/tracks/code_path_audit_20260607/spec.md` - the audit that quantifies these estimates +- `docs/handoffs/PROMPT_FOR_TIER_1.md` - Tier 1 brief +- `src/provider_state.py` - the `ProviderHistory` dataclass already defined (Phase 0 deliverable from parent track) +- `src/ai_client.py:113-139` - the 7 history globals + 6 locks + 1 `_send_lock` +- `src/ai_client.py:1245-1485` - the 5 anthropic helpers (most-heavy) \ No newline at end of file