docs(phase-6): update ai_client+models guides; report + follow-up track setup

Phase 6 t6.1 + t6.2 (no archive per user directive): - docs/guide_ai_client.md: update Overview to mention 8 providers (was 5); add 'Shared OpenAI-Compatible Helper' section explaining src/openai_compatible.py (NormalizedResponse, OpenAICompatibleRequest, send_openai_compatible, usage pattern); document the Qwen adapter and Llama multi-backend. - docs/guide_models.md: update PROVIDERS list to 8 entries (was 5). - conductor/tracks.md: update the Qwen track entry to reflect '50/79 tasks done; Phase 6 in progress; NOT archiving - has follow-up'; add detailed status note pointing to the follow-up track + audit report. - docs/reports/qwen_llama_grok_followup_audit_20260611.md: NEW report explaining why a follow-up is needed (7 categories of gaps; the Tech Lead's 'footnote for now' failure mode; the lessons learned). - conductor/tracks/qwen_llama_grok_followup_20260611/: NEW follow-up track setup (spec.md, state.toml, metadata.json, TODO.md). 5 phases: tool loop lift, PROVIDERS move, UX adaptations 2-9, local-first + matrix v2, Anthropic/Gemini/DeepSeek migration. Phase 6 t6.3 (git mv to archive) and t6.4 (mark Recently Completed) are NOT applied per user directive: 'we can then doc this we're not archiving yet, if we have a follow up track I need this one to stay up because there is still alot todo'.
2026-06-11 09:33:18 -04:00
parent 457255bcd4
commit 691dc584eb
8 changed files with 745 additions and 3 deletions
@@ -6,10 +6,17 @@

 ## Overview

-`src/ai_client.py` (~116KB) is the **unified LLM client** for 5 providers. It abstracts the differences between providers (Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI) behind a single `send()` function.
+`src/ai_client.py` (~116KB) is the **unified LLM client** for 8 providers. It abstracts the differences between providers (Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI, Qwen, Grok, Llama) behind a single `send()` function.

 The module is a **stateful singleton** — all provider state is held in module-level globals. There is no class wrapping; the module itself is the abstraction layer.

+The 8 providers split into 3 API shapes:
+- **Native SDK**: Gemini (google-genai), Anthropic (anthropic), Qwen (DashScope)
+- **OpenAI-compatible**: MiniMax, Grok, Llama (Ollama/OpenRouter/custom), DeepSeek
+- **Subprocess**: Gemini CLI
+
+The OpenAI-compatible vendors all call the shared helper in `src/openai_compatible.py` (added 2026-06-06 by the `qwen_llama_grok_integration_20260606` track; see "Shared OpenAI-Compatible Helper" section below). The MiniMax provider's `_send_minimax` was refactored to use this helper (Phase 4 of the same track, 231 → 75 lines, 68% reduction).
+
 ---

 ## Module-Level Imports
@@ -430,4 +437,91 @@ Gated by env var (e.g., `RUN_REAL_AI_TESTS=1`). Hits the real API. Not in defaul
 - **[guide_state_lifecycle.md](guide_state_lifecycle.md)** — The per-provider history globals (`_anthropic_history`, etc.) are managed here; their locking and reset behavior is documented
 - **[guide_context_aggregation.md](guide_context_aggregation.md)** — The `aggregate.py` pipeline that produces the markdown the AI client sends
 - **[conductor/product.md](../conductor/product.md#multi-provider-integration)** — Product-level overview of providers
+- **[docs/reports/qwen_llama_grok_followup_audit_20260611.md](qwen_llama_grok_followup_audit_20260611.md)** — Audit of the parent track's gaps; follow-up track `qwen_llama_grok_followup_20260611` covers them
+
+---
+
+## Shared OpenAI-Compatible Helper (`src/openai_compatible.py`)
+
+Added 2026-06-06 by the `qwen_llama_grok_integration_20260606` track. Operates on a normalized request/response data structure so 4 OpenAI-compatible vendors (MiniMax, Grok, Llama, DeepSeek) can share the same request building, response parsing, streaming aggregation, tool call detection, and error classification logic.
+
+### Data Structures
+
+```python
+@dataclass(frozen=True)
+class NormalizedResponse:
+    text: str
+    tool_calls: list[dict[str, Any]]
+    usage_input_tokens: int
+    usage_output_tokens: int
+    usage_cache_read_tokens: int
+    usage_cache_creation_tokens: int
+    raw_response: Any
+
+@dataclass
+class OpenAICompatibleRequest:
+    messages: list[dict[str, Any]]
+    model: str
+    temperature: float = 0.0
+    top_p: float = 1.0
+    max_tokens: int = 8192
+    tools: Optional[list[dict[str, Any]]] = None
+    tool_choice: str = "auto"
+    stream: bool = False
+    stream_callback: Optional[Callable[[str], None]] = None
+```
+
+### The Function
+
+```python
+def send_openai_compatible(
+    client: Any,        # openai.OpenAI client with vendor-specific base_url + auth
+    request: OpenAICompatibleRequest,
+    *, capabilities: "VendorCapabilities",  # from src/vendor_capabilities.py
+) -> NormalizedResponse:
+```
+
+The function:
+1. Translates `request.messages` into the OpenAI SDK's `messages` parameter (passthrough — already in OpenAI shape).
+2. Translates `request.tools` if non-None (passthrough for now; future: strip unsupported fields based on `capabilities`).
+3. Calls `client.chat.completions.create(...)` with the right parameters.
+4. If streaming: aggregates chunks; calls `stream_callback(text_chunk)` for each text delta; collects final usage from the last chunk.
+5. If non-streaming: parses the response in one shot.
+6. Returns a `NormalizedResponse` with text, tool calls (in OpenAI shape), usage stats.
+7. On exception: classifies the OpenAI exception and re-raises as `ProviderError`.
+
+### Usage Pattern (per vendor)
+
+```python
+# _send_grok, _send_llama (single-shot placeholders), _send_minimax (with restored tool loop)
+def _send_grok(md_content, user_message, base_dir, file_items=None, discussion_history="", stream=False, ...):
+    client = _ensure_grok_client()  # openai.OpenAI(api_key=..., base_url="https://api.x.ai/v1")
+    with _grok_history_lock:
+        # ... build messages, append user, system + context ...
+        request = OpenAICompatibleRequest(
+            messages=messages, model=_model, stream=stream,
+            stream_callback=stream_callback,
+        )
+        caps = get_capabilities("grok", _model)
+        response = send_openai_compatible(client, request, capabilities=caps)
+        # ... append to history, return response.text ...
+```
+
+### Qwen Adapter (`src/qwen_adapter.py`)
+
+Qwen uses Alibaba's DashScope native SDK (not OpenAI-compatible) because DashScope's OpenAI-compatible mode drops important features (Qwen-Audio, Qwen-Long custom chunking, Qwen-VL-Max enhanced vision). The adapter normalizes DashScope tool format to OpenAI shape via `build_dashscope_tools()` and classifies DashScope exceptions via `classify_dashscope_error()`.
+
+### Llama Multi-Backend
+
+`_send_llama` supports 3 backends via the state globals `_llama_base_url` and `_llama_api_key`:
+- **Ollama** (local): `http://localhost:11434/v1`; no auth
+- **OpenRouter** (cloud aggregator): `https://openrouter.ai/api/v1`
+- **Custom URL** (escape hatch): any OpenAI-compatible endpoint
+
+The local-LLM signal is `_get_llama_cost_tracking()` (returns False for localhost/127.0.0.1).
+
+### Tests
+
+- `tests/test_vendor_capabilities.py` (3 tests): registry lookup, vendor-default fallback, unknown-vendor raises
+- `tests/test_openai_compatible.py` (6 tests): non-streaming, streaming aggregation, tool call detection, vision, error classification, frozen dataclass
 - **[conductor/tracks/nagent_review_20260608/report.md §15 Pitfalls #2 and #4](../conductor/tracks/nagent_review_20260608/report.md)** — Deep-dive on the per-provider history globals and the stateful singleton pattern; future-track candidate for stateless LLMClient
@@ -363,7 +363,7 @@ The file also defines several module-level constants used across the app:

 ```python
 # Provider routing
-PROVIDERS: list[str] = ["gemini", "anthropic", "deepseek", "MiniMax", "gemini-cli"]
+PROVIDERS: list[str] = ["gemini", "anthropic", "gemini_cli", "deepseek", "minimax", "qwen", "grok", "llama"]

 # Tool categories (for Tool Bias)
 TOOL_CATEGORIES: list[str] = [
@@ -0,0 +1,165 @@
+# Qwen/Llama/Grok Follow-Up Audit Report (2026-06-11)
+
+**Date:** 2026-06-11
+**Author:** Tier 2 Tech Lead
+**Subject:** Why a follow-up track is needed after `qwen_llama_grok_integration_20260606` Phase 5
+
+## TL;DR
+
+The parent track shipped 5 of 6 phases with 50/79 tasks done. The Tech Lead **did not surface the gaps at the checkpoints**; the user discovered them only at the Phase 5 checkpoint. The user is right: the Tech Lead's "footnote for now" pattern is bad — it looks like the work was hidden until called out.
+
+**7 categories of gap** are documented here. Each is captured in the new follow-up track `qwen_llama_grok_followup_20260611`.
+
+---
+
+## 1. Phase 5 partial: 1 of 9 UX adaptations shipped
+
+**What shipped:** Adaptation 1 (Screenshot button iff vision) at `src/gui_2.py:3030` + the helper `_get_active_capabilities()` at `src/gui_2.py:733`.
+
+**What didn't ship:** Adaptations 2-9:
+- Tools toggle iff tool_calling
+- Cache panel iff caching
+- Stream progress iff streaming
+- Fetch Models button iff model_discovery
+- Token budget max = context_window
+- Cost panel × 3 (estimate / "Free (local)" for localhost / "—" for other cost_tracking=false)
+
+**The right move:** All 9 at once, OR explicit user-facing "I'm shipping 1 of 9; the other 8 are deferred" BEFORE doing adaptation 1. The Tech Lead did the latter in a footnote, which the user called out as bad UX.
+
+---
+
+## 2. Tool-call loop regression: only MiniMax works
+
+**What shipped:** `_send_minimax` has a working tool loop. The other 7 vendor entry points do not.
+
+| Vendor | Tool loop? | Why |
+|---|---|---|
+| `_send_minimax` | ✅ Works (231 → 75 lines after refactor + tool loop restoration) | Worker did the refactor; I added the tool loop back manually |
+| `_send_qwen` | ❌ Single-shot | Phase 2 worker omitted it (Qwen has DashScope-specific tool format) |
+| `_send_grok` | ❌ Single-shot | Phase 3 worker omitted it (placeholder) |
+| `_send_llama` | ❌ Single-shot | Phase 3 worker omitted it (placeholder) |
+| `_send_anthropic` | ✅ Inline (4-way duplication with the other 3) | Pre-existing pattern |
+| `_send_gemini` | ✅ Inline | Pre-existing pattern |
+| `_send_gemini_cli` | ✅ Inline | Pre-existing pattern |
+| `_send_deepseek` | ✅ Inline | Pre-existing pattern |
+
+**The right move:** Lift the loop into a shared `run_with_tool_loop` helper that takes history management as injected parameters. Apply to all 8 vendors. This is a single-fix, 8-call-site refactor — much smaller than letting the duplication grow.
+
+The Tech Lead caught this at the end of Phase 4 (during the MiniMax refactor) but should have caught it at the end of Phase 2 (when the Qwen worker shipped single-shot) or the end of Phase 3 (when Grok+Llama workers shipped single-shot).
+
+---
+
+## 3. `src/models.py` has a PROVIDERS list — the user is right that this is sprawl
+
+**What's there now:**
+```python
+# src/models.py:79
+PROVIDERS: List[str] = ["gemini", "anthropic", "gemini_cli", "deepseek", "minimax", "qwen", "grok", "llama"]
+```
+
+**The problem:** `src/models.py` is for **MMA data models** (Tickets, Tracks, FileItem, WorkerContext, etc.). The vendor list is an **AI client concern**. The audit script `audit_no_models_config_io.py` enforces config I/O rules; PROVIDERS has no analogous enforcement.
+
+**The right move:** Move PROVIDERS to `src/ai_client.py` (or a new `src/ai_client_providers.py`). Add `scripts/audit_providers_source_of_truth.py` that fails the build if PROVIDERS is declared in models.py.
+
+The Tech Lead justified keeping it in models.py with "the centralized registry pattern" without asking whether models.py was the right home.
+
+---
+
+## 4. `src/ai_client.py` is 2784 lines and growing
+
+**What's there:** 8 vendor entry points (`_send_anthropic`, `_send_gemini`, `_send_gemini_cli`, `_send_deepseek`, `_send_minimax`, `_send_qwen`, `_send_grok`, `_send_llama`) plus all the supporting machinery (client init, history management, error classification, reasoning content extraction).
+
+**The 8 vendors' inline patterns are 70% similar.** Each has:
+- Client init (credentials + SDK setup)
+- History management (per-vendor lock + history list + repair + trim)
+- Message building (system + context + user content)
+- API call (via SDK or HTTP)
+- Tool loop (or single-shot — see gap #2)
+- Reasoning content extraction
+- Error classification
+
+**The right move:** Codepath consolidation. The shared `send_openai_compatible` covers the API call. A future `run_with_tool_loop` covers the tool loop (gap #2). What's left:
+- History management as a `VendorHistory` class or per-vendor thin wrapper
+- Reasoning content extraction as a uniform helper
+- Error classification as a per-HTTP-code helper
+
+Could cut `src/ai_client.py` by 30-40% (~1000 lines).
+
+---
+
+## 5. Local models deserve more emphasis
+
+**What's there now:** Ollama is one of 3 Llama backends (Ollama, OpenRouter, custom_url). The `cost_tracking: False` for localhost is a small signal.
+
+**The user feedback (verbatim):** "I want to put more emphasis and supporting local models and separating local model vending vis online/cloud vendors of models."
+
+**The right architecture:**
+- Add `local: bool` to VendorCapabilities (separate from `cost_tracking`)
+- Native Ollama (`/api/chat`) as the **default** for Llama (not the OpenAI-compatible fallback)
+- Meta Llama API as a 4th backend (the docs URL returned 400 last session; needs re-verification)
+- GUI: "Local Model" badge per-vendor
+- Cost panel: 4th state "Local (no cost)" distinct from "Free (local)" and "—"
+- vLLM, LM Studio, llama.cpp as additional custom-URL backends with discoverable presets
+
+This is a significant priority shift. The follow-up track's Phase 4 leads with this.
+
+---
+
+## 6. V2 matrix field expansion documented but not implemented
+
+**What the spec says (per Grok's consultation):** Add 12 new fields to VendorCapabilities:
+- `local: bool`
+- `reasoning: bool` (xAI `reasoning_effort`, Anthropic extended thinking, Ollama `think`)
+- `structured_output: bool` (response_format / format)
+- `code_execution: bool` (xAI code_interpreter, Anthropic Computer Use, Gemini Code Execution)
+- `web_search: bool` (xAI web_search, Gemini Grounding)
+- `x_search: bool` (xAI X/Twitter search)
+- `file_search: bool` (xAI file_search, Anthropic PDF, Gemini file API)
+- `mcp_support: bool` (xAI mcp_calls, Anthropic MCP)
+- `audio: bool` (Qwen-Audio, Gemini audio)
+- `video: bool` (Gemini video)
+- `grounding: bool` (Gemini Grounding with Google Search)
+- `computer_use: bool` (Anthropic Computer Use)
+
+**What shipped:** 0 of 12. None wired. No UI adaptations.
+
+The follow-up track's Phase 4 lands these.
+
+---
+
+## 7. Anthropic / Gemini / DeepSeek still not on the matrix
+
+**What's there:** These 3 vendors have unique APIs (4-breakpoint caching, genai SDK, raw HTTP) and the migration to the matrix is non-trivial. The follow-up track is documented (`parent spec §13.1.A`) but never scheduled.
+
+**The value:** Anthropic has prompt caching, extended thinking, Computer Use (big UX wins). Gemini has Grounding with Google Search, native video. DeepSeek has reasoning models.
+
+The follow-up track's Phase 5 lands these.
+
+---
+
+## Lessons (Tech Lead Process)
+
+1. **Surface gaps as they appear, not at the checkpoint.** If a task is going to be deferred mid-phase, say so immediately — don't footnote it later.
+2. **Be explicit about architectural deviations.** The `src/models.py` PROVIDERS sprawl should have been raised at Phase 2, not at Phase 5.
+3. **Plan for the test infrastructure before coding.** The tool-loop regression wasn't caught because no test exercised the loop.
+4. **The "footnote for now" pattern is bad UX.** It looks like the work was hidden until called out. Either ship the work or be explicit about deferring it BEFORE doing the work.
+
+## Follow-Up Track
+
+`conductor/tracks/qwen_llama_grok_followup_20260611/` — 5 phases:
+- Phase 1: Tool loop lift (run_with_tool_loop helper for 8 vendors)
+- Phase 2: PROVIDERS move (out of src/models.py)
+- Phase 3: UX adaptations 2-9 (8 of 9 deferred from parent Phase 5)
+- Phase 4: Local-first + matrix v2 expansion (12 new fields)
+- Phase 5: Anthropic / Gemini / DeepSeek migration
+
+## Parent Track Status
+
+`qwen_llama_grok_integration_20260606` is **NOT being archived** (per user directive). It stays open in `conductor/tracks/` for the follow-up to use as a reference. Phase 6 docs are being done now; the track folder remains at the same path.
+
+## See Also
+
+- `conductor/tracks/qwen_llama_grok_followup_20260611/spec.md` — the follow-up spec
+- `conductor/tracks/qwen_llama_grok_followup_20260611/state.toml` — the follow-up state
+- `conductor/tracks/qwen_llama_grok_followup_20260611/TODO.md` — the setup checklist
+- `conductor/tracks/qwen_llama_grok_integration_20260606/` — the parent track