Private
Public Access
0
0

docs(phase-6): update ai_client+models guides; report + follow-up track setup

Phase 6 t6.1 + t6.2 (no archive per user directive):
- docs/guide_ai_client.md: update Overview to mention 8 providers (was 5);
  add 'Shared OpenAI-Compatible Helper' section explaining
  src/openai_compatible.py (NormalizedResponse, OpenAICompatibleRequest,
  send_openai_compatible, usage pattern); document the Qwen adapter
  and Llama multi-backend.
- docs/guide_models.md: update PROVIDERS list to 8 entries (was 5).
- conductor/tracks.md: update the Qwen track entry to reflect
  '50/79 tasks done; Phase 6 in progress; NOT archiving - has follow-up';
  add detailed status note pointing to the follow-up track + audit
  report.
- docs/reports/qwen_llama_grok_followup_audit_20260611.md: NEW report
  explaining why a follow-up is needed (7 categories of gaps; the
  Tech Lead's 'footnote for now' failure mode; the lessons learned).
- conductor/tracks/qwen_llama_grok_followup_20260611/: NEW follow-up
  track setup (spec.md, state.toml, metadata.json, TODO.md).
  5 phases: tool loop lift, PROVIDERS move, UX adaptations 2-9,
  local-first + matrix v2, Anthropic/Gemini/DeepSeek migration.

Phase 6 t6.3 (git mv to archive) and t6.4 (mark Recently Completed)
are NOT applied per user directive: 'we can then doc this we're not
archiving yet, if we have a follow up track I need this one to stay
up because there is still alot todo'.
This commit is contained in:
2026-06-11 09:33:18 -04:00
parent 457255bcd4
commit 691dc584eb
8 changed files with 745 additions and 3 deletions
+1 -1
View File
@@ -16,7 +16,7 @@ Tracks that are unblocked and ready to start. Ordered by **dependency** (blocked
| # | Priority | Track | Status | Blocked By |
|---|---|---|---|---|
| 2 | A | [Qwen, Llama & Grok Vendor Integration + Capability Matrix](#track-qwen-llama-grok-vendor-integration--capability-matrix) | spec ✓, plan pending | **test_infrastructure_hardening_20260609 (merged)** |
| 2 | A | [Qwen, Llama & Grok Vendor Integration + Capability Matrix](#track-qwen-llama-grok-vendor-integration--capability-matrix) | spec ✓, plan ✓, 50/79 tasks done; **Phase 6 in progress (docs); NOT archiving — has follow-up track** | **test_infrastructure_hardening_20260609 (merged)** |
| 3 | A | [Data-Oriented Error Handling (Fleury Pattern)](#track-data-oriented-error-handling-fleury-pattern) | spec ✓, plan ✓, ready to start | startup_speedup, test_batching_refactor, **test_infrastructure_hardening_20260609 (merged)**, qwen_llama_grok |
| 4 | A | [Data Structure Strengthening (Type Aliases + NamedTuples)](#track-data-structure-strengthening-type-aliases--namedtuples) | spec ✓, plan pending | **test_infrastructure_hardening_20260609 (merged)** |
| 5 | A | [MCP Architecture Refactor (Sub-MCP Extraction)](#track-mcp-architecture-refactor-sub-mcp-extraction) | spec ✓, plan pending | test_infrastructure_hardening_20260609 (merged), data_oriented_error_handling, data_structure_strengthening |
@@ -0,0 +1,81 @@
# Track: Qwen, Llama & Grok Follow-Up (Post-Phase 5)
This is a TODO list for setting up the follow-up track. The Tier 2 Tech Lead will execute items in order.
## Status
- [x] Spec drafted: `conductor/tracks/qwen_llama_grok_followup_20260611/spec.md`
- [ ] state.toml initialized
- [ ] metadata.json created
- [ ] Phase 1 ready to start
## Immediate TODOs (in order)
1. **Read parent track state**
- [ ] Read `conductor/tracks/qwen_llama_grok_integration_20260606/state.toml` to confirm Phase 6 is complete
- [ ] Read `conductor/tracks/qwen_llama_grok_integration_20260606/plan.md` and find tasks tagged t6.* to confirm Phase 6 done
2. **Create the follow-up track structure**
- [ ] Create `conductor/tracks/qwen_llama_grok_followup_20260611/state.toml` with 5 phases × ~7 tasks
- [ ] Create `conductor/tracks/qwen_llama_grok_followup_20260611/metadata.json` with verification_criteria
3. **Phase 1: Tool Loop Lift (first concrete work)**
- [ ] Read current tool-loop patterns in `_send_minimax` (231 → 75 lines after refactor) and `_send_anthropic/_send_gemini/_send_gemini_cli/_send_deepseek` (inline loops)
- [ ] Design `run_with_tool_loop(client, request, capabilities, *, pre_tool_callback, qa_callback, patch_callback, base_dir, vendor_name, history_lock, history, trim_func)` helper
- [ ] Write 5 Red tests: no-tool-calls returns immediately, tool-calls dispatch, max-rounds limit, history appending, error-in-tool-call doesn't crash
- [ ] Implement helper in `src/tool_loop.py`
- [ ] Apply to all 8 vendors
- [ ] Audit script `scripts/audit_no_inline_tool_loops.py` to enforce the pattern
- [ ] Verify all 38+ existing tests still pass
- [ ] Phase 1 checkpoint
4. **Phase 2: PROVIDERS Move**
- [ ] Decide: `src/ai_client.py` vs new `src/ai_client_providers.py` (open question in spec)
- [ ] Move PROVIDERS constant
- [ ] Update 5 import sites
- [ ] Add `scripts/audit_providers_source_of_truth.py`
- [ ] Verify all 38+ tests pass
- [ ] Phase 2 checkpoint
5. **Phase 3: UX Adaptations 2-9**
- [ ] Apply each adaptation one at a time, 1-2 per commit
- [ ] Run live_gui tests in batch after each commit
- [ ] Phase 3 checkpoint when all 9 adaptations done
6. **Phase 4: Local-First + Matrix Expansion**
- [ ] Add `local: bool` to VendorCapabilities
- [ ] Native Ollama adapter (verify URL https://docs.ollama.com/api/chat is up)
- [ ] Meta Llama API adapter (verify URL https://llama.developer.meta.com/docs/overview is up — was 400 last session)
- [ ] GUI: "Local Model" badge
- [ ] Add 12 v2 fields to VendorCapabilities
- [ ] Update all vendor registry entries
- [ ] UI adaptations for the new fields
- [ ] Phase 4 checkpoint
7. **Phase 5: Anthropic / Gemini / DeepSeek Migration**
- [ ] Populate Anthropic matrix entries
- [ ] Populate Gemini matrix entries
- [ ] Populate DeepSeek matrix entries
- [ ] UI adaptations
- [ ] Docs + archive
## Pre-Work Prerequisites
Before starting Phase 1, confirm the parent track's Phase 6 is complete:
- `docs/guide_ai_client.md` updated with new vendors, matrix, helper
- `docs/guide_models.md` updated with new PROVIDERS entries
- Parent track folder **stays open** in `conductor/tracks/` (not archived)
- `conductor/tracks.md` reflects active status
## Lessons from Parent Track (apply to this one)
- **Surface gaps as they appear, not at the checkpoint.** If a task is going to be deferred mid-phase, say so immediately — don't footnote it later.
- **Be explicit about architectural deviations.** The `src/models.py` PROVIDERS sprawl should have been raised at Phase 2, not at Phase 5.
- **Plan for the test infrastructure before coding.** The parent track's tool-loop regression wasn't caught because no test exercised the loop. Future work: every helper gets tests BEFORE implementation.
## Status
- T0: Spec drafted (this file) — DONE
- T1: Parent track Phase 6 verification — TODO
- T2: Follow-up track files created — TODO
- T3: Phase 1 (tool loop lift) — TODO
@@ -0,0 +1,79 @@
{
"track_id": "qwen_llama_grok_followup_20260611",
"name": "Qwen/Llama/Grok Follow-Up (tool loop, PROVIDERS move, UX adaptations 2-9, local-first, matrix v2, Anthropic/Gemini/DeepSeek migration)",
"initialized": "2026-06-11",
"owner": "tier2-tech-lead",
"priority": "high",
"status": "active",
"type": "refactor + feature",
"scope": {
"new_files": [
"src/tool_loop.py",
"src/llama_ollama_native.py",
"src/llama_meta_api.py",
"tests/test_tool_loop.py",
"scripts/audit_no_inline_tool_loops.py",
"scripts/audit_providers_source_of_truth.py"
],
"modified_files": [
"src/ai_client.py",
"src/vendor_capabilities.py",
"src/gui_2.py",
"src/models.py",
"tests/test_minimax_provider.py",
"tests/test_grok_provider.py",
"tests/test_llama_provider.py",
"tests/test_qwen_provider.py",
"tests/test_anthropic_provider.py",
"tests/test_gemini_provider.py",
"tests/test_deepseek_provider.py",
"docs/guide_ai_client.md",
"docs/guide_models.md"
]
},
"blocked_by": {
"qwen_llama_grok_integration_20260606": "phase_6_in_progress"
},
"blocks": [
"anthropic_gemini_deepseek_capability_matrix_20260606"
],
"estimated_phases": 5,
"spec": "spec.md",
"plan": "plan.md",
"state": "state.toml",
"todo": "TODO.md",
"priority_order": "A (tool loop lift + PROVIDERS move + UX 2-9) > B (local-first + matrix v2) > C (Anthropic/Gemini/DeepSeek migration)",
"user_directions": [
"2026-06-11: User wants REPORT explaining why a follow-up is needed (gaps in parent track).",
"2026-06-11: User wants LOCAL MODELS prioritized as first-class; current implementation treats Ollama as 'one of 3 backends' which under-emphasizes local.",
"2026-06-11: User wants the source-of-truth sprawl cleaned up (PROVIDERS in models.py is wrong; should be elsewhere).",
"2026-06-11: User wants ai_client.py further codepath consolidation; new files need review."
],
"verification_criteria": [
"src/tool_loop.py:run_with_tool_loop handles no-tool-calls, dispatches tool calls, respects max-rounds, appends to history, doesn't crash on tool error",
"All 8 vendors (_send_minimax, _send_qwen, _send_grok, _send_llama, _send_anthropic, _send_gemini, _send_gemini_cli, _send_deepseek) use run_with_tool_loop",
"scripts/audit_no_inline_tool_loops.py passes (no inline tool loops in any _send_<vendor>)",
"PROVIDERS is no longer declared in src/models.py",
"scripts/audit_providers_source_of_truth.py passes",
"All 9 UX adaptations from parent spec §6 are applied to src/gui_2.py (1 from parent Phase 5 + 8 from this track's Phase 3)",
"src/llama_ollama_native.py: native Ollama adapter replaces OpenAI-compatible for Ollama backend (or used by default)",
"src/llama_meta_api.py: Meta Llama API adapter; new 4th backend",
"src/vendor_capabilities.py: 12 new v2 fields added (local, reasoning, structured_output, code_execution, web_search, x_search, file_search, mcp_support, audio, video, grounding, computer_use)",
"All vendor registry entries updated with the new fields",
"Anthropic matrix entries populated (caching, extended_thinking, pdf, computer_use)",
"Gemini matrix entries populated (caching, grounding, video, audio)",
"DeepSeek matrix entries populated (reasoning, low_cost)",
"GUI: 'Local Model' badge added to AI Settings panel",
"GUI: 4 cost panel states (estimate / 'Free (local)' / '-' / new local-no-cost state)",
"All existing tests still pass (38+ in batch; full suite has pre-existing live_gui flakes)",
"No new threading.Thread calls",
"docs/guide_ai_client.md + docs/guide_models.md updated"
],
"links": {
"parent_track": "conductor/tracks/qwen_llama_grok_integration_20260606/",
"parent_spec": "conductor/tracks/qwen_llama_grok_integration_20260606/spec.md",
"ai_client_guide": "docs/guide_ai_client.md",
"models_guide": "docs/guide_models.md",
"follow_up_audit_report": "docs/reports/qwen_llama_grok_followup_20260611.md (TBD; will be created in Phase 5)"
}
}
@@ -0,0 +1,237 @@
# Track: Qwen, Llama & Grok Follow-Up (Post-Phase 5)
**Status:** Active (initializing)
**Initialized:** 2026-06-11
**Owner:** Tier 2 Tech Lead
**Priority:** High (architectural consolidation + UX payoff; user is rightly concerned that the parent track shipped with gaps)
---
## Why This Track Exists
The parent track `qwen_llama_grok_integration_20260606` (status: 50/79 tasks done, Phase 6 in progress) shipped 5 phases cleanly but **left meaningful gaps** that the Tier 2 Tech Lead did not surface until the Phase 5 checkpoint. This track captures the deferred work, ordered by impact.
**The Tier 2's failure mode** (called out by the user 2026-06-11): "you never even told me until now and then you just say 'oh yeah we're done btw, fuck you' thats what it feels like." Rightly called. This track exists to fix that.
---
## Goals (Priority Order)
| Priority | Goal | Rationale |
|---|---|---|
| **A (architectural)** | Lift the tool-call loop into a shared `run_with_tool_loop()` helper. Apply to all 4 new vendors + the 4 existing vendors. | Today only `_send_minimax` has a working tool loop. Qwen/Grok/Llama are single-shot (regression). Anthropic/Gemini/Gemini-cli/DeepSeek already have inline tool loops (4-way duplication). Lifting gives one place to fix bugs + add new behavior. |
| **A (architectural)** | Move `PROVIDERS` out of `src/models.py`. | `src/models.py` is for MMA data models (Tickets, Tracks, FileItem). The vendor list is an AI client concern. The audit script `audit_no_models_config_io.py` enforces config I/O rules; PROVIDERS has no analogous enforcement. Move to `src/ai_client.py` (or new `src/ai_client_providers.py`); add an audit script that enforces the move. |
| **A (UX payoff)** | Apply the remaining 8 of 9 UX adaptations from parent track spec §6: tools toggle (tool_calling), cache panel (caching), stream progress (streaming), fetch models (model_discovery), token budget max (context_window), cost panel × 3. | The pattern is established (adaptation 1 shipped in parent Phase 5); the helper `_get_active_capabilities()` is in place; the remaining 8 are mechanical applications. |
| **B (local-first)** | Promote local models from "one of 3 backends" to first-class. | Add `local_backend: bool` capability field (separate from `cost_tracking`). Native Ollama (`/api/chat`) as the default for Llama (not the OpenAI-compatible fallback). Add Meta Llama API as a 4th backend. Add a "Local Model" UI badge. |
| **B (matrix expansion)** | Land the v2 matrix fields: `local`, `reasoning`, `structured_output`, `code_execution`, `web_search`, `x_search`, `file_search`, `mcp_support`, `audio`, `video`, `grounding`, `computer_use`. | These are the 12 fields documented in parent spec §3.1.1 after the Grok consultation. None wired today. Each addition is registry + UI adaptation. |
| **C (provider coverage)** | Migrate Anthropic / Gemini / DeepSeek onto the capability matrix. | Anthropic has prompt caching, extended thinking, Computer Use (high-value UX). Gemini has Grounding with Google Search, native video. DeepSeek has reasoning models. None of these capabilities are exposed in the GUI today. |
| **C (codepath consolidation)** | Reduce `src/ai_client.py` line count (currently 2784). | The 8 vendors' inline patterns have grown. Lifting history management, reasoning content extraction, error classification per HTTP code into shared helpers would cut ~30-40% of the file. |
### Non-Goals (this track)
- Not changing the matrix schema (the 7 v1 fields are good; v2 is additive)
- Not changing the shared `send_openai_compatible` helper (it works; the tool loop is separate)
- Not changing the `vendor_capabilities.py` lookup pattern (it works; registry is the source of truth)
- Not adding new vendors (the parent track added Qwen/Grok/Llama; this track only consolidates what's there)
---
## Architecture
### A.1 Tool Loop Lift
Today:
```python
# in _send_minimax (only):
for _round in range(MAX_TOOL_ROUNDS + 2):
request = OpenAICompatibleRequest(...)
response = send_openai_compatible(client, request, capabilities=caps)
if not response.tool_calls: return response.text
results = asyncio.run(_execute_tool_calls_concurrently(response.tool_calls, ...))
# ... append results to history ...
# in _send_qwen, _send_grok, _send_llama: no loop (single-shot, regression)
# in _send_anthropic, _send_gemini, _send_gemini_cli, _send_deepseek: inline loop (4-way duplication)
```
After:
```python
# new src/tool_loop.py:
def run_with_tool_loop(
client, request, capabilities, *,
pre_tool_callback, qa_callback, patch_callback,
base_dir, vendor_name, history_lock, history, trim_func,
) -> str:
"""Wraps send_openai_compatible with a tool-call loop. Works for any
OpenAI-compatible vendor; vendor-specific logic (history mgmt,
trim, message format) is injected via parameters."""
...
# in each _send_<vendor>:
response = run_with_tool_loop(
client=_ensure_<vendor>_client(),
request=OpenAICompatibleRequest(...),
capabilities=get_capabilities(vendor, _model),
pre_tool_callback=..., qa_callback=..., patch_callback=...,
base_dir=base_dir, vendor_name="<vendor>",
history_lock=_<vendor>_history_lock,
history=_<vendor>_history,
trim_func=_(vendor)_trim_history,
)
```
The helper takes history management as injected parameters (each vendor has its own lock and history list). The tool dispatch (`_execute_tool_calls_concurrently`) takes a `vendor_name` string.
### A.2 PROVIDERS Move
Today:
```python
# src/models.py:79
PROVIDERS: List[str] = ["gemini", "anthropic", "gemini_cli", "deepseek", "minimax", "qwen", "grok", "llama"]
```
After:
```python
# src/ai_client.py (new location) or src/ai_client_providers.py (new file)
PROVIDERS: List[str] = ["gemini", "anthropic", "gemini_cli", "deepseek", "minimax", "qwen", "grok", "llama"]
# src/models.py: import from src.ai_client or keep as re-export shim for backward compat
```
The audit script: add `scripts/audit_providers_source_of_truth.py` that verifies PROVIDERS is not declared in `src/models.py`. Fails the build if regressed.
### A.3 UX Adaptations 2-9
Same pattern as the shipped adaptation 1 (Screenshot button iff vision). For each render site:
```python
caps = app._get_active_capabilities()
imgui.begin_disabled(not caps.<field>)
... UI ...
imgui.end_disabled()
if not caps.<field>:
imgui.same_line()
imgui.text_disabled("(reason)")
```
### B.1 Local-First Architecture
- Add `local_backend: bool` to `VendorCapabilities` (default False)
- Set True for Llama (when base_url is localhost/127.0.0.1)
- Native Ollama adapter: `src/llama_ollama_native.py` (separate from openai_compatible)
- Meta Llama API adapter: `src/llama_meta_api.py` (verify docs first)
- GUI: new "Local Model" badge in the AI Settings panel
- Cost panel: 4th state "Local (no cost)" distinct from "Free (local)" and "—"
### B.2 Matrix Expansion (v2)
Add to `VendorCapabilities`:
- `local: bool` (B.1)
- `reasoning: bool` (xAI `reasoning_effort`, Anthropic extended thinking, Ollama `think`)
- `structured_output: bool` (response_format / format)
- `code_execution: bool` (xAI code_interpreter, Anthropic Computer Use, Gemini Code Execution)
- `web_search: bool` (xAI web_search, Gemini Grounding)
- `x_search: bool` (xAI X/Twitter search, xAI-specific)
- `file_search: bool` (xAI file_search, Anthropic PDF, Gemini file API)
- `mcp_support: bool` (xAI mcp_calls, Anthropic MCP)
- `audio: bool` (Qwen-Audio, Gemini audio)
- `video: bool` (Gemini video)
- `grounding: bool` (Gemini Grounding with Google Search)
- `computer_use: bool` (Anthropic Computer Use)
Each new field is a registry update + a UI adaptation. The matrix schema grows; the GUI filters based on the matrix.
### C.1 Anthropic / Gemini / DeepSeek Migration
Per the deferred follow-up track `anthropic_gemini_deepseek_capability_matrix_20260606` (parent spec §13.1.A). The capability matrix entries for these vendors can be populated:
- `anthropic/*` with `caching: True` (prompt caching), `extended_thinking: True`, `pdf: True`, `computer_use: True`
- `gemini/*` with `caching: True` (explicit cache), `grounding: True`, `video: True`, `audio: True`
- `deepseek/*` with `reasoning: True` (R1), `low_cost: True`
The implementations (`_send_anthropic`, `_send_gemini`, `_send_deepseek`) keep their unique per-vendor code paths. The matrix entries are the source of truth for the UI.
---
## Phase Plan (5 phases, 4 weeks of work)
### Phase 1: Tool Loop Lift (1-2 weeks)
- T1.1: Write red tests for `run_with_tool_loop` (5 tests covering: no tool calls returns immediately, tool calls dispatch, max rounds limit, history appending, error in tool call doesn't crash)
- T1.2: Implement `src/tool_loop.py` with `run_with_tool_loop`
- T1.3: Apply to `_send_minimax` (replace inline loop)
- T1.4: Apply to `_send_qwen`, `_send_grok`, `_send_llama` (add the missing loop)
- T1.5: Apply to `_send_anthropic`, `_send_gemini`, `_send_gemini_cli`, `_send_deepseek` (consolidate)
- T1.6: Verify all 8 vendors' existing tests still pass
- T1.7: Audit script `scripts/audit_no_inline_tool_loops.py` to enforce the pattern
### Phase 2: PROVIDERS Move (1 week)
- T2.1: Move `PROVIDERS` to `src/ai_client.py` (or new `src/ai_client_providers.py`)
- T2.2: Update all 5 import sites (gui_2.py, app_controller.py, etc.) to point to new location
- T2.3: Add `scripts/audit_providers_source_of_truth.py` to enforce the move
- T2.4: Verify all 38+ tests pass
### Phase 3: UX Adaptations 2-9 (1-2 weeks)
- T3.1: Apply adaptation 2 (tools toggle iff tool_calling)
- T3.2: Apply adaptation 3 (cache panel iff caching)
- T3.3: Apply adaptation 4 (stream progress iff streaming)
- T3.4: Apply adaptation 5 (fetch models iff model_discovery)
- T3.5: Apply adaptation 6 (token budget max = context_window)
- T3.6: Apply adaptation 7 (cost panel: estimate)
- T3.7: Apply adaptation 8 (cost panel: "Free (local)" for localhost)
- T3.8: Apply adaptation 9 (cost panel: "—" for other cost_tracking=false)
- T3.9: Verify live_gui tests pass
### Phase 4: Local-First + Matrix Expansion (1-2 weeks)
- T4.1: Add `local: bool` to VendorCapabilities; update registry for Llama
- T4.2: Native Ollama adapter (`src/llama_ollama_native.py`); replace OpenAI-compatible for Ollama backend
- T4.3: Meta Llama API adapter (`src/llama_meta_api.py`); add as 4th Llama backend
- T4.4: GUI: "Local Model" badge
- T4.5: Add v2 fields (local, reasoning, structured_output, code_execution, web_search, x_search, file_search, mcp_support, audio, video, grounding, computer_use)
- T4.6: Update all vendor registry entries with the new fields
- T4.7: Add UI adaptations for the new fields (e.g., "Reasoning" toggle, "Code execution" panel)
### Phase 5: Anthropic / Gemini / DeepSeek Migration (1-2 weeks)
- T5.1: Populate Anthropic matrix entries (caching, extended_thinking, pdf, computer_use)
- T5.2: Populate Gemini matrix entries (caching, grounding, video, audio)
- T5.3: Populate DeepSeek matrix entries (reasoning, low_cost)
- T5.4: UI adaptations for the new capabilities
- T5.5: Docs + archive
---
## Testing Strategy
- All new helpers (`run_with_tool_loop`) get TDD: Red tests first, then implementation
- All UX adaptations get a test that verifies the render function reads the capability
- All audit scripts get a self-test (the script can detect its own absence)
- Live_gui tests run in batch (per the docs_sync lessons: bisect in batch, not isolation)
---
## Risks
- **Tool loop lift risk:** Anthropic and Gemini have unique tool-use formats (Anthropic uses `tool_use` blocks; Gemini uses `functionCall`). Lifting requires careful preservation. Mitigation: keep the per-vendor `tool_format_converter` injection as a parameter.
- **PROVIDERS move risk:** 5 import sites to update; some might use `from src.models import PROVIDERS` and break. Mitigation: search-and-replace audit, run full test suite after.
- **UX adaptation risk:** Same as parent Phase 5 — touching 260KB of GUI code is high risk. Mitigation: ship 1-2 per commit, run live_gui batch after each.
---
## Open Questions
1. **`src/ai_client_providers.py` vs `src/ai_client.py`?** Should PROVIDERS go in a new file (clearer separation) or stay in the main ai_client module (less file proliferation)?
2. **Meta Llama API spec verification:** The 400 error on the docs URL last session — is it back up? If not, defer the Meta backend.
3. **Local model as separate UI mode?** Should the GUI have a "Local / Cloud / All" filter on the provider dropdown, or just show the local badge per-vendor?
---
## See Also
- Parent track: `conductor/tracks/qwen_llama_grok_integration_20260606/`
- Parent spec: `conductor/tracks/qwen_llama_grok_integration_20260606/spec.md`
- Parent Phase 5 report: `docs/reports/qwen_llama_grok_integration_20260610.md` (TBD)
- `docs/guide_ai_client.md` — the doc that needs updating in Phase 6 of the parent track
---
## Status
- T0: Spec drafted (this file)
- T1: Phase 1 (tool loop lift) ready to start
@@ -0,0 +1,86 @@
# Track state for qwen_llama_grok_followup_20260611
# Updated by Tier 2 Tech Lead as tasks complete
[meta]
track_id = "qwen_llama_grok_followup_20260611"
name = "Qwen/Llama/Grok Follow-Up (tool loop, PROVIDERS move, UX adaptations 2-9, local-first, matrix v2, Anthropic/Gemini/DeepSeek migration)"
status = "active"
current_phase = 0
last_updated = "2026-06-11"
[blocked_by]
# This follow-up is blocked on the parent track's Phase 6 (docs) completing.
qwen_llama_grok_integration_20260606 = "phase_6_in_progress"
[phases]
phase_1 = { status = "pending", checkpoint_sha = "", name = "Tool loop lift (run_with_tool_loop helper for 8 vendors)" }
phase_2 = { status = "pending", checkpoint_sha = "", name = "PROVIDERS move (out of src/models.py)" }
phase_3 = { status = "pending", checkpoint_sha = "", name = "UX adaptations 2-9 (8 of 9 deferred from parent Phase 5)" }
phase_4 = { status = "pending", checkpoint_sha = "", name = "Local-first + matrix v2 expansion (12 new fields)" }
phase_5 = { status = "pending", checkpoint_sha = "", name = "Anthropic/Gemini/DeepSeek capability matrix migration" }
[tasks]
# Phase 1: Tool loop lift
t1_1 = { status = "pending", commit_sha = "", description = "Read tool-loop patterns in _send_minimax + the 4 inline-loop vendors" }
t1_2 = { status = "pending", commit_sha = "", description = "Design run_with_tool_loop helper signature" }
t1_3 = { status = "pending", commit_sha = "", description = "Red: 5 tests for run_with_tool_loop in tests/test_tool_loop.py" }
t1_4 = { status = "pending", commit_sha = "", description = "Green: implement run_with_tool_loop in src/tool_loop.py" }
t1_5 = { status = "pending", commit_sha = "", description = "Apply to _send_minimax (replace inline loop)" }
t1_6 = { status = "pending", commit_sha = "", description = "Apply to _send_qwen + _send_grok + _send_llama (add missing loop)" }
t1_7 = { status = "pending", commit_sha = "", description = "Apply to _send_anthropic + _send_gemini + _send_gemini_cli + _send_deepseek (consolidate inline)" }
t1_8 = { status = "pending", commit_sha = "", description = "Add scripts/audit_no_inline_tool_loops.py" }
t1_9 = { status = "pending", commit_sha = "", description = "Phase 1 checkpoint + git note" }
# Phase 2: PROVIDERS move
t2_1 = { status = "pending", commit_sha = "", description = "Decide: src/ai_client.py vs new src/ai_client_providers.py" }
t2_2 = { status = "pending", commit_sha = "", description = "Move PROVIDERS to new location" }
t2_3 = { status = "pending", commit_sha = "", description = "Update 5 import sites" }
t2_4 = { status = "pending", commit_sha = "", description = "Add scripts/audit_providers_source_of_truth.py" }
t2_5 = { status = "pending", commit_sha = "", description = "Phase 2 checkpoint + git note" }
# Phase 3: UX adaptations 2-9
t3_1 = { status = "pending", commit_sha = "", description = "Adaptation 2: tools toggle iff tool_calling" }
t3_2 = { status = "pending", commit_sha = "", description = "Adaptation 3: cache panel iff caching" }
t3_3 = { status = "pending", commit_sha = "", description = "Adaptation 4: stream progress iff streaming" }
t3_4 = { status = "pending", commit_sha = "", description = "Adaptation 5: fetch models iff model_discovery" }
t3_5 = { status = "pending", commit_sha = "", description = "Adaptation 6: token budget max = context_window" }
t3_6 = { status = "pending", commit_sha = "", description = "Adaptation 7: cost panel: estimate" }
t3_7 = { status = "pending", commit_sha = "", description = "Adaptation 8: cost panel: 'Free (local)' for localhost" }
t3_8 = { status = "pending", commit_sha = "", description = "Adaptation 9: cost panel: '-' for other cost_tracking=false" }
t3_9 = { status = "pending", commit_sha = "", description = "Phase 3 checkpoint + git note" }
# Phase 4: Local-first + matrix v2
t4_1 = { status = "pending", commit_sha = "", description = "Add local: bool to VendorCapabilities" }
t4_2 = { status = "pending", commit_sha = "", description = "Native Ollama adapter src/llama_ollama_native.py" }
t4_3 = { status = "pending", commit_sha = "", description = "Meta Llama API adapter src/llama_meta_api.py" }
t4_4 = { status = "pending", commit_sha = "", description = "GUI: 'Local Model' badge" }
t4_5 = { status = "pending", commit_sha = "", description = "Add 12 v2 fields to VendorCapabilities" }
t4_6 = { status = "pending", commit_sha = "", description = "Update all vendor registry entries" }
t4_7 = { status = "pending", commit_sha = "", description = "UI adaptations for new fields (reasoning toggle, code execution panel, etc.)" }
t4_8 = { status = "pending", commit_sha = "", description = "Phase 4 checkpoint + git note" }
# Phase 5: Anthropic / Gemini / DeepSeek migration
t5_1 = { status = "pending", commit_sha = "", description = "Populate Anthropic matrix entries (caching, extended_thinking, pdf, computer_use)" }
t5_2 = { status = "pending", commit_sha = "", description = "Populate Gemini matrix entries (caching, grounding, video, audio)" }
t5_3 = { status = "pending", commit_sha = "", description = "Populate DeepSeek matrix entries (reasoning, low_cost)" }
t5_4 = { status = "pending", commit_sha = "", description = "UI adaptations for new capabilities" }
t5_5 = { status = "pending", commit_sha = "", description = "Phase 5 docs + archive" }
[verification]
phase_1_tool_loop_lifted = false
phase_2_providers_moved = false
phase_3_all_9_ux_adaptations = false
phase_4_local_first_and_matrix_v2 = false
phase_5_anthropic_gemini_deepseek_matrix = false
full_test_suite_passes = false
no_inline_tool_loops = false
no_providers_in_models_py = false
[open_questions]
# Phase 4
where_should_providers_live = "src/ai_client.py (existing file) or new src/ai_client_providers.py (new file)?"
[local_first_priority]
# Per user feedback 2026-06-11: emphasize local models as first-class
# vs cloud/online vendors. Add UI badge, distinct cost state, native Ollama.
local_model_as_first_class = true
native_ollama_default_for_llama = true
meta_llama_api_4th_backend = true
local_badge_in_gui = true
distinct_cost_state_for_local = true
+95 -1
View File
@@ -6,10 +6,17 @@
## Overview
`src/ai_client.py` (~116KB) is the **unified LLM client** for 5 providers. It abstracts the differences between providers (Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI) behind a single `send()` function.
`src/ai_client.py` (~116KB) is the **unified LLM client** for 8 providers. It abstracts the differences between providers (Gemini, Anthropic, DeepSeek, MiniMax, Gemini CLI, Qwen, Grok, Llama) behind a single `send()` function.
The module is a **stateful singleton** — all provider state is held in module-level globals. There is no class wrapping; the module itself is the abstraction layer.
The 8 providers split into 3 API shapes:
- **Native SDK**: Gemini (google-genai), Anthropic (anthropic), Qwen (DashScope)
- **OpenAI-compatible**: MiniMax, Grok, Llama (Ollama/OpenRouter/custom), DeepSeek
- **Subprocess**: Gemini CLI
The OpenAI-compatible vendors all call the shared helper in `src/openai_compatible.py` (added 2026-06-06 by the `qwen_llama_grok_integration_20260606` track; see "Shared OpenAI-Compatible Helper" section below). The MiniMax provider's `_send_minimax` was refactored to use this helper (Phase 4 of the same track, 231 → 75 lines, 68% reduction).
---
## Module-Level Imports
@@ -430,4 +437,91 @@ Gated by env var (e.g., `RUN_REAL_AI_TESTS=1`). Hits the real API. Not in defaul
- **[guide_state_lifecycle.md](guide_state_lifecycle.md)** — The per-provider history globals (`_anthropic_history`, etc.) are managed here; their locking and reset behavior is documented
- **[guide_context_aggregation.md](guide_context_aggregation.md)** — The `aggregate.py` pipeline that produces the markdown the AI client sends
- **[conductor/product.md](../conductor/product.md#multi-provider-integration)** — Product-level overview of providers
- **[docs/reports/qwen_llama_grok_followup_audit_20260611.md](qwen_llama_grok_followup_audit_20260611.md)** — Audit of the parent track's gaps; follow-up track `qwen_llama_grok_followup_20260611` covers them
---
## Shared OpenAI-Compatible Helper (`src/openai_compatible.py`)
Added 2026-06-06 by the `qwen_llama_grok_integration_20260606` track. Operates on a normalized request/response data structure so 4 OpenAI-compatible vendors (MiniMax, Grok, Llama, DeepSeek) can share the same request building, response parsing, streaming aggregation, tool call detection, and error classification logic.
### Data Structures
```python
@dataclass(frozen=True)
class NormalizedResponse:
text: str
tool_calls: list[dict[str, Any]]
usage_input_tokens: int
usage_output_tokens: int
usage_cache_read_tokens: int
usage_cache_creation_tokens: int
raw_response: Any
@dataclass
class OpenAICompatibleRequest:
messages: list[dict[str, Any]]
model: str
temperature: float = 0.0
top_p: float = 1.0
max_tokens: int = 8192
tools: Optional[list[dict[str, Any]]] = None
tool_choice: str = "auto"
stream: bool = False
stream_callback: Optional[Callable[[str], None]] = None
```
### The Function
```python
def send_openai_compatible(
client: Any, # openai.OpenAI client with vendor-specific base_url + auth
request: OpenAICompatibleRequest,
*, capabilities: "VendorCapabilities", # from src/vendor_capabilities.py
) -> NormalizedResponse:
```
The function:
1. Translates `request.messages` into the OpenAI SDK's `messages` parameter (passthrough — already in OpenAI shape).
2. Translates `request.tools` if non-None (passthrough for now; future: strip unsupported fields based on `capabilities`).
3. Calls `client.chat.completions.create(...)` with the right parameters.
4. If streaming: aggregates chunks; calls `stream_callback(text_chunk)` for each text delta; collects final usage from the last chunk.
5. If non-streaming: parses the response in one shot.
6. Returns a `NormalizedResponse` with text, tool calls (in OpenAI shape), usage stats.
7. On exception: classifies the OpenAI exception and re-raises as `ProviderError`.
### Usage Pattern (per vendor)
```python
# _send_grok, _send_llama (single-shot placeholders), _send_minimax (with restored tool loop)
def _send_grok(md_content, user_message, base_dir, file_items=None, discussion_history="", stream=False, ...):
client = _ensure_grok_client() # openai.OpenAI(api_key=..., base_url="https://api.x.ai/v1")
with _grok_history_lock:
# ... build messages, append user, system + context ...
request = OpenAICompatibleRequest(
messages=messages, model=_model, stream=stream,
stream_callback=stream_callback,
)
caps = get_capabilities("grok", _model)
response = send_openai_compatible(client, request, capabilities=caps)
# ... append to history, return response.text ...
```
### Qwen Adapter (`src/qwen_adapter.py`)
Qwen uses Alibaba's DashScope native SDK (not OpenAI-compatible) because DashScope's OpenAI-compatible mode drops important features (Qwen-Audio, Qwen-Long custom chunking, Qwen-VL-Max enhanced vision). The adapter normalizes DashScope tool format to OpenAI shape via `build_dashscope_tools()` and classifies DashScope exceptions via `classify_dashscope_error()`.
### Llama Multi-Backend
`_send_llama` supports 3 backends via the state globals `_llama_base_url` and `_llama_api_key`:
- **Ollama** (local): `http://localhost:11434/v1`; no auth
- **OpenRouter** (cloud aggregator): `https://openrouter.ai/api/v1`
- **Custom URL** (escape hatch): any OpenAI-compatible endpoint
The local-LLM signal is `_get_llama_cost_tracking()` (returns False for localhost/127.0.0.1).
### Tests
- `tests/test_vendor_capabilities.py` (3 tests): registry lookup, vendor-default fallback, unknown-vendor raises
- `tests/test_openai_compatible.py` (6 tests): non-streaming, streaming aggregation, tool call detection, vision, error classification, frozen dataclass
- **[conductor/tracks/nagent_review_20260608/report.md §15 Pitfalls #2 and #4](../conductor/tracks/nagent_review_20260608/report.md)** — Deep-dive on the per-provider history globals and the stateful singleton pattern; future-track candidate for stateless LLMClient
+1 -1
View File
@@ -363,7 +363,7 @@ The file also defines several module-level constants used across the app:
```python
# Provider routing
PROVIDERS: list[str] = ["gemini", "anthropic", "deepseek", "MiniMax", "gemini-cli"]
PROVIDERS: list[str] = ["gemini", "anthropic", "gemini_cli", "deepseek", "minimax", "qwen", "grok", "llama"]
# Tool categories (for Tool Bias)
TOOL_CATEGORIES: list[str] = [
@@ -0,0 +1,165 @@
# Qwen/Llama/Grok Follow-Up Audit Report (2026-06-11)
**Date:** 2026-06-11
**Author:** Tier 2 Tech Lead
**Subject:** Why a follow-up track is needed after `qwen_llama_grok_integration_20260606` Phase 5
## TL;DR
The parent track shipped 5 of 6 phases with 50/79 tasks done. The Tech Lead **did not surface the gaps at the checkpoints**; the user discovered them only at the Phase 5 checkpoint. The user is right: the Tech Lead's "footnote for now" pattern is bad — it looks like the work was hidden until called out.
**7 categories of gap** are documented here. Each is captured in the new follow-up track `qwen_llama_grok_followup_20260611`.
---
## 1. Phase 5 partial: 1 of 9 UX adaptations shipped
**What shipped:** Adaptation 1 (Screenshot button iff vision) at `src/gui_2.py:3030` + the helper `_get_active_capabilities()` at `src/gui_2.py:733`.
**What didn't ship:** Adaptations 2-9:
- Tools toggle iff tool_calling
- Cache panel iff caching
- Stream progress iff streaming
- Fetch Models button iff model_discovery
- Token budget max = context_window
- Cost panel × 3 (estimate / "Free (local)" for localhost / "—" for other cost_tracking=false)
**The right move:** All 9 at once, OR explicit user-facing "I'm shipping 1 of 9; the other 8 are deferred" BEFORE doing adaptation 1. The Tech Lead did the latter in a footnote, which the user called out as bad UX.
---
## 2. Tool-call loop regression: only MiniMax works
**What shipped:** `_send_minimax` has a working tool loop. The other 7 vendor entry points do not.
| Vendor | Tool loop? | Why |
|---|---|---|
| `_send_minimax` | ✅ Works (231 → 75 lines after refactor + tool loop restoration) | Worker did the refactor; I added the tool loop back manually |
| `_send_qwen` | ❌ Single-shot | Phase 2 worker omitted it (Qwen has DashScope-specific tool format) |
| `_send_grok` | ❌ Single-shot | Phase 3 worker omitted it (placeholder) |
| `_send_llama` | ❌ Single-shot | Phase 3 worker omitted it (placeholder) |
| `_send_anthropic` | ✅ Inline (4-way duplication with the other 3) | Pre-existing pattern |
| `_send_gemini` | ✅ Inline | Pre-existing pattern |
| `_send_gemini_cli` | ✅ Inline | Pre-existing pattern |
| `_send_deepseek` | ✅ Inline | Pre-existing pattern |
**The right move:** Lift the loop into a shared `run_with_tool_loop` helper that takes history management as injected parameters. Apply to all 8 vendors. This is a single-fix, 8-call-site refactor — much smaller than letting the duplication grow.
The Tech Lead caught this at the end of Phase 4 (during the MiniMax refactor) but should have caught it at the end of Phase 2 (when the Qwen worker shipped single-shot) or the end of Phase 3 (when Grok+Llama workers shipped single-shot).
---
## 3. `src/models.py` has a PROVIDERS list — the user is right that this is sprawl
**What's there now:**
```python
# src/models.py:79
PROVIDERS: List[str] = ["gemini", "anthropic", "gemini_cli", "deepseek", "minimax", "qwen", "grok", "llama"]
```
**The problem:** `src/models.py` is for **MMA data models** (Tickets, Tracks, FileItem, WorkerContext, etc.). The vendor list is an **AI client concern**. The audit script `audit_no_models_config_io.py` enforces config I/O rules; PROVIDERS has no analogous enforcement.
**The right move:** Move PROVIDERS to `src/ai_client.py` (or a new `src/ai_client_providers.py`). Add `scripts/audit_providers_source_of_truth.py` that fails the build if PROVIDERS is declared in models.py.
The Tech Lead justified keeping it in models.py with "the centralized registry pattern" without asking whether models.py was the right home.
---
## 4. `src/ai_client.py` is 2784 lines and growing
**What's there:** 8 vendor entry points (`_send_anthropic`, `_send_gemini`, `_send_gemini_cli`, `_send_deepseek`, `_send_minimax`, `_send_qwen`, `_send_grok`, `_send_llama`) plus all the supporting machinery (client init, history management, error classification, reasoning content extraction).
**The 8 vendors' inline patterns are 70% similar.** Each has:
- Client init (credentials + SDK setup)
- History management (per-vendor lock + history list + repair + trim)
- Message building (system + context + user content)
- API call (via SDK or HTTP)
- Tool loop (or single-shot — see gap #2)
- Reasoning content extraction
- Error classification
**The right move:** Codepath consolidation. The shared `send_openai_compatible` covers the API call. A future `run_with_tool_loop` covers the tool loop (gap #2). What's left:
- History management as a `VendorHistory` class or per-vendor thin wrapper
- Reasoning content extraction as a uniform helper
- Error classification as a per-HTTP-code helper
Could cut `src/ai_client.py` by 30-40% (~1000 lines).
---
## 5. Local models deserve more emphasis
**What's there now:** Ollama is one of 3 Llama backends (Ollama, OpenRouter, custom_url). The `cost_tracking: False` for localhost is a small signal.
**The user feedback (verbatim):** "I want to put more emphasis and supporting local models and separating local model vending vis online/cloud vendors of models."
**The right architecture:**
- Add `local: bool` to VendorCapabilities (separate from `cost_tracking`)
- Native Ollama (`/api/chat`) as the **default** for Llama (not the OpenAI-compatible fallback)
- Meta Llama API as a 4th backend (the docs URL returned 400 last session; needs re-verification)
- GUI: "Local Model" badge per-vendor
- Cost panel: 4th state "Local (no cost)" distinct from "Free (local)" and "—"
- vLLM, LM Studio, llama.cpp as additional custom-URL backends with discoverable presets
This is a significant priority shift. The follow-up track's Phase 4 leads with this.
---
## 6. V2 matrix field expansion documented but not implemented
**What the spec says (per Grok's consultation):** Add 12 new fields to VendorCapabilities:
- `local: bool`
- `reasoning: bool` (xAI `reasoning_effort`, Anthropic extended thinking, Ollama `think`)
- `structured_output: bool` (response_format / format)
- `code_execution: bool` (xAI code_interpreter, Anthropic Computer Use, Gemini Code Execution)
- `web_search: bool` (xAI web_search, Gemini Grounding)
- `x_search: bool` (xAI X/Twitter search)
- `file_search: bool` (xAI file_search, Anthropic PDF, Gemini file API)
- `mcp_support: bool` (xAI mcp_calls, Anthropic MCP)
- `audio: bool` (Qwen-Audio, Gemini audio)
- `video: bool` (Gemini video)
- `grounding: bool` (Gemini Grounding with Google Search)
- `computer_use: bool` (Anthropic Computer Use)
**What shipped:** 0 of 12. None wired. No UI adaptations.
The follow-up track's Phase 4 lands these.
---
## 7. Anthropic / Gemini / DeepSeek still not on the matrix
**What's there:** These 3 vendors have unique APIs (4-breakpoint caching, genai SDK, raw HTTP) and the migration to the matrix is non-trivial. The follow-up track is documented (`parent spec §13.1.A`) but never scheduled.
**The value:** Anthropic has prompt caching, extended thinking, Computer Use (big UX wins). Gemini has Grounding with Google Search, native video. DeepSeek has reasoning models.
The follow-up track's Phase 5 lands these.
---
## Lessons (Tech Lead Process)
1. **Surface gaps as they appear, not at the checkpoint.** If a task is going to be deferred mid-phase, say so immediately — don't footnote it later.
2. **Be explicit about architectural deviations.** The `src/models.py` PROVIDERS sprawl should have been raised at Phase 2, not at Phase 5.
3. **Plan for the test infrastructure before coding.** The tool-loop regression wasn't caught because no test exercised the loop.
4. **The "footnote for now" pattern is bad UX.** It looks like the work was hidden until called out. Either ship the work or be explicit about deferring it BEFORE doing the work.
## Follow-Up Track
`conductor/tracks/qwen_llama_grok_followup_20260611/` — 5 phases:
- Phase 1: Tool loop lift (run_with_tool_loop helper for 8 vendors)
- Phase 2: PROVIDERS move (out of src/models.py)
- Phase 3: UX adaptations 2-9 (8 of 9 deferred from parent Phase 5)
- Phase 4: Local-first + matrix v2 expansion (12 new fields)
- Phase 5: Anthropic / Gemini / DeepSeek migration
## Parent Track Status
`qwen_llama_grok_integration_20260606` is **NOT being archived** (per user directive). It stays open in `conductor/tracks/` for the follow-up to use as a reference. Phase 6 docs are being done now; the track folder remains at the same path.
## See Also
- `conductor/tracks/qwen_llama_grok_followup_20260611/spec.md` — the follow-up spec
- `conductor/tracks/qwen_llama_grok_followup_20260611/state.toml` — the follow-up state
- `conductor/tracks/qwen_llama_grok_followup_20260611/TODO.md` — the setup checklist
- `conductor/tracks/qwen_llama_grok_integration_20260606/` — the parent track