Documents the side-track surfaced during Phase 2 of
qwen_llama_grok_followup_20260611: src/models.py is bloated
with ~10 non-MMA types (Tool, ToolPreset, BiasProfile,
MCPConfiguration, ContextPreset, RAGConfig, Persona,
ExternalEditorConfig, FileItem, ThinkingSegment) that
should live in their parent modules per the HARD RULE.
The report captures:
- Evidence: which types, lines, target modules
- Why it matters: PROVIDERS move had to use __getattr__
to break a circular import that wouldn't have existed
if ToolPreset lived in src/ai_client.py
- Proposed move map (10 types)
- Prerequisites (1-6)
- Estimated scope: 3-5 days
- Open questions for the user
- Linkage to the follow-up track and the broader
deferred_work list
NOT EXECUTED. User decision: proceed to Phase 3 of the
follow-up. This report is the next agent's reference
when the namespace cleanup track is eventually picked up.
Phase 2 (PROVIDERS move out of src/models.py) is now complete.
The phase checkpoint is commit 7b24ee9 (the empty 'Phase 2
complete' commit). The audit report is attached as a git
note on that commit.
state.toml updates:
- phase_2.status pending -> completed; checkpoint_sha 7b24ee9
- t2_1 pending -> completed; commit 74c3b6b2 (tied to the
PROVIDERS move commit since the location decision was
resolved in that commit's body)
- phase_3.status pending -> in_progress (next)
5 of 5 Phase 2 tasks shipped:
- t2_1: location decision (src/ai_client.py per HARD RULE)
- t2_2: PROVIDERS moved + re-export via __getattr__
- t2_3: 4 import sites updated
- t2_4: audit script added
- t2_5: checkpoint + git note
Side-track surfaced (not in scope for Phase 2): src/models.py
is bloated with non-MMA types. Proposed as
'namespace_cleanup_20260611' track in the deferred_work
section; user to decide whether to side-track before Phase 3
or proceed to UX adaptations first.
Phase 2 ships:
- PROVIDERS lives in src/ai_client.py:56 (canonical home for
AI-client constants per the HARD RULE on src/ files)
- src/models.py keeps a __getattr__ re-export (PEP 562) for
backward compat; lazy-loaded to break the circular import
(src.ai_client imports ToolPreset/BiasProfile/Tool from
models at line 50, so a top-level 'from src.ai_client
import PROVIDERS' would deadlock)
- 4 call sites in src/app_controller.py:3093 and
src/gui_2.py:{2293,2849,5377} updated from
models.PROVIDERS to ai_client.PROVIDERS (direct lookup,
no per-call __getattr__ cost)
- Stale tests/test_provider_curation.py updated from 5 to
8 providers
- New test tests/test_providers_source_of_truth.py asserts
the re-export + object identity
- New audit scripts/audit_providers_source_of_truth.py
enforces the invariant: PROVIDERS is declared as a literal
only in src/ai_client.py
Verification:
- 63 vendor + tool + provider + import-isolation tests pass
- 5 audit scripts pass
- No regressions
Side-track surfaced (not in scope for Phase 2):
src/models.py is bloated with non-MMA types
(Tool/ToolPreset/BiasProfile/MCPConfiguration/ContextPreset/
Persona/RAGConfig/ExternalEditorConfig/ThinkingSegment/etc.)
that belong in their respective sub-system modules per the
HARD RULE. This is a separate refactor track — proposed as
'namespace_cleanup_20260611' in the follow-up track's
deferred_work section. Should be elevated to its own track
before Phase 3 (UX adaptations) to keep the codebase
maintainable.
Commits in this phase:
- 74c3b6b2: move PROVIDERS to src/ai_client.py; re-export
- 6c6a4aef: update 4 import sites
- be505605: add audit script
- <this> (empty): Phase 2 checkpoint
Phase 2 task 2.4 (the script part). The script enforces:
PROVIDERS is declared as a literal only in src/ai_client.py.
The __getattr__ re-export in src/models.py is allowed (it
lazy-imports, not a literal declaration).
Catches the literal pattern 'PROVIDERS: List[str] = ['
specifically, which the __getattr__ re-export does not
match.
OK: passes against current state where PROVIDERS is
declared only in src/ai_client.py:56.
Phase 2 tasks 2.3 (update 4 import sites) + 2.4 (audit script).
The 4 call sites in src/app_controller.py:3093 and src/gui_2.py
{2293, 2849, 5377} were using models.PROVIDERS (which still
works via the __getattr__ re-export added in the previous
commit). Updated them to use ai_client.PROVIDERS directly:
- Models.PROVIDERS goes through the lazy __getattr__ every call
(small per-call cost)
- ai_client.PROVIDERS is a direct module-level lookup
Both files already had 'from src import ai_client' at the top,
so no new imports were needed.
scripts/audit_providers_source_of_truth.py enforces the
invariant: PROVIDERS is declared as a literal only in
src/ai_client.py. Catches accidental declarations creeping
back into src/models.py or other modules. Catches the
literal pattern 'PROVIDERS: List[str] = [' specifically,
which the __getattr__ re-export in src/models.py does not
match (it's 'from src.ai_client import PROVIDERS').
All 5 audit scripts pass:
- audit_main_thread_imports.py
- audit_weak_types.py
- audit_no_models_config_io.py
- audit_no_inline_tool_loops.py
- audit_providers_source_of_truth.py (new)
63 vendor + tool + provider + import-isolation tests pass.
Phase 2 tasks 2.1 + 2.2 + 2.3a of the follow-up track.
PROVIDERS now lives in src/ai_client.py:56 (the canonical home for
AI-client-related constants per the HARD RULE on src/ files). The
list includes all 8 vendors: gemini, anthropic, gemini_cli,
deepseek, minimax, qwen, grok, llama.
Backward compat: src/models.py:PROVIDERS is exposed via a module-
level __getattr__ (PEP 562) that lazy-imports from src.ai_client.
The lazy approach was needed because src.ai_client imports
ToolPreset/BiasProfile/Tool from src.models at line 50, so a
top-level 'from src.ai_client import PROVIDERS' in models.py
would deadlock. Adding a branch to the existing __getattr__
in models.py (which also handles pydantic class factories) is
the surgical fix.
tests/test_provider_curation.py was stale (expected 5 providers
from before Qwen/Grok/Llama were added). Updated to 8.
New test: tests/test_providers_source_of_truth.py asserts:
- src.ai_client.PROVIDERS exists and matches the 8-provider list
- src.models.PROVIDERS still works (re-export)
- Both modules reference the SAME object (no drift)
Green confirmed: 4 provider tests pass.
Task 1.8 (the plan's numbering: 'Add audit script'). Audit checks
that no _send_<vendor> in src/ai_client.py contains an inline
'for round_idx in range(MAX_TOOL_ROUNDS' loop. The audit excludes
the 4 vendored-call-path vendors (anthropic, gemini, gemini_native,
deepseek) which are documented in state.toml's deferred_work
section as future work (they use their own SDKs and need
separate per-vendor conversion to OpenAICompatibleRequest).
state.toml:
- t1_7 (Apply to 4 inline-loop vendors): completed for
_send_gemini_cli only. Anthropic + Gemini + DeepSeek deferred.
- t1_8 (Add audit script): in_progress.
- t1_7 reuses commit 4748d134 (the send_func + on_pre_dispatch
refactor that introduced the new helper pattern for
vendored call paths).
OK: audit passes against the current 4 OpenAI-compat vendors
(minimax, grok, llama, qwen still uses _dashscope_call but
has no inline loop) + gemini_cli.
The follow-up track's tool-loop refactor moved
'from src.openai_compatible import send_openai_compatible,
OpenAICompatibleRequest, NormalizedResponse' to MODULE level
in src/ai_client.py. This violates the startup_speedup_20260606
invariant: heavy SDKs must not be loaded at module level because
ai_client.py is on the main thread's import chain.
src/openai_compatible.py line 5 does 'from openai import
OpenAIError, ...', so any import from it triggers the openai SDK
to load. test_ai_client_does_not_import_openai_at_module_level
guards this invariant and was failing.
Fix: move the imports back to local scope inside the function
bodies that need them:
- _default_send closure inside run_with_tool_loop
(imports send_openai_compatible)
- _send_grok (imports OpenAICompatibleRequest)
- _send_minimax (imports OpenAICompatibleRequest)
- _send_llama (imports OpenAICompatibleRequest)
- _send_gemini_cli (imports OpenAICompatibleRequest + NormalizedResponse)
Test patches: tests that previously patched
'src.ai_client.send_openai_compatible' now patch
'src.openai_compatible.send_openai_compatible' (the actual
import source). _execute_tool_calls_concurrently patches
unchanged (it's defined in src/ai_client.py itself).
Green confirmed: 62 vendor + tool + import-isolation tests
pass. 0 regressions.
Task 1.7 of the follow-up track. Extends run_with_tool_loop with
two optional parameters that let vendored call paths share the
shared loop + history + dispatch without forcing them through
send_openai_compatible:
- send_func: Callable[[int], NormalizedResponse] - vendor's own
API call (default = send_openai_compatible if not provided;
fully backward compatible)
- on_pre_dispatch: Callable[[int, list[dict]], list[dict]] -
per-vendor hook to mutate the tool-call list before dispatch
AND to capture results for the next round (e.g. Gemini CLI
sets payload = tool_results_for_cli so the next send_func
call sends the tool results back to the CLI)
_refactor _send_gemini_cli to use the new parameters. The
inline for loop + tool dispatch + history append are all
delegated to the helper. The vendor's send_func closure
handles:
- adapter.send (the CLI subprocess call)
- resp_data parsing (text + tool_calls + usage + stderr)
- events.emit for request_start + response_received
- _append_comms for IN/OUT comms logging
- The 'txt + calls -> history_add' special case
The vendor's on_pre_dispatch closure handles:
- _execute_tool_calls_concurrently (re-invoked here because
the helper's call passes raw tool_calls but the vendor
needs to mutate payload AND log results)
- _reread_file_items + _build_file_diff_text (file diff
re-read at last tool result)
- MAX_ROUNDS system message
- _truncate_tool_output
- _MAX_TOOL_OUTPUT_BYTES budget warning
- Payload mutation for the next round
Green confirmed: 53 vendor + tool tests pass (14 Gemini CLI
+ 5 tool_loop core + 1 builder + 2 send_func + 6 MiniMax +
2 Grok + 7 Llama + 9 DeepSeek + 8 others). No regressions.
Task 1.7 (apply run_with_tool_loop to anthropic + gemini + gemini_cli
+ deepseek) cannot proceed as a single task. The 4 vendors use their
own vendored call paths, not send_openai_compatible:
- _send_deepseek: requests.post with custom payload + custom streaming
parser + custom comms logging + budget enforcement
- _send_gemini: google-genai SDK streaming + custom types.Tool handling
- _send_gemini_cli: subprocess JSONL parsing via GeminiCliAdapter
- _send_anthropic: anthropic SDK + custom cache control + history
trimming
run_with_tool_loop is hard-coded to send_openai_compatible. Each
vendor needs to be refactored to produce OpenAICompatibleRequest
first (analogous to how parent Phase 3 converted Grok/Llama). That's
a multi-day refactor per vendor.
Per the per-task decision protocol in conductor/workflow.md
('plan approach doesn't fit'): STOP and report. Recommendation
in the deferred_work section: split Task 1.7 into 4 per-vendor
tasks under a new 'Phase 1.5 vendor-conversion-to-OpenAICompatibleRequest'
phase. The current Phase 1 milestone ('helper exists + 3 vendors
applied') is still meaningful and worth checkpointing as-is.
Task 1.6 of the follow-up track. _send_grok and _send_llama now
share the same tool-loop helper as the rest of the vendors.
Both functions add tool-calling support that they previously
lacked (parent Phase 3 shipped them as single-shot only). The
plan's Task 1.6 title says 'add missing loop' which matches
this scope. tool_choice='auto' if tools else 'auto' matches
the MiniMax pattern.
Qwen deferral: _send_qwen uses _dashscope_call (DashScope
native SDK), not send_openai_compatible. run_with_tool_loop
hard-codes send_openai_compatible. Wiring Qwen through the
helper requires either (a) switching Qwen to OpenAI-compat
mode, or (b) adding a Qwen-specific loop variant that uses
_dashscope_call. Both are non-trivial and out of scope for
Task 1.6. Tracked as a follow-up note in the state.toml.
Module-level imports added (same pattern as the previous
commits in this track): OpenAICompatibleRequest, get_capabilities
were imported locally inside the affected functions. Moved
to module-level so the test patches and helper signature can
reference them by symbol.
Green confirmed: 51 vendor + tool tests pass.
Task 1.3 of the follow-up track. _send_minimax now uses
run_with_tool_loop with a per-round request_builder callback
that re-reads _minimax_history under _minimax_history_lock.
The plan's Task 1.3 example builds the request once before the
loop. That would break MiniMax tool flows because the API
would not see the tool results appended to _minimax_history
on later rounds. The fix: extend run_with_tool_loop's 2nd arg
to accept Union[OpenAICompatibleRequest, Callable[[int],
OpenAICompatibleRequest]] (backward compatible; static-request
vendors pass a single request). MiniMax now passes a closure
that rebuilds messages from history each round.
Reasoning extraction: MiniMax exposes its chain-of-thought via
response.raw_response.choices[0].message.reasoning_details[0].
get('text'). Lifted to a _extract_minimax_reasoning callback
passed as reasoning_extractor=... (the new parameter added
in the previous commit).
Trim callback: wraps _trim_minimax_history so it can be called
from run_with_tool_loop after each tool-result append.
Green confirmed: 51 vendor + tool tests pass (6 MiniMax + 5
tool_loop core + 1 tool_loop builder + 39 others); the new
test_ai_client_tool_loop_builder.py locks in the per-round
builder contract.
Tasks 1.1 (red) + 1.2 (green) of the follow-up track. Adds a single
shared tool-call loop in src/ai_client.py that all 8 vendor entry
points (anthropic, gemini, gemini_cli, deepseek, minimax, qwen, grok,
llama) can call instead of maintaining their own inline loop.
Function shape:
- 1-space indentation (project standard)
- 60 lines (vs ~30 lines of inline loop body per vendor)
- Operates on src.openai_compatible.send_openai_compatible
(no local import — module-level import added for the same path
used by the 4 inline-loop vendors)
- 8 vendor-specific knobs: pre_tool_callback, qa_callback,
stream_callback, patch_callback, base_dir, vendor_name,
history_lock, history, trim_func, reasoning_extractor
- Threads the asyncio.get_running_loop / RuntimeError fallback
to handle the no-event-loop case (matches the existing
inline pattern from _send_minimax)
- Uses _execute_tool_calls_concurrently (the existing concurrent
dispatcher) — no new dispatch code
Deviations from plan/Task 1.1:
- The plan's test code patched src.tool_loop.send_openai_compatible
and the plan's Task 1.3 vendor wrapper imported 'from
src.tool_loop import run_with_tool_loop'. The plan predates the
AGENTS.md HARD RULE on src/<thing>.py files; per the follow-up
track's Naming Convention section, run_with_tool_loop lives IN
src/ai_client.py. Tests patch src.ai_client.send_openai_compatible
and the vendor wrapper imports 'from src.ai_client import
run_with_tool_loop' (next task).
- Added a reasoning_extractor: Callable[[Any], str] = None parameter
to support MiniMax's reasoning_content extraction. Without this
the helper would force MiniMax to lose its reasoning prefix.
Green confirmed: 50 vendor + tool tests pass; 4 audit scripts pass.
5 Red tests in tests/test_ai_client_tool_loop.py verify the planned
run_with_tool_loop contract (no-tool-call fast path, tool-call
dispatch, max-rounds safety, history append, error tolerance).
Deviation from plan: tests patch src.ai_client.send_openai_compatible
(plan's Task 1.1 had src.tool_loop.send_openai_compatible). The plan
predates the AGENTS.md HARD RULE on src/<thing>.py files; per the
follow-up track's Naming Convention section, run_with_tool_loop lives
IN src/ai_client.py. The function body imports send_openai_compatible
from src.openai_compatible, so src.ai_client.send_openai_compatible
is the correct patch path.
state.toml: current_phase 0 -> 1, phase_1 pending -> in_progress,
t1_1 pending -> in_progress, blocked_by status
phase_6_in_progress -> phase_6_complete (parent's Phase 6
checkpointed at 064cb26).
Confirmed red: 5 ImportError against src.ai_client.run_with_tool_loop
at collection time.
The user explicitly stated 2026-06-11: 'I need a naming convention
enforce for separate files you keep introducing that are technically
part of a system or parent module.' Per AGENTS.md 'File Size and
Naming Convention' HARD RULE: new src/<thing>.py files may only be
created on the user's explicit request. All AI-client code lives
IN src/ai_client.py.
Sweep through all follow-up track files to remove the stale
references to the no-longer-planned new src/ files:
- TODO.md: t1.4 'Implement helper in src/tool_loop.py' -> '...in
src/ai_client.py'
- plan.md: 5 stale references updated (Task 4.3 title, Step 1
'Files:', Step 5 'git add', Phase 4 git note, the function
summary in Phase 1 verification)
- plan.md: 'src/llama_ollama_native.py' removed (ollama_chat and
_send_llama_native both in src/ai_client.py)
- spec.md: Phase Plan section T1.2 and T4.2/T4.3 updated to
reference src/ai_client.py
- state.toml: t1.4, t4_2, t4_3 descriptions updated
- metadata.json: new_files list shrunk (3 new src/ files removed);
verification_criteria updated to reference src/ai_client.py
functions; follow_up_audit_report reference updated to point to
the actual file (docs/reports/qwen_llama_grok_followup_audit_20260611.md)
Spec additions from the same turn (not in the previous plan version):
- Naming Convention section explicitly references AGENTS.md HARD
RULE; 'If you find yourself about to create one, ASK FIRST'
- 'Non-Goals' section now lists 8 explicit non-goals (vs the
previous 4) including history management lift, reasoning
extraction lift, error classification lift
- 'Deferred Work' section documents 3 separate follow-up tracks
(namespace_cleanup_20260611, ai_client_codepath_consolidation_20260611,
mcp_architecture_refactor_20260606 [already specced])
- 'Open Questions' has 1 RESOLVED (PROVIDERS location) and 2 still
open (Meta URL verification; local model UI mode)
- 'Goals' table: 'local-backend' field added separately from
'cost_tracking' (per user feedback: distinct concept)
- 'B.1 Local-First' section: native Ollama DEFAULT for localhost
(not fallback), Meta Llama API prerequisite (verify URL first)
- 'B.2 Matrix Expansion' section: full list of 12 v2 fields + UI
adaptations for each
This is docs-only. The plan is now complete and aligned with the
HARD RULE. The next agent can pick up at Phase 1, Task 1.1 and
execute straight through.
The user called out the LLM training data bias: 'small files are
good, large files are bad.' This is wrong for production codebases.
Unreal has 15K+ line files; OS kernels, game engines, compilers all
routinely have 10K+ line files. File size is a non-issue. Cognitive
load is managed via naming, regions, and navigation tools (the
manual-slop MCP) — NOT via file splitting.
Updates:
1. AGENTS.md (master agent guidance):
- Added 'File Size and Naming Convention' section
- Added the hard rule: 'New namespaced src/<thing>.py files may
only be created on the user's explicit request. If you find
yourself about to create one, ASK FIRST.'
- Defaults: helpers and sub-systems go in the parent module
2. conductor/workflow.md (Guiding Principles):
- Removed 'Do NOT perform large file writes directamente' from
principle 7 (it was a delegating rule, but 'large file writes'
carried the propaganda)
- Added principle 8: 'File Naming Convention (HARD RULE)' that
references AGENTS.md
- Re-phrased principle 9 (Research-First) to clarify it's about
navigation efficiency, not file size
3. conductor/code_styleguides/python.md:
- Removed the 'extremely large files that violate the Anti-OOP
rule by necessity' framing
- Added the new rule about new src/<thing>.py files
4. .opencode/agents/tier3-worker.md and .opencode/agents/tier4-qa.md:
- Re-phrased 'Do NOT read full large files' to 'Use skeleton
tools to navigate any file regardless of size. File size is
not a concern; the right tools are.'
- Added the new rule about not creating new src/<thing>.py
files unless user explicitly requests it
5. conductor/tracks/qwen_llama_grok_followup_20260611/plan.md:
- Updated the 'Naming Convention' section to reference the new
'user explicit request' rule
This is docs-only. No code changes. The rule is now codified:
agents must ASK FIRST before creating new top-level src/ files.
The follow-up track had a spec but no plan. The plan is the executable
artifact — it specifies file:line refs, exact code to type, TDD steps,
and per-file atomic commits. Without the plan, the next agent cannot
implement from the spec alone.
Plan structure (5 phases, ~40 tasks):
- Phase 1: Tool loop lift (5 Red tests + helper + apply to 8 vendors +
audit script)
- Phase 2: PROVIDERS move (decide location + move + update 4 import
sites + audit script)
- Phase 3: UX adaptations 2-9 (8 separate applications of the pattern
established in parent Phase 5)
- Phase 4: Local-first + matrix v2 (12 new fields + native Ollama
adapter + Meta Llama API + Local Model GUI badge)
- Phase 5: Anthropic / Gemini / DeepSeek migration (matrix entries
for the 3 remaining providers + docs update)
Each task has:
- WHERE: exact file and (where applicable) line range
- WHAT: the specific change
- HOW: TDD step ordering (Red then Green)
- SAFETY: thread-safety, dependency-ordering, and project-invariant
constraints
The plan models the parent track's plan structure (2177 lines,
2-5 minute steps, per-file atomic commits).
Phase 6 of qwen_llama_grok_integration_20260606 ships the docs.
4 of 5 state tasks done (t6.3 CANCELLED per user directive:
'we can then doc this we're not archiving yet, if we have a follow up
track I need this one to stay up because there is still alot todo').
What shipped:
- t6.1: docs/guide_ai_client.md updated
- Overview mentions 8 providers (was 5)
- New 'Shared OpenAI-Compatible Helper' section: NormalizedResponse,
OpenAICompatibleRequest, send_openai_compatible, usage pattern
- Documents the Qwen adapter (src/qwen_adapter.py) and Llama
multi-backend state (3 backends; _get_llama_cost_tracking)
- Tests: 9 total (3 capabilities + 6 openai_compatible)
- t6.2: docs/guide_models.md updated
- PROVIDERS list: 5 -> 8 entries
- t6.4: conductor/tracks.md updated
- Status note on the qwen track entry: 50/79 tasks done;
Phase 6 in progress; NOT archiving; points to the follow-up
- t6.5: this checkpoint (active-with-follow-up, not archived)
- CANCELLED: t6.3 (no git mv to archive)
- CANCELLED: t6.4 'Recently Completed' move (track is active)
What was created in addition (not in the original Phase 6 plan):
- docs/reports/qwen_llama_grok_followup_audit_20260611.md
- Audit report explaining why a follow-up is needed
- 7 categories of gaps from the parent track
- The Tech Lead's 'footnote for now' failure mode (lessons learned)
- conductor/tracks/qwen_llama_grok_followup_20260611/
- 5-phase follow-up track: tool loop lift, PROVIDERS move,
UX adaptations 2-9, local-first + matrix v2,
Anthropic/Gemini/DeepSeek migration
- spec.md, state.toml, metadata.json, TODO.md
- Local-model-first priority per user feedback
- Wait for parent's Phase 6 to finish before starting (blocked_by)
Verification:
- 38/38 regression tests pass in batch
- No new audit script violations
- 4 new files in follow-up track: spec.md, state.toml,
metadata.json, TODO.md
- 1 new report: docs/reports/qwen_llama_grok_followup_audit_20260611.md
- 2 docs files updated: guide_ai_client.md, guide_models.md
The parent track remains ACTIVE (not archived) for the follow-up to
use as a reference. Per the user's 'there is still alot todo'.
Adds a status line to the qwen_llama_grok_integration_20260606 entry
in conductor/tracks.md noting that:
- Phases 1-5 are done; Phase 6 (docs) is in progress
- The track is NOT being archived (per user directive)
- A 5-phase follow-up track exists at
conductor/tracks/qwen_llama_grok_followup_20260611/
- An audit report is at docs/reports/qwen_llama_grok_followup_audit_20260611.md
- 50/79 tasks done; the remaining gaps are documented
Phase 6 t6.1 + t6.2 (no archive per user directive):
- docs/guide_ai_client.md: update Overview to mention 8 providers (was 5);
add 'Shared OpenAI-Compatible Helper' section explaining
src/openai_compatible.py (NormalizedResponse, OpenAICompatibleRequest,
send_openai_compatible, usage pattern); document the Qwen adapter
and Llama multi-backend.
- docs/guide_models.md: update PROVIDERS list to 8 entries (was 5).
- conductor/tracks.md: update the Qwen track entry to reflect
'50/79 tasks done; Phase 6 in progress; NOT archiving - has follow-up';
add detailed status note pointing to the follow-up track + audit
report.
- docs/reports/qwen_llama_grok_followup_audit_20260611.md: NEW report
explaining why a follow-up is needed (7 categories of gaps; the
Tech Lead's 'footnote for now' failure mode; the lessons learned).
- conductor/tracks/qwen_llama_grok_followup_20260611/: NEW follow-up
track setup (spec.md, state.toml, metadata.json, TODO.md).
5 phases: tool loop lift, PROVIDERS move, UX adaptations 2-9,
local-first + matrix v2, Anthropic/Gemini/DeepSeek migration.
Phase 6 t6.3 (git mv to archive) and t6.4 (mark Recently Completed)
are NOT applied per user directive: 'we can then doc this we're not
archiving yet, if we have a follow up track I need this one to stay
up because there is still alot todo'.
Phase 5 of qwen_llama_grok_integration_20260606 ships the foundation
for capability-driven UX. 4 of 6 state tasks done (t5.2 partial: 1 of 9
adaptations; t5.3 skipped; t5.5 cancelled: needs real API keys).
Shipped:
- t5.1: _get_active_capabilities() helper on App class
(src/gui_2.py:733) - reads the matrix for the active (provider, model)
pair; falls back to 'unregistered' VendorCapabilities if not found.
- t5.2 (partial): Adaptation 1 of 9 from spec §6 applied
- Screenshot button iff vision (render_files_and_media:3030)
- Pattern: caps = app._get_active_capabilities();
imgui.begin_disabled(not caps.<field>); ...UI...; imgui.end_disabled();
if not caps.<field>: imgui.same_line(); imgui.text_disabled('(reason)')
- t5.4: 38/38 regression batch passes
Skipped:
- t5.3: providers are exposed via centralized PROVIDERS in src/models.py
(already done in Phases 2 and 3); no per-provider gettable/callback
changes needed.
- t5.5: manual smoke test requires real API keys; user must do this
outside the agent context.
Deferred to follow-up (8 remaining UX adaptations):
- 2: Tools toggle iff tool_calling
- 3: Cache panel iff caching
- 4: Stream progress iff streaming
- 5: Fetch Models button iff model_discovery
- 6: Token budget max = context_window
- 7-9: Cost panel (3 cost_tracking states)
The pattern is established and the helper is in place. Each
remaining adaptation is a mechanical application of the same pattern
at its specific render site.
Verification: 38/38 regression tests pass.
After the end of Phase 5, only adaptation 1 of 9 from spec §6 was
applied (Screenshot button iff vision, render_files_and_media:3030).
The pattern is established; the remaining 8 are mechanical
applications of the same pattern at their respective render sites.
The follow-up track applies the wrapping at:
- tools toggle (tool_calling)
- cache panel (caching)
- stream progress (streaming)
- fetch models button (model_discovery)
- token budget max (context_window)
- cost panel (3 cost_tracking states: estimate / 'Free (local)' / '-')
The _get_active_capabilities() helper (t5.1) is already in place.
Phase 5 t5.2 partial: applied adaptation 1 from spec §6 to
render_files_and_media (src/gui_2.py:3030).
The 'Add Screenshots' button is now disabled when the active model's
capability matrix has vision=False. A tooltip-adjacent text_disabled
note shows '(vision not supported by <model>; attachments would be
ignored)' so the user knows WHY the button is disabled.
Pattern established for the remaining 8 adaptations (t5.2.2 through
t5.2.9 per spec §6):
caps = app._get_active_capabilities()
imgui.begin_disabled(not caps.<field>)
... UI ...
imgui.end_disabled()
if not caps.<field>:
imgui.same_line()
imgui.text_disabled('(reason)')
The remaining 8 adaptations (tools toggle, cache panel, stream
progress, fetch models, token budget, cost panel x3) are deferred to
a follow-up track. The pattern is established; the work is
mechanical application of it.
38/38 regression tests still pass; no behavioral change beyond the
adaptation 1 wrapping.
Phase 5 t5.1: the helper reads the capability matrix for the currently
active (provider, model) pair and returns the VendorCapabilities.
Falls back to an 'unregistered' VendorCapabilities if the pair is
not in the registry (e.g., a brand-new model name the user types in).
The 9 UX adaptations in spec §6 will call this helper to read the
capability flags (vision, tool_calling, caching, streaming, etc.)
and adapt the GUI accordingly.
Also fixed pre-existing indentation inconsistency in the App class
property methods (current_provider / current_model): the first
@property had 2-space indent but the body and subsequent def had
1-space indent (matching the project style). The mismatch was
latent; the new helper exposed it. Now uniform 1-space indent.
38/38 regression tests still pass; no behavioral change beyond the
helper addition.
As of end of Phase 4, only _send_minimax has a working tool-call loop.
Phase 3 (Grok, Llama) and Phase 2 (Qwen) entry points are single-shot;
they call send_openai_compatible once and return without executing
tool_calls. If the user notices 'tool execution doesn't work for
Qwen/Grok/Llama' after Phase 5 ships, the fix is to lift the tool
loop into a shared run_with_tool_loop() helper that wraps
send_openai_compatible. The 4 existing vendors (_send_anthropic /
_send_gemini / _send_gemini_cli / _send_deepseek) already have the
same inline duplication, so the lift would also help those.
This is a follow-up track, not in scope for qwen_llama_grok_integration_20260606.
Phase 4 t4.4: the wildcard entry 'minimax/*' was the only minimax
registration; this adds specific entries for the 4 fallback model
names returned by _list_minimax_models() at src/ai_client.py:2112
('MiniMax-M2.7', 'MiniMax-M2.5', 'MiniMax-M2.1', 'MiniMax-M2').
Each per-model entry mirrors the wildcard defaults (context_window=131072,
cost=0.20/0.20 per Mtok). Per-model entries let the matrix return
exact capability data for known models; the '*' wildcard still catches
new / future model names that aren't in the registry.
State [openai_compatible_models] minimax_models_refactored flag
flips to true (in the next state commit) -- this is the model-level
coverage the flag tracks.
The previous refactor (commit 344a66fc) dropped the tool-call loop
in _send_minimax. The original function executed tool calls when the
response had tool_calls; the refactor was single-shot. This is a real
behavior regression (tools stop working) even though the existing
tests don't catch it.
Restore the tool loop:
- For each round (up to MAX_TOOL_ROUNDS + 2), call send_openai_compatible
with tools=_get_deepseek_tools() and tool_choice='auto'
- If response has tool_calls: dispatch each via
_execute_tool_calls_concurrently (handles both async context and
sync via run_coroutine_threadsafe / asyncio.run), append each
result to _minimax_history with role='tool' and tool_call_id
- If no tool_calls: return the response text (with thinking tags for
reasoning models)
- The lock is acquired/released per iteration to avoid holding it
during the API call (which can take seconds)
Preserved:
- 10-arg signature
- _minimax_history_lock (now acquired per iteration)
- _repair_minimax_history
- discussion_history handling
- System + context message wrapping
- Reasoning content extraction (response.raw_response.choices[0].message
.reasoning_details[0].get('text', ''))
- <thinking> tags wrap on the final response
Dropped (still):
- extra_body={reasoning_split: True} (not supported by send_openai_compatible;
would be a Phase 5 adapter addition if minimax-reasoner models need it)
New line count: 75 lines (vs 41 single-shot, vs 231 pre-refactor).
Net effect: 231 -> 75 = 68% reduction; tool loop preserved.
Verification: 38/38 tests pass (no regressions).
Phase 3 of qwen_llama_grok_integration_20260606 ships Grok and Llama
provider support. 16 of 18 state tasks done (t3.4 and t3.15 cancelled:
no credentials_template.toml exists; t3.6 and t3.17 completed in
Phase 1's initial registry population).
Modules shipped:
- src/ai_client.py: state globals (_grok_*, _llama_* including _llama_base_url
and _llama_api_key), _ensure_grok_client() (OpenAI SDK with base_url
https://api.x.ai/v1), _ensure_llama_client() (OpenAI SDK with
configurable base_url + api_key for Ollama/OpenRouter/custom backends),
_send_grok() and _send_llama() (both 10-param signature matching
_send_minimax, both call send_openai_compatible), _list_grok_models()
and _list_llama_models() (return from capability registry),
_get_llama_cost_tracking() (the local-LLM signal: returns False when
base_url is localhost/127.0.0.1), 2 new branches in list_models(),
Grok + Llama state reset in reset_session()
- src/models.py: 'grok' and 'llama' added to PROVIDERS (centralized;
gui_2.py and app_controller.py import from this list)
- src/cost_tracker.py: 11 new regex pricing entries (3 Grok + 8 Llama)
Tests shipped:
- tests/test_grok_provider.py (28 lines, 2 tests)
- tests/test_llama_provider.py (68 lines, 6 tests)
- Total new tests this phase: 8 (all passing)
- Cumulative: 38 tests in batch (qwen + grok + llama + minimax + caps +
openai_compat + cost + no_top_level_sdk_imports)
Architectural correction (Grok-consulted 2026-06-11):
- Spec section 3.1.1 added: 'best API per vendor' principle
- Spec section 4.3 reverted from 'Native REST API' to 'OpenAI-Compatible'
per Grok's own confirmation: 'the OpenAI-compatible endpoint is
fully compatible and clean with no meaningful unique native surface
lost'
- Follow-up track B renamed: 'Llama Native APIs' (Ollama native +
Meta Llama API), not 'Native Vendor APIs' (no Grok native refactor
needed)
- v2 matrix field expansion documented (per Grok's recommendation):
audio, video, grounding, computer_use, local, reasoning,
web_search, x_search, code_execution, file_search, mcp_support,
structured_output
Deviations from plan (consistent with Phase 1 and Phase 2):
- Test signatures use 10-arg (real _send_minimax shape), not 12-arg
- PROVIDERS change is at src/models.py:56 (centralized), not in
gui_2.py and app_controller.py (which import from models)
- t3.4 and t3.15 (credentials template) skipped: no template file
exists; the user maintains their own credentials.toml directly
Phase 4 (MiniMax refactor) is now unblocked. The refactor replaces
~250 lines of inline OpenAI-compatible send logic in _send_minimax
with a thin wrapper around the shared send_openai_compatible helper
(per the spec §5.2 target: ~50 lines).
Side concerns for Phase 3:
1. PROVIDERS: src/models.py:56 now includes 'grok' and 'llama' alongside
the 6 existing vendors. Centralized registry; gui_2.py and
app_controller.py import from here. State tasks t3.5 and t3.16
were scoped to gui_2.py/app_controller.py but the actual change
is at the centralized registry, per the project's single-source-of-
truth pattern (per src/models.py module docstring and the Phase 5
audit script audit_no_models_config_io.py which enforces that
PROVIDERS lives in models.py).
2. cost_tracker.py: added 11 regex pricing entries (3 Grok + 8 Llama):
Grok (per xAI public pricing):
- grok-2: 2.00 / 10.00
- grok-2-vision: 2.00 / 10.00
- grok-beta: 5.00 / 15.00
Llama (per Grok's consultation: pricing varies by backend; registry
entries represent the most common case):
- llama-3.1-8b-instant: 0.05 / 0.08 (Groq)
- llama-3.1-70b-versatile: 0.59 / 0.79 (Groq)
- llama-3.1-405b-reasoning: 3.00 / 3.00 (OpenRouter avg)
- llama-3.2-1b-preview: 0.04 / 0.04
- llama-3.2-3b-preview: 0.06 / 0.06
- llama-3.2-11b-vision-preview: 0.18 / 0.18
- llama-3.2-90b-vision-preview: 0.90 / 0.90
- llama-3.3-70b-specdec: 0.59 / 0.79 (Groq)
(all per 1M tokens, USD; matches the structure of existing entries;
note: 'llama-3.1', 'llama-3.2', 'llama-3.3' are regex patterns to
allow future model variants in the same family.)
Spot check:
- estimate_cost('grok-2', 1000, 500) = 0.007 (= 0.002 + 0.005)
- estimate_cost('llama-3.3-70b-specdec', 1000, 500) = 0.000985
3. SKIPPED t3.4 and t3.15 (credentials templates): no
credentials_template.toml exists in the project (Phase 2 established
this). The user maintains their own credentials.toml directly.
4. t3.6 and t3.17 (Grok/Llama models in capability registry) were
completed in Phase 1's initial population of 22 entries
(commit 6be04bc). Grok has 4 entries (1 wildcard + 3 models);
Llama has 9 entries (1 wildcard + 8 models). Grok-2-vision has
vision=True; Llama 3.2-11b/90b vision variants have vision=True.
Verification: 38/38 tests pass in batch.
Grok's own recommendation (consulted 2026-06-11):
'xAI (Grok) | xAI official OpenAI-compatible (https://api.x.ai/v1) |
Fully compatible and clean. Supports Grok-2 + Grok-2-Vision. No
meaningful unique native surface lost by using the compatible
endpoint.'
This REVERSES the earlier 'xAI native' correction. The OpenAI-
compatible approach for Grok is the canonical full-featured path;
the implementation in Phase 3 (OpenAI SDK with base_url=https://api.x.ai/v1
+ send_openai_compatible helper) is correct as-is.
Updates to the spec:
1. §3.1.1: replaced the 'use xAI native' decision with the confirmed
per-vendor table. Qwen=Native, Grok=OpenAI-Compatible (per Grok's
own confirmation), MiniMax=OpenAI-Compatible, DeepSeek=OpenAI-
Compatible, Ollama=OpenAI-Compatible-in-v1 (native in v2),
Meta Llama API=Native (new 4th backend, follow-up), Gemini=Native
(follow-up), Anthropic=Native (follow-up). Also added Grok's
recommended v2 matrix field expansion: audio, video, grounding,
computer_use, local, reasoning/extended_thinking, web_search,
x_search, code_execution, file_search, mcp_support, structured_output.
2. §4.3: reverted from 'Grok via xAI (Native REST API)' back to
'Grok via xAI (OpenAI-Compatible) - confirmed 2026-06-11'. The
implementation does NOT need a native refactor; the OpenAI SDK
at https://api.x.ai/v1 is the canonical approach. Removed the
earlier 'caching: true' entry from the registry (since the
OpenAI-compat shim doesn't expose prompt_cache_key) and the
'no persistent client' state struct (back to the OpenAI SDK
pattern).
3. §13.1.B: renamed from 'Native Vendor APIs' to 'Llama Native APIs
(Ollama native + Meta Llama API)' and removed the Grok native
refactor item (Grok says OpenAI-compat is fine). Kept the Ollama
native + Meta Llama API items + matrix expansion. Clarified that
Grok tests do NOT need rewriting; only Llama tests get 2 more
(native Ollama, Meta Llama API).
Net effect: the Phase 3 work that just shipped (Grok+Llama Green
using OpenAI-compat shim) is CORRECT as-is. The implementation
matches Grok's actual recommendation. No code rollback needed.
Three additions to the spec, per the user's architectural correction
in this session:
1. NEW section 3.1.1: 'Architectural principle: Use the best API per
vendor' — explains why the OpenAI-compatible shim loses vendor-
specific features (xAI: prompt_cache_key, reasoning_effort, server-
side tools, cost_in_usd_ticks; Ollama: think param, images array,
thinking field, structured outputs) and states the principle:
'use each vendor's native SDK or REST API when one exists, falling
back to OpenAI-compatible only when no native option exists.'
Also notes that the capability matrix IS the aggregate tracker;
future native features go into the matrix, and the GUI filters
based on it (no per-vendor UI branches).
2. UPDATED section 4.3 (Grok): 'Grok via xAI (Native REST API)' — was
'OpenAI-Compatible'. Now specifies two native endpoints
(/v1/chat/completions and /v1/responses), the native features that
matter, the updated capability registry (caching=true for Grok
via prompt_cache_key), and a 'Phase 3 placeholder behavior' note
that this track's Phase 3 ships the OpenAI-compatible Grok as a
placeholder. The native refactor is deferred to follow-up B.
3. UPDATED section 13.1: added follow-up track B 'Native Vendor APIs
(post-OpenAI-compatible-placeholder)' which documents:
- Grok → xAI native REST
- Llama (Ollama) → native /api/chat
- Llama (Meta Llama API) → new 4th backend (deferred pending
verification of Meta's API spec; llama.developer.meta.com/docs/overview
returned 400 on fetch this session)
- Capability matrix expansion (web_search, x_search, code_execution,
file_search, mcp_support, reasoning_effort, structured_output)
- Test rewrites (mock requests.post instead of chat.completions.create)
This is a docs-only commit; no code changes. The Phase 3 Green work
continues with the OpenAI-compatible approach as planned in the
existing Red tests (t3.3 Grok + t3.14 Llama), and the follow-up track
B handles the native refactor when prioritized.
8 failing tests in 2 new files for the upcoming Grok and Llama
provider implementations.
Grok (tests/test_grok_provider.py, 2 tests):
1. test_send_grok_uses_xai_endpoint: _send_grok calls _ensure_grok_client
and uses an xAI client (base_url https://api.x.ai/v1)
2. test_grok_2_vision_supports_image: structural check that the
capability registry has vision=True for grok-2-vision (already
populated in Phase 1, so this test passes in Red phase; it is a
regression guard for the registry, not an implementation test)
Llama (tests/test_llama_provider.py, 6 tests):
1. test_send_llama_ollama_backend: _send_llama with localhost:11434
(Ollama) base URL
2. test_send_llama_openrouter_backend: _send_llama with OpenRouter URL
3. test_send_llama_custom_url: _send_llama with custom URL
(escape hatch for self-hosted)
4. test_llama_model_discovery_unions_ollama_and_openrouter: _list_llama_models
returns the 8 models from the capability registry
5. test_llama_3_2_vision_vision_capability: structural check for
llama-3.2-11b-vision-preview (passes in Red phase)
6. test_llama_local_backend_cost_tracking_false_for_ollama: the local-LLM
signal -- when base_url is localhost, _get_llama_cost_tracking()
returns False. This is the first test that exercises the local LLM
support that the capability matrix was designed for.
Both _reset_grok_state and _reset_llama_state fixtures use hasattr() to
be no-ops when the state doesn't exist (Red phase).
Test signatures use the real 10-arg _send_minimax signature, NOT the
plan's 12-arg with enable_tools / rag_engine.
Red phase: 6/8 tests fail (4 AttributeError on missing _send_*,
2 ImportError on missing _list_*/_get_*). 2/8 pass (registry structural
checks).
Next: Green phase - implement _send_grok + _ensure_grok_client +
_send_llama + _ensure_llama_client + _list_llama_models +
_get_llama_cost_tracking in src/ai_client.py.
Phase 2 of qwen_llama_grok_integration_20260606 ships Qwen support via
the Alibaba Cloud DashScope native SDK. 10 of 11 state tasks done
(t2.7 cancelled: no credentials_template.toml exists in the project;
t2.9 was completed in Phase 1's initial registry population).
Modules shipped:
- src/qwen_adapter.py (31 lines): build_dashscope_tools() (OpenAI shape
-> DashScope shape), classify_dashscope_error() (5 exception classes
-> ProviderError kinds: auth/network/quota)
- src/ai_client.py: state globals (_qwen_client, _qwen_history,
_qwen_history_lock, _qwen_region), _ensure_qwen_client() (sets
dashscope.base_http_api_url based on region: china vs international),
_dashscope_call() + _dashscope_exception_from_response() +
_extract_dashscope_tool_calls(), _send_qwen() (10-param signature
matching _send_minimax), _list_qwen_models()
- src/models.py: 'qwen' added to PROVIDERS (centralized; gui_2.py and
app_controller.py import from this list)
- src/cost_tracker.py: 7 Qwen pricing entries (regex-matched,
USD per 1M tokens)
Tests shipped: tests/test_qwen_provider.py (55 lines, 5 tests, all passing)
Total new tests this phase: 5
Total tests in new modules: 30 (qwen + minimax + capabilities +
openai_compatible + cost_tracker + no_top_level_sdk_imports)
Verification:
- 30/30 tests pass in batch
- No regressions
- 4/4 audit scripts pass (audit_main_thread_imports, audit_weak_types,
check_test_toml_paths, audit_no_models_config_io)
DashScope alignment (post-cleanup):
- Uses dashscope.common.error.AuthenticationError (real class in
1.25.21) instead of the non-existent InvalidApiKey
- Removed the InvalidApiKey -> AuthenticationError monkey-patch
- TimeoutException -> network (not rate_limit)
- ServiceUnavailableError -> network (not quota)
- _ensure_qwen_client sets base_http_api_url per region (china vs
international) per the latest DashScope API spec
Deviations from the plan:
- Test signature adapted from 12-param (plan) to 10-param (matching
real _send_minimax) -- the plan's enable_tools / rag_engine params
don't exist on _send_minimax
- PROVIDERS change is at src/models.py:56 (centralized), not in
gui_2.py and app_controller.py (which import from models)
- t2.7 (credentials template) skipped: no template file exists;
the user maintains their own credentials.toml directly
Phase 3 (Grok + Llama) is now unblocked. Local LLM support lands
in Phase 3 via Llama's Ollama backend (default base_url
http://localhost:11434/v1).