The follow-up track's tool-loop refactor moved
'from src.openai_compatible import send_openai_compatible,
OpenAICompatibleRequest, NormalizedResponse' to MODULE level
in src/ai_client.py. This violates the startup_speedup_20260606
invariant: heavy SDKs must not be loaded at module level because
ai_client.py is on the main thread's import chain.
src/openai_compatible.py line 5 does 'from openai import
OpenAIError, ...', so any import from it triggers the openai SDK
to load. test_ai_client_does_not_import_openai_at_module_level
guards this invariant and was failing.
Fix: move the imports back to local scope inside the function
bodies that need them:
- _default_send closure inside run_with_tool_loop
(imports send_openai_compatible)
- _send_grok (imports OpenAICompatibleRequest)
- _send_minimax (imports OpenAICompatibleRequest)
- _send_llama (imports OpenAICompatibleRequest)
- _send_gemini_cli (imports OpenAICompatibleRequest + NormalizedResponse)
Test patches: tests that previously patched
'src.ai_client.send_openai_compatible' now patch
'src.openai_compatible.send_openai_compatible' (the actual
import source). _execute_tool_calls_concurrently patches
unchanged (it's defined in src/ai_client.py itself).
Green confirmed: 62 vendor + tool + import-isolation tests
pass. 0 regressions.
Task 1.7 of the follow-up track. Extends run_with_tool_loop with
two optional parameters that let vendored call paths share the
shared loop + history + dispatch without forcing them through
send_openai_compatible:
- send_func: Callable[[int], NormalizedResponse] - vendor's own
API call (default = send_openai_compatible if not provided;
fully backward compatible)
- on_pre_dispatch: Callable[[int, list[dict]], list[dict]] -
per-vendor hook to mutate the tool-call list before dispatch
AND to capture results for the next round (e.g. Gemini CLI
sets payload = tool_results_for_cli so the next send_func
call sends the tool results back to the CLI)
_refactor _send_gemini_cli to use the new parameters. The
inline for loop + tool dispatch + history append are all
delegated to the helper. The vendor's send_func closure
handles:
- adapter.send (the CLI subprocess call)
- resp_data parsing (text + tool_calls + usage + stderr)
- events.emit for request_start + response_received
- _append_comms for IN/OUT comms logging
- The 'txt + calls -> history_add' special case
The vendor's on_pre_dispatch closure handles:
- _execute_tool_calls_concurrently (re-invoked here because
the helper's call passes raw tool_calls but the vendor
needs to mutate payload AND log results)
- _reread_file_items + _build_file_diff_text (file diff
re-read at last tool result)
- MAX_ROUNDS system message
- _truncate_tool_output
- _MAX_TOOL_OUTPUT_BYTES budget warning
- Payload mutation for the next round
Green confirmed: 53 vendor + tool tests pass (14 Gemini CLI
+ 5 tool_loop core + 1 builder + 2 send_func + 6 MiniMax +
2 Grok + 7 Llama + 9 DeepSeek + 8 others). No regressions.
Task 1.3 of the follow-up track. _send_minimax now uses
run_with_tool_loop with a per-round request_builder callback
that re-reads _minimax_history under _minimax_history_lock.
The plan's Task 1.3 example builds the request once before the
loop. That would break MiniMax tool flows because the API
would not see the tool results appended to _minimax_history
on later rounds. The fix: extend run_with_tool_loop's 2nd arg
to accept Union[OpenAICompatibleRequest, Callable[[int],
OpenAICompatibleRequest]] (backward compatible; static-request
vendors pass a single request). MiniMax now passes a closure
that rebuilds messages from history each round.
Reasoning extraction: MiniMax exposes its chain-of-thought via
response.raw_response.choices[0].message.reasoning_details[0].
get('text'). Lifted to a _extract_minimax_reasoning callback
passed as reasoning_extractor=... (the new parameter added
in the previous commit).
Trim callback: wraps _trim_minimax_history so it can be called
from run_with_tool_loop after each tool-result append.
Green confirmed: 51 vendor + tool tests pass (6 MiniMax + 5
tool_loop core + 1 tool_loop builder + 39 others); the new
test_ai_client_tool_loop_builder.py locks in the per-round
builder contract.
5 Red tests in tests/test_ai_client_tool_loop.py verify the planned
run_with_tool_loop contract (no-tool-call fast path, tool-call
dispatch, max-rounds safety, history append, error tolerance).
Deviation from plan: tests patch src.ai_client.send_openai_compatible
(plan's Task 1.1 had src.tool_loop.send_openai_compatible). The plan
predates the AGENTS.md HARD RULE on src/<thing>.py files; per the
follow-up track's Naming Convention section, run_with_tool_loop lives
IN src/ai_client.py. The function body imports send_openai_compatible
from src.openai_compatible, so src.ai_client.send_openai_compatible
is the correct patch path.
state.toml: current_phase 0 -> 1, phase_1 pending -> in_progress,
t1_1 pending -> in_progress, blocked_by status
phase_6_in_progress -> phase_6_complete (parent's Phase 6
checkpointed at 064cb26).
Confirmed red: 5 ImportError against src.ai_client.run_with_tool_loop
at collection time.
8 failing tests in 2 new files for the upcoming Grok and Llama
provider implementations.
Grok (tests/test_grok_provider.py, 2 tests):
1. test_send_grok_uses_xai_endpoint: _send_grok calls _ensure_grok_client
and uses an xAI client (base_url https://api.x.ai/v1)
2. test_grok_2_vision_supports_image: structural check that the
capability registry has vision=True for grok-2-vision (already
populated in Phase 1, so this test passes in Red phase; it is a
regression guard for the registry, not an implementation test)
Llama (tests/test_llama_provider.py, 6 tests):
1. test_send_llama_ollama_backend: _send_llama with localhost:11434
(Ollama) base URL
2. test_send_llama_openrouter_backend: _send_llama with OpenRouter URL
3. test_send_llama_custom_url: _send_llama with custom URL
(escape hatch for self-hosted)
4. test_llama_model_discovery_unions_ollama_and_openrouter: _list_llama_models
returns the 8 models from the capability registry
5. test_llama_3_2_vision_vision_capability: structural check for
llama-3.2-11b-vision-preview (passes in Red phase)
6. test_llama_local_backend_cost_tracking_false_for_ollama: the local-LLM
signal -- when base_url is localhost, _get_llama_cost_tracking()
returns False. This is the first test that exercises the local LLM
support that the capability matrix was designed for.
Both _reset_grok_state and _reset_llama_state fixtures use hasattr() to
be no-ops when the state doesn't exist (Red phase).
Test signatures use the real 10-arg _send_minimax signature, NOT the
plan's 12-arg with enable_tools / rag_engine.
Red phase: 6/8 tests fail (4 AttributeError on missing _send_*,
2 ImportError on missing _list_*/_get_*). 2/8 pass (registry structural
checks).
Next: Green phase - implement _send_grok + _ensure_grok_client +
_send_llama + _ensure_llama_client + _list_llama_models +
_get_llama_cost_tracking in src/ai_client.py.
5 failing tests in tests/test_qwen_provider.py that establish the
core behaviors of the new Qwen (DashScope) provider:
1. test_send_qwen_routes_to_dashscope: _send_qwen calls _ensure_qwen_client
and _dashscope_call, returns the text from the DashScope response
2. test_qwen_vision_vl_model_accepts_image: when file_items contains an
image, the messages passed to _dashscope_call include the image ref
3. test_qwen_tool_format_translation: build_dashscope_tools converts
OpenAI-shaped tool dicts to DashScope shape (name/description/parameters
flat structure, not wrapped in function:)
4. test_qwen_error_classification: classify_dashscope_error maps
dashscope.common.error.InvalidApiKey -> ProviderError(kind='auth',
provider='qwen')
5. test_list_qwen_models_returns_hardcoded_registry: _list_qwen_models
returns the 7 Qwen models registered in src/vendor_capabilities.py
The autouse _reset_qwen_state fixture uses hasattr() so it is a no-op
when _qwen_client / _qwen_history do not exist (yet); this keeps the
fixture working in the Red phase.
All 5 tests fail:
- Tests 1, 2: AttributeError: src.ai_client has no _ensure_qwen_client /
_send_qwen / _dashscope_call
- Tests 3, 4: ModuleNotFoundError: No module named src.qwen_adapter
- Test 5: ImportError: cannot import name _list_qwen_models
Test signature adapted to match the real _send_minimax signature at
src/ai_client.py:2143-2148 (10 params, no enable_tools / rag_engine)
rather than the plan's 12-param signature.
Next: Green phase - implement src/qwen_adapter.py + src/ai_client.py
state + _ensure_qwen_client + _send_qwen + _list_qwen_models.
6 failing tests in tests/test_openai_compatible.py that establish the
core behaviors of the new send_openai_compatible() shared helper:
1. test_send_non_streaming_returns_normalized_response: blocking call
returns text, empty tool_calls, and correct usage token counts
2. test_send_streaming_aggregates_chunks: streaming call aggregates
deltas into final text and fires stream_callback per chunk
3. test_tool_call_detection_in_response: tool_calls from the response
are converted to dicts with id/type/function/arguments fields
4. test_vision_multimodal_message: messages with multimodal content
(text + image_url) are passed through unchanged to the client
5. test_error_classification_429_to_rate_limit: RateLimitError from
openai SDK is caught and re-raised as ProviderError(kind='rate_limit')
6. test_normalized_response_is_frozen_dataclass: NormalizedResponse is
a frozen dataclass (FrozenInstanceError on attribute assignment)
All 6 tests fail with ModuleNotFoundError: No module named
'src.openai_compatible' (confirmed via pytest). The implementation file
will be created in the next commit (Green phase).
ProviderError confirmed importable from src.ai_client (no stub needed).
Green phase: src/vendor_capabilities.py now exists and all 3 Red-phase
tests in tests/test_vendor_capabilities.py pass.
Implementation:
- VendorCapabilities frozen dataclass with 12 fields (vendor, model, vision,
tool_calling, caching, streaming, model_discovery, context_window,
cost_tracking, cost_input_per_mtok, cost_output_per_mtok, notes)
- Module-level _REGISTRY dict keyed by (vendor, model)
- register() inserts/overwrites entries
- get_capabilities() returns specific entry if present, else vendor '*'
default, else raises KeyError with 'No capabilities registered' message
- list_models_for_vendor() returns sorted model names for a vendor
(excludes '*' wildcard)
Initial population (22 entries at module load):
- 1 minimax wildcard (cost: 0.20/0.20 per Mtok)
- 4 grok (1 wildcard + 3 models; grok-2-vision has vision=True)
- 9 llama (1 wildcard + 8 models; 11b/90b vision variants have vision=True)
- 8 qwen (1 wildcard + 7 models; qwen-vl-plus/max have vision=True;
qwen-audio has notes='Text-only in v1; audio input deferred')
The plan's Task 1.3 listed 22 entries but included one impossible entry
(vendor='minimax', model='grok-2-latest'). Omitted; 21 entries shipped.
Test fix: test_fallback_to_vendor_default previously used model name
'llama-3.3-70b-specdec' which IS in the registry, so the specific entry
was returned (with default cost_tracking=True), not the wildcard. Fixed
by changing to 'llama-3.3-future-unregistered' (not in registry, so
fallback fires correctly).
3 failing tests in tests/test_vendor_capabilities.py that establish the
core behaviors of the new VendorCapability matrix:
1. test_registry_lookup_known_model: registering and looking up a specific
(vendor, model) entry returns the registered entry
2. test_fallback_to_vendor_default: looking up an unregistered model returns
the vendor's '*' default entry
3. test_unknown_vendor_raises: looking up a vendor with no entries raises
KeyError with a 'No capabilities registered' message
All 3 tests fail with ModuleNotFoundError: No module named
'src.vendor_capabilities' (confirmed via pytest). The implementation file
will be created in the next commit (Green phase).
The autouse _clean_registry fixture snapshots src.vendor_capabilities._REGISTRY
before each test and restores it after, providing test isolation for the
module-level state.
Three real fixes for the sim test + the live_gui coordination layer:
1. /api/project_switch_status endpoint in src/app_controller.py.
The wait helper had been calling this endpoint but it did not exist;
the helper always received a 404, fell back to {in_progress: False},
and returned immediately even when a switch was in flight. Added the
endpoint that reads _project_switch_in_progress, active_project_path,
and _project_switch_error from the controller.
2. simulation/sim_base.py: replace time.sleep(2.0)/time.sleep(1.5) in
the setup() with wait_io_pool_idle and wait_for_project_switch so
the test does not click btn_md_only while a project switch is in
flight. Also added the wait calls to sim_context.py for the same
reason.
3. src/app_controller.py _handle_md_only: removed the is_project_stale()
early-return. The stale state is a transient window during which the
previous code dropped the click on the floor with a misleading
'stale ui' status. The MD generation worker is safe to run from any
project state; the action handler now always proceeds.
4. tests/test_extended_sims.py: set current_model to 'gemini-cli' so
_do_generate does not raise KeyError('model') when the test
overrides provider to gemini_cli.
KNOWN ISSUE: test_context_sim_live still fails with status
'switching to: temp_livecontextsim' after a 60s wait. The click
appears to be re-triggering a project switch via the GUI's render
loop. Root cause investigation deferred; the sim is async and the
test path is fragile.
The session-scoped live_gui fixture deleted the shared workspace
before recreating it, which raced with the per-worker lock acquisition
and produced FileNotFoundError on .live_gui_owner.lock in xdist.
The per-run timestamped name (tests/artifacts/live_gui_workspace_<ts>/)
already provides enough isolation between pytest invocations, so the
rmtree is unnecessary. Use mkdir(exist_ok=True) only.
The fix in 644d88ab changed the recovery path from client.delete_collection
to shutil.rmtree (chromadb 1.5.x delete_collection is broken on corrupted
state). The test still asserted the old behavior.
The bug: when the local embedding provider fails to initialize
(e.g. sentence-transformers not installed), RAGEngine.__init__
leaves self.embedding_provider = None (initialized at line 93
but never overwritten by the failing LocalEmbeddingProvider ctor).
The constructor returns. _sync_rag_engine's else branch then
sets status to 'ready' - a lie. The RAG panel shows 'ready'.
The user triggers a retrieval. The engine either has a broken
embedding provider (None) or the retrieval fails silently.
The RAG context never appears in the AI's history.
The fix: in _sync_rag_engine's _task, after RAGEngine(...)
returns, check if engine.embedding_provider is None. If so,
set status to 'error: RAG embedding provider failed to initialize'
and return early. This prevents:
- The engine from being assigned to self.rag_engine
- The rebuild being triggered
- The status being set to 'ready' / 'indexing'
Note: this does NOT make the RAG test pass. The test requires
the sentence-transformers package which isn't installed in this
env. The fix makes the failure reliable (not flaky) and surfaces
the right error message.
TDD: 3 tests added in tests/test_rag_engine_ready_status_bug.py:
- RAGEngine ctor raises ImportError on missing sentence-transformers
- _sync_rag_engine sets status to 'error' (not 'ready') on init failure
- RAGEngine ctor leaves embedding_provider=None when init fails
All 3 pass. The RAG batch test now fails reliably at line 46
with the clear error message.
PR1 follow-up (the actual IM_ASSERT root cause fix).
The IM_ASSERT in 'MainDockSpace' was triggered by the
render_approve_script_modal function (gui_2.py:4895) calling
imgui.checkbox with a None value for app.ui_approve_modal_preview.
The chain of bugs:
1. AppController.__getattr__ returned None for ANY ui_ attribute
(line 1237-1238). This was intended as a safety net for ui_*
flags defined in __init__ but it was too généreux: it returned
None for ui_ attrs that were NEVER set.
2. The pattern in render_approve_script_modal:
if not hasattr(app, 'ui_approve_modal_preview'):
app.ui_approve_modal_preview = False
_, app.ui_approve_modal_preview = imgui.checkbox(..., app.ui_approve_modal_preview)
relied on hasattr() returning False for unset attrs to trigger
the initialization. But the App.__setattr__ checks
hasattr(self.controller, name) to decide where to route
assignments. The controller's __getattr__ returned None for
ui_approve_modal_preview, so hasattr() returned True. The
App.__setattr__ routed the assignment to the controller.
The controller's __getattr__ then returned None on read,
silently dropping the False value.
3. The next line called imgui.checkbox with None, which raised
a TypeError. The TypeError propagated out of
render_approve_script_modal without closing the modal,
leaving the ImGui scope stack unbalanced. The unbalanced
scope triggered IM_ASSERT(Missing End()) on the next frame.
Fix: AppController.__getattr__ now only returns None for an
EXPLICIT allowlist of ui_ attrs that are defined in __init__.
For any other missing attribute (including the case
'hasattr() should return False'), it raises AttributeError.
The App.__getattr__ was also fixed (per the test) to check
hasattr(controller, name) before delegating. This is defense in
depth in case other __getattr__ patterns are added.
Test verification (TDD red → green):
- 1/1 test_app_getattr_hasattr_bug PASSES (verifies hasattr
returns False for unset attrs via App.__getattr__)
- 1/1 test_app_controller_getattr_ui_bug PASSES (verifies hasattr
returns False for unset ui_ attrs on controller)
Live verification:
- 4 sims + test_live_workflow + 2 markdown tests: 7/7 PASS in 83.15s
- Previously failed at 200s+ with 'cannot schedule new futures after
shutdown' / 121s with 'GUI is degraded before test starts'
- Now passes cleanly. The IM_ASSERT no longer fires.
13/13 related unit tests pass (app_controller_* + app_run_* +
app_getattr_*). No regressions in 51/51 io_pool/warmup/sigint/etc.
unit tests.
PR3 of the test_full_live_workflow_imgui_assert fix sequence.
When a prior live_gui test in the same session crashes the GUI (e.g.
via an ImGui IM_ASSERT from cumulative panel state), the controller's
_io_pool gets shut down. The next test starts in a degraded state
but only discovers this 120s later when its project switch times
out with a confusing 'cannot schedule new futures after shutdown'
error.
This commit adds a /api/gui_health pre-flight check at the start of
test_full_live_workflow. If the GUI is degraded, the test fails
fast (within 1s) with a clear, actionable message that includes:
- The exact RuntimeError that caused the degradation
- The full traceback of the last ImGui scope mismatch
- A note that the new test cannot proceed with a dirty state
Per user feedback 2026-06-08: 'I don't want a batch to be too fragile
where I can't restart the app and continue with the next test file
if it fails. Just has to note that the new file didn't get to deal
with a dirty state.'
Also includes the planning documents written earlier in this session:
- TODO_test_full_live_workflow_v2.md (task list)
- test_full_live_workflow_imgui_assert_20260608.md (root cause report)
- test_full_live_workflow_propagation_digest_20260608.md (solutions digest)
- batch_resilience_plan_20260608.md (batch resilience plan)
Verification:
- test_full_live_workflow in isolation: 13.45s PASS (health=True, no degrade)
- 4 sims + test_full_live_workflow in batch: 76.46s (1 FAIL fast, 4 sims PASS)
- Without PR3 fix: 200s FAIL with confusing 120s timeout
- With PR3 fix: 76s FAIL with clear 'GUI is degraded' message
- The fast-fail is observable, not silent (per user's 'wrap might be
worth it if that properly lets us handle the assert')