Three independent root causes fixed: - gui_2.py: Route mma_spawn_approval/mma_step_approval events in _process_event_queue - multi_agent_conductor.py: Pass asyncio loop from ConductorEngine.run() through to thread-pool workers for thread-safe event queue access; add _queue_put helper - ai_client.py: Preserve GeminiCliAdapter in reset_session() instead of nulling it Test: visual_sim_mma_v2::test_mma_complete_lifecycle passes in ~8s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.0 KiB
Track Specification: Robust Live Simulation Verification
Overview
Establish a robust, visual simulation framework to prevent regressions in the complex GUI and asynchronous orchestration layers. This track replaces manual human verification with an automated script that clicks through the GUI and verifies the rendered state.
Goals
- Simulation Framework Setup: Build a dedicated test script (
tests/visual_sim_mma_v2.py) utilizingApiHookClientto control the live GUI. - Simulate Epic Planning: Automate the clicking of "New Epic", inputting a prompt, and verifying the expected Tier 1 tracks appear in the UI.
- Simulate Execution & Spawning: Automate the selection of a track, the generation of the DAG, and the interaction with the HITL Approval modal.
Constraints
- Must run against a live instance of the application using
--enable-test-hooks. - Must fail loudly if the visual state (e.g., rendered DAG nodes, text box contents) does not match expectations.
Context & Origins
This track was born from the "Human Verification" phase of the initial MMA Orchestrator prototype (mma_orchestrator_integration_20260226). We realized that while the backend API plumbing for the hierarchical MMA tiers (Tiers 1-4) was technically functional, the product lacked the necessary state management, UX visualization, and human-in-the-loop security gates to be usable.
Key Takeaways from the Prototype Phase:
- The Tier 2 (Tech Lead) needs its own track-scoped discussion history, rather than polluting the global project history.
- Tasks within a track require a DAG (Directed Acyclic Graph) engine to manage complex dependencies and blocking states.
- The GUI must visualize this DAG and stream the output of individual workers directly to their associated tasks.
- We must enforce tiered context subsetting so that Tier 3/4 workers don't receive the massive global context blob, and we need a pre-spawn approval modal so the user can intercept, review, and modify worker prompts/contexts before they execute.
Instructions for the Implementing Agent:
As you execute this track, ensure you maintain alignment with the other Phase 2 tracks. If you learn something that impacts the dependent tracks, please append a similar "Context Summary" to their spec.md files before concluding your run.
Execution Order & Dependencies
This is a multi-track phase. To ensure architectural integrity, these tracks MUST be executed in the following strict order:
- MMA Data Architecture & DAG Engine: (Builds the state and execution foundation)
- Tiered Context Scoping & HITL Approval: (Builds the security and context subsetting on top of the state)
- MMA Dashboard Visualization Overhaul: (Builds the UI to visualize the state and subsets)
- [CURRENT] Robust Live Simulation Verification: (Builds the tests to verify the UI and state)
Prerequisites for this track: MMA Dashboard Visualization Overhaul MUST be completed ([x]) before starting this track.
Session Compression (2026-02-28)
Current State & Glaring Issues:
- Brittle Interception System: The visual simulation (
tests/visual_sim_mma_v2.py) relies heavily on polling anapi_hooks.pyendpoint (/api/gui/mma_status) that aggregates several boolean flags (pending_approval,pending_spawn). This has proven extremely brittle. For example,mock_gemini_cli.pydefaults to emitting aread_filetool call, which triggers the general tool approval popup (_pending_ask), freezing the test because it was expecting the MMA spawn popup (_pending_mma_spawn) or the Track Proposal modal. - Mock Pollution in App Domain: Previous attempts to fix the simulation shoehorned test-specific mock JSON responses directly into
ai_client.pyandscripts/mma_exec.py. This conflates the test environment with the production application codebase. - Popup Handling Failures: The GUI's state machine for closing popups (like
_show_track_proposal_modalin_cb_accept_tracks) is desynchronized from the hook API. The test clicks "Accept", the tracks generate, but the UI state doesn't cleanly reset, leading to endless timeouts during test runs.
Next Steps for the Handoff:
- Completely rip out the hardcoded mock JSON arrays from
ai_client.pyandscripts/mma_exec.py. - Refactor
tests/mock_gemini_cli.pyto be a pure, standalone mock that perfectly simulates the expected streaming behavior ofgemini_cliwithout relying on the app to intercept specific magic prompts. - Stabilize the hook API (
api_hooks.py) so the test script can unambiguously distinguish between a general tool approval, an MMA step approval, and an MMA worker spawn approval, instead of relying on a fragilepending_approvalcatch-all.
Session Compression (2026-02-28, Late Session Addendum)
Current Blocker: The Tier 3 worker simulation is stuck. The orchestration loop in multi_agent_conductor.py correctly starts run_worker_lifecycle, and ai_client.py successfully sends a mock response back from gemini_cli. However, the visual test never sees this output in mma_streams.
- The GUI expects
handle_ai_responseto carry the final AI response (includingstream_idmapping to a specific Tier 3 worker string). - In earlier attempts, we tried manually pushing a
handle_ai_responseevent back into the GUI'sevent_queueat the end ofrun_worker_lifecycle, but it seems the GUI is still looping infinitely, showingPolling streams: ['Tier 1']. The state machine doesn't seem to recognize that the Tier 3 task is done or correctly populate the stream dictionary for the UI to pick up. - Handoff Directive: The next agent needs to trace exactly how a successful AI response from a subprocess/thread (which
run_worker_lifecycleoperates in) is supposed to bubble up toself.mma_streamsingui_2.py. Isevents.emit("response_received")orhandle_ai_responsemissing? Why is the test only seeing'Tier 1'in themma_streamskeys? Focus on the handoff betweenai_client.pycompleting a run andgui_2.pyrendering the result.