Files
manual_slop/conductor/tracks/robust_live_simulation_verification/spec.md
Ed_ da21ed543d fix(mma): Unblock visual simulation - event routing, loop passing, adapter preservation
Three independent root causes fixed:
- gui_2.py: Route mma_spawn_approval/mma_step_approval events in _process_event_queue
- multi_agent_conductor.py: Pass asyncio loop from ConductorEngine.run() through to
  thread-pool workers for thread-safe event queue access; add _queue_put helper
- ai_client.py: Preserve GeminiCliAdapter in reset_session() instead of nulling it

Test: visual_sim_mma_v2::test_mma_complete_lifecycle passes in ~8s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 08:32:31 -05:00

6.0 KiB

Track Specification: Robust Live Simulation Verification

Overview

Establish a robust, visual simulation framework to prevent regressions in the complex GUI and asynchronous orchestration layers. This track replaces manual human verification with an automated script that clicks through the GUI and verifies the rendered state.

Goals

  1. Simulation Framework Setup: Build a dedicated test script (tests/visual_sim_mma_v2.py) utilizing ApiHookClient to control the live GUI.
  2. Simulate Epic Planning: Automate the clicking of "New Epic", inputting a prompt, and verifying the expected Tier 1 tracks appear in the UI.
  3. Simulate Execution & Spawning: Automate the selection of a track, the generation of the DAG, and the interaction with the HITL Approval modal.

Constraints

  • Must run against a live instance of the application using --enable-test-hooks.
  • Must fail loudly if the visual state (e.g., rendered DAG nodes, text box contents) does not match expectations.

Context & Origins

This track was born from the "Human Verification" phase of the initial MMA Orchestrator prototype (mma_orchestrator_integration_20260226). We realized that while the backend API plumbing for the hierarchical MMA tiers (Tiers 1-4) was technically functional, the product lacked the necessary state management, UX visualization, and human-in-the-loop security gates to be usable.

Key Takeaways from the Prototype Phase:

  • The Tier 2 (Tech Lead) needs its own track-scoped discussion history, rather than polluting the global project history.
  • Tasks within a track require a DAG (Directed Acyclic Graph) engine to manage complex dependencies and blocking states.
  • The GUI must visualize this DAG and stream the output of individual workers directly to their associated tasks.
  • We must enforce tiered context subsetting so that Tier 3/4 workers don't receive the massive global context blob, and we need a pre-spawn approval modal so the user can intercept, review, and modify worker prompts/contexts before they execute.

Instructions for the Implementing Agent: As you execute this track, ensure you maintain alignment with the other Phase 2 tracks. If you learn something that impacts the dependent tracks, please append a similar "Context Summary" to their spec.md files before concluding your run.

Execution Order & Dependencies

This is a multi-track phase. To ensure architectural integrity, these tracks MUST be executed in the following strict order:

  1. MMA Data Architecture & DAG Engine: (Builds the state and execution foundation)
  2. Tiered Context Scoping & HITL Approval: (Builds the security and context subsetting on top of the state)
  3. MMA Dashboard Visualization Overhaul: (Builds the UI to visualize the state and subsets)
  4. [CURRENT] Robust Live Simulation Verification: (Builds the tests to verify the UI and state)

Prerequisites for this track: MMA Dashboard Visualization Overhaul MUST be completed ([x]) before starting this track.

Session Compression (2026-02-28)

Current State & Glaring Issues:

  1. Brittle Interception System: The visual simulation (tests/visual_sim_mma_v2.py) relies heavily on polling an api_hooks.py endpoint (/api/gui/mma_status) that aggregates several boolean flags (pending_approval, pending_spawn). This has proven extremely brittle. For example, mock_gemini_cli.py defaults to emitting a read_file tool call, which triggers the general tool approval popup (_pending_ask), freezing the test because it was expecting the MMA spawn popup (_pending_mma_spawn) or the Track Proposal modal.
  2. Mock Pollution in App Domain: Previous attempts to fix the simulation shoehorned test-specific mock JSON responses directly into ai_client.py and scripts/mma_exec.py. This conflates the test environment with the production application codebase.
  3. Popup Handling Failures: The GUI's state machine for closing popups (like _show_track_proposal_modal in _cb_accept_tracks) is desynchronized from the hook API. The test clicks "Accept", the tracks generate, but the UI state doesn't cleanly reset, leading to endless timeouts during test runs.

Next Steps for the Handoff:

  • Completely rip out the hardcoded mock JSON arrays from ai_client.py and scripts/mma_exec.py.
  • Refactor tests/mock_gemini_cli.py to be a pure, standalone mock that perfectly simulates the expected streaming behavior of gemini_cli without relying on the app to intercept specific magic prompts.
  • Stabilize the hook API (api_hooks.py) so the test script can unambiguously distinguish between a general tool approval, an MMA step approval, and an MMA worker spawn approval, instead of relying on a fragile pending_approval catch-all.

Session Compression (2026-02-28, Late Session Addendum) Current Blocker: The Tier 3 worker simulation is stuck. The orchestration loop in multi_agent_conductor.py correctly starts run_worker_lifecycle, and ai_client.py successfully sends a mock response back from gemini_cli. However, the visual test never sees this output in mma_streams.

  • The GUI expects handle_ai_response to carry the final AI response (including stream_id mapping to a specific Tier 3 worker string).
  • In earlier attempts, we tried manually pushing a handle_ai_response event back into the GUI's event_queue at the end of run_worker_lifecycle, but it seems the GUI is still looping infinitely, showing Polling streams: ['Tier 1']. The state machine doesn't seem to recognize that the Tier 3 task is done or correctly populate the stream dictionary for the UI to pick up.
  • Handoff Directive: The next agent needs to trace exactly how a successful AI response from a subprocess/thread (which run_worker_lifecycle operates in) is supposed to bubble up to self.mma_streams in gui_2.py. Is events.emit("response_received") or handle_ai_response missing? Why is the test only seeing 'Tier 1' in the mma_streams keys? Focus on the handoff between ai_client.py completing a run and gui_2.py rendering the result.