Track Specification: Simulation Hardening

Overview

The robust_live_simulation_verification track is marked complete but its session compression documents three unresolved issues: (1) brittle mock that triggers the wrong approval popup, (2) popup state desynchronization after "Accept" clicks, (3) Tier 3 output never appearing in mma_streams (fixed by mma_pipeline_fix track). This track stabilizes the simulation framework so it reliably passes end-to-end.

Prerequisites

mma_pipeline_fix_20260301 MUST be completed first (fixes Tier 3 stream plumbing).

Current Issues (from session compression 2026-02-28)

mock_gemini_cli.py defaults to emitting a read_file tool call, which triggers the general tool approval popup (_pending_ask_dialog) instead of the MMA spawn popup (_pending_mma_spawn). The test expects the spawn popup and times out.

Root cause: The mock's default response path doesn't distinguish between MMA orchestration prompts and Tier 3 worker prompts. It needs to NOT emit tool calls for orchestration-level prompts (Tier 1/2), only for worker-level prompts where tool use is expected.

After clicking "Accept" on the track proposal modal, _show_track_proposal_modal is set to False but the test still sees the popup as active. The hook API's mma_status returns stale proposed_tracks data.

Root cause: _cb_accept_tracks (gui_2.py:2012-2045) processes tracks and clears proposed_tracks, but this runs on the GUI thread. The ApiHookClient.get_mma_status() reads via the GUI trampoline pattern, but there may be a frame delay before the state updates are visible.

Issue 3: Approval Type Ambiguity

The test polling loop auto-approves pending_approval but can't distinguish between tool approval (_pending_ask_dialog), MMA step approval (_pending_mma_approval), and spawn approval (_pending_mma_spawn). The simulation needs explicit handling for each type.

Already resolved in code: get_mma_status now returns separate pending_tool_approval, pending_mma_step_approval, pending_mma_spawn_approval booleans. The test in visual_sim_mma_v2.py already checks these individually. The fix is in making the mock not trigger unexpected approval types.

Goals

Make tests/visual_sim_mma_v2.py pass reliably against the live GUI.
Clean up mock_gemini_cli.py to be deterministic and not trigger spurious approvals.
Add retry/timeout resilience to polling loops.

Architecture Reference

Simulation patterns: docs/guide_simulations.md
Hook API endpoints: docs/guide_tools.md — see /api/gui/mma_status response fields
HITL mechanism: docs/guide_architecture.md — see "The Execution Clutch"

2.8 KiB Raw Blame History