# Verification & Simulation Framework [Top](../Readme.md) | [Architecture](guide_architecture.md) | [Tools & IPC](guide_tools.md) | [MMA Orchestration](guide_mma.md) --- ## Infrastructure ### `--enable-test-hooks` When launched with this flag, the application starts the `HookServer` on port `8999`, exposing its internal state to external HTTP requests. This is the foundation for all automated verification. Without this flag, the Hook API is only available when the provider is `gemini_cli`. ### The `live_gui` pytest Fixture Defined in `tests/conftest.py`, this session-scoped fixture manages the lifecycle of the application under test. **Spawning:** ```python @pytest.fixture(scope="session") def live_gui() -> Generator[tuple[subprocess.Popen, str], None, None]: process = subprocess.Popen( ["uv", "run", "python", "-u", gui_script, "--enable-test-hooks"], stdout=log_file, stderr=log_file, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP if os.name == 'nt' else 0 ) ``` - **`-u` flag**: Disables output buffering for real-time log capture. - **Process group**: On Windows, uses `CREATE_NEW_PROCESS_GROUP` so the entire tree (GUI + child processes) can be killed cleanly. - **Logging**: Stdout/stderr redirected to `logs/gui_2_py_test.log`. **Readiness polling:** ```python max_retries = 15 # seconds while time.time() - start_time < max_retries: response = requests.get("http://127.0.0.1:8999/status", timeout=0.5) if response.status_code == 200: ready = True; break if process.poll() is not None: break # Process died early time.sleep(0.5) ``` Polls `GET /status` every 500ms for up to 15 seconds. Checks `process.poll()` each iteration to detect early crashes (avoids waiting the full timeout if the GUI exits). Pre-check: tests if port 8999 is already occupied. **Failure path:** If the hook server never responds, kills the process tree and calls `pytest.fail()` to abort the entire test session. Diagnostic telemetry (startup time, PID, success/fail) is written via `VerificationLogger`. **Teardown:** ```python finally: client = ApiHookClient() client.reset_session() # Clean GUI state before killing time.sleep(0.5) kill_process_tree(process.pid) log_file.close() ``` Sends `reset_session()` via `ApiHookClient` before killing to prevent stale state files. **Yield value:** `(process: subprocess.Popen, gui_script: str)`. ### Session Isolation ```python @pytest.fixture(autouse=True) def reset_ai_client() -> Generator[None, None, None]: ai_client.reset_session() ai_client.set_provider("gemini", "gemini-2.5-flash-lite") yield ``` Runs automatically before every test. Resets the `ai_client` module state and defaults to a safe model, preventing state pollution between tests. ### Process Cleanup ```python def kill_process_tree(pid: int | None) -> None: ``` - **Windows**: `taskkill /F /T /PID ` — force-kills the process and all children (`/T` is critical since the GUI spawns child processes). - **Unix**: `os.killpg(os.getpgid(pid), SIGKILL)` to kill the entire process group. ### VerificationLogger Structured diagnostic logging for test telemetry: ```python class VerificationLogger: def __init__(self, test_name: str, script_name: str): self.logs_dir = Path(f"logs/test/{datetime.now().strftime('%Y%m%d_%H%M%S')}") def log_state(self, field: str, before: Any, after: Any, delta: Any = None) def finalize(self, description: str, status: str, result_msg: str) ``` Output format: fixed-width column table (`Field | Before | After | Delta`) written to `logs/test//.txt`. Dual output: file + tagged stdout lines for CI visibility. --- ## Simulation Lifecycle: The "Puppeteer" Pattern Simulations act as external puppeteers, driving the GUI through the `ApiHookClient` HTTP interface. The canonical example is `tests/visual_sim_mma_v2.py`. ### Stage 1: Mock Provider Setup ```python client = ApiHookClient() client.set_value('current_provider', 'gemini_cli') mock_cli_path = f'{sys.executable} {os.path.abspath("tests/mock_gemini_cli.py")}' client.set_value('gcli_path', mock_cli_path) client.set_value('files_base_dir', 'tests/artifacts/temp_workspace') client.click('btn_project_save') ``` - Switches the GUI's LLM provider to `gemini_cli` (the CLI adapter). - Points the CLI binary to `python tests/mock_gemini_cli.py` — all LLM calls go to the mock. - Redirects `files_base_dir` to a temp workspace to prevent polluting real project directories. - Saves the project configuration. ### Stage 2: Epic Planning ```python client.set_value('mma_epic_input', 'Develop a new feature') client.click('btn_mma_plan_epic') ``` Enters an epic description and triggers planning. The GUI invokes the LLM (which hits the mock). ### Stage 3: Poll for Proposed Tracks (60s timeout) ```python for _ in range(60): status = client.get_mma_status() if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn') elif status.get('pending_mma_step_approval'): client.click('btn_approve_mma_step') elif status.get('pending_tool_approval'): client.click('btn_approve_tool') if status.get('proposed_tracks') and len(status['proposed_tracks']) > 0: break time.sleep(1) ``` The **approval automation** is a critical pattern repeated in every polling loop. The MMA engine has three approval gates: - **Spawn approval**: Permission to create a new worker subprocess. - **Step approval**: Permission to proceed with the next orchestration step. - **Tool approval**: Permission to execute a tool call. All three are auto-approved by clicking the corresponding button. Without this, the engine would block indefinitely at each gate. ### Stage 4: Accept Tracks ```python client.click('btn_mma_accept_tracks') ``` ### Stage 5: Poll for Tracks Populated (30s timeout) Waits until `status['tracks']` contains a track with `'Mock Goal 1'` in its title. ### Stage 6: Load Track and Verify Tickets (60s timeout) ```python client.click('btn_mma_load_track', user_data=track_id_to_load) ``` Then polls until: - `active_track` matches the loaded track ID. - `active_tickets` list is non-empty. ### Stage 7: Verify MMA Status Transitions (120s timeout) Polls until `mma_status == 'running'` or `'done'`. Continues auto-approving all gates. ### Stage 8: Verify Worker Output in Streams (60s timeout) ```python streams = status.get('mma_streams', {}) if any("Tier 3" in k for k in streams.keys()): tier3_key = [k for k in streams.keys() if "Tier 3" in k][0] if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]: streams_found = True ``` Verifies that `mma_streams` contains a key with "Tier 3" and the value contains the exact mock output string. ### Assertions Summary 1. Mock provider setup succeeds (try/except with `pytest.fail`). 2. `proposed_tracks` appears within 60 seconds. 3. `'Mock Goal 1'` track exists in tracks list within 30 seconds. 4. Track loads and `active_tickets` populate within 60 seconds. 5. MMA status becomes `'running'` or `'done'` within 120 seconds. 6. Tier 3 worker output with specific mock content appears in `mma_streams` within 60 seconds. --- ## Mock Provider Strategy ### `tests/mock_gemini_cli.py` A fake Gemini CLI executable that replaces the real `gemini` binary during integration tests. Outputs JSON-L messages matching the real CLI's streaming output protocol. **Input mechanism:** ```python prompt = sys.stdin.read() # Primary: prompt via stdin sys.argv # Secondary: management command detection os.environ.get('GEMINI_CLI_HOOK_CONTEXT') # Tertiary: environment variable ``` **Management command bypass:** ```python if len(sys.argv) > 1 and sys.argv[1] in ["mcp", "extensions", "skills", "hooks"]: return # Silent exit ``` **Response routing** — keyword matching on stdin content: | Prompt Contains | Response | Session ID | |---|---|---| | `'PATH: Epic Initialization'` | Two mock Track objects (`mock-track-1`, `mock-track-2`) | `mock-session-epic` | | `'PATH: Sprint Planning'` | Two mock Ticket objects (`mock-ticket-1` independent, `mock-ticket-2` depends on `mock-ticket-1`) | `mock-session-sprint` | | `'"role": "tool"'` or `'"tool_call_id"'` | Success message (simulates post-tool-call final answer) | `mock-session-final` | | Default (Tier 3 worker prompts) | `"SUCCESS: Mock Tier 3 worker implemented the change. [MOCK OUTPUT]"` | `mock-session-default` | **Output protocol** — every response is exactly two JSON-L lines: ```json {"type": "message", "role": "assistant", "content": ""} {"type": "result", "status": "success", "stats": {"total_tokens": N, ...}, "session_id": "mock-session-*"} ``` This matches the real Gemini CLI's streaming output format. `flush=True` on every `print()` ensures the consuming process receives data immediately. **Tool call simulation:** The mock does **not** emit tool calls. It detects tool results in the prompt (`'"role": "tool"'` check) and responds with a final answer, simulating the second turn of a tool-call conversation without actually issuing calls. **Debug output:** All debug information goes to stderr, keeping stdout clean for the JSON-L protocol. --- ## Visual Verification Patterns Tests in this framework don't just check return values — they verify the **rendered state** of the application via the Hook API. ### DAG Integrity Verify that `active_tickets` in the MMA status matches the expected task graph: ```python status = client.get_mma_status() tickets = status.get('active_tickets', []) assert len(tickets) >= 2 assert any(t['id'] == 'mock-ticket-1' for t in tickets) ``` ### Stream Telemetry Check `mma_streams` to ensure output from multiple tiers is correctly captured and routed: ```python streams = status.get('mma_streams', {}) tier3_keys = [k for k in streams.keys() if "Tier 3" in k] assert len(tier3_keys) > 0 assert "SUCCESS" in streams[tier3_keys[0]] ``` ### Modal State Assert that the correct dialog is active during a pending tool call: ```python status = client.get_mma_status() assert status.get('pending_tool_approval') == True # or diag = client.get_indicator_state('thinking') assert diag.get('thinking') == True ``` ### Performance Monitoring Verify UI responsiveness under load: ```python perf = client.get_performance() assert perf['fps'] > 30 assert perf['input_lag_ms'] < 100 ``` --- ## Supporting Analysis Modules ### `file_cache.py` — ASTParser (tree-sitter) ```python class ASTParser: def __init__(self, language: str = "python"): self.language = tree_sitter.Language(tree_sitter_python.language()) self.parser = tree_sitter.Parser(self.language) def parse(self, code: str) -> tree_sitter.Tree def get_skeleton(self, code: str) -> str def get_curated_view(self, code: str) -> str ``` **`get_skeleton` algorithm:** 1. Parse code to tree-sitter AST. 2. Walk all `function_definition` nodes. 3. For each body (`block` node): - If first non-comment child is a docstring: preserve docstring, replace rest with `...`. - Otherwise: replace entire body with `...`. 4. Apply edits in reverse byte order (maintains valid offsets). **`get_curated_view` algorithm:** Enhanced skeleton that preserves bodies under two conditions: - Function has `@core_logic` decorator. - Function body contains a `# [HOT]` comment anywhere in its descendants. If either condition is true, the body is preserved verbatim. This enables a two-tier code view: hot paths shown in full, boilerplate compressed. ### `summarize.py` — Heuristic File Summaries Token-efficient structural descriptions without AI calls: ```python _SUMMARISERS: dict[str, Callable] = { ".py": _summarise_python, # imports, classes, methods, functions, constants ".toml": _summarise_toml, # table keys + array lengths ".md": _summarise_markdown, # h1-h3 headings ".ini": _summarise_generic, # line count + preview } ``` **`_summarise_python`** uses stdlib `ast`: 1. Parse with `ast.parse()`. 2. Extract deduplicated imports (top-level module names only). 3. Extract `ALL_CAPS` constants (both `Assign` and `AnnAssign`). 4. Extract classes with their method names. 5. Extract top-level function names. Output: ``` **Python** — 150 lines imports: ast, json, pathlib constants: TIMEOUT_SECONDS class ASTParser: __init__, parse, get_skeleton functions: summarise_file, build_summary_markdown ``` ### `outline_tool.py` — Hierarchical Code Outline ```python class CodeOutliner: def outline(self, code: str) -> str ``` Walks top-level `ast` nodes: - `ClassDef` → `[Class] Name (Lines X-Y)` + docstring + recurse for methods - `FunctionDef` → `[Func] Name (Lines X-Y)` or `[Method] Name` if nested - `AsyncFunctionDef` → `[Async Func] Name (Lines X-Y)` Only extracts first line of docstrings. Uses indentation depth as heuristic for method vs function. --- ## Two Parallel Code Analysis Implementations The codebase has two parallel approaches for structural code analysis: | Aspect | `file_cache.py` (tree-sitter) | `summarize.py` / `outline_tool.py` (stdlib `ast`) | |---|---|---| | Parser | tree-sitter with `tree_sitter_python` | Python's built-in `ast` module | | Precision | Byte-accurate, preserves exact syntax | Line-level, may lose formatting nuance | | `@core_logic` / `[HOT]` | Supported (selective body preservation) | Not supported | | Used by | `py_get_skeleton` MCP tool, worker context injection | `get_file_summary` MCP tool, `py_get_code_outline` | | Performance | Slightly slower (C extension + tree walk) | Faster (pure Python, simpler walk) |