## Structural Testing Contract To maintain the integrity of the test suite and ensure that AI-driven test modifications do not create false positives ("mock-rot"), the following rules apply to all testing within this project: 1. **Ban on Arbitrary Core Mocking:** Tier 3 workers are strictly forbidden from using `unittest.mock.patch` to bypass or stub core infrastructure (e.g., event queues, `ai_client` internals, threading primitives) unless explicitly authorized by the Tier 2 Tech Lead for a specific boundary test. 2. **`live_gui` Standard:** All integration and end-to-end testing must utilize the `live_gui` fixture to interact with a real instance of the application via the Hook API. Bypassing the hook server to directly mutate GUI state in tests is prohibited. 3. **Artifact Isolation:** All test-generated artifacts (logs, temporary workspaces, mock outputs) MUST be written to the `tests/artifacts/` or `tests/logs/` directories. These directories are git-ignored to prevent repository pollution. --- ## Verification & Simulation Framework [Top](../Readme.md) | [Architecture](guide_architecture.md) | [Tools & IPC](guide_tools.md) | [MMA Orchestration](guide_mma.md) --- ## Infrastructure ### `--enable-test-hooks` When launched with this flag, the application starts the `HookServer` on port `8999`, exposing its internal state to external HTTP requests. This is the foundation for all automated verification. Without this flag, the Hook API is only available when the provider is `gemini_cli`. ### The `live_gui` pytest Fixture Defined in `tests/conftest.py`, this session-scoped fixture manages the lifecycle of the application under test. **Spawning:** ```python @pytest.fixture(scope="session") def live_gui() -> Generator[tuple[subprocess.Popen, str], None, None]: process = subprocess.Popen( ["uv", "run", "python", "-u", gui_script, "--enable-test-hooks"], stdout=log_file, stderr=log_file, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP if os.name == 'nt' else 0 ) ``` - **`-u` flag**: Disables output buffering for real-time log capture. - **Process group**: On Windows, uses `CREATE_NEW_PROCESS_GROUP` so the entire tree (GUI + child processes) can be killed cleanly. - **Logging**: Stdout/stderr redirected to `logs/gui_2_py_test.log`. **Readiness polling:** ```python max_retries = 15 # seconds while time.time() - start_time < max_retries: response = requests.get("http://127.0.0.1:8999/status", timeout=0.5) if response.status_code == 200: ready = True; break if process.poll() is not None: break # Process died early time.sleep(0.5) ``` Polls `GET /status` every 500ms for up to 15 seconds. Checks `process.poll()` each iteration to detect early crashes (avoids waiting the full timeout if the GUI exits). Pre-check: tests if port 8999 is already occupied. **Failure path:** If the hook server never responds, kills the process tree and calls `pytest.fail()` to abort the entire test session. Diagnostic telemetry (startup time, PID, success/fail) is written via `VerificationLogger`. **Teardown:** ```python finally: client = ApiHookClient() client.reset_session() # Clean GUI state before killing time.sleep(0.5) kill_process_tree(process.pid) log_file.close() ``` Sends `reset_session()` via `ApiHookClient` before killing to prevent stale state files. **Yield value:** `(process: subprocess.Popen, gui_script: str)`. ### Session Isolation ```python @pytest.fixture(autouse=True) def reset_ai_client() -> Generator[None, None, None]: ai_client.reset_session() ai_client.set_provider("gemini", "gemini-2.5-flash-lite") yield ``` Runs automatically before every test. Resets the `ai_client` module state and defaults to a safe model, preventing state pollution between tests. ### Process Cleanup ```python def kill_process_tree(pid: int | None) -> None: ``` - **Windows**: `taskkill /F /T /PID ` — force-kills the process and all children (`/T` is critical since the GUI spawns child processes). - **Unix**: `os.killpg(os.getpgid(pid), SIGKILL)` to kill the entire process group. ### VerificationLogger Structured diagnostic logging for test telemetry: ```python class VerificationLogger: def __init__(self, test_name: str, script_name: str): self.logs_dir = Path(f"logs/test/{datetime.now().strftime('%Y%m%d_%H%M%S')}") def log_state(self, field: str, before: Any, after: Any, delta: Any = None) def finalize(self, description: str, status: str, result_msg: str) ``` Output format: fixed-width column table (`Field | Before | After | Delta`) written to `logs/test//.txt`. Dual output: file + tagged stdout lines for CI visibility. --- ## Simulation Lifecycle: The "Puppeteer" Pattern Simulations act as external puppeteers, driving the GUI through the `ApiHookClient` HTTP interface. The canonical example is `tests/visual_sim_mma_v2.py`. ### Stage 1: Mock Provider Setup ```python client = ApiHookClient() client.set_value('current_provider', 'gemini_cli') mock_cli_path = f'{sys.executable} {os.path.abspath("tests/mock_gemini_cli.py")}' client.set_value('gcli_path', mock_cli_path) client.set_value('files_base_dir', 'tests/artifacts/temp_workspace') client.click('btn_project_save') ``` - Switches the GUI's LLM provider to `gemini_cli` (the CLI adapter). - Points the CLI binary to `python tests/mock_gemini_cli.py` — all LLM calls go to the mock. - Redirects `files_base_dir` to a temp workspace to prevent polluting real project directories. - Saves the project configuration. ### Stage 2: Epic Planning ```python client.set_value('mma_epic_input', 'Develop a new feature') client.click('btn_mma_plan_epic') ``` Enters an epic description and triggers planning. The GUI invokes the LLM (which hits the mock). ### Stage 3: Poll for Proposed Tracks (60s timeout) ```python for _ in range(60): status = client.get_mma_status() if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn') elif status.get('pending_mma_step_approval'): client.click('btn_approve_mma_step') elif status.get('pending_tool_approval'): client.click('btn_approve_tool') if status.get('proposed_tracks') and len(status['proposed_tracks']) > 0: break time.sleep(1) ``` The **approval automation** is a critical pattern repeated in every polling loop. The MMA engine has three approval gates: - **Spawn approval**: Permission to create a new worker subprocess. - **Step approval**: Permission to proceed with the next orchestration step. - **Tool approval**: Permission to execute a tool call. All three are auto-approved by clicking the corresponding button. Without this, the engine would block indefinitely at each gate. ### Stage 4: Accept Tracks ```python client.click('btn_mma_accept_tracks') ``` ### Stage 5: Poll for Tracks Populated (30s timeout) Waits until `status['tracks']` contains a track with `'Mock Goal 1'` in its title. ### Stage 6: Load Track and Verify Tickets (60s timeout) ```python client.click('btn_mma_load_track', user_data=track_id_to_load) ``` Then polls until: - `active_track` matches the loaded track ID. - `active_tickets` list is non-empty. ### Stage 7: Verify MMA Status Transitions (120s timeout) Polls until `mma_status == 'running'` or `'done'`. Continues auto-approving all gates. ### Stage 8: Verify Worker Output in Streams (60s timeout) ```python streams = status.get('mma_streams', {}) if any("Tier 3" in k for k in streams.keys()): tier3_key = [k for k in streams.keys() if "Tier 3" in k][0] if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]: streams_found = True ``` Verifies that `mma_streams` contains a key with "Tier 3" and the value contains the exact mock output string. ### Assertions Summary 1. Mock provider setup succeeds (try/except with `pytest.fail`). 2. `proposed_tracks` appears within 60 seconds. 3. `'Mock Goal 1'` track exists in tracks list within 30 seconds. 4. Track loads and `active_tickets` populate within 60 seconds. 5. MMA status becomes `'running'` or `'done'` within 120 seconds. 6. Tier 3 worker output with specific mock content appears in `mma_streams` within 60 seconds. --- ## Mock Provider Strategy ### `tests/mock_gemini_cli.py` A fake Gemini CLI executable that replaces the real `gemini` binary during integration tests. Outputs JSON-L messages matching the real CLI's streaming output protocol. **Input mechanism:** ```python prompt = sys.stdin.read() # Primary: prompt via stdin sys.argv # Secondary: management command detection os.environ.get('GEMINI_CLI_HOOK_CONTEXT') # Tertiary: environment variable ``` **Management command bypass:** ```python if len(sys.argv) > 1 and sys.argv[1] in ["mcp", "extensions", "skills", "hooks"]: return # Silent exit ``` **Response routing** — keyword matching on stdin content: | Prompt Contains | Response | Session ID | |---|---|---| | `'PATH: Epic Initialization'` | Two mock Track objects (`mock-track-1`, `mock-track-2`) | `mock-session-epic` | | `'PATH: Sprint Planning'` | Two mock Ticket objects (`mock-ticket-1` independent, `mock-ticket-2` depends on `mock-ticket-1`) | `mock-session-sprint` | | `'"role": "tool"'` or `'"tool_call_id"'` | Success message (simulates post-tool-call final answer) | `mock-session-final` | | Default (Tier 3 worker prompts) | `"SUCCESS: Mock Tier 3 worker implemented the change. [MOCK OUTPUT]"` | `mock-session-default` | **Output protocol** — every response is exactly two JSON-L lines: ```json {"type": "message", "role": "assistant", "content": ""} {"type": "result", "status": "success", "stats": {"total_tokens": N, ...}, "session_id": "mock-session-*"} ``` This matches the real Gemini CLI's streaming output format. `flush=True` on every `print()` ensures the consuming process receives data immediately. **Tool call simulation:** The mock does **not** emit tool calls. It detects tool results in the prompt (`'"role": "tool"'` check) and responds with a final answer, simulating the second turn of a tool-call conversation without actually issuing calls. **Debug output:** All debug information goes to stderr, keeping stdout clean for the JSON-L protocol. --- ## Visual Verification Patterns Tests in this framework don't just check return values — they verify the **rendered state** of the application via the Hook API. ### DAG Integrity Verify that `active_tickets` in the MMA status matches the expected task graph: ```python status = client.get_mma_status() tickets = status.get('active_tickets', []) assert len(tickets) >= 2 assert any(t['id'] == 'mock-ticket-1' for t in tickets) ``` ### Stream Telemetry Check `mma_streams` to ensure output from multiple tiers is correctly captured and routed: ```python streams = status.get('mma_streams', {}) tier3_keys = [k for k in streams.keys() if "Tier 3" in k] assert len(tier3_keys) > 0 assert "SUCCESS" in streams[tier3_keys[0]] ``` ### Modal State Assert that the correct dialog is active during a pending tool call: ```python status = client.get_mma_status() assert status.get('pending_tool_approval') == True # or diag = client.get_indicator_state('thinking') assert diag.get('thinking') == True ``` ### Performance Monitoring Verify UI responsiveness under load: ```python perf = client.get_performance() assert perf['fps'] > 30 assert perf['input_lag_ms'] < 100 ``` --- ## Supporting Analysis Modules ### `file_cache.py` — ASTParser (tree-sitter) ```python class ASTParser: def __init__(self, language: str = "python"): self.language = tree_sitter.Language(tree_sitter_python.language()) self.parser = tree_sitter.Parser(self.language) def parse(self, code: str) -> tree_sitter.Tree def get_skeleton(self, code: str, path: str = "") -> str def get_curated_view(self, code: str, path: str = "") -> str def get_targeted_view(self, code: str, symbols: List[str], path: str = "") -> str ``` **`get_skeleton` algorithm:** 1. Parse code to tree-sitter AST. 2. Walk all `function_definition` nodes. 3. For each body (`block` node): - If first non-comment child is a docstring: preserve docstring, replace rest with `...`. - Otherwise: replace entire body with `...`. 4. Apply edits in reverse byte order (maintains valid offsets). **`get_curated_view` algorithm:** Enhanced skeleton that preserves bodies under two conditions: - Function has `@core_logic` decorator. - Function body contains a `# [HOT]` comment anywhere in its descendants. If either condition is true, the body is preserved verbatim. This enables a two-tier code view: hot paths shown in full, boilerplate compressed. **`get_targeted_view` algorithm:** Extracts only the specified symbols and their dependencies: 1. Find all requested symbol definitions (classes, functions, methods). 2. For each symbol, traverse its body to find referenced names. 3. Include only the definitions that are directly referenced. 4. Used for surgical context injection when `target_symbols` is specified on a Ticket. ### `summarize.py` — Heuristic File Summaries Token-efficient structural descriptions without AI calls: ```python _SUMMARISERS: dict[str, Callable] = { ".py": _summarise_python, # imports, classes, methods, functions, constants ".toml": _summarise_toml, # table keys + array lengths ".md": _summarise_markdown, # h1-h3 headings ".ini": _summarise_generic, # line count + preview } ``` **`_summarise_python`** uses stdlib `ast`: 1. Parse with `ast.parse()`. 2. Extract deduplicated imports (top-level module names only). 3. Extract `ALL_CAPS` constants (both `Assign` and `AnnAssign`). 4. Extract classes with their method names. 5. Extract top-level function names. Output: ``` **Python** — 150 lines imports: ast, json, pathlib constants: TIMEOUT_SECONDS class ASTParser: __init__, parse, get_skeleton functions: summarise_file, build_summary_markdown ``` ### `outline_tool.py` — Hierarchical Code Outline ```python class CodeOutliner: def outline(self, code: str) -> str ``` Walks top-level `ast` nodes: - `ClassDef` → `[Class] Name (Lines X-Y)` + docstring + recurse for methods - `FunctionDef` → `[Func] Name (Lines X-Y)` or `[Method] Name` if nested - `AsyncFunctionDef` → `[Async Func] Name (Lines X-Y)` Only extracts first line of docstrings. Uses indentation depth as heuristic for method vs function. --- ## Two Parallel Code Analysis Implementations The codebase has two parallel approaches for structural code analysis: | Aspect | `file_cache.py` (tree-sitter) | `summarize.py` / `outline_tool.py` (stdlib `ast`) | |---|---|---| | Parser | tree-sitter with `tree_sitter_python` | Python's built-in `ast` module | | Precision | Byte-accurate, preserves exact syntax | Line-level, may lose formatting nuance | | `@core_logic` / `[HOT]` | Supported (selective body preservation) | Not supported | | Used by | `py_get_skeleton` MCP tool, worker context injection | `get_file_summary` MCP tool, `py_get_code_outline` | | Performance | Slightly slower (C extension + tree walk) | Faster (pure Python, simpler walk) |