manual_slop/docs/guide_simulations.md

# Verification & Simulation Framework

[Top](../Readme.md) | [Architecture](guide_architecture.md) | [Tools & IPC](guide_tools.md) | [MMA Orchestration](guide_mma.md)

---

## Infrastructure

### `--enable-test-hooks`

When launched with this flag, the application starts the `HookServer` on port `8999`, exposing its internal state to external HTTP requests. This is the foundation for all automated verification. Without this flag, the Hook API is only available when the provider is `gemini_cli`.

### The `live_gui` pytest Fixture

Defined in `tests/conftest.py`, this session-scoped fixture manages the lifecycle of the application under test.

**Spawning:**

```python
@pytest.fixture(scope="session")
def live_gui() -> Generator[tuple[subprocess.Popen, str], None, None]:
    process = subprocess.Popen(
        ["uv", "run", "python", "-u", gui_script, "--enable-test-hooks"],
        stdout=log_file, stderr=log_file, text=True,
        creationflags=subprocess.CREATE_NEW_PROCESS_GROUP if os.name == 'nt' else 0
    )
```

- **`-u` flag**: Disables output buffering for real-time log capture.
- **Process group**: On Windows, uses `CREATE_NEW_PROCESS_GROUP` so the entire tree (GUI + child processes) can be killed cleanly.
- **Logging**: Stdout/stderr redirected to `logs/gui_2_py_test.log`.

**Readiness polling:**

```python
max_retries = 15  # seconds
while time.time() - start_time < max_retries:
    response = requests.get("http://127.0.0.1:8999/status", timeout=0.5)
    if response.status_code == 200:
        ready = True; break
    if process.poll() is not None: break  # Process died early
    time.sleep(0.5)
```

Polls `GET /status` every 500ms for up to 15 seconds. Checks `process.poll()` each iteration to detect early crashes (avoids waiting the full timeout if the GUI exits). Pre-check: tests if port 8999 is already occupied.

**Failure path:** If the hook server never responds, kills the process tree and calls `pytest.fail()` to abort the entire test session. Diagnostic telemetry (startup time, PID, success/fail) is written via `VerificationLogger`.

**Teardown:**

```python
finally:
    client = ApiHookClient()
    client.reset_session()    # Clean GUI state before killing
    time.sleep(0.5)
    kill_process_tree(process.pid)
    log_file.close()
```

Sends `reset_session()` via `ApiHookClient` before killing to prevent stale state files.

**Yield value:** `(process: subprocess.Popen, gui_script: str)`.

### Session Isolation

```python
@pytest.fixture(autouse=True)
def reset_ai_client() -> Generator[None, None, None]:
    ai_client.reset_session()
    ai_client.set_provider("gemini", "gemini-2.5-flash-lite")
    yield
```

Runs automatically before every test. Resets the `ai_client` module state and defaults to a safe model, preventing state pollution between tests.

### Process Cleanup

```python
def kill_process_tree(pid: int | None) -> None:
```

- **Windows**: `taskkill /F /T /PID <pid>` — force-kills the process and all children (`/T` is critical since the GUI spawns child processes).
- **Unix**: `os.killpg(os.getpgid(pid), SIGKILL)` to kill the entire process group.

### VerificationLogger

Structured diagnostic logging for test telemetry:

```python
class VerificationLogger:
    def __init__(self, test_name: str, script_name: str):
        self.logs_dir = Path(f"logs/test/{datetime.now().strftime('%Y%m%d_%H%M%S')}")

    def log_state(self, field: str, before: Any, after: Any, delta: Any = None)
    def finalize(self, description: str, status: str, result_msg: str)
```

Output format: fixed-width column table (`Field | Before | After | Delta`) written to `logs/test/<timestamp>/<script_name>.txt`. Dual output: file + tagged stdout lines for CI visibility.

---

## Simulation Lifecycle: The "Puppeteer" Pattern

Simulations act as external puppeteers, driving the GUI through the `ApiHookClient` HTTP interface. The canonical example is `tests/visual_sim_mma_v2.py`.

### Stage 1: Mock Provider Setup

```python
client = ApiHookClient()
client.set_value('current_provider', 'gemini_cli')
mock_cli_path = f'{sys.executable} {os.path.abspath("tests/mock_gemini_cli.py")}'
client.set_value('gcli_path', mock_cli_path)
client.set_value('files_base_dir', 'tests/artifacts/temp_workspace')
client.click('btn_project_save')
```

- Switches the GUI's LLM provider to `gemini_cli` (the CLI adapter).
- Points the CLI binary to `python tests/mock_gemini_cli.py` — all LLM calls go to the mock.
- Redirects `files_base_dir` to a temp workspace to prevent polluting real project directories.
- Saves the project configuration.

### Stage 2: Epic Planning

```python
client.set_value('mma_epic_input', 'Develop a new feature')
client.click('btn_mma_plan_epic')
```

Enters an epic description and triggers planning. The GUI invokes the LLM (which hits the mock).

### Stage 3: Poll for Proposed Tracks (60s timeout)

```python
for _ in range(60):
    status = client.get_mma_status()
    if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
    elif status.get('pending_mma_step_approval'): client.click('btn_approve_mma_step')
    elif status.get('pending_tool_approval'):     client.click('btn_approve_tool')
    if status.get('proposed_tracks') and len(status['proposed_tracks']) > 0: break
    time.sleep(1)
```

The **approval automation** is a critical pattern repeated in every polling loop. The MMA engine has three approval gates:
- **Spawn approval**: Permission to create a new worker subprocess.
- **Step approval**: Permission to proceed with the next orchestration step.
- **Tool approval**: Permission to execute a tool call.

All three are auto-approved by clicking the corresponding button. Without this, the engine would block indefinitely at each gate.

### Stage 4: Accept Tracks

```python
client.click('btn_mma_accept_tracks')
```

### Stage 5: Poll for Tracks Populated (30s timeout)

Waits until `status['tracks']` contains a track with `'Mock Goal 1'` in its title.

### Stage 6: Load Track and Verify Tickets (60s timeout)

```python
client.click('btn_mma_load_track', user_data=track_id_to_load)
```

Then polls until:
- `active_track` matches the loaded track ID.
- `active_tickets` list is non-empty.

### Stage 7: Verify MMA Status Transitions (120s timeout)

Polls until `mma_status == 'running'` or `'done'`. Continues auto-approving all gates.

### Stage 8: Verify Worker Output in Streams (60s timeout)

```python
streams = status.get('mma_streams', {})
if any("Tier 3" in k for k in streams.keys()):
    tier3_key = [k for k in streams.keys() if "Tier 3" in k][0]
    if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]:
        streams_found = True
```

Verifies that `mma_streams` contains a key with "Tier 3" and the value contains the exact mock output string.

### Assertions Summary

1. Mock provider setup succeeds (try/except with `pytest.fail`).
2. `proposed_tracks` appears within 60 seconds.
3. `'Mock Goal 1'` track exists in tracks list within 30 seconds.
4. Track loads and `active_tickets` populate within 60 seconds.
5. MMA status becomes `'running'` or `'done'` within 120 seconds.
6. Tier 3 worker output with specific mock content appears in `mma_streams` within 60 seconds.

---

## Mock Provider Strategy

### `tests/mock_gemini_cli.py`

A fake Gemini CLI executable that replaces the real `gemini` binary during integration tests. Outputs JSON-L messages matching the real CLI's streaming output protocol.

**Input mechanism:**

```python
prompt = sys.stdin.read()          # Primary: prompt via stdin
sys.argv                            # Secondary: management command detection
os.environ.get('GEMINI_CLI_HOOK_CONTEXT')  # Tertiary: environment variable
```

**Management command bypass:**

```python
if len(sys.argv) > 1 and sys.argv[1] in ["mcp", "extensions", "skills", "hooks"]:
    return  # Silent exit
```

**Response routing** — keyword matching on stdin content:

| Prompt Contains | Response | Session ID |
|---|---|---|
| `'PATH: Epic Initialization'` | Two mock Track objects (`mock-track-1`, `mock-track-2`) | `mock-session-epic` |
| `'PATH: Sprint Planning'` | Two mock Ticket objects (`mock-ticket-1` independent, `mock-ticket-2` depends on `mock-ticket-1`) | `mock-session-sprint` |
| `'"role": "tool"'` or `'"tool_call_id"'` | Success message (simulates post-tool-call final answer) | `mock-session-final` |
| Default (Tier 3 worker prompts) | `"SUCCESS: Mock Tier 3 worker implemented the change. [MOCK OUTPUT]"` | `mock-session-default` |

**Output protocol** — every response is exactly two JSON-L lines:

```json
{"type": "message", "role": "assistant", "content": "<response>"}
{"type": "result", "status": "success", "stats": {"total_tokens": N, ...}, "session_id": "mock-session-*"}
```

This matches the real Gemini CLI's streaming output format. `flush=True` on every `print()` ensures the consuming process receives data immediately.

**Tool call simulation:** The mock does **not** emit tool calls. It detects tool results in the prompt (`'"role": "tool"'` check) and responds with a final answer, simulating the second turn of a tool-call conversation without actually issuing calls.

**Debug output:** All debug information goes to stderr, keeping stdout clean for the JSON-L protocol.

---

## Visual Verification Patterns

Tests in this framework don't just check return values — they verify the **rendered state** of the application via the Hook API.

### DAG Integrity

Verify that `active_tickets` in the MMA status matches the expected task graph:

```python
status = client.get_mma_status()
tickets = status.get('active_tickets', [])
assert len(tickets) >= 2
assert any(t['id'] == 'mock-ticket-1' for t in tickets)
```

### Stream Telemetry

Check `mma_streams` to ensure output from multiple tiers is correctly captured and routed:

```python
streams = status.get('mma_streams', {})
tier3_keys = [k for k in streams.keys() if "Tier 3" in k]
assert len(tier3_keys) > 0
assert "SUCCESS" in streams[tier3_keys[0]]
```

### Modal State

Assert that the correct dialog is active during a pending tool call:

```python
status = client.get_mma_status()
assert status.get('pending_tool_approval') == True
# or
diag = client.get_indicator_state('thinking')
assert diag.get('thinking') == True
```

### Performance Monitoring

Verify UI responsiveness under load:

```python
perf = client.get_performance()
assert perf['fps'] > 30
assert perf['input_lag_ms'] < 100
```

---

## Supporting Analysis Modules

### `file_cache.py` — ASTParser (tree-sitter)

```python
class ASTParser:
    def __init__(self, language: str = "python"):
        self.language = tree_sitter.Language(tree_sitter_python.language())
        self.parser = tree_sitter.Parser(self.language)

    def parse(self, code: str) -> tree_sitter.Tree
    def get_skeleton(self, code: str) -> str
    def get_curated_view(self, code: str) -> str
```

**`get_skeleton` algorithm:**
1. Parse code to tree-sitter AST.
2. Walk all `function_definition` nodes.
3. For each body (`block` node):
   - If first non-comment child is a docstring: preserve docstring, replace rest with `...`.
   - Otherwise: replace entire body with `...`.
4. Apply edits in reverse byte order (maintains valid offsets).

**`get_curated_view` algorithm:**
Enhanced skeleton that preserves bodies under two conditions:
- Function has `@core_logic` decorator.
- Function body contains a `# [HOT]` comment anywhere in its descendants.

If either condition is true, the body is preserved verbatim. This enables a two-tier code view: hot paths shown in full, boilerplate compressed.

### `summarize.py` — Heuristic File Summaries

Token-efficient structural descriptions without AI calls:

```python
_SUMMARISERS: dict[str, Callable] = {
    ".py":   _summarise_python,    # imports, classes, methods, functions, constants
    ".toml": _summarise_toml,      # table keys + array lengths
    ".md":   _summarise_markdown,  # h1-h3 headings
    ".ini":  _summarise_generic,   # line count + preview
}
```

**`_summarise_python`** uses stdlib `ast`:
1. Parse with `ast.parse()`.
2. Extract deduplicated imports (top-level module names only).
3. Extract `ALL_CAPS` constants (both `Assign` and `AnnAssign`).
4. Extract classes with their method names.
5. Extract top-level function names.

Output:
```
**Python** — 150 lines
imports: ast, json, pathlib
constants: TIMEOUT_SECONDS
class ASTParser: __init__, parse, get_skeleton
functions: summarise_file, build_summary_markdown
```

### `outline_tool.py` — Hierarchical Code Outline

```python
class CodeOutliner:
    def outline(self, code: str) -> str
```

Walks top-level `ast` nodes:
- `ClassDef` → `[Class] Name (Lines X-Y)` + docstring + recurse for methods
- `FunctionDef` → `[Func] Name (Lines X-Y)` or `[Method] Name` if nested
- `AsyncFunctionDef` → `[Async Func] Name (Lines X-Y)`

Only extracts first line of docstrings. Uses indentation depth as heuristic for method vs function.

---

## Two Parallel Code Analysis Implementations

The codebase has two parallel approaches for structural code analysis:

| Aspect | `file_cache.py` (tree-sitter) | `summarize.py` / `outline_tool.py` (stdlib `ast`) |
|---|---|---|
| Parser | tree-sitter with `tree_sitter_python` | Python's built-in `ast` module |
| Precision | Byte-accurate, preserves exact syntax | Line-level, may lose formatting nuance |
| `@core_logic` / `[HOT]` | Supported (selective body preservation) | Not supported |
| Used by | `py_get_skeleton` MCP tool, worker context injection | `get_file_summary` MCP tool, `py_get_code_outline` |
| Performance | Slightly slower (C extension + tree walk) | Faster (pure Python, simpler walk) |