Rewrites all docs from Gemini's 330-line executive summaries to 1874 lines of expert-level architectural reference matching the pedagogical depth of gencpp (Parser_Algo.md, AST_Types.md) and VEFontCache-Odin (guide_architecture.md). Changes: - guide_architecture.md: 73 -> 542 lines. Adds inline data structures for all dialog classes, cross-thread communication patterns, complete action type catalog, provider comparison table, 4-breakpoint Anthropic cache strategy, Gemini server-side cache lifecycle, context refresh algorithm. - guide_tools.md: 66 -> 385 lines. Full 26-tool inventory with parameters, 3-layer MCP security model walkthrough, all Hook API GET/POST endpoints with request/response formats, ApiHookClient method reference, /api/ask synchronous HITL protocol, shell runner with env config. - guide_mma.md: NEW (368 lines). Fills major documentation gap — complete Ticket/Track/WorkerContext data structures, DAG engine algorithms (cycle detection, topological sort), ConductorEngine execution loop, Tier 2 ticket generation, Tier 3 worker lifecycle with context amnesia, token firewalling. - guide_simulations.md: 64 -> 377 lines. 8-stage Puppeteer simulation lifecycle, mock_gemini_cli.py JSON-L protocol, approval automation pattern, ASTParser tree-sitter vs stdlib ast comparison, VerificationLogger. - Readme.md: Rewritten with module map, architecture summary, config examples. - docs/Readme.md: Proper index with guide contents table and GUI panel docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
378 lines
13 KiB
Markdown
378 lines
13 KiB
Markdown
# Verification & Simulation Framework
|
|
|
|
[Top](../Readme.md) | [Architecture](guide_architecture.md) | [Tools & IPC](guide_tools.md) | [MMA Orchestration](guide_mma.md)
|
|
|
|
---
|
|
|
|
## Infrastructure
|
|
|
|
### `--enable-test-hooks`
|
|
|
|
When launched with this flag, the application starts the `HookServer` on port `8999`, exposing its internal state to external HTTP requests. This is the foundation for all automated verification. Without this flag, the Hook API is only available when the provider is `gemini_cli`.
|
|
|
|
### The `live_gui` pytest Fixture
|
|
|
|
Defined in `tests/conftest.py`, this session-scoped fixture manages the lifecycle of the application under test.
|
|
|
|
**Spawning:**
|
|
|
|
```python
|
|
@pytest.fixture(scope="session")
|
|
def live_gui() -> Generator[tuple[subprocess.Popen, str], None, None]:
|
|
process = subprocess.Popen(
|
|
["uv", "run", "python", "-u", gui_script, "--enable-test-hooks"],
|
|
stdout=log_file, stderr=log_file, text=True,
|
|
creationflags=subprocess.CREATE_NEW_PROCESS_GROUP if os.name == 'nt' else 0
|
|
)
|
|
```
|
|
|
|
- **`-u` flag**: Disables output buffering for real-time log capture.
|
|
- **Process group**: On Windows, uses `CREATE_NEW_PROCESS_GROUP` so the entire tree (GUI + child processes) can be killed cleanly.
|
|
- **Logging**: Stdout/stderr redirected to `logs/gui_2_py_test.log`.
|
|
|
|
**Readiness polling:**
|
|
|
|
```python
|
|
max_retries = 15 # seconds
|
|
while time.time() - start_time < max_retries:
|
|
response = requests.get("http://127.0.0.1:8999/status", timeout=0.5)
|
|
if response.status_code == 200:
|
|
ready = True; break
|
|
if process.poll() is not None: break # Process died early
|
|
time.sleep(0.5)
|
|
```
|
|
|
|
Polls `GET /status` every 500ms for up to 15 seconds. Checks `process.poll()` each iteration to detect early crashes (avoids waiting the full timeout if the GUI exits). Pre-check: tests if port 8999 is already occupied.
|
|
|
|
**Failure path:** If the hook server never responds, kills the process tree and calls `pytest.fail()` to abort the entire test session. Diagnostic telemetry (startup time, PID, success/fail) is written via `VerificationLogger`.
|
|
|
|
**Teardown:**
|
|
|
|
```python
|
|
finally:
|
|
client = ApiHookClient()
|
|
client.reset_session() # Clean GUI state before killing
|
|
time.sleep(0.5)
|
|
kill_process_tree(process.pid)
|
|
log_file.close()
|
|
```
|
|
|
|
Sends `reset_session()` via `ApiHookClient` before killing to prevent stale state files.
|
|
|
|
**Yield value:** `(process: subprocess.Popen, gui_script: str)`.
|
|
|
|
### Session Isolation
|
|
|
|
```python
|
|
@pytest.fixture(autouse=True)
|
|
def reset_ai_client() -> Generator[None, None, None]:
|
|
ai_client.reset_session()
|
|
ai_client.set_provider("gemini", "gemini-2.5-flash-lite")
|
|
yield
|
|
```
|
|
|
|
Runs automatically before every test. Resets the `ai_client` module state and defaults to a safe model, preventing state pollution between tests.
|
|
|
|
### Process Cleanup
|
|
|
|
```python
|
|
def kill_process_tree(pid: int | None) -> None:
|
|
```
|
|
|
|
- **Windows**: `taskkill /F /T /PID <pid>` — force-kills the process and all children (`/T` is critical since the GUI spawns child processes).
|
|
- **Unix**: `os.killpg(os.getpgid(pid), SIGKILL)` to kill the entire process group.
|
|
|
|
### VerificationLogger
|
|
|
|
Structured diagnostic logging for test telemetry:
|
|
|
|
```python
|
|
class VerificationLogger:
|
|
def __init__(self, test_name: str, script_name: str):
|
|
self.logs_dir = Path(f"logs/test/{datetime.now().strftime('%Y%m%d_%H%M%S')}")
|
|
|
|
def log_state(self, field: str, before: Any, after: Any, delta: Any = None)
|
|
def finalize(self, description: str, status: str, result_msg: str)
|
|
```
|
|
|
|
Output format: fixed-width column table (`Field | Before | After | Delta`) written to `logs/test/<timestamp>/<script_name>.txt`. Dual output: file + tagged stdout lines for CI visibility.
|
|
|
|
---
|
|
|
|
## Simulation Lifecycle: The "Puppeteer" Pattern
|
|
|
|
Simulations act as external puppeteers, driving the GUI through the `ApiHookClient` HTTP interface. The canonical example is `tests/visual_sim_mma_v2.py`.
|
|
|
|
### Stage 1: Mock Provider Setup
|
|
|
|
```python
|
|
client = ApiHookClient()
|
|
client.set_value('current_provider', 'gemini_cli')
|
|
mock_cli_path = f'{sys.executable} {os.path.abspath("tests/mock_gemini_cli.py")}'
|
|
client.set_value('gcli_path', mock_cli_path)
|
|
client.set_value('files_base_dir', 'tests/artifacts/temp_workspace')
|
|
client.click('btn_project_save')
|
|
```
|
|
|
|
- Switches the GUI's LLM provider to `gemini_cli` (the CLI adapter).
|
|
- Points the CLI binary to `python tests/mock_gemini_cli.py` — all LLM calls go to the mock.
|
|
- Redirects `files_base_dir` to a temp workspace to prevent polluting real project directories.
|
|
- Saves the project configuration.
|
|
|
|
### Stage 2: Epic Planning
|
|
|
|
```python
|
|
client.set_value('mma_epic_input', 'Develop a new feature')
|
|
client.click('btn_mma_plan_epic')
|
|
```
|
|
|
|
Enters an epic description and triggers planning. The GUI invokes the LLM (which hits the mock).
|
|
|
|
### Stage 3: Poll for Proposed Tracks (60s timeout)
|
|
|
|
```python
|
|
for _ in range(60):
|
|
status = client.get_mma_status()
|
|
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
|
|
elif status.get('pending_mma_step_approval'): client.click('btn_approve_mma_step')
|
|
elif status.get('pending_tool_approval'): client.click('btn_approve_tool')
|
|
if status.get('proposed_tracks') and len(status['proposed_tracks']) > 0: break
|
|
time.sleep(1)
|
|
```
|
|
|
|
The **approval automation** is a critical pattern repeated in every polling loop. The MMA engine has three approval gates:
|
|
- **Spawn approval**: Permission to create a new worker subprocess.
|
|
- **Step approval**: Permission to proceed with the next orchestration step.
|
|
- **Tool approval**: Permission to execute a tool call.
|
|
|
|
All three are auto-approved by clicking the corresponding button. Without this, the engine would block indefinitely at each gate.
|
|
|
|
### Stage 4: Accept Tracks
|
|
|
|
```python
|
|
client.click('btn_mma_accept_tracks')
|
|
```
|
|
|
|
### Stage 5: Poll for Tracks Populated (30s timeout)
|
|
|
|
Waits until `status['tracks']` contains a track with `'Mock Goal 1'` in its title.
|
|
|
|
### Stage 6: Load Track and Verify Tickets (60s timeout)
|
|
|
|
```python
|
|
client.click('btn_mma_load_track', user_data=track_id_to_load)
|
|
```
|
|
|
|
Then polls until:
|
|
- `active_track` matches the loaded track ID.
|
|
- `active_tickets` list is non-empty.
|
|
|
|
### Stage 7: Verify MMA Status Transitions (120s timeout)
|
|
|
|
Polls until `mma_status == 'running'` or `'done'`. Continues auto-approving all gates.
|
|
|
|
### Stage 8: Verify Worker Output in Streams (60s timeout)
|
|
|
|
```python
|
|
streams = status.get('mma_streams', {})
|
|
if any("Tier 3" in k for k in streams.keys()):
|
|
tier3_key = [k for k in streams.keys() if "Tier 3" in k][0]
|
|
if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]:
|
|
streams_found = True
|
|
```
|
|
|
|
Verifies that `mma_streams` contains a key with "Tier 3" and the value contains the exact mock output string.
|
|
|
|
### Assertions Summary
|
|
|
|
1. Mock provider setup succeeds (try/except with `pytest.fail`).
|
|
2. `proposed_tracks` appears within 60 seconds.
|
|
3. `'Mock Goal 1'` track exists in tracks list within 30 seconds.
|
|
4. Track loads and `active_tickets` populate within 60 seconds.
|
|
5. MMA status becomes `'running'` or `'done'` within 120 seconds.
|
|
6. Tier 3 worker output with specific mock content appears in `mma_streams` within 60 seconds.
|
|
|
|
---
|
|
|
|
## Mock Provider Strategy
|
|
|
|
### `tests/mock_gemini_cli.py`
|
|
|
|
A fake Gemini CLI executable that replaces the real `gemini` binary during integration tests. Outputs JSON-L messages matching the real CLI's streaming output protocol.
|
|
|
|
**Input mechanism:**
|
|
|
|
```python
|
|
prompt = sys.stdin.read() # Primary: prompt via stdin
|
|
sys.argv # Secondary: management command detection
|
|
os.environ.get('GEMINI_CLI_HOOK_CONTEXT') # Tertiary: environment variable
|
|
```
|
|
|
|
**Management command bypass:**
|
|
|
|
```python
|
|
if len(sys.argv) > 1 and sys.argv[1] in ["mcp", "extensions", "skills", "hooks"]:
|
|
return # Silent exit
|
|
```
|
|
|
|
**Response routing** — keyword matching on stdin content:
|
|
|
|
| Prompt Contains | Response | Session ID |
|
|
|---|---|---|
|
|
| `'PATH: Epic Initialization'` | Two mock Track objects (`mock-track-1`, `mock-track-2`) | `mock-session-epic` |
|
|
| `'PATH: Sprint Planning'` | Two mock Ticket objects (`mock-ticket-1` independent, `mock-ticket-2` depends on `mock-ticket-1`) | `mock-session-sprint` |
|
|
| `'"role": "tool"'` or `'"tool_call_id"'` | Success message (simulates post-tool-call final answer) | `mock-session-final` |
|
|
| Default (Tier 3 worker prompts) | `"SUCCESS: Mock Tier 3 worker implemented the change. [MOCK OUTPUT]"` | `mock-session-default` |
|
|
|
|
**Output protocol** — every response is exactly two JSON-L lines:
|
|
|
|
```json
|
|
{"type": "message", "role": "assistant", "content": "<response>"}
|
|
{"type": "result", "status": "success", "stats": {"total_tokens": N, ...}, "session_id": "mock-session-*"}
|
|
```
|
|
|
|
This matches the real Gemini CLI's streaming output format. `flush=True` on every `print()` ensures the consuming process receives data immediately.
|
|
|
|
**Tool call simulation:** The mock does **not** emit tool calls. It detects tool results in the prompt (`'"role": "tool"'` check) and responds with a final answer, simulating the second turn of a tool-call conversation without actually issuing calls.
|
|
|
|
**Debug output:** All debug information goes to stderr, keeping stdout clean for the JSON-L protocol.
|
|
|
|
---
|
|
|
|
## Visual Verification Patterns
|
|
|
|
Tests in this framework don't just check return values — they verify the **rendered state** of the application via the Hook API.
|
|
|
|
### DAG Integrity
|
|
|
|
Verify that `active_tickets` in the MMA status matches the expected task graph:
|
|
|
|
```python
|
|
status = client.get_mma_status()
|
|
tickets = status.get('active_tickets', [])
|
|
assert len(tickets) >= 2
|
|
assert any(t['id'] == 'mock-ticket-1' for t in tickets)
|
|
```
|
|
|
|
### Stream Telemetry
|
|
|
|
Check `mma_streams` to ensure output from multiple tiers is correctly captured and routed:
|
|
|
|
```python
|
|
streams = status.get('mma_streams', {})
|
|
tier3_keys = [k for k in streams.keys() if "Tier 3" in k]
|
|
assert len(tier3_keys) > 0
|
|
assert "SUCCESS" in streams[tier3_keys[0]]
|
|
```
|
|
|
|
### Modal State
|
|
|
|
Assert that the correct dialog is active during a pending tool call:
|
|
|
|
```python
|
|
status = client.get_mma_status()
|
|
assert status.get('pending_tool_approval') == True
|
|
# or
|
|
diag = client.get_indicator_state('thinking')
|
|
assert diag.get('thinking') == True
|
|
```
|
|
|
|
### Performance Monitoring
|
|
|
|
Verify UI responsiveness under load:
|
|
|
|
```python
|
|
perf = client.get_performance()
|
|
assert perf['fps'] > 30
|
|
assert perf['input_lag_ms'] < 100
|
|
```
|
|
|
|
---
|
|
|
|
## Supporting Analysis Modules
|
|
|
|
### `file_cache.py` — ASTParser (tree-sitter)
|
|
|
|
```python
|
|
class ASTParser:
|
|
def __init__(self, language: str = "python"):
|
|
self.language = tree_sitter.Language(tree_sitter_python.language())
|
|
self.parser = tree_sitter.Parser(self.language)
|
|
|
|
def parse(self, code: str) -> tree_sitter.Tree
|
|
def get_skeleton(self, code: str) -> str
|
|
def get_curated_view(self, code: str) -> str
|
|
```
|
|
|
|
**`get_skeleton` algorithm:**
|
|
1. Parse code to tree-sitter AST.
|
|
2. Walk all `function_definition` nodes.
|
|
3. For each body (`block` node):
|
|
- If first non-comment child is a docstring: preserve docstring, replace rest with `...`.
|
|
- Otherwise: replace entire body with `...`.
|
|
4. Apply edits in reverse byte order (maintains valid offsets).
|
|
|
|
**`get_curated_view` algorithm:**
|
|
Enhanced skeleton that preserves bodies under two conditions:
|
|
- Function has `@core_logic` decorator.
|
|
- Function body contains a `# [HOT]` comment anywhere in its descendants.
|
|
|
|
If either condition is true, the body is preserved verbatim. This enables a two-tier code view: hot paths shown in full, boilerplate compressed.
|
|
|
|
### `summarize.py` — Heuristic File Summaries
|
|
|
|
Token-efficient structural descriptions without AI calls:
|
|
|
|
```python
|
|
_SUMMARISERS: dict[str, Callable] = {
|
|
".py": _summarise_python, # imports, classes, methods, functions, constants
|
|
".toml": _summarise_toml, # table keys + array lengths
|
|
".md": _summarise_markdown, # h1-h3 headings
|
|
".ini": _summarise_generic, # line count + preview
|
|
}
|
|
```
|
|
|
|
**`_summarise_python`** uses stdlib `ast`:
|
|
1. Parse with `ast.parse()`.
|
|
2. Extract deduplicated imports (top-level module names only).
|
|
3. Extract `ALL_CAPS` constants (both `Assign` and `AnnAssign`).
|
|
4. Extract classes with their method names.
|
|
5. Extract top-level function names.
|
|
|
|
Output:
|
|
```
|
|
**Python** — 150 lines
|
|
imports: ast, json, pathlib
|
|
constants: TIMEOUT_SECONDS
|
|
class ASTParser: __init__, parse, get_skeleton
|
|
functions: summarise_file, build_summary_markdown
|
|
```
|
|
|
|
### `outline_tool.py` — Hierarchical Code Outline
|
|
|
|
```python
|
|
class CodeOutliner:
|
|
def outline(self, code: str) -> str
|
|
```
|
|
|
|
Walks top-level `ast` nodes:
|
|
- `ClassDef` → `[Class] Name (Lines X-Y)` + docstring + recurse for methods
|
|
- `FunctionDef` → `[Func] Name (Lines X-Y)` or `[Method] Name` if nested
|
|
- `AsyncFunctionDef` → `[Async Func] Name (Lines X-Y)`
|
|
|
|
Only extracts first line of docstrings. Uses indentation depth as heuristic for method vs function.
|
|
|
|
---
|
|
|
|
## Two Parallel Code Analysis Implementations
|
|
|
|
The codebase has two parallel approaches for structural code analysis:
|
|
|
|
| Aspect | `file_cache.py` (tree-sitter) | `summarize.py` / `outline_tool.py` (stdlib `ast`) |
|
|
|---|---|---|
|
|
| Parser | tree-sitter with `tree_sitter_python` | Python's built-in `ast` module |
|
|
| Precision | Byte-accurate, preserves exact syntax | Line-level, may lose formatting nuance |
|
|
| `@core_logic` / `[HOT]` | Supported (selective body preservation) | Not supported |
|
|
| Used by | `py_get_skeleton` MCP tool, worker context injection | `get_file_summary` MCP tool, `py_get_code_outline` |
|
|
| Performance | Slightly slower (C extension + tree walk) | Faster (pure Python, simpler walk) |
|