20 KiB
Structural Testing Contract
To maintain the integrity of the test suite and ensure that AI-driven test modifications do not create false positives ("mock-rot"), the following rules apply to all testing within this project:
- Ban on Arbitrary Core Mocking: Tier 3 workers are strictly forbidden from using
unittest.mock.patchto bypass or stub core infrastructure (e.g., event queues,ai_clientinternals, threading primitives) unless explicitly authorized by the Tier 2 Tech Lead for a specific boundary test. live_guiStandard: All integration and end-to-end testing must utilize thelive_guifixture to interact with a real instance of the application via the Hook API. Bypassing the hook server to directly mutate GUI state in tests is prohibited.- Artifact Isolation: All test-generated artifacts (logs, temporary workspaces, mock outputs) MUST be written to the
tests/artifacts/ortests/logs/directories. These directories are git-ignored to prevent repository pollution.
Verification & Simulation Framework
Top | Architecture | Tools & IPC | MMA Orchestration
Infrastructure
--enable-test-hooks
When launched with this flag, the application starts the HookServer on port 8999, exposing its internal state to external HTTP requests. This is the foundation for all automated verification. Without this flag, the Hook API is only available when the provider is gemini_cli.
The live_gui pytest Fixture
Defined in tests/conftest.py, this session-scoped fixture manages the lifecycle of the application under test.
Spawning:
@pytest.fixture(scope="session")
def live_gui() -> Generator[tuple[subprocess.Popen, str], None, None]:
process = subprocess.Popen(
["uv", "run", "python", "-u", gui_script, "--enable-test-hooks"],
stdout=log_file, stderr=log_file, text=True,
creationflags=subprocess.CREATE_NEW_PROCESS_GROUP if os.name == 'nt' else 0
)
-uflag: Disables output buffering for real-time log capture.- Process group: On Windows, uses
CREATE_NEW_PROCESS_GROUPso the entire tree (GUI + child processes) can be killed cleanly. - Logging: Stdout/stderr redirected to
logs/gui_2_py_test.log.
Readiness polling:
max_retries = 15 # seconds
while time.time() - start_time < max_retries:
response = requests.get("http://127.0.0.1:8999/status", timeout=0.5)
if response.status_code == 200:
ready = True; break
if process.poll() is not None: break # Process died early
time.sleep(0.5)
Polls GET /status every 500ms for up to 15 seconds. Checks process.poll() each iteration to detect early crashes (avoids waiting the full timeout if the GUI exits). Pre-check: tests if port 8999 is already occupied.
Failure path: If the hook server never responds, kills the process tree and calls pytest.fail() to abort the entire test session. Diagnostic telemetry (startup time, PID, success/fail) is written via VerificationLogger.
Teardown:
finally:
client = ApiHookClient()
client.reset_session() # Clean GUI state before killing
time.sleep(0.5)
kill_process_tree(process.pid)
log_file.close()
Sends reset_session() via ApiHookClient before killing to prevent stale state files.
Yield value: (process: subprocess.Popen, gui_script: str).
Session Isolation
@pytest.fixture(autouse=True)
def reset_ai_client() -> Generator[None, None, None]:
ai_client.reset_session()
ai_client.set_provider("gemini", "gemini-2.5-flash-lite")
yield
Runs automatically before every test. Resets the ai_client module state and defaults to a safe model, preventing state pollution between tests.
Workspace Isolation (autouse)
@pytest.fixture(autouse=True)
def isolate_workspace(tmp_path_factory, monkeypatch) -> Generator[None, None, None]:
# Redirects the path resolution layer to a temp directory
# Prevents tests from writing to the user's actual project
...
This autouse fixture ensures every test runs against an isolated tmp_path workspace. It monkeypatch-es src.paths so that any code path resolving a project directory (e.g., manual_slop.toml lookup, conductor directory resolution, log directory) is redirected to a fresh temp directory per test. Without this, tests could mutate the user's actual manual_slop.toml or conductor tracks directory.
This is the primary mechanism for satisfying the Artifact Isolation rule in the Structural Testing Contract.
Path Reset (autouse)
@pytest.fixture(autouse=True)
def reset_paths() -> Generator[None, None, None]:
# Forces `src/paths.py` to re-resolve from environment / config on next access
...
Pairs with isolate_workspace to fully reset the path subsystem. After a test that creates a project config, the next test gets a clean slate.
mock_app and app_instance Fixtures
For unit tests that need a partially-mocked App (without the full live_gui launch), two additional fixtures are available:
mock_app— Returns anAppinstance with key subsystems (event queue, comms log, file cache) mocked. Use for testing GUI logic in isolation.app_instance— Returns a realAppinstance with disk-backed state, but without launching the render loop. Use for testing the full controller logic.
These are scoped per-test (not session) and run faster than live_gui for unit-level testing.
Process Cleanup
def kill_process_tree(pid: int | None) -> None:
- Windows:
taskkill /F /T /PID <pid>— force-kills the process and all children (/Tis critical since the GUI spawns child processes). - Unix:
os.killpg(os.getpgid(pid), SIGKILL)to kill the entire process group.
VerificationLogger
Structured diagnostic logging for test telemetry:
class VerificationLogger:
def __init__(self, test_name: str, script_name: str):
self.logs_dir = Path(f"logs/test/{datetime.now().strftime('%Y%m%d_%H%M%S')}")
def log_state(self, field: str, before: Any, after: Any, delta: Any = None)
def finalize(self, description: str, status: str, result_msg: str)
Output format: fixed-width column table (Field | Before | After | Delta) written to logs/test/<timestamp>/<script_name>.txt. Dual output: file + tagged stdout lines for CI visibility.
Simulation Lifecycle: The "Puppeteer" Pattern
Simulations act as external puppeteers, driving the GUI through the ApiHookClient HTTP interface. The canonical example is tests/visual_sim_mma_v2.py.
Stage 1: Mock Provider Setup
client = ApiHookClient()
client.set_value('current_provider', 'gemini_cli')
mock_cli_path = f'{sys.executable} {os.path.abspath("tests/mock_gemini_cli.py")}'
client.set_value('gcli_path', mock_cli_path)
client.set_value('files_base_dir', 'tests/artifacts/temp_workspace')
client.click('btn_project_save')
- Switches the GUI's LLM provider to
gemini_cli(the CLI adapter). - Points the CLI binary to
python tests/mock_gemini_cli.py— all LLM calls go to the mock. - Redirects
files_base_dirto a temp workspace to prevent polluting real project directories. - Saves the project configuration.
Stage 2: Epic Planning
client.set_value('mma_epic_input', 'Develop a new feature')
client.click('btn_mma_plan_epic')
Enters an epic description and triggers planning. The GUI invokes the LLM (which hits the mock).
Stage 3: Poll for Proposed Tracks (60s timeout)
for _ in range(60):
status = client.get_mma_status()
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
elif status.get('pending_mma_step_approval'): client.click('btn_approve_mma_step')
elif status.get('pending_tool_approval'): client.click('btn_approve_tool')
if status.get('proposed_tracks') and len(status['proposed_tracks']) > 0: break
time.sleep(1)
The approval automation is a critical pattern repeated in every polling loop. The MMA engine has three approval gates:
- Spawn approval: Permission to create a new worker subprocess.
- Step approval: Permission to proceed with the next orchestration step.
- Tool approval: Permission to execute a tool call.
All three are auto-approved by clicking the corresponding button. Without this, the engine would block indefinitely at each gate.
Stage 4: Accept Tracks
client.click('btn_mma_accept_tracks')
Stage 5: Poll for Tracks Populated (30s timeout)
Waits until status['tracks'] contains a track with 'Mock Goal 1' in its title.
Stage 6: Load Track and Verify Tickets (60s timeout)
client.click('btn_mma_load_track', user_data=track_id_to_load)
Then polls until:
active_trackmatches the loaded track ID.active_ticketslist is non-empty.
Stage 7: Verify MMA Status Transitions (120s timeout)
Polls until mma_status == 'running' or 'done'. Continues auto-approving all gates.
Stage 8: Verify Worker Output in Streams (60s timeout)
streams = status.get('mma_streams', {})
if any("Tier 3" in k for k in streams.keys()):
tier3_key = [k for k in streams.keys() if "Tier 3" in k][0]
if "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]:
streams_found = True
Verifies that mma_streams contains a key with "Tier 3" and the value contains the exact mock output string.
Assertions Summary
- Mock provider setup succeeds (try/except with
pytest.fail). proposed_tracksappears within 60 seconds.'Mock Goal 1'track exists in tracks list within 30 seconds.- Track loads and
active_ticketspopulate within 60 seconds. - MMA status becomes
'running'or'done'within 120 seconds. - Tier 3 worker output with specific mock content appears in
mma_streamswithin 60 seconds.
Mock Provider Strategy
tests/mock_gemini_cli.py
A fake Gemini CLI executable that replaces the real gemini binary during integration tests. Outputs JSON-L messages matching the real CLI's streaming output protocol.
Input mechanism:
prompt = sys.stdin.read() # Primary: prompt via stdin
sys.argv # Secondary: management command detection
os.environ.get('GEMINI_CLI_HOOK_CONTEXT') # Tertiary: environment variable
Management command bypass:
if len(sys.argv) > 1 and sys.argv[1] in ["mcp", "extensions", "skills", "hooks"]:
return # Silent exit
Response routing — keyword matching on stdin content:
| Prompt Contains | Response | Session ID |
|---|---|---|
'PATH: Epic Initialization' |
Two mock Track objects (mock-track-1, mock-track-2) |
mock-session-epic |
'PATH: Sprint Planning' |
Two mock Ticket objects (mock-ticket-1 independent, mock-ticket-2 depends on mock-ticket-1) |
mock-session-sprint |
'"role": "tool"' or '"tool_call_id"' |
Success message (simulates post-tool-call final answer) | mock-session-final |
| Default (Tier 3 worker prompts) | "SUCCESS: Mock Tier 3 worker implemented the change. [MOCK OUTPUT]" |
mock-session-default |
Output protocol — every response is exactly two JSON-L lines:
{"type": "message", "role": "assistant", "content": "<response>"}
{"type": "result", "status": "success", "stats": {"total_tokens": N, ...}, "session_id": "mock-session-*"}
This matches the real Gemini CLI's streaming output format. flush=True on every print() ensures the consuming process receives data immediately.
Tool call simulation: The mock does not emit tool calls. It detects tool results in the prompt ('"role": "tool"' check) and responds with a final answer, simulating the second turn of a tool-call conversation without actually issuing calls.
Debug output: All debug information goes to stderr, keeping stdout clean for the JSON-L protocol.
Visual Verification Patterns
Tests in this framework don't just check return values — they verify the rendered state of the application via the Hook API.
DAG Integrity
Verify that active_tickets in the MMA status matches the expected task graph:
status = client.get_mma_status()
tickets = status.get('active_tickets', [])
assert len(tickets) >= 2
assert any(t['id'] == 'mock-ticket-1' for t in tickets)
Stream Telemetry
Check mma_streams to ensure output from multiple tiers is correctly captured and routed:
streams = status.get('mma_streams', {})
tier3_keys = [k for k in streams.keys() if "Tier 3" in k]
assert len(tier3_keys) > 0
assert "SUCCESS" in streams[tier3_keys[0]]
Modal State
Assert that the correct dialog is active during a pending tool call:
status = client.get_mma_status()
assert status.get('pending_tool_approval') == True
# or
diag = client.get_indicator_state('thinking')
assert diag.get('thinking') == True
Performance Monitoring
Verify UI responsiveness under load:
perf = client.get_performance()
assert perf['fps'] > 30
assert perf['input_lag_ms'] < 100
Test Areas by Subsystem
Beyond the Puppeteer pattern, the test suite covers distinct subsystems with their own fixtures and assertions. The key areas:
| Area | Key test files | Approach |
|---|---|---|
| MMA orchestration | test_conductor_engine_v2.py, test_conductor_engine_abort.py, test_conductor_abort_event.py, test_orchestration_logic.py, test_parallel_execution.py |
Unit tests of ConductorEngine, WorkerPool, run_worker_lifecycle. Use mock providers and direct conductor invocation. |
| MMA dashboard | test_mma_dashboard_refresh.py, test_mma_dashboard_streams.py, test_mma_node_editor.py, test_mma_orchestration_gui.py, test_mma_concurrent_tracks_sim.py |
GUI-level tests with live_gui + ApiHookClient. Verify dashboard refresh, stream telemetry, node editor interaction. |
| Discussion | test_discussion_takes.py, test_discussion_takes_gui.py, test_discussion_metrics.py, test_discussion_compression.py, test_gui_discussion_tabs.py |
Take branching, per-response token metrics, history compression. |
| Context composition | test_context_composition_*.py (Phase 1-6), test_context_preview_button.py, test_view_presets.py, test_custom_slices_annotations.py |
Decoupled context panel, view presets, custom slices, preview button. |
| RAG | test_rag_engine.py, test_rag_integration.py, test_rag_gui_presence.py, test_rag_phase4_stress.py, test_rag_visual_sim.py |
Vector store lifecycle, integration with ai_client.send, GUI presence, stress testing. |
| Beads | test_beads_client.py, test_mcp_client_beads.py, test_gui_dag_beads.py |
BeadsClient CRUD, MCP tool registration, DAG visualization with Beads-backed tickets. |
| External MCP | test_external_mcp.py, test_external_mcp_e2e.py, test_external_mcp_hitl.py, test_mcp_config.py |
Server lifecycle, end-to-end with real processes, HITL approval flow. |
| Hot reload | test_hot_reloader.py, test_hot_reload_integration.py |
Module invalidation, state preservation, integration with rendering. |
| C/C++ AST | test_ts_c_tools.py, test_ts_cpp_tools.py, test_mcp_ts_integration.py |
Tree-sitter AST tools dispatch, definitions, signatures, updates. |
| Personas & tool bias | test_persona_manager.py, test_persona_models.py, test_tool_bias.py, test_bias_efficacy.py, test_bias_integration.py |
Persona CRUD, bias engine, prompt generation effects. |
| Provider-specific | test_deepseek_provider.py, test_minimax_provider.py, test_gemini_cli_adapter.py, test_gemini_cli_adapter_parity.py, test_gemini_metrics.py |
Per-provider behavior, parity checks, Gemini cache metrics. |
| Workspace profiles | test_workspace_manager.py, test_workspace_profiles_sim.py |
Profile save/load, scope inheritance, auto-switch (when integrated). |
| History (undo/redo) | test_history.py, test_history_manager.py, test_history_management.py, test_undo_redo_sim.py |
Non-provider undo/redo, snapshot jumping. |
Convention: Subsystem-specific test files are named test_<subsystem>_<aspect>.py. Integration tests with live_gui end in _sim.py or _integration.py. End-to-end tests with real processes end in _e2e.py.
Headless Service Tests
The application also runs in headless mode (without GUI) as a decoupled FastAPI/Uvicorn service. These tests verify the headless path:
test_headless_service.py— Basic service lifecycle, route registration.test_headless_simulation.py— End-to-end MMA simulation via the headless service (no GUI launch).test_headless_verification.py— Full run with error + QA interceptor verification.
The headless service uses the Remote Confirmation Protocol for HITL: when an AI action requires approval, the service blocks on an HTTP endpoint and waits for an external orchestrator (typically a CLI script) to POST a decision. The protocol is documented in guide_tools.md.
Supporting Analysis Modules
file_cache.py — ASTParser (tree-sitter)
class ASTParser:
def __init__(self, language: str = "python"):
self.language = tree_sitter.Language(tree_sitter_python.language())
self.parser = tree_sitter.Parser(self.language)
def parse(self, code: str) -> tree_sitter.Tree
def get_skeleton(self, code: str, path: str = "") -> str
def get_curated_view(self, code: str, path: str = "") -> str
def get_targeted_view(self, code: str, symbols: List[str], path: str = "") -> str
get_skeleton algorithm:
- Parse code to tree-sitter AST.
- Walk all
function_definitionnodes. - For each body (
blocknode):- If first non-comment child is a docstring: preserve docstring, replace rest with
.... - Otherwise: replace entire body with
....
- If first non-comment child is a docstring: preserve docstring, replace rest with
- Apply edits in reverse byte order (maintains valid offsets).
get_curated_view algorithm:
Enhanced skeleton that preserves bodies under two conditions:
- Function has
@core_logicdecorator. - Function body contains a
# [HOT]comment anywhere in its descendants.
If either condition is true, the body is preserved verbatim. This enables a two-tier code view: hot paths shown in full, boilerplate compressed.
get_targeted_view algorithm:
Extracts only the specified symbols and their dependencies:
- Find all requested symbol definitions (classes, functions, methods).
- For each symbol, traverse its body to find referenced names.
- Include only the definitions that are directly referenced.
- Used for surgical context injection when
target_symbolsis specified on a Ticket.
summarize.py — Heuristic File Summaries
Token-efficient structural descriptions without AI calls:
_SUMMARISERS: dict[str, Callable] = {
".py": _summarise_python, # imports, classes, methods, functions, constants
".toml": _summarise_toml, # table keys + array lengths
".md": _summarise_markdown, # h1-h3 headings
".ini": _summarise_generic, # line count + preview
}
_summarise_python uses stdlib ast:
- Parse with
ast.parse(). - Extract deduplicated imports (top-level module names only).
- Extract
ALL_CAPSconstants (bothAssignandAnnAssign). - Extract classes with their method names.
- Extract top-level function names.
Output:
**Python** — 150 lines
imports: ast, json, pathlib
constants: TIMEOUT_SECONDS
class ASTParser: __init__, parse, get_skeleton
functions: summarise_file, build_summary_markdown
outline_tool.py — Hierarchical Code Outline
class CodeOutliner:
def outline(self, code: str) -> str
Walks top-level ast nodes:
ClassDef→[Class] Name (Lines X-Y)+ docstring + recurse for methodsFunctionDef→[Func] Name (Lines X-Y)or[Method] Nameif nestedAsyncFunctionDef→[Async Func] Name (Lines X-Y)
Only extracts first line of docstrings. Uses indentation depth as heuristic for method vs function.
Two Parallel Code Analysis Implementations
The codebase has two parallel approaches for structural code analysis:
| Aspect | file_cache.py (tree-sitter) |
summarize.py / outline_tool.py (stdlib ast) |
|---|---|---|
| Parser | tree-sitter with tree_sitter_python |
Python's built-in ast module |
| Precision | Byte-accurate, preserves exact syntax | Line-level, may lose formatting nuance |
@core_logic / [HOT] |
Supported (selective body preservation) | Not supported |
| Used by | py_get_skeleton MCP tool, worker context injection |
get_file_summary MCP tool, py_get_code_outline |
| Performance | Slightly slower (C extension + tree walk) | Faster (pure Python, simpler walk) |