reports and potential new track

2026-03-05 00:31:55 -05:00
parent 1bc4205153
commit 983538aa8b
7 changed files with 3011 additions and 1 deletions
@@ -0,0 +1,3 @@
+# Test Architecture Integrity & Simulation Audit
+
+[Specification](spec.md) | [Plan](plan.md)
@@ -0,0 +1,9 @@
+{
+  "id": "test_architecture_integrity_audit_20260304"`,
+  "name": "Test Architecture Integrity & Simulation Audit"`,
+  "status": "planned",
+  "created_at": "2026-03-04T00:00:00Z",
+  "updated_at": "2026-03-04T00:00:00Z",
+  "type": "audit",
+  "severity": "high"
+}
@@ -0,0 +1,33 @@
+# Implementation Plan
+
+## Phase 1: Documentation (Planning)
+Focus: Create comprehensive audit documentation with severity ratings
+
+- [ ] Task 1.1: Document all identified false positive risks with severity matrix
+- [ ] Task 1.2: Document all simulation fidelity gaps with impact analysis
+- [ ] Task 1.3: Create mapping of coverage gaps to test categories
+- [ ] Task 1.4: Provide concrete false positive examples
+- [ ] Task 1.5: Provide concrete simulation miss examples
+- [ ] Task 1.6: Prioritize recommendations by impact/effort matrix
+
+## Phase 2: Review & Validation (Research)
+Focus: Peer review of audit findings
+
+- [ ] Task 2.1: Review existing tracks for overlap with this audit
+- [ ] Task 2.2: Validate severity ratings against actual bug history
+- [ ] Task 2.3: Cross-reference findings with docs/guide_simulations.md contract
+- [ ] Task 2.4: Identify which gaps should be addressed in which future track
+
+## Phase 3: Track Finalization
+Focus: Prepare for downstream implementation tracks
+
+- [ ] Task 3.1: Create prioritized backlog of implementation recommendations
+- [ ] Task 3.2: Map recommendations to appropriate future tracks
+- [ ] Task 3.3: Document dependencies between this audit and subsequent work
+
+## Phase 4: User Manual Verification (Protocol in workflow.md)
+Focus: Human review of audit findings
+
+- [ ] Task 4.1: Review severity matrix for accuracy
+- [ ] Task 4.2: Validate concrete examples against real-world scenarios
+- [ ] Task 4.3: Approve recommendations for implementation
@@ -0,0 +1,562 @@
+# Test Architecture Integrity Audit — Claude Review
+
+**Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator)
+**Review Date:** 2026-03-05
+**Source Report:** report.md (authored by GLM-4.7, 2026-03-04)
+**Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable
+recommendations for downstream tracks.
+
+**Methodology:**
+1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools,
+   guide_mma, guide_meta_boundary, Readme)
+2. Read GLM's full report.md
+3. Read plan.md and spec.md for this track
+4. Read py_get_skeleton for all 27 src/ modules
+5. Read py_get_skeleton for conftest.py and representative test files
+   (test_extended_sims, test_live_gui_integration, test_dag_engine,
+   test_mma_orchestration_gui)
+6. Read py_get_skeleton for all 9 simulation/ modules
+7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history
+
+---
+
+## Section 1: Verdict on GLM's Report
+
+GLM produced a competent surface-level audit. The structural inventory is
+accurate and the broad categories of weakness (mock-rot, shallow assertions,
+no negative paths) are valid. However, the report has material errors in
+severity classification, contains two exact duplicate sections (Parts 10 and
+11 are identical), and misses several issues that are more impactful than
+the ones it flags at HIGH. It also makes recommendations that are
+architecturally inappropriate for an ImGui immediate-mode application.
+
+**Confirmed correct:** ~60% of findings
+**Overstated or miscategorized:** ~25% of findings
+**Missed entirely:** see Section 3
+
+---
+
+## Section 2: GLM Findings — Confirmed, Corrected, or Rejected
+
+### 2.1 Confirmed: Mock Provider Never Fails (HIGH)
+
+GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The
+keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`,
+default) always produces a well-formed success response. No test using this
+mock can ever exercise:
+- Malformed or truncated JSON-L output
+- Non-zero exit code from the CLI process
+- A `{"type": "result", "status": "error", ...}` result event
+- Rate-limit or quota responses
+- Partial output followed by process crash
+
+The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A
+corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError`
+that bubbles up through `_send_gemini_cli`. This path is entirely untested.
+
+**Severity: HIGH — confirmed.**
+
+### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH)
+
+GLM flags this as HIGH. The auto-approval pattern in polling loops is:
+```python
+if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
+```
+
+This is structurally correct for automated testing — you MUST auto-approve
+to drive the pipeline. The actual bug is different from what GLM describes:
+the tests never assert that the dialog appeared BEFORE approving. The
+correct pattern is:
+```python
+assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared"
+client.click('btn_approve_spawn')
+```
+
+Without the assert, the test passes even if the dialog never fires (meaning
+spawn approval is silently bypassed at the application level).
+
+**Severity: MEDIUM (dialog verification gap, not approval mechanism itself).**
+**GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is
+required for unattended testing. The fix is to assert the flag is True
+*before* clicking.
+
+There is also zero testing of the rejection path: what happens when
+`btn_reject_spawn` is clicked? Does the engine stop? Does it log an error?
+Does the track reach "blocked" state? This is an untested state transition.
+
+### 2.3 Confirmed: Assertions Are Shallow (HIGH)
+
+GLM is correct. The two canonical examples from simulation tests:
+```python
+assert len(tickets) >= 2          # structure unknown
+"SUCCESS: Mock Tier 3 worker" in streams[tier3_key]  # substring only
+```
+
+Neither validates ticket schema, ID uniqueness, dependency correctness, or
+that the stream content is actually the full response and not a truncated
+fragment.
+
+**Severity: HIGH — confirmed.**
+
+### 2.4 Confirmed: No Negative Path Testing (HIGH)
+
+GLM is correct. The entire test suite covers only the happy path. Missing:
+- Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog,
+  MMASpawnApprovalDialog)
+- Malformed LLM response handling (bad JSON, missing fields, unexpected types)
+- Network timeout/connection error to Hook API during a live_gui test
+- `shell_runner.run_powershell` timeout (60s) expiry path
+- `mcp_client._resolve_and_check` returning an error (path outside allowlist)
+
+**Severity: HIGH — confirmed.**
+
+### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM)
+
+GLM is correct. 1-second polling in simulation loops will miss any state
+that exists for less than 1 second. The approval dialogs in particular may
+appear and be cleared within a single render frame if the engine is fast.
+
+The `WorkflowSimulator.wait_for_ai_response()` method is the most critical
+polling target. It is the backbone of all extended simulation tests. If its
+polling strategy is wrong, the entire extended sim suite is unreliable.
+
+**Severity: MEDIUM — confirmed.**
+
+### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM)
+
+GLM is correct. Setting `gcli_path` to a Python script does not exercise:
+- Real PATH resolution for the `gemini` binary
+- Windows process group creation (`CREATE_NEW_PROCESS_GROUP`)
+- Environment variable propagation to the subprocess
+- `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`)
+- The `kill_process_tree` teardown path when the process hangs
+
+**Severity: MEDIUM — confirmed.**
+
+### 2.7 CORRECTION: "run_powershell is a Read-Only Tool"
+
+**GLM is WRONG here.** In Part 8, GLM lists:
+> "Read-Only Tools: run_powershell (via shell_runner.py)"
+
+`run_powershell` executes arbitrary PowerShell scripts against the filesystem.
+It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only
+because it is not an MCP filesystem tool; its approval gate is the
+`confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only"
+is a factual error that could mislead future workers about the security model.
+
+### 2.8 CORRECTION: "State Duplication Between App and AppController"
+
+**GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed
+before this audit. `gui_2.App` now delegates all state through `AppController`
+via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a
+thin ImGui rendering layer, `AppController` owns all state. GLM's concern is
+stale relative to the current codebase.
+
+### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure"
+
+**This recommendation is architecturally inappropriate** for Dear PyGui/ImGui.
+These are immediate-mode renderers; there is no DOM or widget tree to
+interrogate. Pixel-level screenshot comparison requires platform-specific
+capture APIs (Windows Magnification, GDI) and is extremely fragile to font
+rendering, DPI, and GPU differences. The Hook API's logical state verification
+is the CORRECT and SUFFICIENT abstraction for this application. Adding
+screenshot comparison would be high cost, low value, and high flakiness.
+
+The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`)
+is exposing more GUI state via the Hook API so tests can assert logical
+rendering state (is a panel visible? what is the modal title?) without pixels.
+
+### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries
+
+The summary table in Part 9 lists identical items at multiple severity levels:
+- "No concurrent access testing": appears as both HIGH and MEDIUM
+- "No real-time latency simulation": appears as both MEDIUM and LOW
+- "No human-like behavior": appears as both MEDIUM and LOW
+- "Arbitrary polling intervals": appears as both MEDIUM and LOW
+
+Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference
+section was copy-pasted in full. This suggests the report was generated with
+insufficient self-review.
+
+### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation
+
+GLM spends substantial space on the absence of:
+- Typing speed simulation
+- Hesitation before actions
+- Variable LLM latency
+
+This is a **personal developer tool for a single user on a local machine**.
+These are aspirational concerns for a production SaaS simulation framework.
+For this product context, these are genuinely LOW priority. The simulation
+framework's job is to verify that the GUI state machine transitions correctly,
+not to simulate human psychology.
+
+---
+
+## Section 3: Issues GLM Missed
+
+These are findings not present in GLM's report that carry meaningful risk.
+
+### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests
+
+`conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL
+tests that use `live_gui` share a single running GUI process. If test A
+leaves the GUI in a state with an open modal dialog, test B will find the
+GUI unresponsive or in an unexpected state.
+
+The teardown calls `client.reset_session()` (which clicks `btn_reset_session`),
+but this clears AI state and discussion history, not pending dialogs or
+MMA orchestration state. A test that triggers a spawn approval dialog and
+then fails before approving it will leave `_pending_mma_spawn` set, blocking
+the ENTIRE remaining test session.
+
+**Severity: HIGH.** The current test ordering dependency is invisible and
+fragile. Tests must not be run in arbitrary order.
+
+**Fix:** Each `live_gui`-using test that touches MMA or approval flows should
+explicitly verify clean state at start:
+```python
+status = client.get_mma_status()
+assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty"
+```
+
+### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering
+
+The `app_instance` fixture mocks out all ImGui rendering. This means every
+test using `app_instance` (approximately 40+ tests) is testing Python object
+state, not rendered UI. Tests like:
+- `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`,
+  not that the panel renders
+- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls
+  `_render_token_budget_panel()` in a context where all ImGui calls are no-ops
+
+This creates a systematic false-positive class: a method can be completely
+broken (wrong data, missing widget calls) and the test passes because ImGui
+calls are silently ignored. The only tests with genuine rendering fidelity
+are the `live_gui` tests.
+
+This is the root cause behind GLM's "state existence only" finding. It is
+not a test assertion weakness — it is a fixture architectural limitation.
+
+**Severity: HIGH.** The implication: all `app_instance`-based rendering
+tests should be treated as "smoke tests that the method doesn't crash,"
+not as "verification that the rendering is correct."
+
+**Fix:** The `hook_api_ui_state_verification_20260302` track (adding
+`/api/gui/state`) is the correct path forward: expose render-visible state
+through the Hook API so `live_gui` tests can verify it.
+
+### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block
+
+`ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop.
+There is no outer timeout on this loop. If the GUI thread never signals the
+dialog (e.g., GUI crash after dialog creation, or a test that creates a
+dialog but doesn't render it), the asyncio worker thread hangs indefinitely.
+
+This is particularly dangerous in the `run_worker_lifecycle` path:
+1. Worker pushes dialog to event queue
+2. GUI process crashes or freezes
+3. `dialog.wait()` loops forever at 0.1s intervals
+4. Test session hangs with no error output
+
+There is no test verifying that `wait()` has a maximum wait time and raises
+an exception or returns a default (rejected) decision after it.
+
+**Severity: HIGH.**
+
+### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests
+
+`mcp_client.configure()` sets module-level globals (`_allowed_paths`,
+`_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions
+directly without calling `configure()` first will use whatever state was
+left from the previous test. The `reset_ai_client` autouse fixture calls
+`ai_client.reset_session()` but does NOT reset `mcp_client` state.
+
+Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`,
+etc. directly (not through `ai_client.send()`) inherits the allowlist from
+the previous test run. This can cause false passes (path permitted by
+previous test's allowlist) or false failures (path denied because
+`_base_dirs` is empty from a prior reset).
+
+**Severity: MEDIUM.**
+
+### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption
+
+GLM mentions this as a "design concern." It is more specific: the
+`concurrent_tier_source_tier_20260302` track exists because `current_tier`
+in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers
+run concurrently (future feature), the second `send()` call will overwrite
+the first worker's tier tag.
+
+What's missing: there is no test that verifies the CURRENT behavior is safe
+under single-threaded operation, and no test that demonstrates the failure
+mode under concurrent operation to serve as a regression baseline for the fix.
+
+**Severity: MEDIUM.**
+
+### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime
+
+The arch boundary tests verify that `manual_slop.toml` lists mutating tools
+as disabled by default. But the tests don't verify:
+1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools`
+   at startup
+2. That `ai_client._agent_tools` is actually consulted before tool dispatch
+3. That the TOML → runtime path is end-to-end
+
+A developer could modify how tools are loaded without breaking these tests.
+The tests are static config audits, not runtime enforcement tests.
+
+**Severity: MEDIUM.**
+
+### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly
+
+From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client`
+and calls `ai_client.send()` to generate "human-like" responses. This means:
+- Simulation tests have an implicit dependency on a configured LLM provider
+- If run without an API key (e.g., in CI), simulations fail at the UserSimAgent
+  level, not at the GUI level — making failures hard to diagnose
+- The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in
+  the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent
+  would attempt real API calls
+
+No test documents whether UserSimAgent is actually exercised in the extended
+sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient
+directly to drive the GUI.
+
+**Severity: MEDIUM.**
+
+### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised
+
+The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream
+and then waits for `{"type": "tool_result", ...}` piped back on stdin. The
+`mock_gemini_cli.py` does not emit any `tool_use` events; it only detects
+`'"role": "tool"'` in the prompt to simulate a post-tool-call turn.
+
+This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that
+handles `tool_use` event types and accumulates them) is NEVER exercised by
+any test. A regression in that parsing branch would be invisible to the
+test suite.
+
+**Severity: LOW** (only relevant when the real gemini CLI is used with tools).
+
+### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests
+
+The `reset_ai_client` autouse fixture runs synchronously before each test.
+For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's
+async setup. If the async test itself triggers ai_client operations in setup
+(e.g., through an event loop created by the fixture), the reset may not
+capture all state mutations. This is an edge case but could explain
+intermittent behavior in async tests.
+
+**Severity: LOW.**
+
+---
+
+## Section 4: Revised Severity Matrix
+
+| Severity | Finding | GLM? | Source |
+|---|---|---|---|
+| **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM |
+| **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude |
+| **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude |
+| **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude |
+| **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM |
+| **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM |
+| **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude |
+| **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude |
+| **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude |
+| **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude |
+| **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude |
+| **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM |
+| **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM |
+| **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude |
+| **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude |
+| **LOW** | Variable latency / human-like simulation | Confirmed | GLM |
+
+---
+
+## Section 5: Prioritized Recommendations for Downstream Tracks
+
+Listed in execution order, not importance order. Each maps to an existing or
+proposed track.
+
+### Rec 1: Extend mock_gemini_cli with Failure Modes
+**Target track:** New — `mock_provider_hardening_20260305`
+**Files:** `tests/mock_gemini_cli.py`
+**What:** Add a `MOCK_MODE` environment variable selector:
+- `success` (current behavior, default)
+- `malformed_json` — emit a truncated/corrupt JSON-L line
+- `error_result` — emit `{"type": "result", "status": "error", ...}`
+- `timeout` — sleep 90s to trigger the CLI timeout path
+- `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing
+
+Tests that need to verify error handling pass `MOCK_MODE=error_result` via
+`client.set_value()` before triggering the AI call.
+
+### Rec 2: Add Dialog Assertion Before Auto-Approval
+**Target track:** `test_suite_performance_and_flakiness_20260302` (already planned)
+**Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py`
+**What:** Replace the conditional approval pattern:
+```python
+# BAD (current):
+if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
+# GOOD:
+assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve"
+client.click('btn_approve_spawn')
+```
+Also add at least one test per dialog type that clicks reject and asserts the
+correct downstream state (engine marks track blocked, no worker spawned, etc.).
+
+### Rec 3: Fix live_gui Session Scope Dirty State
+**Target track:** `test_suite_performance_and_flakiness_20260302`
+**Files:** `tests/conftest.py`
+**What:** Add a per-test autouse fixture (function-scoped) that asserts clean
+GUI state before each `live_gui` test:
+```python
+@pytest.fixture(autouse=True)
+def assert_gui_clean(live_gui):
+    client = ApiHookClient()
+    status = client.get_mma_status()
+    assert not status.get('pending_mma_spawn_approval')
+    assert not status.get('pending_mma_step_approval')
+    assert not status.get('pending_tool_approval')
+    assert status.get('mma_status') in ('idle', 'done', '')
+```
+This surfaces inter-test pollution immediately rather than causing a
+mysterious hang in a later test.
+
+### Rec 4: Add ConfirmDialog Timeout Test
+**Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`)
+**Files:** `tests/test_conductor_engine.py`
+**What:** Add a test that creates a `ConfirmDialog`, never signals it, and
+verifies after N seconds that the background thread does NOT block indefinitely.
+This requires either a hard timeout on `wait()` or a documented contract that
+callers must signal the dialog within a finite window.
+
+### Rec 5: Expose More State via Hook API
+**Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority)
+**Files:** `src/api_hooks.py`
+**What:** This track is the key enabler for replacing `app_instance` rendering
+tests with genuine state verification. The planned `/api/gui/state` endpoint
+should expose:
+- Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`)
+- `ui_focus_agent` current filter value
+- `_mma_status`, `_ai_status` text values
+- Panel visibility flags
+
+Once this is in place, the `app_instance` rendering tests can be migrated
+to `live_gui` equivalents that actually verify GUI-visible state.
+
+### Rec 6: Add mcp_client Reset to autouse Fixture
+**Target track:** `test_suite_performance_and_flakiness_20260302`
+**Files:** `tests/conftest.py`
+**What:** Extend `reset_ai_client` autouse fixture to also call
+`mcp_client.configure([], [])` to clear the allowlist between tests.
+This prevents allowlist state from a previous test from leaking into the next.
+
+### Rec 7: Add Runtime HITL Enforcement Test
+**Target track:** `test_suite_performance_and_flakiness_20260302` or new
+**Files:** `tests/test_arch_boundary_phase2.py`
+**What:** Add an integration test (using `app_instance`) that:
+1. Calls `ai_client.set_agent_tools({'set_file_slice': True})`
+2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'`
+3. Triggers a dispatch of `set_file_slice`
+4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred
+
+This closes the gap between "config says mutating tools are off" and
+"runtime actually gates them through the approval callback."
+
+### Rec 8: Document `app_instance` Limitation in conftest
+**Target track:** Any ongoing work — immediate, no track needed
+**Files:** `tests/conftest.py`
+**What:** Add a docstring to `app_instance` fixture:
+```python
+"""
+App instance with all ImGui rendering calls mocked to no-ops.
+Use for unit tests of state logic and method existence.
+DO NOT use to verify rendering correctness — use live_gui for that.
+"""
+```
+This prevents future workers from writing rendering tests against this fixture
+and believing they have real coverage.
+
+---
+
+## Section 6: What the Existing Track Queue Gets Right
+
+The `TASKS.md` strict execution queue is well-ordered for the test concerns:
+
+1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban
+2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring
+3. `codebase_migration_20260302` → Already complete (commit 270f5f7)
+4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205)
+5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests
+6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces
+   malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above
+7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers
+8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination
+
+The `test_architecture_integrity_audit_20260304` (this track) sits logically
+between #1 and #5 — it provides the analytical basis for what #5 and #8 need
+to fix. The audit output (this document) should be read by the Tier 2 Tech Lead
+for both those tracks.
+
+The proposed new tracks (mock_provider_hardening, negative_path_testing) from
+GLM's recommendations are valid but should be created AFTER track #5
+(`hook_api_ui_state_verification`) is complete, since they depend on the
+richer Hook API state to write meaningful assertions.
+
+---
+
+## Section 7: Architectural Observations Not in GLM's Report
+
+### The Two-Tier Mock Problem
+
+The test suite has two completely separate mock layers that do not know about
+each other:
+
+**Layer 1** — `app_instance` fixture (in-process): Patches `immapp.run()`,
+`ai_client.send()`, and related functions with `unittest.mock`. Tests call
+methods directly. No network, no subprocess, no real threading.
+
+**Layer 2** — `mock_gemini_cli.py` (out-of-process): A fake subprocess that
+the live GUI process calls through its own internal LLM pipeline. Tests drive
+this via `ApiHookClient` HTTP calls to the running GUI process.
+
+These layers test completely different things. Layer 1 tests Python object
+invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC,
+process management). Most of the test suite is Layer 1. Very few tests are
+Layer 2. The high-value tests are Layer 2 because they exercise the actual
+system, not a mock of it.
+
+GLM correctly identifies that Layer 1 tests are of limited value for
+rendering verification but does not frame it as a two-layer architecture
+problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification).
+
+### The Simulation Framework's Actual Role
+
+The `simulation/` module is not (and should not be) a fidelity benchmark.
+Its role is:
+1. Drive the GUI through a sequence of interactions
+2. Verify the GUI reaches expected states after each interaction
+
+The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`,
+`sim_execution.py`) are extremely thin wrappers. Their actual test value
+comes from `test_extended_sims.py` which calls them against a live GUI and
+verifies no exceptions are thrown. This is essentially a smoke test for the
+GUI lifecycle, not a behavioral verification.
+
+The real behavioral verification is in `test_visual_sim_mma_v2.py` and
+similar files that assert specific state transitions. The simulation/
+module should be understood as "workflow drivers," not "verification modules."
+
+GLM's recommendation to add latency simulation and human-like behavior to
+`simulation/user_agent.py` would add complexity to a layer that isn't the
+bottleneck. The bottleneck is assertion depth in the polling loops, not
+realism of the user actions.
+
+---
+
+*End of report. Next action: Tier 2 Tech Lead to read this alongside
+`plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`)
+as the highest-leverage unblocking action.*
@@ -0,0 +1,96 @@
+# Track Specification: Test Architecture Integrity & Simulation Audit
+
+## Overview
+Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. This analysis was triggered by a request to review how tests and simulations are setup, whether tests can report passing grades when they fail, and if simulations are rigorous enough or are just rough emulators.
+
+## Current State Audit (as of 20260304)
+
+### Already Implemented (DO NOT re-implement)
+- **Testing Infrastructure** (	ests/conftest.py):
+  - live_gui fixture for session-scoped GUI lifecycle management
+  - Process cleanup with kill_process_tree()
+  - VerificationLogger for diagnostic logging
+  - Artifact isolation to 	ests/artifacts/ and 	ests/logs/
+
+- **Simulation Framework** (simulation/):
+  - sim_base.py: Base simulation class with setup/teardown
+  - workflow_sim.py: Workflow orchestration
+  - sim_context.py, sim_ai_settings.py, sim_tools.py, sim_execution.py
+  - user_agent.py: Simulated human agent
+- **Testing Infrastructure** (tests/conftest.py):
+  - live_gui fixture for session-scoped GUI lifecycle management
+  - Process cleanup with kill_process_tree()
+  - VerificationLogger for diagnostic logging
+  - Artifact isolation to tests/artifacts/ and tests/logs/
+  - Ban on arbitrary core mocking
+- **Mock Provider** (tests/mock_gemini_cli.py):
+  - Keyword-based response routing
+  - JSON-L protocol matching real CLI output
+
+#### Critical False Positive Risks Identified
+1. **Mock Provider Always Returns Success**: Never validates input, never produces errors, never tests failure paths
+2. **Auto-Approval Pattern**: All HITL gates auto-clicked, never verifying dialogs appear or rejection flows
+3. **Substring-Based Assertions**: Only check existence of content, not validity or structure
+4. **State Existence Only**: Tests check fields exist but not their correctness or invariants
+5. **No Negative Path Testing**: No coverage for rejection, timeout, malformed input, concurrent access
+6. **No Visual Verification**: Tests verify logical state via Hook API but never check what's actually rendered
+7. **No State Machine Validation**: No verification that status transitions are legal or complete
+
+#### Simulation Rigor Gaps Identified
+1. **No Real-Time Latency Simulation**: Fixed delays don't model variable LLM/network latency
+2. **No Human-Like Behavior**: Instant actions, no typing speed, hesitation, mistakes, or task switching
+3. **Arbitrary Polling Intervals**: 1-second polls may miss transient states
+4. **Mock CLI Redirection**: Bypasses subprocess spawning, environment passing, and process cleanup paths
+5. **No Stress Testing**: No load testing, no edge case bombardment
+
+#### Test Coverage Gaps
+- No tests for approval dialog rejection flows
+- No tests for malformed LLM response handling
+- No tests for network timeout/failure scenarios
+- No tests for concurrent duplicate requests
+- No tests for out-of-order event sequences
+- No thread-safety tests for shared resources
+- No visual rendering verification (modal visibility, text overflow, color schemes)
+
+#### Structural Testing Contract Gaps
+- Missing rule requiring negative path testing
+- Missing rule requiring state validation beyond existence
+- Missing rule requiring visual verification
+- No enforcement for thread-safety testing
+
+## Goals
+
+1. Document all identified testing pitfalls with severity ratings (HIGH/MEDIUM/LOW)
+2. Create actionable recommendations for each identified issue
+3. Map existing test coverage gaps to specific missing test files
+4. Provide architecture recommendations for simulation framework enhancements
+
+## Functional Requirements
+
+- [ ] Document all false positive risks in a structured format
+- [ ] Document all simulation fidelity gaps in a structured format
+- [ ] Create severity matrix for each issue
+- [ ] Generate list of missing test cases by category
+- [ ] Provide concrete examples of how current tests would pass despite bugs
+- [ ] Provide concrete examples of how simulations would miss UX issues
+
+## Non-Functional Requirements
+
+- Report must include author attribution (GLM-4.7) and derivation methodology
+- Analysis must cite specific file paths and line numbers where applicable
+- Recommendations must be prioritized by impact and implementation effort
+
+## Architecture Reference
+
+Refer to:
+- docs/guide_simulations.md - Current simulation contract and patterns
+- docs/guide_mma.md - MMA orchestration architecture
+- docs/guide_architecture.md - Thread domains, event system, HITL mechanism
+- conductor/tracks/*/spec.md - Existing track specifications for consistency
+
+## Out of Scope
+
+- Implementing the actual test fixes (that's for subsequent tracks)
+- Refactoring the simulation framework (documenting only)
+- Modifying the mock provider (analyzing only)
+- Writing new tests (planning phase for future tracks)