reports and potential new track

2026-03-05 00:31:55 -05:00
parent 1bc4205153
commit 983538aa8b
7 changed files with 3011 additions and 1 deletions
@@ -80,3 +80,7 @@
 ### 5. Transitioning to a Native Orchestrator
 **Goal:** Absorb the Conductor extension entirely into the core application. Manual Slop should natively read/write `plan.md`, manage the `metadata.json`, and orchestrate the MMA tiers in pure Python, removing the dependency on external CLI shell executions (`mma_exec.py`).
 ### 10. 	est_architecture_integrity_audit_20260304 (Planned)
 - **Status:** Initialized
 - **Priority:** High
 - **Goal:** Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. Documented by GLM-4.7 via full skeletal analysis of src/, tests/, and simulation/ directories.
@@ -0,0 +1,3 @@
 # Test Architecture Integrity & Simulation Audit
 [Specification](spec.md) | [Plan](plan.md)
@@ -0,0 +1,9 @@
 {
  "id": "test_architecture_integrity_audit_20260304"`,
  "name": "Test Architecture Integrity & Simulation Audit"`,
  "status": "planned",
  "created_at": "2026-03-04T00:00:00Z",
  "updated_at": "2026-03-04T00:00:00Z",
  "type": "audit",
  "severity": "high"
 }
@@ -0,0 +1,33 @@
 # Implementation Plan
 ## Phase 1: Documentation (Planning)
 Focus: Create comprehensive audit documentation with severity ratings
 - [ ] Task 1.1: Document all identified false positive risks with severity matrix
 - [ ] Task 1.2: Document all simulation fidelity gaps with impact analysis
 - [ ] Task 1.3: Create mapping of coverage gaps to test categories
 - [ ] Task 1.4: Provide concrete false positive examples
 - [ ] Task 1.5: Provide concrete simulation miss examples
 - [ ] Task 1.6: Prioritize recommendations by impact/effort matrix
 ## Phase 2: Review & Validation (Research)
 Focus: Peer review of audit findings
 - [ ] Task 2.1: Review existing tracks for overlap with this audit
 - [ ] Task 2.2: Validate severity ratings against actual bug history
 - [ ] Task 2.3: Cross-reference findings with docs/guide_simulations.md contract
 - [ ] Task 2.4: Identify which gaps should be addressed in which future track
 ## Phase 3: Track Finalization
 Focus: Prepare for downstream implementation tracks
 - [ ] Task 3.1: Create prioritized backlog of implementation recommendations
 - [ ] Task 3.2: Map recommendations to appropriate future tracks
 - [ ] Task 3.3: Document dependencies between this audit and subsequent work
 ## Phase 4: User Manual Verification (Protocol in workflow.md)
 Focus: Human review of audit findings
 - [ ] Task 4.1: Review severity matrix for accuracy
 - [ ] Task 4.2: Validate concrete examples against real-world scenarios
 - [ ] Task 4.3: Approve recommendations for implementation
@@ -0,0 +1,562 @@
 # Test Architecture Integrity Audit — Claude Review
 **Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator)
 **Review Date:** 2026-03-05
 **Source Report:** report.md (authored by GLM-4.7, 2026-03-04)
 **Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable
 recommendations for downstream tracks.
 **Methodology:**
 1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools,
   guide_mma, guide_meta_boundary, Readme)
 2. Read GLM's full report.md
 3. Read plan.md and spec.md for this track
 4. Read py_get_skeleton for all 27 src/ modules
 5. Read py_get_skeleton for conftest.py and representative test files
   (test_extended_sims, test_live_gui_integration, test_dag_engine,
   test_mma_orchestration_gui)
 6. Read py_get_skeleton for all 9 simulation/ modules
 7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history
 ---
 ## Section 1: Verdict on GLM's Report
 GLM produced a competent surface-level audit. The structural inventory is
 accurate and the broad categories of weakness (mock-rot, shallow assertions,
 no negative paths) are valid. However, the report has material errors in
 severity classification, contains two exact duplicate sections (Parts 10 and
 11 are identical), and misses several issues that are more impactful than
 the ones it flags at HIGH. It also makes recommendations that are
 architecturally inappropriate for an ImGui immediate-mode application.
 **Confirmed correct:** ~60% of findings
 **Overstated or miscategorized:** ~25% of findings
 **Missed entirely:** see Section 3
 ---
 ## Section 2: GLM Findings — Confirmed, Corrected, or Rejected
 ### 2.1 Confirmed: Mock Provider Never Fails (HIGH)
 GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The
 keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`,
 default) always produces a well-formed success response. No test using this
 mock can ever exercise:
 - Malformed or truncated JSON-L output
 - Non-zero exit code from the CLI process
 - A `{"type": "result", "status": "error", ...}` result event
 - Rate-limit or quota responses
 - Partial output followed by process crash
 The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A
 corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError`
 that bubbles up through `_send_gemini_cli`. This path is entirely untested.
 **Severity: HIGH — confirmed.**
 ### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH)
 GLM flags this as HIGH. The auto-approval pattern in polling loops is:
 ```python
 if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
 ```
 This is structurally correct for automated testing — you MUST auto-approve
 to drive the pipeline. The actual bug is different from what GLM describes:
 the tests never assert that the dialog appeared BEFORE approving. The
 correct pattern is:
 ```python
 assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared"
 client.click('btn_approve_spawn')
 ```
 Without the assert, the test passes even if the dialog never fires (meaning
 spawn approval is silently bypassed at the application level).
 **Severity: MEDIUM (dialog verification gap, not approval mechanism itself).**
 **GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is
 required for unattended testing. The fix is to assert the flag is True
 *before* clicking.
 There is also zero testing of the rejection path: what happens when
 `btn_reject_spawn` is clicked? Does the engine stop? Does it log an error?
 Does the track reach "blocked" state? This is an untested state transition.
 ### 2.3 Confirmed: Assertions Are Shallow (HIGH)
 GLM is correct. The two canonical examples from simulation tests:
 ```python
 assert len(tickets) >= 2          # structure unknown
 "SUCCESS: Mock Tier 3 worker" in streams[tier3_key]  # substring only
 ```
 Neither validates ticket schema, ID uniqueness, dependency correctness, or
 that the stream content is actually the full response and not a truncated
 fragment.
 **Severity: HIGH — confirmed.**
 ### 2.4 Confirmed: No Negative Path Testing (HIGH)
 GLM is correct. The entire test suite covers only the happy path. Missing:
 - Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog,
  MMASpawnApprovalDialog)
 - Malformed LLM response handling (bad JSON, missing fields, unexpected types)
 - Network timeout/connection error to Hook API during a live_gui test
 - `shell_runner.run_powershell` timeout (60s) expiry path
 - `mcp_client._resolve_and_check` returning an error (path outside allowlist)
 **Severity: HIGH — confirmed.**
 ### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM)
 GLM is correct. 1-second polling in simulation loops will miss any state
 that exists for less than 1 second. The approval dialogs in particular may
 appear and be cleared within a single render frame if the engine is fast.
 The `WorkflowSimulator.wait_for_ai_response()` method is the most critical
 polling target. It is the backbone of all extended simulation tests. If its
 polling strategy is wrong, the entire extended sim suite is unreliable.
 **Severity: MEDIUM — confirmed.**
 ### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM)
 GLM is correct. Setting `gcli_path` to a Python script does not exercise:
 - Real PATH resolution for the `gemini` binary
 - Windows process group creation (`CREATE_NEW_PROCESS_GROUP`)
 - Environment variable propagation to the subprocess
 - `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`)
 - The `kill_process_tree` teardown path when the process hangs
 **Severity: MEDIUM — confirmed.**
 ### 2.7 CORRECTION: "run_powershell is a Read-Only Tool"
 **GLM is WRONG here.** In Part 8, GLM lists:
 > "Read-Only Tools: run_powershell (via shell_runner.py)"
 `run_powershell` executes arbitrary PowerShell scripts against the filesystem.
 It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only
 because it is not an MCP filesystem tool; its approval gate is the
 `confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only"
 is a factual error that could mislead future workers about the security model.
 ### 2.8 CORRECTION: "State Duplication Between App and AppController"
 **GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed
 before this audit. `gui_2.App` now delegates all state through `AppController`
 via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a
 thin ImGui rendering layer, `AppController` owns all state. GLM's concern is
 stale relative to the current codebase.
 ### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure"
 **This recommendation is architecturally inappropriate** for Dear PyGui/ImGui.
 These are immediate-mode renderers; there is no DOM or widget tree to
 interrogate. Pixel-level screenshot comparison requires platform-specific
 capture APIs (Windows Magnification, GDI) and is extremely fragile to font
 rendering, DPI, and GPU differences. The Hook API's logical state verification
 is the CORRECT and SUFFICIENT abstraction for this application. Adding
 screenshot comparison would be high cost, low value, and high flakiness.
 The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`)
 is exposing more GUI state via the Hook API so tests can assert logical
 rendering state (is a panel visible? what is the modal title?) without pixels.
 ### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries
 The summary table in Part 9 lists identical items at multiple severity levels:
 - "No concurrent access testing": appears as both HIGH and MEDIUM
 - "No real-time latency simulation": appears as both MEDIUM and LOW
 - "No human-like behavior": appears as both MEDIUM and LOW
 - "Arbitrary polling intervals": appears as both MEDIUM and LOW
 Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference
 section was copy-pasted in full. This suggests the report was generated with
 insufficient self-review.
 ### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation
 GLM spends substantial space on the absence of:
 - Typing speed simulation
 - Hesitation before actions
 - Variable LLM latency
 This is a **personal developer tool for a single user on a local machine**.
 These are aspirational concerns for a production SaaS simulation framework.
 For this product context, these are genuinely LOW priority. The simulation
 framework's job is to verify that the GUI state machine transitions correctly,
 not to simulate human psychology.
 ---
 ## Section 3: Issues GLM Missed
 These are findings not present in GLM's report that carry meaningful risk.
 ### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests
 `conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL
 tests that use `live_gui` share a single running GUI process. If test A
 leaves the GUI in a state with an open modal dialog, test B will find the
 GUI unresponsive or in an unexpected state.
 The teardown calls `client.reset_session()` (which clicks `btn_reset_session`),
 but this clears AI state and discussion history, not pending dialogs or
 MMA orchestration state. A test that triggers a spawn approval dialog and
 then fails before approving it will leave `_pending_mma_spawn` set, blocking
 the ENTIRE remaining test session.
 **Severity: HIGH.** The current test ordering dependency is invisible and
 fragile. Tests must not be run in arbitrary order.
 **Fix:** Each `live_gui`-using test that touches MMA or approval flows should
 explicitly verify clean state at start:
 ```python
 status = client.get_mma_status()
 assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty"
 ```
 ### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering
 The `app_instance` fixture mocks out all ImGui rendering. This means every
 test using `app_instance` (approximately 40+ tests) is testing Python object
 state, not rendered UI. Tests like:
 - `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`,
  not that the panel renders
 - `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls
  `_render_token_budget_panel()` in a context where all ImGui calls are no-ops
 This creates a systematic false-positive class: a method can be completely
 broken (wrong data, missing widget calls) and the test passes because ImGui
 calls are silently ignored. The only tests with genuine rendering fidelity
 are the `live_gui` tests.
 This is the root cause behind GLM's "state existence only" finding. It is
 not a test assertion weakness — it is a fixture architectural limitation.
 **Severity: HIGH.** The implication: all `app_instance`-based rendering
 tests should be treated as "smoke tests that the method doesn't crash,"
 not as "verification that the rendering is correct."
 **Fix:** The `hook_api_ui_state_verification_20260302` track (adding
 `/api/gui/state`) is the correct path forward: expose render-visible state
 through the Hook API so `live_gui` tests can verify it.
 ### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block
 `ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop.
 There is no outer timeout on this loop. If the GUI thread never signals the
 dialog (e.g., GUI crash after dialog creation, or a test that creates a
 dialog but doesn't render it), the asyncio worker thread hangs indefinitely.
 This is particularly dangerous in the `run_worker_lifecycle` path:
 1. Worker pushes dialog to event queue
 2. GUI process crashes or freezes
 3. `dialog.wait()` loops forever at 0.1s intervals
 4. Test session hangs with no error output
 There is no test verifying that `wait()` has a maximum wait time and raises
 an exception or returns a default (rejected) decision after it.
 **Severity: HIGH.**
 ### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests
 `mcp_client.configure()` sets module-level globals (`_allowed_paths`,
 `_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions
 directly without calling `configure()` first will use whatever state was
 left from the previous test. The `reset_ai_client` autouse fixture calls
 `ai_client.reset_session()` but does NOT reset `mcp_client` state.
 Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`,
 etc. directly (not through `ai_client.send()`) inherits the allowlist from
 the previous test run. This can cause false passes (path permitted by
 previous test's allowlist) or false failures (path denied because
 `_base_dirs` is empty from a prior reset).
 **Severity: MEDIUM.**
 ### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption
 GLM mentions this as a "design concern." It is more specific: the
 `concurrent_tier_source_tier_20260302` track exists because `current_tier`
 in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers
 run concurrently (future feature), the second `send()` call will overwrite
 the first worker's tier tag.
 What's missing: there is no test that verifies the CURRENT behavior is safe
 under single-threaded operation, and no test that demonstrates the failure
 mode under concurrent operation to serve as a regression baseline for the fix.
 **Severity: MEDIUM.**
 ### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime
 The arch boundary tests verify that `manual_slop.toml` lists mutating tools
 as disabled by default. But the tests don't verify:
 1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools`
   at startup
 2. That `ai_client._agent_tools` is actually consulted before tool dispatch
 3. That the TOML → runtime path is end-to-end
 A developer could modify how tools are loaded without breaking these tests.
 The tests are static config audits, not runtime enforcement tests.
 **Severity: MEDIUM.**
 ### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly
 From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client`
 and calls `ai_client.send()` to generate "human-like" responses. This means:
 - Simulation tests have an implicit dependency on a configured LLM provider
 - If run without an API key (e.g., in CI), simulations fail at the UserSimAgent
  level, not at the GUI level — making failures hard to diagnose
 - The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in
  the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent
  would attempt real API calls
 No test documents whether UserSimAgent is actually exercised in the extended
 sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient
 directly to drive the GUI.
 **Severity: MEDIUM.**
 ### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised
 The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream
 and then waits for `{"type": "tool_result", ...}` piped back on stdin. The
 `mock_gemini_cli.py` does not emit any `tool_use` events; it only detects
 `'"role": "tool"'` in the prompt to simulate a post-tool-call turn.
 This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that
 handles `tool_use` event types and accumulates them) is NEVER exercised by
 any test. A regression in that parsing branch would be invisible to the
 test suite.
 **Severity: LOW** (only relevant when the real gemini CLI is used with tools).
 ### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests
 The `reset_ai_client` autouse fixture runs synchronously before each test.
 For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's
 async setup. If the async test itself triggers ai_client operations in setup
 (e.g., through an event loop created by the fixture), the reset may not
 capture all state mutations. This is an edge case but could explain
 intermittent behavior in async tests.
 **Severity: LOW.**
 ---
 ## Section 4: Revised Severity Matrix
 | Severity | Finding | GLM? | Source |
 |---|---|---|---|
 | **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM |
 | **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude |
 | **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude |
 | **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude |
 | **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM |
 | **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM |
 | **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude |
 | **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude |
 | **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude |
 | **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude |
 | **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude |
 | **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM |
 | **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM |
 | **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude |
 | **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude |
 | **LOW** | Variable latency / human-like simulation | Confirmed | GLM |
 ---
 ## Section 5: Prioritized Recommendations for Downstream Tracks
 Listed in execution order, not importance order. Each maps to an existing or
 proposed track.
 ### Rec 1: Extend mock_gemini_cli with Failure Modes
 **Target track:** New — `mock_provider_hardening_20260305`
 **Files:** `tests/mock_gemini_cli.py`
 **What:** Add a `MOCK_MODE` environment variable selector:
 - `success` (current behavior, default)
 - `malformed_json` — emit a truncated/corrupt JSON-L line
 - `error_result` — emit `{"type": "result", "status": "error", ...}`
 - `timeout` — sleep 90s to trigger the CLI timeout path
 - `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing
 Tests that need to verify error handling pass `MOCK_MODE=error_result` via
 `client.set_value()` before triggering the AI call.
 ### Rec 2: Add Dialog Assertion Before Auto-Approval
 **Target track:** `test_suite_performance_and_flakiness_20260302` (already planned)
 **Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py`
 **What:** Replace the conditional approval pattern:
 ```python
 # BAD (current):
 if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
 # GOOD:
 assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve"
 client.click('btn_approve_spawn')
 ```
 Also add at least one test per dialog type that clicks reject and asserts the
 correct downstream state (engine marks track blocked, no worker spawned, etc.).
 ### Rec 3: Fix live_gui Session Scope Dirty State
 **Target track:** `test_suite_performance_and_flakiness_20260302`
 **Files:** `tests/conftest.py`
 **What:** Add a per-test autouse fixture (function-scoped) that asserts clean
 GUI state before each `live_gui` test:
 ```python
@pytest.fixture(autouse=True)
 def assert_gui_clean(live_gui):
    client = ApiHookClient()
    status = client.get_mma_status()
    assert not status.get('pending_mma_spawn_approval')
    assert not status.get('pending_mma_step_approval')
    assert not status.get('pending_tool_approval')
    assert status.get('mma_status') in ('idle', 'done', '')
 ```
 This surfaces inter-test pollution immediately rather than causing a
 mysterious hang in a later test.
 ### Rec 4: Add ConfirmDialog Timeout Test
 **Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`)
 **Files:** `tests/test_conductor_engine.py`
 **What:** Add a test that creates a `ConfirmDialog`, never signals it, and
 verifies after N seconds that the background thread does NOT block indefinitely.
 This requires either a hard timeout on `wait()` or a documented contract that
 callers must signal the dialog within a finite window.
 ### Rec 5: Expose More State via Hook API
 **Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority)
 **Files:** `src/api_hooks.py`
 **What:** This track is the key enabler for replacing `app_instance` rendering
 tests with genuine state verification. The planned `/api/gui/state` endpoint
 should expose:
 - Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`)
 - `ui_focus_agent` current filter value
 - `_mma_status`, `_ai_status` text values
 - Panel visibility flags
 Once this is in place, the `app_instance` rendering tests can be migrated
 to `live_gui` equivalents that actually verify GUI-visible state.
 ### Rec 6: Add mcp_client Reset to autouse Fixture
 **Target track:** `test_suite_performance_and_flakiness_20260302`
 **Files:** `tests/conftest.py`
 **What:** Extend `reset_ai_client` autouse fixture to also call
 `mcp_client.configure([], [])` to clear the allowlist between tests.
 This prevents allowlist state from a previous test from leaking into the next.
 ### Rec 7: Add Runtime HITL Enforcement Test
 **Target track:** `test_suite_performance_and_flakiness_20260302` or new
 **Files:** `tests/test_arch_boundary_phase2.py`
 **What:** Add an integration test (using `app_instance`) that:
 1. Calls `ai_client.set_agent_tools({'set_file_slice': True})`
 2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'`
 3. Triggers a dispatch of `set_file_slice`
 4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred
 This closes the gap between "config says mutating tools are off" and
 "runtime actually gates them through the approval callback."
 ### Rec 8: Document `app_instance` Limitation in conftest
 **Target track:** Any ongoing work — immediate, no track needed
 **Files:** `tests/conftest.py`
 **What:** Add a docstring to `app_instance` fixture:
 ```python
 """
 App instance with all ImGui rendering calls mocked to no-ops.
 Use for unit tests of state logic and method existence.
 DO NOT use to verify rendering correctness — use live_gui for that.
 """
 ```
 This prevents future workers from writing rendering tests against this fixture
 and believing they have real coverage.
 ---
 ## Section 6: What the Existing Track Queue Gets Right
 The `TASKS.md` strict execution queue is well-ordered for the test concerns:
 1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban
 2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring
 3. `codebase_migration_20260302` → Already complete (commit 270f5f7)
 4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205)
 5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests
 6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces
   malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above
 7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers
 8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination
 The `test_architecture_integrity_audit_20260304` (this track) sits logically
 between #1 and #5 — it provides the analytical basis for what #5 and #8 need
 to fix. The audit output (this document) should be read by the Tier 2 Tech Lead
 for both those tracks.
 The proposed new tracks (mock_provider_hardening, negative_path_testing) from
 GLM's recommendations are valid but should be created AFTER track #5
 (`hook_api_ui_state_verification`) is complete, since they depend on the
 richer Hook API state to write meaningful assertions.
 ---
 ## Section 7: Architectural Observations Not in GLM's Report
 ### The Two-Tier Mock Problem
 The test suite has two completely separate mock layers that do not know about
 each other:
 **Layer 1** — `app_instance` fixture (in-process): Patches `immapp.run()`,
 `ai_client.send()`, and related functions with `unittest.mock`. Tests call
 methods directly. No network, no subprocess, no real threading.
 **Layer 2** — `mock_gemini_cli.py` (out-of-process): A fake subprocess that
 the live GUI process calls through its own internal LLM pipeline. Tests drive
 this via `ApiHookClient` HTTP calls to the running GUI process.
 These layers test completely different things. Layer 1 tests Python object
 invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC,
 process management). Most of the test suite is Layer 1. Very few tests are
 Layer 2. The high-value tests are Layer 2 because they exercise the actual
 system, not a mock of it.
 GLM correctly identifies that Layer 1 tests are of limited value for
 rendering verification but does not frame it as a two-layer architecture
 problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification).
 ### The Simulation Framework's Actual Role
 The `simulation/` module is not (and should not be) a fidelity benchmark.
 Its role is:
 1. Drive the GUI through a sequence of interactions
 2. Verify the GUI reaches expected states after each interaction
 The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`,
 `sim_execution.py`) are extremely thin wrappers. Their actual test value
 comes from `test_extended_sims.py` which calls them against a live GUI and
 verifies no exceptions are thrown. This is essentially a smoke test for the
 GUI lifecycle, not a behavioral verification.
 The real behavioral verification is in `test_visual_sim_mma_v2.py` and
 similar files that assert specific state transitions. The simulation/
 module should be understood as "workflow drivers," not "verification modules."
 GLM's recommendation to add latency simulation and human-like behavior to
 `simulation/user_agent.py` would add complexity to a layer that isn't the
 bottleneck. The bottleneck is assertion depth in the polling loops, not
 realism of the user actions.
 ---
 *End of report. Next action: Tier 2 Tech Lead to read this alongside
 `plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`)
 as the highest-leverage unblocking action.*
@@ -0,0 +1,96 @@
 # Track Specification: Test Architecture Integrity & Simulation Audit
 ## Overview
 Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. This analysis was triggered by a request to review how tests and simulations are setup, whether tests can report passing grades when they fail, and if simulations are rigorous enough or are just rough emulators.
 ## Current State Audit (as of 20260304)
 ### Already Implemented (DO NOT re-implement)
 - **Testing Infrastructure** (	ests/conftest.py):
  - live_gui fixture for session-scoped GUI lifecycle management
  - Process cleanup with kill_process_tree()
  - VerificationLogger for diagnostic logging
  - Artifact isolation to 	ests/artifacts/ and 	ests/logs/
 - **Simulation Framework** (simulation/):
  - sim_base.py: Base simulation class with setup/teardown
  - workflow_sim.py: Workflow orchestration
  - sim_context.py, sim_ai_settings.py, sim_tools.py, sim_execution.py
  - user_agent.py: Simulated human agent
 - **Testing Infrastructure** (tests/conftest.py):
  - live_gui fixture for session-scoped GUI lifecycle management
  - Process cleanup with kill_process_tree()
  - VerificationLogger for diagnostic logging
  - Artifact isolation to tests/artifacts/ and tests/logs/
  - Ban on arbitrary core mocking
 - **Mock Provider** (tests/mock_gemini_cli.py):
  - Keyword-based response routing
  - JSON-L protocol matching real CLI output
 #### Critical False Positive Risks Identified
 1. **Mock Provider Always Returns Success**: Never validates input, never produces errors, never tests failure paths
 2. **Auto-Approval Pattern**: All HITL gates auto-clicked, never verifying dialogs appear or rejection flows
 3. **Substring-Based Assertions**: Only check existence of content, not validity or structure
 4. **State Existence Only**: Tests check fields exist but not their correctness or invariants
 5. **No Negative Path Testing**: No coverage for rejection, timeout, malformed input, concurrent access
 6. **No Visual Verification**: Tests verify logical state via Hook API but never check what's actually rendered
 7. **No State Machine Validation**: No verification that status transitions are legal or complete
 #### Simulation Rigor Gaps Identified
 1. **No Real-Time Latency Simulation**: Fixed delays don't model variable LLM/network latency
 2. **No Human-Like Behavior**: Instant actions, no typing speed, hesitation, mistakes, or task switching
 3. **Arbitrary Polling Intervals**: 1-second polls may miss transient states
 4. **Mock CLI Redirection**: Bypasses subprocess spawning, environment passing, and process cleanup paths
 5. **No Stress Testing**: No load testing, no edge case bombardment
 #### Test Coverage Gaps
 - No tests for approval dialog rejection flows
 - No tests for malformed LLM response handling
 - No tests for network timeout/failure scenarios
 - No tests for concurrent duplicate requests
 - No tests for out-of-order event sequences
 - No thread-safety tests for shared resources
 - No visual rendering verification (modal visibility, text overflow, color schemes)
 #### Structural Testing Contract Gaps
 - Missing rule requiring negative path testing
 - Missing rule requiring state validation beyond existence
 - Missing rule requiring visual verification
 - No enforcement for thread-safety testing
 ## Goals
 1. Document all identified testing pitfalls with severity ratings (HIGH/MEDIUM/LOW)
 2. Create actionable recommendations for each identified issue
 3. Map existing test coverage gaps to specific missing test files
 4. Provide architecture recommendations for simulation framework enhancements
 ## Functional Requirements
 - [ ] Document all false positive risks in a structured format
 - [ ] Document all simulation fidelity gaps in a structured format
 - [ ] Create severity matrix for each issue
 - [ ] Generate list of missing test cases by category
 - [ ] Provide concrete examples of how current tests would pass despite bugs
 - [ ] Provide concrete examples of how simulations would miss UX issues
 ## Non-Functional Requirements
 - Report must include author attribution (GLM-4.7) and derivation methodology
 - Analysis must cite specific file paths and line numbers where applicable
 - Recommendations must be prioritized by impact and implementation effort
 ## Architecture Reference
 Refer to:
 - docs/guide_simulations.md - Current simulation contract and patterns
 - docs/guide_mma.md - MMA orchestration architecture
 - docs/guide_architecture.md - Thread domains, event system, HITL mechanism
 - conductor/tracks/*/spec.md - Existing track specifications for consistency
 ## Out of Scope
 - Implementing the actual test fixes (that's for subsequent tracks)
 - Refactoring the simulation framework (documenting only)
 - Modifying the mock provider (analyzing only)
 - Writing new tests (planning phase for future tracks)
		`@@ -0,0 +1,3 @@`
							`# Test Architecture Integrity & Simulation Audit`

							`[Specification](spec.md) \| [Plan](plan.md)`