reports and potential new track

This commit is contained in:
2026-03-05 00:31:55 -05:00
parent 1bc4205153
commit 983538aa8b
7 changed files with 3011 additions and 1 deletions

View File

@@ -0,0 +1,3 @@
# Test Architecture Integrity & Simulation Audit
[Specification](spec.md) | [Plan](plan.md)

View File

@@ -0,0 +1,9 @@
{
"id": "test_architecture_integrity_audit_20260304"`,
"name": "Test Architecture Integrity & Simulation Audit"`,
"status": "planned",
"created_at": "2026-03-04T00:00:00Z",
"updated_at": "2026-03-04T00:00:00Z",
"type": "audit",
"severity": "high"
}

View File

@@ -0,0 +1,33 @@
# Implementation Plan
## Phase 1: Documentation (Planning)
Focus: Create comprehensive audit documentation with severity ratings
- [ ] Task 1.1: Document all identified false positive risks with severity matrix
- [ ] Task 1.2: Document all simulation fidelity gaps with impact analysis
- [ ] Task 1.3: Create mapping of coverage gaps to test categories
- [ ] Task 1.4: Provide concrete false positive examples
- [ ] Task 1.5: Provide concrete simulation miss examples
- [ ] Task 1.6: Prioritize recommendations by impact/effort matrix
## Phase 2: Review & Validation (Research)
Focus: Peer review of audit findings
- [ ] Task 2.1: Review existing tracks for overlap with this audit
- [ ] Task 2.2: Validate severity ratings against actual bug history
- [ ] Task 2.3: Cross-reference findings with docs/guide_simulations.md contract
- [ ] Task 2.4: Identify which gaps should be addressed in which future track
## Phase 3: Track Finalization
Focus: Prepare for downstream implementation tracks
- [ ] Task 3.1: Create prioritized backlog of implementation recommendations
- [ ] Task 3.2: Map recommendations to appropriate future tracks
- [ ] Task 3.3: Document dependencies between this audit and subsequent work
## Phase 4: User Manual Verification (Protocol in workflow.md)
Focus: Human review of audit findings
- [ ] Task 4.1: Review severity matrix for accuracy
- [ ] Task 4.2: Validate concrete examples against real-world scenarios
- [ ] Task 4.3: Approve recommendations for implementation

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,562 @@
# Test Architecture Integrity Audit — Claude Review
**Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator)
**Review Date:** 2026-03-05
**Source Report:** report.md (authored by GLM-4.7, 2026-03-04)
**Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable
recommendations for downstream tracks.
**Methodology:**
1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools,
guide_mma, guide_meta_boundary, Readme)
2. Read GLM's full report.md
3. Read plan.md and spec.md for this track
4. Read py_get_skeleton for all 27 src/ modules
5. Read py_get_skeleton for conftest.py and representative test files
(test_extended_sims, test_live_gui_integration, test_dag_engine,
test_mma_orchestration_gui)
6. Read py_get_skeleton for all 9 simulation/ modules
7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history
---
## Section 1: Verdict on GLM's Report
GLM produced a competent surface-level audit. The structural inventory is
accurate and the broad categories of weakness (mock-rot, shallow assertions,
no negative paths) are valid. However, the report has material errors in
severity classification, contains two exact duplicate sections (Parts 10 and
11 are identical), and misses several issues that are more impactful than
the ones it flags at HIGH. It also makes recommendations that are
architecturally inappropriate for an ImGui immediate-mode application.
**Confirmed correct:** ~60% of findings
**Overstated or miscategorized:** ~25% of findings
**Missed entirely:** see Section 3
---
## Section 2: GLM Findings — Confirmed, Corrected, or Rejected
### 2.1 Confirmed: Mock Provider Never Fails (HIGH)
GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The
keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`,
default) always produces a well-formed success response. No test using this
mock can ever exercise:
- Malformed or truncated JSON-L output
- Non-zero exit code from the CLI process
- A `{"type": "result", "status": "error", ...}` result event
- Rate-limit or quota responses
- Partial output followed by process crash
The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A
corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError`
that bubbles up through `_send_gemini_cli`. This path is entirely untested.
**Severity: HIGH — confirmed.**
### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH)
GLM flags this as HIGH. The auto-approval pattern in polling loops is:
```python
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
```
This is structurally correct for automated testing — you MUST auto-approve
to drive the pipeline. The actual bug is different from what GLM describes:
the tests never assert that the dialog appeared BEFORE approving. The
correct pattern is:
```python
assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared"
client.click('btn_approve_spawn')
```
Without the assert, the test passes even if the dialog never fires (meaning
spawn approval is silently bypassed at the application level).
**Severity: MEDIUM (dialog verification gap, not approval mechanism itself).**
**GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is
required for unattended testing. The fix is to assert the flag is True
*before* clicking.
There is also zero testing of the rejection path: what happens when
`btn_reject_spawn` is clicked? Does the engine stop? Does it log an error?
Does the track reach "blocked" state? This is an untested state transition.
### 2.3 Confirmed: Assertions Are Shallow (HIGH)
GLM is correct. The two canonical examples from simulation tests:
```python
assert len(tickets) >= 2 # structure unknown
"SUCCESS: Mock Tier 3 worker" in streams[tier3_key] # substring only
```
Neither validates ticket schema, ID uniqueness, dependency correctness, or
that the stream content is actually the full response and not a truncated
fragment.
**Severity: HIGH — confirmed.**
### 2.4 Confirmed: No Negative Path Testing (HIGH)
GLM is correct. The entire test suite covers only the happy path. Missing:
- Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog,
MMASpawnApprovalDialog)
- Malformed LLM response handling (bad JSON, missing fields, unexpected types)
- Network timeout/connection error to Hook API during a live_gui test
- `shell_runner.run_powershell` timeout (60s) expiry path
- `mcp_client._resolve_and_check` returning an error (path outside allowlist)
**Severity: HIGH — confirmed.**
### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM)
GLM is correct. 1-second polling in simulation loops will miss any state
that exists for less than 1 second. The approval dialogs in particular may
appear and be cleared within a single render frame if the engine is fast.
The `WorkflowSimulator.wait_for_ai_response()` method is the most critical
polling target. It is the backbone of all extended simulation tests. If its
polling strategy is wrong, the entire extended sim suite is unreliable.
**Severity: MEDIUM — confirmed.**
### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM)
GLM is correct. Setting `gcli_path` to a Python script does not exercise:
- Real PATH resolution for the `gemini` binary
- Windows process group creation (`CREATE_NEW_PROCESS_GROUP`)
- Environment variable propagation to the subprocess
- `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`)
- The `kill_process_tree` teardown path when the process hangs
**Severity: MEDIUM — confirmed.**
### 2.7 CORRECTION: "run_powershell is a Read-Only Tool"
**GLM is WRONG here.** In Part 8, GLM lists:
> "Read-Only Tools: run_powershell (via shell_runner.py)"
`run_powershell` executes arbitrary PowerShell scripts against the filesystem.
It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only
because it is not an MCP filesystem tool; its approval gate is the
`confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only"
is a factual error that could mislead future workers about the security model.
### 2.8 CORRECTION: "State Duplication Between App and AppController"
**GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed
before this audit. `gui_2.App` now delegates all state through `AppController`
via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a
thin ImGui rendering layer, `AppController` owns all state. GLM's concern is
stale relative to the current codebase.
### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure"
**This recommendation is architecturally inappropriate** for Dear PyGui/ImGui.
These are immediate-mode renderers; there is no DOM or widget tree to
interrogate. Pixel-level screenshot comparison requires platform-specific
capture APIs (Windows Magnification, GDI) and is extremely fragile to font
rendering, DPI, and GPU differences. The Hook API's logical state verification
is the CORRECT and SUFFICIENT abstraction for this application. Adding
screenshot comparison would be high cost, low value, and high flakiness.
The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`)
is exposing more GUI state via the Hook API so tests can assert logical
rendering state (is a panel visible? what is the modal title?) without pixels.
### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries
The summary table in Part 9 lists identical items at multiple severity levels:
- "No concurrent access testing": appears as both HIGH and MEDIUM
- "No real-time latency simulation": appears as both MEDIUM and LOW
- "No human-like behavior": appears as both MEDIUM and LOW
- "Arbitrary polling intervals": appears as both MEDIUM and LOW
Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference
section was copy-pasted in full. This suggests the report was generated with
insufficient self-review.
### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation
GLM spends substantial space on the absence of:
- Typing speed simulation
- Hesitation before actions
- Variable LLM latency
This is a **personal developer tool for a single user on a local machine**.
These are aspirational concerns for a production SaaS simulation framework.
For this product context, these are genuinely LOW priority. The simulation
framework's job is to verify that the GUI state machine transitions correctly,
not to simulate human psychology.
---
## Section 3: Issues GLM Missed
These are findings not present in GLM's report that carry meaningful risk.
### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests
`conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL
tests that use `live_gui` share a single running GUI process. If test A
leaves the GUI in a state with an open modal dialog, test B will find the
GUI unresponsive or in an unexpected state.
The teardown calls `client.reset_session()` (which clicks `btn_reset_session`),
but this clears AI state and discussion history, not pending dialogs or
MMA orchestration state. A test that triggers a spawn approval dialog and
then fails before approving it will leave `_pending_mma_spawn` set, blocking
the ENTIRE remaining test session.
**Severity: HIGH.** The current test ordering dependency is invisible and
fragile. Tests must not be run in arbitrary order.
**Fix:** Each `live_gui`-using test that touches MMA or approval flows should
explicitly verify clean state at start:
```python
status = client.get_mma_status()
assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty"
```
### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering
The `app_instance` fixture mocks out all ImGui rendering. This means every
test using `app_instance` (approximately 40+ tests) is testing Python object
state, not rendered UI. Tests like:
- `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`,
not that the panel renders
- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls
`_render_token_budget_panel()` in a context where all ImGui calls are no-ops
This creates a systematic false-positive class: a method can be completely
broken (wrong data, missing widget calls) and the test passes because ImGui
calls are silently ignored. The only tests with genuine rendering fidelity
are the `live_gui` tests.
This is the root cause behind GLM's "state existence only" finding. It is
not a test assertion weakness — it is a fixture architectural limitation.
**Severity: HIGH.** The implication: all `app_instance`-based rendering
tests should be treated as "smoke tests that the method doesn't crash,"
not as "verification that the rendering is correct."
**Fix:** The `hook_api_ui_state_verification_20260302` track (adding
`/api/gui/state`) is the correct path forward: expose render-visible state
through the Hook API so `live_gui` tests can verify it.
### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block
`ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop.
There is no outer timeout on this loop. If the GUI thread never signals the
dialog (e.g., GUI crash after dialog creation, or a test that creates a
dialog but doesn't render it), the asyncio worker thread hangs indefinitely.
This is particularly dangerous in the `run_worker_lifecycle` path:
1. Worker pushes dialog to event queue
2. GUI process crashes or freezes
3. `dialog.wait()` loops forever at 0.1s intervals
4. Test session hangs with no error output
There is no test verifying that `wait()` has a maximum wait time and raises
an exception or returns a default (rejected) decision after it.
**Severity: HIGH.**
### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests
`mcp_client.configure()` sets module-level globals (`_allowed_paths`,
`_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions
directly without calling `configure()` first will use whatever state was
left from the previous test. The `reset_ai_client` autouse fixture calls
`ai_client.reset_session()` but does NOT reset `mcp_client` state.
Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`,
etc. directly (not through `ai_client.send()`) inherits the allowlist from
the previous test run. This can cause false passes (path permitted by
previous test's allowlist) or false failures (path denied because
`_base_dirs` is empty from a prior reset).
**Severity: MEDIUM.**
### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption
GLM mentions this as a "design concern." It is more specific: the
`concurrent_tier_source_tier_20260302` track exists because `current_tier`
in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers
run concurrently (future feature), the second `send()` call will overwrite
the first worker's tier tag.
What's missing: there is no test that verifies the CURRENT behavior is safe
under single-threaded operation, and no test that demonstrates the failure
mode under concurrent operation to serve as a regression baseline for the fix.
**Severity: MEDIUM.**
### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime
The arch boundary tests verify that `manual_slop.toml` lists mutating tools
as disabled by default. But the tests don't verify:
1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools`
at startup
2. That `ai_client._agent_tools` is actually consulted before tool dispatch
3. That the TOML → runtime path is end-to-end
A developer could modify how tools are loaded without breaking these tests.
The tests are static config audits, not runtime enforcement tests.
**Severity: MEDIUM.**
### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly
From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client`
and calls `ai_client.send()` to generate "human-like" responses. This means:
- Simulation tests have an implicit dependency on a configured LLM provider
- If run without an API key (e.g., in CI), simulations fail at the UserSimAgent
level, not at the GUI level — making failures hard to diagnose
- The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in
the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent
would attempt real API calls
No test documents whether UserSimAgent is actually exercised in the extended
sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient
directly to drive the GUI.
**Severity: MEDIUM.**
### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised
The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream
and then waits for `{"type": "tool_result", ...}` piped back on stdin. The
`mock_gemini_cli.py` does not emit any `tool_use` events; it only detects
`'"role": "tool"'` in the prompt to simulate a post-tool-call turn.
This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that
handles `tool_use` event types and accumulates them) is NEVER exercised by
any test. A regression in that parsing branch would be invisible to the
test suite.
**Severity: LOW** (only relevant when the real gemini CLI is used with tools).
### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests
The `reset_ai_client` autouse fixture runs synchronously before each test.
For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's
async setup. If the async test itself triggers ai_client operations in setup
(e.g., through an event loop created by the fixture), the reset may not
capture all state mutations. This is an edge case but could explain
intermittent behavior in async tests.
**Severity: LOW.**
---
## Section 4: Revised Severity Matrix
| Severity | Finding | GLM? | Source |
|---|---|---|---|
| **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM |
| **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude |
| **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude |
| **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude |
| **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM |
| **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM |
| **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude |
| **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude |
| **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude |
| **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude |
| **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude |
| **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM |
| **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM |
| **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude |
| **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude |
| **LOW** | Variable latency / human-like simulation | Confirmed | GLM |
---
## Section 5: Prioritized Recommendations for Downstream Tracks
Listed in execution order, not importance order. Each maps to an existing or
proposed track.
### Rec 1: Extend mock_gemini_cli with Failure Modes
**Target track:** New — `mock_provider_hardening_20260305`
**Files:** `tests/mock_gemini_cli.py`
**What:** Add a `MOCK_MODE` environment variable selector:
- `success` (current behavior, default)
- `malformed_json` — emit a truncated/corrupt JSON-L line
- `error_result` — emit `{"type": "result", "status": "error", ...}`
- `timeout` — sleep 90s to trigger the CLI timeout path
- `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing
Tests that need to verify error handling pass `MOCK_MODE=error_result` via
`client.set_value()` before triggering the AI call.
### Rec 2: Add Dialog Assertion Before Auto-Approval
**Target track:** `test_suite_performance_and_flakiness_20260302` (already planned)
**Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py`
**What:** Replace the conditional approval pattern:
```python
# BAD (current):
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
# GOOD:
assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve"
client.click('btn_approve_spawn')
```
Also add at least one test per dialog type that clicks reject and asserts the
correct downstream state (engine marks track blocked, no worker spawned, etc.).
### Rec 3: Fix live_gui Session Scope Dirty State
**Target track:** `test_suite_performance_and_flakiness_20260302`
**Files:** `tests/conftest.py`
**What:** Add a per-test autouse fixture (function-scoped) that asserts clean
GUI state before each `live_gui` test:
```python
@pytest.fixture(autouse=True)
def assert_gui_clean(live_gui):
client = ApiHookClient()
status = client.get_mma_status()
assert not status.get('pending_mma_spawn_approval')
assert not status.get('pending_mma_step_approval')
assert not status.get('pending_tool_approval')
assert status.get('mma_status') in ('idle', 'done', '')
```
This surfaces inter-test pollution immediately rather than causing a
mysterious hang in a later test.
### Rec 4: Add ConfirmDialog Timeout Test
**Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`)
**Files:** `tests/test_conductor_engine.py`
**What:** Add a test that creates a `ConfirmDialog`, never signals it, and
verifies after N seconds that the background thread does NOT block indefinitely.
This requires either a hard timeout on `wait()` or a documented contract that
callers must signal the dialog within a finite window.
### Rec 5: Expose More State via Hook API
**Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority)
**Files:** `src/api_hooks.py`
**What:** This track is the key enabler for replacing `app_instance` rendering
tests with genuine state verification. The planned `/api/gui/state` endpoint
should expose:
- Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`)
- `ui_focus_agent` current filter value
- `_mma_status`, `_ai_status` text values
- Panel visibility flags
Once this is in place, the `app_instance` rendering tests can be migrated
to `live_gui` equivalents that actually verify GUI-visible state.
### Rec 6: Add mcp_client Reset to autouse Fixture
**Target track:** `test_suite_performance_and_flakiness_20260302`
**Files:** `tests/conftest.py`
**What:** Extend `reset_ai_client` autouse fixture to also call
`mcp_client.configure([], [])` to clear the allowlist between tests.
This prevents allowlist state from a previous test from leaking into the next.
### Rec 7: Add Runtime HITL Enforcement Test
**Target track:** `test_suite_performance_and_flakiness_20260302` or new
**Files:** `tests/test_arch_boundary_phase2.py`
**What:** Add an integration test (using `app_instance`) that:
1. Calls `ai_client.set_agent_tools({'set_file_slice': True})`
2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'`
3. Triggers a dispatch of `set_file_slice`
4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred
This closes the gap between "config says mutating tools are off" and
"runtime actually gates them through the approval callback."
### Rec 8: Document `app_instance` Limitation in conftest
**Target track:** Any ongoing work — immediate, no track needed
**Files:** `tests/conftest.py`
**What:** Add a docstring to `app_instance` fixture:
```python
"""
App instance with all ImGui rendering calls mocked to no-ops.
Use for unit tests of state logic and method existence.
DO NOT use to verify rendering correctness — use live_gui for that.
"""
```
This prevents future workers from writing rendering tests against this fixture
and believing they have real coverage.
---
## Section 6: What the Existing Track Queue Gets Right
The `TASKS.md` strict execution queue is well-ordered for the test concerns:
1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban
2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring
3. `codebase_migration_20260302` → Already complete (commit 270f5f7)
4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205)
5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests
6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces
malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above
7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers
8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination
The `test_architecture_integrity_audit_20260304` (this track) sits logically
between #1 and #5 — it provides the analytical basis for what #5 and #8 need
to fix. The audit output (this document) should be read by the Tier 2 Tech Lead
for both those tracks.
The proposed new tracks (mock_provider_hardening, negative_path_testing) from
GLM's recommendations are valid but should be created AFTER track #5
(`hook_api_ui_state_verification`) is complete, since they depend on the
richer Hook API state to write meaningful assertions.
---
## Section 7: Architectural Observations Not in GLM's Report
### The Two-Tier Mock Problem
The test suite has two completely separate mock layers that do not know about
each other:
**Layer 1**`app_instance` fixture (in-process): Patches `immapp.run()`,
`ai_client.send()`, and related functions with `unittest.mock`. Tests call
methods directly. No network, no subprocess, no real threading.
**Layer 2**`mock_gemini_cli.py` (out-of-process): A fake subprocess that
the live GUI process calls through its own internal LLM pipeline. Tests drive
this via `ApiHookClient` HTTP calls to the running GUI process.
These layers test completely different things. Layer 1 tests Python object
invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC,
process management). Most of the test suite is Layer 1. Very few tests are
Layer 2. The high-value tests are Layer 2 because they exercise the actual
system, not a mock of it.
GLM correctly identifies that Layer 1 tests are of limited value for
rendering verification but does not frame it as a two-layer architecture
problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification).
### The Simulation Framework's Actual Role
The `simulation/` module is not (and should not be) a fidelity benchmark.
Its role is:
1. Drive the GUI through a sequence of interactions
2. Verify the GUI reaches expected states after each interaction
The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`,
`sim_execution.py`) are extremely thin wrappers. Their actual test value
comes from `test_extended_sims.py` which calls them against a live GUI and
verifies no exceptions are thrown. This is essentially a smoke test for the
GUI lifecycle, not a behavioral verification.
The real behavioral verification is in `test_visual_sim_mma_v2.py` and
similar files that assert specific state transitions. The simulation/
module should be understood as "workflow drivers," not "verification modules."
GLM's recommendation to add latency simulation and human-like behavior to
`simulation/user_agent.py` would add complexity to a layer that isn't the
bottleneck. The bottleneck is assertion depth in the polling loops, not
realism of the user actions.
---
*End of report. Next action: Tier 2 Tech Lead to read this alongside
`plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`)
as the highest-leverage unblocking action.*

View File

@@ -0,0 +1,96 @@
# Track Specification: Test Architecture Integrity & Simulation Audit
## Overview
Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. This analysis was triggered by a request to review how tests and simulations are setup, whether tests can report passing grades when they fail, and if simulations are rigorous enough or are just rough emulators.
## Current State Audit (as of 20260304)
### Already Implemented (DO NOT re-implement)
- **Testing Infrastructure** ( ests/conftest.py):
- live_gui fixture for session-scoped GUI lifecycle management
- Process cleanup with kill_process_tree()
- VerificationLogger for diagnostic logging
- Artifact isolation to ests/artifacts/ and ests/logs/
- **Simulation Framework** (simulation/):
- sim_base.py: Base simulation class with setup/teardown
- workflow_sim.py: Workflow orchestration
- sim_context.py, sim_ai_settings.py, sim_tools.py, sim_execution.py
- user_agent.py: Simulated human agent
- **Testing Infrastructure** (tests/conftest.py):
- live_gui fixture for session-scoped GUI lifecycle management
- Process cleanup with kill_process_tree()
- VerificationLogger for diagnostic logging
- Artifact isolation to tests/artifacts/ and tests/logs/
- Ban on arbitrary core mocking
- **Mock Provider** (tests/mock_gemini_cli.py):
- Keyword-based response routing
- JSON-L protocol matching real CLI output
#### Critical False Positive Risks Identified
1. **Mock Provider Always Returns Success**: Never validates input, never produces errors, never tests failure paths
2. **Auto-Approval Pattern**: All HITL gates auto-clicked, never verifying dialogs appear or rejection flows
3. **Substring-Based Assertions**: Only check existence of content, not validity or structure
4. **State Existence Only**: Tests check fields exist but not their correctness or invariants
5. **No Negative Path Testing**: No coverage for rejection, timeout, malformed input, concurrent access
6. **No Visual Verification**: Tests verify logical state via Hook API but never check what's actually rendered
7. **No State Machine Validation**: No verification that status transitions are legal or complete
#### Simulation Rigor Gaps Identified
1. **No Real-Time Latency Simulation**: Fixed delays don't model variable LLM/network latency
2. **No Human-Like Behavior**: Instant actions, no typing speed, hesitation, mistakes, or task switching
3. **Arbitrary Polling Intervals**: 1-second polls may miss transient states
4. **Mock CLI Redirection**: Bypasses subprocess spawning, environment passing, and process cleanup paths
5. **No Stress Testing**: No load testing, no edge case bombardment
#### Test Coverage Gaps
- No tests for approval dialog rejection flows
- No tests for malformed LLM response handling
- No tests for network timeout/failure scenarios
- No tests for concurrent duplicate requests
- No tests for out-of-order event sequences
- No thread-safety tests for shared resources
- No visual rendering verification (modal visibility, text overflow, color schemes)
#### Structural Testing Contract Gaps
- Missing rule requiring negative path testing
- Missing rule requiring state validation beyond existence
- Missing rule requiring visual verification
- No enforcement for thread-safety testing
## Goals
1. Document all identified testing pitfalls with severity ratings (HIGH/MEDIUM/LOW)
2. Create actionable recommendations for each identified issue
3. Map existing test coverage gaps to specific missing test files
4. Provide architecture recommendations for simulation framework enhancements
## Functional Requirements
- [ ] Document all false positive risks in a structured format
- [ ] Document all simulation fidelity gaps in a structured format
- [ ] Create severity matrix for each issue
- [ ] Generate list of missing test cases by category
- [ ] Provide concrete examples of how current tests would pass despite bugs
- [ ] Provide concrete examples of how simulations would miss UX issues
## Non-Functional Requirements
- Report must include author attribution (GLM-4.7) and derivation methodology
- Analysis must cite specific file paths and line numbers where applicable
- Recommendations must be prioritized by impact and implementation effort
## Architecture Reference
Refer to:
- docs/guide_simulations.md - Current simulation contract and patterns
- docs/guide_mma.md - MMA orchestration architecture
- docs/guide_architecture.md - Thread domains, event system, HITL mechanism
- conductor/tracks/*/spec.md - Existing track specifications for consistency
## Out of Scope
- Implementing the actual test fixes (that's for subsequent tracks)
- Refactoring the simulation framework (documenting only)
- Modifying the mock provider (analyzing only)
- Writing new tests (planning phase for future tracks)