|
|
|
@@ -0,0 +1,562 @@
|
|
|
|
|
|
|
|
# Test Architecture Integrity Audit — Claude Review
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator)
|
|
|
|
|
|
|
|
**Review Date:** 2026-03-05
|
|
|
|
|
|
|
|
**Source Report:** report.md (authored by GLM-4.7, 2026-03-04)
|
|
|
|
|
|
|
|
**Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable
|
|
|
|
|
|
|
|
recommendations for downstream tracks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Methodology:**
|
|
|
|
|
|
|
|
1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools,
|
|
|
|
|
|
|
|
guide_mma, guide_meta_boundary, Readme)
|
|
|
|
|
|
|
|
2. Read GLM's full report.md
|
|
|
|
|
|
|
|
3. Read plan.md and spec.md for this track
|
|
|
|
|
|
|
|
4. Read py_get_skeleton for all 27 src/ modules
|
|
|
|
|
|
|
|
5. Read py_get_skeleton for conftest.py and representative test files
|
|
|
|
|
|
|
|
(test_extended_sims, test_live_gui_integration, test_dag_engine,
|
|
|
|
|
|
|
|
test_mma_orchestration_gui)
|
|
|
|
|
|
|
|
6. Read py_get_skeleton for all 9 simulation/ modules
|
|
|
|
|
|
|
|
7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 1: Verdict on GLM's Report
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM produced a competent surface-level audit. The structural inventory is
|
|
|
|
|
|
|
|
accurate and the broad categories of weakness (mock-rot, shallow assertions,
|
|
|
|
|
|
|
|
no negative paths) are valid. However, the report has material errors in
|
|
|
|
|
|
|
|
severity classification, contains two exact duplicate sections (Parts 10 and
|
|
|
|
|
|
|
|
11 are identical), and misses several issues that are more impactful than
|
|
|
|
|
|
|
|
the ones it flags at HIGH. It also makes recommendations that are
|
|
|
|
|
|
|
|
architecturally inappropriate for an ImGui immediate-mode application.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Confirmed correct:** ~60% of findings
|
|
|
|
|
|
|
|
**Overstated or miscategorized:** ~25% of findings
|
|
|
|
|
|
|
|
**Missed entirely:** see Section 3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 2: GLM Findings — Confirmed, Corrected, or Rejected
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 Confirmed: Mock Provider Never Fails (HIGH)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The
|
|
|
|
|
|
|
|
keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`,
|
|
|
|
|
|
|
|
default) always produces a well-formed success response. No test using this
|
|
|
|
|
|
|
|
mock can ever exercise:
|
|
|
|
|
|
|
|
- Malformed or truncated JSON-L output
|
|
|
|
|
|
|
|
- Non-zero exit code from the CLI process
|
|
|
|
|
|
|
|
- A `{"type": "result", "status": "error", ...}` result event
|
|
|
|
|
|
|
|
- Rate-limit or quota responses
|
|
|
|
|
|
|
|
- Partial output followed by process crash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A
|
|
|
|
|
|
|
|
corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError`
|
|
|
|
|
|
|
|
that bubbles up through `_send_gemini_cli`. This path is entirely untested.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: HIGH — confirmed.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM flags this as HIGH. The auto-approval pattern in polling loops is:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is structurally correct for automated testing — you MUST auto-approve
|
|
|
|
|
|
|
|
to drive the pipeline. The actual bug is different from what GLM describes:
|
|
|
|
|
|
|
|
the tests never assert that the dialog appeared BEFORE approving. The
|
|
|
|
|
|
|
|
correct pattern is:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared"
|
|
|
|
|
|
|
|
client.click('btn_approve_spawn')
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Without the assert, the test passes even if the dialog never fires (meaning
|
|
|
|
|
|
|
|
spawn approval is silently bypassed at the application level).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM (dialog verification gap, not approval mechanism itself).**
|
|
|
|
|
|
|
|
**GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is
|
|
|
|
|
|
|
|
required for unattended testing. The fix is to assert the flag is True
|
|
|
|
|
|
|
|
*before* clicking.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There is also zero testing of the rejection path: what happens when
|
|
|
|
|
|
|
|
`btn_reject_spawn` is clicked? Does the engine stop? Does it log an error?
|
|
|
|
|
|
|
|
Does the track reach "blocked" state? This is an untested state transition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.3 Confirmed: Assertions Are Shallow (HIGH)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM is correct. The two canonical examples from simulation tests:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
assert len(tickets) >= 2 # structure unknown
|
|
|
|
|
|
|
|
"SUCCESS: Mock Tier 3 worker" in streams[tier3_key] # substring only
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Neither validates ticket schema, ID uniqueness, dependency correctness, or
|
|
|
|
|
|
|
|
that the stream content is actually the full response and not a truncated
|
|
|
|
|
|
|
|
fragment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: HIGH — confirmed.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.4 Confirmed: No Negative Path Testing (HIGH)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM is correct. The entire test suite covers only the happy path. Missing:
|
|
|
|
|
|
|
|
- Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog,
|
|
|
|
|
|
|
|
MMASpawnApprovalDialog)
|
|
|
|
|
|
|
|
- Malformed LLM response handling (bad JSON, missing fields, unexpected types)
|
|
|
|
|
|
|
|
- Network timeout/connection error to Hook API during a live_gui test
|
|
|
|
|
|
|
|
- `shell_runner.run_powershell` timeout (60s) expiry path
|
|
|
|
|
|
|
|
- `mcp_client._resolve_and_check` returning an error (path outside allowlist)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: HIGH — confirmed.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM is correct. 1-second polling in simulation loops will miss any state
|
|
|
|
|
|
|
|
that exists for less than 1 second. The approval dialogs in particular may
|
|
|
|
|
|
|
|
appear and be cleared within a single render frame if the engine is fast.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `WorkflowSimulator.wait_for_ai_response()` method is the most critical
|
|
|
|
|
|
|
|
polling target. It is the backbone of all extended simulation tests. If its
|
|
|
|
|
|
|
|
polling strategy is wrong, the entire extended sim suite is unreliable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM — confirmed.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM is correct. Setting `gcli_path` to a Python script does not exercise:
|
|
|
|
|
|
|
|
- Real PATH resolution for the `gemini` binary
|
|
|
|
|
|
|
|
- Windows process group creation (`CREATE_NEW_PROCESS_GROUP`)
|
|
|
|
|
|
|
|
- Environment variable propagation to the subprocess
|
|
|
|
|
|
|
|
- `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`)
|
|
|
|
|
|
|
|
- The `kill_process_tree` teardown path when the process hangs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM — confirmed.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.7 CORRECTION: "run_powershell is a Read-Only Tool"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**GLM is WRONG here.** In Part 8, GLM lists:
|
|
|
|
|
|
|
|
> "Read-Only Tools: run_powershell (via shell_runner.py)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
`run_powershell` executes arbitrary PowerShell scripts against the filesystem.
|
|
|
|
|
|
|
|
It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only
|
|
|
|
|
|
|
|
because it is not an MCP filesystem tool; its approval gate is the
|
|
|
|
|
|
|
|
`confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only"
|
|
|
|
|
|
|
|
is a factual error that could mislead future workers about the security model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.8 CORRECTION: "State Duplication Between App and AppController"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed
|
|
|
|
|
|
|
|
before this audit. `gui_2.App` now delegates all state through `AppController`
|
|
|
|
|
|
|
|
via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a
|
|
|
|
|
|
|
|
thin ImGui rendering layer, `AppController` owns all state. GLM's concern is
|
|
|
|
|
|
|
|
stale relative to the current codebase.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**This recommendation is architecturally inappropriate** for Dear PyGui/ImGui.
|
|
|
|
|
|
|
|
These are immediate-mode renderers; there is no DOM or widget tree to
|
|
|
|
|
|
|
|
interrogate. Pixel-level screenshot comparison requires platform-specific
|
|
|
|
|
|
|
|
capture APIs (Windows Magnification, GDI) and is extremely fragile to font
|
|
|
|
|
|
|
|
rendering, DPI, and GPU differences. The Hook API's logical state verification
|
|
|
|
|
|
|
|
is the CORRECT and SUFFICIENT abstraction for this application. Adding
|
|
|
|
|
|
|
|
screenshot comparison would be high cost, low value, and high flakiness.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`)
|
|
|
|
|
|
|
|
is exposing more GUI state via the Hook API so tests can assert logical
|
|
|
|
|
|
|
|
rendering state (is a panel visible? what is the modal title?) without pixels.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The summary table in Part 9 lists identical items at multiple severity levels:
|
|
|
|
|
|
|
|
- "No concurrent access testing": appears as both HIGH and MEDIUM
|
|
|
|
|
|
|
|
- "No real-time latency simulation": appears as both MEDIUM and LOW
|
|
|
|
|
|
|
|
- "No human-like behavior": appears as both MEDIUM and LOW
|
|
|
|
|
|
|
|
- "Arbitrary polling intervals": appears as both MEDIUM and LOW
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference
|
|
|
|
|
|
|
|
section was copy-pasted in full. This suggests the report was generated with
|
|
|
|
|
|
|
|
insufficient self-review.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM spends substantial space on the absence of:
|
|
|
|
|
|
|
|
- Typing speed simulation
|
|
|
|
|
|
|
|
- Hesitation before actions
|
|
|
|
|
|
|
|
- Variable LLM latency
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is a **personal developer tool for a single user on a local machine**.
|
|
|
|
|
|
|
|
These are aspirational concerns for a production SaaS simulation framework.
|
|
|
|
|
|
|
|
For this product context, these are genuinely LOW priority. The simulation
|
|
|
|
|
|
|
|
framework's job is to verify that the GUI state machine transitions correctly,
|
|
|
|
|
|
|
|
not to simulate human psychology.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 3: Issues GLM Missed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These are findings not present in GLM's report that carry meaningful risk.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
`conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL
|
|
|
|
|
|
|
|
tests that use `live_gui` share a single running GUI process. If test A
|
|
|
|
|
|
|
|
leaves the GUI in a state with an open modal dialog, test B will find the
|
|
|
|
|
|
|
|
GUI unresponsive or in an unexpected state.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The teardown calls `client.reset_session()` (which clicks `btn_reset_session`),
|
|
|
|
|
|
|
|
but this clears AI state and discussion history, not pending dialogs or
|
|
|
|
|
|
|
|
MMA orchestration state. A test that triggers a spawn approval dialog and
|
|
|
|
|
|
|
|
then fails before approving it will leave `_pending_mma_spawn` set, blocking
|
|
|
|
|
|
|
|
the ENTIRE remaining test session.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: HIGH.** The current test ordering dependency is invisible and
|
|
|
|
|
|
|
|
fragile. Tests must not be run in arbitrary order.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Fix:** Each `live_gui`-using test that touches MMA or approval flows should
|
|
|
|
|
|
|
|
explicitly verify clean state at start:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
status = client.get_mma_status()
|
|
|
|
|
|
|
|
assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty"
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `app_instance` fixture mocks out all ImGui rendering. This means every
|
|
|
|
|
|
|
|
test using `app_instance` (approximately 40+ tests) is testing Python object
|
|
|
|
|
|
|
|
state, not rendered UI. Tests like:
|
|
|
|
|
|
|
|
- `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`,
|
|
|
|
|
|
|
|
not that the panel renders
|
|
|
|
|
|
|
|
- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls
|
|
|
|
|
|
|
|
`_render_token_budget_panel()` in a context where all ImGui calls are no-ops
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This creates a systematic false-positive class: a method can be completely
|
|
|
|
|
|
|
|
broken (wrong data, missing widget calls) and the test passes because ImGui
|
|
|
|
|
|
|
|
calls are silently ignored. The only tests with genuine rendering fidelity
|
|
|
|
|
|
|
|
are the `live_gui` tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is the root cause behind GLM's "state existence only" finding. It is
|
|
|
|
|
|
|
|
not a test assertion weakness — it is a fixture architectural limitation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: HIGH.** The implication: all `app_instance`-based rendering
|
|
|
|
|
|
|
|
tests should be treated as "smoke tests that the method doesn't crash,"
|
|
|
|
|
|
|
|
not as "verification that the rendering is correct."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Fix:** The `hook_api_ui_state_verification_20260302` track (adding
|
|
|
|
|
|
|
|
`/api/gui/state`) is the correct path forward: expose render-visible state
|
|
|
|
|
|
|
|
through the Hook API so `live_gui` tests can verify it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
`ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop.
|
|
|
|
|
|
|
|
There is no outer timeout on this loop. If the GUI thread never signals the
|
|
|
|
|
|
|
|
dialog (e.g., GUI crash after dialog creation, or a test that creates a
|
|
|
|
|
|
|
|
dialog but doesn't render it), the asyncio worker thread hangs indefinitely.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is particularly dangerous in the `run_worker_lifecycle` path:
|
|
|
|
|
|
|
|
1. Worker pushes dialog to event queue
|
|
|
|
|
|
|
|
2. GUI process crashes or freezes
|
|
|
|
|
|
|
|
3. `dialog.wait()` loops forever at 0.1s intervals
|
|
|
|
|
|
|
|
4. Test session hangs with no error output
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There is no test verifying that `wait()` has a maximum wait time and raises
|
|
|
|
|
|
|
|
an exception or returns a default (rejected) decision after it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: HIGH.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
`mcp_client.configure()` sets module-level globals (`_allowed_paths`,
|
|
|
|
|
|
|
|
`_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions
|
|
|
|
|
|
|
|
directly without calling `configure()` first will use whatever state was
|
|
|
|
|
|
|
|
left from the previous test. The `reset_ai_client` autouse fixture calls
|
|
|
|
|
|
|
|
`ai_client.reset_session()` but does NOT reset `mcp_client` state.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`,
|
|
|
|
|
|
|
|
etc. directly (not through `ai_client.send()`) inherits the allowlist from
|
|
|
|
|
|
|
|
the previous test run. This can cause false passes (path permitted by
|
|
|
|
|
|
|
|
previous test's allowlist) or false failures (path denied because
|
|
|
|
|
|
|
|
`_base_dirs` is empty from a prior reset).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM mentions this as a "design concern." It is more specific: the
|
|
|
|
|
|
|
|
`concurrent_tier_source_tier_20260302` track exists because `current_tier`
|
|
|
|
|
|
|
|
in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers
|
|
|
|
|
|
|
|
run concurrently (future feature), the second `send()` call will overwrite
|
|
|
|
|
|
|
|
the first worker's tier tag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
What's missing: there is no test that verifies the CURRENT behavior is safe
|
|
|
|
|
|
|
|
under single-threaded operation, and no test that demonstrates the failure
|
|
|
|
|
|
|
|
mode under concurrent operation to serve as a regression baseline for the fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The arch boundary tests verify that `manual_slop.toml` lists mutating tools
|
|
|
|
|
|
|
|
as disabled by default. But the tests don't verify:
|
|
|
|
|
|
|
|
1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools`
|
|
|
|
|
|
|
|
at startup
|
|
|
|
|
|
|
|
2. That `ai_client._agent_tools` is actually consulted before tool dispatch
|
|
|
|
|
|
|
|
3. That the TOML → runtime path is end-to-end
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A developer could modify how tools are loaded without breaking these tests.
|
|
|
|
|
|
|
|
The tests are static config audits, not runtime enforcement tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client`
|
|
|
|
|
|
|
|
and calls `ai_client.send()` to generate "human-like" responses. This means:
|
|
|
|
|
|
|
|
- Simulation tests have an implicit dependency on a configured LLM provider
|
|
|
|
|
|
|
|
- If run without an API key (e.g., in CI), simulations fail at the UserSimAgent
|
|
|
|
|
|
|
|
level, not at the GUI level — making failures hard to diagnose
|
|
|
|
|
|
|
|
- The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in
|
|
|
|
|
|
|
|
the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent
|
|
|
|
|
|
|
|
would attempt real API calls
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
No test documents whether UserSimAgent is actually exercised in the extended
|
|
|
|
|
|
|
|
sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient
|
|
|
|
|
|
|
|
directly to drive the GUI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: MEDIUM.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream
|
|
|
|
|
|
|
|
and then waits for `{"type": "tool_result", ...}` piped back on stdin. The
|
|
|
|
|
|
|
|
`mock_gemini_cli.py` does not emit any `tool_use` events; it only detects
|
|
|
|
|
|
|
|
`'"role": "tool"'` in the prompt to simulate a post-tool-call turn.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that
|
|
|
|
|
|
|
|
handles `tool_use` event types and accumulates them) is NEVER exercised by
|
|
|
|
|
|
|
|
any test. A regression in that parsing branch would be invisible to the
|
|
|
|
|
|
|
|
test suite.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: LOW** (only relevant when the real gemini CLI is used with tools).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `reset_ai_client` autouse fixture runs synchronously before each test.
|
|
|
|
|
|
|
|
For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's
|
|
|
|
|
|
|
|
async setup. If the async test itself triggers ai_client operations in setup
|
|
|
|
|
|
|
|
(e.g., through an event loop created by the fixture), the reset may not
|
|
|
|
|
|
|
|
capture all state mutations. This is an edge case but could explain
|
|
|
|
|
|
|
|
intermittent behavior in async tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Severity: LOW.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 4: Revised Severity Matrix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Severity | Finding | GLM? | Source |
|
|
|
|
|
|
|
|
|---|---|---|---|
|
|
|
|
|
|
|
|
| **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM |
|
|
|
|
|
|
|
|
| **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude |
|
|
|
|
|
|
|
|
| **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude |
|
|
|
|
|
|
|
|
| **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude |
|
|
|
|
|
|
|
|
| **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM |
|
|
|
|
|
|
|
|
| **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM |
|
|
|
|
|
|
|
|
| **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude |
|
|
|
|
|
|
|
|
| **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude |
|
|
|
|
|
|
|
|
| **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude |
|
|
|
|
|
|
|
|
| **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude |
|
|
|
|
|
|
|
|
| **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude |
|
|
|
|
|
|
|
|
| **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM |
|
|
|
|
|
|
|
|
| **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM |
|
|
|
|
|
|
|
|
| **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude |
|
|
|
|
|
|
|
|
| **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude |
|
|
|
|
|
|
|
|
| **LOW** | Variable latency / human-like simulation | Confirmed | GLM |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 5: Prioritized Recommendations for Downstream Tracks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Listed in execution order, not importance order. Each maps to an existing or
|
|
|
|
|
|
|
|
proposed track.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 1: Extend mock_gemini_cli with Failure Modes
|
|
|
|
|
|
|
|
**Target track:** New — `mock_provider_hardening_20260305`
|
|
|
|
|
|
|
|
**Files:** `tests/mock_gemini_cli.py`
|
|
|
|
|
|
|
|
**What:** Add a `MOCK_MODE` environment variable selector:
|
|
|
|
|
|
|
|
- `success` (current behavior, default)
|
|
|
|
|
|
|
|
- `malformed_json` — emit a truncated/corrupt JSON-L line
|
|
|
|
|
|
|
|
- `error_result` — emit `{"type": "result", "status": "error", ...}`
|
|
|
|
|
|
|
|
- `timeout` — sleep 90s to trigger the CLI timeout path
|
|
|
|
|
|
|
|
- `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tests that need to verify error handling pass `MOCK_MODE=error_result` via
|
|
|
|
|
|
|
|
`client.set_value()` before triggering the AI call.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 2: Add Dialog Assertion Before Auto-Approval
|
|
|
|
|
|
|
|
**Target track:** `test_suite_performance_and_flakiness_20260302` (already planned)
|
|
|
|
|
|
|
|
**Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py`
|
|
|
|
|
|
|
|
**What:** Replace the conditional approval pattern:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
# BAD (current):
|
|
|
|
|
|
|
|
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
|
|
|
|
|
|
|
|
# GOOD:
|
|
|
|
|
|
|
|
assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve"
|
|
|
|
|
|
|
|
client.click('btn_approve_spawn')
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
Also add at least one test per dialog type that clicks reject and asserts the
|
|
|
|
|
|
|
|
correct downstream state (engine marks track blocked, no worker spawned, etc.).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 3: Fix live_gui Session Scope Dirty State
|
|
|
|
|
|
|
|
**Target track:** `test_suite_performance_and_flakiness_20260302`
|
|
|
|
|
|
|
|
**Files:** `tests/conftest.py`
|
|
|
|
|
|
|
|
**What:** Add a per-test autouse fixture (function-scoped) that asserts clean
|
|
|
|
|
|
|
|
GUI state before each `live_gui` test:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
|
|
|
|
|
|
def assert_gui_clean(live_gui):
|
|
|
|
|
|
|
|
client = ApiHookClient()
|
|
|
|
|
|
|
|
status = client.get_mma_status()
|
|
|
|
|
|
|
|
assert not status.get('pending_mma_spawn_approval')
|
|
|
|
|
|
|
|
assert not status.get('pending_mma_step_approval')
|
|
|
|
|
|
|
|
assert not status.get('pending_tool_approval')
|
|
|
|
|
|
|
|
assert status.get('mma_status') in ('idle', 'done', '')
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
This surfaces inter-test pollution immediately rather than causing a
|
|
|
|
|
|
|
|
mysterious hang in a later test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 4: Add ConfirmDialog Timeout Test
|
|
|
|
|
|
|
|
**Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`)
|
|
|
|
|
|
|
|
**Files:** `tests/test_conductor_engine.py`
|
|
|
|
|
|
|
|
**What:** Add a test that creates a `ConfirmDialog`, never signals it, and
|
|
|
|
|
|
|
|
verifies after N seconds that the background thread does NOT block indefinitely.
|
|
|
|
|
|
|
|
This requires either a hard timeout on `wait()` or a documented contract that
|
|
|
|
|
|
|
|
callers must signal the dialog within a finite window.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 5: Expose More State via Hook API
|
|
|
|
|
|
|
|
**Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority)
|
|
|
|
|
|
|
|
**Files:** `src/api_hooks.py`
|
|
|
|
|
|
|
|
**What:** This track is the key enabler for replacing `app_instance` rendering
|
|
|
|
|
|
|
|
tests with genuine state verification. The planned `/api/gui/state` endpoint
|
|
|
|
|
|
|
|
should expose:
|
|
|
|
|
|
|
|
- Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`)
|
|
|
|
|
|
|
|
- `ui_focus_agent` current filter value
|
|
|
|
|
|
|
|
- `_mma_status`, `_ai_status` text values
|
|
|
|
|
|
|
|
- Panel visibility flags
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Once this is in place, the `app_instance` rendering tests can be migrated
|
|
|
|
|
|
|
|
to `live_gui` equivalents that actually verify GUI-visible state.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 6: Add mcp_client Reset to autouse Fixture
|
|
|
|
|
|
|
|
**Target track:** `test_suite_performance_and_flakiness_20260302`
|
|
|
|
|
|
|
|
**Files:** `tests/conftest.py`
|
|
|
|
|
|
|
|
**What:** Extend `reset_ai_client` autouse fixture to also call
|
|
|
|
|
|
|
|
`mcp_client.configure([], [])` to clear the allowlist between tests.
|
|
|
|
|
|
|
|
This prevents allowlist state from a previous test from leaking into the next.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 7: Add Runtime HITL Enforcement Test
|
|
|
|
|
|
|
|
**Target track:** `test_suite_performance_and_flakiness_20260302` or new
|
|
|
|
|
|
|
|
**Files:** `tests/test_arch_boundary_phase2.py`
|
|
|
|
|
|
|
|
**What:** Add an integration test (using `app_instance`) that:
|
|
|
|
|
|
|
|
1. Calls `ai_client.set_agent_tools({'set_file_slice': True})`
|
|
|
|
|
|
|
|
2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'`
|
|
|
|
|
|
|
|
3. Triggers a dispatch of `set_file_slice`
|
|
|
|
|
|
|
|
4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This closes the gap between "config says mutating tools are off" and
|
|
|
|
|
|
|
|
"runtime actually gates them through the approval callback."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Rec 8: Document `app_instance` Limitation in conftest
|
|
|
|
|
|
|
|
**Target track:** Any ongoing work — immediate, no track needed
|
|
|
|
|
|
|
|
**Files:** `tests/conftest.py`
|
|
|
|
|
|
|
|
**What:** Add a docstring to `app_instance` fixture:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
App instance with all ImGui rendering calls mocked to no-ops.
|
|
|
|
|
|
|
|
Use for unit tests of state logic and method existence.
|
|
|
|
|
|
|
|
DO NOT use to verify rendering correctness — use live_gui for that.
|
|
|
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
This prevents future workers from writing rendering tests against this fixture
|
|
|
|
|
|
|
|
and believing they have real coverage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 6: What the Existing Track Queue Gets Right
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `TASKS.md` strict execution queue is well-ordered for the test concerns:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban
|
|
|
|
|
|
|
|
2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring
|
|
|
|
|
|
|
|
3. `codebase_migration_20260302` → Already complete (commit 270f5f7)
|
|
|
|
|
|
|
|
4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205)
|
|
|
|
|
|
|
|
5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests
|
|
|
|
|
|
|
|
6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces
|
|
|
|
|
|
|
|
malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above
|
|
|
|
|
|
|
|
7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers
|
|
|
|
|
|
|
|
8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `test_architecture_integrity_audit_20260304` (this track) sits logically
|
|
|
|
|
|
|
|
between #1 and #5 — it provides the analytical basis for what #5 and #8 need
|
|
|
|
|
|
|
|
to fix. The audit output (this document) should be read by the Tier 2 Tech Lead
|
|
|
|
|
|
|
|
for both those tracks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The proposed new tracks (mock_provider_hardening, negative_path_testing) from
|
|
|
|
|
|
|
|
GLM's recommendations are valid but should be created AFTER track #5
|
|
|
|
|
|
|
|
(`hook_api_ui_state_verification`) is complete, since they depend on the
|
|
|
|
|
|
|
|
richer Hook API state to write meaningful assertions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Section 7: Architectural Observations Not in GLM's Report
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### The Two-Tier Mock Problem
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The test suite has two completely separate mock layers that do not know about
|
|
|
|
|
|
|
|
each other:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Layer 1** — `app_instance` fixture (in-process): Patches `immapp.run()`,
|
|
|
|
|
|
|
|
`ai_client.send()`, and related functions with `unittest.mock`. Tests call
|
|
|
|
|
|
|
|
methods directly. No network, no subprocess, no real threading.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Layer 2** — `mock_gemini_cli.py` (out-of-process): A fake subprocess that
|
|
|
|
|
|
|
|
the live GUI process calls through its own internal LLM pipeline. Tests drive
|
|
|
|
|
|
|
|
this via `ApiHookClient` HTTP calls to the running GUI process.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These layers test completely different things. Layer 1 tests Python object
|
|
|
|
|
|
|
|
invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC,
|
|
|
|
|
|
|
|
process management). Most of the test suite is Layer 1. Very few tests are
|
|
|
|
|
|
|
|
Layer 2. The high-value tests are Layer 2 because they exercise the actual
|
|
|
|
|
|
|
|
system, not a mock of it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM correctly identifies that Layer 1 tests are of limited value for
|
|
|
|
|
|
|
|
rendering verification but does not frame it as a two-layer architecture
|
|
|
|
|
|
|
|
problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### The Simulation Framework's Actual Role
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `simulation/` module is not (and should not be) a fidelity benchmark.
|
|
|
|
|
|
|
|
Its role is:
|
|
|
|
|
|
|
|
1. Drive the GUI through a sequence of interactions
|
|
|
|
|
|
|
|
2. Verify the GUI reaches expected states after each interaction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`,
|
|
|
|
|
|
|
|
`sim_execution.py`) are extremely thin wrappers. Their actual test value
|
|
|
|
|
|
|
|
comes from `test_extended_sims.py` which calls them against a live GUI and
|
|
|
|
|
|
|
|
verifies no exceptions are thrown. This is essentially a smoke test for the
|
|
|
|
|
|
|
|
GUI lifecycle, not a behavioral verification.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The real behavioral verification is in `test_visual_sim_mma_v2.py` and
|
|
|
|
|
|
|
|
similar files that assert specific state transitions. The simulation/
|
|
|
|
|
|
|
|
module should be understood as "workflow drivers," not "verification modules."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GLM's recommendation to add latency simulation and human-like behavior to
|
|
|
|
|
|
|
|
`simulation/user_agent.py` would add complexity to a layer that isn't the
|
|
|
|
|
|
|
|
bottleneck. The bottleneck is assertion depth in the polling loops, not
|
|
|
|
|
|
|
|
realism of the user actions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*End of report. Next action: Tier 2 Tech Lead to read this alongside
|
|
|
|
|
|
|
|
`plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`)
|
|
|
|
|
|
|
|
as the highest-leverage unblocking action.*
|