test(audit): fix critical test suite deadlocks and write exhaustive architectural report

- Fix 'Triple Bingo' history synchronization explosion during streaming - Implement stateless event buffering in ApiHookClient to prevent dropped events - Ensure 'tool_execution' events emit consistently across all LLM providers - Add hard timeouts to all background thread wait() conditions - Add thorough teardown cleanup to conftest.py's reset_ai_client fixture - Write highly detailed report_gemini.md exposing asyncio lifecycle flaws
2026-03-05 01:42:47 -05:00
parent bfdbd43785
commit 35480a26dc
15 changed files with 715 additions and 481 deletions
@@ -0,0 +1,130 @@
+# Test Architecture Integrity Audit — Gemini Review
+
+**Author:** Gemini 2.5 Pro (Tier 2 Tech Lead)
+**Review Date:** 2026-03-05
+**Source Reports:** `report.md` (GLM-4.7) and `report_claude.md` (Claude Sonnet 4.6)
+**Scope:** Exhaustive root-cause analysis of intermittent and full-suite test failures introduced by the GUI decoupling refactor.
+
+---
+
+## Executive Summary
+
+This report serves as the definitive autopsy of the test suite instability observed following the completion of the `GUI Decoupling & Controller Architecture` track (`1bc4205`). While the decoupling successfully isolated the `AppController` state machine from the `gui_2.py` immediate-mode rendering loop, it inadvertently exposed and amplified several systemic flaws in the project's concurrency model, IPC (Inter-Process Communication) mechanisms, and test fixture isolation.
+
+The symptoms—tests passing in isolation but hanging, deadlocking, or failing assertions when run as a full suite—are classic signatures of **state pollution** and **race conditions**. 
+
+This audit moves beyond the surface-level observations made by GLM-4.7 (which focused heavily on missing negative paths and mock fidelity) and Claude 4.6 (which correctly identified some scoping issues). This report details the exact mechanical failures within the threading models, event loops, and synchronization primitives that caused the build to break under load.
+
+---
+
+## Part 1: Deep Dive — The "Triple Bingo" History Synchronization Bug
+
+**Symptom:** During extended simulations (`test_extended_sims.py`), the GUI would mysteriously hang, memory usage would spike, and tests would eventually time out. 
+
+**Root Cause Analysis:**
+The architecture relies on an asynchronous event queue (`_api_event_queue`) and a synchronized task list (`_pending_gui_tasks`) to bridge the gap between the background AI processing threads and the main GUI rendering thread. When streaming was enabled for the Gemini CLI provider, a catastrophic feedback loop was created.
+
+1.  **The Streaming Accumulator Flaw:** In `AppController._handle_request_event`, the `stream_callback` was designed to push partial updates to the GUI. However, it was pushing the *entire accumulated response text* up to that point, not just the delta.
+2.  **The Unconditional History Append:** The `_process_pending_gui_tasks` loop running on the GUI thread received these "streaming..." events. Crucially, the logic failed to check if the turn was actually complete. For *every single chunk* received, it appended the full accumulated text to `_pending_history_adds`.
+3.  **Exponential Growth:** If the AI responded with 100 chunks, the history array grew by `1 + 2 + 3... + 100` strings, leading to an massive, exponentially growing list of duplicated, partially complete responses being serialized into the project's discussion state. This caused the JSON/TOML parsers and ImGui text rendering to lock up entirely under the weight of megabytes of redundant text.
+4.  **Provider Inconsistency:** To compound this, the `GeminiCliAdapter` was manually emitting its own `history_add` events upon completion, meaning even without streaming, responses were being duplicated because both the controller and the adapter were trying to manage the discussion history.
+
+**Architectural Lesson:** History state mutation must be strictly gated to terminal events (e.g., `status == 'done'`). Intermediate streaming states are purely ephemeral UI concerns and must never touch persistent data structures.
+
+---
+
+## Part 2: Deep Dive — IPC and Event Polling Race Conditions
+
+**Symptom:** Integration tests (like `test_gemini_cli_edge_cases.py` and `test_visual_sim_mma_v2.py`) that rely on `ApiHookClient.wait_for_event()` would sporadically time out waiting for an event (like `ask_received` or `script_confirmation_required`) that the server logs prove was successfully emitted.
+
+**Root Cause Analysis:**
+The testing framework relies heavily on polling the `/api/events` HTTP endpoint to coordinate test assertions with GUI state transitions. 
+
+1.  **Destructive Reads:** The `get_events()` implementation in `ApiHookClient` made a GET request that triggered a lock-guarded drain of `_api_event_queue` on the server. This is a destructive read; once fetched, the server queue is cleared.
+2.  **Stateless Polling:** The original `wait_for_event` implementation fetched events, iterated through them looking for the target type, and if found, returned it. *However*, if the fetch returned multiple events (e.g., `['refresh_metrics', 'script_confirmation_required']`), and the test was only polling for `refresh_metrics`, the `script_confirmation_required` event was silently dropped on the floor because the client retained no state between polls.
+3.  **The Deadlock:** When the test advanced to the next step and called `wait_for_event('script_confirmation_required')`, it would hang until timeout because the event had already been consumed and discarded in the previous polling cycle.
+
+**Architectural Lesson:** IPC clients designed for event polling must implement local, stateful event buffers (`_event_buffer`). Destructive reads from a server queue demand that the client takes full ownership of storing all received events until they are explicitly consumed by the test logic.
+
+---
+
+## Part 3: Deep Dive — Asyncio Lifecycle & Threading Deadlocks
+
+**Symptom:** The full test suite hangs around the 35% mark, often terminating with `RuntimeError: Event loop is closed` or simply freezing indefinitely.
+
+**Root Cause Analysis:**
+The `AppController` initializes its own internal `asyncio` loop running in a dedicated daemon thread (`_loop_thread`). 
+
+1.  **Event Loop Exhaustion:** When `pytest` runs hundreds of tests, the `app_instance` fixture creates and tears down hundreds of `AppController` instances. `asyncio` is not designed to have hundreds of unmanaged loops created and abandoned within a single process space.
+2.  **Missing Teardown:** The `AppController` originally lacked a `shutdown()` method. When a test ended, the daemon thread remained alive, and the `asyncio` loop continued running. When Python's garbage collector eventually reclaimed the object, or when `pytest-asyncio` interfered with global loop policies, the background loops would violently close, throwing exceptions into the void and destabilizing the interpreter.
+3.  **Approval Dialog Deadlocks:** Tools like `run_powershell` and MMA worker spawns require user approval via `ConfirmDialog` or `MMASpawnApprovalDialog`. The AI background thread calls `.wait()` on a threading `Condition`. If the test fails to trigger the approval via the Hook API, or if the Hook API crashes due to the asyncio loop dying, the background thread waits *forever*. There were zero timeouts on the internal wait conditions.
+
+**Architectural Lesson:** 
+1. Deterministic teardown of background threads and event loops is non-negotiable in test suites. 
+2. Threading primitives (`Event`, `Condition`) used for cross-thread synchronization must *always* employ outer timeouts to prevent catastrophic test suite hangs when state machines desynchronize.
+
+---
+
+## Part 4: Deep Dive — Phantom Hook Servers & Test State Pollution
+
+**Symptom:** Tests utilizing the `live_gui` fixture sporadically fail with `ConnectionError`, or assertions fail because the test is interacting with a UI state left over from a completely different test file.
+
+**Root Cause Analysis:**
+The `live_gui` fixture spawns `sloppy.py` via `subprocess.Popen`. This process launches `api_hooks.HookServer` on `127.0.0.1:8999`.
+
+1.  **Zombie Processes:** If a test fails abruptly or a deadlock occurs (as detailed in Part 3), the `conftest.py` teardown block may fail to cleanly terminate the `Popen` instance. 
+2.  **Port Hijacking:** The zombie `sloppy.py` process continues running in the background, holding port 8999 open.
+3.  **Cross-Test Telemetry Contamination:** When the *next* test in the suite executes, it spins up a new `ApiHookClient` pointing at port 8999. Because the zombie server is still listening, the client successfully connects—not to the fresh instance the fixture just spawned (which failed to bind to 8999), but to the dead test's GUI.
+4.  **In-Process Module Pollution:** For tests that don't use `live_gui` but instead mock `app_instance` in-process, global singletons like `ai_client` and `mcp_client` retain state. For example, if Test A modifies `mcp_client.MUTATING_TOOLS` or sets a specific `ai_client._provider`, Test B inherits those mutations unless strictly reset.
+
+**Architectural Lesson:** 
+1. Subprocess test fixtures require aggressive, OS-level process tree termination (`taskkill /F /T` equivalent) to guarantee port release.
+2. Global module state (`ai_client`, `mcp_client`, `events.EventEmitter`) mandates aggressive, explicit `autouse` fixtures that reset all variables and clear all pub/sub listeners between *every single test*.
+
+---
+
+## Part 5: Evaluation of Prior Audits (GLM & Claude)
+
+### Review of GLM-4.7's Report
+GLM correctly identified the lack of negative path testing and the fact that `mock_gemini_cli.py` always returns success, which masks error-handling bugs. However, GLM's recommendation to entirely rewrite the testing framework to use custom `ContextManager` mocks is an over-correction. The core architecture is sound; the failure is in lifecycle management and polling discipline, not the inherent design of the event bus. GLM entirely missed the critical threading and IPC deadlocks that actually cause the suite to fail.
+
+### Review of Claude 4.6's Report
+Claude correctly identified that auto-approving dialogs without asserting their existence hides UX failures (a very astute observation). Claude also correctly identified that `mcp_client` state was bleeding between tests. However, Claude dismissed the `simulation/` framework as merely a "workflow driver," failing to recognize that the simulations *are* the integration tests that were exposing the deep `asyncio` and IPC deadlocks.
+
+---
+
+## Part 6: Prioritized Recommendations & Action Plan
+
+To stabilize the CI/CD pipeline and restore confidence in the test suite, the following tracks must be executed in order.
+
+### Priority 1: `test_stabilization_core_ipc_20260305` (URGENT)
+**Goal:** Eradicate deadlocks, race conditions, and phantom processes.
+**Tasks:**
+1.  **Stateful Event Buffering:** Implement `_event_buffer` in `ApiHookClient` to safely accumulate events between destructive server reads. (Partially addressed, requires full test coverage).
+2.  **Thread Deadlock Prevention:** Enforce a hard 120-second timeout on all `wait()` calls within `ConfirmDialog`, `MMAApprovalDialog`, and `MMASpawnApprovalDialog`.
+3.  **Aggressive Fixture Teardown:** Update `live_gui` in `conftest.py` to assert port 8999 is free *before* launching, and use robust process tree killing on teardown.
+4.  **Global State Sanitization:** Ensure `reset_ai_client` explicitly clears `events._listeners` and `mcp_client.configure([], [])`.
+
+### Priority 2: `hook_api_ui_state_verification_20260302` (HIGH)
+*This track was previously proposed but is now critical.*
+**Goal:** Replace fragile log-parsing assertions with deterministic UI state queries.
+**Tasks:**
+1.  Implement `GET /api/gui/state` in `HookHandler`.
+2.  Wire critical UI variables (`ui_focus_agent`, active modal titles, track status) into the `_settable_fields` dictionary to allow programmatic reading without pixels/screenshots.
+3.  Refactor `test_visual_sim_mma_v2.py` and `test_extended_sims.py` to use these new endpoints instead of brittle `time.sleep()` and string matching.
+
+### Priority 3: `mock_provider_hardening_20260305` (MEDIUM)
+*Sourced from Claude 4.6's recommendations.*
+**Goal:** Ensure error paths are exercised.
+**Tasks:**
+1.  Add `MOCK_MODE` environment variable parsing to `mock_gemini_cli.py`.
+2.  Implement modes for `malformed_json`, `timeout`, and `error_result`.
+3.  Create a dedicated test file `test_negative_flows.py` to verify the GUI correctly displays error states and recovers without crashing when the AI provider fails.
+
+### Priority 4: `asyncio_decoupling_refactor_20260306` (LOW - Architectural Debt)
+**Goal:** Remove `asyncio` from the `AppController` entirely.
+**Tasks:**
+The internal use of `asyncio.Queue` and a dedicated event loop thread within `AppController` is over-engineered for a system that simply needs to pass dicts between a background worker and the GUI thread. Replacing it with a standard `queue.Queue` from the `queue` module will drastically simplify the architecture, eliminate `RuntimeError: Event loop is closed` bugs in tests, and make the application more robust.
+
+---
+*End of Report.*