chore(conductor): Archive completed and deprecated tracks

- Moved codebase_migration_20260302 to archive

- Moved gui_decoupling_controller_20260302 to archive

- Moved test_architecture_integrity_audit_20260304 to archive

- Removed deprecated test_suite_performance_and_flakiness_20260302
This commit is contained in:
2026-03-05 09:51:11 -05:00
parent c295db1630
commit d0e7743ef6
21 changed files with 0 additions and 83 deletions

View File

@@ -1,5 +0,0 @@
# Track codebase_migration_20260302 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -1,8 +0,0 @@
{
"track_id": "codebase_migration_20260302",
"type": "chore",
"status": "new",
"created_at": "2026-03-02T22:28:00Z",
"updated_at": "2026-03-02T22:28:00Z",
"description": "Move the codebase from the main directory to a src directory. Alleviate clutter by doing so. Remove files that are not used at all by the current application's implementation."
}

View File

@@ -1,23 +0,0 @@
# Implementation Plan: Codebase Migration to `src` & Cleanup (codebase_migration_20260302)
## Status: COMPLETE [checkpoint: 92da972]
## Phase 1: Unused File Identification & Removal
- [x] Task: Initialize MMA Environment `activate_skill mma-orchestrator`
- [x] Task: Audit Codebase for Dead Files (1eb9d29)
- [x] Task: Delete Unused Files (1eb9d29)
- [-] Task: Conductor - User Manual Verification 'Phase 1: Unused File Identification & Removal' (SKIPPED)
## Phase 2: Directory Restructuring & Migration
- [x] Task: Create `src/` Directory
- [x] Task: Move Application Files to `src/`
- [x] Task: Conductor - User Manual Verification 'Phase 2: Directory Restructuring & Migration' (Checkpoint: 24f385e)
## Phase 3: Entry Point & Import Resolution
- [x] Task: Create `sloppy.py` Entry Point (c102392)
- [x] Task: Resolve Absolute and Relative Imports (c102392)
- [x] Task: Conductor - User Manual Verification 'Phase 3: Entry Point & Import Resolution' (Checkpoint: 24f385e)
## Phase 4: Final Validation & Documentation
- [x] Task: Full Test Suite Validation (ea5bb4e)
- [x] Task: Update Core Documentation (ea5bb4e)
- [x] Task: Conductor - User Manual Verification 'Phase 4: Final Validation & Documentation' (92da972)

View File

@@ -1,33 +0,0 @@
# Track Specification: Codebase Migration to `src` & Cleanup (codebase_migration_20260302)
## Overview
This track focuses on restructuring the codebase to alleviate clutter by moving the main implementation files from the project root into a dedicated `src/` directory. Additionally, files that are completely unused by the current implementation will be automatically identified and removed. A new clean entry point (`sloppy.py`) will be created in the root directory.
## Functional Requirements
- **Directory Restructuring**:
- Move all active Python implementation files (e.g., `gui_2.py`, `ai_client.py`, `mcp_client.py`, `shell_runner.py`, `project_manager.py`, `events.py`, etc.) into a new `src/` directory.
- Update internal imports within all moved files to reflect their new locations or ensure the Python path resolves them correctly.
- **Root Directory Retention**:
- Keep configuration files (e.g., `config.toml`, `pyproject.toml`, `requirements.txt`, `.gitignore`) in the project root.
- Keep documentation files and directories (e.g., `Readme.md`, `BUILD.md`, `docs/`) in the project root.
- Keep the `tests/` and `simulation/` directories at the root level.
- **New Entry Point**:
- Create a new file `sloppy.py` in the root directory.
- `sloppy.py` will serve as the primary entry point to launch the application (jumpstarting the underlying `gui_2.py` logic which will be moved into `src/`).
- **Dead Code/File Removal**:
- Automatically identify completely unused files and scripts in the project root (e.g., legacy files, unreferenced tools).
- Delete the identified unused files to clean up the repository.
## Non-Functional Requirements
- Ensure all automated tests (`tests/`) and simulations (`simulation/`) continue to function perfectly without `ModuleNotFoundError`s.
- `sloppy.py` must support existing CLI arguments (e.g., `--enable-test-hooks`).
## Acceptance Criteria
- [ ] A `src/` directory exists and contains the main application logic.
- [ ] The root directory is clean, containing mainly configs, docs, `tests/`, `simulation/`, and `sloppy.py`.
- [ ] `sloppy.py` successfully launches the application.
- [ ] The full test suite runs and passes (i.e. all imports are correctly resolved).
- [ ] Obsolete/unused files have been successfully deleted from the repository.
## Out of Scope
- Complete refactoring of `gui_2.py` into a fully modular system (this track only moves it, though preparing it for future non-monolithic structure is conceptually aligned).

View File

@@ -1,175 +0,0 @@
# Session Post-Mortem: 2026-03-04
## Track: GUI Decoupling & Controller Architecture
## Summary
Agent successfully fixed all test failures (345 passed, 0 skipped) but committed MULTIPLE critical violations of the conductor workflow and code style guidelines.
---
## CRITICAL VIOLATIONS
### 1. Edit Tool Destroys Indentation
**What happened:** The `Edit` tool automatically converts 1-space indentation to 4-space indentation.
**Evidence:**
```
git diff tests/conftest.py
# Entire file converted from 1-space to 4-space indentation
# 275 lines changed to 315 lines due to reformatting
```
**Root cause:** The Edit tool appears to apply Python auto-formatting (possibly Black or similar) that enforces 4-space indentation, completely ignoring the project's 1-space style.
**Impact:**
- Lost work when `git checkout` was needed to restore proper indentation
- Wasted time on multiple restore cycles
- User frustration
**Required fix in conductor/tooling:**
- Either disable auto-formatting in Edit tool
- Or add a post-edit validation step that rejects changes with wrong indentation
- Or mandate Python subprocess edits with explicit newline preservation
### 2. Did NOT Read Context Documents
**What happened:** Agent jumped straight to running tests without reading:
- `conductor/workflow.md`
- `conductor/tech-stack.md`
- `conductor/product.md`
- `docs/guide_architecture.md`
- `docs/guide_simulations.md`
**Evidence:** First action was `bash` command to run pytest, not reading context.
**Required fix in conductor/prompt:**
- Add explicit CHECKLIST at start of every session
- Block progress until context documents are confirmed read
- Add "context_loaded" state tracking
### 3. Did NOT Get Skeleton Outlines
**What happened:** Agent read full files instead of using skeleton tools.
**Evidence:** Used `read` on `conftest.py` (293 lines) instead of `py_get_skeleton`
**Required fix in conductor/prompt:**
- Enforce `py_get_skeleton` or `get_file_summary` before any `read` of files >50 lines
- Add validation that blocks `read` without prior skeleton call
### 4. Did NOT Delegate to Tier 3 Workers
**What happened:** Agent made direct code edits instead of delegating via Task tool.
**Evidence:** Used `edit` tool directly on `tests/conftest.py`, `tests/test_live_gui_integration.py`, `tests/test_gui2_performance.py`
**Required fix in conductor/prompt:**
- Add explicit check: "Is this a code implementation task? If YES, delegate to Tier 3"
- Block `edit` tool for code files unless explicitly authorized
### 5. Did NOT Follow TDD Protocol
**What happened:** No Red-Green-Refactor cycle. Just fixed code directly.
**Required fix in conductor/prompt:**
- Enforce "Write failing test FIRST" before any implementation
- Add test-first validation
---
## WORKAROUNDS THAT WORKED
### Python Subprocess Edits Preserve Indentation
```python
python -c "
with open('file.py', 'r', encoding='utf-8', newline='') as f:
content = f.read()
content = content.replace(old, new)
with open('file.py', 'w', encoding='utf-8', newline='') as f:
f.write(content)
"
```
This pattern preserved CRLF line endings and 1-space indentation.
---
## RECOMMENDED CHANGES TO CONDUCTOR FILES
### 1. workflow.md - Add Session Start Checklist
```markdown
## Session Start Checklist (MANDATORY)
Before ANY other action:
1. [ ] Read conductor/workflow.md
2. [ ] Read conductor/tech-stack.md
3. [ ] Read conductor/product.md
4. [ ] Read relevant docs/guide_*.md
5. [ ] Check TASKS.md for active tracks
6. [ ] Announce: "Context loaded, proceeding to [task]"
```
### 2. AGENTS.md - Add Edit Tool Warning
```markdown
## CRITICAL: Edit Tool Indentation Bug
The `Edit` tool DESTROYS 1-space indentation and converts to 4-space.
**NEVER use Edit tool directly on Python files.**
Instead, use Python subprocess:
\`\`\`python
python -c "..."
\`\`\`
Or use `py_update_definition` MCP tool.
```
### 3. workflow.md - Add Code Style Enforcement
```markdown
## Code Style (MANDATORY)
- **1-space indentation** for ALL Python code
- **CRLF line endings** on Windows
- Use `./scripts/ai_style_formatter.py` for formatting
- **NEVER** use Edit tool on Python files - it destroys indentation
- Use Python subprocess with `newline=''` to preserve line endings
```
### 4. conductor/prompt - Add Tool Restrictions
```markdown
## Tool Restrictions (TIER 2)
### ALLOWED Tools (Read-Only Research)
- read (for files <50 lines only)
- py_get_skeleton, py_get_code_outline, get_file_summary
- grep, glob
- bash (for git status, pytest --collect-only)
### FORBIDDEN Tools (Delegate to Tier 3)
- edit (on .py files - destroys indentation)
- write (on .py files)
- Any direct code modification
### Required Pattern
1. Research with skeleton tools
2. Draft surgical prompt with WHERE/WHAT/HOW/SAFETY
3. Delegate to Tier 3 via Task tool
4. Verify result
```
---
## FILES CHANGED THIS SESSION
| File | Change | Commit |
|------|--------|--------|
| tests/conftest.py | Add `temp_workspace.mkdir()` before file writes | 45b716f |
| tests/test_live_gui_integration.py | Call handler directly instead of event queue | 45b716f |
| tests/test_gui2_performance.py | Fix key mismatch (gui_2.py -> sloppy.py lookup) | 45b716f |
| conductor/tracks/gui_decoupling_controller_20260302/plan.md | Mark track complete | 704b9c8 |
---
## FINAL TEST RESULTS
```
345 passed, 0 skipped, 2 warnings in 205.94s
```
Track complete. All tests pass.

View File

@@ -1,50 +0,0 @@
# Comprehensive Debrief: GUI Decoupling Track (Botched Implementation)
## 1. Track Overview
* **Track Name:** GUI Decoupling & Controller Architecture
* **Track ID:** `gui_decoupling_controller_20260302`
* **Primary Objective:** Decouple business logic from `gui_2.py` (3,500+ lines) into a headless `AppController`.
## 2. Phase-by-Phase Failure Analysis
### Phase 1: Controller Skeleton & State Migration
* **Status:** [x] Completed (with major issues)
* **What happened:** State variables (locks, paths, flags) were moved to `AppController`. `App` was given a `__getattr__` and `__setattr__` bridge to delegate to the controller.
* **Failure:** The delegation created a "Phantom State" problem. Sub-agents began treating the two objects as interchangeable, but they are not. Shadowing (where `App` has a variable that blocks `Controller`) became a silent bug source.
### Phase 2: Logic & Background Thread Migration
* **Status:** [x] Completed (with critical regressions)
* **What happened:** Async loops, AI client calls, and project I/O were moved to `AppController`.
* **Failure 1 (Over-deletion):** Tier 3 workers deleted essential UI-thread handlers from `App` (like `_handle_approve_script`). This broke button callbacks and crashed the app on startup.
* **Failure 2 (Thread Violation):** A "fallback queue processor" was added to the Controller thread. This caused two threads to race for the same event queue. If the Controller won, the UI never blinked/updated, causing simulation timeouts.
* **Failure 3 (Property Erasure):** During surgical cleanups in this high-reasoning session, the `current_provider` getter/setter in `AppController` was accidentally deleted while trying to remove a redundant method. `App` now attempts to delegate to a non-existent attribute, causing `AttributeError`.
### Phase 3: Test Suite Refactoring
* **Status:** [x] Completed (fragile)
* **What happened:** `conftest.py` was updated to patch `AppController` methods.
* **Failure:** The `live_gui` sandbox environment (isolated workspace) was broken because the Controller now eagerly checks for `credentials.toml` on startup. The previous agent tried to "fix" this by copying secrets into the sandbox, which is a security regression and fragile.
### Phase 4: Final Validation
* **Status:** [ ] FAILED
* **What happened:** Integration tests and extended simulations fail or timeout consistently.
* **Root Cause:** Broken synchronization between the Controller's background processing and the GUI's rendering loop. The "Brain" (Controller) and "Limb" (GUI) are disconnected.
## 3. Current "Fucked" State of the Codebase
* **`src/gui_2.py`:** Contains rendering but is missing critical property logic. It still shadows core methods that should be purely in the controller.
* **`src/app_controller.py`:** Missing core properties (`current_provider`) and has broken `start_services` logic.
* **`tests/conftest.py`:** Has a messy `live_gui` fixture that uses environment variables (`SLOP_CREDENTIALS`, `SLOP_MCP_ENV`) but points to a sandbox that is missing the actual files.
* **`sloppy.py`:** The entry point works but the underlying classes are in a state of partial migration.
## 4. Immediate Recovery Plan (New Phase 5)
### Phase 5: Stabilization & Cleanup
1. **Task 5.1: AST Synchronization Audit.** Manually (via AST) compare `App` and `AppController`. Ensure every property needed for the UI exists in the Controller and is correctly delegated by `App`.
2. **Task 5.2: Restore Controller Properties.** Re-implement `current_provider` and `current_model` in `AppController` with proper logic (initializing adapters, clearing stats).
3. **Task 5.3: Explicit Delegation.** Remove the "magic" `__getattr__` and `__setattr__`. Replace them with explicit property pass-throughs. This will make `AttributeError` visible during static analysis rather than runtime.
4. **Task 5.4: Fix Sandbox Isolation.** Ensure `live_gui` fixture in `conftest.py` correctly handles `credentials.toml` via `SLOP_CREDENTIALS` env var pointing to the root, and ensure `sloppy.py` respects it.
5. **Task 5.5: Event Loop Consolidation.** Ensure there is EXACTLY ONE `asyncio` loop running, owned by the Controller, and that the GUI thread only reads from `_pending_gui_tasks`.
## 5. Technical Context for Next Session
* **Encoding issues:** `temp_conftest.py` and other git-shipped files often have UTF-16 or different line endings. Use Python-based readers to bypass `read_file` failures.
* **Crucial Lines:** `src/gui_2.py` line 180-210 (Delegation) and `src/app_controller.py` line 460-500 (Event Processing) are the primary areas of failure.
* **Mocking:** All `patch` targets in `tests/` must now be audited to ensure they hit the Controller, not the App.

View File

@@ -1,5 +0,0 @@
# Track gui_decoupling_controller_20260302 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -1,8 +0,0 @@
{
"track_id": "gui_decoupling_controller_20260302",
"type": "refactor",
"status": "new",
"created_at": "2026-03-02T22:30:00Z",
"updated_at": "2026-03-02T22:30:00Z",
"description": "Extract the state machine and core lifecycle into a headless app_controller.py, leaving gui_2.py as a pure immediate-mode view."
}

View File

@@ -1,37 +0,0 @@
# Implementation Plan: GUI Decoupling & Controller Architecture (gui_decoupling_controller_20260302)
## Status: COMPLETE [checkpoint: 45b716f]
## Phase 1: Controller Skeleton & State Migration
- [x] Task: Initialize MMA Environment `activate_skill mma-orchestrator` [d0009bb]
- [x] Task: Create `app_controller.py` Skeleton [d0009bb]
- [x] Task: Migrate Data State from GUI [d0009bb]
## Phase 2: Logic & Background Thread Migration
- [x] Task: Extract Background Threads & Event Queue [9260c7d]
- [x] Task: Extract I/O and AI Methods [9260c7d]
## Phase 3: Test Suite Refactoring
- [x] Task: Update `conftest.py` Fixtures [f2b2575]
- [x] Task: Resolve Broken GUI Tests [f2b2575]
## Phase 4: Final Validation
- [x] Task: Full Suite Validation & Warning Cleanup [45b716f]
- [x] WHERE: Project root
- [x] WHAT: `uv run pytest`
- [x] HOW: 345 passed, 0 skipped, 2 warnings
- [x] SAFETY: All tests pass
## Phase 5: Stabilization & Cleanup (RECOVERY)
- [x] Task: Task 5.1: AST Synchronization Audit [16d337e]
- [x] Task: Task 5.2: Restore Controller Properties (Restore `current_provider`) [2d041ee]
- [ ] Task: Task 5.3: Replace magic `__getattr__` with Explicit Delegation (DEFERRED - requires 80+ property definitions, separate track recommended)
- [x] Task: Task 5.4: Fix Sandbox Isolation logic in `conftest.py` [88aefc2]
- [x] Task: Task 5.5: Event Loop Consolidation & Single-Writer Sync [1b46534]
- [x] Task: Task 5.6: Fix `test_gui_provider_list_via_hooks` workspace creation [45b716f]
- [x] Task: Task 5.7: Fix `test_live_gui_integration` event loop issue [45b716f]
- [x] Task: Task 5.8: Fix `test_gui2_performance` key mismatch [45b716f]
- [x] WHERE: tests/test_gui2_performance.py:57-65
- [x] WHAT: Fix key mismatch - looked for "gui_2.py" but stored as full sloppy.py path
- [x] HOW: Use `next((k for k in _shared_metrics if "sloppy.py" in k), None)` to find key
- [x] SAFETY: Test-only change

View File

@@ -1,21 +0,0 @@
# Track Specification: GUI Decoupling & Controller Architecture (gui_decoupling_controller_20260302)
## Overview
`gui_2.py` currently operates as a Monolithic God Object (3,500+ lines). It violates the Data-Oriented Design heuristic by owning complex business logic, orchestrator hooks, and markdown file building. This track extracts the core state machine and lifecycle into a headless `app_controller.py`, turning the GUI into a pure immediate-mode view.
## Architectural Constraints: The "Immediate Mode View" Contract
- **No Business Logic in View**: `gui_2.py` MUST NOT perform file I/O, AI API calls, or subprocess management directly.
- **State Ownership**: `app_controller.py` (or equivalent) owns the "Source of Truth" state.
- **Event-Driven Mutations**: The GUI must mutate state exclusively by dispatching events or calling controller methods, never by directly manipulating backend objects in the render loop.
## Functional Requirements
- **Controller Extraction**: Create `app_controller.py` to handle all non-rendering logic.
- **State Migration**: Move state variables (`_tool_log`, `_comms_log`, `active_tickets`, etc.) out of `App.__init__` into the controller.
- **Logic Migration**: Move background threads, file reading/writing (`_flush_to_project`), and AI orchestrator invocations to the controller.
- **View Refactoring**: Refactor `gui_2.py` to accept the controller as a dependency and merely render its current state.
## Acceptance Criteria
- [ ] `app_controller.py` exists and owns the application state.
- [ ] `gui_2.py` has been reduced in size and complexity (no file I/O or AI calls).
- [ ] All existing features (chat, tools, tracks) function identically.
- [ ] The full test suite runs and passes against the new decoupled architecture.

View File

@@ -1,3 +0,0 @@
# Test Architecture Integrity & Simulation Audit
[Specification](spec.md) | [Plan](plan.md)

View File

@@ -1,9 +0,0 @@
{
"id": "test_architecture_integrity_audit_20260304"`,
"name": "Test Architecture Integrity & Simulation Audit"`,
"status": "planned",
"created_at": "2026-03-04T00:00:00Z",
"updated_at": "2026-03-04T00:00:00Z",
"type": "audit",
"severity": "high"
}

View File

@@ -1,33 +0,0 @@
# Implementation Plan
## Phase 1: Documentation (Planning)
Focus: Create comprehensive audit documentation with severity ratings
- [ ] Task 1.1: Document all identified false positive risks with severity matrix
- [ ] Task 1.2: Document all simulation fidelity gaps with impact analysis
- [ ] Task 1.3: Create mapping of coverage gaps to test categories
- [ ] Task 1.4: Provide concrete false positive examples
- [ ] Task 1.5: Provide concrete simulation miss examples
- [ ] Task 1.6: Prioritize recommendations by impact/effort matrix
## Phase 2: Review & Validation (Research)
Focus: Peer review of audit findings
- [ ] Task 2.1: Review existing tracks for overlap with this audit
- [ ] Task 2.2: Validate severity ratings against actual bug history
- [ ] Task 2.3: Cross-reference findings with docs/guide_simulations.md contract
- [ ] Task 2.4: Identify which gaps should be addressed in which future track
## Phase 3: Track Finalization
Focus: Prepare for downstream implementation tracks
- [ ] Task 3.1: Create prioritized backlog of implementation recommendations
- [ ] Task 3.2: Map recommendations to appropriate future tracks
- [ ] Task 3.3: Document dependencies between this audit and subsequent work
## Phase 4: User Manual Verification (Protocol in workflow.md)
Focus: Human review of audit findings
- [ ] Task 4.1: Review severity matrix for accuracy
- [ ] Task 4.2: Validate concrete examples against real-world scenarios
- [ ] Task 4.3: Approve recommendations for implementation

View File

@@ -1,562 +0,0 @@
# Test Architecture Integrity Audit — Claude Review
**Author:** Claude Sonnet 4.6 (Tier 1 Orchestrator)
**Review Date:** 2026-03-05
**Source Report:** report.md (authored by GLM-4.7, 2026-03-04)
**Scope:** Verify GLM's findings, correct errors, surface missed issues, produce actionable
recommendations for downstream tracks.
**Methodology:**
1. Read all 6 `docs/` architecture guides (guide_architecture, guide_simulations, guide_tools,
guide_mma, guide_meta_boundary, Readme)
2. Read GLM's full report.md
3. Read plan.md and spec.md for this track
4. Read py_get_skeleton for all 27 src/ modules
5. Read py_get_skeleton for conftest.py and representative test files
(test_extended_sims, test_live_gui_integration, test_dag_engine,
test_mma_orchestration_gui)
6. Read py_get_skeleton for all 9 simulation/ modules
7. Cross-referenced findings against JOURNAL.md, TASKS.md, and git history
---
## Section 1: Verdict on GLM's Report
GLM produced a competent surface-level audit. The structural inventory is
accurate and the broad categories of weakness (mock-rot, shallow assertions,
no negative paths) are valid. However, the report has material errors in
severity classification, contains two exact duplicate sections (Parts 10 and
11 are identical), and misses several issues that are more impactful than
the ones it flags at HIGH. It also makes recommendations that are
architecturally inappropriate for an ImGui immediate-mode application.
**Confirmed correct:** ~60% of findings
**Overstated or miscategorized:** ~25% of findings
**Missed entirely:** see Section 3
---
## Section 2: GLM Findings — Confirmed, Corrected, or Rejected
### 2.1 Confirmed: Mock Provider Never Fails (HIGH)
GLM is correct. `tests/mock_gemini_cli.py` has zero failure modes. The
keyword routing (`'"PATH: Epic Initialization"'`, `'"PATH: Sprint Planning"'`,
default) always produces a well-formed success response. No test using this
mock can ever exercise:
- Malformed or truncated JSON-L output
- Non-zero exit code from the CLI process
- A `{"type": "result", "status": "error", ...}` result event
- Rate-limit or quota responses
- Partial output followed by process crash
The `GeminiCliAdapter.send()` parses streaming JSON-L line-by-line. A
corrupted line (encoding error, mid-write crash) would throw a `json.JSONDecodeError`
that bubbles up through `_send_gemini_cli`. This path is entirely untested.
**Severity: HIGH — confirmed.**
### 2.2 Confirmed: Auto-Approval Hides Dialog Logic (MEDIUM, not HIGH)
GLM flags this as HIGH. The auto-approval pattern in polling loops is:
```python
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
```
This is structurally correct for automated testing — you MUST auto-approve
to drive the pipeline. The actual bug is different from what GLM describes:
the tests never assert that the dialog appeared BEFORE approving. The
correct pattern is:
```python
assert status.get('pending_mma_spawn_approval'), "Spawn dialog never appeared"
client.click('btn_approve_spawn')
```
Without the assert, the test passes even if the dialog never fires (meaning
spawn approval is silently bypassed at the application level).
**Severity: MEDIUM (dialog verification gap, not approval mechanism itself).**
**GLM's proposed fix ("Remove auto-approval") is wrong.** Auto-approval is
required for unattended testing. The fix is to assert the flag is True
*before* clicking.
There is also zero testing of the rejection path: what happens when
`btn_reject_spawn` is clicked? Does the engine stop? Does it log an error?
Does the track reach "blocked" state? This is an untested state transition.
### 2.3 Confirmed: Assertions Are Shallow (HIGH)
GLM is correct. The two canonical examples from simulation tests:
```python
assert len(tickets) >= 2 # structure unknown
"SUCCESS: Mock Tier 3 worker" in streams[tier3_key] # substring only
```
Neither validates ticket schema, ID uniqueness, dependency correctness, or
that the stream content is actually the full response and not a truncated
fragment.
**Severity: HIGH — confirmed.**
### 2.4 Confirmed: No Negative Path Testing (HIGH)
GLM is correct. The entire test suite covers only the happy path. Missing:
- Rejection flows for all three dialog types (ConfirmDialog, MMAApprovalDialog,
MMASpawnApprovalDialog)
- Malformed LLM response handling (bad JSON, missing fields, unexpected types)
- Network timeout/connection error to Hook API during a live_gui test
- `shell_runner.run_powershell` timeout (60s) expiry path
- `mcp_client._resolve_and_check` returning an error (path outside allowlist)
**Severity: HIGH — confirmed.**
### 2.5 Confirmed: Arbitrary Poll Intervals Miss Transient States (MEDIUM)
GLM is correct. 1-second polling in simulation loops will miss any state
that exists for less than 1 second. The approval dialogs in particular may
appear and be cleared within a single render frame if the engine is fast.
The `WorkflowSimulator.wait_for_ai_response()` method is the most critical
polling target. It is the backbone of all extended simulation tests. If its
polling strategy is wrong, the entire extended sim suite is unreliable.
**Severity: MEDIUM — confirmed.**
### 2.6 Confirmed: Mock CLI Bypasses Real Subprocess Path (MEDIUM)
GLM is correct. Setting `gcli_path` to a Python script does not exercise:
- Real PATH resolution for the `gemini` binary
- Windows process group creation (`CREATE_NEW_PROCESS_GROUP`)
- Environment variable propagation to the subprocess
- `mcp_env.toml` path prepending (in `shell_runner._build_subprocess_env`)
- The `kill_process_tree` teardown path when the process hangs
**Severity: MEDIUM — confirmed.**
### 2.7 CORRECTION: "run_powershell is a Read-Only Tool"
**GLM is WRONG here.** In Part 8, GLM lists:
> "Read-Only Tools: run_powershell (via shell_runner.py)"
`run_powershell` executes arbitrary PowerShell scripts against the filesystem.
It is the MOST dangerous tool in the set — it is not in `MUTATING_TOOLS` only
because it is not an MCP filesystem tool; its approval gate is the
`confirm_and_run_callback` (ConfirmDialog). Categorizing it as "read-only"
is a factual error that could mislead future workers about the security model.
### 2.8 CORRECTION: "State Duplication Between App and AppController"
**GLM is outdated here.** The gui_decoupling track (`1bc4205`) was completed
before this audit. `gui_2.App` now delegates all state through `AppController`
via `__getattr__`/`__setattr__` proxies. There is no duplication — `App` is a
thin ImGui rendering layer, `AppController` owns all state. GLM's concern is
stale relative to the current codebase.
### 2.9 CORRECTION: "Priority 5 — Screenshot Comparison Infrastructure"
**This recommendation is architecturally inappropriate** for Dear PyGui/ImGui.
These are immediate-mode renderers; there is no DOM or widget tree to
interrogate. Pixel-level screenshot comparison requires platform-specific
capture APIs (Windows Magnification, GDI) and is extremely fragile to font
rendering, DPI, and GPU differences. The Hook API's logical state verification
is the CORRECT and SUFFICIENT abstraction for this application. Adding
screenshot comparison would be high cost, low value, and high flakiness.
The appropriate alternative (already partially in place via `hook_api_ui_state_verification_20260302`)
is exposing more GUI state via the Hook API so tests can assert logical
rendering state (is a panel visible? what is the modal title?) without pixels.
### 2.10 CORRECTION: Severity Table Has Duplicate and Conflicting Entries
The summary table in Part 9 lists identical items at multiple severity levels:
- "No concurrent access testing": appears as both HIGH and MEDIUM
- "No real-time latency simulation": appears as both MEDIUM and LOW
- "No human-like behavior": appears as both MEDIUM and LOW
- "Arbitrary polling intervals": appears as both MEDIUM and LOW
Additionally, Parts 10 and 11 are EXACTLY IDENTICAL — the cross-reference
section was copy-pasted in full. This suggests the report was generated with
insufficient self-review.
### 2.11 CONTEXTUAL DOWNGRADE: Human-Like Behavior / Latency Simulation
GLM spends substantial space on the absence of:
- Typing speed simulation
- Hesitation before actions
- Variable LLM latency
This is a **personal developer tool for a single user on a local machine**.
These are aspirational concerns for a production SaaS simulation framework.
For this product context, these are genuinely LOW priority. The simulation
framework's job is to verify that the GUI state machine transitions correctly,
not to simulate human psychology.
---
## Section 3: Issues GLM Missed
These are findings not present in GLM's report that carry meaningful risk.
### 3.1 CRITICAL: `live_gui` is Session-Scoped — Dirty State Across Tests
`conftest.py`'s `live_gui` fixture has `scope="session"`. This means ALL
tests that use `live_gui` share a single running GUI process. If test A
leaves the GUI in a state with an open modal dialog, test B will find the
GUI unresponsive or in an unexpected state.
The teardown calls `client.reset_session()` (which clicks `btn_reset_session`),
but this clears AI state and discussion history, not pending dialogs or
MMA orchestration state. A test that triggers a spawn approval dialog and
then fails before approving it will leave `_pending_mma_spawn` set, blocking
the ENTIRE remaining test session.
**Severity: HIGH.** The current test ordering dependency is invisible and
fragile. Tests must not be run in arbitrary order.
**Fix:** Each `live_gui`-using test that touches MMA or approval flows should
explicitly verify clean state at start:
```python
status = client.get_mma_status()
assert not status.get('pending_mma_spawn_approval'), "Previous test left GUI dirty"
```
### 3.2 HIGH: `app_instance` Fixture Tests Don't Test Rendering
The `app_instance` fixture mocks out all ImGui rendering. This means every
test using `app_instance` (approximately 40+ tests) is testing Python object
state, not rendered UI. Tests like:
- `test_app_has_render_token_budget_panel(app_instance)` — tests `hasattr()`,
not that the panel renders
- `test_render_token_budget_panel_empty_stats_no_crash(app_instance)` — calls
`_render_token_budget_panel()` in a context where all ImGui calls are no-ops
This creates a systematic false-positive class: a method can be completely
broken (wrong data, missing widget calls) and the test passes because ImGui
calls are silently ignored. The only tests with genuine rendering fidelity
are the `live_gui` tests.
This is the root cause behind GLM's "state existence only" finding. It is
not a test assertion weakness — it is a fixture architectural limitation.
**Severity: HIGH.** The implication: all `app_instance`-based rendering
tests should be treated as "smoke tests that the method doesn't crash,"
not as "verification that the rendering is correct."
**Fix:** The `hook_api_ui_state_verification_20260302` track (adding
`/api/gui/state`) is the correct path forward: expose render-visible state
through the Hook API so `live_gui` tests can verify it.
### 3.3 HIGH: No Test for `ConfirmDialog.wait()` Infinite Block
`ConfirmDialog.wait()` uses `_condition.wait(timeout=0.1)` in a `while not self._done` loop.
There is no outer timeout on this loop. If the GUI thread never signals the
dialog (e.g., GUI crash after dialog creation, or a test that creates a
dialog but doesn't render it), the asyncio worker thread hangs indefinitely.
This is particularly dangerous in the `run_worker_lifecycle` path:
1. Worker pushes dialog to event queue
2. GUI process crashes or freezes
3. `dialog.wait()` loops forever at 0.1s intervals
4. Test session hangs with no error output
There is no test verifying that `wait()` has a maximum wait time and raises
an exception or returns a default (rejected) decision after it.
**Severity: HIGH.**
### 3.4 MEDIUM: `mcp_client` Module State Persists Across Unit Tests
`mcp_client.configure()` sets module-level globals (`_allowed_paths`,
`_base_dirs`, `_primary_base_dir`). Tests that call MCP tool functions
directly without calling `configure()` first will use whatever state was
left from the previous test. The `reset_ai_client` autouse fixture calls
`ai_client.reset_session()` but does NOT reset `mcp_client` state.
Any test that calls `mcp_client.read_file()`, `mcp_client.py_get_skeleton()`,
etc. directly (not through `ai_client.send()`) inherits the allowlist from
the previous test run. This can cause false passes (path permitted by
previous test's allowlist) or false failures (path denied because
`_base_dirs` is empty from a prior reset).
**Severity: MEDIUM.**
### 3.5 MEDIUM: `current_tier` Module Global — No Test for Concurrent Corruption
GLM mentions this as a "design concern." It is more specific: the
`concurrent_tier_source_tier_20260302` track exists because `current_tier`
in `ai_client.py` is a module-level `str | None`. When two Tier 3 workers
run concurrently (future feature), the second `send()` call will overwrite
the first worker's tier tag.
What's missing: there is no test that verifies the CURRENT behavior is safe
under single-threaded operation, and no test that demonstrates the failure
mode under concurrent operation to serve as a regression baseline for the fix.
**Severity: MEDIUM.**
### 3.6 MEDIUM: `test_arch_boundary_phase2.py` Tests Config File, Not Runtime
The arch boundary tests verify that `manual_slop.toml` lists mutating tools
as disabled by default. But the tests don't verify:
1. That `manual_slop.toml` is actually loaded into `ai_client._agent_tools`
at startup
2. That `ai_client._agent_tools` is actually consulted before tool dispatch
3. That the TOML → runtime path is end-to-end
A developer could modify how tools are loaded without breaking these tests.
The tests are static config audits, not runtime enforcement tests.
**Severity: MEDIUM.**
### 3.7 MEDIUM: `UserSimAgent.generate_response()` Calls `ai_client.send()` Directly
From `simulation/user_agent.py`: the `UserSimAgent` class imports `ai_client`
and calls `ai_client.send()` to generate "human-like" responses. This means:
- Simulation tests have an implicit dependency on a configured LLM provider
- If run without an API key (e.g., in CI), simulations fail at the UserSimAgent
level, not at the GUI level — making failures hard to diagnose
- The mock gemini_cli setup in tests does NOT redirect `ai_client.send()` in
the TEST process (only in the GUI process via `gcli_path`), so UserSimAgent
would attempt real API calls
No test documents whether UserSimAgent is actually exercised in the extended
sims (`test_extended_sims.py`) or whether those sims use the ApiHookClient
directly to drive the GUI.
**Severity: MEDIUM.**
### 3.8 LOW: Gemini CLI Tool-Call Protocol Not Exercised
The real Gemini CLI emits `{"type": "tool_use", "tool": {...}}` events mid-stream
and then waits for `{"type": "tool_result", ...}` piped back on stdin. The
`mock_gemini_cli.py` does not emit any `tool_use` events; it only detects
`'"role": "tool"'` in the prompt to simulate a post-tool-call turn.
This means `GeminiCliAdapter`'s tool-call parsing logic (the branch that
handles `tool_use` event types and accumulates them) is NEVER exercised by
any test. A regression in that parsing branch would be invisible to the
test suite.
**Severity: LOW** (only relevant when the real gemini CLI is used with tools).
### 3.9 LOW: `reset_ai_client` Autouse Fixture Timing is Wrong for Async Tests
The `reset_ai_client` autouse fixture runs synchronously before each test.
For tests marked `@pytest.mark.asyncio`, the reset happens BEFORE the test's
async setup. If the async test itself triggers ai_client operations in setup
(e.g., through an event loop created by the fixture), the reset may not
capture all state mutations. This is an edge case but could explain
intermittent behavior in async tests.
**Severity: LOW.**
---
## Section 4: Revised Severity Matrix
| Severity | Finding | GLM? | Source |
|---|---|---|---|
| **HIGH** | Mock provider has zero failure modes — all integration tests pass unconditionally | Confirmed | GLM |
| **HIGH** | `app_instance` fixture mocks ImGui — rendering tests are existence checks only | Missed | Claude |
| **HIGH** | `live_gui` session scope — dirty state from one test bleeds into the next | Missed | Claude |
| **HIGH** | `ConfirmDialog.wait()` has no outer timeout — worker thread can hang indefinitely | Missed | Claude |
| **HIGH** | Shallow assertions — substring match and length check only, no schema validation | Confirmed | GLM |
| **HIGH** | No negative path coverage — rejection flows, timeouts, malformed inputs untested | Confirmed | GLM |
| **MEDIUM** | Auto-approval never asserts dialog appeared before approving | Corrected | GLM/Claude |
| **MEDIUM** | `mcp_client` module state not reset between unit tests | Missed | Claude |
| **MEDIUM** | `current_tier` global — no test demonstrates safe single-thread or failure under concurrent use | Missed | Claude |
| **MEDIUM** | Arch boundary tests validate TOML config, not runtime enforcement | Missed | Claude |
| **MEDIUM** | `UserSimAgent` calls `ai_client.send()` directly — implicit real API dependency | Missed | Claude |
| **MEDIUM** | Arbitrary 1-second poll intervals miss sub-second transient states | Confirmed | GLM |
| **MEDIUM** | Mock CLI bypasses real subprocess spawning path | Confirmed | GLM |
| **LOW** | GeminiCliAdapter tool-use parsing branch never exercised by any test | Missed | Claude |
| **LOW** | `reset_ai_client` autouse timing may be incorrect for async tests | Missed | Claude |
| **LOW** | Variable latency / human-like simulation | Confirmed | GLM |
---
## Section 5: Prioritized Recommendations for Downstream Tracks
Listed in execution order, not importance order. Each maps to an existing or
proposed track.
### Rec 1: Extend mock_gemini_cli with Failure Modes
**Target track:** New — `mock_provider_hardening_20260305`
**Files:** `tests/mock_gemini_cli.py`
**What:** Add a `MOCK_MODE` environment variable selector:
- `success` (current behavior, default)
- `malformed_json` — emit a truncated/corrupt JSON-L line
- `error_result` — emit `{"type": "result", "status": "error", ...}`
- `timeout` — sleep 90s to trigger the CLI timeout path
- `tool_use` — emit a real `tool_use` event to exercise GeminiCliAdapter parsing
Tests that need to verify error handling pass `MOCK_MODE=error_result` via
`client.set_value()` before triggering the AI call.
### Rec 2: Add Dialog Assertion Before Auto-Approval
**Target track:** `test_suite_performance_and_flakiness_20260302` (already planned)
**Files:** All live_gui simulation tests, `tests/test_visual_sim_mma_v2.py`
**What:** Replace the conditional approval pattern:
```python
# BAD (current):
if status.get('pending_mma_spawn_approval'): client.click('btn_approve_spawn')
# GOOD:
assert status.get('pending_mma_spawn_approval'), "Spawn dialog must appear before approve"
client.click('btn_approve_spawn')
```
Also add at least one test per dialog type that clicks reject and asserts the
correct downstream state (engine marks track blocked, no worker spawned, etc.).
### Rec 3: Fix live_gui Session Scope Dirty State
**Target track:** `test_suite_performance_and_flakiness_20260302`
**Files:** `tests/conftest.py`
**What:** Add a per-test autouse fixture (function-scoped) that asserts clean
GUI state before each `live_gui` test:
```python
@pytest.fixture(autouse=True)
def assert_gui_clean(live_gui):
client = ApiHookClient()
status = client.get_mma_status()
assert not status.get('pending_mma_spawn_approval')
assert not status.get('pending_mma_step_approval')
assert not status.get('pending_tool_approval')
assert status.get('mma_status') in ('idle', 'done', '')
```
This surfaces inter-test pollution immediately rather than causing a
mysterious hang in a later test.
### Rec 4: Add ConfirmDialog Timeout Test
**Target track:** New — `mock_provider_hardening_20260305` (or `test_stabilization`)
**Files:** `tests/test_conductor_engine.py`
**What:** Add a test that creates a `ConfirmDialog`, never signals it, and
verifies after N seconds that the background thread does NOT block indefinitely.
This requires either a hard timeout on `wait()` or a documented contract that
callers must signal the dialog within a finite window.
### Rec 5: Expose More State via Hook API
**Target track:** `hook_api_ui_state_verification_20260302` (already planned, HIGH priority)
**Files:** `src/api_hooks.py`
**What:** This track is the key enabler for replacing `app_instance` rendering
tests with genuine state verification. The planned `/api/gui/state` endpoint
should expose:
- Active modal type (`confirm_dialog`, `mma_step_approval`, `mma_spawn_approval`, `ask`, `none`)
- `ui_focus_agent` current filter value
- `_mma_status`, `_ai_status` text values
- Panel visibility flags
Once this is in place, the `app_instance` rendering tests can be migrated
to `live_gui` equivalents that actually verify GUI-visible state.
### Rec 6: Add mcp_client Reset to autouse Fixture
**Target track:** `test_suite_performance_and_flakiness_20260302`
**Files:** `tests/conftest.py`
**What:** Extend `reset_ai_client` autouse fixture to also call
`mcp_client.configure([], [])` to clear the allowlist between tests.
This prevents allowlist state from a previous test from leaking into the next.
### Rec 7: Add Runtime HITL Enforcement Test
**Target track:** `test_suite_performance_and_flakiness_20260302` or new
**Files:** `tests/test_arch_boundary_phase2.py`
**What:** Add an integration test (using `app_instance`) that:
1. Calls `ai_client.set_agent_tools({'set_file_slice': True})`
2. Confirms `mcp_client.MUTATING_TOOLS` contains `'set_file_slice'`
3. Triggers a dispatch of `set_file_slice`
4. Verifies `pre_tool_callback` was invoked BEFORE the write occurred
This closes the gap between "config says mutating tools are off" and
"runtime actually gates them through the approval callback."
### Rec 8: Document `app_instance` Limitation in conftest
**Target track:** Any ongoing work — immediate, no track needed
**Files:** `tests/conftest.py`
**What:** Add a docstring to `app_instance` fixture:
```python
"""
App instance with all ImGui rendering calls mocked to no-ops.
Use for unit tests of state logic and method existence.
DO NOT use to verify rendering correctness — use live_gui for that.
"""
```
This prevents future workers from writing rendering tests against this fixture
and believing they have real coverage.
---
## Section 6: What the Existing Track Queue Gets Right
The `TASKS.md` strict execution queue is well-ordered for the test concerns:
1. `test_stabilization_20260302` → Must be first: asyncio lifecycle, mock-rot ban
2. `strict_static_analysis_and_typing_20260302` → Type safety before refactoring
3. `codebase_migration_20260302` → Already complete (commit 270f5f7)
4. `gui_decoupling_controller_20260302` → Already complete (commit 1bc4205)
5. `hook_api_ui_state_verification_20260302` → Critical enabler for real rendering tests
6. `robust_json_parsing_tech_lead_20260302` → Valid, but NOTE: the mock never produces
malformed JSON, so the auto-retry loop cannot be verified without Rec 1 above
7. `concurrent_tier_source_tier_20260302` → Threading safety for future parallel workers
8. `test_suite_performance_and_flakiness_20260302` → Polling determinism, sleep elimination
The `test_architecture_integrity_audit_20260304` (this track) sits logically
between #1 and #5 — it provides the analytical basis for what #5 and #8 need
to fix. The audit output (this document) should be read by the Tier 2 Tech Lead
for both those tracks.
The proposed new tracks (mock_provider_hardening, negative_path_testing) from
GLM's recommendations are valid but should be created AFTER track #5
(`hook_api_ui_state_verification`) is complete, since they depend on the
richer Hook API state to write meaningful assertions.
---
## Section 7: Architectural Observations Not in GLM's Report
### The Two-Tier Mock Problem
The test suite has two completely separate mock layers that do not know about
each other:
**Layer 1**`app_instance` fixture (in-process): Patches `immapp.run()`,
`ai_client.send()`, and related functions with `unittest.mock`. Tests call
methods directly. No network, no subprocess, no real threading.
**Layer 2**`mock_gemini_cli.py` (out-of-process): A fake subprocess that
the live GUI process calls through its own internal LLM pipeline. Tests drive
this via `ApiHookClient` HTTP calls to the running GUI process.
These layers test completely different things. Layer 1 tests Python object
invariants. Layer 2 tests the full application pipeline (threading, HTTP, IPC,
process management). Most of the test suite is Layer 1. Very few tests are
Layer 2. The high-value tests are Layer 2 because they exercise the actual
system, not a mock of it.
GLM correctly identifies that Layer 1 tests are of limited value for
rendering verification but does not frame it as a two-layer architecture
problem with a clear solution (expand Layer 2 via hook_api_ui_state_verification).
### The Simulation Framework's Actual Role
The `simulation/` module is not (and should not be) a fidelity benchmark.
Its role is:
1. Drive the GUI through a sequence of interactions
2. Verify the GUI reaches expected states after each interaction
The simulations (`sim_context.py`, `sim_ai_settings.py`, `sim_tools.py`,
`sim_execution.py`) are extremely thin wrappers. Their actual test value
comes from `test_extended_sims.py` which calls them against a live GUI and
verifies no exceptions are thrown. This is essentially a smoke test for the
GUI lifecycle, not a behavioral verification.
The real behavioral verification is in `test_visual_sim_mma_v2.py` and
similar files that assert specific state transitions. The simulation/
module should be understood as "workflow drivers," not "verification modules."
GLM's recommendation to add latency simulation and human-like behavior to
`simulation/user_agent.py` would add complexity to a layer that isn't the
bottleneck. The bottleneck is assertion depth in the polling loops, not
realism of the user actions.
---
*End of report. Next action: Tier 2 Tech Lead to read this alongside
`plan.md` and initiate track #5 (`hook_api_ui_state_verification_20260302`)
as the highest-leverage unblocking action.*

View File

@@ -1,355 +0,0 @@
# Test Architecture Integrity Audit — Gemini Review (Exhaustive Edition)
**Author:** Gemini 2.5 Pro (Tier 2 Tech Lead)
**Review Date:** 2026-03-05
**Source Reports:** `report.md` (GLM-4.7) and `report_claude.md` (Claude Sonnet 4.6)
**Scope:** Exhaustive root-cause analysis of intermittent and full-suite test failures introduced by the GUI decoupling refactor, with deep mechanical traces.
---
## 1. Executive Summary
This report serves as the definitive, exhaustive autopsy of the test suite instability observed following the completion of the `GUI Decoupling & Controller Architecture` track (`1bc4205`). While the decoupling successfully isolated the `AppController` state machine from the `gui_2.py` immediate-mode rendering loop, it inadvertently exposed and amplified several systemic flaws in the project's concurrency model, IPC (Inter-Process Communication) mechanisms, and test fixture isolation.
The symptoms—tests passing in isolation but hanging, deadlocking, or failing assertions when run as a full suite—are classic signatures of **state pollution** and **race conditions**.
This audit moves far beyond the surface-level observations made by GLM-4.7 (which focused heavily on missing negative paths and mock fidelity) and Claude 4.6 (which correctly identified some scoping issues). This report details the exact mechanical failures within the threading models, event loops, and synchronization primitives that caused the build to break under load. It provides code-level proofs, temporal sequence analyses, and strict architectural redesign requirements to ensure the robustness of future tracks.
---
## 2. Methodology & Discovery Process
To uncover these deep-seated concurrency and state issues, standard unit testing was insufficient. The methodology required stress-testing the architecture under full suite execution, capturing process dumps, and tracing the precise temporal relationships between thread execution.
### 2.1 The Execution Protocol
1. **Full Suite Execution Observation:** I repeatedly executed `uv run pytest --maxfail=10 -k "not performance and not stress"`. The suite consistently hung around the 35-40% mark, typically during `tests/test_extended_sims.py`, `tests/test_gemini_cli_edge_cases.py`, or `tests/test_conductor_api_hook_integration.py`.
2. **Targeted Re-execution (The Isolation Test):** Running the failing tests in isolation (`uv run pytest tests/test_extended_sims.py -v -s`) resulted in **100% PASSING** tests. This is the hallmark of non-deterministic state bleed. It immediately ruled out logical errors in the test logic itself and pointed definitively to **Inter-Test State Pollution** or **Resource Exhaustion**.
3. **Sequential Execution Analysis:** By running tests in specific chronological pairs (e.g., `uv run pytest tests/test_gemini_cli_edge_cases.py tests/test_extended_sims.py`), I was able to reliably reproduce the hang outside of the full suite context. This dramatically narrowed the search space.
4. **Log Tracing & Telemetry Injection:** I injected massive amounts of `sys.stderr.write` traces into the `_process_event_queue`, `_confirm_and_run`, `_handle_generate_send`, and `ApiHookClient` polling loops to track thread lifecycles, memory boundaries, and event propagation across the IPC boundary.
5. **Root Cause Isolation:** The traces revealed not one, but three distinct, catastrophic failure modes occurring simultaneously, which I have categorized below.
---
## 3. Deep Dive I: The "Triple Bingo" History Synchronization Bug
### 3.1 The Symptom
During extended simulations (specifically `test_context_sim_live` and `test_execution_sim_live`), the GUI process (`sloppy.py`) would mysteriously hang. CPU utilization on the rendering thread would hit 100%, memory usage would spike dramatically, and the test client would eventually time out after 60+ seconds of polling for a terminal AI response.
### 3.2 The Mechanism of Failure
The architecture of `Manual Slop` relies on an asynchronous event queue (`_api_event_queue`) and a synchronized task list (`_pending_gui_tasks`) to bridge the gap between the background AI processing threads (which handle network I/O and subprocess execution) and the main GUI rendering thread (which must remain lock-free to maintain 60 FPS).
When streaming was enabled for the Gemini CLI provider to improve UX latency, a catastrophic feedback loop was created.
#### 3.2.1 The Streaming Accumulator Flaw
In `AppController._handle_request_event`, the `stream_callback` was designed to push partial string updates to the GUI so the user could see the AI typing in real-time.
```python
# The original flawed callback inside _handle_request_event
try:
resp = ai_client.send(
event.stable_md,
event.prompt,
# ...
stream=True,
stream_callback=lambda text: self._on_ai_stream(text), # <--- THE CATALYST
# ...
)
```
However, the underlying AI providers (specifically `GeminiCliAdapter`) were returning the *entire accumulated response text* up to that point on every tick, not just the newly generated characters (the delta).
#### 3.2.2 The Unconditional History Append (O(N^2) Explosion)
The `_process_pending_gui_tasks` loop, running on the 60-FPS GUI thread, received these continuous "streaming..." events via the `handle_ai_response` action tag. Crucially, the controller logic failed to check if the AI's turn was actually complete (i.e., `status == 'done'`) before committing the payload to persistent storage.
```python
# Flawed AppController logic (Pre-Remediation)
elif action == "handle_ai_response":
payload = task.get("payload", {})
text = payload.get("text", "")
is_streaming = payload.get("status") == "streaming..."
# ... [Redacted: Code that updates self.ai_response] ...
# CRITICAL FLAW: Appends to memory on EVERY SINGLE CHUNK
if self.ui_auto_add_history and not stream_id:
role = payload.get("role", "AI")
with self._pending_history_adds_lock:
self._pending_history_adds.append({
"role": role,
"content": self.ai_response, # <--- The full accumulated text
"collapsed": False,
"ts": project_manager.now_ts()
})
```
**The Mathematical Impact:**
Assume the AI generates a final response of `T` total characters, delivered in `N` discrete streaming chunks.
- Chunk 1: Length `T/N`. History grows by `T/N`.
- Chunk 2: Length `2T/N`. History grows by `2T/N`.
- Chunk N: Length `T`. History grows by `T`.
Total characters stored in memory for a single message = `O(N * T)`.
If a 2000-character script is streamed in 100 chunks, the `_pending_history_adds` array contains 100 entries, consuming roughly 100,000 characters of memory for a 2,000 character output.
#### 3.2.3 The TOML Serialization Lockup
When `_process_pending_history_adds` executed on the next frame, it flushed these hundreds of duplicated, massive string entries into the active discussion dictionary.
```python
# This runs on the GUI thread
if "history" not in disc_data:
disc_data["history"] = []
disc_data["history"].append(project_manager.entry_to_str(item))
```
This rapid mutation triggered the `App` to flag the project state as dirty, invoking `_save_active_project()`. The `tomli_w` parser was then forced to serialize megabytes of redundant, malformed text synchronously. This completely locked the main Python thread (holding the GIL hostage), dropping the application frame rate to 0, preventing the hook server from responding to HTTP requests, and causing the `pytest` simulator to time out.
#### 3.2.4 Provider Inconsistency (The Third Bingo)
To compound this architectural disaster, the `GeminiCliAdapter` was violating the separation of concerns by manually emitting its own `history_add` events upon completion:
```python
# Old GeminiCliAdapter logic (Pre-Remediation)
if "text" in res:
# A backend client modifying frontend state directly!
_append_comms("IN", "history_add", {"role": "AI", "content": res["text"]})
```
This meant even if streaming was disabled, responses were being duplicated because both the controller (via `ui_auto_add_history`) and the adapter were competing to push arrays into the discussion history.
### 3.3 The Implemented Resolution
1. **Strict Gated Appends:** Modified `AppController` to strictly gate history serialization. It now checks `if not is_streaming:`. Intermediate streaming states are treated correctly as purely ephemeral UI state variables (`self.ai_response`), not persistent data records.
2. **Adapter Responsibility Stripping:** Removed `history_add` emission responsibilities from all AI adapters. History management is strictly an `AppController` domain concern. The adapters are now pure functions that map prompts to vendor APIs and return raw strings or tool schemas.
---
## 4. Deep Dive II: IPC and Event Polling Race Conditions
### 4.1 The Symptom
Integration tests relying on the Hook API (e.g., `test_visual_sim_mma_v2.py`) would sporadically hang while executing `client.wait_for_event('script_confirmation_required')` or `client.wait_for_event('ask_received')`. The server logs definitively proved the GUI had reached the correct state and emitted the event to the queue, but the test script acted as if it never arrived, eventually failing with an HTTP 504 Timeout or an assertion error.
### 4.2 The Mechanism of Failure
The testing framework uses high-frequency HTTP polling against the `/api/events` endpoint to coordinate test assertions with background GUI state transitions.
#### 4.2.1 Destructive Server Reads
The `get_events()` implementation in `HookHandler.do_GET` performed a destructive read (a pop operation):
```python
# api_hooks.py (Server Side)
elif self.path == "/api/events":
# ...
if lock:
with lock:
events = list(queue)
queue.clear() # <--- DESTRUCTIVE READ: ALL events are wiped.
self.wfile.write(json.dumps({"events": events}).encode("utf-8"))
```
Once a client fetched the `/api/events` payload, those events were permanently wiped from the application's memory.
#### 4.2.2 Stateless Client Polling
The original `wait_for_event` implementation in `ApiHookClient` was completely stateless. It did not remember what it saw in previous polls.
```python
# Old ApiHookClient logic (Flawed)
def wait_for_event(self, event_type: str, timeout: float = 5):
start = time.time()
while time.time() - start < timeout:
events = self.get_events() # Fetches AND clears the server queue
for ev in events:
if ev.get("type") == event_type:
return ev
time.sleep(0.1)
return None
```
#### 4.2.3 The Race Condition Timeline (The Silent Drop)
Consider a scenario where the GUI rapidly emits two distinct events in a single tick: `['refresh_metrics', 'script_confirmation_required']`.
1. **T=0.0s:** The Test script calls `client.wait_for_event('refresh_metrics')`.
2. **T=0.1s:** `ApiHookClient` calls `GET /api/events`. It receives `['refresh_metrics', 'script_confirmation_required']`. The server queue is now EMPTY.
3. **T=0.1s:** `ApiHookClient` iterates the array. It finds `refresh_metrics`. It returns it to the test script.
4. **THE FATAL FLAW:** The `script_confirmation_required` event, which was also in the payload, is attached to a local variable (`events`) that is immediately garbage collected when the function returns. The event is **silently discarded**.
5. **T=0.5s:** The Test script advances to the next block of logic and calls `client.wait_for_event('script_confirmation_required')`.
6. **T=0.6s to T=5.0s:** `ApiHookClient` repeatedly polls `GET /api/events`. The server queue remains empty.
7. **T=5.0s:** The Test script fails with a Timeout Error, leaving the developer confused because the GUI logs explicitly say the script confirmation was requested.
### 4.3 The Implemented Resolution
Transformed the `ApiHookClient` from a stateless HTTP wrapper into a stateful event consumer by implementing an internal `_event_buffer`.
```python
# Fixed ApiHookClient
def get_events(self) -> list[Any]:
res = self._make_request("GET", "/api/events")
new_events = res.get("events", []) if res else []
self._event_buffer.extend(new_events) # Accumulate safely
return list(self._event_buffer)
def wait_for_event(self, event_type: str, timeout: float = 5):
start = time.time()
while time.time() - start < timeout:
self.get_events() # Refreshes buffer
for i, ev in enumerate(self._event_buffer):
if ev.get("type") == event_type:
return self._event_buffer.pop(i) # Consume ONLY the target
time.sleep(0.1)
```
This architectural pattern (Client-Side Event Buffering) guarantees zero event loss, regardless of how fast the GUI pushes to the queue, how many events are bundled into a single HTTP response, or what chronological order the test script polls them in.
---
## 5. Deep Dive III: Asyncio Lifecycle & Threading Deadlocks
### 5.1 The Symptom
When running the full test suite (`pytest --maxfail=10`), execution would abruptly stop, usually midway through `test_gemini_cli_parity_regression.py`. Tests would throw `RuntimeError: Event loop is closed` deep inside background threads, breaking the application state permanently for the rest of the run, or simply freezing the terminal indefinitely.
### 5.2 The Mechanism of Failure
The `AppController` initializes its own internal `asyncio` loop running in a dedicated daemon thread (`_loop_thread`) to handle HTTP non-blocking requests (if any) and async queue processing.
#### 5.2.1 Event Loop Exhaustion
`pytest` is a synchronous runner by default, but it heavily utilizes the `pytest-asyncio` plugin to manage async fixtures and test coroutines. When `pytest` executes hundreds of tests, the `app_instance` and `mock_app` fixtures create and tear down hundreds of `AppController` instances.
`asyncio.new_event_loop()` is fundamentally incompatible with unmanaged, rapid creation and destruction of loops across multiple short-lived threads within a single process space. Thread-local storage (`threading.local`) for event loops becomes polluted, and Python's weak references break down under the load.
#### 5.2.2 Missing Teardown & Zombie Loops
Originally, the `AppController` completely lacked a `shutdown()` or `close()` method. When a `pytest` function finished, the daemon `_loop_thread` remained alive, and the inner `asyncio` loop continued attempting to poll `self.event_queue.get()`.
When Python's garbage collector eventually reclaimed the unreferenced `AppController` object, or when `pytest-asyncio` invoked global loop cleanup policies at the end of a module, these background loops were violently terminated mid-execution. This raised `CancelledError` or `Event loop is closed` exceptions, crashing the thread and leaving the testing framework in an indeterminate state.
#### 5.2.3 The Unbounded Wait Deadlock
When the AI Tier 3 worker wants to execute a mutating filesystem tool like `run_powershell` or spawn a sub-agent, it triggers a HITL (Human-in-the-Loop) gate. Because the AI logic runs on a background thread, it must halt and wait for the GUI thread to signal approval. It does this using a standard `threading.Condition`:
```python
# Old ConfirmDialog logic (Flawed)
def wait(self) -> tuple[bool, str]:
with self._condition:
while not self._done:
self._condition.wait(timeout=0.1) # <--- FATAL: No outer escape hatch!
return self._approved, self._script
```
If the test logic failed to trigger the approval via the Hook API (e.g., due to the event dropping bug detailed in Part 4), or if the Hook API crashed because the background asyncio loop died (as detailed in 5.2.2), the background worker thread called `dialog.wait()` and **waited forever**. It was trapped in an infinite loop, immune to `Ctrl+C` and causing the CI/CD pipeline to hang until a 6-hour timeout triggered.
### 5.3 The Implemented Resolution
1. **Deterministic Teardown Lifecycle:** Added an explicit `AppController.shutdown()` method which calls `self._loop.stop()` safely from a threadsafe context and invokes `self._loop_thread.join(timeout=2.0)`. Updated all `conftest.py` fixtures to rigorously call this during the `yield` teardown phase.
2. **Deadlock Prevention via Hard Timeouts:** Wrapped all `wait()` calls in `ConfirmDialog`, `MMAApprovalDialog`, and `MMASpawnApprovalDialog` with an absolute outer timeout of 120 seconds.
```python
# Fixed ConfirmDialog logic
def wait(self) -> tuple[bool, str]:
start_time = time.time()
with self._condition:
while not self._done:
if time.time() - start_time > 120:
return False, self._script # Auto-reject after 2 minutes
self._condition.wait(timeout=0.1)
return self._approved, self._script
```
If the GUI fails to respond within 2 minutes, the dialog automatically aborts, preventing thread starvation and allowing the test suite to fail gracefully rather than hanging infinitely.
---
## 6. Deep Dive IV: Phantom Hook Servers & Test State Pollution
### 6.1 The Symptom
Tests utilizing the `live_gui` fixture sporadically failed with `ConnectionError: Max retries exceeded with url: /api/events`, or assertions failed completely because the test was mysteriously interacting with UI state (like `ui_ai_input` values) left over from a completely different test file run several minutes prior.
### 6.2 The Mechanism of Failure
The `live_gui` fixture in `conftest.py` spawns a completely independent GUI process using `subprocess.Popen([sys.executable, "sloppy.py", "--headless", "--enable-test-hooks"])`. This child process automatically binds to `127.0.0.1:8999` and launches the `api_hooks.HookServer`.
#### 6.2.1 Zombie Processes on Windows
If a test failed abruptly via an assertion mismatch or a timeout, the standard teardown block in the `live_gui` fixture called `process.terminate()`.
On Windows, `terminate()` maps to `TerminateProcess()`, which kills the immediate PID. However, it does *not* reliably kill child processes spawned by the target script. If `sloppy.py` had spawned its own worker threads, or if it had launched a PowerShell subprocess that got stuck, the parent process tree remained alive as a "zombie" or "phantom" process.
#### 6.2.2 Port Hijacking & Cross-Test Telemetry Contamination
The zombie `sloppy.py` process continues running silently in the background, keeping the HTTP socket on port 8999 bound and listening.
When the *next* test in the suite executes, the `live_gui` fixture attempts to spawn a new process. The new process boots, tries to start `HookServer` on 8999, fails (because the zombie holds the port), and logs an `OSError: Address already in use` error to `stderr`. It then continues running without a hook API.
The test script then instantiates `ApiHookClient()` and sends a request to `127.0.0.1:8999`. **The zombie GUI process from the previous test answers.** The current test is now feeding inputs, clicking buttons, and making assertions against a polluted, broken state machine from a different context, leading to entirely baffling test failures.
#### 6.2.3 In-Process Module Pollution (The Singleton Trap)
For unit tests that mock `App` in-process (avoiding `subprocess`), global singletons like `ai_client` and `mcp_client` retained state indefinitely. Python modules are loaded once per interpreter session.
If `test_arch_boundary_phase1.py` modified `mcp_client.MUTATING_TOOLS` or registered an event listener via `ai_client.events.on("tool_execution", mock_callback)`, that listener remained active forever. When `test_gemini_cli_adapter_parity.py` ran later, the old mock listener fired, duplicating events, triggering assertions on dead mocks, and causing chaotic, untraceable failures.
### 6.3 The Implemented Resolution
1. **Aggressive Subprocess Annihilation:** Imported `psutil` into `conftest.py` and implemented a `kill_process_tree` function to recursively slaughter every child PID attached to the `live_gui` fixture upon teardown.
2. **Proactive Port Verification:** Added HTTP GET polling to `127.0.0.1:8999/status` *before* launching the subprocess to ensure the port is completely dead. If it responds, the test suite aborts loudly rather than proceeding with a hijacked port.
3. **Singleton Sanitization (Scorched Earth):** Expanded the `reset_ai_client` autouse fixture (which runs before every single test) to rigorously clear `ai_client.events._listeners` via a newly added `clear()` method, and to call `mcp_client.configure([], [])` to wipe the file allowlist.
---
## 7. Review of Prior Audits (GLM-4.7 & Claude Sonnet 4.6)
### 7.1 Critique of GLM-4.7's Report
GLM-4.7 produced a report that was thorough in its static skeletal analysis but fundamentally flawed in its dynamic conclusions.
* **Accurate Findings:** GLM correctly identified the lack of negative path testing. It accurately noted that `mock_gemini_cli.py` always returning success masks error-handling logic in the main application. It also correctly identified that asserting substrings (`assert "Success" in response`) is brittle.
* **Inaccurate Findings:** GLM focused exclusively on "false positive risks" (tests passing when they shouldn't) and completely missed the far more critical "false negative risks" (tests failing or hanging due to race conditions).
* **The Over-Correction:** GLM's primary recommendation was to rewrite the entire testing framework to use custom `ContextManager` mocks and to rip out the simulation layer entirely. This was a severe misdiagnosis. The event bus (`EventEmitter` and `AsyncEventQueue`) was structurally sound; the failures were purely due to lifecycle management, bad polling loops, and lacking thread timeouts. Throwing out the simulation framework would have destroyed the only integration tests capable of actually catching these deep architectural bugs.
### 7.2 Critique of Claude 4.6's Report
Claude 4.6's review was much closer to reality, correctly dialing back GLM's hysteria and focusing on structural execution.
* **Accurate Findings:** Claude accurately identified the auto-approval problem: tests were clicking "approve" without asserting the dialog actually rendered first, hiding UX failures. It brilliantly identified the "Two-Tier Mock Problem"—the split between in-process `app_instance` unit tests and out-of-process `live_gui` integration tests. It also correctly caught the `mcp_client` state bleeding issue (which I subsequently fixed in this track).
* **Missed Findings:** Claude dismissed the `simulation/` framework as merely a "workflow driver." It failed to recognize that the workflow driver was actively triggering deadlocks in the `AppController`'s thread pools due to missing synchronization bounds. It did not uncover the IPC Destructive Read bug or the Triple Bingo streaming issue, because those require dynamic runtime tracing to observe.
---
## 8. File-by-File Impact Analysis of This Remediation Session
To permanently fix these issues, the following systemic changes were applied during this track:
### 8.1 `src/app_controller.py`
* **Thread Offloading:** Wrapped `_do_generate` inside `_handle_generate_send` and `_handle_md_only` in explicit `threading.Thread` workers. The Markdown compilation step is CPU-bound and slow on large projects; running it synchronously was blocking the async event loop and the GUI render tick.
* **Streaming Gate:** Added conditional logic to `_process_pending_gui_tasks` ensuring that `_pending_history_adds` is only mutated when `is_streaming` is False and `stream_id` is None.
* **Hard Timeouts:** Injected 120-second bounds via `time.time()` into the `wait()` loops for `ConfirmDialog`, `MMAApprovalDialog`, and `MMASpawnApprovalDialog`.
* **Lifecycle Hooks:** Implemented `shutdown()` to terminate the `asyncio` loop and join background threads cleanly. Added event logging bridging to `_api_event_queue` for `script_confirmation_required` so the Hook API clients can see it.
### 8.2 `src/ai_client.py`
* **Event Cleanliness:** Removed duplicated `events.emit("tool_execution", status="started")` calls across all providers (Gemini, Anthropic, Deepseek). Previously, some providers emitted it twice, and others omitted it entirely for mutating tools. Enforced single, pre-execution emission.
* **History Decoupling:** Stripped arbitrary `history_add` events from `_send_gemini_cli`. State persistence is exclusively the domain of the controller now.
### 8.3 `src/api_hook_client.py` & `src/api_hooks.py`
* **Stateful IPC:** Transformed `ApiHookClient` from a stateless HTTP wrapper into a stateful event consumer by implementing `_event_buffer`. `get_events()` now extends this buffer, and `wait_for_event()` pops from it, eliminating race conditions entirely.
* **Timeout Tuning:** Reduced `api_hooks.py` server-side lock wait timeouts from 60s to 10s to prevent the Hook Server from holding TCP connections hostage when the GUI thread is busy. This allows the client to retry gracefully rather than hanging.
### 8.4 `tests/conftest.py`
* **Scorched Earth Teardown:** Upgraded the `reset_ai_client` autouse fixture to explicitly invoke `ai_client.events.clear()` and `mcp_client.configure([], [])`.
* **Zombie Prevention:** Modified the `live_gui` fixture to log warnings on port collisions and utilize strict process tree termination (`kill_process_tree`) upon yield completion.
### 8.5 `src/events.py`
* **Listener Management:** Added a `clear()` method to `EventEmitter` to support the scorched-earth teardown in `conftest.py`. Implemented `task_done` and `join` pass-throughs for `AsyncEventQueue`.
---
## 9. Prioritized Action Plan & Future Tracks
The critical blocking bugs have been resolved, and the test suite can now complete end-to-end without deadlocking. However, architectural debt remains. The following tracks should be executed in order:
### Priority 1: `hook_api_ui_state_verification_20260302` (HIGH)
**Context:** This is an existing, planned track, but it must be expedited.
**Goal:** Replace fragile `time.sleep()` and log-parsing assertions in `test_visual_sim_mma_v2.py` with deterministic UI state queries.
**Implementation Details:**
1. Implement a robust `GET /api/gui/state` endpoint in `HookHandler`.
2. Wire critical UI variables (e.g., `ui_focus_agent`, active modal titles, track operational status) into the `AppController._settable_fields` dictionary to allow programmatic reading without pixels or screenshots.
3. Refactor all simulation tests to poll for precise state markers (e.g., `assert client.get_value("modal_open") == "ConfirmDialog"`) rather than sleeping for arbitrary seconds.
### Priority 2: `asyncio_decoupling_refactor_20260306` (MEDIUM)
**Context:** The internal use of `asyncio` is a lingering risk factor for test stability.
**Goal:** Remove `asyncio` from the `AppController` entirely.
**Implementation Details:**
1. The `AppController` currently uses an `asyncio.Queue` and a dedicated `_loop_thread` to manage background tasks. This is vastly over-engineered for a system whose only job is to pass dictionary payloads between a background AI worker and the main GUI thread.
2. Replace `events.AsyncEventQueue` with a standard, thread-safe `queue.Queue` from Python's standard library.
3. Convert the `_process_event_queue` async loop into a standard synchronous `while True` loop running in a standard daemon thread.
4. This will permanently eliminate all `RuntimeError: Event loop is closed` bugs during test teardowns and drastically simplify mental overhead for future developers maintaining the codebase.
### Priority 3: `mock_provider_hardening_20260305` (MEDIUM)
**Context:** Sourced from Claude 4.6's valid recommendations.
**Goal:** Ensure error paths are exercised.
**Implementation Details:**
1. Add `MOCK_MODE` environment variable parsing to `tests/mock_gemini_cli.py`.
2. Implement distinct mock behaviors for `malformed_json`, `timeout` (sleep for 90s), and `error_result` (return a valid JSON payload indicating failure).
3. Create `tests/test_negative_flows.py` to verify the GUI correctly displays error states, allows session resets, and recovers without crashing when the AI provider returns garbage data.
### Priority 4: `simulation_fidelity_enhancement_20260305` (LOW)
**Context:** Sourced from GLM-4.7's recommendations.
**Goal:** Make tests closer to human use.
**Implementation Details:**
1. As Claude noted, this is low priority for a local developer tool. However, adding slight, randomized jitter to the `UserSimAgent` (e.g., typing delays, minor hesitations between clicks) can help shake out UI rendering glitches that only appear when ImGui is forced to render intermediate frames.
---
*End of Exhaustive Report. Track Completed.*

View File

@@ -1,96 +0,0 @@
# Track Specification: Test Architecture Integrity & Simulation Audit
## Overview
Comprehensive audit of testing infrastructure and simulation framework to identify false positive risks, coverage gaps, and simulation fidelity issues. This analysis was triggered by a request to review how tests and simulations are setup, whether tests can report passing grades when they fail, and if simulations are rigorous enough or are just rough emulators.
## Current State Audit (as of 20260304)
### Already Implemented (DO NOT re-implement)
- **Testing Infrastructure** ( ests/conftest.py):
- live_gui fixture for session-scoped GUI lifecycle management
- Process cleanup with kill_process_tree()
- VerificationLogger for diagnostic logging
- Artifact isolation to ests/artifacts/ and ests/logs/
- **Simulation Framework** (simulation/):
- sim_base.py: Base simulation class with setup/teardown
- workflow_sim.py: Workflow orchestration
- sim_context.py, sim_ai_settings.py, sim_tools.py, sim_execution.py
- user_agent.py: Simulated human agent
- **Testing Infrastructure** (tests/conftest.py):
- live_gui fixture for session-scoped GUI lifecycle management
- Process cleanup with kill_process_tree()
- VerificationLogger for diagnostic logging
- Artifact isolation to tests/artifacts/ and tests/logs/
- Ban on arbitrary core mocking
- **Mock Provider** (tests/mock_gemini_cli.py):
- Keyword-based response routing
- JSON-L protocol matching real CLI output
#### Critical False Positive Risks Identified
1. **Mock Provider Always Returns Success**: Never validates input, never produces errors, never tests failure paths
2. **Auto-Approval Pattern**: All HITL gates auto-clicked, never verifying dialogs appear or rejection flows
3. **Substring-Based Assertions**: Only check existence of content, not validity or structure
4. **State Existence Only**: Tests check fields exist but not their correctness or invariants
5. **No Negative Path Testing**: No coverage for rejection, timeout, malformed input, concurrent access
6. **No Visual Verification**: Tests verify logical state via Hook API but never check what's actually rendered
7. **No State Machine Validation**: No verification that status transitions are legal or complete
#### Simulation Rigor Gaps Identified
1. **No Real-Time Latency Simulation**: Fixed delays don't model variable LLM/network latency
2. **No Human-Like Behavior**: Instant actions, no typing speed, hesitation, mistakes, or task switching
3. **Arbitrary Polling Intervals**: 1-second polls may miss transient states
4. **Mock CLI Redirection**: Bypasses subprocess spawning, environment passing, and process cleanup paths
5. **No Stress Testing**: No load testing, no edge case bombardment
#### Test Coverage Gaps
- No tests for approval dialog rejection flows
- No tests for malformed LLM response handling
- No tests for network timeout/failure scenarios
- No tests for concurrent duplicate requests
- No tests for out-of-order event sequences
- No thread-safety tests for shared resources
- No visual rendering verification (modal visibility, text overflow, color schemes)
#### Structural Testing Contract Gaps
- Missing rule requiring negative path testing
- Missing rule requiring state validation beyond existence
- Missing rule requiring visual verification
- No enforcement for thread-safety testing
## Goals
1. Document all identified testing pitfalls with severity ratings (HIGH/MEDIUM/LOW)
2. Create actionable recommendations for each identified issue
3. Map existing test coverage gaps to specific missing test files
4. Provide architecture recommendations for simulation framework enhancements
## Functional Requirements
- [ ] Document all false positive risks in a structured format
- [ ] Document all simulation fidelity gaps in a structured format
- [ ] Create severity matrix for each issue
- [ ] Generate list of missing test cases by category
- [ ] Provide concrete examples of how current tests would pass despite bugs
- [ ] Provide concrete examples of how simulations would miss UX issues
## Non-Functional Requirements
- Report must include author attribution (GLM-4.7) and derivation methodology
- Analysis must cite specific file paths and line numbers where applicable
- Recommendations must be prioritized by impact and implementation effort
## Architecture Reference
Refer to:
- docs/guide_simulations.md - Current simulation contract and patterns
- docs/guide_mma.md - MMA orchestration architecture
- docs/guide_architecture.md - Thread domains, event system, HITL mechanism
- conductor/tracks/*/spec.md - Existing track specifications for consistency
## Out of Scope
- Implementing the actual test fixes (that's for subsequent tracks)
- Refactoring the simulation framework (documenting only)
- Modifying the mock provider (analyzing only)
- Writing new tests (planning phase for future tracks)

View File

@@ -1,5 +0,0 @@
# Track test_suite_performance_and_flakiness_20260302 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -1,8 +0,0 @@
{
"track_id": "test_suite_performance_and_flakiness_20260302",
"type": "chore",
"status": "new",
"created_at": "2026-03-02T22:30:00Z",
"updated_at": "2026-03-02T22:30:00Z",
"description": "Replace arbitrary time.sleep() calls with deterministic polling/Events and optimize test speed."
}

View File

@@ -1,51 +0,0 @@
# Implementation Plan: Test Suite Performance & Flakiness (test_suite_performance_and_flakiness_20260302)
> **TEST DEBT FIX:** This track is responsible for eliminating the `RuntimeError: Event loop is closed` deadlocks and zombie subprocesses discovered during the `test_architecture_integrity_audit_20260304` audit.
## Phase 1: Asyncio Decoupling & Queue Refactor
- [ ] Task: Initialize MMA Environment `activate_skill mma-orchestrator`
- [ ] Task: Rip out `asyncio` from `AppController`
- [ ] WHERE: `src/app_controller.py` and `src/events.py`
- [ ] WHAT: Replace `events.AsyncEventQueue` with a standard `queue.Queue`. Convert `_process_event_queue` to a synchronous `while True` loop running in a daemon thread.
- [ ] HOW: Remove all `async`/`await` and `asyncio.run_coroutine_threadsafe` calls related to the internal event queue.
- [ ] SAFETY: Ensures test teardowns no longer violently crash background event loops.
- [ ] Task: Ensure Phantom Processes are Killed
- [ ] WHERE: `tests/conftest.py`
- [ ] WHAT: Verify the `kill_process_tree` implementation added in the audit is fully robust against hanging `sloppy.py` instances.
- [ ] HOW: Test with intentional process hangs.
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Asyncio Decoupling'
## Phase 2: Audit & Polling Primitives
- [ ] Task: Initialize MMA Environment `activate_skill mma-orchestrator`
- [ ] Task: Create Deterministic Polling Primitives
- [ ] WHERE: `tests/conftest.py`
- [ ] WHAT: Implement a `wait_until(predicate_fn, timeout=5.0, interval=0.05)` utility.
- [ ] HOW: Standard while loop that evaluates `predicate_fn()`.
- [ ] SAFETY: Ensure it raises a clear `TimeoutError` if it fails.
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Polling Primitives' (Protocol in workflow.md)
## Phase 2: Refactoring Integration Tests
- [ ] Task: Refactor `test_spawn_interception.py`
- [ ] WHERE: `tests/test_spawn_interception.py`
- [ ] WHAT: Replace hardcoded sleeps with `wait_until` checking the `event_queue` or internal state.
- [ ] HOW: Use the new `conftest.py` utility.
- [ ] SAFETY: Prevent event loop deadlocks.
- [ ] Task: Refactor Simulation Waits
- [ ] WHERE: `simulation/*.py` and `tests/test_live_gui_integration.py`
- [ ] WHAT: Replace `time.sleep()` blocks with `ApiHookClient.wait_for_event` or `client.wait_until_value_equals`.
- [ ] HOW: Expand `ApiHookClient` polling capabilities if necessary.
- [ ] SAFETY: Ensure the GUI hook server remains responsive during rapid polling.
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Refactoring Sleeps' (Protocol in workflow.md)
## Phase 3: Test Marking & Final Validation
- [ ] Task: Apply Slow Test Marks
- [ ] WHERE: Across all `tests/`
- [ ] WHAT: Add `@pytest.mark.slow` to any test requiring a live GUI boot or API mocking that takes >2 seconds.
- [ ] HOW: Import pytest and apply the decorator.
- [ ] SAFETY: Update `pyproject.toml` to register the `slow` marker.
- [ ] Task: Full Suite Performance Validation
- [ ] WHERE: Project root
- [ ] WHAT: Run `uv run pytest -m "not slow"` and verify execution time < 10 seconds. Run `uv run pytest` to ensure total suite passes.
- [ ] HOW: Time the terminal command.
- [ ] SAFETY: None.
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Final Validation' (Protocol in workflow.md)

View File

@@ -1,19 +0,0 @@
# Track Specification: Test Suite Performance & Flakiness (test_suite_performance_and_flakiness_20260302)
## Overview
The test suite currently takes over 5.0 minutes to execute and frequently hangs on integration tests (e.g., `test_spawn_interception.py`). Several simulation tests are flaky or timing out. This track replaces arbitrary `time.sleep()` calls with deterministic polling (`threading.Event()`), aiming to drive the core TDD test execution time down to under 10 seconds.
## Architectural Constraints
- **Zero Arbitrary Sleeps**: `time.sleep(1.0)` is banned in test files unless testing actual rate-limiting or debounce functionality.
- **Deterministic Waits**: Tests must use state-polling (with aggressive micro-sleeps) or `asyncio.Event` / `threading.Event` to proceed exactly when the system is ready.
## Functional Requirements
- Audit all `tests/` and `simulation/` files for `time.sleep()`.
- Implement polling helper functions in `conftest.py` (e.g., `wait_until(condition_func, timeout)`).
- Refactor all integration tests to use the deterministic polling helpers.
- Apply `@pytest.mark.slow` to any test that legitimately takes >2 seconds, allowing developers to skip them during rapid TDD loops.
## Acceptance Criteria
- [ ] `time.sleep` occurrences in the test suite are eliminated or strictly justified.
- [ ] The core unit test suite (excluding `@pytest.mark.slow`) executes in under 10 seconds.
- [ ] Integration tests pass consistently without flakiness across 10 consecutive runs.