docs: add test_infra_hardening foundation + RAG batch failure status

Foundation document for the future test_infra_hardening track that will address session-scoped live_gui fixture isolation, silent __getattr__/__setattr__ contract assumptions, and similar test infrastructure fragility. Also documents the test_rag_phase4_final_verify batch failure that surfaces after the __getattr__ fix unblocks test_full_live_workflow. The RAG test failure is NOT a regression - it reproduces on pre-fix HEAD too. It's a pre-existing test isolation issue (the live_gui fixture is session-scoped, so state from the 4 sims pollutes the controller).
2026-06-09 00:26:05 -04:00
parent bcdc26d0bd
commit 37b9a68017
2 changed files with 181 additions and 0 deletions
@@ -0,0 +1,140 @@
+# Future Track Foundation: Test Infrastructure Hardening (2026-06-08)
+
+**Status:** Foundation document (pre-spec). Goal: outline the broader track that this work belongs to.
+
+**Related:**
+- `docs/reports/test_full_live_workflow_imgui_assert_20260608.md` (initial root cause)
+- `docs/reports/test_full_live_workflow_propagation_digest_20260608.md` (solutions digest)
+- `docs/reports/test_full_live_workflow_progress_20260608_pm.md` (PR1+PR2+PR3 progress)
+- `docs/reports/batch_resilience_plan_20260608.md` (batch resilience plan)
+- `conductor/todos/TODO_test_full_live_workflow_v2.md` (task list)
+
+---
+
+## What Was Fixed (this session)
+
+1. **PR1 (audit):** `scripts/check_imgui_scopes.py` found 3 false positives. Documented.
+2. **PR2 (wrap + health endpoint):** `immapp.run` is now wrapped in try/except. `/api/gui_health` exposes the controller's degraded state. Tests fail fast with clear messages on dirty state.
+3. **PR3 (pre-flight check):** `test_full_live_workflow` calls `client.get_gui_health()` at start. Fails fast with actionable message if the GUI is degraded.
+4. **PR1 follow-up (real fix):** The actual IM_ASSERT trigger was a double `__getattr__` bug:
+   - `AppController.__getattr__` returned `None` for ANY `ui_` attribute (including ones not in `__init__`)
+   - `App.__setattr__` checked `hasattr(self.controller, name)` to route assignments; the controller's buggy `__getattr__` made `hasattr` return True for all `ui_` attrs
+   - The `if not hasattr(app, 'foo'): app.foo = False` pattern in `render_approve_script_modal` failed to initialize
+   - `imgui.checkbox` was called with `None`, raised TypeError
+   - The TypeError propagated without closing the ImGui modal, leaving the scope stack unbalanced
+   - Next frame: IM_ASSERT(Missing End())
+5. **Fix:** `AppController.__getattr__` now only returns `None` for an explicit allowlist of `ui_` attrs that ARE defined in `__init__`. For any other missing attribute, raises `AttributeError`. Also added defense-in-depth in `App.__getattr__` to check `hasattr(controller, name)` before delegating.
+
+**Result:** 4 sims + test_live_workflow + 2 markdown tests all pass in 87.80s. No IM_ASSERT. The test passes cleanly.
+
+---
+
+## What Is Still Open (Future Work)
+
+### 1. Test Infrastructure Audit (the broader track)
+
+The fixes this session addressed ONE bug that was making `test_full_live_workflow` fail. The user asked: "continue with trying to finally cure the test infra with a strong foundation for the future track."
+
+**The broader concern:** The test infrastructure has accumulated complexity and implicit assumptions. The `live_gui` fixture is session-scoped, the controller's state is shared across 49+ tests, and small bugs in `__getattr__` / `__setattr__` cascade into mysterious failures 80 seconds later.
+
+**Recommended track scope:**
+- **Test isolation:** Move from session-scoped to per-file (or per-test-with-respawn) live_gui fixture
+- **Observability:** Add `/api/gui_health` (done) + structured logging for all state mutations
+- **Regression safety:** Audit all `__getattr__` / `__setattr__` / `__init__` for hidden contract assumptions
+- **ImGui scope audit:** Make the static `check_imgui_scopes.py` more powerful (handle try/except, control flow, context managers)
+- **Defer-not-catch pattern:** Per `conductor/workflow.md` known pitfall, audit all `imgui.*` calls for the "called before ImGui fully initialized" issue
+
+### 2. The `_UI_FLAG_DEFAULTS` allowlist (immediate)
+
+In the fix, I hardcoded an allowlist of `ui_` attrs that can return `None`. This is a maintenance burden — new `ui_` attrs added to `__init__` must also be added to this allowlist, or the test fixture will fail.
+
+**Better fix:** Use a class-level `_UI_FLAG_DEFAULTS` set OR detect them dynamically (e.g., from annotations in `__init__`). The current hardcoded set is fragile.
+
+### 3. The `_handle_reset_session` and other state-clearing paths
+
+The `AppController._handle_reset_session` clears many fields but not all. Tests that share state via the session-scoped fixture can carry over state from one test to the next. A future track should audit and complete the reset logic.
+
+### 4. Per-test or per-file `live_gui` fixture scope
+
+Per the `docs/reports/batch_resilience_plan_20260608.md`, the recommended approach is to either:
+- Make the fixture per-file scoped (heavy but simple)
+- Add a lazy re-spawn wrapper (lighter but more complex)
+- Add a per-test autouse health check (lightest, but doesn't recover from subprocess death)
+
+The right answer depends on whether tests need cross-file state. The current 49+ live_gui tests should be audited for cross-file dependencies.
+
+### 5. The `live_gui` subprocess lifecycle
+
+The subprocess is killed via `taskkill /F /T` (force-kill). This is correct for production but means the subprocess can't clean up. A graceful shutdown signal (e.g., `os.kill(pid, signal.CTRL_C_EVENT)` to trigger the SIGINT handler) would allow clean teardown and better diagnostic output on the next session.
+
+### 6. Documentation: the `__getattr__` / `__setattr__` contract
+
+The fix in this session was possible because I read the `__getattr__` code. But the `__getattr__` / `__setattr__` pair is a non-obvious contract. The docstring should explicitly state:
+- Which attributes are delegated to the controller
+- What `hasattr()` should return for each
+- The interaction with `setattr()`
+
+A future track should add explicit tests for the delegation contract, perhaps via property descriptors.
+
+---
+
+## Proposed Track Name
+
+`test_infra_hardening_20260608` (or similar)
+
+## Proposed Track Phases
+
+### Phase 1: Audit (1-2 days)
+- Catalog all `__getattr__` / `__setattr__` in the codebase
+- Document the implicit contracts
+- Identify other "silent failure" patterns (where a bug manifests 80s later in a different subsystem)
+
+### Phase 2: Refactor the `_UI_FLAG_DEFAULTS` (1 day)
+- Move the hardcoded set to a class-level attribute
+- OR detect from `__init__` annotations
+- Add unit test that catches missing entries
+
+### Phase 3: live_gui fixture scope change (1-2 days)
+- Audit all live_gui tests for cross-file state dependencies
+- Change `live_gui` from session-scoped to per-file (or per-test-with-respawn)
+- Add metrics for the cost (slowdown)
+
+### Phase 4: Improve check_imgui_scopes.py (2-3 days)
+- Add support for try/except patterns
+- Add support for control flow analysis
+- Add a "render function entry/exit" tracking mode that runs the GUI for a frame and reports unbalanced scopes
+
+### Phase 5: Documentation and runbooks (1 day)
+- Document the deferred-not-catch pattern in a code style guide
+- Add a runbook for "the live_gui test failed — what to check"
+- Update the `docs/reports/` to reflect the new infrastructure
+
+---
+
+## Why This Track Is Worth Doing
+
+The bug fixed in this session was a 4-layer deep interaction:
+1. `__getattr__` returning None (the wrong default)
+2. `hasattr()` returning True because of (1)
+3. `__setattr__` routing the assignment to the wrong place because of (2)
+4. `imgui.checkbox` getting None because of (3)
+5. The TypeError propagating without proper cleanup
+6. The ImGui scope stack being unbalanced
+7. The next frame triggering IM_ASSERT
+
+This is a fragility that will recur. The track prevents future bugs of this shape by:
+- Making the contracts explicit (Phase 1)
+- Eliminating the silent-failure pattern (Phase 2)
+- Reducing the state surface shared between tests (Phase 3)
+- Improving the static audit to catch scope issues early (Phase 4)
+
+---
+
+## Related Commits
+
+- `bcdc26d0` (this session): The actual fix — `__getattr__` allowlist
+- `51ecace4` (this session): PR3 pre-flight health check + planning docs
+- `1c565da7` (this session): PR2 wrap + health endpoint
+- `c9a991bb` (this session): timeout bump
+- `4a338486` (this session): io_pool 4→8
+- `87d7c5bf` (this session): io_pool test assertion
@@ -0,0 +1,41 @@
+# Status Report: RAG Test Failure in Tier-3 Batch (2026-06-08)
+
+**TL;DR:** The `test_rag_phase4_final_verify` failure in the latest tier-3 batch is **NOT a regression from my `__getattr__` fix**. It's a pre-existing test isolation issue that was masked by the previous `IM_ASSERT` failure.
+
+**Reproduction:**
+- Test passes in isolation (`pytest tests/test_rag_phase4_final_verify.py` → 6.80s PASS)
+- Test fails after the 4 sims (`pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py` → 1 FAIL, 4 PASS)
+- Test fails the same way on PRE-FIX HEAD (verified by `git stash` + rerun)
+
+**Failure modes observed across runs:**
+1. `Status: error: Local RAG embeddings require sentence-transformers` — environmental, `sentence-transformers` package not installed
+2. `Poll 0, status: Loaded track: Track B` — the controller's `ai_status` is showing track loading state from a prior sim's MMA workflow
+3. `Poll 0, status: done` + later `RAG context not found in history` — the AI responded but didn't include the RAG-retrieved context in the message
+
+**Root cause:** The `live_gui` fixture is session-scoped. State accumulated from the 4 sims pollutes the controller in ways the RAG test doesn't anticipate. The sims:
+- Change the `current_provider` to `gemini_cli`
+- Set various `ui_*` attributes
+- May leave pending MMA track workflows
+- May modify `mma_streams`, `mma_epic_input`, etc.
+
+When the RAG test starts, the controller is in an unknown state. The test assumes a clean baseline.
+
+**Per the systematic-debugging skill (after 3+ fix attempts on a similar pattern):**
+
+This is a CLASSIC test isolation issue. It's the SAME architectural problem that the previous `IM_ASSERT` bug exposed — the shared session-scoped subprocess. My fix to the `__getattr__` bug fixed ONE specific bug that was making the test fail. The underlying test isolation issue is still there, and it surfaces as a different failure now.
+
+**Recommended fix (per `docs/reports/test_infra_hardening_foundation_20260608.md`):**
+
+The future track should:
+1. Make the `live_gui` fixture per-file or per-test scoped (instead of session)
+2. Add lazy re-spawn when the subprocess dies or is degraded
+3. Or: have the RAG test explicitly reset state via a documented API before starting
+
+**Immediate workaround for the user:** The RAG test failure is NOT introduced by my fix. The user can:
+- Continue with the IM_ASSERT fix merged (commit `bcdc26d0`)
+- File a separate bug for the RAG test isolation issue
+- Or, since my `__getattr__` fix unblocked `test_full_live_workflow`, the user may want to accept the RAG test failure as a separate issue
+
+**Related commits:**
+- `bcdc26d0` (my fix): `__getattr__` allowlist — fixes the IM_ASSERT
+- Pre-existing (no commit): RAG test batch isolation issue