diff --git a/docs/reports/test_infra_hardening_foundation_20260608.md b/docs/reports/test_infra_hardening_foundation_20260608.md new file mode 100644 index 00000000..fcdfe15a --- /dev/null +++ b/docs/reports/test_infra_hardening_foundation_20260608.md @@ -0,0 +1,140 @@ +# Future Track Foundation: Test Infrastructure Hardening (2026-06-08) + +**Status:** Foundation document (pre-spec). Goal: outline the broader track that this work belongs to. + +**Related:** +- `docs/reports/test_full_live_workflow_imgui_assert_20260608.md` (initial root cause) +- `docs/reports/test_full_live_workflow_propagation_digest_20260608.md` (solutions digest) +- `docs/reports/test_full_live_workflow_progress_20260608_pm.md` (PR1+PR2+PR3 progress) +- `docs/reports/batch_resilience_plan_20260608.md` (batch resilience plan) +- `conductor/todos/TODO_test_full_live_workflow_v2.md` (task list) + +--- + +## What Was Fixed (this session) + +1. **PR1 (audit):** `scripts/check_imgui_scopes.py` found 3 false positives. Documented. +2. **PR2 (wrap + health endpoint):** `immapp.run` is now wrapped in try/except. `/api/gui_health` exposes the controller's degraded state. Tests fail fast with clear messages on dirty state. +3. **PR3 (pre-flight check):** `test_full_live_workflow` calls `client.get_gui_health()` at start. Fails fast with actionable message if the GUI is degraded. +4. **PR1 follow-up (real fix):** The actual IM_ASSERT trigger was a double `__getattr__` bug: + - `AppController.__getattr__` returned `None` for ANY `ui_` attribute (including ones not in `__init__`) + - `App.__setattr__` checked `hasattr(self.controller, name)` to route assignments; the controller's buggy `__getattr__` made `hasattr` return True for all `ui_` attrs + - The `if not hasattr(app, 'foo'): app.foo = False` pattern in `render_approve_script_modal` failed to initialize + - `imgui.checkbox` was called with `None`, raised TypeError + - The TypeError propagated without closing the ImGui modal, leaving the scope stack unbalanced + - Next frame: IM_ASSERT(Missing End()) +5. **Fix:** `AppController.__getattr__` now only returns `None` for an explicit allowlist of `ui_` attrs that ARE defined in `__init__`. For any other missing attribute, raises `AttributeError`. Also added defense-in-depth in `App.__getattr__` to check `hasattr(controller, name)` before delegating. + +**Result:** 4 sims + test_live_workflow + 2 markdown tests all pass in 87.80s. No IM_ASSERT. The test passes cleanly. + +--- + +## What Is Still Open (Future Work) + +### 1. Test Infrastructure Audit (the broader track) + +The fixes this session addressed ONE bug that was making `test_full_live_workflow` fail. The user asked: "continue with trying to finally cure the test infra with a strong foundation for the future track." + +**The broader concern:** The test infrastructure has accumulated complexity and implicit assumptions. The `live_gui` fixture is session-scoped, the controller's state is shared across 49+ tests, and small bugs in `__getattr__` / `__setattr__` cascade into mysterious failures 80 seconds later. + +**Recommended track scope:** +- **Test isolation:** Move from session-scoped to per-file (or per-test-with-respawn) live_gui fixture +- **Observability:** Add `/api/gui_health` (done) + structured logging for all state mutations +- **Regression safety:** Audit all `__getattr__` / `__setattr__` / `__init__` for hidden contract assumptions +- **ImGui scope audit:** Make the static `check_imgui_scopes.py` more powerful (handle try/except, control flow, context managers) +- **Defer-not-catch pattern:** Per `conductor/workflow.md` known pitfall, audit all `imgui.*` calls for the "called before ImGui fully initialized" issue + +### 2. The `_UI_FLAG_DEFAULTS` allowlist (immediate) + +In the fix, I hardcoded an allowlist of `ui_` attrs that can return `None`. This is a maintenance burden — new `ui_` attrs added to `__init__` must also be added to this allowlist, or the test fixture will fail. + +**Better fix:** Use a class-level `_UI_FLAG_DEFAULTS` set OR detect them dynamically (e.g., from annotations in `__init__`). The current hardcoded set is fragile. + +### 3. The `_handle_reset_session` and other state-clearing paths + +The `AppController._handle_reset_session` clears many fields but not all. Tests that share state via the session-scoped fixture can carry over state from one test to the next. A future track should audit and complete the reset logic. + +### 4. Per-test or per-file `live_gui` fixture scope + +Per the `docs/reports/batch_resilience_plan_20260608.md`, the recommended approach is to either: +- Make the fixture per-file scoped (heavy but simple) +- Add a lazy re-spawn wrapper (lighter but more complex) +- Add a per-test autouse health check (lightest, but doesn't recover from subprocess death) + +The right answer depends on whether tests need cross-file state. The current 49+ live_gui tests should be audited for cross-file dependencies. + +### 5. The `live_gui` subprocess lifecycle + +The subprocess is killed via `taskkill /F /T` (force-kill). This is correct for production but means the subprocess can't clean up. A graceful shutdown signal (e.g., `os.kill(pid, signal.CTRL_C_EVENT)` to trigger the SIGINT handler) would allow clean teardown and better diagnostic output on the next session. + +### 6. Documentation: the `__getattr__` / `__setattr__` contract + +The fix in this session was possible because I read the `__getattr__` code. But the `__getattr__` / `__setattr__` pair is a non-obvious contract. The docstring should explicitly state: +- Which attributes are delegated to the controller +- What `hasattr()` should return for each +- The interaction with `setattr()` + +A future track should add explicit tests for the delegation contract, perhaps via property descriptors. + +--- + +## Proposed Track Name + +`test_infra_hardening_20260608` (or similar) + +## Proposed Track Phases + +### Phase 1: Audit (1-2 days) +- Catalog all `__getattr__` / `__setattr__` in the codebase +- Document the implicit contracts +- Identify other "silent failure" patterns (where a bug manifests 80s later in a different subsystem) + +### Phase 2: Refactor the `_UI_FLAG_DEFAULTS` (1 day) +- Move the hardcoded set to a class-level attribute +- OR detect from `__init__` annotations +- Add unit test that catches missing entries + +### Phase 3: live_gui fixture scope change (1-2 days) +- Audit all live_gui tests for cross-file state dependencies +- Change `live_gui` from session-scoped to per-file (or per-test-with-respawn) +- Add metrics for the cost (slowdown) + +### Phase 4: Improve check_imgui_scopes.py (2-3 days) +- Add support for try/except patterns +- Add support for control flow analysis +- Add a "render function entry/exit" tracking mode that runs the GUI for a frame and reports unbalanced scopes + +### Phase 5: Documentation and runbooks (1 day) +- Document the deferred-not-catch pattern in a code style guide +- Add a runbook for "the live_gui test failed — what to check" +- Update the `docs/reports/` to reflect the new infrastructure + +--- + +## Why This Track Is Worth Doing + +The bug fixed in this session was a 4-layer deep interaction: +1. `__getattr__` returning None (the wrong default) +2. `hasattr()` returning True because of (1) +3. `__setattr__` routing the assignment to the wrong place because of (2) +4. `imgui.checkbox` getting None because of (3) +5. The TypeError propagating without proper cleanup +6. The ImGui scope stack being unbalanced +7. The next frame triggering IM_ASSERT + +This is a fragility that will recur. The track prevents future bugs of this shape by: +- Making the contracts explicit (Phase 1) +- Eliminating the silent-failure pattern (Phase 2) +- Reducing the state surface shared between tests (Phase 3) +- Improving the static audit to catch scope issues early (Phase 4) + +--- + +## Related Commits + +- `bcdc26d0` (this session): The actual fix — `__getattr__` allowlist +- `51ecace4` (this session): PR3 pre-flight health check + planning docs +- `1c565da7` (this session): PR2 wrap + health endpoint +- `c9a991bb` (this session): timeout bump +- `4a338486` (this session): io_pool 4→8 +- `87d7c5bf` (this session): io_pool test assertion diff --git a/docs/reports/test_rag_batch_failure_status_20260608.md b/docs/reports/test_rag_batch_failure_status_20260608.md new file mode 100644 index 00000000..7efa4f15 --- /dev/null +++ b/docs/reports/test_rag_batch_failure_status_20260608.md @@ -0,0 +1,41 @@ +# Status Report: RAG Test Failure in Tier-3 Batch (2026-06-08) + +**TL;DR:** The `test_rag_phase4_final_verify` failure in the latest tier-3 batch is **NOT a regression from my `__getattr__` fix**. It's a pre-existing test isolation issue that was masked by the previous `IM_ASSERT` failure. + +**Reproduction:** +- Test passes in isolation (`pytest tests/test_rag_phase4_final_verify.py` → 6.80s PASS) +- Test fails after the 4 sims (`pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py` → 1 FAIL, 4 PASS) +- Test fails the same way on PRE-FIX HEAD (verified by `git stash` + rerun) + +**Failure modes observed across runs:** +1. `Status: error: Local RAG embeddings require sentence-transformers` — environmental, `sentence-transformers` package not installed +2. `Poll 0, status: Loaded track: Track B` — the controller's `ai_status` is showing track loading state from a prior sim's MMA workflow +3. `Poll 0, status: done` + later `RAG context not found in history` — the AI responded but didn't include the RAG-retrieved context in the message + +**Root cause:** The `live_gui` fixture is session-scoped. State accumulated from the 4 sims pollutes the controller in ways the RAG test doesn't anticipate. The sims: +- Change the `current_provider` to `gemini_cli` +- Set various `ui_*` attributes +- May leave pending MMA track workflows +- May modify `mma_streams`, `mma_epic_input`, etc. + +When the RAG test starts, the controller is in an unknown state. The test assumes a clean baseline. + +**Per the systematic-debugging skill (after 3+ fix attempts on a similar pattern):** + +This is a CLASSIC test isolation issue. It's the SAME architectural problem that the previous `IM_ASSERT` bug exposed — the shared session-scoped subprocess. My fix to the `__getattr__` bug fixed ONE specific bug that was making the test fail. The underlying test isolation issue is still there, and it surfaces as a different failure now. + +**Recommended fix (per `docs/reports/test_infra_hardening_foundation_20260608.md`):** + +The future track should: +1. Make the `live_gui` fixture per-file or per-test scoped (instead of session) +2. Add lazy re-spawn when the subprocess dies or is degraded +3. Or: have the RAG test explicitly reset state via a documented API before starting + +**Immediate workaround for the user:** The RAG test failure is NOT introduced by my fix. The user can: +- Continue with the IM_ASSERT fix merged (commit `bcdc26d0`) +- File a separate bug for the RAG test isolation issue +- Or, since my `__getattr__` fix unblocked `test_full_live_workflow`, the user may want to accept the RAG test failure as a separate issue + +**Related commits:** +- `bcdc26d0` (my fix): `__getattr__` allowlist — fixes the IM_ASSERT +- Pre-existing (no commit): RAG test batch isolation issue