docs: add test_infra_hardening foundation + RAG batch failure status
Foundation document for the future test_infra_hardening track that will address session-scoped live_gui fixture isolation, silent __getattr__/__setattr__ contract assumptions, and similar test infrastructure fragility. Also documents the test_rag_phase4_final_verify batch failure that surfaces after the __getattr__ fix unblocks test_full_live_workflow. The RAG test failure is NOT a regression - it reproduces on pre-fix HEAD too. It's a pre-existing test isolation issue (the live_gui fixture is session-scoped, so state from the 4 sims pollutes the controller).
This commit is contained in:
@@ -0,0 +1,140 @@
|
||||
# Future Track Foundation: Test Infrastructure Hardening (2026-06-08)
|
||||
|
||||
**Status:** Foundation document (pre-spec). Goal: outline the broader track that this work belongs to.
|
||||
|
||||
**Related:**
|
||||
- `docs/reports/test_full_live_workflow_imgui_assert_20260608.md` (initial root cause)
|
||||
- `docs/reports/test_full_live_workflow_propagation_digest_20260608.md` (solutions digest)
|
||||
- `docs/reports/test_full_live_workflow_progress_20260608_pm.md` (PR1+PR2+PR3 progress)
|
||||
- `docs/reports/batch_resilience_plan_20260608.md` (batch resilience plan)
|
||||
- `conductor/todos/TODO_test_full_live_workflow_v2.md` (task list)
|
||||
|
||||
---
|
||||
|
||||
## What Was Fixed (this session)
|
||||
|
||||
1. **PR1 (audit):** `scripts/check_imgui_scopes.py` found 3 false positives. Documented.
|
||||
2. **PR2 (wrap + health endpoint):** `immapp.run` is now wrapped in try/except. `/api/gui_health` exposes the controller's degraded state. Tests fail fast with clear messages on dirty state.
|
||||
3. **PR3 (pre-flight check):** `test_full_live_workflow` calls `client.get_gui_health()` at start. Fails fast with actionable message if the GUI is degraded.
|
||||
4. **PR1 follow-up (real fix):** The actual IM_ASSERT trigger was a double `__getattr__` bug:
|
||||
- `AppController.__getattr__` returned `None` for ANY `ui_` attribute (including ones not in `__init__`)
|
||||
- `App.__setattr__` checked `hasattr(self.controller, name)` to route assignments; the controller's buggy `__getattr__` made `hasattr` return True for all `ui_` attrs
|
||||
- The `if not hasattr(app, 'foo'): app.foo = False` pattern in `render_approve_script_modal` failed to initialize
|
||||
- `imgui.checkbox` was called with `None`, raised TypeError
|
||||
- The TypeError propagated without closing the ImGui modal, leaving the scope stack unbalanced
|
||||
- Next frame: IM_ASSERT(Missing End())
|
||||
5. **Fix:** `AppController.__getattr__` now only returns `None` for an explicit allowlist of `ui_` attrs that ARE defined in `__init__`. For any other missing attribute, raises `AttributeError`. Also added defense-in-depth in `App.__getattr__` to check `hasattr(controller, name)` before delegating.
|
||||
|
||||
**Result:** 4 sims + test_live_workflow + 2 markdown tests all pass in 87.80s. No IM_ASSERT. The test passes cleanly.
|
||||
|
||||
---
|
||||
|
||||
## What Is Still Open (Future Work)
|
||||
|
||||
### 1. Test Infrastructure Audit (the broader track)
|
||||
|
||||
The fixes this session addressed ONE bug that was making `test_full_live_workflow` fail. The user asked: "continue with trying to finally cure the test infra with a strong foundation for the future track."
|
||||
|
||||
**The broader concern:** The test infrastructure has accumulated complexity and implicit assumptions. The `live_gui` fixture is session-scoped, the controller's state is shared across 49+ tests, and small bugs in `__getattr__` / `__setattr__` cascade into mysterious failures 80 seconds later.
|
||||
|
||||
**Recommended track scope:**
|
||||
- **Test isolation:** Move from session-scoped to per-file (or per-test-with-respawn) live_gui fixture
|
||||
- **Observability:** Add `/api/gui_health` (done) + structured logging for all state mutations
|
||||
- **Regression safety:** Audit all `__getattr__` / `__setattr__` / `__init__` for hidden contract assumptions
|
||||
- **ImGui scope audit:** Make the static `check_imgui_scopes.py` more powerful (handle try/except, control flow, context managers)
|
||||
- **Defer-not-catch pattern:** Per `conductor/workflow.md` known pitfall, audit all `imgui.*` calls for the "called before ImGui fully initialized" issue
|
||||
|
||||
### 2. The `_UI_FLAG_DEFAULTS` allowlist (immediate)
|
||||
|
||||
In the fix, I hardcoded an allowlist of `ui_` attrs that can return `None`. This is a maintenance burden — new `ui_` attrs added to `__init__` must also be added to this allowlist, or the test fixture will fail.
|
||||
|
||||
**Better fix:** Use a class-level `_UI_FLAG_DEFAULTS` set OR detect them dynamically (e.g., from annotations in `__init__`). The current hardcoded set is fragile.
|
||||
|
||||
### 3. The `_handle_reset_session` and other state-clearing paths
|
||||
|
||||
The `AppController._handle_reset_session` clears many fields but not all. Tests that share state via the session-scoped fixture can carry over state from one test to the next. A future track should audit and complete the reset logic.
|
||||
|
||||
### 4. Per-test or per-file `live_gui` fixture scope
|
||||
|
||||
Per the `docs/reports/batch_resilience_plan_20260608.md`, the recommended approach is to either:
|
||||
- Make the fixture per-file scoped (heavy but simple)
|
||||
- Add a lazy re-spawn wrapper (lighter but more complex)
|
||||
- Add a per-test autouse health check (lightest, but doesn't recover from subprocess death)
|
||||
|
||||
The right answer depends on whether tests need cross-file state. The current 49+ live_gui tests should be audited for cross-file dependencies.
|
||||
|
||||
### 5. The `live_gui` subprocess lifecycle
|
||||
|
||||
The subprocess is killed via `taskkill /F /T` (force-kill). This is correct for production but means the subprocess can't clean up. A graceful shutdown signal (e.g., `os.kill(pid, signal.CTRL_C_EVENT)` to trigger the SIGINT handler) would allow clean teardown and better diagnostic output on the next session.
|
||||
|
||||
### 6. Documentation: the `__getattr__` / `__setattr__` contract
|
||||
|
||||
The fix in this session was possible because I read the `__getattr__` code. But the `__getattr__` / `__setattr__` pair is a non-obvious contract. The docstring should explicitly state:
|
||||
- Which attributes are delegated to the controller
|
||||
- What `hasattr()` should return for each
|
||||
- The interaction with `setattr()`
|
||||
|
||||
A future track should add explicit tests for the delegation contract, perhaps via property descriptors.
|
||||
|
||||
---
|
||||
|
||||
## Proposed Track Name
|
||||
|
||||
`test_infra_hardening_20260608` (or similar)
|
||||
|
||||
## Proposed Track Phases
|
||||
|
||||
### Phase 1: Audit (1-2 days)
|
||||
- Catalog all `__getattr__` / `__setattr__` in the codebase
|
||||
- Document the implicit contracts
|
||||
- Identify other "silent failure" patterns (where a bug manifests 80s later in a different subsystem)
|
||||
|
||||
### Phase 2: Refactor the `_UI_FLAG_DEFAULTS` (1 day)
|
||||
- Move the hardcoded set to a class-level attribute
|
||||
- OR detect from `__init__` annotations
|
||||
- Add unit test that catches missing entries
|
||||
|
||||
### Phase 3: live_gui fixture scope change (1-2 days)
|
||||
- Audit all live_gui tests for cross-file state dependencies
|
||||
- Change `live_gui` from session-scoped to per-file (or per-test-with-respawn)
|
||||
- Add metrics for the cost (slowdown)
|
||||
|
||||
### Phase 4: Improve check_imgui_scopes.py (2-3 days)
|
||||
- Add support for try/except patterns
|
||||
- Add support for control flow analysis
|
||||
- Add a "render function entry/exit" tracking mode that runs the GUI for a frame and reports unbalanced scopes
|
||||
|
||||
### Phase 5: Documentation and runbooks (1 day)
|
||||
- Document the deferred-not-catch pattern in a code style guide
|
||||
- Add a runbook for "the live_gui test failed — what to check"
|
||||
- Update the `docs/reports/` to reflect the new infrastructure
|
||||
|
||||
---
|
||||
|
||||
## Why This Track Is Worth Doing
|
||||
|
||||
The bug fixed in this session was a 4-layer deep interaction:
|
||||
1. `__getattr__` returning None (the wrong default)
|
||||
2. `hasattr()` returning True because of (1)
|
||||
3. `__setattr__` routing the assignment to the wrong place because of (2)
|
||||
4. `imgui.checkbox` getting None because of (3)
|
||||
5. The TypeError propagating without proper cleanup
|
||||
6. The ImGui scope stack being unbalanced
|
||||
7. The next frame triggering IM_ASSERT
|
||||
|
||||
This is a fragility that will recur. The track prevents future bugs of this shape by:
|
||||
- Making the contracts explicit (Phase 1)
|
||||
- Eliminating the silent-failure pattern (Phase 2)
|
||||
- Reducing the state surface shared between tests (Phase 3)
|
||||
- Improving the static audit to catch scope issues early (Phase 4)
|
||||
|
||||
---
|
||||
|
||||
## Related Commits
|
||||
|
||||
- `bcdc26d0` (this session): The actual fix — `__getattr__` allowlist
|
||||
- `51ecace4` (this session): PR3 pre-flight health check + planning docs
|
||||
- `1c565da7` (this session): PR2 wrap + health endpoint
|
||||
- `c9a991bb` (this session): timeout bump
|
||||
- `4a338486` (this session): io_pool 4→8
|
||||
- `87d7c5bf` (this session): io_pool test assertion
|
||||
@@ -0,0 +1,41 @@
|
||||
# Status Report: RAG Test Failure in Tier-3 Batch (2026-06-08)
|
||||
|
||||
**TL;DR:** The `test_rag_phase4_final_verify` failure in the latest tier-3 batch is **NOT a regression from my `__getattr__` fix**. It's a pre-existing test isolation issue that was masked by the previous `IM_ASSERT` failure.
|
||||
|
||||
**Reproduction:**
|
||||
- Test passes in isolation (`pytest tests/test_rag_phase4_final_verify.py` → 6.80s PASS)
|
||||
- Test fails after the 4 sims (`pytest tests/test_extended_sims.py tests/test_rag_phase4_final_verify.py` → 1 FAIL, 4 PASS)
|
||||
- Test fails the same way on PRE-FIX HEAD (verified by `git stash` + rerun)
|
||||
|
||||
**Failure modes observed across runs:**
|
||||
1. `Status: error: Local RAG embeddings require sentence-transformers` — environmental, `sentence-transformers` package not installed
|
||||
2. `Poll 0, status: Loaded track: Track B` — the controller's `ai_status` is showing track loading state from a prior sim's MMA workflow
|
||||
3. `Poll 0, status: done` + later `RAG context not found in history` — the AI responded but didn't include the RAG-retrieved context in the message
|
||||
|
||||
**Root cause:** The `live_gui` fixture is session-scoped. State accumulated from the 4 sims pollutes the controller in ways the RAG test doesn't anticipate. The sims:
|
||||
- Change the `current_provider` to `gemini_cli`
|
||||
- Set various `ui_*` attributes
|
||||
- May leave pending MMA track workflows
|
||||
- May modify `mma_streams`, `mma_epic_input`, etc.
|
||||
|
||||
When the RAG test starts, the controller is in an unknown state. The test assumes a clean baseline.
|
||||
|
||||
**Per the systematic-debugging skill (after 3+ fix attempts on a similar pattern):**
|
||||
|
||||
This is a CLASSIC test isolation issue. It's the SAME architectural problem that the previous `IM_ASSERT` bug exposed — the shared session-scoped subprocess. My fix to the `__getattr__` bug fixed ONE specific bug that was making the test fail. The underlying test isolation issue is still there, and it surfaces as a different failure now.
|
||||
|
||||
**Recommended fix (per `docs/reports/test_infra_hardening_foundation_20260608.md`):**
|
||||
|
||||
The future track should:
|
||||
1. Make the `live_gui` fixture per-file or per-test scoped (instead of session)
|
||||
2. Add lazy re-spawn when the subprocess dies or is degraded
|
||||
3. Or: have the RAG test explicitly reset state via a documented API before starting
|
||||
|
||||
**Immediate workaround for the user:** The RAG test failure is NOT introduced by my fix. The user can:
|
||||
- Continue with the IM_ASSERT fix merged (commit `bcdc26d0`)
|
||||
- File a separate bug for the RAG test isolation issue
|
||||
- Or, since my `__getattr__` fix unblocked `test_full_live_workflow`, the user may want to accept the RAG test failure as a separate issue
|
||||
|
||||
**Related commits:**
|
||||
- `bcdc26d0` (my fix): `__getattr__` allowlist — fixes the IM_ASSERT
|
||||
- Pre-existing (no commit): RAG test batch isolation issue
|
||||
Reference in New Issue
Block a user