test(live_workflow): pre-flight health check fails fast on dirty state

PR3 of the test_full_live_workflow_imgui_assert fix sequence. When a prior live_gui test in the same session crashes the GUI (e.g. via an ImGui IM_ASSERT from cumulative panel state), the controller's _io_pool gets shut down. The next test starts in a degraded state but only discovers this 120s later when its project switch times out with a confusing 'cannot schedule new futures after shutdown' error. This commit adds a /api/gui_health pre-flight check at the start of test_full_live_workflow. If the GUI is degraded, the test fails fast (within 1s) with a clear, actionable message that includes: - The exact RuntimeError that caused the degradation - The full traceback of the last ImGui scope mismatch - A note that the new test cannot proceed with a dirty state Per user feedback 2026-06-08: 'I don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state.' Also includes the planning documents written earlier in this session: - TODO_test_full_live_workflow_v2.md (task list) - test_full_live_workflow_imgui_assert_20260608.md (root cause report) - test_full_live_workflow_propagation_digest_20260608.md (solutions digest) - batch_resilience_plan_20260608.md (batch resilience plan) Verification: - test_full_live_workflow in isolation: 13.45s PASS (health=True, no degrade) - 4 sims + test_full_live_workflow in batch: 76.46s (1 FAIL fast, 4 sims PASS) - Without PR3 fix: 200s FAIL with confusing 120s timeout - With PR3 fix: 76s FAIL with clear 'GUI is degraded' message - The fast-fail is observable, not silent (per user's 'wrap might be worth it if that properly lets us handle the assert')
2026-06-08 21:17:54 -04:00
parent 8a597d1832
commit 51ecace464
5 changed files with 1077 additions and 0 deletions
@@ -0,0 +1,156 @@
+# TODO: Fix test_full_live_workflow — ImGui IM_ASSERT root cause + batch resilience
+
+**Report:** `docs/reports/test_full_live_workflow_imgui_assert_20260608.md` (v2, supersedes v1)
+**Predecessor:** `conductor/todos/TODO_test_full_live_workflow.md` (Tasks 1, 2, 4, 5, 6 SHIPPED; Tasks 3, 7 remaining and still relevant)
+**Status:** NEW. No tasks started. Awaiting user direction on which solution to implement first.
+**Failure reproducibility:** 100% in tier-3 batch (5+ live_gui tests, ~200s total), 0% in isolation
+
+---
+
+## The Real Root Cause (per v2 report)
+
+The test's `_do_project_switch` runs in ~8-10ms — it is NOT slow. The test fails because:
+
+1. Some `render_*` function has an ImGui scope mismatch (`begin()` without matching `end()`)
+2. After 4 sims have rendered their panels, the cumulative state triggers an `IM_ASSERT((0) && "Missing End()")` from imgui.cpp:11662 in window 'MainDockSpace' at frame ~71.5s into GUI lifetime
+3. The `RuntimeError` from `immapp.run` propagates up through `app.run()` and `main()`
+4. The exception causes the controller's `_io_pool` to shut down (likely via `ThreadPoolExecutor.__del__` during GC, or via the `app.shutdown()` path if `immapp.run` internally caught and returned)
+5. The hook server thread keeps running (it's a separate `ThreadingHTTPServer` in `src/api_hooks.py`)
+6. The test's `btn_project_new_automated` click hits the click handler, which calls `submit_io(self._do_project_switch, path)`, which throws `RuntimeError: cannot schedule new futures after shutdown`
+7. The test's `wait_for_project_switch` polls `/api/project_switch_status` 1200+ times in 120s and times out
+
+The `_do_project_switch` is a symptom, not the cause.
+
+---
+
+## Tasks (ordered by dependency)
+
+### 1. [HIGH] Run `scripts/check_imgui_scopes.py` to identify the scope mismatch
+
+- **What:** Invoke the existing audit script against `src/gui_2.py` and any other ImGui-rendering files. Look for `begin()` calls without a matching `end()` in the same scope.
+- **Where:** `scripts/check_imgui_scopes.py` (existing), `src/gui_2.py` (90+ render functions).
+- **Why:** This is the real fix. The script exists for exactly this purpose but hasn't been run against the recent render additions.
+- **Pattern:** Per `conductor/workflow.md`: "Mandatory ImGui Verification: All changes to the GUI (gui_2.py) MUST be verified using the custom AST linter (scripts/check_imgui_scopes.py) to ensure all ImGui scopes (begin/end, push/pop) are properly matched."
+- **Acceptance:** Audit output identifies the specific `render_*` function and line number(s) with the unbalanced scope. Documented in the report.
+- **Effort:** 1-2 hours (audit run + manual triage of findings).
+- **Risk:** Medium. Findings may be in render paths that are only exercised by specific sim combinations. Need careful triage.
+
+### 2. [HIGH] Fix the identified ImGui scope mismatch
+
+- **What:** Once Task 1 identifies the function, add the missing `end()` (or remove the spurious `begin()`).
+- **Where:** TBD by Task 1. Likely in a `render_*` function called from `_gui_func` → `_render_main_interface` → some panel.
+- **Why:** This is the actual bug. All other tasks are workarounds.
+- **Acceptance:**
+  - `IM_ASSERT` no longer fires in any test batch combination
+  - All existing tests still pass (no regression)
+  - `test_full_live_workflow` passes in tier-3 batch (the goal)
+- **Effort:** 1-4 hours depending on what Task 1 finds.
+- **Risk:** Medium. A wrong fix could break other tests. May need to add defer-not-catch pattern (per `conductor/workflow.md` known pitfall) for the offending render path.
+- **Depends on:** Task 1.
+
+### 3. [MED] Wrap `immapp.run` in `try/except RuntimeError` in `gui_2.py:618`
+
+- **What:** Catch the IM_ASSERT (or any `RuntimeError` from `immapp.run`), log it, and return gracefully so the process doesn't die.
+- **Where:** `src/gui_2.py:618`.
+- **Why:** Per user: "the wrap might be worth it if that properly lets us handle the assert." A proper wrap logs the assert, marks the GUI as degraded, and lets the hook server keep serving (so tests can complete their work). It is NOT a silent swallow — the error is logged at ERROR level and exposed via a new endpoint.
+- **Acceptance:**
+  - When IM_ASSERT fires, the subprocess stays alive
+  - The `_io_pool` is NOT shut down by the exception (or is re-created lazily — see Task 5)
+  - A new `/api/gui_health` endpoint returns `{"degraded": true, "last_assert": "..."}` so tests can detect the state
+  - The log includes the full assert message + stack trace at ERROR level
+- **Effort:** 1-2 hours. The wrap is simple. The endpoint + logging is straightforward.
+- **Risk:** Low. The wrap is a band-aid, but it properly handles the failure (logs it, surfaces it) rather than swallowing silently.
+- **Depends on:** None. Can be done in parallel with Tasks 1+2. Belongs in the same PR as the fix or as a separate hardening PR.
+
+### 4. [MED] Add batch-level test isolation (kill+restart sloppy.py per file)
+
+- **What:** Modify `scripts/run_tests_batched.py` to kill the `live_gui` subprocess at the end of each test file (or at the start of a new one), so a failing test file doesn't poison subsequent test files.
+- **Where:** `scripts/run_tests_batched.py` (existing batch runner).
+- **Why:** Per user: "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state."
+- **Pattern:** A failing batch should not block subsequent batches. The user wants to be able to run a batch, see it fail, run the next batch, and have it start clean.
+- **Acceptance:**
+  - When a test file fails, the runner logs a clear "batch N failed; next batch will restart the app" message
+  - The next batch's `live_gui` fixture spawns a fresh `sloppy.py` subprocess (or detects the old one is dead and spawns a new one)
+  - No "dirty state" from a prior failed batch leaks into the next batch
+  - The batch runner continues to the next batch automatically (no user intervention needed)
+- **Effort:** 2-4 hours. Requires understanding the current batch runner's lifecycle and modifying the `live_gui` fixture to handle "previous subprocess died, start a new one".
+- **Risk:** Low. The conftest's `live_gui` fixture is already session-scoped — making it per-file-scoped (or function-scoped with batch-aware session reuse) is a small change.
+- **Depends on:** None. Can be done in parallel with the other tasks.
+
+### 5. [LOW] Make `submit_io` recover from a shut-down pool
+
+- **What:** In `submit_io`, if `self._io_pool` is shut down, recreate it lazily.
+- **Where:** `src/app_controller.py:2257-2284` (current `submit_io` body).
+- **Why:** Defense in depth. If the GUI crashes and shuts down the pool, the test can still submit work after the wrap (Task 3) catches the exception. Without this, the controller is permanently dead.
+- **Acceptance:**
+  - After a GUI crash + `immapp.run` recovery, `submit_io` works again
+  - No new threading issues (the recreated pool has the same semantics)
+  - Inflight counter (`_io_pool_inflight`) is reset
+- **Effort:** 30 minutes.
+- **Risk:** Low. Standard lazy-recreation pattern. The pool was already designed to be replaceable.
+- **Depends on:** None.
+
+### 6. [LOW] Add `/api/gui_health` endpoint with degraded-state info
+
+- **What:** New endpoint returning `{"healthy": bool, "degraded_reason": str | null, "last_assert": str | null, "io_pool_alive": bool}`.
+- **Where:** `src/api_hooks.py` (add new `elif` branch) + `src/app_controller.py` (add `self._gui_degraded_reason` and `self._last_imgui_assert` state).
+- **Why:** Per Task 3, the wrap logs the assert. The endpoint exposes the state to tests so they can detect a degraded GUI and fail with a clear message ("GUI is degraded due to IM_ASSERT; skipping test") rather than a confusing timeout.
+- **Acceptance:**
+  - Endpoint returns 200 with the health dict
+  - Tests can call `client.get_gui_health()` and check `healthy == False` to detect a degraded GUI
+  - `tests/test_live_workflow.py` checks the health before starting and fails fast with a clear message if degraded
+- **Effort:** 1-2 hours.
+- **Risk:** Low. Read-only endpoint.
+- **Depends on:** Task 3.
+
+---
+
+## Tasks Inherited from Predecessor TODO (still relevant)
+
+These are from `conductor/todos/TODO_test_full_live_workflow.md` and were marked as not yet shipped:
+
+### 7. [MED] Replace `os.path.abspath("tests/artifacts/temp_project.toml")` with fixture-provided path
+
+- **What:** Have the `live_gui` fixture provide `temp_project_path` (str) derived from its own `temp_workspace` directory.
+- **Where:** `tests/conftest.py` (live_gui fixture) + `tests/test_live_workflow.py:79`.
+- **Why:** cwd-relative path is fragile; fixture-relative path is stable. Per the v1 report's Cause 1.
+- **Acceptance:** Test does `temp_project_path = live_gui_temp_project_path` (or accesses it as a fixture attribute). No more `os.path.abspath("tests/artifacts/...")`.
+- **Effort:** 30 minutes.
+- **Risk:** Low.
+
+### 8. [LOW] Add `tests/.test_durations.json` recording in CI / dev convenience
+
+- **What:** Add a dev-mode shortcut to record durations once the fix lands (e.g. `python scripts/run_tests_batched.py --durations`).
+- **Where:** `scripts/run_tests_batched.py` (already has `--durations` flag; just need a one-time run + commit).
+- **Why:** The categorizer uses `.test_durations.json` for `speed` auto-inference. Currently all files default to MEDIUM speed.
+- **Acceptance:** `tests/.test_durations.json` exists, has timing data for all 295+ tests.
+- **Effort:** 5 minutes (run + commit).
+- **Risk:** Low.
+
+---
+
+## Order of Work (recommended)
+
+1. **Tasks 1 + 2 first** — find and fix the ImGui scope mismatch. This is the real fix. If successful, Tasks 3, 4, 5, 6 may be unnecessary (or become hardening improvements rather than bug fixes).
+2. **Task 3 in parallel** — wrap `immapp.run` so the assert doesn't kill the process. Even if Task 2 succeeds, the wrap is a good safety net for future scope bugs.
+3. **Task 4** — batch-level isolation. Independent of the ImGui fix; improves robustness for ALL tests.
+4. **Tasks 5, 6** — defense in depth. Only valuable if Tasks 1+2 don't fully fix the issue OR as ongoing hardening.
+5. **Tasks 7, 8** — unrelated cleanup. Do in a separate small commit/PR.
+
+## Estimated Time
+
+- Tasks 1+2: 2-6 hours (real fix, may require investigation)
+- Task 3: 1-2 hours (band-aid, but proper one)
+- Task 4: 2-4 hours (batch resilience)
+- Tasks 5+6: 1-2 hours combined (defense in depth)
+- Tasks 7+8: 30 minutes combined (cleanup)
+- **Total: 6-14 hours**
+
+## Verification
+
+After fix:
+- `uv run python scripts/run_tests_batched.py --tiers 3 --no-xdist --no-color` shows `<<< tier-3-live_gui PASS`
+- `uv run pytest tests/test_live_workflow.py` still PASSes in isolation
+- `uv run pytest tests/test_live_workflow.py tests/test_extended_sims.py` (siblings) PASSes
+- A failing batch does NOT prevent the next batch from running with a clean state
+- Failure message on real regression is clear and actionable (e.g. "GUI degraded: IM_ASSERT(Missing End()) in render_X; skipping test")