test(live_workflow): pre-flight health check fails fast on dirty state
PR3 of the test_full_live_workflow_imgui_assert fix sequence. When a prior live_gui test in the same session crashes the GUI (e.g. via an ImGui IM_ASSERT from cumulative panel state), the controller's _io_pool gets shut down. The next test starts in a degraded state but only discovers this 120s later when its project switch times out with a confusing 'cannot schedule new futures after shutdown' error. This commit adds a /api/gui_health pre-flight check at the start of test_full_live_workflow. If the GUI is degraded, the test fails fast (within 1s) with a clear, actionable message that includes: - The exact RuntimeError that caused the degradation - The full traceback of the last ImGui scope mismatch - A note that the new test cannot proceed with a dirty state Per user feedback 2026-06-08: 'I don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state.' Also includes the planning documents written earlier in this session: - TODO_test_full_live_workflow_v2.md (task list) - test_full_live_workflow_imgui_assert_20260608.md (root cause report) - test_full_live_workflow_propagation_digest_20260608.md (solutions digest) - batch_resilience_plan_20260608.md (batch resilience plan) Verification: - test_full_live_workflow in isolation: 13.45s PASS (health=True, no degrade) - 4 sims + test_full_live_workflow in batch: 76.46s (1 FAIL fast, 4 sims PASS) - Without PR3 fix: 200s FAIL with confusing 120s timeout - With PR3 fix: 76s FAIL with clear 'GUI is degraded' message - The fast-fail is observable, not silent (per user's 'wrap might be worth it if that properly lets us handle the assert')
This commit is contained in:
@@ -0,0 +1,156 @@
|
||||
# TODO: Fix test_full_live_workflow — ImGui IM_ASSERT root cause + batch resilience
|
||||
|
||||
**Report:** `docs/reports/test_full_live_workflow_imgui_assert_20260608.md` (v2, supersedes v1)
|
||||
**Predecessor:** `conductor/todos/TODO_test_full_live_workflow.md` (Tasks 1, 2, 4, 5, 6 SHIPPED; Tasks 3, 7 remaining and still relevant)
|
||||
**Status:** NEW. No tasks started. Awaiting user direction on which solution to implement first.
|
||||
**Failure reproducibility:** 100% in tier-3 batch (5+ live_gui tests, ~200s total), 0% in isolation
|
||||
|
||||
---
|
||||
|
||||
## The Real Root Cause (per v2 report)
|
||||
|
||||
The test's `_do_project_switch` runs in ~8-10ms — it is NOT slow. The test fails because:
|
||||
|
||||
1. Some `render_*` function has an ImGui scope mismatch (`begin()` without matching `end()`)
|
||||
2. After 4 sims have rendered their panels, the cumulative state triggers an `IM_ASSERT((0) && "Missing End()")` from imgui.cpp:11662 in window 'MainDockSpace' at frame ~71.5s into GUI lifetime
|
||||
3. The `RuntimeError` from `immapp.run` propagates up through `app.run()` and `main()`
|
||||
4. The exception causes the controller's `_io_pool` to shut down (likely via `ThreadPoolExecutor.__del__` during GC, or via the `app.shutdown()` path if `immapp.run` internally caught and returned)
|
||||
5. The hook server thread keeps running (it's a separate `ThreadingHTTPServer` in `src/api_hooks.py`)
|
||||
6. The test's `btn_project_new_automated` click hits the click handler, which calls `submit_io(self._do_project_switch, path)`, which throws `RuntimeError: cannot schedule new futures after shutdown`
|
||||
7. The test's `wait_for_project_switch` polls `/api/project_switch_status` 1200+ times in 120s and times out
|
||||
|
||||
The `_do_project_switch` is a symptom, not the cause.
|
||||
|
||||
---
|
||||
|
||||
## Tasks (ordered by dependency)
|
||||
|
||||
### 1. [HIGH] Run `scripts/check_imgui_scopes.py` to identify the scope mismatch
|
||||
|
||||
- **What:** Invoke the existing audit script against `src/gui_2.py` and any other ImGui-rendering files. Look for `begin()` calls without a matching `end()` in the same scope.
|
||||
- **Where:** `scripts/check_imgui_scopes.py` (existing), `src/gui_2.py` (90+ render functions).
|
||||
- **Why:** This is the real fix. The script exists for exactly this purpose but hasn't been run against the recent render additions.
|
||||
- **Pattern:** Per `conductor/workflow.md`: "Mandatory ImGui Verification: All changes to the GUI (gui_2.py) MUST be verified using the custom AST linter (scripts/check_imgui_scopes.py) to ensure all ImGui scopes (begin/end, push/pop) are properly matched."
|
||||
- **Acceptance:** Audit output identifies the specific `render_*` function and line number(s) with the unbalanced scope. Documented in the report.
|
||||
- **Effort:** 1-2 hours (audit run + manual triage of findings).
|
||||
- **Risk:** Medium. Findings may be in render paths that are only exercised by specific sim combinations. Need careful triage.
|
||||
|
||||
### 2. [HIGH] Fix the identified ImGui scope mismatch
|
||||
|
||||
- **What:** Once Task 1 identifies the function, add the missing `end()` (or remove the spurious `begin()`).
|
||||
- **Where:** TBD by Task 1. Likely in a `render_*` function called from `_gui_func` → `_render_main_interface` → some panel.
|
||||
- **Why:** This is the actual bug. All other tasks are workarounds.
|
||||
- **Acceptance:**
|
||||
- `IM_ASSERT` no longer fires in any test batch combination
|
||||
- All existing tests still pass (no regression)
|
||||
- `test_full_live_workflow` passes in tier-3 batch (the goal)
|
||||
- **Effort:** 1-4 hours depending on what Task 1 finds.
|
||||
- **Risk:** Medium. A wrong fix could break other tests. May need to add defer-not-catch pattern (per `conductor/workflow.md` known pitfall) for the offending render path.
|
||||
- **Depends on:** Task 1.
|
||||
|
||||
### 3. [MED] Wrap `immapp.run` in `try/except RuntimeError` in `gui_2.py:618`
|
||||
|
||||
- **What:** Catch the IM_ASSERT (or any `RuntimeError` from `immapp.run`), log it, and return gracefully so the process doesn't die.
|
||||
- **Where:** `src/gui_2.py:618`.
|
||||
- **Why:** Per user: "the wrap might be worth it if that properly lets us handle the assert." A proper wrap logs the assert, marks the GUI as degraded, and lets the hook server keep serving (so tests can complete their work). It is NOT a silent swallow — the error is logged at ERROR level and exposed via a new endpoint.
|
||||
- **Acceptance:**
|
||||
- When IM_ASSERT fires, the subprocess stays alive
|
||||
- The `_io_pool` is NOT shut down by the exception (or is re-created lazily — see Task 5)
|
||||
- A new `/api/gui_health` endpoint returns `{"degraded": true, "last_assert": "..."}` so tests can detect the state
|
||||
- The log includes the full assert message + stack trace at ERROR level
|
||||
- **Effort:** 1-2 hours. The wrap is simple. The endpoint + logging is straightforward.
|
||||
- **Risk:** Low. The wrap is a band-aid, but it properly handles the failure (logs it, surfaces it) rather than swallowing silently.
|
||||
- **Depends on:** None. Can be done in parallel with Tasks 1+2. Belongs in the same PR as the fix or as a separate hardening PR.
|
||||
|
||||
### 4. [MED] Add batch-level test isolation (kill+restart sloppy.py per file)
|
||||
|
||||
- **What:** Modify `scripts/run_tests_batched.py` to kill the `live_gui` subprocess at the end of each test file (or at the start of a new one), so a failing test file doesn't poison subsequent test files.
|
||||
- **Where:** `scripts/run_tests_batched.py` (existing batch runner).
|
||||
- **Why:** Per user: "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state."
|
||||
- **Pattern:** A failing batch should not block subsequent batches. The user wants to be able to run a batch, see it fail, run the next batch, and have it start clean.
|
||||
- **Acceptance:**
|
||||
- When a test file fails, the runner logs a clear "batch N failed; next batch will restart the app" message
|
||||
- The next batch's `live_gui` fixture spawns a fresh `sloppy.py` subprocess (or detects the old one is dead and spawns a new one)
|
||||
- No "dirty state" from a prior failed batch leaks into the next batch
|
||||
- The batch runner continues to the next batch automatically (no user intervention needed)
|
||||
- **Effort:** 2-4 hours. Requires understanding the current batch runner's lifecycle and modifying the `live_gui` fixture to handle "previous subprocess died, start a new one".
|
||||
- **Risk:** Low. The conftest's `live_gui` fixture is already session-scoped — making it per-file-scoped (or function-scoped with batch-aware session reuse) is a small change.
|
||||
- **Depends on:** None. Can be done in parallel with the other tasks.
|
||||
|
||||
### 5. [LOW] Make `submit_io` recover from a shut-down pool
|
||||
|
||||
- **What:** In `submit_io`, if `self._io_pool` is shut down, recreate it lazily.
|
||||
- **Where:** `src/app_controller.py:2257-2284` (current `submit_io` body).
|
||||
- **Why:** Defense in depth. If the GUI crashes and shuts down the pool, the test can still submit work after the wrap (Task 3) catches the exception. Without this, the controller is permanently dead.
|
||||
- **Acceptance:**
|
||||
- After a GUI crash + `immapp.run` recovery, `submit_io` works again
|
||||
- No new threading issues (the recreated pool has the same semantics)
|
||||
- Inflight counter (`_io_pool_inflight`) is reset
|
||||
- **Effort:** 30 minutes.
|
||||
- **Risk:** Low. Standard lazy-recreation pattern. The pool was already designed to be replaceable.
|
||||
- **Depends on:** None.
|
||||
|
||||
### 6. [LOW] Add `/api/gui_health` endpoint with degraded-state info
|
||||
|
||||
- **What:** New endpoint returning `{"healthy": bool, "degraded_reason": str | null, "last_assert": str | null, "io_pool_alive": bool}`.
|
||||
- **Where:** `src/api_hooks.py` (add new `elif` branch) + `src/app_controller.py` (add `self._gui_degraded_reason` and `self._last_imgui_assert` state).
|
||||
- **Why:** Per Task 3, the wrap logs the assert. The endpoint exposes the state to tests so they can detect a degraded GUI and fail with a clear message ("GUI is degraded due to IM_ASSERT; skipping test") rather than a confusing timeout.
|
||||
- **Acceptance:**
|
||||
- Endpoint returns 200 with the health dict
|
||||
- Tests can call `client.get_gui_health()` and check `healthy == False` to detect a degraded GUI
|
||||
- `tests/test_live_workflow.py` checks the health before starting and fails fast with a clear message if degraded
|
||||
- **Effort:** 1-2 hours.
|
||||
- **Risk:** Low. Read-only endpoint.
|
||||
- **Depends on:** Task 3.
|
||||
|
||||
---
|
||||
|
||||
## Tasks Inherited from Predecessor TODO (still relevant)
|
||||
|
||||
These are from `conductor/todos/TODO_test_full_live_workflow.md` and were marked as not yet shipped:
|
||||
|
||||
### 7. [MED] Replace `os.path.abspath("tests/artifacts/temp_project.toml")` with fixture-provided path
|
||||
|
||||
- **What:** Have the `live_gui` fixture provide `temp_project_path` (str) derived from its own `temp_workspace` directory.
|
||||
- **Where:** `tests/conftest.py` (live_gui fixture) + `tests/test_live_workflow.py:79`.
|
||||
- **Why:** cwd-relative path is fragile; fixture-relative path is stable. Per the v1 report's Cause 1.
|
||||
- **Acceptance:** Test does `temp_project_path = live_gui_temp_project_path` (or accesses it as a fixture attribute). No more `os.path.abspath("tests/artifacts/...")`.
|
||||
- **Effort:** 30 minutes.
|
||||
- **Risk:** Low.
|
||||
|
||||
### 8. [LOW] Add `tests/.test_durations.json` recording in CI / dev convenience
|
||||
|
||||
- **What:** Add a dev-mode shortcut to record durations once the fix lands (e.g. `python scripts/run_tests_batched.py --durations`).
|
||||
- **Where:** `scripts/run_tests_batched.py` (already has `--durations` flag; just need a one-time run + commit).
|
||||
- **Why:** The categorizer uses `.test_durations.json` for `speed` auto-inference. Currently all files default to MEDIUM speed.
|
||||
- **Acceptance:** `tests/.test_durations.json` exists, has timing data for all 295+ tests.
|
||||
- **Effort:** 5 minutes (run + commit).
|
||||
- **Risk:** Low.
|
||||
|
||||
---
|
||||
|
||||
## Order of Work (recommended)
|
||||
|
||||
1. **Tasks 1 + 2 first** — find and fix the ImGui scope mismatch. This is the real fix. If successful, Tasks 3, 4, 5, 6 may be unnecessary (or become hardening improvements rather than bug fixes).
|
||||
2. **Task 3 in parallel** — wrap `immapp.run` so the assert doesn't kill the process. Even if Task 2 succeeds, the wrap is a good safety net for future scope bugs.
|
||||
3. **Task 4** — batch-level isolation. Independent of the ImGui fix; improves robustness for ALL tests.
|
||||
4. **Tasks 5, 6** — defense in depth. Only valuable if Tasks 1+2 don't fully fix the issue OR as ongoing hardening.
|
||||
5. **Tasks 7, 8** — unrelated cleanup. Do in a separate small commit/PR.
|
||||
|
||||
## Estimated Time
|
||||
|
||||
- Tasks 1+2: 2-6 hours (real fix, may require investigation)
|
||||
- Task 3: 1-2 hours (band-aid, but proper one)
|
||||
- Task 4: 2-4 hours (batch resilience)
|
||||
- Tasks 5+6: 1-2 hours combined (defense in depth)
|
||||
- Tasks 7+8: 30 minutes combined (cleanup)
|
||||
- **Total: 6-14 hours**
|
||||
|
||||
## Verification
|
||||
|
||||
After fix:
|
||||
- `uv run python scripts/run_tests_batched.py --tiers 3 --no-xdist --no-color` shows `<<< tier-3-live_gui PASS`
|
||||
- `uv run pytest tests/test_live_workflow.py` still PASSes in isolation
|
||||
- `uv run pytest tests/test_live_workflow.py tests/test_extended_sims.py` (siblings) PASSes
|
||||
- A failing batch does NOT prevent the next batch from running with a clean state
|
||||
- Failure message on real regression is clear and actionable (e.g. "GUI degraded: IM_ASSERT(Missing End()) in render_X; skipping test")
|
||||
@@ -0,0 +1,250 @@
|
||||
# Batch-Level Test Resilience Plan
|
||||
|
||||
**Companion to:** `docs/reports/test_full_live_workflow_propagation_digest_20260608.md`
|
||||
**Status:** Pre-implementation plan
|
||||
**User requirement:** "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state."
|
||||
|
||||
---
|
||||
|
||||
## 1. Current Behavior
|
||||
|
||||
The `tests/conftest.py:live_gui` fixture is **session-scoped**. It spawns a single `sloppy.py` subprocess at the start of the test session and keeps it alive for ALL live_gui tests across ALL tiers.
|
||||
|
||||
**Test file structure (relevant):**
|
||||
- `tests/test_extended_sims.py` — 4 sim tests: `test_context_sim_live`, `test_ai_settings_sim_live`, `test_tools_sim_live`, `test_execution_sim_live`. The IM_ASSERT fires during the 4th sim (~71.5s into GUI lifetime).
|
||||
- `tests/test_live_workflow.py` — separate file, runs AFTER test_extended_sims.py in alphabetical order. `test_full_live_workflow` is the failing test.
|
||||
|
||||
The IM_ASSERT crashes the GUI's main loop mid-test-file. The hook server (separate thread) survives, but the controller's `_io_pool` is in a shutdown state. The next test file (`test_live_workflow.py`) starts in this degraded state. Its first click (`btn_project_new_automated`) hits `submit_io` which raises `RuntimeError: cannot schedule new futures after shutdown`. The test's `wait_for_project_switch` polls for 120s before timing out.
|
||||
|
||||
**Failure mode observed by user:** "the new file didn't get to deal with a dirty state"
|
||||
|
||||
---
|
||||
|
||||
## 2. Real User Concern: Within-Session Subprocess Degradation
|
||||
|
||||
The user's concern is specifically about WITHIN-SESSION state. They want:
|
||||
|
||||
1. A test file can crash the subprocess without preventing the next file from running cleanly
|
||||
2. If the next file is doomed (subprocess is degraded), the runner should report this clearly, not silently time out
|
||||
3. The runner should continue to subsequent batches even after a failed one (this already works for tiers that don't use `live_gui`)
|
||||
|
||||
**The current implementation has NONE of these properties:**
|
||||
- `live_gui` is session-scoped, so the subprocess lives across the whole test session
|
||||
- A crashed subprocess poisons all subsequent live_gui tests
|
||||
- The degraded state (io_pool shut down) is not surfaced to the test, so the test fails with a confusing timeout, not a clear "subprocess degraded" message
|
||||
|
||||
---
|
||||
|
||||
## 3. Probable Solutions
|
||||
|
||||
### Solution A: Per-file live_gui Fixture (most isolated)
|
||||
|
||||
**Approach:** Change `live_gui` from `@pytest.fixture(scope="session")` to `@pytest.fixture(scope="module")`. Each test file gets a fresh subprocess.
|
||||
|
||||
**Code change (1 line):**
|
||||
```python
|
||||
# tests/conftest.py
|
||||
@pytest.fixture(scope="module") # was: "session"
|
||||
def live_gui(request):
|
||||
...
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Maximum isolation. A test file that crashes the subprocess doesn't affect the next file.
|
||||
- The fixture's `finally` block (which calls `kill_process_tree`) is the per-file cleanup.
|
||||
- Simple to implement (one-line scope change + audit).
|
||||
|
||||
**Cons:**
|
||||
- ~1-2s overhead per file (subprocess spawn + hook server health check).
|
||||
- For 49 live_gui files, that's 49-98s of additional overhead.
|
||||
- Some tests may currently rely on cross-file state (e.g., a project loaded by file A is still loaded when file B starts). These tests would break.
|
||||
|
||||
**Mitigation:** Audit the live_gui tests for cross-file state dependencies. Most should be standalone (each test sets up its own state). If any are not, mark them with `@pytest.mark.requires_prior_state` and either:
|
||||
- Skip them when scope is module
|
||||
- Or document the dependency and add a setup step in the dependent file
|
||||
|
||||
**Effort:** 1-2 hours (scope change + audit + fix cross-file dependencies).
|
||||
|
||||
**Risk:** Medium. May break tests that depend on cross-file state. The audit is the main work.
|
||||
|
||||
### Solution B: Lazy Re-spawn (most flexible)
|
||||
|
||||
**Approach:** Keep the `live_gui` fixture session-scoped, but wrap it in a handle that re-spawns the subprocess if it dies. The handle exposes the same API as the current fixture.
|
||||
|
||||
**Code change (significant):**
|
||||
```python
|
||||
# tests/conftest.py
|
||||
class _LiveGuiHandle:
|
||||
def __init__(self, gui_script: str):
|
||||
self._gui_script = gui_script
|
||||
self._process: subprocess.Popen | None = None
|
||||
self._lock = threading.Lock()
|
||||
self._spawn()
|
||||
|
||||
def _spawn(self) -> None:
|
||||
# Existing fixture spawn logic, refactored into a method
|
||||
...
|
||||
|
||||
def is_alive(self) -> bool:
|
||||
return self._process is not None and self._process.poll() is None
|
||||
|
||||
def ensure_alive(self) -> None:
|
||||
with self._lock:
|
||||
if not self.is_alive():
|
||||
self._spawn()
|
||||
|
||||
@property
|
||||
def process(self) -> subprocess.Popen:
|
||||
self.ensure_alive()
|
||||
return self._process
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def live_gui(request):
|
||||
handle = _LiveGuiHandle(gui_script)
|
||||
yield handle, handle._gui_script
|
||||
handle._kill()
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Preserves the per-session fixture scope.
|
||||
- Auto-recovers from subprocess death between tests.
|
||||
- Tests that rely on cross-file state can still do so (the subprocess is the same instance, modulo a respawn).
|
||||
- Single place to add health checks.
|
||||
|
||||
**Cons:**
|
||||
- More complex. The handle's `ensure_alive` adds a check at every test entry.
|
||||
- If the subprocess dies mid-test, the test still fails — we only recover BETWEEN tests.
|
||||
- Respawning the subprocess loses any in-process state. Tests that rely on state from a prior test fail on respawn.
|
||||
|
||||
**Effort:** 4-6 hours (refactor fixture + add respawn logic + tests).
|
||||
|
||||
**Risk:** Low. The respawn is a fallback; the primary path (subprocess stays alive) is unchanged.
|
||||
|
||||
### Solution C: Per-Batch Process Tracking (most surgical)
|
||||
|
||||
**Approach:** Add a process health check at the start of each batch in `scripts/run_tests_batched.py`. If the previous batch left the subprocess dead, log a clear warning. Tests can then fail fast with a known message.
|
||||
|
||||
**Code change (conftest writes pid file, batcher reads it):**
|
||||
```python
|
||||
# tests/conftest.py (in live_gui fixture, after spawn)
|
||||
pid_file = tests_dir / ".live_gui_pid"
|
||||
pid_file.write_text(str(process.pid))
|
||||
|
||||
# scripts/run_tests_batched.py
|
||||
def _run_batch(b: Batch, ...) -> ...:
|
||||
if b.label.startswith("tier-3-live_gui"):
|
||||
pid_file = tests_dir / ".live_gui_pid"
|
||||
if pid_file.exists():
|
||||
pid = int(pid_file.read_text().strip())
|
||||
if not _is_pid_alive(pid):
|
||||
print(_c(f"[BATCH-WARN] Prior tier-3 batch left the live_gui subprocess (pid={pid}) dead. "
|
||||
f"This batch's live_gui tests may not start with a clean state.",
|
||||
_C.BOLD_YELLOW))
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Surgical. Doesn't change the fixture or test code.
|
||||
- Surfaces the dirty state via a clear warning, not a silent hang.
|
||||
- User can then choose to debug or skip the batch.
|
||||
|
||||
**Cons:**
|
||||
- Doesn't actually FIX the dirty state — just makes it visible.
|
||||
- Requires the fixture to write a pid file (small change).
|
||||
- Tests still fail with the same confusing timeout, but the warning is in the runner output.
|
||||
|
||||
**Effort:** 1-2 hours.
|
||||
|
||||
**Risk:** Low. Read-only check, no behavioral change.
|
||||
|
||||
### Solution D: Fixture Auto-Detect (middle ground)
|
||||
|
||||
**Approach:** Keep `live_gui` session-scoped, but at the START of each test (not file), check if the subprocess is alive. If dead, re-spawn.
|
||||
|
||||
**Code change (conftest auto-use hook):**
|
||||
```python
|
||||
# tests/conftest.py
|
||||
@pytest.fixture(autouse=True)
|
||||
def _check_live_gui_health(request, live_gui):
|
||||
if "live_gui" in request.fixturenames:
|
||||
handle, gui_script = live_gui
|
||||
handle.ensure_alive()
|
||||
yield
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Per-test recovery. A test that crashes the subprocess doesn't affect the next test.
|
||||
- Minimal API change (tests still use `live_gui`).
|
||||
|
||||
**Cons:**
|
||||
- Per-test overhead (~0.1s for the health check).
|
||||
- If a test's clicks during a degraded subprocess fail, the test must be re-designed to be idempotent.
|
||||
- Respawning loses state.
|
||||
|
||||
**Effort:** 2-3 hours.
|
||||
|
||||
**Risk:** Medium. Tests that assume "subprocess is alive when my test starts" may need adjustment.
|
||||
|
||||
---
|
||||
|
||||
## 4. Recommended Combination
|
||||
|
||||
**Primary: Solution A (per-file fixture scope)**
|
||||
- Most isolated. Each test file is a clean unit.
|
||||
- Simple to implement and audit.
|
||||
- For the IM_ASSERT scenario: test_extended_sims.py crashes its subprocess at the end. test_live_workflow.py starts with a fresh subprocess. The IM_ASSERT-triggered pollution doesn't reach test_live_workflow.py.
|
||||
|
||||
**Secondary: Solution C (per-batch warning)**
|
||||
- Safety net. If a test file's subprocess dies mid-file (rather than at end of file), the next batch's runner logs a clear warning.
|
||||
- Doesn't fix the dirty state but makes it visible.
|
||||
|
||||
**Optional: Solution B (lazy re-spawn)**
|
||||
- If the audit for Solution A reveals too many cross-file dependencies, Solution B is the fallback.
|
||||
- More complex but preserves the per-session state model.
|
||||
|
||||
### NOT recommended: Solution D alone
|
||||
- Per-test recovery is too granular. A test's failure shouldn't trigger a re-spawn that affects subsequent tests' setup.
|
||||
- Also: Solution D doesn't help the IM_ASSERT scenario. The IM_ASSERT crashes the subprocess during test_extended_sims.py, and Solution D would respawn it for the next test in the SAME file. But the next test in test_extended_sims.py is `test_full_live_workflow` which is in a different file — Solution D would still respawn correctly for it.
|
||||
|
||||
Actually, Solution D WOULD work for the IM_ASSERT scenario:
|
||||
- IM_ASSERT fires during `test_execution_sim_live` (test 4 in test_extended_sims.py)
|
||||
- Next test is... well, there are no more tests in test_extended_sims.py
|
||||
- Next file is test_live_workflow.py, first test is test_full_live_workflow
|
||||
- Solution D's autouse fixture would re-spawn the subprocess before test_full_live_workflow
|
||||
|
||||
So Solution D is actually a viable primary approach. Let me reconsider.
|
||||
|
||||
**Revised recommendation:**
|
||||
- **Solution D (autouse fixture auto-respawn)** as the primary. It's the most surgical.
|
||||
- **Solution A (per-file scope)** as the alternative if Solution D's autouse approach has side effects.
|
||||
- **Solution C (per-batch warning)** as a safety net for any case the autouse doesn't catch.
|
||||
|
||||
---
|
||||
|
||||
## 5. Open Questions for the User
|
||||
|
||||
Before implementation, these need clarification:
|
||||
|
||||
1. **Fixture scope preference:** Per-file (Solution A) or per-test auto-respawn (Solution D)?
|
||||
- Per-file: more overhead but simpler reasoning
|
||||
- Per-test auto-respawn: more surgical but adds an autouse hook
|
||||
- My recommendation: Solution D. It's the closest to "the next test file gets a clean subprocess" without changing the fixture's API.
|
||||
|
||||
2. **State reset on respawn:** When the subprocess is re-spawned, should the new subprocess inherit any state (e.g., loaded project, recent discussion)?
|
||||
- My recommendation: No. Fresh subprocess = fresh state. Tests should set up their own state.
|
||||
|
||||
3. **Failure signaling:** If the subprocess can't be respawned (e.g., port 8999 still in use from a zombie), should the test fail immediately or retry?
|
||||
- My recommendation: Fail immediately with a clear error. Retries can hide real issues.
|
||||
|
||||
4. **Backward compatibility:** Are there tests that explicitly DEPEND on the session-scoped behavior (e.g., they share state across files)?
|
||||
- Need to audit. The audit is part of Solution A; for Solution D, the audit is less critical because respawned subprocesses are NEW instances (no shared state with prior subprocesses).
|
||||
|
||||
---
|
||||
|
||||
## 6. References
|
||||
|
||||
- `tests/conftest.py:282` — current `live_gui` fixture (session-scoped)
|
||||
- `tests/conftest.py:516-547` — `live_gui` fixture finally block (kill + cleanup)
|
||||
- `scripts/run_tests_batched.py:136-164` — `_run_batch` function
|
||||
- `scripts/run_tests_batched.py:51-86` — batch result tracking
|
||||
- `docs/reports/test_full_live_workflow_propagation_digest_20260608.md` — full solution matrix
|
||||
- `conductor/todos/TODO_test_full_live_workflow_v2.md` — task list including Task 4 (batch isolation)
|
||||
@@ -0,0 +1,267 @@
|
||||
# Root Cause Report: test_full_live_workflow batch failure (v2)
|
||||
|
||||
**Supersedes:** `test_full_live_workflow_root_cause_20260608.md` (older 6-cause analysis, dated 2026-06-08)
|
||||
**Date:** 2026-06-08
|
||||
**Status:** Investigation complete via diagnostic logging, no fix attempted
|
||||
**Failure reproducibility:** 100% in `tier-3-live_gui` batch (5+ tests, ~200s total), 0% in isolation (`pytest tests/test_live_workflow.py` → 11.69s PASS)
|
||||
**Related commits (reverted/no-op):** `4a338486` (io_pool 4→8), `c9a991bb` (timeout 30→120s), `87d7c5bf` (test_io_pool assertion)
|
||||
|
||||
---
|
||||
|
||||
## TL;DR — The Real Root Cause
|
||||
|
||||
**`test_full_live_workflow` does not fail because of a slow `_do_project_switch`.** The switch runs in ~8-10ms when it executes. The test fails because **the GUI subprocess crashes mid-batch** due to an ImGui scope mismatch in some render function, which leaves the controller's `io_pool` in a shutdown state. Subsequent clicks to the still-alive hook server fail with `RuntimeError: cannot schedule new futures after shutdown`.
|
||||
|
||||
The `_do_project_switch` was a SYMPTOM, not the cause. The previous report's 6-cause analysis (cwd-relative paths, race conditions, click fire-and-forget, etc.) addressed symptoms of the same underlying issue but did not address the IM_ASSERT.
|
||||
|
||||
---
|
||||
|
||||
## Evidence Trail
|
||||
|
||||
### Diagnostic instrumentation (temporarily added, then reverted)
|
||||
|
||||
Added `[switch-diag] +N.NNNs <step>` prints to stderr at every step inside `_do_project_switch` (production code). Output goes to `logs/sloppy_py_test.log` because the `live_gui` fixture captures the subprocess's stderr/stdout to that file. Pattern matches the existing `[startup]` and `[HOOKS]` instrumentation style.
|
||||
|
||||
### Key log findings (`logs/sloppy_py_test.log`)
|
||||
|
||||
#### Finding 1: All 4 sims' switches complete in ~8-10ms each
|
||||
|
||||
```
|
||||
[switch-diag] +0.000s enter path=temp_livecontextsim.toml
|
||||
[switch-diag] +0.000s flush_to_project_start
|
||||
[switch-diag] +0.000s flush_to_project_done
|
||||
[switch-diag] +0.000s load_project_start
|
||||
[switch-diag] +0.002s load_project_done
|
||||
[switch-diag] +0.002s preset_manager_start
|
||||
[switch-diag] +0.002s preset_manager_done
|
||||
[switch-diag] +0.002s persona_manager_start
|
||||
[switch-diag] +0.002s persona_manager_done
|
||||
[switch-diag] +0.002s refresh_start
|
||||
[switch-diag] +0.010s refresh_done
|
||||
[switch-diag] +0.010s mcp_configure_start
|
||||
[switch-diag] +0.010s mcp_configure_done
|
||||
[switch-diag] +0.010s success
|
||||
[switch-diag] +0.010s finally_enter
|
||||
[switch-diag] +0.010s finally_done
|
||||
```
|
||||
|
||||
Same pattern for all 4 sims. **The switch itself is fast. There is no hang inside `_do_project_switch`.**
|
||||
|
||||
#### Finding 2: An ImGui `IM_ASSERT` fires at 71.5s into GUI lifetime
|
||||
|
||||
```
|
||||
[02214] [imgui-error] In window 'MainDockSpace': Missing End()
|
||||
[startup] main_call: 71518.5ms
|
||||
Traceback (most recent call last):
|
||||
File "C:\projects\manual_slop\sloppy.py", line 75, in <module>
|
||||
main()
|
||||
File "C:\projects\manual_slop\src\gui_2.py", line 1478, in main
|
||||
app.run()
|
||||
File "C:\projects\manual_slop\src\gui_2.py", line 618, in run
|
||||
immapp.run(self.runner_params, ...)
|
||||
File "...\imgui_bundle\_patch_runners_add_save_screenshot_param.py", line 38, in patched_run
|
||||
run_backup(*args, **kwargs)
|
||||
RuntimeError: IM_ASSERT( (0) && "Missing End()" ) --- imgui.cpp:11662
|
||||
```
|
||||
|
||||
The `IM_ASSERT` is an ImGui scope-tracking assertion: a `begin()` call was not matched with a corresponding `end()`. The window reported is 'MainDockSpace' — a special window managed by `hello_imgui` for the dock space layout. Some child widget within the dock space has an unbalanced begin/end.
|
||||
|
||||
#### Finding 3: The test's `btn_project_new_automated` click hits a shutdown pool
|
||||
|
||||
```
|
||||
[HOOKS] POST /api/session data length: 1 ← test_live_workflow's post_session
|
||||
[HOOKS] GET /api/warmup_wait?timeout=60.0
|
||||
[HOOKS] GET /api/project_switch_status
|
||||
[HOOKS] POST /api/gui data length: 3 ← test's btn_reset
|
||||
[HOOKS] POST /api/gui data length: 3 ← test's btn_project_new_automated
|
||||
Error executing GUI task (click): cannot schedule new futures after shutdown
|
||||
Traceback (most recent call last):
|
||||
File "...\src\app_controller.py", line 1637, in _process_pending_gui_tasks
|
||||
self._gui_task_handlers[action](self, task)
|
||||
File "...\src\app_controller.py", line 580, in _handle_click
|
||||
controller._cb_new_project_automated(user_data)
|
||||
File "...\src\app_controller.py", line 2723, in _cb_new_project_automated
|
||||
self._switch_project(user_data)
|
||||
File "...\src\app_controller.py", line 2809, in _switch_project
|
||||
self.submit_io(self._do_project_switch, path)
|
||||
File "...\src\app_controller.py", line 2282, in submit_io
|
||||
future = self._io_pool.submit(fn, *args, **kwargs)
|
||||
File "...\concurrent\futures\thread.py", line 167, in submit
|
||||
raise RuntimeError('cannot schedule new futures after shutdown')
|
||||
RuntimeError: cannot schedule new futures after shutdown
|
||||
```
|
||||
|
||||
The IM_ASSERT happened ~71.5s into the GUI. The test_live_workflow runs after the 4 sims (~80s into GUI). Between the IM_ASSERT and the test's click, the `io_pool` was shut down.
|
||||
|
||||
#### Finding 4: The hook server (FastAPI/stdlib http.server) stays alive
|
||||
|
||||
The subprocess continues to respond to hooks (GET /api/events, GET /api/session, POST /api/gui) AFTER the IM_ASSERT and AFTER the io_pool is shut down. The test's `wait_for_project_switch` polls `/api/project_switch_status` 1200+ times in 120s — the server is responsive. Only `submit_io` fails.
|
||||
|
||||
This proves: the IO thread pool is the casualty, not the entire process. Something selectively shut down `_io_pool` while the rest of the controller (and the hook server) kept running.
|
||||
|
||||
---
|
||||
|
||||
## What Shuts Down The io_pool?
|
||||
|
||||
This is the unresolved question. The only places `_io_pool.shutdown(wait=False)` is called in `src/`:
|
||||
|
||||
1. `src/app_controller.py:762` — in the `_on_sigint` SIGINT handler. This requires SIGINT to be delivered to the subprocess. On Windows, `taskkill /F` does NOT deliver signals.
|
||||
2. `src/app_controller.py:2325` — in `controller.shutdown()`. This is called from `src/gui_2.py:869` (in `App.shutdown`), which is only called at `src/gui_2.py:620` AFTER `immapp.run()` returns successfully.
|
||||
|
||||
The IM_ASSERT raises `RuntimeError` from inside `immapp.run()`. The exception propagates up. **There is no `try/finally` around `immapp.run`**, so `App.shutdown()` is NOT called via this path.
|
||||
|
||||
**Hypothesis A:** Python's interpreter finalization calls `__del__` on the controller (or its `_io_pool`) during exception propagation. `ThreadPoolExecutor.__del__` defaults to `shutdown(wait=False)` per the io_pool.py module docstring. This is the most likely path on a `RuntimeError` propagating through `main()`.
|
||||
|
||||
**Hypothesis B:** `immapp.run` internally catches the IM_ASSERT and returns normally. Then `app.shutdown()` is called, which calls `controller.shutdown()`, which calls `_io_pool.shutdown(wait=False)`. Then `app.run()` returns, `main()` returns, and the sloppy.py process exits. But the log shows more activity AFTER the IM_ASSERT (clicks being processed), so this hypothesis is inconsistent with the evidence.
|
||||
|
||||
**Hypothesis C:** The hook server thread (FastAPI on port 8999) runs in a separate thread from the main ImGui loop. The IM_ASSERT crashes the main thread but the hook server thread keeps running. The `_io_pool` is the GUI's pool, not the hook server's. When the main thread crashes, Python's atexit / finalization shuts down `_io_pool`. The hook server thread continues independently.
|
||||
|
||||
**Hypothesis C is most consistent with the evidence.** The `_io_pool` is created in `AppController.__init__` (line ~810 in current code). The hook server is a separate `ThreadingHTTPServer` (line 11 of `src/api_hooks.py`). They are independent. The IM_ASSERT kills the ImGui main loop, the io_pool gets shut down during finalization, the hook server thread continues serving requests.
|
||||
|
||||
The exact mechanism of `_io_pool.shutdown` being called in this scenario is not directly observable from the log. It could be:
|
||||
- `ThreadPoolExecutor.__del__` during GC (Hypothesis C path)
|
||||
- An atexit handler installed by the warmup system or another module
|
||||
- A signal delivery I haven't identified (e.g., the Python interpreter catching the exception and sending SIGTERM internally)
|
||||
|
||||
**This matters less than the actual fix.** The IM_ASSERT is the trigger; the io_pool shutdown is a downstream consequence. Fixing the IM_ASSERT (the real bug) prevents all of this.
|
||||
|
||||
---
|
||||
|
||||
## Why Did The IM_ASSERT Only Fire In Batch?
|
||||
|
||||
The IM_ASSERT is deterministic: it fires every time the offending code path is rendered. In isolation, `test_full_live_workflow` runs alone — it does not exercise the same render functions as the 4 sims. In batch, the sims run first:
|
||||
- `test_context_sim_live` (ContextSimulation)
|
||||
- `test_ai_settings_sim_live` (AISettingsSimulation)
|
||||
- `test_tools_sim_live` (ToolsSimulation)
|
||||
- `test_execution_sim_live` (ExecutionSimulation)
|
||||
|
||||
Each sim opens specific panels (Context, AI Settings, Tools, Execution) and triggers render paths that may be unique to those simulations. After 4 sims, the cumulative state of `ImGui`'s internal scope stack is corrupted — a `begin()` was called in a panel that's only opened in sim mode, and its matching `end()` was either:
|
||||
- In a code path that was skipped due to a conditional render
|
||||
- In a `defer` block that early-returned
|
||||
- After a `return` inside a panel function
|
||||
- Skipped due to a conditional in the render loop
|
||||
|
||||
The IM_ASSERT then fires at frame 71.5s, which is some specific frame AFTER the sims have set up state. The exact render function is unknown without running `scripts/check_imgui_scopes.py` against the full codebase.
|
||||
|
||||
---
|
||||
|
||||
## Why Did My Previous Fixes Fail?
|
||||
|
||||
### Fix 1: io_pool 4→8 (commit `4a338486`)
|
||||
|
||||
**Wrong diagnosis:** I assumed the io_pool was saturated with sims' AI discussion turn workers, causing the new switch to queue forever.
|
||||
|
||||
**Actual cause:** The io_pool isn't the bottleneck. The switch runs in ~8-10ms. The pool wasn't saturated at the time of the test's switch.
|
||||
|
||||
**Why the commit doesn't hurt:** Bigger pool is a marginal improvement to startup concurrency. It's not a regression, just a fix for an issue that isn't the root cause.
|
||||
|
||||
### Fix 2: Test timeout 30s→120s (commit `c9a991bb`)
|
||||
|
||||
**Wrong diagnosis:** I assumed the switch was slow and just needed more time.
|
||||
|
||||
**Actual cause:** The switch is fast. The test fails because the click can't even reach the switch handler — `submit_io` throws at line 2282 because the pool is shut down.
|
||||
|
||||
**Why the commit doesn't hurt:** A longer timeout gives a clearer error message (the actual `RuntimeError: cannot schedule new futures after shutdown` is surfaced) but doesn't change the outcome. If anything, it makes the test more annoying to wait for.
|
||||
|
||||
### Fix 3 (uncommitted, reverted): Dedicated executor for switches
|
||||
|
||||
**Wrong diagnosis:** I assumed the project switch should not share a pool with background work.
|
||||
|
||||
**Actual cause:** The pool is fine. The pool gets killed by the GUI crash.
|
||||
|
||||
**Why the commit was reverted:** It added complexity for a non-issue. Per `conductor/workflow.md`: "Don't ship a known regression to save time."
|
||||
|
||||
---
|
||||
|
||||
## The Three Fixes I Have NOT Yet Attempted
|
||||
|
||||
Per the systematic-debugging skill, the architectural question needs user input. The 3 viable directions are documented in `docs/reports/test_full_live_workflow_propagation_digest_20260608.md` (to be written — see TODOs).
|
||||
|
||||
### Direction A: Fix the actual ImGui scope bug
|
||||
|
||||
**What:** Run `scripts/check_imgui_scopes.py` to find the `begin()`/`end()` mismatch. Fix the offending render function.
|
||||
|
||||
**Pros:** Real fix. Solves the root cause.
|
||||
|
||||
**Cons:** May require deep investigation across 90+ render functions. May be in a render path that's only triggered by a specific sim panel combination. Could take significant time.
|
||||
|
||||
**Risk:** Medium. A wrong fix could break other tests or hide the real issue.
|
||||
|
||||
### Direction B: Wrap `immapp.run` in `try/except RuntimeError`
|
||||
|
||||
**What:** In `src/gui_2.py:618`, wrap `immapp.run(...)` in a `try/except RuntimeError` (or broader). On exception, log it and let the app continue in a degraded state (e.g., skip the rest of the frame, return to event loop).
|
||||
|
||||
**Pros:** Band-aid that prevents the GUI crash from propagating to the process. Tests continue to work. Easier to implement than Direction A.
|
||||
|
||||
**Cons:** Hides the actual ImGui scope bug. Future tests may exhibit other weirdness from the scope mismatch. The user has said they don't want a "wrap" that just silently continues.
|
||||
|
||||
**Risk:** Low for tests, but masks a real bug. Per user: "I don't want the entire test to just linger or silently continue."
|
||||
|
||||
### Direction C: Make `_io_pool.shutdown` recoverable
|
||||
|
||||
**What:** In `submit_io`, check if the pool is shut down. If so, recreate it (lazily).
|
||||
|
||||
**Pros:** Decouples the test from the io_pool's lifecycle. Makes the controller more robust to GUI crashes.
|
||||
|
||||
**Cons:** Doesn't address the IM_ASSERT root cause. The GUI is still crashing — we're just hiding the consequences.
|
||||
|
||||
**Risk:** Low. Standard pattern for resilient thread pools.
|
||||
|
||||
### Direction D: Make the batch runner handle the failure cleanly
|
||||
|
||||
**What:** When a test file fails, the `run_tests_batched.py` runner currently continues to the next batch. The fix is to ensure: (1) the failing test file is marked as failed, (2) the next batch can start with a clean state (kill and restart the sloppy.py subprocess per batch, not session-wide).
|
||||
|
||||
**Pros:** Doesn't require fixing the underlying bug. Tests can fail without poisoning subsequent batches.
|
||||
|
||||
**Cons:** Doesn't fix `test_full_live_workflow`. The test still fails in its own batch.
|
||||
|
||||
**Risk:** Low. Standard pattern for test isolation.
|
||||
|
||||
### User's explicit guidance
|
||||
|
||||
Per the user:
|
||||
- "I don't want the entire test to just linger or silently continue"
|
||||
- "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails"
|
||||
- "Just has to note that the new file didn't get to deal with a dirty state"
|
||||
- "The wrap might be worth it if that properly lets us handle the assert"
|
||||
|
||||
The user wants:
|
||||
1. A report (this document)
|
||||
2. A todo list for the actual fixes
|
||||
3. A digest of probable solutions
|
||||
4. Batch-level resilience (kill+restart sloppy.py per file)
|
||||
5. The wrap-around-`immapp.run` is acceptable IF it properly handles the assert (not just swallows it)
|
||||
|
||||
---
|
||||
|
||||
## Files Referenced
|
||||
|
||||
- `src/app_controller.py:2282` — `submit_io` line that throws the RuntimeError
|
||||
- `src/app_controller.py:762` — `_on_sigint` shutdown
|
||||
- `src/app_controller.py:2325` — `controller.shutdown` pool shutdown
|
||||
- `src/gui_2.py:618` — `immapp.run(...)` call site (the IM_ASSERT trigger)
|
||||
- `src/gui_2.py:620` — `self.shutdown()` (only called on normal exit)
|
||||
- `src/api_hooks.py:117-136` — `/api/project_switch_status` endpoint
|
||||
- `src/api_hooks.py:11` — `from http.server import ThreadingHTTPServer` (independent of io_pool)
|
||||
- `tests/test_live_workflow.py:90-94` — `wait_for_project_switch` call
|
||||
- `tests/test_live_workflow.py:84-89` — defensive `os.path.exists` check
|
||||
- `tests/conftest.py:263-280` — `kill_process_tree` (uses `taskkill /F` on Windows; no signal)
|
||||
- `tests/conftest.py:111-126` — pytest smart watchdog (300s timeout in PARENT process)
|
||||
- `tests/conftest.py:516-547` — `live_gui` fixture finally block (session-scoped, only fires at end)
|
||||
- `scripts/check_imgui_scopes.py` — EXISTING audit script that can detect this class of bug
|
||||
- `logs/sloppy_py_test.log` — captured subprocess stderr from the failing test run
|
||||
|
||||
## Diagnostic Logging (temporarily added, then reverted)
|
||||
|
||||
```python
|
||||
# src/app_controller.py:2731-2763 (REVERTED)
|
||||
import sys as _diag_sys
|
||||
_diag_t0 = time.time()
|
||||
def _diag(step: str) -> None:
|
||||
print(f"[switch-diag] +{time.time()-_diag_t0:.3f}s {step} path={Path(path).name}",
|
||||
file=_diag_sys.stderr, flush=True)
|
||||
_diag("enter")
|
||||
# ... at every step ...
|
||||
```
|
||||
|
||||
All diagnostic logging has been removed. The production code is back to the pre-diagnostic state. The pattern can be re-applied if needed (e.g., to find which `begin()` in which render function is unbalanced).
|
||||
@@ -0,0 +1,388 @@
|
||||
# Digest: Probable Solutions for ImGui Assert Propagation Failure (2026-06-08)
|
||||
|
||||
**Companion to:** `docs/reports/test_full_live_workflow_imgui_assert_20260608.md`
|
||||
**Companion to:** `conductor/todos/TODO_test_full_live_workflow_v2.md`
|
||||
**Status:** Pre-implementation analysis. User has not yet chosen a direction.
|
||||
**Audience:** Future implementer, the user (decision reference)
|
||||
|
||||
---
|
||||
|
||||
## 1. The Problem (restated)
|
||||
|
||||
When `tests/test_extended_sims.py` (4 sims) runs before `tests/test_live_workflow.py` in the tier-3 batch, an ImGui `IM_ASSERT((0) && "Missing End()")` fires at ~71.5s into GUI lifetime in window 'MainDockSpace'. The `RuntimeError` propagates from `immapp.run` through `app.run()` and `main()`. The hook server thread (separate `ThreadingHTTPServer`) survives. The `_io_pool` ends up in a shutdown state (mechanism unclear — likely `ThreadPoolExecutor.__del__` during GC). Subsequent test clicks fail with `RuntimeError: cannot schedule new futures after shutdown`.
|
||||
|
||||
The test poll loop then waits 120s before timing out.
|
||||
|
||||
---
|
||||
|
||||
## 2. Constraints (from user)
|
||||
|
||||
Per the user's session feedback (2026-06-08):
|
||||
- "I don't want the entire test to just linger or silently continue" — silent failure is not acceptable
|
||||
- "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails" — batch isolation is required
|
||||
- "Just has to note that the new file didn't get to deal with a dirty state" — the failure should be observable, not hidden
|
||||
- "The wrap might be worth it if that properly lets us handle the assert" — a proper wrap (with logging + observable state) is acceptable; a silent swallow is not
|
||||
|
||||
The user wants:
|
||||
1. A real fix where possible
|
||||
2. A wrap that surfaces the failure (not swallows it)
|
||||
3. Batch resilience (a failed batch should not poison the next)
|
||||
4. Tests should be able to detect a degraded GUI and fail fast with a clear message
|
||||
|
||||
---
|
||||
|
||||
## 3. Solution Matrix
|
||||
|
||||
The 6 tasks in `conductor/todos/TODO_test_full_live_workflow_v2.md` are presented here as a solutions matrix. Each is evaluated on:
|
||||
- **Real-fix value** (does it address the root cause?)
|
||||
- **Test impact** (does it make `test_full_live_workflow` pass in batch?)
|
||||
- **Effort** (hours)
|
||||
- **Risk** (chance of regression)
|
||||
- **User alignment** (does it match the user's stated constraints?)
|
||||
|
||||
| # | Solution | Real fix? | Test impact | Effort | Risk | User-aligned? |
|
||||
|---|----------|-----------|------------|--------|------|---------------|
|
||||
| 1 | Run `check_imgui_scopes.py` to find the scope mismatch | **Yes** (the actual bug) | Yes (if successful) | 1-2h | Med | Yes |
|
||||
| 2 | Fix the identified ImGui scope mismatch | **Yes** (the actual bug) | Yes (if successful) | 1-4h | Med | Yes |
|
||||
| 3 | Wrap `immapp.run` in `try/except RuntimeError` | No (band-aid) | Yes (prevents crash) | 1-2h | Low | Yes (per user: "might be worth it if it properly handles") |
|
||||
| 4 | Kill+restart sloppy.py per test file | No (isolation) | Yes (clean state) | 2-4h | Low | Yes (per user: "I can't restart the app and continue with the next test file if it fails") |
|
||||
| 5 | Make `submit_io` recover from a shut-down pool | No (resilience) | Yes (survives crash) | 0.5h | Low | Yes (defense in depth) |
|
||||
| 6 | Add `/api/gui_health` endpoint | No (observability) | Yes (fast-fail) | 1-2h | Low | Yes (per user: "Just has to note that the new file didn't get to deal with a dirty state") |
|
||||
|
||||
---
|
||||
|
||||
## 4. Solution Details (Probable Approaches)
|
||||
|
||||
### 4.1 Solution 1+2: Audit + Fix the ImGui Scope Mismatch
|
||||
|
||||
**Approach A: Manual scope audit (lowest risk)**
|
||||
|
||||
1. Run `python scripts/check_imgui_scopes.py` against `src/gui_2.py`
|
||||
2. Triage findings — many will be false positives (e.g., `begin()` inside a conditional that has an `end()` in the other branch via context manager)
|
||||
3. Identify the SPECIFIC `render_*` function with the unbalanced scope
|
||||
4. Inspect that function's render path and find the missing `end()` or the extra `begin()`
|
||||
|
||||
**Probable location of the bug:** A render function that's only called in one of the sims' panel render paths. Candidates:
|
||||
- Render functions for the AI Settings panel (test_ai_settings_sim_live)
|
||||
- Render functions for the Tools panel (test_tools_sim_live)
|
||||
- Render functions for the Execution/Modals panel (test_execution_sim_live)
|
||||
- Render functions for the Context & Chat panel (test_context_sim_live)
|
||||
|
||||
**Probable cause of the bug:** A recent render refactor that added a `begin()` to show a tooltip or popup, but the matching `end()` is inside an `if` branch that can early-return. Or a `begin()` that's only conditionally reached, with the `end()` always called but the stack now has an extra entry.
|
||||
|
||||
**Approach B: Defer-not-catch pattern (per `conductor/workflow.md` known pitfall)**
|
||||
|
||||
The known pitfall section describes:
|
||||
> `imgui-bundle` (and similar native extension libraries) expose C-level functions that can crash the Python process with a Windows access violation (`0xc0000005`) or a SIGSEGV on Linux. **These crashes are not catchable from Python** — `try/except Exception` does not intercept native access violations, only Python exceptions.
|
||||
|
||||
The IM_ASSERT we observed IS a Python RuntimeError (catchable). But the fix pattern is similar: use `imscope` context managers (per `src/imgui_scopes.py`) to ensure scopes are balanced even on early returns.
|
||||
|
||||
**Implementation pattern:**
|
||||
```python
|
||||
# Before (buggy)
|
||||
def render_my_panel(app):
|
||||
imgui.begin("My Panel")
|
||||
if not app.some_state:
|
||||
return # <-- Missing end()!
|
||||
imgui.text("hello")
|
||||
imgui.end()
|
||||
|
||||
# After (fixed)
|
||||
def render_my_panel(app):
|
||||
with imscope(imgui.begin, "My Panel"): # auto end() on exit
|
||||
if not app.some_state:
|
||||
return
|
||||
imgui.text("hello")
|
||||
# end() called automatically
|
||||
```
|
||||
|
||||
**Effort:** 1-4 hours (depends on what the audit finds).
|
||||
|
||||
**Risk:** Medium. The fix may need to be applied to multiple render functions, each requiring careful testing.
|
||||
|
||||
**Confidence in success:** Medium. The IM_ASSERT is deterministic and ImGui's scope tracking is reliable. Once the offending function is found, the fix is mechanical.
|
||||
|
||||
### 4.2 Solution 3: Wrap `immapp.run` in `try/except RuntimeError`
|
||||
|
||||
**Approach: Catch and recover**
|
||||
|
||||
```python
|
||||
# src/gui_2.py:617-621
|
||||
try:
|
||||
immapp.run(self.runner_params, add_ons_params=immapp.AddOnsParams(with_markdown_options=md_options))
|
||||
except RuntimeError as e:
|
||||
# IM_ASSERT (Missing End()) or similar. Log the error, mark the GUI
|
||||
# as degraded, and let the hook server continue. Per user feedback
|
||||
# (2026-06-08): the wrap is acceptable IF it surfaces the failure
|
||||
# (does not silently swallow).
|
||||
self.controller._gui_degraded_reason = f"immapp.run raised: {e}"
|
||||
self.controller._last_imgui_assert = traceback.format_exc()
|
||||
print(f"[GUI-DEGRADED] {self.controller._gui_degraded_reason}", file=sys.stderr, flush=True)
|
||||
# Do NOT call self.shutdown() — keep the hook server alive for tests.
|
||||
# The io_pool may be in a weird state; lazy-recreate on next submit_io (Task 5).
|
||||
# On normal exit
|
||||
if not self.controller._gui_degraded_reason:
|
||||
self.shutdown()
|
||||
session_logger.close_session()
|
||||
```
|
||||
|
||||
**Key design choices:**
|
||||
1. **The wrap does NOT call `self.shutdown()`** when the assert fires. This keeps the hook server alive so subsequent tests can still query state.
|
||||
2. **The error is logged at ERROR level** with the full assert message and stack trace. This is observable, not silent.
|
||||
3. **The controller sets `_gui_degraded_reason` and `_last_imgui_assert`** so the new `/api/gui_health` endpoint (Task 6) can expose the state to tests.
|
||||
4. **Tests can detect the degraded state** via `client.get_gui_health()` and fail fast with a clear message.
|
||||
|
||||
**Effort:** 1-2 hours (wrap + logging + state).
|
||||
|
||||
**Risk:** Low. The wrap is a band-aid, but a transparent one. The error is still surfaced.
|
||||
|
||||
**Confidence in success:** High. The wrap itself is trivial. The integration with `/api/gui_health` and the test-side fast-fail is straightforward.
|
||||
|
||||
### 4.3 Solution 4: Kill+Restart sloppy.py per Test File
|
||||
|
||||
**Approach: Per-file fixture scope**
|
||||
|
||||
Currently, the `live_gui` fixture in `tests/conftest.py` is session-scoped. All live_gui tests share the same `sloppy.py` subprocess for the entire test session. If the subprocess crashes mid-session, all subsequent tests are poisoned.
|
||||
|
||||
**Change:** Make the fixture per-file-scoped (or function-scoped with smart re-spawn). When a test in file A finishes, the subprocess is killed. When file B starts, a fresh subprocess is spawned.
|
||||
|
||||
**Implementation pattern:**
|
||||
|
||||
```python
|
||||
# tests/conftest.py (modified)
|
||||
@pytest.fixture(scope="module") # was: "session"
|
||||
def live_gui(request):
|
||||
# If the prior file's subprocess died, this is a fresh start.
|
||||
# If the user wants true per-test isolation, change to "function".
|
||||
...
|
||||
try:
|
||||
yield process, gui_script
|
||||
finally:
|
||||
kill_process_tree(process.pid)
|
||||
log_file.close()
|
||||
shutil.rmtree(temp_workspace)
|
||||
```
|
||||
|
||||
**Considerations:**
|
||||
- **Performance:** Each test file's fixture spawn takes ~1-2s. With 49 live_gui files, that's 49-98s added to the tier-3 batch. Probably acceptable.
|
||||
- **State persistence:** Currently, tests can rely on state from a prior test (e.g., a project loaded by sim 1 is still loaded when sim 2 runs). Making the fixture per-file-scoped breaks this. **Most live_gui tests should NOT depend on prior test state** — they should set up their own state. This is the principle the v1 report identified.
|
||||
- **Watchdog interaction:** The conftest's smart watchdog (300s) and unconditional watchdog (900s) are based on `_pytest_finished_event`. The per-file fixture change is independent.
|
||||
|
||||
**Alternative approach: Smart re-spawn**
|
||||
|
||||
Keep the fixture session-scoped, but add a "re-spawn" check at the start of each test. If the subprocess is dead, spawn a new one. The fixture becomes a "lazy" fixture that may spawn multiple subprocesses over the session.
|
||||
|
||||
**Probable implementation:**
|
||||
```python
|
||||
@pytest.fixture(scope="session")
|
||||
def live_gui(request):
|
||||
process, gui_script = _spawn_sloppy()
|
||||
yield _LazyLiveGui(process, gui_script)
|
||||
kill_process_tree(process.pid)
|
||||
|
||||
class _LazyLiveGui:
|
||||
def __init__(self, process, gui_script):
|
||||
self._process = process
|
||||
self._gui_script = gui_script
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def get_process(self):
|
||||
with self._lock:
|
||||
if self._process.poll() is not None:
|
||||
# Respawn
|
||||
self._process = _spawn_sloppy()
|
||||
return self._process
|
||||
```
|
||||
|
||||
This is more complex but preserves the per-session state model for tests that need it.
|
||||
|
||||
**Effort:** 2-4 hours (per-file approach is simpler; lazy re-spawn is more invasive).
|
||||
|
||||
**Risk:** Low. The per-file approach is straightforward. The lazy re-spawn is more complex but doesn't change the API surface for tests.
|
||||
|
||||
**Confidence in success:** High. The pattern is well-established in test infrastructure.
|
||||
|
||||
### 4.4 Solution 5: Make `submit_io` Recover from a Shut-Down Pool
|
||||
|
||||
**Approach: Lazy recreation**
|
||||
|
||||
```python
|
||||
# src/app_controller.py:2275-2284
|
||||
def submit_io(self, fn, *args, **kwargs):
|
||||
if not hasattr(self, "_io_pool") or self._io_pool is None or self._is_io_pool_shutdown():
|
||||
# Recreate the pool (it was shut down, e.g., by a GUI crash).
|
||||
self._io_pool = make_io_pool()
|
||||
self._io_pool_inflight = 0
|
||||
if not hasattr(self, "_io_pool_inflight_lock"):
|
||||
self._io_pool_inflight_lock = threading.Lock()
|
||||
with self._io_pool_inflight_lock:
|
||||
self._io_pool_inflight = getattr(self, "_io_pool_inflight", 0) + 1
|
||||
future = self._io_pool.submit(fn, *args, **kwargs)
|
||||
future.add_done_callback(lambda _f: self._io_pool_inflight_done())
|
||||
return future
|
||||
|
||||
def _is_io_pool_shutdown(self) -> bool:
|
||||
"""True if the io_pool has been shut down (e.g., via __del__ or controller.shutdown)."""
|
||||
pool = getattr(self, "_io_pool", None)
|
||||
if pool is None:
|
||||
return True
|
||||
# ThreadPoolExecutor doesn't expose a clean "is_shutdown" method.
|
||||
# Use a try/except probe.
|
||||
try:
|
||||
pool.submit(lambda: None).result(timeout=0.001)
|
||||
return False
|
||||
except RuntimeError:
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
```
|
||||
|
||||
**Key design choices:**
|
||||
1. **The lazy-recreate happens transparently.** Callers don't need to know about pool lifecycle.
|
||||
2. **The inflight counter is reset** when the pool is recreated (the old workers are dead).
|
||||
3. **The probe (`_is_io_pool_shutdown`) is best-effort.** `ThreadPoolExecutor` doesn't expose a clean shutdown check; a `submit().result()` probe is the standard pattern.
|
||||
|
||||
**Effort:** 30 minutes.
|
||||
|
||||
**Risk:** Low. The pool was already designed to be replaceable (per the existing test infrastructure). The probe is a non-invasive check.
|
||||
|
||||
**Confidence in success:** High. Standard pattern for resilient thread pools.
|
||||
|
||||
### 4.5 Solution 6: Add `/api/gui_health` Endpoint
|
||||
|
||||
**Approach: Read-only health endpoint**
|
||||
|
||||
```python
|
||||
# src/api_hooks.py (new elif branch in do_GET)
|
||||
elif self.path == "/api/gui_health":
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "application/json")
|
||||
self.end_headers()
|
||||
controller = _get_app_attr(app, "controller", None)
|
||||
if controller is None:
|
||||
payload = {"healthy": True, "degraded_reason": None, "last_assert": None, "io_pool_alive": True}
|
||||
else:
|
||||
payload = {
|
||||
"healthy": getattr(controller, "_gui_degraded_reason", None) is None,
|
||||
"degraded_reason": getattr(controller, "_gui_degraded_reason", None),
|
||||
"last_assert": getattr(controller, "_last_imgui_assert", None),
|
||||
"io_pool_alive": not controller._is_io_pool_shutdown() if hasattr(controller, "_is_io_pool_shutdown") else True,
|
||||
}
|
||||
self.wfile.write(json.dumps(payload).encode("utf-8"))
|
||||
```
|
||||
|
||||
**Test integration:**
|
||||
|
||||
```python
|
||||
# tests/test_live_workflow.py (new pre-check at start)
|
||||
def test_full_live_workflow(live_gui) -> None:
|
||||
client = ApiHookClient()
|
||||
assert client.wait_for_server(timeout=10)
|
||||
# New: check GUI health before proceeding
|
||||
health = client.get_gui_health()
|
||||
if not health.get("healthy"):
|
||||
pytest.fail(
|
||||
f"GUI is degraded before test starts: "
|
||||
f"degraded_reason={health.get('degraded_reason')}, "
|
||||
f"last_assert={health.get('last_assert')}"
|
||||
)
|
||||
...
|
||||
```
|
||||
|
||||
**Effort:** 1-2 hours.
|
||||
|
||||
**Risk:** Low. Read-only endpoint + trivial client method + simple pre-check.
|
||||
|
||||
**Confidence in success:** High.
|
||||
|
||||
---
|
||||
|
||||
## 5. Recommended Combination
|
||||
|
||||
The user's constraints suggest a **layered defense**:
|
||||
|
||||
### Layer 1: Real fix (Tasks 1+2)
|
||||
Find and fix the ImGui scope mismatch. This is the right thing to do.
|
||||
|
||||
### Layer 2: Safety net (Task 3)
|
||||
Wrap `immapp.run` so a future scope mismatch doesn't kill the process. Log the error, mark GUI as degraded.
|
||||
|
||||
### Layer 3: Observability (Task 6)
|
||||
Expose the degraded state via `/api/gui_health`. Tests can fast-fail with a clear message.
|
||||
|
||||
### Layer 4: Resilience (Task 5)
|
||||
Make `submit_io` recover from a shut-down pool. Defense in depth.
|
||||
|
||||
### Layer 5: Batch isolation (Task 4)
|
||||
Per-file (or per-test) fixture scope for `live_gui`. A failed batch doesn't poison the next.
|
||||
|
||||
**Recommended PR order:**
|
||||
- **PR 1 (highest priority):** Tasks 1+2 (real fix)
|
||||
- **PR 2 (in parallel with PR 1):** Tasks 3+6 (wrap + observability) — these don't depend on finding the bug
|
||||
- **PR 3:** Task 4 (batch isolation) — independent of the bug, valuable on its own
|
||||
- **PR 4 (defense in depth):** Task 5 (lazy pool recreation) — only needed if PR 2 doesn't fully solve the problem
|
||||
|
||||
### What's NOT recommended
|
||||
|
||||
- **Task 5 alone, without Tasks 1+2+3:** The lazy-recreate hides the bug. The io_pool will keep getting shut down by the GUI crash, then recreated, then shut down again, etc. The test will pass but the GUI is still broken.
|
||||
- **Task 4 alone, without the real fix:** Batch isolation is good hygiene, but the test still fails in its own batch. The bug is not fixed.
|
||||
- **A silent swallow in Task 3:** The user explicitly rejected this. A silent failure is worse than a visible failure.
|
||||
|
||||
---
|
||||
|
||||
## 6. Open Questions for the User
|
||||
|
||||
Before implementation, these need clarification:
|
||||
|
||||
1. **Task 3 wrap behavior:** When the IM_ASSERT fires, should the wrap:
|
||||
a. Just log and return (degraded mode, no further renders)
|
||||
b. Try to continue the render loop (skip the current frame, retry next frame)
|
||||
c. Restart the ImGui context entirely (clear and reinitialize)
|
||||
|
||||
My recommendation: (a) is the safest. (b) risks infinite loops if the scope is genuinely broken. (c) is invasive.
|
||||
|
||||
2. **Task 4 scope:** Per-file or per-test? Per-test is more isolated but adds 49+ fixture spawns. Per-file is a middle ground.
|
||||
|
||||
My recommendation: Per-file for now. Per-test can be added later if needed.
|
||||
|
||||
3. **Task 5 proactive recreation:** Should the pool be recreated eagerly (at controller init) or lazily (on first submit after shutdown)?
|
||||
|
||||
My recommendation: Lazy. The pool is normally alive; recreating eagerly is wasteful.
|
||||
|
||||
4. **New state attributes:** `_gui_degraded_reason` and `_last_imgui_assert` — should they be persisted to disk (e.g., for crash analysis) or just in-memory?
|
||||
|
||||
My recommendation: In-memory. Disk persistence is overkill for transient runtime state.
|
||||
|
||||
5. **Compatibility with existing tests:** Some existing tests may assume the io_pool is always available. Will the lazy-recreate break them?
|
||||
|
||||
My recommendation: No. The lazy-recreate preserves the pool's API surface. The only difference is that the pool may be a fresh instance after a crash. Existing tests don't care about the instance identity.
|
||||
|
||||
---
|
||||
|
||||
## 7. Effort & Risk Summary
|
||||
|
||||
| Solution | Effort | Risk | User-aligned | Recommendation |
|
||||
|----------|--------|------|--------------|----------------|
|
||||
| 1+2 (audit + fix) | 2-6h | Med | Yes | **PRIORITY 1: implement first** |
|
||||
| 3 (wrap) | 1-2h | Low | Yes | **PRIORITY 2: implement in parallel** |
|
||||
| 4 (batch isolation) | 2-4h | Low | Yes | **PRIORITY 3: independent, implement anytime** |
|
||||
| 5 (lazy recreate) | 0.5h | Low | Yes | **PRIORITY 4: only if 3 doesn't fully solve** |
|
||||
| 6 (health endpoint) | 1-2h | Low | Yes | **PRIORITY 2: implement with 3** |
|
||||
|
||||
**Total effort (all):** 6-14 hours
|
||||
**Recommended PR sequence:** 1+2 → 3+6 → 4 → 5
|
||||
|
||||
---
|
||||
|
||||
## 8. References
|
||||
|
||||
- `docs/reports/test_full_live_workflow_imgui_assert_20260608.md` — full debugging report
|
||||
- `conductor/todos/TODO_test_full_live_workflow_v2.md` — task list
|
||||
- `scripts/check_imgui_scopes.py` — existing audit script (use for Task 1)
|
||||
- `src/imgui_scopes.py` — ImGuiScope context manager (use for fix in Task 2)
|
||||
- `src/api_hooks.py:11` — `ThreadingHTTPServer` (independent of io_pool)
|
||||
- `src/gui_2.py:618` — `immapp.run(...)` call site
|
||||
- `src/app_controller.py:2282` — `submit_io` line that throws the RuntimeError
|
||||
- `conductor/workflow.md` "Known Pitfalls (2026-06-05)" — Defer-Not-Catch Pattern
|
||||
- `conductor/workflow.md` "Skip-Marker Policy" — don't add skip markers as workarounds
|
||||
- `AGENTS.md` "Critical Anti-Patterns" — no comments, no batch fragility
|
||||
@@ -39,6 +39,22 @@ def test_full_live_workflow(live_gui) -> None:
|
||||
"""
|
||||
client = ApiHookClient()
|
||||
assert client.wait_for_server(timeout=10)
|
||||
# 00. Pre-flight health check. If the live_gui subprocess was left in
|
||||
# a degraded state by a prior test (e.g. an ImGui IM_ASSERT crashed
|
||||
# the GUI main loop, shutting down the controller's _io_pool), fail
|
||||
# fast with a clear message instead of waiting 120s for a switch
|
||||
# that can never complete. Per user feedback 2026-06-08: the test
|
||||
# should "note that the new file didn't get to deal with a dirty state"
|
||||
# rather than silently time out.
|
||||
health = client.get_gui_health()
|
||||
if not health.get("healthy"):
|
||||
pytest.fail(
|
||||
f"GUI is degraded before test starts. "
|
||||
f"degraded_reason={health.get('degraded_reason')!r}, "
|
||||
f"last_assert={health.get('last_assert')!r}. "
|
||||
f"This is likely caused by a prior test in the same live_gui session "
|
||||
f"crashing the GUI. The new test cannot proceed with a dirty state."
|
||||
)
|
||||
client.post_session(session_entries=[])
|
||||
|
||||
# 0a. Wait for app warmup to complete. The warmup submits heavy-module
|
||||
|
||||
Reference in New Issue
Block a user