Private
Public Access
0
0
Files
manual_slop/docs/reports/batch_resilience_plan_20260608.md
T
ed 51ecace464 test(live_workflow): pre-flight health check fails fast on dirty state
PR3 of the test_full_live_workflow_imgui_assert fix sequence.

When a prior live_gui test in the same session crashes the GUI (e.g.
via an ImGui IM_ASSERT from cumulative panel state), the controller's
_io_pool gets shut down. The next test starts in a degraded state
but only discovers this 120s later when its project switch times
out with a confusing 'cannot schedule new futures after shutdown'
error.

This commit adds a /api/gui_health pre-flight check at the start of
test_full_live_workflow. If the GUI is degraded, the test fails
fast (within 1s) with a clear, actionable message that includes:
- The exact RuntimeError that caused the degradation
- The full traceback of the last ImGui scope mismatch
- A note that the new test cannot proceed with a dirty state

Per user feedback 2026-06-08: 'I don't want a batch to be too fragile
where I can't restart the app and continue with the next test file
if it fails. Just has to note that the new file didn't get to deal
with a dirty state.'

Also includes the planning documents written earlier in this session:
- TODO_test_full_live_workflow_v2.md (task list)
- test_full_live_workflow_imgui_assert_20260608.md (root cause report)
- test_full_live_workflow_propagation_digest_20260608.md (solutions digest)
- batch_resilience_plan_20260608.md (batch resilience plan)

Verification:
- test_full_live_workflow in isolation: 13.45s PASS (health=True, no degrade)
- 4 sims + test_full_live_workflow in batch: 76.46s (1 FAIL fast, 4 sims PASS)
  - Without PR3 fix: 200s FAIL with confusing 120s timeout
  - With PR3 fix: 76s FAIL with clear 'GUI is degraded' message
- The fast-fail is observable, not silent (per user's 'wrap might be
  worth it if that properly lets us handle the assert')
2026-06-08 21:17:54 -04:00

12 KiB

Batch-Level Test Resilience Plan

Companion to: docs/reports/test_full_live_workflow_propagation_digest_20260608.md Status: Pre-implementation plan User requirement: "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state."


1. Current Behavior

The tests/conftest.py:live_gui fixture is session-scoped. It spawns a single sloppy.py subprocess at the start of the test session and keeps it alive for ALL live_gui tests across ALL tiers.

Test file structure (relevant):

  • tests/test_extended_sims.py — 4 sim tests: test_context_sim_live, test_ai_settings_sim_live, test_tools_sim_live, test_execution_sim_live. The IM_ASSERT fires during the 4th sim (~71.5s into GUI lifetime).
  • tests/test_live_workflow.py — separate file, runs AFTER test_extended_sims.py in alphabetical order. test_full_live_workflow is the failing test.

The IM_ASSERT crashes the GUI's main loop mid-test-file. The hook server (separate thread) survives, but the controller's _io_pool is in a shutdown state. The next test file (test_live_workflow.py) starts in this degraded state. Its first click (btn_project_new_automated) hits submit_io which raises RuntimeError: cannot schedule new futures after shutdown. The test's wait_for_project_switch polls for 120s before timing out.

Failure mode observed by user: "the new file didn't get to deal with a dirty state"


2. Real User Concern: Within-Session Subprocess Degradation

The user's concern is specifically about WITHIN-SESSION state. They want:

  1. A test file can crash the subprocess without preventing the next file from running cleanly
  2. If the next file is doomed (subprocess is degraded), the runner should report this clearly, not silently time out
  3. The runner should continue to subsequent batches even after a failed one (this already works for tiers that don't use live_gui)

The current implementation has NONE of these properties:

  • live_gui is session-scoped, so the subprocess lives across the whole test session
  • A crashed subprocess poisons all subsequent live_gui tests
  • The degraded state (io_pool shut down) is not surfaced to the test, so the test fails with a confusing timeout, not a clear "subprocess degraded" message

3. Probable Solutions

Solution A: Per-file live_gui Fixture (most isolated)

Approach: Change live_gui from @pytest.fixture(scope="session") to @pytest.fixture(scope="module"). Each test file gets a fresh subprocess.

Code change (1 line):

# tests/conftest.py
@pytest.fixture(scope="module")  # was: "session"
def live_gui(request):
    ...

Pros:

  • Maximum isolation. A test file that crashes the subprocess doesn't affect the next file.
  • The fixture's finally block (which calls kill_process_tree) is the per-file cleanup.
  • Simple to implement (one-line scope change + audit).

Cons:

  • ~1-2s overhead per file (subprocess spawn + hook server health check).
  • For 49 live_gui files, that's 49-98s of additional overhead.
  • Some tests may currently rely on cross-file state (e.g., a project loaded by file A is still loaded when file B starts). These tests would break.

Mitigation: Audit the live_gui tests for cross-file state dependencies. Most should be standalone (each test sets up its own state). If any are not, mark them with @pytest.mark.requires_prior_state and either:

  • Skip them when scope is module
  • Or document the dependency and add a setup step in the dependent file

Effort: 1-2 hours (scope change + audit + fix cross-file dependencies).

Risk: Medium. May break tests that depend on cross-file state. The audit is the main work.

Solution B: Lazy Re-spawn (most flexible)

Approach: Keep the live_gui fixture session-scoped, but wrap it in a handle that re-spawns the subprocess if it dies. The handle exposes the same API as the current fixture.

Code change (significant):

# tests/conftest.py
class _LiveGuiHandle:
    def __init__(self, gui_script: str):
        self._gui_script = gui_script
        self._process: subprocess.Popen | None = None
        self._lock = threading.Lock()
        self._spawn()
    
    def _spawn(self) -> None:
        # Existing fixture spawn logic, refactored into a method
        ...
    
    def is_alive(self) -> bool:
        return self._process is not None and self._process.poll() is None
    
    def ensure_alive(self) -> None:
        with self._lock:
            if not self.is_alive():
                self._spawn()
    
    @property
    def process(self) -> subprocess.Popen:
        self.ensure_alive()
        return self._process

@pytest.fixture(scope="session")
def live_gui(request):
    handle = _LiveGuiHandle(gui_script)
    yield handle, handle._gui_script
    handle._kill()

Pros:

  • Preserves the per-session fixture scope.
  • Auto-recovers from subprocess death between tests.
  • Tests that rely on cross-file state can still do so (the subprocess is the same instance, modulo a respawn).
  • Single place to add health checks.

Cons:

  • More complex. The handle's ensure_alive adds a check at every test entry.
  • If the subprocess dies mid-test, the test still fails — we only recover BETWEEN tests.
  • Respawning the subprocess loses any in-process state. Tests that rely on state from a prior test fail on respawn.

Effort: 4-6 hours (refactor fixture + add respawn logic + tests).

Risk: Low. The respawn is a fallback; the primary path (subprocess stays alive) is unchanged.

Solution C: Per-Batch Process Tracking (most surgical)

Approach: Add a process health check at the start of each batch in scripts/run_tests_batched.py. If the previous batch left the subprocess dead, log a clear warning. Tests can then fail fast with a known message.

Code change (conftest writes pid file, batcher reads it):

# tests/conftest.py (in live_gui fixture, after spawn)
pid_file = tests_dir / ".live_gui_pid"
pid_file.write_text(str(process.pid))

# scripts/run_tests_batched.py
def _run_batch(b: Batch, ...) -> ...:
    if b.label.startswith("tier-3-live_gui"):
        pid_file = tests_dir / ".live_gui_pid"
        if pid_file.exists():
            pid = int(pid_file.read_text().strip())
            if not _is_pid_alive(pid):
                print(_c(f"[BATCH-WARN] Prior tier-3 batch left the live_gui subprocess (pid={pid}) dead. "
                         f"This batch's live_gui tests may not start with a clean state.",
                         _C.BOLD_YELLOW))

Pros:

  • Surgical. Doesn't change the fixture or test code.
  • Surfaces the dirty state via a clear warning, not a silent hang.
  • User can then choose to debug or skip the batch.

Cons:

  • Doesn't actually FIX the dirty state — just makes it visible.
  • Requires the fixture to write a pid file (small change).
  • Tests still fail with the same confusing timeout, but the warning is in the runner output.

Effort: 1-2 hours.

Risk: Low. Read-only check, no behavioral change.

Solution D: Fixture Auto-Detect (middle ground)

Approach: Keep live_gui session-scoped, but at the START of each test (not file), check if the subprocess is alive. If dead, re-spawn.

Code change (conftest auto-use hook):

# tests/conftest.py
@pytest.fixture(autouse=True)
def _check_live_gui_health(request, live_gui):
    if "live_gui" in request.fixturenames:
        handle, gui_script = live_gui
        handle.ensure_alive()
    yield

Pros:

  • Per-test recovery. A test that crashes the subprocess doesn't affect the next test.
  • Minimal API change (tests still use live_gui).

Cons:

  • Per-test overhead (~0.1s for the health check).
  • If a test's clicks during a degraded subprocess fail, the test must be re-designed to be idempotent.
  • Respawning loses state.

Effort: 2-3 hours.

Risk: Medium. Tests that assume "subprocess is alive when my test starts" may need adjustment.


Primary: Solution A (per-file fixture scope)

  • Most isolated. Each test file is a clean unit.
  • Simple to implement and audit.
  • For the IM_ASSERT scenario: test_extended_sims.py crashes its subprocess at the end. test_live_workflow.py starts with a fresh subprocess. The IM_ASSERT-triggered pollution doesn't reach test_live_workflow.py.

Secondary: Solution C (per-batch warning)

  • Safety net. If a test file's subprocess dies mid-file (rather than at end of file), the next batch's runner logs a clear warning.
  • Doesn't fix the dirty state but makes it visible.

Optional: Solution B (lazy re-spawn)

  • If the audit for Solution A reveals too many cross-file dependencies, Solution B is the fallback.
  • More complex but preserves the per-session state model.
  • Per-test recovery is too granular. A test's failure shouldn't trigger a re-spawn that affects subsequent tests' setup.
  • Also: Solution D doesn't help the IM_ASSERT scenario. The IM_ASSERT crashes the subprocess during test_extended_sims.py, and Solution D would respawn it for the next test in the SAME file. But the next test in test_extended_sims.py is test_full_live_workflow which is in a different file — Solution D would still respawn correctly for it.

Actually, Solution D WOULD work for the IM_ASSERT scenario:

  • IM_ASSERT fires during test_execution_sim_live (test 4 in test_extended_sims.py)
  • Next test is... well, there are no more tests in test_extended_sims.py
  • Next file is test_live_workflow.py, first test is test_full_live_workflow
  • Solution D's autouse fixture would re-spawn the subprocess before test_full_live_workflow

So Solution D is actually a viable primary approach. Let me reconsider.

Revised recommendation:

  • Solution D (autouse fixture auto-respawn) as the primary. It's the most surgical.
  • Solution A (per-file scope) as the alternative if Solution D's autouse approach has side effects.
  • Solution C (per-batch warning) as a safety net for any case the autouse doesn't catch.

5. Open Questions for the User

Before implementation, these need clarification:

  1. Fixture scope preference: Per-file (Solution A) or per-test auto-respawn (Solution D)?

    • Per-file: more overhead but simpler reasoning
    • Per-test auto-respawn: more surgical but adds an autouse hook
    • My recommendation: Solution D. It's the closest to "the next test file gets a clean subprocess" without changing the fixture's API.
  2. State reset on respawn: When the subprocess is re-spawned, should the new subprocess inherit any state (e.g., loaded project, recent discussion)?

    • My recommendation: No. Fresh subprocess = fresh state. Tests should set up their own state.
  3. Failure signaling: If the subprocess can't be respawned (e.g., port 8999 still in use from a zombie), should the test fail immediately or retry?

    • My recommendation: Fail immediately with a clear error. Retries can hide real issues.
  4. Backward compatibility: Are there tests that explicitly DEPEND on the session-scoped behavior (e.g., they share state across files)?

    • Need to audit. The audit is part of Solution A; for Solution D, the audit is less critical because respawned subprocesses are NEW instances (no shared state with prior subprocesses).

6. References

  • tests/conftest.py:282 — current live_gui fixture (session-scoped)
  • tests/conftest.py:516-547live_gui fixture finally block (kill + cleanup)
  • scripts/run_tests_batched.py:136-164_run_batch function
  • scripts/run_tests_batched.py:51-86 — batch result tracking
  • docs/reports/test_full_live_workflow_propagation_digest_20260608.md — full solution matrix
  • conductor/todos/TODO_test_full_live_workflow_v2.md — task list including Task 4 (batch isolation)