# Batch-Level Test Resilience Plan

**Companion to:** `docs/reports/test_full_live_workflow_propagation_digest_20260608.md`
**Status:** Pre-implementation plan
**User requirement:** "I also don't want a batch to be too fragile where I can't restart the app and continue with the next test file if it fails. Just has to note that the new file didn't get to deal with a dirty state."

---

## 1. Current Behavior

The `tests/conftest.py:live_gui` fixture is **session-scoped**. It spawns a single `sloppy.py` subprocess at the start of the test session and keeps it alive for ALL live_gui tests across ALL tiers.

**Test file structure (relevant):**
- `tests/test_extended_sims.py` — 4 sim tests: `test_context_sim_live`, `test_ai_settings_sim_live`, `test_tools_sim_live`, `test_execution_sim_live`. The IM_ASSERT fires during the 4th sim (~71.5s into GUI lifetime).
- `tests/test_live_workflow.py` — separate file, runs AFTER test_extended_sims.py in alphabetical order. `test_full_live_workflow` is the failing test.

The IM_ASSERT crashes the GUI's main loop mid-test-file. The hook server (separate thread) survives, but the controller's `_io_pool` is in a shutdown state. The next test file (`test_live_workflow.py`) starts in this degraded state. Its first click (`btn_project_new_automated`) hits `submit_io` which raises `RuntimeError: cannot schedule new futures after shutdown`. The test's `wait_for_project_switch` polls for 120s before timing out.

**Failure mode observed by user:** "the new file didn't get to deal with a dirty state"

---

## 2. Real User Concern: Within-Session Subprocess Degradation

The user's concern is specifically about WITHIN-SESSION state. They want:

1. A test file can crash the subprocess without preventing the next file from running cleanly
2. If the next file is doomed (subprocess is degraded), the runner should report this clearly, not silently time out
3. The runner should continue to subsequent batches even after a failed one (this already works for tiers that don't use `live_gui`)

**The current implementation has NONE of these properties:**
- `live_gui` is session-scoped, so the subprocess lives across the whole test session
- A crashed subprocess poisons all subsequent live_gui tests
- The degraded state (io_pool shut down) is not surfaced to the test, so the test fails with a confusing timeout, not a clear "subprocess degraded" message

---

## 3. Probable Solutions

### Solution A: Per-file live_gui Fixture (most isolated)

**Approach:** Change `live_gui` from `@pytest.fixture(scope="session")` to `@pytest.fixture(scope="module")`. Each test file gets a fresh subprocess.

**Code change (1 line):**
```python
# tests/conftest.py
@pytest.fixture(scope="module")  # was: "session"
def live_gui(request):
    ...
```

**Pros:**
- Maximum isolation. A test file that crashes the subprocess doesn't affect the next file.
- The fixture's `finally` block (which calls `kill_process_tree`) is the per-file cleanup.
- Simple to implement (one-line scope change + audit).

**Cons:**
- ~1-2s overhead per file (subprocess spawn + hook server health check).
- For 49 live_gui files, that's 49-98s of additional overhead.
- Some tests may currently rely on cross-file state (e.g., a project loaded by file A is still loaded when file B starts). These tests would break.

**Mitigation:** Audit the live_gui tests for cross-file state dependencies. Most should be standalone (each test sets up its own state). If any are not, mark them with `@pytest.mark.requires_prior_state` and either:
- Skip them when scope is module
- Or document the dependency and add a setup step in the dependent file

**Effort:** 1-2 hours (scope change + audit + fix cross-file dependencies).

**Risk:** Medium. May break tests that depend on cross-file state. The audit is the main work.

### Solution B: Lazy Re-spawn (most flexible)

**Approach:** Keep the `live_gui` fixture session-scoped, but wrap it in a handle that re-spawns the subprocess if it dies. The handle exposes the same API as the current fixture.

**Code change (significant):**
```python
# tests/conftest.py
class _LiveGuiHandle:
    def __init__(self, gui_script: str):
        self._gui_script = gui_script
        self._process: subprocess.Popen | None = None
        self._lock = threading.Lock()
        self._spawn()
    
    def _spawn(self) -> None:
        # Existing fixture spawn logic, refactored into a method
        ...
    
    def is_alive(self) -> bool:
        return self._process is not None and self._process.poll() is None
    
    def ensure_alive(self) -> None:
        with self._lock:
            if not self.is_alive():
                self._spawn()
    
    @property
    def process(self) -> subprocess.Popen:
        self.ensure_alive()
        return self._process

@pytest.fixture(scope="session")
def live_gui(request):
    handle = _LiveGuiHandle(gui_script)
    yield handle, handle._gui_script
    handle._kill()
```

**Pros:**
- Preserves the per-session fixture scope.
- Auto-recovers from subprocess death between tests.
- Tests that rely on cross-file state can still do so (the subprocess is the same instance, modulo a respawn).
- Single place to add health checks.

**Cons:**
- More complex. The handle's `ensure_alive` adds a check at every test entry.
- If the subprocess dies mid-test, the test still fails — we only recover BETWEEN tests.
- Respawning the subprocess loses any in-process state. Tests that rely on state from a prior test fail on respawn.

**Effort:** 4-6 hours (refactor fixture + add respawn logic + tests).

**Risk:** Low. The respawn is a fallback; the primary path (subprocess stays alive) is unchanged.

### Solution C: Per-Batch Process Tracking (most surgical)

**Approach:** Add a process health check at the start of each batch in `scripts/run_tests_batched.py`. If the previous batch left the subprocess dead, log a clear warning. Tests can then fail fast with a known message.

**Code change (conftest writes pid file, batcher reads it):**
```python
# tests/conftest.py (in live_gui fixture, after spawn)
pid_file = tests_dir / ".live_gui_pid"
pid_file.write_text(str(process.pid))

# scripts/run_tests_batched.py
def _run_batch(b: Batch, ...) -> ...:
    if b.label.startswith("tier-3-live_gui"):
        pid_file = tests_dir / ".live_gui_pid"
        if pid_file.exists():
            pid = int(pid_file.read_text().strip())
            if not _is_pid_alive(pid):
                print(_c(f"[BATCH-WARN] Prior tier-3 batch left the live_gui subprocess (pid={pid}) dead. "
                         f"This batch's live_gui tests may not start with a clean state.",
                         _C.BOLD_YELLOW))
```

**Pros:**
- Surgical. Doesn't change the fixture or test code.
- Surfaces the dirty state via a clear warning, not a silent hang.
- User can then choose to debug or skip the batch.

**Cons:**
- Doesn't actually FIX the dirty state — just makes it visible.
- Requires the fixture to write a pid file (small change).
- Tests still fail with the same confusing timeout, but the warning is in the runner output.

**Effort:** 1-2 hours.

**Risk:** Low. Read-only check, no behavioral change.

### Solution D: Fixture Auto-Detect (middle ground)

**Approach:** Keep `live_gui` session-scoped, but at the START of each test (not file), check if the subprocess is alive. If dead, re-spawn.

**Code change (conftest auto-use hook):**
```python
# tests/conftest.py
@pytest.fixture(autouse=True)
def _check_live_gui_health(request, live_gui):
    if "live_gui" in request.fixturenames:
        handle, gui_script = live_gui
        handle.ensure_alive()
    yield
```

**Pros:**
- Per-test recovery. A test that crashes the subprocess doesn't affect the next test.
- Minimal API change (tests still use `live_gui`).

**Cons:**
- Per-test overhead (~0.1s for the health check).
- If a test's clicks during a degraded subprocess fail, the test must be re-designed to be idempotent.
- Respawning loses state.

**Effort:** 2-3 hours.

**Risk:** Medium. Tests that assume "subprocess is alive when my test starts" may need adjustment.

---

## 4. Recommended Combination

**Primary: Solution A (per-file fixture scope)**
- Most isolated. Each test file is a clean unit.
- Simple to implement and audit.
- For the IM_ASSERT scenario: test_extended_sims.py crashes its subprocess at the end. test_live_workflow.py starts with a fresh subprocess. The IM_ASSERT-triggered pollution doesn't reach test_live_workflow.py.

**Secondary: Solution C (per-batch warning)**
- Safety net. If a test file's subprocess dies mid-file (rather than at end of file), the next batch's runner logs a clear warning.
- Doesn't fix the dirty state but makes it visible.

**Optional: Solution B (lazy re-spawn)**
- If the audit for Solution A reveals too many cross-file dependencies, Solution B is the fallback.
- More complex but preserves the per-session state model.

### NOT recommended: Solution D alone
- Per-test recovery is too granular. A test's failure shouldn't trigger a re-spawn that affects subsequent tests' setup.
- Also: Solution D doesn't help the IM_ASSERT scenario. The IM_ASSERT crashes the subprocess during test_extended_sims.py, and Solution D would respawn it for the next test in the SAME file. But the next test in test_extended_sims.py is `test_full_live_workflow` which is in a different file — Solution D would still respawn correctly for it.

Actually, Solution D WOULD work for the IM_ASSERT scenario:
- IM_ASSERT fires during `test_execution_sim_live` (test 4 in test_extended_sims.py)
- Next test is... well, there are no more tests in test_extended_sims.py
- Next file is test_live_workflow.py, first test is test_full_live_workflow
- Solution D's autouse fixture would re-spawn the subprocess before test_full_live_workflow

So Solution D is actually a viable primary approach. Let me reconsider.

**Revised recommendation:**
- **Solution D (autouse fixture auto-respawn)** as the primary. It's the most surgical.
- **Solution A (per-file scope)** as the alternative if Solution D's autouse approach has side effects.
- **Solution C (per-batch warning)** as a safety net for any case the autouse doesn't catch.

---

## 5. Open Questions for the User

Before implementation, these need clarification:

1. **Fixture scope preference:** Per-file (Solution A) or per-test auto-respawn (Solution D)?
   - Per-file: more overhead but simpler reasoning
   - Per-test auto-respawn: more surgical but adds an autouse hook
   - My recommendation: Solution D. It's the closest to "the next test file gets a clean subprocess" without changing the fixture's API.

2. **State reset on respawn:** When the subprocess is re-spawned, should the new subprocess inherit any state (e.g., loaded project, recent discussion)?
   - My recommendation: No. Fresh subprocess = fresh state. Tests should set up their own state.

3. **Failure signaling:** If the subprocess can't be respawned (e.g., port 8999 still in use from a zombie), should the test fail immediately or retry?
   - My recommendation: Fail immediately with a clear error. Retries can hide real issues.

4. **Backward compatibility:** Are there tests that explicitly DEPEND on the session-scoped behavior (e.g., they share state across files)?
   - Need to audit. The audit is part of Solution A; for Solution D, the audit is less critical because respawned subprocesses are NEW instances (no shared state with prior subprocesses).

---

## 6. References

- `tests/conftest.py:282` — current `live_gui` fixture (session-scoped)
- `tests/conftest.py:516-547` — `live_gui` fixture finally block (kill + cleanup)
- `scripts/run_tests_batched.py:136-164` — `_run_batch` function
- `scripts/run_tests_batched.py:51-86` — batch result tracking
- `docs/reports/test_full_live_workflow_propagation_digest_20260608.md` — full solution matrix
- `conductor/todos/TODO_test_full_live_workflow_v2.md` — task list including Task 4 (batch isolation)