docs(reports): root cause report for test_full_live_workflow race condition

2026-06-08 09:24:14 -04:00
parent 64823493c0
commit 08ee7547be
1 changed files with 144 additions and 0 deletions
@@ -0,0 +1,144 @@
+# Root Cause Report: test_full_live_workflow FAILED in tier-3 batch
+
+**Status:** Investigation complete, NO fix attempted (per user: report first, then simple todo)
+**Failure reproducibility:** 100% in tier-3 batch (`uv run python scripts/run_tests_batched.py --tiers 3`), 0% in isolation (`uv run pytest tests/test_live_workflow.py` → 11.3s, PASS)
+**Related track:** `test_batching_refactor_20260606` (now shipped; this is the only remaining pre-existing failure it surfaced)
+
+---
+
+## Symptom
+
+```
+tests/test_live_workflow.py::test_full_live_workflow FAILED              [ 65%]
+___________________________ test_full_live_workflow ___________________________
+    @pytest.mark.integration
+    def test_full_live_workflow(live_gui) -> None:
+        client = ApiHookClient()
+        assert client.wait_for_server(timeout=10)
+        client.post_session(session_entries=[])
+        # 1. Reset
+        client.click("btn_reset")
+        time.sleep(1)
+        # 2. Project Setup
+        temp_project_path = os.path.abspath("tests/artifacts/temp_project.toml")
+        if os.path.exists(temp_project_path):
+            try: os.remove(temp_project_path)
+            except: pass
+        client.click("btn_project_new_automated", user_data=temp_project_path)
+        # Wait for project to be active
+        success = False
+        for _ in range(10):
+            proj = client.get_project()
+            if proj.get('project', {}).get('project', {}).get('name') == 'temp_project':
+                success = True
+                break
+            time.sleep(1)
+>   assert success, "Project failed to activate"
+E   AssertionError: Project failed to activate
+```
+
+---
+
+## Root Causes (layered, not single)
+
+### Cause 1: `os.path.abspath("tests/artifacts/temp_project.toml")` is cwd-relative (test line 50)
+- The path is resolved against whatever cwd the test process happens to be in.
+- `pytest` and `uv run python scripts/run_tests_batched.py` may have different cwds depending on how the runner was invoked.
+- When the test runs via the batcher, `temp_project_path` may be different from the path used by earlier tests in the same `live_gui` subprocess session.
+- The `if os.path.exists(temp_project_path): os.remove(...)` cleanup at lines 52-53 then removes a file that isn't the one the controller will create, leaving a stale project at the controller's actual cwd.
+
+### Cause 2: Click → background switch → test poll race (test lines 55-65 vs `src/app_controller.py:2723`)
+Trace:
+1. `client.click("btn_project_new_automated", user_data=temp_project_path)` enqueues a GUI task
+2. `_handle_click` (line 575) dispatches to `_cb_new_project_automated(user_data)` (line 580)
+3. `_cb_new_project_automated` (line 2677):
+   - `name = Path(user_data).stem` = `"temp_project"` ✓
+   - `proj = project_manager.default_project(name)`
+   - `project_manager.save_project(proj, user_data)` — writes file to disk ✓
+   - `self._switch_project(user_data)` — **schedules background thread via `self.submit_io`**
+4. `_do_project_switch` (line 2692) runs in the background, loads the project, sets `self.project`
+5. Test polls `client.get_project()` → `/api/project` → reads `self.project` via `project_manager.flat_config(...)` (line 253 in `project_manager.py`)
+
+**The race:** the test starts polling immediately after `click()` returns. The background thread hasn't necessarily completed `_do_project_switch` yet. With 47 prior live_gui tests in the same subprocess, the IO thread pool is overloaded and the background task is delayed past the 10s window.
+
+### Cause 3: Click is "fire and forget" — no completion signal
+- `client.click()` enqueues a GUI task; it does not wait for the task to complete.
+- There is no API endpoint to query "is the click handler done?" or "did the switch succeed?"
+- The test's only feedback is polling `client.get_project()`, which returns the controller's current `self.project` — which may be from a prior test.
+
+### Cause 4: No defensive state verification
+- The test never checks whether the click was dispatched, whether the handler completed, or whether the file was actually written to disk.
+- It assumes: `click()` always enqueues, the handler always succeeds, the project always activates within 10s. These are all fragile assumptions.
+
+### Cause 5: Stale state from prior live_gui tests
+- Earlier live_gui tests in the tier-3 batch (47 of them) may have:
+  - Set `self.project` to a different project
+  - Added entries to `self.project_paths`
+  - Left a stale `tests/artifacts/temp_project.toml` from a prior failed run
+- The `live_gui` fixture is session-scoped, so all 48 tests share the same `App` and `AppController` instance.
+- The test's `client.click("btn_reset")` at line 46 resets the AI session but **does not reset the project** (see `_handle_reset_session` at line 3244 — it clears `files`, `context_files`, `disc_entries`, etc. but not `self.project` or `self.active_project_path`).
+
+### Cause 6: API response shape double-nesting (test line 61 vs `src/api_hooks.py:116`)
+- `client.get_project()` returns `{"project": <flat_config>}` where `<flat_config>` is `project_manager.flat_config()` output, which itself wraps the project dict in another `"project"` key.
+- So the actual response is `{"project": {"project": {"name": "temp_project", ...}, "output": {...}, ...}}`.
+- The test correctly walks `.get('project').get('project').get('name')` — this is NOT a bug. The test is right; the API and the test agree.
+- (Confirmed by reading `src/project_manager.py:253-272` and `src/api_hooks.py:115-116`.)
+
+---
+
+## Why isolation works but the batch fails
+
+| State | Isolation | Tier-3 batch |
+|-------|-----------|--------------|
+| `live_gui` subprocess | Fresh | Shared with 47 prior tests |
+| `self.project_paths` | `[]` | Populated by prior tests |
+| `self.project` | `{}` | Stale from prior `_switch_project` calls |
+| `self.io_pool` queue | Empty | Backed up from prior `submit_io` calls |
+| `tests/artifacts/temp_project.toml` | Pre-deleted (test did it) | May be re-created by prior test with different cwd |
+| `os.path.abspath("tests/artifacts/temp_project.toml")` | Resolves to the test's cwd | Resolves to whatever cwd the batcher set |
+
+The 10×1s poll loop is the **load-bearing fragile assumption** — it works when the IO pool is idle, fails when it's backed up.
+
+---
+
+## What's fragile about the test (user feedback)
+
+> "if there is a race condition, we need to properly have the test handle it so that its not fragile. The live gui tests should properly have feed when state is valid and also setup the correct state and not assume its perfect, that kind of fragility is not a good thing"
+
+The user is right on all four points:
+
+1. **Race condition handling is broken** — blind polling of HTTP endpoint for 10s is the test equivalent of `time.sleep(10)`. It assumes the system is well-behaved.
+2. **No "feed when state is valid"** — the test should wait for a deterministic signal (e.g. `ai_status == "switched to: temp_project"`) instead of polling the project state.
+3. **No "setup the correct state"** — the test should fully reset project state before starting (call `_switch_project(None)` or similar) to ensure it's not running on top of a prior test's state.
+4. **"Assume its perfect"** — the test never checks: (a) was the click dispatched? (b) did the file get written? (c) did the controller's project change? Each is independently a failure mode the test ignores.
+
+---
+
+## Recommended fix direction (NOT IMPLEMENTED)
+
+Per the user: this report only. A simple todo follows.
+
+The fix direction is to make the test poll a **deterministic signal** rather than a derived state. Two viable options:
+
+**Option A (preferred — add a signal):** Add `/api/project_switch_status` endpoint that returns `{"in_progress": bool, "path": str, "error": str | null}`. Test polls this until `in_progress == False` and `path == expected_path` and `error is None`. Catches all three failure modes (not dispatched, file not written, controller error).
+
+**Option B (less invasive — derive signal from `ai_status`):** The `_do_project_switch` method (line 2714) already sets `self.ai_status = f"switched to: {Path(path).stem}"`. Test could poll `/api/gui/state` for `ai_status == "switched to: temp_project"`. Less robust (ai_status can be overwritten by other events) but no API change.
+
+**Plus state hygiene:**
+- `_handle_reset_session` should also reset `self.project` and `self.active_project_path` to defaults (or call a new `_reset_project` helper).
+- The pre-delete at line 52 should use the `live_gui` fixture's `temp_project_path` (provided by the fixture) rather than a hardcoded cwd-relative path.
+- The 10-iteration poll should use exponential backoff or a condition-based wait with a clear timeout (see `superpowers:condition-based-waiting`).
+
+---
+
+## Files referenced (no changes made)
+
+- `tests/test_live_workflow.py:32-65` — failing test
+- `src/app_controller.py:2677-2684` — `_cb_new_project_automated` (synchronous up to switch, then dispatches IO)
+- `src/app_controller.py:2723-2747` — `_switch_project` (schedules IO)
+- `src/app_controller.py:2692-2721` — `_do_project_switch` (background, loads project)
+- `src/app_controller.py:3244-3296` — `_handle_reset_session` (does NOT reset project)
+- `src/api_hooks.py:110-116` — `/api/project` GET endpoint
+- `src/project_manager.py:253-272` — `flat_config` (returns nested `project` key)
+- `src/project_manager.py:109-161` — `default_project` (returns nested `project` key)
+- `tests/conftest.py` — `live_gui` session fixture (cwd-dependent artifact path)