fix(app_controller): clear project-switch state in _handle_reset_session
When a prior test in the tier-3-live_gui batch leaves a _do_project_switch background thread running, the next test's btn_project_new_automated click sees _project_switch_in_progress=True (from the prior thread) and queues the new path via _project_switch_pending_path. The queued switch is never actually submitted to the io_pool, so is_project_stale() stays True and AI ops (_handle_generate_send) bail with 'project switch in progress; AI ops disabled'. Fix: _handle_reset_session now also clears _project_switch_in_progress, _project_switch_pending_path, and _project_switch_error (under the existing _project_switch_lock). This way, even if the prior background thread is still running, the controller reports an idle state and the new switch can be submitted normally. Also: - src/api_hook_client.py: reverted wait_for_project_switch to require in_progress=False (was relaxed to return on queued path, which misled the caller into thinking the switch was done) - tests/test_handle_reset_session_clears_project.py: new test test_handle_reset_session_clears_project_switch_state asserts is_project_stale() returns False after reset - tests/test_api_hook_client_wait_for_project_switch.py: updated test_wait_for_project_switch_does_not_return_on_queued (in_progress + matching path should keep waiting, not return early) - tests/test_live_workflow.py: added pre-wait for any in-flight switch before doing btn_reset (so the test waits up to 60s for the prior switch to complete if needed) - conductor/todos/TODO_test_full_live_workflow.md: updated Task 4 with the deeper hang analysis and recommended fix Known follow-up: test_full_live_workflow still hangs in tier-3 batch even with this fix, because the new _do_project_switch itself is hung in the io_pool (likely saturation from prior sims' AI discussion turn workers). Deeper investigation required.
This commit is contained in:
@@ -30,11 +30,16 @@
|
|||||||
|
|
||||||
### 4. [MED] Replace 10×1s blind poll with condition-based wait ✅ SHIPPED (commits a6605d98 + b6972c31)
|
### 4. [MED] Replace 10×1s blind poll with condition-based wait ✅ SHIPPED (commits a6605d98 + b6972c31)
|
||||||
- **What:** Use the new `/api/project_switch_status` endpoint with `client.wait_for_project_switch(expected_path, timeout)`.
|
- **What:** Use the new `/api/project_switch_status` endpoint with `client.wait_for_project_switch(expected_path, timeout)`.
|
||||||
- **Where:** `tests/test_live_workflow.py:57-72` (replaced lines 57-65) + new `ApiHookClient.wait_for_project_switch` method.
|
- **Where:** `tests/test_live_workflow.py` + new `ApiHookClient.wait_for_project_switch` method.
|
||||||
- **Why:** Blind polling of derived state is fragile; condition-based wait is deterministic and surfaces the failure reason immediately.
|
- **Why:** Blind polling of derived state is fragile; condition-based wait is deterministic and surfaces the failure reason immediately.
|
||||||
- **Pattern:** See `src/api_hook_client.py:wait_for_server` (existing pattern in the same client).
|
- **Pattern:** See `src/api_hook_client.py:wait_for_server` (existing pattern in the same client).
|
||||||
- **Acceptance:** Test fails fast (within 30s) with a clear `error` message from the API instead of timing out at 10s with "Project failed to activate". 6 unit tests for the new helper (mocked _make_request) all pass.
|
- **Acceptance:** Test fails fast (within 30s) with a clear `error` message from the API instead of timing out at 10s with "Project failed to activate". 7 unit tests for the new helper (mocked _make_request) all pass.
|
||||||
- **Known issue:** Test STILL fails in tier-3-live_gui batch (passes in 10.24s in isolation). The wait helper reports `in_progress: True` for the full 30s timeout, meaning `_do_project_switch` background thread is not completing. Likely io_pool saturation from prior sims' AI discussion turns. Needs further investigation of `_do_project_switch` hangs in batch context.
|
- **Known issue (still open):** Test STILL fails in tier-3-live_gui batch (passes in 10.24s in isolation). The wait helper reports `in_progress: True, path: temp_project.toml` for the full 30s timeout. Investigation found:
|
||||||
|
- Added pre-wait (`client.wait_for_project_switch` at start) so the test waits for any prior switch to complete
|
||||||
|
- Added `_handle_reset_session` to also clear `_project_switch_in_progress`/`_project_switch_pending_path`/`_project_switch_error` so a hung switch doesn't block the next session
|
||||||
|
- The new switch is submitted to io_pool but the `_do_project_switch` background thread is **still hanging in the batch context** for 30+ seconds. The thread is not blocked on a lock or I/O — it's just not being scheduled (likely io_pool saturation from prior sims' long-running discussion turn workers)
|
||||||
|
- This is a deeper issue: `test_extended_sims.py` sims each submit AI discussion turns that spawn multiple io_pool jobs. The sims don't wait for these to complete. The next test inherits a saturated pool.
|
||||||
|
- **Recommended fix:** Mark `test_full_live_workflow` with `@pytest.mark.skipif(ENV_BATCH)` or run it in a separate subprocess. The test is fundamentally fragile to session-scoped state pollution and the io_pool saturation from prior sims.
|
||||||
|
|
||||||
### 5. [LOW] Add defensive state assertions ✅ SHIPPED (commit b6972c31)
|
### 5. [LOW] Add defensive state assertions ✅ SHIPPED (commit b6972c31)
|
||||||
- **What:** Before waiting for activation, verify the file was created (5s poll, then assert).
|
- **What:** Before waiting for activation, verify the file was created (5s poll, then assert).
|
||||||
|
|||||||
+31
-26
@@ -374,33 +374,38 @@ class ApiHookClient:
|
|||||||
return result
|
return result
|
||||||
|
|
||||||
def wait_for_project_switch(self, expected_path: str = None, timeout: float = 30.0, poll_interval: float = 0.2) -> dict[str, Any]:
|
def wait_for_project_switch(self, expected_path: str = None, timeout: float = 30.0, poll_interval: float = 0.2) -> dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Blocks until the project switch completes (or fails). Returns the final
|
Blocks until the project switch completes (or fails). Returns the
|
||||||
status dict. If expected_path is provided, also waits until the reported
|
final status dict. If expected_path is provided, also waits until the
|
||||||
path matches it (or its basename matches).
|
reported path matches it (or its basename matches).
|
||||||
Fails fast (within timeout seconds) with a clear error if:
|
Fails fast (within timeout seconds) with a clear error if:
|
||||||
- the controller reports an error during the switch
|
- the controller reports an error during the switch
|
||||||
- the path doesn't reach expected_path within the timeout
|
- the path doesn't reach expected_path within the timeout
|
||||||
- the controller is hung (in_progress stays True)
|
- the controller is hung (in_progress stays True)
|
||||||
[C: tests/test_live_workflow.py:test_full_live_workflow]
|
[C: tests/test_live_workflow.py:test_full_live_workflow]
|
||||||
"""
|
"""
|
||||||
import os
|
import os
|
||||||
start = time.time()
|
start = time.time()
|
||||||
last_status = {"in_progress": True, "path": None, "error": None}
|
last_status = {"in_progress": True, "path": None, "error": None}
|
||||||
while time.time() - start < timeout:
|
expected_basename = os.path.basename(expected_path) if expected_path else None
|
||||||
last_status = self.get_project_switch_status()
|
while time.time() - start < timeout:
|
||||||
if last_status.get("error"):
|
last_status = self.get_project_switch_status()
|
||||||
return last_status
|
if last_status.get("error"):
|
||||||
if not last_status.get("in_progress"):
|
|
||||||
if expected_path is None:
|
|
||||||
return last_status
|
return last_status
|
||||||
if last_status.get("path") == expected_path:
|
if not last_status.get("in_progress"):
|
||||||
return last_status
|
current_path = last_status.get("path") or ""
|
||||||
if last_status.get("path") and os.path.basename(str(last_status.get("path"))) == os.path.basename(expected_path):
|
current_basename = os.path.basename(str(current_path))
|
||||||
return last_status
|
path_matches = (expected_path is None
|
||||||
time.sleep(poll_interval)
|
or current_path == expected_path
|
||||||
last_status["timeout"] = True
|
or (expected_basename and current_basename == expected_basename))
|
||||||
return last_status
|
if path_matches:
|
||||||
|
return last_status
|
||||||
|
time.sleep(poll_interval)
|
||||||
|
last_status["timeout"] = True
|
||||||
|
return last_status
|
||||||
|
|
||||||
|
def post_project(self, project_data: dict) -> dict[str, Any]:
|
||||||
|
return last_status
|
||||||
|
|
||||||
def post_project(self, project_data: dict) -> dict[str, Any]:
|
def post_project(self, project_data: dict) -> dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -3276,6 +3276,13 @@ class AppController:
|
|||||||
Path(self.active_project_path).stem if self.active_project_path else "unnamed"
|
Path(self.active_project_path).stem if self.active_project_path else "unnamed"
|
||||||
)
|
)
|
||||||
self.project_paths = []
|
self.project_paths = []
|
||||||
|
# Clear project-switch state machine so a hung switch from a prior test
|
||||||
|
# does not block the next session. (is_project_stale() must return False
|
||||||
|
# for the next click to actually submit a new switch.)
|
||||||
|
with self._project_switch_lock:
|
||||||
|
self._project_switch_in_progress = False
|
||||||
|
self._project_switch_pending_path = None
|
||||||
|
self._project_switch_error = None
|
||||||
self.ai_status = "session reset"
|
self.ai_status = "session reset"
|
||||||
self.ai_response = ""
|
self.ai_response = ""
|
||||||
self.ui_ai_input = ""
|
self.ui_ai_input = ""
|
||||||
|
|||||||
@@ -1,8 +1,14 @@
|
|||||||
"""Tests for ApiHookClient.wait_for_project_switch.
|
"""Tests for ApiHookClient.wait_for_project_switch.
|
||||||
|
|
||||||
These tests use mocked _make_request so they don't require a live_gui
|
These tests use mocked _make_request so they don't require a live_gui
|
||||||
session. They verify the polling logic: success, error, timeout, and
|
session. They verify the polling logic: success, error, queued-not-returned,
|
||||||
path-matching behavior.
|
timeout, and path-matching behavior.
|
||||||
|
|
||||||
|
Contract: wait_for_project_switch returns when the controller's switch
|
||||||
|
has completed (in_progress=False) and the path matches. It does NOT
|
||||||
|
return on queued switches (in_progress=True) because queued means a
|
||||||
|
prior switch is still running, and downstream code (AI ops) will be
|
||||||
|
blocked by is_project_stale().
|
||||||
"""
|
"""
|
||||||
import sys
|
import sys
|
||||||
import os
|
import os
|
||||||
@@ -52,8 +58,24 @@ def test_wait_for_project_switch_matches_by_basename() -> None:
|
|||||||
assert "timeout" not in result
|
assert "timeout" not in result
|
||||||
|
|
||||||
|
|
||||||
|
def test_wait_for_project_switch_does_not_return_on_queued() -> None:
|
||||||
|
"""If in_progress=True and path matches (switch is queued), keep waiting.
|
||||||
|
|
||||||
|
The prior switch is hung. The new switch is queued behind it. Returning
|
||||||
|
now would mislead the caller into thinking the switch is done, but
|
||||||
|
is_project_stale() will still return True and block AI ops.
|
||||||
|
"""
|
||||||
|
client = ApiHookClient()
|
||||||
|
with patch.object(client, "_make_request") as mock_make:
|
||||||
|
mock_make.return_value = {"in_progress": True, "path": "C:/projects/foo.toml", "error": None}
|
||||||
|
result = client.wait_for_project_switch(expected_path="C:/projects/foo.toml", timeout=0.5, poll_interval=0.1)
|
||||||
|
# Should time out, not return early
|
||||||
|
assert result.get("timeout") is True
|
||||||
|
assert result["in_progress"] is True
|
||||||
|
|
||||||
|
|
||||||
def test_wait_for_project_switch_times_out_when_in_progress() -> None:
|
def test_wait_for_project_switch_times_out_when_in_progress() -> None:
|
||||||
"""If the controller stays in_progress past the timeout, return with timeout flag."""
|
"""If in_progress stays True and path never matches, return with timeout flag."""
|
||||||
client = ApiHookClient()
|
client = ApiHookClient()
|
||||||
with patch.object(client, "_make_request") as mock_make:
|
with patch.object(client, "_make_request") as mock_make:
|
||||||
mock_make.return_value = {"in_progress": True, "path": None, "error": None}
|
mock_make.return_value = {"in_progress": True, "path": None, "error": None}
|
||||||
@@ -80,7 +102,7 @@ def test_wait_for_project_switch_polls_then_completes() -> None:
|
|||||||
def fake_request(*args, **kwargs):
|
def fake_request(*args, **kwargs):
|
||||||
call_count[0] += 1
|
call_count[0] += 1
|
||||||
if call_count[0] < 3:
|
if call_count[0] < 3:
|
||||||
return {"in_progress": True, "path": None, "error": None}
|
return {"in_progress": True, "path": "C:/other.toml", "error": None}
|
||||||
return {"in_progress": False, "path": "C:/foo.toml", "error": None}
|
return {"in_progress": False, "path": "C:/foo.toml", "error": None}
|
||||||
|
|
||||||
with patch.object(client, "_make_request", side_effect=fake_request):
|
with patch.object(client, "_make_request", side_effect=fake_request):
|
||||||
|
|||||||
@@ -77,3 +77,34 @@ def test_handle_reset_session_resets_project_to_valid_default(controller):
|
|||||||
assert isinstance(controller.project, dict)
|
assert isinstance(controller.project, dict)
|
||||||
assert "project" in controller.project
|
assert "project" in controller.project
|
||||||
|
|
||||||
|
|
||||||
|
def test_handle_reset_session_clears_project_switch_state(controller):
|
||||||
|
"""The project-switch state machine must be reset so a hung switch
|
||||||
|
from a prior test does not block the next session.
|
||||||
|
|
||||||
|
`is_project_stale()` must return False after reset, otherwise the next
|
||||||
|
`btn_project_new_automated` click is queued behind the hung switch
|
||||||
|
and `is_project_stale()` keeps returning True, blocking AI ops
|
||||||
|
(`_handle_generate_send` returns 'project switch in progress; AI ops disabled').
|
||||||
|
"""
|
||||||
|
# Simulate a prior hung switch
|
||||||
|
controller._project_switch_in_progress = True
|
||||||
|
controller._project_switch_pending_path = "/some/old/path.toml"
|
||||||
|
controller._project_switch_error = "stale error from hung switch"
|
||||||
|
assert controller.is_project_stale() # precondition
|
||||||
|
controller._handle_reset_session()
|
||||||
|
assert controller._project_switch_in_progress is False, (
|
||||||
|
f"_project_switch_in_progress not cleared: {controller._project_switch_in_progress}"
|
||||||
|
)
|
||||||
|
assert controller._project_switch_pending_path is None, (
|
||||||
|
f"_project_switch_pending_path not cleared: {controller._project_switch_pending_path}"
|
||||||
|
)
|
||||||
|
assert controller._project_switch_error is None, (
|
||||||
|
f"_project_switch_error not cleared: {controller._project_switch_error}"
|
||||||
|
)
|
||||||
|
assert not controller.is_project_stale(), (
|
||||||
|
f"is_project_stale() still True after reset: "
|
||||||
|
f"in_progress={controller._project_switch_in_progress}, "
|
||||||
|
f"pending={controller._project_switch_pending_path}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -40,7 +40,23 @@ def test_full_live_workflow(live_gui) -> None:
|
|||||||
client = ApiHookClient()
|
client = ApiHookClient()
|
||||||
assert client.wait_for_server(timeout=10)
|
assert client.wait_for_server(timeout=10)
|
||||||
client.post_session(session_entries=[])
|
client.post_session(session_entries=[])
|
||||||
|
|
||||||
|
# 0. Wait for any in-flight project switch to complete before starting.
|
||||||
|
# The session-scoped live_gui fixture shares the controller across all
|
||||||
|
# 48 live tests. Prior tests (especially test_extended_sims) may leave
|
||||||
|
# a project switch hanging in the io_pool. If we proceed without waiting,
|
||||||
|
# our new switch will be queued behind the hung one and is_project_stale()
|
||||||
|
# will return True, blocking AI ops.
|
||||||
|
pre_status = client.get_project_switch_status()
|
||||||
|
if pre_status.get("in_progress"):
|
||||||
|
print(f"\n[TEST] Waiting for prior project switch to complete: {pre_status}")
|
||||||
|
idle_status = client.wait_for_project_switch(timeout=60.0)
|
||||||
|
assert not idle_status.get("timeout"), (
|
||||||
|
f"Prior project switch did not complete in 60s. Aborting. "
|
||||||
|
f"Last status: {idle_status}"
|
||||||
|
)
|
||||||
|
print(f"[TEST] Prior switch done: {idle_status}")
|
||||||
|
|
||||||
# 1. Reset
|
# 1. Reset
|
||||||
print("\n[TEST] Clicking Reset...")
|
print("\n[TEST] Clicking Reset...")
|
||||||
client.click("btn_reset")
|
client.click("btn_reset")
|
||||||
|
|||||||
Reference in New Issue
Block a user