docs(report): add deviation #2.5 for test_headless_verification fix

The headless batch hang the user reported was caused by an xdist worker crash on test_headless_verification_full_run, not a test logic failure. The same root cause as the 4 Phase 2 follow-ups (mock returns raw string but production does 'if not result.ok:'), but with a different failure mode (worker crash that hangs the batched test runner). Documented in section 3 of the report as deviation #2.5 with: - Where it went wrong (missed in the 4 follow-ups) - The specific symptom in the user's session - The fix (out-of-band commit e35b6a34) - Lesson for the next spec (verification must include xdist mode)
2026-06-15 21:28:29 -04:00
parent e35b6a34ad
commit bc388f11bb
1 changed files with 23 additions and 3 deletions
@@ -22,7 +22,7 @@ Two concerns, one track — both shipped:

 **(B) UI Polish Test Cleanup:** 2 broken test assertions fixed (`find()` → `rfind()` to locate the actual code instead of the comment block). The production code was already correct (user commits `d0b06575` and `df7bda6e`); the test bug was just the search logic.

-**Result:** 12 pre-existing test failures fixed (6 from the spec + 6 more discovered in Phase 2 follow-ups). 4 RAG failures remain (deferred to a separate RAG track per spec §7.1 OOS1).
+**Result:** 13 pre-existing test failures fixed (6 from the spec + 6 more discovered in Phase 2 follow-ups + 1 out-of-band that caused the headless batch hang). 4 RAG failures remain (deferred to a separate RAG track per spec §7.1 OOS1).

 **CRITICAL for the next track you plan:** The user has expressed intent to **mass-rename `send_result` to `send`** in a future refactor (stated during the run: "when we do I'm going to rename all send_result to send via mass refactor"). The track is designed so this rename is mechanical: the `Result[T]` return type is stable; only the public function name changes. The next track should plan for this rename and decide whether to keep `Result[T]` semantics or revert to `Optional[str]`.

@@ -54,8 +54,8 @@ Two concerns, one stability track. Per spec §0: "This is a **stability track**
 | **Test mock bug (G16, qwen)** | 1 | 2 tests in `test_qwen_provider.py` asserted against raw `str`; production returns `Result[str]` after the `data_oriented_error_handling` refactor |
 | **UI Polish test bug (G17, G18)** | 2 | `find()` located the comment block, not the code; `rfind()` fixes |
 | **Deprecation removal (G19)** | 1 | `send()` function + `filterwarnings` + `test_deprecation_warnings.py` |
-| **Discovered in implementation (not in spec)** | 7 | Production-affected test mocks (4 files: `test_conductor_tech_lead.py`, `test_orchestration_logic.py`, `test_orchestrator_pm.py`, `test_orchestrator_pm_history.py`, `test_phase6_engine.py`, `test_run_worker_lifecycle_abort.py`, `test_spawn_interception_v2.py`) + 4 follow-up mock-return-value fixes |
-| **Total** | **27 gaps** | (19 documented in metadata + 7 discovered in Phase 2 + 4 follow-up) |
+| **Discovered in implementation (not in spec)** | 7 | Production-affected test mocks (4 files: `test_conductor_tech_lead.py`, `test_orchestration_logic.py`, `test_orchestrator_pm.py`, `test_orchestrator_pm_history.py`, `test_phase6_engine.py`, `test_run_worker_lifecycle_abort.py`, `test_spawn_interception_v2.py`) + 4 follow-up mock-return-value fixes + 1 out-of-band headless_verification fix |
+| **Total** | **28 gaps** | (19 documented in metadata + 7 discovered in Phase 2 + 4 follow-up + 1 out-of-band) |

 ### 1.2 Symptoms (as code-discovered during Phase 1.1)

@@ -208,6 +208,26 @@ A `rg "ai_client\.send|ai_client, ['\"]send['\"]|ai_client\.send\("` would have

 **Lesson for the next spec:** A spec for a Result-based refactor should grep for `return_value="..."` and `return_value='...'` patterns that match the migrated function name. The script `scripts/audit_weak_types.py` does NOT catch this category (it catches `dict[str, Any]` annotations, not mock patterns).

+### #2.5: Out-of-band fix — `test_headless_verification.py` caused the headless batch hang (CRITICAL for batched runs)
+
+**Where this went wrong:** A 5th test file (`tests/test_headless_verification.py::test_headless_verification_full_run`) had the same raw-string mock pattern. It was missed in the original 4 follow-ups because:
+1. The 4 follow-ups targeted the 4 test files the user reported during the run
+2. The headless test fails in a different mode under xdist: the xdist worker crashes with `node down: Not properly terminated` rather than reporting a test failure
+3. The batched test runner (`scripts/run_tests_batched.py`) reads from a pipe; when the worker dies ungracefully, the master process waits forever for the pipe to close, hanging the entire `tier-1-unit-headless` batch
+
+**Symptom in the user's session:**
+```
+>>> Running tier-1-unit-headless (2 files)
+[gw0] [ 33%] PASSED tests/test_headless_simulation.py::test_mma_track_lifecycle_simulation
+Why does this never end...
+```
+
+**The fix:** Out-of-band commit `e35b6a34` — wrapped the mock return in `Result(data="...")`. 2/2 tests pass under xdist; full headless batch (14 tests) completes in 18.7s.
+
+**Lesson for the next spec:** A spec for a Result-based refactor should include a verification step that runs the full test suite under `pytest -n auto` (not just single-file or no-xdist mode). A test that "passes in isolation" can still hang under xdist, and a hang blocks the entire batched test runner.
+
+**Why this was out-of-band rather than Phase 2:** The user's earlier report only mentioned 3 specific test failures (`test_token_reduction_logging`, `test_rag_integration`, `test_run_worker_lifecycle_uses_strategy`). I fixed those 3 plus 1 more (`test_conductor_engine_v2.py` which the user fixed manually but the mocks still returned raw strings). The headless_verification test was a 5th file not in the user's report, and its failure mode is qualitatively different (xdist worker crash, not test failure).
+
 ### #3: Track work was done on `master` directly, not a feature branch (PROCEDURAL)

 **What happened:** The track was created on `master` and the work was committed directly to `master` over 31 atomic commits + 6 phase checkpoints. There is no feature branch to merge.